concise at ntu graduate institute of linguistics

35

Upload: kuanming

Post on 16-Apr-2017

422 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Concise at NTU Graduate Institute of Linguistics
Page 2: Concise at NTU Graduate Institute of Linguistics

語料庫 Corpus Collection of Texts

• ⼤大量的⽂文本

• 經過整理

• 有既定的格式與標記

• 中⽂文的語料庫

• 中研院平衡語料庫

• LIVAC漢語共時語料庫

• 北京⼤大學語料庫

• 蘭開斯特⼤大學漢語平衡語料庫

• 蘭開斯特-洛杉磯漢語⼝口語語料庫

Page 3: Concise at NTU Graduate Institute of Linguistics

How about SPECIALISED/OPEN

CHINESE corpus/corpora?

Page 4: Concise at NTU Graduate Institute of Linguistics
Page 5: Concise at NTU Graduate Institute of Linguistics

Tokenisation 分詞

Tagging 標註

Indexing 索引

Page 6: Concise at NTU Graduate Institute of Linguistics

Simply IMPORT your files!

Page 7: Concise at NTU Graduate Institute of Linguistics

Chinese Tokenisation / Word Segmentation 中⽂文分詞

Page 8: Concise at NTU Graduate Institute of Linguistics

MMSeg Algorithm Chih-Hao Tsai 蔡志浩

Page 9: Concise at NTU Graduate Institute of Linguistics

mmseg4j a java implementation

Page 10: Concise at NTU Graduate Institute of Linguistics

Custom Dictionaries mmseg4j hack

Page 11: Concise at NTU Graduate Institute of Linguistics

Part-Of-Speech Tagging 詞性標註

Page 12: Concise at NTU Graduate Institute of Linguistics
Page 13: Concise at NTU Graduate Institute of Linguistics
Page 14: Concise at NTU Graduate Institute of Linguistics

Indexing 索引

Page 15: Concise at NTU Graduate Institute of Linguistics
Page 16: Concise at NTU Graduate Institute of Linguistics
Page 17: Concise at NTU Graduate Institute of Linguistics

Tokenisation 分詞

Tagging 標註

Indexing 索引

MMSeg4j Core

Stanford POS TaggerLucene Core

Page 18: Concise at NTU Graduate Institute of Linguistics

Analysing Gears

Page 19: Concise at NTU Graduate Institute of Linguistics

Concordance Keyword in context

Page 20: Concise at NTU Graduate Institute of Linguistics

Collocation

Page 21: Concise at NTU Graduate Institute of Linguistics

Word List

Page 22: Concise at NTU Graduate Institute of Linguistics

Word Cluster 2-gram

Page 23: Concise at NTU Graduate Institute of Linguistics

Word Cloud

Page 24: Concise at NTU Graduate Institute of Linguistics

Concordance Plot

Page 25: Concise at NTU Graduate Institute of Linguistics

Collocational Network

Page 26: Concise at NTU Graduate Institute of Linguistics

Scatter Plot Correspondence Analysis

Page 27: Concise at NTU Graduate Institute of Linguistics

In Progress - Collocational Network - Correspondence Analysis - Principal Component Analysis

Wish List - Notes & Comments - History (or multi-query)

Page 28: Concise at NTU Graduate Institute of Linguistics

Using concise-core for programmer

1. Create Workspace & Import Files2. Concordance3. Word List4. Surface Collocation5. Textual Collocation6. N-gram7. Cluster

Page 29: Concise at NTU Graduate Institute of Linguistics

1. Create Workspace & Import FilesFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has created.");

// import documentFile file1 = new File(“path1”);File file2 = new File(“path2”);Importer importer = new Importer(ws);importer.indexFile(file1, false);importer.indexFile(file2, false);importer.close();System.out.println("done");

// Display documentSystem.out.println("================");for (ConciseDocument doc : new DocumentIterator(ws)) {

System.out.println(doc);}System.out.println("================");

Page 30: Concise at NTU Graduate Institute of Linguistics

2. ConcordanceFile w = new File(“path/to/your/file“);Workspace ws = new Workspace(w);System.out.println("workspace has created.");

Conc conc = new Conc(ws, "喝 咖啡", false);System.out.println("Search words: " + conc.getSearchWords());

// 設定跨距conc.setSpanSize(Conc.DEFAULT_LEFT_SPAN, Conc.DEFAULT_RIGHT_SPAN);System.out.println("Left: " + conc.left_span_size);System.out.println("Right: " + conc.right_span_size);System.out.println("===============================");

for (ScoreDoc d : conc.hitDocs()) {

ConcLineIterator iter = new ConcLineIterator(conc, d);for (ConcLine line : iter) {

System.out.println(line);}

}

Page 31: Concise at NTU Graduate Institute of Linguistics

3. Word ListFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has created.");

long types = 0L;long count = 0L;for (Word word : new WordIterator(ws, false)) {

System.out.println(word.toString());types++;count += word.totalTermFreq;

}

System.out.println("===========");System.out.println(types + " types.");System.out.println(count + " tokens.");

// Demo static sum methodlong sumTotal = WordIterator.sumTotalTermFreq(ws, false);System.out.println(sumTotal);

Page 32: Concise at NTU Graduate Institute of Linguistics

4. Surface CollocationFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has opened.\n");

Conc conc = new Conc(ws, "\"屋 外\"", false);System.out.println("search words: " + conc.getSearchWords() + "\n");conc.setSpanSize(4, 4);

// surface mode collocationCollocateIterator iter = new SurfaceCollocateIterator(conc);System.out.println("surface mode collocation");System.out.println("========================");for (Collocate c : iter) {

System.out.println(c);}System.out.println();

Page 33: Concise at NTU Graduate Institute of Linguistics

5. Textual CollocationFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has opened.\n");

Conc conc = new Conc(ws, "\"屋 外\"", false);System.out.println("search words: " + conc.getSearchWords() + "\n");conc.setSpanSize(4, 4);

// textual mode collocationTextualCollocateIterator tIter =

new TextualCollocateIterator(conc, BOUNDARY.SENTENCE);System.out.println("textual mode collocation");System.out.println("========================");for (Collocate c : tIter) {

System.out.println(c);}

Page 34: Concise at NTU Graduate Institute of Linguistics

6. N-gramFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has created.");

int grams = 2;

NgramClusterIterator ngram = new NgramClusterIterator(ws, grams, true);for (Cluster c : ngram) {

System.out.println(c);}System.out.println("===========");System.out.println(grams + "-gram.");

Page 35: Concise at NTU Graduate Institute of Linguistics

7. ClusterFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has opened.\n");

Conc conc = new Conc(ws, "⾼高興_*_*", false);System.out.println("search words: " + conc.getSearchWords());

conc.setSpanSize(2, 1);System.out.println("Left: " + conc.left_span_size);System.out.println("Right: " + conc.right_span_size);System.out.println("==============================");

ClusterIterator iter = new ConcClusterIterator(conc);for (Cluster c : iter) {

System.out.println(c);}