concise at ntu graduate institute of linguistics
TRANSCRIPT
語料庫 Corpus Collection of Texts
• ⼤大量的⽂文本
• 經過整理
• 有既定的格式與標記
• 中⽂文的語料庫
• 中研院平衡語料庫
• LIVAC漢語共時語料庫
• 北京⼤大學語料庫
• 蘭開斯特⼤大學漢語平衡語料庫
• 蘭開斯特-洛杉磯漢語⼝口語語料庫
How about SPECIALISED/OPEN
CHINESE corpus/corpora?
Tokenisation 分詞
Tagging 標註
Indexing 索引
Simply IMPORT your files!
Chinese Tokenisation / Word Segmentation 中⽂文分詞
MMSeg Algorithm Chih-Hao Tsai 蔡志浩
mmseg4j a java implementation
Custom Dictionaries mmseg4j hack
Part-Of-Speech Tagging 詞性標註
Indexing 索引
Tokenisation 分詞
Tagging 標註
Indexing 索引
MMSeg4j Core
Stanford POS TaggerLucene Core
Analysing Gears
Concordance Keyword in context
Collocation
Word List
Word Cluster 2-gram
Word Cloud
Concordance Plot
Collocational Network
Scatter Plot Correspondence Analysis
In Progress - Collocational Network - Correspondence Analysis - Principal Component Analysis
Wish List - Notes & Comments - History (or multi-query)
Using concise-core for programmer
1. Create Workspace & Import Files2. Concordance3. Word List4. Surface Collocation5. Textual Collocation6. N-gram7. Cluster
1. Create Workspace & Import FilesFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has created.");
// import documentFile file1 = new File(“path1”);File file2 = new File(“path2”);Importer importer = new Importer(ws);importer.indexFile(file1, false);importer.indexFile(file2, false);importer.close();System.out.println("done");
// Display documentSystem.out.println("================");for (ConciseDocument doc : new DocumentIterator(ws)) {
System.out.println(doc);}System.out.println("================");
2. ConcordanceFile w = new File(“path/to/your/file“);Workspace ws = new Workspace(w);System.out.println("workspace has created.");
Conc conc = new Conc(ws, "喝 咖啡", false);System.out.println("Search words: " + conc.getSearchWords());
// 設定跨距conc.setSpanSize(Conc.DEFAULT_LEFT_SPAN, Conc.DEFAULT_RIGHT_SPAN);System.out.println("Left: " + conc.left_span_size);System.out.println("Right: " + conc.right_span_size);System.out.println("===============================");
for (ScoreDoc d : conc.hitDocs()) {
ConcLineIterator iter = new ConcLineIterator(conc, d);for (ConcLine line : iter) {
System.out.println(line);}
}
3. Word ListFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has created.");
long types = 0L;long count = 0L;for (Word word : new WordIterator(ws, false)) {
System.out.println(word.toString());types++;count += word.totalTermFreq;
}
System.out.println("===========");System.out.println(types + " types.");System.out.println(count + " tokens.");
// Demo static sum methodlong sumTotal = WordIterator.sumTotalTermFreq(ws, false);System.out.println(sumTotal);
4. Surface CollocationFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has opened.\n");
Conc conc = new Conc(ws, "\"屋 外\"", false);System.out.println("search words: " + conc.getSearchWords() + "\n");conc.setSpanSize(4, 4);
// surface mode collocationCollocateIterator iter = new SurfaceCollocateIterator(conc);System.out.println("surface mode collocation");System.out.println("========================");for (Collocate c : iter) {
System.out.println(c);}System.out.println();
5. Textual CollocationFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has opened.\n");
Conc conc = new Conc(ws, "\"屋 外\"", false);System.out.println("search words: " + conc.getSearchWords() + "\n");conc.setSpanSize(4, 4);
// textual mode collocationTextualCollocateIterator tIter =
new TextualCollocateIterator(conc, BOUNDARY.SENTENCE);System.out.println("textual mode collocation");System.out.println("========================");for (Collocate c : tIter) {
System.out.println(c);}
6. N-gramFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has created.");
int grams = 2;
NgramClusterIterator ngram = new NgramClusterIterator(ws, grams, true);for (Cluster c : ngram) {
System.out.println(c);}System.out.println("===========");System.out.println(grams + "-gram.");
7. ClusterFile w = new File(“path/to/your/workspace“);Workspace ws = new Workspace(w);System.out.println("workspace has opened.\n");
Conc conc = new Conc(ws, "⾼高興_*_*", false);System.out.println("search words: " + conc.getSearchWords());
conc.setSpanSize(2, 1);System.out.println("Left: " + conc.left_span_size);System.out.println("Right: " + conc.right_span_size);System.out.println("==============================");
ClusterIterator iter = new ConcClusterIterator(conc);for (Cluster c : iter) {
System.out.println(c);}