1 the role of lexical resources in cjk natural language processing jack halpern (春遍雀來)...

19
1 The Role of Lexical R esources in CJK Natur al Language Processin g Jack Halpern 春春春春春 () The CJK Dictionary Institute (CJKI) ( 春春春春春春春春 ) ACL/COLING’06 Workshop on Multilingual Language Resources and Interoperability

Upload: harry-woods

Post on 04-Jan-2016

218 views

Category:

Documents


3 download

TRANSCRIPT

  • The Role of Lexical Resources in CJK Natural Language ProcessingJack HalpernThe CJK Dictionary Institute (CJKI) ()ACL/COLING06 Workshop on Multilingual Language Resources and Interoperability

  • various challengesIdentifying and processing the large number of orthographic variants in Japanese, and alternate character forms in CJK languages.The lack of easily available comprehensive lexical resources, especially lexical databases, comparable to the major European languages.The accurate conversion between Simplified and Traditional Chinese The morphological complexity of Japanese and KoreanAccurate word segmentation and disambiguating ambiguous segmentations stringsThe difficulty of lexeme-based retrieval and CJK CLIRChinese and Japanese proper nouns, which are very numerous, are difficult to detect without a lexiconAutomatic recognition of terms and their variants

  • Named Entity ExtractionThe number of personal names and their variants (e.g. over a hundred ways to spell Mohammed) is probably in the billionsCJKI maintain databases of millions of proper nounsuse of keywords or syntactic structures that co-occur with proper nouns, which we refer to as named entity contextual clues (NECC)NER, especially of personal names and place names, is an area in which lexicon-driven methods have a clear advantage over probabilistic methods and in which the role of lexical resources should be a central one

  • Linguistic Issues in ChineseA major issue for Chinese segmentors is how to treat compound words and multiword lexical units (MWU) are not tagged as segments in Chinese GigawordThe lexicons used by Chinese segmentors are small-scale or incomplete. Our testing of various Chinese segmentors has shown that coverage of MWUs is often limited.Chinese linguists disagree on the concept of wordhood in Chinese. Various theories such as the Lexical Integrity Hypothesis have been proposed.The "correct segmentation can depend on the application, and there are various segmentation standards. For example, a search engine user looking for is not normally interested in and per se, unless they are part of .

  • LexemeA Lexemesmallest distinctive units associating meaning with formPredicting compositionality is not trivial and often impossiblelexical items like represent stand-alone, well-defined concepts and should be treated as single units

  • Multilevel SegmentationChinese MWUs can consist of nested components that can be segmented in different ways for different levels to satisfy the requirements of different segmentation standards multiword lexemic++ lexemic+++ sublexemic + [ + ] [+] morphemic[+] [++] [+] submorphemicMT NER preferred preferred

  • Neologisms (;)The problem of incorrect segmentation is especially obvious in the case of neologisms. dinnom cyberphile dinzshngw e-commerce zhuchz auto fan

  • Chinese-to-Chinese Conversion (C2C)The conversion can be implemented on three levels1. Code Conversion.numerous one-to-many ambiguities,

  • C2C (cont.)2. Orthographic Conversionmeaningful linguistic units, equivalent to lexemesmust be done with a segmentor

  • C2C (cont.)3. Lexemic Conversionmaps SC and TC lexemes that are semantically

  • Traditional Chinese VariantsTraditional Chinese has numerous variant character forms

  • Orthographic Variation in JapaneseHighly Irregular Orthographyfour scripts used to write Japanese, e.g. kanji, hiragana, katakana, and the Latin alphabet, , , , , .JP IR problem 4*3*2=24'egg' four variants (, , , )'chicken' three(, , )'to lay' two (, )- google 398- google 12

  • Okurigana Variants

  • Cross-Script Orthographic Variation()Google67500 66200 58000

  • Kana Variantszunormalization

  • Lexicon-driven NormalizationConvert variants to a standardized form for indexingNormalize queries for dictionary lookupNormalize all source documentsIdentify forms as members of a variant group

  • Orthographic Variation in Koreanfar less than in Japaneseloan word cake (ke i keu) and (ke ik)Person name 'Clinton keul rin teon and keul rin tonMixture shirtwai-syea cheu Y wai-syea cheu

  • The Role of Lexical Databasesdisk storage is no longer a major issueCJKI, which specializes in CJK and Arabic computational lexicographyorthographic normalization and named entity extractionthe small-scale lexical resources currently used by many NLP tools are inadequate to these taskslexicon-driven techniques have proven their effectiveness, there is no need to overly rely on probabilistic methodsup-to-date lexical resources are the key to achieving major enhancements in NLP technology