social imeの共有辞書をクリーニングしてみた

Social IMEの共有辞書をクリーニングしてみた

@nokuno

#IME2011

Cleaning Social IME Dictionary Yoh Okuno

#IME2011

About the presenter

•  Name: Yoh Okuno

•  Software Engineer at Yahoo! Japan

•  Interest: NLP, Machine Learning, Data Mining

•  Skill: C/C++, Python, Hadoop, and English.

•  Website: http://yoh.okuno.name/

What is Social IME? •  The most popular “Cloud-‐based” Japanese

input method (230k unique user per month)

http://www.social-‐ime.com/

Shared Dictionary of Social IME

•  Noisy & Crazy → Needs cleaning!

shared with all users

Character alignment •  Align pairs of Kana and Kanji characters monotonically and detect failures of alignment

•  Techniques from statistical machine translation

•  Used m2m-‐aligner because of its functions

http://code.google.com/p/m2m-‐aligner/

四季多彩しきたさい西都原さいとばる iPhone あいふぉん

四|季|多|彩| し|き|た|さい| 西|都|原| さい|と|ばる| i|Ph|o|n|e| あい|ふ|ぉ|ん|_|

Training m2m-‐aligner •  Train 3 datasets

– Mozc’s dictionary (1.5 M words)

– unidic (230k words)

– alt-‐cannadic (400k words) → most suitable

•  Just run 2 commands

Trained results •  Three files are generated

Alignment:

Error:

Model:

Applying m2m-‐aligner

•  Apply to 4 datasets

–  Social IME shared dictionary (93k words)

– Mined from Wikipedia (169k words)

– Crawled MS-‐IME dictionary (18k words)

– Manually corrected MS-‐IME dictionary (92k words)

– Hatena keyword (315k words)

Mining words from Wikipedia

grep like “[一-‐龠]+（[ぁ-‐んヴー]+）”

Crawling MS-‐IME user dictionary

Hatena keyword

Applied results

•  Run:

•  Results: Dataset Social IME Wikipedia MS-‐IME MS-‐IME2 hatena

Size 93k 169k 18k 97k 314k Align 48k 137k 16k 86k 235k Error 45k 32k 2k 10k 78k

Alignment examples •  Not perfect but practical precision From Social IME:

From Wikipedia:

“ゃ，ゅ，ょ，っ” should be combined with the previous character

Error examples (from Social IME)

•  Error analysis is most interesting!

Abbreviations: Emoticons (顔文字):

Personal Information:

Error examples (from Hatena) Length limit (16 chars):

Chinese / Korean / old Japanese words:

Semantic translation:

12/29 Released!!

Conclusion •  Described how to clean Social-‐IME/Wikipedia/

MS-‐IME dictionary using m2m-‐aligner

•  Released cleaned dictionary today!

•  Future work: automatically classify pairs with

alignment error to emoticons, abbreviations,

personal information and so on.

TokyoNLP 発表者募集！

Any Question?

social imeの共有辞書をクリーニングしてみた

Technology