applying mpaligner to machine transliteration with japanese-specific heuristics

Applying mpaligner to Machine Transliteration with Japanese-‐Specific Heuristics

Yoh Okuno

Outline

•  Introduction

•  System

•  Experiments

•  Conclusion

Outline •  Introduction – Statistical Machine Transliteration

– Baseline and Our Systems

•  System

•  Experiments

•  Conclusion

Machine Transliteration as Monotonic SMT

•  The most common approach for machine

transliteration is to follow the manner of SMT

(Statistical Machine Translation)

•  Consists of 3 steps as below:

1.  Align training data monotonically (character-‐based)

2.  Train discriminative model given aligned data

3.  Decode input string to n-‐best list

[Finch+ 2008]

Example of Statistical Transliteration •  Given training data of transliteration pairs

OKUNO 奥野 NOMURA 野村 MURAI 村井

Training Data

Example of Statistical Transliteration 1. Align training data utilizing co-‐occurrence

OKU:NO 奥:野 NO:MURA 野:村 MURA:I 村:井

1. Align

Training Data

Aligned Data

OKU → 奥 NO → 野 MURA → 村 I → 井

Example of Statistical Transliteration 2. Train statistical model from aligned data

1. Align

Training Data

Aligned Data 2. Train

Learned Model

(Rules)

OKU → 奥 NO → 野 MURA → 村 I → 井

Example of Statistical Transliteration 3. Decode new input and return output

OKUMURA OKUI MURANO 1. Align

Training Data

Aligned Data

3. Decode

2. Train

奥村奥井村野

Test Input Output

Learned Model

(Rules)

The Baseline System using m2m-‐aligner

Align: m2m-‐aligner

Train: DirecTL+

Decode: DirecTL+

Training Data

Output: N-‐best List

[Jiampojamarn+ 2007, 2008]

Our System: mpaligner with Heuristics

Pre-‐processing

Align: mpaligner

Train: DirecTL+

Decode: DirecTL+

Training Data

Output: N-‐best List

Japanese-‐Specific Heuristics 1.  JnJk: De-‐romanization 2.  EnJa: Syllable-‐based Alignment

Improved Alignment Tool 1.  Better Accuracy than m2m 2.  No hand-‐tuning parameters

[Kubo+ 2011]

Outline •  Introduction

•  System

– Comparing Aligners

– Japanese-‐Specific Heuristics

•  Experiments

•  Conclusion

m2m-‐aligner: Many-‐to-‐Many Alignments

•  Alignment tool based on EM algorithm and MLE

•  Advantages: 1.  Can align multiple characters

2.  Perform well on short alignment

•  Disadvantages: 1.  Poor performance on long alignment by overfitting

2.  Require hand-‐tuning of length limit parameters

[Jiampojamarn+ 2007]

http://code.google.com/p/m2m-‐aligner/ 12

mpaligner: Minimum Pattern Aligner

•  Idea: penalize long alignment during E-‐step

•  Simple scaling as below

•  x: source string, y: target string

•  |x|: length of x, |y|: length of y

•  P(x, y): probability of string pair (x,y) •  Good performance without hand-‐tuning parameters

[Kubo+ 2011]

http://sourceforge.jp/projects/mpaligner/ 13

Motivation: Invalid Alignment Problem

•  Character-‐based alignment can be phonetically invalid

–  It may divide atomic units into meaningless pieces

– We call the smallest unit of alignment as syllable

•  Syllable-‐based alignment should be used for this task

–  Problem: No training data for syllable-‐based alignment

•  In this study, we propose Japanese-‐specific heuristics

for this problem depending on Japanese knowledge

Examples of Invalid and Valid Alignment •  In Japanese language, consonants should be combined with vowels

•  JnJk Task Type Source Target Valid SUZU:KI 鈴:木 Invalid SUZ:UKI 鈴:木 Valid HIRO:MI 裕:実 Invalid HIR:OMI 裕:実 Valid OKU:NO 奥:野 Invalid OK:UNO 奥:野

•  EnJa Task Type Source Target Valid Ar:thur アー:サー Invalid A:r:th:ur ア:ー:サ:ー Valid Cha:p:li:n チャッ:プ:リ:ン Invalid C:h:a:p:li:n チ:ャ:ッ:プ:リ:ン Valid Ju:s:mi:ne ジャ:ス:ミ:ン Invalid J:u:s:mi:ne ジ:ャ:ス:ミ:ン

Language Specific Heuristics as Preprocessing

•  Developed Japanese-‐specific heuristics for JnJk and EnJa tasks as preprocessing

– Combine atomic string into syllable

–  Treat a syllable as one character in alignment

•  Definition of syllable should be chosen carefully –  It may cause bad side effect

–  Some contexts are incorporated as n-‐gram features

JnJk task: De-‐romanization Heuristic

・・

•  De-‐romanization: convert Roman characters

•  Consonant and vowel are coupled into Kana

•  Common romanization table is used (Hepburn)

Roman A I U E O Kana あいうえお Roman KA KI KU KE KO Kana かきくけこ

17 http://www.social-‐ime.com/conv-‐table.html

EnJa Task: Syllable-‐based Alignment

•  In EnJa task, target side should be aligned with

unit of syllable, not character

•  Combine sub-‐characters with previous ones

•  There are 3 types of sub-‐characters: 1.  Lower case characters (Yo-‐on): e.g. ャ, ュ, ョ

2.  Silent character (Soku-‐on): e.g. ッ

3.  Hyphen (Cho-‐on; long vowel): e.g. ー

•  System

•  Experiments

– Official Scores for 8 Language Pairs

– Further Investigation for JnJk and EnJa

•  Conclusion

Experimental Settings •  Conducted 2 types of experiments

– Official evaluation on test set for 8 language pairs

– Compared proposed and baseline systems for JnJk

and EnJa tasks on development set

•  Followed default settings of tools basically

– m2m-‐aligner: length limits are selected carefully

–  Iteration number: optimized by development set

–  Features: N-‐gram (N=2) and context (size=7) features 20

Official Scores for 8 Language Pairs •  Applied heuristics to JnJk and EnJa tasks

•  Performed well (top rank on EnPe and EnHe) Task ACC F-‐Score MRR MAP Rank JnJk 0.512 0.693 0.582 0.401 2 EnJa 0.362 0.803 0.469 0.359 2 EnCh 0.301 0.655 0.376 0.292 5 ChEn 0.013 0.259 0.017 0.013 4 EnKo 0.334 0.688 0.411 0.334 3 EnBa 0.404 0.882 0.515 0.403 2 EnPe 0.658 0.941 0.761 0.640 1 EnHe 0.191 0.808 0.254 0.190 1 21

Results in JnJk and EnJa Tasks •  Proposed system overcome baselines

Method ACC F-‐Score MRR MAP m2m-‐aligner 0.113 0.389 0.182 0.114 mpaligner 0.121 0.391 0.197 0.122 Proposed 0.199 0.494 0.300 0.200

Method ACC F-‐Score MRR MAP m2m-‐aligner 0.280 0.737 0.359 0.280 mpaligner 0.326 0.761 0.431 0.326 Proposed 0.358 0.774 0.469 0.358

Result in JnJk Task

Result in EnJa Task

Output Examples (10-‐best list) JnJk Task

Harui Kyotaro 1 春井京太郎 2 晴井恭太郎 3 治井匡太郎 4 榛井強太郎 5 敏井共太郎 6 明井享太郎 7 陽井亨太郎 8 遙井杏太郎 9 遥井鋸太郎 10 温井教太郎

EnJa Task Bloy Grothendieck

1 ブロイグローテンディック 2 ブロアグロートンディック 3 ブローイグローテンディーク 4 ブロワグローテンディック 5 ブロッイグローゾンディック 6 ブロヤグローテンジーク 7 ブロヨグローザーンディック 8 ブウォイグローザンディック 9 ブロティグローシンディック 10 ブロレィグローゼンディック

Error Analysis •  Sparseness problem:

–  Side effect of syllable-‐based alignment in EnJa task

–  Too many target side characters in JnJk task

•  Word origin [Hagiwara+ 2011]:

–  English names come from various languages

–  First and family name can be modeled differently

– Gender: first names are quite different

•  Training data inconsistency or ambiguity

–  e.g. JAPAN → 日本国 (Not transliteration) 24

•  System

•  Experiments

•  Conclusion

– Future Work

Conclusion •  Applied mpaligner to machine transliteration task for

the first time

–  Performed better than m2m-‐aligner

– Maximum likelihood estimation approach is not suitable

•  Proposed Japanese-‐specific heuristics for JnJk and EnJa tasks

–  De-‐romanization for JnJk task

–  Syllable-‐based alignment for EnJa task

Future Work •  Combine these heuristics with other

language-‐independent approaches such as

[Finch+ 2011] or [Hagiwara+ 2011]

•  Develop language-‐dependent heuristics besides Japanese language

•  Can we find such heuristics automatically? 27

Reference (1) •  Andrew Finch and Eiichiro Sumita. 2008. Phrase-‐based machine

transliteration.

•  Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Sherif. 2007.

Applying many-‐to-‐many alignments and hidden markov models to letter-‐

to-‐phoneme con-‐ version.

•  Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kondrak. 2008. Joint

processing and discrimina-‐ tive training for letter-‐to-‐phoneme conversion.

•  Keigo Kubo, Hiromichi Kawanami, Hiroshi Saruwatari, and Kiyohiro

Shikano. 2011. Unconstrained many-‐ to-‐many alignment for automatic

pronunciation annotation.

•  Min Zhang, A Kumaran, and Haizhou Li. 2012. Whitepaper of news 2012

shared task on machine transliteration. 28

Reference (2)

•  Masato Hagiwara and Satoshi Sekine. 2011. Latent

class transliteration based on source language origin.

•  Andrew Finch, Paul Dixon, and Eiichiro Sumita. 2011.

Integrating models derived from non-‐parametric

bayesian co-‐segmentation into a statistical machine

transliteration system.

•  Andrew Finch and Eiichiro Sumita. 2010. A Bayesian

Model of Bilingual Segmentation for Transliteration. 29

WTIM: Workshop on Text Input Methods

•  1st workshop with IJCNLP 2011 (Thailand)

–  12 people presented from Google, Microsoft, Yahoo

–  https://sites.google.com/site/wtim2011/

•  2nd workshop planed with COLING 2012 (India)

– Venue: December, 2012 in Mumbai, India

– Are you interested as a presenter or an attendee?

Any Questions?

applying mpaligner to machine transliteration with japanese-specific heuristics

Technology

narayaneeyam telugu transliteration 022

ardhanareeshvara stotram tamil transliteration

heuristics and biases in managerial decision making

ganesha ashtakam oriya transliteration

search, backtracking,heuristics exhaustive search/heuristics

chiropractic heuristics

eccentricity heuristics through sublinear analysis lenses

1 - japji [transliteration w gurmukhi]

transliteration french bengali

annapoornastotram tamil transliteration

bhakthamar meaning & transliteration

ashtavakra gita english transliteration

devi suktam transliteration

heuristics ppt

nassim talib - heuristics (via twitter)

met a heuristics

m1 the heuristics of ps

plous heuristics

koran-transliteration · koran-transliteration hans zirker...

abhinavagupta - tantrasara - roman transliteration