latent class transliteration based on source language origin
TRANSCRIPT
Latent Class Transliteration
based on Source Language Origin
Masato Hagiwara & Satoshi Sekine
Rakuten Institute of Technology, New York
ACL-HLT 2011, June 21
2
Objective
• Transliteration
– Phonetic translation between languages with different
writing systems
e.g., flextime / フレックスタイム furekkusutaimu
– Useful for machine translation, spelling variation etc.
• Transliteration models
– Phonetic-based re-writing models
(Knight and Jonathan 1998)
– Spelling-based supervised models
(Brill and Moore 2000)
3
Spelling-based model (Brill and Moore 2000)
Edit distance
substitution, insertion, deletion = cost 1
Alpha-Beta Model
flextime
furekkusutaimu
Generalization of edit distance string-to-string substitution α→β
flextime
furekkusutaimu
P(flextime→furekkusutaimu)
= P(^f→fu)*P(le→re)*P(x→kkusu)*P(ti→tai)*P(me$→mu)
Transliteration Probability
Maximum re-writing probability over all possible partitions
Substitution Prob.
α
β
4
Multiple Language Origins
piaget / ピアジェ piaje target / ターゲット tāgetto
Single models cannot deal with multiple origins
• Class transliteration Model (Li et al. 2007)
– Language detection + switching multiple models
P(get→ジェ je) ? P(get→ゲット getto) ?
piaget / ピアジェ piaje target / ターゲット tāgetto
French origin
English origin French model
English model
5
Issues on Class Transliteration Model
• Requires training sets tagged with language origins
– Rare especially for proper nouns
• Language origins ≠ transliteration models
– e.g., spaghetti / スパゲティ supageti
Italian origins but can be found in English dictionaries
– e.g., Carl Laemmle / カール・レムリ kāru remuri
German immigrant but listed as an “American” film producer
→ An English transliteration model doesn’t work
Model source language origins as latent classes
6
Latent Class Transliteration Model
• Proposing “latent class transliteration model”
– Models the “source language origins” as latent classes
– “latent classes” correspond to sets of words with similar
transliteration characteristics
– Trained via the EM algorithm from transliteration pairs
Class transliteration model
Latent class transliteration model (proposed)
Explicit language detection
Latent class distribution
language gender
latent class
s: source t: target
7
Model Training via the EM Algorithm
E step
Log likelihood
M step
8
Iterative Learning via EM Algorithm
piaget → piaje
target → taagetto
…
p/i/a/get→p/i/a/je
t/ar/get→t/aa/getto
…
Lx Ly Lz
Update
M step
Σγ*f(get$→je)
Training Pairs
P(^p→p) P(ar→aa) P(get$→je) P(get$→getto) …
Transliteration Model
Lx Ly Lz
P(^p→p) P(ar→aa) P(get$→je) P(get$→getto) …
TransliterationModel
Lx Ly Lz
p/i/a/get→p/i/a/je
t/ar/get→t/aa/getto
…
Lx Ly Lz
E step Transliteration probability
Based on αβ model
9
Experiments
• Estimate correct transliteration for
foreign proper nouns
– Rank the candidates based on probability
Top-10 Mean Reciprocal Rank (MRR)
• Datasets
– Dataset 1: Western person name list
(6,718; de+en+fr)
– Dataset 2: Western proper noun from
Wikipedia
(11,323; +it+es)
10
Compared Models
• Alpha-beta method (AB)
• Class transliteration method (SOFT)
• Class transliteration method (HARD)
• Latent class transliteration method
(LATENT – PROPOSED)
11 Performance measure: Top-10 mean reciprocal rank (MRR)
Results
Model Dataset 1 Dataset 2
Alpha-beta method AB 94.8 90.9 Class transliteration method HARD
90.3 89.8
Class transliteration method SOFT
95.7 92.4
Latent class transliteration method LATENT (proposed)
95.8 92.4
Higher performance of
LATENT vs SOFT/HARD
Performance can be higher depending
on the number of latent classes
Low class detection
precision (77.4%)
12
Error Analysis
Example SOFT/HARD LATENT (Proposed)
Felix/フェリックス ferikkusu [en]
フィリス firisu
Read/リード riido [en]
レアード reādo
Caen/カーン kān [fr]
シャーン shān
Laemmle/レムリ remuri [en]
リアム riamu
Xavier/ザビア zabia [en]
ガブリエル gaburieru
Hilda/イルダ iruda [en]
ハルラ
harura
ハルラ harura
13
Conclusion
• Proposed the “latent class
transliteration model”
– Models source language origins as latent classes
– Model estimation from transliterated pairs via the
EM algorithm
– Comparable results v.s. models with explicit
language origins
• Future works
– Sources other than Western languages
– Targets other than Japanese