latent class transliteration based on source language origin

Latent Class Transliteration

based on Source Language Origin

Masato Hagiwara & Satoshi Sekine

Rakuten Institute of Technology, New York

ACL-HLT 2011, June 21

2

Objective

• Transliteration

– Phonetic translation between languages with different

writing systems

e.g., flextime / フレックスタイム furekkusutaimu

– Useful for machine translation, spelling variation etc.

• Transliteration models

– Phonetic-based re-writing models

(Knight and Jonathan 1998)

– Spelling-based supervised models

(Brill and Moore 2000)

3

Spelling-based model (Brill and Moore 2000)

Edit distance

substitution, insertion, deletion = cost 1

Alpha-Beta Model

flextime

furekkusutaimu

Generalization of edit distance string-to-string substitution α→β

flextime

furekkusutaimu

P(flextime→furekkusutaimu)

= P(^f→fu)*P(le→re)*P(x→kkusu)*P(ti→tai)*P(me$→mu)

Transliteration Probability

Maximum re-writing probability over all possible partitions

Substitution Prob.

α

β

4

Multiple Language Origins

piaget / ピアジェ piaje target / ターゲット tāgetto

Single models cannot deal with multiple origins

• Class transliteration Model (Li et al. 2007)

– Language detection + switching multiple models

P(get→ジェ je) ? P(get→ゲット getto) ?

piaget / ピアジェ piaje target / ターゲット tāgetto

French origin

English origin French model

English model

5

Issues on Class Transliteration Model

• Requires training sets tagged with language origins

– Rare especially for proper nouns

• Language origins ≠ transliteration models

– e.g., spaghetti / スパゲティ supageti

Italian origins but can be found in English dictionaries

– e.g., Carl Laemmle / カール・レムリ kāru remuri

German immigrant but listed as an “American” film producer

→ An English transliteration model doesn’t work

Model source language origins as latent classes

6

Latent Class Transliteration Model

• Proposing “latent class transliteration model”

– Models the “source language origins” as latent classes

– “latent classes” correspond to sets of words with similar

transliteration characteristics

– Trained via the EM algorithm from transliteration pairs

Class transliteration model

Latent class transliteration model (proposed)

Explicit language detection

Latent class distribution

language gender

latent class

s: source t: target

7

Model Training via the EM Algorithm

E step

Log likelihood

M step

http://maru.bonyari.jp/texclip/texclip.php?s=/beginalign*L=-/sum_n /log P(s_n/to t_n)/endalign*

8

Iterative Learning via EM Algorithm

piaget → piaje

target → taagetto

…

p/i/a/get→p/i/a/je

t/ar/get→t/aa/getto

…

Lx Ly Lz

Update

M step

Σγ*f(get$→je)

Training Pairs

P(^p→p) P(ar→aa) P(get$→je) P(get$→getto) …

Transliteration Model

Lx Ly Lz

P(^p→p) P(ar→aa) P(get$→je) P(get$→getto) …

TransliterationModel

Lx Ly Lz

p/i/a/get→p/i/a/je

t/ar/get→t/aa/getto

…

Lx Ly Lz

E step Transliteration probability

Based on αβ model

9

Experiments

• Estimate correct transliteration for

foreign proper nouns

– Rank the candidates based on probability

Top-10 Mean Reciprocal Rank (MRR)

• Datasets

– Dataset 1: Western person name list

(6,718; de+en+fr)

– Dataset 2: Western proper noun from

Wikipedia

(11,323; +it+es)

10

Compared Models

• Alpha-beta method (AB)

• Class transliteration method (SOFT)

• Class transliteration method (HARD)

• Latent class transliteration method

(LATENT – PROPOSED)

11 Performance measure: Top-10 mean reciprocal rank (MRR)

Results

Model Dataset 1 Dataset 2

Alpha-beta method AB 94.8 90.9 Class transliteration method HARD

90.3 89.8

Class transliteration method SOFT

95.7 92.4

Latent class transliteration method LATENT (proposed)

95.8 92.4

Higher performance of

LATENT vs SOFT/HARD

Performance can be higher depending

on the number of latent classes

Low class detection

precision (77.4%)

12

Error Analysis

Example SOFT/HARD LATENT (Proposed)

Felix/フェリックス ferikkusu [en]

フィリス firisu

Read/リード riido [en]

レアード reādo

Caen/カーン kān [fr]

シャーン shān

Laemmle/レムリ remuri [en]

リアム riamu

Xavier/ザビア zabia [en]

ガブリエル gaburieru

Hilda/イルダ iruda [en]

ハルラ

harura

ハルラ harura

13

Conclusion

• Proposed the “latent class

transliteration model”

– Models source language origins as latent classes

– Model estimation from transliterated pairs via the

EM algorithm

– Comparable results v.s. models with explicit

language origins

• Future works

– Sources other than Western languages

– Targets other than Japanese