latent class transliteration based on source language origin

13
Latent Class Transliteration based on Source Language Origin Masato Hagiwara & Satoshi Sekine Rakuten Institute of Technology, New York ACL-HLT 2011, June 21

Upload: rakuten-inc

Post on 13-Jul-2015

230 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Latent Class Transliteration based on Source Language Origin

Latent Class Transliteration

based on Source Language Origin

Masato Hagiwara & Satoshi Sekine

Rakuten Institute of Technology, New York

ACL-HLT 2011, June 21

Page 2: Latent Class Transliteration based on Source Language Origin

2

Objective

• Transliteration

– Phonetic translation between languages with different

writing systems

e.g., flextime / フレックスタイム furekkusutaimu

– Useful for machine translation, spelling variation etc.

• Transliteration models

– Phonetic-based re-writing models

(Knight and Jonathan 1998)

– Spelling-based supervised models

(Brill and Moore 2000)

Page 3: Latent Class Transliteration based on Source Language Origin

3

Spelling-based model (Brill and Moore 2000)

Edit distance

substitution, insertion, deletion = cost 1

Alpha-Beta Model

flextime

furekkusutaimu

Generalization of edit distance string-to-string substitution α→β

flextime

furekkusutaimu

P(flextime→furekkusutaimu)

= P(^f→fu)*P(le→re)*P(x→kkusu)*P(ti→tai)*P(me$→mu)

Transliteration Probability

Maximum re-writing probability over all possible partitions

Substitution Prob.

α

β

Page 4: Latent Class Transliteration based on Source Language Origin

4

Multiple Language Origins

piaget / ピアジェ piaje target / ターゲット tāgetto

Single models cannot deal with multiple origins

• Class transliteration Model (Li et al. 2007)

– Language detection + switching multiple models

P(get→ジェ je) ? P(get→ゲット getto) ?

piaget / ピアジェ piaje target / ターゲット tāgetto

French origin

English origin French model

English model

Page 5: Latent Class Transliteration based on Source Language Origin

5

Issues on Class Transliteration Model

• Requires training sets tagged with language origins

– Rare especially for proper nouns

• Language origins ≠ transliteration models

– e.g., spaghetti / スパゲティ supageti

Italian origins but can be found in English dictionaries

– e.g., Carl Laemmle / カール・レムリ kāru remuri

German immigrant but listed as an “American” film producer

→ An English transliteration model doesn’t work

Model source language origins as latent classes

Page 6: Latent Class Transliteration based on Source Language Origin

6

Latent Class Transliteration Model

• Proposing “latent class transliteration model”

– Models the “source language origins” as latent classes

– “latent classes” correspond to sets of words with similar

transliteration characteristics

– Trained via the EM algorithm from transliteration pairs

Class transliteration model

Latent class transliteration model (proposed)

Explicit language detection

Latent class distribution

language gender

latent class

s: source t: target

Page 7: Latent Class Transliteration based on Source Language Origin

7

Model Training via the EM Algorithm

E step

Log likelihood

M step

Page 8: Latent Class Transliteration based on Source Language Origin

8

Iterative Learning via EM Algorithm

piaget → piaje

target → taagetto

p/i/a/get→p/i/a/je

t/ar/get→t/aa/getto

Lx Ly Lz

Update

M step

Σγ*f(get$→je)

Training Pairs

P(^p→p) P(ar→aa) P(get$→je) P(get$→getto) …

Transliteration Model

Lx Ly Lz

P(^p→p) P(ar→aa) P(get$→je) P(get$→getto) …

TransliterationModel

Lx Ly Lz

p/i/a/get→p/i/a/je

t/ar/get→t/aa/getto

Lx Ly Lz

E step Transliteration probability

Based on αβ model

Page 9: Latent Class Transliteration based on Source Language Origin

9

Experiments

• Estimate correct transliteration for

foreign proper nouns

– Rank the candidates based on probability

Top-10 Mean Reciprocal Rank (MRR)

• Datasets

– Dataset 1: Western person name list

(6,718; de+en+fr)

– Dataset 2: Western proper noun from

Wikipedia

(11,323; +it+es)

Page 10: Latent Class Transliteration based on Source Language Origin

10

Compared Models

• Alpha-beta method (AB)

• Class transliteration method (SOFT)

• Class transliteration method (HARD)

• Latent class transliteration method

(LATENT – PROPOSED)

Page 11: Latent Class Transliteration based on Source Language Origin

11 Performance measure: Top-10 mean reciprocal rank (MRR)

Results

Model Dataset 1 Dataset 2

Alpha-beta method AB 94.8 90.9 Class transliteration method HARD

90.3 89.8

Class transliteration method SOFT

95.7 92.4

Latent class transliteration method LATENT (proposed)

95.8 92.4

Higher performance of

LATENT vs SOFT/HARD

Performance can be higher depending

on the number of latent classes

Low class detection

precision (77.4%)

Page 12: Latent Class Transliteration based on Source Language Origin

12

Error Analysis

Example SOFT/HARD LATENT (Proposed)

Felix/フェリックス ferikkusu [en]

フィリス firisu

Read/リード riido [en]

レアード reādo

Caen/カーン kān [fr]

シャーン shān

Laemmle/レムリ remuri [en]

リアム riamu

Xavier/ザビア zabia [en]

ガブリエル gaburieru

Hilda/イルダ iruda [en]

ハルラ

harura

ハルラ harura

Page 13: Latent Class Transliteration based on Source Language Origin

13

Conclusion

• Proposed the “latent class

transliteration model”

– Models source language origins as latent classes

– Model estimation from transliterated pairs via the

EM algorithm

– Comparable results v.s. models with explicit

language origins

• Future works

– Sources other than Western languages

– Targets other than Japanese