applying mpaligner to machine transliteration with japanese-specific heuristics

Applying mpaligner to Machine Transliteration with Japanese-‐Specific Heuristics

Yoh Okuno

Outline

•  Introduction

•  System

•  Experiments

•  Conclusion

2

Outline •  Introduction – Statistical Machine Transliteration

– Baseline and Our Systems

•  System

•  Experiments

•  Conclusion

3

Machine Transliteration as Monotonic SMT

•  The most common approach for machine

transliteration is to follow the manner of SMT

(Statistical Machine Translation)

•  Consists of 3 steps as below:

1.  Align training data monotonically (character-‐based)

2.  Train discriminative model given aligned data

3.  Decode input string to n-‐best list

4

[Finch+ 2008]

Example of Statistical Transliteration •  Given training data of transliteration pairs

5

OKUNO 奥野 NOMURA 野村 MURAI 村井

Training Data

Example of Statistical Transliteration 1. Align training data utilizing co-‐occurrence

6

OKU:NO 奥:野 NO:MURA 野:村 MURA:I 村:井


1. Align

Training Data

Aligned Data

OKU → 奥 NO → 野 MURA → 村 I → 井

Example of Statistical Transliteration 2. Train statistical model from aligned data

7



1. Align

Training Data

Aligned Data 2. Train

Learned Model

(Rules)

OKU → 奥 NO → 野 MURA → 村 I → 井

Example of Statistical Transliteration 3. Decode new input and return output

8



OKUMURA OKUI MURANO 1. Align

Training Data

Aligned Data

3. Decode

2. Train

奥村奥井村野

Test Input Output

Learned Model

(Rules)

The Baseline System using m2m-‐aligner

Align: m2m-‐aligner

Train: DirecTL+

Decode: DirecTL+

Training Data

Output: N-‐best List

[Jiampojamarn+ 2007, 2008]

9

Our System: mpaligner with Heuristics

Pre-‐processing

Align: mpaligner

Train: DirecTL+

Decode: DirecTL+

Training Data

Output: N-‐best List

Japanese-‐Specific Heuristics 1.  JnJk: De-‐romanization 2.  EnJa: Syllable-‐based Alignment

10

Improved Alignment Tool 1.  Better Accuracy than m2m 2.  No hand-‐tuning parameters

[Kubo+ 2011]

Outline •  Introduction

•  System

– Comparing Aligners

– Japanese-‐Specific Heuristics

•  Experiments

•  Conclusion

11

m2m-‐aligner: Many-‐to-‐Many Alignments

•  Alignment tool based on EM algorithm and MLE

•  Advantages: 1.  Can align multiple characters

2.  Perform well on short alignment

•  Disadvantages: 1.  Poor performance on long alignment by overfitting

2.  Require hand-‐tuning of length limit parameters

[Jiampojamarn+ 2007]

http://code.google.com/p/m2m-‐aligner/ 12

mpaligner: Minimum Pattern Aligner

•  Idea: penalize long alignment during E-‐step

•  Simple scaling as below

•  x: source string, y: target string

•  |x|: length of x, |y|: length of y

•  P(x, y): probability of string pair (x,y) •  Good performance without hand-‐tuning parameters

[Kubo+ 2011]

http://sourceforge.jp/projects/mpaligner/ 13

Motivation: Invalid Alignment Problem

•  Character-‐based alignment can be phonetically invalid

–  It may divide atomic units into meaningless pieces

– We call the smallest unit of alignment as syllable

•  Syllable-‐based alignment should be used for this task

–  Problem: No training data for syllable-‐based alignment

•  In this study, we propose Japanese-‐specific heuristics

for this problem depending on Japanese knowledge

14

Examples of Invalid and Valid Alignment •  In Japanese language, consonants should be combined with vowels

•  JnJk Task Type Source Target Valid SUZU:KI 鈴:木 Invalid SUZ:UKI 鈴:木 Valid HIRO:MI 裕:実 Invalid HIR:OMI 裕:実 Valid OKU:NO 奥:野 Invalid OK:UNO 奥:野

•  EnJa Task Type Source Target Valid Ar:thur アー:サー Invalid A:r:th:ur ア:ー:サ:ー Valid Cha:p:li:n チャッ:プ:リ:ン Invalid C:h:a:p:li:n チ:ャ:ッ:プ:リ:ン Valid Ju:s:mi:ne ジャ:ス:ミ:ン Invalid J:u:s:mi:ne ジ:ャ:ス:ミ:ン

15

Language Specific Heuristics as Preprocessing

•  Developed Japanese-‐specific heuristics for JnJk and EnJa tasks as preprocessing

– Combine atomic string into syllable

–  Treat a syllable as one character in alignment

•  Definition of syllable should be chosen carefully –  It may cause bad side effect

–  Some contexts are incorporated as n-‐gram features

16

JnJk task: De-‐romanization Heuristic

・・

•  De-‐romanization: convert Roman characters

•  Consonant and vowel are coupled into Kana

•  Common romanization table is used (Hepburn)

Roman A I U E O Kana あいうえお Roman KA KI KU KE KO Kana かきくけこ

17 http://www.social-‐ime.com/conv-‐table.html

EnJa Task: Syllable-‐based Alignment

•  In EnJa task, target side should be aligned with

unit of syllable, not character

•  Combine sub-‐characters with previous ones

•  There are 3 types of sub-‐characters: 1.  Lower case characters (Yo-‐on): e.g. ャ, ュ, ョ

2.  Silent character (Soku-‐on): e.g. ッ

3.  Hyphen (Cho-‐on; long vowel): e.g. ー

18


•  System

•  Experiments

– Official Scores for 8 Language Pairs

– Further Investigation for JnJk and EnJa

•  Conclusion

19

Experimental Settings •  Conducted 2 types of experiments

– Official evaluation on test set for 8 language pairs

– Compared proposed and baseline systems for JnJk

and EnJa tasks on development set

•  Followed default settings of tools basically

– m2m-‐aligner: length limits are selected carefully

–  Iteration number: optimized by development set

–  Features: N-‐gram (N=2) and context (size=7) features 20

Official Scores for 8 Language Pairs •  Applied heuristics to JnJk and EnJa tasks

•  Performed well (top rank on EnPe and EnHe) Task ACC F-‐Score MRR MAP Rank JnJk 0.512 0.693 0.582 0.401 2 EnJa 0.362 0.803 0.469 0.359 2 EnCh 0.301 0.655 0.376 0.292 5 ChEn 0.013 0.259 0.017 0.013 4 EnKo 0.334 0.688 0.411 0.334 3 EnBa 0.404 0.882 0.515 0.403 2 EnPe 0.658 0.941 0.761 0.640 1 EnHe 0.191 0.808 0.254 0.190 1 21

Results in JnJk and EnJa Tasks •  Proposed system overcome baselines

Method ACC F-‐Score MRR MAP m2m-‐aligner 0.113 0.389 0.182 0.114 mpaligner 0.121 0.391 0.197 0.122 Proposed 0.199 0.494 0.300 0.200

Method ACC F-‐Score MRR MAP m2m-‐aligner 0.280 0.737 0.359 0.280 mpaligner 0.326 0.761 0.431 0.326 Proposed 0.358 0.774 0.469 0.358

Result in JnJk Task

Result in EnJa Task

22

Output Examples (10-‐best list) JnJk Task

Harui Kyotaro 1 春井京太郎 2 晴井恭太郎 3 治井匡太郎 4 榛井強太郎 5 敏井共太郎 6 明井享太郎 7 陽井亨太郎 8 遙井杏太郎 9 遥井鋸太郎 10 温井教太郎

EnJa Task Bloy Grothendieck

1 ブロイグローテンディック 2 ブロアグロートンディック 3 ブローイグローテンディーク 4 ブロワグローテンディック 5 ブロッイグローゾンディック 6 ブロヤグローテンジーク 7 ブロヨグローザーンディック 8 ブウォイグローザンディック 9 ブロティグローシンディック 10 ブロレィグローゼンディック

23

Error Analysis •  Sparseness problem:

–  Side effect of syllable-‐based alignment in EnJa task

–  Too many target side characters in JnJk task

•  Word origin [Hagiwara+ 2011]:

–  English names come from various languages

–  First and family name can be modeled differently

– Gender: first names are quite different

•  Training data inconsistency or ambiguity

–  e.g. JAPAN → 日本国 (Not transliteration) 24


•  System

•  Experiments

•  Conclusion

– Future Work

25

Conclusion •  Applied mpaligner to machine transliteration task for

the first time

–  Performed better than m2m-‐aligner

– Maximum likelihood estimation approach is not suitable

•  Proposed Japanese-‐specific heuristics for JnJk and EnJa tasks

–  De-‐romanization for JnJk task

–  Syllable-‐based alignment for EnJa task

26

Future Work •  Combine these heuristics with other

language-‐independent approaches such as

[Finch+ 2011] or [Hagiwara+ 2011]

•  Develop language-‐dependent heuristics besides Japanese language

•  Can we find such heuristics automatically? 27

Reference (1) •  Andrew Finch and Eiichiro Sumita. 2008. Phrase-‐based machine

transliteration.

•  Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Sherif. 2007.

Applying many-‐to-‐many alignments and hidden markov models to letter-‐

to-‐phoneme con-‐ version.

•  Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kondrak. 2008. Joint

processing and discrimina-‐ tive training for letter-‐to-‐phoneme conversion.

•  Keigo Kubo, Hiromichi Kawanami, Hiroshi Saruwatari, and Kiyohiro

Shikano. 2011. Unconstrained many-‐ to-‐many alignment for automatic

pronunciation annotation.

•  Min Zhang, A Kumaran, and Haizhou Li. 2012. Whitepaper of news 2012

shared task on machine transliteration. 28

Reference (2)

•  Masato Hagiwara and Satoshi Sekine. 2011. Latent

class transliteration based on source language origin.

•  Andrew Finch, Paul Dixon, and Eiichiro Sumita. 2011.

Integrating models derived from non-‐parametric

bayesian co-‐segmentation into a statistical machine

transliteration system.

•  Andrew Finch and Eiichiro Sumita. 2010. A Bayesian

Model of Bilingual Segmentation for Transliteration. 29

WTIM: Workshop on Text Input Methods

•  1st workshop with IJCNLP 2011 (Thailand)

–  12 people presented from Google, Microsoft, Yahoo

–  https://sites.google.com/site/wtim2011/

•  2nd workshop planed with COLING 2012 (India)

– Venue: December, 2012 in Mumbai, India

– Are you interested as a presenter or an attendee?

Any Questions?

applying mpaligner to machine transliteration with japanese-specific heuristics

Technology