applying mpaligner to machine transliteration with japanese-specific heuristics

31
Applying mpaligner to Machine Transliteration with JapaneseSpecific Heuristics Yoh Okuno

Upload: yoh-okuno

Post on 11-Jun-2015

868 views

Category:

Technology


1 download

DESCRIPTION

ACL 2012 / NEWS 2012

TRANSCRIPT

Page 1: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Applying  mpaligner  to  Machine  Transliteration  with  Japanese-­‐Specific  Heuristics

Yoh  Okuno  

Page 2: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Outline

•  Introduction  

•  System  

•  Experiments  

•  Conclusion  

2  

Page 3: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Outline •  Introduction  – Statistical  Machine  Transliteration  

– Baseline  and  Our  Systems  

•  System  

•  Experiments  

•  Conclusion  

3  

Page 4: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Machine  Transliteration  as  Monotonic  SMT

•  The  most  common  approach  for  machine  

transliteration  is  to  follow  the  manner  of  SMT

(Statistical  Machine  Translation)  

•  Consists  of  3  steps  as  below:  

1.  Align  training  data  monotonically  (character-­‐based)  

2.  Train  discriminative  model  given  aligned  data  

3.  Decode  input  string  to  n-­‐best  list  

4  

[Finch+  2008]

Page 5: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Example  of  Statistical  Transliteration •  Given  training  data  of  transliteration  pairs  

5  

OKUNO  奥野 NOMURA  野村  MURAI    村井

Training  Data

Page 6: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Example  of  Statistical  Transliteration 1.  Align  training  data  utilizing  co-­‐occurrence    

 

 

 

 

 

6  

OKU:NO  奥:野  NO:MURA  野:村  MURA:I  村:井

OKUNO  奥野 NOMURA  野村  MURAI    村井

1.  Align

Training  Data

Aligned  Data

Page 7: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

OKU  →  奥  NO  →  野  MURA  →  村  I  →  井

Example  of  Statistical  Transliteration 2.  Train  statistical  model  from  aligned  data  

7  

OKU:NO  奥:野  NO:MURA  野:村  MURA:I  村:井

OKUNO  奥野 NOMURA  野村  MURAI    村井

1.  Align

Training  Data

Aligned  Data 2.  Train

Learned  Model

(Rules)

Page 8: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

OKU  →  奥  NO  →  野  MURA  →  村  I  →  井

Example  of  Statistical  Transliteration 3.  Decode  new  input  and  return  output  

8  

OKU:NO  奥:野  NO:MURA  野:村  MURA:I  村:井

OKUNO  奥野 NOMURA  野村  MURAI    村井

OKUMURA  OKUI  MURANO  1.  Align

Training  Data

Aligned  Data

3.  Decode

2.  Train

奥村  奥井  村野  

Test  Input Output

Learned  Model

(Rules)

Page 9: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

The  Baseline  System  using  m2m-­‐aligner

Align:  m2m-­‐aligner

Train:  DirecTL+

Decode:  DirecTL+

Training  Data

Output:  N-­‐best  List

[Jiampojamarn+  2007,  2008]  

9  

Page 10: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Our  System:  mpaligner  with  Heuristics

Pre-­‐processing

Align:  mpaligner

Train:  DirecTL+

Decode:  DirecTL+

Training  Data

Output:  N-­‐best  List

Japanese-­‐Specific  Heuristics  1.  JnJk:  De-­‐romanization  2.  EnJa:  Syllable-­‐based  Alignment

10  

Improved  Alignment  Tool  1.  Better  Accuracy  than  m2m  2.  No  hand-­‐tuning  parameters

[Kubo+  2011]  

Page 11: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Outline •  Introduction  

•  System  

– Comparing  Aligners  

– Japanese-­‐Specific  Heuristics  

•  Experiments  

•  Conclusion  

11  

Page 12: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

m2m-­‐aligner:  Many-­‐to-­‐Many  Alignments  

•  Alignment  tool  based  on  EM  algorithm  and  MLE  

•  Advantages:  1.  Can  align  multiple  characters  

2.  Perform  well  on  short  alignment  

•  Disadvantages:  1.  Poor  performance  on  long  alignment  by  overfitting  

2.  Require  hand-­‐tuning  of  length  limit  parameters  

[Jiampojamarn+  2007]  

http://code.google.com/p/m2m-­‐aligner/ 12  

Page 13: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

mpaligner:  Minimum  Pattern  Aligner

•  Idea:  penalize  long  alignment  during  E-­‐step  

•  Simple  scaling  as  below  

•  x:  source  string,    y:  target  string  

•  |x|:  length  of  x,    |y|:  length  of  y  

•  P(x,  y):  probability  of  string  pair  (x,y)  •  Good  performance  without  hand-­‐tuning  parameters  

[Kubo+  2011]  

http://sourceforge.jp/projects/mpaligner/ 13  

Page 14: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Motivation:  Invalid  Alignment  Problem

•  Character-­‐based  alignment  can  be  phonetically  invalid  

–  It  may  divide  atomic  units  into  meaningless  pieces  

– We  call  the  smallest  unit  of  alignment  as  syllable  

•  Syllable-­‐based  alignment  should  be  used  for  this  task  

–  Problem:  No  training  data  for  syllable-­‐based  alignment  

•  In  this  study,  we  propose  Japanese-­‐specific  heuristics  

for  this  problem  depending  on  Japanese  knowledge  

14  

Page 15: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Examples  of  Invalid  and  Valid  Alignment •  In  Japanese  language,  consonants  should  be  combined  with  vowels  

•  JnJk  Task Type Source Target Valid SUZU:KI 鈴:木 Invalid SUZ:UKI 鈴:木 Valid HIRO:MI 裕:実 Invalid HIR:OMI 裕:実 Valid OKU:NO 奥:野 Invalid OK:UNO 奥:野

•  EnJa  Task Type Source Target Valid Ar:thur アー:サー Invalid A:r:th:ur ア:ー:サ:ー Valid Cha:p:li:n チャッ:プ:リ:ン Invalid C:h:a:p:li:n チ:ャ:ッ:プ:リ:ン Valid Ju:s:mi:ne ジャ:ス:ミ:ン Invalid J:u:s:mi:ne ジ:ャ:ス:ミ:ン

15  

Page 16: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Language  Specific  Heuristics  as  Preprocessing

•  Developed  Japanese-­‐specific  heuristics  for  JnJk  and  EnJa  tasks  as  preprocessing  

– Combine  atomic  string  into  syllable  

–  Treat  a  syllable  as  one  character  in  alignment  

•  Definition  of  syllable  should  be  chosen  carefully  –  It  may  cause  bad  side  effect  

–  Some  contexts  are  incorporated  as  n-­‐gram  features

16  

Page 17: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

JnJk  task:  De-­‐romanization  Heuristic

・・

•  De-­‐romanization:  convert  Roman  characters  

•  Consonant  and  vowel  are  coupled  into  Kana  

•  Common  romanization  table  is  used  (Hepburn)  

Roman A I U E O Kana あ い う え お Roman KA KI KU KE KO Kana か き く け こ

17  http://www.social-­‐ime.com/conv-­‐table.html

Page 18: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

EnJa  Task:  Syllable-­‐based  Alignment

•  In  EnJa  task,  target  side  should  be  aligned  with  

unit  of  syllable,  not  character  

•  Combine  sub-­‐characters  with  previous  ones  

•  There  are  3  types  of  sub-­‐characters:  1.  Lower  case  characters  (Yo-­‐on):  e.g.  ャ,  ュ,  ョ  

2.  Silent  character  (Soku-­‐on):  e.g.  ッ  

3.  Hyphen  (Cho-­‐on;  long  vowel):  e.g.  ー  

18  

Page 19: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Outline •  Introduction  

•  System  

•  Experiments  

– Official  Scores  for  8  Language  Pairs  

– Further  Investigation  for  JnJk  and  EnJa  

•  Conclusion  

19  

Page 20: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Experimental  Settings •  Conducted  2  types  of  experiments  

– Official  evaluation  on  test  set  for  8  language  pairs  

– Compared  proposed  and  baseline  systems  for  JnJk  

and  EnJa  tasks  on  development  set  

•  Followed  default  settings  of  tools  basically  

– m2m-­‐aligner:  length  limits  are  selected  carefully  

–  Iteration  number:  optimized  by  development  set  

–  Features:  N-­‐gram  (N=2)  and  context  (size=7)  features  20  

Page 21: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Official  Scores  for  8  Language  Pairs •  Applied  heuristics  to  JnJk  and  EnJa  tasks  

•  Performed  well  (top  rank  on  EnPe  and  EnHe)  Task ACC F-­‐Score MRR MAP Rank JnJk 0.512 0.693 0.582  0.401 2 EnJa 0.362 0.803 0.469 0.359 2 EnCh 0.301 0.655 0.376 0.292 5 ChEn 0.013 0.259 0.017 0.013 4 EnKo 0.334 0.688 0.411 0.334 3 EnBa 0.404 0.882 0.515 0.403 2 EnPe 0.658 0.941 0.761 0.640 1 EnHe 0.191 0.808 0.254 0.190 1 21  

Page 22: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Results  in  JnJk  and  EnJa  Tasks •  Proposed  system  overcome  baselines  

Method ACC F-­‐Score MRR MAP m2m-­‐aligner 0.113 0.389 0.182 0.114 mpaligner 0.121 0.391 0.197 0.122 Proposed 0.199 0.494 0.300 0.200

Method ACC F-­‐Score MRR MAP m2m-­‐aligner 0.280 0.737 0.359 0.280 mpaligner 0.326 0.761 0.431   0.326 Proposed 0.358 0.774 0.469 0.358

Result  in  JnJk  Task

Result  in  EnJa  Task

22  

Page 23: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Output  Examples  (10-­‐best  list) JnJk  Task

Harui Kyotaro 1 春井 京太郎 2 晴井 恭太郎 3 治井 匡太郎 4 榛井 強太郎 5 敏井 共太郎 6 明井 享太郎 7 陽井 亨太郎 8 遙井 杏太郎 9 遥井 鋸太郎 10 温井 教太郎

EnJa  Task Bloy Grothendieck

1 ブロイ グローテンディック 2 ブロア グロートンディック 3 ブローイ グローテンディーク 4 ブロワ グローテンディック 5 ブロッイ グローゾンディック 6 ブロヤ グローテンジーク 7 ブロヨ グローザーンディック 8 ブウォイ グローザンディック 9 ブロティ グローシンディック 10 ブロレィ グローゼンディック

23  

Page 24: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Error  Analysis •  Sparseness  problem:  

–  Side  effect  of  syllable-­‐based  alignment  in  EnJa  task  

–  Too  many  target  side  characters  in  JnJk  task  

•  Word  origin  [Hagiwara+  2011]:  

–  English  names  come  from  various  languages  

–  First  and  family  name  can  be  modeled  differently  

– Gender:  first  names  are  quite  different  

•  Training  data  inconsistency  or  ambiguity  

–  e.g.  JAPAN  →  日本国  (Not  transliteration)   24  

Page 25: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Outline •  Introduction  

•  System  

•  Experiments  

•  Conclusion  

– Future  Work  

25  

Page 26: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Conclusion •  Applied  mpaligner  to  machine  transliteration  task  for  

the  first  time  

–  Performed  better  than  m2m-­‐aligner  

– Maximum  likelihood  estimation  approach  is  not  suitable  

•  Proposed  Japanese-­‐specific  heuristics  for  JnJk  and  EnJa  tasks  

–  De-­‐romanization  for  JnJk  task  

–  Syllable-­‐based  alignment  for  EnJa  task  

26  

Page 27: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Future  Work •  Combine  these  heuristics  with  other  

language-­‐independent  approaches  such  as  

[Finch+  2011]  or  [Hagiwara+  2011]  

•  Develop  language-­‐dependent  heuristics  besides  Japanese  language  

•  Can  we  find  such  heuristics  automatically? 27  

Page 28: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Reference  (1) •  Andrew  Finch  and  Eiichiro  Sumita.  2008.  Phrase-­‐based  machine  

transliteration.  

•  Sittichai  Jiampojamarn,  Grzegorz  Kondrak,  and  Tarek  Sherif.  2007.  

Applying  many-­‐to-­‐many  alignments  and  hidden  markov  models  to  letter-­‐

to-­‐phoneme  con-­‐  version.  

•  Sittichai  Jiampojamarn,  Colin  Cherry,  and  Grzegorz  Kondrak.  2008.  Joint  

processing  and  discrimina-­‐  tive  training  for  letter-­‐to-­‐phoneme  conversion.  

•  Keigo  Kubo,  Hiromichi  Kawanami,  Hiroshi  Saruwatari,  and  Kiyohiro  

Shikano.  2011.  Unconstrained  many-­‐  to-­‐many  alignment  for  automatic  

pronunciation  annotation.  

•  Min  Zhang,  A  Kumaran,  and  Haizhou  Li.  2012.  Whitepaper  of  news  2012  

shared  task  on  machine  transliteration. 28  

Page 29: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Reference  (2)

•  Masato  Hagiwara  and  Satoshi  Sekine.  2011.  Latent  

class  transliteration  based  on  source  language  origin.  

•  Andrew  Finch,  Paul  Dixon,  and  Eiichiro  Sumita.  2011.  

Integrating  models  derived  from  non-­‐parametric  

bayesian  co-­‐segmentation  into  a  statistical  machine  

transliteration  system.

•  Andrew  Finch  and  Eiichiro  Sumita.  2010.  A  Bayesian  

Model  of  Bilingual  Segmentation  for  Transliteration. 29  

Page 30: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

WTIM:  Workshop  on  Text  Input  Methods

•  1st  workshop  with  IJCNLP  2011  (Thailand)  

–  12  people  presented  from  Google,  Microsoft,  Yahoo  

–  https://sites.google.com/site/wtim2011/  

•  2nd  workshop  planed  with  COLING  2012  (India)  

– Venue:  December,  2012  in  Mumbai,  India  

– Are  you  interested  as  a  presenter  or  an  attendee?

Page 31: Applying mpaligner to Machine Transliteration with Japanese-Specific Heuristics

Any  Questions?