applying mpaligner to machine transliteration with japanese-specific heuristics

Post on 11-Jun-2015

868 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

ACL 2012 / NEWS 2012

TRANSCRIPT

Applying  mpaligner  to  Machine  Transliteration  with  Japanese-­‐Specific  Heuristics

Yoh  Okuno  

Outline

•  Introduction  

•  System  

•  Experiments  

•  Conclusion  

2  

Outline •  Introduction  – Statistical  Machine  Transliteration  

– Baseline  and  Our  Systems  

•  System  

•  Experiments  

•  Conclusion  

3  

Machine  Transliteration  as  Monotonic  SMT

•  The  most  common  approach  for  machine  

transliteration  is  to  follow  the  manner  of  SMT

(Statistical  Machine  Translation)  

•  Consists  of  3  steps  as  below:  

1.  Align  training  data  monotonically  (character-­‐based)  

2.  Train  discriminative  model  given  aligned  data  

3.  Decode  input  string  to  n-­‐best  list  

4  

[Finch+  2008]

Example  of  Statistical  Transliteration •  Given  training  data  of  transliteration  pairs  

5  

OKUNO  奥野 NOMURA  野村  MURAI    村井

Training  Data

Example  of  Statistical  Transliteration 1.  Align  training  data  utilizing  co-­‐occurrence    

 

 

 

 

 

6  

OKU:NO  奥:野  NO:MURA  野:村  MURA:I  村:井

OKUNO  奥野 NOMURA  野村  MURAI    村井

1.  Align

Training  Data

Aligned  Data

OKU  →  奥  NO  →  野  MURA  →  村  I  →  井

Example  of  Statistical  Transliteration 2.  Train  statistical  model  from  aligned  data  

7  

OKU:NO  奥:野  NO:MURA  野:村  MURA:I  村:井

OKUNO  奥野 NOMURA  野村  MURAI    村井

1.  Align

Training  Data

Aligned  Data 2.  Train

Learned  Model

(Rules)

OKU  →  奥  NO  →  野  MURA  →  村  I  →  井

Example  of  Statistical  Transliteration 3.  Decode  new  input  and  return  output  

8  

OKU:NO  奥:野  NO:MURA  野:村  MURA:I  村:井

OKUNO  奥野 NOMURA  野村  MURAI    村井

OKUMURA  OKUI  MURANO  1.  Align

Training  Data

Aligned  Data

3.  Decode

2.  Train

奥村  奥井  村野  

Test  Input Output

Learned  Model

(Rules)

The  Baseline  System  using  m2m-­‐aligner

Align:  m2m-­‐aligner

Train:  DirecTL+

Decode:  DirecTL+

Training  Data

Output:  N-­‐best  List

[Jiampojamarn+  2007,  2008]  

9  

Our  System:  mpaligner  with  Heuristics

Pre-­‐processing

Align:  mpaligner

Train:  DirecTL+

Decode:  DirecTL+

Training  Data

Output:  N-­‐best  List

Japanese-­‐Specific  Heuristics  1.  JnJk:  De-­‐romanization  2.  EnJa:  Syllable-­‐based  Alignment

10  

Improved  Alignment  Tool  1.  Better  Accuracy  than  m2m  2.  No  hand-­‐tuning  parameters

[Kubo+  2011]  

Outline •  Introduction  

•  System  

– Comparing  Aligners  

– Japanese-­‐Specific  Heuristics  

•  Experiments  

•  Conclusion  

11  

m2m-­‐aligner:  Many-­‐to-­‐Many  Alignments  

•  Alignment  tool  based  on  EM  algorithm  and  MLE  

•  Advantages:  1.  Can  align  multiple  characters  

2.  Perform  well  on  short  alignment  

•  Disadvantages:  1.  Poor  performance  on  long  alignment  by  overfitting  

2.  Require  hand-­‐tuning  of  length  limit  parameters  

[Jiampojamarn+  2007]  

http://code.google.com/p/m2m-­‐aligner/ 12  

mpaligner:  Minimum  Pattern  Aligner

•  Idea:  penalize  long  alignment  during  E-­‐step  

•  Simple  scaling  as  below  

•  x:  source  string,    y:  target  string  

•  |x|:  length  of  x,    |y|:  length  of  y  

•  P(x,  y):  probability  of  string  pair  (x,y)  •  Good  performance  without  hand-­‐tuning  parameters  

[Kubo+  2011]  

http://sourceforge.jp/projects/mpaligner/ 13  

Motivation:  Invalid  Alignment  Problem

•  Character-­‐based  alignment  can  be  phonetically  invalid  

–  It  may  divide  atomic  units  into  meaningless  pieces  

– We  call  the  smallest  unit  of  alignment  as  syllable  

•  Syllable-­‐based  alignment  should  be  used  for  this  task  

–  Problem:  No  training  data  for  syllable-­‐based  alignment  

•  In  this  study,  we  propose  Japanese-­‐specific  heuristics  

for  this  problem  depending  on  Japanese  knowledge  

14  

Examples  of  Invalid  and  Valid  Alignment •  In  Japanese  language,  consonants  should  be  combined  with  vowels  

•  JnJk  Task Type Source Target Valid SUZU:KI 鈴:木 Invalid SUZ:UKI 鈴:木 Valid HIRO:MI 裕:実 Invalid HIR:OMI 裕:実 Valid OKU:NO 奥:野 Invalid OK:UNO 奥:野

•  EnJa  Task Type Source Target Valid Ar:thur アー:サー Invalid A:r:th:ur ア:ー:サ:ー Valid Cha:p:li:n チャッ:プ:リ:ン Invalid C:h:a:p:li:n チ:ャ:ッ:プ:リ:ン Valid Ju:s:mi:ne ジャ:ス:ミ:ン Invalid J:u:s:mi:ne ジ:ャ:ス:ミ:ン

15  

Language  Specific  Heuristics  as  Preprocessing

•  Developed  Japanese-­‐specific  heuristics  for  JnJk  and  EnJa  tasks  as  preprocessing  

– Combine  atomic  string  into  syllable  

–  Treat  a  syllable  as  one  character  in  alignment  

•  Definition  of  syllable  should  be  chosen  carefully  –  It  may  cause  bad  side  effect  

–  Some  contexts  are  incorporated  as  n-­‐gram  features

16  

JnJk  task:  De-­‐romanization  Heuristic

・・

•  De-­‐romanization:  convert  Roman  characters  

•  Consonant  and  vowel  are  coupled  into  Kana  

•  Common  romanization  table  is  used  (Hepburn)  

Roman A I U E O Kana あ い う え お Roman KA KI KU KE KO Kana か き く け こ

17  http://www.social-­‐ime.com/conv-­‐table.html

EnJa  Task:  Syllable-­‐based  Alignment

•  In  EnJa  task,  target  side  should  be  aligned  with  

unit  of  syllable,  not  character  

•  Combine  sub-­‐characters  with  previous  ones  

•  There  are  3  types  of  sub-­‐characters:  1.  Lower  case  characters  (Yo-­‐on):  e.g.  ャ,  ュ,  ョ  

2.  Silent  character  (Soku-­‐on):  e.g.  ッ  

3.  Hyphen  (Cho-­‐on;  long  vowel):  e.g.  ー  

18  

Outline •  Introduction  

•  System  

•  Experiments  

– Official  Scores  for  8  Language  Pairs  

– Further  Investigation  for  JnJk  and  EnJa  

•  Conclusion  

19  

Experimental  Settings •  Conducted  2  types  of  experiments  

– Official  evaluation  on  test  set  for  8  language  pairs  

– Compared  proposed  and  baseline  systems  for  JnJk  

and  EnJa  tasks  on  development  set  

•  Followed  default  settings  of  tools  basically  

– m2m-­‐aligner:  length  limits  are  selected  carefully  

–  Iteration  number:  optimized  by  development  set  

–  Features:  N-­‐gram  (N=2)  and  context  (size=7)  features  20  

Official  Scores  for  8  Language  Pairs •  Applied  heuristics  to  JnJk  and  EnJa  tasks  

•  Performed  well  (top  rank  on  EnPe  and  EnHe)  Task ACC F-­‐Score MRR MAP Rank JnJk 0.512 0.693 0.582  0.401 2 EnJa 0.362 0.803 0.469 0.359 2 EnCh 0.301 0.655 0.376 0.292 5 ChEn 0.013 0.259 0.017 0.013 4 EnKo 0.334 0.688 0.411 0.334 3 EnBa 0.404 0.882 0.515 0.403 2 EnPe 0.658 0.941 0.761 0.640 1 EnHe 0.191 0.808 0.254 0.190 1 21  

Results  in  JnJk  and  EnJa  Tasks •  Proposed  system  overcome  baselines  

Method ACC F-­‐Score MRR MAP m2m-­‐aligner 0.113 0.389 0.182 0.114 mpaligner 0.121 0.391 0.197 0.122 Proposed 0.199 0.494 0.300 0.200

Method ACC F-­‐Score MRR MAP m2m-­‐aligner 0.280 0.737 0.359 0.280 mpaligner 0.326 0.761 0.431   0.326 Proposed 0.358 0.774 0.469 0.358

Result  in  JnJk  Task

Result  in  EnJa  Task

22  

Output  Examples  (10-­‐best  list) JnJk  Task

Harui Kyotaro 1 春井 京太郎 2 晴井 恭太郎 3 治井 匡太郎 4 榛井 強太郎 5 敏井 共太郎 6 明井 享太郎 7 陽井 亨太郎 8 遙井 杏太郎 9 遥井 鋸太郎 10 温井 教太郎

EnJa  Task Bloy Grothendieck

1 ブロイ グローテンディック 2 ブロア グロートンディック 3 ブローイ グローテンディーク 4 ブロワ グローテンディック 5 ブロッイ グローゾンディック 6 ブロヤ グローテンジーク 7 ブロヨ グローザーンディック 8 ブウォイ グローザンディック 9 ブロティ グローシンディック 10 ブロレィ グローゼンディック

23  

Error  Analysis •  Sparseness  problem:  

–  Side  effect  of  syllable-­‐based  alignment  in  EnJa  task  

–  Too  many  target  side  characters  in  JnJk  task  

•  Word  origin  [Hagiwara+  2011]:  

–  English  names  come  from  various  languages  

–  First  and  family  name  can  be  modeled  differently  

– Gender:  first  names  are  quite  different  

•  Training  data  inconsistency  or  ambiguity  

–  e.g.  JAPAN  →  日本国  (Not  transliteration)   24  

Outline •  Introduction  

•  System  

•  Experiments  

•  Conclusion  

– Future  Work  

25  

Conclusion •  Applied  mpaligner  to  machine  transliteration  task  for  

the  first  time  

–  Performed  better  than  m2m-­‐aligner  

– Maximum  likelihood  estimation  approach  is  not  suitable  

•  Proposed  Japanese-­‐specific  heuristics  for  JnJk  and  EnJa  tasks  

–  De-­‐romanization  for  JnJk  task  

–  Syllable-­‐based  alignment  for  EnJa  task  

26  

Future  Work •  Combine  these  heuristics  with  other  

language-­‐independent  approaches  such  as  

[Finch+  2011]  or  [Hagiwara+  2011]  

•  Develop  language-­‐dependent  heuristics  besides  Japanese  language  

•  Can  we  find  such  heuristics  automatically? 27  

Reference  (1) •  Andrew  Finch  and  Eiichiro  Sumita.  2008.  Phrase-­‐based  machine  

transliteration.  

•  Sittichai  Jiampojamarn,  Grzegorz  Kondrak,  and  Tarek  Sherif.  2007.  

Applying  many-­‐to-­‐many  alignments  and  hidden  markov  models  to  letter-­‐

to-­‐phoneme  con-­‐  version.  

•  Sittichai  Jiampojamarn,  Colin  Cherry,  and  Grzegorz  Kondrak.  2008.  Joint  

processing  and  discrimina-­‐  tive  training  for  letter-­‐to-­‐phoneme  conversion.  

•  Keigo  Kubo,  Hiromichi  Kawanami,  Hiroshi  Saruwatari,  and  Kiyohiro  

Shikano.  2011.  Unconstrained  many-­‐  to-­‐many  alignment  for  automatic  

pronunciation  annotation.  

•  Min  Zhang,  A  Kumaran,  and  Haizhou  Li.  2012.  Whitepaper  of  news  2012  

shared  task  on  machine  transliteration. 28  

Reference  (2)

•  Masato  Hagiwara  and  Satoshi  Sekine.  2011.  Latent  

class  transliteration  based  on  source  language  origin.  

•  Andrew  Finch,  Paul  Dixon,  and  Eiichiro  Sumita.  2011.  

Integrating  models  derived  from  non-­‐parametric  

bayesian  co-­‐segmentation  into  a  statistical  machine  

transliteration  system.

•  Andrew  Finch  and  Eiichiro  Sumita.  2010.  A  Bayesian  

Model  of  Bilingual  Segmentation  for  Transliteration. 29  

WTIM:  Workshop  on  Text  Input  Methods

•  1st  workshop  with  IJCNLP  2011  (Thailand)  

–  12  people  presented  from  Google,  Microsoft,  Yahoo  

–  https://sites.google.com/site/wtim2011/  

•  2nd  workshop  planed  with  COLING  2012  (India)  

– Venue:  December,  2012  in  Mumbai,  India  

– Are  you  interested  as  a  presenter  or  an  attendee?

Any  Questions?

top related