saab mansour and hermann ney human language technology and pattern recognition computer science...
Post on 31-Dec-2015
221 Views
Preview:
TRANSCRIPT
Saab Mansour and Hermann NeyHuman Language Technology and Pattern Recognition
Computer Science DepartmentRWTH Aachen University, Aachen, Germany
NAACL-HLT 2013
Introduction
Domain-adaption 是利用某一個domain 內的 data 來提高 TM model 在test domain 的 performance.
TM adaption: 建立一個 general domain phrase table, 利用 in-domain data 修改 phrase probabilities.
Introduction
使用的 corpus 為 IWSLT(International Workshop On Spoken Language Translation) TED(Technology Entertainment Design) tasks 內的 Arabic-to-English 和 German-to-English.
Phrase Training
用 Forced alignment (FA) 來執行 phrase segmentation, alignment training 和 probability estimation.
用 SMT 來做 phrase training, 對一個training set y, 產生 heuristic-based phrase table Py
0, 經過 FA training(sentence 會被 segmentation 和alignment), 根據 output 來重新估計 phrase的機率值 , 產生新的 phrase table p’.
Adaption
對一個 training set y’, 產生 initial phrase table Py’
0, 對 yin(in-domain training data) 做 FA training, bias the probability to in-domain, procedure表示為 X-FA-IN.
用 leaving-one-out 來避免 over-fitting.
Experimental Setup
Training Corpora: Arabic-to-English:
In-domain: 90K TED sentences Other-domain: 7.9M sentences of United
Nation data German-to-English:
In-domain: 130K TED sentences Other-domain: 2.1M sentences from news-
commentary and europarl corpora
Experimental Setup – Translation System
Baseline system: built using SMT toolkit Jane 2.0
Measures: BLEU, TER. Arabic-English results are case
sensitive German-English results are case insensitive
Results
Heuristics: IN,OD,ALL standard phrase extraction using word-
alignment training and heuristic phrase extraction over the word alignment.
FA standard: IN-FA,OD-FA,ALL-FA standard FA phrase training where the same
training set is used for initial phrase table generation as well as the FA procedure.
FA adaptation: OD-FA0-IN, ALL-FA-IN FA based adaptation phrase training, where
the initial table is generated from some general data and the FA training is performed on the IN data to achieve adaptation.
Results - measures
BLEU: (Bilingual Evaluation Understudy) Candidate: the the the the the the the. Reference 1: The cat is on the mat. Reference 2: There is a cat on the mat.
Standard unigram precision: 7/7 Modified unigram precision: 2/7
Results - measures
TER: translation edit rate REF: SAUDI ARABIA denied THIS WEEK
information published in the AMERICAN new york times
HYP: THIS WEEK THE SAUDIS denied information published in the new york times
TER = 4/13 4 (1 Shift, 2 Substitutions, and 1 Insertion)
Mixture Modeling
Linear interpolation of IN and OD, IN and OD-FA0-IN, weight is uniform(0.5).
Conclusion
提出 phrase training procedure for adaptation using FA method.
對 Arabic-to-English 和 German-to-English TED lectures translation tasks, 都提高了 performance, BLEU 在development set 提高 0.6%, TER 分別在 test, eval sets 減少了 0.8% 和 0.6%
最後用 mixture model 來比較 , 結果顯示adapted OD table performance 較unadpated 的 OD table 好 .
top related