saab mansour and hermann ney human language technology and pattern recognition computer science...

Post on 31-Dec-2015

221 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Saab Mansour and Hermann NeyHuman Language Technology and Pattern Recognition

Computer Science DepartmentRWTH Aachen University, Aachen, Germany

NAACL-HLT 2013

Introduction

Domain-adaption 是利用某一個domain 內的 data 來提高 TM model 在test domain 的 performance.

TM adaption: 建立一個 general domain phrase table, 利用 in-domain data 修改 phrase probabilities.

Introduction

使用的 corpus 為 IWSLT(International Workshop On Spoken Language Translation) TED(Technology Entertainment Design) tasks 內的 Arabic-to-English 和 German-to-English.

Phrase Training

用 Forced alignment (FA) 來執行 phrase segmentation, alignment training 和 probability estimation.

用 SMT 來做 phrase training, 對一個training set y, 產生 heuristic-based phrase table Py

0, 經過 FA training(sentence 會被 segmentation 和alignment), 根據 output 來重新估計 phrase的機率值 , 產生新的 phrase table p’.

Adaption

對一個 training set y’, 產生 initial phrase table Py’

0, 對 yin(in-domain training data) 做 FA training, bias the probability to in-domain, procedure表示為 X-FA-IN.

用 leaving-one-out 來避免 over-fitting.

Experimental Setup

Training Corpora: Arabic-to-English:

In-domain: 90K TED sentences Other-domain: 7.9M sentences of United

Nation data German-to-English:

In-domain: 130K TED sentences Other-domain: 2.1M sentences from news-

commentary and europarl corpora

Experimental Setup – Translation System

Baseline system: built using SMT toolkit Jane 2.0

Measures: BLEU, TER. Arabic-English results are case

sensitive German-English results are case insensitive

Results

Heuristics: IN,OD,ALL standard phrase extraction using word-

alignment training and heuristic phrase extraction over the word alignment.

FA standard: IN-FA,OD-FA,ALL-FA standard FA phrase training where the same

training set is used for initial phrase table generation as well as the FA procedure.

FA adaptation: OD-FA0-IN, ALL-FA-IN FA based adaptation phrase training, where

the initial table is generated from some general data and the FA training is performed on the IN data to achieve adaptation.

Results - measures

BLEU: (Bilingual Evaluation Understudy) Candidate: the the the the the the the. Reference 1: The cat is on the mat. Reference 2: There is a cat on the mat.

Standard unigram precision: 7/7 Modified unigram precision: 2/7

Results - measures

TER: translation edit rate REF: SAUDI ARABIA denied THIS WEEK

information published in the AMERICAN new york times

HYP: THIS WEEK THE SAUDIS denied information published in the new york times

TER = 4/13 4 (1 Shift, 2 Substitutions, and 1 Insertion)

Mixture Modeling

Linear interpolation of IN and OD, IN and OD-FA0-IN, weight is uniform(0.5).

Conclusion

提出 phrase training procedure for adaptation using FA method.

對 Arabic-to-English 和 German-to-English TED lectures translation tasks, 都提高了 performance, BLEU 在development set 提高 0.6%, TER 分別在 test, eval sets 減少了 0.8% 和 0.6%

最後用 mixture model 來比較 , 結果顯示adapted OD table performance 較unadpated 的 OD table 好 .

top related