微软亚洲研究院汉英翻译系统 cwmt2008 评测技术报告

微软亚洲研究院汉英翻译系统CWMT2008 评测技术报告

张冬冬李志灏李沐周明

微软亚洲研究院

Outline

• Overview– MSRA Submissions

– System Description

• Experiments– Training Data & Toolkits

– Chinese-English Machine Translation

– Chinese-English System Combination

• Conclusion

Evaluation Task Participation

参加与否语种与领域评测项目项目代号

汉英新闻领域机器翻译 ZH-EN-NEWS-TRANS

汉英新闻领域系统融合 ZH-EN-NEWS-COMBI

英汉新闻领域机器翻译 EN-ZH-NEWS-TRANS

英汉科技领域机器翻译 EN-ZH-SCIE-TRANS

MSRA Submission

• Machine translation task– Primary submission

• Unlimited training corpus• Combining: SysA + SysB + SysC + SysD

– Contrast submission• Limited training corpus• Combining: SysA + SysB + SysC

• System combination task– Limited training corpus– Combining: 10 systems

SysA

• Phrase-based model

• CYK decoding algorithm

• BTG grammar

• Features:

– Similar with (Koehn, 2004)

• Maximum Entropy reordering model

– (Zhang et. al 2007, Xiong et. Al, 2006)

SysB• Syntactic pre-reordering model– (Li et. al, 2007)

• Motivations• Isolating reordering model from decoder• Making use of syntactic information

SysC

• Hierarchical phase-based model

– (David Chiang, 2005)

– Hiero re-implementation

• Weighted synchronous CFG

SysD

• String-to-dependency MT– (Shen et. al, 2008)

– Integrating target dependent language model

• Motivations– Target dependent structures integrate linguistic knowledge

– Directly targeted on lexical items, simpler than CFG

– Capture long distance relations by local dependency trees

System Combination

• Analogous with BBN’s work (Rosti et. al 2007)

s jN

l

N

pil ilwp

1

1

0,1

)),|(log(

s jN

l

N

pil

1

1

0,1

1

• Adaptations in MSRA system

– Single confusion network

• Candidate skeletons come from top-1 translations of each system

• The best skeleton has the most similarity with others based on BLEU

– Word alignment between skeleton and other candidate

translations performed by GIZA++

– Parameters are tuned to maximize BLEU on Dev. data

System Combination (Cont.)

Outline

• Overview– MSRA Submissions

– System Description

• Experiments– Training Data & Toolkits

– Chinese-English Machine Translation

– Chinese-English System Combination

• Conclusion

Training Data非受限训练语料受限训练语料

Primary MT Submission Contrast MT Submission

短语翻译模型 LDC Parallel data, 4.99M sentence pairs

主办方提供734.8K sentence pairs

语言模型 Gizaword+LDC Parallel (English part) 323M English words

主办方提供的英语部分9.21M English words

调序模型 FBIS + others, 197K sentence pairs CLDC-LAC-2003-004(ICT)

开发数据集 2005-863-001(489 pairs) 2005-863-001( 489 pairs)

Pre-/Post-processing

• Pre-processing– Tokenization for Chinese and English sentences

• Before word alignment and language model training• Special tokens recognized and normalized (date, time and

number) for training data– Special tokens are pre-translated with rules for test

data before decoding• Post-processing– English case restoration after translation– OOVs are removed from final translation

Tools• MSR-SEG

– MSRA word segmentation tool used to segment Chinese sentences in parallel data

• Berkeley parser– Parse sentences for both training and test data for syntactic pre-

reordering model based system• GIZA++

– Used for bilingual word alignment• MaxEnt Toolkit

– Reordering Model (Le Zhang, 2004)• MSRA internal tools

– Language modeling– Decoders– Case-restoration for English words– System combination

Experiments for MT Task

系统名称SSMT2007

(BLEU4 ，忽略英文大小写 )

CWMT2008 (BLEU4 ，考虑英文大

小写 )

受限训练语料

SysA 0.2366 0.2148SysB 0.2505 0.2303SysC 0.2436 0.2255

Contrast Submission 0.2473 0.2306

非受限训练语料

SysA 0.3157 0.2727SysB 0.3208 0.2782SysC 0.3196 0.2762SysD 0.3276 0.2787

Primary Submission 0.3389 0.2809

Experiments for System Comb.各参评系

统编号SSMT2007, BLEU4 ，

忽略大小写采用与否

S1-1 0.2799 S1-2 0.2802 S3-1 0.2446 S3-2 0.2818 S4-1 0.2823 S7-1 0.1647 S8-1 0.2037

S10-1 0.2133 S10-2 0.2297 S10-3 0.2234 S11-1 0.1835 S12-1 0.3389 S12-2 0.2473 S14-1 0.2118 S14-2 0.2179 S14-3 0.2165 S15-1 0.2642

受限 LM

系统融合 0.3274

非受限 LM

0.3476

非受限LM

Conclusions

• Syntax information improves SMT

– Syntactic pre-reordering model

– Target dependency model

• Limited LM affects the system combination

– Perform worse over unlimited output when using limited LM

Thanks!

MSRA Systems

• SysA:– 基于连续短语翻译模型

• SysB:– SysA + 多个预调序的源语言输入

• SysC:– 基于层次短语翻译模型

• SysD:– 基于串到目标语言依存树的翻译模型

SysB• Syntactic pre-reordering model– (Li et. al, 2007)

• Motivations– Isolating reordering model from decoder– Making use of syntactic parse information

微软亚洲研究院汉英翻译系统 cwmt2008 评测技术报告

Documents