微软亚洲研究院汉英翻译系统 cwmt2008 评测技术报告
DESCRIPTION
微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告. 张冬冬 李志灏 李沐 周明 微软亚洲研究院. Outline. Overview MSRA Submissions System Description Experiments Training Data & Toolkits Chinese-English Machine Translation Chinese-English System Combination Conclusion. Evaluation Task Participation. MSRA Submission. - PowerPoint PPT PresentationTRANSCRIPT
微软亚洲研究院汉英翻译系统CWMT2008 评测技术报告
张冬冬 李志灏 李沐 周明
微软亚洲研究院
Outline
• Overview– MSRA Submissions
– System Description
• Experiments– Training Data & Toolkits
– Chinese-English Machine Translation
– Chinese-English System Combination
• Conclusion
Evaluation Task Participation
参加与否 语种与领域 评测项目 项目代号
汉英新闻领域 机器翻译 ZH-EN-NEWS-TRANS
汉英新闻领域 系统融合 ZH-EN-NEWS-COMBI
英汉新闻领域 机器翻译 EN-ZH-NEWS-TRANS
英汉科技领域 机器翻译 EN-ZH-SCIE-TRANS
MSRA Submission
• Machine translation task– Primary submission
• Unlimited training corpus• Combining: SysA + SysB + SysC + SysD
– Contrast submission• Limited training corpus• Combining: SysA + SysB + SysC
• System combination task– Limited training corpus– Combining: 10 systems
SysA
• Phrase-based model
• CYK decoding algorithm
• BTG grammar
• Features:
– Similar with (Koehn, 2004)
• Maximum Entropy reordering model
– (Zhang et. al 2007, Xiong et. Al, 2006)
SysB• Syntactic pre-reordering model– (Li et. al, 2007)
• Motivations• Isolating reordering model from decoder• Making use of syntactic information
SysC
• Hierarchical phase-based model
– (David Chiang, 2005)
– Hiero re-implementation
• Weighted synchronous CFG
SysD
• String-to-dependency MT– (Shen et. al, 2008)
– Integrating target dependent language model
• Motivations– Target dependent structures integrate linguistic knowledge
– Directly targeted on lexical items, simpler than CFG
– Capture long distance relations by local dependency trees
System Combination
• Analogous with BBN’s work (Rosti et. al 2007)
s jN
l
N
pil ilwp
1
1
0,1
)),|(log(
s jN
l
N
pil
1
1
0,1
1
• Adaptations in MSRA system
– Single confusion network
• Candidate skeletons come from top-1 translations of each system
• The best skeleton has the most similarity with others based on BLEU
– Word alignment between skeleton and other candidate
translations performed by GIZA++
– Parameters are tuned to maximize BLEU on Dev. data
System Combination (Cont.)
Outline
• Overview– MSRA Submissions
– System Description
• Experiments– Training Data & Toolkits
– Chinese-English Machine Translation
– Chinese-English System Combination
• Conclusion
Training Data非受限训练语料 受限训练语料
Primary MT Submission Contrast MT Submission
短语翻译模型 LDC Parallel data, 4.99M sentence pairs
主办方提供734.8K sentence pairs
语言模型 Gizaword+LDC Parallel (English part) 323M English words
主办方提供的英语部分9.21M English words
调序模型 FBIS + others, 197K sentence pairs CLDC-LAC-2003-004(ICT)
开发数据集 2005-863-001(489 pairs) 2005-863-001( 489 pairs)
Pre-/Post-processing
• Pre-processing– Tokenization for Chinese and English sentences
• Before word alignment and language model training• Special tokens recognized and normalized (date, time and
number) for training data– Special tokens are pre-translated with rules for test
data before decoding• Post-processing– English case restoration after translation– OOVs are removed from final translation
Tools• MSR-SEG
– MSRA word segmentation tool used to segment Chinese sentences in parallel data
• Berkeley parser– Parse sentences for both training and test data for syntactic pre-
reordering model based system• GIZA++
– Used for bilingual word alignment• MaxEnt Toolkit
– Reordering Model (Le Zhang, 2004)• MSRA internal tools
– Language modeling– Decoders– Case-restoration for English words– System combination
Experiments for MT Task
系统名称SSMT2007
(BLEU4 ,忽略英文大小写 )
CWMT2008 (BLEU4 ,考虑英文大
小写 )
受限训练语料
SysA 0.2366 0.2148SysB 0.2505 0.2303SysC 0.2436 0.2255
Contrast Submission 0.2473 0.2306
非受限训练语料
SysA 0.3157 0.2727SysB 0.3208 0.2782SysC 0.3196 0.2762SysD 0.3276 0.2787
Primary Submission 0.3389 0.2809
Experiments for System Comb.各参评系
统编号SSMT2007, BLEU4 ,
忽略大小写 采用与否
S1-1 0.2799 S1-2 0.2802 S3-1 0.2446 S3-2 0.2818 S4-1 0.2823 S7-1 0.1647 S8-1 0.2037
S10-1 0.2133 S10-2 0.2297 S10-3 0.2234 S11-1 0.1835 S12-1 0.3389 S12-2 0.2473 S14-1 0.2118 S14-2 0.2179 S14-3 0.2165 S15-1 0.2642
受限 LM
系统融合 0.3274
非受限 LM
0.3476
非受限LM
Conclusions
• Syntax information improves SMT
– Syntactic pre-reordering model
– Target dependency model
• Limited LM affects the system combination
– Perform worse over unlimited output when using limited LM
Thanks!
MSRA Systems
• SysA:– 基于连续短语翻译模型
• SysB:– SysA + 多个预调序的源语言输入
• SysC:– 基于层次短语翻译模型
• SysD:– 基于串到目标语言依存树的翻译模型
SysB• Syntactic pre-reordering model– (Li et. al, 2007)
• Motivations– Isolating reordering model from decoder– Making use of syntactic parse information