微软亚洲研究院汉英翻译系统 cwmt2008 评测技术报告

22
微微微微微微微微微微微微CWMT2008 微微微微微微 微微微 微微微 微微 微微 微微微微微微微

Upload: jory

Post on 15-Jan-2016

212 views

Category:

Documents


0 download

DESCRIPTION

微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告. 张冬冬 李志灏 李沐 周明 微软亚洲研究院. Outline. Overview MSRA Submissions System Description Experiments Training Data & Toolkits Chinese-English Machine Translation Chinese-English System Combination Conclusion. Evaluation Task Participation. MSRA Submission. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

微软亚洲研究院汉英翻译系统CWMT2008 评测技术报告

张冬冬 李志灏 李沐 周明

微软亚洲研究院

Page 2: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

Outline

• Overview– MSRA Submissions

– System Description

• Experiments– Training Data & Toolkits

– Chinese-English Machine Translation

– Chinese-English System Combination

• Conclusion

Page 3: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

Evaluation Task Participation

参加与否 语种与领域 评测项目 项目代号

汉英新闻领域 机器翻译 ZH-EN-NEWS-TRANS

汉英新闻领域 系统融合 ZH-EN-NEWS-COMBI

英汉新闻领域 机器翻译 EN-ZH-NEWS-TRANS

英汉科技领域 机器翻译 EN-ZH-SCIE-TRANS

Page 4: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

MSRA Submission

• Machine translation task– Primary submission

• Unlimited training corpus• Combining: SysA + SysB + SysC + SysD

– Contrast submission• Limited training corpus• Combining: SysA + SysB + SysC

• System combination task– Limited training corpus– Combining: 10 systems

Page 5: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

SysA

• Phrase-based model

• CYK decoding algorithm

• BTG grammar

• Features:

– Similar with (Koehn, 2004)

• Maximum Entropy reordering model

– (Zhang et. al 2007, Xiong et. Al, 2006)

Page 6: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

SysB• Syntactic pre-reordering model– (Li et. al, 2007)

• Motivations• Isolating reordering model from decoder• Making use of syntactic information

Page 7: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

SysC

• Hierarchical phase-based model

– (David Chiang, 2005)

– Hiero re-implementation

• Weighted synchronous CFG

Page 8: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

SysD

• String-to-dependency MT– (Shen et. al, 2008)

– Integrating target dependent language model

• Motivations– Target dependent structures integrate linguistic knowledge

– Directly targeted on lexical items, simpler than CFG

– Capture long distance relations by local dependency trees

Page 9: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

System Combination

• Analogous with BBN’s work (Rosti et. al 2007)

s jN

l

N

pil ilwp

1

1

0,1

)),|(log(

s jN

l

N

pil

1

1

0,1

1

Page 10: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

• Adaptations in MSRA system

– Single confusion network

• Candidate skeletons come from top-1 translations of each system

• The best skeleton has the most similarity with others based on BLEU

– Word alignment between skeleton and other candidate

translations performed by GIZA++

– Parameters are tuned to maximize BLEU on Dev. data

System Combination (Cont.)

Page 11: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

Outline

• Overview– MSRA Submissions

– System Description

• Experiments– Training Data & Toolkits

– Chinese-English Machine Translation

– Chinese-English System Combination

• Conclusion

Page 12: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

Training Data非受限训练语料 受限训练语料

Primary MT Submission Contrast MT Submission

短语翻译模型 LDC Parallel data, 4.99M sentence pairs

主办方提供734.8K sentence pairs

语言模型 Gizaword+LDC Parallel (English part) 323M English words

主办方提供的英语部分9.21M English words

调序模型 FBIS + others, 197K sentence pairs CLDC-LAC-2003-004(ICT)

开发数据集 2005-863-001(489 pairs) 2005-863-001( 489 pairs)

Page 13: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

Pre-/Post-processing

• Pre-processing– Tokenization for Chinese and English sentences

• Before word alignment and language model training• Special tokens recognized and normalized (date, time and

number) for training data– Special tokens are pre-translated with rules for test

data before decoding• Post-processing– English case restoration after translation– OOVs are removed from final translation

Page 14: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

Tools• MSR-SEG

– MSRA word segmentation tool used to segment Chinese sentences in parallel data

• Berkeley parser– Parse sentences for both training and test data for syntactic pre-

reordering model based system• GIZA++

– Used for bilingual word alignment• MaxEnt Toolkit

– Reordering Model (Le Zhang, 2004)• MSRA internal tools

– Language modeling– Decoders– Case-restoration for English words– System combination

Page 15: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

Experiments for MT Task

系统名称SSMT2007

(BLEU4 ,忽略英文大小写 )

CWMT2008 (BLEU4 ,考虑英文大

小写 )

受限训练语料

SysA 0.2366 0.2148SysB 0.2505 0.2303SysC 0.2436 0.2255

Contrast Submission 0.2473 0.2306

非受限训练语料

SysA 0.3157 0.2727SysB 0.3208 0.2782SysC 0.3196 0.2762SysD 0.3276 0.2787

Primary Submission 0.3389 0.2809

Page 16: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

Experiments for System Comb.各参评系

统编号SSMT2007, BLEU4 ,

忽略大小写 采用与否

S1-1 0.2799 S1-2 0.2802 S3-1 0.2446 S3-2 0.2818 S4-1 0.2823 S7-1 0.1647 S8-1 0.2037

S10-1 0.2133 S10-2 0.2297 S10-3 0.2234 S11-1 0.1835 S12-1 0.3389 S12-2 0.2473 S14-1 0.2118 S14-2 0.2179 S14-3 0.2165 S15-1 0.2642

受限 LM

系统融合 0.3274

非受限 LM

0.3476

非受限LM

Page 17: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

Conclusions

• Syntax information improves SMT

– Syntactic pre-reordering model

– Target dependency model

• Limited LM affects the system combination

– Perform worse over unlimited output when using limited LM

Page 18: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

Thanks!

Page 19: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告
Page 20: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告
Page 21: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

MSRA Systems

• SysA:– 基于连续短语翻译模型

• SysB:– SysA + 多个预调序的源语言输入

• SysC:– 基于层次短语翻译模型

• SysD:– 基于串到目标语言依存树的翻译模型

Page 22: 微软亚洲研究院汉英翻译系统 CWMT2008 评测技术报告

SysB• Syntactic pre-reordering model– (Li et. al, 2007)

• Motivations– Isolating reordering model from decoder– Making use of syntactic parse information