discriminative learning of extraction sets for machine translation john denero and dan klein uc...

Discriminative Learning of Extraction Sets for Machine Translation

John DeNero and Dan KleinUC Berkeley

Identifying Phrasal Translations

In the past two years , a number of US citizens …

过去两年中 , 一批美国公民 …

past two year in , one lots US citizen

Phrase alignment models: Choose a segmentation and a one-to-one phrase alignment

Past Go over

Underlying assumption: There is a correct phrasal segmentation

Unique Segmentations?

Problem 1: Overlapping phrases can be useful (and complementary)

Problem 2: Phrases and their sub-phrases can both be useful

Hypothesis: This is why models of phrase alignment don’t work well

Identifying Phrasal Translations

This talk: Modeling sets of overlapping, multi-scale phrase pairs

Input: sentence pairs

Output: extracted phrases

… But the Standard Pipeline has Overlap!

M O T I V A T I O N

In the past two years

过去

Sentence Pair

Word Alignment

Extracted Phrases

Related Work

M O T I V A T I O N

Sentence Pair

Word Alignment

Extracted Phrases

Translation models: Sinuhe system (Kääriäinen, 2009)

Combining Aligners: Yonggang Deng & Bowen Zhou (2009)

Fixed alignments; learned phrase pair weights

Fixed directional alignments; learned symmetrization

Extraction models: Moore and Quirk, 2007

Fixed alignments; learned phrase pair weights

Our Task: Predict Extraction Sets

M O T I V A T I O N

Sentence Pair

Extracted Phrases

Conditional model of extraction sets given sentence pairs

过去两年中

40 1 2 3 4 5

过去两年中

40 1 2 3 4 5

Extracted Phrases + ``Word Alignments’’

Alignments Imply Extraction Sets

M O D E L

过去

40 1 2 3 4 5

Word-level alignment

Word-to-span alignments

Extraction set of bispans

Nulls and Possibles

报道

according to

news report

it is reported

报道

according to

news report

it is reported

Nulls:

Possibles:

Incorporating Possible Alignments

M O D E L

过去

40 1 2 3 4 5

Sure and possible

word links

Linear Model for Extraction Sets

M O D E L

过去

40 1 2 3 4 5

Features on sure links

Features on all bispans

Features on Bispans and Sure Links

F E A T U R E S

地球

go over

over the Earth

Some features on sure links

HMM posteriors

Presence in dictionary

Numbers & punctuation

Features on bispans

HMM phrase table features: e.g., phrase relative frequencies

Lexical indicator features for phrases with common words

Monolingual phrase features: e.g., “the _____”

Shape features: e.g., Chinese character counts

Getting Gold Extraction Sets

T R A I N I N G

Hand Aligned: Sure and possible

word links

Deterministic: A bispan is included iff every word within the bispan aligns within the bispan

Deterministic: Find min and max alignment index for each word

Discriminative Training with MIRA

T R A I N I N G

Loss function: F-score of bispan errors (precision & recall)

Training Criterion: Minimal change to w such that the gold is preferred to the guess by a loss-scaled margin

Gold (annotated) Guess (arg max w ɸ)∙

Inference: An ITG Parser

I N F E R E N C E

ITG captures some bispans

Coarse-to-Fine Approximation

I N F E R E N C E

Coarse Pass: Features that are local to terminal productions

Fine Pass: Agenda search using coarse pass as a heuristic

We use an agenda-based parser. It’s fast!

Experimental Setup

R E S U L T S

Chinese-to-English newswire

Parallel corpus: 11.3 million words; sentences length ≤ 40

MT systems: Tuned and tested on NIST ‘04 and ‘05

Supervised data: 150 training & 191 test sentences (NIST ‘02)

Unsupervised Model: Jointly trained HMM (Berkeley Aligner)

Baselines and Limited Systems

R E S U L T S

Coarse:

State-of-the-art unsupervised baseline

Joint training & competitive posterior decoding

Source of many features for supervised models

Supervised ITG aligner with block terminals

State-of-the-art supervised baseline

Re-implementation of Haghighi et al., 2009

Supervised block ITG + possible alignments

Coarse pass of full extraction set model

Word Alignment Performance

R E S U L T S

Precision

Recall

1 - AER

80.4 HMMITGCoarseFull

Extracted Bispan Performance

R E S U L T S

Precision

Recall

HMMITGCoarseFull

Translation Performance (BLEU)

R E S U L T S

Joshua

31.5 32 32.5 33 33.5 34 34.5 35 35.5 36 36.5

HMMITGCoarseFull

Supervised conditions also included HMM alignments

Conclusions

Extraction set model directly learns what phrases to extract

The system performs well as an aligner or a rule extractor

Are segmentations always bad?

Idea: get overlap and multi-scale into the learning!

Thank you!

nlp.cs.berkeley.edu

discriminative learning of extraction sets for machine translation john denero and dan klein uc...

Documents

diskrete mathematik angelika steger institut für...

pid control. practical issues smith predictor (not pid…)...

symplectic tracking routine malte titze, helmholtz-zentrum...

sampling theory & data...

discriminative learning of deep convolutional feature

discriminative training methods for the generalized …

classification and ranking approaches to discriminative...

cs 188: artificial intelligence fall 2009 lecture 20:...

hauptseminar grundlagen der theoretischen physik einführung...

texpoint fonts used in emf. read the texpoint manual before...

online tracking by learning discriminative saliency map with...

relational algebra with discriminative joins and lazy...

deep network models · convolutional-neural-network-cnn/...

grundlagen der algorithmen und datenstrukturen kapitel...

information theory for high throughput sequencing texpoint...

relacion entre espacios protegidos y la diversidad de...

statistical nlp spring 2011 lecture 4: speech recognition...

crf-filters: discriminative particle filters for sequential...

vertex sparsification of cuts, flows, and distances robert...

online discriminative dictionary learning for visual...