jhu mt class: human evaluation of machine translation systems

Post on 17-Jun-2015

273 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Evaluation

•Some (not all) key ingredients in Google Translate:

•Phrase-based translation models

•... Learned heuristically from word alignments

•... Coupled with a huge language model

•... And very tight pruning heuristics

•Q: How do they know it works?

Overview

training data(parallel text) learner model

联合国 安全 理事会 的

五个 常任 理事 国都decoder

However , the sky remained clear under the strong north wind .

Overview

training data(parallel text) learner model

联合国 安全 理事会 的

五个 常任 理事 国都decoder

However , the sky remained clear under the strong north wind .

Overview

training data(parallel text) learner model

联合国 安全 理事会 的

五个 常任 理事 国都decoder

However , the sky remained clear under the strong north wind .

Evaluation

More has been written about machine translation

evaluation than about machine translation itself.

Yorick Wilks

•Why evaluate?•Rank systems.•Evaluate incremental changes.•Assess new ideas empirically.

•Evaluation must be:•Fast•Cheap•Reliable•Repeatable

© 2010 IBM Corporation

IBM Research

55

What It Takes to compete against Top Human Jeopardy! PlayersOur Analysis Reveals the Winner’s Cloud

Winning Human Performance

Winning Human Performance

2007 QA Computer System2007 QA Computer System

Grand Champion Human Performance

Grand Champion Human Performance

Each dot – actual historical human Jeopardy! games

More ConfidentMore Confident Less ConfidentLess Confident

Computers?Not So Good.

© 2010 IBM Corporation

IBM Research

10

Baseline 12/06

v0.1 12/07

v0.3 08/08

v0.5 05/09

v0.6 10/09

v0.8 11/10

v0.4 12/08

DeepQA: Incremental Progress in Answering Precision on the Jeopardy Challenge: 6/2007-11/2010

v0.2 05/08

IBM WatsonPlaying in the Winners Cloud

V0.7 04/10

美国愿和北韩谈判但拒绝再付出报酬

美国愿和北韩谈判但拒绝再付出报酬

US willing to negotiate with North Korea but not to pay more compensation.

美国愿和北韩谈判但拒绝再付出报酬

US willing to negotiate with North Korea but not to pay more compensation.

The United States is willing to hold talks with North Korea but refused to pay

remuneration.

“奋进”号因机械手故障推迟到升空

Launch of “Endeavour” delayed by robotic arm problems.

“奋进”号因机械手故障推迟到升空

“Progress” postponed because of mechanical hand into the sky.

Launch of “Endeavour” delayed by robotic arm problems.

“奋进”号因机械手故障推迟到升空

Rank Sentences

You have judged 25 sentences for WMT09 Spanish-English News Corpus, 427 sentences total taking 64.9 seconds per

sentence.

Source: Estos tejidos están analizados, transformados y congelados antes de ser almacenados en Hema-Québec, que gestiona también el único banco público de sangre del cordón umbilical en Quebec.

Reference: These tissues are analyzed, processed and frozen before being stored at Héma-Québec, which manages also the only bank of placental blood in Quebec.

Translation Rank

These weavings are analyzed, transformed and frozen before being stored in Hema-Quebec, that negotiates also the public only bank of blood of the umbilical cord in Quebec.

1

Best

2 3 4 5

Worst

These tissues analysed, processed and before frozen of stored in Hema-Québec, which also operates the only public bank umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These tissues are analyzed, processed and frozen before being stored in Hema-Québec, which also manages the only public bank umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These tissues are analyzed, processed and frozen before being stored in Hema-Quebec, which also operates the only public bank of umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These fabrics are analyzed, are transformed and are frozen before being stored in Hema-Québec, who manages also the only public bank of blood of the umbilical cord in Quebec.

1

Best

2 3 4 5

Worst

Annotator: ccb Task: WMT09 Spanish-English News Corpus

Instructions:

Rank each translation from Best to Worst relative to the other choices (ties are allowed). These are not interpreted as absolute scores. They are relative scores.

Manual Evaluation

Rank Sentences

You have judged 25 sentences for WMT09 Spanish-English News Corpus, 427 sentences total taking 64.9 seconds per

sentence.

Source: Estos tejidos están analizados, transformados y congelados antes de ser almacenados en Hema-Québec, que gestiona también el único banco público de sangre del cordón umbilical en Quebec.

Reference: These tissues are analyzed, processed and frozen before being stored at Héma-Québec, which manages also the only bank of placental blood in Quebec.

Translation Rank

These weavings are analyzed, transformed and frozen before being stored in Hema-Quebec, that negotiates also the public only bank of blood of the umbilical cord in Quebec.

1

Best

2 3 4 5

Worst

These tissues analysed, processed and before frozen of stored in Hema-Québec, which also operates the only public bank umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These tissues are analyzed, processed and frozen before being stored in Hema-Québec, which also manages the only public bank umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These tissues are analyzed, processed and frozen before being stored in Hema-Quebec, which also operates the only public bank of umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These fabrics are analyzed, are transformed and are frozen before being stored in Hema-Québec, who manages also the only public bank of blood of the umbilical cord in Quebec.

1

Best

2 3 4 5

Worst

Annotator: ccb Task: WMT09 Spanish-English News Corpus

Instructions:

Rank each translation from Best to Worst relative to the other choices (ties are allowed). These are not interpreted as absolute scores. They are relative scores.

Manual Evaluation

Chinese people in the traditional Spring Festival is approaching, the CPC Central Committee this afternoon in Zhongnanhai on the 22nd non-Party

personages to convene a forum in Spring Festival, invited the central committees of democratic parties, the leadership of the National Federation

of Industry and Commerce and personages without party affiliation on behalf of comrades gathered together State yes, talked in length about the

friendship, to greet the Chinese New Year. CPC Central Committee General Secretary and State President and Central Military Commission Chairman

Hu Jintao on behalf of the CPC Central Committee, the State Council, to the central committees of democratic parties, leaders of the National Federation

of Industry and Commerce and personages without party affiliation, to members of the united front, to extend my New Year's blessing.

Design of the WMT Evaluation (2008-2011)

Design of the WMT Evaluation (2008-2011)

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

} 1. system C

2. system D

3. system A

4. system B

5. system G

6. system F

7. system E

{

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

} 1. system C

2. system D

3. system A

4. system B

5. system G

6. system F

7. system E

{➡Costly: 361 hours of human effort in 2011.

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

} 1. system C

2. system D

3. system A

4. system B

5. system G

6. system F

7. system E

{Are you sure this is the correct ranking?

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

} 1. system C

2. system D

3. system A

4. system B

5. system G

6. system F

7. system E

{Are you sure this is the correct ranking?

•In above example, there are 5040 possible rankings.•With 10 systems: 3 million possible rankings.•With 20 systems: 2 quintillion possible rankings.

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =While (evaluation period is not over):

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.While (evaluation period is not over):

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.

While (evaluation period is not over):

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.➡ Sample an assessor.

While (evaluation period is not over):

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.➡ Sample an assessor.➡ Receive (partial) ranking of translations from assessor.

While (evaluation period is not over):

1. reference2. system C3. system A, system F4. system D

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.➡ Sample an assessor.➡ Receive (partial) ranking of translations from assessor.

While (evaluation period is not over):

1. reference2. system C3. system A, system F4. system D

reference system Areference system Creference system Dreference system Fsystem A system Csystem A system Dsystem A system Fsystem C system Dsystem C system Fsystem D system F

≺≺≺≺

≡{

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.➡ Sample an assessor.➡ Receive (partial) ranking of translations from assessor.

While (evaluation period is not over):

1. reference2. system C3. system A, system F4. system D

reference system Areference system Creference system Dreference system Fsystem A system Csystem A system Dsystem A system Fsystem C system Dsystem C system Fsystem D system F

≺≺≺≺

≡{WMT Raw Data:

pairwise rankings

Tournamentssystem A system B

system C system D

•Directed edge between every pair of vertices.•Edge from A to B if A beats B in pairwise comparison.•Widely used to model: sports, web results, elections.

Tournamentssystem A system B

system C system D

•Directed edge between every pair of vertices.•Edge from A to B if A beats B in pairwise comparison.•Widely used to model: sports, web results, elections.

Landau, 1951. On dominance relations andthe structure of animal societies

Tournamentssystem A system B

system C system D

•Directed edge between every pair of vertices.•Edge from A to B if A beats B in pairwise comparison.•Widely used to model: sports, web results, elections.

Landau, 1951. On dominance relations andthe structure of animal societies

•We use to model all WMT `10-`11 rankings (25 tasks).

Tournamentssystem A system B

system C system D

If tournament is acyclic: topological sort

Tournamentssystem A system B

system C system D

If tournament is acyclic: topological sort

Tournamentssystem A system B

system C system D

If tournament is acyclic: topological sort

Tournamentssystem A system B

system C system D

If tournament is acyclic: topological sort

Tournamentssystem A system B

system C

system D

If tournament is acyclic: topological sort

Tournamentssystem A system B

system C

system D

If tournament is acyclic: topological sort

Tournamentssystem A

system B

system C

system D

If tournament is acyclic: topological sort

Tournamentssystem A

system B

system C

system D

If tournament is acyclic: topological sort

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

16 out of 25 tasks in WMT ’10-’11 contain cycles!

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Solution: Reverse a set of edges such that:(a) Resulting graph is acyclic.

(b) Sum of reversed edges weights is minimized.

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Solution: Reverse a set of edges such that:(a) Resulting graph is acyclic.

(b) Sum of reversed edges weights is minimized.

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Solution: Reverse a set of edges such that:(a) Graph is acyclic.

(b) Sum of reversed edges weights is minimized.

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Set of reversed edges = minimum feedback arc set (MFAS).In theory, this optimization is NP-hard (Karp, 1972).

In practice, it’s not too hard.

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Important detail: What should the weight be?Following analysis uses #(wins - losses).

Dumb, but counts each observation equally.

Example: French-English 2010

Task Rankings

MFAS

onlineBrwth-combo

cmu-hyposel-combocambridge

liumdcu-combo

cmu-heafield-comboupv-combo

nrcuedin

jhulimsi

jhu-combolium-combo

ralilig

bbn-comborwth

cmu-statxferonlineAhuicong

dfkicu-zeman

geneva

Example: French-English 2010

Task Rankings

MFAS

onlineBrwth-combo

cmu-hyposel-combocambridge

liumdcu-combo

cmu-heafield-comboupv-combo

nrcuedin

jhulimsi

jhu-combolium-combo

ralilig

bbn-comborwth

cmu-statxferonlineAhuicong

dfkicu-zeman

geneva

Example: French-English 2010

Task Rankings

MFAS

Has WMT solved these problems?

Human evaluation is too slow and expensive!

Human evaluation isn’t reproducible!

Has WMT solved these problems?

Human evaluation is too slow and expensive!

Human evaluation isn’t reproducible!

With crowdsourcing, WMT has made a good dent in this problem.

Has WMT solved these problems?

Human evaluation is too slow and expensive!

Human evaluation isn’t reproducible!

With crowdsourcing, WMT has made a good dent in this problem.

Empirically true in the WMT data.

Human Assessment is Fast and Cheap!

top related