structural phrase alignment based on consistency criteria toshiaki nakazawa, kun yu, sadao kurohashi...

1
Structural Phrase Alignment Based on Consistency Criteria Toshiaki Nakazawa, Kun Yu, Sadao Kurohashi (Graduate School of Informatics, Kyoto Univers {nakazawa, kunyu}@nlp.kuee.kyoto-u.ac.jp [email protected] my traffic The light was green when entering the intersection Language Models My traffic light was green when entering the intersection. Output came at me from the side at the intersection 私 私 私私私 私 私 私私 私私 私私 私 私 私私 私私私私私 私私 私私私 私私 私 私私私 my signature traffic The light was green to remove when entering a house Translation Examples (suddenly) (rush out) (house) (put off) (signal) (enter) (when) (cross) (point) (my) (signal) (blue) (was) Input 私私 私 私 私私 私 私 私私 私 私私私 (cross) (point) (enter) (when) (my) (signal) (blue) (was) 私私私私私私私 私私私私私私私私私Near! Far! i j j i E j i J alignment a a d a a d cs ) , ( ), , ( max arg J-Side Distance E-Side Distance Consistency Score Frequency (log) Dist of J-Side Dist of E-Side Score J-Side Distance E-Side Distance Flow of Our EBMT System Core Steps of Alignment •Searching Correspondence Candidates – Fine alignment is efficient in translation – Search candidates as much as possible using variety of li nguistic information •Bilingual dictionaries •Transliteration (Katakana words, NEs) 私私私私私私 → rosuwain ⇔ rose wine (similarity:0.78) 私私 → shinjuku ⇔ shinjuku (similarity:1.0) •Numeral normalization 私私私私私 → 2,160,000 ← 2.16 million •Japanese flexible matching (Odani et. al. 2007) •Substring co-occurrence measure (Cromieres 2006) •Selecting Correspondence Candidates – More candidates derive more ambiguities and improper alig nments – Necessity of robust alignment method which can align para llel sentences consistently by selecting the adequate can didates set Pre Rec F Baseline 77.47 64.32 70.29 +Consistency Score 80.30 66.90 72.99 Proposed(+CS,+DpndType) 80.77 69.14 74.51 Filtering (80%) 82.48 71.31 76.49 Moses (SMT Toolkit)* 60.19 33.15 42.75 Manual (upper bound) 95.58 89.80 92.60 English- French English- Romanian English- Korean HLT-NAACL 2003 5.71 28.86 - ACL 2005 - 26.55 - Gildea, 2003 - - 32 GIZA++ 15.89 27.19 35 Experimental Result 500 test sentences from Mainichi newspaper parallel corpus Bilingual dictionary: KENKYUSYA J-E/J-E 500K entries Evaluation criteria: Precision / Recall / F-measure Character-base for Japanese, word-base for English Quality of Other Language Pair * Using 300K newspaper domain bi-sentences for training (AER) Conclusion Selecting Correspondence Candidates Using Consistency Score and Dependency Type you will have to file insurance an claim insurance with the office in Japan 私私 私 私私 私私 私 私私私 私私 私私 私 私私私私 私 私私私私私 (in Japan) (insurance) (insurance) (to company) (claim) (instance) (you can) Ambiguiti es! Improper alignment s! Distribution of the distance of alignment pairs in hand- annotated data (Mainichi newspaper 40K sentence pairs) [Uchimoto04] Consistency Score Function “Near-Near” pair → Positive Score “Far-Far” pair → “Near-Far” pair → Negative Score 1/1+1/2=1.5 E J E J d d d d cs 1 1 , baseline Japanese predicate: level C 6 predicate: level B+/B 5 predicate: level B-/A 4 case no / renta i 2 Inside clause 1 predicate: level A- Others 3 English S / SBAR / SQ … 5 VP / WHADVP 4 WHADJP ADVP / ADJP NP / PP / INTJ 3 QP / PRT / PRN Others 1 Dependency Type Distance How to reflect the inconsistency? Proposed a new phrase alignment method using consistency criteria. Enough alignment accuracy compared to other language pairs. We need to acquire the parameters automatically by machine learning. We are planning to evolve the framework which revises (There is a translation demos in exhibition corner by NICT which is using our system!) you will have to file insurance an claim insurance with the office in Japan 私私 私 私私 私私 私 私私私 私私 私私 私 私私私私 私 私私私私 私 私私 私私私 私私 私私私 私私 私私 NP NP NN PP NN PP Pair 1: (Ds, Dt) = (1, 1) Positive Score Pair 2: (Ds, Dt) = (1, 7) Negative Score (in Japan) (insurance) (insurance) (to company) (claim) (instance) (you can) [case “de”] [case “ga”] [renyou] [case “no”] [inside clause] [inside clause] Near! Far!

Upload: deborah-hampton

Post on 19-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structural Phrase Alignment Based on Consistency Criteria Toshiaki Nakazawa, Kun Yu, Sadao Kurohashi (Graduate School of Informatics, Kyoto University)

Structural Phrase AlignmentBased on Consistency Criteria

Toshiaki Nakazawa, Kun Yu, Sadao Kurohashi(Graduate School of Informatics, Kyoto University) {nakazawa, kunyu}@nlp.kuee.kyoto-u.ac.jp [email protected]

my

traffic

The light

was green

when

entering

the intersection

Language Models

My traffic light was green when entering the intersection.

Output

came

at me

from the side

at the intersection

私 の

サイン

家 に

入る

脱ぐ

交差

点 で 、

突然

飛び出して 来た のです 。

信号 は

でした 。

my

signature

traffic

The light

was green

to remove

when

entering

a house

Translation Examples

(suddenly)

(rush out)

(house)

(put off)

(signal)

(enter)(when)

(cross)

(point)

(my)

(signal)

(blue)

(was)

Input

交差

点 に

入る

私 の

信号 は

でした 。

(cross)

(point)

(enter)

(when)

(my)

(signal)

(blue)

(was)

交差点に入る時私の信号は青でした。

Near!

Far!

i j

jiEjiJalignment

aadaadcs ),(),,(maxargJ-Side Distance E-Side Distance

Consistency ScoreFrequency (log)

Dist of J-SideDist of E-Side

Score

J-Side Distance

E-SideDistance

Flow of Our EBMT System Core Steps of Alignment• Searching Correspondence Candidates

– Fine alignment is efficient in translation

– Search candidates as much as possible using variety of linguistic information• Bilingual dictionaries• Transliteration (Katakana words, NEs) ローズワイン → rosuwain ⇔ rose wine (similarity:0.78)

新宿 → shinjuku ⇔ shinjuku (similarity:1.0)

• Numeral normalization 二百十六万 → 2,160,000 ← 2.16 million

• Japanese flexible matching (Odani et. al. 2007)• Substring co-occurrence measure (Cromieres 2006)

• Selecting Correspondence Candidates– More candidates derive more ambiguities and improper alignments

– Necessity of robust alignment method which can align parallel sentences consistently by selecting the adequate candidates set

Pre Rec F

Baseline 77.47 64.32 70.29

+Consistency Score 80.30 66.90 72.99

Proposed(+CS,+DpndType) 80.77 69.14 74.51

Filtering (80%) 82.48 71.31 76.49

Moses (SMT Toolkit)* 60.19 33.15 42.75

Manual (upper bound) 95.58 89.80 92.60

English-French

English-Romanian

English-Korean

HLT-NAACL 2003 5.71 28.86 -

ACL 2005 - 26.55 -

( Gildea, 2003 ) - - 32

GIZA++ 15.89 27.19 35

Experimental Result• 500 test sentences from Mainichi newspaper parallel corpus

• Bilingual dictionary: KENKYUSYA J-E/J-E 500K entries

• Evaluation criteria: Precision / Recall / F-measure

• Character-base for Japanese, word-base for English

Quality of Other Language Pairs

* Using 300K newspaper domain bi-sentences for training

(AER)

Conclusion

Selecting Correspondence CandidatesUsing Consistency Score and Dependency Type

you

will have to file

insurance

an claim

insurance

with the office

in Japan

日本 で

保険

会社 に 対して

保険

請求 の

申し立て が

可能ですよ

(in Japan)

(insurance)

(insurance)

(to company)

(claim)

(instance)

(you can)

Ambiguities!

Improper alignments!

Distribution of the distance of alignment pairs in hand-annotated data (Mainichi newspaper 40K sentence pairs) [Uchimoto04]

Consistency Score Function

“Near-Near” pair → Positive Score“Far-Far” pair → 0“Near-Far” pair → Negative Score

1/1+1/2=1.5

EJ

EJ ddddcs

11,

baseline

Japanese

predicate: level C 6

predicate: level B+/B 5

predicate: level B-/A 4

case no / rentai 2

Inside clause 1

predicate: level A-

Others 3

English

S / SBAR / SQ … 5

VP / WHADVP 4

WHADJP

ADVP / ADJP

NP / PP / INTJ

3

QP / PRT / PRN

Others 1

Dependency Type Distance

How to reflect the inconsistency?

•  Proposed a new phrase alignment method using consistency criteria.•  Enough alignment accuracy compared to other language pairs.•  We need to acquire the parameters automatically by machine learning.•  We are planning to evolve the framework which revises the parse result.

(There is a translation demos in exhibition corner by NICT which is using our system!)

you

will have to file

insurance

an claim

insurance

with the office

in Japan

日本 で

保険

会社 に 対して

保険

請求 の

申し立て が

可能です よ

デ格

文節内

連用

文節内

ノ格

ガ格

NP

NP

NN

PP

NN

PP

Pair 1:(Ds, Dt) = (1, 1)Positive Score

Pair 2:(Ds, Dt) = (1, 7)Negative Score

(in Japan)

(insurance)

(insurance)

(to company)

(claim)

(instance)

(you can)

[case “de”]

[case “ga”]

[renyou]

[case “no”]

[inside clause]

[inside clause]

Near!

Far!