deep linguistic information in hybrid machine translation jan hajič charles university in prague...
TRANSCRIPT
Deep Linguistic InformationDeep Linguistic Informationin Hybrid Machine Translationin Hybrid Machine Translation
Jan HajičJan Hajič
Charles University in PragueCharles University in PragueFaculty of Mathematics and PhysicsFaculty of Mathematics and Physics
Institute of Formal and Applied LinguisticsInstitute of Formal and Applied LinguisticsCzech RepublicCzech Republic
““DeepBank:” The Prague Czech-English DeepBank:” The Prague Czech-English Dependency Treebank (2.0)Dependency Treebank (2.0)– Texts, annotation style(s), alignment, toolsTexts, annotation style(s), alignment, tools
The platform: TreexThe platform: Treex
TectoMT: hybrid MT English → Czech TectoMT: hybrid MT English → Czech – The (old) ideaThe (old) idea– Overall designOverall design– Core modulesCore modules
(A Speculation on) The Future(A Speculation on) The Future
Outline: From Data Outline: From Data To an MT SystemTo an MT System
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 22
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 44
The Prague Czech-English The Prague Czech-English Dependency Treebank (PCEDT) 2.0Dependency Treebank (PCEDT) 2.0
Parallel treebankParallel treebank
Dependency style (“Prague”)Dependency style (“Prague”)– (surface) syntax(surface) syntax– syntax & semantics (“tectogrammatics”)syntax & semantics (“tectogrammatics”)
surfacesyntax
syntax &semantics(and more)=“tectogrammatics”
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 55
The Prague Czech-English The Prague Czech-English Dependency Treebank (PCEDT) 2.0Dependency Treebank (PCEDT) 2.0
Parallel treebankParallel treebank
Dependency style (“Prague”)Dependency style (“Prague”)– (surface) syntax(surface) syntax– syntax & semantics (“tectogrammatics”)syntax & semantics (“tectogrammatics”)
Penn Treebank translation into CzechPenn Treebank translation into Czech
Názory na její tříměsíční perspektivu se různí.
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 66
The Prague Czech-English The Prague Czech-English Dependency Treebank (PCEDT) 2.0Dependency Treebank (PCEDT) 2.0
Parallel treebankParallel treebank
Dependency style (“Prague”)Dependency style (“Prague”)– (surface) syntax(surface) syntax– syntax & semantics (“tectogrammatics”)syntax & semantics (“tectogrammatics”)
Penn Treebank translation into CzechPenn Treebank translation into Czech
1 million words1 million words
Published at LDC, June 2012 (LDC2012T08)Published at LDC, June 2012 (LDC2012T08)– Also available through LINDAT-Clarin and META-Also available through LINDAT-Clarin and META-
SHARESHARE
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 77
PCEDT 2.0PCEDT 2.0The Alignment(s)The Alignment(s)
Czech-English alignmentsCzech-English alignments– Sentence-level (manual, natural due to translation)Sentence-level (manual, natural due to translation)
At both syntactic levelsAt both syntactic levels
– Word (node) level Word (node) level automatic, test section manually corrected (in part)automatic, test section manually corrected (in part)
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 88
PCEDT 2.0PCEDT 2.0The Alignment(s)The Alignment(s)
Czech-English alignmentsCzech-English alignments– Sentence-level (manual, natural due to translation)Sentence-level (manual, natural due to translation)
At both syntactic levels 1 At both syntactic levels 1 →→ 1 1
– Word (node) level Word (node) level automatic, test section manually corrected (in part), m automatic, test section manually corrected (in part), m →→ n n
Between annotation levelsBetween annotation levels– Tectogrammatics to surface syntaxTectogrammatics to surface syntax
m → n, incl. 1 → 0m → n, incl. 1 → 0
– Surface syntax to word level (1 → 1)Surface syntax to word level (1 → 1)
tectogrammatics
surface syntax
PTB syntax
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 1010
Surface syntax annotationSurface syntax annotation
EnglishEnglish– Dependency (head rules + additions, manual corrections)Dependency (head rules + additions, manual corrections)– Function label (PDT-style) at all nodes (from PTB + rules)Function label (PDT-style) at all nodes (from PTB + rules)– Lemmatization + „pure“ POS tags from PTBLemmatization + „pure“ POS tags from PTB– Automatic (from PTB) + a few manual correctionsAutomatic (from PTB) + a few manual corrections
CzechCzech– PDT style, no changePDT style, no change– Syntax: Syntax: automatic (MST); 2000 sent. automatic (MST); 2000 sent. fully manualfully manual for testing for testing– Lemmatization and tagging: autoLemmatization and tagging: auto
9999%%/96/96%, Spoustov%, Spoustová et al. EACL 2009 (COMPOST tagger)á et al. EACL 2009 (COMPOST tagger)
http://ufal.mff.cuni.cz/compost (Czech, English & other) (Czech, English & other)
– No p-level (of course No p-level (of course ))
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 1111
Tectogrammatical annotationTectogrammatical annotation
Manual (both languages)Manual (both languages)
Major featuresMajor features– Nodes with „autosemaNodes with „autosemanntic“ words only (no function words)tic“ words only (no function words)
Ellipsis „restored“ (new node for verbal arguments)Ellipsis „restored“ (new node for verbal arguments)
– (Semantic) function (dependent(Semantic) function (dependent →→ head relation)head relation)Verb arguments + ca 50 functions for other relationsVerb arguments + ca 50 functions for other relations
– Valency lexicons attached (Eng: links to PropBank)Valency lexicons attached (Eng: links to PropBank)– ““Formemes”: prep+case style label (useful in MT and search)Formemes”: prep+case style label (useful in MT and search)– Co-reference integrated (Eng: BBNCo-reference integrated (Eng: BBN + more + more)), Czech: manually, Czech: manually
AlignAlignmentment– To To surface syntax surface syntax & between Czech and English& between Czech and English
This temblor-prone city dispatched inspectors, firefighters and other earthquake-trained personnel *-1 to aid San Francisco.
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 1212
Accompanying ToolsAccompanying Tools
TrEd (TrEd (http://ufal.mff.cuni.cz/tred))– Annotation, View/Browse and Search environmentAnnotation, View/Browse and Search environment– Open source, perlOpen source, perl– Search and visualization: Search and visualization:
Simple data browser (Simple data browser (http://ufal.mff.cuni.cz/pcedt2.0))
PML-TQ: Powerful query language for complex tree-based annotationPML-TQ: Powerful query language for complex tree-based annotation
Treex (Treex (http://ufal.mff.cuni.cz/treex))– Modular NLP processing environmentModular NLP processing environment– Easy handling of complex NLP-annotated dataEasy handling of complex NLP-annotated data– Modules exists for Czech, English data processing Modules exists for Czech, English data processing
incl. 3incl. 3rdrd-party tools integrated into Treex-party tools integrated into Treex
– CPAN-distributedCPAN-distributed
The famous, (almost) “Vauquois” triangle:The famous, (almost) “Vauquois” triangle:
PCEDT and TectogrammaticsPCEDT and Tectogrammaticsin (hybrid) MTin (hybrid) MT
source language (English) target language (Czech)
POS & lemmatization:morphological layer
shallow syntax:analytical layer
deep syntax & semantics:tectogrammatical layer
a-layer
m-layer
w-layer
t-layer
ANALYSIS TRANSFER SYNTHESIS
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 1313
Over 90 steps: both Over 90 steps: both rule-basedrule-based and and statisticalstatistical
Analysis-Transfer-SynthesisAnalysis-Transfer-SynthesisHybrid SystemHybrid System
source language (English) target language (Czech)
a-layer
m-layer
w-layer
t-layer
ANALYSIS TRANSFER SYNTHESIS
Tokenization
Lemmatization
Tagging (Compost)
Parsing (MST)
Analytical dep. function
Convert to t-tree
Grammatemes, formemes
Structural transfer
Basic morph. categories
Agreement
Add function words
Concatenate
Generate forms
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 1414
Lexicaltransfer (dictionary)& lexical choice
Example TranslationExample Translation
Machine translation should be easy .Tokenized
machine translation should be easy .NN NN MD VB JJ .
Lemmatized & POS tagged
a-layer(parse)
+functions machine
Atr
translationSb
shouldPred
beObj
.AuxK
easyPnom
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 1515
Example TranslationExample Translation
Mark function nodes &edges to
“collapse”machineAtr
translationSb
shouldPred
beObj
.AuxK
easyPnom
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 1616
Example TranslationExample Translation
T-treebackbone
+formemes
machinen:attr
translationn:subj
bev:fin
easyadj:compl
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 1717
Example TranslationExample Translation
T-treebackbone
+formemes
+grammatemes
machinen:attr
translationn:subj
bev:fin
easyadj:compl
Modality=hortConditional=1Tense=PresSim
DoC=PositiveNum=sg
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 1818
Example TranslationExample Translation
Transferstarts:
Clone t-tree
počítačstrojovýstroj n:2adj:attrn:attr
převodpřekladposunn:1
mít být v:finv:inf
snadnýjednoduchýadj:compln:1adv:
Modality=hortConditional=1Tense=PresSim
DoC=PositiveNum=sg
Fill in target language
equivalents:*lemmasformemes
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 1919
* Dictionary translation: MaxEnt classifier, ~106 features
Example TranslationExample Translation
Selectbest
combinationof lemmas &Formemes
(HMTM)
počítačstrojovýstroj n:2adj:attrn:attr
převodpřekladposunn:1
mít být v:finv:inf
snadnýjednoduchýadj:compln:1adv:
Modality=hortConditional=1Tense=PresSim
DoC=PositiveNum=sg
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 2020
Example TranslationExample Translation
Cloneto a-tree,add core
morphological & POS tags
+agreement
+function words
strojovýDeg=posCase=1Gen=MInanim
překladNum=sgCase=1
mít Gen=MInanimC=PastPNum=sg
snadnýDeg=posCase=1Gen=MInanim
.
.být C=inf
by
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 2121
Example TranslationExample Translation
Rearrangeclitics
strojovýDeg=posCase=1Gen=MInanim
překladNum=sgCase=1
mít Gen=MInanimC=PastPNum=sg
snadnýDeg=posCase=1Gen=MInanim
.
.být C=inf
by
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 2222
Example TranslationExample Translation
Synthesizeword forms
strojový
překlad
měl
snadný.
být by
... and flatten the tree:
(capitalize, space) Strojový překlad by měl být snadný.
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 2323
WMT Constrained task en → cs:WMT Constrained task en → cs:– TectoMT, Moses (Prague), Moses (Edinburgh) tied 1TectoMT, Moses (Prague), Moses (Edinburgh) tied 1stst
Unconstrained:Unconstrained:(subj. eval.)(subj. eval.)
BLEUBLEUAll < 0.17All < 0.17
ResultsResults
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 2424
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 2525
Acknowledgements:
The FutureThe Future
Non-isomorphic treesNon-isomorphic trees– Better breakdown to treelets and/or parameter training Better breakdown to treelets and/or parameter training
(than in STSG)(than in STSG)
Multiple paths / n-best listsMultiple paths / n-best lists– At least until statistical componentsAt least until statistical components
Combine with Moses (Combine with Moses (using input latticesusing input lattices))– Two „languages“: original Two „languages“: original & Czech by TectoMT& Czech by TectoMT
Moses with syntactic and semantic factorsMoses with syntactic and semantic factors
Still more generalized syntax and semantics Still more generalized syntax and semantics (AMR/MRS and beyond?)(AMR/MRS and beyond?)
Acknowledgements:
Ministry of Education Czech Rep.
LC536, MSM0021620838
Acknowledgements:
Ministry of Education Czech Rep.
ME09008, 7Ennnn
Acknowledgements:
Czech Science Foundation
GAP406/10/0875
Acknowledgements:
Czech Science Foundation
GPP406/10/P193
Acknowledgements:
Czech Science Foundation
GA405/09/0729
Acknowledgements:
“Information Society” Programme
1ET101120503
Acknowledgements:
Charles Univ. student grants
116310, 158010, 3537/2011
Acknowledgements:
European projects (in part)
034434, 034291, 231720, 247762
Acknowledgements:
European projects (part)
249119, 257528
Acknowledgements:
Charles University research funds
(“PRVOUK”)
Dec. 8, 2012Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Hybrid MT Workshop - Coling 2012 2626
ReferencesReferences
Thank you!Thank you!
Zdeněk Žabokrtský, Martin Popel: Hidden Markov Tree Model in Dependency-based Machine Translation. In ACL 2009, pp. 145-148
David Mareček, Martin Popel, Zdeněk Žabokrtský: Maximum Entropy Translation Model in Dependency-Based MT Framework. Joint 5th Workshop on Statistical Machine Translation and MetricsMATR, ACL 2010, Uppsala, Sweden, pp. 201-206.
Ondřej Dušek, Zdeněk Žabokrtský, Martin Popel, Martin Majliš, Michal Novák and David Mareček: Formemes in English-Czech Deep Syntactic MT. In WMT’12, Montréal, Canada, pp. 267-274.
Martin Popel, Zdeněk Žabokrtský: TectoMT: Modular NLP Framework. IceTAL 2010, 7th International Conference on Natural Language Processing, Reykjavík, Iceland, pp. 293-304.
TectoMT at WMT 12: http://www.statmt.org/wmt12/pdf/WMT02.pdf