Lecture 11: Statistical/Probabilistic Models
for CLIR & Word Alignment
Wen-Hsiang Lu (盧文祥 )
Department of Computer Science and Information Engineering,
National Cheng Kung University
2011/05/30
Cross-Language Information Retrieval
Query TranslationQuery Translation Information RetrievalInformation RetrievalSourceQuery
TargetTranslation
Target Documents
Target Documents
• Query in source language and retrieve relevant documents in target languages
海珊 / 侯賽因 / 哈珊 / 胡笙 (TC)侯赛因 / 海珊 / 哈珊 (SC)
Hussein
References
• The Web as a Parallel Corpus– Philip Resnik and Noah A. Smith,
Computational Linguistics, Special Issue on the Web as Corpus, 2003• Automatic Construction of English/Chinese Parallel Corpora
– Christopher C. Yang & Kar Wing Li,Journal of the American Society for Information Science and Technology, 2003
• Statistical Cross-Language Information Retrieval using N-Best Query Translations (SIGIR2002)
– Marcello Federico & Nicola Bertoldi, ITC-irst Centro per la Ricerca Scientifica e Techologica
• Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval– Wessel Kraaij, Jian-Yun Nie and Michel Simard,
Computational Linguistics, Special Issue on the Web as Corpus, 2003
• A Probability Model to Improve Word Alignment (ACL2003)– Colin Cherry & Dekang Lin, University of Alberta
The Web as Corpus
Outline
• The Web as a Parallel CorpusPhilip Resnik and Noah A. SmithComputational Linguistics, Special Issue on the Web as Corpus, 2003
• Automatic Construction of English/Chinese Parallel CorporaChristopher C. Yang and Kar Wing LiJournal of the American Society for Information Science and Technology, 2003
• Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval Wessel Kraaij, Jian-Yun Nie and Michel SimardComputational Linguistics, Special Issue on the Web as Corpus, 2003
The Web as a Parallel Corpus
Philip Resnik and Noah A. SmithComputational Linguistics, Special Issue on
the Web as Corpus, 2003
Parallel Corpora
• Bitexts, bodies of text in parallel translation, plays an important role in machine translation and multilingual natural language processing.
• Not readily available in necessary quantities– Canadian parliamentary proceedings (Hansards) in
English/French– United Nations proceedings (Linguistic Data
Consortium, http://www.ldc.upenn.edu/)– Religious texts (Resnik, Olsen, and Diab)– Localized versions of software manuals (Resnik and
Melamed 1997; Menezes and Richardson)
STRAND
• An architecture for structural translation recognition, acquiring natural data (Resnik 1998, 1999)
• Identify pairs of Web Pages that are mutual translations.
• Web page authors disseminate information in multiple languages– When presenting the same content in two different
languages, authors exhibit a very strong tendency to use the same document structure
Finding Parallel Web Pages• Finding parallel text on the Web consists of three
main steps:– Location of pages that might have parallel translation
– Generation of candidate pairs that might be translations
– Structural filtering out of nontranslation candidate pairs
• Locating pages– Two types: parents and siblings
– Ask AltaVista: (anchor: “english” OR anchor: ”anglais”) AND (anchor: “french” OR anchor: “francais”)
Two types of Website Structure
STRAND
• Generating Candidate Pairs:– Automatic language identification (Dunning 1994)– URL-matching: manually creating a list of
substitution rules• E.g., http://mysite.com/english/home_en.html =>
http://mysite.com/big5/home_ch.html
– Document length: length(E) C.length(F)
• Structural filtering– The heart of STRAND– Markup analyzer: determine a set of pair-specific
structural values for translation pairs
Automatic Construction of English/Chinese Parallel
Corpora
Christopher C. Yang and Kar Wing LiJournal of the American Society for
Information Science and Technology, 2003
Web Parallel Corpora
• Some web sites with bilingual text contain a completely separate monolingual sub-tree for each language.
• Title alignment and dynamic programming matching
References
Statistical Cross-Language Information Retrieval using N-Best Query Translations
Marcello Federico & Nicola Bertoldi,ITC-irst Centro per la Ricerca Scientifica e Techologica
Outline
• Statistical CLIR Approach• Query Document Model• Query Translation Model
Statistical CLIR Approach
• CLIR problem− Given a query i in the source language (Italian),
one would like to find relevant documents d in the target language (English), within a collection D.
P(d | i) P(i, d)
− To fill the language difference between query and documents, the hidden variable e is introduced, which represents an English translation of i.
Statistical CLIR Approach
– P(e,d) is computed by the query-document model– P(i,e) is computed by the query-translation model
e
d'
e
e
)'e,(
)P(e,e),i(
)e |e)P(,i(
),e,i(),i(
dP
dP
dP
dPdP
Statistical CLIR Approach
Query-Document Model
• Statistical LM & Smoothing– Term frequencies of a document are
smoothed linearly and the amount of probability assigned to never observed terms is proportionally to the size of the document vocabulary
n
kkn dqPdqqP
dPdPdP
11 )|()|...q(
)()|q(),q(
||
1
||
||
||
)()(
)(|)(|)(
|)(|
|)(|)(
),()|(
VVN
V
VN
qNqP
qPdVdN
dV
dVdN
qdPdqP
local
global
Query-Translation Model
• According to the HMM
• Determine N-best translations– The most probable translation
e* can be computed throughthe Viterbi search algorithm.
– Intermediate results of the Viterbi algorithm can be used by the A* search algorithm to efficiently compute the N most possible translations of i.
n
kkkkknn eiPeePeiPePeeiiP
2111111 )|()|()|()()...e,...i(
Query-Translation Model
• P(i | e) are estimated from a bilingual dictionary
• P(e | e’) are estimated on the target document collection (order-free bigram LM)
• Smoothing
otherwise 0
pairon translatia is )( if 1 ),( ,
),'(
),()|(
'
i,eei
ei
eieiP
i
)',''(
)',()'|(
''
e
eeP
eePeeP
corpus. in the times occurring pairs termofnumber therepresent
and ),( above described as estimated is )(
corpus, in the appearing soccurrence-co ofnumber theis )( where
2 ),'()( 0 ,
)',(max)',(
21
1
kn
qPeP
e,e'C
nn
nePeβP
N
eeCeeP
k
CLIR Algorithm
• Use two approximations to limit the set of possible translations and documents.
otherwise 0
)i( e if (i)
)e,i()e,i(' :1 Appr. 1
NΤK
PP
otherwise 0
)e( e if (e)
)e,()e,(' :2 Appr. 2K
dPdP
e
d'
)'e,(
)P(e,e),i(
),i(
dP
dP
dP
Complexity of CLIR Algorithm
index file inverted theofentry each by spanned documents ofnumber average :
terma of ons translatiofnumber average :
ons translatigenerated ofnumber
lengthquery :
Ι
N:
n
Text Preprocessing
Blind Relevance Feedback
• The R most relevant terms are selected from the top B ranked documents according to:
documents top theamong termcontaining documents ofnumber the:
)5.0)(5.0(
)5.0)(5.0(
Bwr
rBrN
rBNNrr
w
www
wwww
Comparison with other CLIR Models
• Hiemstra (1999)
• Xu (2001)
n
k ekkk
n
k ekk
n
kk
k
k
dePeiP
deiP
diPdiP
1
1
1
)|()|(
)|,(
)|()|(
])|()|()1()([)|(1
ke
kkk
n
kk dePeiPiPdiP
Term Translation Model using Search Result Pages
• Apply page authority to search-result-based translation extraction
links total#
oflink #)( where
)()]|()|([)(
1
)(
)()|()|(
)|()|(
)|()|(
1
drdrP
drPdrqPdrtPqP
qP
drPdrqPdrtP
qdrPdrtP
RtPqtP
dr
k
ii
dr
dr
q
Embedding Web-Based Statistical Translation Models
in Cross-Language Information Retrieval
Wessel Kraaij, Jian-Yun Nie and Michel Simard
Computational Linguistics, Special Issue on the Web as Corpus, 2003
Web Mining for CLIR
• The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically.
• The resulting translation models can be embedded in several ways in a retrieval model.
• Conventional approach: IR + MT (machine translation)
Problems in Query Translation
• Finding translations– Lexical coverage: Proper names and abbreviations.– Transliteration: Phonemic representation of a
named entity.• Jeltsin, Eltsine, Yeltsin, and Jelzin (in Latin script)
• Pruning translation alternatives• Weighting translation alternatives
Exploitation of Parallel Texts
• Using a pseudofeedback approach (Yang et al. 1998)
• Capturing global cross-language term associations (Yang et al. 1998; Lavrenko 2002)
• Transposing to a language-independent semantic space (Dumais et al. 1997; Yang et al. 1998)
• Training a statistical translation model (Nie et al. 1999; Franz et al. 2001; Hiemstra 2001; Xu et al. 2001)
Mining Process in PTMiner
Embedding Translation into IR Model
• Basic language model
• Normalized log-likelihood ratio (NLLR)
Embedding Translation into IR Model
* Query Model:
* Document Model:
* Basic Language Model:
(log likelihood ratio)
(normalizedlog likelihood ratio)
A Probability Model to Improve Word Alignment
Colin Cherry & Dekang Lin,University of Alberta
Outline
• Introduction• Probabilistic Word-Alignment Model• Word-Alignment Algorithm
– Constraints– Features
Introduction
• Word-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation.– E.g., translation lexicons, transfer rules
• Word-alignment problem– Conventional approaches usually used co-occurrence models
• E.g., Ø2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993)– Indirect association problem: Melamed (2000) proposed
competitive linking along with an explicit noise model to solve
• To propose a probabilistic word-alignment model which allows easy integration of context-specific features.
)),,(|),((
)),,(|),((log),(
vucoocvulinksB
vucoocvulinksBvuscoreB CISCO System Inc.
思科 系統
CISCO System Inc.
思科 系統
noise
Probabilistic Word-Alignment Model
• Given E = e1, e2, …, em , F = f1, f2, …, fn
– If ei and fj are translation pair, then link l(ei, fj) exists
– If ei has no corresponding translation, then null link l(ei, f0) exists
– If fj has no corresponding translation, then null link l(e0, fj) exists
– An alignment A is a set of links such that every word in E and F participates in at least one link
• Alignment problem is to find alignment A to maximize P(A|E, F)
• IBM’s translation model: maximize P(A, F|E)
Probabilistic Word-Alignment Model (Cont.)
• Given A = {l1, l2, …, lt}, where lk = l(eik, fjk
), then
consecutive subsets of A, lij = {li, li+1, …, lj}
• Let Ck= {E, F, l1k-1} represent the context of lk
t
k
kk
t lFElPFElPFEAP1
111 ),,|(),|(),|(
),|(
)|(),|(
),(
),,(
),|(
)|(
),,(
)()|(
)(
),()|(
jkikk
kkjkikk
jkik
jkikk
jkikk
kk
jkikk
kkk
k
kkkk
feCP
lCPfelP
feP
felP
feCP
lCP
feCP
lPlCP
CP
ClPClP
1)|,(
)|,()(),,(
kjkik
kjkikkjkikk
CfeP
CfePCPfeCP
1)|,(
)|,()(),,(
kjkik
kjkikkjkikk
lfeP
lfePlPfelP
Probabilistic Word-Alignment Model (Cont.)
• Ck = {E, F, l1k-1} is too complex to estimate
• FTk is a set of context-related features such that P(lk|Ck) can be approximated by P(lk|eik
, fjk, FTk)
• Let Ck’ = {eik
, fjk} ∪ FTk
),|(
)|(),|(
),|(
)|(),|()|(
'
''
jkikk
kkjkikk
jkikk
kkjkikkkk
feFTP
lFTPfelP
feCP
lCPfelPClP
t
k FTft jkik
kjkikk
kfeftP
lftPfelPFEAP
1 ),|(
)|(),|(),|(
An Illustrative Example
Word-Alignment Algorithm
• Input: E, F, TE
– TE is E’s dependency tree which enable us to make use of features and constraints based on linguistic intuitions
• Constraints– One-to-one constraint: every
word participates in exactly one link
– Cohesion constraint: use TE to induce TF with no crossing dependencies
Word-Alignment Algorithm (Cont.)
• Features– Adjacency features fta: for
any word pair (ei, fj), if a link l(ei’, fj’) exists where -2 i’-i 2 and -2 j’-j 2, then fta(i-i’, j-j’, ei’) is active for this context.
– Dependency features ftd: for any word pair (ei, fj), let ei’ be the governor of ei ,and let rel be the grammatical relationship between them. If a link l(ei’, fj’) exists, then ftd(j-j’, rel) is active for this context.
) ,3(),(
) ,1(
),1,1( )',(
1
1
detftlesthepair
detft
hostftlthepair
d
d
a
Experimental Results
• Test bed: Hansard corpus– Training: 50K aligned pairs of sentences (Och & Ney 2000)– Testing: 500 pairs
Future Work
• The alignment algorithm presented here is incapable of creating alignments that are notone-to-one, many-to-one alignment will be pursued.
• The proposed model is capable of creating many-to-one alignments as the null probabilities of the words added on the “many” side.