a general optimization framework for smoothing language models on graph structures qiaozhu mei, duo...
TRANSCRIPT
A General Optimization Framework for Smoothing Language Models on
Graph Structures
Qiaozhu Mei, Duo Zhang, ChengXiang Zhai
University of Illinois at Urbana-Champaign
Kullback-Leibler Divergence Retrieval Method
Document d
A text mining paper
data mining
Doc Language Model (LM) θd : p(w|d) text 4/100=0.04
mining 3/100=0.03clustering 1/100=0.01…data = 0computing = 0…
Query q
Data ½=0.5Mining ½=0.5
Query Language Model θq : p(w|q)
Data ½=0.4Mining ½=0.4Clustering =0.1…
?p(w|q’)
text =0.039mining =0.028clustering =0.01…data = 0.001computing = 0.0005… Similarity
function
)|(
)|(log)|()||(
d
Vwdq wp
wpwpD
Smoothed Doc LM θd' : p(w|d’)
2
Smoothing a Document Language Model
3
Retrieval performance estimate LM smoothing LM
text 4/100 = 0.04mining 3/100 = 0.03Assoc. 1/100 = 0.01clustering 1/100=0.01…data = 0computing = 0…
text = 0.039mining = 0.028Assoc. = 0.009clustering =0.01…data = 0.001computing = 0.0005…
Assign non-zero prob. to unseen words
Estimate a more accurate distribution from sparse data
text = 0.038mining = 0.026Assoc. = 0.008clustering =0.01…data = 0.002computing = 0.001…
)|( dMLE wP
)|( collectionwP )|()|()1()|( collectiondMLE wPwPdwP
Previous Work on Smoothing
d
Collection
d
Clusters
d
Nearest Neighbors
Collection
Cluster
neighbors
d
d~
Interpolate MLE
with Reference LM
Estimate a Reference language model
θref using the collection (corpus)
ref
)|()|()|( refdMLE wPwPdwP
[Ponte & Croft 98]
[Liu & Croft 04]
[Kurland& Lee 04]
4
Problems of Existing Methods
• Smoothing with Global Background– Ignoring collection structure
• Smoothing with Document Clusters– Ignoring local structures inside cluster
• Smoothing using Neighbor Documents– Ignoring global structure
• Different heuristics on θref and interpolation– No clear objective function for optimization– No guidance on how to further improve the existing methods
5
Research Questions
• What is the right corpus structure to use?
• What are the criteria for a good smoothing method? – Accurate language model?
• What are we ending up optimizing?
• Could there be a general optimization framework?
6
Our Contribution
• Formulation of smoothing as optimization over graph structures
• A general optimization framework for smoothing both document LMs and query LMs
• Novel instantiations of the framework lead to more effective smoothing methods
7
A Graph-based Formulation of Smoothing
• A novel and general view of smoothing
8
d
P(w|d): MLEP(w|d): Smoothed
P(w|d) = Surface on top of the Graph
projection on a plain
Smoothed LM = Smoothed Surface!
Collection = Graph (of Documents)
Collection
P(w|d1)P(w|d2)
d1d2
Covering Existing Models
9
dC1
C2
C3
C4
Background
Smoothing with Graph Structure
Smoothing with Nearest Neighbor- Local Graph
Smoothing with Document Clusters- Forest w/ Pseudo docs
Smoothing with Global Background- Star graph
Collection = Graph
Smoothed LM = Smoothed Surfaces
Instantiations of the Formulation
10
Language Models to be Smoothed
Types of Graphs Document LM Query LM
Star Graph w/ Background Node
[Ponte & Croft 98], [Hiemstra & Kraaij 98], [Miller et al. 99], [ Zhai & Lafferty 01]…
N/A
Forest w/ Cluster roots [Liu and Croft 04] N/A
Local kNN graph [Kurland and Lee 04][Tao et al. 06]
N/A
Document Similarity Graph Novel N/A
Word Similarity Graph Novel Novel
Other graphs? ? ?
DocumentGraphs
Smoothing over Word Graphs
w
P(wu|d)/Deg(u)
Smoothed LM = Smoothed Surface!
Similarity graph of words
Given d, {P(w|d)} = Surface over the word graph!
P(wu|d)P(wv|d)
11
The General Objective of Smoothing
12
2
),(
2 ))(,()~
)(()1()(
Evu
vuVu
uu ffvuwffuwCO
ufuf~ 2)
~)(( uu ffuw
Fidelity to MLE 2
),(
))(,(
Evu
vu ffvuw
Smoothness of the surface
)(uw
Importance of vertices
),( vuw
- Weights of edges (1/dist.)
The Optimization Framework
13
2
),(
2 ))(,()~
)(()1()(
Evu
vuVu
uu ffvuwffuwCO
• Criteria: – Fidelity: keep close to the MLE– Surface Smoothness: local and global consistency– Constraint:
• Unified optimization objective:
Fidelity to MLE Smoothness of the surface
)(minarg Find :Smoothing COfuf
u
w
dwpd 1)|( ,
The Procedure of Smoothing
14
Construct a document/word
graph;
d
Vvvuuu
u
ffvuwffuDegf
CO))(,(2)
~)(()1(2
)(
Vv
vuu fuDeg
vuwff
)(
),(~)1( Iterative updating
Define reasonable w(u)
and w(u,v);
AdditionalDirichlet
Smoothing
Define reasonable fu
smoothed
Evuv
vuwuDeguw),(,
),()()(
Definegraph
Definesurfaces
Smoothsurfaces
Smoothing Language Models using a Document Graph
15
Construct a kNN graph of documents;
d w(u): Deg(u) w(u,v): cosine
AdditionalDirichlet
Smoothing
fu= p(w|du); or fu= s(q, du);
uf
;)|()(
),()|()1()|(
Vv
vuMLEu dwPuDeg
vuwdwPdwP
Vv
vuu dqsuDeg
vuwdqsdqs ),(
)(
),(),(~)1(),(or
Document language model:
Alternative: Document relevance score: e.g., (Diaz 05)
Smoothing Language Models using a Word Graph
16
Construct a kNN graph of
words;
w w(u): Deg(u) w(u,v): PMI
AdditionalDirichlet
Smoothing
fu=
uf
Document language model:
Query Language Model
)(
)|(
uDeg
dwP u
)(
)|(or
uDeg
qwP u
Vv
vuMLEu dwPVDeg
vuwdwPdwP )|(
)(
),()|()1()|(
Vv
vuMLEu qwPVDeg
vuwqwPqwP )|(
)(
),()|()1()|(or
Intuitive Interpretation – Smoothing using Word Graph
17
Vv
vuMLu dwPVDeg
vuwdwPdwP )|()
)(
),()|()1(()|(
w
Stationary distribution of a Markov Chain
wWriting a document = random walk on the word Markov chain; write down w whenever passing w
)( uvP
Intuitive Interpretation – Smoothing using Document Graph
d
d1 0
0))|(1)(1(1)|()1()|( uMLuMLu dwPdwPdwP
Vv
vdwPuDeg
vuw)|(
)(
),( Absorption Probability to the “1” state
Writing a word w in a document = random walk on the doc Markov chain; write down w if reaching “1”
)1( uP )0( uP
)( vuP
Act as neighbors do
18
Experiments
Data Sets
# docs
Avg doc length
queries # relevant docs
AP88-90 243k 273 51-150 21829
LA 132k 290 301-400 2350
SJMN 90k 266 51-150 4881
TREC8 528k 477 401-450 4728
19
Liu and Croft ’04Tao ’06
• Smooth Document LM on Document Graph (DMDG)• Smooth Document LM on Word Graph (DMWG)• Smooth relevance Score on Document Graph (DSDG)• Smooth Query LM on word graph (QMWG) • Evaluate using MAP
Effectiveness of the Framework
20
Data Sets Dirichlet DMDG DMWG † DSDG QMWG
AP88-90 0.217 0.254 ***(+17.1%)
0.252 ***(+16.1%)
0.239 ***(+10.1%)
0.239(+10.1%)
LA 0.247 0.258 **(+4.5%)
0.257 **(+4.5%)
0.251 **(+1.6%)
0.247
SJMN 0.204 0.231 ***(+13.2%)
0.229 ***(+12.3%)
0.225 ***(+10.3%)
0.219(+7.4%)
TREC8 0.257 0.271 *** (+5.4%)
0.271 **(+5.4%)
0.261 (+1.6%)
0.260(+1.2%)
† DMWG: reranking top 3000 results. Usually this yieldsto a reduced performance than ranking all the documents
Wilcoxon test: *, **, *** means significance level 0.1, 0.05, 0.01
Graph-based smoothing >> BaselineSmoothing Doc LM >> relevance score >> Query LM
Comparison with Existing Models
21
Data Sets
CBDM(Liu and Croft)
DELM(Tao et al.)
DMDG DMDG(1 iteration)
AP88-90 0.233 0.250 0.254 0.252
LA 0.259 0.265 0.260 0.258
SJMN 0.217 0.227 0.235 0.229
TREC8 N/A 0.267 0.271 0.270
Graph-based smoothing > state-of-the-art More iterations > Single iteration (similar to DELM)
Combined with Pseudo-Feedback
22
Data Sets FB FB+QMWG
AP88-90 0.271 0.273
LA 0.258 0.267
SJMN 0.245 0.246
TREC8 0.278 0.280
Data Sets DMWG FB FB+DMWG
AP88-90 0.252 0.266 0.271 **
LA 0.257 0.257 0.267 **
SJMN 0.229 0.241 0.249 **
TREC8 0.271 0.278 0.292 ***
d1dθ
q
BθF
w
smooth
w
smooth
rerankTop docs
Related Work
• Language modeling in Information Retrieval; smoothing using collection model– (Ponte & Croft 98); (Hiemstra & Kraaij 98); (Miller et al. 99); (Zhai &
Lafferty 01), etc.
• Smoothing using corpus structures– Cluster structure: (Liu & Croft 04), etc.
– Nearest Neighbors: (Kurland & Lee 04), (Tao et al. 06)
• Relevance score propagation (Diaz 05), (Qin et al. 05)
• Graph-based learning– (Zhu et al. 03); (Zhou et al. 04), etc.
23
Conclusions
• Smoothing language models using document/word graphs
• A general optimization framework– Various effective instantiations
• Improved performance over state-of-the-art• Future Work:
– Combine document graphs with word graphs– Study alternative ways of constructing graphs
24