2006/08/03
Chinese Spoken Document Retrieval and Organization
Berlin Chen ( 陳柏琳 )
Associate ProfessorDepartment of Computer Science & Information Engineering
National Taiwan Normal University
2
Outline
• Introduction• Information Retrieval Models• Spoken Document Summarization• Spoken Document Organization• Language Model Adaptation for Spoken Document
Transcription• Conclusions and Future Work
3
Introduction (1/2)
• Multimedia associated with speech is continuously growing and filling our computers and lives– Such as broadcast news, lectures, historical archives, voice
mails, meeting records, spoken dialogues etc.– Speech is the most semantic (or information)-bearing
• On the other hand, speech is the primary and the most convenient means of communication between people– Speech provides a better (or natural) user interface in wireless
environments and on smaller hand-held devices
• Multimedia information access using speech will be very promising in the near future
4
Introduction (2/2)
Information Extraction and Retrieval
(IE & IR)
Spoken Dialogues
Spoken DocumentUnderstanding and
Organization
MultimediaNetworkContent
NetworksUsers
˙
Named EntityExtraction
Segmentation
Topic Analysisand Organization
Summarization
Title Generation
InformationRetrieval
Two-dimensional Tree Structurefor Organized Topics
Chinese Broadcast News Archive
retrievalresults
titles,summaries
Input Query
User Instructions
Named EntityExtraction
Segmentation
Topic Analysisand Organization
Summarization
Title Generation
InformationRetrieval
Two-dimensional Tree Structurefor Organized Topics
Chinese Broadcast News Archive
retrievalresults
titles,summaries
Input Query
User Instructions
N
i
d
d
d
d
.
.
.
.2
1
K
k
T
T
T
T
.
.2
1
n
j
t
t
t
t
.
.2
1
idP ik dTP kj TtP
documents latent topics query
nj ttttQ ....21Q
N
i
d
d
d
d
.
.
.
.2
1
N
i
d
d
d
d
.
.
.
.2
1
K
k
T
T
T
T
.
.2
1
K
k
T
T
T
T
.
.2
1
n
j
t
t
t
t
.
.2
1
n
j
t
t
t
t
.
.2
1
idP ik dTP kj TtP
documents latent topics query
nj ttttQ ....21Q
Multimodal Interaction
Multimedia Content Processing
• Scenario of Multimedia Access using Speech
5
Related Research Work
• Spoken document transcription, organization and retrieval have been active focuses of much research in the recent past [R3, R4]– Informedia System at Carnegie Mellon Univ.– Multimedia Document Retrieval Project at Cambridge Univ.– Rough’n’Ready System at BBN Technologies– SpeechBot Audio/Video Search System at HP Labs– Spontaneous Speech Corpus and Processing Technology Proje
ct at Japan, etc.– ….
6
Related Research Issues (1/2)
• Transcription of Spoken Documents– Automatically convert speech signals into sequences of words or
other suitable units for further processing
• Audio Indexing and Information Retrieval– Robust representation of the spoken document
– Matching between (spoken) queries and spoken documents
• Named Entity Extraction from Spoken Documents – Personal names, organization names, location names, event
names
– Very often out-of-vocabulary (OOV) words, difficult for recognition
• Spoken Document Segmentation– Automatically segment spoken documents into short paragraphs,
each with a central topic
7
Related Research Issues (2/2)
• Information Extraction for Spoken Documents– Extraction of key information such as who, when, where, what
and how for the information described by spoken documents
• Summarization for Spoken Documents– Automatically generate a summary (in text/speech form) for each
short paragraph
• Title Generation for Multi-media/Spoken Documents– Automatically generate a title (in text/speech form) for each short
paragraph; i.e., a very concise summary indicating the topic area
• Topic Analysis and Organization for Spoken Documents– Analyze the subject topics for the short paragraphs; or organize
the subject topics of the short paragraphs into graphic structures for efficient browsing
8
An Example System for Chinese Broadcast News (1/2)
• For example, a prototype system developed at NTU for efficient spoken document retrieval and browsing [R4]
9
• Users can browse spoken documents in top-down and bottom-up manners
– http://sovideo.iis.sinica.edu.tw/SLG/ngsst/ngsst-index.html
An Example System for Chinese Broadcast News (2/2)
10
Information Retrieval Models
• Information retrieval (IR) models can be characterized by two different matching strategies– Literal term matching
• Match queries and documents in an index term space– Concept matching
• Match queries and documents in a latent semantic space
香港星島日報篇報導引述軍事觀察家的話表示,到二零零五年台灣將完全喪失空中優勢,原因是中國大陸戰機不論是數量或是性能上都將超越台灣,報導指出中國在大量引進俄羅斯先進武器的同時也得加快研發自製武器系統,目前西安飛機製造廠任職的改進型飛豹戰機即將部署尚未與蘇愷三十通道地對地攻擊住宅飛機,以督促遇到挫折的監控其戰機目前也已經取得了重大階段性的認知成果。根據日本媒體報導在台海戰爭隨時可能爆發情況之下北京方面的基本方針,使用高科技答應局部戰爭。因此,解放軍打算在二零零四年前又有包括蘇愷三十二期在內的兩百架蘇霍伊戰鬥機。
中共新一代空軍戰
力relevant ?
11
IR Models: Literal Term Matching (1/2)
• Vector Space Model (VSM)– Vector representations are used for queries and documents– Each dimension is associated with a index term and usually
represented by TF-IDF weighting– Cosine measure for query-document relevance
– VSM can be implemented with an inverted file structure for efficient document search
x
y
ni Qi
ni ji
ni qiji
j
j
j
ww
ww
QD
QD
QDsim
12,1
2,
1 ,,
||||)(cosine
),(
jD
Q
12
IR Models: Literal Term Matching (2/2)
• Hidden Markov Models (HMM) [R1]– Also known as Language Modeling (LM) approaches– Each document is a probabilistic generative model consisting of a
set of N-gram distributions for predicting the query
– Models can be optimized by the expectation-maximization (EM) or minimum classification error (MCE) training algorithms
– Such approaches do provide a potentially effective and theoretically attractive probabilistic framework for studying IR problems
Li wwwwQ ....21 3m
4m
ji DwP
CwP i
jii DwwP ,1
CwwP ii ,1
1m
2mQuery
,,
14132
21
1211
CwwPmDwwPmCwPmDwPm
CwPmDwPmDQP
iijii
L
iiji
jj
13
IR Models: Concept Matching (1/3)
• Latent Semantic Analysis (LSA) [R2]– Start with a matrix describing the intra- and Inter-document
statistics between all terms and all documents– Singular value decomposition (SVD) is then performed on the
matrix to project all term and document vectors onto a reduced latent topical space
– Matching between queries and documents can be carried out in this topical space
=
mxn
kxk
mxk
kxn
A Umxk
Σk VTkxm
w1
w2
wi
wm
D1 D2 Di Dn
w1
w2
wi
wm
D1 D2 Di Dn
111ˆ
kkkmm
Tk Uqq
dq
dqdqcoinedqsim
T
ˆˆ
ˆˆ)ˆ,ˆ(ˆ,ˆ
2
14
IR Models: Concept Matching (2/3)
• Probabilistic Latent Semantic Analysis (PLSA) [R5, R6]– An probabilistic framework for the above topical approach
– Relevance measure is not obtained directly from the frequency of a respective query term occurring in a document, but has to do with the frequency of the term and document in the latent topics
– A query and a document thus may have a high relevance score even if they do not share any terms in common
kjk
kk
kjkk
j
DTPQTP
DTPQTPDQR
22),(
=
D1 D2 Di Dn
mxn
kxk
mxk
rxn
PUmxk
Σk VTkxn
ki TwP
kTPw1
w2
wi
wm
kj TDP
ij DwP ,
w1
w2
wj
wm
D1 D2 Di Dn
kjkk
kiji TDPTPTwPDwP ,
15
IR Models: Concept Matching (3/3)
• PLSA also can be viewed as an HMM model or a topical mixture model (TMM)
– Explicitly interpret the document as a mixture model used to predict the query, which can be easily related to the conventional HMM modeling approaches widely studied in speech processing community (topical distributions are tied among documents)
– Thus quite a few of theoretically attractive model training algorithms can be applied in supervised or unsupervised manners
1TwP i
jDTP 2
A document model
Query
Li wwwwQ ....21
jDTP 1
jK DTP
2TwP i
Ki TwP
L
i
K
kjkkij DTPTwPDQP
1 1
16
Preliminary IR Evaluations
• Experiments were conducted on TDT2/TDT3 spoken document collections [R6]– TDT2 for parameter tuning/training, while TDT3 for evaluation– E.g., mean average precision (mAP) tested on TDT3
• HMM/PLSA/TMM are trained in a supervised manner
• Language modeling approaches (TMM/PLSA/HMM) are evidenced with significantly better results than that of conventional statistical approaches (VSM/LSA) for the above spoken document retrieval (SDR) task
TMM HMM PLSA VSM LSA
TD 0.7870 0.7174 0.6882 0.6505 0.6440
SD 0.7852 0.7156 0.6688 0.6216 0.6390
17
Spoken Document Summarization (1/3)
• Extractive Summarization– Extract a number of indicative sentences form the original spoken
document– A lightweight process as compared with abstractive summarization
• Use TMM to explore the relationships between the handcrafted titles and the documents of the contemporary newswire text collection for selecting salient sentences form spoken documents– Build a probabilistic latent topical space
1TwP i
lgSTP ,2
A sentence model
Document
gLig wwwwD ....21
lgSTP ,1
lgK DTP ,
2TwP i
Ki TwP
18
Spoken Document Summarization (2/3)
• An illustration of the TMM-based extractive summarization
• The sentence model can be obtained by probabilistic “fold-in”
H1
H2D2
HMDM
T1
T2
TK
Sg,1
Sg,2
Sg,3
Sg,L
Spoken Document Dg
Sg,l
D1
T1
T2
TK
Text Documents
Sentence
ji
ji
Dwc
Dw
K
kjkkijj HTPTwPHDP
,
1
gi
gi
Dwc
Dw
K
klgkkilgg STPTwPSDP
,
1,,
HjDj
lgn
lgi
Swlgi
Swlgikgi
lgk Swc
SwTPDwc
STP
,
,
,
,
, ,
,,
K
klgkki
lgkkilgik
STPTwP
STPTwPSwTP
1,
,,, where,
19
Spoken Document Summarization (3/3)
• Preliminary Tests on 200 radio broadcast news stories collected in Taiwan (automatic transcripts with 14.17% character error rate) – ROUGE-2 measure was used to evaluate the performance levels
of different models
• TMM yields considerable performance gains for the results with lower summarization ratios
VSM LSA-1 LSA-2 SigSen TMM Random
10% 0.2603 0.1872 0.24125 0.1976 0.2808 0.1466
20% 0.2739 0.2089 0.25231 0.2332 0.2837 0.1696
30% 0.3092 0.2288 0.26693 0.2724 0.3100 0.1956
50% 0.4107 0.3772 0.35121 0.3883 0.4266 0.3436
20
Spoken Document Organization (1/3)
• Each document is viewed as a TMM model to generate itself – Additional transitions between topical mixtures have to do with
the topological relationships between topical classes on a 2-D map
Two-dimensional Tree Structure
for Organized Topics
2
2
2
,exp
2
1,
lk
klTTdist
TTE
K
sks
klkl
TTE
TTEYTP
1,
,
jLij wwwwD ....21
1TwP i jDTP 1
jK DTP
2TwP i
Ki TwP
jDTP 2
A document model
Document
K
k
K
llnklikii TwPYTPDTPDwP
1 1
21
Spoken Document Organization (2/3)
• Document models can be trained in an unsupervised way by maximizing the total log-likelihood of the document collection
• Initial evaluation results (conducted on the TDT2 collection)
– TMM-based approach is competitive to the conventional Self-Organization Map (SOM) approach
V
ijiji
n
jT DwPDwcL
11log,
Model Iterations distBet/distWithin
TMM
10 1.9165
20 2.0650
30 1.9477
40 1.9175
SOM 100 2.0604
22
Spoken Document Organization (3/3)
• Each topical class can be labeled by words selected using the following criterion
• An example map for international political news
n
ijkji
n
jjkji
ki
DTPDwc
DTPDwc
TwSig
1
1
1,
,
,
23
Dynamic Language Model Adaptation (1/3)
• TMM also can be applied to dynamic language model adaptation for spoken document recognition
– In word graph rescoring, the history of each newly occurring word can be treated as a document used to predict itself
iwH
iw iw
K
kwkkiwiTMM ii
HTPTwPHwP1
24
Automatic Speech Recognition (ASR)
• A search process to uncover the word sequence that has the maximum posteriorprobability
m21 w,...,wwˆ W
XWP
WXW
X
WXW
XWW
W
W
W
PP
P
PP
Pˆ
max arg
max arg
max arg
Language Model Probability Acoustic Model Probability
Unigram:
Bigram:
Trigram:
1j
j1j1jj1kk121k21 wC
wwCwwP,wwP...wwPwPw..wwP
i
i
jjk21k21 wC
wCwP,wP...wPwPw..wwP
1j2j
j1j2j2k1kk1k2kk213121k21 wwC
wwwCwwwP,wwwP...wwwPwwPwPw..wwP
N-gramLanguage Modeling
N21i
mi21
,.....,v,vv:Vw
w,...,w,..w,w
where W
25
Dynamic Language Model Adaptation (2/3)
• For each history model, the probability distributions is trained beforehand and kept fixed during rescoring, while the probability can be dynamically estimated (folded-in)
• TMM LM can be incorporated with background N-gram LM via probability interpolation or scaling– Probability Interpolation
– Probability Scaling
ki TwP
i
iwnii
iw
Hwwnkwn
wkH
HwTPHwn
HTP
,,
ˆ
iwk HTP
K
llnwl
knwk
wnk
TwPHTP
TwPHTPHwTP
i
i
i
1
, where
12121 1 ~
iiiBGwiTMMiiiAdapt wwwPHwPwwwPi
V
llBGwlTMMiilBG
iBGwiTMMiiiBGiiiAdapt
wPHwPwwwP
wPHwPwwwPwwwP
i
i
112
12122
~
26
Dynamic Language Model Adaptation (3/3)
• Preliminary Tests on 500 radio broadcast news stories
– TMM is competitive to conventional MAP-based LM adaptation, and is superior than LSA
– Combination of TMM (topical information) and MAP (contextual information) approaches yields additional performance gains
CER (%)
PP
Baseline (Background Trigram LM) 15.22 752.49
+TMM (Interpolation) 14.47 502.50
+TMM (Scaling) 13.87 493.26
+LSA (Scaling) 15.19 676.67
+MAP In-domain Trigram 13.74 430.59
+TMM (Scaling) + MAP In-domain Trigram 13.27 306.75
27
Prototype Systems Developed at NTNU (1/3)
• Spoken Document Retrieval
http://sdr.csie.ntnu.edu.tw
28
Prototype Systems Developed at NTNU (2/3)
• Speech Retrieval and Browsing of Digital Historical Archives
29
Prototype Systems Developed at NTNU (3/3)
• Systems Currently Implemented with a Client-Server Architecture
InvertedFiles
Multi-ScaleIndexer
Automatically TranscribedBroadcast News Corpus
InformationRetrieval Server
MandarinLVCSR Server
PDA Client
Audio StreamingServer
Word-levelIndexingFeatures
Syllable-levelIndexingFeatures
30
Conclusions and Future Work
• Multimedia information access using speech will be very promising in the near future– Speech is the key for multimedia understanding and organization– Several task domains still remain challenging
• PLSA/TMM-based modeling approaches using probabilistic latent topical information are promising for language modeling, information retrieval, document summarization, segmentation, organization, etc.– A unified framework for spoken document recognition, organization
and retrieval was initially investigated
31
References
[R1] B. Chen et al., “A Discriminative HMM/N-Gram-Based Retrieval Approach for Mandarin Spoken Documents,” ACM Transactions on Asian Language Information Processing, Vol. 3, No. 2, June 2004
[R2] J.R. Bellegarda, “Latent Semantic Mapping,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005
[R3] K. Koumpis and S. Renals, “Content-based access to spoken audio,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005
[R4] L.S. Lee and B. Chen, “Spoken Document Understanding and Organization,” IEEE Signal Processing Magazine, Vol. 22, No. 5, Sept. 2005
[R5] T. Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, Vol. 42, 2001
[R6] B. Chen, “Exploring the Use of Latent Topical Information for Statistical Chinese Spoken Document Retrieval,” Pattern Recognition Letters, Vol. 27, No. 1, Jan. 2006
Thank You!