survey of approaches to information retrieval of speech messages
DESCRIPTION
Survey of Approaches to Information Retrieval of Speech Messages. Kenney Ng Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute of Technology February 16 , 1996 DRAFT. 報告人:朱惠銘. Survey of Approaches to Information Retrieval of Speech Messages. Introduction - PowerPoint PPT PresentationTRANSCRIPT
Survey of Approaches to Information Retrieval of Speech Messages
Kenney NgSpoken Language Systems GroupLaboratory for Computer ScienceMassachusetts Institute of Technology
February 16 , 1996DRAFT
報告人:朱惠銘
Survey of Approaches to Information Retrieval of Speech Messages
Introduction Information Retrieval Text RetrievalDifferences between
text and speech mediaInformation Retrieval of
Speech Messages
1 Introduction
Process, organize, and analyze the data. Present the data in human usable form. Find the “interesting” piece of information
efficiently. Increasingly large portions in spoken
language information: recorded speech messages radio and television broadcasts
Development of automatic methods.
2 Information Retrieval
2.1 DefinitionThe representation, storage , organization
and accessing of information items.Return the best match of the “request”
provided by the user.There is no restriction on the type of
documents. Text Retrieval , Document Retrieval Image Retrieval , Speech Retrieval Multi-media Retrieval
2.2 Information Retrieval vs. Database Retrieval
Database Retrieval Information Retrieval
Return the specific facts (answer or exactly match request)
Return the relevant to the user’s request
Structure record are well defined.
Typically not well structured
Complete specification of user’s information needed.
Incomplete specification of the user’s information
Seeking the answer that is a specific fact or piece of information.
Interested in a general topic or subject area and wants to find out more about it.
2.3 Component Processes
Creating document representations (indexing)
Creating request representations (query formation)
Comparing representations
(retrieval)
Evaluating retrieved documents
(relevance feedback)
2.3 Component Processes (cont.)Performance
Recall The fraction of all the relevant documents in the entire
collection that are retrieved in response to a query.
Precision The fraction of the retrieved documents that are relevant.
Average precision The precision values obtained at each new relevant
document in the ranked output for an individual query are averaged.
3 Text Retrieval
3.1 Indexing and Document Representation
3.2 Query Formation3.3 Matching Query and
Document Representation
3.1 Indexing and Document RepresentationTerms and Keywords
A list of words extracted from the full text document.
Construct a Stop list to remove the useless words.
Under the usage of synonymsConstruct a dictionary structure to modifyTo replace each word in one class
Tradeoff exists between normalization and discrimination in the indexing process
Index Term Weighting
Term frequency
The frequency of occurrence of each term in the document
For term tk in document di
),( ki tdtf
Index Term Weighting
Inverse document frequency Approach of weighting each term inversely
proportional to the number of documents in which the term occurs.
For term tk N is the total number of documentsntk is the number of documents with term tk
)log()(kt
kn
Ntidf
Index Term Weighting
Weights to terms Terms that occur frequently in particular
documents but rarely in the overall collection should receive a large weight.
j tjji
tki
j
jji
kkiik
nN
idftdtf
nN
tdtf
tidftdtf
tidftdtfdtw k
2222
)(),(
)log(),(
)(),(
)(),(),(
3.2 Query Formation
Relevance Feedback The IR system automatically modifies a query
based on user feedback about documents retrieved in an initial run.
nonrelireli
oldnewdi
di
di
diqq
3.2 Query Formation
Extracting from a user request a representation of its content.
The indexing method also applicable to query formation.
3.3 Matching Query and Document Representations
Boolean Model, Extended Boolean Model
Vector Space ModelProbabilistic Models
Boolean Model
Document representation Binary value variable
True: the term is present in the document False: the term is absent in the document
The document can be represented in a binary vector
Query Boolean query : AND, OR and NOT
Matching function Standard rule of Boolean logic If the document representation satisfy the query
expression then that document matches the query
Extended Boolean Model
The retrieval decision of the Boolean Model may be too harsh.
The extended boolean model
This is maximal for a document contain all the terms and decreases the numbers of matching term decreases.
pp
Kpp
andK
dddqdsim
121
])1(.......)1()1(
[1),(
Extended Boolean Model
For the OR query
This is minimal for a document that contains none of the terms and increases as the number of matching terms increases.
The variable p is a constant in the range 1≤p≤∞ that is determined empirically;it is typically in the range 2≤p≤5.
pp
Kpp
orK
dddqdsim
121
].......
[),(
Vector Space Model
Documents and queries are represented as vector in a K-dimensional space
K is the number of indexing terms.
K
kk
K
kk
K
kkk
dqdqsim
1
2
1
2
1
)()(),(
Probabilistic Models
Baye’s Decision Rule The probability that the document d is relevant to the
query q denotes The probability that the document d is non-relevant to
the query q denotes Cr is the cost of retrieving a non-relevant document Cn is the cost of not retrieving a relevant document The expected cost of retrieving a extraneous document
is
Cn
Cr
qdRp
qdRpCqdRpCqdRp rn
),|(
),|(),|(),|(
),|( qdRp
),|( qdRp
rCqdRp ),|(
Probabilistic Models (cont.)
How to compute the and which are posterior probabilities?
Base on Bayes’ Rule
, are the priori probabilities of relevance and non-relevance of a document.
, are the likelihoods or class conditional probabilities.
)|(
)|(),|(),|(
)|(
)|(),|(),|(
qdp
qRpqRdpqdRp
qdp
qRpqRdpqdRp
),|( qdRp ),|( qdRp
)|( qRp )|( qRp
),|( qRdp),|( qRdp
Probabilistic Models (cont.)
)|(),|(
)|(),|(
)|(),|(
)|(
)|(
)|(),|(
),|(
),|(
qRpqRdp
qRpqRdp
qRpqRdp
qdp
qdp
qRpqRdp
qdRp
qdRp
Now we have to estimate and
),|( qRdp),|( qRdp
Probabilistic Models (cont.)
Assumptions The document vectors are binary, indicating the
presence or absence of each indexing term. Each term has a binomial distribution. There are no interactions between the terms.
K
k
dk
dk
K
k
k
K
k
dk
dk
K
k
k
kk
kk
qqqRdpqRdp
ppqRdpqRdp
1
1
1
1
1
1
)1(),|(),|(
)1(),|(),|(
dvectordocumenttheintermitheisdtermsindexKk thk 1,0,,.....,1
),|1(),|1( qRdpqqRdpp kkkk
Probabilistic Models (cont.)
Cwd
q
p
qRp
qRp
pq
qpd
qqqRp
ppqRp
qRdpqRp
qRdpqRp
qdRp
qdRpdqsim
K
k
kk
K
k k
kK
k kk
kkk
K
k
dk
K
k
dk
K
k
dk
K
k
dk
kk
kk
1
11
1
1
1
1
1
1
1
1log
)|(
)|(log
)1(
)1(log
)1()|(
)1()|(log
),|()|(
),|()|(log
),|(
),|(log),(
Probabilistic Models (cont.)
)1(
)1(log
kk
kkk
pq
qpw
wk is the same as the relevance weight of kth index term
Assume pk a constant value : 0.5 qk overall frequency : nk/N
)1log(log)1(
)1(log
)2
11(
)1(2
1
kN
nN
n
kk
kkk
n
N
pq
qpw
k
k
4 Differences between text and speech media
Speech is a richer and more expressive medium than text. (mood, tone)
Robustness of the retrieval models to noise or errors in transcription.
How to accurately extract and represent the contents of a speech message in a form that can be efficiently stored and searched.
5 Information Retrieval of Speech Messages
Speech Message RetrievalLarge Vocabulary Word Recognition ApproachSub-Word Unit ApproachWord Spotting Approaches
Speech Message Classification and SortingTopic IdentificationsTopic SpottingTopic Clustering
Large Vocabulary Word Recognition Approach
Suggested by CMU in Information digital video library project.
A user can interact with the text retrieval system to obtain video clips stored in the library that are relevant to his request.
Large vocabularyspeech recognizer
Sound trackof video
Textualtranscript
Natural languageunderstanding
Full-text informationretrieval system
Sub-Word Unit Approach
Syllabic UnitsPhonetic Units
Syllabic Units
VCV-featuresSub-word units consist of a maximum
sequence of consonants enclosed between two maximum sequences of vowels.eg: INFORMATION has INFO,ORMA,ATIO
vcv-featuresTake subset of these features as the
indexing terms.
Syllabic Units
CriteriaOccur frequently enough for a reliable acoustic
model to be trained for it.Not occur so frequently that its ability to
discriminate between different messages is poor.
Process
query VCV-features tf*idf weight
Document representation
Cosine similarity function
Document with highest score
Syllabic Units
Major problemThe acoustic confusability of
VCV-feature based approach is not taken into account during the selection of indexing features
Phonetic Units
Using variable length phone sequences as indexing feature. These features can be viewed as “pseudo -word”
and were shown to be useful for detecting or spotting topics in recorded military radio broadcasts.
An automatic procedure based on “digital trees” is used to search the possible subsequences
A Hidden Markov Model (HMM) phone recognizer with 52 monophone models is used to process the speech
More domain independent than a word based system.
Word Spotting Approaches
Between the simple phonetic and the complex large-vocabulary recognition.
Two different ways that word spotting has been used.
1. Small, fixed number of keywords are selected a priori for both recognition and indexing.
2. The speech messages in the collection are processed and stored in a form (e.g. phone lattice) that allows arbitrary keywords to be searched for after they are specified by the user.
Speech Message Classification and Sorting
Topic Identifications (1) K keywords nk is the binary value indicating the presence or
absence of keyword wk. Finding that topic Ti which maximum the score Si
K
k ki
kiki
wpTp
wTpnS
1 )()(
),(log
Speech Message Classification and Sorting
Topic Identifications (1) If there are 6 topics , top scoring 40 words each,
total 240 keywords . These keywords used on the text transcriptions
of the speech messages 82.4% classification accuracy achieved
If a genetic algorithm used to reduced the number of keywords down to 126 with a small drop in classification performance to 78.2% .
Topic Identifications (2)
The topic dependent unigram language models K is the number of keywords in the indexing
vocabulary nk is the number of times keyword wk occurs in the
speech message p( wk | Ti ) is the unigram or occurrence probability of
keyword wk in the set of class Ti message.
K
k
K
k
ikkn
iki TwpnTwps k
0 0
)|(log)|(log
Topic Identifications (2)
Number of wordsThe topic
classification accuracy
All 8431 words in the recognition vocabulary 72.5%
a subset of 4600 words by performing a X2 hypothesis test based on contingency tables
to select the “important” keywords74%
A genetic algorithm search was then used to Reduce to 203
70%
Topic Identifications (3)
The length normalized topic score N is the total number of words in speech message K is the number of keywords in the indexing
vocabulary nk is the number of times keyword wk occurs in the
speech message p( wk | Ti ) is the unigram or occurrence probability of
keyword wk in the set of class Ti message.
K
k
ikki TwpnN
S0
)|(log1
Topic Identifications (3)
750 keywords Classification accuracy is 74.6%
Topic Identifications (4)
The topic model is extended to a mixture of multinomial M is the number of multinomial model components Πm is the weight of the mth multinomial component K is the number of keywords in the indexing
vocabulary nk is the number of times keyword wk occurs in the
speech message p( wk | Ti ) is the unigram or occurrence probability of
keyword wk in the set of class Ti message.
})|({log01
K
k
nikm
M
j
mikTwpS
Topic Identifications (4)
Experiments indicate that the more complex models do not perform as well as the simple single mixture model.
Topic Spotting (1)
“usefulness” measure how discriminating the word is for the topic.
and are the probabilities of detecting the keyword in the topic and unwanted
This measure select words that occur often in the topic and have high discriminability .
)|(
)|(log)|(),(
Twp
TwpTwpTwu
k
kkk
)|( Twp k )|( Twp k
Topic Spotting (2)
Performed by accumulating over a window of speech (typically 60 seconds)
The log likelihood ratio of the detected keywords to produce a topic score for that region of the speech message.
K
k k
kk
Twp
Twpns
1 )|(
)|(log
Topic Spotting (2)
Try to capture dependencies between the keywords are examined.
w represent the vector of keywords is the coefficient of model .
Their experiments show that using a carefully chosen log-linear model can give topic spotting performance that is better than using the basic model that assumes keyword independence
)(
)()exp(
)|(
)|(
)|(
)|(log
0
0
Tp
Tpn
Twp
Twp
nwTp
wTp
K
k
kk
K
k
kk
k
Topic Clustering
Try to discover structure or relationships between messages in a collection.
The clustering processTokenizationSimilarity computationClustering
Topic Clustering (cont.)
Tokenization to come up with a suitable representation of the speech message which can be used in the next two steps.
Similarityit needs to compare every pair of messages,N-gram model is used.
Clusteringusing hierarchical tree clustering or nearest neighbor classification.
Work well under true transcription texts figure of merit (FOM) 90% rates
Using speech input is worse than texts, it down to 70% FOM using recognition output, unigram language models and tree-based clustering.
Thanks for all