survey of approaches to information retrieval of speech messages

Survey of Approaches to Information Retrieval of Speech Messages

Kenney NgSpoken Language Systems GroupLaboratory for Computer ScienceMassachusetts Institute of Technology

February 16 , 1996DRAFT

報告人：朱惠銘

Survey of Approaches to Information Retrieval of Speech Messages

Introduction Information Retrieval Text RetrievalDifferences between

text and speech mediaInformation Retrieval of

Speech Messages

1 Introduction

Process, organize, and analyze the data. Present the data in human usable form. Find the “interesting” piece of information

efficiently. Increasingly large portions in spoken

language information： recorded speech messages radio and television broadcasts

Development of automatic methods.

2 Information Retrieval

2.1 DefinitionThe representation, storage , organization

and accessing of information items.Return the best match of the “request”

provided by the user.There is no restriction on the type of

documents. Text Retrieval , Document Retrieval Image Retrieval , Speech Retrieval Multi-media Retrieval

2.2 Information Retrieval vs. Database Retrieval

Database Retrieval Information Retrieval

Return the specific facts (answer or exactly match request)

Return the relevant to the user’s request

Structure record are well defined.

Typically not well structured

Complete specification of user’s information needed.

Incomplete specification of the user’s information

Seeking the answer that is a specific fact or piece of information.

Interested in a general topic or subject area and wants to find out more about it.

2.3 Component Processes

Creating document representations (indexing)

Creating request representations (query formation)

Comparing representations

(retrieval)

Evaluating retrieved documents

(relevance feedback)

2.3 Component Processes (cont.)Performance

Recall The fraction of all the relevant documents in the entire

collection that are retrieved in response to a query.

Precision The fraction of the retrieved documents that are relevant.

Average precision The precision values obtained at each new relevant

document in the ranked output for an individual query are averaged.

3 Text Retrieval

3.1 Indexing and Document Representation

3.2 Query Formation3.3 Matching Query and

Document Representation

3.1 Indexing and Document RepresentationTerms and Keywords

A list of words extracted from the full text document.

Construct a Stop list to remove the useless words.

Under the usage of synonymsConstruct a dictionary structure to modifyTo replace each word in one class

Tradeoff exists between normalization and discrimination in the indexing process

Index Term Weighting

Term frequency

The frequency of occurrence of each term in the document

For term tk in document di

),( ki tdtf


Inverse document frequency Approach of weighting each term inversely

proportional to the number of documents in which the term occurs.

For term tk N is the total number of documentsntk is the number of documents with term tk

)log()(kt

kn

Ntidf


Weights to terms Terms that occur frequently in particular

documents but rarely in the overall collection should receive a large weight.

j tjji

tki

j

jji

kkiik

nN

idftdtf

nN

tdtf

tidftdtf

tidftdtfdtw k

2222

)(),(

)log(),(

)(),(

)(),(),(

3.2 Query Formation

Relevance Feedback The IR system automatically modifies a query

based on user feedback about documents retrieved in an initial run.

nonrelireli

oldnewdi

di

di

diqq

3.2 Query Formation

Extracting from a user request a representation of its content.

The indexing method also applicable to query formation.

3.3 Matching Query and Document Representations

Boolean Model, Extended Boolean Model

Vector Space ModelProbabilistic Models

Boolean Model

Document representation Binary value variable

True: the term is present in the document False: the term is absent in the document

The document can be represented in a binary vector

Query Boolean query : AND, OR and NOT

Matching function Standard rule of Boolean logic If the document representation satisfy the query

expression then that document matches the query

Extended Boolean Model

The retrieval decision of the Boolean Model may be too harsh.

The extended boolean model

This is maximal for a document contain all the terms and decreases the numbers of matching term decreases.

pp

Kpp

andK

dddqdsim

121

])1(.......)1()1(

[1),(

Extended Boolean Model

For the OR query

This is minimal for a document that contains none of the terms and increases as the number of matching terms increases.

The variable p is a constant in the range 1≤p≤∞ that is determined empirically;it is typically in the range 2≤p≤5.

pp

Kpp

orK

dddqdsim

121

].......

[),(

Vector Space Model

Documents and queries are represented as vector in a K-dimensional space

K is the number of indexing terms.

K

kk

K

kk

K

kkk

qq

dqdqsim

1

2

1

2

1

)()(),(

Probabilistic Models

Baye’s Decision Rule The probability that the document d is relevant to the

query q denotes The probability that the document d is non-relevant to

the query q denotes Cr is the cost of retrieving a non-relevant document Cn is the cost of not retrieving a relevant document The expected cost of retrieving a extraneous document

is

Cn

Cr

qdRp

qdRpCqdRpCqdRp rn

),|(

),|(),|(),|(

),|( qdRp

),|( qdRp

rCqdRp ),|(

Probabilistic Models (cont.)

How to compute the and which are posterior probabilities?

Base on Bayes’ Rule

, are the priori probabilities of relevance and non-relevance of a document.

, are the likelihoods or class conditional probabilities.

)|(

)|(),|(),|(

)|(

)|(),|(),|(

qdp

qRpqRdpqdRp

qdp

qRpqRdpqdRp

),|( qdRp ),|( qdRp

)|( qRp )|( qRp

),|( qRdp),|( qRdp


)|(),|(

)|(),|(

)|(),|(

)|(

)|(

)|(),|(

),|(

),|(

qRpqRdp

qRpqRdp

qRpqRdp

qdp

qdp

qRpqRdp

qdRp

qdRp

Now we have to estimate and

),|( qRdp),|( qRdp


Assumptions The document vectors are binary, indicating the

presence or absence of each indexing term. Each term has a binomial distribution. There are no interactions between the terms.

K

k

dk

dk

K

k

k

K

k

dk

dk

K

k

k

kk

kk

qqqRdpqRdp

ppqRdpqRdp

1

1

1

1

1

1

)1(),|(),|(

)1(),|(),|(

dvectordocumenttheintermitheisdtermsindexKk thk 1,0,,.....,1

),|1(),|1( qRdpqqRdpp kkkk


Cwd

q

p

qRp

qRp

pq

qpd

qqqRp

ppqRp

qRdpqRp

qRdpqRp

qdRp

qdRpdqsim

K

k

kk

K

k k

kK

k kk

kkk

K

k

dk

K

k

dk

K

k

dk

K

k

dk

kk

kk

1

11

1

1

1

1

1

1

1

1log

)|(

)|(log

)1(

)1(log

)1()|(

)1()|(log

),|()|(

),|()|(log

),|(

),|(log),(


)1(

)1(log

kk

kkk

pq

qpw

wk is the same as the relevance weight of kth index term

Assume pk a constant value : 0.5 qk overall frequency : nk/N

)1log(log)1(

)1(log

)2

11(

)1(2

1

kN

nN

n

kk

kkk

n

N

pq

qpw

k

k

4 Differences between text and speech media

Speech is a richer and more expressive medium than text. (mood, tone)

Robustness of the retrieval models to noise or errors in transcription.

How to accurately extract and represent the contents of a speech message in a form that can be efficiently stored and searched.

5 Information Retrieval of Speech Messages

Speech Message RetrievalLarge Vocabulary Word Recognition ApproachSub-Word Unit ApproachWord Spotting Approaches

Speech Message Classification and SortingTopic IdentificationsTopic SpottingTopic Clustering

Large Vocabulary Word Recognition Approach

Suggested by CMU in Information digital video library project.

A user can interact with the text retrieval system to obtain video clips stored in the library that are relevant to his request.

Large vocabularyspeech recognizer

Sound trackof video

Textualtranscript

Natural languageunderstanding

Full-text informationretrieval system

Sub-Word Unit Approach

Syllabic UnitsPhonetic Units

Syllabic Units

VCV-featuresSub-word units consist of a maximum

sequence of consonants enclosed between two maximum sequences of vowels.eg: INFORMATION has INFO,ORMA,ATIO

vcv-featuresTake subset of these features as the

indexing terms.

Syllabic Units

CriteriaOccur frequently enough for a reliable acoustic

model to be trained for it.Not occur so frequently that its ability to

discriminate between different messages is poor.

Process

query VCV-features tf*idf weight

Document representation

Cosine similarity function

Document with highest score

Syllabic Units

Major problemThe acoustic confusability of

VCV-feature based approach is not taken into account during the selection of indexing features

Phonetic Units

Using variable length phone sequences as indexing feature. These features can be viewed as “pseudo -word”

and were shown to be useful for detecting or spotting topics in recorded military radio broadcasts.

An automatic procedure based on “digital trees” is used to search the possible subsequences

A Hidden Markov Model (HMM) phone recognizer with 52 monophone models is used to process the speech

More domain independent than a word based system.

Word Spotting Approaches

Between the simple phonetic and the complex large-vocabulary recognition.

Two different ways that word spotting has been used.

1. Small, fixed number of keywords are selected a priori for both recognition and indexing.

2. The speech messages in the collection are processed and stored in a form (e.g. phone lattice) that allows arbitrary keywords to be searched for after they are specified by the user.

Speech Message Classification and Sorting

Topic Identifications (1) K keywords nk is the binary value indicating the presence or

absence of keyword wk. Finding that topic Ti which maximum the score Si

K

k ki

kiki

wpTp

wTpnS

1 )()(

),(log

Speech Message Classification and Sorting

Topic Identifications (1) If there are 6 topics , top scoring 40 words each,

total 240 keywords . These keywords used on the text transcriptions

of the speech messages 82.4% classification accuracy achieved

If a genetic algorithm used to reduced the number of keywords down to 126 with a small drop in classification performance to 78.2% .

Topic Identifications (2)

The topic dependent unigram language models K is the number of keywords in the indexing

vocabulary nk is the number of times keyword wk occurs in the

speech message p( wk | Ti ) is the unigram or occurrence probability of

keyword wk in the set of class Ti message.

K

k

K

k

ikkn

iki TwpnTwps k

0 0

)|(log)|(log


Number of wordsThe topic

classification accuracy

All 8431 words in the recognition vocabulary 72.5%

a subset of 4600 words by performing a X2 hypothesis test based on contingency tables

to select the “important” keywords74%

A genetic algorithm search was then used to Reduce to 203

70%


The length normalized topic score N is the total number of words in speech message K is the number of keywords in the indexing




K

k

ikki TwpnN

S0

)|(log1


750 keywords Classification accuracy is 74.6%


The topic model is extended to a mixture of multinomial M is the number of multinomial model components Πm is the weight of the mth multinomial component K is the number of keywords in the indexing




})|({log01

K

k

nikm

M

j

mikTwpS


Experiments indicate that the more complex models do not perform as well as the simple single mixture model.

Topic Spotting (1)

“usefulness” measure how discriminating the word is for the topic.

and are the probabilities of detecting the keyword in the topic and unwanted

This measure select words that occur often in the topic and have high discriminability .

)|(

)|(log)|(),(

Twp

TwpTwpTwu

k

kkk

)|( Twp k )|( Twp k

Topic Spotting (2)

Performed by accumulating over a window of speech (typically 60 seconds)

The log likelihood ratio of the detected keywords to produce a topic score for that region of the speech message.

K

k k

kk

Twp

Twpns

1 )|(

)|(log

Topic Spotting (2)

Try to capture dependencies between the keywords are examined.

w represent the vector of keywords is the coefficient of model .

Their experiments show that using a carefully chosen log-linear model can give topic spotting performance that is better than using the basic model that assumes keyword independence

)(

)()exp(

)|(

)|(

)|(

)|(log

0

0

Tp

Tpn

Twp

Twp

nwTp

wTp

K

k

kk

K

k

kk

k

Topic Clustering

Try to discover structure or relationships between messages in a collection.

The clustering processTokenizationSimilarity computationClustering

Topic Clustering (cont.)

Tokenization to come up with a suitable representation of the speech message which can be used in the next two steps.

Similarityit needs to compare every pair of messages,N-gram model is used.

Clusteringusing hierarchical tree clustering or nearest neighbor classification.

Work well under true transcription texts figure of merit (FOM) 90% rates

Using speech input is worse than texts, it down to 70% FOM using recognition output, unigram language models and tree-based clustering.

Thanks for all