a survey on automatic text/speech summarization shih-hsiang lin( 林士翔 ) department of computer...
TRANSCRIPT
A Survey on Automatic Text/Speech Summarization
Shih-Hsiang Lin(林士翔 )
Department of Computer Science & Information Engineering
National Taiwan Normal University
References:1. D, Das and A. F. T. Martins, A Survey on Automatic Text Summarization, 20072. Y. T. Chen et al., A probabilistic generative framework for extractive broadcast news speech summarization, IEEE Trans. on ASLP 2009.3. Hovey’s tutorial, Automated Text summarization Tutorial , COLING/ACL 19984. Radev’s tutorial, Text summarization, SIGIR 20045. Berlin’s lecture, A Brief Review of Extractive Summarization Research, 2008
NLP Related Technologies
2
Outline
• Introduction• Single-Document Summarization
– Early work
– Supervised Methods
– Unsupervised Method
• Multi-Document Summarization– Not abailable yet …
• Evaluation– ROGUE
– Information-Theoretic Method
3
4
Introduction
• The subfield of summarization has been investigated by the NLP community for nearly the last half century– “A text that is produced from one or more texts, that conveys important
information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that” – (Radev, 2000)
Summaries may be produced from a single document or multiple documents Summaries should preserve important information Summaries should be short
• Terminology in the summarization dialect– Extraction: identify important sections of the text
– Abstraction: produce important material in a new way
– Fusion: combines extracted parts coherently
– Compression: throw out unimportant sections of the text
– Indicative vs. Informative vs. Critic
– Generic vs. Query-oriented
– Single-Document Summarization vs. Multi-Document Summarization
Introduction (cont.)
• Input (Jones, 1997)– Subject type: domain
– Genre: newspaper articles, editorials, letters, reports...
– Form: regular text structure; free-form
– Source size: single doc; multiple docs (few; many)
• Purpose– Situation: embedded in larger system (MT, IR) or not?
– Audience: focused or general
– Usage: IR, sorting, skimming...
• Output– Completeness: include all aspects, or focus on some?
– Format: paragraph, table, etc.
– Style: informative, indicative, critical...
5*This slides was adopted from Prof. Hovey’s presentation
Introduction (cont.)
• A Summarization Machine
6*This slides was adopted from Prof. Hovey’s presentation
Introduction (cont.)
• A brief history of summarization
7
Speech Summarization
• Fundamental problems with speech summarization– Disfluencies, hesitations, repetitions, repairs, …
– Difficulties of sentence segmentation
– More spontaneous parts of speech (e.g. interviews in broadcast news) are less amenable to standard text summarization
– Speech recognition errors
• Speech Summarization– Speech-to-text summarization
The documents can be easily looked through The part of the documents that is interesting for users can be easily extracted Information extraction and retrieval techniques can be easily applied to the
documents
– Speech-to-speech summarization Wrong information due to speech recognition errors can be avoided Prosodic information such as the emotion of speakers that is conveyed only by
speech can be presented
8*This slides was adopted from Prof. Furui’s presentation
Single-Document SummarizationEarly Work
• The most cited paper on summarization is that of (Luhn, 1958)– The frequency of a particular word in an article provides an useful measure
of its significance
– There are also several key ideas put forward in this paper that have assumed importance in later work on summarization
words were stemmed to their root forms, and stop words were deleted compiled a list of content words sorted by decreasing frequency, the index
providing a significance measure of the word a significance factor was derived that reflects the number of occurrences of
significant words within a sentence all sentences are ranked in order of their significance factor, and the top ranking
sentences are finally selected to form the auto-abstract
• Baxendale also suggest that “sentence position” is helpful in finding salient parts of documents (Baxendale, 1958)– examined 200 paragraphs to find that in 85% of the paragraphs the topic
sentence came as the first one & in 7% of the time it was the last sentence
9
Single-Document Summarization Early Work (cont.)
• Edmundson (1969) describes a system that produces document extracts– His primary contribution was the development of a typical structure for an
extractive summarization experiment (400 technical documents)
– Four kind of features are used Word frequency, Positional feature Cue words: present of words like significant, or hardly The skeleton of the document: whether the sentence is a title or heading
– Weights were attached to each of these features manually to score each sentence
About 44% of the auto-extracts matched the manual extracts
10
Single-Document SummarizationSupervised Methods
• In the 1990s, with the advent of machine learning techniques in NLP– a series of seminal publications appeared that employed statistical
techniques to produce document extracts
• Kupiec et al. (1995) using a naive-Bayes classifier to categorizes each sentence as worthy of extraction or not– Let be a particular sentence, the set of sentences that make up the
summary, and the features
– Assuming independence of the features
– Two additional features are used: sentence length and the presence of uppercase words
– Feature analysis revealed that a system using only the position and the cue features, along with the sentence length, performed best
11
s skFFF ,,, 21
k
i i
k
i ik
FP
sPsFPFFFsP
1
121
|,,,|
SSS
Single-Document Summarization Supervised Methods (cont.)
• Aone et al. (1999) also incorporated a naive-Bayes classifier, but with richer features– Signature words: derived from term frequency(TF) and inverse document
frequency(IDF)
– Named-entity tagger
– Shallow discourse analysis
– Synonyms and morphological variants were also merged (accomplied by WordNet)
• Lin and Hovy (1997) studied the importance of sentence position feature– However, since the discourse structure significantly varies over domains
– They makes an important contribution by investigating techniques of tailoring the position method towards optimality over a genre
Measured the yield of each sentence position against the topic keywords Then ranked the sentence positions by their average yield to produce the Optimal
Position Policy (OPP) for topic positions for the genre
12
k ki
jiji n
ntf
,
,, ii tdd
Didf
:log
Single-Document Summarization Supervised Methods (cont.)
• Lin (1999) broke away from the assumption that features are independent of each other– He tried to model the problem of sentence extraction using decision trees,
instead of a naive-Bayes classifier
– Some novel features were introduced in his paper Query Signature: normalized score given to sentences depending on number of
query words that they contain IR signature: score given to sentences depending on number and scores of IR
signature words included (the m most salient words in the corpus) Average lexical connectivity: the number of terms shared with other sentences
divided by the total number of sentences in the text Numerical data: value 1 when sentences contained a number Proper name, Pronoun or Adjective, Weekday or Month, Quotation (similar as
previous feature) Sentence length, Sentence Order
– Feature analysis suggested that the IR signature was a valuable feature, corroborating the early findings of Luhn (1958)
13
• Conroy and O'leary (2001) modeled the problem of extracting a sentence from a document using a hidden Markov model (HMM)
– The HMM was structured as follows states (alternating between summary states and non-summary
states) Allowed “hesitation“ only in non-summary states and “skipping next state” only in
summary states The transition matrix can be estimated from training corpus
element is the empirical probability of transitioning from state i to state j
Associated with each state i was an output function assume that the features are multivariate normal distributed using the training data to compute the maximum likelihood estimate of its mean and
covariance matrix (shared covariance)
– Use three features: position of the sentence, number of terms in the sentence, and likeliness of the sentence terms given the document terms
Single-Document Summarization Supervised Methods (cont.)
14
12 s s 1s
M̂ ji,
istateOPObi |
Single-Document Summarization Supervised Methods (cont.)
• Osborne (2002) used log-linear models to obviate the assumption of feature independence– Let be a label, the item we are interested in labeling, the i-th feature
and the corresponding feature weight
– The conditional log-linear model can be stated as follows
– The authors added a non-uniform prior to the model, claiming that a log-linear model tends to reject too many sentences for inclusion in a summary
– The features included word pairs, sentence length, sentence position, and naive discourse features like inside introduction or inside conclusion.
15
s if
iii scf
sZscP ,exp
1|
iii
CcCcscfcPscPcPslabel ,logmaxarg|maxarg
c
i
Single-Document Summarization Supervised Methods (cont.)
• Svore et al. (2007) propose an algorithm based on neural nets and the use of third party datasets to perform extractive summarization– They trained a model that could infer the proper ranking of sentences
The ranking was accomplished using RankNet based on neural networks For the training set, they used ROUGE-1 to score the similarity of a human written
highlight and a sentence in the document These similarity scores were used as “soft-labels” during training, contrasting with other
approaches where sentences are “hard-labeled”, as selected or not
– Another novelty of the framework lay in the use of features that derived information from query logs from Microsoft's news search engine and Wikipedia entries (third party datasets)
They conjecture that if a document sentence contained keywords used in the news search engine, or entities found in Wikipedia articles, then there is a greater chance of having that sentence in the highlight
– They generate 10 features for each sentence in each document Is first sentence, Sentence position, SumBasic score(unigram), SumBasic bigram
score, Title similarity score, Average News Query Term Score, News Query Term Sum Score, Relative News Query Term Score, Average Wikipedia Entity Score, Wikipedia Entity Sum Score
16
Single-Document Summarization Supervised Methods (cont.)
• Other kinds of supervised summarizers includes– Support vector machine (SVM) (Hirao et al. 2002)
– Gaussian Mixture Models (GMM) (Murray et al. 2005)
– Conditional Random Fields (CRFs) (Shen et al. 2007)
• In general, the extractive summarization can be treated as a two-class (summary/non-summary) classification problem (Lin et al. 2009)– A sentence with a set of representative features
– To summarize documents with different summary ratios, the important sentences of a document can be selected (or ranked) based on the posterior probability of a sentence being included in the summary given the feature set
17
iS M iMimii xxxX ,,,,1
iS
iX
Single-Document SummarizationUnsupervised Methods
• Gong (2001) proposed using vector space model (VSM)– Vector representations of sentences and the document to be summarized
using statistical weighting, such as TF-IDF
– Sentences are ranked based on their proximity to the document
Maximum Marginal Relevance (MMR) (Murray et al. 2005) can be applied to summarize more important and different concepts in a document
18
x
y
iS
D
DS
DSDSsim
i
ii
,
)),()(1()),(( SummSSimaDSSimaS iiMMRi
Single-Document SummarizationUnsupervised Methods (cont.)
• Latent Semantic Analysis (LSA) (Gong 2001)– Construct a “term-sentence” matrix for a given document
– Perform Singular Value Decomposition (SVD) on the “term-sentence” matrix The right singular vectors with larger singular values represent the dimensions of
the more important latent semantic concepts in the document Represent each sentence of a document as a semantic vector in the reduced
space
19
Jw
w
w
2
1
1S 2S MS
J content words
M sentences Information of word j
j
A U
1
2
K
i
Information of sentence i
tVTerm-sentence
matrixLeft singularvector matrix
Right singularvector matrix
singular value matrix
12
k
Jw
w
w
2
1
1S 2S MS
J content words
M sentences Information of word j
j
A U
1
2
K
i
Information of sentence i
tVTerm-sentence
matrixLeft singularvector matrix
Right singularvector matrix
singular value matrix
12
k
Single-Document Summarization Unsupervised Methods (cont.)
• Probabilistic Generative Framework (Chen et al. 2009)– Criterion: Maximum a posteriori (MAP)
– Sentence Generative Model Each sentence of the document as a probabilistic generative model Language Model (LM), Sentence Topic Model (STM) and Word Topic Model
(WTM) are initially investigated
– Sentence Prior Model Sentence prior is simply set to uniform here Or may have to do with duration/position, correctness of sentence boundary,
confidence score, prosodic information, etc.
20
ii
rankii
i SPSDPDP
SPSDPDSP
iSDP
iSP
Single-Document Summarization Unsupervised Methods (cont.)
– Language Model (LM) Approach (Literal Term Matching)
– Sentence Topic Model (STM) Approach (Concept Matching)
– Word Topic Model (WTM) Approach (Concept Matching)
21
Djw
Djwc
jijiLM CwPSwPSDP,
1 SwP j
CwP j
: the sentence model: the collection model
: a weighting parameter
Djw
DjwcK
kkkjiSTM DTPTwPSDP
,
1
Djw
Djwc
iSmw
K
kmwkkjimiWTM MTPTwPSDP
,
1,
Multi-Document Summarization
• Task Characteristics– Input: a set of documents on the same topic
Retrieved during an IR search Clustered by a news browsers Problem: same topic or same event
– Output: a paragraph length summary Salient information across documents Similarities between topics?
– Redundancy removal is critical
• Application oriented task– News portal, presenting articles from different sources
– Corporate emails organized by subjects.
– Medical reports about a patient
22
Evaluation
• Recall-Oriented Understudy for Gisting Evaluation (ROUGE) (Lin 2004)– Let be a set of reference summary, and let be a summary
generated automatically by a system. Let be a binary vector representing n-grams contained in a document
– The metric ROUGE-N is an n-gram recall based statistic
where denotes the usual inner product of vectors
– The various versions of ROUGE were evaluated by computing the correlation coefficient between ROUGE scores and human judgment scores
ROUGE-2 performed the best among the ROUGE-N variants
23
Rr nn
Rr nn
rr
srsNROGUE
,
,
mrrR ,,1 s dn
d
,
昨天 馬英九 訪問 中國大陸
昨天 馬英九 結束 訪問 回國
Evaluation (cont.)
• Lin et al., (2006) also proposed to use an information-theoretic method to automatic evaluation of summaries– The central idea is to use a divergence measure (i.e., Jensen-Shannon
divergence), between a pair of probability distributions The first distribution is derived from an automatic summary and the second from a
set of reference summaries
– Let be the set of documents to summarize
A distribution parameterized by generates reference summaries
A summarization system is governed by some distribution We may define a good summarizer as one for which is closed to One information-theoretic measure between distributions that is adequate for this
is the KL divergence
However, the KL divergence is unbounded and goes to infinity whenever vanishes and does not
Another problem is that KL divergence is not symmetric
24
R nddD ,,1
AA R
m
iR
i
AiA
iRA
p
ppppKL
1
log||
Aip
Rip
Evaluation (cont.)
– Hence, they propose to use the Jensen-Shannon divergence which is bounded and symmetric
where
– To evaluate a summary given a reference summary , the negative JS divergence can be used for the purpose
25
RA
RARA
pHpHrH
rpKLrpKLppJS
2
1
2
1
||2
1||
2
1||
RA ppr
2
1
2
1
AS RS
RRAARA SpSpJSSSScore |||||