a text categorization approach to automatic image...
TRANSCRIPT
-
1
A Text Categorization Approach to Automatic Image Annotation
Chin-Hui Lee (李錦輝)School of ECE, Georgia Institute of Technology
Atlanta, GA 30332-0250, [email protected]
Joint work with Sheng Gao of Institute of Infocomm Research, Singapore
2
Outline• Text Categorization (TC): a unified scenario
– High dimensional feature extraction (LSA) – Discriminative classifier design (MFoM learning)
• Information theory perspective– Shannon’s study on entropy of English letters
• From text to multimedia documents and back– Tokenization of multimedia patterns into audiovisual alphabets– Formation of audiovisual words and LSA-based feature vectors – Multimedia pattern recognition through vector-based TC
• Multimedia retrieval applications– Automatic image annotation– Audio fingerprint and music retrieval
• Summary
-
2
3
Concept & Content Based Image RetrievalConcept & Content Based Image Retrieval
• Indexing and retrieval of photos• Content based example search does not give good performance• Concept based keyword search
– GUI– Multimedia UI– Multilingual VUI
4
Picasso’s “Parade” (1917)
A Picture is worth more than a thousand words?
Image Segmentation & Annotation
“clown, ball, tree, flying horse, people”
Do we need image segmentation for object recognition?
-
3
5
Multilingual Image Annotation (IIS)
彩虹 (Rainbow)天氣 (Weather)花 (Flower)自然 (Nature)
向日葵 (Sunflower)花 (Flower)植物 (Plant)沙漠 (Desert)
海豹 (Seal)哺乳類 (Mammal)海岸 (Coast)動物 (Animal)
太陽系(Solar System)慧星 (Comet)熱帶魚(Tropical Fish)太空 (Universe)
瀑布 (Waterfall)地形 (Landform)自然 (Nature)蟑螂 (Cockroach)
狗 (Dog)哺乳類 (Mammal)穿山甲 (Pangolin) 羊(Sheep)
Top 4 keywords Top 4 keywordsImages Images
6
Text Categorization – A Unifying Scenario
Unknown document dj
Classifier Ti
Classifier T1
Classifier Tm
…..
T1(dj)
Ti(dj)
Tm(dj)Decisions by m classifiers
Labels of djfor Ci
L1(dj)
Li(dj)
Lm(dj)
….. …..
Evaluation
-
4
7
Text Categorization: Training Classifiers
Training set for each category Ci , i= 1,…,m. (Positive +Negative)
Doc. in new feature space
(1) Feature Extraction &
Reduction
PiNi
Pi
Ni
(2) Classifier Learning
Classifier Tifor category Ci
8
Vector Space Representation of Documents and Queries
ConsumerLending
Home EquityService
DepositServices
Credit CardServices
LoanServicing
idqX
Unknown Query
Other Document
-
5
9
LSA Based Feature Extraction• LSA Matrix (also known as a Routing Matrix) C
– number of times word occurs in :– total number of words present in :– total number of occurs in A:– “indexing” power of in corpus A :– normalized entropy:
jijiij nnc ⋅−= /)1( ε
iw
iw
iw
jAjA
10log1log
1 ≤≤−=⋅⋅∑ = in
nN
j nn
Ni iij
i
ij εε
ijnsum)column (jn⋅
sum) row(⋅inii εη −=1
power indexing maximum if0 ⋅== iiji nnεprobable)(equally power no if1 N
niji
in ⋅==ε{
10
Maximal Figure-of-Merit Learning (1)• Generalized score function for HD vector – no pdf• Misclassification function: simulating the discrete
decision rule to be embedded in MFoM learning
( )
( )⎪⎪⎩
⎪⎪⎨
⎧
−+−==
++−=−=
∑
∑
=
=
Cfor ),(;
Cfor ),(;
10
01
R
kkkj
R
kkkj
xwwWXfWXd
wxwWXfWXd
( ) ( ) ( )η
η
1
,
;1
1;;⎥⎥⎦
⎤
⎢⎢⎣
⎡
⎭⎬⎫
⎩⎨⎧
Λ−
+Λ−=Λ ∑≠ jii
ijj XgNXgXd
Multi-category classification
Binary tree classifier with LDF
-
6
11
MFoM Classifier Learning (2)• Approximate the overall performance metric:
derived from the class 0-1 loss function
( )1 1
21 1max ; 2
N Nj
W jj j j j j
TPF T W F
N N FP FN TP= =≈ =
+ +∑ ∑
– Averaging F1 as a measure
– Classification error rate as a measure
( ) ( ) ( )1
1min ; ; 1| |
N
j jX T j
L T W l X W X CTΛ ∈ =
= ∈∑ ∑• Objective function is highly nonlinear: GPD
( )tWWttt
WTLWW =+ ∇−= |;1 κ
( ) ( ) );(11; βα +−+
= WXdj jeWXl
12
Task and Experimental Setup
• ModApte split version of Reuters-21578 corpus– lexicon: 10118 words, remove 319 stop-words and
words occurred less than 4 times– corpus clean-up: remove documents which are not
labeled by topics, miss topics, or are labeled by topics only occurred in training or test corpus
– final experiments setup: 7,770 training documents, 3,019 test documents, 90 topics
– some topics have little data for training or testing and with conflict labels in some cases
-
7
13
Performance Comparison (SIGIR2003)
0.5560.5250.524macF1
0.8840.8600.857micF1
0.9140.9140.881micP
0.8570.8120.834micR
Binary F1-MFoMSVMk-NN
• Achieve the best performance in the Reuter task with LSA based feature extraction and discriminative MFoM learning
14
Performance vs. LSA Feature Dimension
-
8
15
Separation before and after MFoM
16
Binary vs. Multi-Class TC (ICML04)
F1 -based comparison:Multi-Class MFoM works much better for small training sets
0.6670.0001Sun-meal
0.7500.3333Potato
0.8330.2865Platinum
0.5000.1678Oat0.6000.4299Income
MC MFoM
BinaryMFoM
# of Training instancesCategory
-
9
17
From Text to Multimedia Documents
• Property of raw multimedia patterns– Mostly fuzzy low-level signal representations – Hard to locate segmentation and object boundaries
• Definition of common sets of fundamental units– No obvious fundamental alphabets and words– Precision and coverage of multimedia tokenization
• Extraction of multimedia document feature vectors– Dimensionality, discrimination ability and trainability
• What are the missing links?– Shannon’s information theory perspective (1951)– Finding acoustic, audio, visual “alphabets” and “words”
18
Shannon’s Study on Entropy of English• Approximation to co-occurrences of consecutive letters
human prediction1.34Shannon’s 2ndExperiment
bigram2.8Second order
unigram4.03First orderuniform letter 4.76Zeroth order
CommentsCross Entropy (bits)Model
• The entropy of this alphabet is language-dependent. This makes it possible for text-based language ID of encrypted documents without using dictionaries. Can we do the same for spoken languages? How about multimedia documents?
-
10
19
Common Technology Thread: DSP, Feature Extraction & Classifier Learning
Speech/Image/Audio
LSA-Based FE/SVD
TextMedia Tokenization
Results
A/V Alphabet Model
A/V Word List
TC ClassifierLearning
TC Classifier
Text DocumentTraining Set
AudiovisualClassification
TC Classifier
Feature
First Step: Define alphabets and train alphabet models
20
Automatic Image Annotation• A process associating concepts or keywords to
images describing their visual content• AIA can be used to make queries based on
image concepts (Google-style keyword search)
Verbal annotation
Image/language connection
…,boat, sea, sky, beach,…
-
11
21
Google Image Search
tiger-l.png
“Tiger”
Ranking: 69
tiger-regal_1024x768.jpg
“A Siberian tiger in all his majesty”
Ranking: 1
Juvenile.jpg
“Newborn tigershark”
Ranking: 23
Tiger%20pear%20in%20car%20tyre%2075%20dpi.jpg
“Tiger pear in car tyre”Ranking: 5
Query: Tiger
22
Altavista Image Search
Query: Tiger
tiger-003.jpg
No Caption
Ranking: 34
Tiger.jpg
No Caption
Ranking: 1
Tiger__1_11.jpeg
No Caption
Ranking:18
tiger.jpg
No caption
Ranking: 4
-
12
23
Open Issues• Which kind of features are more capable to
capture the visual information?
• What image components should be used as image units?
• How to connect visual context to semantic information?
• How to describe connections among image components to represent high-level annotations?
24
Generic AIA System
Segmentation
Feature Extraction Tokenization
Visual / Verbal Connection
Model
…,boat, sea, sky, beach,…
-
13
25
Visual Component UnitsThe association with verbal information can be
done with:
• Entire image(B. Manjunath 1996, M. Swain 1991)
• Segmented regions(Blobworld - C. Carson 1999, Y. Deng 1999)
• Fixed-size sub-image macro-blocks(R.W. Picard 1998, Y. Mori 2000)
26
Low Level Visual Features
• Texture FeaturesGabor WaveletsWold FeaturesDCT CoefficientsFFT Coefficients
• Color FeaturesYUV HistogramsLab HistogramsHSV HistogramsRGB Histograms
• Shape FeaturesAverage OrientationSizeConvexityDeviation First MomentArea/squared (boundary length)
-
14
27
Some AIA Models• Translation Model (TM)• Multi Topic Text Categorization (MC MFoM)• Maximum Entropy (ME)• Markov Random Field (MRF)• Conditional Random Field (CRF)• Continuous Space Relevance Model Gaussian
distribution of visual features (CRM)• Multiple Bernoulli Relevance Model Gaussian
distribution of visual features (MBRM)
28
Automatic Image Annotation (AIA)
Given an image, automatically associate it with a few semantic labels, i.e., keywords in a predefined vocabulary, to describe the image content based on low-level features Bear, Polar, Snow, Tundra
Training stage: Learn the association rule P(X,Y) between a low-level image representation, X, and semantic labels, Y, based on a training set
Prediction stage: Predict and rank P(Y|X) for any unknown image X according to the set of N pre-determined concepts
* Gao, Wang and Lee, “Automatic Image Annotation through Multi-Topic Text Categorization,” ICASSP2006
-
15
29
Image Macro-Blocking vs. Segmentation
• Image segmentation is often unreliable– Error propagation with image segmentation (blobs)– Speech recognition can be accomplished without explicit
word boundary specifications• Regular blocking with explicit segmentation
– Similar to framing of speech utterances– Good for signal-independent image processing
• Research issues– Size of image sub-blocks– Overlapping blocking of images (like speech)– Resolution: image representation and quantization
30
Visual Terms
• Used for the first time in the Translation Model (Duygulu 2002):
• Starting from the set of the feature values, create a finite set of tokens able to represent all images– Also in mixture models (recent BHMMM work)
• Visual terms are analogous to words in documents– Any TC algorithms can be easily applied to AIA
• They can be extracted by dividing the feature values in regions according to their similarities and using the centroids of the clusters as representing term
-
16
31
Image Tokenization and Visual Terms• Visual Alphabets
– VQ codebooks for color, texture, shape and others• Tokenization of all macro-blocks of training images• Visual Words
– Compute bigrams of neighboring blocks, reoccurring patterns appearing over images will be remembered
X33X32X31
X23X22X21
X13X12X11 • For a single 6-bit codebook– LSA feature dimension is 4160
(64*65) when considering both unigrams and bigrams
– Multiple and large codebooks
32
Visual Bigrams• Visual Bigrams deal with the spatial information brought by near blocks
X11 X12 X13
X21 X22 X23
X31 X32 X33
X22X11X22X12X22X13…
X22X33
Det Error
-1.00
4.00
9.00
14.00
19.00
Train Test
Color Ftr( uni: 7bits)
Color Ftr(uni+bigrams:6bits)
-
17
33
Text Representation of Images• Given a visual lexicon, A={A1,A2, …, AM}, with Mvisual terms, an image document can be represented by V={V1,V2, …, VM}, each component being statistics of visual term occurred in the particular image document• SVD can be applied to reduce the dimension, M• Semantic concept modeling for image annotation
– Semantic concept set, , N: total concepts. Each concept has a discriminant function, , to be trained. Multiple relevant keywords are assigned to an image X, according to the rule,
{ }, 1jC C j N= ≤ ≤( )jj Xg Λ;
34
Using Multiple Visual Dictionaries• If more features are available:
– Put all the features in one vector, and build visual codebook on combined vectors (Visual Dictionary)
– Extract a single Visual Dictionary for each feature, or group of features, and fuse them at higher level
• Advantages to use more visual dictionaries:– Management of shorter visual vectors (avoid curse of
dimensionality)– Possibility to add new features without repeat the entire
extraction/tokenization process
-
18
35
Defining Classifier Score Function
X
…
Λ
argmax
g1(X,Λ)gN(X,Λ)
gi(X,Λ)
C(X) • Each classifier score function, gi is trained to discriminate positive from negative examples for the i-th Class:
• The classifier is trained to minimize the Det Error:
∑≤≤ ⋅
+=
Nj
jj
NFNFP
DetE1 2
…
)],([maxarg)(1
Λ=≤≤
XgXC jNj
36
Tree of Binary Classifiers
gi(X,Λj,) Labeli
--
Reject Option
LabeliLabelk
Negative Classification
Positive Classification
Linear Classification Units gj(X,Λi) are ordered in the tree according the False Positive Rate ( lower rate near the root)
Each unit applies the rule:
if gi(X,Λj,) > Th+ Assign Labeli
if gi(X,Λj,) < Th- ( Not Labeli)
else Reject the sample (not decide)
•When Reject Option occurs the sample is sent to a unit trained for the same category but with a different Visual Dictionary
•Th+ and Th- are chosen according to the statistic of positive and negative samples
-
19
37
AIA Examples: Multi-Topic TC
Ground truth: locomotive, railroad, smoke, trainCMRM: water, sky, tree, peopleMFoM: mountain, sky, tree, train, locomotive, railroad, aerial
Ground truth: bear, polar, snow, tundraCMRM: water, sky, plane, jet, treeMFoM: tree, bear, snow, polar, tundra, ice
Ground truth: cat, wood, tiger, waterCMRM: people, water, rocks, buildingsMFoM: water, cat, tiger, forest
38
Experimental Evaluation Data Sets• Corel CD: 374 concepts with a total of 5,000 images, 4,500
images for training and 500 for testing− each image (128*192) is uniformly segmented into 96 grids, each
with a block size of 16x16• TRECVID 2003 development set. 33,529 keyframes from 93
MPEG files with114 concepts, out of which only 10 concepts were selected (i.e. Aircraft including Airplane, Airplane_landing and Airplane_takeoff, Animal, Building, Car/Bus/Truck including Car, News subject face including Male_News_Subject and Female_News_Subject, Bus and Truck, Non-studio setting, Outdoors, People, Road and Weather news). We randomly select about half, i.e., 15,804 key frames for training and the remaining 17,725 for testing.
– Each image (224*352) is segmented into 77 grids, each with a block size of 32x32
-
20
39
Comparison with Benchmarks on Corel
133122N.A6649# of det
0.270.250.120.090.04mR
0.250.240.090.100.06mP
MFoMMBRMMECMRMTM
• Testing on 260 out of 374 concepts. TM, CMRM, ME and MBRM results from other published papers
40
Summary
• Text Categorization (TC): a unified scenario– HD LSA feature extraction – MFoM discriminative classifier learning
• Information theory perspective: Shannon• From text to multimedia documents and back
– Audiovisual alphabets/words and LSA-based features– Pattern classification through vector-based TC
• Multimedia applications: achieving better results– AIA for concept based image retrieval– Many other applications
/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False
/Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice