a text categorization approach to automatic image...

1

A Text Categorization Approach to Automatic Image Annotation

Chin-Hui Lee (李錦輝)School of ECE, Georgia Institute of Technology

Atlanta, GA 30332-0250, [email protected]

Joint work with Sheng Gao of Institute of Infocomm Research, Singapore

2

Outline• Text Categorization (TC): a unified scenario

– High dimensional feature extraction (LSA) – Discriminative classifier design (MFoM learning)

• Information theory perspective– Shannon’s study on entropy of English letters

• From text to multimedia documents and back– Tokenization of multimedia patterns into audiovisual alphabets– Formation of audiovisual words and LSA-based feature vectors – Multimedia pattern recognition through vector-based TC

• Multimedia retrieval applications– Automatic image annotation– Audio fingerprint and music retrieval

• Summary

2

3

Concept & Content Based Image RetrievalConcept & Content Based Image Retrieval

• Indexing and retrieval of photos• Content based example search does not give good performance• Concept based keyword search

– GUI– Multimedia UI– Multilingual VUI

4

Picasso’s “Parade” (1917)

A Picture is worth more than a thousand words?

Image Segmentation & Annotation

“clown, ball, tree, flying horse, people”

Do we need image segmentation for object recognition?

3

5

Multilingual Image Annotation (IIS)

彩虹 (Rainbow)天氣 (Weather)花 (Flower)自然 (Nature)

向日葵 (Sunflower)花 (Flower)植物 (Plant)沙漠 (Desert)

海豹 (Seal)哺乳類 (Mammal)海岸 (Coast)動物 (Animal)

太陽系(Solar System)慧星 (Comet)熱帶魚(Tropical Fish)太空 (Universe)

瀑布 (Waterfall)地形 (Landform)自然 (Nature)蟑螂 (Cockroach)

狗 (Dog)哺乳類 (Mammal)穿山甲 (Pangolin) 羊(Sheep)

Top 4 keywords Top 4 keywordsImages Images

6

Text Categorization – A Unifying Scenario

Unknown document dj

Classifier Ti

Classifier T1

Classifier Tm

…..

T1(dj)

Ti(dj)

Tm(dj)Decisions by m classifiers

Labels of djfor Ci

L1(dj)

Li(dj)

Lm(dj)

….. …..

Evaluation

4

7

Text Categorization: Training Classifiers

Training set for each category Ci , i= 1,…,m. (Positive +Negative)

Doc. in new feature space

(1) Feature Extraction &

Reduction

PiNi

Pi

Ni

(2) Classifier Learning

Classifier Tifor category Ci

8

Vector Space Representation of Documents and Queries

ConsumerLending

Home EquityService

DepositServices

Credit CardServices

LoanServicing

idqX

Unknown Query

Other Document

5

9

LSA Based Feature Extraction• LSA Matrix (also known as a Routing Matrix) C

– number of times word occurs in :– total number of words present in :– total number of occurs in A:– “indexing” power of in corpus A :– normalized entropy:

jijiij nnc ⋅−= /)1( ε

iw

iw

iw

jAjA

10log1log

1 ≤≤−=⋅⋅∑ = in

nN

j nn

Ni iij

i

ij εε

ijnsum)column (jn⋅

sum) row(⋅inii εη −=1

power indexing maximum if0 ⋅== iiji nnεprobable)(equally power no if1 N

niji

in ⋅==ε{

10

Maximal Figure-of-Merit Learning (1)• Generalized score function for HD vector – no pdf• Misclassification function: simulating the discrete

decision rule to be embedded in MFoM learning

( )

( )⎪⎪⎩

⎪⎪⎨

⎧

−+−==

++−=−=

∑

∑

=

=

Cfor ),(;

Cfor ),(;

10

01

R

kkkj

R

kkkj

xwwWXfWXd

wxwWXfWXd

( ) ( ) ( )η

η

1

,

;1

1;;⎥⎥⎦

⎤

⎢⎢⎣

⎡

⎭⎬⎫

⎩⎨⎧

Λ−

+Λ−=Λ ∑≠ jii

ijj XgNXgXd

Multi-category classification

Binary tree classifier with LDF

6

11

MFoM Classifier Learning (2)• Approximate the overall performance metric:

derived from the class 0-1 loss function

( )1 1

21 1max ; 2

N Nj

W jj j j j j

TPF T W F

N N FP FN TP= =≈ =

+ +∑ ∑

– Averaging F1 as a measure

– Classification error rate as a measure

( ) ( ) ( )1

1min ; ; 1| |

N

j jX T j

L T W l X W X CTΛ ∈ =

= ∈∑ ∑• Objective function is highly nonlinear: GPD

( )tWWttt

WTLWW =+ ∇−= |;1 κ

( ) ( ) );(11; βα +−+

= WXdj jeWXl

12

Task and Experimental Setup

• ModApte split version of Reuters-21578 corpus– lexicon: 10118 words, remove 319 stop-words and

words occurred less than 4 times– corpus clean-up: remove documents which are not

labeled by topics, miss topics, or are labeled by topics only occurred in training or test corpus

– final experiments setup: 7,770 training documents, 3,019 test documents, 90 topics

– some topics have little data for training or testing and with conflict labels in some cases

7

13

Performance Comparison (SIGIR2003)

0.5560.5250.524macF1

0.8840.8600.857micF1

0.9140.9140.881micP

0.8570.8120.834micR

Binary F1-MFoMSVMk-NN

• Achieve the best performance in the Reuter task with LSA based feature extraction and discriminative MFoM learning

14

Performance vs. LSA Feature Dimension

8

15

Separation before and after MFoM

16

Binary vs. Multi-Class TC (ICML04)

F1 -based comparison:Multi-Class MFoM works much better for small training sets

0.6670.0001Sun-meal

0.7500.3333Potato

0.8330.2865Platinum

0.5000.1678Oat0.6000.4299Income

MC MFoM

BinaryMFoM

# of Training instancesCategory

9

17

From Text to Multimedia Documents

• Property of raw multimedia patterns– Mostly fuzzy low-level signal representations – Hard to locate segmentation and object boundaries

• Definition of common sets of fundamental units– No obvious fundamental alphabets and words– Precision and coverage of multimedia tokenization

• Extraction of multimedia document feature vectors– Dimensionality, discrimination ability and trainability

• What are the missing links?– Shannon’s information theory perspective (1951)– Finding acoustic, audio, visual “alphabets” and “words”

18

Shannon’s Study on Entropy of English• Approximation to co-occurrences of consecutive letters

human prediction1.34Shannon’s 2ndExperiment

bigram2.8Second order

unigram4.03First orderuniform letter 4.76Zeroth order

CommentsCross Entropy (bits)Model

• The entropy of this alphabet is language-dependent. This makes it possible for text-based language ID of encrypted documents without using dictionaries. Can we do the same for spoken languages? How about multimedia documents?

10

19

Common Technology Thread: DSP, Feature Extraction & Classifier Learning

Speech/Image/Audio

LSA-Based FE/SVD

TextMedia Tokenization

Results

A/V Alphabet Model

A/V Word List

TC ClassifierLearning

TC Classifier

Text DocumentTraining Set

AudiovisualClassification

TC Classifier

Feature

First Step: Define alphabets and train alphabet models

20

Automatic Image Annotation• A process associating concepts or keywords to

images describing their visual content• AIA can be used to make queries based on

image concepts (Google-style keyword search)

Verbal annotation

Image/language connection

…,boat, sea, sky, beach,…

11

21

Google Image Search

tiger-l.png

“Tiger”

Ranking: 69

tiger-regal_1024x768.jpg

“A Siberian tiger in all his majesty”

Ranking: 1

Juvenile.jpg

“Newborn tigershark”

Ranking: 23

Tiger%20pear%20in%20car%20tyre%2075%20dpi.jpg

“Tiger pear in car tyre”Ranking: 5

Query: Tiger

22

Altavista Image Search

Query: Tiger

tiger-003.jpg

No Caption

Ranking: 34

Tiger.jpg

No Caption

Ranking: 1

Tiger__1_11.jpeg

No Caption

Ranking:18

tiger.jpg

No caption

Ranking: 4

12

23

Open Issues• Which kind of features are more capable to

capture the visual information?

• What image components should be used as image units?

• How to connect visual context to semantic information?

• How to describe connections among image components to represent high-level annotations?

24

Generic AIA System

Segmentation

Feature Extraction Tokenization

Visual / Verbal Connection

Model

…,boat, sea, sky, beach,…

13

25

Visual Component UnitsThe association with verbal information can be

done with:

• Entire image(B. Manjunath 1996, M. Swain 1991)

• Segmented regions(Blobworld - C. Carson 1999, Y. Deng 1999)

• Fixed-size sub-image macro-blocks(R.W. Picard 1998, Y. Mori 2000)

26

Low Level Visual Features

• Texture FeaturesGabor WaveletsWold FeaturesDCT CoefficientsFFT Coefficients

• Color FeaturesYUV HistogramsLab HistogramsHSV HistogramsRGB Histograms

• Shape FeaturesAverage OrientationSizeConvexityDeviation First MomentArea/squared (boundary length)

14

27

Some AIA Models• Translation Model (TM)• Multi Topic Text Categorization (MC MFoM)• Maximum Entropy (ME)• Markov Random Field (MRF)• Conditional Random Field (CRF)• Continuous Space Relevance Model Gaussian

distribution of visual features (CRM)• Multiple Bernoulli Relevance Model Gaussian

distribution of visual features (MBRM)

28

Automatic Image Annotation (AIA)

Given an image, automatically associate it with a few semantic labels, i.e., keywords in a predefined vocabulary, to describe the image content based on low-level features Bear, Polar, Snow, Tundra

Training stage: Learn the association rule P(X,Y) between a low-level image representation, X, and semantic labels, Y, based on a training set

Prediction stage: Predict and rank P(Y|X) for any unknown image X according to the set of N pre-determined concepts

* Gao, Wang and Lee, “Automatic Image Annotation through Multi-Topic Text Categorization,” ICASSP2006

15

29

Image Macro-Blocking vs. Segmentation

• Image segmentation is often unreliable– Error propagation with image segmentation (blobs)– Speech recognition can be accomplished without explicit

word boundary specifications• Regular blocking with explicit segmentation

– Similar to framing of speech utterances– Good for signal-independent image processing

• Research issues– Size of image sub-blocks– Overlapping blocking of images (like speech)– Resolution: image representation and quantization

30

Visual Terms

• Used for the first time in the Translation Model (Duygulu 2002):

• Starting from the set of the feature values, create a finite set of tokens able to represent all images– Also in mixture models (recent BHMMM work)

• Visual terms are analogous to words in documents– Any TC algorithms can be easily applied to AIA

• They can be extracted by dividing the feature values in regions according to their similarities and using the centroids of the clusters as representing term

16

31

Image Tokenization and Visual Terms• Visual Alphabets

– VQ codebooks for color, texture, shape and others• Tokenization of all macro-blocks of training images• Visual Words

– Compute bigrams of neighboring blocks, reoccurring patterns appearing over images will be remembered

X33X32X31

X23X22X21

X13X12X11 • For a single 6-bit codebook– LSA feature dimension is 4160

(64*65) when considering both unigrams and bigrams

– Multiple and large codebooks

32

Visual Bigrams• Visual Bigrams deal with the spatial information brought by near blocks

X11 X12 X13

X21 X22 X23

X31 X32 X33

X22X11X22X12X22X13…

X22X33

Det Error

-1.00

4.00

9.00

14.00

19.00

Train Test

Color Ftr( uni: 7bits)

Color Ftr(uni+bigrams:6bits)

17

33

Text Representation of Images• Given a visual lexicon, A={A1,A2, …, AM}, with Mvisual terms, an image document can be represented by V={V1,V2, …, VM}, each component being statistics of visual term occurred in the particular image document• SVD can be applied to reduce the dimension, M• Semantic concept modeling for image annotation

– Semantic concept set, , N: total concepts. Each concept has a discriminant function, , to be trained. Multiple relevant keywords are assigned to an image X, according to the rule,

{ }, 1jC C j N= ≤ ≤( )jj Xg Λ;

34

Using Multiple Visual Dictionaries• If more features are available:

– Put all the features in one vector, and build visual codebook on combined vectors (Visual Dictionary)

– Extract a single Visual Dictionary for each feature, or group of features, and fuse them at higher level

• Advantages to use more visual dictionaries:– Management of shorter visual vectors (avoid curse of

dimensionality)– Possibility to add new features without repeat the entire

extraction/tokenization process

18

35

Defining Classifier Score Function

X

…

Λ

argmax

g1(X,Λ)gN(X,Λ)

gi(X,Λ)

C(X) • Each classifier score function, gi is trained to discriminate positive from negative examples for the i-th Class:

• The classifier is trained to minimize the Det Error:

∑≤≤ ⋅

+=

Nj

jj

NFNFP

DetE1 2

…

)],([maxarg)(1

Λ=≤≤

XgXC jNj

36

Tree of Binary Classifiers

gi(X,Λj,) Labeli

--

Reject Option

LabeliLabelk

Negative Classification

Positive Classification

Linear Classification Units gj(X,Λi) are ordered in the tree according the False Positive Rate ( lower rate near the root)

Each unit applies the rule:

if gi(X,Λj,) > Th+ Assign Labeli

if gi(X,Λj,) < Th- ( Not Labeli)

else Reject the sample (not decide)

•When Reject Option occurs the sample is sent to a unit trained for the same category but with a different Visual Dictionary

•Th+ and Th- are chosen according to the statistic of positive and negative samples

19

37

AIA Examples: Multi-Topic TC

Ground truth: locomotive, railroad, smoke, trainCMRM: water, sky, tree, peopleMFoM: mountain, sky, tree, train, locomotive, railroad, aerial

Ground truth: bear, polar, snow, tundraCMRM: water, sky, plane, jet, treeMFoM: tree, bear, snow, polar, tundra, ice

Ground truth: cat, wood, tiger, waterCMRM: people, water, rocks, buildingsMFoM: water, cat, tiger, forest

38

Experimental Evaluation Data Sets• Corel CD: 374 concepts with a total of 5,000 images, 4,500

images for training and 500 for testing− each image (128*192) is uniformly segmented into 96 grids, each

with a block size of 16x16• TRECVID 2003 development set. 33,529 keyframes from 93

MPEG files with114 concepts, out of which only 10 concepts were selected (i.e. Aircraft including Airplane, Airplane_landing and Airplane_takeoff, Animal, Building, Car/Bus/Truck including Car, News subject face including Male_News_Subject and Female_News_Subject, Bus and Truck, Non-studio setting, Outdoors, People, Road and Weather news). We randomly select about half, i.e., 15,804 key frames for training and the remaining 17,725 for testing.

– Each image (224*352) is segmented into 77 grids, each with a block size of 32x32

20

39

Comparison with Benchmarks on Corel

133122N.A6649# of det

0.270.250.120.090.04mR

0.250.240.090.100.06mP

MFoMMBRMMECMRMTM

• Testing on 260 out of 374 concepts. TM, CMRM, ME and MBRM results from other published papers

40

Summary

• Text Categorization (TC): a unified scenario– HD LSA feature extraction – MFoM discriminative classifier learning

• Information theory perspective: Shannon• From text to multimedia documents and back

– Audiovisual alphabets/words and LSA-based features– Pattern classification through vector-based TC

• Multimedia applications: achieving better results– AIA for concept based image retrieval– Many other applications

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice