a text categorization approach to automatic image...

20
1 A Text Categorization Approach to Automatic Image Annotation Chin-Hui Lee (李錦輝) School of ECE, Georgia Institute of Technology Atlanta, GA 30332-0250, USA [email protected] Joint work with Sheng Gao of Institute of Infocomm Research, Singapore 2 Outline Text Categorization (TC): a unified scenario High dimensional feature extraction (LSA) Discriminative classifier design (MFoM learning) Information theory perspective Shannon’s study on entropy of English letters From text to multimedia documents and back Tokenization of multimedia patterns into audiovisual alphabets Formation of audiovisual words and LSA-based feature vectors Multimedia pattern recognition through vector-based TC Multimedia retrieval applications Automatic image annotation Audio fingerprint and music retrieval Summary

Upload: others

Post on 22-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    A Text Categorization Approach to Automatic Image Annotation

    Chin-Hui Lee (李錦輝)School of ECE, Georgia Institute of Technology

    Atlanta, GA 30332-0250, [email protected]

    Joint work with Sheng Gao of Institute of Infocomm Research, Singapore

    2

    Outline• Text Categorization (TC): a unified scenario

    – High dimensional feature extraction (LSA) – Discriminative classifier design (MFoM learning)

    • Information theory perspective– Shannon’s study on entropy of English letters

    • From text to multimedia documents and back– Tokenization of multimedia patterns into audiovisual alphabets– Formation of audiovisual words and LSA-based feature vectors – Multimedia pattern recognition through vector-based TC

    • Multimedia retrieval applications– Automatic image annotation– Audio fingerprint and music retrieval

    • Summary

  • 2

    3

    Concept & Content Based Image RetrievalConcept & Content Based Image Retrieval

    • Indexing and retrieval of photos• Content based example search does not give good performance• Concept based keyword search

    – GUI– Multimedia UI– Multilingual VUI

    4

    Picasso’s “Parade” (1917)

    A Picture is worth more than a thousand words?

    Image Segmentation & Annotation

    “clown, ball, tree, flying horse, people”

    Do we need image segmentation for object recognition?

  • 3

    5

    Multilingual Image Annotation (IIS)

    彩虹 (Rainbow)天氣 (Weather)花 (Flower)自然 (Nature)

    向日葵 (Sunflower)花 (Flower)植物 (Plant)沙漠 (Desert)

    海豹 (Seal)哺乳類 (Mammal)海岸 (Coast)動物 (Animal)

    太陽系(Solar System)慧星 (Comet)熱帶魚(Tropical Fish)太空 (Universe)

    瀑布 (Waterfall)地形 (Landform)自然 (Nature)蟑螂 (Cockroach)

    狗 (Dog)哺乳類 (Mammal)穿山甲 (Pangolin) 羊(Sheep)

    Top 4 keywords Top 4 keywordsImages Images

    6

    Text Categorization – A Unifying Scenario

    Unknown document dj

    Classifier Ti

    Classifier T1

    Classifier Tm

    …..

    T1(dj)

    Ti(dj)

    Tm(dj)Decisions by m classifiers

    Labels of djfor Ci

    L1(dj)

    Li(dj)

    Lm(dj)

    ….. …..

    Evaluation

  • 4

    7

    Text Categorization: Training Classifiers

    Training set for each category Ci , i= 1,…,m. (Positive +Negative)

    Doc. in new feature space

    (1) Feature Extraction &

    Reduction

    PiNi

    Pi

    Ni

    (2) Classifier Learning

    Classifier Tifor category Ci

    8

    Vector Space Representation of Documents and Queries

    ConsumerLending

    Home EquityService

    DepositServices

    Credit CardServices

    LoanServicing

    idqX

    Unknown Query

    Other Document

  • 5

    9

    LSA Based Feature Extraction• LSA Matrix (also known as a Routing Matrix) C

    – number of times word occurs in :– total number of words present in :– total number of occurs in A:– “indexing” power of in corpus A :– normalized entropy:

    jijiij nnc ⋅−= /)1( ε

    iw

    iw

    iw

    jAjA

    10log1log

    1 ≤≤−=⋅⋅∑ = in

    nN

    j nn

    Ni iij

    i

    ij εε

    ijnsum)column (jn⋅

    sum) row(⋅inii εη −=1

    power indexing maximum if0 ⋅== iiji nnεprobable)(equally power no if1 N

    niji

    in ⋅==ε{

    10

    Maximal Figure-of-Merit Learning (1)• Generalized score function for HD vector – no pdf• Misclassification function: simulating the discrete

    decision rule to be embedded in MFoM learning

    ( )

    ( )⎪⎪⎩

    ⎪⎪⎨

    −+−==

    ++−=−=

    =

    =

    Cfor ),(;

    Cfor ),(;

    10

    01

    R

    kkkj

    R

    kkkj

    xwwWXfWXd

    wxwWXfWXd

    ( ) ( ) ( )η

    η

    1

    ,

    ;1

    1;;⎥⎥⎦

    ⎢⎢⎣

    ⎭⎬⎫

    ⎩⎨⎧

    Λ−

    +Λ−=Λ ∑≠ jii

    ijj XgNXgXd

    Multi-category classification

    Binary tree classifier with LDF

  • 6

    11

    MFoM Classifier Learning (2)• Approximate the overall performance metric:

    derived from the class 0-1 loss function

    ( )1 1

    21 1max ; 2

    N Nj

    W jj j j j j

    TPF T W F

    N N FP FN TP= =≈ =

    + +∑ ∑

    – Averaging F1 as a measure

    – Classification error rate as a measure

    ( ) ( ) ( )1

    1min ; ; 1| |

    N

    j jX T j

    L T W l X W X CTΛ ∈ =

    = ∈∑ ∑• Objective function is highly nonlinear: GPD

    ( )tWWttt

    WTLWW =+ ∇−= |;1 κ

    ( ) ( ) );(11; βα +−+

    = WXdj jeWXl

    12

    Task and Experimental Setup

    • ModApte split version of Reuters-21578 corpus– lexicon: 10118 words, remove 319 stop-words and

    words occurred less than 4 times– corpus clean-up: remove documents which are not

    labeled by topics, miss topics, or are labeled by topics only occurred in training or test corpus

    – final experiments setup: 7,770 training documents, 3,019 test documents, 90 topics

    – some topics have little data for training or testing and with conflict labels in some cases

  • 7

    13

    Performance Comparison (SIGIR2003)

    0.5560.5250.524macF1

    0.8840.8600.857micF1

    0.9140.9140.881micP

    0.8570.8120.834micR

    Binary F1-MFoMSVMk-NN

    • Achieve the best performance in the Reuter task with LSA based feature extraction and discriminative MFoM learning

    14

    Performance vs. LSA Feature Dimension

  • 8

    15

    Separation before and after MFoM

    16

    Binary vs. Multi-Class TC (ICML04)

    F1 -based comparison:Multi-Class MFoM works much better for small training sets

    0.6670.0001Sun-meal

    0.7500.3333Potato

    0.8330.2865Platinum

    0.5000.1678Oat0.6000.4299Income

    MC MFoM

    BinaryMFoM

    # of Training instancesCategory

  • 9

    17

    From Text to Multimedia Documents

    • Property of raw multimedia patterns– Mostly fuzzy low-level signal representations – Hard to locate segmentation and object boundaries

    • Definition of common sets of fundamental units– No obvious fundamental alphabets and words– Precision and coverage of multimedia tokenization

    • Extraction of multimedia document feature vectors– Dimensionality, discrimination ability and trainability

    • What are the missing links?– Shannon’s information theory perspective (1951)– Finding acoustic, audio, visual “alphabets” and “words”

    18

    Shannon’s Study on Entropy of English• Approximation to co-occurrences of consecutive letters

    human prediction1.34Shannon’s 2ndExperiment

    bigram2.8Second order

    unigram4.03First orderuniform letter 4.76Zeroth order

    CommentsCross Entropy (bits)Model

    • The entropy of this alphabet is language-dependent. This makes it possible for text-based language ID of encrypted documents without using dictionaries. Can we do the same for spoken languages? How about multimedia documents?

  • 10

    19

    Common Technology Thread: DSP, Feature Extraction & Classifier Learning

    Speech/Image/Audio

    LSA-Based FE/SVD

    TextMedia Tokenization

    Results

    A/V Alphabet Model

    A/V Word List

    TC ClassifierLearning

    TC Classifier

    Text DocumentTraining Set

    AudiovisualClassification

    TC Classifier

    Feature

    First Step: Define alphabets and train alphabet models

    20

    Automatic Image Annotation• A process associating concepts or keywords to

    images describing their visual content• AIA can be used to make queries based on

    image concepts (Google-style keyword search)

    Verbal annotation

    Image/language connection

    …,boat, sea, sky, beach,…

  • 11

    21

    Google Image Search

    tiger-l.png

    “Tiger”

    Ranking: 69

    tiger-regal_1024x768.jpg

    “A Siberian tiger in all his majesty”

    Ranking: 1

    Juvenile.jpg

    “Newborn tigershark”

    Ranking: 23

    Tiger%20pear%20in%20car%20tyre%2075%20dpi.jpg

    “Tiger pear in car tyre”Ranking: 5

    Query: Tiger

    22

    Altavista Image Search

    Query: Tiger

    tiger-003.jpg

    No Caption

    Ranking: 34

    Tiger.jpg

    No Caption

    Ranking: 1

    Tiger__1_11.jpeg

    No Caption

    Ranking:18

    tiger.jpg

    No caption

    Ranking: 4

  • 12

    23

    Open Issues• Which kind of features are more capable to

    capture the visual information?

    • What image components should be used as image units?

    • How to connect visual context to semantic information?

    • How to describe connections among image components to represent high-level annotations?

    24

    Generic AIA System

    Segmentation

    Feature Extraction Tokenization

    Visual / Verbal Connection

    Model

    …,boat, sea, sky, beach,…

  • 13

    25

    Visual Component UnitsThe association with verbal information can be

    done with:

    • Entire image(B. Manjunath 1996, M. Swain 1991)

    • Segmented regions(Blobworld - C. Carson 1999, Y. Deng 1999)

    • Fixed-size sub-image macro-blocks(R.W. Picard 1998, Y. Mori 2000)

    26

    Low Level Visual Features

    • Texture FeaturesGabor WaveletsWold FeaturesDCT CoefficientsFFT Coefficients

    • Color FeaturesYUV HistogramsLab HistogramsHSV HistogramsRGB Histograms

    • Shape FeaturesAverage OrientationSizeConvexityDeviation First MomentArea/squared (boundary length)

  • 14

    27

    Some AIA Models• Translation Model (TM)• Multi Topic Text Categorization (MC MFoM)• Maximum Entropy (ME)• Markov Random Field (MRF)• Conditional Random Field (CRF)• Continuous Space Relevance Model Gaussian

    distribution of visual features (CRM)• Multiple Bernoulli Relevance Model Gaussian

    distribution of visual features (MBRM)

    28

    Automatic Image Annotation (AIA)

    Given an image, automatically associate it with a few semantic labels, i.e., keywords in a predefined vocabulary, to describe the image content based on low-level features Bear, Polar, Snow, Tundra

    Training stage: Learn the association rule P(X,Y) between a low-level image representation, X, and semantic labels, Y, based on a training set

    Prediction stage: Predict and rank P(Y|X) for any unknown image X according to the set of N pre-determined concepts

    * Gao, Wang and Lee, “Automatic Image Annotation through Multi-Topic Text Categorization,” ICASSP2006

  • 15

    29

    Image Macro-Blocking vs. Segmentation

    • Image segmentation is often unreliable– Error propagation with image segmentation (blobs)– Speech recognition can be accomplished without explicit

    word boundary specifications• Regular blocking with explicit segmentation

    – Similar to framing of speech utterances– Good for signal-independent image processing

    • Research issues– Size of image sub-blocks– Overlapping blocking of images (like speech)– Resolution: image representation and quantization

    30

    Visual Terms

    • Used for the first time in the Translation Model (Duygulu 2002):

    • Starting from the set of the feature values, create a finite set of tokens able to represent all images– Also in mixture models (recent BHMMM work)

    • Visual terms are analogous to words in documents– Any TC algorithms can be easily applied to AIA

    • They can be extracted by dividing the feature values in regions according to their similarities and using the centroids of the clusters as representing term

  • 16

    31

    Image Tokenization and Visual Terms• Visual Alphabets

    – VQ codebooks for color, texture, shape and others• Tokenization of all macro-blocks of training images• Visual Words

    – Compute bigrams of neighboring blocks, reoccurring patterns appearing over images will be remembered

    X33X32X31

    X23X22X21

    X13X12X11 • For a single 6-bit codebook– LSA feature dimension is 4160

    (64*65) when considering both unigrams and bigrams

    – Multiple and large codebooks

    32

    Visual Bigrams• Visual Bigrams deal with the spatial information brought by near blocks

    X11 X12 X13

    X21 X22 X23

    X31 X32 X33

    X22X11X22X12X22X13…

    X22X33

    Det Error

    -1.00

    4.00

    9.00

    14.00

    19.00

    Train Test

    Color Ftr( uni: 7bits)

    Color Ftr(uni+bigrams:6bits)

  • 17

    33

    Text Representation of Images• Given a visual lexicon, A={A1,A2, …, AM}, with Mvisual terms, an image document can be represented by V={V1,V2, …, VM}, each component being statistics of visual term occurred in the particular image document• SVD can be applied to reduce the dimension, M• Semantic concept modeling for image annotation

    – Semantic concept set, , N: total concepts. Each concept has a discriminant function, , to be trained. Multiple relevant keywords are assigned to an image X, according to the rule,

    { }, 1jC C j N= ≤ ≤( )jj Xg Λ;

    34

    Using Multiple Visual Dictionaries• If more features are available:

    – Put all the features in one vector, and build visual codebook on combined vectors (Visual Dictionary)

    – Extract a single Visual Dictionary for each feature, or group of features, and fuse them at higher level

    • Advantages to use more visual dictionaries:– Management of shorter visual vectors (avoid curse of

    dimensionality)– Possibility to add new features without repeat the entire

    extraction/tokenization process

  • 18

    35

    Defining Classifier Score Function

    X

    Λ

    argmax

    g1(X,Λ)gN(X,Λ)

    gi(X,Λ)

    C(X) • Each classifier score function, gi is trained to discriminate positive from negative examples for the i-th Class:

    • The classifier is trained to minimize the Det Error:

    ∑≤≤ ⋅

    +=

    Nj

    jj

    NFNFP

    DetE1 2

    )],([maxarg)(1

    Λ=≤≤

    XgXC jNj

    36

    Tree of Binary Classifiers

    gi(X,Λj,) Labeli

    --

    Reject Option

    LabeliLabelk

    Negative Classification

    Positive Classification

    Linear Classification Units gj(X,Λi) are ordered in the tree according the False Positive Rate ( lower rate near the root)

    Each unit applies the rule:

    if gi(X,Λj,) > Th+ Assign Labeli

    if gi(X,Λj,) < Th- ( Not Labeli)

    else Reject the sample (not decide)

    •When Reject Option occurs the sample is sent to a unit trained for the same category but with a different Visual Dictionary

    •Th+ and Th- are chosen according to the statistic of positive and negative samples

  • 19

    37

    AIA Examples: Multi-Topic TC

    Ground truth: locomotive, railroad, smoke, trainCMRM: water, sky, tree, peopleMFoM: mountain, sky, tree, train, locomotive, railroad, aerial

    Ground truth: bear, polar, snow, tundraCMRM: water, sky, plane, jet, treeMFoM: tree, bear, snow, polar, tundra, ice

    Ground truth: cat, wood, tiger, waterCMRM: people, water, rocks, buildingsMFoM: water, cat, tiger, forest

    38

    Experimental Evaluation Data Sets• Corel CD: 374 concepts with a total of 5,000 images, 4,500

    images for training and 500 for testing− each image (128*192) is uniformly segmented into 96 grids, each

    with a block size of 16x16• TRECVID 2003 development set. 33,529 keyframes from 93

    MPEG files with114 concepts, out of which only 10 concepts were selected (i.e. Aircraft including Airplane, Airplane_landing and Airplane_takeoff, Animal, Building, Car/Bus/Truck including Car, News subject face including Male_News_Subject and Female_News_Subject, Bus and Truck, Non-studio setting, Outdoors, People, Road and Weather news). We randomly select about half, i.e., 15,804 key frames for training and the remaining 17,725 for testing.

    – Each image (224*352) is segmented into 77 grids, each with a block size of 32x32

  • 20

    39

    Comparison with Benchmarks on Corel

    133122N.A6649# of det

    0.270.250.120.090.04mR

    0.250.240.090.100.06mP

    MFoMMBRMMECMRMTM

    • Testing on 260 out of 374 concepts. TM, CMRM, ME and MBRM results from other published papers

    40

    Summary

    • Text Categorization (TC): a unified scenario– HD LSA feature extraction – MFoM discriminative classifier learning

    • Information theory perspective: Shannon• From text to multimedia documents and back

    – Audiovisual alphabets/words and LSA-based features– Pattern classification through vector-based TC

    • Multimedia applications: achieving better results– AIA for concept based image retrieval– Many other applications

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 300 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure true /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles true /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /NA /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /LeaveUntagged /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice