on-the-fly visual category search in web-scale image collections

On-the-fly Visual Category Search in Web-scale Image

CollectionsKen Chatfield - University of Oxford

May 2015

• Search large unannotated datasets of 1M+ images for object categories

• Do so in real-time and without any prior knowledge

Motivation and Objectives

‘Regular’ Category Retrieval

1,000 ILSVRC classes

Pre-trained CNN e.g. Alexnet

S O F T M A X

‘Regular’ Category Retrieval

car?lion?

apple?bus?


S O F T M A X

On-the-fly Category Retrieval


fc7

training data from the web

OTF Classifier e.g. Linear SVM

• Bootstrap training using images from the web

• Use highly compact ConvNet features + compression as the basis of a OTF system

• Plus: Novel GPU architecture for iterative on-the-fly learning

Proposed Solution

Architecture Outline

!Car |

Google Image Search Sourced Training Images

Image Encoder φ( I )

φ( I+ )

Fixed negative pool

precomputed features

Linear SVM

φ( I- )

w

Target Dataset

wTφ( It )

Ranking

φ( It )

precomputed features

Flickr Pinterest

etc.

Need for Speed

!Car |

Google Image Search Sourced Training Images

Image Encoder φ( I )

φ( I+ )

Negative pool

Linear SVM

φ( I- )

w

Target Dataset

wTφ( It )

Ranking

φ( It )

Flickr Pinterest

etc.Ranking most critical stage

w wTφ( It )

φ( It )

Must compute w.X for all image features in dataset giving complexity of O(ND) so important to reduce image representation dimensionality:

• Obtain 128-D representation from CNN (488 MB / 1M images)

• Then compress further using binarization (122 MB / 1M images)

• Or using product quantization (30.5 MB / 1M images)

Fast Ranking = Compact Representation

N – # images in test set D – dim of image representation

Lower-dimensional FeaturesTaking CNN-M network as base:

conv3 512x3x3

conv4 512x3x3

conv2 256x5x5

conv1 96x7x7

conv5 512x3x3

fc6 d.o. 4096-D

fc7 d.o. 4096-D

ILSVRC softmax


conv3 512x3x3

conv4 512x3x3

conv2 256x5x5

conv1 96x7x7

conv5 512x3x3

ILSVRC softmaxfc6 d.o.

2048-D

fc7 d.o. 2048-D

Replace 4096-D fc layer w. 2048-D, 128-D layers


conv3 512x3x3

conv4 512x3x3

conv2 256x5x5

conv1 96x7x7

conv5 512x3x3

ILSVRC softmaxfc6 d.o.

128-D

fc7 d.o. 128-D

Replace 4096-D fc layer w. 2048-D, 128-D layers

Lower-dimensional Features

mAP

( V

OC

07 )

78

78.75

79.5

80.25

81

4096 2048 1024 128

78.6

79.9180.1

79.89

Compression

• Binarization by embedding into Hamming space:

e : RD ! BM

Where M > D and U is obtained by taking the first D columns of the QR-decomposition of a random M x M matrix

bi = sgn(Uxi)

• Product Quantization

…

…

…

…

D

S

d

Q

Evaluation Dataset

10,000 annotated images

PASCAL VOC 2007

1M unannotated images

MIRFLICKR-1M

• Want to evaluate CNN features for real-world photo retrieval

• Disjoint from ImageNet (as CNN trained on that) + with less focus on fine-grained retrieval

Evaluation Dataset

1 2 3 4 53

Using MIRFLICKR-1M dataset as distractors

Evaluation Dataset

1 2 33

Remove false negatives and evaluate Precision @ K…

Using MIRFLICKR-1M dataset as distractors

where K = 100

Or evaluate Precision @ K over MIRFLICKR-1M directly

Retrieval ResultsResults for two sample classes over VOC + Distractor data (Retrieve ~500 images from within 1M images – TP are 0.05% of dataset)

1000 10 20 30 40 50 60 70 80 90

1

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Rank

Precision

CNN 2048CNN 128CNN 128 PQFK 512CNN 128 rpbin

1000 10 20 30 40 50 60 70 80 90

1

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

RankPrecision

Class: Sheep Class: Motorbike

! CNN 128 (Prec. 0.32 @ 100) ! CNN 128 (Prec 0.77 @ 100)

Retrieval ResultsResults for two sample classes over VOC + Distractor data (Retrieve ~500 images from within 1M images – TP are 0.05% of dataset)

! CNN 128 (Prec. 0.32 @ 100) ! CNN 128 (Prec 0.77 @ 100)

Retrieval Results

CNN-M 2048

CNN-M 128

CNN-M 128 BIN

55.4

51.0

50.1

95.4

95.1

94.0

90.9

92.3

—

VOC Training Google Training

CNN-M 128 PQ 50.5 94.6 92.1

7.63 GB

488 MB

122 MB

30.5 MB

FV 29.3 80.5 — 312.8 GB

Freeform Queries

VOC vs Google Training

! ‘Chair’ – CNN 128 (Prec. 0.92 @ 100) (Prec. 0.86 @ 100)

! ‘Train’ – CNN 128 (Prec. 1.0 @ 100) (Prec. 1.0 @ 100)

VOC Training Google Training

Instances & Faces tooInstances

Root SIFT Extractor ψ( I ) → xi

φ( I+ )

VQ Encoder φ( xi )

Hamming Encoder φ( xi )

Spatial Verification

φ( xi )ψ( It )

Target Dataset

match?

match?

Ranking x N

(take max)N tr

aini

ng

imag

es

Faces

N tr

aini

ng

imag

es

φ( It )

Target Dataset Tracks

RankingLinear SVM

w

φ( I- )

Negative Pool

φ( If+ )If+Face Extractor ψ( I ) → If

Pre-trained Face CNN

φ( I )

Live Demo

Landing Page1User enters text query term and selects search modality (e.g. ‘forest’ using object category search)

Ranked Results3A ranked list of visually matching images is displayed within 1~30 secs of entering the cold query

Querying2A live view of images downloaded from Google Image search as they are used to construct a visual appearance model on-the-fly

Can try out the system live over a dataset of 5M+ images sourced from BBC News footage at: http://varro3.robots.ox.ac.uk:9090

http://varro3.robots.ox.ac.uk:9090

Question:

How can we adapt standard GPU ConvNet pipeline for on-the-fly search?

We want: • simultaneous feature computation/model training • highly parallel operation by using a GPU-bound

architecture

ConvNet-based Architecture• Libraries such as Caffe allow for fast computation

of ConvNet features entirely on GPU

ConvNet-based Architecture

RGB

CNN feat.

conv stack

fc stack

Fixed negative pool

!Sheep|

Google Image Search

Training Images

precomputed CNN feats

SVM

Model w


RGB xB/2 Pos.

CNN feat.

conv stack

fc stack

CNN feat. xB/2 Neg.

Fixed negative pool

!Sheep|

Google Image Search

Training Images

SVM Loss Layer 5 =1

B

X

i=1..B

I[yiw>xi < 1]yixi

Batch Sampler Batch size = B


CPU Frontend GPU Backend


RGB xB/2 Pos.

CNN feat.

conv stack

fc stack

CNN feat. xB/2 Neg.

Fixed negative pool

!Sheep|

Google Image Search

Training Images

SVM Loss Layer 5 =1

B

X

i=1..B

I[yiw>xi < 1]yixi


Image Buffer





Fixed negative pool

!Sheep|

Google Image Search

Training Images

Image Buffer

RGB xB/2 Pos.

CNN feat. xB/2 Neg.

CNN feat.

Target Dataset: MIRFLICKR

Ever

y τ

secs

conv stack

fc stack

Model w



Inner Product Layer


SVM Loss Layer

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Currently working on the following extensions:

• How to select negative training images more intelligently (e.g. selection of most discriminative negative images per query from a larger 1M+ pool of non-class images)

• How to establish a confidence measure for images in the output ranking, so know when a query works well or not, and source training images more intelligently

• Query attribute refinement (sporty + car)

Continued Work

“On-the-fly Learning for Visual Search of Large-scale Image and Video Datasets” IJMIR 2015 Ken Chatfield, Relja Arandjelovic, Omkar Parkhi, Andrew Zisserman

“Efficient On-the-fly Category Retrieval using ConvNets and GPUs” ACCV 2014 Ken Chatfield, Karen Simonyan, Andrew Zisserman

“Return of the Devil in the Details: Delving Deep into Convolutional Nets” BMVC 2014 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman (Best Paper Prize)

http://www.robots.ox.ac.uk/~ken

Related Publications

http://www.robots.ox.ac.uk/~ken

on-the-fly visual category search in web-scale image collections

Technology