on-the-fly visual category search in web-scale image collections

44
On-the-fly Visual Category Search in Web-scale Image Collections Ken Chatfield - University of Oxford May 2015

Upload: ken-chatfield

Post on 17-Aug-2015

25 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: On-the-fly Visual Category Search in Web-scale Image Collections

On-the-fly Visual Category Search in Web-scale Image

CollectionsKen Chatfield - University of Oxford

May 2015

Page 2: On-the-fly Visual Category Search in Web-scale Image Collections

• Search large unannotated datasets of 1M+ images for object categories

• Do so in real-time and without any prior knowledge

Motivation and Objectives

Page 3: On-the-fly Visual Category Search in Web-scale Image Collections

‘Regular’ Category Retrieval

1,000 ILSVRC classes

Pre-trained CNN e.g. Alexnet

S O F T M A X

Page 4: On-the-fly Visual Category Search in Web-scale Image Collections

‘Regular’ Category Retrieval

car?lion?

apple?bus?

Pre-trained CNN e.g. Alexnet

S O F T M A X

Page 5: On-the-fly Visual Category Search in Web-scale Image Collections

On-the-fly Category Retrieval

Pre-trained CNN e.g. Alexnet

fc7

training data from the web

OTF Classifier e.g. Linear SVM

Page 6: On-the-fly Visual Category Search in Web-scale Image Collections

On-the-fly Category Retrieval

Pre-trained CNN e.g. Alexnet

fc7

training data from the web

OTF Classifier e.g. Linear SVM

Page 7: On-the-fly Visual Category Search in Web-scale Image Collections

• Bootstrap training using images from the web

• Use highly compact ConvNet features + compression as the basis of a OTF system

• Plus: Novel GPU architecture for iterative on-the-fly learning

Proposed Solution

Page 8: On-the-fly Visual Category Search in Web-scale Image Collections

Architecture Outline

!Car |

Google Image Search Sourced Training Images

Image Encoder φ( I )

φ( I+ )

Fixed negative pool

precomputed features

Linear SVM

φ( I- )

w

Target Dataset

wTφ( It )

Ranking

φ( It )

precomputed features

Flickr Pinterest

etc.

Page 9: On-the-fly Visual Category Search in Web-scale Image Collections

Need for Speed

!Car |

Google Image Search Sourced Training Images

Image Encoder φ( I )

φ( I+ )

Negative pool

Linear SVM

φ( I- )

w

Target Dataset

wTφ( It )

Ranking

φ( It )

Flickr Pinterest

etc.Ranking most critical stage

w wTφ( It )

φ( It )

Page 10: On-the-fly Visual Category Search in Web-scale Image Collections

Must compute w.X for all image features in dataset giving complexity of O(ND) so important to reduce image representation dimensionality:

• Obtain 128-D representation from CNN (488 MB / 1M images)

• Then compress further using binarization (122 MB / 1M images)

• Or using product quantization (30.5 MB / 1M images)

Fast Ranking = Compact Representation

N – # images in test set D – dim of image representation

Page 11: On-the-fly Visual Category Search in Web-scale Image Collections

Lower-dimensional FeaturesTaking CNN-M network as base:

conv3 512x3x3

conv4 512x3x3

conv2 256x5x5

conv1 96x7x7

conv5 512x3x3

fc6 d.o. 4096-D

fc7 d.o. 4096-D

ILSVRC softmax

Page 12: On-the-fly Visual Category Search in Web-scale Image Collections

Lower-dimensional FeaturesTaking CNN-M network as base:

conv3 512x3x3

conv4 512x3x3

conv2 256x5x5

conv1 96x7x7

conv5 512x3x3

ILSVRC softmaxfc6 d.o.

2048-D

fc7 d.o. 2048-D

Replace 4096-D fc layer w. 2048-D, 128-D layers

Page 13: On-the-fly Visual Category Search in Web-scale Image Collections

Lower-dimensional FeaturesTaking CNN-M network as base:

conv3 512x3x3

conv4 512x3x3

conv2 256x5x5

conv1 96x7x7

conv5 512x3x3

ILSVRC softmaxfc6 d.o.

128-D

fc7 d.o. 128-D

Replace 4096-D fc layer w. 2048-D, 128-D layers

Page 14: On-the-fly Visual Category Search in Web-scale Image Collections

Lower-dimensional Features

mAP

( V

OC

07 )

78

78.75

79.5

80.25

81

4096 2048 1024 128

78.6

79.9180.1

79.89

Page 15: On-the-fly Visual Category Search in Web-scale Image Collections

Compression

• Binarization by embedding into Hamming space:

e : RD ! BM

Where M > D and U is obtained by taking the first D columns of the QR-decomposition of a random M x M matrix

bi = sgn(Uxi)

• Product Quantization

D

S

d

Q

Page 16: On-the-fly Visual Category Search in Web-scale Image Collections

Evaluation Dataset

10,000 annotated images

PASCAL VOC 2007

1M unannotated images

MIRFLICKR-1M

• Want to evaluate CNN features for real-world photo retrieval

• Disjoint from ImageNet (as CNN trained on that) + with less focus on fine-grained retrieval

Page 17: On-the-fly Visual Category Search in Web-scale Image Collections

Evaluation Dataset

1 2 3 4 53

Using MIRFLICKR-1M dataset as distractors

Page 18: On-the-fly Visual Category Search in Web-scale Image Collections

Evaluation Dataset

1 2 33

Remove false negatives and evaluate Precision @ K…

Using MIRFLICKR-1M dataset as distractors

where K = 100

Or evaluate Precision @ K over MIRFLICKR-1M directly

Page 19: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval ResultsResults for two sample classes over VOC + Distractor data (Retrieve ~500 images from within 1M images – TP are 0.05% of dataset)

1000 10 20 30 40 50 60 70 80 90

1

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Rank

Precision

CNN 2048CNN 128CNN 128 PQFK 512CNN 128 rpbin

1000 10 20 30 40 50 60 70 80 90

1

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

RankPrecision

Class: Sheep Class: Motorbike

! CNN 128 (Prec. 0.32 @ 100) ! CNN 128 (Prec 0.77 @ 100)

Page 20: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval ResultsResults for two sample classes over VOC + Distractor data (Retrieve ~500 images from within 1M images – TP are 0.05% of dataset)

! CNN 128 (Prec. 0.32 @ 100) ! CNN 128 (Prec 0.77 @ 100)

Page 21: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval Results

CNN-M 2048

CNN-M 128

CNN-M 128 BIN

55.4

51.0

50.1

95.4

95.1

94.0

90.9

92.3

VOC Training Google Training

CNN-M 128 PQ 50.5 94.6 92.1

7.63 GB

488 MB

122 MB

30.5 MB

FV 29.3 80.5 — 312.8 GB

Page 22: On-the-fly Visual Category Search in Web-scale Image Collections

Freeform Queries

Page 23: On-the-fly Visual Category Search in Web-scale Image Collections

Freeform Queries

Page 24: On-the-fly Visual Category Search in Web-scale Image Collections

VOC vs Google Training

! ‘Chair’ – CNN 128 (Prec. 0.92 @ 100) (Prec. 0.86 @ 100)

! ‘Train’ – CNN 128 (Prec. 1.0 @ 100) (Prec. 1.0 @ 100)

VOC Training Google Training

Page 25: On-the-fly Visual Category Search in Web-scale Image Collections

Instances & Faces tooInstances

Root SIFT Extractor ψ( I ) → xi

φ( I+ )

VQ Encoder φ( xi )

Hamming Encoder φ( xi )

Spatial Verification

φ( xi )ψ( It )

Target Dataset

match?

match?

Ranking x N

(take max)N tr

aini

ng

imag

es

Faces

N tr

aini

ng

imag

es

φ( It )

Target Dataset Tracks

RankingLinear SVM

w

φ( I- )

Negative Pool

φ( If+ )If+Face Extractor ψ( I ) → If

Pre-trained Face CNN

φ( I )

Page 26: On-the-fly Visual Category Search in Web-scale Image Collections

Live Demo

Landing Page1User enters text query term and selects search modality (e.g. ‘forest’ using object category search)

Ranked Results3A ranked list of visually matching images is displayed within 1~30 secs of entering the cold query

Querying2A live view of images downloaded from Google Image search as they are used to construct a visual appearance model on-the-fly

Can try out the system live over a dataset of 5M+ images sourced from BBC News footage at: http://varro3.robots.ox.ac.uk:9090

Page 27: On-the-fly Visual Category Search in Web-scale Image Collections

Question:

How can we adapt standard GPU ConvNet pipeline for on-the-fly search?

We want: • simultaneous feature computation/model training • highly parallel operation by using a GPU-bound

architecture

ConvNet-based Architecture• Libraries such as Caffe allow for fast computation

of ConvNet features entirely on GPU

Page 28: On-the-fly Visual Category Search in Web-scale Image Collections

ConvNet-based Architecture

RGB

CNN feat.

conv stack

fc stack

Fixed negative pool

!Sheep|

Google Image Search

Training Images

precomputed CNN feats

SVM

Model w

Page 29: On-the-fly Visual Category Search in Web-scale Image Collections

ConvNet-based Architecture

RGB xB/2 Pos.

CNN feat.

conv stack

fc stack

CNN feat. xB/2 Neg.

Fixed negative pool

!Sheep|

Google Image Search

Training Images

SVM Loss Layer 5 =1

B

X

i=1..B

I[yiw>xi < 1]yixi

Batch Sampler Batch size = B

precomputed CNN feats

CPU Frontend GPU Backend

Page 30: On-the-fly Visual Category Search in Web-scale Image Collections

ConvNet-based Architecture

RGB xB/2 Pos.

CNN feat.

conv stack

fc stack

CNN feat. xB/2 Neg.

Fixed negative pool

!Sheep|

Google Image Search

Training Images

SVM Loss Layer 5 =1

B

X

i=1..B

I[yiw>xi < 1]yixi

Batch Sampler Batch size = B

Image Buffer

precomputed CNN feats

CPU Frontend GPU Backend

Page 31: On-the-fly Visual Category Search in Web-scale Image Collections

ConvNet-based Architecture

Batch Sampler Batch size = B

Fixed negative pool

!Sheep|

Google Image Search

Training Images

Image Buffer

RGB xB/2 Pos.

CNN feat. xB/2 Neg.

CNN feat.

Target Dataset: MIRFLICKR

Ever

y τ

secs

conv stack

fc stack

Model w

precomputed CNN feats

CPU Frontend GPU Backend

Inner Product Layer

precomputed CNN feats

SVM Loss Layer

Page 32: On-the-fly Visual Category Search in Web-scale Image Collections

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Page 33: On-the-fly Visual Category Search in Web-scale Image Collections

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Page 34: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Page 35: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Page 36: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Page 37: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Page 38: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Page 39: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Page 40: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Page 41: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Page 42: On-the-fly Visual Category Search in Web-scale Image Collections

Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

1

00.10.20.30.40.50.60.70.80.9

Seconds

Prec

isio

n @

100

10 images

20 images

30 images

0.15

s0.36

s0.54

s0.73

s

sofasheepbushorse

Page 43: On-the-fly Visual Category Search in Web-scale Image Collections

Currently working on the following extensions:

• How to select negative training images more intelligently (e.g. selection of most discriminative negative images per query from a larger 1M+ pool of non-class images)

• How to establish a confidence measure for images in the output ranking, so know when a query works well or not, and source training images more intelligently

• Query attribute refinement (sporty + car)

Continued Work

Page 44: On-the-fly Visual Category Search in Web-scale Image Collections

“On-the-fly Learning for Visual Search of Large-scale Image and Video Datasets” IJMIR 2015 Ken Chatfield, Relja Arandjelovic, Omkar Parkhi, Andrew Zisserman

“Efficient On-the-fly Category Retrieval using ConvNets and GPUs” ACCV 2014 Ken Chatfield, Karen Simonyan, Andrew Zisserman

“Return of the Devil in the Details: Delving Deep into Convolutional Nets” BMVC 2014 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman (Best Paper Prize)

http://www.robots.ox.ac.uk/~ken

Related Publications