on-the-fly visual category search in web-scale image collections
TRANSCRIPT
On-the-fly Visual Category Search in Web-scale Image
CollectionsKen Chatfield - University of Oxford
May 2015
• Search large unannotated datasets of 1M+ images for object categories
• Do so in real-time and without any prior knowledge
Motivation and Objectives
‘Regular’ Category Retrieval
1,000 ILSVRC classes
Pre-trained CNN e.g. Alexnet
S O F T M A X
‘Regular’ Category Retrieval
car?lion?
apple?bus?
Pre-trained CNN e.g. Alexnet
S O F T M A X
On-the-fly Category Retrieval
Pre-trained CNN e.g. Alexnet
fc7
training data from the web
OTF Classifier e.g. Linear SVM
On-the-fly Category Retrieval
Pre-trained CNN e.g. Alexnet
fc7
training data from the web
OTF Classifier e.g. Linear SVM
• Bootstrap training using images from the web
• Use highly compact ConvNet features + compression as the basis of a OTF system
• Plus: Novel GPU architecture for iterative on-the-fly learning
Proposed Solution
Architecture Outline
!Car |
Google Image Search Sourced Training Images
Image Encoder φ( I )
φ( I+ )
Fixed negative pool
precomputed features
Linear SVM
φ( I- )
w
Target Dataset
wTφ( It )
Ranking
φ( It )
precomputed features
Flickr Pinterest
etc.
Need for Speed
!Car |
Google Image Search Sourced Training Images
Image Encoder φ( I )
φ( I+ )
Negative pool
Linear SVM
φ( I- )
w
Target Dataset
wTφ( It )
Ranking
φ( It )
Flickr Pinterest
etc.Ranking most critical stage
w wTφ( It )
φ( It )
Must compute w.X for all image features in dataset giving complexity of O(ND) so important to reduce image representation dimensionality:
• Obtain 128-D representation from CNN (488 MB / 1M images)
• Then compress further using binarization (122 MB / 1M images)
• Or using product quantization (30.5 MB / 1M images)
Fast Ranking = Compact Representation
N – # images in test set D – dim of image representation
Lower-dimensional FeaturesTaking CNN-M network as base:
conv3 512x3x3
conv4 512x3x3
conv2 256x5x5
conv1 96x7x7
conv5 512x3x3
fc6 d.o. 4096-D
fc7 d.o. 4096-D
ILSVRC softmax
Lower-dimensional FeaturesTaking CNN-M network as base:
conv3 512x3x3
conv4 512x3x3
conv2 256x5x5
conv1 96x7x7
conv5 512x3x3
ILSVRC softmaxfc6 d.o.
2048-D
fc7 d.o. 2048-D
Replace 4096-D fc layer w. 2048-D, 128-D layers
Lower-dimensional FeaturesTaking CNN-M network as base:
conv3 512x3x3
conv4 512x3x3
conv2 256x5x5
conv1 96x7x7
conv5 512x3x3
ILSVRC softmaxfc6 d.o.
128-D
fc7 d.o. 128-D
Replace 4096-D fc layer w. 2048-D, 128-D layers
Lower-dimensional Features
mAP
( V
OC
07 )
78
78.75
79.5
80.25
81
4096 2048 1024 128
78.6
79.9180.1
79.89
Compression
• Binarization by embedding into Hamming space:
e : RD ! BM
Where M > D and U is obtained by taking the first D columns of the QR-decomposition of a random M x M matrix
bi = sgn(Uxi)
• Product Quantization
…
…
…
…
D
S
d
Q
Evaluation Dataset
10,000 annotated images
PASCAL VOC 2007
1M unannotated images
MIRFLICKR-1M
• Want to evaluate CNN features for real-world photo retrieval
• Disjoint from ImageNet (as CNN trained on that) + with less focus on fine-grained retrieval
Evaluation Dataset
1 2 3 4 53
Using MIRFLICKR-1M dataset as distractors
Evaluation Dataset
1 2 33
Remove false negatives and evaluate Precision @ K…
Using MIRFLICKR-1M dataset as distractors
where K = 100
Or evaluate Precision @ K over MIRFLICKR-1M directly
Retrieval ResultsResults for two sample classes over VOC + Distractor data (Retrieve ~500 images from within 1M images – TP are 0.05% of dataset)
1000 10 20 30 40 50 60 70 80 90
1
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Rank
Precision
CNN 2048CNN 128CNN 128 PQFK 512CNN 128 rpbin
1000 10 20 30 40 50 60 70 80 90
1
00.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
RankPrecision
Class: Sheep Class: Motorbike
! CNN 128 (Prec. 0.32 @ 100) ! CNN 128 (Prec 0.77 @ 100)
Retrieval ResultsResults for two sample classes over VOC + Distractor data (Retrieve ~500 images from within 1M images – TP are 0.05% of dataset)
! CNN 128 (Prec. 0.32 @ 100) ! CNN 128 (Prec 0.77 @ 100)
Retrieval Results
CNN-M 2048
CNN-M 128
CNN-M 128 BIN
55.4
51.0
50.1
95.4
95.1
94.0
90.9
92.3
—
VOC Training Google Training
CNN-M 128 PQ 50.5 94.6 92.1
7.63 GB
488 MB
122 MB
30.5 MB
FV 29.3 80.5 — 312.8 GB
Freeform Queries
Freeform Queries
VOC vs Google Training
! ‘Chair’ – CNN 128 (Prec. 0.92 @ 100) (Prec. 0.86 @ 100)
! ‘Train’ – CNN 128 (Prec. 1.0 @ 100) (Prec. 1.0 @ 100)
VOC Training Google Training
Instances & Faces tooInstances
Root SIFT Extractor ψ( I ) → xi
φ( I+ )
VQ Encoder φ( xi )
Hamming Encoder φ( xi )
Spatial Verification
φ( xi )ψ( It )
Target Dataset
match?
match?
Ranking x N
(take max)N tr
aini
ng
imag
es
Faces
N tr
aini
ng
imag
es
φ( It )
Target Dataset Tracks
RankingLinear SVM
w
φ( I- )
Negative Pool
φ( If+ )If+Face Extractor ψ( I ) → If
Pre-trained Face CNN
φ( I )
Live Demo
Landing Page1User enters text query term and selects search modality (e.g. ‘forest’ using object category search)
Ranked Results3A ranked list of visually matching images is displayed within 1~30 secs of entering the cold query
Querying2A live view of images downloaded from Google Image search as they are used to construct a visual appearance model on-the-fly
Can try out the system live over a dataset of 5M+ images sourced from BBC News footage at: http://varro3.robots.ox.ac.uk:9090
Question:
How can we adapt standard GPU ConvNet pipeline for on-the-fly search?
We want: • simultaneous feature computation/model training • highly parallel operation by using a GPU-bound
architecture
ConvNet-based Architecture• Libraries such as Caffe allow for fast computation
of ConvNet features entirely on GPU
ConvNet-based Architecture
RGB
CNN feat.
conv stack
fc stack
Fixed negative pool
!Sheep|
Google Image Search
Training Images
precomputed CNN feats
SVM
Model w
ConvNet-based Architecture
RGB xB/2 Pos.
CNN feat.
conv stack
fc stack
CNN feat. xB/2 Neg.
Fixed negative pool
!Sheep|
Google Image Search
Training Images
SVM Loss Layer 5 =1
B
X
i=1..B
I[yiw>xi < 1]yixi
Batch Sampler Batch size = B
precomputed CNN feats
CPU Frontend GPU Backend
ConvNet-based Architecture
RGB xB/2 Pos.
CNN feat.
conv stack
fc stack
CNN feat. xB/2 Neg.
Fixed negative pool
!Sheep|
Google Image Search
Training Images
SVM Loss Layer 5 =1
B
X
i=1..B
I[yiw>xi < 1]yixi
Batch Sampler Batch size = B
Image Buffer
precomputed CNN feats
CPU Frontend GPU Backend
ConvNet-based Architecture
Batch Sampler Batch size = B
Fixed negative pool
!Sheep|
Google Image Search
Training Images
Image Buffer
RGB xB/2 Pos.
CNN feat. xB/2 Neg.
CNN feat.
Target Dataset: MIRFLICKR
Ever
y τ
secs
conv stack
fc stack
Model w
precomputed CNN feats
CPU Frontend GPU Backend
Inner Product Layer
precomputed CNN feats
SVM Loss Layer
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
0.15
s0.36
s0.54
s0.73
s
sofasheepbushorse
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
0.15
s0.36
s0.54
s0.73
s
sofasheepbushorse
Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
0.15
s0.36
s0.54
s0.73
s
sofasheepbushorse
Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
0.15
s0.36
s0.54
s0.73
s
sofasheepbushorse
Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
0.15
s0.36
s0.54
s0.73
s
sofasheepbushorse
Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
0.15
s0.36
s0.54
s0.73
s
sofasheepbushorse
Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
0.15
s0.36
s0.54
s0.73
s
sofasheepbushorse
Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
0.15
s0.36
s0.54
s0.73
s
sofasheepbushorse
Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
0.15
s0.36
s0.54
s0.73
s
sofasheepbushorse
Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
0.15
s0.36
s0.54
s0.73
s
sofasheepbushorse
Retrieval Results• Images are fed into the network at a rate of 12 per second • Dataset is ranked with current model every ~0.2 seconds • Most rankings stabilise in under 1 second
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
1
00.10.20.30.40.50.60.70.80.9
Seconds
Prec
isio
n @
100
10 images
20 images
30 images
0.15
s0.36
s0.54
s0.73
s
sofasheepbushorse
Currently working on the following extensions:
• How to select negative training images more intelligently (e.g. selection of most discriminative negative images per query from a larger 1M+ pool of non-class images)
• How to establish a confidence measure for images in the output ranking, so know when a query works well or not, and source training images more intelligently
• Query attribute refinement (sporty + car)
Continued Work
“On-the-fly Learning for Visual Search of Large-scale Image and Video Datasets” IJMIR 2015 Ken Chatfield, Relja Arandjelovic, Omkar Parkhi, Andrew Zisserman
“Efficient On-the-fly Category Retrieval using ConvNets and GPUs” ACCV 2014 Ken Chatfield, Karen Simonyan, Andrew Zisserman
“Return of the Devil in the Details: Delving Deep into Convolutional Nets” BMVC 2014 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman (Best Paper Prize)
http://www.robots.ox.ac.uk/~ken
Related Publications