multiple object class detection & localization with deep...

16th Oct. 2015

Hansung Lee

SW R&D Center, Samsung Electronics

Multiple Object Class Detection &

Localization with Deep Learning (CNN)

Outline

1. Introduction – definition of object recognition problems

2. Feature Extraction vs. Feature Learning

3. Design Issues

4. Methods and Algorithms

References

1

I. Introduction – definition of object recognition problems

2

3

Object Recognition Problem [1]

Major Tasks

Object

Recognition

Object Instance

Recognition

Object Class

Recognition

• Identifying previously seen object instances

• Matching problem in which the differences between the stored

exemplars and the objects to be re-identified in an input image

• Need some alignment process

• Known as category-level or generic object recognition

• Focuses on recognizing always unseen-before instances of

some predefined categories

• Challenging Problems

1) The inter-category visual differences sometimes may be very small

2) Large intra-category appearance variations caused by different object colors, textures, shapes,

as well as varying imaging conditions

3) An object in a real-world scene often occupies just a small portion of the scene and is occluded by

others or accompanied by similar looking background structures

Object Class

Detection

• Determine whether or not any instances of categories of interest

are present in an input image

• Locate instances of categories of interest accurately in the image

to separate them from the background

X. Zhang et al., “Object Class Detection: A Survey,” J. ACM Computing Survey, vol. 46, no. 1, pp. 10:1-10:53, 2013.

4

Object Class Detection (1/2) [1]

Different facets related to object class detection

From X. Zhang et al., “Object Class Detection: A Survey,” J. ACM Computing Survey, vol. 46, no. 1, pp. 10:1-10:53, 2013.

5

Object Class Detection (2/2) [1]

The bridging role of categorical appearance models in object class detection

From X. Zhang et al., “Object Class Detection: A Survey,” J. ACM Computing Survey, vol. 46, no. 1, pp. 10:1-10:53, 2013.

6

Description of Relevant Visual Cues [1]

Descriptor Pixel-Level Feature

Description

Patch-Level Feature

Description

• Gray level pixel’s intensity

• Color Histogram

Region-Level

Feature Description

SIFT and its Variants

Filter Bank Responses

• Support region or the

neighborhood of the point

• Local feature descriptors

• pixel intensities, colors, textures,

edges, etc

Bag of Features

HoG and its Variants

GIST Feature

Shape Feature

Self-Similarity Feature

• Capturing the discriminating

visual properties of the target

categories or their components

• Keeping sufficient robustness

against possible intra-class

variations

X. Zhang et al., “Object Class Detection: A Survey,” J. ACM Computing Survey, vol. 46, no. 1, pp. 10:1-10:53, 2013.

II. Feature Extraction vs. Feature Learning

7

8

Filter Banks – Extracting Feature from Image (1/2) [2]

Gabor Filter Bank

Real Part Magnitude Part

Characteristics

• Pros:

1) Similar to Human Visual System, 2)Appropriate for Texture Representation, 3) Visual cortex of

mammalian brains can be modeled by Gabor functions

• Cons:

1) Difficult to Analyze the Results

9

Filter Banks – Extracting Feature from Image (2/2) [3 - 5]

• Multi scale, multi orientation filter bank with 48

filters

• Mixture of edge, bar and spot filters

• 2 Gaussian derivative filters at 6 orientations and

3 scales, 8 Laplacian of Gaussian filters and 4

Gaussian filters

The Schmid (S) Filter Bank

• 13 rotationally invariant filters

• 13 isotropic, "Gabor-like" filters

The Maximum Response (MR) Filter Bank

Leung-Malik(LM) Filter Bank

• 2 anisotropic filters (an edge and a bar filter, at 6

orientations and 3 scales)

• 2 rotationally symmetric ones (a Gaussian and a

Laplacian of Gaussian)

10

Filters Learnt by Convolutional Neural Network

1st Convolutional Layer

11

Feature Descriptor – Bag of Feature [6 – 10]

Bag of Word

SIFT Feature

FAST Feature

SIFT: Scale Invariant Feature Transform

SURF: Speed Up Robust Features

FAST: Features from Accelerated Segment Test

BRIEF: Binary Robust Independent Elementary Features

SURF Feature

BRIEF Feature From http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O

Visual Vocabulary vs. Activation Feature Map of CNN

Convolutional Neural Network

From https://gilscvblog.wordpress.com/2013/08/23/bag-of-words-models-for-visual-categorization/

From H. Han, Deep Learning for Image Understanding - Applying in the Real World, BigComp 2015.

13

Visualization of Feature Characteristics (1/4)

Gabor Feature Normalized RGB Feature

Similarity Matrix Visualization

14

BoF - FAST


BoF - SIFT

15

BoF - SURF


16


Feature Extraction from CNN (7 L) Feature Extraction from CNN (5 P)

Kernels

III. Design Issues

17

Object

Detection

18

Categories of Object Detection

Class specific

Object Detection

Generic Object

Detection

Single Object

Detection

Multiple Object

Detection

• Object detectors are specialized for one object class

• Examples: Face Detection (Haar Feature + Ada Booting),

Human Body detector (HoG Feature + SVM)

• Generally, Salience based Approach

• Objectness Score, Saliency Measure

• Examples: BING, EdgeBoxes, etc.

Objects are standalone things

with a well defined boundary and

center as opposed to amorphous

background stuff.

19

Multiple Object Recognition & Localization [11]

Basic Design of MORL

• Issues:

1) Even a highly accurate classifier will produce false positives when faced with so many proposals.

2) Small sections of background can resemble actual objects, causing detection errors.

J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” arXiv: 1506.02640v3 [cs.CV], jun. 2015.

20

GODL Approaches – Objectness [12, 13]

Learning Cue Parameters Visual Cues

• Multi-scale Saliency: an unique/salient appearance

• Color Contrast: a different appearance

• Edge Density: a closed boundary

• Superpixels Straddling: a closed boundary

Bayesian Cue Integration

Characteristics

• Use the 3-characteristics, a different appearance,

an unique/salient appearance, a closed boundary

• Pros:

- High recall ratio with the small no. of proposals

- Easy to control the no. of proposals

• Cons:

- Slower than Bing and EdgeBoxes

21

Pipeline for Object Detection

Generating the Proposal

Windows

Matching Predefined

Features of Objects

Object Localization

Classifying Each

Bounding Box

Reject the Invalid

Bounding Boxes

Pruning the Invalid

Bounding Boxes

Finding Bounding Boxes with Objectness Measurement & Heuristics

Detecting & Localizing the Objects with Classifier

22

Design Issues – Training dataset vs. Testing dataset

Training Data Testing Data

Image instance

Resized & Cropped Image

Image Instance(s)(Bounding Box)

Cropped & Resized Image

Matching

23

Design Issues – Low confidence values

24

Low confidence

High confidence

Design Issues – Low confidence value with high objectness

25

The appearance of the object is not

enough to tell us about the object’s identity. ?

Contextual Information [14]

The scene add contextual information about

the object’s identity, so we can identify the

object as a kettle

Possibly From C. Galleguillos et al., “Context based Object Categorization: A Critical Survey,” Computer Vision and Image Understanding (CVIU), vol. 114, pp. 712-722, 2010.

IV. Methods and Algorithms

26

27

HCP – Hypotheses-CNN-Pooling (1/3) [15]

HCP Framework

From Y. Wei et al., “CNN: Single-label to Multi-label,” arXiv:1406.5726v3 [cs.CV] 9 Jul 2014.

28


Initialization of HCP


29


Samples of Predicted Scores


30

R-CNN – Regions with CNN (1/3) [16, 17]

Object Detection System Overview

• Takes an input image

• Extracts around 2000 bottom-up region proposals

• Computes features for each proposal using a large convolutional neural

network (CNN)

• Classifies each region using class-specific linear SVMs

From R. Girshick et al., “Rich feature hierarchies for accurate object detection and semantic segmentation” In CVPR, 2014.

From R. Girshick et al., “Rich feature hierarchies for accurate object detection and semantic segmentation Supplementary material” In CVPR, 2014.

31


Object Proposal Transformations



Bounding Box Regression

32




Experimental Results

33

Fast R-CNN [18]

Contributions of Fast R-CNN

• Higher detection quality (mAP) than R-CNN, SPPnet

• Training is single-stage, using a multi-task loss

• Training can update all network layers

• No disk storage is required for feature caching

From R. Girshick et al., “Fast R-CNN,” arXiv:1504.08083v2 [cs.CV] 27 Sep 2015.

34

Faster R-CNN: Region Proposal Network [19]

Object Detection System Overview

• Takes an image (of any size) as input and outputs a set of rectangular

object proposals, each with an objectness score - Slide a small network over the conv feature map output by the last shared conv layer

- Each sliding window is mapped to a lower-dimensional vector

- This vector is fed into two sibling fully-connected layers—a box-regression layer (reg)

and a box-classification layer (cls)

From S. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” arXiv:1506.01497v1 [cs.CV] 4 Jun 2015.

35

YOLO - You Only Look Once (1/4) [11]

From J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” arXiv: 1506.02640v3 [cs.CV], jun. 2015.

YOLO, a unified pipeline for object detection

• Define object detection as a regression problem to spatially separated

bounding boxes and associated.

• A single neural network predicts bounding boxes and class probabilities

directly from full images in one evaluation.

(1) Resizes the input image to 448 X 448.

(2) Runs a single convolutional network on the image.

(3) Thresholds the resulting detections by the model’s confidence.

36



How It Works

• Divides the image into regions.

• Predicts bounding boxes and probabilities for each region.

• Bounding boxes are weighted by the predicted probabilities.

• Threshold the detections by some value to only see high scoring

detections.

From http://pjreddie.com/darknet/yolo/

37



Unified Detection Model

A regression problem to a 7724 tensor which encodes bounding boxes and class

probabilities for all objects in the image.

24 convolutional layers + 2 fully connected layers.

38



Experimental Results

References

39

[1] X. Zhang et al., “Object Class Detection: A Survey,” J. ACM Computing Survey, vol. 46, no. 1, pp. 10:1-10:53, 2013.

[2] M. Haghighat et al., “Identification Using Encrypted Biometrics,” Computer Analysis of Images and Patterns, pp. 440-448,

2013.

[3] T. Leung et al., “Representing and Recognizing the Visual Appearance of Materials using Three-dimensional textons,”

Int. Journal of Computer Vision, vol. 43, no. 1, pp. 29-44, June 2001.

[4] C. Schmid et al., “Constructing Models for Content-based Image Retrieval,” CVPR, vol. 2, pp. 39-45, 2001.

[5] J. Geusebro et al., “Fast Anisotropic Gauss Filtering,” IEEE Transaction on Image Processing, vol. 12, no. 8, pp. 938-943,

2003.

[6] Csurka, Gabriella, et al. “Visual Categorization with Bags of Keypoints,” ECCV, vol. 1, 2004.

[7] D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Int. Journal of Computer Vision, vol. 60, no. 2,

pp. 91-110, 2004.

[8] H. Bay et al., “Speeded-Up Robust Features (SURF),” Computer Vision and Image Understanding, vol. 110, no. 3,

pp. 346-359, 2008.

[9] E. Rosten et al., “Faster and Better: A Machine Learning Approach to Corner Detection,” IEEE TPAMI, vol. 32, no. 1,

pp. 105-119, 2009.

[10] M. Calonder et al., “BRIEF: Computing a Local Binary Descriptor Very Fast,” IEEE TPAMI, vol. 34, no. 7, pp. 1281-1298, 2012.

[11] J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” arXiv: 1506.02640v3 [cs.CV], jun. 2015.

References

40

[12] B. Alexe, T. Deselaers, and V. Ferrari, "What is an Object?," CVPR 2010, 2010.

[13] B. Alexe, T. Deselaers, and V. Ferrari, " Measuring the Objectness of Image Windows," PAMI, vol. 34, No. 11,

pp. 2189-2202, 2012.

[14] C. Galleguillos et al., “Context based Object Categorization: A Critical Survey,” Computer Vision and Image Understanding

(CVIU), vol. 114, pp. 712-722, 2010.

[15] Y. Wei et al., “CNN: Single-label to Multi-label,” arXiv:1406.5726v3 [cs.CV] 9 Jul 2014.

[16] R. Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation” In CVPR, 2014.

[17] R. Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation Supplementary

Material” In CVPR, 2014.

[18] R. Girshick et al., “Fast R-CNN,” arXiv:1504.08083v2 [cs.CV] 27 Sep 2015.

[19] S. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” arXiv:1506.01497v1

[cs.CV] 4 Jun 2015.

[20] F. Anselmi et al., “Deep Convolutional Networks are Hierarchical Kernel Machines,” CBMM Memo, NSF, no. 35, 2015.

multiple object class detection & localization with deep...

Documents