multiple object class detection & localization with deep...
TRANSCRIPT
16th Oct. 2015
Hansung Lee
SW R&D Center, Samsung Electronics
Multiple Object Class Detection &
Localization with Deep Learning (CNN)
Outline
1. Introduction – definition of object recognition problems
2. Feature Extraction vs. Feature Learning
3. Design Issues
4. Methods and Algorithms
References
1
I. Introduction – definition of object recognition problems
2
3
Object Recognition Problem [1]
Major Tasks
Object
Recognition
Object Instance
Recognition
Object Class
Recognition
• Identifying previously seen object instances
• Matching problem in which the differences between the stored
exemplars and the objects to be re-identified in an input image
• Need some alignment process
• Known as category-level or generic object recognition
• Focuses on recognizing always unseen-before instances of
some predefined categories
• Challenging Problems
1) The inter-category visual differences sometimes may be very small
2) Large intra-category appearance variations caused by different object colors, textures, shapes,
as well as varying imaging conditions
3) An object in a real-world scene often occupies just a small portion of the scene and is occluded by
others or accompanied by similar looking background structures
Object Class
Detection
• Determine whether or not any instances of categories of interest
are present in an input image
• Locate instances of categories of interest accurately in the image
to separate them from the background
X. Zhang et al., “Object Class Detection: A Survey,” J. ACM Computing Survey, vol. 46, no. 1, pp. 10:1-10:53, 2013.
4
Object Class Detection (1/2) [1]
Different facets related to object class detection
From X. Zhang et al., “Object Class Detection: A Survey,” J. ACM Computing Survey, vol. 46, no. 1, pp. 10:1-10:53, 2013.
5
Object Class Detection (2/2) [1]
The bridging role of categorical appearance models in object class detection
From X. Zhang et al., “Object Class Detection: A Survey,” J. ACM Computing Survey, vol. 46, no. 1, pp. 10:1-10:53, 2013.
6
Description of Relevant Visual Cues [1]
Descriptor Pixel-Level Feature
Description
Patch-Level Feature
Description
• Gray level pixel’s intensity
• Color Histogram
Region-Level
Feature Description
SIFT and its Variants
Filter Bank Responses
• Support region or the
neighborhood of the point
• Local feature descriptors
• pixel intensities, colors, textures,
edges, etc
Bag of Features
HoG and its Variants
GIST Feature
Shape Feature
Self-Similarity Feature
• Capturing the discriminating
visual properties of the target
categories or their components
• Keeping sufficient robustness
against possible intra-class
variations
X. Zhang et al., “Object Class Detection: A Survey,” J. ACM Computing Survey, vol. 46, no. 1, pp. 10:1-10:53, 2013.
II. Feature Extraction vs. Feature Learning
7
8
Filter Banks – Extracting Feature from Image (1/2) [2]
Gabor Filter Bank
Real Part Magnitude Part
Characteristics
• Pros:
1) Similar to Human Visual System, 2)Appropriate for Texture Representation, 3) Visual cortex of
mammalian brains can be modeled by Gabor functions
• Cons:
1) Difficult to Analyze the Results
9
Filter Banks – Extracting Feature from Image (2/2) [3 - 5]
• Multi scale, multi orientation filter bank with 48
filters
• Mixture of edge, bar and spot filters
• 2 Gaussian derivative filters at 6 orientations and
3 scales, 8 Laplacian of Gaussian filters and 4
Gaussian filters
The Schmid (S) Filter Bank
• 13 rotationally invariant filters
• 13 isotropic, "Gabor-like" filters
The Maximum Response (MR) Filter Bank
Leung-Malik(LM) Filter Bank
• 2 anisotropic filters (an edge and a bar filter, at 6
orientations and 3 scales)
• 2 rotationally symmetric ones (a Gaussian and a
Laplacian of Gaussian)
10
Filters Learnt by Convolutional Neural Network
1st Convolutional Layer
11
Feature Descriptor – Bag of Feature [6 – 10]
Bag of Word
SIFT Feature
FAST Feature
SIFT: Scale Invariant Feature Transform
SURF: Speed Up Robust Features
FAST: Features from Accelerated Segment Test
BRIEF: Binary Robust Independent Elementary Features
SURF Feature
BRIEF Feature From http://www.codeproject.com/Articles/619039/Bag-of-Features-Descriptor-on-SIFT-Features-with-O
Visual Vocabulary vs. Activation Feature Map of CNN
Convolutional Neural Network
From https://gilscvblog.wordpress.com/2013/08/23/bag-of-words-models-for-visual-categorization/
From H. Han, Deep Learning for Image Understanding - Applying in the Real World, BigComp 2015.
13
Visualization of Feature Characteristics (1/4)
Gabor Feature Normalized RGB Feature
Similarity Matrix Visualization
14
BoF - FAST
Visualization of Feature Characteristics (2/4)
BoF - SIFT
15
BoF - SURF
Visualization of Feature Characteristics (3/4)
16
Visualization of Feature Characteristics (4/4)
Feature Extraction from CNN (7 L) Feature Extraction from CNN (5 P)
Kernels
III. Design Issues
17
Object
Detection
18
Categories of Object Detection
Class specific
Object Detection
Generic Object
Detection
Single Object
Detection
Multiple Object
Detection
• Object detectors are specialized for one object class
• Examples: Face Detection (Haar Feature + Ada Booting),
Human Body detector (HoG Feature + SVM)
• Generally, Salience based Approach
• Objectness Score, Saliency Measure
• Examples: BING, EdgeBoxes, etc.
Objects are standalone things
with a well defined boundary and
center as opposed to amorphous
background stuff.
19
Multiple Object Recognition & Localization [11]
Basic Design of MORL
• Issues:
1) Even a highly accurate classifier will produce false positives when faced with so many proposals.
2) Small sections of background can resemble actual objects, causing detection errors.
J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” arXiv: 1506.02640v3 [cs.CV], jun. 2015.
20
GODL Approaches – Objectness [12, 13]
Learning Cue Parameters Visual Cues
• Multi-scale Saliency: an unique/salient appearance
• Color Contrast: a different appearance
• Edge Density: a closed boundary
• Superpixels Straddling: a closed boundary
Bayesian Cue Integration
Characteristics
• Use the 3-characteristics, a different appearance,
an unique/salient appearance, a closed boundary
• Pros:
- High recall ratio with the small no. of proposals
- Easy to control the no. of proposals
• Cons:
- Slower than Bing and EdgeBoxes
21
Pipeline for Object Detection
Generating the Proposal
Windows
Matching Predefined
Features of Objects
Object Localization
Classifying Each
Bounding Box
Reject the Invalid
Bounding Boxes
Pruning the Invalid
Bounding Boxes
Finding Bounding Boxes with Objectness Measurement & Heuristics
Detecting & Localizing the Objects with Classifier
22
Design Issues – Training dataset vs. Testing dataset
Training Data Testing Data
Image instance
Resized & Cropped Image
Image Instance(s)(Bounding Box)
Cropped & Resized Image
Matching
23
Design Issues – Low confidence values
24
Low confidence
High confidence
Design Issues – Low confidence value with high objectness
25
The appearance of the object is not
enough to tell us about the object’s identity. ?
Contextual Information [14]
The scene add contextual information about
the object’s identity, so we can identify the
object as a kettle
Possibly From C. Galleguillos et al., “Context based Object Categorization: A Critical Survey,” Computer Vision and Image Understanding (CVIU), vol. 114, pp. 712-722, 2010.
IV. Methods and Algorithms
26
27
HCP – Hypotheses-CNN-Pooling (1/3) [15]
HCP Framework
From Y. Wei et al., “CNN: Single-label to Multi-label,” arXiv:1406.5726v3 [cs.CV] 9 Jul 2014.
28
HCP – Hypotheses-CNN-Pooling (2/3) [15]
Initialization of HCP
From Y. Wei et al., “CNN: Single-label to Multi-label,” arXiv:1406.5726v3 [cs.CV] 9 Jul 2014.
29
HCP – Hypotheses-CNN-Pooling (3/3) [15]
Samples of Predicted Scores
From Y. Wei et al., “CNN: Single-label to Multi-label,” arXiv:1406.5726v3 [cs.CV] 9 Jul 2014.
30
R-CNN – Regions with CNN (1/3) [16, 17]
Object Detection System Overview
• Takes an input image
• Extracts around 2000 bottom-up region proposals
• Computes features for each proposal using a large convolutional neural
network (CNN)
• Classifies each region using class-specific linear SVMs
From R. Girshick et al., “Rich feature hierarchies for accurate object detection and semantic segmentation” In CVPR, 2014.
From R. Girshick et al., “Rich feature hierarchies for accurate object detection and semantic segmentation Supplementary material” In CVPR, 2014.
31
R-CNN – Regions with CNN (2/3) [16, 17]
Object Proposal Transformations
From R. Girshick et al., “Rich feature hierarchies for accurate object detection and semantic segmentation” In CVPR, 2014.
From R. Girshick et al., “Rich feature hierarchies for accurate object detection and semantic segmentation Supplementary material” In CVPR, 2014.
Bounding Box Regression
32
R-CNN – Regions with CNN (3/3) [16, 17]
From R. Girshick et al., “Rich feature hierarchies for accurate object detection and semantic segmentation” In CVPR, 2014.
From R. Girshick et al., “Rich feature hierarchies for accurate object detection and semantic segmentation Supplementary material” In CVPR, 2014.
Experimental Results
33
Fast R-CNN [18]
Contributions of Fast R-CNN
• Higher detection quality (mAP) than R-CNN, SPPnet
• Training is single-stage, using a multi-task loss
• Training can update all network layers
• No disk storage is required for feature caching
From R. Girshick et al., “Fast R-CNN,” arXiv:1504.08083v2 [cs.CV] 27 Sep 2015.
34
Faster R-CNN: Region Proposal Network [19]
Object Detection System Overview
• Takes an image (of any size) as input and outputs a set of rectangular
object proposals, each with an objectness score - Slide a small network over the conv feature map output by the last shared conv layer
- Each sliding window is mapped to a lower-dimensional vector
- This vector is fed into two sibling fully-connected layers—a box-regression layer (reg)
and a box-classification layer (cls)
From S. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” arXiv:1506.01497v1 [cs.CV] 4 Jun 2015.
35
YOLO - You Only Look Once (1/4) [11]
From J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” arXiv: 1506.02640v3 [cs.CV], jun. 2015.
YOLO, a unified pipeline for object detection
• Define object detection as a regression problem to spatially separated
bounding boxes and associated.
• A single neural network predicts bounding boxes and class probabilities
directly from full images in one evaluation.
(1) Resizes the input image to 448 X 448.
(2) Runs a single convolutional network on the image.
(3) Thresholds the resulting detections by the model’s confidence.
36
YOLO - You Only Look Once (2/4) [11]
From J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” arXiv: 1506.02640v3 [cs.CV], jun. 2015.
How It Works
• Divides the image into regions.
• Predicts bounding boxes and probabilities for each region.
• Bounding boxes are weighted by the predicted probabilities.
• Threshold the detections by some value to only see high scoring
detections.
From http://pjreddie.com/darknet/yolo/
37
YOLO - You Only Look Once (3/4) [11]
From J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” arXiv: 1506.02640v3 [cs.CV], jun. 2015.
Unified Detection Model
A regression problem to a 7724 tensor which encodes bounding boxes and class
probabilities for all objects in the image.
24 convolutional layers + 2 fully connected layers.
38
YOLO - You Only Look Once (4/4) [11]
From J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” arXiv: 1506.02640v3 [cs.CV], jun. 2015.
Experimental Results
References
39
[1] X. Zhang et al., “Object Class Detection: A Survey,” J. ACM Computing Survey, vol. 46, no. 1, pp. 10:1-10:53, 2013.
[2] M. Haghighat et al., “Identification Using Encrypted Biometrics,” Computer Analysis of Images and Patterns, pp. 440-448,
2013.
[3] T. Leung et al., “Representing and Recognizing the Visual Appearance of Materials using Three-dimensional textons,”
Int. Journal of Computer Vision, vol. 43, no. 1, pp. 29-44, June 2001.
[4] C. Schmid et al., “Constructing Models for Content-based Image Retrieval,” CVPR, vol. 2, pp. 39-45, 2001.
[5] J. Geusebro et al., “Fast Anisotropic Gauss Filtering,” IEEE Transaction on Image Processing, vol. 12, no. 8, pp. 938-943,
2003.
[6] Csurka, Gabriella, et al. “Visual Categorization with Bags of Keypoints,” ECCV, vol. 1, 2004.
[7] D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Int. Journal of Computer Vision, vol. 60, no. 2,
pp. 91-110, 2004.
[8] H. Bay et al., “Speeded-Up Robust Features (SURF),” Computer Vision and Image Understanding, vol. 110, no. 3,
pp. 346-359, 2008.
[9] E. Rosten et al., “Faster and Better: A Machine Learning Approach to Corner Detection,” IEEE TPAMI, vol. 32, no. 1,
pp. 105-119, 2009.
[10] M. Calonder et al., “BRIEF: Computing a Local Binary Descriptor Very Fast,” IEEE TPAMI, vol. 34, no. 7, pp. 1281-1298, 2012.
[11] J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” arXiv: 1506.02640v3 [cs.CV], jun. 2015.
References
40
[12] B. Alexe, T. Deselaers, and V. Ferrari, "What is an Object?," CVPR 2010, 2010.
[13] B. Alexe, T. Deselaers, and V. Ferrari, " Measuring the Objectness of Image Windows," PAMI, vol. 34, No. 11,
pp. 2189-2202, 2012.
[14] C. Galleguillos et al., “Context based Object Categorization: A Critical Survey,” Computer Vision and Image Understanding
(CVIU), vol. 114, pp. 712-722, 2010.
[15] Y. Wei et al., “CNN: Single-label to Multi-label,” arXiv:1406.5726v3 [cs.CV] 9 Jul 2014.
[16] R. Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation” In CVPR, 2014.
[17] R. Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation Supplementary
Material” In CVPR, 2014.
[18] R. Girshick et al., “Fast R-CNN,” arXiv:1504.08083v2 [cs.CV] 27 Sep 2015.
[19] S. Ren et al., “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” arXiv:1506.01497v1
[cs.CV] 4 Jun 2015.
[20] F. Anselmi et al., “Deep Convolutional Networks are Hierarchical Kernel Machines,” CBMM Memo, NSF, no. 35, 2015.