large scale object recognition (ammai presentation)
TRANSCRIPT
3
"What does classifying more than 10,000 image categories tell us?”
tries to discuss this question
Deng, Jia, et al. "What does classifying more than 10,000 image categories tell us?." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 71-84.
4
Datasets
• ImageNet10K – 10184 categories, 9 million images
• ImageNet7K (7404 categories)• ImageNet1K (1000 categories)• Rand200{a,b,c} (200 categories)• CalNet200 (200 categories)• Ungulate183, Fungus134, Vehicle262
5
Algorithms
• GIST+NN– kNN on L2 distance
• BOW + NN– SIFT for BOW, kNN on L1 distance
• BOW + SVM–# of SVM == # of categories (1-vs-all)
• SPM + SVM– SIFT for SPM, 1-vs-all SVM
6
Computation time analysis
• BOW+SVM (ImageNet 10K)– A 1-vs-all SVM classifier needs 1 hr (2.66 GHz Intel Xeon)– 16 hrs for testing
• 66 multi-core machine needs several weeks
Distributed computing and efficient learning are needed.
8
Size analysis
• Techniques that outperforms others on small datasets may underperform on large datasets
12
From large scale image categorization to entry-level categories
Ordonez, Vicente, et al. "From large scale image categorization to entry-level categories." Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.
14
Definition of entry-level category
• The name that most people tend to call– 圓仔、熊貓、哺乳類、 Ailuropoda
melanoleuca( 學名 )
15
To achieve entry-level recognition
• By hypernym?– Just replace the given output by its
hypernym
Bird
sparrow penguin
16
Problem 1
• You may call a sparrow a bird, but you may not call a penguin a bird
Bird
sparrow penguin
17
Problem 2
• Encyclopedia knowledge v.s. Common sense knowledge
Tulip is not a kind of flower.
What a beautiful flower!
18
Two methods
• Translate the result to entry-level category
• Directly learn a entry-level classifier
ImageClassifier
Tulip Flower
ImageClassifier
Flower
19
Method 1
• Use a metric for scoring each node
Bird
sparrow penguin
Output of linear SVM
0.80.1
0.9
20
Method 1
• Add the concept of naturalness
• We want v to be natural, but not too high level to keep specificity
In Google 1T corpus, v appears more φ(v) gets higher
The max height of the tree under v
24
An interesting perspective...
• Why can we (as a human) recognize tens of thousands of objects in a really short time?
• We have simplified the world, or–We process thing slow (computation
cost)–We receive lots of information(memory
cost)我的觀察啦 XD
25
An explanation for the paper• Different kind of dolphins have
similar properties – So why bother to know all kind of
dolphin?
• Dolphin has similar properties of fish– So people think it is a kind of fish
27
Probably No
• Natural Objects–We identify them by properties
• Artifacts–We identify them by functionalities
28
Probably No
• Natural Objects–We identify them by properties
• Artifacts–We identify them by functionalities
29
A support from paper
• Even if the result is incorrect, animals tend to be miscate- gorized as other animals
Deng, Jia, et al. "What does classifying more than 10,000 image categories tell us?." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 71-84.
30
Maybe it’s because the logic of making things are different.
(God v.s. Human)
Artifacts are made to let human use.
Natural objects are made to live
their lives.
31
How to implement?
• It is still an open question.
Yao, Bangpeng, Jiayuan Ma, and Li Fei-Fei. "Discovering object functionality."Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.
Woods, Kevin, et al. "Learning membership functions in a function-based object recognition system." J. Artif. Intell. Res.(JAIR) 3 (1995): 187-222.
Weng, Juyang, and Matthew Luciw. "Brain-like emergent spatial processing."Autonomous Mental Development, IEEE Transactions on 4.2 (2012): 161-185.
32
Improving the Fisher Kernel for Large-Scale Image Classification
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 143-156.
33
Fisher vector revisit
• A kind of representation of image– Input: a set of local descriptors– Output: a fixed-length fisher vector
36
Fisher vector revisit
• For each image, N=2
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on.
37
Fisher vector revisit
• Since we already know the GMM of that image, we can take derivatives
• Derivatives– the change of the parameters will
change the fitness of GMM to the image
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on.
38
Fisher vector revisit
• Concatenate these derivatives, we got Fisher Vector!
Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on.
The number of parameters is the same for every
image
39
Fisher vector revisit
• The form of Fisher Vector– Local descriptors
– Fisher Vector (not normalized)
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 143-156.
40
Improvement - L2 Normalization
• Assume: the descriptors of a given image follow a distribution p
• p has two parts– background part uλ (image independent)
– Image-specific part q
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 143-156.
41
Improvement - L2 Normalization
• Decompose the vector
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 143-156.
42
Improvement - L2 Normalization
• Learning process minimize the image-independent part
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 143-156.
43
Improvement - L2 Normalization
• To remove the dependence on ω, we can L2-normalize the vector
Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 143-156.
44
Improvement - Power Normalization
• As the number of Gaussians increases, Fisher vector becomes sparser
16 Gaussians 64 256
45
Improvement - Power Normalization
• Apply power normalization to each dimension of Fisher vector
• α=0.5 for 256 Gaussians is reasonable
49
Large-Scale Experiments
• Training: ImageNet, Flickr groups, VOC 2007 trainval
• Testing: PASCAL VOC 2007 (20 classes)
[29] Harzallah, Hedi, Frédéric Jurie, and Cordelia Schmid. "Combining efficient object localization and image classification." Computer Vision, 2009 IEEE 12th International Conference on.
50
Another thing I want to share• Deep Learning can be used in
robotics!
Deep Learning for Detecting Robotic Grasps, Ian Lenz, Honglak Lee, Ashutosh Saxena. To appear in International Journal of Robotics Research (IJRR), 2014.