large scale object recognition （ammai presentation）

Large-Scale Object Recognition

Presenter: 電機碩二賴柏任Date: 06.18.2015

2

Motivation

• People can recognize tens of thousands of objects...

• How about computers?

3

"What does classifying more than 10,000 image categories tell us?”

tries to discuss this question

Deng, Jia, et al. "What does classifying more than 10,000 image categories tell us?." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 71-84.

4

Datasets

• ImageNet10K – 10184 categories, 9 million images

• ImageNet7K (7404 categories)• ImageNet1K (1000 categories)• Rand200{a,b,c} (200 categories)• CalNet200 (200 categories)• Ungulate183, Fungus134, Vehicle262

5

Algorithms

• GIST+NN– kNN on L2 distance

• BOW + NN– SIFT for BOW, kNN on L1 distance

• BOW + SVM–# of SVM == # of categories (1-vs-all)

• SPM + SVM– SIFT for SPM, 1-vs-all SVM

6

Computation time analysis

• BOW+SVM (ImageNet 10K)– A 1-vs-all SVM classifier needs 1 hr (2.66 GHz Intel Xeon)– 16 hrs for testing

• 66 multi-core machine needs several weeks

Distributed computing and efficient learning are needed.

7

Size analysis

• 2x decrease in accuracy with 10x increase in the number of classes

8

Size analysis

• Techniques that outperforms others on small datasets may underperform on large datasets

9

Size analysis

• Semantic hierarchy is correlated to visual confusion

10

Density Analysis

• Density of a dataset

11

Density Analysis

• Denser dataset predict lower accuracy

12

From large scale image categorization to entry-level categories

Ordonez, Vicente, et al. "From large scale image categorization to entry-level categories." Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.

13

Motivation

• One image has many labels, what should I actually call it?

Entry-Level

category

14

Definition of entry-level category

• The name that most people tend to call– 圓仔、熊貓、哺乳類、 Ailuropoda

melanoleuca( 學名 )

15

To achieve entry-level recognition

• By hypernym?– Just replace the given output by its

hypernym

Bird

sparrow penguin

16

Problem 1

• You may call a sparrow a bird, but you may not call a penguin a bird

Bird

sparrow penguin

17

Problem 2

• Encyclopedia knowledge v.s. Common sense knowledge

Tulip is not a kind of flower.

What a beautiful flower!

18

Two methods

• Translate the result to entry-level category

• Directly learn a entry-level classifier

ImageClassifier

Tulip Flower

ImageClassifier

Flower

19

Method 1

• Use a metric for scoring each node

Bird

sparrow penguin

Output of linear SVM

0.80.1

0.9

20

Method 1

• Add the concept of naturalness

• We want v to be natural, but not too high level to keep specificity

In Google 1T corpus, v appears more φ(v) gets higher

The max height of the tree under v

21

Method 1

• Combine the two scores

• Experiments are passed since there are too many details...

22

Method 2

• Passed...

23

What I have learned from the two papers above

24

An interesting perspective...

• Why can we (as a human) recognize tens of thousands of objects in a really short time?

• We have simplified the world, or–We process thing slow (computation

cost)–We receive lots of information(memory

cost)我的觀察啦 XD

25

An explanation for the paper• Different kind of dolphins have

similar properties – So why bother to know all kind of

dolphin?

• Dolphin has similar properties of fish– So people think it is a kind of fish

26

How do we simplify?

• Hierarchy matters • But do we follow WordNet?

27

Probably No

• Natural Objects–We identify them by properties

• Artifacts–We identify them by functionalities

28

Probably No

• Natural Objects–We identify them by properties

• Artifacts–We identify them by functionalities

29

A support from paper

• Even if the result is incorrect, animals tend to be miscate- gorized as other animals

Deng, Jia, et al. "What does classifying more than 10,000 image categories tell us?." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 71-84.

30

Maybe it’s because the logic of making things are different.

(God v.s. Human)

Artifacts are made to let human use.

Natural objects are made to live

their lives.

31

How to implement?

• It is still an open question.

Yao, Bangpeng, Jiayuan Ma, and Li Fei-Fei. "Discovering object functionality."Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013.

Woods, Kevin, et al. "Learning membership functions in a function-based object recognition system." J. Artif. Intell. Res.(JAIR) 3 (1995): 187-222.

Weng, Juyang, and Matthew Luciw. "Brain-like emergent spatial processing."Autonomous Mental Development, IEEE Transactions on 4.2 (2012): 161-185.

32

Improving the Fisher Kernel for Large-Scale Image Classification

Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010. 143-156.

33

Fisher vector revisit

• A kind of representation of image– Input: a set of local descriptors– Output: a fixed-length fisher vector

34


• Use GMM to model input images

35


• Assume: only 2 Gaussians are used

36


• For each image, N=2

Perronnin, Florent, and Christopher Dance. "Fisher kernels on visual vocabularies for image categorization." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on.

37


• Since we already know the GMM of that image, we can take derivatives

• Derivatives– the change of the parameters will

change the fitness of GMM to the image


38


• Concatenate these derivatives, we got Fisher Vector!


The number of parameters is the same for every

image

39


• The form of Fisher Vector– Local descriptors

– Fisher Vector (not normalized)


40

Improvement - L2 Normalization

• Assume: the descriptors of a given image follow a distribution p

• p has two parts– background part uλ (image independent)

– Image-specific part q


41


• Decompose the vector


42


• Learning process minimize the image-independent part


43


• To remove the dependence on ω, we can L2-normalize the vector


44

Improvement - Power Normalization

• As the number of Gaussians increases, Fisher vector becomes sparser

16 Gaussians 64 256

45

Improvement - Power Normalization

• Apply power normalization to each dimension of Fisher vector

• α=0.5 for 256 Gaussians is reasonable

46

Improvement-Spatial Pyramid

• Original spatial pyramid

47


• Combine spatial pyramid and FK

BoW histogram

48


• Combine spatial pyramid and FK

Fisher Vector

49

Large-Scale Experiments

• Training: ImageNet, Flickr groups, VOC 2007 trainval

• Testing: PASCAL VOC 2007 (20 classes)

[29] Harzallah, Hedi, Frédéric Jurie, and Cordelia Schmid. "Combining efficient object localization and image classification." Computer Vision, 2009 IEEE 12th International Conference on.

50

Another thing I want to share• Deep Learning can be used in

robotics!

Deep Learning for Detecting Robotic Grasps, Ian Lenz, Honglak Lee, Ashutosh Saxena. To appear in International Journal of Robotics Research (IJRR), 2014.

51

Thanks for your attention.

large scale object recognition （ammai presentation）

Technology