modeling structures in human pose estimation and imediacy ... · deep learning is a...

154
Structured deep learning for visual localization and recognition Wanli Ouyang (欧阳万里) [email protected] The Chinese University of Hong Kong The University of Sydney

Upload: others

Post on 24-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Structured deep learning for visual localization and recognition

Wanli Ouyang (欧阳万里)

[email protected]

The Chinese University of Hong Kong The University of Sydney

Page 2: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

2

Back-bone model design

Conclusion

Introduction

Structured Hidden factors Structured featuresStructured output

Structured deep learning

Page 3: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

3

Back-bone model design

Conclusion

Introduction

Structured Hidden factors Structured featuresStructured output

Structured deep learning

Page 4: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

10/1/20194

Object recognition

Cat

Page 5: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

10/1/2019

Object detection

WomanChild Tooth brush Tooth brush

5

Action recognition

Object recognition

Cat

Page 6: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Application

6

Automotive safety and automatic car driving

Page 7: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Application

7

Automotive safety and automatic car driving

Robotics and Human-computer interaction

Page 8: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Application

10/1/2019 8

Automotive safety and automatic car driving

Robotics and Human-computer interaction

Internet of Things

Page 9: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Application

10/1/2019 9

Theft

Automotive safety and automatic car driving

Robotics and Human-computer interaction

Internet of Things

Public safety and smart city

Traffic jam

Page 10: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Application

10/1/2019 10

Automotive safety and automatic car driving

Robotics and Human-computer interaction

Internet of Things

Public safety and smart city

Social network

Family Workmate Father

Page 11: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Application

10/1/2019

Automotive safety and automatic car driving

Robotics and Human-computer interaction

Internet of Things

Public safety and smart city

Social network

Industrial production

Page 12: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Application

10/1/2019

Automotive safety and automatic car driving

Robotics and Human-computer interaction

Internet of Things

Public safety and smart city

Social network

Industrial production

Bio-medical imaging

Microaneurysms

Blot hemorrhages

Page 13: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Challenges -- person• Intra-class variation• Color

Page 14: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Challenges -- person• Intra-class variation• Color

• Occlusion

Page 15: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Challenges -- person• Intra-class variation• Color

• Occlusion

• Deformation

Page 16: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Deep learning

MIT Tech ReviewTop 10 Breakthroughs 2013Ranking No. 1

Hinton won ImageNetcompetition Classify 1.2 million images into 1,000 categoriesBeating existing computer vision methods by 20+% Surpassing human performance

Hold records on most of the computer vision problems

Web-scale visual search, self-driving cars, surveillance, multimedia …

Simulate brain activities and employ millions of neurons to fit billions of training samples. Deep neural networks are trained with GPU clusters with tens of thousands of processors

Page 17: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

ImageNet Large Scale Visual Recognition Challenge

Page 18: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

10/1/2019

Object detection

WomanChild Tooth brush Tooth brush

18

Object recognition

Cat

Page 19: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

ImageNet Object Detection Task

19

200 object classes

~500,000 training images, 60,000 test images

Page 20: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Mean Averaged Precision (mAP)

UvA-Euvision22.581%

ILSVRC 2013ILSVRC 2014

GoogleGoogLeNet

43.9%

DeepID-Net50.3%

W. Ouyang and X. Wang, et al. “DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection,” CVPR15, TPAMI17

MSRAResNet62.0%

CVPR’15

GBD-Net66.3%

ILSVRC 2015 ILSVRC 2016

X. Zeng, W. Ouyang, J. Yan, etc, “Crafting gbd-net for object detection,” ECCV16, TPAMI 2017

Page 21: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Our team at ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

2014 2015 2016

Object detection 2nd (Google 1st) 1st

Video object detection/tracking

1st 1st

Page 22: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Our team at Common Object in Context (COCO)

2018

Object detection and instance segmentation 1st

Page 23: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

23

Back-bone model design

Conclusion

Introduction

Structured Hidden factors Structured featuresStructured output

Structured deep learning

Page 24: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Is deep model a black box?

24

Page 25: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Performance vs practical need

Conventional model

Deep model Very Deep model

Very deep structured learning

Many other applications

Face recognition

Page 26: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Structure in data

?

Page 27: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Structure in data

? ?

?

Page 28: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Structure in data

?

? ?

?

?

Page 29: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Model structures among neurons

h1

x

y

h2

29

ConventionalKnowledge based structured latent

factor

Structured outputStructured output

and feature

Page 30: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

30

Back-bone model design

Conclusion

Introduction

Structured Hidden factors Structured featuresStructured output

Structured deep learning

Page 31: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

31

Introduction

Structured output

Structured deep learning

Page 32: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Object detection

• Sliding window

• Variable window size

Page 33: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Motivation

• Much more negative samples than positive samples

• Easy to tell some regions do not contain any object

Page 34: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Learned classifier

for sheep

Image with RoIs

Stage 2 Stage 3Stage 1

Remaining RoIs

on the image

classifier classifier classifier

bg? bg? bg?

Rejected RoIs

Cascade Network

Page 35: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Cascade Network

Learned classifier

for sheep

Image with RoIs

Stage 2 Stage 3Stage 1

Remaining RoIs

on the image

classifier classifier classifier

bg? bg? bg?

Rejected RoIs

Wanli Ouyang, Kun Wang, Xin Zhu, Xiaogang Wang. "Chained Cascade Network for Object Detection", Proc. ICCV, 2017.

Page 36: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Cascade Network

Learned classifier

for sheep

Image with RoIs

Stage 2 Stage 3Stage 1

Remaining RoIs

on the image

classifier classifier classifier

bg? bg? bg?

Rejected RoIs

Wanli Ouyang, Kun Wang, Xin Zhu, Xiaogang Wang. "Chained Cascade Network for Object Detection", Proc. ICCV, 2017.

Page 37: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Cascade Network

Learned classifier

for sheep

Image with RoIs

Stage 2 Stage 3Stage 1

Remaining RoIs

on the image

classifier classifier classifier

bg? bg? bg?

Rejected RoIs

Wanli Ouyang, Kun Wang, Xin Zhu, Xiaogang Wang. "Chained Cascade Network for Object Detection", Proc. ICCV, 2017.

Page 38: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Model structures among classifiers at different stages

• Build up cascade at several stages in one network

convolution on image

loss function for

trainingsoftmax

softmax

softmax

detection

results

chained class

scores

... ...

chained CNN

features for RoI

...roi-pooling

rejected

RoI

contextual cascade

RoIs

...

rejected

RoI

Classifier chaining with multiple

cascade stages

remaining

RoIsfeaturesremaining

RoIs

detection scores

classifier

classifier

classifier

early cascade

Wanli Ouyang, Kun Wang, Xin Zhu, Xiaogang Wang. "Chained Cascade Network for Object Detection", Proc. ICCV, 2017.

Page 39: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Model structures among classifiers at different stages

Learned classifier

for sheep

Image with RoIs

Stage 2 Stage 3Stage 1

Remaining RoIs

on the image

classifier classifier classifier

bg? bg? bg?

Rejected RoIs

tooth brush, tooth brush, tooth brushaxe, axe, tooth brush

Wanli Ouyang, Kun Wang, Xin Zhu, Xiaogang Wang. "Chained Cascade Network for Object Detection", Proc. ICCV, 2017.

Page 40: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Model structures among classifiers at different stages with different context

• Build up structure among classifiers ci(*) at different stages

softmax

softmax

softmax

features

p1

p2

p3

p4

u(p1, r1)=0?

u(p2, r2)=0?

u(p3, r3)=0?

u(p4, r4)=0?

rejected

rejected

rejected

rejected

Y

Y

Y

Y

N

N

N

detection

results

c2(f2)⊙b2

summed

class scores

+

c1(f1)

⊙b1

c3(f3)⊙b3

+

c4(f4)⊙b4+

f1

f2

f3

f4

Wanli Ouyang, Kun Wang, Xin Zhu, Xiaogang Wang. "Chained Cascade Network for Object Detection", Proc. ICCV, 2017.

Page 41: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Experimental results

• Build up structure among classifiers ci(*) at different stages

ImageNet val2 detection mean average precision (%) with

different setting on classifier chaining.

cascade? √

chaining classifier? √

mAP 49.4 50.9

softmax

softmax

softmax

features

p1

p2

p3

p4

u(p1, r1)=0?

u(p2, r2)=0?

u(p3, r3)=0?

u(p4, r4)=0?

rejected

rejected

rejected

rejected

Y

Y

Y

Y

N

N

N

detection

results

c2(f2)⊙b2

summed

class scores

+

c1(f1)

⊙b1

c3(f3)⊙b3

+

c4(f4)⊙b4+

f1

f2

f3

f4

Wanli Ouyang, Kun Wang, Xin Zhu, Xiaogang Wang. "Chained Cascade Network for Object Detection", Proc. ICCV, 2017.

Page 42: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Structured output

• Treat deep model as feature extractor

• Jointly learn feature and structured output– Structure layer capture the structured information that cannot be

modeled by conventional deep model, e.g. relationship between cascaded classifiers

– Conventional deep model need not be influenced by the problem that can be well solved by structured model, e.g. need not be influenced by the huge amount of easy negative data

Deep model Structure learning

Feature extraction Structure learning

Input

Input

Output

Output

Wanli Ouyang, Kun Wang, Xin Zhu, Xiaogang Wang. "Chained Cascade Network for Object Detection", Proc. ICCV, 2017.

Page 43: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

43

Back-bone model design

Conclusion

Introduction

Structured Hidden factorsStructured output

Structured deep learning

Page 44: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

44

• Introduction• Learning features

– Learning Feature Pyramids (ICCV17)

• Learning– Structure of output– Structured Hidden factors

• Joint deep learning for pedestrian detection (ICCV13)• Deep-ID Net for object detection (T-PAMI16)• Mutual Learning Mutual Visibility Relationship for pedestrian

detection (IJCV16)

– Structure of features

• Conclusion

Page 45: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

10/1/2019

Object detectionWomanChild Tooth brush Tooth brush

45

Page 46: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Challenges -- person• Intra-class variation• Color

• Occlusion

• Deformation

HiddenNo

annotation

Page 47: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Is deep model a black box?

47

Page 48: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

48

Joint Learning vs Separate Learning

Data collection

Preprocessing step 1

Preprocessing step 2

… Feature extraction

Training or manual design

Classification

Manual design

Training or manual design

Data collection

Feature transform

Feature transform

… Feature transform

Classification

End-to-end learning

? ? ?

Deep learning is a framework/language but not a black-box model

Its power comes from joint optimization and increasing the capacity of the learner

Page 49: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

49

• N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR, 2005. (10,000+ citations)

• P. Felzenszwalb, D. McAlester, and D. Ramanan. A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR, 2008. (4000+ citations)

• W. Ouyang and X. Wang. A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling. CVPR, 2012.

Page 50: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

50

Our Joint Deep Learning Model

W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” Proc. ICCV, 2013.

Page 51: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

51

Our Joint Deep Learning Model

W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” Proc. ICCV, 2013.

Page 52: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

52

Our Joint Deep Learning Model

W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” Proc. ICCV, 2013.

Page 53: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Modeling Part Detectors

Design the filters in the second convolutional layer with variable sizes

Part models Learned filtered at the second convolutional layer

Part models learned from HOG

53

Page 54: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

54

Deformation Layer

Infer the location of object parts

Page 55: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

55

Deformation Layer

Infer the location of object parts

Page 56: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

56

Our Joint Deep Learning Model

W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” Proc. ICCV, 2013.

Page 57: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

57

Visibility Reasoning with Deep Belief Net

Correlates with part detection score

Infer the visibility of object parts

Page 58: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Pedestrian Detection on Caltech (average miss detection rates)

58

HOG+SVM68% DPM

63%

Joint DL39%

W. Ouyang and X. Wang, “Joint Deep Learning for Pedestrian Detection,” ICCV 2013.

Joint DL-v29%

W. Ouyang et. al, “Jointly learning deep features, deformable parts, occlusion and classification for pedestrian detection,” TPAMI, accepted.

Our code:

Our code:

Page 59: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Generalize from single pedestrian to multiple pedestrians

Single pedestrian

Multiple pedestriansIJCV’16

Generic Object detectionTPAMI’17 (most popular)

Deformation

Visibility

Page 60: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

60

Back-bone model design

Conclusion

Introduction

Structured Hidden factors Structured featuresStructured output

Structured deep learning

Page 61: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

h

x

y

Structure in neurons

• Conventional neural networks

– Neurons in the same layer have no connection

– Neurons in adjacent layers are fully connected, at least within a local region

Structure exists in brain

Page 62: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

62

• Introduction

• Learning features

• Learning

– Structured output

– Structured Hidden factors

– Structure of features

• GBD-Net for Object detection (ECCV16)

• Structured feature learning for pose estimation (CVPR16)

• CRF-CNN for pose estimation (NIPS 16)

• Attention-Gated CRFs for Contour Prediction (NIPS17)

• Scene Graph Generation from Objects, Phrases and Region Captions (ICCV17)

• Conclusion

Page 63: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

10/1/2019

Object detectionWomanChild Tooth brush Tooth brush

63

Page 64: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Message from past ImageNet Challenge

Design a good learning strategy (VGG, BN) or a good branching structure (Inception, ResNet) to make the model deeper

8 8 19 22

152

0

50

100

150

200

AlexNet(ILSVRC 2012)

ZF-Net,Overfeat

(ILSVRC 2013)

VGG (ILSVRC2014)

GoogleNet(ILSVRC 2014)

ResNet(ILSVRC 2015)

Number of layers

64

Page 65: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Message from past ImageNet Challenge

Design a good learning strategy (VGG, BN) or a good branching structure (Inception, ResNet) to make the model deeper

Page 66: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Is deeper the only way to go?

Page 67: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

What can our vision researchers’ observation help?

Page 68: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

What can our vision researchers’ observation help?

GBD-Net

Page 69: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

GBD-Net

Context

What can our vision researchers’ observation help?

Page 70: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Context

Visual context helps to identify objects

Page 71: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

71

Visual context helps to identify objects

Context

Page 72: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Motivation

With the deep model, what can we do for context?

Page 73: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Motivation

With the deep model, what can we do for context?

Learning relationship among features of different resolutions and contextual regions.

Page 74: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Motivation

With the deep model, what can we do for context?

Learning relationship among features of different resolutions and contextual regions.

Rabbit ear

Rabbit head

Page 75: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Motivation

With the deep model, what can we do for context?

Learning relationship among features of different resolutions and contextual regions.

Features of different contextual regions validate each other

Rabbit ear

Rabbit head

Page 76: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Motivation

With the deep model, what can we do for context?

Learning relationship among features of different resolutions and contextual regions.

Features of different contextual regions validate each other

Rabbit ear

Rabbit head

Page 77: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Motivation

With the deep model, what can we do for context?

Learning relationship among features of different resolutions and contextual regions.

Features of different contextual regions validate each other

Not always true

Rabbit ear

Rabbit head

Page 78: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Motivation

With the deep model, what can we do for context?

Learning relationship among features of different resolutions and contextual regions.

Features of different contextual regions validate each other

Not always true

Rabbit ear

Rabbit head

Human face

Rabbit ear

Rabbit head

Page 79: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Motivation

With the deep model, what can we do for context?

Learning relationship among features of different resolutions and contextual regions.

Features of different contextual regions validate each other

Control the flow of message passing

Rabbit ear

Rabbit head Rabbit head

Human face

Rabbit ear

Page 80: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Fast R-CNN

Page 81: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Gated bi-directional CNN (GBD-Net)

Page 82: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Gated bi-directional CNN (GBD-Net)

Page 83: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Gated bi-directional CNN (GBD-Net)

Features of different context and resolution

Page 84: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Gated bi-directional CNN (GBD-Net)

Features of different context and resolution

Page 85: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Gated bi-directional CNN (GBD-Net)

Features of different context and resolution

Page 86: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Gated bi-directional CNN (GBD-Net)

Passing messages among these features

Page 87: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Independent features

Page 88: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Passing message in one direction

Page 89: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Passing message in two directions

Page 90: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Passing message with gates

+3.7% mAP on BN-Inception

Page 91: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Gated bi-directional CNN (GBD-Net)

Page 92: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Improvement from GBD-net

DataSet ImageNet val2 Pascal VOC 07 COCO (AP50)

Without GBD 48.4 73.1 39.3

+ GBD 52.1 77.2 45.8

BN-net (BN-Inception) as the baseline

3.7 4.1

6.5

0

2

4

6

8

ImageNet val2 Pascal VOC 07 COCO (AP50)

Page 93: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Brief summary

Features matter

Observations from vision researchers also matter

Use deep model as a tool to model the relationship among features

Gated bi-directional network (GBD-Net)

Pass messages among features from different contextual regions

Code: https://github.com/craftGBD/craftGBDZeng et al. “Crafting GBD-Net for Object Detection,” TPAMI, accepted.

93

A pretrained deep model with 269 layers is also provided

Page 94: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Motivation

• Debate

– Lack of "general theory"

• Solution

– Probabilistic model, conditional random field, is used as the theory

Page 95: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Conditional Random Field

Where,

Page 96: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

(a) Multi-layer neural

network

(b) Structured

output space

(c) Structured

hidden layer

I

h

z

ezh

ezh

eh

"End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation", CVPR 2016.

(d) Attention gated Structured

hidden layer

Learning Deep Structured Multi-Scale Features using Attention-Gated CRFs for Contour Prediction", NIPS, 2017.

"Structured feature learning for pose estimation", CVPR 2016.

“CRF-CNN: Modeling Structured Information in Human Pose Estimation”, NIPS, 2016.

Model (a)

Model (b)

Model (c)

Model (d)

Page 97: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Mean Field Approximation

To obtain the estimation of features:

𝑝 𝒉|𝐈, 𝛉 = ς𝑖𝑄(𝐡𝑖|I, 𝛉)

𝑄 𝐡𝑖 I, 𝛉 =1

𝑧ℎ,𝑖𝑒

− σℎ𝑘Φ

ℎ ℎ𝑘,I −σ(𝑖,𝑗)∈𝜀

ℎ𝜑

ℎ(𝐡

𝑖,𝑄(ℎ𝑗|I,𝛉)

Page 98: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Message passing

• Belief propagation

– N2 => 2N

Page 99: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

92.7 96.5

95.4

96

87.8

83.1 8

9.6

91.3

69.2

78.8

76.9 80

55.4

66.7

65.2

67.1

82.9 8

8.7

87.6

89.5

77 8

1.7

83.2

85

75 8

1.1

81.1

83.1

CHEN&YUILLE

NIPS'2014

YANG ET AL.

CVPR'2016

CHU ET AL.

CVPR'2016

OURS

RESULTS ON LSP (PCP)

Torso Head U. arms L.arms U.legs L.legs Mean

93.5

94 95.5

96

86.7

88.2

88.9

91.3

73 74.4

75.9 80

59.8

62.1

63.8 67.1

83.7

84.3

87.1

89.5

79 80 81.4 85

77.1

78.4

80.1

83.1

F LOODING-2ITRS-

TREE

FLOODING-2ITRS-

LOOPY

SERIAL-TREE(RELU) SERIAL-

TREE(SOFTMAX)

COMPONENT ANALYSIS ON LSP (PCP)

Torso Head U.arms L.arms U.legs L.legs Mean

Loopy Structure

Page 100: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Why structured features?

• Richer visual information

Label: rabbit Visual feature

Facing left

Page 101: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Model structures among neurons

h1

x

y

h2

101

ConventionalKnowledge based structured latent

factor

Structured outputStructured output

and feature

Page 102: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Is structured learning only effective for object detection?

Page 103: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Application of structured feature learning

• Haze removal (Submitted to CVPR19)

• Depth estimation (TPAMI 18)

• Contour estimation (NIPS 17)

• Detection (TPAMI17, TPAMI18, …)

• Human pose estimation (CVPR16)

• Person re-identification (CVPR18)

• Relationship estimation (ICCV17)

• Image captioning (ICCV17)

D. Xu, et al., "Monocular Depth Estimation using Multi-Scale Continuous CRFs as Sequential Deep Networks," TPAMI 2018.

W. Ouyang, et al., ” Jointly learning deep features, deformable parts, occlusion and classification for pedestrian detection,” TPAMI 2018.W. Ouyang, et. al. “DeepID-Net: Object Detection with Deformable Part Based Convolutional Neural Networks”, TPAMI 2017.X. Chu, W. Ouyang, et. al. "Structured feature learning for pose estimation". CVPR 2016.Y. Li, W. Ouyang, et. al. "Scene Graph Generation from Objects, Phrases and Region Captions", ICCV, 2017.

Low-level vision

High-level vision

Vision + Language

Page 104: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Is structured learning only effective for specific vision task?

Page 105: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

105

Back-bone model design

Conclusion

Introduction

Structured deep learning

Page 106: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

106

Back-bone model design

Conclusion

Introduction

Structured deep learning

Page 107: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Back-bone deep model design

• Basis structure of deep model

– AlexNet, VGG, GoogleNet, ResNet, DenseNet

– Validated on large-scale classification tasks such as ImageNet

– Models pretrained on ImageNet are found to be effective initial model for other tasks

Page 108: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

108

Back-bone model design

Introduction

Structured deep learning

Conclusion

FishNet (NeurIPS18) Optical flow guided feature (CVPR18)

Page 109: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

109

Back-bone model design

Introduction

Structured deep learning

Conclusion

FishNet (NeurIPS18) Optical flow guided feature (CVPR18)

Page 110: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Low-level and high-level features

Image from Andrew Ng’s slides

Low-level

High-level

Page 111: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Current CNN Structures

Image Classification: Summarize high-level semantic information of the whole image.

Detection/Segmentation:High-level semantic meaning with high spatial resolution

Called U-Net, Hourglass, or Conv-deconv

Page 112: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Architectures designed for tasks of different granularities areDIVERGING

Page 113: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Unify the advantages of networks for pixel-level, region-level, and image-level tasks

Page 114: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Observation and design• Design• Our observation

1. Diverged structures for tasks requiring different resolutions.

1. Unify the advantages of networks for pixel-level, region-level, and image-level tasks.

Page 115: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Hourglass for Classification

Poor performance.

So what is the problem?

• Different tasks require different resolutions of feature

Directly applying hourglass for classification?

Features with high-level semantics and high resolution is good

• Down sample high-level features with high resolution

Our design

Page 116: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Observation and design

• Observation

1. Diverged structures for tasks requiring different resolutions.

2. Isolated Conv blocks the direct back-propagation

• Design1. Unify the advantages of networks

for pixel-level, region-level, and image-level tasks.

Page 117: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Hourglass for Classification

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑜𝑢𝑡

1 × 1, 𝑐𝑜𝑢𝑡

𝑐𝑜𝑢𝑡

𝑆𝑡𝑟𝑖𝑑𝑒 = 2

The 𝟏 × 𝟏 convolution layer in yellowindicates the Isolated convolution.

• Hourglass may bring more isolated convolutions than ResNet

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑜𝑢𝑡

𝑐𝑜𝑢𝑡

Normal Res-BlockRes-Block for

down/up sampling

Page 118: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Observation and design

1 × 1, 𝑐𝑖𝑛

3 × 3, 𝑐𝑖𝑛

1 × 1, 𝑐𝑖𝑛

C

Low-level features 𝑐𝑜𝑢𝑡 −

𝑐𝑖𝑛𝑐𝑜𝑢𝑡

C Concat

𝑢𝑝/𝑑𝑜𝑤𝑛 𝑠𝑎𝑚𝑝𝑙𝑒

Our design

• Observation

1. Diverged structures for tasks requiring different resolutions.

2. Isolated Conv blocks the direct back-propagation

• Design1. Unify the advantages of networks

for pixel-level, region-level, and image-level tasks.

2. Design a network that does not need isolated convolution

Page 119: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Observation and design

1. Diverged structures for tasks requiring different resolutions.

2. Isolated Conv blocks the direct back-propagation

3. Features with different depths are not fully explored, or mixed but not preserved

1. Unify the advantages of networks for pixel-level, region-level, and image-level tasks.

2. Design a network that does not need isolated convolution

3. Features from varying depths are preserved and refined from each other.

Bharath Hariharan, et al. "Hypercolumns for object segmentation and fine-grained localization." CVPR’15.Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." ECCV’16.

• Observation • Design

Page 120: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Difference between mix and preserve and refine

High level

Low level

+

Mixed features

High level

Low level

+

High level

Low level

+High level

Low level

Mixed

convconv

Page 121: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Difference between mix and preserve and refine

High level

Low level

Mixed features Preserve and refine

M

M

M Message generation

High level

Low level

+

Page 122: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Observation and design

Solution

1. Diverged structures for tasks requiring different resolutions.

2. Isolated Conv blocks the direct back-propagation

3. Features with different depths are not fully explored, or mixed but not preserved

Our observation

1. Unify the advantages of networks for pixel-level, region-level, and image-level tasks.

2. Design a network that does not need isolated convolution

3. Features from varying depths are preserved and refined from each other.

Bharath Hariharan, et al. "Hypercolumns for object segmentation and fine-grained localization." CVPR’15.Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." ECCV’16.

Page 123: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

FishNet: Overview

224x224 … 56x56 28x28 14x14 7x7 14x14 28x28 56x56 28x28 14x14 7x7 1x1

… … … … … ……

Features in the tail part

Features in the body part

Residual Blocks

Features inthe head part

Concat

Fish Tail

Fish Body

Fish Head

… … …

Page 124: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

FishNet: Preservation & Refinement

……

…………

……

……

TransferringBlocks T (⋅)

DR Blocks

UR Blocks

Regular Connections

……

FishTail

FishBody

FishHead

M(⋅)

𝑑𝑜𝑤𝑛(⋅)

𝑢𝑝(⋅)

𝑟(⋅)

M(⋅)

……

……

Up-sampling and Refinement (UR) Blocks

Down-sampling and Refinement (DR) Blocks

Feature from varying depth refines each otherhere

Sum up every𝒌 adjacent

channelsConcat

Concat

Nearest neighbor up-sampling

𝟐 × 𝟐 Max-Pooling

From Tail

From Body

From Body

From Head

Page 125: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

FishNet: Performance-ImageNet

22.59%

21.93%(5.92%)

21.55%(5.86…

21.25%(5. 76%)

23.78%(7.00

22.30%(6.2…

21.69%(5.9…

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

10 20 30 40 50 60 70

FishNet

ResNet

Parameters, × 106

22.59%

21.93%

21.55%21.25%

23.78%

22.30%

21.69%

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

2 4 6 8 10 12

FishNet

ResNet

FLOP, × 109

To

p-1

Erro

r

Codehttps://github.com/kevin-ssy/FishNet

Page 126: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

FishNet: Performance-ImageNet

22.59%

21.93%(5.92%)

21.55%(5.86%)

21.25%(5. 76%)

22.58%(6.35%)

22.20%(6.20%)

22.15%(6.12%)

21.20%

23.78%(7.00%)

22.30%(6.20%)

21.69%(5.94%)

21.00%

21.50%

22.00%

22.50%

23.00%

23.50%

24.00%

10 20 30 40 50 60 70

FishNet

DenseNet

ResNet

To

p-1

Erro

r

Parameters, × 106

Codehttps://github.com/kevin-ssy/FishNet

Page 127: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

FishNet: Performance on COCO Detection and Segmentation

38.00%

38.50%

39.00%

39.50%

40.00%

40.50%

41.00%

41.50%

42.00%

AP

R-50RX-50Fish-150

Codehttps://github.com/kevin-ssy/FishNet

34.00%

34.50%

35.00%

35.50%

36.00%

36.50%

37.00%

37.50%

AP

R-50 RX-50

Fish-150

Detection Instance segmentation

Page 128: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Winning COCO 2018 Instance Segmentation Task

Page 129: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Visualization

Page 130: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Visualization

Page 131: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Visualization

Page 132: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Visualization

Page 133: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Visualization

Page 134: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Codebase

• Comprehensive

RPN Fast/Faster R-CNN

Mask R-CNN FPN

Cascade R-CNN RetinaNet

More … …

• High performance

Better performance

Optimized memory consumption

Faster speed

• Handy to develop

Written with PyTorch

Modular design

GitHub: mmdet√

Page 135: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

FishNet: Advantages

1. Better gradient flow to shallow layers

2. Features

➢ contain rich low-level and high-level semantics

➢ are preserved and refined from each other

Codehttps://github.com/kevin-ssy/FishNet

Page 136: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Outline

136

Back-bone model design

Introduction

Structured deep learning

Conclusion

FishNet (NeurIPS18) Optical flow guided feature (CVPR18)

Page 137: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Action Recognition

• Recognize action from videos

Page 138: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Optical flow in Action Recognition

• Motion is the important information

• Optical flow

– Effective

– Time consuming

Modality Acc. Speed(fps)

RGB 85.5% 680

RGB+Optical Flow 94.0% 14

We need a better motion representation

Page 139: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Optical flow guided feature

Page 140: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Optical flow guided feature

{vx , vy} = optical flow

Coefficient for optical flow:

Optical flow:

Intuitive Inspiration

Page 141: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Optical flow guided feature

Feature flow:

{ } = feature flow

Page 142: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Optical flow guided feature

1 × 1 Conv 1 × 1 Conv

S

C

Sobel Subtract

Concat

OOFF Unit

Fram

et

Fram

et

+ ∆

t

Cla

ssif

ica

tio

nSu

b-n

etw

ork

...

...

ResolutionK*K

ResolutionK/2*K/2

ResolutionK/4*K/4

CO

CO

...

...

CO

Feat

ure

tFe

atu

re t

+ ∆

t

...

...

OCOFF UnitConcat

C C C

Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, Wei Zhang. "Optical Flow Guided Feature: A Motion Representation for Video Action Recognition", Proc. CVPR, 2018.

Page 143: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Optical Flow Guided Feature (OFF): Experimental results

1. OFF with only RGB inputs is comparable with the other state-of-the-art methods using optical flow as input.

0 50 100 150 200 250

RGB + Optical flow + I3D

RGB + OFF

RGB + OFF + Optical Flow

FPS

92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5 96.0

RGB + Optical flow + I3D

RGB + OFF

RGB + OFF + Optical Flow

Accuracy (%)Code:

Page 144: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Not only for action recognition

• Also effective for

– Video object detection

– Video denoising

Page 145: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Optical Flow Guided Feature (OFF): Experimental results

1. q40 means quantization factor.

71 72 73 74 75 76

resnet+rfcn

resnet+rfcn+OFF

Detection (mAP)

34.6 34.8 35 35.2 35.4 35.6 35.8 36 36.2

DnCNN

DnCNN+OFF

Compression Artifact Removal (PSNR)

q40

Page 146: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Video Compression

The figure is from Bernd Girod’s slides

Page 147: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Video Compression

1990 1995 2000 2005 2010

H.261 H.262 H.264 H.265H.263

Disadvantages:• Hand-crafted techniques• Not friendly for emerging contents• Not easy to improve the efficiency in the old pipeline

What happens when video compression meets deep learning?

Page 148: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Traditional Video Compression

Current

Frame

Transform

Block based

Motion Estimation

Motion

Compensation

-Inverse

Transform

Entropy

Coding

Decoded Frames Buffer

Q

𝑥𝑡

ҧ𝑥𝑡

𝑟𝑡

Ƹ𝑟𝑡

ො𝑥𝑡

ො𝑥𝑡−1

𝑣𝑡

Prediction

Transform

Page 149: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Traditional Video Compression

Current

Frame

Transform

Block based

Motion Estimation

Motion

Compensation

-Inverse

Transform

Entropy

Coding

Decoded Frames Buffer

Q

𝑥𝑡

ҧ𝑥𝑡

𝑟𝑡

Ƹ𝑟𝑡

ො𝑥𝑡

ො𝑥𝑡−1

𝑣𝑡

Prediction

𝑥𝑡 ො𝑥𝑡−1

MotionEstimation

MotionCompensation

ҧ𝑥𝑡

𝑣𝑡

MotionVectors

Page 150: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Traditional Video Compression

Current

Frame

Transform

Block based

Motion Estimation

Motion

Compensation

-Inverse

Transform

Entropy

Coding

Decoded Frames Buffer

Q

𝑥𝑡

ҧ𝑥𝑡

𝑟𝑡

Ƹ𝑟𝑡

ො𝑥𝑡

ො𝑥𝑡−1

𝑣𝑡

Transform 3 7 4 20 3 4 51 10 8 48 0 8 6

18 -2 -4 1-3 1 -2 -41 3 4 33 2 -5 -3

16 -1 -2 0-2 1 -1 -21 2 4 22 2 -4 -2

4 5 3 21 2 4 41 9 6 37 2 6 6

DCT

Q

IDCT

𝑟𝑡

Ƹ𝑟𝑡

Page 151: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Traditional Video Compression

Current

Frame

Transform

Block based

Motion Estimation

Motion

Compensation

-Inverse

Transform

Entropy

Coding

Decoded Frames Buffer

Q

𝑥𝑡

ҧ𝑥𝑡

𝑟𝑡

Ƹ𝑟𝑡

ො𝑥𝑡

ො𝑥𝑡−1

𝑣𝑡

min 𝜆𝐷 + 𝑅

Distortion Bit rate

Page 152: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Deep Video Compression ModelMethod

Current

Frame

Transform

Block based

Motion Estimation

Motion

Compensation

-Inverse

Transform

Entropy

Coding

Decoded Frames Buffer

Q

𝑥𝑡

ҧ𝑥𝑡

𝑟𝑡

Ƹ𝑟𝑡

ො𝑥𝑡

ො𝑥𝑡−1

𝑣𝑡

Optical Flow Net

MV Encoder Net

Q

𝑣𝑡

𝑚𝑡

MV Decoder Net

ො𝑣𝑡

ෝ𝑚𝑡

Motion

Compensation Net

Residual

Encoder Net

Residual

Decoder Net

Q𝑦𝑡 ො𝑦𝑡

Bit Rate

Estimation Net

ෝ𝑚𝑡

ො𝑦𝑡

min𝐷 + 𝜆 𝑅

Page 153: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Deep Video Compression Model

• Experimental Results

Page 154: Modeling structures in human pose estimation and imediacy ... · Deep learning is a framework/language but not a black ... and increasing the capacity of the learner. 49 • N. Dalal

Take home message

154

• Structured deep learning is

– effective

– for output and features

– from observation

• End-to-end joint training bridges the gap between structure modeling and feature learning