cvpr 2009 quick review: action recognition 讲解人：李哲中科院计算所 jdl 2009 年 9 月...

CVPR 2009 Quick Review:Action Recognition

讲解人：李哲

中科院计算所 JDL2009 年 9 月 18 日

23/4/21 1

# Paper

TitleRecognizing Realistic Actions from Videos “in the Wild”

AuthorsJingen Liu, Jiebo Luo, Mubarak Shah

Paper ID : 0598

23/4/21 2

# 提纲作者介绍摘要相关背景篇章结构问题的提出算法介绍实验结果结论

23/4/21 3

# 第一作者 Jingen LiuPhD student School of Electrical Engineering and Computer ScienceUniversity of Central Florida, Orlando, FL, USA

Research Interests scene understanding and recognition, action recognition, video content

analysis and retrieval, object recognition, and crowd tracking.Papers

09: CVPR(2), ICCV(2), ICASSP(1) 08: ICPR(1), CVPR(2), TRECVID(1)

Background Ph.D. (now) : University of Central Florida, Orlando, FL, USA; B.S. ,M.S. degree: Huazhong University of Science and Technology, Wuhan, China.Homepage http://www.cs.ucf.edu/~liujg/

23/4/21 4

# 第二作者 Jiebo Luo Senior Principal ScientistKodak Research Laboratories in Rochester, NY.

Research Interests image processing, pattern recognition, computer vision, computational

photography, medical imaging, and multimedia communication.Academic Contributions

Fellow, IEEE (2009) 120+ papers ， 40+ granted U.S. patents

Background Senior Principal Scientist(1999-present), Principal Research Scientist (1996-

1999), Senior Research Scientist(1995-1996): Kodak Research Laboratories; Ph.D. (1995) degree: Electrical Engineering, University of Rochester in 1995; B.S. (1989), M.S. (1992) degree: Electrical Engineering, University of Science

and Technology of China.Homepage http://sites.google.com/site/jieboluo/Home

23/4/21 5

# 第三作者 Mubarak Shah

Agere Chair ProfessorSchool of Electrical Engineering & Computer ScienceUniversity of Central Florida, Orlando, FL

Research Interests image processing, pattern recognition, computer vision, computational photography, medical

imaging, and multimedia communication.Academic Contributions

Fellow, IEEE (2003) Books(2), Book Chapters(10), Journal paper (60), Conference papers(130) …before 2006

Background M.S. (1982) & Ph.D. (1986) degree: Wayne State University Detroit, Michigan (Major:

Computer Engineering, Minor Area: Mathematics); E.D.E. (1980): A post graduate diploma, Philips International Institute of Technological

Studies, Eindhoven, The Netherlands.(Major: Speech Recognition); B.S. (1979) degree: National College of Engineering & Technology, Karachi, Pakistan(major:

Electronics).Homepage http://server.cs.ucf.edu/~vision/faculty/shah.html (not available ??) http://unjobs.org/authors/mubarak-shah (CV2006)

23/4/21 6

# AbstractIn this paper, we present a systematic framework for recognizing realistic actions from videos “in the wild.” Such unconstrained videos are abundant in personal collections as well as on the web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous variations that result from camera motion, background clutter, changes in object appearance, and scale, etc. The main challenge is how to extract reliable and informative features from the unconstrained videos. We extract both motion and static features from the videos. Since the raw features of both types are dense yet noisy, we propose strategies to prune these features.We use motion statistics to acquire stable motion features and clean static features. Furthermore, PageRank is used to mine the most informative static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information-theoretic algorithm is employed to group semantically related features. Finally, AdaBoost is chosen to integrate all the heterogeneous yet complementary features for recognition.We have tested the framework on the KTH dataset and our own dataset consisting of 11 categories of actions collected from YouTube and personal videos, and have obtained impressive results for action recognition and action localization.

23/4/21 7

# 摘要本文针对自然场景视频中实际行为的识别问题提出了一种系统框架。无约束的视频大量存在于个人收藏及网络中，但针对这类视频的行为识别问题仍未彻底解决，原因在于由相机移动、背景混乱、物体外观及尺度变化等带来的大量不确定性。如何从这些无约束的视频中提取可靠和充实的特征，是面临的主要挑战。我们从视频中提取了运动和静态两种特征。由于这两种原始特征均密集且富含噪声，因此我们提了出修整这些特征的策略。我们使用 motion statistics 获取可靠的运动特征以及干净的静态特征；使用 PageRank 用于挖掘最 informative 的静态特征；采用一种 divisive information-theoretic 算法将语义相关的特征分组用于构造紧凑且可区分的 visual vocabulary ；最后，使用 AdaBoost 将所有混杂但互补的特征整合到一起。在通用的 KTH 及我们自己搭建的数据库 ( 包含来自于 YouTube 及个人视频的 11 种行为 ) 上完成行为定位和识别的测试，均得到了impressive results 。

23/4/21 8

# Background(1/2)——Interest Point Detection

Blob detectionAimed at detecting points and/or regions in the image that are either brighter or darker than the surrounding.

Corner detectionAimed at detecting “corner” points in the image. A “corner” can be defined as the intersection of two edges, or a point for which there are two dominant and different edge directions in a local neighborhood of the point.

Fig. Input & Output of a typical Corner Detection alogrithm

23/4/21 9

# Background(2/2)——Bag of Video-Words

From ?? 23/4/21 10

# Problem

Recognizing Realistic Action from Video “in the Wild”

Realistic Action VS Template based action

Video “in the wild” Video in Lab environment

YouTube dataset

Challenges in Realistic videosLarge variation in• Camera motion• Cluttered background• Viewpoint• Object scale•Illumination condition•Object appearance and pose

KTH dataset

boxing clapping

jogging waving

Basket shooting biking

swing Tennis shooting

23/4/21 11

# FrameworkInput VideosInput Videos

Motion & Static Features Extraction

Motion & Static Features Extraction

Motion & Static Features PruningMotion & Static Features Pruning

Motion & Static Vocabularies Learning

Motion & Static Vocabularies Learning

Histogram-based Video Representation

Histogram-based Video Representation

Boosted Learning & LocalizationBoosted Learning & Localization

Contributions

23/4/21 12

# Motivation(1/3)——Static Features

Why Static Features?In Realistic video, motion features are unreliable due to unpredictable and often unintended camera motion (camera shake).Correlated objects are helpful to action recognition.

“Ball” in “Soccer Juggling”, “Horse” in “Horseback Ridding”, etc.

Static features are complementary to motion features.

How to get Static Features?Interest point detectors: corner features & blob features

23/4/21 13

# Motivation(2/3)——Feature Pruning

Why Feature Pruning?Motion feature pruning: discard the motion features caused by camera moving or shaking. Static feature pruning: select the significant static features.

How to prune features?Motion feature Pruning: use feature statistics and the distribution of spatial locations.Static feature pruning: PageRank.

23/4/21 14

# Motivation(3/3)——Vocabulary Learning

Why vocabulary learning?Obtain compact yet discriminative visual vocabularies for motion and static features.Large visual vocabulary performs better, but over-specific visual words may eventually over-fit the data.The combination of two features may be more useful than when used individually.

How to learn vocabularies?Information-theoretic measure to refine the initial vocabularies by feature grouping.

23/4/21 15

# Algorithm——Motion Feature Detection

VideoVideo

Filters2-D Gaussian filter in space

1-D Gabor filter in time

Filters2-D Gaussian filter in space

1-D Gabor filter in time

Point: local maximal responseArea: 3D cuboids around the points

Point: local maximal responseArea: 3D cuboids around the points

Flat gradient victor of the AreasFlat gradient victor of the Areas

PCA reduce the dimensionsPCA reduce the dimensions

Motion featuresMotion features

Spatiotemporal interest point detector[P. Dollar et al., VS-PETS 2005 ]

Input

Output23/4/21 16

# Algorithm——Motion Feature Pruning

相机抖动仅会影响到几帧中 motion feature 的检测，因而可直接抛弃被影响帧

RulesRule1: 某帧 feature 过多，直接删除该帧（ remove abrupt camera motion ）Rule2: 筛选，按比例保留距离该帧中所有特征平均位置较近的特征

（ select good features ）

About 8% improvement in average accuracy

23/4/21 17

# Algorithm——Static Feature Detection

Interest point detectorsHarris-Laplacian (HAR) detectorHessian-Laplacian (HES) detectorMSER detector

Pruning using context informationDetecting regions of interest by motion statisticsUsing PageRank to preserve consistent features

Corner feature

Blob feature

23/4/21 18

# Algorithm——Static Feature Pruning(1/2)

Motivation:Foreground features have motion consistent matches troughout the

entire video sequence. Background features are, however, unstable due to entire video sequence.

Using PageRank to preserve consistent (significant ) featuresConstruct a Feature Network G=(V, E) for a given video: W(n×n)

V : set of vertex (static features ——image patches)E : set of weighted edges (feature similarity)[24]

Rank the features based on their persistence (importance)

:scaling factor(0.85 in experiment) : indicator vector indentifying the verices with zero out-degree : weights matrix : n×1 transport vector with uniform probability distribution over

the vertices.The Initial PR value for each vertex is 1/n

Pr Pr ( Pr 1 )W b v b

Wv

23/4/21 19

# PageRankPageRank is a variant of Eigenvector Centrality, which measures the importance of a node in a given network.

Ranking vertex by their relative importanceA vertex neighbor to an important vertex should rank higher

Fig. PageRank from Wiki23/4/21 20

# Algorithm——Static Feature Pruning(2/2)

Fig. Two examples from riding (top) and cycling (bottom) demonstrate the effects of feature acquisition. The first row shows the selected features. The top 10% features in PR values are retrieved.

23/4/21 21

# Algorithm——Learning Semantic Vocabulary

Information-theoretic divisive algorithmInput: X initial visual words, and distribution ;Output: visual word clustersInitiate randomly assign the cluster membersThis is similar to k-means

Two major stepsFor each cluster ,compute the prior and “centers”.

Update clusters : for each ,find the new cluster:

X̂X̂

)ˆ|( xCp)|( XCp

ˆ

ˆ( )t i

i tx x

x

ˆ

ˆ( | ) ( | )ˆ( )

t i

ti t

x x i

p C x p C xx

ˆix

tx* ˆ( ) argmin KL( ( | ), ( | ))t j t ji x p C x p C x

23/4/21 22

# Experiments——KTH dataset

static features82.3%

motion features87.1%

Hybrid features91.8%

Static feature: shape information23/4/21 23

# YouTube dataset

b_shooting g_walking t_jumping s_juggling

cycling t_swing t_swinging v_spiking

diving swinging r_riding

11 categoriesAbout 1600 videos

23/4/21 24

# Experiments——YouTube dataset(1/3)

Figure A:Performance comparison between system with motion feature pruning and without feature pruning

Figure B:Performance comparison between system with static feature pruning and without feature pruning

Average AccuracyBefore pruning : 57%After pruning : 65.4%

Average AccuracyBefore pruning : 58.1%After pruning : 63.0%

23/4/21 25


Average accuracy:Motion: 65.4%; Static: 63.1%; Hybrid:71.2%

Fig. Comparison of classification performance for using motion, static and hybrid features.

23/4/21 26


Fig. The confusion table for classification using hybrid features.

23/4/21 27

# Conclusions

Interest This paper present a systematic framework for recognizing realistic actions from videos “in the wild”.Static features are complementary to motion features.Using Motion cues to prune motion and static features is helpful. Information-theoretic based divisive clustering reconstruct compact yet discriminative semantic visual vocabularies.

23/4/21 28

Thank you!

23/4/21 29

cvpr 2009 quick review: action recognition 讲解人：李 哲 中科院计算所 jdl 2009 年 9 月...

Documents

cvpr 2009 quick review: action recognition 讲解人：李哲中科院计算所 jdl 2009 年 9 月...