some recent works of human activity recognition 吴心筱 [email protected]
TRANSCRIPT
Action Description
Action, Object and Scene
Multi-View Action Recognition
Action Detection
Complex Activity Recognition
Multimedia Event Detection
Extension of Interest Points
Extension of Bag-of-Words
Mid-level Attribute Feature
Dense Trajectory
Action Bank
Action Description
Bregonzio et al., CVPR, 2009
Clouds of interest points accumulated over multiple temporal scales
Extension of Interest Points
Matteo Bregonzio, Shaogang Gong and Tao Xiang. Recognising Action as Clouds of Space-Time Interest Points. CVPR 2009.
Holistic features of the clouds as the spatio-temporal information of interest points:
Extension of Interest Points
Matteo Bregonzio, Shaogang Gong and Tao Xiang. Recognizing Action as Clouds of Space-Time Interest Points. CVPR, 2009.
Wu et al., CVPR, 2011
Multi-scale spatio-temporal (ST) context distribution feature
Characterize the spatial and temporal context distributions of interest points over multiple space-time scales.
Extension of Interest Points
Xinxiao Wu, Dong Xu, Lixin Duan and Jiebo Luo. Action recognition using context and appearance distribution features. CVPR 2011.
A set of XYT relative coordinates between the center interest point and other interest points in a local region.
Multi-scale local regions across multiple space-time scales.
Extension of Interest Points
Xinxiao Wu, Dong Xu, Lixin Duan and Jiebo Liu. Action recognition using context and appearance distribution features. CVPR 2011.
Wu et al., CVPR, 2011
A global GMM is trained using all local features from all the training videos.
The video-specific GMM for a given video is generated from the global GMM via a Maximum A Posterior adaption process.
Extension of Bag-of-Words
Xinxiao Wu, Dong Xu, Lixin Duan and Jiebo Luo. Action recognition using context and appearance distribution features. CVPR 2011.
Kovashka and Grauman, CVPR, 2010
Exploit multiple “bag-of-words” model to represent the hierarchy of space-time configurations at different scales.
Extension of Bag-of-Words
A. Kovashka and K. Grauman. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. CVPR, 2010.
Kovashka and Grauman, CVPR, 2010
A. Kovashka and K. Grauman. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. CVPR, 2010.
Kovashka and Grauman, CVPR, 2010
A. Kovashka and K. Grauman. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. CVPR, 2010.
Savarese, WMVC, 2008
Use a local histogram to capture co-occurences of words in a local region.
Extension of Bag-of-Words
S. Savarese, A. Delpozo, J.C. Niebles and L. Fei-Fei. Spatial-temporal correlatons for unsupervised action classification. WMVC, 2008.
M. Ryoo and J. Aggarwal, ICCV, 2009.
Propose a “featuretype X featuretype X relationship” histogram to capture both appearance and relationship information between pairwise visual words.
Extension of Bag-of-Words
M. Ryoo and J. Aggarwal. Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. ICCV, 2009.
Liu et al., CVPR, 2011.
Action attributes: a set of inter mediate concepts.
A unified framework: action attributes are effectively selected in a discriminative fashion.
Data-driven Attributes.
Mid-level Attribute Feature
Jingen Liu, Benjamin Kuipers and Silvio Savarese. Recognizing Human Actions by Attributes. CVPR, 2011.
Jingen Liu, Benjamin Kuipers and Silvio Savarese. Recognizing Human Actions by Attributes. CVPR, 2011.
Wang et al., CVPR, 2011.
Sample dense points from each frame and track them based on displacement information from a dense optical flow field.
Dense Trajectory
Heng Wang, Alexander Klaser, Cordelia Schmid and Cheng-Lin Liu. CVPR, 2011.
Wang et al., CVPR, 2011.
Four descriptors: Trajectory; HOG; HOF; MBH.
Heng Wang, Alexander Klaser, Cordelia Schmid and Cheng-Lin Liu. CVPR, 2011.
Sadanand and Corso, CVPR, 2011.
Object BankAction Bank
Action Bank: a large set of action detectors.
Action Bank
Sreemanananth Sadanand and Jason J. Corso. Action Bank: A High-Level Representation of Activity in Video, CVPR, 2012.
Sreemanananth Sadanand and Jason J. Corso. Action Bank: A High-Level Representation of Activity in Video, CVPR, 2012.
Nazli Ikizler-Cinbis and Stan Sclaroff, ECCV, 2010
Combine the information from person, object and scene
Multiple instance learning + multiple kernel learning
A bag contains all the instances extracted from a video for a particular feature channel.
Different features have different kernel weights.
Nazli Ikizler-Cinbis and Stan Sclaroff, Object, Scene and Actions: Combining Multiple Features for Human Action Recognition, ECCV, 2010.
Nazli Ikizler-Cinbis and Stan Sclaroff, Object, Scene and Actions: Combining Multiple Features for Human Action Recognition, ECCV, 2010.
Marcin Marszalek, Ivan Laptev and Cordelia Schmid, CVPR 2009.
Automatically discover the relation between scene classes and human actions: using movie scripts
Marcin Marszalek, Ivan Laptev and Cordelia Schmid, Actions in Context, CVPR, 2009.
Weinland et al., ICCV, 2009.
A 3D visual hull is proposed to represent an action exemplar using a system of 5 calibrated cameras.
Daniel Weinland, Edmond Boyer and Remi Ronfard. Action recognition from arbitrary views using 3D exemplars. ICCV, 2009.
View-invariant
Weinland et al., ICCV, 2009.
3D exemplar-based HMM for classification
Daniel Weinland, Edmond Boyer and Remi Ronfard. Action recognition from arbitrary views using 3D exemplars. ICCV, 2009.
View-invariantYan et al., CVPR, 2008.
4D action feature: 3D shapes over time (4D)
Pingkun Yan, Saad M. Khan, Mubarak Shah. Learning 4D Action Feature Models for Arbitrary View Action Recognition. CVPR, 2008.
View-invariantJunejo et al., IEEE TPAMI, 2008.
A novel view-invariant feature: self-similarity descriptor
Frame-to-frame similarity
Imran N. Junejo, Emilie Dexter, Ivan Laptev and Patrick Perez. View-independent action recognition from temporal self-similarities. IEEE T-PAMI, 2008.
View-invariantLewandowski et al, ECCV, 2010.
View-independent manifold representation
A stylistic invariant embedded manifold is produced to describe an action for each view.
All view-dependent manifolds are automatically combined to generate an unified manifold .Michal Lewandowski, Dimitrios Makris, and Jean-Christophe
Nebel. View and style-independent action manifolds for human activity recognition, ECCV, 2010.
View-invariantWu and Jia, ECCV, 2012.
Propose a latent kernelized structural SVM.
The view index is treated as a latent variable and inferred during both training and testing.
Xinxiao Wu and Yunde Jia. View-Invariant action recognition using latent kernelized structural SVM. ECCV, 2012.
kernelized
Cross-viewLiu et al., CVPR, 2011.
Learn the bilingual-words from both source view and target view.
Transfer action models between two views via the bag-of-bilingual-words model.
Jingen Liu, Mubarak Shah, Benjamin Kuipers and Silvio Savarese. Cross-View Action Recognition via View Knowledge Transfer. CVPR 2011.
Cross-viewLi et al, CVPR, 2012.
Propose “virtual views” to connect action descriptors from source view and target view.
Each virtual view is associated with a linear transformation of the action descriptor,and the sequence of transformations arising from the sequence of virtual views aims at bridging the source and target views Xinxiao Wu and Yunde Jia. View-Invariant action recognition
using latent kernelized structural SVM.
Cross-viewWu et al., PCM, 2012.
Transfer Discriminant-Analysis of Canonical Correlations (Transfer DCC).
Minimize the mismatch between data distributions of source and target views.
Xinxiao Wu, Cuiwei Liu, and Yunde Jia. Transfer discriminant-analysis of canonical correlations for view-transfer action recognition, PCM, 2012.
Yuan et al., IEEE T-PAMI, 2010.
A discriminative pattern matching criterion for action classification: naïve-Bayes mutual information maximization (NBMIM)
An efficient search algorithm: spatio-temporal branch-and-bound (STBB) search algorithm
Junsong Yuan, Zicheng Liu, and Ying Wu, Discriminative video pattern search for efficient action detection, IEEE T-PAMI, 2012.
Hu et al., ICCV, 2009.
The candidate of regions of an action are treated as a bag of instances.
A novel multiple-instance learning framework, named SMILE-SVM (Simulated annealing Multiple Instance Learning Support Vector Machines), is proposed for learning human action detector.
Yuxiao Hu, Liangliang Cao, Fengjun Lv, Shuicheng Yan, Yihong Gong and Thomas, S. Huang. Action detection in complex scenes with spatial and temporal ambiguities. ICCV, 2009.
Gaidon et al., CVPR, 2011.
Actom Sequence Model: represent an activity as a sequence of atomic action-anchored visual features.
Automatically detect atomic actions from an input activity video.
A. Gaidon, Z. Harchaoui, and C. Schmid. Actom sequence models for efficient action detection. CVPR, 2011.
Hoai et al., CVPR, 2011.
Jointly perform video segmentation and action recognition.
M. Hoai, Z. Lan, and F. Torre. Joint segmentation and classification of human actions in video. CVPR, 2011.
Tang et al., CVPR, 2012.
Each activity is modeled by a set of latent state variables and duration variables.
The states are the cluster centers by clustering all the fixed-length video clips from training data.
A max-margin based discriminative model is introduced to learning the temporal structure of complex events.
K. Tang, F.-F. Li, and D. Koller. Learning latent temporal structure for complex event detection. CVPR, 2012.
Izadinia and Shah, ECCV, 2012.
A latent discriminative model is proposed to detect the low-level events by modeling the co-ocurrence relationship between different low-level events in a graph.
Each video is divided into short clips and each clip is manually annotated using one low-level event label, which are used fro training the low-level detectors.
H. Izadinia and M. Shah. Recognizing complex events using large margin joint low-level event model. ECCV, 2012.