hyeonsoo , kang

Hyeonsoo, Kang

Unsupervised Mining of Statis-tical Temporal Structures in Video

▫ Structure of the algorithm

▫ Introduction

1. Model learning algorithm2. [Review HMM]3. Feature selection algorithm

▫ Results

What is “supervised learn-ing?”

What is “supervised learn-ing?”

It is the way of doing such that the al-gorithm designers manually identify important structures, collect labeled data for training, and apply tools to earn the classifiers

Burden of labeling and training

Cannot be readily extended to diverse new domains at a large scale.

Good

Works for domain-specific problems at a small scale

Bad



Good


Bad

Let’s aim at an automated method which works just fine for domain-spe-cific problems but also flexible & scalable!



Good


Bad

Let’s aim at an automated method which works just fine for domain-spe-cific problems but also flexible & scalable!

But is that possible …?

A temporal sequence of nine shots, each shot is a sec-ond apart

Observations?

Similar color & move-ments



Observations?

Different color



Observations?

Different camera walk


Let’s focus on a particular domain of videos, such that (1) Video structures is in a discrete state-space(2) The features, i.e., observations from data are

stochastic (small statistical variations on the raw features)

(3) The sequence is highly correlated in time

Unsupervised learning ap-proaches are chiefly twofold:(a) Model learning algorithm

(b) Feature selection algorithm

(a) Model learning algorithm


Using a fixed feature set manually selected based on heuristics build a model of good performance to dis-tinguish high-level structures of the given video

Using both the model learning algo-rithm and the feature selection algo-rithm results a model and a set of features that distinguish high-level structures of the given video well


1. Base line: uses a two level HHMM to model structures in video.

2. HHMM ::= Hierarchical Hidden Markov Model.

Hierarchical Hidden Markov Model is a statistical model derived from the Hidden Markov Model (HMM). The HHMM utilizes its structure to solve a subset of the problems more efficiently, but can be transformed into a standard HMM. Therefore, the coverage of HHMM and HMM are the same, but their performance.


1. Base line: uses a two level HHMM to model structures in video.


Hierarchical Hidden Markov Model is a statistical model derived from the Hidden Markov Model (HMM). The HHMM utilizes its structure to solve a subset of the problems more efficiently, but can be transformed into a standard HMM. Therefore, the coverage of HHMM and HMM are the same, but their performance.Wait, what is HMM then?

[Quick Review: HMM]

Consider a simple 3-state Markov model of the weather. We assume that once a day (e.g., at noon), the weather is observed as being one of the following:

(S1) State 1: rain (or snow)(S2) State 2: cloudy(S3) State 3: sunny

We postulate that the weather on day t is charac-terized by a single one of the three states above, and that the matrix A of state transition probabili-ties is

A = {aij} =

Given that the weather on day 1 (t = 1) is sunny (s-tate 3), we can ask the question: What is the prob-ability (according to the model) that the weather for the next 7 days will be “sunny-sunny-rain-sunny-cloudy-sunny- …?”

[Quick Review: HMM]

Stated more formally, we define the observation sequence O as

O = {S3, S3, S3, S1, S1, S3, S2, S3} “sunny-sunny-rain-sunny-cloudy-sunny- …?”

corresponding to t = 1, 2, …, 8, and we wish to de-termine the probability of O, given the model. This probability can be expressed (and evaluated) as

P(O|Model) = P[S3, S3, S3, S1, S1, S3, S2, S3 | Model]

[Quick Review: HMM]




P(O|Model) = P[S3, S3, S3, S1, S1, S3, S2, S3 | Model]= P[S3] P[S3|S3] P[S3|S3] P[S1|S3] P[S1|S1] P[S3|S1] P[S2|S3] P[S3|S2]

[Quick Review: HMM]




P(O|Model) = P[S3, S3, S3, S1, S1, S3, S2, S3 | Model]= P[S3] P[S3|S3] P[S3|S3] P[S1|S3] P[S1|S1] P[S3|S1] P[S2|S3] P[S3|S2]= a33 a33 a31 a11 a13 a32 a23A = {aij} =

[Quick Review: HMM]




P(O|Model) = P[S3, S3, S3, S1, S1, S3, S2, S3 | Model]= P[S3] P[S3|S3] P[S3|S3] P[S1|S3] P[S1|S1] P[S3|S1] P[S2|S3] P[S3|S2]= a33 a33 a31 a11 a13 a32 a23= 1 (0.8)(0.8)(0.1)(0.4)(0.3)(0.1)(0.2)

[Quick Review: HMM]




P(O|Model) = P[S3, S3, S3, S1, S1, S3, S2, S3 | Model]= P[S3] P[S3|S3] P[S3|S3] P[S1|S3] P[S1|S1] P[S3|S1] P[S2|S3] P[S3|S2]= a33 a33 a31 a11 a13 a32 a23= 1 (0.8)(0.8)(0.1)(0.4)(0.3)(0.1)(0.2)= 1.536 X 10-4

Where we use the notation = P[q1 = Si], 1 <= i <= N

to denote the initial state probabilities.

MMObservable

[Quick Review: HMM]Hidden Markov Model is not too different from the observable MM, just that each state now does not correspond to an observable (physical) event.

For example, assume the following scenario. You are in a room with a curtain through which you cannot see what is happening. On the other side of the curtain is another person who is performing a coin (or multiple coins) tossing experi-ment. The other person will not tell you anything about what he is doing exactly; he will only tell you the result of each coin flip.

An HMM is characterized by the following:1) N, the number of states in the model2) M, the number of distinct observation symbols

per state3) The state transition probability distribution A =

{aij}4) The observation symbol probability distribution

in state j, B = {bj(k)}, where Bj(k) = P[vk at t|qt = Sj],

1 <= j <= N, 1 <= k <= M.5) The initial state distribution = {} where

= P[q1 = Si], 1 <= i <= N.

[Quick Review: HMM]

HMM requires specification of two model parame-ters (N and M), specification of observation sym-bols, and the specification of the three probability measures, A, B, . Since N and M are implicit in other variables, we can use the compact notation


1. Base line: uses HHMM


Hierarchical Hidden Markov Model is a statistical model derived from the Hidden Markov Model (HMM). The HHMM utilizes its structure to solve a subset of the problems more efficiently, but can be transformed into a standard HMM. Therefore, the coverage of HHMM and HMM are the same, but their performance.Wait, what is HMM then? Now, to build a HHMM model, we need to estimate parameters, as we have seen in HMM model,


Wait, what is HMM then? Now, to build a HHMM model, we need to estimate parameters, as we have seen in HMM model,

We model the recurring event in each video as HMMs, and the higher-level transitions between these events as another level of Markov chain. This two-level HHMM: lower-level states repre-

sent variations that can occur within the same event (observations, i.e., measurements taken from the raw video, with mixture of Gaussian dis-tribution)

Higher level structure elements usually corre-spond to semantic events.

An example of HHMM

sunny rain cloudy

And lower nodes represent some

variations …

An example of HHMM


3. To estimate parameters we use

(1)Expectation Maximization (EM) algorithm(2)Bayesian Learning Techniques(3)Reverse-Jump Markov Chain Monte Carlo (RJ

MCMC)(4)Bayesian Information Criteria (BIC)





Model parameters are updated using EM Model structure learning uses MCMC; parameter

learning for HHMM using EM is known to con-verge to a local maximum of the data likelihood since EM is an hill-climbing algorithm. – But searching for a global maximum in the likelihood landscape is intractable. we adopt randomized search





Model parameters are updated using EM Model structure learning uses MCMC; parameter

learning for HHMM using EM is known to con-verge to a local maximum of the data likelihood since EM is an hill-climbing algorithm. – But searching for a global maximum in the likelihood landscape is intractable. we adopt randomized search

However, I will not go through them one by one… if you are interested, you can find it in the paper: Xie, Lexing, et al. [1].



Using a fixed feature set manually selected based on heuristics build a model of good performance to dis-tinguish high-level structures of the given video

Using both the model learning algo-rithm and the feature selection algo-rithm results a model and a set of features that distinguish high-level structures of the given video well

Into what aspects does the feature selection can be divided and why?

Feature selection is divided into two aspects:

(1)Eliminating irrelevant features – usually irrelevant features disturb the classifier and degrade classification accuracy

(2)Eliminating redundant ones – re-dundant features add to computa-tional cost without bringing in new in-formation.

Into what aspects does the feature selection can be divided and why?


1. We use filter-wrapper methods and wrapper step corre-sponds to eliminating irrelevant features, and filter step cor-responds to eliminating redundant ones.(a) Wrapper step – partitions the feature pool into consistent

groups(b) Filter step – eliminates redundant dimensions

2. For example there are features like …Dominant Color Ratio (DCR), Motion Intensity (MI), the least-square estimation of camera translation (MX, MY), and five audio features – Volume, Spectral roll-off (SR), Low-band energy (LE), High-band energy (HE), and Zero-crossing rate (ZCR)


3. Algorithm structure The big picture would be:

HHMM

Viterbi state sequence information gain

Markov blanket filtering

BIC fitness


3. Algorithm structure The big picture would be:

HHMM

Viterbi state sequence information gain

Markov blanket filtering

BIC fitness

In detail:

Experiments and Results

For soccer videos, the main evaluation focused on distinguishing the two semantic evens, play and break(a) Model learning algorithm



We use a fixed set of features manually selected on heuristics (dominant color ratio and motion inten-sity) (Xu et al., 2001; Xie et al., 2002b)



We use a fixed set of features manually selected on heuristics (dominant color ratio and motion inten-sity) (Xu et al., 2001; Xie et al., 2002b)

Built four different learning schemes against the ground truth:

(1)Supervised HMM (2)Supervised HHMM(3)Unsupervised HHMM without model adapta-

tion(4)Unsupervised HHMM with model adaptation


For soccer videos, the main evaluation focused on distinguishing the two semantic evens, play and break(b) Feature selection algorithmBased on the good performance of the model pa-rameter and structure learning algorithm, we test the performance of the automatic feature selection method that iteratively wraps around, and filters.

A 9-dimensional feature vector sampled at every 0.1 seconds including:Dominant Color Ratio (DCR), Motion Intensity (MI), the least-square estimation of camera translation (MX, MY), and five audio features – Volume, Spec-tral roll-off (SR), Low-band energy (LE), High-band energy (HE), and Zero-crossing rate (ZCR)


Evaluation against the play/break labels showed a 74.8 % accuracy.

For clip Spain, the final selected feature set was {DCR, Volume}; with 74.8% accuracyFor clip Korea, the final selected feature set is {DCR, MX}; with 74.5% accuracy

[Testing on the baseball video] Yielded three consistent compact feature groups:

{HE, LE, ZCR}, {DCR, MX}, {Volume, SR} Resulting segments have consistent perceptual

properties, with one cluster of segments mostly corresponding to pitching shots and other field shots when the game is in play, while the other cluster contains most of the cutaways shots, score boards and game breaks, respectively.

Summary

With a specific domain of videos (sports; soccer and baseball), our unsupervised learning method can perform well.

Our method was chiefly twofold, one was model learning algorithm and the other feature selection algorithm.

In model learning algorithm, We used HHMM as the basic model and used other techniques such as Ex-pectation Maximization (EM) algorithm, Bayesian Learning Techniques, Reverse-Jump Markov Chain Monte Carlo (RJ MCMC), and Bayesian Information Criteria (BIC) to set the parameters for the model.

In feature selection algorithm, together with a model of good performance, we used filter-wrapper methods to eliminate irrelevant and redundant fea-tures.

Questions1. What is supervised learning?

2. What is the benefit of using unsupervised learn-ing?

3. Into what aspects does the feature selection can divided and why?

Questions1. What is supervised learning? the algorithm designers manually identify impor-tant structures, collect labelled data for training, and apply supervised learning tools to learn the classifiers.2. What is the benefit of using unsupervised learn-ing?(A) It alleviates the burden of labelling and train-ing.(B) also it provides a scalable solution for generaliz-ing video indexing techniques.3. Into what aspects does the feature selection can divided and why? Feature selection: is divided into two aspects (1) eliminating irrelevant features: Usually irrele-vant features disturb the classifier and degrade classification accuracy(2) eliminating redundant ones: Redundant fea-tures add to computational cost without bringing in new information.

Bibliography[1] Rabiner, Lawrence R. "A tutorial on hidden Markov models and selected applications in speech recognition." Proceedings of the IEEE 77.2 (1989): 257-286.[2] Xie, Lexing, et al. "Structure analysis of soc-cer video with hidden Markov models." Acoustics, Speech, and Signal Process-ing (ICASSP), 2002 IEEE International Confer-ence on. Vol. 4. IEEE, 2002.[3] Xie, Lexing, et al. "Unsupervised mining of statistical temporal structures in video." Video mining. Springer US, 2003. 279-307.[4] Xu, Peng, et al. "Algorithms and system for segmentation and structure analysis in soccer video." IEEE International Conference on Multi-media and Expo. 2001.

THANK YOU!

hyeonsoo , kang

Documents