movie content analysis, indexing and skimming

36
Movie Content Analysis, Indexing and Skimming 김김김 (Duck Ju Kim)

Upload: portia

Post on 23-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Movie Content Analysis, Indexing and Skimming. 김덕주 (Duck Ju Kim). Problems. What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated media data?. Introduction. Analysis Structured organization Embedded semantics - PowerPoint PPT Presentation

TRANSCRIPT

Movie Content Analysis, Indexing and Skimming

Movie Content Analysis, Indexing and Skimming(Duck Ju Kim)This topic is about Content-based video analysis, indexing and skimming(representation).1ProblemsWhat is the objective of content-based video analysis?

Why supervised identification has limitation?

Why should use integrated media data?A content-based movie analysis, indexing and skimming system includes three major modules. Three types of movie events. Abstracted in the form of a short video clip.2IntroductionAnalysisStructured organizationEmbedded semanticsIndexingTagging semantic unitsLimited machine perceptionSkimmingAbstraction & PresentationVideo browsing

Obtaining structured organization, Understanding its embedded semantics. Semantic gap still exists between the real content and the contexts derived from these features.3Event Detection ApproachShot detectionLow-level structureNot correspond directly to video semanticsScene extractionHigher-level contextMany unimportant contents Event extractionHigher semantic levelBetter reveal, represent, abstractionShot is a set of contiguously recorded image frames. Scene is a collection of semantically related shots. Event is a meaningful video paragraph.4Speaker Identification ApproachStandard speech databasesYOHO, HUB4, SWITCHBOARDIntegration from media cuesSpeaker recognition + Facial analysisSpeech cues + Visual cuesSupervised IdentificationFixed speaker modelsInsufficient training dataData collection before processing

Assumes there is only one face / Only identifies speakers5Video Skimming ApproachPre-developed schemesDiscontinuous semantic flowIgnored embedded audio cueComputation of six types of featuresImportance evaluationAssembling important eventsContent Pre-analysisShot detectionColor histogram-based approachExtract keyframesThe first and last framesAudio contentClassificationSilence, speech, music, environmental soundsVisual contentDetect human facesMovie Event ExtractionDevelop thematic topicsThrough actions or dialogsWhat to extract?Two-speaker dialogsMultiple-speaker dialogsHybrid Events

Movie Event ExtractionHow to extract?Shot sink computationGrouping close and similar shotsSink clustering and characterizationPeriodic, partly-periodic, non-periodicEvent extraction and classificationPost-processingSink : grouping close and similar shots. Periodic, partly-periodic, non-periodic. 9Shot Sink ComputationPool of close and similar shotsUsing Visual InformationWindow-based Sweep Algorithm

Dist_(I,j) : Similarity of two shots i and j. b_i, e_i, b_j, e_j : keyframes of i and j. dist(b_i, b_j) : Euclidean distance or histogram intersection between color histograms.W_i (i = 1, 2, 3, 4) : Weighting coefficients, L_i, L_j : shot lengths(unit : frame), winL : for normalization.10Shot Sink ClusteringClustering & CharacterizingPeriodic, Partly-periodic, Non-periodicDegree of shot repetition

Determining the sink periodicityCalculate relative temporal distanceCompute mean , standard deviation Grouping with K-means algorithm

Crosses, Triangles, Circles : Periodic, Partly Periodic, Non-Periodic12Integrating Speech & Face InformationFalse AlarmMontage presentation -> Spoken DialogMultiple-speaker dialog -> Two-speaker dialogSolution to reducingEmbedded audio information integrationSpeech shot ratio calculation Facial cue inclusionFace detectionAdaptive Speaker IdentificationShot detection & Audio classificationFace detection & Mouth trackingSpeech segmentation / clusteringInitial speaker modelingAudiovisual-based speaker identificationUnsupervised speaker model adaptationIdentifying target movie cast members for content indexing purposes.14

Face Detection & Mouth TrackingDetection & Recognition of talking faces

Distance between eyes and mouth : distEyes position : (x1, y1), (x2, y2)Mouth center : (x, y)

Speech Segmentation

We first classify audio frames into silence and speech first. E : Frame energy, T : Threshold, L : Minimum speech/silence segment length.17Speech ClusteringTwo separate segments X1, X2Joined segment X = {X1, X2}

For cluster C have n homogeneous speech segmentsDist(X, C) = , Negative value -> Considered from the same speaker

Using Bayesian Information Criterion(BIC) to measure the similarity between two speech segments. Sigma notations are segments covariance matrices.M_ij is its respective feature vector numbers.18Initial Speaker ModelingRequired for identification processExploiting the inter-relations between facial and speech cuesFor each target cast member AFind a speech shot where A is talkingCollect all the speech segmentsBuild initial modelGaussian Mixture Model(GMM)Likelihood-basedspeaker identificationGMM model notation , j = 1, 2, , mFor ith enrolled speakerThe log likelihood between X and Mi

m : the # of components, p_j : weight, mu_j : mean vector, sigma_j : covariance matrix. X : observation sequence consisting of T ceptral vectors x_t, t = 1, 2, , TS : covariance of X, X_bar : mean of X20Audiovisual integrationfor speaker identificationFinalizing the speaker identification taskIntegration of audio and video cuesExamine the existence of temporal overlapOverlap ratio > ThresholdAssign face vector to clusterOtherwise, set face vector to nullSpeaker Identity

Cluster C, Face sequences F, f_arrow : face vectors, v_arrow : speaker vectors, N : # of target speakers.21UnsupervisedSpeaker Model AdaptationUpdating the speaker modelThree approachesAverage-based model adaptationMAP-based model adaptationViterbi-based model adaptationAverage-based Model AdaptationCompute BIC distancesCompare between dmin and threshold Tdmin < T :

dmin > T : Initialize new mixture componentUpdate the weight for each component

Distances between cluster C and all of Ps mixture component b_i, d_min when b_0. 23MAP-based Model Adaptationi : Mean of bid

Li : Occupation likelihood of the adaptation data

-bar : Mean of the observed adaptation data

Tau : adaptation speed. P(i|x_t, M_p) : posteriori probability of x_t to b_i.24Viterbi-based Model AdaptationAllows different feature vectors from different componentsHard decisionAny vector can either occupy component or notIndicator function instead of probability functionMixture component

Event-based Movie SkimmingEvent feature extractionSix types of mid- to high-level featuresEvaluation of importanceMovie skim generationAssemble major events -> final skimEvent Feature ExtractionMusic Ratio Speech Ratio Sound LoudnessAction LevelNormalized by dividing the largest valuePresent CastTheme Topic

Event Feature ExtractionM : # of features extractedN : # of eventsai,j : value of jth feature in ith event

Movie Skim GenerationChoosing important eventsUsers feature preference Event importance vector

Event Detection ResultsCorrectness of the event classificationSystem performance evaluationHybrid class excluded

Speaker Identification ResultsEvaluation of adaptive speaker identification systemFalse acceptance(FA)False rejection(FR)Identification accuracy(IA)

(A, B) : Character A is speaking, but actor B is identified.33Average-based, MAP-based, Viterbi-based

Movie Skimming ResultsDifficulties of Qualitative evaluationQuantitative measure based on user study5-point scale : 1~5Visual comprehensionAudio comprehensionSemantic continuityGood abstractionQuick browsingVideo skipping