Dynamic Surface
Modeling & Applications
Tony Tung
Matsuyama Laboratory, Kyoto University
2005.07-2005.08
2008.06-2014.09
自己紹介
Interests: computer vision, pattern recognition,
shape modeling, human-computer interaction
Tony TUNG
Matsuyama Laboratory
Graduate School of Informatics, Kyoto University
2005/07/01 - 2005/08/31 : JSPS Summer program (postdoc)
2008/06/01 - 2010/01/31 : Postdoc + JSPS short-term postdoc
2010/02/01 - 2014/09/30 : Assistant Professor (CREST - Kawahara Laboratory)
KAKENHI WAKATEB *2
JSPS AYAME
Microsoft Research Azure project
Contact: tonytung.org
3D Video is:
- Free-viewpoint video
- Image-based system for full surface capture of objects in
motion
- Markerless technique
3D video: full 3D object in motion
3D Video project
[Matsuyama et al., CVIU'04]
3D Video project
Applications: preservation of intangible cultural heritage,
medicine (e.g., gait analysis), entertainment (movies,
sport replay), etc.
3D Video project
T. Matsuyama, S. Nobuhara, T. Takai, T. Tung
Springer 2012 (book)
3D video framework
- Current 3D video studio (3rd) at Kyoto University
•Reconstruction space: 3m x 3m x3m
•Green background for chroma keying, fluorescent/LED lamps
•16 video cameras 1600x1200@25fps
•Grasshopper IEEE1394b
•Synchronization by external trigger
•Geometrically calibrated
•Cluster which consists of 2 PCs (8 cameras per PC)
3D video framework
• 3D video data = sequence of 3D mesh models– Frame-by-frame reconstruction using multiview stereo techniques
3D video framework
• 3D video data = sequence of 3D mesh models– Frame-by-frame reconstruction using multiview stereo techniques
3D Video Reconstruction[CVPR08] [ICCV09]
3D video reconstruction
3D video reconstruction from multiview stereo:
[Matsuyama et al., CVIU’04] [Matsuyama et al., Springer’12]
++ using temporal cues:
[Tung et al., CVPR’08] [Tung et al., ICCV’09]
3D video super resolution
Image-based super resolution of 3D video
[Tung et al., CVPR’08]
Stereo probabilistic fusion
3D video reconstruction from wide baseline stereo and SfM
probabilistic fusion
[Tung et al., ICCV’09]
Data size issue
• One or several subjects in the 3D video studio
• 3D surface reconstruction by MVS technique
• Volumetric graph-cuts (5mm resolution)
• Each 3D model = 1.5 MB (30,000 triangles)
• 5 min of 3D video = 11.25 GB
How to manage this big amount of data?
- How to search ( analysis, visualization)
- How to handle inconsistency ( storage, transfer)
Topology Dictionary
for 3D Video Understanding[CVPR07] [CVPR09] [PAMI12]
3D video sequence
Sequential reconstruction
(Inconsistent topology between frames)
Topology can be used to characterized 3D video data
Topology dictionary for 3D video understanding
• Abstraction levels• Topology-based shape description (frame level)
• Probabilistic motion graph modeling (sequence level)
• Applications• Analysis: segmentation, annotation, action recognition
• Content-based encoding: summarization, skimming
• Data size compression: storage, streaming
[Tung et al., CVPR’09]
[Tung et al., PAMI’12]
[Matsuyama et al., Springer’12]
Topology-based shape description
Morse theory
: S with : real continuous function
S : manifold surface (mesh surface)
Reeb graph = quotient space of the graph of in S
defined by the equivalence relation ~
. (X) = (Y)
(X ,Y) S2, X ~ Y . X and Y same connected
component as -1((X))
[Reeb, 1946]
Topology-based shape description
• Multiresolution Reeb graphs
[Hilaga et al., SIGGRAPH’01]
[Tung et al., CVPR’07]
- Automatic extraction of graphs
- R, t, scale invariant
- Homotopic
- Multiresolution coarse-to-fine matching
Topology-based shape description
Reeb graph evaluation
• Robustness to surface noise
Reeb graph vs. skeleton
• “Automatic” 3D shape description
Topology matching
- Invariance to rotation, translation and scale
- Matching using topological and geometrical
attributes (valence, relative area)
- Coarse-to-fine multiresolution strategy
- Similarity of two models M,N from similarity of
topology consistent node pairs {(mi, nj)} at
every level of resolution:
SIM(M,N) = sim(mi, nj)r=0
R
{ij}
[Hilaga et al., SIGGRAPH’01]
[Tung et al., CVPR’07]
Performance evaluation
Pose retrieval in 3D video sequences
[Huang et al., 3DPVT'10]
Topology clusters
• Dataset clustering using similarity evaluation
Distance matrix {1 - SIM}
Topology clusters
• Dataset clustering using similarity evaluation
Repeated poses
Long poses
Short poses
Transitions
Distance matrix {1 - SIM}
Topology clusters
• Clustering of (repetitive) atomic actions
Topology clusters
• Clustering of (repetitive) atomic actions
i i
Topology clusters
• Clustering of (repetitive) atomic actions
Topology clusters
• Motion graph structure SIGGRAPH’02: [Arikan&Forsyth] [Kovar et al.] [Lee et al.]
• using statistics on cluster size and occurrence
i
Topology clusters
i
SUMMARIZATION
• Motion graph structure SIGGRAPH’02: [Arikan&Forsyth] [Kovar et al.] [Lee et al.]
• using statistics on cluster size and occurrence
Topology clusters
i
3D VIDEO SKIMMING
• Motion graph structure SIGGRAPH’02: [Arikan&Forsyth] [Kovar et al.] [Lee et al.]
• using statistics on cluster size and occurrence
3D video skimming
3D video annotation
• Add semantic information to each topology cluster
3D video skimming and annotation
Topology dictionary for 3D video understanding
[Tung et al., CVPR’09] [Tung et al., PAMI’12]
(captions should be automatically displayed with this video)
CG models as prior
3D video annotation
3D video annotation
Topology dictionary for 3D video understanding
[Tung et al., CVPR’09] [Tung et al., PAMI’12]
(captions should be automatically displayed with this video)
Invariant Surface Descriptor for
3D Video Encoding[ACCV12] [TVC14]
3D video encoding
• 3D video data size is big
– Several GB for few minutes of HR sequence
• Impractical for data storage/management
• Impractical for data streaming over network
• Data structure inconsistency prevents existing
compression approaches to be efficient
3D video encoding
Approach: Geometry image technique (3D to 2D transform)
• Cut open 3D meshes and re-parameterize on plane
• Apply lossless compression (2D video)
See [Gu et al., SIGGRAPH’02]
for synthetic data
3D video encoding
Approach: Geometry image technique (3D to 2D transform)
• Cut open 3D meshes and re-parameterize on plane
• Apply lossless compression (2D video)
Solution: stabilize the cuts for optimal encoding
3D video encoding
Possible scenarios:
1. Meshes are consistent (share same connectivity)
• Synthetic datasets
3D video encoding
Possible scenarios:
1. Meshes are consistent (share same connectivity)
• Synthetic datasets
2. Meshes are inconsistent (different connectivity, resolution)
• Tracking & remeshing [Cagniart et al., ECCV’10]
• Point-to-point surface alignment
“Geodesic mapping”
[Tung et al., CVPR’10][Tung et al., PAMI’14]
3D video encoding
Possible scenarios:
1. Meshes are consistent (share same connectivity)
• Synthetic datasets
2. Meshes are inconsistent (different connectivity, resolution)
• Tracking & remeshing [Cagniart et al., ECCV’10]
• Point-to-point surface alignment [Tung et al., CVPR’10][Tung et al.,
PAMI’14]
• Geometrical data are inconsistent in time (e.g., raw 3D video)
– Adaptive bitrate streaming (where resolution can vary)
Deformation invariant surface descriptor
[Tung et al., ACCV’12] [Tung et al., Vis. Comp.’14]
Invariant shape descriptor
• Define a surface-based shape descriptor
– Graph defined on object’s surface
– Nodes are geodesically consistent across time
• E.g., surface extremal points
[Tung et al., IJSM05] [Tung et al., PAMI12]
Invariant shape descriptor
• Define a surface-based shape descriptor
– Graph defined on object’s surface
– Nodes are geodesically consistent across time
• E.g., surface extremal points
– Edges joint the nodes
• Defined as paths on the surface
• Maintained geodesically consistent across time
– Using the previous position of the path (vertices)
– Using the shortest path between nodes
• Probabilistic framework (MAP-MRF) to handle
surface non-rigid deformations
Invariant shape descriptor
• Define a surface-based shape descriptor
– Graph defined on object’s surface
– Nodes are geodesically consistent across time
• E.g., surface extremal points
– Edges joint the nodes
• Defined as paths on the surface
• Maintained geodesically consistent across time
– Using the previous position of the path (vertices)
– Using the shortest path between nodes
Invariant shape descriptor
1. Invariant to surface deformation and parametrization
2. Parametrization in one-shot
3. Use as cut graphs
Invariant shape descriptor
3D video encoding
Invariant surface-based descriptor for 3D video encoding
[Tung et al., ACCV’12] [Tung et al., Vis. Comp.’14]
Dynamic Surface Alignment[CVPR10] [PAMI14]
Point-to-point surface alignment
For:
– Shape matching
(retrieval, comparison)
– Motion tracking
– Texture transfer
– …
– 3D video encoding
– Surface dynamics
Point-to-point surface alignment
Appearance-based
• color, corners, local features
e.g., see [Ahmed et al., CVPR08]
Have to deal with:
- Inconsistent colors from multiple views
- Poor texture (e.g., solid color clothing)
- Surface noise
Usual process
1. Find landmark points
2. Refine (interpolate)
Geometry-based
• local geometry property
• mapping/diffusion functions:
spherical [Starck et al., ICCV05],
embedding [Bronstein et al., TVCG07],
multiple maps [Kim et al., , SIGGRAPH11],
spectral matching [Lombaert et al., PAMI13]
• patch deformation [Cagniart et al., ECCV10]
Point-to-point surface alignment
is a surface mapping between S1 and S2
is a metric
is a diffeomorphism
Have to deal with:
- Inconsistent colors from multiple views
- Poor texture (e.g., solid color clothing)
- Surface noise
Geodesic mapping
1. Define landmark points using geometry-
based approach
2. Choose the landmark points with
minimum ambiguity (coarse-to-fine
strategy)
3. Refine by propagation
Have to deal with:
- Inconsistent colors from multiple views
- Poor texture (e.g., solid color clothing)
- Surface noise
See preliminary work in [Tung et al., CVPR’10]
Geodesic mapping
Model:
• Define a smooth bijective map between two manifolds
(S1, g1) and (S2, g2)
• g1 and g2 are geodesic distances
Geodesic consistency of v1S1 and v2S2 :
• Assuming two sets of N points
B1 = {b1,…,bN} S1 and B2 = {b’1,…,b’N} S2
• i{1,..,N}, |g1 (v1, bi) - g2 (v2, b’i)| ≤ ’
Global geodesic distance measures distortion between surfaces points w.r.t. N points.
Geodesic mapping
Overview:
BtSt and Bt+1St+1 are
surface extremal points
(see [Tung et al., PAMI2012])
Surface extremal points are critical points of
Geodesic mapping
Overview:
BtSt and Bt+1St+1 are
surface extremal points
(see [Tung et al.,
PAMI2012])
Geodesic consistency condition can be broken
when surfaces undergo non-rigid deformations!
Geodesic mapping
Overview:
BtSt and Bt+1St+1 are
surface extremal points
(see [Tung et al., PAMI2012])
Ambiguity degree A (vS) for point localization:
Measure of the number of points geodesically
consistent to v w.r.t. B S
Geodesic consistency condition can be broken
when surfaces undergo non-rigid deformations!
Geodesic mapping
Recursive mapping:
• Recursively chose Ni points in regions
of low ambiguity w.r.t. N landmarks
• Find corresponding points using N’ ≤ N
(N’ = max number of isoline intersections)
• Set N = Ni
Geodesic mapping
N = Ni
Geodesic mapping
• Refinement by MRF optimization:
Labeling problem
Global geodesic distance DN w.r.t. B^t = {bi ^t} and B^t+1 = {bi ^t+1} :
Tp(lp): orientation of (p, lp)
Geodesic mapping
• Experimental resultspoint-to-point surface alignment between consecutive frames
Geodesic mapping point-to-point surface alignment
[Cagniart et al., ECCV10] as ground truth [Spectral method] = [Lombaert et al., PAMI13]
Geodesic mapping point-to-point surface alignment
[Cagniart et al., ECCV10] as ground truth [Spectral method] = [Lombaert et al., PAMI13]
[Misreconstruction]
Geodesic mapping point-to-point surface alignment
• Quantitative evaluations
[Lombaert,13
]
[Kim et
al.,11]
Geodesic mapping
• Topology change
Regions where no topology change occurred are not affected
[Kim et al.,
SIGGRAPH11]
Geodesic mapping
• Applications
Intrinsic Characterization of
Dynamic Surface[CVPR13] [CVPR14]
Natural object dynamics modeling
• Natural scenes are complex but contain
statistics
– e.g., water, fire, human actions, etc.
• Dynamics modeling has been used for complex
scene segmentation and classification
– Dynamic textures
• Linear Dynamical Systems (distances, BoS)
[Doretto, IJCV02] [Chan, CVPR05] [Ravichandran, CVPR09]
– Dynamic facial events
• Timing structure of LDS
[Kawashima et al., 2007~2010]
Real-world surface dynamics
•Real-world objects in motion exhibit local deformation statistics
•Observation of intrinsic geometry
[Tung et al., CVPR13]
Real-world surface dynamics
Bouncing sequence
Shape index observation across time [Koenderink, Vis. Comp. ‘92]
•Real-world objects in motion exhibit local deformation statistics
•Observation of intrinsic geometry
Real-world surface dynamics
Samba sequence
Shape index observation across time [Koenderink, Vis. Comp. ‘92]
•Real-world objects in motion exhibit local deformation statistics
•Observation of intrinsic geometry
Intrinsic geometry
• Local topology descriptor (Koenderink shape
index)
[-1,1] and k1, k2 are principal curvatures (k1≤k2)
The shape index varies continuously with respect to surface deformation.
Intrinsic geometry
• Shape index variance average give information
on deformation location and relative magnitude
• However, it does not contain information about
acceleration patterns or timing structure
Shape index variance average over sequence.
Surface deformation dynamics
• After surface alignment, surface points can be
tracked across time
• Observation of temporal variations of shape
index at each surface point
• Characterization per surface patch
Surface deformation dynamics
• After surface alignment, surface points can be
tracked across time
• Observation of temporal variations of shape
index at each surface point
• Characterization per per surface patch
Free sequence
Surface deformation dynamics
• Dynamics modeling using Hybrid Linear Dynamical
System [Kawashima et al., ICIAP’07]
– Hidden state variable with Markovian dynamics
• Continuous hidden state variable x(t)
• Noisy measurements y(t)
– Linear-Gaussian model
• Y = { y(t) } : observations
• X = { x(t) } : hidden states in continuous state space
• Fi : transition matrix that models the dynamics of Di
• H : observation matrix mapping hidden states to system output by linear
projection
• gi : bias vector, vi(t) : measurement noise, w(t): observation noise[Doretto, IJCV02] [Chan, CVPR05] [Ravichandran, CVPR09]
Surface deformation dynamics
• Dynamics modeling using Hybrid Linear Dynamical
System [Kawashima et al., ICIAP’07]
– Model LDS state durations and transitions (i.e., timing
structure)
Surface deformation dynamics
– Model state durations and transitions (i.e., timing structure)
Bag-of-System
• Keypoint classification using bag-of-systems
– Bag-of-feature framework
– Codebook obtained by k-medoid clustering
• Codewords accounting for timing distribution
– Softweighting accounting for relative state duration
• Classification using SVM with RBF kernel
– Rigid/non-rigid regions
Rigidity-based classification
- Collection of N = 4 LDS per patch
- K=8 codewords
- For each sequence: 25% for training, 75% for testing
[Ravichandran 09][Saisan01] [Saisan01] [Ours]
[Tung et al., CVPR’13]
Timing-based local descriptor
I = {overlapping intervals}
[Tung et al., CVPR’14]
• Preserve local structure of surface such as deformation
patterns between neighbor patches
Timing-based local descriptor
I = {overlapping intervals}
Yi , Yj : observed signals
[Tung et al., CVPR’14]
• Histogram of timing:
Bag-of-Timing paradigm
• Timing of local surface element dynamics are
words of a codebook
– Sparse histogram of dynamic state timings
– Find codewords using k-medoids algorithm
– Soft-weighting of descriptors
• Classification (SVM)/ segmentation of
descriptors
– Different rigidity levels
Rigidity-based surface segmentation
Surface dynamics
Rigidity-based surface segmentation
Dynamic face
3D face dataset
Dynamic face
Cardiac datasets
Summary
• 3D video is a markerless surface capture technique which allows the capture of objects in motion
• 3D video reconstruction state-of-the art• Silhouette and stereo fusion
• Topology dictionary for 3D video understanding- Shape description using Reeb graphs
- Sequence encoding by feature vector clustering
- Probabilistic motion graph model
• Applications: skimming, summarization, annotation, content-based description/encoding.
Summary
• Invariant surface-based descriptor
– Geometry video approach
– Deformation invariant surface cut graph
– Probabilistic formulation
– Applications: 3D video data compression for transfer,
storage.
Summary
• Point-to-point surface alignment of 3D video
data
– Recursive geodesic mapping
– Ambiguity measure
– Competitive with state-of-the-art
Accuracy is to be improved when topology change
Use other intrinsic maps
Summary
• Deformable surface dynamics modeling
– Intrinsic surface properties are tracked across time
– Dynamics modeled using a set of LDS with timing
structure information (using Hybrid LDS)
– Timing-based local descriptor
– Applications: rigidity classification, segmentation with
respect to rigidity levels
• Deformation learning using a generative model
Multimodal Interaction Dynamics
in Group Discussion
using a Smart Digital Signage[ECCVW12] [HCI13] [THMS14]
[ECCVW 08] [IJNCR14]
Human-human interaction
• Human-human interactions for ambient systems
supervising human communications
• Multimodal sensing and analysis of multiparty
interaction for high-level understanding of
human interactions
• Speaker diarization / Visual information processing
• Annotation of comprehension and interest level
• New indexing scheme of speech archives
• Interaction-oriented approach (reaction)
• Non-verbal information (backchannels, nodding,
gaze )
Related work
• VACE Multimodal meeting corpus [Chen et al., MLMI’06]
• 6 people (round table)
• 12 stereo camera pairs, 3D Vicon IR system, microphones
• AMI meeting corpus [2007]
• 6 cameras, 24 microphones, whiteboard
• IMADE room (poster) [Kawahara et al., Interspeech’08]
• 1 presenter, 2 listeners
• 6-8 multiview video cameras, motion capture (12 markers on
body and head), eye-tracking system with accelerometer, micro
array (8-19) and headset
Related work
Video capture at IMADE room
Why poster sessions?
• Norm in conferences and open labs
• Mixture of lecture and meeting characteristics
• One main speaker with a small audience
• Real-time feedback (backchannels by audience)
• Interactive
• Audience can make questions/comments at any time
• Controllable (knowledge/familiarity) and yet real
Overview
1. Multimodal capture system
2. Audio and Visual information processing
3. Multimodal interaction dynamics modeling
4. Experimental validation
• Joint-attention estimation
Portable multimodal system
• 65” plasma screen
• 19-channel mic array + amplifier
• 6 multiple view video cameras
• Vision camera (UXGA, 25fps), synchronized &
calibrated
• 1 PC with GPU
65” display (160cm width)
200cm
30-40cmMicrophone array
Demo at IEEE ICASSP’12
Multimodal data processing
Audio information processing
• Speaker diarization
• Audio segmentation
• Speaker turns
1. Speech enhancement
2. 2 GMM models for classification (256
components)• Speech
• Noise
3. Training by EM [Gomez et al., IEEE Trans. ASLP 2010]
Video information processing
• Online head motion tracking (for nodding and turning)
1. Face detection [Viola & Jones, CVPR’01]
• Face feature detection (nose)
2. Depth from stereo
3. Feature tracking using probabilistic model (particle
filter) [ECCVW08] [IJNCR14]
• Likelihood updated with color histograms and depth info
• Cope with missing frames, partial occlusions
Video information processing
System demo at IEEE ICASSP2012
A/V interaction
• Input: temporal data (e.g., head positions)
• Speaker diarization
• Head motion of each subjects
• Dynamics modeling using HDS [Kawashima et al.,
NIPSw’10]
• System of LDS
• Transitions using a Finite State Machine
• Timing structure analysis
(event classification, multimodal interaction modeling)
Modeling using HDS
• Linear Dynamical System Di
• Y = { y(t) } : observations
• X = { x(t) } : hidden states in continuous state space
• Fi : transition matrix that models the dynamics of Di
• H : observation matrix that maps hidden states to
system output by linear projection
• gi : bias vector, vi(t) : meas. noise, w(t): obs. noise
Modeling using HDS (cont’d)
• Hybrid LDS
1. N LDS Di
2. FSM with N states: S = { qi }– (N and LDS parameters are estimated using EM)
• Interval-based representation
• Interval: Ik = < qi , tj >
• Duration: tj = ek - bk
[Kawashima et al., NIPSw’10]
Interaction modeling
• Interaction level between multimodal signals
i.e., number of occurrences of synchronized events wrt time
• The distribution of temporal differences of two signals Yk
and Yk’ is modeled by:
Z(Yk,Yk’)=Pr({ bk-bk’=b, ek-ek’= e} | {(Ik, Ik’) : [bk,ek][bk’,ek’] !=0} )
(Z represents synchronization wrt reaction time)
Experimental results
• Two scenarios with digital signage:
• Poster presentation
• Casual discussion
• Speaker/Audience interaction characterization
• A/V processing
• Multimodal interaction dynamics modeling using 6
states
• Insight about joint-attention
Poster presentation
3min
Multimodal interaction modeling
• IHDS with 6 modes for head motion
• LDS clustering & parameter optimization by EM
• LDS timing structure and speaker turn synchronization
Head motion dynamics vs. speech turns
Joint-attention characterization
• Reaction occurrences to A/V stimuli
Audio stimuli Visual stimuli
Casual discussion
3min
Multimodal interaction modeling
Head motion dynamics vs. speech turns
Synchronized state
distribution
Joint-attention estimation
Audio stimuli Visual stimuli
Summary
• Multimodal system with digital signage (smart
poster) for human-human interaction analysis
• Mic array & multiview video
• Poster presentations (1 presenter, 2-3 listeners)
• Multimodal data interaction
• Speaker diarization & dynamical system modeling
(IHDS)
• Joint-attention in group discussion
• Non-verbal events generate more non-verbal
reactions compared to audio events
tonytung.org