30th コンピュータビジョン勉強会@関東 dynamicfusion

55
DynamicFusion: Reconstruction and Tracking of Non-Rigid Scenes in Real-Time Richard A. Newbomb Dieter Fox Steven M. Seitz Hiroki Mizuno July 25 '15 1 30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会

Upload: hiroki-mizuno

Post on 15-Aug-2015

1.232 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: 30th コンピュータビジョン勉強会@関東 DynamicFusion

DynamicFusion: Reconstruction and Tracking of Non-Rigid Scenes in Real-Time

Richard A. Newbomb Dieter Fox Steven M. Seitz

Hiroki Mizuno July 25 '15

1 第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会

Page 2: 30th コンピュータビジョン勉強会@関東 DynamicFusion

選んだ理由 (1/2) •  一番初めに目に入った

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 2

Page 3: 30th コンピュータビジョン勉強会@関東 DynamicFusion

選んだ理由 (2/2) •  著者がすごい

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 3

Richard A. Newbombe

DTAM, KinectFusionの著者 Surreal VisionのCo-Founder

Dieter Fox

確率ロボティクスの著者

Steven M. Seitz

Multi View Stereo界隈の大御所

Page 4: 30th コンピュータビジョン勉強会@関東 DynamicFusion

What's DynamicFusion •  論文タイトル:

–  DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time

•  Motivation: –  "How can we generalise KinectFusion to reconstruct

and track dynamic, non-rigid scenes in real-time?"

•  ざっくり言うと: "KinectFusionをDynamicなシーンに拡張"

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 4

Page 5: 30th コンピュータビジョン勉強会@関東 DynamicFusion

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 5

というわけで

まずは KinectFusion の話をします

Page 6: 30th コンピュータビジョン勉強会@関東 DynamicFusion

KinectFusion •  KinectFusion:

Real-Time Dense Surface Mapping and Tracking –  ISMAR 2011で発表 (Best Paper Award) –  Richard A. Newbombe 他 (MSRの人が数名連名)

•  ざっくり概要 –  KinectのDepth Mapを入力としたSLAM –  Depth Mapをリアルタイムに融合して、綺麗な面を生成

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 6

KinectFusion: Real-Time Dense Surface Mapping and Tracking⇤

Richard A. NewcombeImperial College London

Shahram IzadiMicrosoft Research

Otmar HilligesMicrosoft Research

David MolyneauxMicrosoft ResearchLancaster University

David KimMicrosoft Research

Newcastle University

Andrew J. DavisonImperial College London

Pushmeet KohliMicrosoft Research

Jamie ShottonMicrosoft Research

Steve HodgesMicrosoft Research

Andrew FitzgibbonMicrosoft Research

Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sensing infrastructure.Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the left for comparisonis an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).

ABSTRACT

We present a system for accurate real-time mapping of complex andarbitrary indoor scenes in variable lighting conditions, using only amoving low-cost depth camera and commodity graphics hardware.We fuse all of the depth data streamed from a Kinect sensor intoa single global implicit surface model of the observed scene inreal-time. The current sensor pose is simultaneously obtained bytracking the live depth frame relative to the global model using acoarse-to-fine iterative closest point (ICP) algorithm, which usesall of the observed depth data available. We demonstrate the advan-tages of tracking against the growing full surface model comparedwith frame-to-frame tracking, obtaining tracking and mapping re-sults in constant time within room sized scenes with limited driftand high accuracy. We also show both qualitative and quantitativeresults relating to various aspects of our tracking and mapping sys-tem. Modelling of natural scenes, in real-time with only commod-ity sensor and GPU hardware, promises an exciting step forwardin augmented reality (AR), in particular, it allows dense surfaces tobe reconstructed in real-time, with a level of detail and robustnessbeyond any solution yet presented using passive computer vision.

Keywords: Real-Time, Dense Reconstruction, Tracking, GPU,SLAM, Depth Cameras, Volumetric Representation, AR

Index Terms: I.3.3 [Computer Graphics] Picture/Image Genera-tion - Digitizing and Scanning; I.4.8 [Image Processing and Com-puter Vision] Scene Analysis - Tracking, Surface Fitting; H.5.1[Information Interfaces and Presentation]: Multimedia InformationSystems - Artificial, augmented, and virtual realities

⇤This work was performed at Microsoft Research.

1 INTRODUCTION

Real-time infrastructure-free tracking of a handheld camera whilstsimultaneously mapping the physical scene in high-detail promisesnew possibilities for augmented and mixed reality applications.

In computer vision, research on structure from motion (SFM)and multi-view stereo (MVS) has produced many compelling re-sults, in particular accurate camera tracking and sparse reconstruc-tions (e.g. [10]), and increasingly reconstruction of dense surfaces(e.g. [24]). However, much of this work was not motivated by real-time applications.

Research on simultaneous localisation and mapping (SLAM) hasfocused more on real-time markerless tracking and live scene re-construction based on the input of a single commodity sensor—amonocular RGB camera. Such ‘monocular SLAM’ systems such asMonoSLAM [8] and the more accurate Parallel Tracking and Map-ping (PTAM) system [17] allow researchers to investigate flexibleinfrastructure- and marker-free AR applications. But while thesesystems perform real-time mapping, they were optimised for ef-ficient camera tracking, with the sparse point cloud models theyproduce enabling only rudimentary scene reconstruction.

In the past year, systems have begun to emerge that combinePTAM’s handheld camera tracking capability with dense surfaceMVS-style reconstruction modules, enabling more sophisticatedocclusion prediction and surface interaction [19, 26]. Most recentlyin this line of research, iterative image alignment against dense re-constructions has also been used to replace point features for cam-era tracking [20]. While this work is very promising for AR, densescene reconstruction in real-time remains a challenge for passivemonocular systems which assume the availability of the right typeof camera motion and suitable scene illumination.

But while algorithms for estimating camera pose and extract-ing geometry from images have been evolving at pace, so havethe camera technologies themselves. New depth cameras based ei-ther on time-of-flight (ToF) or structured light sensing offer densemeasurements of depth in an integrated device. With the arrivalof Microsoft’s Kinect, such sensing has suddenly reached wideconsumer-level accessibility. The opportunities for SLAM and AR

Page 7: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Result

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 7

https://www.youtube.com/watch?v=quGhaggn3cQ

Page 8: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Overview:

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 8

Rk Tg,k

Rk

Tg,k-1

Input

Measurement PoseEstimation

UpdateReconstruction

SurfacePrediction

Compute Surface Nertex and

Normal Maps

ICP of Predictedand Measured

Surface

Integrate SurfaceMeasurement

into Global TSDF

Ray-cast TSDF to Compute Surface Prediction

Tg,k-1k

Vk-1

S

Nk-1^ ^

VkNk

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

Figure 9: Interactive simulation of particle physics on dy-namic scenes. Particles interact with dynamically movingforeground user, whilst physical camera and user move.User can collide with the physics-enabled virtual objects.

Figure 10: Enabling touch input on arbitrary surfaces witha moving camera. A) Live RGB. B) Composited view withsegmented hand and single finger touching curved surface.C) rendered as surface normals. D) Single finger drawingon a curved surface. E) Multi-touch on regular planar booksurface. F) Multi-touch on an arbitrarily shaped surface.

demonstrated in the interactive tabletop community, to reachany surface in the user’s environment (see Figure 10).GPU IMPLEMENTATIONOur approach for real-time camera tracking and surface re-construction is based on two well-studied algorithms [1, 5,24], which have been designed from the ground-up for paral-lel execution on the GPU. A full formulation of our method isprovided in [21], as well as quantitative evaluation of recon-struction performance. The focus of this section is on the im-plementation of our novel core and extended GPU pipeline,which is critical in enabling interactive rates.The main system pipeline consists of four main stages (la-beled appropriately in Figure 11):a) Depth Map Conversion The live depth map is convertedfrom image coordinates into 3D points (referred to as ver-tices) and normals in the coordinate space of the camera.

b) Camera Tracking In the tracking phase, a rigid 6DOFtransform is computed to closely align the current orientedpoints with the previous frame, using a GPU implementa-tion of the Iterative Closest Point (ICP) algorithm [1]. Rel-ative transforms are incrementally applied to a single trans-form that defines the global pose of the Kinect.

c) Volumetric Integration Instead of fusing point clouds orcreating a mesh, we use a volumetric surface representationbased on [5]. Given the global pose of the camera, orientedpoints are converted into global coordinates, and a single 3Dvoxel grid is updated. Each voxel stores a running averageof its distance to the assumed position of a physical surface.

d) Raycasting Finally, the volume is raycast to extract viewsof the implicit surface, for rendering to the user. When us-ing the global pose of the camera, this raycasted view of the

Figure 11: Overview of tracking and reconstruction pipelinefrom raw depth map to rendered view of 3D scene.

volume also equates to a synthetic depth map, which canbe used as a less noisy more globally consistent referenceframe for the next iteration of ICP. This allows tracking byaligning the current live depth map with our less noisy ray-casted view of the model, as opposed to using only the livedepth maps frame-to-frame.

Each of these steps is executed in parallel on the GPU usingthe CUDA language. We describe each of these steps of thepipeline in the following sections.

Depth Map ConversionAt time i, each CUDA thread operates in parallel on a sep-arate pixel u = (x, y) in the incoming depth map Di(u).Given the intrinsic calibration matrix K (of the Kinect in-frared camera), each GPU thread reprojects a specific depthmeasurement as a 3D vertex in the camera’s coordinate spaceas follows: vi(u) = Di(u) K�1[u,1]. This results in a sin-gle vertex map Vi computed in parallel.Corresponding normal vectors for each vertex are computedby each GPU thread using neighboring reprojected points:ni(u) = (vi(x+1, y)�vi(x, y))⇥(vi(x, y+1)�vi(x, y))(normalized to unit length ni/||ni||). This results in a singlenormal map Ni computed in parallel.The 6DOF camera pose at time i is a rigid body transformmatrix Ti = [Ri|ti] containing a 3x3 rotation matrix (Ri)and 3D translation vector (ti). Given this transform, a ver-tex and normal can be converted into global coordinates:vg

i (u) = Tivi(u) and ng

i (u) = Rini(u) respectively.

Camera TrackingICP is a popular and well-studied algorithm for 3D shapealignment (see [24] for a detailed study). In KinectFusion,ICP is instead leveraged to track the camera pose for eachnew depth frame, by estimating a single 6DOF transformthat closely aligns the current oriented points with those ofthe previous frame. This gives a relative 6DOF transformTrel which can be incrementally applied together to give thesingle global camera pose Ti.The important first step of ICP is to find correspondences be-tween the current oriented points at time i with the previousat i�1. In our system we use projective data association [24]to find these correspondences. This part of the GPU-based al-gorithm is shown as pseudocode in Listing 1. Given the pre-vious global camera pose Ti�1

, each GPU thread transformsa unique point vi�1

into camera coordinate space, and per-spective projects it into image coordinates. It then uses this2D point as a lookup into the current vertex (Vi) and normal

Paper Session: 3D UIST’11, October 16–19, 2011, Santa Barbara, CA, USA

563

Page 9: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Mapping as Surface Reconstruction

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 9

Rk Tg,k

Rk

Tg,k-1

Input

Measurement PoseEstimation

UpdateReconstruction

SurfacePrediction

Compute Surface Nertex and

Normal Maps

ICP of Predictedand Measured

Surface

Integrate SurfaceMeasurement

into Global TSDF

Ray-cast TSDF to Compute Surface Prediction

Tg,k-1k

Vk-1

S

Nk-1^ ^

VkNk

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙

u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙

u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙

u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

Page 10: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Mapping as Surface Reconstruction (1/4) •  カメラのポーズがわかっている前提で、

Live FrameをGlobal Sceneに統合 (Fusion)

•  Sceneの表現: –  Volumetric Truncated Signed Distance Function

(TSDF)※ –  Volumeデータ構造 – 面の位置をzeroとして、各Voxelに面までの距離を符号付で格納

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 10

※ B. Curless and M. Levoy, A Volumetric for Building Complex Models from Range Images In ACM Transactions on Graphics (SIGGRAPH), 1996.

Page 11: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Mapping as Surface Reconstruction (2/4) •  TSDF概要

–  複数視点からのDepth mapを統合するアルゴリズム –  ある視点から得られたDepth mapをTSDFに変換 –  別の視点からのDepth mapも同様にTSDFに変換 –  これらの重み付和をとることで、複数視点からのDepth mapが統合される –  全部の視点が統合されたら、Marching Cubeなどで面を取り出すことが可能

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 11

Sensor

Near Farx

x

VolumeRange surface

Zero-crossing(isosurface)

x

x

New zero-crossing

Distancefrom

surface

(a) (b)

Figure 2. Unweighted signed distance functions in 3D. (a) A range sensorlooking down the x-axis observes a range image, shown here as a recon-structed range surface. Following one line of sight down the x-axis, wecan generate a signed distance function as shown. The zero crossing ofthis function is a point on the range surface. (b) The range sensor re-peats the measurement, but noise in the range sensing process results ina slightly different range surface. In general, the second surface wouldinterpenetrate the first, but we have shown it as an offset from the firstsurface for purposes of illustration. Following the same line of sight asbefore, we obtain another signed distance function. By summing thesefunctions, we arrive at a cumulative function with a new zero crossingpositioned midway between the original range measurements.

signed distance of each point to the nearest range surface alongthe line of sight to the sensor. We construct this function by com-bining signed distance functions , , ... and weightfunctions , , ... obtained from range images ...

. Our combining rules give us for each voxel a cumulative signeddistance function, , and a cumulative weight . We repre-sent these functions on a discrete voxel grid and extract an isosurfacecorresponding to . Under a certain set of assumptions, thisisosurface is optimal in the least squares sense. A full proof of thisoptimality is beyond the scope of this paper, but a sketch appears inappendix A.

Figure 2 illustrates the principle of combining unweighted signeddistances for the simple case of two range surfaces sampled from thesame direction. Note that the resulting isosurface would be the sur-face created by averaging the two range surfaces along the sensor’slines of sight. In general, however, weights are necessary to repre-sent variations in certainty across the range surfaces. The choice ofweights should be specific to the range scanning technology. For op-tical triangulation scanners, for example, Soucy [25] and Turk [30]make the weight depend on the dot product between each vertex nor-mal and the viewing direction, reflecting greater uncertainty when theillumination is at grazing angles to the surface. Turk also argues thatthe range data at the boundaries of the mesh typically have greateruncertainty, requiring more down-weighting. We adopt these sameweighting schemes for our optical triangulation range data.

Figure 3 illustrates the construction and usage of the signed dis-tance and weight functions in 1D. In Figure 3a, the sensor is posi-tioned at the origin looking down the +x axis and has taken two mea-surements, and . The signed distance profiles, andmay extend indefinitely in either direction, but the weight functions,

and , taper off behind the range points for reasons dis-cussed below.

Figure 3b is the weighted combination of the two profiles. Thecombination rules are straightforward:

(1)

(2)

r1

d1(x)

w1(x)

r2

w2(x)

d2(x)

W(x)

D(x)

R

(a) (b)

x xSensor

Figure 3. Signed distance and weight functions in one dimension. (a) Thesensor looks down the x-axis and takes two measurements, and .

and are the signed distance profiles, and andare the weight functions. In 1D, we might expect two sensor measure-ments to have the same weight magnitudes, but we have shown them to beof different magnitude here to illustrate how the profiles combine in thegeneral case. (b) is a weighted combination of and ,and is the sum of the weight functions. Given this formulation, thezero-crossing, , becomes the weighted combination of and andrepresents our best guess of the location of the surface. In practice, wetruncate the distance ramps and weights to the vicinity of the range points.

where, and are the signed distance and weight functionsfrom the th range image.

Expressed as an incremental calculation, the rules are:

(3)

(4)

where and are the cumulative signed distance andweight functions after integrating the th range image.

In the special case of one dimension, the zero-crossing of the cu-mulative function is at a range, given by:

(5)

i.e., a weighted combination of the acquired range values, which iswhat one would expect for a least squares minimization.

In principle, the distance and weighting functions should extendindefinitely in either direction. However, to prevent surfaces on op-posite sides of the object from interfering with each other, we forcethe weighting function to taper off behind the surface. There is atrade-off involved in choosing where the weight function tapers off. Itshould persist far enough behind the surface to ensure that all distanceramps will contribute in the vicinity of the final zero crossing, but, itshould also be as narrow as possible to avoid influencing surfaces onthe other side. To meet these requirements, we force the weights tofall off at a distance equal to half the maximum uncertainty intervalof the range measurements. Similarly, the signed distance and weightfunctions need not extend far in front of the surface. Restricting thefunctions to the vicinity of the surface yields a more compact rep-resentation and reduces the computational expense of updating thevolume.

In two and three dimensions, the range measurements correspondto curves or surfaces with weight functions, and the signed distanceramps have directions that are consistent with the primary directionsof sensor uncertainty. The uncertainties that apply to range imageintegration include errors in alignment between meshes as well as er-rors inherent in the scanning technology. A number of algorithms foraligning sets of range images have been explored and shown to yieldexcellent results [11][30]. The remaining error lies in the scanner it-self. For optical triangulation scanners, for example, this error hasbeen shown to be ellipsoidal about the range points, with the majoraxis of the ellipse aligned with the lines of sight of the laser [13][24].

Figure 4 illustrates the two-dimensional case for a range curvederived from a single scan containing a row of range samples. In

3

(b) (c)

(e) (f)

Isosurface

Sensor

n2n1

Dmax

Dmin(a)

(d)

Sensor

Figure 4. Combination of signed distance and weight functions in twodimensions. (a) and (d) are the signed distance and weight functions, re-spectively, generated for a range image viewed from the sensor line ofsight shown in (d). The signed distance functions are chosen to vary be-tween and , as shown in (a). The weighting falls off withincreasing obliquity to the sensor and at the edges of the meshes as in-dicated by the darker regions in (e). The normals, and shown in(e), are oriented at a grazing angle and facing the sensor, respectively.Note how the weighting is lower (darker) for the grazing normal. (b) and(e) are the signed distance and weight functions for a range image of thesame object taken at a 60 degree rotation. (c) is the signed distance func-tion corresponding to the per voxel weighted combination of (a)and (b) constructed using equations 3 and 4. (f) is the sum of the weightsat each voxel, . The dotted green curve in (c) is the isosurface thatrepresents our current estimate of the shape of the object.

practice, we use a fixed point representation for the signed distancefunction, which bounds the values to lie between andas shown in the figure. The values of and must be neg-ative and positive, respectively, as they are on opposite sides of asigned distance zero-crossing.

For three dimensions, we can summarize the whole algorithm asfollows. First, we set all voxel weights to zero, so that new data willoverwrite the initial grid values. Next, we tessellate each range im-age by constructing triangles from nearest neighbors on the sampledlattice. We avoid tessellating over step discontinuities (cliffs in therange map) by discarding triangles with edge lengths that exceed athreshold. We must also compute a weight at each vertex as describedabove.

Once a range image has been converted to a triangle mesh witha weight at each vertex, we can update the voxel grid. The signeddistance contribution is computed by casting a ray from the sensorthrough each voxel near the range surface and then intersecting it withthe triangle mesh, as shown in figure 5. The weight is computed bylinearly interpolating the weights stored at the intersection triangle’svertices. Having determined the signed distance and weight we canapply the update formulae described in equations 3 and 4.

At any point during the merging of the range images, we can ex-tract the zero-crossing isosurface from the volumetric grid. We re-strict this extraction procedure to skip samples with zero weight, gen-erating triangles only in the regions of observed data. We will relaxthis restriction in the next section.

4 Hole fillingThe algorithm described in the previous section is designed to re-construct the observed portions of the surface. Unseen portions ofthe surface will appear as holes in the reconstruction. While this re-sult is an accurate representation of the known surface, the holes areesthetically unsatisfying and can present a stumbling block to follow-on algorithms that expect continuous meshes. In [17], for example,

Volume

Sensor

Range surface

wawb

wc

w

d

VoxelViewingray

Figure 5. Sampling the range surface to update the volume. We com-pute the weight, , and signed distance, , needed to update the voxel bycasting a ray from the sensor, through the voxel onto the range surface.We obtain the weight, , by linearly interpolating the weights ( , ,and ) stored at neighboring range vertices. Note that for a translatingsensor (like our Cyberware scanner), the sensor point is different for eachcolumn of range points.

the authors describe a method for parameterizing patches that entailsgenerating evenly spaced grid lines by walking across the edges of amesh. Gaps in the mesh prevent the algorithm from creating a fairparameterization. As another example, rapid prototyping technolo-gies such as stereolithography typically require a “watertight” modelin order to construct a solid replica [7].

One option for filling holes is to operate on the reconstructed mesh.If the regions of the mesh near each hole are very nearly planar, thenthis approach works well. However, holes in the meshes can be (andfrequently are) highly non-planar and may even require connectionsbetween unconnected components. Instead, we offer a hole fillingapproach that operates on our volume, which contains more informa-tion than the reconstructed mesh.

The key to our algorithm lies in classifying all points in the vol-ume as being in one of three states: unseen, empty, or near the sur-face. Holes in the surface are indicated by frontiers between unseenregions and empty regions (see Figure 6). Surfaces placed at thesefrontiers offer a plausible way to plug these holes (dotted in Figure 6).Obtaining this classification and generating these hole fillers leads toa straightforward extension of the algorithm described in the previoussection:

1. Initialize the voxel space to the “unseen” state.

2. Update the voxels near the surface as described in the previ-ous section. As before, these voxels take on continuous signeddistance and weight values.

3. Follow the lines of sight back from the observed surface andmark the corresponding voxels as “empty”. We refer to thisstep as space carving.

4. Perform an isosurface extraction at the zero-crossing of thesigned distance function. Additionally, extract a surface be-tween regions seen to be empty and regions that remain unseen.

In practice, we represent the unseen and empty states using thefunction and weight fields stored on the voxel lattice. We representthe unseen state with the function values ,

and the empty state with the function values ,, as shown in Figure 6b. The key advantage of this repre-

sentation is that we can use the same isosurface extraction algorithmwe used in the previous section without the restriction on interpo-lating voxels of zero weight. This extraction finds both the signeddistance and hole fill isosurfaces and connects them naturally wherethey meet, i.e., at the corners in Figure 6a where the dotted red linemeets the dashed green line. Note that the triangles that arise from

4

Page 12: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Mapping as Surface Reconstruction (3/4) •  Live Frame (Depth map) からTSDFへの変換

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 12

stored at each location of the TSDF: the current truncated signeddistance value Fk(p) and a weight Wk(p),

Sk(p) 7! [Fk(p),Wk(p)] . (5)

A dense surface measurement (such as the raw depth map Rk) pro-vides two important constraints on the surface being reconstructed.First, assuming we can truncate the uncertainty of a depth mea-surement such that the true value lies within ±µ of the measuredvalue, then for a distance r from the camera center along each depthmap ray, r < (lRk(u)� µ) is a measurement of free space (herel = kK�1

˙uk2 scales the measurement along the pixel ray). Sec-ond, we assume that no surface information is obtained in the re-construction volume at r > (lRk(u) + µ) along the camera ray.Therefore the SDF need only represent the region of uncertaintywhere the surface measurement exists |r�lRk(u)| µ . A TSDFallows the asymmetry between free space, uncertain measurementand unknown areas to be represented. Points that are within visiblespace at distance greater than µ from the nearest surface interfaceare truncated to a maximum distance µ . Non-visible points fartherthan µ from the surface are not measured. Otherwise the SDF rep-resents the distance to the nearest surface point.

Although efficient algorithms exist for computing the true dis-crete SDF for a given set of point measurements (complexity islinear in the the number of voxels) [22], sophisticated implemen-tations are required to achieve top performance on GPU hardware,without which real-time computation is not possible for a reason-able size volume. Instead, we use a projective truncated signeddistance function that is readily computed and trivially parallelis-able. For a raw depth map Rk with a known pose Tg,k, its globalframe projective TSDF [FRk ,WRk ] at a point p in the global frameg is computed as,

FRk (p) = Y⇣

l�1k(tg,k�pk2�Rk(x)⌘, (6)

l = kK�1˙xk2 , (7)

x =j

p⇣

KT�1g,kp

⌘k, (8)

Y(h) =

(min

⇣1, h

µ

⌘sgn(h) iff h ��µ

null otherwise(9)

We use a nearest neighbour lookup b.c instead of interpolatingthe depth value, to prevent smearing of measurements at depth dis-continuities. 1/l converts the ray distance to p to a depth (we foundno considerable difference in using SDF values computed using dis-tances along the ray or along the optical axis). Y performs the SDFtruncation. The truncation function is scaled to ensure that a sur-face measurement (zero crossing in the SDF) is represented by atleast one non truncated voxel value in the discretised volume ei-ther side of the surface. Also, the support is increased linearly withdistance from the sensor center to support correct representationof noisier measurements. The associated weight WRk (p) is propor-tional to cos(q)/Rk(x), where q is the angle between the associatedpixel ray direction and the surface normal measurement in the localframe.

The projective TSDF measurement is only correct exactly at thesurface FRk (p) = 0 or if there is only a single point measurementin isolation. When a surface is present the closest point along aray could be another surface point not on the ray associated withthe pixel in Equation 8. It has been shown that for points closeto the surface, a correction can be applied by scaling the SDF bycos(q) [11]. However, we have found that approximation withinthe truncation region for 100s or more fused TSDFs from multipleviewpoints (as performed here) converges towards an SDF with apseudo-Euclidean metric that does not hinder mapping and trackingperformance.

The global fusion of all depth maps in the volume is formed bythe weighted average of all individual TSDFs computed for eachdepth map, which can be seen as de-noising the global TSDF frommultiple noisy TSDF measurements. Under an L2 norm the de-noised (fused) surface results as the zero-crossings of the point-wiseSDF F minimising:

minF2F Â

kkWRk FRk �F)k2. (10)

Given that the focus of our work is on real-time sensor tracking andsurface reconstruction we must maintain interactive frame-rates.(For a 640x480 depth stream at 30fps the equivalent of over 9million new point measurements are made per second). Storinga weight Wk(p) with each value allows an important aspect of theglobal minimum of the convex L2 de-noising metric to be exploitedfor real-time fusion; that the solution can be obtained incrementallyas more data terms are added using a simple weighted running av-erage [7], defined point-wise {p|FRk (p) 6= null}:

Fk(p) =Wk�1(p)Fk�1(p)+WRk (p)FRk (p)

Wk�1(p)+WRk (p)(11)

Wk(p) = Wk�1(p)+WRk (p) (12)

No update on the global TSDF is performed for values resultingfrom unmeasurable regions specified in Equation 9. While Wk(p)provides weighting of the TSDF proportional to the uncertainty ofsurface measurement, we have also found that in practice simplyletting WRk (p) = 1, resulting in a simple average, provides good re-sults. Moreover, by truncating the updated weight over some valueWh ,

Wk(p) min(Wk�1(p)+WRk (p),Wh ) , (13)

a moving average surface reconstruction can be obtained enablingreconstruction in scenes with dynamic object motion.

Although a large number of voxels can be visited that will notproject into the current image, the simplicity of the kernel meansoperation time is memory, not computation, bound and with currentGPU hardware over 65 gigavoxels/second (⇡ 2ms per full volumeupdate for a 5123 voxel reconstruction) can be updated. We use 16bits per component in S(p), although experimentally we have ver-ified that as few as 6 bits are required for the SDF value. Finally,we note that the raw depth measurements are used for TSDF fusionrather than the bilateral filtered version used in the tracking com-ponent, described later in section 3.5. The early filtering removesdesired high frequency structure and noise alike which would re-duce the ability to reconstruct finer scale structures.

3.4 Surface Prediction from Ray Casting the TSDFWith the most up-to-date reconstruction available comes the abil-ity to compute a dense surface prediction by rendering the surfaceencoded in the zero level set Fk = 0 into a virtual camera with thecurrent estimate Tg,k. The surface prediction is stored as a vertexand normal map Vk and Nk in frame of reference k and is used inthe subsequent camera pose estimation step.

As we have a dense surface reconstruction in the form of a globalSDF, a per pixel raycast can be performed [21]. Each pixel’s cor-responding ray, Tg,kK�1

˙u, is marched starting from the minimumdepth for the pixel and stopping when a zero crossing (+ve to�ve for a visible surface) is found indicating the surface interface.Marching also stops if a �ve to +ve back face is found, or ulti-mately when exiting the working volume, both resulting in non sur-face measurement at the pixel u.

For points on or very close to the surface interface Fk(p) = 0 itis assumed that the gradient of the TSDF at p is orthogonal to thezero level set, and so the surface normal for the associated pixel u

stored at each location of the TSDF: the current truncated signeddistance value Fk(p) and a weight Wk(p),

Sk(p) 7! [Fk(p),Wk(p)] . (5)

A dense surface measurement (such as the raw depth map Rk) pro-vides two important constraints on the surface being reconstructed.First, assuming we can truncate the uncertainty of a depth mea-surement such that the true value lies within ±µ of the measuredvalue, then for a distance r from the camera center along each depthmap ray, r < (lRk(u)� µ) is a measurement of free space (herel = kK�1

˙

uk2 scales the measurement along the pixel ray). Sec-ond, we assume that no surface information is obtained in the re-construction volume at r > (lRk(u) + µ) along the camera ray.Therefore the SDF need only represent the region of uncertaintywhere the surface measurement exists |r�lRk(u)| µ . A TSDFallows the asymmetry between free space, uncertain measurementand unknown areas to be represented. Points that are within visiblespace at distance greater than µ from the nearest surface interfaceare truncated to a maximum distance µ . Non-visible points fartherthan µ from the surface are not measured. Otherwise the SDF rep-resents the distance to the nearest surface point.

Although efficient algorithms exist for computing the true dis-crete SDF for a given set of point measurements (complexity islinear in the the number of voxels) [22], sophisticated implemen-tations are required to achieve top performance on GPU hardware,without which real-time computation is not possible for a reason-able size volume. Instead, we use a projective truncated signeddistance function that is readily computed and trivially parallelis-able. For a raw depth map Rk with a known pose Tg,k, its globalframe projective TSDF [FRk ,WRk ] at a point p in the global frameg is computed as,

FRk (p) = Y⇣

l�1k(tg,k�pk2�Rk(x)⌘, (6)

l = kK�1˙

xk2 , (7)

x =j

p⇣

KT�1g,kp

⌘k, (8)

Y(h) =

(min

⇣1, h

µ

⌘sgn(h) iff h ��µ

null otherwise(9)

We use a nearest neighbour lookup b.c instead of interpolatingthe depth value, to prevent smearing of measurements at depth dis-continuities. 1/l converts the ray distance to p to a depth (we foundno considerable difference in using SDF values computed using dis-tances along the ray or along the optical axis). Y performs the SDFtruncation. The truncation function is scaled to ensure that a sur-face measurement (zero crossing in the SDF) is represented by atleast one non truncated voxel value in the discretised volume ei-ther side of the surface. Also, the support is increased linearly withdistance from the sensor center to support correct representationof noisier measurements. The associated weight WRk (p) is propor-tional to cos(q)/Rk(x), where q is the angle between the associatedpixel ray direction and the surface normal measurement in the localframe.

The projective TSDF measurement is only correct exactly at thesurface FRk (p) = 0 or if there is only a single point measurementin isolation. When a surface is present the closest point along aray could be another surface point not on the ray associated withthe pixel in Equation 8. It has been shown that for points closeto the surface, a correction can be applied by scaling the SDF bycos(q) [11]. However, we have found that approximation withinthe truncation region for 100s or more fused TSDFs from multipleviewpoints (as performed here) converges towards an SDF with apseudo-Euclidean metric that does not hinder mapping and trackingperformance.

The global fusion of all depth maps in the volume is formed bythe weighted average of all individual TSDFs computed for eachdepth map, which can be seen as de-noising the global TSDF frommultiple noisy TSDF measurements. Under an L2 norm the de-noised (fused) surface results as the zero-crossings of the point-wiseSDF F minimising:

minF2F Â

kkWRk FRk �F)k2. (10)

Given that the focus of our work is on real-time sensor tracking andsurface reconstruction we must maintain interactive frame-rates.(For a 640x480 depth stream at 30fps the equivalent of over 9million new point measurements are made per second). Storinga weight Wk(p) with each value allows an important aspect of theglobal minimum of the convex L2 de-noising metric to be exploitedfor real-time fusion; that the solution can be obtained incrementallyas more data terms are added using a simple weighted running av-erage [7], defined point-wise {p|FRk (p) 6= null}:

Fk(p) =Wk�1(p)Fk�1(p)+WRk (p)FRk (p)

Wk�1(p)+WRk (p)(11)

Wk(p) = Wk�1(p)+WRk (p) (12)

No update on the global TSDF is performed for values resultingfrom unmeasurable regions specified in Equation 9. While Wk(p)provides weighting of the TSDF proportional to the uncertainty ofsurface measurement, we have also found that in practice simplyletting WRk (p) = 1, resulting in a simple average, provides good re-sults. Moreover, by truncating the updated weight over some valueWh ,

Wk(p) min(Wk�1(p)+WRk (p),Wh ) , (13)

a moving average surface reconstruction can be obtained enablingreconstruction in scenes with dynamic object motion.

Although a large number of voxels can be visited that will notproject into the current image, the simplicity of the kernel meansoperation time is memory, not computation, bound and with currentGPU hardware over 65 gigavoxels/second (⇡ 2ms per full volumeupdate for a 5123 voxel reconstruction) can be updated. We use 16bits per component in S(p), although experimentally we have ver-ified that as few as 6 bits are required for the SDF value. Finally,we note that the raw depth measurements are used for TSDF fusionrather than the bilateral filtered version used in the tracking com-ponent, described later in section 3.5. The early filtering removesdesired high frequency structure and noise alike which would re-duce the ability to reconstruct finer scale structures.

3.4 Surface Prediction from Ray Casting the TSDFWith the most up-to-date reconstruction available comes the abil-ity to compute a dense surface prediction by rendering the surfaceencoded in the zero level set Fk = 0 into a virtual camera with thecurrent estimate Tg,k. The surface prediction is stored as a vertexand normal map Vk and Nk in frame of reference k and is used inthe subsequent camera pose estimation step.

As we have a dense surface reconstruction in the form of a globalSDF, a per pixel raycast can be performed [21]. Each pixel’s cor-responding ray, Tg,kK�1

˙

u, is marched starting from the minimumdepth for the pixel and stopping when a zero crossing (+ve to�ve for a visible surface) is found indicating the surface interface.Marching also stops if a �ve to +ve back face is found, or ulti-mately when exiting the working volume, both resulting in non sur-face measurement at the pixel u.

For points on or very close to the surface interface Fk(p) = 0 itis assumed that the gradient of the TSDF at p is orthogonal to thezero level set, and so the surface normal for the associated pixel u

: 点pにおけるTSDF value : 点pにおける重み

stored at each location of the TSDF: the current truncated signeddistance value Fk(p) and a weight Wk(p),

Sk(p) 7! [Fk(p),Wk(p)] . (5)

A dense surface measurement (such as the raw depth map Rk) pro-vides two important constraints on the surface being reconstructed.First, assuming we can truncate the uncertainty of a depth mea-surement such that the true value lies within ±µ of the measuredvalue, then for a distance r from the camera center along each depthmap ray, r < (lRk(u)� µ) is a measurement of free space (herel = kK�1

˙uk2 scales the measurement along the pixel ray). Sec-ond, we assume that no surface information is obtained in the re-construction volume at r > (lRk(u) + µ) along the camera ray.Therefore the SDF need only represent the region of uncertaintywhere the surface measurement exists |r�lRk(u)| µ . A TSDFallows the asymmetry between free space, uncertain measurementand unknown areas to be represented. Points that are within visiblespace at distance greater than µ from the nearest surface interfaceare truncated to a maximum distance µ . Non-visible points fartherthan µ from the surface are not measured. Otherwise the SDF rep-resents the distance to the nearest surface point.

Although efficient algorithms exist for computing the true dis-crete SDF for a given set of point measurements (complexity islinear in the the number of voxels) [22], sophisticated implemen-tations are required to achieve top performance on GPU hardware,without which real-time computation is not possible for a reason-able size volume. Instead, we use a projective truncated signeddistance function that is readily computed and trivially parallelis-able. For a raw depth map Rk with a known pose Tg,k, its globalframe projective TSDF [FRk ,WRk ] at a point p in the global frameg is computed as,

FRk (p) = Y⇣

l�1k(tg,k�pk2�Rk(x)⌘, (6)

l = kK�1˙xk2 , (7)

x =j

p⇣

KT�1g,kp

⌘k, (8)

Y(h) =

(min

⇣1, h

µ

⌘sgn(h) iff h ��µ

null otherwise(9)

We use a nearest neighbour lookup b.c instead of interpolatingthe depth value, to prevent smearing of measurements at depth dis-continuities. 1/l converts the ray distance to p to a depth (we foundno considerable difference in using SDF values computed using dis-tances along the ray or along the optical axis). Y performs the SDFtruncation. The truncation function is scaled to ensure that a sur-face measurement (zero crossing in the SDF) is represented by atleast one non truncated voxel value in the discretised volume ei-ther side of the surface. Also, the support is increased linearly withdistance from the sensor center to support correct representationof noisier measurements. The associated weight WRk (p) is propor-tional to cos(q)/Rk(x), where q is the angle between the associatedpixel ray direction and the surface normal measurement in the localframe.

The projective TSDF measurement is only correct exactly at thesurface FRk (p) = 0 or if there is only a single point measurementin isolation. When a surface is present the closest point along aray could be another surface point not on the ray associated withthe pixel in Equation 8. It has been shown that for points closeto the surface, a correction can be applied by scaling the SDF bycos(q) [11]. However, we have found that approximation withinthe truncation region for 100s or more fused TSDFs from multipleviewpoints (as performed here) converges towards an SDF with apseudo-Euclidean metric that does not hinder mapping and trackingperformance.

The global fusion of all depth maps in the volume is formed bythe weighted average of all individual TSDFs computed for eachdepth map, which can be seen as de-noising the global TSDF frommultiple noisy TSDF measurements. Under an L2 norm the de-noised (fused) surface results as the zero-crossings of the point-wiseSDF F minimising:

minF2F Â

kkWRk FRk �F)k2. (10)

Given that the focus of our work is on real-time sensor tracking andsurface reconstruction we must maintain interactive frame-rates.(For a 640x480 depth stream at 30fps the equivalent of over 9million new point measurements are made per second). Storinga weight Wk(p) with each value allows an important aspect of theglobal minimum of the convex L2 de-noising metric to be exploitedfor real-time fusion; that the solution can be obtained incrementallyas more data terms are added using a simple weighted running av-erage [7], defined point-wise {p|FRk (p) 6= null}:

Fk(p) =Wk�1(p)Fk�1(p)+WRk (p)FRk (p)

Wk�1(p)+WRk (p)(11)

Wk(p) = Wk�1(p)+WRk (p) (12)

No update on the global TSDF is performed for values resultingfrom unmeasurable regions specified in Equation 9. While Wk(p)provides weighting of the TSDF proportional to the uncertainty ofsurface measurement, we have also found that in practice simplyletting WRk (p) = 1, resulting in a simple average, provides good re-sults. Moreover, by truncating the updated weight over some valueWh ,

Wk(p) min(Wk�1(p)+WRk (p),Wh ) , (13)

a moving average surface reconstruction can be obtained enablingreconstruction in scenes with dynamic object motion.

Although a large number of voxels can be visited that will notproject into the current image, the simplicity of the kernel meansoperation time is memory, not computation, bound and with currentGPU hardware over 65 gigavoxels/second (⇡ 2ms per full volumeupdate for a 5123 voxel reconstruction) can be updated. We use 16bits per component in S(p), although experimentally we have ver-ified that as few as 6 bits are required for the SDF value. Finally,we note that the raw depth measurements are used for TSDF fusionrather than the bilateral filtered version used in the tracking com-ponent, described later in section 3.5. The early filtering removesdesired high frequency structure and noise alike which would re-duce the ability to reconstruct finer scale structures.

3.4 Surface Prediction from Ray Casting the TSDFWith the most up-to-date reconstruction available comes the abil-ity to compute a dense surface prediction by rendering the surfaceencoded in the zero level set Fk = 0 into a virtual camera with thecurrent estimate Tg,k. The surface prediction is stored as a vertexand normal map Vk and Nk in frame of reference k and is used inthe subsequent camera pose estimation step.

As we have a dense surface reconstruction in the form of a globalSDF, a per pixel raycast can be performed [21]. Each pixel’s cor-responding ray, Tg,kK�1

˙u, is marched starting from the minimumdepth for the pixel and stopping when a zero crossing (+ve to�ve for a visible surface) is found indicating the surface interface.Marching also stops if a �ve to +ve back face is found, or ulti-mately when exiting the working volume, both resulting in non sur-face measurement at the pixel u.

For points on or very close to the surface interface Fk(p) = 0 itis assumed that the gradient of the TSDF at p is orthogonal to thezero level set, and so the surface normal for the associated pixel u

stored at each location of the TSDF: the current truncated signeddistance value Fk(p) and a weight Wk(p),

Sk(p) 7! [Fk(p),Wk(p)] . (5)

A dense surface measurement (such as the raw depth map Rk) pro-vides two important constraints on the surface being reconstructed.First, assuming we can truncate the uncertainty of a depth mea-surement such that the true value lies within ±µ of the measuredvalue, then for a distance r from the camera center along each depthmap ray, r < (lRk(u)� µ) is a measurement of free space (herel = kK�1

˙

uk2 scales the measurement along the pixel ray). Sec-ond, we assume that no surface information is obtained in the re-construction volume at r > (lRk(u) + µ) along the camera ray.Therefore the SDF need only represent the region of uncertaintywhere the surface measurement exists |r�lRk(u)| µ . A TSDFallows the asymmetry between free space, uncertain measurementand unknown areas to be represented. Points that are within visiblespace at distance greater than µ from the nearest surface interfaceare truncated to a maximum distance µ . Non-visible points fartherthan µ from the surface are not measured. Otherwise the SDF rep-resents the distance to the nearest surface point.

Although efficient algorithms exist for computing the true dis-crete SDF for a given set of point measurements (complexity islinear in the the number of voxels) [22], sophisticated implemen-tations are required to achieve top performance on GPU hardware,without which real-time computation is not possible for a reason-able size volume. Instead, we use a projective truncated signeddistance function that is readily computed and trivially parallelis-able. For a raw depth map Rk with a known pose Tg,k, its globalframe projective TSDF [FRk ,WRk ] at a point p in the global frameg is computed as,

FRk (p) = Y⇣

l�1k(tg,k�pk2�Rk(x)⌘, (6)

l = kK�1˙

xk2 , (7)

x =j

p⇣

KT�1g,kp

⌘k, (8)

Y(h) =

(min

⇣1, h

µ

⌘sgn(h) iff h ��µ

null otherwise(9)

We use a nearest neighbour lookup b.c instead of interpolatingthe depth value, to prevent smearing of measurements at depth dis-continuities. 1/l converts the ray distance to p to a depth (we foundno considerable difference in using SDF values computed using dis-tances along the ray or along the optical axis). Y performs the SDFtruncation. The truncation function is scaled to ensure that a sur-face measurement (zero crossing in the SDF) is represented by atleast one non truncated voxel value in the discretised volume ei-ther side of the surface. Also, the support is increased linearly withdistance from the sensor center to support correct representationof noisier measurements. The associated weight WRk (p) is propor-tional to cos(q)/Rk(x), where q is the angle between the associatedpixel ray direction and the surface normal measurement in the localframe.

The projective TSDF measurement is only correct exactly at thesurface FRk (p) = 0 or if there is only a single point measurementin isolation. When a surface is present the closest point along aray could be another surface point not on the ray associated withthe pixel in Equation 8. It has been shown that for points closeto the surface, a correction can be applied by scaling the SDF bycos(q) [11]. However, we have found that approximation withinthe truncation region for 100s or more fused TSDFs from multipleviewpoints (as performed here) converges towards an SDF with apseudo-Euclidean metric that does not hinder mapping and trackingperformance.

The global fusion of all depth maps in the volume is formed bythe weighted average of all individual TSDFs computed for eachdepth map, which can be seen as de-noising the global TSDF frommultiple noisy TSDF measurements. Under an L2 norm the de-noised (fused) surface results as the zero-crossings of the point-wiseSDF F minimising:

minF2F Â

kkWRk FRk �F)k2. (10)

Given that the focus of our work is on real-time sensor tracking andsurface reconstruction we must maintain interactive frame-rates.(For a 640x480 depth stream at 30fps the equivalent of over 9million new point measurements are made per second). Storinga weight Wk(p) with each value allows an important aspect of theglobal minimum of the convex L2 de-noising metric to be exploitedfor real-time fusion; that the solution can be obtained incrementallyas more data terms are added using a simple weighted running av-erage [7], defined point-wise {p|FRk (p) 6= null}:

Fk(p) =Wk�1(p)Fk�1(p)+WRk (p)FRk (p)

Wk�1(p)+WRk (p)(11)

Wk(p) = Wk�1(p)+WRk (p) (12)

No update on the global TSDF is performed for values resultingfrom unmeasurable regions specified in Equation 9. While Wk(p)provides weighting of the TSDF proportional to the uncertainty ofsurface measurement, we have also found that in practice simplyletting WRk (p) = 1, resulting in a simple average, provides good re-sults. Moreover, by truncating the updated weight over some valueWh ,

Wk(p) min(Wk�1(p)+WRk (p),Wh ) , (13)

a moving average surface reconstruction can be obtained enablingreconstruction in scenes with dynamic object motion.

Although a large number of voxels can be visited that will notproject into the current image, the simplicity of the kernel meansoperation time is memory, not computation, bound and with currentGPU hardware over 65 gigavoxels/second (⇡ 2ms per full volumeupdate for a 5123 voxel reconstruction) can be updated. We use 16bits per component in S(p), although experimentally we have ver-ified that as few as 6 bits are required for the SDF value. Finally,we note that the raw depth measurements are used for TSDF fusionrather than the bilateral filtered version used in the tracking com-ponent, described later in section 3.5. The early filtering removesdesired high frequency structure and noise alike which would re-duce the ability to reconstruct finer scale structures.

3.4 Surface Prediction from Ray Casting the TSDFWith the most up-to-date reconstruction available comes the abil-ity to compute a dense surface prediction by rendering the surfaceencoded in the zero level set Fk = 0 into a virtual camera with thecurrent estimate Tg,k. The surface prediction is stored as a vertexand normal map Vk and Nk in frame of reference k and is used inthe subsequent camera pose estimation step.

As we have a dense surface reconstruction in the form of a globalSDF, a per pixel raycast can be performed [21]. Each pixel’s cor-responding ray, Tg,kK�1

˙

u, is marched starting from the minimumdepth for the pixel and stopping when a zero crossing (+ve to�ve for a visible surface) is found indicating the surface interface.Marching also stops if a �ve to +ve back face is found, or ulti-mately when exiting the working volume, both resulting in non sur-face measurement at the pixel u.

For points on or very close to the surface interface Fk(p) = 0 itis assumed that the gradient of the TSDF at p is orthogonal to thezero level set, and so the surface normal for the associated pixel u

stored at each location of the TSDF: the current truncated signeddistance value Fk(p) and a weight Wk(p),

Sk(p) 7! [Fk(p),Wk(p)] . (5)

A dense surface measurement (such as the raw depth map Rk) pro-vides two important constraints on the surface being reconstructed.First, assuming we can truncate the uncertainty of a depth mea-surement such that the true value lies within ±µ of the measuredvalue, then for a distance r from the camera center along each depthmap ray, r < (lRk(u)� µ) is a measurement of free space (herel = kK�1

˙uk2 scales the measurement along the pixel ray). Sec-ond, we assume that no surface information is obtained in the re-construction volume at r > (lRk(u) + µ) along the camera ray.Therefore the SDF need only represent the region of uncertaintywhere the surface measurement exists |r�lRk(u)| µ . A TSDFallows the asymmetry between free space, uncertain measurementand unknown areas to be represented. Points that are within visiblespace at distance greater than µ from the nearest surface interfaceare truncated to a maximum distance µ . Non-visible points fartherthan µ from the surface are not measured. Otherwise the SDF rep-resents the distance to the nearest surface point.

Although efficient algorithms exist for computing the true dis-crete SDF for a given set of point measurements (complexity islinear in the the number of voxels) [22], sophisticated implemen-tations are required to achieve top performance on GPU hardware,without which real-time computation is not possible for a reason-able size volume. Instead, we use a projective truncated signeddistance function that is readily computed and trivially parallelis-able. For a raw depth map Rk with a known pose Tg,k, its globalframe projective TSDF [FRk ,WRk ] at a point p in the global frameg is computed as,

FRk (p) = Y⇣

l�1k(tg,k�pk2�Rk(x)⌘, (6)

l = kK�1˙xk2 , (7)

x =j

p⇣

KT�1g,kp

⌘k, (8)

Y(h) =

(min

⇣1, h

µ

⌘sgn(h) iff h ��µ

null otherwise(9)

We use a nearest neighbour lookup b.c instead of interpolatingthe depth value, to prevent smearing of measurements at depth dis-continuities. 1/l converts the ray distance to p to a depth (we foundno considerable difference in using SDF values computed using dis-tances along the ray or along the optical axis). Y performs the SDFtruncation. The truncation function is scaled to ensure that a sur-face measurement (zero crossing in the SDF) is represented by atleast one non truncated voxel value in the discretised volume ei-ther side of the surface. Also, the support is increased linearly withdistance from the sensor center to support correct representationof noisier measurements. The associated weight WRk (p) is propor-tional to cos(q)/Rk(x), where q is the angle between the associatedpixel ray direction and the surface normal measurement in the localframe.

The projective TSDF measurement is only correct exactly at thesurface FRk (p) = 0 or if there is only a single point measurementin isolation. When a surface is present the closest point along aray could be another surface point not on the ray associated withthe pixel in Equation 8. It has been shown that for points closeto the surface, a correction can be applied by scaling the SDF bycos(q) [11]. However, we have found that approximation withinthe truncation region for 100s or more fused TSDFs from multipleviewpoints (as performed here) converges towards an SDF with apseudo-Euclidean metric that does not hinder mapping and trackingperformance.

The global fusion of all depth maps in the volume is formed bythe weighted average of all individual TSDFs computed for eachdepth map, which can be seen as de-noising the global TSDF frommultiple noisy TSDF measurements. Under an L2 norm the de-noised (fused) surface results as the zero-crossings of the point-wiseSDF F minimising:

minF2F Â

kkWRk FRk �F)k2. (10)

Given that the focus of our work is on real-time sensor tracking andsurface reconstruction we must maintain interactive frame-rates.(For a 640x480 depth stream at 30fps the equivalent of over 9million new point measurements are made per second). Storinga weight Wk(p) with each value allows an important aspect of theglobal minimum of the convex L2 de-noising metric to be exploitedfor real-time fusion; that the solution can be obtained incrementallyas more data terms are added using a simple weighted running av-erage [7], defined point-wise {p|FRk (p) 6= null}:

Fk(p) =Wk�1(p)Fk�1(p)+WRk (p)FRk (p)

Wk�1(p)+WRk (p)(11)

Wk(p) = Wk�1(p)+WRk (p) (12)

No update on the global TSDF is performed for values resultingfrom unmeasurable regions specified in Equation 9. While Wk(p)provides weighting of the TSDF proportional to the uncertainty ofsurface measurement, we have also found that in practice simplyletting WRk (p) = 1, resulting in a simple average, provides good re-sults. Moreover, by truncating the updated weight over some valueWh ,

Wk(p) min(Wk�1(p)+WRk (p),Wh ) , (13)

a moving average surface reconstruction can be obtained enablingreconstruction in scenes with dynamic object motion.

Although a large number of voxels can be visited that will notproject into the current image, the simplicity of the kernel meansoperation time is memory, not computation, bound and with currentGPU hardware over 65 gigavoxels/second (⇡ 2ms per full volumeupdate for a 5123 voxel reconstruction) can be updated. We use 16bits per component in S(p), although experimentally we have ver-ified that as few as 6 bits are required for the SDF value. Finally,we note that the raw depth measurements are used for TSDF fusionrather than the bilateral filtered version used in the tracking com-ponent, described later in section 3.5. The early filtering removesdesired high frequency structure and noise alike which would re-duce the ability to reconstruct finer scale structures.

3.4 Surface Prediction from Ray Casting the TSDFWith the most up-to-date reconstruction available comes the abil-ity to compute a dense surface prediction by rendering the surfaceencoded in the zero level set Fk = 0 into a virtual camera with thecurrent estimate Tg,k. The surface prediction is stored as a vertexand normal map Vk and Nk in frame of reference k and is used inthe subsequent camera pose estimation step.

As we have a dense surface reconstruction in the form of a globalSDF, a per pixel raycast can be performed [21]. Each pixel’s cor-responding ray, Tg,kK�1

˙u, is marched starting from the minimumdepth for the pixel and stopping when a zero crossing (+ve to�ve for a visible surface) is found indicating the surface interface.Marching also stops if a �ve to +ve back face is found, or ulti-mately when exiting the working volume, both resulting in non sur-face measurement at the pixel u.

For points on or very close to the surface interface Fk(p) = 0 itis assumed that the gradient of the TSDF at p is orthogonal to thezero level set, and so the surface normal for the associated pixel u

=

θ : Rayと法線のなす角

p

x

Global TSDF

Raw Depth map

µ

Rk(x)

Page 13: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Mapping as Surface Reconstruction (4/4)

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 13

実際のところ、 でいい結果になるらしい。

stored at each location of the TSDF: the current truncated signeddistance value Fk(p) and a weight Wk(p),

Sk(p) 7! [Fk(p),Wk(p)] . (5)

A dense surface measurement (such as the raw depth map Rk) pro-vides two important constraints on the surface being reconstructed.First, assuming we can truncate the uncertainty of a depth mea-surement such that the true value lies within ±µ of the measuredvalue, then for a distance r from the camera center along each depthmap ray, r < (lRk(u)� µ) is a measurement of free space (herel = kK�1

˙uk2 scales the measurement along the pixel ray). Sec-ond, we assume that no surface information is obtained in the re-construction volume at r > (lRk(u) + µ) along the camera ray.Therefore the SDF need only represent the region of uncertaintywhere the surface measurement exists |r�lRk(u)| µ . A TSDFallows the asymmetry between free space, uncertain measurementand unknown areas to be represented. Points that are within visiblespace at distance greater than µ from the nearest surface interfaceare truncated to a maximum distance µ . Non-visible points fartherthan µ from the surface are not measured. Otherwise the SDF rep-resents the distance to the nearest surface point.

Although efficient algorithms exist for computing the true dis-crete SDF for a given set of point measurements (complexity islinear in the the number of voxels) [22], sophisticated implemen-tations are required to achieve top performance on GPU hardware,without which real-time computation is not possible for a reason-able size volume. Instead, we use a projective truncated signeddistance function that is readily computed and trivially parallelis-able. For a raw depth map Rk with a known pose Tg,k, its globalframe projective TSDF [FRk ,WRk ] at a point p in the global frameg is computed as,

FRk (p) = Y⇣

l�1k(tg,k�pk2�Rk(x)⌘, (6)

l = kK�1˙xk2 , (7)

x =j

p⇣

KT�1g,kp

⌘k, (8)

Y(h) =

(min

⇣1, h

µ

⌘sgn(h) iff h ��µ

null otherwise(9)

We use a nearest neighbour lookup b.c instead of interpolatingthe depth value, to prevent smearing of measurements at depth dis-continuities. 1/l converts the ray distance to p to a depth (we foundno considerable difference in using SDF values computed using dis-tances along the ray or along the optical axis). Y performs the SDFtruncation. The truncation function is scaled to ensure that a sur-face measurement (zero crossing in the SDF) is represented by atleast one non truncated voxel value in the discretised volume ei-ther side of the surface. Also, the support is increased linearly withdistance from the sensor center to support correct representationof noisier measurements. The associated weight WRk (p) is propor-tional to cos(q)/Rk(x), where q is the angle between the associatedpixel ray direction and the surface normal measurement in the localframe.

The projective TSDF measurement is only correct exactly at thesurface FRk (p) = 0 or if there is only a single point measurementin isolation. When a surface is present the closest point along aray could be another surface point not on the ray associated withthe pixel in Equation 8. It has been shown that for points closeto the surface, a correction can be applied by scaling the SDF bycos(q) [11]. However, we have found that approximation withinthe truncation region for 100s or more fused TSDFs from multipleviewpoints (as performed here) converges towards an SDF with apseudo-Euclidean metric that does not hinder mapping and trackingperformance.

The global fusion of all depth maps in the volume is formed bythe weighted average of all individual TSDFs computed for eachdepth map, which can be seen as de-noising the global TSDF frommultiple noisy TSDF measurements. Under an L2 norm the de-noised (fused) surface results as the zero-crossings of the point-wiseSDF F minimising:

minF2F Â

kkWRk FRk �F)k2. (10)

Given that the focus of our work is on real-time sensor tracking andsurface reconstruction we must maintain interactive frame-rates.(For a 640x480 depth stream at 30fps the equivalent of over 9million new point measurements are made per second). Storinga weight Wk(p) with each value allows an important aspect of theglobal minimum of the convex L2 de-noising metric to be exploitedfor real-time fusion; that the solution can be obtained incrementallyas more data terms are added using a simple weighted running av-erage [7], defined point-wise {p|FRk (p) 6= null}:

Fk(p) =Wk�1(p)Fk�1(p)+WRk (p)FRk (p)

Wk�1(p)+WRk (p)(11)

Wk(p) = Wk�1(p)+WRk (p) (12)

No update on the global TSDF is performed for values resultingfrom unmeasurable regions specified in Equation 9. While Wk(p)provides weighting of the TSDF proportional to the uncertainty ofsurface measurement, we have also found that in practice simplyletting WRk (p) = 1, resulting in a simple average, provides good re-sults. Moreover, by truncating the updated weight over some valueWh ,

Wk(p) min(Wk�1(p)+WRk (p),Wh ) , (13)

a moving average surface reconstruction can be obtained enablingreconstruction in scenes with dynamic object motion.

Although a large number of voxels can be visited that will notproject into the current image, the simplicity of the kernel meansoperation time is memory, not computation, bound and with currentGPU hardware over 65 gigavoxels/second (⇡ 2ms per full volumeupdate for a 5123 voxel reconstruction) can be updated. We use 16bits per component in S(p), although experimentally we have ver-ified that as few as 6 bits are required for the SDF value. Finally,we note that the raw depth measurements are used for TSDF fusionrather than the bilateral filtered version used in the tracking com-ponent, described later in section 3.5. The early filtering removesdesired high frequency structure and noise alike which would re-duce the ability to reconstruct finer scale structures.

3.4 Surface Prediction from Ray Casting the TSDFWith the most up-to-date reconstruction available comes the abil-ity to compute a dense surface prediction by rendering the surfaceencoded in the zero level set Fk = 0 into a virtual camera with thecurrent estimate Tg,k. The surface prediction is stored as a vertexand normal map Vk and Nk in frame of reference k and is used inthe subsequent camera pose estimation step.

As we have a dense surface reconstruction in the form of a globalSDF, a per pixel raycast can be performed [21]. Each pixel’s cor-responding ray, Tg,kK�1

˙u, is marched starting from the minimumdepth for the pixel and stopping when a zero crossing (+ve to�ve for a visible surface) is found indicating the surface interface.Marching also stops if a �ve to +ve back face is found, or ulti-mately when exiting the working volume, both resulting in non sur-face measurement at the pixel u.

For points on or very close to the surface interface Fk(p) = 0 itis assumed that the gradient of the TSDF at p is orthogonal to thezero level set, and so the surface normal for the associated pixel u

stored at each location of the TSDF: the current truncated signeddistance value Fk(p) and a weight Wk(p),

Sk(p) 7! [Fk(p),Wk(p)] . (5)

A dense surface measurement (such as the raw depth map Rk) pro-vides two important constraints on the surface being reconstructed.First, assuming we can truncate the uncertainty of a depth mea-surement such that the true value lies within ±µ of the measuredvalue, then for a distance r from the camera center along each depthmap ray, r < (lRk(u)� µ) is a measurement of free space (herel = kK�1

˙

uk2 scales the measurement along the pixel ray). Sec-ond, we assume that no surface information is obtained in the re-construction volume at r > (lRk(u) + µ) along the camera ray.Therefore the SDF need only represent the region of uncertaintywhere the surface measurement exists |r�lRk(u)| µ . A TSDFallows the asymmetry between free space, uncertain measurementand unknown areas to be represented. Points that are within visiblespace at distance greater than µ from the nearest surface interfaceare truncated to a maximum distance µ . Non-visible points fartherthan µ from the surface are not measured. Otherwise the SDF rep-resents the distance to the nearest surface point.

Although efficient algorithms exist for computing the true dis-crete SDF for a given set of point measurements (complexity islinear in the the number of voxels) [22], sophisticated implemen-tations are required to achieve top performance on GPU hardware,without which real-time computation is not possible for a reason-able size volume. Instead, we use a projective truncated signeddistance function that is readily computed and trivially parallelis-able. For a raw depth map Rk with a known pose Tg,k, its globalframe projective TSDF [FRk ,WRk ] at a point p in the global frameg is computed as,

FRk (p) = Y⇣

l�1k(tg,k�pk2�Rk(x)⌘, (6)

l = kK�1˙

xk2 , (7)

x =j

p⇣

KT�1g,kp

⌘k, (8)

Y(h) =

(min

⇣1, h

µ

⌘sgn(h) iff h ��µ

null otherwise(9)

We use a nearest neighbour lookup b.c instead of interpolatingthe depth value, to prevent smearing of measurements at depth dis-continuities. 1/l converts the ray distance to p to a depth (we foundno considerable difference in using SDF values computed using dis-tances along the ray or along the optical axis). Y performs the SDFtruncation. The truncation function is scaled to ensure that a sur-face measurement (zero crossing in the SDF) is represented by atleast one non truncated voxel value in the discretised volume ei-ther side of the surface. Also, the support is increased linearly withdistance from the sensor center to support correct representationof noisier measurements. The associated weight WRk (p) is propor-tional to cos(q)/Rk(x), where q is the angle between the associatedpixel ray direction and the surface normal measurement in the localframe.

The projective TSDF measurement is only correct exactly at thesurface FRk (p) = 0 or if there is only a single point measurementin isolation. When a surface is present the closest point along aray could be another surface point not on the ray associated withthe pixel in Equation 8. It has been shown that for points closeto the surface, a correction can be applied by scaling the SDF bycos(q) [11]. However, we have found that approximation withinthe truncation region for 100s or more fused TSDFs from multipleviewpoints (as performed here) converges towards an SDF with apseudo-Euclidean metric that does not hinder mapping and trackingperformance.

The global fusion of all depth maps in the volume is formed bythe weighted average of all individual TSDFs computed for eachdepth map, which can be seen as de-noising the global TSDF frommultiple noisy TSDF measurements. Under an L2 norm the de-noised (fused) surface results as the zero-crossings of the point-wiseSDF F minimising:

minF2F Â

kkWRk FRk �F)k2. (10)

Given that the focus of our work is on real-time sensor tracking andsurface reconstruction we must maintain interactive frame-rates.(For a 640x480 depth stream at 30fps the equivalent of over 9million new point measurements are made per second). Storinga weight Wk(p) with each value allows an important aspect of theglobal minimum of the convex L2 de-noising metric to be exploitedfor real-time fusion; that the solution can be obtained incrementallyas more data terms are added using a simple weighted running av-erage [7], defined point-wise {p|FRk (p) 6= null}:

Fk(p) =Wk�1(p)Fk�1(p)+WRk (p)FRk (p)

Wk�1(p)+WRk (p)(11)

Wk(p) = Wk�1(p)+WRk (p) (12)

No update on the global TSDF is performed for values resultingfrom unmeasurable regions specified in Equation 9. While Wk(p)provides weighting of the TSDF proportional to the uncertainty ofsurface measurement, we have also found that in practice simplyletting WRk (p) = 1, resulting in a simple average, provides good re-sults. Moreover, by truncating the updated weight over some valueWh ,

Wk(p) min(Wk�1(p)+WRk (p),Wh ) , (13)

a moving average surface reconstruction can be obtained enablingreconstruction in scenes with dynamic object motion.

Although a large number of voxels can be visited that will notproject into the current image, the simplicity of the kernel meansoperation time is memory, not computation, bound and with currentGPU hardware over 65 gigavoxels/second (⇡ 2ms per full volumeupdate for a 5123 voxel reconstruction) can be updated. We use 16bits per component in S(p), although experimentally we have ver-ified that as few as 6 bits are required for the SDF value. Finally,we note that the raw depth measurements are used for TSDF fusionrather than the bilateral filtered version used in the tracking com-ponent, described later in section 3.5. The early filtering removesdesired high frequency structure and noise alike which would re-duce the ability to reconstruct finer scale structures.

3.4 Surface Prediction from Ray Casting the TSDFWith the most up-to-date reconstruction available comes the abil-ity to compute a dense surface prediction by rendering the surfaceencoded in the zero level set Fk = 0 into a virtual camera with thecurrent estimate Tg,k. The surface prediction is stored as a vertexand normal map Vk and Nk in frame of reference k and is used inthe subsequent camera pose estimation step.

As we have a dense surface reconstruction in the form of a globalSDF, a per pixel raycast can be performed [21]. Each pixel’s cor-responding ray, Tg,kK�1

˙

u, is marched starting from the minimumdepth for the pixel and stopping when a zero crossing (+ve to�ve for a visible surface) is found indicating the surface interface.Marching also stops if a �ve to +ve back face is found, or ulti-mately when exiting the working volume, both resulting in non sur-face measurement at the pixel u.

For points on or very close to the surface interface Fk(p) = 0 itis assumed that the gradient of the TSDF at p is orthogonal to thezero level set, and so the surface normal for the associated pixel u

stored at each location of the TSDF: the current truncated signeddistance value Fk(p) and a weight Wk(p),

Sk(p) 7! [Fk(p),Wk(p)] . (5)

A dense surface measurement (such as the raw depth map Rk) pro-vides two important constraints on the surface being reconstructed.First, assuming we can truncate the uncertainty of a depth mea-surement such that the true value lies within ±µ of the measuredvalue, then for a distance r from the camera center along each depthmap ray, r < (lRk(u)� µ) is a measurement of free space (herel = kK�1

˙

uk2 scales the measurement along the pixel ray). Sec-ond, we assume that no surface information is obtained in the re-construction volume at r > (lRk(u) + µ) along the camera ray.Therefore the SDF need only represent the region of uncertaintywhere the surface measurement exists |r�lRk(u)| µ . A TSDFallows the asymmetry between free space, uncertain measurementand unknown areas to be represented. Points that are within visiblespace at distance greater than µ from the nearest surface interfaceare truncated to a maximum distance µ . Non-visible points fartherthan µ from the surface are not measured. Otherwise the SDF rep-resents the distance to the nearest surface point.

Although efficient algorithms exist for computing the true dis-crete SDF for a given set of point measurements (complexity islinear in the the number of voxels) [22], sophisticated implemen-tations are required to achieve top performance on GPU hardware,without which real-time computation is not possible for a reason-able size volume. Instead, we use a projective truncated signeddistance function that is readily computed and trivially parallelis-able. For a raw depth map Rk with a known pose Tg,k, its globalframe projective TSDF [FRk ,WRk ] at a point p in the global frameg is computed as,

FRk (p) = Y⇣

l�1k(tg,k�pk2�Rk(x)⌘, (6)

l = kK�1˙

xk2 , (7)

x =j

p⇣

KT�1g,kp

⌘k, (8)

Y(h) =

(min

⇣1, h

µ

⌘sgn(h) iff h ��µ

null otherwise(9)

We use a nearest neighbour lookup b.c instead of interpolatingthe depth value, to prevent smearing of measurements at depth dis-continuities. 1/l converts the ray distance to p to a depth (we foundno considerable difference in using SDF values computed using dis-tances along the ray or along the optical axis). Y performs the SDFtruncation. The truncation function is scaled to ensure that a sur-face measurement (zero crossing in the SDF) is represented by atleast one non truncated voxel value in the discretised volume ei-ther side of the surface. Also, the support is increased linearly withdistance from the sensor center to support correct representationof noisier measurements. The associated weight WRk (p) is propor-tional to cos(q)/Rk(x), where q is the angle between the associatedpixel ray direction and the surface normal measurement in the localframe.

The projective TSDF measurement is only correct exactly at thesurface FRk (p) = 0 or if there is only a single point measurementin isolation. When a surface is present the closest point along aray could be another surface point not on the ray associated withthe pixel in Equation 8. It has been shown that for points closeto the surface, a correction can be applied by scaling the SDF bycos(q) [11]. However, we have found that approximation withinthe truncation region for 100s or more fused TSDFs from multipleviewpoints (as performed here) converges towards an SDF with apseudo-Euclidean metric that does not hinder mapping and trackingperformance.

The global fusion of all depth maps in the volume is formed bythe weighted average of all individual TSDFs computed for eachdepth map, which can be seen as de-noising the global TSDF frommultiple noisy TSDF measurements. Under an L2 norm the de-noised (fused) surface results as the zero-crossings of the point-wiseSDF F minimising:

minF2F Â

kkWRk FRk �F)k2. (10)

Given that the focus of our work is on real-time sensor tracking andsurface reconstruction we must maintain interactive frame-rates.(For a 640x480 depth stream at 30fps the equivalent of over 9million new point measurements are made per second). Storinga weight Wk(p) with each value allows an important aspect of theglobal minimum of the convex L2 de-noising metric to be exploitedfor real-time fusion; that the solution can be obtained incrementallyas more data terms are added using a simple weighted running av-erage [7], defined point-wise {p|FRk (p) 6= null}:

Fk(p) =Wk�1(p)Fk�1(p)+WRk (p)FRk (p)

Wk�1(p)+WRk (p)(11)

Wk(p) = Wk�1(p)+WRk (p) (12)

No update on the global TSDF is performed for values resultingfrom unmeasurable regions specified in Equation 9. While Wk(p)provides weighting of the TSDF proportional to the uncertainty ofsurface measurement, we have also found that in practice simplyletting WRk (p) = 1, resulting in a simple average, provides good re-sults. Moreover, by truncating the updated weight over some valueWh ,

Wk(p) min(Wk�1(p)+WRk (p),Wh ) , (13)

a moving average surface reconstruction can be obtained enablingreconstruction in scenes with dynamic object motion.

Although a large number of voxels can be visited that will notproject into the current image, the simplicity of the kernel meansoperation time is memory, not computation, bound and with currentGPU hardware over 65 gigavoxels/second (⇡ 2ms per full volumeupdate for a 5123 voxel reconstruction) can be updated. We use 16bits per component in S(p), although experimentally we have ver-ified that as few as 6 bits are required for the SDF value. Finally,we note that the raw depth measurements are used for TSDF fusionrather than the bilateral filtered version used in the tracking com-ponent, described later in section 3.5. The early filtering removesdesired high frequency structure and noise alike which would re-duce the ability to reconstruct finer scale structures.

3.4 Surface Prediction from Ray Casting the TSDFWith the most up-to-date reconstruction available comes the abil-ity to compute a dense surface prediction by rendering the surfaceencoded in the zero level set Fk = 0 into a virtual camera with thecurrent estimate Tg,k. The surface prediction is stored as a vertexand normal map Vk and Nk in frame of reference k and is used inthe subsequent camera pose estimation step.

As we have a dense surface reconstruction in the form of a globalSDF, a per pixel raycast can be performed [21]. Each pixel’s cor-responding ray, Tg,kK�1

˙

u, is marched starting from the minimumdepth for the pixel and stopping when a zero crossing (+ve to�ve for a visible surface) is found indicating the surface interface.Marching also stops if a �ve to +ve back face is found, or ulti-mately when exiting the working volume, both resulting in non sur-face measurement at the pixel u.

For points on or very close to the surface interface Fk(p) = 0 itis assumed that the gradient of the TSDF at p is orthogonal to thezero level set, and so the surface normal for the associated pixel u

Global SceneとLive FrameのTSDFのFusion

重みの更新

FusionされたTSDF Value Global Model Live Frame

Page 14: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Surface Prediction

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 14

Rk Tg,k

Rk

Tg,k-1

Input

Measurement PoseEstimation

UpdateReconstruction

SurfacePrediction

Compute Surface Nertex and

Normal Maps

ICP of Predictedand Measured

Surface

Integrate SurfaceMeasurement

into Global TSDF

Ray-cast TSDF to Compute Surface Prediction

Tg,k-1k

Vk-1

S

Nk-1^ ^

VkNk

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙

u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙

u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙

u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

Page 15: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Surface Prediction •  Global TSDFから任意視点におけるSurfaceを取り出す •  Pose Estimationに必要

•  Ray Casting (Ray marching) – 距離関数と親和性の高いレンダリング方法 – カメラからRayを飛ばす –  Minimum depthから初めて、zero交差が見つかったら止める。

– 法線の計算:

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 15

Figure 5: Reconstructed of a scene showing raycasting of the TSDF(left) without and (middle and right) with interpolation of the TSDF atthe surface interface using eqn. 15.

Figure 6: Demonstration of the space skipping ray casting. (Left)pixel iteration count are shown where for each pixel the ray is tra-versed in steps of at most one voxel (white equals 480 incrementsand black 60). (middle) ray marching steps are drastically reducedby skipping empty space according to the minimum truncation µ(white equals 70 iterations and black 10 ⇡ 6⇥ speedup). Marchingsteps can be seen to increase around the surface interface wherethe signed distance function has not been truncated. (Right) Normalmap at resulting surface intersection.

along which p was found can be computed directly from Fk using anumerical derivative of the SDF:

Rg,kNk = N

gk(u) = n

⇥—F(p)

⇤, —F =

∂F∂x

,∂F∂y

,∂F∂ z

�>(14)

Further, this derivative is scaled in each dimension to ensure cor-rect isotropy given potentially arbitrary voxel resolutions and re-construction dimensions.

Since the rendering of the surface is restricted to provide phys-ically plausible measurement predictions, there is a minimum andmaximum rendering range the ray needs to traverse correspondingto conservative minimum and maximum sensor range (⇡ [0.4,8]meters for the Kinect). This results in the desirable property ofthe proposed surface prediction requiring a bounded time per pixelcomputation for any size or complexity of scene with a fixed volu-metric resolution.

Classically a min/max block acceleration structure [21] can beused to speed up marching through empty space. However, dueto continual updating of the TSDF (which would require a con-stant update to the min/max macro blocks) we found that simpleray skipping provides a more useful acceleration. In ray skippingwe utilise the fact that near F(p) = 0 the fused volume holds a goodapproximation to the true signed distance from p to the nearest sur-face interface. Using our known truncation distance we can marchalong the ray in steps with size < µ while values of F(p) have +vetruncated values, as we can assume a step µ must pass through atleast one non-truncated +ve value before stepping over the surfacezero crossing. The speed-up obtained is demonstrated in Figure 6by measuring the number of steps required for each pixel to inter-sect the surface relative to standard marching.

Higher quality intersections can be obtained by solving a ray/tri-linear cell intersection [21] that requires solving a cubic polyno-mial. As this is expensive we use a simple approximation. Givena ray has been found to intersect the SDF where F+

t and F+t+Dt are

trilinearly interpolated SDF values either side of the zero crossingat points along the ray t and t+Dt from its starting point, we find

parameter t⇤ at which the intersection occurs more precisely:

t⇤ = t� DtF+t

F+t+Dt �F+

t. (15)

The predicted vertex and normal maps are computed at the inter-polated location in the global frame. Figure 5 shows a typical recon-struction, the interpolation scheme described achieves high qualityocclusion boundaries at a fraction of the cost of full interpolation.

3.5 Sensor Pose EstimationLive camera localisation involves estimating the current camerapose Tw,k 2 SE3 (Equation 1) for each new depth image. Manytracking algorithms use feature selection to improve speed by re-ducing the number of points for which data association need beperformed. In this work, we take advantage of two factors to al-low us instead to make use of all of the data in a depth image fora dense iterated close point based pose estimation. First, by main-taining a high tracking frame-rate, we can assume small motionfrom one frame to the next. This allows us to use the fast projec-tive data association algorithm [4] to obtain correspondence and thepoint-plane metric [5] for pose optimisation. Second, modern GPUhardware enables a fully parrallelised processing pipeline, so thatthe data association and point-plane optimisation can use all of theavailable surface measurements.

The point-plane error metric in combination with correspon-dences obtained using projective data association was first demon-strated in a real time modelling system by [23] where frame-to-frame tracking was used (with a fixed camera) for depth map align-ment. In our system we instead track the current sensor frame byaligning a live surface measurement (Vk,Nk) against the model pre-diction from the previous frame (Vk�1, Nk�1). We note that frame-to-frame tracking is obtained simply by setting (Vk�1, Nk�1) (Vk�1,Nk�1) which is used in our experimental section for a com-parison between frame-to-frame and frame-model tracking.

Utilising the surface prediction, the global point-plane energy,under the L2 norm for the desired camera pose estimate Tg,k is:

E(Tg,k) = Âu2U

Wk(u) 6=null

����⇣Tg,k ˙

Vk(u)� V

gk�1 (u)

⌘>N

gk�1 (u)

����2, (16)

where each global frame surface prediction is obtained using theprevious fixed pose estimate Tg,k�1. The projective data as-sociation algorithm produces the set of vertex correspondences{Vk(u), Vk�1(u)|W(u) 6= null} by computing the perspectively pro-jected point, u = p(KeTk�1,k ˙

Vk(u)) using an estimate for the frame-frame transform eTz

k�1,k = T�1g,k�1

eTzg,k and testing the predicted and

measured vertex and normal for compatibility. A threshold on thedistance of vertices and difference in normal values suffices to re-ject grossly incorrect correspondences, also illustrated in Figure 7:

W(u) 6= null iff

8<

:

Mk(u) = 1, andkeTz

g,k˙

Vk(u)� V

gk�1 (u)k2 ed , and

heRzg,kNk(u), N

gk�1 (u)i eq .

(17)

where ed and eq are threshold parameters of our system. eTz=0g,k is

initialised with the previous frame pose Tg,k.An iterative solution, eTz

g,k for z > 0 is obtained by minimisingthe energy of a linearised version of (16) around the previous es-timate eTz�1

g,k . Using the small angle assumption for an incrementaltransform:

eTzinc =

heRz�� e

t

zi=

2

41 a �g tx�a 1 b tyg �b 1 tz

3

5 , (18)

Figure 5: Reconstructed of a scene showing raycasting of the TSDF(left) without and (middle and right) with interpolation of the TSDF atthe surface interface using eqn. 15.

Figure 6: Demonstration of the space skipping ray casting. (Left)pixel iteration count are shown where for each pixel the ray is tra-versed in steps of at most one voxel (white equals 480 incrementsand black 60). (middle) ray marching steps are drastically reducedby skipping empty space according to the minimum truncation µ(white equals 70 iterations and black 10 ⇡ 6⇥ speedup). Marchingsteps can be seen to increase around the surface interface wherethe signed distance function has not been truncated. (Right) Normalmap at resulting surface intersection.

along which p was found can be computed directly from Fk using anumerical derivative of the SDF:

Rg,kNk = N

gk(u) = n

⇥—F(p)

⇤, —F =

∂F∂x

,∂F∂y

,∂F∂ z

�>(14)

Further, this derivative is scaled in each dimension to ensure cor-rect isotropy given potentially arbitrary voxel resolutions and re-construction dimensions.

Since the rendering of the surface is restricted to provide phys-ically plausible measurement predictions, there is a minimum andmaximum rendering range the ray needs to traverse correspondingto conservative minimum and maximum sensor range (⇡ [0.4,8]meters for the Kinect). This results in the desirable property ofthe proposed surface prediction requiring a bounded time per pixelcomputation for any size or complexity of scene with a fixed volu-metric resolution.

Classically a min/max block acceleration structure [21] can beused to speed up marching through empty space. However, dueto continual updating of the TSDF (which would require a con-stant update to the min/max macro blocks) we found that simpleray skipping provides a more useful acceleration. In ray skippingwe utilise the fact that near F(p) = 0 the fused volume holds a goodapproximation to the true signed distance from p to the nearest sur-face interface. Using our known truncation distance we can marchalong the ray in steps with size < µ while values of F(p) have +vetruncated values, as we can assume a step µ must pass through atleast one non-truncated +ve value before stepping over the surfacezero crossing. The speed-up obtained is demonstrated in Figure 6by measuring the number of steps required for each pixel to inter-sect the surface relative to standard marching.

Higher quality intersections can be obtained by solving a ray/tri-linear cell intersection [21] that requires solving a cubic polyno-mial. As this is expensive we use a simple approximation. Givena ray has been found to intersect the SDF where F+

t and F+t+Dt are

trilinearly interpolated SDF values either side of the zero crossingat points along the ray t and t+Dt from its starting point, we find

parameter t⇤ at which the intersection occurs more precisely:

t⇤ = t� DtF+t

F+t+Dt �F+

t. (15)

The predicted vertex and normal maps are computed at the inter-polated location in the global frame. Figure 5 shows a typical recon-struction, the interpolation scheme described achieves high qualityocclusion boundaries at a fraction of the cost of full interpolation.

3.5 Sensor Pose EstimationLive camera localisation involves estimating the current camerapose Tw,k 2 SE3 (Equation 1) for each new depth image. Manytracking algorithms use feature selection to improve speed by re-ducing the number of points for which data association need beperformed. In this work, we take advantage of two factors to al-low us instead to make use of all of the data in a depth image fora dense iterated close point based pose estimation. First, by main-taining a high tracking frame-rate, we can assume small motionfrom one frame to the next. This allows us to use the fast projec-tive data association algorithm [4] to obtain correspondence and thepoint-plane metric [5] for pose optimisation. Second, modern GPUhardware enables a fully parrallelised processing pipeline, so thatthe data association and point-plane optimisation can use all of theavailable surface measurements.

The point-plane error metric in combination with correspon-dences obtained using projective data association was first demon-strated in a real time modelling system by [23] where frame-to-frame tracking was used (with a fixed camera) for depth map align-ment. In our system we instead track the current sensor frame byaligning a live surface measurement (Vk,Nk) against the model pre-diction from the previous frame (Vk�1, Nk�1). We note that frame-to-frame tracking is obtained simply by setting (Vk�1, Nk�1) (Vk�1,Nk�1) which is used in our experimental section for a com-parison between frame-to-frame and frame-model tracking.

Utilising the surface prediction, the global point-plane energy,under the L2 norm for the desired camera pose estimate Tg,k is:

E(Tg,k) = Âu2U

Wk(u) 6=null

����⇣Tg,k ˙

Vk(u)� V

gk�1 (u)

⌘>N

gk�1 (u)

����2, (16)

where each global frame surface prediction is obtained using theprevious fixed pose estimate Tg,k�1. The projective data as-sociation algorithm produces the set of vertex correspondences{Vk(u), Vk�1(u)|W(u) 6= null} by computing the perspectively pro-jected point, u = p(KeTk�1,k ˙

Vk(u)) using an estimate for the frame-frame transform eTz

k�1,k = T�1g,k�1

eTzg,k and testing the predicted and

measured vertex and normal for compatibility. A threshold on thedistance of vertices and difference in normal values suffices to re-ject grossly incorrect correspondences, also illustrated in Figure 7:

W(u) 6= null iff

8<

:

Mk(u) = 1, andkeTz

g,k˙

Vk(u)� V

gk�1 (u)k2 ed , and

heRzg,kNk(u), N

gk�1 (u)i eq .

(17)

where ed and eq are threshold parameters of our system. eTz=0g,k is

initialised with the previous frame pose Tg,k.An iterative solution, eTz

g,k for z > 0 is obtained by minimisingthe energy of a linearised version of (16) around the previous es-timate eTz�1

g,k . Using the small angle assumption for an incrementaltransform:

eTzinc =

heRz�� e

t

zi=

2

41 a �g tx�a 1 b tyg �b 1 tz

3

5 , (18)

Page 16: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Surface Measurement

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 16

Rk Tg,k

Rk

Tg,k-1

Input

Measurement PoseEstimation

UpdateReconstruction

SurfacePrediction

Compute Surface Nertex and

Normal Maps

ICP of Predictedand Measured

Surface

Integrate SurfaceMeasurement

into Global TSDF

Ray-cast TSDF to Compute Surface Prediction

Tg,k-1k

Vk-1

S

Nk-1^ ^

VkNk

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙

u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙

u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙

u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

Page 17: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Surface Measurement (1/4) •  Live Frameに対する前処理: 1.  De-noise (Bilateral Filter) 2.  各ピクセルごとにVertexとNormalを生成 3.  Pyramid生成 (L=1, 2, 3)

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 17

Page 18: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Surface Measurement (2/4) •  De-noise

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 18

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙

u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙

u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙

u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙

u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙

u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙

u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

: Raw Depth map

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙

u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙

u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙

u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

: Pixel

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙

u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙

u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙

u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙

u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙

u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙

u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

: 正規化

要するに、Depthに対するBilateral Filter

Page 19: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Surface Measurement (3/4) •  Vertex map

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 19

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙

u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙

u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙

u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

•  Normal map

De-noise後のDepth mapを使って、点 u を3次元に投影

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

各頂点の3次元座標から、法線を計算 (近接ピクセル3次元座標の外積)

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

:正規化 (単位ベクトル化)

Page 20: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Surface Measurement (4/4) •  Pyramid生成

–  3 Level –  L = 1 : 生の解像度 –  L+1 : Lの半分の解像度 – 各レベルごとに、Vertex / Normal Mapを計算して保持

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 20

Page 21: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Sensor Pose Estimation

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 21

Rk Tg,k

Rk

Tg,k-1

Input

Measurement PoseEstimation

UpdateReconstruction

SurfacePrediction

Compute Surface Nertex and

Normal Maps

ICP of Predictedand Measured

Surface

Integrate SurfaceMeasurement

into Global TSDF

Ray-cast TSDF to Compute Surface Prediction

Tg,k-1k

Vk-1

S

Nk-1^ ^

VkNk

Figure 3: Overall system workflow.

Surface reconstruction update: The global scene fusion pro-cess, where given the pose determined by tracking the depth datafrom a new sensor frame, the surface measurement is integrated intothe scene model maintained with a volumetric, truncated signed dis-tance function (TSDF) representation.

Surface prediction: Unlike frame-to-frame pose estimation asperformed in [15], we close the loop between mapping and local-isation by tracking the live depth frame against the globally fusedmodel. This is performed by raycasting the signed distance func-tion into the estimated frame to provide a dense surface predictionagainst which the live depth map is aligned.

Sensor pose estimation: Live sensor tracking is achieved usinga multi-scale ICP alignment between the predicted surface and cur-rent sensor measurement. Our GPU based implementation uses allthe available data at frame-rate.

3.1 PreliminariesWe represent the live 6DOF camera pose estimated for a frame attime k by a rigid body transformation matrix:

Tg,k =

Rg,k tg,k0

> 1

�2 SE3 , (1)

where the Euclidean group SE3 := {R, t | R 2 SO3, t 2 R3}. Thismaps the camera coordinate frame at time k into the global frameg, such that a point pk 2 R3 in the camera frame is transferred intothe global co-ordinate frame via pg = Tg,kpk.

We will also use a single constant camera calibration matrixK that transforms points on the sensor plane into image pixels.The function q = p(p) performs perspective projection of p 2R3 = (x,y,z)> including dehomogenisation to obtain q 2 R2 =(x/z,y/z)>. We will also use a dot notation to denote homogeneousvectors ˙

u := (u>|1)>

3.2 Surface MeasurementAt time k a measurement comprises a raw depth map Rk whichprovides calibrated depth measurements Rk(u) 2 R at each imagepixel u = (u,v)> in the image domain u 2 U ⇢ R2 such that pk =Rk(u)K�1

˙

u is a metric point measurement in the sensor frame ofreference k. We apply a bilateral filter [27] to the raw depth map toobtain a discontinuity preserved depth map with reduced noise Dk,

Dk(u) =1

Wp

Âq2U

Nss (ku�qk2)Nsr (kRk(u)�Rk(q)k2)Rk(q) ,

(2)where Ns (t) = exp(�t2s�2) and W

p

is a normalizing constant.We back-project the filtered depth values into the sensor’s frame

of reference to obtain a vertex map Vk,

Vk(u) = Dk(u)K�1

˙

u . (3)

Since each frame from the depth sensor is a surface measurementon a regular grid, we compute, using a cross product between neigh-

Figure 4: A slice through the truncated signed distance volumeshowing the truncated function F > µ (white), the smooth distancefield around the surface interface F = 0 and voxels that have not yethad a valid measurement(grey) as detailed in eqn. 9.

bouring map vertices, the corresponding normal vectors,

Nk(u) = n⇥(Vk(u+1,v)�Vk(u,v))⇥ (Vk(u,v+1)�Vk(u,v))

⇤,

(4)where n [x] = x/kxk2.

We also define a vertex validity mask: Mk(u) 7! 1 for each pixelwhere a depth measurement transforms to a valid vertex; otherwiseif a depth measurement is missing Mk(u) 7! 0. The bilateral filteredversion of the depth map greatly increases the quality of the normalmaps produced, improving the data association required in trackingdescribed in Section 3.5.

We compute an L = 3 level multi-scale representation of the sur-face measurement in the form of a vertex and normal map pyramid.First a depth map pyramid Dl2[1...L] is computed. Setting the bottomdepth map pyramid level equal to the original bilateral filtered depthmap, the sub-sampled version Dl+1 is computed from Dl by blockaveraging followed by sub-sampling to half the resolution. Depthvalues are used in the average only if they are within 3sr of thecentral pixel to ensure smoothing does not occur over depth bound-aries. Subsequently each level in a vertex and normal map pyramidV

l2[1...L], N

l2[1...L] is computed with Equations 3 and 4 using thecorresponding depth map level. We note that given the camera toglobal co-ordinate frame transform Tg,k associated with the surfacemeasurement, the global frame vertex is V

gk(u) = Tg,k ˙

Vk(u) andthe equivalent mapping of normal vectors into the global frame isN

gk(u) = Rg,kNk(u).

3.3 Mapping as Surface ReconstructionEach consecutive depth frame, with an associated live camera poseestimate, is fused incrementally into one single 3D reconstructionusing the volumetric truncated signed distance function (TSDF)[7]. In a true signed distance function, the value corresponds tothe signed distance to the closest zero crossing (the surface inter-face), taking on positive and increasing values moving from thevisible surface into free space, and negative and decreasing valueson the non-visible side. The result of averaging the SDF’s of multi-ple 3D point clouds (or surface measurements) that are aligned intoa global frame is a global surface fusion.

An example given in Figure 4 demonstrates how the TSDF al-lows us to represent arbitrary genus surfaces as zero crossingswithin the volume. We will denote the global TSDF that contains afusion of the registered depth measurements from frames 1 . . .k asSk(p) where p 2 R3 is a global frame point in the 3D volume to bereconstructed. A discretization of the TSDF with a specified res-olution is stored in global GPU memory where all processing willreside. From here on we assume a fixed bijective mapping betweenvoxel/memory elements and the continuous TSDF representationand will refer only to the continuous TSDF S. Two components are

Page 22: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Sensor Pose Estimation (1/4) •  Live FrameのPoseを推定

–  Live Frameの全Pixel情報を使って推定

•  仮定: –  Frame間の移動量は十分小さい

•  Solution : ICP –  Live FrameのVertex/Normal (De-noised) と、Surface

Prediction間の6DOFを推定 –  Point-Plane Optimization

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 22

Page 23: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Sensor Pose Estimation (2/4) •  The global point-plane energy

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 23

Figure 5: Reconstructed of a scene showing raycasting of the TSDF(left) without and (middle and right) with interpolation of the TSDF atthe surface interface using eqn. 15.

Figure 6: Demonstration of the space skipping ray casting. (Left)pixel iteration count are shown where for each pixel the ray is tra-versed in steps of at most one voxel (white equals 480 incrementsand black 60). (middle) ray marching steps are drastically reducedby skipping empty space according to the minimum truncation µ(white equals 70 iterations and black 10 ⇡ 6⇥ speedup). Marchingsteps can be seen to increase around the surface interface wherethe signed distance function has not been truncated. (Right) Normalmap at resulting surface intersection.

along which p was found can be computed directly from Fk using anumerical derivative of the SDF:

Rg,kNk = N

gk(u) = n

⇥—F(p)

⇤, —F =

∂F∂x

,∂F∂y

,∂F∂ z

�>(14)

Further, this derivative is scaled in each dimension to ensure cor-rect isotropy given potentially arbitrary voxel resolutions and re-construction dimensions.

Since the rendering of the surface is restricted to provide phys-ically plausible measurement predictions, there is a minimum andmaximum rendering range the ray needs to traverse correspondingto conservative minimum and maximum sensor range (⇡ [0.4,8]meters for the Kinect). This results in the desirable property ofthe proposed surface prediction requiring a bounded time per pixelcomputation for any size or complexity of scene with a fixed volu-metric resolution.

Classically a min/max block acceleration structure [21] can beused to speed up marching through empty space. However, dueto continual updating of the TSDF (which would require a con-stant update to the min/max macro blocks) we found that simpleray skipping provides a more useful acceleration. In ray skippingwe utilise the fact that near F(p) = 0 the fused volume holds a goodapproximation to the true signed distance from p to the nearest sur-face interface. Using our known truncation distance we can marchalong the ray in steps with size < µ while values of F(p) have +vetruncated values, as we can assume a step µ must pass through atleast one non-truncated +ve value before stepping over the surfacezero crossing. The speed-up obtained is demonstrated in Figure 6by measuring the number of steps required for each pixel to inter-sect the surface relative to standard marching.

Higher quality intersections can be obtained by solving a ray/tri-linear cell intersection [21] that requires solving a cubic polyno-mial. As this is expensive we use a simple approximation. Givena ray has been found to intersect the SDF where F+

t and F+t+Dt are

trilinearly interpolated SDF values either side of the zero crossingat points along the ray t and t+Dt from its starting point, we find

parameter t⇤ at which the intersection occurs more precisely:

t⇤ = t� DtF+t

F+t+Dt �F+

t. (15)

The predicted vertex and normal maps are computed at the inter-polated location in the global frame. Figure 5 shows a typical recon-struction, the interpolation scheme described achieves high qualityocclusion boundaries at a fraction of the cost of full interpolation.

3.5 Sensor Pose EstimationLive camera localisation involves estimating the current camerapose Tw,k 2 SE3 (Equation 1) for each new depth image. Manytracking algorithms use feature selection to improve speed by re-ducing the number of points for which data association need beperformed. In this work, we take advantage of two factors to al-low us instead to make use of all of the data in a depth image fora dense iterated close point based pose estimation. First, by main-taining a high tracking frame-rate, we can assume small motionfrom one frame to the next. This allows us to use the fast projec-tive data association algorithm [4] to obtain correspondence and thepoint-plane metric [5] for pose optimisation. Second, modern GPUhardware enables a fully parrallelised processing pipeline, so thatthe data association and point-plane optimisation can use all of theavailable surface measurements.

The point-plane error metric in combination with correspon-dences obtained using projective data association was first demon-strated in a real time modelling system by [23] where frame-to-frame tracking was used (with a fixed camera) for depth map align-ment. In our system we instead track the current sensor frame byaligning a live surface measurement (Vk,Nk) against the model pre-diction from the previous frame (Vk�1, Nk�1). We note that frame-to-frame tracking is obtained simply by setting (Vk�1, Nk�1) (Vk�1,Nk�1) which is used in our experimental section for a com-parison between frame-to-frame and frame-model tracking.

Utilising the surface prediction, the global point-plane energy,under the L2 norm for the desired camera pose estimate Tg,k is:

E(Tg,k) = Âu2U

Wk(u) 6=null

����⇣Tg,k ˙

Vk(u)� V

gk�1 (u)

⌘>N

gk�1 (u)

����2, (16)

where each global frame surface prediction is obtained using theprevious fixed pose estimate Tg,k�1. The projective data as-sociation algorithm produces the set of vertex correspondences{Vk(u), Vk�1(u)|W(u) 6= null} by computing the perspectively pro-jected point, u = p(KeTk�1,k ˙

Vk(u)) using an estimate for the frame-frame transform eTz

k�1,k = T�1g,k�1

eTzg,k and testing the predicted and

measured vertex and normal for compatibility. A threshold on thedistance of vertices and difference in normal values suffices to re-ject grossly incorrect correspondences, also illustrated in Figure 7:

W(u) 6= null iff

8<

:

Mk(u) = 1, andkeTz

g,k˙

Vk(u)� V

gk�1 (u)k2 ed , and

heRzg,kNk(u), N

gk�1 (u)i eq .

(17)

where ed and eq are threshold parameters of our system. eTz=0g,k is

initialised with the previous frame pose Tg,k.An iterative solution, eTz

g,k for z > 0 is obtained by minimisingthe energy of a linearised version of (16) around the previous es-timate eTz�1

g,k . Using the small angle assumption for an incrementaltransform:

eTzinc =

heRz�� e

t

zi=

2

41 a �g tx�a 1 b tyg �b 1 tz

3

5 , (18)

Figure 5: Reconstructed of a scene showing raycasting of the TSDF(left) without and (middle and right) with interpolation of the TSDF atthe surface interface using eqn. 15.

Figure 6: Demonstration of the space skipping ray casting. (Left)pixel iteration count are shown where for each pixel the ray is tra-versed in steps of at most one voxel (white equals 480 incrementsand black 60). (middle) ray marching steps are drastically reducedby skipping empty space according to the minimum truncation µ(white equals 70 iterations and black 10 ⇡ 6⇥ speedup). Marchingsteps can be seen to increase around the surface interface wherethe signed distance function has not been truncated. (Right) Normalmap at resulting surface intersection.

along which p was found can be computed directly from Fk using anumerical derivative of the SDF:

Rg,kNk = N

gk(u) = n

⇥—F(p)

⇤, —F =

∂F∂x

,∂F∂y

,∂F∂ z

�>(14)

Further, this derivative is scaled in each dimension to ensure cor-rect isotropy given potentially arbitrary voxel resolutions and re-construction dimensions.

Since the rendering of the surface is restricted to provide phys-ically plausible measurement predictions, there is a minimum andmaximum rendering range the ray needs to traverse correspondingto conservative minimum and maximum sensor range (⇡ [0.4,8]meters for the Kinect). This results in the desirable property ofthe proposed surface prediction requiring a bounded time per pixelcomputation for any size or complexity of scene with a fixed volu-metric resolution.

Classically a min/max block acceleration structure [21] can beused to speed up marching through empty space. However, dueto continual updating of the TSDF (which would require a con-stant update to the min/max macro blocks) we found that simpleray skipping provides a more useful acceleration. In ray skippingwe utilise the fact that near F(p) = 0 the fused volume holds a goodapproximation to the true signed distance from p to the nearest sur-face interface. Using our known truncation distance we can marchalong the ray in steps with size < µ while values of F(p) have +vetruncated values, as we can assume a step µ must pass through atleast one non-truncated +ve value before stepping over the surfacezero crossing. The speed-up obtained is demonstrated in Figure 6by measuring the number of steps required for each pixel to inter-sect the surface relative to standard marching.

Higher quality intersections can be obtained by solving a ray/tri-linear cell intersection [21] that requires solving a cubic polyno-mial. As this is expensive we use a simple approximation. Givena ray has been found to intersect the SDF where F+

t and F+t+Dt are

trilinearly interpolated SDF values either side of the zero crossingat points along the ray t and t+Dt from its starting point, we find

parameter t⇤ at which the intersection occurs more precisely:

t⇤ = t� DtF+t

F+t+Dt �F+

t. (15)

The predicted vertex and normal maps are computed at the inter-polated location in the global frame. Figure 5 shows a typical recon-struction, the interpolation scheme described achieves high qualityocclusion boundaries at a fraction of the cost of full interpolation.

3.5 Sensor Pose EstimationLive camera localisation involves estimating the current camerapose Tw,k 2 SE3 (Equation 1) for each new depth image. Manytracking algorithms use feature selection to improve speed by re-ducing the number of points for which data association need beperformed. In this work, we take advantage of two factors to al-low us instead to make use of all of the data in a depth image fora dense iterated close point based pose estimation. First, by main-taining a high tracking frame-rate, we can assume small motionfrom one frame to the next. This allows us to use the fast projec-tive data association algorithm [4] to obtain correspondence and thepoint-plane metric [5] for pose optimisation. Second, modern GPUhardware enables a fully parrallelised processing pipeline, so thatthe data association and point-plane optimisation can use all of theavailable surface measurements.

The point-plane error metric in combination with correspon-dences obtained using projective data association was first demon-strated in a real time modelling system by [23] where frame-to-frame tracking was used (with a fixed camera) for depth map align-ment. In our system we instead track the current sensor frame byaligning a live surface measurement (Vk,Nk) against the model pre-diction from the previous frame (Vk�1, Nk�1). We note that frame-to-frame tracking is obtained simply by setting (Vk�1, Nk�1) (Vk�1,Nk�1) which is used in our experimental section for a com-parison between frame-to-frame and frame-model tracking.

Utilising the surface prediction, the global point-plane energy,under the L2 norm for the desired camera pose estimate Tg,k is:

E(Tg,k) = Âu2U

Wk(u) 6=null

����⇣Tg,k ˙

Vk(u)� V

gk�1 (u)

⌘>N

gk�1 (u)

����2, (16)

where each global frame surface prediction is obtained using theprevious fixed pose estimate Tg,k�1. The projective data as-sociation algorithm produces the set of vertex correspondences{Vk(u), Vk�1(u)|W(u) 6= null} by computing the perspectively pro-jected point, u = p(KeTk�1,k ˙

Vk(u)) using an estimate for the frame-frame transform eTz

k�1,k = T�1g,k�1

eTzg,k and testing the predicted and

measured vertex and normal for compatibility. A threshold on thedistance of vertices and difference in normal values suffices to re-ject grossly incorrect correspondences, also illustrated in Figure 7:

W(u) 6= null iff

8<

:

Mk(u) = 1, andkeTz

g,k˙

Vk(u)� V

gk�1 (u)k2 ed , and

heRzg,kNk(u), N

gk�1 (u)i eq .

(17)

where ed and eq are threshold parameters of our system. eTz=0g,k is

initialised with the previous frame pose Tg,k.An iterative solution, eTz

g,k for z > 0 is obtained by minimisingthe energy of a linearised version of (16) around the previous es-timate eTz�1

g,k . Using the small angle assumption for an incrementaltransform:

eTzinc =

heRz�� e

t

zi=

2

41 a �g tx�a 1 b tyg �b 1 tz

3

5 , (18)

Figure 5: Reconstructed of a scene showing raycasting of the TSDF(left) without and (middle and right) with interpolation of the TSDF atthe surface interface using eqn. 15.

Figure 6: Demonstration of the space skipping ray casting. (Left)pixel iteration count are shown where for each pixel the ray is tra-versed in steps of at most one voxel (white equals 480 incrementsand black 60). (middle) ray marching steps are drastically reducedby skipping empty space according to the minimum truncation µ(white equals 70 iterations and black 10 ⇡ 6⇥ speedup). Marchingsteps can be seen to increase around the surface interface wherethe signed distance function has not been truncated. (Right) Normalmap at resulting surface intersection.

along which p was found can be computed directly from Fk using anumerical derivative of the SDF:

Rg,kNk = N

gk(u) = n

⇥—F(p)

⇤, —F =

∂F∂x

,∂F∂y

,∂F∂ z

�>(14)

Further, this derivative is scaled in each dimension to ensure cor-rect isotropy given potentially arbitrary voxel resolutions and re-construction dimensions.

Since the rendering of the surface is restricted to provide phys-ically plausible measurement predictions, there is a minimum andmaximum rendering range the ray needs to traverse correspondingto conservative minimum and maximum sensor range (⇡ [0.4,8]meters for the Kinect). This results in the desirable property ofthe proposed surface prediction requiring a bounded time per pixelcomputation for any size or complexity of scene with a fixed volu-metric resolution.

Classically a min/max block acceleration structure [21] can beused to speed up marching through empty space. However, dueto continual updating of the TSDF (which would require a con-stant update to the min/max macro blocks) we found that simpleray skipping provides a more useful acceleration. In ray skippingwe utilise the fact that near F(p) = 0 the fused volume holds a goodapproximation to the true signed distance from p to the nearest sur-face interface. Using our known truncation distance we can marchalong the ray in steps with size < µ while values of F(p) have +vetruncated values, as we can assume a step µ must pass through atleast one non-truncated +ve value before stepping over the surfacezero crossing. The speed-up obtained is demonstrated in Figure 6by measuring the number of steps required for each pixel to inter-sect the surface relative to standard marching.

Higher quality intersections can be obtained by solving a ray/tri-linear cell intersection [21] that requires solving a cubic polyno-mial. As this is expensive we use a simple approximation. Givena ray has been found to intersect the SDF where F+

t and F+t+Dt are

trilinearly interpolated SDF values either side of the zero crossingat points along the ray t and t+Dt from its starting point, we find

parameter t⇤ at which the intersection occurs more precisely:

t⇤ = t� DtF+t

F+t+Dt �F+

t. (15)

The predicted vertex and normal maps are computed at the inter-polated location in the global frame. Figure 5 shows a typical recon-struction, the interpolation scheme described achieves high qualityocclusion boundaries at a fraction of the cost of full interpolation.

3.5 Sensor Pose EstimationLive camera localisation involves estimating the current camerapose Tw,k 2 SE3 (Equation 1) for each new depth image. Manytracking algorithms use feature selection to improve speed by re-ducing the number of points for which data association need beperformed. In this work, we take advantage of two factors to al-low us instead to make use of all of the data in a depth image fora dense iterated close point based pose estimation. First, by main-taining a high tracking frame-rate, we can assume small motionfrom one frame to the next. This allows us to use the fast projec-tive data association algorithm [4] to obtain correspondence and thepoint-plane metric [5] for pose optimisation. Second, modern GPUhardware enables a fully parrallelised processing pipeline, so thatthe data association and point-plane optimisation can use all of theavailable surface measurements.

The point-plane error metric in combination with correspon-dences obtained using projective data association was first demon-strated in a real time modelling system by [23] where frame-to-frame tracking was used (with a fixed camera) for depth map align-ment. In our system we instead track the current sensor frame byaligning a live surface measurement (Vk,Nk) against the model pre-diction from the previous frame (Vk�1, Nk�1). We note that frame-to-frame tracking is obtained simply by setting (Vk�1, Nk�1) (Vk�1,Nk�1) which is used in our experimental section for a com-parison between frame-to-frame and frame-model tracking.

Utilising the surface prediction, the global point-plane energy,under the L2 norm for the desired camera pose estimate Tg,k is:

E(Tg,k) = Âu2U

Wk(u) 6=null

����⇣Tg,k ˙

Vk(u)� V

gk�1 (u)

⌘>N

gk�1 (u)

����2, (16)

where each global frame surface prediction is obtained using theprevious fixed pose estimate Tg,k�1. The projective data as-sociation algorithm produces the set of vertex correspondences{Vk(u), Vk�1(u)|W(u) 6= null} by computing the perspectively pro-jected point, u = p(KeTk�1,k ˙

Vk(u)) using an estimate for the frame-frame transform eTz

k�1,k = T�1g,k�1

eTzg,k and testing the predicted and

measured vertex and normal for compatibility. A threshold on thedistance of vertices and difference in normal values suffices to re-ject grossly incorrect correspondences, also illustrated in Figure 7:

W(u) 6= null iff

8<

:

Mk(u) = 1, andkeTz

g,k˙

Vk(u)� V

gk�1 (u)k2 ed , and

heRzg,kNk(u), N

gk�1 (u)i eq .

(17)

where ed and eq are threshold parameters of our system. eTz=0g,k is

initialised with the previous frame pose Tg,k.An iterative solution, eTz

g,k for z > 0 is obtained by minimisingthe energy of a linearised version of (16) around the previous es-timate eTz�1

g,k . Using the small angle assumption for an incrementaltransform:

eTzinc =

heRz�� e

t

zi=

2

41 a �g tx�a 1 b tyg �b 1 tz

3

5 , (18)

: Live frame

: Model prediction from the previous frame

Technical Report TR04-004, Department of Computer Science, University of North Carolina at Chapel Hill, February 2004.

1

Linear Least-Squares Optimization for Point-to-Plane ICP Surface Registration

Kok-Lim Low

Department of Computer Science University of North Carolina at Chapel Hill

Email: [email protected]

ABSTRACT The Iterative Closest Point (ICP) algorithm that uses the point-to-plane error metric has been shown to converge much faster than one that uses the point-to-point error metric. At each iteration of the ICP algorithm, the change of relative pose that gives the minimal point-to-plane error is usually solved using standard nonlinear least-squares methods, which are often very slow. Fortunately, when the relative orientation between the two input surfaces is small, we can approximate the nonlinear optimization problem with a linear least-squares one that can be solved more efficiently. We detail the derivation of a linear system whose least-squares solution is a good approximation to that obtained from a nonlinear optimization.

1 INTRODUCTION 3D shape alignment is an important part of many applications. It is used for object recognition in which newly acquired shapes in the environment are fitted to model shapes in the database. For reverse engineering and building real-world models for virtual reality, it is used to align multiple partial range scans to form models that are more complete. For autonomous range acquisition, 3D registration is used to accurately localize the range scanner, and to align data from multiple scans for view-planning computation.

Since its introduction by Besl and McKay [Besl92], the ICP (Iterative Closest Point) algorithm has become the most widely used method for aligning three-dimensional shapes (a similar algorithm was also introduced by Chen and Medioni [Chen92]). Rusinkiewicz and Levoy [Rusinkiewicz01] provide a recent survey of the many ICP variants based on the original ICP concept.

In the ICP algorithm described by Besl and McKay [Besl92], each point in one data set is paired with the closest point in the other data set to form correspondence pairs. Then a point-to-point error metric is used in which the sum of the squared distance between points in each correspondence pair is minimized. The process is iterated until the error becomes smaller than a threshold or it stops changing. On the other hand, Chen and Medioni [Chen92] used a point-to-plane error metric in which the object of minimization is the sum of the squared distance between a point and the tangent plane at its correspondence point. Unlike the point-to-point metric, which has a closed-form solution, the point-to-plane metric is usually solved using standard nonlinear least squares methods, such as the Levenberg-Marquardt method [Press92]. Although each iteration of the point-to-plane ICP algorithm is generally slower than the point-to-point version, researchers have observed significantly better convergence rates in the former [Rusinkiewicz01]. A more theoretical explanation of the convergence of the point-to-plane metric is described by Pottmann et al [Pottmann02].

In [Rusinkiewicz01], it was suggested that when the relative orientation (rotation) between the two input surfaces is small, one can approximate the nonlinear least-squares optimization problem with a linear one, so as to speed up the computation. This approximation is simply done by replacing sin θ by θ and cos θ by 1 in the rotation matrix.

In this technical report, we describe in detail the derivation of a system of linear equations to approximate the original nonlinear system, and demonstrate how the least-squares solution to the linear system can be obtained using SVD (singular value decomposition). A 3D rigid-body transformation matrix is then constructed from the linear least-squares solution.

2 POINT-TO-PLANE ICP ALGORITHM Given a source surface and a destination surface, each iteration of the ICP algorithm first establishes a set of pair-correspondences between points in the source surface and points in the destination surfaces. For example, for each point on the source surface, the nearest point on the destination surface is chosen as its correspondence [Besl92] (see [Rusinkiewicz01] for other approaches to find point correspondences). The output of an ICP iteration is a 3D rigid-body transformation M that transforms the source points such that the total error between the corresponding points, under a certain chosen error metric, is minimal.

When the point-to-plane error metric is used, the object of minimization is the sum of the squared distance between each source point and the tangent plane at its corresponding destination point (see Figure 1). More specifically, if si = (six, siy, siz, 1)T is a source point, di = (dix, diy, diz, 1)T is the corresponding destination point, and ni = (nix, niy, niz, 0)T is the unit normal vector at di, then the goal of each ICP iteration is to find Mopt such that

( )( )∑ •−⋅=i

iii2

opt minarg ndsMM M

(1)

where M and Mopt are 4×4 3D rigid-body transformation matrices.

Figure 1: Point-to-plane error between two surfaces.

tangent plane

s1 source point

destination point

d1

n1 unit

normal

s2

d2

n2 s3

d3

n3

destination surface

source surface

l1

l2 l3

Low, Kok-Lim, Linear least-squares optimization for Point-to-Plane ICP surface Registration Chapel Hill, University of North Carolina, 2004.

Page 24: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Sensor Pose Estimation (3/4) •  Iterative solution

•  パラメータの更新量の計算

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 24

Figure 5: Reconstructed of a scene showing raycasting of the TSDF(left) without and (middle and right) with interpolation of the TSDF atthe surface interface using eqn. 15.

Figure 6: Demonstration of the space skipping ray casting. (Left)pixel iteration count are shown where for each pixel the ray is tra-versed in steps of at most one voxel (white equals 480 incrementsand black 60). (middle) ray marching steps are drastically reducedby skipping empty space according to the minimum truncation µ(white equals 70 iterations and black 10 ⇡ 6⇥ speedup). Marchingsteps can be seen to increase around the surface interface wherethe signed distance function has not been truncated. (Right) Normalmap at resulting surface intersection.

along which p was found can be computed directly from Fk using anumerical derivative of the SDF:

Rg,kNk = N

gk(u) = n

⇥—F(p)

⇤, —F =

∂F∂x

,∂F∂y

,∂F∂ z

�>(14)

Further, this derivative is scaled in each dimension to ensure cor-rect isotropy given potentially arbitrary voxel resolutions and re-construction dimensions.

Since the rendering of the surface is restricted to provide phys-ically plausible measurement predictions, there is a minimum andmaximum rendering range the ray needs to traverse correspondingto conservative minimum and maximum sensor range (⇡ [0.4,8]meters for the Kinect). This results in the desirable property ofthe proposed surface prediction requiring a bounded time per pixelcomputation for any size or complexity of scene with a fixed volu-metric resolution.

Classically a min/max block acceleration structure [21] can beused to speed up marching through empty space. However, dueto continual updating of the TSDF (which would require a con-stant update to the min/max macro blocks) we found that simpleray skipping provides a more useful acceleration. In ray skippingwe utilise the fact that near F(p) = 0 the fused volume holds a goodapproximation to the true signed distance from p to the nearest sur-face interface. Using our known truncation distance we can marchalong the ray in steps with size < µ while values of F(p) have +vetruncated values, as we can assume a step µ must pass through atleast one non-truncated +ve value before stepping over the surfacezero crossing. The speed-up obtained is demonstrated in Figure 6by measuring the number of steps required for each pixel to inter-sect the surface relative to standard marching.

Higher quality intersections can be obtained by solving a ray/tri-linear cell intersection [21] that requires solving a cubic polyno-mial. As this is expensive we use a simple approximation. Givena ray has been found to intersect the SDF where F+

t and F+t+Dt are

trilinearly interpolated SDF values either side of the zero crossingat points along the ray t and t+Dt from its starting point, we find

parameter t⇤ at which the intersection occurs more precisely:

t⇤ = t� DtF+t

F+t+Dt �F+

t. (15)

The predicted vertex and normal maps are computed at the inter-polated location in the global frame. Figure 5 shows a typical recon-struction, the interpolation scheme described achieves high qualityocclusion boundaries at a fraction of the cost of full interpolation.

3.5 Sensor Pose EstimationLive camera localisation involves estimating the current camerapose Tw,k 2 SE3 (Equation 1) for each new depth image. Manytracking algorithms use feature selection to improve speed by re-ducing the number of points for which data association need beperformed. In this work, we take advantage of two factors to al-low us instead to make use of all of the data in a depth image fora dense iterated close point based pose estimation. First, by main-taining a high tracking frame-rate, we can assume small motionfrom one frame to the next. This allows us to use the fast projec-tive data association algorithm [4] to obtain correspondence and thepoint-plane metric [5] for pose optimisation. Second, modern GPUhardware enables a fully parrallelised processing pipeline, so thatthe data association and point-plane optimisation can use all of theavailable surface measurements.

The point-plane error metric in combination with correspon-dences obtained using projective data association was first demon-strated in a real time modelling system by [23] where frame-to-frame tracking was used (with a fixed camera) for depth map align-ment. In our system we instead track the current sensor frame byaligning a live surface measurement (Vk,Nk) against the model pre-diction from the previous frame (Vk�1, Nk�1). We note that frame-to-frame tracking is obtained simply by setting (Vk�1, Nk�1) (Vk�1,Nk�1) which is used in our experimental section for a com-parison between frame-to-frame and frame-model tracking.

Utilising the surface prediction, the global point-plane energy,under the L2 norm for the desired camera pose estimate Tg,k is:

E(Tg,k) = Âu2U

Wk(u) 6=null

����⇣Tg,k ˙

Vk(u)� V

gk�1 (u)

⌘>N

gk�1 (u)

����2, (16)

where each global frame surface prediction is obtained using theprevious fixed pose estimate Tg,k�1. The projective data as-sociation algorithm produces the set of vertex correspondences{Vk(u), Vk�1(u)|W(u) 6= null} by computing the perspectively pro-jected point, u = p(KeTk�1,k ˙

Vk(u)) using an estimate for the frame-frame transform eTz

k�1,k = T�1g,k�1

eTzg,k and testing the predicted and

measured vertex and normal for compatibility. A threshold on thedistance of vertices and difference in normal values suffices to re-ject grossly incorrect correspondences, also illustrated in Figure 7:

W(u) 6= null iff

8<

:

Mk(u) = 1, andkeTz

g,k˙

Vk(u)� V

gk�1 (u)k2 ed , and

heRzg,kNk(u), N

gk�1 (u)i eq .

(17)

where ed and eq are threshold parameters of our system. eTz=0g,k is

initialised with the previous frame pose Tg,k.An iterative solution, eTz

g,k for z > 0 is obtained by minimisingthe energy of a linearised version of (16) around the previous es-timate eTz�1

g,k . Using the small angle assumption for an incrementaltransform:

eTzinc =

heRz�� e

t

zi=

2

41 a �g tx�a 1 b tyg �b 1 tz

3

5 , (18)

Figure 7: Example of point-plane outliers as person steps into par-tially reconstructed scene (left). Outliers from compatibility checks(Equation 17) using a surface measurement with (center) and with-out (right) bilateral filtering applied to the raw depth map. W(u) = nullare light grey with unpredicted/unmeasured points shown in white.

an updated transform is simply eTzg,k =

eTzinceTz�1

g,k . Writing the updateeTz

inc as a parameter vector,

x = (b ,g,a, tx, ty, tz)> 2 R6 (19)

and updating the current global frame vertex estimates for all pixels{u|W(u) 6= null}, eVg

k(u) =eTz�1

g,k˙

Vk(u), we minimise the linearisederror function using the incremental point transfer:

eTzg,k

˙

Vk(u) = eRzeV

gk(u)+

et

z = G(u)x+ eV

gk(u) , (20)

where the 3⇥6 matrix G is formed with the skew-symmetric matrixform of eVg

k(u):

G(u) =

heV

gk(u)

i

��I3⇥3

�. (21)

An iteration is obtained by solving:

minx2R6 Â

Wk(u) 6=nullkEk2

2 (22)

E = N

gk�1(u)

>⇣

G(u)x+ eV

gk(u)� V

gk�1(u)

⌘(23)

By computing the derivative of the objective function with respectto the parameter vector x and setting to zero, a 6⇥ 6 symmetriclinear system is generated for each vertex-normal element corre-spondence:

ÂWk(u) 6=null

⇣A

>A

⌘x = ÂA

>b, (24)

A

> = G

>(u)Ngk�1(u), (25)

b = N

gk�1(u)

>⇣

V

gk�1(u)� e

V

gk(u)

⌘.(26)

In our GPU implementation each summand of the normal sys-tem is computed in parallel. The symmetry of the system enablesoperations and memory to be saved and the final sum is obtainedusing a parallel tree-based reduction [13], to obtain the upper tri-angular component of the symmetric system. The solution vectorx is efficiently computed using a Cholesky decomposition on thehost (CPU) and coerced back into an SE3 transform which we com-pose onto the previous pose estimate, obtaining eTz

g,k. The data as-sociation and pose minimisation is embedded into a coarse to fineframework using the bottom 3 levels of a vertex and normal mappyramid. We iterate for a maximum of zmax = [4,5,10] iterations inlevels [3,2,1] respectively, starting with the coarsest level 3. Afterall iterations are completed we fix the final camera pose Tg,k eTzmax

g,kStability and validity check for transformation update Ide-

ally we would only like to perform a surface update if we are sure

that tracking has not failed or is highly erroneous. As inter-framesensor motion increases the assumptions made in both linearisationof the point-plane error metric and the projective data associationcan be broken. Also, if the currently observable surface geometrydoes not provide point-plane pairs that constrain the full 6DOF ofthe linear system then an arbitrary solution within the remainingfree DOFs can be obtained. Both outcomes will lead to a reducedquality reconstruction and tracking failure. We therefore performa check on the null space of the normal system to ensure it is ade-quately constrained. We also perform a simple threshold check onthe magnitude of incremental transform parameters x, to ensure thesmall angle assumption was not drastically broken. If either testfails, the system is placed into re-localisation mode.

Relocalisation Our current implementation uses an interactivere-localisation scheme, whereby if the sensor loses track, the lastknown sensor pose is used to provide a surface prediction, and theuser instructed to align the incoming depth frame with what is dis-played. While running the pose optimisation, if the stability andvalidity checks are passed tracking and mapping are resumed.

4 EXPERIMENTS

We have conducted a number of experiments to investigate the per-formance of our system. These and other aspects, such as the sys-tem’s ability to keep track during very rapid motion, are illustratedextensively in our submitted video.

4.1 Metrically Consistent ReconstructionOur tracking and mapping system provides a constant time algo-rithm for a given area of reconstruction, and we are interested ininvestigating its ability to form metrically consistent models fromtrajectories containing local loop closures without requiring explicitglobal joint-estimation. We are also interested in the ability of thesystem to scale gracefully with different processing and memoryresources.

To investigate these properties we conducted the following ex-periment. The Kinect sensor was placed in a fixed location observ-ing a tabletop scene mounted on a turntable. The turntable wasthen spun through a full rotation as depth data was captured over⇡ 19 seconds, resulting in N = 560 frames. For the purposes ofour system, if the reconstruction volume is set to span solely theregion of the rotating scene, the resulting depth image sequence ob-tained is obviously equivalent to the Kinect having been moved ona precise circular track around a static table, and this allows us toeasily evaluate the quality of tracking. All parameters of the systemare kept constant using a reconstruction resolution of 2563 voxelsunless stated otherwise.

The N frames of depth data captured were then processed in eachof the following ways:

1. Frames 1 . . .N were fused together within the TSDF usingsensor pose estimates obtained with our frame-to-frame onlyICP implementation.

2. Frames 1 . . .L, L < N were fed through our standard trackingand mapping pipeline forming an incomplete loop closure.Here, sensor pose estimates are obtained by the full frame-model ICP method.

3. Frames 1 . . .N were fed through our standard tracking andmapping pipeline resulting in a complete loop closure aroundthe table. Again, sensor pose estimates are obtained by frame-model ICP.

4. Frames 1 . . .N were fed not just once but repeatedly for M = 4loops to the standard tracking and mapping pipeline. This waspossible because the sensor motion was such that frame 1 andframe N were captured from almost the same place.

Figure 7: Example of point-plane outliers as person steps into par-tially reconstructed scene (left). Outliers from compatibility checks(Equation 17) using a surface measurement with (center) and with-out (right) bilateral filtering applied to the raw depth map. W(u) = nullare light grey with unpredicted/unmeasured points shown in white.

an updated transform is simply eTzg,k =

eTzinceTz�1

g,k . Writing the updateeTz

inc as a parameter vector,

x = (b ,g,a, tx, ty, tz)> 2 R6 (19)

and updating the current global frame vertex estimates for all pixels{u|W(u) 6= null}, eVg

k(u) =eTz�1

g,k˙

Vk(u), we minimise the linearisederror function using the incremental point transfer:

eTzg,k

˙

Vk(u) = eRzeV

gk(u)+

et

z = G(u)x+ eV

gk(u) , (20)

where the 3⇥6 matrix G is formed with the skew-symmetric matrixform of eVg

k(u):

G(u) =

heV

gk(u)

i

��I3⇥3

�. (21)

An iteration is obtained by solving:

minx2R6 Â

Wk(u) 6=nullkEk2

2 (22)

E = N

gk�1(u)

>⇣

G(u)x+ eV

gk(u)� V

gk�1(u)

⌘(23)

By computing the derivative of the objective function with respectto the parameter vector x and setting to zero, a 6⇥ 6 symmetriclinear system is generated for each vertex-normal element corre-spondence:

ÂWk(u) 6=null

⇣A

>A

⌘x = ÂA

>b, (24)

A

> = G

>(u)Ngk�1(u), (25)

b = N

gk�1(u)

>⇣

V

gk�1(u)� e

V

gk(u)

⌘.(26)

In our GPU implementation each summand of the normal sys-tem is computed in parallel. The symmetry of the system enablesoperations and memory to be saved and the final sum is obtainedusing a parallel tree-based reduction [13], to obtain the upper tri-angular component of the symmetric system. The solution vectorx is efficiently computed using a Cholesky decomposition on thehost (CPU) and coerced back into an SE3 transform which we com-pose onto the previous pose estimate, obtaining eTz

g,k. The data as-sociation and pose minimisation is embedded into a coarse to fineframework using the bottom 3 levels of a vertex and normal mappyramid. We iterate for a maximum of zmax = [4,5,10] iterations inlevels [3,2,1] respectively, starting with the coarsest level 3. Afterall iterations are completed we fix the final camera pose Tg,k eTzmax

g,kStability and validity check for transformation update Ide-

ally we would only like to perform a surface update if we are sure

that tracking has not failed or is highly erroneous. As inter-framesensor motion increases the assumptions made in both linearisationof the point-plane error metric and the projective data associationcan be broken. Also, if the currently observable surface geometrydoes not provide point-plane pairs that constrain the full 6DOF ofthe linear system then an arbitrary solution within the remainingfree DOFs can be obtained. Both outcomes will lead to a reducedquality reconstruction and tracking failure. We therefore performa check on the null space of the normal system to ensure it is ade-quately constrained. We also perform a simple threshold check onthe magnitude of incremental transform parameters x, to ensure thesmall angle assumption was not drastically broken. If either testfails, the system is placed into re-localisation mode.

Relocalisation Our current implementation uses an interactivere-localisation scheme, whereby if the sensor loses track, the lastknown sensor pose is used to provide a surface prediction, and theuser instructed to align the incoming depth frame with what is dis-played. While running the pose optimisation, if the stability andvalidity checks are passed tracking and mapping are resumed.

4 EXPERIMENTS

We have conducted a number of experiments to investigate the per-formance of our system. These and other aspects, such as the sys-tem’s ability to keep track during very rapid motion, are illustratedextensively in our submitted video.

4.1 Metrically Consistent ReconstructionOur tracking and mapping system provides a constant time algo-rithm for a given area of reconstruction, and we are interested ininvestigating its ability to form metrically consistent models fromtrajectories containing local loop closures without requiring explicitglobal joint-estimation. We are also interested in the ability of thesystem to scale gracefully with different processing and memoryresources.

To investigate these properties we conducted the following ex-periment. The Kinect sensor was placed in a fixed location observ-ing a tabletop scene mounted on a turntable. The turntable wasthen spun through a full rotation as depth data was captured over⇡ 19 seconds, resulting in N = 560 frames. For the purposes ofour system, if the reconstruction volume is set to span solely theregion of the rotating scene, the resulting depth image sequence ob-tained is obviously equivalent to the Kinect having been moved ona precise circular track around a static table, and this allows us toeasily evaluate the quality of tracking. All parameters of the systemare kept constant using a reconstruction resolution of 2563 voxelsunless stated otherwise.

The N frames of depth data captured were then processed in eachof the following ways:

1. Frames 1 . . .N were fused together within the TSDF usingsensor pose estimates obtained with our frame-to-frame onlyICP implementation.

2. Frames 1 . . .L, L < N were fed through our standard trackingand mapping pipeline forming an incomplete loop closure.Here, sensor pose estimates are obtained by the full frame-model ICP method.

3. Frames 1 . . .N were fed through our standard tracking andmapping pipeline resulting in a complete loop closure aroundthe table. Again, sensor pose estimates are obtained by frame-model ICP.

4. Frames 1 . . .N were fed not just once but repeatedly for M = 4loops to the standard tracking and mapping pipeline. This waspossible because the sensor motion was such that frame 1 andframe N were captured from almost the same place.

Figure 7: Example of point-plane outliers as person steps into par-tially reconstructed scene (left). Outliers from compatibility checks(Equation 17) using a surface measurement with (center) and with-out (right) bilateral filtering applied to the raw depth map. W(u) = nullare light grey with unpredicted/unmeasured points shown in white.

an updated transform is simply eTzg,k =

eTzinceTz�1

g,k . Writing the updateeTz

inc as a parameter vector,

x = (b ,g,a, tx, ty, tz)> 2 R6 (19)

and updating the current global frame vertex estimates for all pixels{u|W(u) 6= null}, eVg

k(u) =eTz�1

g,k˙

Vk(u), we minimise the linearisederror function using the incremental point transfer:

eTzg,k

˙

Vk(u) = eRzeV

gk(u)+

et

z = G(u)x+ eV

gk(u) , (20)

where the 3⇥6 matrix G is formed with the skew-symmetric matrixform of eVg

k(u):

G(u) =

heV

gk(u)

i

��I3⇥3

�. (21)

An iteration is obtained by solving:

minx2R6 Â

Wk(u) 6=nullkEk2

2 (22)

E = N

gk�1(u)

>⇣

G(u)x+ eV

gk(u)� V

gk�1(u)

⌘(23)

By computing the derivative of the objective function with respectto the parameter vector x and setting to zero, a 6⇥ 6 symmetriclinear system is generated for each vertex-normal element corre-spondence:

ÂWk(u) 6=null

⇣A

>A

⌘x = ÂA

>b, (24)

A

> = G

>(u)Ngk�1(u), (25)

b = N

gk�1(u)

>⇣

V

gk�1(u)� e

V

gk(u)

⌘.(26)

In our GPU implementation each summand of the normal sys-tem is computed in parallel. The symmetry of the system enablesoperations and memory to be saved and the final sum is obtainedusing a parallel tree-based reduction [13], to obtain the upper tri-angular component of the symmetric system. The solution vectorx is efficiently computed using a Cholesky decomposition on thehost (CPU) and coerced back into an SE3 transform which we com-pose onto the previous pose estimate, obtaining eTz

g,k. The data as-sociation and pose minimisation is embedded into a coarse to fineframework using the bottom 3 levels of a vertex and normal mappyramid. We iterate for a maximum of zmax = [4,5,10] iterations inlevels [3,2,1] respectively, starting with the coarsest level 3. Afterall iterations are completed we fix the final camera pose Tg,k eTzmax

g,kStability and validity check for transformation update Ide-

ally we would only like to perform a surface update if we are sure

that tracking has not failed or is highly erroneous. As inter-framesensor motion increases the assumptions made in both linearisationof the point-plane error metric and the projective data associationcan be broken. Also, if the currently observable surface geometrydoes not provide point-plane pairs that constrain the full 6DOF ofthe linear system then an arbitrary solution within the remainingfree DOFs can be obtained. Both outcomes will lead to a reducedquality reconstruction and tracking failure. We therefore performa check on the null space of the normal system to ensure it is ade-quately constrained. We also perform a simple threshold check onthe magnitude of incremental transform parameters x, to ensure thesmall angle assumption was not drastically broken. If either testfails, the system is placed into re-localisation mode.

Relocalisation Our current implementation uses an interactivere-localisation scheme, whereby if the sensor loses track, the lastknown sensor pose is used to provide a surface prediction, and theuser instructed to align the incoming depth frame with what is dis-played. While running the pose optimisation, if the stability andvalidity checks are passed tracking and mapping are resumed.

4 EXPERIMENTS

We have conducted a number of experiments to investigate the per-formance of our system. These and other aspects, such as the sys-tem’s ability to keep track during very rapid motion, are illustratedextensively in our submitted video.

4.1 Metrically Consistent ReconstructionOur tracking and mapping system provides a constant time algo-rithm for a given area of reconstruction, and we are interested ininvestigating its ability to form metrically consistent models fromtrajectories containing local loop closures without requiring explicitglobal joint-estimation. We are also interested in the ability of thesystem to scale gracefully with different processing and memoryresources.

To investigate these properties we conducted the following ex-periment. The Kinect sensor was placed in a fixed location observ-ing a tabletop scene mounted on a turntable. The turntable wasthen spun through a full rotation as depth data was captured over⇡ 19 seconds, resulting in N = 560 frames. For the purposes ofour system, if the reconstruction volume is set to span solely theregion of the rotating scene, the resulting depth image sequence ob-tained is obviously equivalent to the Kinect having been moved ona precise circular track around a static table, and this allows us toeasily evaluate the quality of tracking. All parameters of the systemare kept constant using a reconstruction resolution of 2563 voxelsunless stated otherwise.

The N frames of depth data captured were then processed in eachof the following ways:

1. Frames 1 . . .N were fused together within the TSDF usingsensor pose estimates obtained with our frame-to-frame onlyICP implementation.

2. Frames 1 . . .L, L < N were fed through our standard trackingand mapping pipeline forming an incomplete loop closure.Here, sensor pose estimates are obtained by the full frame-model ICP method.

3. Frames 1 . . .N were fed through our standard tracking andmapping pipeline resulting in a complete loop closure aroundthe table. Again, sensor pose estimates are obtained by frame-model ICP.

4. Frames 1 . . .N were fed not just once but repeatedly for M = 4loops to the standard tracking and mapping pipeline. This waspossible because the sensor motion was such that frame 1 andframe N were captured from almost the same place.

Figure 7: Example of point-plane outliers as person steps into par-tially reconstructed scene (left). Outliers from compatibility checks(Equation 17) using a surface measurement with (center) and with-out (right) bilateral filtering applied to the raw depth map. W(u) = nullare light grey with unpredicted/unmeasured points shown in white.

an updated transform is simply eTzg,k =

eTzinceTz�1

g,k . Writing the updateeTz

inc as a parameter vector,

x = (b ,g,a, tx, ty, tz)> 2 R6 (19)

and updating the current global frame vertex estimates for all pixels{u|W(u) 6= null}, eVg

k(u) =eTz�1

g,k˙

Vk(u), we minimise the linearisederror function using the incremental point transfer:

eTzg,k

˙

Vk(u) = eRzeV

gk(u)+

et

z = G(u)x+ eV

gk(u) , (20)

where the 3⇥6 matrix G is formed with the skew-symmetric matrixform of eVg

k(u):

G(u) =

heV

gk(u)

i

��I3⇥3

�. (21)

An iteration is obtained by solving:

minx2R6 Â

Wk(u) 6=nullkEk2

2 (22)

E = N

gk�1(u)

>⇣

G(u)x+ eV

gk(u)� V

gk�1(u)

⌘(23)

By computing the derivative of the objective function with respectto the parameter vector x and setting to zero, a 6⇥ 6 symmetriclinear system is generated for each vertex-normal element corre-spondence:

ÂWk(u) 6=null

⇣A

>A

⌘x = ÂA

>b, (24)

A

> = G

>(u)Ngk�1(u), (25)

b = N

gk�1(u)

>⇣

V

gk�1(u)� e

V

gk(u)

⌘.(26)

In our GPU implementation each summand of the normal sys-tem is computed in parallel. The symmetry of the system enablesoperations and memory to be saved and the final sum is obtainedusing a parallel tree-based reduction [13], to obtain the upper tri-angular component of the symmetric system. The solution vectorx is efficiently computed using a Cholesky decomposition on thehost (CPU) and coerced back into an SE3 transform which we com-pose onto the previous pose estimate, obtaining eTz

g,k. The data as-sociation and pose minimisation is embedded into a coarse to fineframework using the bottom 3 levels of a vertex and normal mappyramid. We iterate for a maximum of zmax = [4,5,10] iterations inlevels [3,2,1] respectively, starting with the coarsest level 3. Afterall iterations are completed we fix the final camera pose Tg,k eTzmax

g,kStability and validity check for transformation update Ide-

ally we would only like to perform a surface update if we are sure

that tracking has not failed or is highly erroneous. As inter-framesensor motion increases the assumptions made in both linearisationof the point-plane error metric and the projective data associationcan be broken. Also, if the currently observable surface geometrydoes not provide point-plane pairs that constrain the full 6DOF ofthe linear system then an arbitrary solution within the remainingfree DOFs can be obtained. Both outcomes will lead to a reducedquality reconstruction and tracking failure. We therefore performa check on the null space of the normal system to ensure it is ade-quately constrained. We also perform a simple threshold check onthe magnitude of incremental transform parameters x, to ensure thesmall angle assumption was not drastically broken. If either testfails, the system is placed into re-localisation mode.

Relocalisation Our current implementation uses an interactivere-localisation scheme, whereby if the sensor loses track, the lastknown sensor pose is used to provide a surface prediction, and theuser instructed to align the incoming depth frame with what is dis-played. While running the pose optimisation, if the stability andvalidity checks are passed tracking and mapping are resumed.

4 EXPERIMENTS

We have conducted a number of experiments to investigate the per-formance of our system. These and other aspects, such as the sys-tem’s ability to keep track during very rapid motion, are illustratedextensively in our submitted video.

4.1 Metrically Consistent ReconstructionOur tracking and mapping system provides a constant time algo-rithm for a given area of reconstruction, and we are interested ininvestigating its ability to form metrically consistent models fromtrajectories containing local loop closures without requiring explicitglobal joint-estimation. We are also interested in the ability of thesystem to scale gracefully with different processing and memoryresources.

To investigate these properties we conducted the following ex-periment. The Kinect sensor was placed in a fixed location observ-ing a tabletop scene mounted on a turntable. The turntable wasthen spun through a full rotation as depth data was captured over⇡ 19 seconds, resulting in N = 560 frames. For the purposes ofour system, if the reconstruction volume is set to span solely theregion of the rotating scene, the resulting depth image sequence ob-tained is obviously equivalent to the Kinect having been moved ona precise circular track around a static table, and this allows us toeasily evaluate the quality of tracking. All parameters of the systemare kept constant using a reconstruction resolution of 2563 voxelsunless stated otherwise.

The N frames of depth data captured were then processed in eachof the following ways:

1. Frames 1 . . .N were fused together within the TSDF usingsensor pose estimates obtained with our frame-to-frame onlyICP implementation.

2. Frames 1 . . .L, L < N were fed through our standard trackingand mapping pipeline forming an incomplete loop closure.Here, sensor pose estimates are obtained by the full frame-model ICP method.

3. Frames 1 . . .N were fed through our standard tracking andmapping pipeline resulting in a complete loop closure aroundthe table. Again, sensor pose estimates are obtained by frame-model ICP.

4. Frames 1 . . .N were fed not just once but repeatedly for M = 4loops to the standard tracking and mapping pipeline. This waspossible because the sensor motion was such that frame 1 andframe N were captured from almost the same place.

Figure 7: Example of point-plane outliers as person steps into par-tially reconstructed scene (left). Outliers from compatibility checks(Equation 17) using a surface measurement with (center) and with-out (right) bilateral filtering applied to the raw depth map. W(u) = nullare light grey with unpredicted/unmeasured points shown in white.

an updated transform is simply eTzg,k =

eTzinceTz�1

g,k . Writing the updateeTz

inc as a parameter vector,

x = (b ,g,a, tx, ty, tz)> 2 R6 (19)

and updating the current global frame vertex estimates for all pixels{u|W(u) 6= null}, eVg

k(u) =eTz�1

g,k˙

Vk(u), we minimise the linearisederror function using the incremental point transfer:

eTzg,k

˙

Vk(u) = eRzeV

gk(u)+

et

z = G(u)x+ eV

gk(u) , (20)

where the 3⇥6 matrix G is formed with the skew-symmetric matrixform of eVg

k(u):

G(u) =

heV

gk(u)

i

��I3⇥3

�. (21)

An iteration is obtained by solving:

minx2R6 Â

Wk(u) 6=nullkEk2

2 (22)

E = N

gk�1(u)

>⇣

G(u)x+ eV

gk(u)� V

gk�1(u)

⌘(23)

By computing the derivative of the objective function with respectto the parameter vector x and setting to zero, a 6⇥ 6 symmetriclinear system is generated for each vertex-normal element corre-spondence:

ÂWk(u) 6=null

⇣A

>A

⌘x = ÂA

>b, (24)

A

> = G

>(u)Ngk�1(u), (25)

b = N

gk�1(u)

>⇣

V

gk�1(u)� e

V

gk(u)

⌘.(26)

In our GPU implementation each summand of the normal sys-tem is computed in parallel. The symmetry of the system enablesoperations and memory to be saved and the final sum is obtainedusing a parallel tree-based reduction [13], to obtain the upper tri-angular component of the symmetric system. The solution vectorx is efficiently computed using a Cholesky decomposition on thehost (CPU) and coerced back into an SE3 transform which we com-pose onto the previous pose estimate, obtaining eTz

g,k. The data as-sociation and pose minimisation is embedded into a coarse to fineframework using the bottom 3 levels of a vertex and normal mappyramid. We iterate for a maximum of zmax = [4,5,10] iterations inlevels [3,2,1] respectively, starting with the coarsest level 3. Afterall iterations are completed we fix the final camera pose Tg,k eTzmax

g,kStability and validity check for transformation update Ide-

ally we would only like to perform a surface update if we are sure

that tracking has not failed or is highly erroneous. As inter-framesensor motion increases the assumptions made in both linearisationof the point-plane error metric and the projective data associationcan be broken. Also, if the currently observable surface geometrydoes not provide point-plane pairs that constrain the full 6DOF ofthe linear system then an arbitrary solution within the remainingfree DOFs can be obtained. Both outcomes will lead to a reducedquality reconstruction and tracking failure. We therefore performa check on the null space of the normal system to ensure it is ade-quately constrained. We also perform a simple threshold check onthe magnitude of incremental transform parameters x, to ensure thesmall angle assumption was not drastically broken. If either testfails, the system is placed into re-localisation mode.

Relocalisation Our current implementation uses an interactivere-localisation scheme, whereby if the sensor loses track, the lastknown sensor pose is used to provide a surface prediction, and theuser instructed to align the incoming depth frame with what is dis-played. While running the pose optimisation, if the stability andvalidity checks are passed tracking and mapping are resumed.

4 EXPERIMENTS

We have conducted a number of experiments to investigate the per-formance of our system. These and other aspects, such as the sys-tem’s ability to keep track during very rapid motion, are illustratedextensively in our submitted video.

4.1 Metrically Consistent ReconstructionOur tracking and mapping system provides a constant time algo-rithm for a given area of reconstruction, and we are interested ininvestigating its ability to form metrically consistent models fromtrajectories containing local loop closures without requiring explicitglobal joint-estimation. We are also interested in the ability of thesystem to scale gracefully with different processing and memoryresources.

To investigate these properties we conducted the following ex-periment. The Kinect sensor was placed in a fixed location observ-ing a tabletop scene mounted on a turntable. The turntable wasthen spun through a full rotation as depth data was captured over⇡ 19 seconds, resulting in N = 560 frames. For the purposes ofour system, if the reconstruction volume is set to span solely theregion of the rotating scene, the resulting depth image sequence ob-tained is obviously equivalent to the Kinect having been moved ona precise circular track around a static table, and this allows us toeasily evaluate the quality of tracking. All parameters of the systemare kept constant using a reconstruction resolution of 2563 voxelsunless stated otherwise.

The N frames of depth data captured were then processed in eachof the following ways:

1. Frames 1 . . .N were fused together within the TSDF usingsensor pose estimates obtained with our frame-to-frame onlyICP implementation.

2. Frames 1 . . .L, L < N were fed through our standard trackingand mapping pipeline forming an incomplete loop closure.Here, sensor pose estimates are obtained by the full frame-model ICP method.

3. Frames 1 . . .N were fed through our standard tracking andmapping pipeline resulting in a complete loop closure aroundthe table. Again, sensor pose estimates are obtained by frame-model ICP.

4. Frames 1 . . .N were fed not just once but repeatedly for M = 4loops to the standard tracking and mapping pipeline. This waspossible because the sensor motion was such that frame 1 andframe N were captured from almost the same place.

ICPの更新量 :

Energy Function:

Page 25: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Sensor Pose Estimation (4/4) •  Coarse-to-fine Estimation

–  3Level pyramid – 各レベルにおいて、更新回数の上限を設定

•  L3 : 4回 •  L2 : 5回 •  L1 : 10回

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 25

Page 26: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Result

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 26

https://www.youtube.com/watch?v=quGhaggn3cQ

Page 27: 30th コンピュータビジョン勉強会@関東 DynamicFusion

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 27

ここまでが KinectFusion

ここからが DynamicFusion

Page 28: 30th コンピュータビジョン勉強会@関東 DynamicFusion

DynamicFusion: Overview (1/3) •  KinectFusionをDynamicなシーンに拡張

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 28

KinectFusion: Real-Time Dense Surface Mapping and Tracking⇤

Richard A. NewcombeImperial College London

Shahram IzadiMicrosoft Research

Otmar HilligesMicrosoft Research

David MolyneauxMicrosoft ResearchLancaster University

David KimMicrosoft Research

Newcastle University

Andrew J. DavisonImperial College London

Pushmeet KohliMicrosoft Research

Jamie ShottonMicrosoft Research

Steve HodgesMicrosoft Research

Andrew FitzgibbonMicrosoft Research

Figure 1: Example output from our system, generated in real-time with a handheld Kinect depth camera and no other sensing infrastructure.Normal maps (colour) and Phong-shaded renderings (greyscale) from our dense reconstruction system are shown. On the left for comparisonis an example of the live, incomplete, and noisy data from the Kinect sensor (used as input to our system).

ABSTRACT

We present a system for accurate real-time mapping of complex andarbitrary indoor scenes in variable lighting conditions, using only amoving low-cost depth camera and commodity graphics hardware.We fuse all of the depth data streamed from a Kinect sensor intoa single global implicit surface model of the observed scene inreal-time. The current sensor pose is simultaneously obtained bytracking the live depth frame relative to the global model using acoarse-to-fine iterative closest point (ICP) algorithm, which usesall of the observed depth data available. We demonstrate the advan-tages of tracking against the growing full surface model comparedwith frame-to-frame tracking, obtaining tracking and mapping re-sults in constant time within room sized scenes with limited driftand high accuracy. We also show both qualitative and quantitativeresults relating to various aspects of our tracking and mapping sys-tem. Modelling of natural scenes, in real-time with only commod-ity sensor and GPU hardware, promises an exciting step forwardin augmented reality (AR), in particular, it allows dense surfaces tobe reconstructed in real-time, with a level of detail and robustnessbeyond any solution yet presented using passive computer vision.

Keywords: Real-Time, Dense Reconstruction, Tracking, GPU,SLAM, Depth Cameras, Volumetric Representation, AR

Index Terms: I.3.3 [Computer Graphics] Picture/Image Genera-tion - Digitizing and Scanning; I.4.8 [Image Processing and Com-puter Vision] Scene Analysis - Tracking, Surface Fitting; H.5.1[Information Interfaces and Presentation]: Multimedia InformationSystems - Artificial, augmented, and virtual realities

⇤This work was performed at Microsoft Research.

1 INTRODUCTION

Real-time infrastructure-free tracking of a handheld camera whilstsimultaneously mapping the physical scene in high-detail promisesnew possibilities for augmented and mixed reality applications.

In computer vision, research on structure from motion (SFM)and multi-view stereo (MVS) has produced many compelling re-sults, in particular accurate camera tracking and sparse reconstruc-tions (e.g. [10]), and increasingly reconstruction of dense surfaces(e.g. [24]). However, much of this work was not motivated by real-time applications.

Research on simultaneous localisation and mapping (SLAM) hasfocused more on real-time markerless tracking and live scene re-construction based on the input of a single commodity sensor—amonocular RGB camera. Such ‘monocular SLAM’ systems such asMonoSLAM [8] and the more accurate Parallel Tracking and Map-ping (PTAM) system [17] allow researchers to investigate flexibleinfrastructure- and marker-free AR applications. But while thesesystems perform real-time mapping, they were optimised for ef-ficient camera tracking, with the sparse point cloud models theyproduce enabling only rudimentary scene reconstruction.

In the past year, systems have begun to emerge that combinePTAM’s handheld camera tracking capability with dense surfaceMVS-style reconstruction modules, enabling more sophisticatedocclusion prediction and surface interaction [19, 26]. Most recentlyin this line of research, iterative image alignment against dense re-constructions has also been used to replace point features for cam-era tracking [20]. While this work is very promising for AR, densescene reconstruction in real-time remains a challenge for passivemonocular systems which assume the availability of the right typeof camera motion and suitable scene illumination.

But while algorithms for estimating camera pose and extract-ing geometry from images have been evolving at pace, so havethe camera technologies themselves. New depth cameras based ei-ther on time-of-flight (ToF) or structured light sensing offer densemeasurements of depth in an integrated device. With the arrivalof Microsoft’s Kinect, such sensing has suddenly reached wideconsumer-level accessibility. The opportunities for SLAM and AR

Page 29: 30th コンピュータビジョン勉強会@関東 DynamicFusion

DynamicFusion: Overview (2/3) •  基本的はKinectFusion

– シーン表現: Truncated Signed Distance Function (TSDF)

– 入力データはDepth Map –  Pose推定しながらDepth MapをFusion

•  拡張ポイント –  Dynamicなシーンの表現 :

Canonical Space + Warp-Field –  Canonical Space : 基準空間 –  Warp-Field : Live Frameと基準空間の間の変形

(Deformation) 第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 29

Page 30: 30th コンピュータビジョン勉強会@関東 DynamicFusion

DynamicFusion: Overview (3/3)

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 30

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the movingscene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated througha k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised andcompleted scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sampleof model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scenemotion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-icantly better tracking and reconstruction. For each canoni-cal point v

c

2 S, Tlc

= W(v

c

) transforms that point fromcanonical space into the live, non-rigidly deformed frame ofreference.

Since we will need to estimate the warp function for eachnew frame, W

t

, its representation must be efficiently opti-misable. One possibility is to densely sample the volume,e.g. representing a quantised SE(3) field at the resolutionof the truncated signed distance (TSDF) geometric repre-sentation. However, a typical TSDF volume reconstructionat a relatively low resolution of 2563 voxels would requirethe solution of 6 ⇥ 256

3 parameters per frame, about 10million times more than in the original KinectFusion al-gorithm, which only estimates a single rigid transforma-tion. Clearly, a completely dense parametrisation of thewarp function is infeasible. In reality, surfaces tend to movesmoothly in space, and so we can instead use a sparse setof transformations as bases and define the dense volumet-ric warp function through interpolation. Due to its compu-tational efficiency and high quality interpolation capabilitywe use dual-quaternion blending DQB [11], to define ourwarp function:

W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)

where the weighted average over unit dual quaternion trans-formations is simply DQB(x

c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,

with each unit dual-quaternion q

kc

2 R8. Here, N (x) arethe k-nearest transformation nodes to the point x and w

k

:

R3 7! R defines a weight that alters the radius of influenceof each node and SE3(.) converts from quaternions back toan SE(3) transformation matrix. The state of the warp-fieldW

t

at time t is defined by the values of a set of n defor-mation nodes N t

warp = {dgv

,dg

w

,dg

se3}t. Each of thei = 1..n nodes has a position in the canonical frame dg

i

v

2R3, its associated transformation T

ic

= dg

i

se3, and a ra-dial basis weight dg

w

that controls the extent of the trans-formation w

i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��

.Each radius parameter dgi

w

is set to ensure the node’s in-fluence overlaps with neighbouring nodes, dependent onthe sampling sparsity of nodes, which we describe in de-tail in section (3.4). Since the warp function defines arigid body transformation for all supported space, both posi-tion and any associated orientation of space is transformed,e.g., the vertex v

c

from a surface with orientation or nor-mal n

c

is transformed into the live frame as (v

t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)

>. We notethat scaling of space can also be represented with this warpfunction, since compression and expansion of space are rep-resented by neighbouring points moving in converging anddiverging directions. Finally, we note that we can factorout any rigid body transformation common to all points inthe volume, e.g., due to camera motion. We therefore in-troduce the explicit warped model to live camera transform,T

lw

, and compose this onto the volumetric warp function;

Page 31: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Result

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 31

https://www.youtube.com/watch?v=i1eZekcc_lM&feature=youtu.be

Page 32: 30th コンピュータビジョン勉強会@関東 DynamicFusion

DynamicFusion Outline 1.  Dense Non-rigid Warp Field

–  Dynamicなシーンの表現方法について

2.  Dense Non-rigid Surface Fusion –  どうやってLive FrameをFusionするのか

3.  Estimating the Warp-Field State –  Warp-Fieldの推定方法

4.  Extending the Warp-Field –  Warp-Field構造の逐次追加方法

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 32

Page 33: 30th コンピュータビジョン勉強会@関東 DynamicFusion

DynamicFusion Outline 1.  Dense Non-rigid Warp Field

–  Dynamicなシーンの表現方法について

2.  Dense Non-rigid Surface Fusion –  どうやってLive FrameをFusionするのか

3.  Estimating the Warp-Field State –  Warp-Fieldの推定方法

4.  Extending the Warp-Field –  Warp-Field構造の逐次追加方法

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 33

Page 34: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Dense Non-rigid Warp Field (1/5) •  Live FrameとCanonical Model間の変形

•  単純に考えると –  Volumeの各点に6D Transformationを設定 –  2563のVoxelとすると、推定する必要のあるパラメータ数: 2563 x 6 = 16,777,216 x 6 = 1億 / frame

– パラメータ数が爆発。実現不能。

•  疎な6D変形ノードと補間による表現を導入

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 34

Page 35: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Dense Non-rigid Warp Field (2/5) •  疎な6D変形ノード

: Embedded Deformation Graph –  Polygon MeshをGraph構造で表現 –  Rigid-as-possibleな変形を実現

•  変形の補間 : Dual-Quaternion Blending – ある点xcの変形をk近傍ノードの重み付平均で表現 – アーチファクトの少ないSkinningを高速に実現

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 35

Page 36: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Dense Non-rigid Warp Field (3/5) •  Embedded Deformation Graph

–  Graphics分野の手法 –  Rigid-as-possibleな変形を実現 –  Polygon MeshをGraphで表現。ある点が移動した場合に、接続された点が連動して動く

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 36

R.W.Sumner,J.Schmid,andM.Pauly.Embeddeddeformation for shape manipulation. ACM Trans. Graph., 26(3):80, 2007. 2, 4, 6

is the zero vector, the deformed position ˜

g j of node j is simplyequal to g j + t j.

More interestingly, the deformation graph can be used to deformany geometric model defined by vertices in R3. Since transfer-ring the deformation to an embedded shape requires computationproportional to the shape’s complexity, efficiency is of paramountimportance. Consequentially, we employ an algorithm similar tothe widely used and highly efficient skeleton-subspace deformationfrom character animation. The influence of individual graph nodesis smoothly blended so that the deformed position ˜vi of each shapevertex vi, i 2 1 . . .n, is a weighted sum of its position after applica-tion of the deformation graph affine transformations:

˜vi =m

Âj=1

w j(vi)⇥R j(vi�g j)+g j + t j

⇤. (2)

While linear blending may result in some artifacts for extremelycoarse graphs, they are negligible for moderately dense ones likethose shown in our examples. This result is not surprising, sinceonly a few extra joint transformations are needed to greatly reduceartifacts exhibited by skeleton-subspace deformation [Weber 2000;Mohr and Gleicher 2003]. In our case, the nodes are evenly dis-tributed over the entire shape so that the blended transformationsare very similar to one another.

Normals are transformed similarly, according to the weighted sumof each normal transformed by the inverse transpose of the nodetransformations, and then renormalized:

˜

ni =m

Âj=1

w j(vi)R�1>j ni. (3)

The weights w j(vi), j 2 1 . . .m, are spatially varying and thus de-pend on the vertex position. Due to the graph structure, transforma-tions that are close to one another will be the most similar. Thus, forconsistency and efficiency, we limit the influence of the deforma-tion graph on a particular vertex to the k-nearest nodes. The weightsfor each vertex are precomputed according to

w j(vi) = (1�kvi�g jk/dmax)2 (4)

and then normalized to sum to one. Here, dmax is the distance tothe k + 1-nearest node. We use k = 4 for all examples, except theexperiment in Figure 5 (d), where k = 8.

The layout of the deformation graph nodes should roughly conformto the shape of the model being edited. In our experiments, a uni-form sampling of the model surface produces the best results. Sucha sampling is easily accomplished by distributing points denselyover the surface, and repeatedly removing all points within a givenradius of a randomly chosen one until a desired sampling density isreached. For meshes, simplification algorithms also produce goodresults when the triangle aspect ratio is restricted to avoid long andskinny triangles. For particle simulations, a simple and efficientconstruction of the node layout can be achieved by sampling parti-cle paths through time. The number of graph nodes determines theexpressibility of the deformation graph. Coarse edits can be madewith a small number of nodes, while highly detailed ones requiredenser sampling. We find that full body pose changes are effec-tively accomplished with 200 to 300 nodes.

Graph edges connect nodes of overlapping influence and are used toenforce consistency in the overall deformation. Once the node po-sitions are determined, the connectivity is computed automaticallyby creating an edge between any two nodes that influence the samevertex. Thus, the graph structure depends on how it is evaluated.

User edit

g0 g1 g2 g3 g4

˜

g0 ˜

g1˜

g2

˜

g3

˜

g4

Econ

Econ +EregEcon +Ereg +Erot

R2(g3�g2)+g2 + t2

R2(g1�g2)+g2 + t2

˜

g2 = g2 + t2

Figure 2: A simple deformation graph shows the effect of the threeterms of the objective function. The quadrilaterals at each graphnode illustrate the deformation induced by the corresponding affinetransformation. Without the rotational term, unnatural shearingcan occur, as shown in the bottom right. The transformation fornode g2 is applied to neighboring nodes g1 and g3, yielding thepredicted positions shown on the bottom left as gray circles. Theregularization term minimizes the squared distance between thesepredicted positions and their actual positions ˜

g1 and ˜

g3.

4 Optimization

Once the deformation graph has been specified, the user manipu-lates an embedded primitive by selecting vertices and moving themaround. The vertices serve as positional constraints for an opti-mization problem in which the affine transformations of the de-formation graph comprise the unknown variables. The objectivefunction encodes detail preservation directly by specifying that theaffine transformations should be rotations. Consequently, local fea-tures deform as rigidly as possible. A second energy term serves asa regularizer for the deformation by indicating that the affine trans-formations of adjacent graph nodes should agree with one another.

Rotation. In order for a 3⇥ 3 matrix R to represent a rotationin SO(3), it must satisfy six conditions: each of its three columnsmust be unit length, and all columns must be orthogonal to one an-other [Grassia 1998]. The squared deviation from these conditionsis given by the function Rot(R):

Rot(R) =(c1 · c2)2 +(c1 · c3)2 +(c2 · c3)2+

(c1 · c1�1)2 +(c2 · c2�1)2 +(c3 · c3�1)2 (5)

where c1, c2, and c3 are the 3⇥1 column vectors of R. This func-tion is nonlinear in the matrix entries. The term Erot sums the rota-tion error over all transformations of the deformation graph:

Erot =m

Âj=1

Rot(R j). (6)

Regularization. Conceptually, each of the affine transformationsrepresents a localized deformation centered at a graph node. Sincenearby transformations have overlapping influence, we must ensurethat the computed transformations are consistent with respect to oneanother. We add a regularization term to the optimization inferredfrom the graph structure. If nodes j and k are neighbors, they affecta common subset of the embedded shape. The position of node k

Page 37: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Dense Non-rigid Warp Field (4/5) •  Dual-Quaternion Blending

–  Graphics分野の手法 –  高速でアーチファクトの少ないSkinningを実現 –  できるだけ体積を保存するように補間 –  2つのQuaternionを使って、回転と平行移動を表現

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 37

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the movingscene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated througha k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised andcompleted scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sampleof model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scenemotion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-icantly better tracking and reconstruction. For each canoni-cal point v

c

2 S, Tlc

= W(v

c

) transforms that point fromcanonical space into the live, non-rigidly deformed frame ofreference.

Since we will need to estimate the warp function for eachnew frame, W

t

, its representation must be efficiently opti-misable. One possibility is to densely sample the volume,e.g. representing a quantised SE(3) field at the resolutionof the truncated signed distance (TSDF) geometric repre-sentation. However, a typical TSDF volume reconstructionat a relatively low resolution of 2563 voxels would requirethe solution of 6 ⇥ 256

3 parameters per frame, about 10million times more than in the original KinectFusion al-gorithm, which only estimates a single rigid transforma-tion. Clearly, a completely dense parametrisation of thewarp function is infeasible. In reality, surfaces tend to movesmoothly in space, and so we can instead use a sparse setof transformations as bases and define the dense volumet-ric warp function through interpolation. Due to its compu-tational efficiency and high quality interpolation capabilitywe use dual-quaternion blending DQB [11], to define ourwarp function:

W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)

where the weighted average over unit dual quaternion trans-formations is simply DQB(x

c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,

with each unit dual-quaternion q

kc

2 R8. Here, N (x) arethe k-nearest transformation nodes to the point x and w

k

:

R3 7! R defines a weight that alters the radius of influenceof each node and SE3(.) converts from quaternions back toan SE(3) transformation matrix. The state of the warp-fieldW

t

at time t is defined by the values of a set of n defor-mation nodes N t

warp = {dgv

,dg

w

,dg

se3}t. Each of thei = 1..n nodes has a position in the canonical frame dg

i

v

2R3, its associated transformation T

ic

= dg

i

se3, and a ra-dial basis weight dg

w

that controls the extent of the trans-formation w

i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��

.Each radius parameter dgi

w

is set to ensure the node’s in-fluence overlaps with neighbouring nodes, dependent onthe sampling sparsity of nodes, which we describe in de-tail in section (3.4). Since the warp function defines arigid body transformation for all supported space, both posi-tion and any associated orientation of space is transformed,e.g., the vertex v

c

from a surface with orientation or nor-mal n

c

is transformed into the live frame as (v

t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)

>. We notethat scaling of space can also be represented with this warpfunction, since compression and expansion of space are rep-resented by neighbouring points moving in converging anddiverging directions. Finally, we note that we can factorout any rigid body transformation common to all points inthe volume, e.g., due to camera motion. We therefore in-troduce the explicit warped model to live camera transform,T

lw

, and compose this onto the volumetric warp function;

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the movingscene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated througha k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised andcompleted scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sampleof model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scenemotion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-icantly better tracking and reconstruction. For each canoni-cal point v

c

2 S, Tlc

= W(v

c

) transforms that point fromcanonical space into the live, non-rigidly deformed frame ofreference.

Since we will need to estimate the warp function for eachnew frame, W

t

, its representation must be efficiently opti-misable. One possibility is to densely sample the volume,e.g. representing a quantised SE(3) field at the resolutionof the truncated signed distance (TSDF) geometric repre-sentation. However, a typical TSDF volume reconstructionat a relatively low resolution of 2563 voxels would requirethe solution of 6 ⇥ 256

3 parameters per frame, about 10million times more than in the original KinectFusion al-gorithm, which only estimates a single rigid transforma-tion. Clearly, a completely dense parametrisation of thewarp function is infeasible. In reality, surfaces tend to movesmoothly in space, and so we can instead use a sparse setof transformations as bases and define the dense volumet-ric warp function through interpolation. Due to its compu-tational efficiency and high quality interpolation capabilitywe use dual-quaternion blending DQB [11], to define ourwarp function:

W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)

where the weighted average over unit dual quaternion trans-formations is simply DQB(x

c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,

with each unit dual-quaternion q

kc

2 R8. Here, N (x) arethe k-nearest transformation nodes to the point x and w

k

:

R3 7! R defines a weight that alters the radius of influenceof each node and SE3(.) converts from quaternions back toan SE(3) transformation matrix. The state of the warp-fieldW

t

at time t is defined by the values of a set of n defor-mation nodes N t

warp = {dgv

,dg

w

,dg

se3}t. Each of thei = 1..n nodes has a position in the canonical frame dg

i

v

2R3, its associated transformation T

ic

= dg

i

se3, and a ra-dial basis weight dg

w

that controls the extent of the trans-formation w

i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��

.Each radius parameter dgi

w

is set to ensure the node’s in-fluence overlaps with neighbouring nodes, dependent onthe sampling sparsity of nodes, which we describe in de-tail in section (3.4). Since the warp function defines arigid body transformation for all supported space, both posi-tion and any associated orientation of space is transformed,e.g., the vertex v

c

from a surface with orientation or nor-mal n

c

is transformed into the live frame as (v

t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)

>. We notethat scaling of space can also be represented with this warpfunction, since compression and expansion of space are rep-resented by neighbouring points moving in converging anddiverging directions. Finally, we note that we can factorout any rigid body transformation common to all points inthe volume, e.g., due to camera motion. We therefore in-troduce the explicit warped model to live camera transform,T

lw

, and compose this onto the volumetric warp function;

: Dual-Quaternion

L. Kavan, S. Collins, J. Zˇa ra, and C. O’Sullivan. Skinning with Dual Quaternions. In Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games, I3D ’07, pages 39–46, New York, NY, USA, 2007. ACM. 3

Linear Blending Dual Quaternion Blending 図は以下より転用 http://rodolphe-vaillant.fr/?e=29

Page 38: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Dense Non-rigid Warp Field (4/5) •  疎な6D変換ノードのパラメータ:

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 38

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the movingscene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated througha k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised andcompleted scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sampleof model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scenemotion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-icantly better tracking and reconstruction. For each canoni-cal point v

c

2 S, Tlc

= W(v

c

) transforms that point fromcanonical space into the live, non-rigidly deformed frame ofreference.

Since we will need to estimate the warp function for eachnew frame, W

t

, its representation must be efficiently opti-misable. One possibility is to densely sample the volume,e.g. representing a quantised SE(3) field at the resolutionof the truncated signed distance (TSDF) geometric repre-sentation. However, a typical TSDF volume reconstructionat a relatively low resolution of 2563 voxels would requirethe solution of 6 ⇥ 256

3 parameters per frame, about 10million times more than in the original KinectFusion al-gorithm, which only estimates a single rigid transforma-tion. Clearly, a completely dense parametrisation of thewarp function is infeasible. In reality, surfaces tend to movesmoothly in space, and so we can instead use a sparse setof transformations as bases and define the dense volumet-ric warp function through interpolation. Due to its compu-tational efficiency and high quality interpolation capabilitywe use dual-quaternion blending DQB [11], to define ourwarp function:

W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)

where the weighted average over unit dual quaternion trans-formations is simply DQB(x

c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,

with each unit dual-quaternion q

kc

2 R8. Here, N (x) arethe k-nearest transformation nodes to the point x and w

k

:

R3 7! R defines a weight that alters the radius of influenceof each node and SE3(.) converts from quaternions back toan SE(3) transformation matrix. The state of the warp-fieldW

t

at time t is defined by the values of a set of n defor-mation nodes N t

warp = {dgv

,dg

w

,dg

se3}t. Each of thei = 1..n nodes has a position in the canonical frame dg

i

v

2R3, its associated transformation T

ic

= dg

i

se3, and a ra-dial basis weight dg

w

that controls the extent of the trans-formation w

i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��

.Each radius parameter dgi

w

is set to ensure the node’s in-fluence overlaps with neighbouring nodes, dependent onthe sampling sparsity of nodes, which we describe in de-tail in section (3.4). Since the warp function defines arigid body transformation for all supported space, both posi-tion and any associated orientation of space is transformed,e.g., the vertex v

c

from a surface with orientation or nor-mal n

c

is transformed into the live frame as (v

t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)

>. We notethat scaling of space can also be represented with this warpfunction, since compression and expansion of space are rep-resented by neighbouring points moving in converging anddiverging directions. Finally, we note that we can factorout any rigid body transformation common to all points inthe volume, e.g., due to camera motion. We therefore in-troduce the explicit warped model to live camera transform,T

lw

, and compose this onto the volumetric warp function;

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the movingscene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated througha k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised andcompleted scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sampleof model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scenemotion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-icantly better tracking and reconstruction. For each canoni-cal point v

c

2 S, Tlc

= W(v

c

) transforms that point fromcanonical space into the live, non-rigidly deformed frame ofreference.

Since we will need to estimate the warp function for eachnew frame, W

t

, its representation must be efficiently opti-misable. One possibility is to densely sample the volume,e.g. representing a quantised SE(3) field at the resolutionof the truncated signed distance (TSDF) geometric repre-sentation. However, a typical TSDF volume reconstructionat a relatively low resolution of 2563 voxels would requirethe solution of 6 ⇥ 256

3 parameters per frame, about 10million times more than in the original KinectFusion al-gorithm, which only estimates a single rigid transforma-tion. Clearly, a completely dense parametrisation of thewarp function is infeasible. In reality, surfaces tend to movesmoothly in space, and so we can instead use a sparse setof transformations as bases and define the dense volumet-ric warp function through interpolation. Due to its compu-tational efficiency and high quality interpolation capabilitywe use dual-quaternion blending DQB [11], to define ourwarp function:

W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)

where the weighted average over unit dual quaternion trans-formations is simply DQB(x

c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,

with each unit dual-quaternion q

kc

2 R8. Here, N (x) arethe k-nearest transformation nodes to the point x and w

k

:

R3 7! R defines a weight that alters the radius of influenceof each node and SE3(.) converts from quaternions back toan SE(3) transformation matrix. The state of the warp-fieldW

t

at time t is defined by the values of a set of n defor-mation nodes N t

warp = {dgv

,dg

w

,dg

se3}t. Each of thei = 1..n nodes has a position in the canonical frame dg

i

v

2R3, its associated transformation T

ic

= dg

i

se3, and a ra-dial basis weight dg

w

that controls the extent of the trans-formation w

i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��

.Each radius parameter dgi

w

is set to ensure the node’s in-fluence overlaps with neighbouring nodes, dependent onthe sampling sparsity of nodes, which we describe in de-tail in section (3.4). Since the warp function defines arigid body transformation for all supported space, both posi-tion and any associated orientation of space is transformed,e.g., the vertex v

c

from a surface with orientation or nor-mal n

c

is transformed into the live frame as (v

t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)

>. We notethat scaling of space can also be represented with this warpfunction, since compression and expansion of space are rep-resented by neighbouring points moving in converging anddiverging directions. Finally, we note that we can factorout any rigid body transformation common to all points inthe volume, e.g., due to camera motion. We therefore in-troduce the explicit warped model to live camera transform,T

lw

, and compose this onto the volumetric warp function;

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the movingscene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated througha k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised andcompleted scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sampleof model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scenemotion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-icantly better tracking and reconstruction. For each canoni-cal point v

c

2 S, Tlc

= W(v

c

) transforms that point fromcanonical space into the live, non-rigidly deformed frame ofreference.

Since we will need to estimate the warp function for eachnew frame, W

t

, its representation must be efficiently opti-misable. One possibility is to densely sample the volume,e.g. representing a quantised SE(3) field at the resolutionof the truncated signed distance (TSDF) geometric repre-sentation. However, a typical TSDF volume reconstructionat a relatively low resolution of 2563 voxels would requirethe solution of 6 ⇥ 256

3 parameters per frame, about 10million times more than in the original KinectFusion al-gorithm, which only estimates a single rigid transforma-tion. Clearly, a completely dense parametrisation of thewarp function is infeasible. In reality, surfaces tend to movesmoothly in space, and so we can instead use a sparse setof transformations as bases and define the dense volumet-ric warp function through interpolation. Due to its compu-tational efficiency and high quality interpolation capabilitywe use dual-quaternion blending DQB [11], to define ourwarp function:

W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)

where the weighted average over unit dual quaternion trans-formations is simply DQB(x

c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,

with each unit dual-quaternion q

kc

2 R8. Here, N (x) arethe k-nearest transformation nodes to the point x and w

k

:

R3 7! R defines a weight that alters the radius of influenceof each node and SE3(.) converts from quaternions back toan SE(3) transformation matrix. The state of the warp-fieldW

t

at time t is defined by the values of a set of n defor-mation nodes N t

warp = {dgv

,dg

w

,dg

se3}t. Each of thei = 1..n nodes has a position in the canonical frame dg

i

v

2R3, its associated transformation T

ic

= dg

i

se3, and a ra-dial basis weight dg

w

that controls the extent of the trans-formation w

i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��

.Each radius parameter dgi

w

is set to ensure the node’s in-fluence overlaps with neighbouring nodes, dependent onthe sampling sparsity of nodes, which we describe in de-tail in section (3.4). Since the warp function defines arigid body transformation for all supported space, both posi-tion and any associated orientation of space is transformed,e.g., the vertex v

c

from a surface with orientation or nor-mal n

c

is transformed into the live frame as (v

t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)

>. We notethat scaling of space can also be represented with this warpfunction, since compression and expansion of space are rep-resented by neighbouring points moving in converging anddiverging directions. Finally, we note that we can factorout any rigid body transformation common to all points inthe volume, e.g., due to camera motion. We therefore in-troduce the explicit warped model to live camera transform,T

lw

, and compose this onto the volumetric warp function;

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the movingscene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated througha k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised andcompleted scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sampleof model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scenemotion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-icantly better tracking and reconstruction. For each canoni-cal point v

c

2 S, Tlc

= W(v

c

) transforms that point fromcanonical space into the live, non-rigidly deformed frame ofreference.

Since we will need to estimate the warp function for eachnew frame, W

t

, its representation must be efficiently opti-misable. One possibility is to densely sample the volume,e.g. representing a quantised SE(3) field at the resolutionof the truncated signed distance (TSDF) geometric repre-sentation. However, a typical TSDF volume reconstructionat a relatively low resolution of 2563 voxels would requirethe solution of 6 ⇥ 256

3 parameters per frame, about 10million times more than in the original KinectFusion al-gorithm, which only estimates a single rigid transforma-tion. Clearly, a completely dense parametrisation of thewarp function is infeasible. In reality, surfaces tend to movesmoothly in space, and so we can instead use a sparse setof transformations as bases and define the dense volumet-ric warp function through interpolation. Due to its compu-tational efficiency and high quality interpolation capabilitywe use dual-quaternion blending DQB [11], to define ourwarp function:

W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)

where the weighted average over unit dual quaternion trans-formations is simply DQB(x

c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,

with each unit dual-quaternion q

kc

2 R8. Here, N (x) arethe k-nearest transformation nodes to the point x and w

k

:

R3 7! R defines a weight that alters the radius of influenceof each node and SE3(.) converts from quaternions back toan SE(3) transformation matrix. The state of the warp-fieldW

t

at time t is defined by the values of a set of n defor-mation nodes N t

warp = {dgv

,dg

w

,dg

se3}t. Each of thei = 1..n nodes has a position in the canonical frame dg

i

v

2R3, its associated transformation T

ic

= dg

i

se3, and a ra-dial basis weight dg

w

that controls the extent of the trans-formation w

i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��

.Each radius parameter dgi

w

is set to ensure the node’s in-fluence overlaps with neighbouring nodes, dependent onthe sampling sparsity of nodes, which we describe in de-tail in section (3.4). Since the warp function defines arigid body transformation for all supported space, both posi-tion and any associated orientation of space is transformed,e.g., the vertex v

c

from a surface with orientation or nor-mal n

c

is transformed into the live frame as (v

t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)

>. We notethat scaling of space can also be represented with this warpfunction, since compression and expansion of space are rep-resented by neighbouring points moving in converging anddiverging directions. Finally, we note that we can factorout any rigid body transformation common to all points inthe volume, e.g., due to camera motion. We therefore in-troduce the explicit warped model to live camera transform,T

lw

, and compose this onto the volumetric warp function;

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the movingscene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated througha k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised andcompleted scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sampleof model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scenemotion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-icantly better tracking and reconstruction. For each canoni-cal point v

c

2 S, Tlc

= W(v

c

) transforms that point fromcanonical space into the live, non-rigidly deformed frame ofreference.

Since we will need to estimate the warp function for eachnew frame, W

t

, its representation must be efficiently opti-misable. One possibility is to densely sample the volume,e.g. representing a quantised SE(3) field at the resolutionof the truncated signed distance (TSDF) geometric repre-sentation. However, a typical TSDF volume reconstructionat a relatively low resolution of 2563 voxels would requirethe solution of 6 ⇥ 256

3 parameters per frame, about 10million times more than in the original KinectFusion al-gorithm, which only estimates a single rigid transforma-tion. Clearly, a completely dense parametrisation of thewarp function is infeasible. In reality, surfaces tend to movesmoothly in space, and so we can instead use a sparse setof transformations as bases and define the dense volumet-ric warp function through interpolation. Due to its compu-tational efficiency and high quality interpolation capabilitywe use dual-quaternion blending DQB [11], to define ourwarp function:

W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)

where the weighted average over unit dual quaternion trans-formations is simply DQB(x

c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,

with each unit dual-quaternion q

kc

2 R8. Here, N (x) arethe k-nearest transformation nodes to the point x and w

k

:

R3 7! R defines a weight that alters the radius of influenceof each node and SE3(.) converts from quaternions back toan SE(3) transformation matrix. The state of the warp-fieldW

t

at time t is defined by the values of a set of n defor-mation nodes N t

warp = {dgv

,dg

w

,dg

se3}t. Each of thei = 1..n nodes has a position in the canonical frame dg

i

v

2R3, its associated transformation T

ic

= dg

i

se3, and a ra-dial basis weight dg

w

that controls the extent of the trans-formation w

i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��

.Each radius parameter dgi

w

is set to ensure the node’s in-fluence overlaps with neighbouring nodes, dependent onthe sampling sparsity of nodes, which we describe in de-tail in section (3.4). Since the warp function defines arigid body transformation for all supported space, both posi-tion and any associated orientation of space is transformed,e.g., the vertex v

c

from a surface with orientation or nor-mal n

c

is transformed into the live frame as (v

t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)

>. We notethat scaling of space can also be represented with this warpfunction, since compression and expansion of space are rep-resented by neighbouring points moving in converging anddiverging directions. Finally, we note that we can factorout any rigid body transformation common to all points inthe volume, e.g., due to camera motion. We therefore in-troduce the explicit warped model to live camera transform,T

lw

, and compose this onto the volumetric warp function;

: Canonical Frameにおける位置

: Canonical - Live Frame間Transform

: ノードの影響範囲コントロール

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the movingscene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated througha k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised andcompleted scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sampleof model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scenemotion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-icantly better tracking and reconstruction. For each canoni-cal point v

c

2 S, Tlc

= W(v

c

) transforms that point fromcanonical space into the live, non-rigidly deformed frame ofreference.

Since we will need to estimate the warp function for eachnew frame, W

t

, its representation must be efficiently opti-misable. One possibility is to densely sample the volume,e.g. representing a quantised SE(3) field at the resolutionof the truncated signed distance (TSDF) geometric repre-sentation. However, a typical TSDF volume reconstructionat a relatively low resolution of 2563 voxels would requirethe solution of 6 ⇥ 256

3 parameters per frame, about 10million times more than in the original KinectFusion al-gorithm, which only estimates a single rigid transforma-tion. Clearly, a completely dense parametrisation of thewarp function is infeasible. In reality, surfaces tend to movesmoothly in space, and so we can instead use a sparse setof transformations as bases and define the dense volumet-ric warp function through interpolation. Due to its compu-tational efficiency and high quality interpolation capabilitywe use dual-quaternion blending DQB [11], to define ourwarp function:

W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)

where the weighted average over unit dual quaternion trans-formations is simply DQB(x

c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,

with each unit dual-quaternion q

kc

2 R8. Here, N (x) arethe k-nearest transformation nodes to the point x and w

k

:

R3 7! R defines a weight that alters the radius of influenceof each node and SE3(.) converts from quaternions back toan SE(3) transformation matrix. The state of the warp-fieldW

t

at time t is defined by the values of a set of n defor-mation nodes N t

warp = {dgv

,dg

w

,dg

se3}t. Each of thei = 1..n nodes has a position in the canonical frame dg

i

v

2R3, its associated transformation T

ic

= dg

i

se3, and a ra-dial basis weight dg

w

that controls the extent of the trans-formation w

i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��

.Each radius parameter dgi

w

is set to ensure the node’s in-fluence overlaps with neighbouring nodes, dependent onthe sampling sparsity of nodes, which we describe in de-tail in section (3.4). Since the warp function defines arigid body transformation for all supported space, both posi-tion and any associated orientation of space is transformed,e.g., the vertex v

c

from a surface with orientation or nor-mal n

c

is transformed into the live frame as (v

t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)

>. We notethat scaling of space can also be represented with this warpfunction, since compression and expansion of space are rep-resented by neighbouring points moving in converging anddiverging directions. Finally, we note that we can factorout any rigid body transformation common to all points inthe volume, e.g., due to camera motion. We therefore in-troduce the explicit warped model to live camera transform,T

lw

, and compose this onto the volumetric warp function;

DQBの重み

Page 39: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Dense Non-rigid Warp Field (5/5)

•  Warp-Field:

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 39

our complete warp-field is then given as:

Wt

(x

c

) = T

lw

SE3(DQB(x

c

)). (2)

3.2. Dense Non-Rigid Surface FusionWe now describe how, given the model-to-frame warp

field Wt

, we update our canonical model geometry. Ourreconstruction into the canonical space S is represented bythe sampled TSDF V : S 7! R2 within a voxel domainS ⇢ N3. The sampled function holds for each voxel x 2S corresponding to the sampled point x

c

, a tuple V(x) 7![v(x) 2 R,w(x) 2 R]> holding a weighted average of allprojective TSDF values observed for that point so far v(x),together with the sum of all associated weights w(x).

We extend the projective TSDF fusion approach origi-nally introduced by [6] to operate over non-rigidly deform-ing scenes. Given the live depth image D

t

, we transformeach voxel center x

c

2 S by its estimated warp into the liveframe (x

>t

, 1)

>= W

t

(x

c

)(x

>c

, 1)

>, and carry through theTSDF surface fusion operation by directly projecting thewarped center into the depth frame. This allows the TSDFfor a point in the canonical frame to be updated by com-puting the projective TSDF in the deforming frame withouthaving to resample a warped TSDF in the live frame. Theprojective signed distance at the warped canonical point is:

psdf(x

c

) =

hK

�1D

t

(u

c

)

⇥u

>c

, 1

⇤>i

z

� [x

t

]

z

, (3)

where u

c

= ⇡ (Kx

t

) is the pixel into which the voxel cen-ter projects. We compute distance along the optical (z) axisof the camera frame using the z component denoted [.]

z

.K is the known 3⇥ 3 camera intrinsic matrix, and ⇡ per-forms perspective projection. For each voxel x, we updatethe TSDF to incorporate the projective SDF observed in thewarped frame using TSDF fusion:

V(x)t

=

([v0(x),w0

(x)]

>, if psdf(dc(x)) > �⌧

V(x)t�1, otherwise

(4)

where dc(.) transforms a discrete voxel point into the con-tinuous TSDF domain. The truncation distance ⌧ > 0 andthe updated TSDF value is given by the weighted averagingscheme [5], with the weight truncation introduced in [18]:

v0(x) =

v(x)t�1w(x)t�1 +min(⇢, ⌧)w(x)

w(x)t�1 + w(x)

⇢ = psdf(dc(x))

w0(x) = min(w(x)

t�1 + w(x), w

max

) . (5)

Unlike the static fusion scenario where the weight w(x)encodes the uncertainty of the depth value observed at theprojected pixel in the depth frame, we also account for un-certainty associated with the warp function at x

c

. In thecase of the single rigid transformation in original TSDFfusion, we are certain that observed surface regions, freespace, and unobserved regions transform equivalently. In

our non-rigid case, the further away the point xc

is from analready mapped and observable surface region, the less cer-tain we can be about its transformation. We use the averagedistance from x

c

to its k-nearest deformation nodes as aproxy for this increase in uncertainty and scale: w(x) /1k

Pi2N(x

c

) kdgi

w

� x

c

k2. We note that our non-rigidfusion generalises the static reconstruction case used inKinectFusion, replacing the single (rigid) model-to-cameratransform with a per voxel warp that transforms the asso-ciated space into the live (non-rigid) frame (see Figure 3).This technique greatly simplifies the non-rigid reconstruc-tion process over methods where all frames are explicitlywarped into a canonical frame. Furthermore, given a cor-rect warp field, then, since all TSDF updates are computedusing distances in the camera frame, the non-rigid projec-tive TSDF fusion approach maintains the optimality guar-antees for surface reconstruction from noisy observationsoriginally proved for the static reconstruction case in [6].

3.3. Estimating the Warp-field State Wt

We estimate the current values of the transformationsdg

se3 in Wt

given a newly observed depth map D

t

and thecurrent reconstruction V by constructing an energy functionthat is minimised by our desired parameters:

E(Wt

,V, Dt

, E) = Data(Wt

,V, Dt

) + �Reg(Wt

, E) . (6)

Our data term consists of a dense model-to-frame ICPcost Data(W

t

,V, Dt

) which is coupled with a regular-isation term Reg(W

t

, E) that penalises non-smooth mo-tion fields, and ensures as-rigid-as-possible deformation be-tween transformation nodes connected by the edge set E .The coupling of a data-term formed from linearly blendedtransformations with a rigid-as-possible graph based reg-ularisation is a form of the embedded deformation graphmodel introduced in [25]. The regularisation parameter �enables a trade-off between relaxing rigidity over the fieldwhen given high quality data, and ensuring a smooth con-sistent deformation of non or noisily observed regions ofspace. We defined these terms in the next subsections.

3.3.1 Dense Non-Rigid ICP Data-term

Our aim is to estimate all non-rigid transformation param-eters T

ic

and T

lw

that warp the canonical volume into thelive frame. We achieve this by performing a dense non-rigid alignment of the current surface reconstruction, ex-tracted from the canonical volume’s zero level set, into thelive frame’s depth map.

Surface Prediction and Data-Association: The currentzero level set of the TSDF V is extracted by marching cubesand stored as a polygon mesh with point-normal pairs in thecanonical frame: ˆV

c

⌘ {Vc

, N

c

}. We non-rigidly transformthis mesh into the live frame using the current warp fieldW

t

resulting in the warped point-normals ˆVw

.

点xcのWarp our complete warp-field is then given as:

Wt

(x

c

) = T

lw

SE3(DQB(x

c

)). (2)

3.2. Dense Non-Rigid Surface FusionWe now describe how, given the model-to-frame warp

field Wt

, we update our canonical model geometry. Ourreconstruction into the canonical space S is represented bythe sampled TSDF V : S 7! R2 within a voxel domainS ⇢ N3. The sampled function holds for each voxel x 2S corresponding to the sampled point x

c

, a tuple V(x) 7![v(x) 2 R,w(x) 2 R]> holding a weighted average of allprojective TSDF values observed for that point so far v(x),together with the sum of all associated weights w(x).

We extend the projective TSDF fusion approach origi-nally introduced by [6] to operate over non-rigidly deform-ing scenes. Given the live depth image D

t

, we transformeach voxel center x

c

2 S by its estimated warp into the liveframe (x

>t

, 1)

>= W

t

(x

c

)(x

>c

, 1)

>, and carry through theTSDF surface fusion operation by directly projecting thewarped center into the depth frame. This allows the TSDFfor a point in the canonical frame to be updated by com-puting the projective TSDF in the deforming frame withouthaving to resample a warped TSDF in the live frame. Theprojective signed distance at the warped canonical point is:

psdf(x

c

) =

hK

�1D

t

(u

c

)

⇥u

>c

, 1

⇤>i

z

� [x

t

]

z

, (3)

where u

c

= ⇡ (Kx

t

) is the pixel into which the voxel cen-ter projects. We compute distance along the optical (z) axisof the camera frame using the z component denoted [.]

z

.K is the known 3⇥ 3 camera intrinsic matrix, and ⇡ per-forms perspective projection. For each voxel x, we updatethe TSDF to incorporate the projective SDF observed in thewarped frame using TSDF fusion:

V(x)t

=

([v0(x),w0

(x)]

>, if psdf(dc(x)) > �⌧

V(x)t�1, otherwise

(4)

where dc(.) transforms a discrete voxel point into the con-tinuous TSDF domain. The truncation distance ⌧ > 0 andthe updated TSDF value is given by the weighted averagingscheme [5], with the weight truncation introduced in [18]:

v0(x) =

v(x)t�1w(x)t�1 +min(⇢, ⌧)w(x)

w(x)t�1 + w(x)

⇢ = psdf(dc(x))

w0(x) = min(w(x)

t�1 + w(x), w

max

) . (5)

Unlike the static fusion scenario where the weight w(x)encodes the uncertainty of the depth value observed at theprojected pixel in the depth frame, we also account for un-certainty associated with the warp function at x

c

. In thecase of the single rigid transformation in original TSDFfusion, we are certain that observed surface regions, freespace, and unobserved regions transform equivalently. In

our non-rigid case, the further away the point xc

is from analready mapped and observable surface region, the less cer-tain we can be about its transformation. We use the averagedistance from x

c

to its k-nearest deformation nodes as aproxy for this increase in uncertainty and scale: w(x) /1k

Pi2N(x

c

) kdgi

w

� x

c

k2. We note that our non-rigidfusion generalises the static reconstruction case used inKinectFusion, replacing the single (rigid) model-to-cameratransform with a per voxel warp that transforms the asso-ciated space into the live (non-rigid) frame (see Figure 3).This technique greatly simplifies the non-rigid reconstruc-tion process over methods where all frames are explicitlywarped into a canonical frame. Furthermore, given a cor-rect warp field, then, since all TSDF updates are computedusing distances in the camera frame, the non-rigid projec-tive TSDF fusion approach maintains the optimality guar-antees for surface reconstruction from noisy observationsoriginally proved for the static reconstruction case in [6].

3.3. Estimating the Warp-field State Wt

We estimate the current values of the transformationsdg

se3 in Wt

given a newly observed depth map D

t

and thecurrent reconstruction V by constructing an energy functionthat is minimised by our desired parameters:

E(Wt

,V, Dt

, E) = Data(Wt

,V, Dt

) + �Reg(Wt

, E) . (6)

Our data term consists of a dense model-to-frame ICPcost Data(W

t

,V, Dt

) which is coupled with a regular-isation term Reg(W

t

, E) that penalises non-smooth mo-tion fields, and ensures as-rigid-as-possible deformation be-tween transformation nodes connected by the edge set E .The coupling of a data-term formed from linearly blendedtransformations with a rigid-as-possible graph based reg-ularisation is a form of the embedded deformation graphmodel introduced in [25]. The regularisation parameter �enables a trade-off between relaxing rigidity over the fieldwhen given high quality data, and ensuring a smooth con-sistent deformation of non or noisily observed regions ofspace. We defined these terms in the next subsections.

3.3.1 Dense Non-Rigid ICP Data-term

Our aim is to estimate all non-rigid transformation param-eters T

ic

and T

lw

that warp the canonical volume into thelive frame. We achieve this by performing a dense non-rigid alignment of the current surface reconstruction, ex-tracted from the canonical volume’s zero level set, into thelive frame’s depth map.

Surface Prediction and Data-Association: The currentzero level set of the TSDF V is extracted by marching cubesand stored as a polygon mesh with point-normal pairs in thecanonical frame: ˆV

c

⌘ {Vc

, N

c

}. We non-rigidly transformthis mesh into the live frame using the current warp fieldW

t

resulting in the warped point-normals ˆVw

.

(a) Initial Frame at t = 0s (b) Raw (noisy) depth maps for frames at t = 1s, 10s, 15s, 20s (c) Node Distance

(d) Canonical Model (e) Canonical model warped into its live frame (f) Model Normals

Figure 2: DynamicFusion takes an online stream of noisy depth maps (a,b) and outputs a real-time dense reconstruction of the movingscene (d,e). To achieve this, we estimate a volumetric warp (motion) field that transforms the canonical model space into the live frame,enabling the scene motion to be undone, and all depth maps to be densely fused into a single rigid TSDF reconstruction (d,f). Simulta-neously, the structure of the warp field is constructed as a set of sparse 6D transformation nodes that are smoothly interpolated througha k-nearest node average in the canonical frame (c). The resulting per-frame warp field estimate enables the progressively denoised andcompleted scene geometry to be transformed into the live frame in real-time (e). In (e) we also visualise motion trails for a sub-sampleof model vertices over the last 1 second of scene motion together with a coordinate frame showing the rigid body component of the scenemotion. In (c) we render the nearest node to model surface distance where increased distance is mapped to a lighter value.

of objects with both translation and rotation results in signif-icantly better tracking and reconstruction. For each canoni-cal point v

c

2 S, Tlc

= W(v

c

) transforms that point fromcanonical space into the live, non-rigidly deformed frame ofreference.

Since we will need to estimate the warp function for eachnew frame, W

t

, its representation must be efficiently opti-misable. One possibility is to densely sample the volume,e.g. representing a quantised SE(3) field at the resolutionof the truncated signed distance (TSDF) geometric repre-sentation. However, a typical TSDF volume reconstructionat a relatively low resolution of 2563 voxels would requirethe solution of 6 ⇥ 256

3 parameters per frame, about 10million times more than in the original KinectFusion al-gorithm, which only estimates a single rigid transforma-tion. Clearly, a completely dense parametrisation of thewarp function is infeasible. In reality, surfaces tend to movesmoothly in space, and so we can instead use a sparse setof transformations as bases and define the dense volumet-ric warp function through interpolation. Due to its compu-tational efficiency and high quality interpolation capabilitywe use dual-quaternion blending DQB [11], to define ourwarp function:

W(x

c

) ⌘ SE3(DQB(x

c

)) , (1)

where the weighted average over unit dual quaternion trans-formations is simply DQB(x

c

) ⌘P

k2N(xc

) wk

(xc

)qkc

kPk2N(x

c

) wk

(xc

)qkc

k ,

with each unit dual-quaternion q

kc

2 R8. Here, N (x) arethe k-nearest transformation nodes to the point x and w

k

:

R3 7! R defines a weight that alters the radius of influenceof each node and SE3(.) converts from quaternions back toan SE(3) transformation matrix. The state of the warp-fieldW

t

at time t is defined by the values of a set of n defor-mation nodes N t

warp = {dgv

,dg

w

,dg

se3}t. Each of thei = 1..n nodes has a position in the canonical frame dg

i

v

2R3, its associated transformation T

ic

= dg

i

se3, and a ra-dial basis weight dg

w

that controls the extent of the trans-formation w

i

(x

c

) = exp

��kdgi

v

� x

c

k2/�2(dg

i

w

)

2��

.Each radius parameter dgi

w

is set to ensure the node’s in-fluence overlaps with neighbouring nodes, dependent onthe sampling sparsity of nodes, which we describe in de-tail in section (3.4). Since the warp function defines arigid body transformation for all supported space, both posi-tion and any associated orientation of space is transformed,e.g., the vertex v

c

from a surface with orientation or nor-mal n

c

is transformed into the live frame as (v

t

, 1)

>=

Wt

(v

c

)(v

>c

, 1)

> and (n

t

, 0)

>= W

t

(v

c

)(n

>c

, 0)

>. We notethat scaling of space can also be represented with this warpfunction, since compression and expansion of space are rep-resented by neighbouring points moving in converging anddiverging directions. Finally, we note that we can factorout any rigid body transformation common to all points inthe volume, e.g., due to camera motion. We therefore in-troduce the explicit warped model to live camera transform,T

lw

, and compose this onto the volumetric warp function;

: Cameraの移動に伴う、シーン全体の変形

: Dual-QuaternionからSE3(ユークリッド空間の運動群) のへの変換

Page 40: 30th コンピュータビジョン勉強会@関東 DynamicFusion

DynamicFusion Outline 1.  Dense Non-rigid Warp Field

–  Dynamicなシーンの表現方法について

2.  Dense Non-rigid Surface Fusion –  どうやってLive FrameをFusionするのか

3.  Estimating the Warp-Field State –  Warp-Fieldの推定方法

4.  Extending the Warp-Field –  Warp-Field構造の逐次追加方法

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 40

Page 41: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Dense Non-Rigid Surface Fusion (1/3) •  Live FrameのDepthからCanonical Modelを更新

–  Warp-Fieldは得られているという前提

•  更新方法: –  Warp-Fieldを使って、Canonical ModelをLive Frameに合わせて変形

– そのTSDFと、Live FrameのTSDFをFusion

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 41

Page 42: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Dense Non-Rigid Surface Fusion (2/3)

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 42

our complete warp-field is then given as:

Wt

(x

c

) = T

lw

SE3(DQB(x

c

)). (2)

3.2. Dense Non-Rigid Surface FusionWe now describe how, given the model-to-frame warp

field Wt

, we update our canonical model geometry. Ourreconstruction into the canonical space S is represented bythe sampled TSDF V : S 7! R2 within a voxel domainS ⇢ N3. The sampled function holds for each voxel x 2S corresponding to the sampled point x

c

, a tuple V(x) 7![v(x) 2 R,w(x) 2 R]> holding a weighted average of allprojective TSDF values observed for that point so far v(x),together with the sum of all associated weights w(x).

We extend the projective TSDF fusion approach origi-nally introduced by [6] to operate over non-rigidly deform-ing scenes. Given the live depth image D

t

, we transformeach voxel center x

c

2 S by its estimated warp into the liveframe (x

>t

, 1)

>= W

t

(x

c

)(x

>c

, 1)

>, and carry through theTSDF surface fusion operation by directly projecting thewarped center into the depth frame. This allows the TSDFfor a point in the canonical frame to be updated by com-puting the projective TSDF in the deforming frame withouthaving to resample a warped TSDF in the live frame. Theprojective signed distance at the warped canonical point is:

psdf(x

c

) =

hK

�1D

t

(u

c

)

⇥u

>c

, 1

⇤>i

z

� [x

t

]

z

, (3)

where u

c

= ⇡ (Kx

t

) is the pixel into which the voxel cen-ter projects. We compute distance along the optical (z) axisof the camera frame using the z component denoted [.]

z

.K is the known 3⇥ 3 camera intrinsic matrix, and ⇡ per-forms perspective projection. For each voxel x, we updatethe TSDF to incorporate the projective SDF observed in thewarped frame using TSDF fusion:

V(x)t

=

([v0(x),w0

(x)]

>, if psdf(dc(x)) > �⌧

V(x)t�1, otherwise

(4)

where dc(.) transforms a discrete voxel point into the con-tinuous TSDF domain. The truncation distance ⌧ > 0 andthe updated TSDF value is given by the weighted averagingscheme [5], with the weight truncation introduced in [18]:

v0(x) =

v(x)t�1w(x)t�1 +min(⇢, ⌧)w(x)

w(x)t�1 + w(x)

⇢ = psdf(dc(x))

w0(x) = min(w(x)

t�1 + w(x), w

max

) . (5)

Unlike the static fusion scenario where the weight w(x)encodes the uncertainty of the depth value observed at theprojected pixel in the depth frame, we also account for un-certainty associated with the warp function at x

c

. In thecase of the single rigid transformation in original TSDFfusion, we are certain that observed surface regions, freespace, and unobserved regions transform equivalently. In

our non-rigid case, the further away the point xc

is from analready mapped and observable surface region, the less cer-tain we can be about its transformation. We use the averagedistance from x

c

to its k-nearest deformation nodes as aproxy for this increase in uncertainty and scale: w(x) /1k

Pi2N(x

c

) kdgi

w

� x

c

k2. We note that our non-rigidfusion generalises the static reconstruction case used inKinectFusion, replacing the single (rigid) model-to-cameratransform with a per voxel warp that transforms the asso-ciated space into the live (non-rigid) frame (see Figure 3).This technique greatly simplifies the non-rigid reconstruc-tion process over methods where all frames are explicitlywarped into a canonical frame. Furthermore, given a cor-rect warp field, then, since all TSDF updates are computedusing distances in the camera frame, the non-rigid projec-tive TSDF fusion approach maintains the optimality guar-antees for surface reconstruction from noisy observationsoriginally proved for the static reconstruction case in [6].

3.3. Estimating the Warp-field State Wt

We estimate the current values of the transformationsdg

se3 in Wt

given a newly observed depth map D

t

and thecurrent reconstruction V by constructing an energy functionthat is minimised by our desired parameters:

E(Wt

,V, Dt

, E) = Data(Wt

,V, Dt

) + �Reg(Wt

, E) . (6)

Our data term consists of a dense model-to-frame ICPcost Data(W

t

,V, Dt

) which is coupled with a regular-isation term Reg(W

t

, E) that penalises non-smooth mo-tion fields, and ensures as-rigid-as-possible deformation be-tween transformation nodes connected by the edge set E .The coupling of a data-term formed from linearly blendedtransformations with a rigid-as-possible graph based reg-ularisation is a form of the embedded deformation graphmodel introduced in [25]. The regularisation parameter �enables a trade-off between relaxing rigidity over the fieldwhen given high quality data, and ensuring a smooth con-sistent deformation of non or noisily observed regions ofspace. We defined these terms in the next subsections.

3.3.1 Dense Non-Rigid ICP Data-term

Our aim is to estimate all non-rigid transformation param-eters T

ic

and T

lw

that warp the canonical volume into thelive frame. We achieve this by performing a dense non-rigid alignment of the current surface reconstruction, ex-tracted from the canonical volume’s zero level set, into thelive frame’s depth map.

Surface Prediction and Data-Association: The currentzero level set of the TSDF V is extracted by marching cubesand stored as a polygon mesh with point-normal pairs in thecanonical frame: ˆV

c

⌘ {Vc

, N

c

}. We non-rigidly transformthis mesh into the live frame using the current warp fieldW

t

resulting in the warped point-normals ˆVw

.

our complete warp-field is then given as:

Wt

(x

c

) = T

lw

SE3(DQB(x

c

)). (2)

3.2. Dense Non-Rigid Surface FusionWe now describe how, given the model-to-frame warp

field Wt

, we update our canonical model geometry. Ourreconstruction into the canonical space S is represented bythe sampled TSDF V : S 7! R2 within a voxel domainS ⇢ N3. The sampled function holds for each voxel x 2S corresponding to the sampled point x

c

, a tuple V(x) 7![v(x) 2 R,w(x) 2 R]> holding a weighted average of allprojective TSDF values observed for that point so far v(x),together with the sum of all associated weights w(x).

We extend the projective TSDF fusion approach origi-nally introduced by [6] to operate over non-rigidly deform-ing scenes. Given the live depth image D

t

, we transformeach voxel center x

c

2 S by its estimated warp into the liveframe (x

>t

, 1)

>= W

t

(x

c

)(x

>c

, 1)

>, and carry through theTSDF surface fusion operation by directly projecting thewarped center into the depth frame. This allows the TSDFfor a point in the canonical frame to be updated by com-puting the projective TSDF in the deforming frame withouthaving to resample a warped TSDF in the live frame. Theprojective signed distance at the warped canonical point is:

psdf(x

c

) =

hK

�1D

t

(u

c

)

⇥u

>c

, 1

⇤>i

z

� [x

t

]

z

, (3)

where u

c

= ⇡ (Kx

t

) is the pixel into which the voxel cen-ter projects. We compute distance along the optical (z) axisof the camera frame using the z component denoted [.]

z

.K is the known 3⇥ 3 camera intrinsic matrix, and ⇡ per-forms perspective projection. For each voxel x, we updatethe TSDF to incorporate the projective SDF observed in thewarped frame using TSDF fusion:

V(x)t

=

([v0(x),w0

(x)]

>, if psdf(dc(x)) > �⌧

V(x)t�1, otherwise

(4)

where dc(.) transforms a discrete voxel point into the con-tinuous TSDF domain. The truncation distance ⌧ > 0 andthe updated TSDF value is given by the weighted averagingscheme [5], with the weight truncation introduced in [18]:

v0(x) =

v(x)t�1w(x)t�1 +min(⇢, ⌧)w(x)

w(x)t�1 + w(x)

⇢ = psdf(dc(x))

w0(x) = min(w(x)

t�1 + w(x), w

max

) . (5)

Unlike the static fusion scenario where the weight w(x)encodes the uncertainty of the depth value observed at theprojected pixel in the depth frame, we also account for un-certainty associated with the warp function at x

c

. In thecase of the single rigid transformation in original TSDFfusion, we are certain that observed surface regions, freespace, and unobserved regions transform equivalently. In

our non-rigid case, the further away the point xc

is from analready mapped and observable surface region, the less cer-tain we can be about its transformation. We use the averagedistance from x

c

to its k-nearest deformation nodes as aproxy for this increase in uncertainty and scale: w(x) /1k

Pi2N(x

c

) kdgi

w

� x

c

k2. We note that our non-rigidfusion generalises the static reconstruction case used inKinectFusion, replacing the single (rigid) model-to-cameratransform with a per voxel warp that transforms the asso-ciated space into the live (non-rigid) frame (see Figure 3).This technique greatly simplifies the non-rigid reconstruc-tion process over methods where all frames are explicitlywarped into a canonical frame. Furthermore, given a cor-rect warp field, then, since all TSDF updates are computedusing distances in the camera frame, the non-rigid projec-tive TSDF fusion approach maintains the optimality guar-antees for surface reconstruction from noisy observationsoriginally proved for the static reconstruction case in [6].

3.3. Estimating the Warp-field State Wt

We estimate the current values of the transformationsdg

se3 in Wt

given a newly observed depth map D

t

and thecurrent reconstruction V by constructing an energy functionthat is minimised by our desired parameters:

E(Wt

,V, Dt

, E) = Data(Wt

,V, Dt

) + �Reg(Wt

, E) . (6)

Our data term consists of a dense model-to-frame ICPcost Data(W

t

,V, Dt

) which is coupled with a regular-isation term Reg(W

t

, E) that penalises non-smooth mo-tion fields, and ensures as-rigid-as-possible deformation be-tween transformation nodes connected by the edge set E .The coupling of a data-term formed from linearly blendedtransformations with a rigid-as-possible graph based reg-ularisation is a form of the embedded deformation graphmodel introduced in [25]. The regularisation parameter �enables a trade-off between relaxing rigidity over the fieldwhen given high quality data, and ensuring a smooth con-sistent deformation of non or noisily observed regions ofspace. We defined these terms in the next subsections.

3.3.1 Dense Non-Rigid ICP Data-term

Our aim is to estimate all non-rigid transformation param-eters T

ic

and T

lw

that warp the canonical volume into thelive frame. We achieve this by performing a dense non-rigid alignment of the current surface reconstruction, ex-tracted from the canonical volume’s zero level set, into thelive frame’s depth map.

Surface Prediction and Data-Association: The currentzero level set of the TSDF V is extracted by marching cubesand stored as a polygon mesh with point-normal pairs in thecanonical frame: ˆV

c

⌘ {Vc

, N

c

}. We non-rigidly transformthis mesh into the live frame using the current warp fieldW

t

resulting in the warped point-normals ˆVw

.

our complete warp-field is then given as:

Wt

(x

c

) = T

lw

SE3(DQB(x

c

)). (2)

3.2. Dense Non-Rigid Surface FusionWe now describe how, given the model-to-frame warp

field Wt

, we update our canonical model geometry. Ourreconstruction into the canonical space S is represented bythe sampled TSDF V : S 7! R2 within a voxel domainS ⇢ N3. The sampled function holds for each voxel x 2S corresponding to the sampled point x

c

, a tuple V(x) 7![v(x) 2 R,w(x) 2 R]> holding a weighted average of allprojective TSDF values observed for that point so far v(x),together with the sum of all associated weights w(x).

We extend the projective TSDF fusion approach origi-nally introduced by [6] to operate over non-rigidly deform-ing scenes. Given the live depth image D

t

, we transformeach voxel center x

c

2 S by its estimated warp into the liveframe (x

>t

, 1)

>= W

t

(x

c

)(x

>c

, 1)

>, and carry through theTSDF surface fusion operation by directly projecting thewarped center into the depth frame. This allows the TSDFfor a point in the canonical frame to be updated by com-puting the projective TSDF in the deforming frame withouthaving to resample a warped TSDF in the live frame. Theprojective signed distance at the warped canonical point is:

psdf(x

c

) =

hK

�1D

t

(u

c

)

⇥u

>c

, 1

⇤>i

z

� [x

t

]

z

, (3)

where u

c

= ⇡ (Kx

t

) is the pixel into which the voxel cen-ter projects. We compute distance along the optical (z) axisof the camera frame using the z component denoted [.]

z

.K is the known 3⇥ 3 camera intrinsic matrix, and ⇡ per-forms perspective projection. For each voxel x, we updatethe TSDF to incorporate the projective SDF observed in thewarped frame using TSDF fusion:

V(x)t

=

([v0(x),w0

(x)]

>, if psdf(dc(x)) > �⌧

V(x)t�1, otherwise

(4)

where dc(.) transforms a discrete voxel point into the con-tinuous TSDF domain. The truncation distance ⌧ > 0 andthe updated TSDF value is given by the weighted averagingscheme [5], with the weight truncation introduced in [18]:

v0(x) =

v(x)t�1w(x)t�1 +min(⇢, ⌧)w(x)

w(x)t�1 + w(x)

⇢ = psdf(dc(x))

w0(x) = min(w(x)

t�1 + w(x), w

max

) . (5)

Unlike the static fusion scenario where the weight w(x)encodes the uncertainty of the depth value observed at theprojected pixel in the depth frame, we also account for un-certainty associated with the warp function at x

c

. In thecase of the single rigid transformation in original TSDFfusion, we are certain that observed surface regions, freespace, and unobserved regions transform equivalently. In

our non-rigid case, the further away the point xc

is from analready mapped and observable surface region, the less cer-tain we can be about its transformation. We use the averagedistance from x

c

to its k-nearest deformation nodes as aproxy for this increase in uncertainty and scale: w(x) /1k

Pi2N(x

c

) kdgi

w

� x

c

k2. We note that our non-rigidfusion generalises the static reconstruction case used inKinectFusion, replacing the single (rigid) model-to-cameratransform with a per voxel warp that transforms the asso-ciated space into the live (non-rigid) frame (see Figure 3).This technique greatly simplifies the non-rigid reconstruc-tion process over methods where all frames are explicitlywarped into a canonical frame. Furthermore, given a cor-rect warp field, then, since all TSDF updates are computedusing distances in the camera frame, the non-rigid projec-tive TSDF fusion approach maintains the optimality guar-antees for surface reconstruction from noisy observationsoriginally proved for the static reconstruction case in [6].

3.3. Estimating the Warp-field State Wt

We estimate the current values of the transformationsdg

se3 in Wt

given a newly observed depth map D

t

and thecurrent reconstruction V by constructing an energy functionthat is minimised by our desired parameters:

E(Wt

,V, Dt

, E) = Data(Wt

,V, Dt

) + �Reg(Wt

, E) . (6)

Our data term consists of a dense model-to-frame ICPcost Data(W

t

,V, Dt

) which is coupled with a regular-isation term Reg(W

t

, E) that penalises non-smooth mo-tion fields, and ensures as-rigid-as-possible deformation be-tween transformation nodes connected by the edge set E .The coupling of a data-term formed from linearly blendedtransformations with a rigid-as-possible graph based reg-ularisation is a form of the embedded deformation graphmodel introduced in [25]. The regularisation parameter �enables a trade-off between relaxing rigidity over the fieldwhen given high quality data, and ensuring a smooth con-sistent deformation of non or noisily observed regions ofspace. We defined these terms in the next subsections.

3.3.1 Dense Non-Rigid ICP Data-term

Our aim is to estimate all non-rigid transformation param-eters T

ic

and T

lw

that warp the canonical volume into thelive frame. We achieve this by performing a dense non-rigid alignment of the current surface reconstruction, ex-tracted from the canonical volume’s zero level set, into thelive frame’s depth map.

Surface Prediction and Data-Association: The currentzero level set of the TSDF V is extracted by marching cubesand stored as a polygon mesh with point-normal pairs in thecanonical frame: ˆV

c

⌘ {Vc

, N

c

}. We non-rigidly transformthis mesh into the live frame using the current warp fieldW

t

resulting in the warped point-normals ˆVw

.

Live FrameのProjective Signed Distance Function:

TSDFのFusion:

Truncate:

Page 43: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Dense Non-Rigid Surface Fusion (3/3)

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 43

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera framewhen observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp isinitialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scenedeforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), wherethe left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-spondence) between the model geometry and the live frameby rendering the warped surface ˆV

w

into the live frameshaded with canonical frame vertex positions using a ras-terizing rendering pipeline. This results in a prediction ofthe canonical frame’s geometry that is currently predictedto be visible in the live frame: P(

ˆVc

). We store this pre-diction as a pair of images {v,n} : ⌦ 7! P(

ˆVc

), where⌦ is the pixel domain of the predicted images, storing therendered canonical frame vertices and normals.

Given optimal transformation parameters for the currenttime frame, the predicted-to-be-visible geometry shouldtransform close, modulo observation noise, to the live sur-face vl : ⌦ 7! R3, formed by back projection of the depthimage [vl(u)

>, 1]

>= K

�1D

t

(u)[u

>, 1]

>. This can bequantified by a per pixel dense model-to-frame point-planeerror, which we compute under the robust Tukey penaltyfunction data, summed over the predicted image domain⌦:

Data(W,V, Dt

) ⌘X

u2⌦

data

�n

>u

(v

u

� vl

u

)

�.(7)

augment. The transformed model vertex v(u) is simply˜

T

u

= W(v(u)), producing the current canonical to liveframe point-normal predictions v

u

=

˜

T

u

v(u) and n

u

=

˜

T

u

n(u), and data-association of that model point-normalis made with a live frame point-normal through perspectiveprojection into the pixel u = ⇡(Kv

u

).We note that, ignoring the negligible cost of rendering

the geometry ˆVw

, the ability to extract, predict, and performprojective data association with the currently visible canoni-cal geometry leads to a data-term evaluation that has a com-putational complexity with an upper bound in the numberof pixels in the observation image. Furthermore, each data-term summand depends only on a subset of the n trans-

formations when computing W , and the region over whicheach node has a numerically significant impact on the errorfunction is compact. In practice, the result is a computa-tional cost similar to a single rigid body dense projectivepoint-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-field Regularization

It is crucial for our non-rigid TSDF fusion technique to es-timate a deformation not only of currently visible surfaces,but over all space within S. This enables reconstructionof new regions of the scene surface that are about to comeinto view. However, nodes affecting canonical space withinwhich no currently observed surface resides will have noassociated data term. In any case, noise, missing data andinsufficient geometric texture in the live frame – an ana-logue to the aperture problem in optical-flow – will resultin optimisation of the transform parameters being ill-posed.How should we constrain the motion of non-observed ge-ometry? Whilst the fully correct motion depends on objectdynamics and, where applicable, the subject’s volition, wemake use of a simpler model of unobserved geometry: thatit deforms in a piece-wise smooth way.

We use a deformation graph based regularization definedbetween transformation nodes, where an edge in the graphbetween nodes i and j adds a rigid-as-possible regularisa-tion term to the total error being minimized, under the dis-continuity preserving Huber penalty reg. The total regu-larisation term sums over all pair-wise connected nodes:

Reg(W, E) ⌘nX

i=0

X

j2E(i)

ij

reg

�T

ic

dgj

v

�T

jc

dg

j

v

�, (8)

where E defines the regularisation graph topology, and ↵ij

defines the weight associated with the edge, which we setto ↵

ij

= max(dg

i

w

,dg

j

w

).

Page 44: 30th コンピュータビジョン勉強会@関東 DynamicFusion

DynamicFusion Outline 1.  Dense Non-rigid Warp Field

–  Dynamicなシーンの表現方法について

2.  Dense Non-rigid Surface Fusion –  どうやってLive FrameをFusionするのか

3.  Estimating the Warp-Field State –  Warp-Fieldの推定方法

4.  Extending the Warp-Field –  Warp-Field構造の逐次追加方法

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 44

Page 45: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Estimating the Warp-Field State (1/5) •  Depth map Dt からdgse3を推定

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 45

our complete warp-field is then given as:

Wt

(x

c

) = T

lw

SE3(DQB(x

c

)). (2)

3.2. Dense Non-Rigid Surface FusionWe now describe how, given the model-to-frame warp

field Wt

, we update our canonical model geometry. Ourreconstruction into the canonical space S is represented bythe sampled TSDF V : S 7! R2 within a voxel domainS ⇢ N3. The sampled function holds for each voxel x 2S corresponding to the sampled point x

c

, a tuple V(x) 7![v(x) 2 R,w(x) 2 R]> holding a weighted average of allprojective TSDF values observed for that point so far v(x),together with the sum of all associated weights w(x).

We extend the projective TSDF fusion approach origi-nally introduced by [6] to operate over non-rigidly deform-ing scenes. Given the live depth image D

t

, we transformeach voxel center x

c

2 S by its estimated warp into the liveframe (x

>t

, 1)

>= W

t

(x

c

)(x

>c

, 1)

>, and carry through theTSDF surface fusion operation by directly projecting thewarped center into the depth frame. This allows the TSDFfor a point in the canonical frame to be updated by com-puting the projective TSDF in the deforming frame withouthaving to resample a warped TSDF in the live frame. Theprojective signed distance at the warped canonical point is:

psdf(x

c

) =

hK

�1D

t

(u

c

)

⇥u

>c

, 1

⇤>i

z

� [x

t

]

z

, (3)

where u

c

= ⇡ (Kx

t

) is the pixel into which the voxel cen-ter projects. We compute distance along the optical (z) axisof the camera frame using the z component denoted [.]

z

.K is the known 3⇥ 3 camera intrinsic matrix, and ⇡ per-forms perspective projection. For each voxel x, we updatethe TSDF to incorporate the projective SDF observed in thewarped frame using TSDF fusion:

V(x)t

=

([v0(x),w0

(x)]

>, if psdf(dc(x)) > �⌧

V(x)t�1, otherwise

(4)

where dc(.) transforms a discrete voxel point into the con-tinuous TSDF domain. The truncation distance ⌧ > 0 andthe updated TSDF value is given by the weighted averagingscheme [5], with the weight truncation introduced in [18]:

v0(x) =

v(x)t�1w(x)t�1 +min(⇢, ⌧)w(x)

w(x)t�1 + w(x)

⇢ = psdf(dc(x))

w0(x) = min(w(x)

t�1 + w(x), w

max

) . (5)

Unlike the static fusion scenario where the weight w(x)encodes the uncertainty of the depth value observed at theprojected pixel in the depth frame, we also account for un-certainty associated with the warp function at x

c

. In thecase of the single rigid transformation in original TSDFfusion, we are certain that observed surface regions, freespace, and unobserved regions transform equivalently. In

our non-rigid case, the further away the point xc

is from analready mapped and observable surface region, the less cer-tain we can be about its transformation. We use the averagedistance from x

c

to its k-nearest deformation nodes as aproxy for this increase in uncertainty and scale: w(x) /1k

Pi2N(x

c

) kdgi

w

� x

c

k2. We note that our non-rigidfusion generalises the static reconstruction case used inKinectFusion, replacing the single (rigid) model-to-cameratransform with a per voxel warp that transforms the asso-ciated space into the live (non-rigid) frame (see Figure 3).This technique greatly simplifies the non-rigid reconstruc-tion process over methods where all frames are explicitlywarped into a canonical frame. Furthermore, given a cor-rect warp field, then, since all TSDF updates are computedusing distances in the camera frame, the non-rigid projec-tive TSDF fusion approach maintains the optimality guar-antees for surface reconstruction from noisy observationsoriginally proved for the static reconstruction case in [6].

3.3. Estimating the Warp-field State Wt

We estimate the current values of the transformationsdg

se3 in Wt

given a newly observed depth map D

t

and thecurrent reconstruction V by constructing an energy functionthat is minimised by our desired parameters:

E(Wt

,V, Dt

, E) = Data(Wt

,V, Dt

) + �Reg(Wt

, E) . (6)

Our data term consists of a dense model-to-frame ICPcost Data(W

t

,V, Dt

) which is coupled with a regular-isation term Reg(W

t

, E) that penalises non-smooth mo-tion fields, and ensures as-rigid-as-possible deformation be-tween transformation nodes connected by the edge set E .The coupling of a data-term formed from linearly blendedtransformations with a rigid-as-possible graph based reg-ularisation is a form of the embedded deformation graphmodel introduced in [25]. The regularisation parameter �enables a trade-off between relaxing rigidity over the fieldwhen given high quality data, and ensuring a smooth con-sistent deformation of non or noisily observed regions ofspace. We defined these terms in the next subsections.

3.3.1 Dense Non-Rigid ICP Data-term

Our aim is to estimate all non-rigid transformation param-eters T

ic

and T

lw

that warp the canonical volume into thelive frame. We achieve this by performing a dense non-rigid alignment of the current surface reconstruction, ex-tracted from the canonical volume’s zero level set, into thelive frame’s depth map.

Surface Prediction and Data-Association: The currentzero level set of the TSDF V is extracted by marching cubesand stored as a polygon mesh with point-normal pairs in thecanonical frame: ˆV

c

⌘ {Vc

, N

c

}. We non-rigidly transformthis mesh into the live frame using the current warp fieldW

t

resulting in the warped point-normals ˆVw

.

これを最小化する各パラメータを推定する

Energy データ項 ICP cost

正規化項 滑らかでない動き に対するペナルティ

Wt : Warp Field : 現在のSurface (Polygon Mesh) Dt : Live frame Depth map ε : Edge set (ノードの接続)

our complete warp-field is then given as:

Wt

(x

c

) = T

lw

SE3(DQB(x

c

)). (2)

3.2. Dense Non-Rigid Surface FusionWe now describe how, given the model-to-frame warp

field Wt

, we update our canonical model geometry. Ourreconstruction into the canonical space S is represented bythe sampled TSDF V : S 7! R2 within a voxel domainS ⇢ N3. The sampled function holds for each voxel x 2S corresponding to the sampled point x

c

, a tuple V(x) 7![v(x) 2 R,w(x) 2 R]> holding a weighted average of allprojective TSDF values observed for that point so far v(x),together with the sum of all associated weights w(x).

We extend the projective TSDF fusion approach origi-nally introduced by [6] to operate over non-rigidly deform-ing scenes. Given the live depth image D

t

, we transformeach voxel center x

c

2 S by its estimated warp into the liveframe (x

>t

, 1)

>= W

t

(x

c

)(x

>c

, 1)

>, and carry through theTSDF surface fusion operation by directly projecting thewarped center into the depth frame. This allows the TSDFfor a point in the canonical frame to be updated by com-puting the projective TSDF in the deforming frame withouthaving to resample a warped TSDF in the live frame. Theprojective signed distance at the warped canonical point is:

psdf(x

c

) =

hK

�1D

t

(u

c

)

⇥u

>c

, 1

⇤>i

z

� [x

t

]

z

, (3)

where u

c

= ⇡ (Kx

t

) is the pixel into which the voxel cen-ter projects. We compute distance along the optical (z) axisof the camera frame using the z component denoted [.]

z

.K is the known 3⇥ 3 camera intrinsic matrix, and ⇡ per-forms perspective projection. For each voxel x, we updatethe TSDF to incorporate the projective SDF observed in thewarped frame using TSDF fusion:

V(x)t

=

([v0(x),w0

(x)]

>, if psdf(dc(x)) > �⌧

V(x)t�1, otherwise

(4)

where dc(.) transforms a discrete voxel point into the con-tinuous TSDF domain. The truncation distance ⌧ > 0 andthe updated TSDF value is given by the weighted averagingscheme [5], with the weight truncation introduced in [18]:

v0(x) =

v(x)t�1w(x)t�1 +min(⇢, ⌧)w(x)

w(x)t�1 + w(x)

⇢ = psdf(dc(x))

w0(x) = min(w(x)

t�1 + w(x), w

max

) . (5)

Unlike the static fusion scenario where the weight w(x)encodes the uncertainty of the depth value observed at theprojected pixel in the depth frame, we also account for un-certainty associated with the warp function at x

c

. In thecase of the single rigid transformation in original TSDFfusion, we are certain that observed surface regions, freespace, and unobserved regions transform equivalently. In

our non-rigid case, the further away the point xc

is from analready mapped and observable surface region, the less cer-tain we can be about its transformation. We use the averagedistance from x

c

to its k-nearest deformation nodes as aproxy for this increase in uncertainty and scale: w(x) /1k

Pi2N(x

c

) kdgi

w

� x

c

k2. We note that our non-rigidfusion generalises the static reconstruction case used inKinectFusion, replacing the single (rigid) model-to-cameratransform with a per voxel warp that transforms the asso-ciated space into the live (non-rigid) frame (see Figure 3).This technique greatly simplifies the non-rigid reconstruc-tion process over methods where all frames are explicitlywarped into a canonical frame. Furthermore, given a cor-rect warp field, then, since all TSDF updates are computedusing distances in the camera frame, the non-rigid projec-tive TSDF fusion approach maintains the optimality guar-antees for surface reconstruction from noisy observationsoriginally proved for the static reconstruction case in [6].

3.3. Estimating the Warp-field State Wt

We estimate the current values of the transformationsdg

se3 in Wt

given a newly observed depth map D

t

and thecurrent reconstruction V by constructing an energy functionthat is minimised by our desired parameters:

E(Wt

,V, Dt

, E) = Data(Wt

,V, Dt

) + �Reg(Wt

, E) . (6)

Our data term consists of a dense model-to-frame ICPcost Data(W

t

,V, Dt

) which is coupled with a regular-isation term Reg(W

t

, E) that penalises non-smooth mo-tion fields, and ensures as-rigid-as-possible deformation be-tween transformation nodes connected by the edge set E .The coupling of a data-term formed from linearly blendedtransformations with a rigid-as-possible graph based reg-ularisation is a form of the embedded deformation graphmodel introduced in [25]. The regularisation parameter �enables a trade-off between relaxing rigidity over the fieldwhen given high quality data, and ensuring a smooth con-sistent deformation of non or noisily observed regions ofspace. We defined these terms in the next subsections.

3.3.1 Dense Non-Rigid ICP Data-term

Our aim is to estimate all non-rigid transformation param-eters T

ic

and T

lw

that warp the canonical volume into thelive frame. We achieve this by performing a dense non-rigid alignment of the current surface reconstruction, ex-tracted from the canonical volume’s zero level set, into thelive frame’s depth map.

Surface Prediction and Data-Association: The currentzero level set of the TSDF V is extracted by marching cubesand stored as a polygon mesh with point-normal pairs in thecanonical frame: ˆV

c

⌘ {Vc

, N

c

}. We non-rigidly transformthis mesh into the live frame using the current warp fieldW

t

resulting in the warped point-normals ˆVw

.

Page 46: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Estimating the Warp-Field State (2/5) •  現在のSurfaceを抽出

–  TSDFからZero Level setをMarching cubeで抽出

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 46

https://ja.wikipedia.org/wiki/マーチングキューブ法

Page 47: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Estimating the Warp-Field State (3/5) •  ICP Cost

–  Canonical Modelを現在のTでレンダリング –  Live FrameのDepth mapとのエラーをPoint-Planeで計算

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 47

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera framewhen observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp isinitialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scenedeforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), wherethe left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-spondence) between the model geometry and the live frameby rendering the warped surface ˆV

w

into the live frameshaded with canonical frame vertex positions using a ras-terizing rendering pipeline. This results in a prediction ofthe canonical frame’s geometry that is currently predictedto be visible in the live frame: P(

ˆVc

). We store this pre-diction as a pair of images {v,n} : ⌦ 7! P(

ˆVc

), where⌦ is the pixel domain of the predicted images, storing therendered canonical frame vertices and normals.

Given optimal transformation parameters for the currenttime frame, the predicted-to-be-visible geometry shouldtransform close, modulo observation noise, to the live sur-face vl : ⌦ 7! R3, formed by back projection of the depthimage [vl(u)

>, 1]

>= K

�1D

t

(u)[u

>, 1]

>. This can bequantified by a per pixel dense model-to-frame point-planeerror, which we compute under the robust Tukey penaltyfunction data, summed over the predicted image domain⌦:

Data(W,V, Dt

) ⌘X

u2⌦

data

�n

>u

(v

u

� vl

u

)

�.(7)

augment. The transformed model vertex v(u) is simply˜

T

u

= W(v(u)), producing the current canonical to liveframe point-normal predictions v

u

=

˜

T

u

v(u) and n

u

=

˜

T

u

n(u), and data-association of that model point-normalis made with a live frame point-normal through perspectiveprojection into the pixel u = ⇡(Kv

u

).We note that, ignoring the negligible cost of rendering

the geometry ˆVw

, the ability to extract, predict, and performprojective data association with the currently visible canoni-cal geometry leads to a data-term evaluation that has a com-putational complexity with an upper bound in the numberof pixels in the observation image. Furthermore, each data-term summand depends only on a subset of the n trans-

formations when computing W , and the region over whicheach node has a numerically significant impact on the errorfunction is compact. In practice, the result is a computa-tional cost similar to a single rigid body dense projectivepoint-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-field Regularization

It is crucial for our non-rigid TSDF fusion technique to es-timate a deformation not only of currently visible surfaces,but over all space within S. This enables reconstructionof new regions of the scene surface that are about to comeinto view. However, nodes affecting canonical space withinwhich no currently observed surface resides will have noassociated data term. In any case, noise, missing data andinsufficient geometric texture in the live frame – an ana-logue to the aperture problem in optical-flow – will resultin optimisation of the transform parameters being ill-posed.How should we constrain the motion of non-observed ge-ometry? Whilst the fully correct motion depends on objectdynamics and, where applicable, the subject’s volition, wemake use of a simpler model of unobserved geometry: thatit deforms in a piece-wise smooth way.

We use a deformation graph based regularization definedbetween transformation nodes, where an edge in the graphbetween nodes i and j adds a rigid-as-possible regularisa-tion term to the total error being minimized, under the dis-continuity preserving Huber penalty reg. The total regu-larisation term sums over all pair-wise connected nodes:

Reg(W, E) ⌘nX

i=0

X

j2E(i)

ij

reg

�T

ic

dgj

v

�T

jc

dg

j

v

�, (8)

where E defines the regularisation graph topology, and ↵ij

defines the weight associated with the edge, which we setto ↵

ij

= max(dg

i

w

,dg

j

w

).

Point-Plane Error

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera framewhen observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp isinitialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scenedeforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), wherethe left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-spondence) between the model geometry and the live frameby rendering the warped surface ˆV

w

into the live frameshaded with canonical frame vertex positions using a ras-terizing rendering pipeline. This results in a prediction ofthe canonical frame’s geometry that is currently predictedto be visible in the live frame: P(

ˆVc

). We store this pre-diction as a pair of images {v,n} : ⌦ 7! P(

ˆVc

), where⌦ is the pixel domain of the predicted images, storing therendered canonical frame vertices and normals.

Given optimal transformation parameters for the currenttime frame, the predicted-to-be-visible geometry shouldtransform close, modulo observation noise, to the live sur-face vl : ⌦ 7! R3, formed by back projection of the depthimage [vl(u)

>, 1]

>= K

�1D

t

(u)[u

>, 1]

>. This can bequantified by a per pixel dense model-to-frame point-planeerror, which we compute under the robust Tukey penaltyfunction data, summed over the predicted image domain⌦:

Data(W,V, Dt

) ⌘X

u2⌦

data

�n

>u

(v

u

� vl

u

)

�.(7)

augment. The transformed model vertex v(u) is simply˜

T

u

= W(v(u)), producing the current canonical to liveframe point-normal predictions v

u

=

˜

T

u

v(u) and n

u

=

˜

T

u

n(u), and data-association of that model point-normalis made with a live frame point-normal through perspectiveprojection into the pixel u = ⇡(Kv

u

).We note that, ignoring the negligible cost of rendering

the geometry ˆVw

, the ability to extract, predict, and performprojective data association with the currently visible canoni-cal geometry leads to a data-term evaluation that has a com-putational complexity with an upper bound in the numberof pixels in the observation image. Furthermore, each data-term summand depends only on a subset of the n trans-

formations when computing W , and the region over whicheach node has a numerically significant impact on the errorfunction is compact. In practice, the result is a computa-tional cost similar to a single rigid body dense projectivepoint-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-field Regularization

It is crucial for our non-rigid TSDF fusion technique to es-timate a deformation not only of currently visible surfaces,but over all space within S. This enables reconstructionof new regions of the scene surface that are about to comeinto view. However, nodes affecting canonical space withinwhich no currently observed surface resides will have noassociated data term. In any case, noise, missing data andinsufficient geometric texture in the live frame – an ana-logue to the aperture problem in optical-flow – will resultin optimisation of the transform parameters being ill-posed.How should we constrain the motion of non-observed ge-ometry? Whilst the fully correct motion depends on objectdynamics and, where applicable, the subject’s volition, wemake use of a simpler model of unobserved geometry: thatit deforms in a piece-wise smooth way.

We use a deformation graph based regularization definedbetween transformation nodes, where an edge in the graphbetween nodes i and j adds a rigid-as-possible regularisa-tion term to the total error being minimized, under the dis-continuity preserving Huber penalty reg. The total regu-larisation term sums over all pair-wise connected nodes:

Reg(W, E) ⌘nX

i=0

X

j2E(i)

ij

reg

�T

ic

dgj

v

�T

jc

dg

j

v

�, (8)

where E defines the regularisation graph topology, and ↵ij

defines the weight associated with the edge, which we setto ↵

ij

= max(dg

i

w

,dg

j

w

).

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera framewhen observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp isinitialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scenedeforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), wherethe left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-spondence) between the model geometry and the live frameby rendering the warped surface ˆV

w

into the live frameshaded with canonical frame vertex positions using a ras-terizing rendering pipeline. This results in a prediction ofthe canonical frame’s geometry that is currently predictedto be visible in the live frame: P(

ˆVc

). We store this pre-diction as a pair of images {v,n} : ⌦ 7! P(

ˆVc

), where⌦ is the pixel domain of the predicted images, storing therendered canonical frame vertices and normals.

Given optimal transformation parameters for the currenttime frame, the predicted-to-be-visible geometry shouldtransform close, modulo observation noise, to the live sur-face vl : ⌦ 7! R3, formed by back projection of the depthimage [vl(u)

>, 1]

>= K

�1D

t

(u)[u

>, 1]

>. This can bequantified by a per pixel dense model-to-frame point-planeerror, which we compute under the robust Tukey penaltyfunction data, summed over the predicted image domain⌦:

Data(W,V, Dt

) ⌘X

u2⌦

data

�n

>u

(v

u

� vl

u

)

�.(7)

augment. The transformed model vertex v(u) is simply˜

T

u

= W(v(u)), producing the current canonical to liveframe point-normal predictions v

u

=

˜

T

u

v(u) and n

u

=

˜

T

u

n(u), and data-association of that model point-normalis made with a live frame point-normal through perspectiveprojection into the pixel u = ⇡(Kv

u

).We note that, ignoring the negligible cost of rendering

the geometry ˆVw

, the ability to extract, predict, and performprojective data association with the currently visible canoni-cal geometry leads to a data-term evaluation that has a com-putational complexity with an upper bound in the numberof pixels in the observation image. Furthermore, each data-term summand depends only on a subset of the n trans-

formations when computing W , and the region over whicheach node has a numerically significant impact on the errorfunction is compact. In practice, the result is a computa-tional cost similar to a single rigid body dense projectivepoint-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-field Regularization

It is crucial for our non-rigid TSDF fusion technique to es-timate a deformation not only of currently visible surfaces,but over all space within S. This enables reconstructionof new regions of the scene surface that are about to comeinto view. However, nodes affecting canonical space withinwhich no currently observed surface resides will have noassociated data term. In any case, noise, missing data andinsufficient geometric texture in the live frame – an ana-logue to the aperture problem in optical-flow – will resultin optimisation of the transform parameters being ill-posed.How should we constrain the motion of non-observed ge-ometry? Whilst the fully correct motion depends on objectdynamics and, where applicable, the subject’s volition, wemake use of a simpler model of unobserved geometry: thatit deforms in a piece-wise smooth way.

We use a deformation graph based regularization definedbetween transformation nodes, where an edge in the graphbetween nodes i and j adds a rigid-as-possible regularisa-tion term to the total error being minimized, under the dis-continuity preserving Huber penalty reg. The total regu-larisation term sums over all pair-wise connected nodes:

Reg(W, E) ⌘nX

i=0

X

j2E(i)

ij

reg

�T

ic

dgj

v

�T

jc

dg

j

v

�, (8)

where E defines the regularisation graph topology, and ↵ij

defines the weight associated with the edge, which we setto ↵

ij

= max(dg

i

w

,dg

j

w

).

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera framewhen observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp isinitialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scenedeforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), wherethe left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-spondence) between the model geometry and the live frameby rendering the warped surface ˆV

w

into the live frameshaded with canonical frame vertex positions using a ras-terizing rendering pipeline. This results in a prediction ofthe canonical frame’s geometry that is currently predictedto be visible in the live frame: P(

ˆVc

). We store this pre-diction as a pair of images {v,n} : ⌦ 7! P(

ˆVc

), where⌦ is the pixel domain of the predicted images, storing therendered canonical frame vertices and normals.

Given optimal transformation parameters for the currenttime frame, the predicted-to-be-visible geometry shouldtransform close, modulo observation noise, to the live sur-face vl : ⌦ 7! R3, formed by back projection of the depthimage [vl(u)

>, 1]

>= K

�1D

t

(u)[u

>, 1]

>. This can bequantified by a per pixel dense model-to-frame point-planeerror, which we compute under the robust Tukey penaltyfunction data, summed over the predicted image domain⌦:

Data(W,V, Dt

) ⌘X

u2⌦

data

�n

>u

(v

u

� vl

u

)

�.(7)

augment. The transformed model vertex v(u) is simply˜

T

u

= W(v(u)), producing the current canonical to liveframe point-normal predictions v

u

=

˜

T

u

v(u) and n

u

=

˜

T

u

n(u), and data-association of that model point-normalis made with a live frame point-normal through perspectiveprojection into the pixel u = ⇡(Kv

u

).We note that, ignoring the negligible cost of rendering

the geometry ˆVw

, the ability to extract, predict, and performprojective data association with the currently visible canoni-cal geometry leads to a data-term evaluation that has a com-putational complexity with an upper bound in the numberof pixels in the observation image. Furthermore, each data-term summand depends only on a subset of the n trans-

formations when computing W , and the region over whicheach node has a numerically significant impact on the errorfunction is compact. In practice, the result is a computa-tional cost similar to a single rigid body dense projectivepoint-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-field Regularization

It is crucial for our non-rigid TSDF fusion technique to es-timate a deformation not only of currently visible surfaces,but over all space within S. This enables reconstructionof new regions of the scene surface that are about to comeinto view. However, nodes affecting canonical space withinwhich no currently observed surface resides will have noassociated data term. In any case, noise, missing data andinsufficient geometric texture in the live frame – an ana-logue to the aperture problem in optical-flow – will resultin optimisation of the transform parameters being ill-posed.How should we constrain the motion of non-observed ge-ometry? Whilst the fully correct motion depends on objectdynamics and, where applicable, the subject’s volition, wemake use of a simpler model of unobserved geometry: thatit deforms in a piece-wise smooth way.

We use a deformation graph based regularization definedbetween transformation nodes, where an edge in the graphbetween nodes i and j adds a rigid-as-possible regularisa-tion term to the total error being minimized, under the dis-continuity preserving Huber penalty reg. The total regu-larisation term sums over all pair-wise connected nodes:

Reg(W, E) ⌘nX

i=0

X

j2E(i)

ij

reg

�T

ic

dgj

v

�T

jc

dg

j

v

�, (8)

where E defines the regularisation graph topology, and ↵ij

defines the weight associated with the edge, which we setto ↵

ij

= max(dg

i

w

,dg

j

w

).

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera framewhen observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp isinitialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scenedeforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), wherethe left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-spondence) between the model geometry and the live frameby rendering the warped surface ˆV

w

into the live frameshaded with canonical frame vertex positions using a ras-terizing rendering pipeline. This results in a prediction ofthe canonical frame’s geometry that is currently predictedto be visible in the live frame: P(

ˆVc

). We store this pre-diction as a pair of images {v,n} : ⌦ 7! P(

ˆVc

), where⌦ is the pixel domain of the predicted images, storing therendered canonical frame vertices and normals.

Given optimal transformation parameters for the currenttime frame, the predicted-to-be-visible geometry shouldtransform close, modulo observation noise, to the live sur-face vl : ⌦ 7! R3, formed by back projection of the depthimage [vl(u)

>, 1]

>= K

�1D

t

(u)[u

>, 1]

>. This can bequantified by a per pixel dense model-to-frame point-planeerror, which we compute under the robust Tukey penaltyfunction data, summed over the predicted image domain⌦:

Data(W,V, Dt

) ⌘X

u2⌦

data

�n

>u

(v

u

� vl

u

)

�.(7)

augment. The transformed model vertex v(u) is simply˜

T

u

= W(v(u)), producing the current canonical to liveframe point-normal predictions v

u

=

˜

T

u

v(u) and n

u

=

˜

T

u

n(u), and data-association of that model point-normalis made with a live frame point-normal through perspectiveprojection into the pixel u = ⇡(Kv

u

).We note that, ignoring the negligible cost of rendering

the geometry ˆVw

, the ability to extract, predict, and performprojective data association with the currently visible canoni-cal geometry leads to a data-term evaluation that has a com-putational complexity with an upper bound in the numberof pixels in the observation image. Furthermore, each data-term summand depends only on a subset of the n trans-

formations when computing W , and the region over whicheach node has a numerically significant impact on the errorfunction is compact. In practice, the result is a computa-tional cost similar to a single rigid body dense projectivepoint-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-field Regularization

It is crucial for our non-rigid TSDF fusion technique to es-timate a deformation not only of currently visible surfaces,but over all space within S. This enables reconstructionof new regions of the scene surface that are about to comeinto view. However, nodes affecting canonical space withinwhich no currently observed surface resides will have noassociated data term. In any case, noise, missing data andinsufficient geometric texture in the live frame – an ana-logue to the aperture problem in optical-flow – will resultin optimisation of the transform parameters being ill-posed.How should we constrain the motion of non-observed ge-ometry? Whilst the fully correct motion depends on objectdynamics and, where applicable, the subject’s volition, wemake use of a simpler model of unobserved geometry: thatit deforms in a piece-wise smooth way.

We use a deformation graph based regularization definedbetween transformation nodes, where an edge in the graphbetween nodes i and j adds a rigid-as-possible regularisa-tion term to the total error being minimized, under the dis-continuity preserving Huber penalty reg. The total regu-larisation term sums over all pair-wise connected nodes:

Reg(W, E) ⌘nX

i=0

X

j2E(i)

ij

reg

�T

ic

dgj

v

�T

jc

dg

j

v

�, (8)

where E defines the regularisation graph topology, and ↵ij

defines the weight associated with the edge, which we setto ↵

ij

= max(dg

i

w

,dg

j

w

).

Ω : ピクセルドメイン

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera framewhen observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp isinitialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scenedeforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), wherethe left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-spondence) between the model geometry and the live frameby rendering the warped surface ˆV

w

into the live frameshaded with canonical frame vertex positions using a ras-terizing rendering pipeline. This results in a prediction ofthe canonical frame’s geometry that is currently predictedto be visible in the live frame: P(

ˆVc

). We store this pre-diction as a pair of images {v,n} : ⌦ 7! P(

ˆVc

), where⌦ is the pixel domain of the predicted images, storing therendered canonical frame vertices and normals.

Given optimal transformation parameters for the currenttime frame, the predicted-to-be-visible geometry shouldtransform close, modulo observation noise, to the live sur-face vl : ⌦ 7! R3, formed by back projection of the depthimage [vl(u)

>, 1]

>= K

�1D

t

(u)[u

>, 1]

>. This can bequantified by a per pixel dense model-to-frame point-planeerror, which we compute under the robust Tukey penaltyfunction data, summed over the predicted image domain⌦:

Data(W,V, Dt

) ⌘X

u2⌦

data

�n

>u

(v

u

� vl

u

)

�.(7)

augment. The transformed model vertex v(u) is simply˜

T

u

= W(v(u)), producing the current canonical to liveframe point-normal predictions v

u

=

˜

T

u

v(u) and n

u

=

˜

T

u

n(u), and data-association of that model point-normalis made with a live frame point-normal through perspectiveprojection into the pixel u = ⇡(Kv

u

).We note that, ignoring the negligible cost of rendering

the geometry ˆVw

, the ability to extract, predict, and performprojective data association with the currently visible canoni-cal geometry leads to a data-term evaluation that has a com-putational complexity with an upper bound in the numberof pixels in the observation image. Furthermore, each data-term summand depends only on a subset of the n trans-

formations when computing W , and the region over whicheach node has a numerically significant impact on the errorfunction is compact. In practice, the result is a computa-tional cost similar to a single rigid body dense projectivepoint-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-field Regularization

It is crucial for our non-rigid TSDF fusion technique to es-timate a deformation not only of currently visible surfaces,but over all space within S. This enables reconstructionof new regions of the scene surface that are about to comeinto view. However, nodes affecting canonical space withinwhich no currently observed surface resides will have noassociated data term. In any case, noise, missing data andinsufficient geometric texture in the live frame – an ana-logue to the aperture problem in optical-flow – will resultin optimisation of the transform parameters being ill-posed.How should we constrain the motion of non-observed ge-ometry? Whilst the fully correct motion depends on objectdynamics and, where applicable, the subject’s volition, wemake use of a simpler model of unobserved geometry: thatit deforms in a piece-wise smooth way.

We use a deformation graph based regularization definedbetween transformation nodes, where an edge in the graphbetween nodes i and j adds a rigid-as-possible regularisa-tion term to the total error being minimized, under the dis-continuity preserving Huber penalty reg. The total regu-larisation term sums over all pair-wise connected nodes:

Reg(W, E) ⌘nX

i=0

X

j2E(i)

ij

reg

�T

ic

dgj

v

�T

jc

dg

j

v

�, (8)

where E defines the regularisation graph topology, and ↵ij

defines the weight associated with the edge, which we setto ↵

ij

= max(dg

i

w

,dg

j

w

).

: robust Tukey Penalty function

Page 48: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Estimating the Warp-Field State (4/5) •  Live Frameから見えていない部分について •  Deformation Graphで求める:

– ノードの接続情報をもとに、全体でRigid-as-possibleになるように正規化

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 48

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera framewhen observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp isinitialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scenedeforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), wherethe left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-spondence) between the model geometry and the live frameby rendering the warped surface ˆV

w

into the live frameshaded with canonical frame vertex positions using a ras-terizing rendering pipeline. This results in a prediction ofthe canonical frame’s geometry that is currently predictedto be visible in the live frame: P(

ˆVc

). We store this pre-diction as a pair of images {v,n} : ⌦ 7! P(

ˆVc

), where⌦ is the pixel domain of the predicted images, storing therendered canonical frame vertices and normals.

Given optimal transformation parameters for the currenttime frame, the predicted-to-be-visible geometry shouldtransform close, modulo observation noise, to the live sur-face vl : ⌦ 7! R3, formed by back projection of the depthimage [vl(u)

>, 1]

>= K

�1D

t

(u)[u

>, 1]

>. This can bequantified by a per pixel dense model-to-frame point-planeerror, which we compute under the robust Tukey penaltyfunction data, summed over the predicted image domain⌦:

Data(W,V, Dt

) ⌘X

u2⌦

data

�n

>u

(v

u

� vl

u

)

�.(7)

augment. The transformed model vertex v(u) is simply˜

T

u

= W(v(u)), producing the current canonical to liveframe point-normal predictions v

u

=

˜

T

u

v(u) and n

u

=

˜

T

u

n(u), and data-association of that model point-normalis made with a live frame point-normal through perspectiveprojection into the pixel u = ⇡(Kv

u

).We note that, ignoring the negligible cost of rendering

the geometry ˆVw

, the ability to extract, predict, and performprojective data association with the currently visible canoni-cal geometry leads to a data-term evaluation that has a com-putational complexity with an upper bound in the numberof pixels in the observation image. Furthermore, each data-term summand depends only on a subset of the n trans-

formations when computing W , and the region over whicheach node has a numerically significant impact on the errorfunction is compact. In practice, the result is a computa-tional cost similar to a single rigid body dense projectivepoint-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-field Regularization

It is crucial for our non-rigid TSDF fusion technique to es-timate a deformation not only of currently visible surfaces,but over all space within S. This enables reconstructionof new regions of the scene surface that are about to comeinto view. However, nodes affecting canonical space withinwhich no currently observed surface resides will have noassociated data term. In any case, noise, missing data andinsufficient geometric texture in the live frame – an ana-logue to the aperture problem in optical-flow – will resultin optimisation of the transform parameters being ill-posed.How should we constrain the motion of non-observed ge-ometry? Whilst the fully correct motion depends on objectdynamics and, where applicable, the subject’s volition, wemake use of a simpler model of unobserved geometry: thatit deforms in a piece-wise smooth way.

We use a deformation graph based regularization definedbetween transformation nodes, where an edge in the graphbetween nodes i and j adds a rigid-as-possible regularisa-tion term to the total error being minimized, under the dis-continuity preserving Huber penalty reg. The total regu-larisation term sums over all pair-wise connected nodes:

Reg(W, E) ⌘nX

i=0

X

j2E(i)

ij

reg

�T

ic

dgj

v

�T

jc

dg

j

v

�, (8)

where E defines the regularisation graph topology, and ↵ij

defines the weight associated with the edge, which we setto ↵

ij

= max(dg

i

w

,dg

j

w

).

ε(i) : i番目のノードの接続エッジ

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera framewhen observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp isinitialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scenedeforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), wherethe left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-spondence) between the model geometry and the live frameby rendering the warped surface ˆV

w

into the live frameshaded with canonical frame vertex positions using a ras-terizing rendering pipeline. This results in a prediction ofthe canonical frame’s geometry that is currently predictedto be visible in the live frame: P(

ˆVc

). We store this pre-diction as a pair of images {v,n} : ⌦ 7! P(

ˆVc

), where⌦ is the pixel domain of the predicted images, storing therendered canonical frame vertices and normals.

Given optimal transformation parameters for the currenttime frame, the predicted-to-be-visible geometry shouldtransform close, modulo observation noise, to the live sur-face vl : ⌦ 7! R3, formed by back projection of the depthimage [vl(u)

>, 1]

>= K

�1D

t

(u)[u

>, 1]

>. This can bequantified by a per pixel dense model-to-frame point-planeerror, which we compute under the robust Tukey penaltyfunction data, summed over the predicted image domain⌦:

Data(W,V, Dt

) ⌘X

u2⌦

data

�n

>u

(v

u

� vl

u

)

�.(7)

augment. The transformed model vertex v(u) is simply˜

T

u

= W(v(u)), producing the current canonical to liveframe point-normal predictions v

u

=

˜

T

u

v(u) and n

u

=

˜

T

u

n(u), and data-association of that model point-normalis made with a live frame point-normal through perspectiveprojection into the pixel u = ⇡(Kv

u

).We note that, ignoring the negligible cost of rendering

the geometry ˆVw

, the ability to extract, predict, and performprojective data association with the currently visible canoni-cal geometry leads to a data-term evaluation that has a com-putational complexity with an upper bound in the numberof pixels in the observation image. Furthermore, each data-term summand depends only on a subset of the n trans-

formations when computing W , and the region over whicheach node has a numerically significant impact on the errorfunction is compact. In practice, the result is a computa-tional cost similar to a single rigid body dense projectivepoint-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-field Regularization

It is crucial for our non-rigid TSDF fusion technique to es-timate a deformation not only of currently visible surfaces,but over all space within S. This enables reconstructionof new regions of the scene surface that are about to comeinto view. However, nodes affecting canonical space withinwhich no currently observed surface resides will have noassociated data term. In any case, noise, missing data andinsufficient geometric texture in the live frame – an ana-logue to the aperture problem in optical-flow – will resultin optimisation of the transform parameters being ill-posed.How should we constrain the motion of non-observed ge-ometry? Whilst the fully correct motion depends on objectdynamics and, where applicable, the subject’s volition, wemake use of a simpler model of unobserved geometry: thatit deforms in a piece-wise smooth way.

We use a deformation graph based regularization definedbetween transformation nodes, where an edge in the graphbetween nodes i and j adds a rigid-as-possible regularisa-tion term to the total error being minimized, under the dis-continuity preserving Huber penalty reg. The total regu-larisation term sums over all pair-wise connected nodes:

Reg(W, E) ⌘nX

i=0

X

j2E(i)

ij

reg

�T

ic

dgj

v

�T

jc

dg

j

v

�, (8)

where E defines the regularisation graph topology, and ↵ij

defines the weight associated with the edge, which we setto ↵

ij

= max(dg

i

w

,dg

j

w

).

: Huber penalty

Non-rigid scene deformation Introducing an occlusion

(a) Live frame t = 0 (b) Live Frame t = 1 (c) Canonical 7! Live (d) Live frame t = 0 (e) Live Frame t = 1 (f) Canonical 7! Live

Figure 3: An illustration of how each point in the canonical frame maps, through a correct warp field, onto a ray in the live camera framewhen observing a deforming scene. In (a) the first view of a dynamic scene is observed. In the corresponding canonical frame, the warp isinitialized to the identity transform and the three rays shown in the live frame also map as straight lines in the canonical frame. As the scenedeforms in the live frame (b), the warp function transforms each point from the canonical and into the corresponding live frame location,causing the corresponding rays to bend (c). Note that this warp can be achieved with two 6D deformation nodes (shown as circles), wherethe left node applies a clockwise twist. In (d) we show a new scene that includes a cube that is about to occlude the bar. In the live frame(e), as the cube occludes a portion of the bar, the points in the canonical frame (f) are warped to correctly pass through the cube.

We obtain an initial estimate for data-association (corre-spondence) between the model geometry and the live frameby rendering the warped surface ˆV

w

into the live frameshaded with canonical frame vertex positions using a ras-terizing rendering pipeline. This results in a prediction ofthe canonical frame’s geometry that is currently predictedto be visible in the live frame: P(

ˆVc

). We store this pre-diction as a pair of images {v,n} : ⌦ 7! P(

ˆVc

), where⌦ is the pixel domain of the predicted images, storing therendered canonical frame vertices and normals.

Given optimal transformation parameters for the currenttime frame, the predicted-to-be-visible geometry shouldtransform close, modulo observation noise, to the live sur-face vl : ⌦ 7! R3, formed by back projection of the depthimage [vl(u)

>, 1]

>= K

�1D

t

(u)[u

>, 1]

>. This can bequantified by a per pixel dense model-to-frame point-planeerror, which we compute under the robust Tukey penaltyfunction data, summed over the predicted image domain⌦:

Data(W,V, Dt

) ⌘X

u2⌦

data

�n

>u

(v

u

� vl

u

)

�.(7)

augment. The transformed model vertex v(u) is simply˜

T

u

= W(v(u)), producing the current canonical to liveframe point-normal predictions v

u

=

˜

T

u

v(u) and n

u

=

˜

T

u

n(u), and data-association of that model point-normalis made with a live frame point-normal through perspectiveprojection into the pixel u = ⇡(Kv

u

).We note that, ignoring the negligible cost of rendering

the geometry ˆVw

, the ability to extract, predict, and performprojective data association with the currently visible canoni-cal geometry leads to a data-term evaluation that has a com-putational complexity with an upper bound in the numberof pixels in the observation image. Furthermore, each data-term summand depends only on a subset of the n trans-

formations when computing W , and the region over whicheach node has a numerically significant impact on the errorfunction is compact. In practice, the result is a computa-tional cost similar to a single rigid body dense projectivepoint-plane data-term evaluation (as used in KinectFusion).

3.3.2 Warp-field Regularization

It is crucial for our non-rigid TSDF fusion technique to es-timate a deformation not only of currently visible surfaces,but over all space within S. This enables reconstructionof new regions of the scene surface that are about to comeinto view. However, nodes affecting canonical space withinwhich no currently observed surface resides will have noassociated data term. In any case, noise, missing data andinsufficient geometric texture in the live frame – an ana-logue to the aperture problem in optical-flow – will resultin optimisation of the transform parameters being ill-posed.How should we constrain the motion of non-observed ge-ometry? Whilst the fully correct motion depends on objectdynamics and, where applicable, the subject’s volition, wemake use of a simpler model of unobserved geometry: thatit deforms in a piece-wise smooth way.

We use a deformation graph based regularization definedbetween transformation nodes, where an edge in the graphbetween nodes i and j adds a rigid-as-possible regularisa-tion term to the total error being minimized, under the dis-continuity preserving Huber penalty reg. The total regu-larisation term sums over all pair-wise connected nodes:

Reg(W, E) ⌘nX

i=0

X

j2E(i)

ij

reg

�T

ic

dgj

v

�T

jc

dg

j

v

�, (8)

where E defines the regularisation graph topology, and ↵ij

defines the weight associated with the edge, which we setto ↵

ij

= max(dg

i

w

,dg

j

w

).

Page 49: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Estimating the Warp-Field State (5/5) •  Hierarchical Deformation Tree:

–  Deformation Graphをピラミッド化 –  Coarse-to-Fineな最適化

•  Embedded Deformation Graph (Original) –  Regularizationの影響範囲をk-nearest neighborで定義

•  それに比べて – 計算コスト・安定性の面でメリット

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 49

Page 50: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Efficient Optimization •  まず、KinectFusionと同じ手順で、Camera

Pose (Tlw)を推定

•  次に、Eの最小化を行うことで、各Deformation NodeのWarp-Fieldを求める。

•  Gauss-Newton法で解く –  Gauss-Newton法: 二乗和の最小化に特化したNewton法

– ヘッセ行列 (2階微分) をヤコビ行列 (1階微分) で近似 – コレスキー分解を使って効率的に解く

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 50

our complete warp-field is then given as:

Wt

(x

c

) = T

lw

SE3(DQB(x

c

)). (2)

3.2. Dense Non-Rigid Surface FusionWe now describe how, given the model-to-frame warp

field Wt

, we update our canonical model geometry. Ourreconstruction into the canonical space S is represented bythe sampled TSDF V : S 7! R2 within a voxel domainS ⇢ N3. The sampled function holds for each voxel x 2S corresponding to the sampled point x

c

, a tuple V(x) 7![v(x) 2 R,w(x) 2 R]> holding a weighted average of allprojective TSDF values observed for that point so far v(x),together with the sum of all associated weights w(x).

We extend the projective TSDF fusion approach origi-nally introduced by [6] to operate over non-rigidly deform-ing scenes. Given the live depth image D

t

, we transformeach voxel center x

c

2 S by its estimated warp into the liveframe (x

>t

, 1)

>= W

t

(x

c

)(x

>c

, 1)

>, and carry through theTSDF surface fusion operation by directly projecting thewarped center into the depth frame. This allows the TSDFfor a point in the canonical frame to be updated by com-puting the projective TSDF in the deforming frame withouthaving to resample a warped TSDF in the live frame. Theprojective signed distance at the warped canonical point is:

psdf(x

c

) =

hK

�1D

t

(u

c

)

⇥u

>c

, 1

⇤>i

z

� [x

t

]

z

, (3)

where u

c

= ⇡ (Kx

t

) is the pixel into which the voxel cen-ter projects. We compute distance along the optical (z) axisof the camera frame using the z component denoted [.]

z

.K is the known 3⇥ 3 camera intrinsic matrix, and ⇡ per-forms perspective projection. For each voxel x, we updatethe TSDF to incorporate the projective SDF observed in thewarped frame using TSDF fusion:

V(x)t

=

([v0(x),w0

(x)]

>, if psdf(dc(x)) > �⌧

V(x)t�1, otherwise

(4)

where dc(.) transforms a discrete voxel point into the con-tinuous TSDF domain. The truncation distance ⌧ > 0 andthe updated TSDF value is given by the weighted averagingscheme [5], with the weight truncation introduced in [18]:

v0(x) =

v(x)t�1w(x)t�1 +min(⇢, ⌧)w(x)

w(x)t�1 + w(x)

⇢ = psdf(dc(x))

w0(x) = min(w(x)

t�1 + w(x), w

max

) . (5)

Unlike the static fusion scenario where the weight w(x)encodes the uncertainty of the depth value observed at theprojected pixel in the depth frame, we also account for un-certainty associated with the warp function at x

c

. In thecase of the single rigid transformation in original TSDFfusion, we are certain that observed surface regions, freespace, and unobserved regions transform equivalently. In

our non-rigid case, the further away the point xc

is from analready mapped and observable surface region, the less cer-tain we can be about its transformation. We use the averagedistance from x

c

to its k-nearest deformation nodes as aproxy for this increase in uncertainty and scale: w(x) /1k

Pi2N(x

c

) kdgi

w

� x

c

k2. We note that our non-rigidfusion generalises the static reconstruction case used inKinectFusion, replacing the single (rigid) model-to-cameratransform with a per voxel warp that transforms the asso-ciated space into the live (non-rigid) frame (see Figure 3).This technique greatly simplifies the non-rigid reconstruc-tion process over methods where all frames are explicitlywarped into a canonical frame. Furthermore, given a cor-rect warp field, then, since all TSDF updates are computedusing distances in the camera frame, the non-rigid projec-tive TSDF fusion approach maintains the optimality guar-antees for surface reconstruction from noisy observationsoriginally proved for the static reconstruction case in [6].

3.3. Estimating the Warp-field State Wt

We estimate the current values of the transformationsdg

se3 in Wt

given a newly observed depth map D

t

and thecurrent reconstruction V by constructing an energy functionthat is minimised by our desired parameters:

E(Wt

,V, Dt

, E) = Data(Wt

,V, Dt

) + �Reg(Wt

, E) . (6)

Our data term consists of a dense model-to-frame ICPcost Data(W

t

,V, Dt

) which is coupled with a regular-isation term Reg(W

t

, E) that penalises non-smooth mo-tion fields, and ensures as-rigid-as-possible deformation be-tween transformation nodes connected by the edge set E .The coupling of a data-term formed from linearly blendedtransformations with a rigid-as-possible graph based reg-ularisation is a form of the embedded deformation graphmodel introduced in [25]. The regularisation parameter �enables a trade-off between relaxing rigidity over the fieldwhen given high quality data, and ensuring a smooth con-sistent deformation of non or noisily observed regions ofspace. We defined these terms in the next subsections.

3.3.1 Dense Non-Rigid ICP Data-term

Our aim is to estimate all non-rigid transformation param-eters T

ic

and T

lw

that warp the canonical volume into thelive frame. We achieve this by performing a dense non-rigid alignment of the current surface reconstruction, ex-tracted from the canonical volume’s zero level set, into thelive frame’s depth map.

Surface Prediction and Data-Association: The currentzero level set of the TSDF V is extracted by marching cubesand stored as a polygon mesh with point-normal pairs in thecanonical frame: ˆV

c

⌘ {Vc

, N

c

}. We non-rigidly transformthis mesh into the live frame using the current warp fieldW

t

resulting in the warped point-normals ˆVw

.

Energy関数 :

Page 51: 30th コンピュータビジョン勉強会@関東 DynamicFusion

DynamicFusion Outline 1.  Dense Non-rigid Warp Field

–  Dynamicなシーンの表現方法について

2.  Dense Non-rigid Surface Fusion –  どうやってLive FrameをFusionするのか

3.  Estimating the Warp-Field State –  Warp-Fieldの推定方法

4.  Extending the Warp-Field –  Warp-Field構造の逐次追加方法

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 51

Page 52: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Extending the Warp-Field •  Inserting New Deformation Nodes:

–  Live FrameのSurface Fusionが終わったら、Warp Nodeの調整を行う

–  Polygon Meshを捜査して、現在のNodeの影響範囲にいない頂点が見つかったら、Nodeを追加する

–  追加されたNodeの初期Warpパラメータは、DQBで計算

•  Updating the Regularization Graph: –  Hierarchical Deformation Treeの構築 –  Level 0 = Warp Node –  next level:

•  各Regularization Nodeを捜査して、Nodeを間引き •  WarpはDQBで再計算

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 52

Page 53: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Result

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 53

https://www.youtube.com/watch?v=i1eZekcc_lM&feature=youtu.be

Page 54: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Limitation •  Quickly move from closed to open topology

–  Canonical Modelが閉じたTopologyである場合に、そこから開くと失敗しやすい

–  Regularization GraphのEdgeが分離できない – 逆に、開いた状態から閉じるは表現できる

• とはいえ、GraphのEdgeは繋がらない

•  高速な移動 •  Live Frameから見えていない範囲の移動

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 54

Page 55: 30th コンピュータビジョン勉強会@関東 DynamicFusion

Conclusion •  DynamicFusion

–  DynamicなシーンでSLAM – リアルタイムに3D Reconstruction

•  Contribution –  Non-rigid volumetric TSDF fusion –  Estimate 6D warp-field in Realtime

第30回(後編) コンピュータビジョン勉強会@関東 CVPR2015読み会 55