28th cv勉強会@関東 #3

28
28CV勉強会@関東 コンピュータビジョン 最先端ガイド5 Multi-View Stereo #3 Mar. 28, 2015 Hiroki Mizuno 1

Upload: hiroki-mizuno

Post on 15-Jul-2015

1.211 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: 28th CV勉強会@関東 #3

第28回 CV勉強会@関東 コンピュータビジョン 最先端ガイド5

Multi-View Stereo #3

Mar. 28, 2015 Hiroki Mizuno

1

Page 2: 28th CV勉強会@関東 #3

話す場所 •  複数画像からの三次元復元手法

1.  はじめに 2.  システム概要と構成上の注意

1.  画像収集 2.  カメラパラメータ推定 3.  密な形状復元

1.  最先端のMVS研究例

3.  多眼ステレオ (Multi-View Stereo) 1.  フォトメトリックコンシステンシー 2.  デプスマップ復元 3.  デプスマップからのメッシュ復元 4.  復元結果

4.  むすび

2

Page 3: 28th CV勉強会@関東 #3

密な形状復元 •  Structure from Motion (SfM) の結果

– カメラパラメータ – カメラ間の対応点 (特徴点) の三次元点群

= 疎な形状復元

3

SfM

bundlerのExamplesより借用

Page 4: 28th CV勉強会@関東 #3

密な形状復元

4

Page 5: 28th CV勉強会@関東 #3

密な形状復元

http://www.di.ens.fr/pmvs/gallery.html 5

Page 6: 28th CV勉強会@関東 #3

密な形状復元

6 http://www.di.ens.fr/pmvs/gallery.html

Page 7: 28th CV勉強会@関東 #3

Multi-View Stereo •  What's Multi-View Stereo

– キャリブレーション済みの多視点画像から高精度・高密度な三次元形状復元

•  長所 – 高解像度 – 撮影速度 – 価格

•  短所 – テクスチャが必要 – リアルタイム復元が困難

7

Page 8: 28th CV勉強会@関東 #3

Multi-View Stereo Algorithm サーベイ •  A Comparison and Evaluation of Multi-View

Stereo Reconstruction Algorithm –  CVPR 2006 –  Authors :

•  Steven M. Seitz •  Brain Curless •  James Diebel •  Daniel Scharstein •  Richard Szeliski

•  各種アルゴリズムの分類 •  復元性能のベンチマーク・ランキング

8

Page 9: 28th CV勉強会@関東 #3

Multi-View Stereo taxonomy •  Scene representation •  Photo-consistency measure •  Visibility model •  Shape prior •  Reconstruction algorithm •  Initialization requirements

9

Page 10: 28th CV勉強会@関東 #3

Scene representation •  取り扱う3次元空間の表現方法

– Volume – Polygon Mesh – Set of depth maps

10

Page 11: 28th CV勉強会@関東 #3

Scene representation •  Volume

– 3次元空間をGrid状に分割した空間 •  Voxel •  Level-Set

Voxel

• 各Gridにオブジェクトの占有率(二値)を格納

Level-Set • 各Gridに、最も近い面までの距離を格納

11

Page 12: 28th CV勉強会@関東 #3

Scene representation •  Polygon Mesh

– 頂点とそれを繋ぐ面のセット

12

Page 13: 28th CV勉強会@関東 #3

Scene representation •  Set of depth maps

– 各カメラのピクセルごとの深度情報集合

13

Page 14: 28th CV勉強会@関東 #3

Photo-consistency measure •  画像間の「見え」の対応を計算する方法

– Scene space –  Image space

•  反射モデル

– 多くのアルゴリズムは"Lambertモデル"を仮定 • 見えが視点位置に依存しない • 陰影は光源と面の傾きのみに依存

– 最近 (2006年時点) の新しいアルゴリズムではBRDFなどを想定したものもある

14

Page 15: 28th CV勉強会@関東 #3

Photo-consistency measure •  Scene space

– シーン中の点を各カメラに投影しPhoto-consistencyを計算

– Photo-consistencyはSSDやNCCで計算することが一般的

15

538 Computer Vision: Algorithms and Applications (September 3, 2010 draft)

p

x1x0

(R,t)

p∞

e1e0c0 c1

epipolar plane

p∞p

(R,t)

c0 c1

epipolarlines

x0

e0 e1

x1

l1l0

(a) (b)

Figure 11.3 Epipolar geometry: (a) epipolar line segment corresponding to one ray; (b)corresponding set of epipolar lines and their epipolar plane.

11.1.1 Rectification

As we saw in Section 7.2, the epipolar geometry for a pair of cameras is implicit in therelative pose and calibrations of the cameras, and can easily be computed from seven or morepoint matches using the fundamental matrix (or five or more points for the calibrated essentialmatrix) (Zhang 1998a,b; Faugeras and Luong 2001; Hartley and Zisserman 2004). Once thisgeometry has been computed, we can use the epipolar line corresponding to a pixel in oneimage to constrain the search for corresponding pixels in the other image. One way to do thisis to use a general correspondence algorithm, such as optical flow (Section 8.4), but to onlyconsider locations along the epipolar line (or to project any flow vectors that fall off back ontothe line).

A more efficient algorithm can be obtained by first rectifying (i.e, warping) the inputimages so that corresponding horizontal scanlines are epipolar lines (Loop and Zhang 1999;Faugeras and Luong 2001; Hartley and Zisserman 2004).2 Afterwards, it is possible to matchhorizontal scanlines independently or to shift images horizontally while computing matchingscores (Figure 11.4).

A simple way to rectify the two images is to first rotate both cameras so that they arelooking perpendicular to the line joining the camera centers c0 and c1. Since there is a de-gree of freedom in the tilt, the smallest rotations that achieve this should be used. Next, todetermine the desired twist around the optical axes, make the up vector (the camera y axis)

2 This makes most sense if the cameras are next to each other, although by rotating the cameras, rectification canbe performed on any pair that is not verged too much or has too much of a scale change. In those latter cases, usingplane sweep (below) or hypothesizing small planar patch locations in 3D (Goesele, Snavely, Curless et al. 2007) maybe preferable.

Page 16: 28th CV勉強会@関東 #3

Photo-consistency measure •  Image space

– カメラ画像をシーンに投影してPhoto-consistencyを計算

16

11.1 Epipolar geometry 541

Virtual camera

d

x

y

Input image k

uv

Homography: u = H x

x

y

k

d

k

(a) (b)

Figure 11.6 Sweeping a set of planes through a scene (Szeliski and Golland 1999) c� 1999

Springer: (a) The set of planes seen from a virtual camera induces a set of homographies inany other source (input) camera image. (b) The warped images from all the other cameras canbe stacked into a generalized disparity space volume I(x, y, d, k) indexed by pixel location(x, y), disparity d, and camera k.

1997)), the last row of a full-rank 4 ⇥ 4 projection matrix ˜P can be set to an arbitrary planeequation p3 = s3[n0|c0]. The resulting four-dimensional projective transform (collineation)(2.68) maps 3D world points p = (X,Y, Z, 1) into screen coordinates xs = (xs, ys, 1, d),where the projective depth (or parallax) d (2.66) is 0 on the reference plane (Figure 2.11).

Sweeping d through a series of disparity hypotheses, as shown in Figure 11.6a, corre-sponds to mapping each input image into the virtual camera ˜P defining the disparity spacethrough a series of homographies (2.68–2.71),

xk ⇠˜P k

˜P�1

xs = ˜Hkx + tkd = ( ˜Hk + tk[0 0 d])x, (11.3)

as shown in Figure 2.12b, where xk and x are the homogeneous pixel coordinates in thesource and virtual (reference) images (Szeliski and Golland 1999). The members of the fam-ily of homographies ˜Hk(d) = ˜Hk + tk[0 0 d], which are parametererized by the addition ofa rank-1 matrix, are related to each other through a planar homology (Hartley and Zisserman2004, A5.2).

The choice of virtual camera and parameterization is application dependent and is whatgives this framework a lot of its flexibility. In many applications, one of the input cameras(the reference camera) is used, thus computing a depth map that is registered with one of theinput images and which can later be used for image-based rendering (Sections 13.1 and 13.2).In other applications, such as view interpolation for gaze correction in video-conferencing

Page 17: 28th CV勉強会@関東 #3

Visibility model •  各カメラでの可視・不可視の判断方法

–  オクルージョンの問題

•  Geometric –  真面目に取り組むアプローチ –  基本的にチキン・エッグ問題なので、カメラ配置に制約を持たせるなどで対応

•  Quasi-geometric –  近似情報を使うアプローチ –  Visual Hullなどで粗い復元をしてからPhoto-Consistencyを計算

•  Outlier-based –  外れ値を無視するアプローチ –  "複数の画像からのphoto-consistency"で説明されているアプローチもこれに該当

17

Page 18: 28th CV勉強会@関東 #3

Shape prior •  形状に対する事前知識モデル

–  Photo-consistencyだけでは失敗する –  特にTextureのない領域

•  Minimal Surface –  面は滑らかである –  曲率の高い部分は苦手 –  Level-set, mesh-based algorithm

•  Maximal Surface –  空間を削る系のアプローチ –  輝度が一致する解が見つかればその場で停止 –  高い曲率も表現できる –  全体的に復元結果が大きくなる傾向になる –  Voxel-coloring, Space carving

•  Image-based –  近傍PixelのDepthはSmooth –  2D Markov Random Field

18

Page 19: 28th CV勉強会@関東 #3

Reconstruction algorithm •  3D Volume

– Volumeの各格子でコスト関数を計算 – その後、Surfaceを抽出 – Voxel-coloring, Volumetric MRF

•  Evolving surface – 徐々に面を形成してくアプローチ –  Level-set, Space carving

•  Depth map – 複数のDepth mapを独立に計算し、統合

•  Feature Point – 疎な再構成を行ってから、それらを補間

19

Page 20: 28th CV勉強会@関東 #3

Initialization requirements •  初期化の要件

– Rough Bounding Box or Volume •  Space carving •  Level-set (質の高い初期値が必要)

– Foreground/background segmentation •  silhouette

– Range of disparity or depth values •  Image-space algorithm

20

Page 21: 28th CV勉強会@関東 #3

Benchmark Datasets

21

temple temple model

dino dino model

bird dogsFigure 1. Multi-view datasets with laser-scanned 3D models.

Figure 2. The 317 camera positions and orientations for the temple

dataset. The gaps are due to shadows. The 47 cameras correspond-

ing to the ring dataset are shown in blue and red, and the 16 sparse

ring cameras only in red.

hull [46] that serves as an initial estimate of scene geom-etry [5, 19, 31, 47, 48].

Image-space algorithms [33, 35–37] typically enforceconstraints on the allowable range of disparity or depth val-ues, thereby constraining scene geometry to lie within anear and far depth plane for each camera viewpoint.

3. Multi-view data sets

To enable a quantitative evaluation of multi-view stereoreconstruction algorithms, we collected several calibrated

multi-view image sets and corresponding ground truth 3Dmesh models. Similar data are available for surface light-field studies [59, 60]; we have followed similar proceduresfor acquiring the images and models and for registeringthem to one another (although we add a step to automati-cally refine the alignment of the ground truth to the imagesets based on minimizing photo-consistency). The surfacelightfield data sets themselves are not, however, suitable forthis evaluation due to the highly specular nature of the ob-jects selected for those studies. We note that a number ofother high quality multi-view datasets are publicly available(without registered ground truth models), and we providelinks to many of these through our web site.

The target objects for this study were selected to havea variety of characteristics that are challenging for typi-cal multi-view stereo reconstruction algorithms. We soughtobjects that broadly sample the space of these character-istics by including both sharp and smooth features, com-plex topologies, strong concavities, and both strongly andweakly textured surfaces (see Figure 1).

The images were captured using the Stanford sphericalgantry, a robotic arm that can be positioned on a one-meterradius sphere to an accuracy of approximately 0.01 degrees.Images were captured using a CCD camera with a resolu-tion of 640 × 480 pixels attached to the tip of the gantryarm. At this resolution, a pixel in the image spans roughly0.25mm on the surface of the object (the temple object is10cm× 16cm× 8cm, and the dino is 7cm× 9cm× 7cm).

The system was calibrated by imaging a planar calibra-tion grid from 68 viewpoints over the hemisphere and using[61] to compute intrinsic and extrinsic parameters. Fromthese parameters, we computed the camera’s translationaland rotational offset relative to the tip of the gantry arm, en-abling us to determine the camera’s position and orientationas a function of any desired arm position.

The target object sits on a stationary platform near thecenter of the gantry sphere and is lit by three external spot-lights. Because the gantry casts shadows on the object incertain viewpoints, we double-covered the hemisphere withtwo different arm configurations, capturing a total of 790images. After shadowed images were manually removed,we obtained roughly 80% coverage of the sphere. From theresulting images, we created three datasets for each object,corresponding to a full hemisphere, a single ring around theobject, and a sparsely sampled ring (Figure 2).

The reference 3D model was captured using a Cyber-ware Model 15 laser stripe scanner. This unit has a single-scan resolution of 0.25mm and an accuracy of 0.05mm

to 0.2mm, depending on the surface characteristics andthe viewing angle. For each object, roughly 200 individ-ual scans were captured, aligned and merged on a 0.25mm

grid, with the resulting mesh extracted with sub-voxel preci-sion [62]; the accuracy of the combined scans is appreciably

temple temple model

dino dino model

bird dogsFigure 1. Multi-view datasets with laser-scanned 3D models.

Figure 2. The 317 camera positions and orientations for the temple

dataset. The gaps are due to shadows. The 47 cameras correspond-

ing to the ring dataset are shown in blue and red, and the 16 sparse

ring cameras only in red.

hull [46] that serves as an initial estimate of scene geom-etry [5, 19, 31, 47, 48].

Image-space algorithms [33, 35–37] typically enforceconstraints on the allowable range of disparity or depth val-ues, thereby constraining scene geometry to lie within anear and far depth plane for each camera viewpoint.

3. Multi-view data sets

To enable a quantitative evaluation of multi-view stereoreconstruction algorithms, we collected several calibrated

multi-view image sets and corresponding ground truth 3Dmesh models. Similar data are available for surface light-field studies [59, 60]; we have followed similar proceduresfor acquiring the images and models and for registeringthem to one another (although we add a step to automati-cally refine the alignment of the ground truth to the imagesets based on minimizing photo-consistency). The surfacelightfield data sets themselves are not, however, suitable forthis evaluation due to the highly specular nature of the ob-jects selected for those studies. We note that a number ofother high quality multi-view datasets are publicly available(without registered ground truth models), and we providelinks to many of these through our web site.

The target objects for this study were selected to havea variety of characteristics that are challenging for typi-cal multi-view stereo reconstruction algorithms. We soughtobjects that broadly sample the space of these character-istics by including both sharp and smooth features, com-plex topologies, strong concavities, and both strongly andweakly textured surfaces (see Figure 1).

The images were captured using the Stanford sphericalgantry, a robotic arm that can be positioned on a one-meterradius sphere to an accuracy of approximately 0.01 degrees.Images were captured using a CCD camera with a resolu-tion of 640 × 480 pixels attached to the tip of the gantryarm. At this resolution, a pixel in the image spans roughly0.25mm on the surface of the object (the temple object is10cm× 16cm× 8cm, and the dino is 7cm× 9cm× 7cm).

The system was calibrated by imaging a planar calibra-tion grid from 68 viewpoints over the hemisphere and using[61] to compute intrinsic and extrinsic parameters. Fromthese parameters, we computed the camera’s translationaland rotational offset relative to the tip of the gantry arm, en-abling us to determine the camera’s position and orientationas a function of any desired arm position.

The target object sits on a stationary platform near thecenter of the gantry sphere and is lit by three external spot-lights. Because the gantry casts shadows on the object incertain viewpoints, we double-covered the hemisphere withtwo different arm configurations, capturing a total of 790images. After shadowed images were manually removed,we obtained roughly 80% coverage of the sphere. From theresulting images, we created three datasets for each object,corresponding to a full hemisphere, a single ring around theobject, and a sparsely sampled ring (Figure 2).

The reference 3D model was captured using a Cyber-ware Model 15 laser stripe scanner. This unit has a single-scan resolution of 0.25mm and an accuracy of 0.05mm

to 0.2mm, depending on the surface characteristics andthe viewing angle. For each object, roughly 200 individ-ual scans were captured, aligned and merged on a 0.25mm

grid, with the resulting mesh extracted with sub-voxel preci-sion [62]; the accuracy of the combined scans is appreciably

temple temple model

dino dino model

bird dogsFigure 1. Multi-view datasets with laser-scanned 3D models.

Figure 2. The 317 camera positions and orientations for the temple

dataset. The gaps are due to shadows. The 47 cameras correspond-

ing to the ring dataset are shown in blue and red, and the 16 sparse

ring cameras only in red.

hull [46] that serves as an initial estimate of scene geom-etry [5, 19, 31, 47, 48].

Image-space algorithms [33, 35–37] typically enforceconstraints on the allowable range of disparity or depth val-ues, thereby constraining scene geometry to lie within anear and far depth plane for each camera viewpoint.

3. Multi-view data sets

To enable a quantitative evaluation of multi-view stereoreconstruction algorithms, we collected several calibrated

multi-view image sets and corresponding ground truth 3Dmesh models. Similar data are available for surface light-field studies [59, 60]; we have followed similar proceduresfor acquiring the images and models and for registeringthem to one another (although we add a step to automati-cally refine the alignment of the ground truth to the imagesets based on minimizing photo-consistency). The surfacelightfield data sets themselves are not, however, suitable forthis evaluation due to the highly specular nature of the ob-jects selected for those studies. We note that a number ofother high quality multi-view datasets are publicly available(without registered ground truth models), and we providelinks to many of these through our web site.

The target objects for this study were selected to havea variety of characteristics that are challenging for typi-cal multi-view stereo reconstruction algorithms. We soughtobjects that broadly sample the space of these character-istics by including both sharp and smooth features, com-plex topologies, strong concavities, and both strongly andweakly textured surfaces (see Figure 1).

The images were captured using the Stanford sphericalgantry, a robotic arm that can be positioned on a one-meterradius sphere to an accuracy of approximately 0.01 degrees.Images were captured using a CCD camera with a resolu-tion of 640 × 480 pixels attached to the tip of the gantryarm. At this resolution, a pixel in the image spans roughly0.25mm on the surface of the object (the temple object is10cm× 16cm× 8cm, and the dino is 7cm× 9cm× 7cm).

The system was calibrated by imaging a planar calibra-tion grid from 68 viewpoints over the hemisphere and using[61] to compute intrinsic and extrinsic parameters. Fromthese parameters, we computed the camera’s translationaland rotational offset relative to the tip of the gantry arm, en-abling us to determine the camera’s position and orientationas a function of any desired arm position.

The target object sits on a stationary platform near thecenter of the gantry sphere and is lit by three external spot-lights. Because the gantry casts shadows on the object incertain viewpoints, we double-covered the hemisphere withtwo different arm configurations, capturing a total of 790images. After shadowed images were manually removed,we obtained roughly 80% coverage of the sphere. From theresulting images, we created three datasets for each object,corresponding to a full hemisphere, a single ring around theobject, and a sparsely sampled ring (Figure 2).

The reference 3D model was captured using a Cyber-ware Model 15 laser stripe scanner. This unit has a single-scan resolution of 0.25mm and an accuracy of 0.05mm

to 0.2mm, depending on the surface characteristics andthe viewing angle. For each object, roughly 200 individ-ual scans were captured, aligned and merged on a 0.25mm

grid, with the resulting mesh extracted with sub-voxel preci-sion [62]; the accuracy of the combined scans is appreciably

カメラ配置  47視点 カメラ解像度 640x480

Temple 10x16x8 cm

Dino 7x9x7 cm

Page 22: 28th CV勉強会@関東 #3

Benchmark Result

22

http://vision.middlebury.edu/mview/eval/

Page 23: 28th CV勉強会@関東 #3

最先端のMVS研究例 •  "Silhouette and stereo fusion for 3D object modeling"

–  CVIU 2004 –  ターンテーブルを使い、10度ごとに画像取得 –  Visual Hull → Polygon Mesh復元 –  レーザレンジセンサーレベルの復元に成功

23

Fig. 16. Reconstructions using our proposed approach. Left: one original image used in thereconstruction. Middle: Gouraud shading reconstructed models (45843, 83628 and 114496vertices respectively). Right: textured models.

23

Input Image Reconstructed Model 頂点数 114,496点

Gouraud shading Textured

Page 24: 28th CV勉強会@関東 #3

最先端のMVS研究例 •  "A Globally Optimal Algorithm for Robust TV-L1"

–  ICCV2007

– 中間データとしてDepth-Mapを保持 – 複数のDepth-Mapを併合することでポリゴンメッシュ復元

24

(a) Depth image #1 (b) Depth image #2 (c) Mesh view #1 (d) Mesh view #2

Figure 3. Selected depth images and the final mesh (379958 triangles) for the “Dino” dataset.

regions. Since this noise is largely inconsistent in multipleviews, the final integrated model (Figures 4(c) and (d)) isvery clean. Parts of the pedestal are missing due to depthoutliers induced by specular reflections.

(a) One source view (b) Depth image (c) Front view (d) Back view

Figure 4. The “Statue” dataset (consisting of 40 source views).The final mesh has 230460 triangles.

The finally presented dataset comprises a sequence of fa-cade images used for terrestrial city modeling (Figures 5(a)and (c)). We employ a fast dynamic programming approachfor depth estimation to obtain better results in textureless fa-cade regions (Figures 5(b) and (d)), which still have incor-rect matches e.g. at mirroring display windows. Figure 5(e)displays the mesh generated by our proposed integrationmethod.

5. Discussion

This section briefly discusses the relationship of our ap-proach with pure binary image and shape denoising, andsuggests the integration of additional knowledge into ourframework for range image fusion using weighted total vari-ation.

5.1. Distance Field/Shape Denoising

It is tempting to ask, whether (robust) averaging of dis-tance fields near the hypothetical surface is strictly neces-sary, or if pure binary input fields f

i

2 {�1, 1} are suffi-cient, where f

i

(~x) = 1 indicates carved voxels according to

the range image r

i

. Such an approach coincides with select-ing the width � ! 0. The TV-L1 energy in Eq. 2 simplifiesto

E =

Z

n

|ru| + �N

+(~x)|u(~x)� 1| (12)

+ �N

�(~x)|u(~x) + 1|

o

d~x,

where N

+(~x) is the number of range images voting for a

carved voxel, i.e. N+(~x) = |{i : d

i

(~x) � 0}|. N

�(~x) is the

number of range images confidently voting for an occludedvoxel, namely N

�(~x) = |{i : d

i

(~x) 2 (0,�⌘)}|.The minimizer for the energy in Eq. 12 can be again

found by an alternating optimization procedure as describedin the previous section. It is easy to see, that the solution tothe intermediate point-wise minimization step

min

v

1

2✓

(u� v)

2+ �

N

+|v � 1| + N

�|v + 1|�

(13)

is now given by

v = max(�1,min(1, u + �✓(N

+ �N

�))). (14)

Of course, this scheme is more efficient than the procedureoutlined in Proposition 2, and the overall computing time isreduced to about 60% in our implementation. However, thisapproach is very vulnerable to aliasing artefacts in practice,which are clearly visible in Figure 6(a). An analysis for thecase of pure binary input fields f

i

in the spirit of [10] stillneeds to be done.

5.2. Weighted Total Variation

The homogeneous total variation regularization can bereplaced by a weighted TV-regularization [4], which en-ables an efficient solution procedure for the geodesic ac-tive contour model [7]. The isotropic TV-L1 energy func-tional in Eq. 1 can be extended to incorporate a weightedTV-norm:

E

g

=

Z

8

<

:

g(~x) |ru| + �

X

i2I(~x)

w

i

(~x)|u� f

i

|

9

=

;

d~x,

(15)

Page 25: 28th CV勉強会@関東 #3

最先端のMVS研究例 •  Depth-Map復元の利点

– 三次元でなく、二次元の画像でドメインでの問題

– リアルタムでの復元も可能

25

Page 26: 28th CV勉強会@関東 #3

最先端のMVS研究例 •  "Towards Internet-scale Multi-view Stereo "

– CVPR 2010 – 法線付きの点群として3次元復元 – 大規模MVS

26

Towards Internet-scale Multi-view Stereo

Yasutaka Furukawa1 Brian Curless2 Steven M. Seitz1,2 Richard Szeliski31Google Inc. 2University of Washington 3Microsoft Research

Abstract

This paper introduces an approach for enabling exist-ing multi-view stereo methods to operate on extremely largeunstructured photo collections. The main idea is to decom-pose the collection into a set of overlapping sets of photosthat can be processed in parallel, and to merge the result-ing reconstructions. This overlapping clustering problemis formulated as a constrained optimization and solved it-eratively. The merging algorithm, designed to be paralleland out-of-core, incorporates robust filtering steps to elim-inate low-quality reconstructions and enforce global visi-bility constraints. The approach has been tested on severallarge datasets downloaded from Flickr.com, including onewith over ten thousand images, yielding a 3D reconstruc-tion with nearly thirty million points.

1. IntroductionThe state of the art in 3D reconstruction from images has

undergone a revolution in the last few years. Coupled withthe explosion of imagery available online and advances incomputing, we have the opportunity to run reconstructionalgorithms at massive scale. Indeed, we can now attempt toreconstruct the entire world, i.e., every building, landscape,and (static) object that can be photographed.

The most important technological ingredients towardsthis goal are already in place. Matching algorithms (e.g.,SIFT [17]) provide accurate correspondences, structure-from-motion (SFM) algorithms use these correspondencesto estimate precise camera pose, and multi-view-stereo(MVS) methods take images with pose as input and producedense 3D models with accuracy nearly on par with laserscanners [22]. Indeed, this type of pipeline has already beendemonstrated by a few research groups [11, 12, 14, 19],with impressive results.

To reconstruct everything, one key challenge is scala-bility.1 In particular, how can we devise reconstruction al-gorithms that operate at Internet-scale, i.e., on the millionsof images available on Internet sites such as Flickr.com?

1There are other challenges such as handling complex BRDFs andlighting variations, which we do not address in this paper.

Figure 1. Our dense reconstruction of Piazza San Marco (Venice)from 13, 703 images with 27,707,825 reconstructed MVS points(further upsampled x9 for high quality point-based rendering).

Given recent progress on Internet-scale matching and SFM(notably Agarwal et al.’s Rome-in-a-day project [1]), we fo-cus our efforts in this paper on the last stage of the pipeline,i.e., Internet-scale MVS.

MVS algorithms are based on the idea of correlatingmeasurements from several images at once to derive 3Dsurface information. Many MVS algorithms aim at recon-structing a global 3D model by using all the images avail-able simultaneously [9, 13, 20, 24]. Such an approach is notfeasible as the number of images grows. Instead, it becomesimportant to select the right subset of images, and to clusterthem into manageable pieces.

We propose a novel view selection and clustering schemethat allows a wide class of MVS algorithms to scale up tomassive photo sets. Combined with a new merging methodthat robustly filters out low-quality or erroneous points, wedemonstrate our approach running for thousands of imagesof large sites and one entire city. Our system is the first todemonstrate an unstructured MVS approach at city-scale.

We propose an overlapping view clustering problem [2],in which the goal is to decompose the set of input imagesinto clusters that have small overlap. Overlap is importantfor the MVS problem, as a strict partition would undersam-ple surfaces near cluster boundaries. Once clustered, weapply a state-of-the-art MVS algorithm to reconstruct dense3D points, and then merge the resulting reconstructions into

1

Pizza San Marco (Venice) 視点数 : 13,703 点群数 : 27,707,825

Page 27: 28th CV勉強会@関東 #3

最先端のMVS研究例 •  大規模MVSのChallenge

– ビュークラスタリング問題 •  SfMの出力からMVSに必要な視点をクラスタリング

– PCクラスタで並列化 • とはいえ、数時間はかかる

27

Page 28: 28th CV勉強会@関東 #3

公開されている無償ソフトウェア Structure from Motion (SfM)

–  Bundler •  Photo Tourismはこれを使ってる

–  Voodoo Camera Tracker •  動画からのSfM

Multi-View Stereo (MVS) –  Patch-based Multi-view Stereo (PMVS) –  Poisson Surface Reconstruction

•  法線付き点群からのMesh生成 Web Service

–  My3DScanner (サービス終了???) •  Bundler + PMVS + Poisson Surface Reconstruction

–  Photosynth –  Automatic Reconstruction Conduit

Viewer –  MeshLab

28