a stereo vision-based mixed reality system with natural...

A Stereo Vision-based Mixed Reality Systemwith Natural Feature Point Tracking

Masayuki Kanbara�, Hirofumi Fujii �, Haruo Takemura� and Naokazu Yokoya�

�Graduate School of Information Science, Nara Institute of Science and Technology8916-5 Takayama-cho, Ikoma, Nara 630-0101, JAPAN

�Matsushita Communication Industrial Co., Ltd.600 Saedo-cho, Tsuzuki-ku, Yokohama, Kanagawa 224-8539, JAPANE-mail: ��

Abstract

This paper proposes a method to extend a registrationrange of a vision-based mixed reality system. We pro-pose to use natural feature points contained in imagescaptured by a pair of stereo cameras in conjunction withpre-defined fixed fiducial markers. The system also in-corporates an inertial sensor to achieve a robust registra-tion method which can handle user’s fast head rotationand movement. The system first uses pre-defined fiducialmarkers to estimate a projection matrix between real andvirtual coordinate systems. At the same time, the systempicks up and tracks a set of natural feature points fromthe initial image. As a user moves around in MR environ-ment, the initial markers fall out from the camera frameand natural features are then used to recover a projec-tion matrix. Experiments evaluating the feasibility of themethod are carried out and show the potential benefits ofthe method.

Keywords: Augmented Reality, Vision-based Registra-tion, Inertial Sensor, Natural Feature Tracking, Stereo Vi-sion

1 Introduction

Augmented reality produces an environment in whichvirtual objects are superimposed on user’s view of the realenvironment. Augmented reality has received a great dealof attention as a new method for displaying information orincreasing the reality of virtual environments. A numberof applications have already been proposed and demon-strated [1, 2, 5, 9]. To implement an augmented realitysystem, we must solve some problems. Geometric regis-tration is especially the most important problem becausevirtual objects should be superimposed on the right placeas if they really exist in the real world.

One of the major approaches to the registration be-

tween the real and virtual worlds is vision-based method[3, 6, 8, 10, 11]. The methods, which are sometimes re-ferred to as vision-based tracking or registration, estimatea position and an orientation of user’s viewpoint from im-ages captured by a camera attached at the user’s view-point. Because the method usually uses fiducial markersplaced in the environment, the measurement range is ac-tually limited.

To overcome this limitation, Park et al. proposed a reg-istration method that tracked natural features in additionto markers in the real environment [7]. The method real-izes a wide range registration between the real and virtualworlds by tracking markers and natural features. How-ever, tracking the natural features in the captured imagesis usually difficult because of the following two reasons.One is that a template matching for tracking is not robustor stable enough, especially when a perspective of thescene changes as a user moves in a 3-D environment, al-though the template matching-based tracking method hasbeen extensively studied in the field of computer vision.The other is that tracking itself is time consuming andmakes it difficult for the system to run in real time.

One of solutions for the problem above is to predictpositions of features in the next frame using Kalman fil-ter [7]. The method avoids miss tracking of markers andreduces calculation cost by limiting a search area in theimage. On the other hand, You et al. proposed a trackingmethod using an inertial sensor attached to a camera forpredicting markers’ positions [12]. The method is ableto track the features even when the camera moves fast.However, the method cannot accurately predict positionsof markers when the camera is translated since an inertialsensor can measure only rotation of the camera. Figure 1illustrates the problem. When the camera is translated androtated at the same time, the predicted position of markeris�� at time t+�t without considering the camera transla-tion. However, the predicted position�� differs from thecorrect position�� of marker by translation�.

time: t

time: t+∆t

T

θ : rotation

T : translation

-T

predicted position of marker

P ’

camera

PtPt+∆t

Pt

correct position of marker

θ

θ

Figure 1 Relationship between a markerand a camera in motion.

In this paper, we propose a stereo vision-based aug-mented reality with a wide range of registration by usingpre-defined fiducial markers and natural features. In addi-tion, we discuss a robust tracking method using an inertialsensor, which predicts positions of markers and naturalfeatures for tracking.

The following part of the paper is structured as fol-lows. Section 2 describes the stereo vision-based registra-tion method with an inertial sensor using natural featurepoints. In Section 3, experimental results with the pro-posed method and discussion about a prototype systemare described. Finally, Section 4 summarizes the presentwork and describes future work.

2 Geometric registration using markers andnatural features

We assume in this study that a pair of stereo camerasare virtually located at viewer’s two eyes in an augmentedreality system. Figure 2 shows the flowchart of the pro-posed method. First, to memorize positions of markersand natural features, the markers and natural features,both of which may be simply called features hereafter,are detected from a pair of stereo images (A in Fig. 2). Inorder to track the features, the positions are predicted us-ing a pair of stereo images and an inertial sensor (B and Cin Fig. 2). At the same time, the predictions are evaluatedbecause tracking of natural features contains errors (D inFig 2). Finally, the results above are used for estimatinga model-view matrix which represents a relationship be-tween the real and virtual coordinate systems (E in Fig.2).

positions of markers and natural features

in the last frame

D. evaluation oftracking

E. estimation ofmodel-view matrix

B. predicting positions of features

A. detecting markers and natural features

registration process

initial process

orientation(inertial sensor)

stereo images

position of markers and natural features

in the first frame

C. tracking of features using predicted position

Figure 2 Flow diagram of the registrationmethod.

��

In the first frame, markers and natural features are de-tected from a pair of stereo images. The markers are de-tected by color matching [3]. On the other hand, the nat-ural features are detected by using Moravec’s interest op-erator [4]. The interest operator can detect characteristicpoints as natural features that are easily matched betweenthe two consecutive images. In the proposed method, thenatural features are detected from every evenly spaced re-gion by the interest operator to distribute detected featuresevenly in the images. Figure 3 shows a N� M searchwindow layout for the interest operator which is appliedto the left image of stereo pair.

��

2.2.1. Analysis of camera motion. Figure 4 illustratesthe relationship between the positions of stereo cameras attime t and t��t. In general, the motions of cameras con-tain rotation and translation. Since the center of rotationdiffers from the position of stereo cameras, the translation� of camera can be decomposed into two translations��

1 2 M

1

2

N

left image

Figure 3 Search windows of natural fea-tures in the first frame.

time : t

time: t+∆t

θ

T

center of rotation

T r

T t

T t

θ

Figure 4 Decomposition of camera mo-tion.

and�� . Between�� and whole translation�, the fol-lowing equation stands:

� � �� (1)

where�� is the translation caused by the rotation of thecamera which is obtained by the inertial sensor, and��

represents the user’s viewpoint translation, that is, thetranslation of rotation center. In this paper, the positionsof features in the next frame are predicted by estimatingboth�� and��.

2.2.2. Estimating camera motion and predicting posi-tions of features. When� ,�� and�� are estimated, thepositions of features in the current frame can be predictedas follows.

�� (2)

where� is a transformation matrix of the rotation�which is obtained by the inertial sensor.

time: t

time: t+∆tθ

translation by a rotation- Tr

center of rotation

Pt+∆tPt

Pt

RPt

Tr

Pc

correct position of features

θ

Figure 5 Translation of camera in rotationby a displacement.

.

The stereo cameras are actually translated even whenthe camera is rotated because of the displacement betweenthe center of rotation and the position of stereo cameras,as shown in Figure 5. Because the displacement is con-stant, the translation�� can be calculated with the fol-lowing equation.

�� (3)

where�� represents the relationship between the centerof rotation and the position of stereo cameras. Then, thetransformation matrix� is represented by the foremen-tioned matrix since the rotation of camera is the same asthe rotation obtained by the inertial sensor. The calcula-tion is applied to both cameras.

Next, the translation�� of the center of rotation canbe determined as follows.

��

� �� (4)

The equation means that the translation of cameras��

can be calculated by the relationship�� between po-sitions of a feature and camera in the current frame andthe predicted position��.

The proposed method assumes that the farthest markerfrom cameras has small movement in adjacent images.Hence, the translation�� is predicted by estimating�� of the farthest marker from the camera.

��

In this section, the tracking of markers and natural fea-tures with the prediction above is described.

2.3.1. Tracking of markers. This section describes twocases of features: inside or outside of the current frame. Inthe case where the predicted position is inside of the im-age, a search window is determined by the predicted posi-tion. Then, the markers are tracked by detecting marker’sregion in the search window based on color information.Next, the 3D positions of the markers are calculated witha stereo matching algorithm. Note that the farthest markeris tracked without the predictions.

In the case where the predicted position is outside ofthe image, tracking is realized by assuming that the pre-diction position is correct in the current frame. There-fore, the markers can be tracked continuously even whenmarkers that have once gone outside come back into sightagain.

2.3.2. Tracking of natural features. The natural fea-tures are tracked by using a standard template matchingtechnique applied to two consecutive images. Note that atemplate is made from a neighboring region of the naturalfeatures in the previous frame and a similarity measure isa normalized cross correlation. When the cameras rotateon roll direction, the rotation of the template is consideredin the matching process.

2.3.3. Evaluation of tracking results. The tracking re-sults are evaluated by the following constraints becausethe natural feature positions drift by updating of the tem-plate. When one of the following constrains is not satis-fied, the tracking is discontinued.

� Correlation between two consecutive images.

The normalized cross correlation between two con-secutive frames is used to evaluate the tracking er-ror caused by mismatchings or occlusions by cameramotion. If the correlation is under a given threshold,the tracking of natural features is discontinued.

� Epipolar constraint.

The second constraint is the epipolar constraint,which means that corresponding points in stereo pairshould exist on the epipolar line which is intersectionof two image planes and a plane determined by thefeature point in 3D and the two centers of lenses. Ifthe corresponding points are not exist on the epipolarline, the tracking of natural features is discontinued.

� Displacement of 3D positions.

The third constraint is that the 3D positions of thefeatures in the current frame are compared to theirpositions recorded in the first frame. If the distancebetween the positions of natural features in the cur-rent frame and their positions in the first frame islarger than a threshold, the tracking of natural fea-tures is discontinued.

world coordinate system

image plane

camera coordinate system

recorded positionsestimated positions

model-view matrix M

Pinit,i

Pnow,i

Figure 6 Registration using features.

��

A model-view matrix that represents the relationshipbetween the world and camera coordinate systems is es-timated using the features. Figure 6 illustrates a rela-tionship between the positions of features in the first andcurrent frames. The model-view matrix is calculated bymatching their 3D positions. Provided that the position ofthe i-th feature in the world coordinate system recordedin the first frame is�� and the position ofi-th fea-ture in the camera coordinate system in the current frameis��, the model-view matrixM can be estimated byminimizing a sum of square differences (SSD) as follows:

��

�

��

where�� is a parameter that represents the credibility ofeach feature. The credibility is defined by the constrainsdescribed in Section 2.3.3.

3 Implementation and Experiments

��

We have constructed a prototype of video see-throughaugmented reality system using two small CCD cameras(Toshiba IK-UM42) and an inertial sensor (InterSense IS-300) mounted on a HMD (Olympus Media Mask), asshown in Figure 7, for demonstrating the proposed geo-metrical registration algorithm. The baseline length be-tween two cameras is set to 6.5 cm. The optical axesof the cameras are set to be parallel to the viewer’s gazedirection (actually the head direction). The images cap-tured by the cameras are fed into a graphic workstation(SGI Onyx2 IR: 16CPU MIPS R10000 195MHz) throughthe digital video interface (DIVO). The orientation of thecamera (head) obtained by the inertial sensor is also fedinto the workstation with serial interface. The incomingreal world images are merged with virtual objects and out-put from the DIVO interface to the HMD. The hardware

stereo cameras

inertial sensor

Figure 7 Appearance of stereo video see-through HMD with inertial sensor.

graphics workstation

video interface

composedstereo images

HMD

Onyx2 inertial sensor

stereo cameras

orientation

stereo images

Figure 8 Configuration of prototype sys-tem.

configuration of the whole system is illustrated in Figure8. Figure 9 shows an appearance of experiment using theprototype system.

��

The experiment uses four blue markers placed on adesktop as fiducial markers. Figure 10 shows the resultsof prediction of markers’ position with translation pre-diction of user’s head in comparison with the predictionwithout translation prediction. The squares of solid rep-resent search windows with translation prediction. Thedotted squares represent those without translation predic-tion. Note that the user’s head translation is predicted us-ing only the marker farthest from cameras. As shown inFigure 10, the case which does not consider user’s headtranslation cannot accurately track the markers. On theother hand, it is confirmed that the proposed method canaccurately predict the markers’ positions by only usingone tracked marker even when user’s head is translated ora marker is occluded by another object.

Figure 9 Appearance of the experiment.

Figure 11 shows the results of tracking markers inrapid camera motion. The squares of solid line and dottedline represent search windows with and without transla-tion prediction, respectively. The markers in each imageare tracked successfully with translation prediction, whilethe tracking without translation prediction is not completebecause a marker in the current frame does not exist in thesearch window as is clearly observed in Figure 11.

The rate of miss tracking using only stereo cameras is7.47%. The rates of miss tracking using both stereo cam-eras and an inertial sensor with and without consideringthe translation of the user’s head are 1.26% and 0.34%,respectively. Note that the rate of miss tracking is definedas follows:

��

�� -��

��

Note that the total number of frames is 150 and the num-ber of observed markers is 4 in the experiment shown inFigure 11.

��

Figure 12 illustrates the results of natural feature track-ing using the proposed method, where the mark ’+’ rep-resents positions of the natural features. The marks in thetop of the image sequence represents the position of natu-ral features detected by Moravec’s interest operator. Theonly reliabe natural features are tracked by consideringthe constrains mentioned in the Section 2.3.3.

Figure 13 shows results of geometric registration usingboth markers and natural features. The solid and dottedlines are drawn connecting the 3D positions of markersby using the estimated model-view matrix. The solid anddotted lines represent results of registration using markersand natural features and using only markers, respectively.

In the experiment, the markers and natural features cap-tured in the right half of image are used for registrationto verify the robustness of the proposal method when themarkers go outside of images. As shown in Figure 13, thesolid line is more accurate than the dotted line.

4 Conclusion

This paper has proposed a stereo vision-based aug-mented reality system with a wide range of registration.We have used natural feature points contained in imagescaptured by a pair of stereo camera in conjunction withpre-defined fiducial markers. In addition, the method re-alizes a robust feature tracking by using an inertial sensorwhich predicts positions of features. The feasibility ofthe prototype system has been successfully demonstratedthrough experiments. Our future work will include au-tomatic detection of new natural features that come intosight for a wider range of registartion.

AcknowledgmentsThis work was supported in part by Core Research forEvolutional Science and Technology (CREST) of JapanScience and Technology Corporation (JST).

References

[1] R. Azuma. A Survey of Augmented Reality. InPresence,volume 6(4), pages 355–385, 1997.

[2] S. Feiner, B. Maclntyre, and D. Seligmann. Knowledge-based Augmented Reality. InCommun. of the ACM, vol-ume 36(7), pages 52–62, 1993.

[3] M. Kanbara, T. Okuma, H. Takemura, and N. Yokoya. AStereoscopic Video See-through Augmented Reality Sys-tem Based on Real-time Vision-based Registration. InProc. IEEE Virtual Reality 2000, pages 255–262, 2000.

[4] H. Moravec. Visual Mapping by a Robot Rover,. InProc.6th IJCAI, pages 598–600, 1979.

[5] T. Ohshima, K. Satoh, H. Yamamoto, and H. Tamura.AR� Hockey: A Case Study of Collaborative AugmentedReality. InProc. VRAIS’98, pages 14–18, 1998.

[6] T. Okuma, K. Kiyokawa, H. Takemura, and N. Yokoya.An Augmented Reality System Using a Real-time VisionBased Registration. InProc. ICPR’98, volume 2, pages1226–1229, 1998.

[7] J. Park, S. You, and U. Neumann. Natural Feature Track-ing for Extendible Robust Augmented Realities. InProc.of Int. Workshop on Augmented Reality, 1998.

[8] J. Rekimoto. Matrix: A Realitime Object Identificationand Registration Method for Augmented Reality. InProc.APCHI, pages 63–68, 1998.

[9] A. State, A. Livingston, F. Garrett, G. Hirota, andH. Fuchs. Technologies for Augmented Reality Systems:Realizing Ultrasound-Guided Needle Biopsies. InProc.SIGGRAPH’96, pages 439–446, 1996.

[10] M. Uenohara and T. Kanade. Vision-Based ObjectRegistration for Real-time Image Overlay. InProc.CVRMed’95, pages 13–22, 1995.

[11] Y. Yokokohji, Y. Sugawara, and T. Yoshikawa. AccurateImage Overlay on See-through Head-mounted DisplaysUsing Vision and Accelerometers. InProc. IEEE VirtualReality 2000, pages 247–254, 2000.

[12] S. You, U. Neumann, and R. Azuma. Hybrid Inertial andVision Tracking for Augmented Reality Registration. InProc. IEEE Virtual Reality ’99, pages 260–267, 1999.

1 st frame

10 th frame

20 th frame

30 th frame

40 th frame

Figure 10 Position prediction of markerswith and without considering a translation(solid square: search window with transla-tion prediction, dotted square: search win-dow without translation prediction).

1 st frame

5 th frame

10 th frame

15 th frame

20 th frame

Figure 11 Marker tracking with and with-out position prediction in rapid motion(solid square: search window of trackedmarker with translation prediction, dottedsquare: search window of tracked markerwithout translation prediction).

1 st frame

15 th frame

30 th frame

45 th frame

60 th frame

Figure 12 Result of natural feature track-ing.

1 st frame

10 th frame

20 th frame

30 th frame

40 th frame

Figure 13 Result of registration usingboth markers and natural features (solidline) compared with that using only mark-ers (dotted line).

a stereo vision-based mixed reality system with natural...

Documents