Semi-interactive construction of 3D event logs for scene investigation

(1)

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

Dang, T.K.

Publication date 2013

Link to publication

Citation for published version (APA):

Dang, T. K. (2013). Semi-interactive construction of 3D event logs for scene investigation.

General rights

It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations

If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

(2)

Chapter

2

A Review of 3D Reconstruction:

Towards Scene Investigation Using

Handheld Cameras

In scene investigation, many tasks require or can benefit from a 3D model of the scene. For instance, the measurement of certain objects can be done off-site without pre-determining what needs to be measured while capturing. Complicated tasks like hypothesis validation absolutely require a 3D model of the scene. Thus it would be ideal if we have an efficient method to easily model a scene for investigation from some images, or even better from a video log of the investigation process.

The problem of 3D reconstruction from images or videos has been extensively examined. While good results have been shown in controlled environment using high quality still im-ages, we study the many challenges of using video input in an uncontrolled environment. This chapter gives an overview of the complete process and reviews the related work in each step. By doing so we identify the opportunities to apply 3D reconstruction from images/videos in scene investigation.

(3)

2.1 Overview of 3D Reconstruction From Video

Se-quences

In our proposed scheme, scene investigation, a person moves around and captures the scene, which we assume to be static (i.e. there are no moving objects in the input).

Building a 3D model from videos for scene investigation is the purpose of this review. To support various tasks of scene investigation, such as 3D navigation, measurement, or hypothesis validation, the following aspects are important:

• Robustness: we are looking for a system that works in different setups and in varying

conditions.

• Flexibility: Investigators should spend time mainly on their job, i.e. investigation,

rather than setting up complex hardware and performing tedious calibration. Thus we need a flexible solution. It would be great if an investigator can grab any camera, record the scene and do reconstruction later on.

• Precision: To obtain the right interpretation of the scene and enable hypothesis

valida-tion, the method should be accurate.

• Usability: Since investigators are not experts in computer vision or graphics, the

recon-struction procedure should minimize the human effort. Interaction, if required, should be intuitive and reasonable.

Those requirements are hard to meet in one system. To a certain level, they contradict each other. For example, to obtain precision, we usually have to sacrifice flexibility and usability.

Images _ProcessingFeature

Structure & Motion Recovery Model Creation 3D Model Lens Distortion Correction Features, Matches Correcte d Images Point cloud, Camera parameters

Figure 2.1: The 3D reconstruction step of the overall process can be decomposed into four sub-steps.

Following common frameworks for reconstruction, we divide the reconstruction process into four main steps (Figure 2.1), which are discussed in the following sections.

1. Lens distortion correction (Section 2.2): The coordinates of pixels suffer from a ra-dial lens distortion, thus should be corrected.

2. Feature processing (Section 2.3): The objective of this step is to detect the same features in different frames and match them.

(4)

3. Structure and Motion Recovery (Section 2.4): This step recovers the structure of the scene, i.e the 3D coordinates of the features, and the motion of camera, i.e. the pose and internal parameters of the cameras.

4. Model Creation (Section 2.5): This step creates a 3D model of the scene in some desired representations, for example a textured mesh.

2.2 Lens Distortion Correction

Feature Processing Lens Distortion Correction Lens calibration info Lens calibration Undistorted frames Features Coordinate Correction Corrected features Calibration object Images Structure & Motion Recovery 3D Model Model Creation

Figure 2.2: Detail of lens distortion correction. The Lens calibration step should be done once.

In 3D reconstruction, feature coordinates are assumed to be produced by an ideal pinhole camera [61] (p. 153). Real cameras do not conform to that model. The difference between a real and an ideal camera is called the lens distortion. For most handheld cameras, it is too significant to be ignored. Therefore, lens distortion correction should be applied before running any geometric algorithms [28].

To correct the distortion, we first need to model it. The lens distortion is typically modeled by a radial displacement function (xd, yd)T = L(˜r)( ˜x, ˜y)T, where (xd, yd) is the actual distorted

frame coordinate and ( ˜x, ˜y) is the ideal coordinate. Here, ˜r =p˜x2_{+ ˜y}2 _{is the distance from} the point to the center for radial distortion, which is usually close to the image center. L(˜r) is commonly a quartic function. To find the parameters of the distortion function L(˜r), several methods have been proposed [169, 94, 35]. More sophisticated distortion models includes tangential displacement which is perpendicular to the radial distorted point. Figure 2.3 gives an example of lens distortion and correction.

The process of finding the distortion function parameters is called lens calibration. It also recovers other camera intrinsic parameters such as focal length. As long as the intrinsic parameters remain fixed, which is often the case, the calibration only needs to be performed once. It is usually done with a calibration objects, such as a chess board [169, 94]. It is also

(5)

Pixel error = [0.4506, 1.127] Focal Length = (2090.42, 1708.59) Principal Point = (1631.5, 1223.5) Skew = 0 Radial coefficients = (−0.09648, 0.1846, 0) Tangential coefficients = (−0.09743, −0.0104) +/− [145.3, 257.5] +/− [0, 0] +/− 0 +/− [0.06539, 0.06875, 0] +/− [0.02106, 0.002139] 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 50 50 50 50 100 100 100 150

Radial Component of the Distortion Model

Pixel error = [0.4506, 1.127] Focal Length = (2090.42, 1708.59) Principal Point = (1631.5, 1223.5) Skew = 0 Radial coefficients = (−0.09648, 0.1846, 0) Tangential coefficients = (−0.09743, −0.0104) +/− [145.3, 257.5] +/− [0, 0] +/− 0 +/− [0.06539, 0.06875, 0] +/− [0.02106, 0.002139] 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 100 100 100 100 100 100 200 200 200 200 200 200 300 300 400 400 500

Complete Distortion Model

(a) Radial component of distortion model (b) Complete distortion model

Figure 2.3: A lens distortion model calibrated using the Camera Calibration Toolbox for Matlab [19].

possible to calibrate without calibration objects such as in [35] based on the fact that “straight lines have to be straight” and by choosing an arbitrary straight line in the scene.

In the reconstruction process, lens distortion correction is the first step. In practice it is actually more complex. Since the correction is non-linear, it may reduce the photometric quality and affect the feature detection. Thus point and edge detection are usually done on the original distorted frames. Then the correction is applied to the features detected. Correcting all pixels is expensive hence we should do it only when it is really required. For example in stereo mapping, a sub-step of model creation that will be explained later, only a subset of the frames is used and the distortion fully corrected.

Doing calibration in advance means that we cannot use zooming since it changes the in-trinsic parameters. In scene investigation, this is fortunately a fair assumption as in scene investigation users would move close to the object to observe and take pictures of it. If zooming is desired, methods exist to estimate the camera’s intrinsic parameters during the re-construction such as [28]. This, however, makes the rere-construction algorithm more complex, and thus prominently less accurate.

2.3 Feature Processing

The first step in 3D reconstruction is to detect and match features in different frames. Until now, the features used in structure recovery processes are points and lines. The three main steps in feature processing (Figure 2.4) are detection, description, and initial matching.

Finding features in an image is done using a detector. The most important information a detector gives is the location of features. It can also return other characteristics such as the scale and orientation. Two characteristics of a good detector, as defined in [134], are

repeatability and reliability. Repeatability is the ability to detect the same features in different

images. Reliability means that the detected features are distinctive enough so that the number of candidates for matching is small. For 3D reconstruction, location accuracy is important

(6)

Images _ProcessingFeature Structure & Motion Recovery 3D Model Lens Distortion Correction Images Feature Detection Feature Description Initial Matching Feature Spatial info Feature descriptors Correspon-dences Model Creation

Figure 2.4:Feature detection and matching process

since in a complex reconstruction process small errors might be accumulated or magnified resulting in a bad final result.

Now suppose we have two images of the same scene and their features. To find corre-sponding pairs of features, we need feature descriptors. The feature description process takes a feature detected in the previous step and produces descriptive information. This descrip-tor is usually represented by a vecdescrip-tor. Features in different frames are matched by comparing their descriptions. A good descriptor should be invariant to rotation, scaling, and affine trans-formation so that the same feature in different images would be characterized by almost the same value and should be reliable in reducing the number of possible matches [104]. In ad-dition, to deal with a large number of features and high-dimensional descriptor vectors, we need an efficient initial matching strategy.

Research on interest points and lines detection and description are summarized separately in 2.3.1 and 2.3.2. Matching of both kinds of features is discussed in 2.3.3 since they share the same principles.

2.3.1 Interest Points

In this document, a point feature is called an interest point, in other work also often referred to as keypoint (e.g. [93]).

Point detection

Schmid et al. classify point detectors into three categories: contour based, intensity based, and parametric model based [134].

• Contour based detectors first extract contours from images, then find points that have

special characteristics, e.g. junctions, endings, or curvature maxima. A multi-scale framework can be utilized to get more robust results.

(7)

• Intensity based detectors find interest points by examining the intensity change around

points. To measure the change, first and second derivatives of frames are used in many different forms and combinations.

• Parametric model based detectors find interest points by matching models or

tem-plates, for example of L-corners, to a frame.

Many point detectors have been proposed. The Harris corner detector [59] is well-known and is invariant to rotation and partially to intensity change. However, it is not scale invari-ant. Scale invariant detectors such as [93, 102] search for features over scale space. Lowe’s SIFT [93] searches for local extrema of Difference of Gaussians in space and scale. Mikola-jczyk and Schmid [102] use Harris corners to search for features in the spatial domain, then use a Laplacian in scale to select features invariant to scale. An affine invariant detector is pro-posed by Tuytelaars and Van Gool [158]. Starting from a local intensity maximum, it searches along rays through that point to find local intensity extrema. The link formed by those ex-trema defines an interest region∗that is approximated by an ellipse. By searching along many rays and using ellipses to represent regions, the detected regions are affine-invariant. Their experiments show that the method can deal with view changes up to 60 degrees.

Feature (corner) Rotated feature Rotated & scaled feature Affine transformed feature

Figure 2.5: A feature under different transformations.

Repeatability is evaluated by the ratio of number of repeated points over the total detected points in the common part of two frames. The reliability of a detector is measured by the diffusion of local jets, a set of image derivatives, of a large number of interest points. The more diffusive the values, the more reliable the detector. This diffusion is measured using entropy. Among the ones examined in [134], the Improved Harris corner, which is improved from the original by employing a more appropriate differential operator, is the best in term of both repeatability and reliability. The evaluation in [134] does not cover scale and affine invariant detectors.

Mikolajczyk et al. gives a comparison of affine detectors [105]. Instead of diffusion of descriptors, this comparison evaluates reliability using the matching score, “the ratio be-tween the number of correct matches and the smaller number of detected regions in the pair of images”. On average the Hessian-Affine [103] detector performs best, where we should note that SIFT [93] uses approximately the same detection scheme as Hessian-Affine. The MSER [99] is the most repeatable in terms of viewpoint change, which is important in recon-struction. A more recent elaborate comparison focusing on geometric applications by Henrik

∗_{Note that via scale, and affine transformation, a point is usually no longer a point but becomes a region. So in}

literature, for robust detectors we see “interest regions” instead of interest points. When matching regions for 3D reconstruction, we can simply use the centroids of regions for computation.

(8)

Aan et al. [2], confirms the superior performance of SIFT [93] and Hessian-Affine [103], but reveals that MSER [99] does not perform as well as found in [105]. In their comparison they also found that the simple Harris corner detector [59] performs very well when scale change is small.

Speed is a criteria not mentioned in the above evaluation. In practice it can be impor-tant if we handle large amount of data, e.g. doing reconstruction from videos. There are several efficient implementations [157] of detectors, for instance SURF [12] is an efficient implementation of SIFT [93]. Those implementations significantly improve the speed with-out sacrifying other criteria.

Location accuracy is another missing criterion in existing evaluations of detectors. Loca-tion accuracy is very important in 3D reconstrucLoca-tion [157]. Because of the complex recon-struction process, limited errors in the location of interest points can be magnified into large errors. In [110] the relation between intensity noise and certainty of interest point location is derived for the Harris corner detector [59], which is rather outdated. More investigation on location accuracy of state-of-the-art detectors, e.g. SIFT [93], should be done.

Point description

Point descriptors are classified by Mikolajczyk and Schmid [104] into the four following categories:

• Distribution based descriptors. Histograms are used to represent the characteristics

of the region. The characteristics could be pixel intensity, distance from the center point [86], relative ordering of intensity [168], or gradient [93].

• Spatial-frequency descriptors. These techniques are used in the domain of texture

classification and description. Texture description using Gabor filters is standardized in MPEG7 [126].

• Differential descriptors. A set of local derivatives is used to describe an interest region.

The local jet, used in [134] to evaluate the reliability of detectors, is an example.

• Moments. Van Gool et al. [161] use moments to describe a region. The central moment

of a region in combination with the moment’s order and degree forms the invariant. The invariance of descriptors is obtained in many ways for different changing factors. For example, in [93, 102] maxima of local gradients with different directions are used to identify the orientation. Other sets of rotation invariants can be used to characterize the region, e.g. the Fourier-Mellin transformation used in [11]. Scale and skew determined in the detection phase are used to normalize image patches in [11, 131].

Mikolajczyk and Schmid [73, 104] have done an evaluation of descriptors. Two criteria are used: the Receiver Operating Characteristic and the recall. The first one is the ratio of detection rate over false positive rate; and the second is the ratio of correct matches over possible matches. Qualitatively, the results for the two criteria are the same. The SIFT descriptor [93], which is invariant to scaling and partially to view change, and SIFT-based methods [77, 104] are in the top group. This evaluation also shows that using region-based detectors, in particular the scale and affine invariant detectors, gives slightly better results.

(9)

Conclusion

We have summarized various methods to detect and to describe interest points, as well as comparisons between them. Choosing a detector somewhat depends on the type of input scenes. SIFT and SIFT-based descriptors, such as [66], are attractive because of their per-formance. Location accuracy of interest points has not been studied although it has been identified as an important criterion especially for 3D reconstruction. Thus, in order to apply 3D reconstruction in scene investigation, we need to study more about the location accuracy of interest points.

2.3.2 Lines

In practice one type of feature cannot cover all possible situations. For instance, detects like SIFT [93] and SURF [12] detect blob-like structure, which are many in natural scenes but not in man-made scenes. Lines also provide more choice of geometric constraints for structure and motion recovery. In this sub-section we discuss line detection and matching for 3D reconstruction.

Line detection

Line detection includes two steps: first edge detection, and then line extraction. Edge detec-tion find the pixels potentially belonging to a line. Then parameterized lines are extracted from those pixels in the line extraction step.

Many edge detection schemes are available in the literature [137]. Edge detection is based on intensity change. Edge detectors usually follow the same routine: smoothing, applying edge enhancement filters, applying a threshold, and edge tracing.

Evaluations of edge detectors are inconsistent and not convergent [137, 47] for reasons such as unclear objective and varying parameters. Shin el al have done an evaluation in using structure from motion as the black box to test edge detectors [137]. It shows that overall, the Canny detector [22] is most suitable because of its performance, fast speed, and low sensitivity to variations in parameters. There are, however, weak points of this evaluation. The structure from motion algorithm described in [151], which is used in the evaluation, is based on the two-view geometric constraint. This cannot fully show the potential of using lines, which should be shown with three-view constraints. The second weak point is that the “intermediate processing”, parameterizing lines and matching them is fixed which affects the final result. Thus, the result is not complete enough to draw a conclusion.

Extracting lines could be done in several ways. The Hough transform [71] is famous for curve fitting. Despite of having a long history, the Hough transform and its extensions are still used [106, 6, 112]. A simpler approach connects line segments with a limit on angle changes and then uses the least median square method to fit the connected paths into lines [145, 137]. Robust estimators such as RANSAC [45] are popularly used to fit lines to the points of segments. We have not found a complete and concrete evaluation of line extraction.

(10)

Line description

Lines can be matched based on attributes such as orientation, length, extent of overlap, or dis-tance. Optical flow can be employed in case the difference between two views is small [83].

The evaluation of line description and matching algorithms for reconstruction is missing in literature. Authors give evaluations to compare their work with previous works, e.g. [83], but those are not complete enough to draw a conclusion.

Lines give stronger relations between views. Lines are many and easy to extract in scenes with dominant man-made objects. One of the first works that uses line correspondences and trifocal tensors is Beardsley [14]. In their work, lines are not used directly, but point correspondences are used first to recover geometry information. The potential of using line features is confirmed in [123]. Using both lines and points as input features, it is shown that trifocal tensors and camera motion can be recovered more accurately.

2.3.3 Initial Matching Strategy

Features are initially matched across frames based on descriptors. The simplest method is exhaustive search, optionally with a limit on the motion vector length. This approach is obviously slow when matching to a large number of features in many frames.

When the number of features used is large, a KD-tree [49] can be used to find matches with a binary search strategy. But the KD-tree is inefficient since the size of descriptor vectors is large, for example SIFT descriptors are 128-element vectors. Best Bin First (BBF) [15] and Fast Filtering Vector Approximation (FFVA) [163] approximate the KD-tree search to gain efficiency while searching a large database.

Finally, the problem can be formulated as a bipartite graph problem: finding the vertex set whose sum of weights is greatest from one set of nodes to another given that each node can connect to only one other node. One must be aware that in this formulation the total cost is optimized while what we really need is the number of correct matches. Though they are related, they are not the same.

Matching groups of close features is more accurate and efficient than individual matching. This has been shown for both points [50] and lines [162].

When matching a pair of images, exhaustive search can be good enough, and can even outperform BBF. This is to be expected as binary search is only efficient when the number of candidates is much larger than the number of dimensions. Descriptors are usually high-dimensional thus if the number of features is not large, using binary search is not efficient. This explains why the high dimensional Binary search is absolutely better in searching a pre-indexed dataset. Also we expect that they perform well in matching an image to a large set of unstructured images.

Matches produced here are initial and may contain outliers (Figure 2.3.3a). Validation of those matches is embedded within the geometric constraint computation. Beardsley et al. [14] use the geometric constraints in both two-view and three-view cases to match lines. Schmid and Zisserman extend this idea to matching both lines and curves [135]. This technique is now standard in reconstruction.

(11)

(a) (b)

Figure 2.6: (a) Correspondences initially matched based on SIFT descriptors. Note the in-consistent motion field showing outliers. (b) After applying the geometric constraint technique all matches are correct.

2.3.4 Summary and Conclusion

We have summarized the literature of feature detection and matching for the structure recov-ery problem. Two kinds of features are examined: points and lines.

With points, many detection and matching methods are available in literature with good evaluations [134, 105, 157, 2]. The evaluation, however, does not have a criterion on accu-racy. This is, however, very important in the noise-sensitive process of 3D reconstruction. Efficiency is less considered as a criterion in literature, but is important in practice especially if we want to do reconstruction from videos.

Using line features improves accuracy [123]. However, line detection and matching schemes and their evaluations are not available at the proper level, especially not in the con-text of 3D reconstruction. This should be explored more as lines are many in man-made structures and thus frequently encountered in scene investigation.

2.4 Structure and Motion Recovery

The structure and motion recovery step estimates the structure of the scene and the motion of the cameras. Taking feature correspondences as input, it estimates the geometric con-straints among views. Then the projection matrices that represent the motion information are

(12)

recovered. Finally, 3D coordinates of features, i.e. structure information, are computed via triangulation [65]. Figure 2.7 summarizes the steps.

Structure & Motion Recovery Model Creation 3D Model Lens Distortion Correction Correspon-dences Projective Reconstruction Metric Reconstruction Projective Structure & Motion Refined Correspon-dences Metric Structure & Motion

Figure 2.7: Details of the structure and motion recovery step.

In the following subsections we first discuss the problem of structure and motion recov-ery from multiple images, which is quite standard now. Then we discuss the benefit and the imminent problems of using large amount of data, typical in reconstruction from video sequences.

2.4.1 Multiple View Geometry and Stratification of 3D Geometry

This subsection gives a brief overview of multiple view geometry and the concept of geomet-ric stratification that are required to understand the following subsections.

Multiple view geometry

The research in 3D reconstruction from multiple views started with two views. This is quite natural since humans also see the world through a two-view system. Initial research assumed calibrated cameras, i.e. the intrinsic parameters of each camera and the relative positions of two cameras, if a stereo system is employed, are known. All of those parameters are acquired via a calibration process (see Section 2.2).

For the calibrated case, the essential matrix E [69] is used to represent the constraints between two normalized views. Given the calibration matrix K (a 3x3 matrix that includes the information of focal length, ratio, and the skew of the camera), the view is normalized by transforming all points using the inverse of K: ˆx = K−1x, in which x is the 2D coordinate of

a point in the view. The new calibration matrix of the view is now the identity. Then for a corresponding pair of points (x, x0) in homogeneous coordinates, E is defined by the simple equation: ˆx0T_{E ˆx = 0.}

In the uncalibrated case, the essential matrix is extended to a new concept. During the 1990s, the concept of fundamental matrix F, or bifocal tensor, was introduced and well

(13)

stud-X x’ x C (camera center) C’ e’ e Epipole Epipolar plane Baseline Epipolar lines l l'

Figure 2.8: Two-view geometry. X is a 3D point; x, and x0 are its projections. C and C0 are camera centers. The line connecting them is called the baseline. The X, C, and C0 together define the epipolar plane. l and l0 are the epipolar lines of the two projections of X. The projection of the camera centers on the other views, e and e0, are called epipoles.

ied by Faugeras [108] and Hartley [62]. The F matrix is the generalization of E and the defining equation is very similar: x0TFxi= 0.

The difference is that in uncalibrated reconstruction, the K matrix is unknown and thus the view coordinates cannot be normalized. Therefore, in the equation x is used instead of ˆx). F is still “fundamental” for research of multiple view geometry since it is simple, yet very informative. Its relations with other ways of expressing constraints can be found in [61]. Some principal concepts in two-view geometry, or epipolar geometry, are explained in Figure 2.8.

Three-view geometry is also developed during the 1990s. A three-view constraint, pre-sented by a trifocal tensor T , captures the relation among projections of a line in three views. Trifocal tensors define a richer set of constraints over views. Apart from a line-line-line correspondence, it also defines line-line, line-point, line, and point-point-point constraints. Furthermore, it introduces the homography to transfer points between two views. Unlike the fundamental matrix, which defines a point to line relation, i.e. a one-to-many relation, line correspondences defined by trifocal tensors are one-to-one. This is one of the advantages of trifocal tensors [61, 44]. Trifocal geometry is explained in Figure 2.9. Stratification of 3D geometry

Focal tensors (F or T ) form the constraints among multiple views. But from them to the final structure and motion is a long way.

Motion information or camera parameters at a view are either intrinsic or extrinsic. In-trinsic parameters are focal length, skewness, etc. ExIn-trinsic parameters are position and ori-entation. In a stereo vision system, such as the human vision system, the two “cameras”

(14)

L C C’ C’’ l _l’ l’’

Figure 2.9: Line correspondences among three view are the basis to define three-view tensors. Points on line l are transferred to points on line l0 by the homography induced by the plane (C0, l0) (Figure 15.1 in [61]).

can be fully calibrated. That is all intrinsic and extrinsic parameters are known. In a mono vision system, the case in reconstruction with a handheld camera, extrinsic parameters are always unknown in advance. If the camera is calibrated in advance (see Section 2.2), the reconstruction process is called calibrated reconstruction otherwise it is called uncalibrated reconstruction.

In the uncalibrated case, no prior calibration is used; the missing information must be recovered via a further step called self-calibration, which is discussed in Section 2.4.3. For better comprehension, we should first understand the concept of geometric stratification, in-troduced by O. Faugeras in [43].

The space we are familiar with is the Euclidean space, in which all familiar concepts like absolute length, angle, ratio, and parallelism exist (Table 2.1). Taking away the concept of length, we have the metric space. If the concept of angle is taken from metric space, we have the affine space in which parallelism, ratio, and centroid exist, but we cannot measure angle. Since there is no way to tell the difference between angle, in this space, for example, a rectangle is just the same as a parallelogram. The least informative space is the projective

space, in which the concepts of parallelism, ratio, and centroid do not exist; while tangency

and cross ratio still exist.

From uncalibrated images, scene geometry is recovered step by step from projective space up to Euclidean space. This process uses invariant objects. For example, the plane at infinity is an invariant object that helps upgrade from the projective to the affine space. The plane at infinity is where all parallel lines meet, hence the word infinity. In the projective space it is, however, not at infinity. If we find it in the projective space and transform the space so that the plane at infinity is indeed at infinity, then we have upgraded to affine space.

Characteristics of geometric spaces are summarized in table 2.1 and can be found in [114] or [61].

Since we are looking for a flexible method, we continue the discussion on structure and motion recovery in the uncalibrated case. The calibrated reconstruction is similar since

(15)

know-Stratum DoF Trans.Matrix Distortion Invariants

Projective 15 T Intersection, Tangency of

sur-faces, Cross-ratio Affine 12 " A t 0T ₁ #

+ Parallelism, Centroid, Plane

at infinity Metric 7 " sR t 0T ₁ #

+ Relative distance, Angle,

Ab-solute conic Euclidean 6 " R t 0T ₁ # + Absolute distance

Table 2.1: Characteristics of geometric strata [114, 61]. Transformations are defined by homogeneous coordinate matrices. T is a 4 × 4 invertible matrix. A is a 3 × 4 matrix. R is 3 × 3 rotation matrix. A scaling factor is denoted by s, and t is a 3D translation vector.

ing intrinsic parameters does not help to skip any upgrading step, but it does make the problem more constrained and the results more accurate.

2.4.2 Projective Structure and Motion

Having only knowledge of feature correspondences, the most elaborate reconstruction we can obtain is a projective reconstruction. There are infinite ways to obtain projection ma-trices from a focal tensor. Methods, implementation hints, and evaluations of focal tensor computation are well discussed by Hartley and Zisserman in [61].

The computation of the focal tensor at simplest involves solving a linear equation sys-tem. If the input, i.e. feature correspondences, includes outliers, robust methods such as RANSAC [45] or Least Median of Squares must be employed to reject them. Then iterative optimization, e.g. using Levenberg-Marquart, should be used to improve the result. Choos-ing error functions for the minimization is very important since the algebraic errors, i.e. the estimation errors computed directly from the geometric constraint equations, do not express geometric meaning. Geometric or Sampson distances are advised [61].

With focal tensors recovered, projective reconstruction is already available. There are many decompositions from tensors to projection matrices [95]. Commonly one assumes that the first camera projection matrix is P1= [I 0], where I is the 3 × 3 identity matrix. By doing so we can derive the other view’s projection matrix based on the constraint.

In case of more than two views, in order to have a consistent structure, the decomposition into projection matrices must be done with homographies induced from the same reference plane. This can be based on fundamental matrices [96] or trifocal tensors [9].

To avoid complex equations, one can use the additive structure building (ASB) method as in [121]. After having the initial structure, new views are added one by one. A new projection matrix is computed from a linear equations system, formed from correspondences of already reconstructed 3D points and their projections on the new view. A non-sequential adding strategy can be used to reduce accumulated error.

(16)

Using the first view coordinates as the world coordinate system simplifies the equations of the reconstruction since the projection matrix of the first view is simply (P = [I 0]). However, it makes the computation unstable and sensitive to noise [153]. That is the motivation for

factorization methods that produces a consistent set of projection matrices directly from the

correspondences. The first factorization algorithm was introduced by Tomasi and Kanade for orthogonal projection. Sturm and Triggs then extended it for perspective projection [147]. Further developments to solve the problems of initialization, missing trajectories and con-tinuous reconstruction are given in [1, 98]. An evaluation is given is [148]. Plane-based calibration [159] uses plane features represented via homographies. It has higher accuracy than point-based factorization, and overcomes the missing trajectory problem.

Theoretically, factorization gives better results compared to the structure update tech-nique. Yet we have not found any explicit experimental verification of this. Whether factor-ization is more accurate and effective than ASB with a good frame selection is still a question. Furthermore, the key to the advantage of factorization is the reliable tracks of features over frames, but those are difficult to obtain in practice.

Conclusion. Both methods, ASB and factorization should be evaluated further. While factorization has been the research direction in recent years, the ASB has more practical relevance. Outliers and ill-conditioned views can be rejected each step. Besides, since bundle adjustment, which optimizes all motion and structure at once, is usually used afterward, the final result is almost the same.

2.4.3 Metric Structure and Motion

Upgrading a projective reconstruction to a metric one requires additional constraints [115]. The research on self-calibration ranges from methods with the strict assumption of knowing all fixed intrinsic parameters to the flexible, practical ones with minimal and realistic assump-tions, for example with only the condition that the pixel grid is square [67, 115, 124, 122]. Available methods. Many metric upgrade methods directly go from projective to metric space. Heyden [67] derives the solution from the projection matrix equation, and needs at least five known intrinsic parameters. Pollefeys [115] builds up the method from an analysis of the absolute quadric equation, an abstract object encoding characteristics of both the affine and metric strata. This method is quite popular. It is improved by employing prior knowledge on camera parameters in [117]. Its constraint enforcement problem is solved by Chandraker et al. [23] using Linear Matrix Inequality relaxation. Ponce et al. proposed a new abstract object for calibration: the Absolute Quadric Complex [122]. It has the advantage of decoupling skew and aspect ratio from other intrinsic parameters. This is an appealing characteristic since we all use digital cameras that have rectangular or square pixels. In [159], after projective reconstruction via factorization of matrix of homographies, a method to upgrade to metric space is presented. It is also based on the theory of the absolute quadric.

Hartley, on the other hand, from the comment that iteration is tricky [124] and a direct upgrade method has the difficulty of constraint enforcement [61], proposed a fully stratified method. The method first upgrades to an affine level by an exhaustive search for the plane at

(17)

infinity. To limit the search space it employs the cheirality constraint, i.e. the reconstructed points must be in front of the camera [63]. After this the affine structure is upgraded to a metric one as described in [84].

Evaluation. We identify the four possibly best auto-calibration methods: Pollefeys et al. [117], Chandraker et al. [23], Ponce et al. [122], and Ueshiba and Tomita [159].

Unfortunately a complete comparison among them with simulated or real data is not avail-able. As a constraint enforcement is added, we expect the second method [23] to outperform the first [117]. But the evaluation in [23] on 25 real datasets is only on qualitative criteria and the first turns out to be the winner. This maybe caused by the numerical instability of the optimization software. The third method only outperforms the first one in simulation at a noise standard deviation of 3.5 pixels. That is quite high and unrealistic in our opinion. The fourth method has not been compared to any other.

Figure 2.10: A few images from a frame sequence and the recovered metric structure and motion. The structure (point cloud) is built from keypoints only thus looks quite sparse. Result is generated using VisualSFM [167].

Conclusion. In summary, several methods exist for metric reconstruction but a complete evaluation on robustness, accuracy, and flexibility does not exist. Some simulated results show that average error is about 1 to 2 percent, while results from real data have errors of about 3 to 7 percent [121, 23]. This means that uncalibrated reconstruction maybe of limited use for measurement and hypothesis validation, which need highly accurate models.

2.4.4 Degeneracy

Degenerate input is input from which it is impossible to make a metric reconstruction. It is either because of the characteristics of the scene or the capturing positions.

In practice, input captured by a person using a hand-held camera is hardly absolutely degenerate. However, nearly degenerate inputs are common in practice, for instance when a camera moves along a wall or on an elliptic orbit around the object. That is why studying de-generacy and detecting those cases is extremely important in creating a robust reconstruction method, or selecting the most suitable method for the case.

The study of degeneracy started very early and still is subject of recent research, e.g. by Kahl et. al. [74]. Degeneracy is either caused by structure, motion, or a combination of both.

(18)

Structure degeneracy happens when the observed points and the viewpoints follows a certain rule. The latter depends only on camera motion, for example pure rotation, thus can happen with any scene. Sturm’s paper [146] studies degeneracy in fixed intrinsic parameter cases. He also suggested a “brute force” approach to select the best algorithm. Pollefeys gives a practical approach that examines the condition number of the equation system [113]. This helps to reject the case but does not give the proper reconstruction method.

In [117], Pollefeys et al. show how to deal with planar structure, the most common structure degeneracy. The method uses the General Robust Information Criteria [156] on the trifocal tensor to detect and handle degeneracy. It is noticed that a scene with a dominant plane is a very common near degeneracy case [26, 48]. It is because most features are found on that plane. The fundamental matrix computation usually fails in this case. The problem is solved using Degenerate RANSAC [26] or Quasi Degenerate RANSAC [48], which test the degeneracy hypothesis during the computation.

In conclusion, detection of degeneracy is important in 3D reconstruction. The fewer camera parameters are known, the more ambiguous the reconstruction will be [113]. An im-portant fact is that degeneracy cannot be identified in a complete automatic way. Hence for each application one should use the context knowledge as much as possible. Some systems explicitly request users to follow a strict capturing guideline to avoid degeneracy, e.g. as in Photosynth† _{or ARC3D}‡_{. In scene investigation, since we do not want to limit the} investi-gators’ movement, we should find another way to overcome degeneracy or at least make the capturing guideline less strict.

2.4.5 Reconstruction from Videos

Research in 3D reconstruction started with the question whether it is possible to do 3D recon-struction from a set of images. But for applications like scene investigation, it is more natural to use videos.

Using input of video sequences is a trade-off: sacrificing intra-frame quality, i.e. resolu-tion and sharpness, for inter-frame quality, i.e. relatedness and overlap between frames. To compensate the loss due to lower frame quality, we can exploit the inter-frame redundancy in the data. From a statistical point of view, more projections of a point means more samples and the estimation more reliably converges to the true value. The best texture found by se-lecting the best view or super-resolution can be used to get better visualization quality. Frame sequences also enable some techniques to deal with shadow, shading, and highlights [121].

One advantage of using videos is flexibility. Taking still images for reconstruction is troublesome, even for an expert. For example, assume that we know that the best move is going around an object, with the angular difference between consecutive views about 15 degrees. Without measuring, it is hard to follow that guideline. Using a hand-held video camera, we only have to worry about the type of move, i.e. going around, and leave the frame selection to computers. In case of reconstruction with unknown target, such as in crime scene investigation where at first we do not know which objects are evidences at first, using videos helps to avoid missing details.

†_{http://labs.live.com/photosynth/}

(19)

It has been shown possible to reconstruct 3D models from large amounts of data [142, 5]. These works, however, use images of better quality than video frames. Also in terms of the amount of input, the number of images used for each scene in these works (a few thousands) is far less comparing to the number of frames in a video log. To take advantage of video input, we have to solve some extra steps compared to reconstruction from still images. They are frame selection, sequence segmentation, structure fusion, and bundle adjustment.

• Frame selection. Among a number of frames, selecting good frames will improve the reconstruction result. Good frames are ones that have proper geometric attributes and good photometric quality. The problem is related to the estimation of the views’ position and orientation and photometric quality evaluation.

• Sequence segmentation. Reconstruction algorithms assume that a sequence is contin-uously captured. The sequence should be broken into proper scene parts and recon-structed separately and fused later.

• Structure fusion. Results of processing different video segments, either generated by different captures or through segmentation, must be fused together to create a final unique result.

• Bundle adjustment. The reconstruction process includes local updates, for example feature matching, structure update, and bias assumptions, e.g. use of a first-view coor-dinate system. Those lead to inconsistency and accumulated errors in the global result. There should be a global optimization step to produce a unique consistent result. For all mentioned problems, solutions exist to a certain level, yet no solution is absolutely perfect. For example, available bundle adjustment tools like [90] work well with a limited number of images, and get extremely slow when the number of input images increases.

2.4.6 Summary and Conclusion

There are many options for each step in the reconstruction process. Some steps are well eval-uated while others need further evaluation. Both accuracy and robustness should be improved in order to make visual-based 3D reconstruction more applicable. Using more data such as found in video can improve the quality. But also many problems come along when using video sequences.

Aiming for a flexible application, we favor the ASB method for projective reconstruction and the Pollefeys et al. method [117] for metric upgrade because they are relatively simple to implement and have good performance.

Reconstruction from uncalibrated images gives flexibility. However, it has been observed that reconstruction from long uncalibrated sequences gives poor results, because of error accumulation and an issue called projective drift error [121, 29]. Thus, in practical settings, as soon as enough constraints are present, we should upgrade to a metric space. Then we can use a simpler algorithm to update the structure with others images [142, 41].

(20)

2.5 Model Creation

Once the structure and motion are recovered, we can proceed to model creation. From a set of calibrated frames and some geometric information of a scene, the problem of building a 3D model is called multi-view stereo reconstruction (MVSR). An overview and comparison of MVSR algorithms is given by Seitz et al. [136]. MVSR is a comprehensive topic that involves image processing, multi-view geometry, and computer graphics. Instead of trying to cover all of its aspects, we present one commonly used class of methods, e.g. used in the well-known work of Pollefeys et al [121]. We relate characteristics and concepts to the general overview of Seitz et al. [136] and suggest interested readers to check that paper for a good overview of the topic.

Structure & Motion Recovery Model Creation 3D Model Lens Distortion Correction Images Rectification Stereo mapping Rectified images Depth maps Structure & motion Mesh

building Texture _building

Wire frame model

Textured 3D Model

Figure 2.11: Zoom in of a multi-view stereo reconstruction showcase.

According to the categorization in [136], the presented class of methods is image-based. It includes four sub-steps: rectification, stereo mapping, mesh building, and texture mapping (Figure 2.11). Rectification aligns scanlines between two images. Stereo mapping computes a dense matching map between points of different calibrated views. The scanline alignments speed up this step. From matching maps, depth maps are recovered through triangulation. Then in the mesh building step, multiple depth maps are merged to create a polygon mesh. The final step, texture mapping, extracts textures from selected views and maps these onto the wire-frame model.

We will present each of those steps in the following sub-sections.

2.5.1 Rectification

Rectification is a pre-processing step typical for image-based MVSR methods. It exploits the epipolar geometry to align epipolar lines so that corresponding points will have the same y-coordinate in two frames. This makes the computation in the stereo mapping step faster.

The first class of rectification methods is planar rectification, e.g. [64]. Both images are projected onto a plane parallel to the baseline. This method tis simple and fast. It, however, fails in the case of a forward moving camera that commonly happens in scene investigation, for example, when moving along a street or corridor. In this case, planar rectification will create an unbounded image.

(21)

l₁ li

epipole

r

p

Figure 2.12: Polar rectification [116]. A point p is encoded by a pair of (r,θ ).

The second class of rectification methods is non-planar rectification. The first method invented in this class is cylinderic rectification proposed by Roy et al [129]. Images are projected on a cylinder whose axis is the baseline. The unbound images’ size problem is solved this way. However, the cost is the complexity, which is not a desired characteristic of just a preparation step. Pollefeys proposed a method called polar rectification [116] that solves the problem while keeping simplicity. Each pixel is coded by two components, the scan line that it lies on, which is an epipolar line, and its distance to the epipole. The method does not require projection of pixels but only scanning and recoding. A later work [109] refines this method to reduce feature distortion and complete the solution for the epipole at infinity case.

If we want to use videos captured by investigators, for which any kind of movement can be expected, we should use the polar rectification.

2.5.2 Stereo Mapping

The stereo mapping step establishes a dense matching maps between images. From them, depth maps are computed using triangulation. Triangulation is well discussed in [65]. Here we focus on how to produce matching maps.

Stereo mapping is not trivial as shown by the number of papers, different constraints, and strategies applied in D. Scharstein and R. Szeliski’s evaluation [132]. In the following paragraphs, we summarize their taxonomy, and evaluation of stereo mapping methods. Taxonomy. The traditional definition of stereo mapping considers only two rectified views and the matching map is presented as the disparity map with respect to the reference image. It includes four subtasks:

1. Matching cost computation. Differences of any pair of pixels from two different images are computed using a cost function, for instance squared intensity difference. The range of pixels in the current image to be compared to a pixel in the reference image is limited based on a geometric basis such as by the epipolar constraint.

2. Cost aggregation. Making a matching decision based on the cost of single pixel is unreliable due to noise. Cost aggregation improves the reliability of the cost by com-puting the cost in the neighborhood of the pixel. Aggregation in many cases enforces local smoothness.

(22)

3. Disparity computation. The decision on which elements match is made in this step. It can be a simple winner-take-all iteration through all pixels or as complex as a global optimization over all pixels.

4. Disparity refinement. Refinements include sub-pixel disparity estimation, e.g. by curve fitting, and post-processing such as applying a median filter to clean up mis-matches.

Algorithms can either be local or global. Local or window-based algorithms decide the validity of each match based on matching cost within a limited window region, while global ones define smoothness assumption over whole images and solve an optimization problem. Step (3) of local algorithms is simply a winner-take-all. In global methods, the third step is the most important one where a global smoothness constraint is enforced.

Global methods are classified into three sub-classes depending on how the disparity com-putation is implemented. These are global optimization, dynamic programming, and

cooper-ative algorithms.

• Global optimization. The smoothness constraint enforcement is usually interpreted as an energy-minimization problem. The energy function encodes both intensity and smoothness errors. The smoothness function must also preserve the discontinuity at edges. A match is then decided by finding the minimum. A variety of algorithms are in use including Markov Random Fields, simulated annealing, graph-cut, and belief-propagation.

• Dynamic programming. Methods in this class solve the minimization along scan-lines. The advantage is its high speed. Two challenges in the class are achievement of inter-scanline consistency, and the definition of the cost for occluded pixels. There are techniques, such as described in [78], to overcome the first. The second one remains a major problem.

• Cooperative algorithms. Cooperative algorithms iteratively perform local optimiza-tion. This finally creates a result that is similar to a global optimizaoptimiza-tion.

Scharstein also lists algorithms that fall out of the taxonomy, for instance those that use optical flow in a hierarchical framework or use a multi-valued representation for disparity maps [16, 18].

Multi-view stereo mapping Stereo mapping has also been generalized to three or more views [81, 165]. But due to the required assumption of no occlusion, they use only infor-mation from views which are close. This is a typical limitation of the image-based MVSR methods. The mesh building step then has to resolve the global relations by merging depth maps.

Discussion. Stereo mapping bears several characteristics of a MVSR algorithm. In [136],

photo-consistency measures visual consistency among views. Most of the stereo mapping

(23)

(a) (b)

Figure 2.13: Sparse structure (a), and dense structure (b) of the same scene. Result is generated using VisualSFM [167].

with a window-matching metric. A visibility model defines how to identify occlusion. Since stereo mapping is mainly developed for close views, it uses an outlier-based model. In the disparity refinement, smoothing is common. It is one type of shape prior, using expected characteristics of scenes to improve results. The result of this step, depth maps, are one way of scene representation. Other representations such as textured meshes are the result of the next step, mesh building and texture mapping.

2.5.3 Mesh building and Texture mapping

For scene reconstruction, we consider geometric quality as most important for measurement and hypothesis validation. Mesh building and texture mapping, which relate mainly to pho-tometric quality, are presented briefly for the completeness of the process.

Mesh Building. From multiple depth maps, a polygon mesh can be built using, for exam-ple, marching cubes [89] or marching triangles [70]. This assures the global consistency of views. The result is a polygon mesh, which is another type of scene representation.

Texture mapping. Since for each patch of the mesh, there are several associated tex-ture patches from different views, the main problem is how to select or fuse those. The simplest method is to take the texture from the most frontal view. A better approach is to combine textures from different images like proposed in [34]. Textures from different im-ages should be blended together to avoid ghosting effect using, for example, Poison blend-ing [111]. The borders of the texture patches should be blended to make them seamless, for instance, using graph cut [4]. Since videos can be input of the reconstruction, space-time

super-resolution [133] is possibly useful to make not only pleasant but also more detailed

high resolution textures. Yet we have not seen any work like that in 3D reconstruction. In case the scene has close light sources, they should be detected and information of those sources should be taken into account in step 3. In fact, the close light source problem

(24)

is even more serious since it affects the reconstruction right from the beginning. Light source estimation requires known shapes or surfaces [144] while the surfaces available at this stage already suffers from the lack of light source information. Using special objects, such as reflective metal spheres [87], to locate light sources, is possibly the most efficient way to tackle the problem.

At the highest level, one can try to estimate the materials of objects in the scene. This is however extremely costly and unnecessary in many cases. For instance, in crime scene investigation sometimes a textureless model is enough for measurement or even hypothesis validation.

2.5.4 Discussion

We discussed the image-based class of MVSR methods, which take only frames and some geometric information as initialization requirement [136]. By going through four main steps, we finally have a model of the scene composed of textured meshes. What we have not men-tioned so far is evaluation.

For image-based MVSR, the stereo mapping step decides the geometric quality. A com-plete evaluation is given in [132]§. Algorithms are evaluated based on the bad pixel per-centage at non-occlusion regions, textureless regions, and discontinuous regions. In general, various ways are used to initialize the disparity and all the best methods use a global opti-mization based on Belief Propagation.

From the MVSR view, geometric quality is judged based on accuracy and

complete-ness [136]. Accuracy is measured by the distance of the reconstructed model, in the mesh

representation, to a ground truth obtained by laser scanners. Completeness is measured as the percentage of the scanned model covered by the reconstructed model. Datasets and eval-uation results of various method are publicly available¶. The weak point of this evaluation is that it only considers small Lambertian objects. Datasets of non-Lambertian objects or large scale scenes such as a room are not available at the moment.

Building a photo realistic model is a difficult task, especially just from visual input. Ge-ometry and photometric quality are both important and not easy to obtain. Nevertheless, when modeling a scene we should consider the purpose. For example if the goal is measure-ment, we may ignore the texturing. For most of the cases, the purpose of the model is only to give an overview of the scene, thus aiming for a photorealistic, highly accurate model is not the highest priority.

2.6 Conclusion and Discussion

We have reviewed the main steps in reconstruction from image sequences: lens distortion correction, feature processing, structure and motion recovery, and model creation. Each step or even sub-step is already a field of research so we have kept a moderate level of detail, only an overview and an assessment of methods were given. From that analysis we have identified challenges we face when applying reconstruction in scene investigation.

§_{The evaluation is frequently updated and available on the Internet at http://cat.middlebury.edu/stereo/.}

(25)

Lens distortion correction is also well studied. Since zooming is not a crucial function, we suggest calibrating camera in advance, or at least fix the intrinsic parameters by not zooming. This simplifies the reconstruction, resulting in a more accurate result.

Feature detection and matching are well studied. Modern feature point detectors are repeatable and reliable, but whether they are accurate is unknown. Yet for 3D reconstruction accuracy is of major importance. We see the benefit in using lines, but need a good evaluation to take a decision to choose a method and how to use it. Matching methods based on feature descriptors are quite reliable, but are never completely correct. Thus employment of robust estimation in the structure and motion recovery step is a must.

The structure and motion recovery is a complex step, especially if we take uncalibrated sequences as input. Though successes have been shown for image sequences and even large collections of images, taking available methods to a new application domain is a new chal-lenge. We expect new problems stemming from input quality and user behavior. That is why reconstruction from videos is called “black art” [61], and thus we should study it more in practice.

Various methods are available for model creation. Evaluation is available but the setup is far from the scale we find at real scenes. Overall this step is very complex if we aim for a photorealistic and highly accurate model. In our opinion, we should aim for an efficient method to build a good overview visualization, and do more modeling on demand, where the user is interactively steering the modeling process, depending on the case at hand.

In summary, all elements for a 3D reconstruction system are available. In some steps there are even more options to take, but how well those elements work together in scene investigation is an unanswered question. In this thesis we study this problem and consider how automatic methods and interaction can be employed in a synergetic and to what extent they yield complete and accurate 3D models of the scene.