Semi-interactive construction of 3D event logs for scene investigation

(1)

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

UvA-DARE (Digital Academic Repository)

Dang, T.K.

Publication date 2013

Document Version Final published version

Link to publication

Citation for published version (APA):

Dang, T. K. (2013). Semi-interactive construction of 3D event logs for scene investigation.

General rights

It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations

If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

(2)

(3)

Semi-Interactive Construction of 3D

Event Logs for Scene Investigation

(4)

All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without written permission from the author.

(5)

Semi-Interactive Construction of 3D

Event Logs for Scene Investigation

ACADEMISCH PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam,

op gezag van de Rector Magnificus prof. dr. D.C. van den Boom

ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel

op vrijdag 31 mei 2013, te 10:00 uur

door

Trung Kien Dang

(6)

Promotor: Prof. dr. ir. A. W. M. Smeulders

Co-promotor: Dr. M. Worring

Overige leden: Prof. dr. ir. F. C. A. Groen

Prof. dr. T. Gevers Dr. ir. L. Dorst Prof. dr. M. J. Sjerps Dr. J. Bijhold

Prof. dr. M. Welling

Faculteit: Faculteit der Natuurwetenschappen, Wiskunde en Informatica

The work described in this thesis was supported by the MultimediaN project.

Advanced School for Computing and Imaging

MultimediaN

The work described in this thesis has been carried out within the graduate school ASCI, at the Intelligent Systems Lab Amsterdam of the University of Amsterdam. ASCI dissertation series number 278.

Intelligent Systems Lab Amsterdam

University of Amsterdam The Netherlands

(7)

Chapter

1

Introduction

1.1 Motivation

The increasing availability of cameras and the reduced cost of storage have encouraged peo-ple to use image and videos in many aspects of their life. Instead of writing a diary, nowadays many people capture their daily activities with a camera. When such capturing is continuous this is known as “life logging”. This idea goes back to Vannevar Bush’s Memex device [21] and is still a topic of active research [7, 150, 37]. Similarly, professional activities can be recorded with videos to create professional logs. For example, in home safety assessment, an investigator can walk around, examine a house and record speech notes at the same time. Another interesting professional application is crime scene investigation. Instead of look-ing for evidence, purposely taklook-ing photos and writlook-ing notes, investigators can just wear a head-mounted device and focus on finding the evidence, while everything is automatically recorded in a log. These professional applications all share a similar setup, namely a first person view video log recorded in a typically static scene. In this thesis we focus on this group of professional logging applications which we call scene investigation.

Our proposed scene investigation framework includes three phases:

• Capturing. An investigator records the scene and all objects of interest it contains using various media including photos, videos, and speech. The capturing is a complex process in which the investigator performs several actions to record different aspects of the scene. In particular, the investigator records the overall scene to get an overview, walks around to search for objects of interest, and then examines those specific objects in detail. Together these actions form the events in the capturing process.

• Processing. In this phase, all data are analyzed to yield information about the scene, the objects, and the events in the capturing phase.

• Reviewing. In the reviewing phase, an investigator uses the collected data to perform various tasks: assessing the evidence, getting an overview of the case, measuring spe-cific scene characteristics, or evaluating hypotheses.

(12)

One important characteristic of the capturing phase is that it is difficult if not impossible to repeat this phase as the assumption of remaining static has a limited time-span. In crime scene investigation, for example, it is very hard to keep the crime scene untouched for a long time. Thus, it is best that the investigator captures the scene as completely as possible to avoid the need for further visits.

Depending on the specific reviewing task, the requirements on completeness and accuracy in the capturing phase vary. For example, measuring specific characteristics requires high accuracy but not completeness, while evaluating a hypothesis requires both completeness and sufficient accuracy.

Events of the investigation process constitute valuable information. Important events lead us quickly to important facts, while relations between events suggest connected facts. But all this information is only implicitly present in the data. The processing phase should uncover this information in a way that it is completely transparent to the user. This is the main subject of our research, how to turn data from the capturing phase into information that supports the various tasks in the reviewing phase.

In the traditional investigation process, experts take photos and notes of the scene and objects they find important. This standard way of capturing does not provide sufficient basis for the later processes of processing and reviewing. In particular, a collection of photos cannot give a good overview of the scene. Thus, it is hard to imagine the relations between objects, come up with hypotheses and assert them based on the photos alone. To get a better overview, in some cases, investigators use the pictures to create a panorama. But since the viewpoint is fixed for each panorama, it does not give a good spatial impression. The measuring task is also not easily performed using those photos when the investigator has not planned for this in advance. More complicated tasks like making a hypothesis on how the suspect moved are very difficult to perform using a collection of photos without a good sense of space. Finally, a collection of photos and notes hardly captures investigation events, which are important for understanding the investigation process. For supporting advanced tasks, new ways of capturing the scene are needed.

A potential solution to enhance scene investigation is to log the whole investigation pro-cess using a handheld or head-mounted camera, and to create 3D models of the scene. Re-search on techniques to capture crime scenes as 3D models has pointed out that 3D models can be used in many investigation tasks [72]. 3D models make discussion easier, hypothesis assessment more accurate, and court presentation much clearer. Using a video log to capture investigation events is straightforward in the capturing phase. Instead of taking photos and notes, investigators film the scene with a handheld camera. All moves and observations of in-vestigators are thus recorded in video logs. However, in order to take the benefit, it is crucial to have a method to extract events from the logs for reviewing. When combined, 3D models and video logs have great potential to improve information accessibility. For example, a 3D model can help visualize the spatial relations between events, or details of certain parts of the model can be checked by reviewing events captured in that part of the scene. Together, a 3D model of the scene and log events form a 3D event log of the case. Such a log, apart from direct application like event-based navigation in 3D, will enable other applications such as knowledge mining of expert moves, finding correlation among cases, or training.

There are several paths towards creating 3D events logs: interactive, automatic, or semi-interactive. 3D models can be built interactively using 3D authoring software such as

(13)

1.2. Problem Statement 3

Blender∗_{. This, however, requires a lot of time measuring the scene, and interactive model}

building. As an alternative, 3D models can also be acquired automatically using laser scan-ners. However, laser scanners are expensive and not flexible, hence there are limits on their large scale deployment. Another automatic approach is modeling scene from image data. Under specific conditions, e.g. for certain outdoor scenes, it is also possible to automatically reconstruct a 3D model from video [121, 40]. This suggests that we could reconstruct a 3D model from an investigation log, and somehow combine it with the event analysis. However, as there are inherent limits on the accuracy one can obtain and under certain circumstances it might not even succeed [121, 23, 146]. As we will show in this thesis this is especially true when the aim is to do reconstruction of indoor scenes. A wiser approach is to handle the two problems, building 3D models and analyzing events, separately and leverage user interaction to overcome the limitation of the automatic algorithm, i.e. taking a semi-interactive approach. Semi-interactive approaches for this problem are not commonly found, and there are sev-eral questions that need to be answered before this is feasible. Those questions are discussed in the following section.

1.2 Problem Statement

As discussed, fully automatic methods for 3D reconstruction would be ideal, but not always realistic because of their limitations in terms of accuracy and robustness. Yet we should not ignore them. Automatic methods hold technical elements for creating an effective and af-fordable method of 3D reconstruction. Studying them in the scene investigation context will help us understand more what computers can do in this application, and what the difficulties are that need user interaction to solve. In addition, automatic reconstruction from videos is directly relevant for estimation of camera motion and scene structure, which are both impor-tant features for analyzing the investigation log. Hence, our first question is:

Q1: To what extent can automatic methods be used for accurately reconstructing a 3D

model from video logs?

As we have argued, using interaction is the best solution in practical applications. Users rather spend a dozen minutes to have an accurate result than waiting hours for a mediocre automatic result. The problem is that we should not overuse interactions and create just another manual tool. The proper way is to keep the interactions as limited as possible. To be effective, they should also be simple and intuitive. Once we answer the first question about what computers can do in 3D reconstruction, we will have the basis to answer the next question:

Q2: How to semi-interactively construct 3D models such that interaction is simple,

intu-itive and as limited as possible?

Once having 3D models of investigated scenes, our next interest is to analyze video logs of investigations, turning them into sequences of events. More specifically we are interested

(14)

in finding and analyzing events capturing the expert knowledge of investigators. Detecting events from videos is known as a challenging problem in literature since the information to be extracted highly depends on the context. Also most of existing works are about detecting events in the content, i.e. what is captured in the video, while in scene investigation we are also interested in how the content is captured. So our next question is:

Q3: How to detect and understand investigation events in investigations logs?

Finally, we consider how to use the results of analysis for navigation of the scene. To that end we should note that events are not only related in time, but also in space. Given that we have the 3D model of the scene (by answering Q2), the investigation process will be more comprehensible if its events are visualized in that model. Hence the last question is:

Q4: How to connect investigation events to the 3D model of the scene?

1.3 Organization

The following chapters of the thesis aim to answer the questions raised.

To answer Q1, we investigate in Chapter 2 available tools and methods in literature for fully automatic 3D reconstruction from image sequences or videos. We identify opportunities and challenges one faces when applying those automatic methods in the practical application of scene investigation. in chapter 3, using a theoretical model, we analyze one aspect that affects the accuracy of automatic 3D reconstruction, namely the feature location error. In Chapter 4, we fully investigate Q1 in practice. Every required step for automatic 3D re-construction from indoor investigation videos are sketched and experimentally verified. This gives a clear picture of the performance a fully automatic method can deliver in scene inves-tigation applications, and pave the way to answer other questions.

Drawing from the knowledge and experience in doing 3D reconstruction, Chapter 5 presents a semi-interactive system for reconstructing indoor scenes. This system is the an-swer to Q2, delivering an effective and accurate method that hopefully will bring more use of 3D models into scene investigation.

Chapter 6 aims to answer Q3 and Q4. We present a framework to analyze a video log of an investigation, and visualize it so that its events can be navigated in 3D. Together, Chapter 5 and 6 provides a complete methodology for building a 3D event log.

The process of creating and accessing 3D event logs is summarized in Figure 1.1. Our contribution is mainly in studying the processing phase for a scene investigation solution using handheld cameras to enhance users access to the captured scene.

(15)

1.3. Organization 5 Automatic video log segmentation Semi-interactive 3D reconstruction Semi-interactive matching Automatic 3D reconstruction C a p tu ri n g P ro c e s s in g R e v ie w in g

Our proposed solution

Images Video log 3D model Event log 3D event log

(16)

(17)

Chapter

2

A Review of 3D Reconstruction:

Towards Scene Investigation Using

Handheld Cameras

In scene investigation, many tasks require or can benefit from a 3D model of the scene. For instance, the measurement of certain objects can be done off-site without pre-determining what needs to be measured while capturing. Complicated tasks like hypothesis validation absolutely require a 3D model of the scene. Thus it would be ideal if we have an efficient method to easily model a scene for investigation from some images, or even better from a video log of the investigation process.

The problem of 3D reconstruction from images or videos has been extensively examined. While good results have been shown in controlled environment using high quality still im-ages, we study the many challenges of using video input in an uncontrolled environment. This chapter gives an overview of the complete process and reviews the related work in each step. By doing so we identify the opportunities to apply 3D reconstruction from images/videos in scene investigation.

(18)

2.1 Overview of 3D Reconstruction From Video

Se-quences

In our proposed scheme, scene investigation, a person moves around and captures the scene, which we assume to be static (i.e. there are no moving objects in the input).

Building a 3D model from videos for scene investigation is the purpose of this review. To support various tasks of scene investigation, such as 3D navigation, measurement, or hypothesis validation, the following aspects are important:

• Robustness: we are looking for a system that works in different setups and in varying

conditions.

• Flexibility: Investigators should spend time mainly on their job, i.e. investigation,

rather than setting up complex hardware and performing tedious calibration. Thus we need a flexible solution. It would be great if an investigator can grab any camera, record the scene and do reconstruction later on.

• Precision: To obtain the right interpretation of the scene and enable hypothesis

valida-tion, the method should be accurate.

• Usability: Since investigators are not experts in computer vision or graphics, the

recon-struction procedure should minimize the human effort. Interaction, if required, should be intuitive and reasonable.

Those requirements are hard to meet in one system. To a certain level, they contradict each other. For example, to obtain precision, we usually have to sacrifice flexibility and usability.

Images _ProcessingFeature

Structure & Motion Recovery Model Creation 3D Model Lens Distortion Correction Features, Matches Correcte d Images Point cloud, Camera parameters

Figure 2.1: The 3D reconstruction step of the overall process can be decomposed into four sub-steps.

Following common frameworks for reconstruction, we divide the reconstruction process into four main steps (Figure 2.1), which are discussed in the following sections.

1. Lens distortion correction (Section 2.2): The coordinates of pixels suffer from a ra-dial lens distortion, thus should be corrected.

2. Feature processing (Section 2.3): The objective of this step is to detect the same features in different frames and match them.

(19)

2.2. Lens Distortion Correction 9

3. Structure and Motion Recovery (Section 2.4): This step recovers the structure of the scene, i.e the 3D coordinates of the features, and the motion of camera, i.e. the pose and internal parameters of the cameras.

4. Model Creation (Section 2.5): This step creates a 3D model of the scene in some desired representations, for example a textured mesh.

2.2 Lens Distortion Correction

Feature Processing Lens Distortion Correction Lens calibration info Lens calibration Undistorted frames Features Coordinate Correction Corrected features Calibration object Images Structure & Motion Recovery 3D Model Model Creation

Figure 2.2: Detail of lens distortion correction. The Lens calibration step should be done once.

In 3D reconstruction, feature coordinates are assumed to be produced by an ideal pinhole camera [61] (p. 153). Real cameras do not conform to that model. The difference between a real and an ideal camera is called the lens distortion. For most handheld cameras, it is too significant to be ignored. Therefore, lens distortion correction should be applied before running any geometric algorithms [28].

To correct the distortion, we first need to model it. The lens distortion is typically modeled by a radial displacement function (xd, yd)T = L(˜r)( ˜x, ˜y)T, where (xd, yd) is the actual distorted

frame coordinate and ( ˜x, ˜y) is the ideal coordinate. Here, ˜r =p˜x2_{+ ˜y}2 _{is the distance from}

the point to the center for radial distortion, which is usually close to the image center. L(˜r) is commonly a quartic function. To find the parameters of the distortion function L(˜r), several methods have been proposed [169, 94, 35]. More sophisticated distortion models includes tangential displacement which is perpendicular to the radial distorted point. Figure 2.3 gives an example of lens distortion and correction.

The process of finding the distortion function parameters is called lens calibration. It also recovers other camera intrinsic parameters such as focal length. As long as the intrinsic parameters remain fixed, which is often the case, the calibration only needs to be performed once. It is usually done with a calibration objects, such as a chess board [169, 94]. It is also

(20)

Pixel error = [0.4506, 1.127] Focal Length = (2090.42, 1708.59) Principal Point = (1631.5, 1223.5) Skew = 0 Radial coefficients = (−0.09648, 0.1846, 0) Tangential coefficients = (−0.09743, −0.0104) +/− [145.3, 257.5] +/− [0, 0] +/− 0 +/− [0.06539, 0.06875, 0] +/− [0.02106, 0.002139] 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 50 50 50 50 100 100 100 150

Radial Component of the Distortion Model

Pixel error = [0.4506, 1.127] Focal Length = (2090.42, 1708.59) Principal Point = (1631.5, 1223.5) Skew = 0 Radial coefficients = (−0.09648, 0.1846, 0) Tangential coefficients = (−0.09743, −0.0104) +/− [145.3, 257.5] +/− [0, 0] +/− 0 +/− [0.06539, 0.06875, 0] +/− [0.02106, 0.002139] 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 100 100 100 100 100 100 200 200 200 200 200 200 300 300 400 400 500

Complete Distortion Model

(a) Radial component of distortion model (b) Complete distortion model

Figure 2.3: A lens distortion model calibrated using the Camera Calibration Toolbox for Matlab [19].

possible to calibrate without calibration objects such as in [35] based on the fact that “straight lines have to be straight” and by choosing an arbitrary straight line in the scene.

In the reconstruction process, lens distortion correction is the first step. In practice it is actually more complex. Since the correction is non-linear, it may reduce the photometric quality and affect the feature detection. Thus point and edge detection are usually done on the original distorted frames. Then the correction is applied to the features detected. Correcting all pixels is expensive hence we should do it only when it is really required. For example in stereo mapping, a sub-step of model creation that will be explained later, only a subset of the frames is used and the distortion fully corrected.

Doing calibration in advance means that we cannot use zooming since it changes the in-trinsic parameters. In scene investigation, this is fortunately a fair assumption as in scene investigation users would move close to the object to observe and take pictures of it. If zooming is desired, methods exist to estimate the camera’s intrinsic parameters during the re-construction such as [28]. This, however, makes the rere-construction algorithm more complex, and thus prominently less accurate.

2.3 Feature Processing

The first step in 3D reconstruction is to detect and match features in different frames. Until now, the features used in structure recovery processes are points and lines. The three main steps in feature processing (Figure 2.4) are detection, description, and initial matching.

Finding features in an image is done using a detector. The most important information a detector gives is the location of features. It can also return other characteristics such as the scale and orientation. Two characteristics of a good detector, as defined in [134], are

repeatability and reliability. Repeatability is the ability to detect the same features in different

images. Reliability means that the detected features are distinctive enough so that the number of candidates for matching is small. For 3D reconstruction, location accuracy is important

(21)

2.3. Feature Processing 11

Structure & Motion Recovery 3D Model Lens Distortion Correction Images Feature Detection Feature Description Initial Matching Feature Spatial info Feature descriptors Correspon-dences Model Creation

Figure 2.4:Feature detection and matching process

since in a complex reconstruction process small errors might be accumulated or magnified resulting in a bad final result.

Now suppose we have two images of the same scene and their features. To find corre-sponding pairs of features, we need feature descriptors. The feature description process takes a feature detected in the previous step and produces descriptive information. This descrip-tor is usually represented by a vecdescrip-tor. Features in different frames are matched by comparing their descriptions. A good descriptor should be invariant to rotation, scaling, and affine trans-formation so that the same feature in different images would be characterized by almost the same value and should be reliable in reducing the number of possible matches [104]. In ad-dition, to deal with a large number of features and high-dimensional descriptor vectors, we need an efficient initial matching strategy.

Research on interest points and lines detection and description are summarized separately in 2.3.1 and 2.3.2. Matching of both kinds of features is discussed in 2.3.3 since they share the same principles.

2.3.1 Interest Points

In this document, a point feature is called an interest point, in other work also often referred to as keypoint (e.g. [93]).

Point detection

Schmid et al. classify point detectors into three categories: contour based, intensity based, and parametric model based [134].

• Contour based detectors first extract contours from images, then find points that have

special characteristics, e.g. junctions, endings, or curvature maxima. A multi-scale framework can be utilized to get more robust results.

(22)

• Intensity based detectors find interest points by examining the intensity change around

points. To measure the change, first and second derivatives of frames are used in many different forms and combinations.

• Parametric model based detectors find interest points by matching models or

tem-plates, for example of L-corners, to a frame.

Many point detectors have been proposed. The Harris corner detector [59] is well-known and is invariant to rotation and partially to intensity change. However, it is not scale invari-ant. Scale invariant detectors such as [93, 102] search for features over scale space. Lowe’s SIFT [93] searches for local extrema of Difference of Gaussians in space and scale. Mikola-jczyk and Schmid [102] use Harris corners to search for features in the spatial domain, then use a Laplacian in scale to select features invariant to scale. An affine invariant detector is pro-posed by Tuytelaars and Van Gool [158]. Starting from a local intensity maximum, it searches along rays through that point to find local intensity extrema. The link formed by those ex-trema defines an interest region∗that is approximated by an ellipse. By searching along many rays and using ellipses to represent regions, the detected regions are affine-invariant. Their experiments show that the method can deal with view changes up to 60 degrees.

Feature (corner) Rotated feature Rotated & scaled feature Affine transformed feature

Figure 2.5: A feature under different transformations.

Repeatability is evaluated by the ratio of number of repeated points over the total detected points in the common part of two frames. The reliability of a detector is measured by the diffusion of local jets, a set of image derivatives, of a large number of interest points. The more diffusive the values, the more reliable the detector. This diffusion is measured using entropy. Among the ones examined in [134], the Improved Harris corner, which is improved from the original by employing a more appropriate differential operator, is the best in term of both repeatability and reliability. The evaluation in [134] does not cover scale and affine invariant detectors.

Mikolajczyk et al. gives a comparison of affine detectors [105]. Instead of diffusion of descriptors, this comparison evaluates reliability using the matching score, “the ratio be-tween the number of correct matches and the smaller number of detected regions in the pair of images”. On average the Hessian-Affine [103] detector performs best, where we should note that SIFT [93] uses approximately the same detection scheme as Hessian-Affine. The MSER [99] is the most repeatable in terms of viewpoint change, which is important in recon-struction. A more recent elaborate comparison focusing on geometric applications by Henrik

∗_{Note that via scale, and affine transformation, a point is usually no longer a point but becomes a region. So in} literature, for robust detectors we see “interest regions” instead of interest points. When matching regions for 3D reconstruction, we can simply use the centroids of regions for computation.

(23)

Aan et al. [2], confirms the superior performance of SIFT [93] and Hessian-Affine [103], but reveals that MSER [99] does not perform as well as found in [105]. In their comparison they also found that the simple Harris corner detector [59] performs very well when scale change is small.

Speed is a criteria not mentioned in the above evaluation. In practice it can be impor-tant if we handle large amount of data, e.g. doing reconstruction from videos. There are several efficient implementations [157] of detectors, for instance SURF [12] is an efficient implementation of SIFT [93]. Those implementations significantly improve the speed with-out sacrifying other criteria.

Location accuracy is another missing criterion in existing evaluations of detectors. Loca-tion accuracy is very important in 3D reconstrucLoca-tion [157]. Because of the complex recon-struction process, limited errors in the location of interest points can be magnified into large errors. In [110] the relation between intensity noise and certainty of interest point location is derived for the Harris corner detector [59], which is rather outdated. More investigation on location accuracy of state-of-the-art detectors, e.g. SIFT [93], should be done.

Point description

Point descriptors are classified by Mikolajczyk and Schmid [104] into the four following categories:

• Distribution based descriptors. Histograms are used to represent the characteristics

of the region. The characteristics could be pixel intensity, distance from the center point [86], relative ordering of intensity [168], or gradient [93].

• Spatial-frequency descriptors. These techniques are used in the domain of texture

classification and description. Texture description using Gabor filters is standardized in MPEG7 [126].

• Differential descriptors. A set of local derivatives is used to describe an interest region.

The local jet, used in [134] to evaluate the reliability of detectors, is an example.

• Moments. Van Gool et al. [161] use moments to describe a region. The central moment

of a region in combination with the moment’s order and degree forms the invariant. The invariance of descriptors is obtained in many ways for different changing factors. For example, in [93, 102] maxima of local gradients with different directions are used to identify the orientation. Other sets of rotation invariants can be used to characterize the region, e.g. the Fourier-Mellin transformation used in [11]. Scale and skew determined in the detection phase are used to normalize image patches in [11, 131].

Mikolajczyk and Schmid [73, 104] have done an evaluation of descriptors. Two criteria are used: the Receiver Operating Characteristic and the recall. The first one is the ratio of detection rate over false positive rate; and the second is the ratio of correct matches over possible matches. Qualitatively, the results for the two criteria are the same. The SIFT descriptor [93], which is invariant to scaling and partially to view change, and SIFT-based methods [77, 104] are in the top group. This evaluation also shows that using region-based detectors, in particular the scale and affine invariant detectors, gives slightly better results.

(24)

Conclusion

We have summarized various methods to detect and to describe interest points, as well as comparisons between them. Choosing a detector somewhat depends on the type of input scenes. SIFT and SIFT-based descriptors, such as [66], are attractive because of their per-formance. Location accuracy of interest points has not been studied although it has been identified as an important criterion especially for 3D reconstruction. Thus, in order to apply 3D reconstruction in scene investigation, we need to study more about the location accuracy of interest points.

2.3.2 Lines

In practice one type of feature cannot cover all possible situations. For instance, detects like SIFT [93] and SURF [12] detect blob-like structure, which are many in natural scenes but not in man-made scenes. Lines also provide more choice of geometric constraints for structure and motion recovery. In this sub-section we discuss line detection and matching for 3D reconstruction.

Line detection

Line detection includes two steps: first edge detection, and then line extraction. Edge detec-tion find the pixels potentially belonging to a line. Then parameterized lines are extracted from those pixels in the line extraction step.

Many edge detection schemes are available in the literature [137]. Edge detection is based on intensity change. Edge detectors usually follow the same routine: smoothing, applying edge enhancement filters, applying a threshold, and edge tracing.

Evaluations of edge detectors are inconsistent and not convergent [137, 47] for reasons such as unclear objective and varying parameters. Shin el al have done an evaluation in using structure from motion as the black box to test edge detectors [137]. It shows that overall, the Canny detector [22] is most suitable because of its performance, fast speed, and low sensitivity to variations in parameters. There are, however, weak points of this evaluation. The structure from motion algorithm described in [151], which is used in the evaluation, is based on the two-view geometric constraint. This cannot fully show the potential of using lines, which should be shown with three-view constraints. The second weak point is that the “intermediate processing”, parameterizing lines and matching them is fixed which affects the final result. Thus, the result is not complete enough to draw a conclusion.

Extracting lines could be done in several ways. The Hough transform [71] is famous for curve fitting. Despite of having a long history, the Hough transform and its extensions are still used [106, 6, 112]. A simpler approach connects line segments with a limit on angle changes and then uses the least median square method to fit the connected paths into lines [145, 137]. Robust estimators such as RANSAC [45] are popularly used to fit lines to the points of segments. We have not found a complete and concrete evaluation of line extraction.

(25)

Line description

Lines can be matched based on attributes such as orientation, length, extent of overlap, or dis-tance. Optical flow can be employed in case the difference between two views is small [83].

The evaluation of line description and matching algorithms for reconstruction is missing in literature. Authors give evaluations to compare their work with previous works, e.g. [83], but those are not complete enough to draw a conclusion.

Lines give stronger relations between views. Lines are many and easy to extract in scenes with dominant man-made objects. One of the first works that uses line correspondences and trifocal tensors is Beardsley [14]. In their work, lines are not used directly, but point correspondences are used first to recover geometry information. The potential of using line features is confirmed in [123]. Using both lines and points as input features, it is shown that trifocal tensors and camera motion can be recovered more accurately.

2.3.3 Initial Matching Strategy

Features are initially matched across frames based on descriptors. The simplest method is exhaustive search, optionally with a limit on the motion vector length. This approach is obviously slow when matching to a large number of features in many frames.

When the number of features used is large, a KD-tree [49] can be used to find matches with a binary search strategy. But the KD-tree is inefficient since the size of descriptor vectors is large, for example SIFT descriptors are 128-element vectors. Best Bin First (BBF) [15] and Fast Filtering Vector Approximation (FFVA) [163] approximate the KD-tree search to gain efficiency while searching a large database.

Finally, the problem can be formulated as a bipartite graph problem: finding the vertex set whose sum of weights is greatest from one set of nodes to another given that each node can connect to only one other node. One must be aware that in this formulation the total cost is optimized while what we really need is the number of correct matches. Though they are related, they are not the same.

Matching groups of close features is more accurate and efficient than individual matching. This has been shown for both points [50] and lines [162].

When matching a pair of images, exhaustive search can be good enough, and can even outperform BBF. This is to be expected as binary search is only efficient when the number of candidates is much larger than the number of dimensions. Descriptors are usually high-dimensional thus if the number of features is not large, using binary search is not efficient. This explains why the high dimensional Binary search is absolutely better in searching a pre-indexed dataset. Also we expect that they perform well in matching an image to a large set of unstructured images.

Matches produced here are initial and may contain outliers (Figure 2.3.3a). Validation of those matches is embedded within the geometric constraint computation. Beardsley et al. [14] use the geometric constraints in both two-view and three-view cases to match lines. Schmid and Zisserman extend this idea to matching both lines and curves [135]. This technique is now standard in reconstruction.

(26)

(a) (b)

Figure 2.6: (a) Correspondences initially matched based on SIFT descriptors. Note the in-consistent motion field showing outliers. (b) After applying the geometric constraint technique all matches are correct.

2.3.4 Summary and Conclusion

We have summarized the literature of feature detection and matching for the structure recov-ery problem. Two kinds of features are examined: points and lines.

With points, many detection and matching methods are available in literature with good evaluations [134, 105, 157, 2]. The evaluation, however, does not have a criterion on accu-racy. This is, however, very important in the noise-sensitive process of 3D reconstruction. Efficiency is less considered as a criterion in literature, but is important in practice especially if we want to do reconstruction from videos.

Using line features improves accuracy [123]. However, line detection and matching schemes and their evaluations are not available at the proper level, especially not in the con-text of 3D reconstruction. This should be explored more as lines are many in man-made structures and thus frequently encountered in scene investigation.

2.4 Structure and Motion Recovery

The structure and motion recovery step estimates the structure of the scene and the motion of the cameras. Taking feature correspondences as input, it estimates the geometric con-straints among views. Then the projection matrices that represent the motion information are

(27)

2.4. Structure and Motion Recovery 17

recovered. Finally, 3D coordinates of features, i.e. structure information, are computed via triangulation [65]. Figure 2.7 summarizes the steps.

Structure & Motion Recovery Model Creation 3D Model Lens Distortion Correction Correspon-dences Projective Reconstruction Metric Reconstruction Projective Structure & Motion Refined Correspon-dences Metric Structure & Motion

Figure 2.7: Details of the structure and motion recovery step.

In the following subsections we first discuss the problem of structure and motion recov-ery from multiple images, which is quite standard now. Then we discuss the benefit and the imminent problems of using large amount of data, typical in reconstruction from video sequences.

2.4.1 Multiple View Geometry and Stratification of 3D Geometry

This subsection gives a brief overview of multiple view geometry and the concept of geomet-ric stratification that are required to understand the following subsections.

Multiple view geometry

The research in 3D reconstruction from multiple views started with two views. This is quite natural since humans also see the world through a two-view system. Initial research assumed calibrated cameras, i.e. the intrinsic parameters of each camera and the relative positions of two cameras, if a stereo system is employed, are known. All of those parameters are acquired via a calibration process (see Section 2.2).

For the calibrated case, the essential matrix E [69] is used to represent the constraints between two normalized views. Given the calibration matrix K (a 3x3 matrix that includes the information of focal length, ratio, and the skew of the camera), the view is normalized by transforming all points using the inverse of K: ˆx = K−1x, in which x is the 2D coordinate of

a point in the view. The new calibration matrix of the view is now the identity. Then for a corresponding pair of points (x, x0) in homogeneous coordinates, E is defined by the simple equation: ˆx0T_{E ˆx = 0.}

In the uncalibrated case, the essential matrix is extended to a new concept. During the 1990s, the concept of fundamental matrix F, or bifocal tensor, was introduced and well

(28)

stud-X x’ x C (camera center) C’ e’ e Epipole Epipolar plane Baseline Epipolar lines l l'

Figure 2.8: Two-view geometry. X is a 3D point; x, and x0 are its projections. C and C0 are camera centers. The line connecting them is called the baseline. The X, C, and C0 together define the epipolar plane. l and l0 are the epipolar lines of the two projections of X. The projection of the camera centers on the other views, e and e0, are called epipoles.

ied by Faugeras [108] and Hartley [62]. The F matrix is the generalization of E and the defining equation is very similar: x0TFxi= 0.

The difference is that in uncalibrated reconstruction, the K matrix is unknown and thus the view coordinates cannot be normalized. Therefore, in the equation x is used instead of ˆx). F is still “fundamental” for research of multiple view geometry since it is simple, yet very informative. Its relations with other ways of expressing constraints can be found in [61]. Some principal concepts in two-view geometry, or epipolar geometry, are explained in Figure 2.8.

Three-view geometry is also developed during the 1990s. A three-view constraint, pre-sented by a trifocal tensor T , captures the relation among projections of a line in three views. Trifocal tensors define a richer set of constraints over views. Apart from a line-line-line correspondence, it also defines line-line, line-point, line, and point-point-point constraints. Furthermore, it introduces the homography to transfer points between two views. Unlike the fundamental matrix, which defines a point to line relation, i.e. a one-to-many relation, line correspondences defined by trifocal tensors are one-to-one. This is one of the advantages of trifocal tensors [61, 44]. Trifocal geometry is explained in Figure 2.9.

Stratification of 3D geometry

Focal tensors (F or T ) form the constraints among multiple views. But from them to the final structure and motion is a long way.

Motion information or camera parameters at a view are either intrinsic or extrinsic. In-trinsic parameters are focal length, skewness, etc. ExIn-trinsic parameters are position and ori-entation. In a stereo vision system, such as the human vision system, the two “cameras”

(29)

2.4. Structure and Motion Recovery 19 L C C’ C’’ l _l’ l’’

Figure 2.9: Line correspondences among three view are the basis to define three-view tensors. Points on line l are transferred to points on line l0 by the homography induced by the plane (C0, l0) (Figure 15.1 in [61]).

can be fully calibrated. That is all intrinsic and extrinsic parameters are known. In a mono vision system, the case in reconstruction with a handheld camera, extrinsic parameters are always unknown in advance. If the camera is calibrated in advance (see Section 2.2), the reconstruction process is called calibrated reconstruction otherwise it is called uncalibrated reconstruction.

In the uncalibrated case, no prior calibration is used; the missing information must be recovered via a further step called self-calibration, which is discussed in Section 2.4.3. For better comprehension, we should first understand the concept of geometric stratification, in-troduced by O. Faugeras in [43].

The space we are familiar with is the Euclidean space, in which all familiar concepts like absolute length, angle, ratio, and parallelism exist (Table 2.1). Taking away the concept of length, we have the metric space. If the concept of angle is taken from metric space, we have the affine space in which parallelism, ratio, and centroid exist, but we cannot measure angle. Since there is no way to tell the difference between angle, in this space, for example, a rectangle is just the same as a parallelogram. The least informative space is the projective

space, in which the concepts of parallelism, ratio, and centroid do not exist; while tangency

and cross ratio still exist.

From uncalibrated images, scene geometry is recovered step by step from projective space up to Euclidean space. This process uses invariant objects. For example, the plane at infinity is an invariant object that helps upgrade from the projective to the affine space. The plane at infinity is where all parallel lines meet, hence the word infinity. In the projective space it is, however, not at infinity. If we find it in the projective space and transform the space so that the plane at infinity is indeed at infinity, then we have upgraded to affine space.

Characteristics of geometric spaces are summarized in table 2.1 and can be found in [114] or [61].

Since we are looking for a flexible method, we continue the discussion on structure and motion recovery in the uncalibrated case. The calibrated reconstruction is similar since

(30)

know-Stratum DoF Trans.Matrix Distortion Invariants

Projective 15 T Intersection, Tangency of

sur-faces, Cross-ratio Affine 12 " A t 0T ₁ #

+ Parallelism, Centroid, Plane

at infinity Metric 7 " sR t 0T ₁ #

+ Relative distance, Angle,

Ab-solute conic Euclidean 6 " R t 0T ₁ # + Absolute distance

Table 2.1: Characteristics of geometric strata [114, 61]. Transformations are defined by homogeneous coordinate matrices. T is a 4 × 4 invertible matrix. A is a 3 × 4 matrix. R is 3 × 3 rotation matrix. A scaling factor is denoted by s, and t is a 3D translation vector.

ing intrinsic parameters does not help to skip any upgrading step, but it does make the problem more constrained and the results more accurate.

2.4.2 Projective Structure and Motion

Having only knowledge of feature correspondences, the most elaborate reconstruction we can obtain is a projective reconstruction. There are infinite ways to obtain projection ma-trices from a focal tensor. Methods, implementation hints, and evaluations of focal tensor computation are well discussed by Hartley and Zisserman in [61].

The computation of the focal tensor at simplest involves solving a linear equation sys-tem. If the input, i.e. feature correspondences, includes outliers, robust methods such as RANSAC [45] or Least Median of Squares must be employed to reject them. Then iterative optimization, e.g. using Levenberg-Marquart, should be used to improve the result. Choos-ing error functions for the minimization is very important since the algebraic errors, i.e. the estimation errors computed directly from the geometric constraint equations, do not express geometric meaning. Geometric or Sampson distances are advised [61].

With focal tensors recovered, projective reconstruction is already available. There are many decompositions from tensors to projection matrices [95]. Commonly one assumes that the first camera projection matrix is P1= [I 0], where I is the 3 × 3 identity matrix. By doing

so we can derive the other view’s projection matrix based on the constraint.

In case of more than two views, in order to have a consistent structure, the decomposition into projection matrices must be done with homographies induced from the same reference plane. This can be based on fundamental matrices [96] or trifocal tensors [9].

To avoid complex equations, one can use the additive structure building (ASB) method as in [121]. After having the initial structure, new views are added one by one. A new projection matrix is computed from a linear equations system, formed from correspondences of already reconstructed 3D points and their projections on the new view. A non-sequential adding strategy can be used to reduce accumulated error.

(31)

Using the first view coordinates as the world coordinate system simplifies the equations of the reconstruction since the projection matrix of the first view is simply (P = [I 0]). However, it makes the computation unstable and sensitive to noise [153]. That is the motivation for

factorization methods that produces a consistent set of projection matrices directly from the

correspondences. The first factorization algorithm was introduced by Tomasi and Kanade for orthogonal projection. Sturm and Triggs then extended it for perspective projection [147]. Further developments to solve the problems of initialization, missing trajectories and con-tinuous reconstruction are given in [1, 98]. An evaluation is given is [148]. Plane-based calibration [159] uses plane features represented via homographies. It has higher accuracy than point-based factorization, and overcomes the missing trajectory problem.

Theoretically, factorization gives better results compared to the structure update tech-nique. Yet we have not found any explicit experimental verification of this. Whether factor-ization is more accurate and effective than ASB with a good frame selection is still a question. Furthermore, the key to the advantage of factorization is the reliable tracks of features over frames, but those are difficult to obtain in practice.

Conclusion. Both methods, ASB and factorization should be evaluated further. While

factorization has been the research direction in recent years, the ASB has more practical relevance. Outliers and ill-conditioned views can be rejected each step. Besides, since bundle adjustment, which optimizes all motion and structure at once, is usually used afterward, the final result is almost the same.

2.4.3 Metric Structure and Motion

Upgrading a projective reconstruction to a metric one requires additional constraints [115]. The research on self-calibration ranges from methods with the strict assumption of knowing all fixed intrinsic parameters to the flexible, practical ones with minimal and realistic assump-tions, for example with only the condition that the pixel grid is square [67, 115, 124, 122].

Available methods. Many metric upgrade methods directly go from projective to metric

space. Heyden [67] derives the solution from the projection matrix equation, and needs at least five known intrinsic parameters. Pollefeys [115] builds up the method from an analysis of the absolute quadric equation, an abstract object encoding characteristics of both the affine and metric strata. This method is quite popular. It is improved by employing prior knowledge on camera parameters in [117]. Its constraint enforcement problem is solved by Chandraker et al. [23] using Linear Matrix Inequality relaxation. Ponce et al. proposed a new abstract object for calibration: the Absolute Quadric Complex [122]. It has the advantage of decoupling skew and aspect ratio from other intrinsic parameters. This is an appealing characteristic since we all use digital cameras that have rectangular or square pixels. In [159], after projective reconstruction via factorization of matrix of homographies, a method to upgrade to metric space is presented. It is also based on the theory of the absolute quadric.

Hartley, on the other hand, from the comment that iteration is tricky [124] and a direct upgrade method has the difficulty of constraint enforcement [61], proposed a fully stratified method. The method first upgrades to an affine level by an exhaustive search for the plane at

(32)

infinity. To limit the search space it employs the cheirality constraint, i.e. the reconstructed points must be in front of the camera [63]. After this the affine structure is upgraded to a metric one as described in [84].

Evaluation. We identify the four possibly best auto-calibration methods: Pollefeys et

al. [117], Chandraker et al. [23], Ponce et al. [122], and Ueshiba and Tomita [159].

Unfortunately a complete comparison among them with simulated or real data is not avail-able. As a constraint enforcement is added, we expect the second method [23] to outperform the first [117]. But the evaluation in [23] on 25 real datasets is only on qualitative criteria and the first turns out to be the winner. This maybe caused by the numerical instability of the optimization software. The third method only outperforms the first one in simulation at a noise standard deviation of 3.5 pixels. That is quite high and unrealistic in our opinion. The fourth method has not been compared to any other.

Figure 2.10: A few images from a frame sequence and the recovered metric structure and motion. The structure (point cloud) is built from keypoints only thus looks quite sparse. Result is generated using VisualSFM [167].

Conclusion. In summary, several methods exist for metric reconstruction but a complete

evaluation on robustness, accuracy, and flexibility does not exist. Some simulated results show that average error is about 1 to 2 percent, while results from real data have errors of about 3 to 7 percent [121, 23]. This means that uncalibrated reconstruction maybe of limited use for measurement and hypothesis validation, which need highly accurate models.

2.4.4 Degeneracy

Degenerate input is input from which it is impossible to make a metric reconstruction. It is either because of the characteristics of the scene or the capturing positions.

In practice, input captured by a person using a hand-held camera is hardly absolutely degenerate. However, nearly degenerate inputs are common in practice, for instance when a camera moves along a wall or on an elliptic orbit around the object. That is why studying de-generacy and detecting those cases is extremely important in creating a robust reconstruction method, or selecting the most suitable method for the case.

The study of degeneracy started very early and still is subject of recent research, e.g. by Kahl et. al. [74]. Degeneracy is either caused by structure, motion, or a combination of both.

(33)

Structure degeneracy happens when the observed points and the viewpoints follows a certain rule. The latter depends only on camera motion, for example pure rotation, thus can happen with any scene. Sturm’s paper [146] studies degeneracy in fixed intrinsic parameter cases. He also suggested a “brute force” approach to select the best algorithm. Pollefeys gives a practical approach that examines the condition number of the equation system [113]. This helps to reject the case but does not give the proper reconstruction method.

In [117], Pollefeys et al. show how to deal with planar structure, the most common structure degeneracy. The method uses the General Robust Information Criteria [156] on the trifocal tensor to detect and handle degeneracy. It is noticed that a scene with a dominant plane is a very common near degeneracy case [26, 48]. It is because most features are found on that plane. The fundamental matrix computation usually fails in this case. The problem is solved using Degenerate RANSAC [26] or Quasi Degenerate RANSAC [48], which test the degeneracy hypothesis during the computation.

In conclusion, detection of degeneracy is important in 3D reconstruction. The fewer camera parameters are known, the more ambiguous the reconstruction will be [113]. An im-portant fact is that degeneracy cannot be identified in a complete automatic way. Hence for each application one should use the context knowledge as much as possible. Some systems explicitly request users to follow a strict capturing guideline to avoid degeneracy, e.g. as in Photosynth† _{or ARC3D}‡_{. In scene investigation, since we do not want to limit the}

investi-gators’ movement, we should find another way to overcome degeneracy or at least make the capturing guideline less strict.

2.4.5 Reconstruction from Videos

Research in 3D reconstruction started with the question whether it is possible to do 3D recon-struction from a set of images. But for applications like scene investigation, it is more natural to use videos.

Using input of video sequences is a trade-off: sacrificing intra-frame quality, i.e. resolu-tion and sharpness, for inter-frame quality, i.e. relatedness and overlap between frames. To compensate the loss due to lower frame quality, we can exploit the inter-frame redundancy in the data. From a statistical point of view, more projections of a point means more samples and the estimation more reliably converges to the true value. The best texture found by se-lecting the best view or super-resolution can be used to get better visualization quality. Frame sequences also enable some techniques to deal with shadow, shading, and highlights [121].

One advantage of using videos is flexibility. Taking still images for reconstruction is troublesome, even for an expert. For example, assume that we know that the best move is going around an object, with the angular difference between consecutive views about 15 degrees. Without measuring, it is hard to follow that guideline. Using a hand-held video camera, we only have to worry about the type of move, i.e. going around, and leave the frame selection to computers. In case of reconstruction with unknown target, such as in crime scene investigation where at first we do not know which objects are evidences at first, using videos helps to avoid missing details.

†_{http://labs.live.com/photosynth/}

(34)

It has been shown possible to reconstruct 3D models from large amounts of data [142, 5]. These works, however, use images of better quality than video frames. Also in terms of the amount of input, the number of images used for each scene in these works (a few thousands) is far less comparing to the number of frames in a video log. To take advantage of video input, we have to solve some extra steps compared to reconstruction from still images. They are frame selection, sequence segmentation, structure fusion, and bundle adjustment.

• Frame selection. Among a number of frames, selecting good frames will improve the reconstruction result. Good frames are ones that have proper geometric attributes and good photometric quality. The problem is related to the estimation of the views’ position and orientation and photometric quality evaluation.

• Sequence segmentation. Reconstruction algorithms assume that a sequence is contin-uously captured. The sequence should be broken into proper scene parts and recon-structed separately and fused later.

• Structure fusion. Results of processing different video segments, either generated by different captures or through segmentation, must be fused together to create a final unique result.

• Bundle adjustment. The reconstruction process includes local updates, for example feature matching, structure update, and bias assumptions, e.g. use of a first-view coor-dinate system. Those lead to inconsistency and accumulated errors in the global result. There should be a global optimization step to produce a unique consistent result.

For all mentioned problems, solutions exist to a certain level, yet no solution is absolutely perfect. For example, available bundle adjustment tools like [90] work well with a limited number of images, and get extremely slow when the number of input images increases.

2.4.6 Summary and Conclusion

There are many options for each step in the reconstruction process. Some steps are well eval-uated while others need further evaluation. Both accuracy and robustness should be improved in order to make visual-based 3D reconstruction more applicable. Using more data such as found in video can improve the quality. But also many problems come along when using video sequences.

Aiming for a flexible application, we favor the ASB method for projective reconstruction and the Pollefeys et al. method [117] for metric upgrade because they are relatively simple to implement and have good performance.

Reconstruction from uncalibrated images gives flexibility. However, it has been observed that reconstruction from long uncalibrated sequences gives poor results, because of error accumulation and an issue called projective drift error [121, 29]. Thus, in practical settings, as soon as enough constraints are present, we should upgrade to a metric space. Then we can use a simpler algorithm to update the structure with others images [142, 41].

(35)

2.5. Model Creation 25

2.5 Model Creation

Once the structure and motion are recovered, we can proceed to model creation. From a set of calibrated frames and some geometric information of a scene, the problem of building a 3D model is called multi-view stereo reconstruction (MVSR). An overview and comparison of MVSR algorithms is given by Seitz et al. [136]. MVSR is a comprehensive topic that involves image processing, multi-view geometry, and computer graphics. Instead of trying to cover all of its aspects, we present one commonly used class of methods, e.g. used in the well-known work of Pollefeys et al [121]. We relate characteristics and concepts to the general overview of Seitz et al. [136] and suggest interested readers to check that paper for a good overview of the topic.

Structure & Motion Recovery Model Creation 3D Model Lens Distortion Correction Images Rectification Stereo mapping Rectified images Depth maps Structure & motion Mesh

building Texture _building

Wire frame model

Textured 3D Model

Figure 2.11: Zoom in of a multi-view stereo reconstruction showcase.

According to the categorization in [136], the presented class of methods is image-based. It includes four sub-steps: rectification, stereo mapping, mesh building, and texture mapping (Figure 2.11). Rectification aligns scanlines between two images. Stereo mapping computes a dense matching map between points of different calibrated views. The scanline alignments speed up this step. From matching maps, depth maps are recovered through triangulation. Then in the mesh building step, multiple depth maps are merged to create a polygon mesh. The final step, texture mapping, extracts textures from selected views and maps these onto the wire-frame model.

We will present each of those steps in the following sub-sections.

2.5.1 Rectification

Rectification is a pre-processing step typical for image-based MVSR methods. It exploits the epipolar geometry to align epipolar lines so that corresponding points will have the same y-coordinate in two frames. This makes the computation in the stereo mapping step faster.

The first class of rectification methods is planar rectification, e.g. [64]. Both images are projected onto a plane parallel to the baseline. This method tis simple and fast. It, however, fails in the case of a forward moving camera that commonly happens in scene investigation, for example, when moving along a street or corridor. In this case, planar rectification will create an unbounded image.

(36)

l₁

li

epipole

r

p

Figure 2.12: Polar rectification [116]. A point p is encoded by a pair of (r,θ ).

The second class of rectification methods is non-planar rectification. The first method invented in this class is cylinderic rectification proposed by Roy et al [129]. Images are projected on a cylinder whose axis is the baseline. The unbound images’ size problem is solved this way. However, the cost is the complexity, which is not a desired characteristic of just a preparation step. Pollefeys proposed a method called polar rectification [116] that solves the problem while keeping simplicity. Each pixel is coded by two components, the scan line that it lies on, which is an epipolar line, and its distance to the epipole. The method does not require projection of pixels but only scanning and recoding. A later work [109] refines this method to reduce feature distortion and complete the solution for the epipole at infinity case.

If we want to use videos captured by investigators, for which any kind of movement can be expected, we should use the polar rectification.

2.5.2 Stereo Mapping

The stereo mapping step establishes a dense matching maps between images. From them, depth maps are computed using triangulation. Triangulation is well discussed in [65]. Here we focus on how to produce matching maps.

Stereo mapping is not trivial as shown by the number of papers, different constraints, and strategies applied in D. Scharstein and R. Szeliski’s evaluation [132]. In the following paragraphs, we summarize their taxonomy, and evaluation of stereo mapping methods.

Taxonomy. The traditional definition of stereo mapping considers only two rectified views

and the matching map is presented as the disparity map with respect to the reference image. It includes four subtasks:

1. Matching cost computation. Differences of any pair of pixels from two different images are computed using a cost function, for instance squared intensity difference. The range of pixels in the current image to be compared to a pixel in the reference image is limited based on a geometric basis such as by the epipolar constraint.

2. Cost aggregation. Making a matching decision based on the cost of single pixel is unreliable due to noise. Cost aggregation improves the reliability of the cost by com-puting the cost in the neighborhood of the pixel. Aggregation in many cases enforces local smoothness.

Semi-interactive construction of 3D event logs for scene investigation - Thesis

UvA-DARE (Digital Academic Repository)

Semi-interactive construction of 3D event logs for scene investigation

Semi-Interactive Construction of 3D

Event Logs for Scene Investigation

Semi-Interactive Construction of 3D

Event Logs for Scene Investigation

Trung Kien Dang

Contents

Chapter

1

Introduction

1.1 Motivation

1.2 Problem Statement

1.3 Organization

Chapter

2

A Review of 3D Reconstruction:

Towards Scene Investigation Using

Handheld Cameras

2.1 Overview of 3D Reconstruction From Video

Se-quences

2.2 Lens Distortion Correction

2.3 Feature Processing

2.3.1 Interest Points

2.3.2 Lines

2.3.3 Initial Matching Strategy

2.3.4 Summary and Conclusion

2.4 Structure and Motion Recovery

2.4.1 Multiple View Geometry and Stratification of 3D Geometry

2.4.2 Projective Structure and Motion

2.4.3 Metric Structure and Motion

2.4.4 Degeneracy

2.4.5 Reconstruction from Videos

2.4.6 Summary and Conclusion

2.5 Model Creation

2.5.1 Rectification

2.5.2 Stereo Mapping