Semi-interactive construction of 3D event logs for scene investigation

(1)

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

Dang, T.K.

Publication date 2013

Link to publication

Citation for published version (APA):

Dang, T. K. (2013). Semi-interactive construction of 3D event logs for scene investigation.

General rights

It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations

If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

(2)

Chapter

5

A Semi-interactive Panorama Based

3D Reconstruction Framework for

Indoor Scenes

∗

We present a semi-interactive method for 3D reconstruction specialized for indoor scenes which combines computer vision techniques with efficient interaction. We use panoramas, popularly used for visualization of indoor scenes, but clearly not able to show depth, for their wide field of view, as the starting point. Exploiting user defined knowledge, in term of a rough sketch of orthogonality and parallelism in scenes, we design smart interaction techniques to semi-automatically reconstruct a scene from coarse to fine level. The framework is flexible and efficient. Users can build a coarse walls-and-floor textured model in five mouse clicks, or a detailed model showing all furniture in a couple of minutes interaction. We show results of reconstruction on four different scenes. The accuracy of the reconstructed models is quite high, around one percent error at full room scale. Thus, our framework is a good choice for applications requiring accuracy as well as applications requiring a 3D impression of the scene.

∗_{This chapter has been published in Computer Vision and Image Understanding [32]}

(3)

5.1 Introduction

A realistic 3D model of a scene and the objects it contains is an ideal for applications such as giving an impression of a room in a house for sale, reconstruction of bullet trajectories in crime scene investigation, or building realistic settings for virtual training [72]. It gives good spatial perception and enables functionalities such as measurement, manipulation, and annotation. One broad categorization of scenes is outdoor versus indoor. Outdoor scenes have been popular in many modeling applications [118, 142], especially creating models of urban scenes [120, 30]. Indoor scenes are prevalent in applications like real estate management, home decoration, or crime scene investigation (CSI), but research on them is limited with some notable exceptions [138, 88, 42]. In this chapter we consider the 3D reconstruction of indoor scenes.

While in applications like real estate management, a coarse model of a room is sufficient, other applications need more complete models. For instance, in CSI the model should be complete and show all the details in the crime scene as any object is potentially evidence. Each application also requires a different level of accuracy. Home decoration, for example, does not need extreme accuracy for its purpose is merely to give an impression of the scene. For the CSI application, the model should be as accurate as possible in order to make mea-surements and hypothesis validation reliable. Here we are seeking for a framework that can create complete and accurate models in highly demanding applications such as CSI, as well as coarse models for less demanding applications.

3D models are often built manually from measurements and images using the background map technique. Modelers take images of the object from orthogonal views (top, side and front), and try to create a model matching those images. A measurement is required to scale the model of the object to the right size. Modelling from measurements and images is only suitable for simple scenes, as complex scenes with many objects require a lot of measure-ments, images, and interaction. Even with measuremeasure-ments, accurately modeling objects is difficult since the assumption that the line of view is orthogonal to the object is hard to meet in practice. Since manual reconstruction is cumbersome and time consuming [55], automatic or semi-interactive reconstruction is preferred.

Automatic methods do exist and have shown good results for isolated objects and outdoor scenes [121, 142, 23, 139, 160]. Those methods require a camera moving around and looking towards the scene to capture it from multiple viewpoints [46, 115, 117, 125]. Such moves maintain a large difference between viewpoints, giving accurately estimated 3D coordinates [65]. Unfortunately in practice people tend not to follow such moves, making these methods inaccurate and unreliable. Indeed in the well-known PhotoSynth system its has been observed that quality suffers when users do not follow the appropriate moves [139]. In simple cases, when modelling single-object scenes, automatic methods give results of 2 to 5 percent relative error [119]. This is sufficient for visualization, but rather low for measurements such as in CSI applications. In indoor scenes where the space is limited, the situation is even worse as it is difficult, if not impossible, to perform the capturing moves suitable for automatic reconstruction. So, automatic reconstruction methods in their current state are not sufficient for accurate indoor scene reconstruction.

Semi-interactive methods are potential solutions [34, 39, 57]. A small amount of interac-tion helping computers in identifying important features makes reconstrucinterac-tion more reliable.

(4)

5.2. Related Work 63 A few mouse clicks are enough to build a coarse model [42]. Recent work, such as the VideoTrace system [160], shows that interaction can be made smart and efficient by exploit-ing automatically estimated geometric information.

While interaction helps to efficiently improve the reliability, there is still the problem of having limited space to move around in indoor scenes. Using panoramas is a potential solu-tion. Panoramas give a broad field of view. So a few panoramas are enough to completely capture a scene, and moving around the scene is no longer a problem. Furthermore, building panoramas is reliable, thus using panoramas contributes to the reliability of the overall solu-tion. The advantages of interaction on the one hand and panoramas on the other, suggest that a combination of them would be a good solution for indoor scene reconstruction.

Following the above observations, we propose a multi-stage, semi-interactive, panorama based framework for indoor scenes. In the first stage, a coarse model is built. This stage extends upon the technique in [42]. We make the interaction more efficient by providing a smart interaction technique, and rectify panoramas to guarantee the accuracy meets our aimed quality. Furthermore, we give a reconstructability analysis and, based on that, present a capture assistant to guide the placement of the camera. Results of the first stage, a coarse model and geometric constraints, facilitate efficient interaction to build a detailed model in the second stage. This framework overcomes the problems mentioned and makes it easier to create accurate and complete models.

In the next section we summarize related work. Section 5.3 gives an overview of our framework. Section 5.4 describes how to turn panoramas into a floor-plan and how to build a coarse 3D model. Section 5.5 describes the interaction to add details to the coarse model. Then we evaluate the accuracy and show how efficient the framework is. We close the chapter with a discussion on how to further automate the framework.

5.2 Related Work

5.2.1 Reconstruction from panoramas

A panorama is a wide-angle image, typically generated by stitching images from the same viewpoint [149]. Since panoramas cover a wide view, they must be mapped on a cylinder or sphere to view. Accordingly, they are called cylindric or spherical panoramas. Being wide-angle, panoramas give a good overview of a scene, especially in indoor scenes where the field of view is limited. On the other hand, they do not give a good spatial perception since the viewpoint is fixed at one point. There is work on creating panoramas using multiple viewpoints, called multi-perspective panoramas [88, 170, 164]. However, multi-perspective panoramas only yield a 3D impression from the original viewpoints. Other methods are needed to make real 3D models.

3D reconstruction from panoramas is found in [138, 88, 57]. In [138], a scene is modelled from geometric primitives, which are manually selected in panoramas of the scene. Recon-struction is done separately for each panorama, and then results of different panoramas are merged together. In [88] a dense 3D point cloud is estimated from multi-perspective panora-mas. It, however, requires a special rig for capturing the panorapanora-mas. In [57] a method to do reconstruction from a cylindric panorama is proposed. It assumes that the scene, e.g. a

(5)

room, is composed of a set of connected rectangles. This method requires that all corners of the room are visible, which is not often the case in practice. In [42], a method to recon-struct an indoor scene from normal single-perspective panoramas is described. The result is a coarse 3D model including walls onto which panoramas are projected. Such a model is not sufficient for some applications such as CSI, but this simple and flexible method gives good intermediate results towards building a detailed model.

5.2.2 Interaction in reconstruction

There are many types of interaction in reconstruction. In the simplest case users define geo-metric primitives, such as points, lines, or pyramids and match these to the image data [34]. In [39], quadric surfaces are used to support more complex objects. VideoTrace [160] lets users draw and correct vertices of a model in an image sequence.

The efficiency of interaction can be improved by exploiting what is already known about the scene. The guiding principle is to get as much geometric constraints as possible, and use them to assist interaction. These constraints can come from domain knowledge, the user interacting with the model, or through automatic estimation by the system, each of them we will now briefly describe.

Domain knowledge in the form of prior knowledge about the type of scenes to be recon-structed is helpful in designing efficient interaction. For example, when modelling man-made scenes we can assume that parallel lines are many. Thus, vanishing points are helpful in con-straining the interaction [138, 27, 166]. In urban scenes there are often repeated component such as windows. Hence instead of modelling them separately, the user can copy them [39]. In a man-made scene, objects are stacked on top of each other, e.g. a table is on the floor and books are on the table. We can exploit these to reduce the interaction and improve accu-racy [55].

Scene specific geometric constraints can be provided by users. In [55], users define how an object should be bound to another one, to reduce the degrees of freedom in the interaction to reconstruct that object. In [42], after roughly defining a room by a sketch, users can build a coarse model with a few mouse clicks.

Some geometric constraints can be reliably estimated by computers. In some cases, coarse 3D structure and camera motion information can be estimated. State-of-the-art in-teractive reconstruction systems including [160, 139] take advantage of such information sources to create intuitive and efficient interaction. For example, in VideoTrace [160] sys-tem, vertices drawn in one frame by the user are tracked and rendered in other frame by the system. Users browse forward or backward in the video sequence to correct those vertices until satified. For the user it is like refining a model rather than creating it from scratch.

In practice, those three sources of constraints are often mixed in the modelling flow, which is also what we will do in this chapter.

5.3 Framework Overview

Our framework is an A-to-Z solution, from capturing an indoor scene to modelling it, which is summarized in Figure 5.2.

(6)

5.3. Framework Overview 65

a. A rectangular room b. (Unwrapped)panorama of the room

viewpoint

c. The walls-and-floor model d. Adding more detail to the model

Figure 5.1: Illustration of input and (intermediate) results of the reconstruction process. A simple rectangular room is used as example.

The framework takes as input a sketch of the floor-plan, a top-down design drawing of a room that describes its walls and their relative positions drawn by the user. The capture plan-ning module analyzes the sketch to tell the user how many panoramas are needed to com-pletely capture the scene, and suggests camera placement i.e. the appropriate viewpoints. Either calibrated or uncalibrated cameras can be used, but to guarantee good accuracy, we advise to pre-calibrate the camera and correct the lens distortion before stitching them into panoramas. Users can use a software package of their own choice to estimate the camera mo-tion and stitch corrected images together into panoramas, for example using Hugin†_(Figure

5.1b).

To build a coarse model of a room, the user picks the corners, intersections of walls, in the panoramas. The framework provides a smart corner picking method to make the interaction comfortable. The location of the corners on the panoramas and the sketch are enough to estimate the correct floor-plan and build a coarse model of the scene [42] (Figure 5.1c). More expressively, we call this the coarse model, which includes textured walls and floor, a walls-and-floor model. A typical rectangular room needs only one panorama to build such a model, where irregular rooms may need more than one panorama depending on the shape of the room and the viewpoints of the panoramas. This stage is discussed in detail in Section 5.4.

In order to add more detail efficiently, we exploit the geometric constraint resulting from the observation that indoor scenes contain many flat objects aligned to walls. We iteratively use known surfaces to guide an interaction type that we call perspective extrusion to add objects. This technique helps to quickly build a detailed model (Figure 5.1d). Details of this stage are given in Section 5.5.

(7)

Panoramas Capture planning Building a walls-and-floor model Building panoramas Adding objects to the model Detailed model Walls-and-floor model Identifying corners Identifying object parameters Sketching floor-plan (Semi-) interactive Automatic Result

Figure 5.2:Overview of the proposed framework.

5.4 Building a Walls-And-Floor Model

In this section we discuss methods for building a walls-and-floor model. For easier compre-hension, we present the floor-plan estimation and other elements prior to the capture plan-ning. For the moment, we assume that the set of panoramas given is sufficient for floor-plan estimation.

We let the user draw a sketch of the floor-plan indicating orthogonality and parallelism of walls, and use a method built upon the method in [42] to estimate an accurate floor-plan. This method is based on the observation that the horizontal dimension of the panoramic image is proportional to the horizontal view angle of the panorama. Thus a set of corners divides the panorama into horizontal view angles of known ratio. If we assure that any panorama looks all around a room, the total horizontal view angle is obviously 360 degrees without any measurement. Hence we know each horizontal view angle. This observation is valid when the corners are perfectly aligned to the vertical dimension. Thus, to make a more accurate floor-plan estimation than in [42], we rectify the panoramas to meet that condition first.

Building 360-degree panoramas is well studied [149], thus we do not discuss it here. For the next step, indicating corners in panoramas, we provide smart corner picking. Rectifying panoramas, and estimating the floor-plan are subsequently discussed below. Then we present

(8)

5.4. Building a Walls-And-Floor Model 67 1. Let the user pick a point in/near a corner from the panorama.

2. Find the best image, according to Eq. (5.1).

3. Perform Canny edge detection in a horizontal band of one tenth of the image width around the picked point.

4. Fit a line through the picked point and the edges using RANSAC, where the line must go though the picked point.

5. Optimize the line without constraining it to the picked point.

Table 5.1:Smart corner picking process.

the reconstructability analysis and the capture assistant.

5.4.1 Smart corner picking

In order to estimate the floor-plan, coordinates of the top-down projections of corners are needed. As panoramas may not be well aligned, getting one point on a corner is not enough. Instead we need to identify a corner by a line segment. One way to do that is to ask a user to manually draw a line onto a panorama. To make it even simpler, we provide a utility to let users just casually pick a point in a panorama and the system will automatically identify the corner line.

Since the straightness of lines is not preserved in the coordinate system of a panorama, here a cylindric one, we must project a user picked point into one of the images, from which the panorama is created, to work in the image coordinate system. We assume that the best image is the one whose image plane is most orthogonal to the projection ray of the picked point. Or in other words, the angle between the ray from the viewpoint to the image center and the projection ray rcof the picked point is smallest.

if = arg min

i ∠(r(i), rc) (5.1)

where r(i) is the principal ray of image i.

Since panoramas are usually approximately aligned, we limit the detection to a vertical image band around the picked point. We detect vertical edges around that point, and fit a line through the picked point and edge points using RANSAC [45]. The picked point is used here as an anchor to avoid the auto-detected line moving to a wrong location. Since the picked point is not exactly at the right position, we afterwards relax the condition, optimizing the line without constraining it to go though the picked point to yield the final line. The process is summarized in Table 5.1 and two examples are given in Figure 5.3.

5.4.2 Rectifying panoramas

To accurately estimate the floor-plan, we first rectify the panoramas so that corners are aligned to the vertical dimension for a cylindrical panorama.

Each corner together with the viewpoint defines a plane. And these planes remain un-changed no matter how we move the coordinate system since they are defined by the scene and viewpoint. To align the panorama cylinder we need to find a rotation R that makes those

(9)

a b c

Figure 5.3: Two examples of smart corner picking. (a) The user picks a point. (b) Edges are detected in a vertical image band; a line is fitted through the picked point and edges. Note that there is another (even longer) vertical line but the algorithm smartly takes the edge close to the picked point. (c) The final result.

planes parallel to the vertical direction. In other words, after transforming by R, the normals of planes are orthogonal to w = (0, 0, 1)T_{, i.e.}

uT_i R−1w = 0 (5.2)

where uiare the planes’ normals.

Using this constraint, given at least three corners, we can compute the last column of R−1_, or equivalently the last row of R, by finding the least-square solution. If the last row of R is r3= (a, b, c), and from the constraint that R is orthogonal, we choose its other rows as:

r1∼= (−b, a, 0)

r2∼= (−ac, −bc, a2+ b2) (5.3)

where ∼= means equal up to a scale, and |r1| = |r2| = |r3| = 1.

Once having computed R, we resample the panoramic image to finish the rectification.

5.4.3 Estimating the floor-plan

The locations of corners in panoramas, identified in the previous step, give sets of horizontal angles between the corners when viewed from the panorama viewpoint. If we have a way to represent those angles in terms of coordinates of projections of corners and viewpoints in the floor-plan, we have a set of constraints to estimate the floor-plan and the viewpoints. Here we briefly review such a method presented in [42], discuss its applicability, and show how we extend it for our work.

(10)

5.4. Building a Walls-And-Floor Model 69 x₀=0 y0=0 x₁ y1=1 c₀ c1 c2 c₃ c0 c3 c2 c1 0,3,4 0,1,4 1,2,4 2,3,4 y_v0 x_v0 c₄=(x_v0, y_v0) a b

Figure 5.4: Parameterization of the floor-plan model given a sketch, simplified from Figure 2 in [42]. (a) To reduce the number of parameters, corners are represented by shared param-eters. (b) Each viewpoint is parameterized separately. Locations of corners in a panorama at the viewpoint give a set of angles between corners as viewed from the viewpoint.

A sketch is a model of the floor-plan. We force users to draw rectilinear lines parallel to the axes by providing them with a drawing grid. Of course, this alignment can be done auto-matically, but drawing in such a way helps users to correctly define parallelism and orthog-onality. Note, as only parallelism and orthogonality are important in the parameterization, a sketch of a rectangular room is any arbitrary rectangle.

Assuming that the room has n corners, we need at most 2n parameters to represent it. A viewpoint, whose coordinates both have to be estimated, is represented by a pair of separate parameters. Suppose that we have v panoramas, then the total number of parameter is 2n+2v. For each wall drawn in the sketch that is parallel to an axis, since the two corners of a wall share a horizontal or vertical coordinate, the number of parameters is reduced by one (Figure 5.4a). Hence the number of parameters is reduced by the number of those walls, m. To further reduce the number of parameters, the origin of the coordinate system is set at one corner, and the length of a wall is set to one, as the reconstruction is up to a scale anyway. These settings reduce the number of parameters by 3. In summary, the number of parameters to be estimated is:

2n + 2v − m − 3 (5.4)

From the model of the floor-plan that contains the coordinates of corners and viewpoints, we can estimate the angle between two corners as seen from a viewpoint (Figure 5.4b). These angles are equal to the set of angles defined by user-picked corners in the panoramas. This set of constraints can be used to estimate the parameters of the floor-plan model and the viewpoints.

At this point, the coordinates of top-down projections of viewpoints are estimated. But the viewpoints’ heights are missing. Complete viewpoint coordinates are required to add more details to the model in the later stage. Since we already know the the floor and the projection of the viewpoint on the floor, we only need one point to compute the relative distance from

(11)

Unseen corner

viewpoint

a b

Figure 5.5: When the floor-plan is not rectilinear (a), or if from the viewpoint we cannot see all corners (b), we may need more than one panorama to estimate it.

the viewpoint to the floor. To get that point, we ask the user to pick any floor point in each panorama to compute its viewpoint height.

5.4.4 Reconstructability analysis

We now give an analysis of the floor-plan estimation method. To estimate the floor-plan and the viewpoint coordinates, the number of constraint must be greater or equal to the number of unknowns given in equation (5.4) of the previous sub-section.

Suppose that viewpoint i sees cicorners, since the sum of the angles is 360 degrees, we have ci− 1 independent constraints. Since the viewpoints are different, constraints of one viewpoint are independent of constraints of other viewpoints. The problem is solvable when the number of constraints is greater than or equal to the number of parameters:

v

∑

i=1

ci≥ 2n + 3v − m − 3 (5.5)

Common rooms have all walls parallel to an axis, i.e. the floor-plan is a rectilinear poly-gon, thus m is equal to n. Equation (5.5) then simplifies to:

v

∑

i=1

ci≥ n + 3v − 3 (5.6)

Suppose that we can find a point from which all corners are visible, i.e. ci= n, Equation (5.6) is then further simplified to v ≥ 1. So indeed given a rectilinear floor-plan, one panorama that sees all corners might be enough to estimate it. A special, yet the most common, case is a rectangular room. Since we see all four corners from any viewpoint, one panorama might be enough to reconstruct the walls-and-floor model.

We need more panoramas when the floor-plan is not a rectilinear polygon, and when from the chosen viewpoint we cannot see all corners. Figure 5.5 shows examples.

(12)

5.4. Building a Walls-And-Floor Model 71

5.4.5 The capture assistant

The capture assistant helps users in planning viewpoints in the room so that the reconstruction is possible and the model covers all of the room. To that end, it must know the number of unknowns given a sketch, the number of constraints produced by viewpoints and the area they cover. Furthermore, it is preferred that the number of viewpoints is minimal.

The number of unknowns is computed easily using equations (5.5) and (5.6). A line segment from any point within a convex polygon to any of its vertices does not go out of the polygon. Hence if the floor-plan is convex, counting the constraints is trivial since from any viewpoint we see all the corners. When the floor-plan is concave, the problem is non trivial. Since we keep the sketching simple, only asking users to align rectilinear lines of the sketch parallel to axes, the sketch is freely stretched unevenly along axes. Our solution is to decompose the sketch into tiles and compute the minimal number of observable corners from each tile, invariant to how it is stretched along axes. The method for decomposing a sketch into such tiles, which we call invariant observable areas, is described in Algorithm 1. Algorithm 1 Decomposing a sketch into invariant observable areas

• Step 1: Cut the sketch into tiles using all distinguished x and y coordinates. A sketch is turned into a set of rectangles and triangles (Figure 5.6.a). Where each of them is called a tile (Figure 5.6.b).

• Step 2: For each tile, find its invariant observable area by the following steps: – Initiate the area to contain only the tile itself.

– Iteratively add a tile if it, together with some tiles already added, forms a convex polygon containing the initial tile.

Lemma 5.4.1 If the sketch is different from the real floor plan by an uneven scaling, the Invariant Observable Areas (IOAs) are invariant to unevenly scaling.

Proof The sketch is different from the real floor plan by an uneven scaling, the coordinates of corners are transformed by a monotic function, thus the order between any pair of x or y coordinate is preserved. That means if xa > xb in the floor-plan, or one sketch, in another sketch that still holds. Consequently. The order of tiles, as decomposed in the algorithm above, is horizontally and vertically unchanged in any sketch. Consequently the IOAs, a set

of tiles, built following step 2 in Algorithm 1 is unchanged. ¤

Lemma 5.4.2 Any point in an IOA is observable from any point in the initial tile.

Proof Any point is observable from another point within a convex polygon. Since the ex-tending scheme only adds a new tile if it is a part of a convex polygon with the initial tile, all

points in the IOA are observable from any point in the initial tile. ¤

Having IOAs we check if the planned viewpoints surely cover all the room and provide enough constrains to estimate the real floor-plan. The IOA of a viewpoint is the IOA of the

(13)

a b

6

5

4

5

4

3

c d

Figure 5.6: Illustration of the sketch decomposition algorithm. (a) The sketch is cut into rectangles and triangles using all distinguished x and y coordinates. (b) The tile graph indi-cates possibilities of traveling among tiles. (c) For each tile the initial observable area is itself (black); then tiles reached by traveling parallel to axes are iteratively added (gray); finally tiles reached from two ways are added (diagonal pattern). (d) The number of corners contained in the observable area is the minimal number of observable corners from the tile.

tile containing it. By checking if the union of the planned viewpoints’ IOAs, we can make sure that the set of viewpoints covers all the scene. Checking whether the floor-plan is solv-able is done by summing the number of corners observed by each IOA, and then comparing it to the condition in (5.5).

Given the IOAs of a sketch, finding an optimal set of viewpoints, i.e. smallest number of viewpoints that covers the scene completely and satisfies the reconstructibility condition (5.5), is a hard problem. Let us construct a graph representing the problem. Each tile is a node in the graph. For each tile, we have edges connecting it to all tiles in its IOA. Since if a tile is observable from another one, then from it we can also observe the other tile, the edges are undirected. Putting aside the reconstructibility condition, our problem is finding the minimal set of nodes from which we have edges connect to the rest of the nodes. This is the minimal dominating set problem, one of the known NP-complete problems [82]. With an additional condition, our problem is arguably of the same complexity. To suggest users a solution in interactive time, we propose the following greedy algorithm 2.

In practice, since there are objects in the room, we might not be able to put the camera at the suggested positions, or see all the corners we should see according to the analysis. Should an object, e.g. a tall wardrobe, completely block corner(s), it must be considered as part of the walls. The procedure to suggest viewpoints is the same. If a suggested tile is inappropriate to place the camera, users can mark it so that algorithm 2 can ignore that tile when recomputing the suggested viewpoints. This procedure has proven to give good results in practical cases.

Viewpoints also affect the accuracy of the floor-plan and the texture quality. In practice, since the panorama is built from high resolution images, the texture quality should not be a

(14)

5.5. Adding Details using Perspective Extrusion 73 Algorithm 2 Suggesting viewpoints, the greedy algorithm

.

• Step 1. Find a dominating set. Initialize an empty dominating set of tiles. While the scene is not covered by the union of the IOAs of tiles in the set, add a tile whose IOA contains most uncovered tiles.

• Step 2. Satisfy the reconstructability condition. While the condition of (5.5) is not satisfied, add a tile whose IOA contains most corners, i.e. providing most number of constraints.

problem. To estimate the floor plan accurately, intuitively one should place the camera in the center of the room to balance the constraints.

After this stage, we have a textured walls and floor model. In this model, objects are projected on the walls and on the floor. It gives a good overview of the scene. As indicated in applications such as real estate management it should be satisfactory. However for an application such as CSI, the object localization is not detailed enough. Thus, we need the second stage to add more detail.

5.5 Adding Details using Perspective Extrusion

The model now contains planes of walls, the floor, and viewpoint locations. We design in-teractive methods to add detail to the model in spirit of the whole framework: flexibly re-constructing objects from coarse to fine. For example, a table is reconstructed first and then the stack of books on it. Characteristics of indoor scenes are utilized in designing interaction methods meeting that idea.

In indoor scenes, many objects are composed of planes. Since objects are often aligned to walls, those planes are likely parallel to at least one wall or the floor. As indicated earlier, this gives a constraint to reconstruct objects. This action is similar to an extrusion, a popular standard technique in manual 3D modelling. In a normal extrusion, the orthogonal projec-tion of the object’s boundary on a reference plane is orthogonally popped up with a known distance, creating a new object planar surface. In our situation we do not see the object in orthogonal views, but from a panorama viewpoint. So, instead of moving the object’s bound-ary on lines orthogonal to the reference plane, we move it on rays from the viewpoint to their original locations in the reference plane (Figure 5.1.d). Because of this constraining, we call it a perspective extrusion.

Our aim is to reconstruct an object surface S that has a surface parallel to an already reconstructed plane (Figure 5.7). S is reconstructed from a set of three parameters. The reference plane l is a reconstructed plane to which the plane of S is parallel. The distance S to l is denoted by d; and b is a projection of the boundary of S in a panorama. The reconstruction procedure includes shifting the parallel plane l by distance d to get the object plane p, and cutting p by the pyramid of b and the viewpoint from which we see b. Once we have S, users can choose whether the object is a solid box or just a planar surface. The perspective extrusion process is summarized in Table 5.2.

(15)

1. The user picks the reference plane l.

2. The user defines the distance from l to the object plane p, either from one or two viewpoints.

3. Compute the object plane p by shifting l by d.

4. The user defines the boundary though its projection b onto a panorama.

5. Compute initial S by cutting the object plane p by the pyramid of b and the panorama viewpoint.

6. The user choses object type, either a solid box or a planar surface.

Table 5.2:Perspective extrusion process

Reconstructed plane l

viewpoint

Object S

Distance d Boundary b

Figure 5.7:A perspective extrusion pops up an object from an already reconstructed plane.

In related work such as [55], object parameters are defined indirectly in terms of geo-metric objects, e.g. a rectangular box. In pictures of indoor scenes, objects are frequently occluded, making the use of geometric objects difficult. To give more options in reconstruct-ing an object, we choose to let users define those parameters directly and separately. For example, a box is defined by one of its faces and the distance to the plane the face is parallel to. The distance can be defined by an orthogonal line to any reconstructed plane.

The parallel plane l is picked from the current model. We provide two ways to define d, namely using one or two viewpoints. To define d from a single viewpoint, the user draws a line from the object surface orthogonally to a reconstructed plane. To define d from two viewpoints, the user picks the projections of a point on the object surface in two panoramas. We then triangulate these two projections to estimate the 3D coordinates of that point, and its distance to l, which as it is already reconstructed, gives us the distance d. This strategy is useful when there is no physical clue for guiding the drawing of a line from the object’s sur-face orthogonally to a reconstructed plane. For example, for a chair, whose legs are bended, standing in the middle of the room, there would be no physical clue to draw d from a single viewpoint. The boundary b is a polygon drawn by users from the viewpoint. To assist the drawing of b, we assume as a default that the boundary of S has orthogonal angles and is symmetric as long as the drawing of b does not break this assumption. Using those assump-tions, we predict the boundary and render it. This is helpful to accurately define b, especially when a vertex is occluded.

(16)

5.6. Results 75 For flexibility and accuracy, we let users define any parameter (l, d, or b) from any avail-able panorama viewpoint. A possible way to increase flexibility and accuracy is to let users adjust the boundary b from different viewpoints as in VideoTrace [160]. However, that is only effective if we have many viewpoints, i.e. observations of the boundary. To keep the framework simple and the number of input panoramas small, we have decided not to use that technique.

To be reconstructible, objects must be seen and the parameters for perspective extrusion must be definable. The capture assistant described in 5.4.5 handles part of this by ensuring all of the floor and walls will be seen. Of course objects can be occluded completely by other objects, but that is rarely the case for the main objects in the scene. For l and b, if objects are complex or curvy, we can only approximate them (Figure 5.11c, d). For a “floating” object, like the chair in Figure 5.10a, since there is no solid connection from its surface to another surface, one should use two viewpoints to define d. In general, if an object has sufficiently different appearance in two panoramas, then it is reconstructible.

5.6 Results

We now present results showing that the proposed framework overcomes difficulties in indoor scene reconstruction to efficiently produce complete and accurate models.

5.6.1 Datasets

Four scenes are used in our evaluation (Figure 5.8). Three are rooms in a house captured by ourselves. The last one is a fake crime scene captured by The Netherlands Forensic Institute. The ground truth is defined by measurements made on objects in the scenes. All scenes are typical indoor scenes, rather complex and the space is limited.

For every scene, the minimal number of panoramas required, as computed using our capture assistant, is one. Because of obstacles (furniture) there was no good position for capturing all corners, thus we had to use two panoramas for the three rooms. For the fake crime scene, we use one panorama.

5.6.2 Accuracy

Since the reconstructed model is up to a scale and a rotation, we have to eliminate that am-biguity in order to evaluate the accuracy. To do so we estimate a transformation from the estimated floor-plan to the ground truth floor-plan. We apply this to the model, and then evaluate the model at two levels: at room scale (i.e. floor-plan error), and at object scale (i.e. object measurements).

Table 5.3 shows floor-plan errors with and without rectifying panoramas. In two out of three datasets the improvement is quite significant. In one dataset, the Bedroom, the error without rectification is almost the same as rectified since the angles of the original panoramas almost perfect. Using uncalibrated images (calibration done during stitching) is possible, though the results are not as good as using calibrated images. The errors, with pre-calibrated images and panorama rectification, are about a few centimeters in a room of about

(17)

2 panoramas 2 panoramas 2 panoramas 1 panoramas

a. Bedroom b. Dining room c. Kitchen d. Fake crime scene

Figure 5.8:Evaluated scenes, their sketches, and number of panoramas used.

Without rectification Uncalibrated images Calibrated & Rectification

Bedroom 0.48 ± 1.45 0.49 ± 0.16 0.38 ± 0.14

Dining room 7.50 ± 3.20 7.48 ± 3.17 1.18 ± 0.49

Kitchen 9.88 ± 3.24 0.48 ± 0.23 0.28 ± 0.05

Table 5.3: Floor-plan relative errors (in percent, mean ± standard deviation). To achieve the best accuracy lens distortion should be applied before panorama stitching, and panorama rectification (Section 5.4.2) should be used. The floor-plan error of the fake crime scene is not available because of lacking ground truth

ten squared meters. The relative errors, computed by dividing the absolute error by the length of the diagonal of the rectangular bounding box of the true floor-plan, are about one percent. The estimated floor-plan of the dining room is less accurate since it was hard to identify some of its corners in the panoramas. Our accuracy is higher than in [42], where the error is about 4 percent. Two differences responsible for the improvement are: the floor-plan estimation strategy we used, and our panorama rectification. In [42], a sketch of several rooms is used to parameterize and estimate the floor-plan of multiple rooms. It was noted that by doing so, and thus ignoring thickness of walls, might reduce the accuracy [42]. To achieve high accuracy, we have estimated the floor-plan of each room separately. More importantly, our rectification eliminates the inaccurate alignment in the input panoramas.

For objects, since the angles between geometric primitives, lines and planes, are already enforced during the reconstruction, we only evaluate the length errors, absolute and relative to the ground truth lengths.

The accuracy of our framework is quite high, e.g. compared to [42, 119]. Object accuracy is slightly less accurate than scene accuracy in terms of relative error, but our examination shows that the absolute errors are about the same.

(18)

5.6. Results 77

Average Object Error

Absolute (cm) Relative (%)

Bedroom 2.4 ± 1.9 2.38 ± 1.41

Dining room 1.6 ± 1.2 1.84 ± 2.00

Kitchen 1.1 ± 1.0 1.17 ± 1.06

Fake crime scene 6.2 ± 2.6 1.84 ± 0.89

Table 5.4:Average object errors (mean ± standard deviation).

5.6.3 Efficiency and Completeness

Our framework is efficient. A scene can be modeled in a dozen of minutes. Figure 5.9 shows the model of a rather complex scene namely the fake crime scene. The walls-and-floor model is built in seconds. All furniture is modeled in about five minutes. The time taken to build the final model that includes small objects such as cups on tables is ten minutes. Furthermore, users do not need to measure objects for modeling at capture time.

a. Walls-and-floor model b. All furniture model 0 minutes, 6 mouse clicks 5 minutes, 10 extrusions

c. Final model d. Final textured model

10 minutes, 19 extrusions

Figure 5.9: Resulting models as function to time and amount of interaction spent. The example is the fake crime scene.

Figure 5.10 shows models of some scenes built using our framework. Close-ups of objects picked from reconstructed models are given in Figure 5.11. Objects composed of planar surfaces are well reconstructed, while complex curvy objects can only be approximated using

(19)

perspective extrusions.

a. Bedroom b. Dining room c. Kitchen

Figure 5.10:Models reconstructed using the proposed framework.

a. Stove b. Table c. Couch d. Fake body

Figure 5.11: Model of objects picked from models in figures 5.9 and 5.10. It takes less than a minute to model an object. Objects composed of planar surfaces (the stove and the table) are well reconstructed using our method, while complex objects like a fake body are hard to approximate using perspective extrusions alone.

5.7 Conclusion

We have proposed a panorama-based semi-interactive 3D reconstruction framework for in-door scenes. The framework overcomes the problems of limited field of view in inin-door scenes and has the desired properties: robustness, efficiency, and accuracy. Those properties make it suitable for a broad range of applications, from a coarse model created in a few seconds for a presentation to a detailed model for measurement in crime scene investigation. Models inexpensively created using our framework are an intuitive medium to manage and retrieve digitized information of scenes and use it in interactive applications.

A limitation of the framework is that it lacks the ability to model complex objects. This could be counteracted by other more expensive techniques. For example the VideoTrace tech-nique [160] lets users model objects from video sequences. The ortho-image techtech-nique [152] creates background maps from image sequences to assist artists in modelling objects in 3D

(20)

5.7. Conclusion 79 authoring software. As objects are complex, both techniques require images from many different angles and more interaction. Since our panoramic images are calibrated, we can integrate those techniques into our framework as plugins. Once the object is reconstructed using those techniques, we can automatically integrate it back into our model, by matching panoramic images to the image sequence used to model the object and then estimating the pose of the object. Thus the framework is a useful tool for both quickly building coarse mod-els as well as efficiently building accurate modmod-els. In the accompanying video the system is demonstrated on a number of realistic scenes.