Semi-interactive construction of 3D event logs for scene investigation

(1)

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

Dang, T.K.

Publication date 2013

Link to publication

Citation for published version (APA):

Dang, T. K. (2013). Semi-interactive construction of 3D event logs for scene investigation.

General rights

It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations

If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

(2)

Chapter

6

Building 3D Event Logs for Video

Investigation

In scene investigation, creating a work log using a handheld camera is more convenient and more complete than using photos and notes. By introducing video analysis and computer vision techniques, it is possible to build a system enabling users to navigate through the logs of an investigation in time and space. Such navigation gives a better overview and makes the data more accessible. We develop methods for processing video logs and present an interface for navigating the result. The processing includes (i) segmenting a log into events using a novel structure and motion feature so that it is more accessible in the time dimension, and (ii) mapping video frames to a 3D model of the scene so the log can be navigated in space. Our results show that, using our proposed features, we can recognize more than 70 percent of all frames correctly, and that we find all the events. From there we provide a method to semi-interactively map those events to a 3D model of the scene with which we can map more than 80 percent of the events. The result is a spatio-temporal representation of the investigation that nicely supports applications such as revisiting the scene, examining the investigation itself, or hypothesis testing.

(3)

6.1 Introduction

The increasing availability of cameras and the reduced cost of storage have encouraged peo-ple to use image and videos in many aspects of their life. Instead of writing a diary, nowadays many people capture their daily activities with a camera. When such capturing is continuous this is known as “life logging”. This idea goes back to Vannevar Bush’s Memex device [21] and is still topic of active research [7, 150, 37]. Similarly, professional activities can be recorded with videos to create professional logs. For example, in home safety assessment, an investigator can walk around, examine a house and record speech notes at the same time. Another interesting professional application is crime scene investigation. Instead of look-ing for evidence, purposely taklook-ing photos and writlook-ing notes, investigators can just wear a head-mounted device and focus on finding the evidence, while everything is automatically recorded in a log. These professional applications all share a similar setup, namely a first person view video log recorded in a typically static scene. In this thesis we focus on this group of professional logging applications which we call scene investigation.

Our proposed scene investigation framework includes three phases: capturing, process-ing, and reviewing.

In the capturing phase, an investigator records the scene and all objects of interest it con-tains using various media including photos, videos, and speech. The capturing is a complex process in which the investigator performs several actions to record different aspects of the scene. In particular, the investigator records the overall scene to get an overview, walks around to search for objects of interest, then examine those specific objects in detail. To-gether these actions form the events in the capturing process. In the processing phase, the system analyzes all data to yield information about the scene, the objects, and the events in the capturing phase. Later, in the reviewing phase, an investigator uses the collected data to perform various tasks: assessing the evidence, getting an overview of the case, measuring specific scene characteristics, or evaluating hypotheses.

In common investigation practice, experts take photos of the scene and objects they find important and add hand-written notes to them. This standard way of recording does not provide sufficient basis for the later processes of processing and reviewing. A collection of photos cannot give a good overview of the scene. Thus, it is hard to understand the relation between objects, come up with hypotheses and assert them based on the photos alone. In some cases, investigators use the pictures to create a panorama to get a better overview [17]. But since the viewpoint is fixed for each panorama, it does not give a good spatial impression. The measuring task is also not easily done on those photos when the investigator has not planned for this in advance. More complicated tasks like making a hypothesis on how the suspect moved are very difficult to perform using a collection of photos without a good sense of space. Finally, a collection of photos and notes hardly captures investigation events, which are important for understanding the investigation process.

A potential solution to enhance scene investigation is to have 3D models of the scene, and log the whole investigation process using a handheld or mounted camera. Research on techniques to capture crime scenes as a 3D model has pointed out that 3D models can be used in many investigation tasks [72, 55]. 3D models make discussion easier, hypothesis assessment more accurate, and court presentation much clearer.

(4)

6.1. Introduction 83 Instead of taking photos and notes, investigators film the scene with a camera. All moves and observations of investigators are thus recorded in video logs. However, in order to take the benefit, it is crucial to have a method to extract events from the logs for reviewing (Figure 6.1).

TV table body a b c d sofa so fa

moving path camera pose

sh e lf heater shelf table TV a b c d

Figure 6.1: An investigation log is a series of investigation events (like (a) taking overview, (b) searching, (c) getting details, or (d) examining) within a scene. When reviewing the investi-gation, it is helpful if an analysis can point out which events happened and where in the scene they took place.

When combined 3D models and video logs have great potential to improve the informa-tion accessibility. For example, a 3D model can help visualize the spatial relainforma-tion between events, or details of certain parts of the model can be checked by reviewing events capturing

(5)

in that part of the scene. Together, a 3D model of the scene and log events form a 3D event log of the case. Such a log, apart from direct application like event-based navigation in 3D, will enable other applications such as knowledge mining of expert moves, finding correlation among cases, or teaching novice investigator.

3D models of a scene can be built in various ways [160, 139, 121, 141]. Here we assume that the 3D model is already created. In our case with the semi-interactive 3D reconstruction method we described in [32]. We focus on analyzing investigation log to find events and connecting them to the 3D model.

Analyzing an investigation log is different from regular video analysis. The common target in video analysis is to determine the content of a shot, while in an investigation log analysis we already know the content (the scene) and the purpose (investigation). The inves-tigation events we want to detect arise from both content and the intentions of the cameraman. For example, when an investigator captures an object of interest she will walk around the ob-ject and zoom-in on it. From the video log, even for humans, these events are not easy to recognize.

Once we have the logs analyzed, the remaining task is to map the events to the 3D model of the scene. When the data would be high quality imagery, accurate matching would be possible [93, 142]. Video frames of investigation logs, however, are not optimal for matching as they suffer from intensity noise and motion blur. This hinders the performance of familiar matching methods. Automatic video log segmentation Semi-interactive matching Video log 3D event log Investigation events 3D reconstructed model

Figure 6.2:Overview of the framework to build a 3D event log of an investigations.

In this chapter, we present our solution to analyzing investigation logs, and on connecting them to a 3D model of the scene to power a log navigation system for later reviewing of the

(6)

6.2. Related work 85 investigation. In the next section we review the related work. Then two main components of our system are presented in subsequent sections: (i) analyzing an investigation log to segment it into events, (ii) and connecting a log to a 3D model for reviewing (Figure 6.2). In order to analyze investigation logs, in Section 6.3, we introduce our novel features to classify frames in order to segment a log into events. This turns a log into a story of the investigation, making it more accessible in the time dimension. Section 6.4 presents our semi-interactive approach to map events to a reconstructed 3D model. Together with the log segmentation step, this builds a 3D event log containing investigation events and their spatial and temporal relations. Section 6.5 evaluates the results of the proposed solution at the two analysis stages: segmenting a log into events, and mapping the events to a 3D model. Finally, in Section 6.6 we present our interface which allows for navigating the 3D events.

6.2 Related work

Video analysis and segmentation.

Video analysis often starts with segmenting a video into units for easier management and processing. The commonly used unit is a shot whose boundary can be detected quite reliably e.g. based on motion [107]. Then the videos are analyzed using various low-level features and machine learning to get more abstract information, such as whether a specific concept is presented [143]. As we want to get more information on the movements and actions of the investigator, attention and intention are two important aspects. An attention model defines what elements in the video are most likely to get the viewer’s attention. Many works are based on the visual saliency of regions in frames. They are then used as criteria to select key frames [97, 101, 3]. Attention analysis tries to capture the passive reaction of the viewer. When watching a scene captured in videos, intention analysis tries to find the motivation and reaction of the cameraman to a scene. This information leads to one more browsing dimension [100], or another way of summarizing [3].

Tasks such as summarizing are very difficult to tackle as a general problem. Indeed existing systems have been built to handle data in specific domains, such as news [8] and sports [130], rather than trying to handle everything. In all examples mentioned, we see that domain specific summarizing methods perform better. Social networking video sites, like YouTube, urge for research on analysis of user generated videos [3] as well as life logs [7], a significant sub-class of user generated videos. Indeed, we have seen research in both hard-ware [52, 7, 36] and algorithms [37, 38] to meet that need.

Domains dictate requirements and applicable techniques for analysis. For example, life logs, and in general many user generated videos, are one-shot. This means the familiar unit of video analysis (shots) is no longer suitable, and new units as well as new segmentation methods must be developed [3]. The quality of those videos are lower than professionally produced videos. Unstable motion and varying types of scenes violate the common assump-tions on the motion model. A more difficult issue is that those videos are less structured, making it harder to analyze contextual information. In this work we consider a class of user generated videos, professional logs of scene investigation that shares many challenges with general user generated videos and video logs, but also has its own domain specific character-istics.

(7)

Video navigation

A simple video navigation scheme, as seen on every DVD, divides a video into tracks and presents them with a representative frame and description. If we apply multimedia analysis and know more about the purpose of navigation, there are many alternative ways to navigate a video. For example, in [79] the track and representative frame scheme is enhanced using an interactive mosaic as a customized interface. The method takes into account various fea-tures including color distribution, existence of human faces, and time, to select and pack key frames into a mosaic template. Apart from the familiar time dimension, we can also navigate in space. Tour into video [75] shows the ability of spatial navigation in video by decomposing an object into different depth layers allowing users to watch the video from new perspectives. The navigation can be based or frame based. In [56], object tracking enables an object-based video navigation scheme. For example, users can navigate video by dragging an object from one frame to a new location. The system then automatically navigates to the frame in which the object location is closest to that expectation. The common thing in the novel navi-gation schemes described above is that they depend on video analysis to get the information required for navigation. That is also the way we approach the problem. First analyzing video logs and then use the result as basis for navigation in a 3D model.

6.3 Analyzing Investigation Logs

In this section we first discuss investigation events and their characteristics, through that we motivate our solution for segmenting investigation logs, which is described subsequently.

6.3.1 Investigation events

Watching logs produced by professionals (crime investigators), we identify four types of events: search, overview, detail, and examination. In a search segment investigators look around the scene for interesting objects. An overview segment is taken with the intention to capture spatial relations between objects and to position oneself in the room. In a detail segment the investigator is interested in a specific object e.g. an important trace, and moves closer or zooms in to capture it. Finally, in examination segments, investigators carefully look at every angle of an important object. The different situations lead to four different types of segments in an investigation log. As a basis for video navigation, our aim is to automatically segment an investigation log into these four classes of events.

There are several clues for segmentation, namely structure and motion, visual content, and voice. Voice is an accurate clue, however, as users usually add voice notes at a few important points only it does not cover all the frames of the video. Since in investigation, the objects of interest vary greatly, both in type and appearance, the visual content approach is infeasible. So to understand the movement of the cameraman is the most reliable clue. The class of events can be predicted by studying the trajectory of cameramen movement and his relative position to the objects. In computer vision terms this represents the structure of the scene and the motion of the camera. We observe that the four types of events have different structure and motion patterns. For example, an overview has moderate pan and tilt camera

(8)

6.3. Analyzing Investigation Logs 87 Investigation event Characteristics

Search - Unstable, mixed motion

Overview - Moderate pan/tilt; far from objects

Detail - Zooming like motion; close to objects

Examination - Go around objects; close to objects

Table 6.1:Types of investigation events, and characteristics of their motion patterns.

motion and the camera is far from the objects. Table 6.1 summarizes the different types of events and characteristics of their motion patterns.

Though the description in Table 6.1 looks simple, performing the log segmentation is not. Some terms, such as “go around objects”, are at a conceptual level. These are not considered in standard camera motion analysis, which usually classifies video motion into pan, tilt, and zoom. Also it is not just camera motion. For example, the term “close” implies that we also need features representing the structure (depth) of the scene. Thus, to segment investigation logs, we need features containing both camera motion and structure information.

6.3.2 Segmentation using Structure-Motion Features

As discussed, in order to find investigation events, we need new features capturing patterns of motion and structure. We proposed such features below. These features, employed in a three-step framework (Fig 6.3), help to segment log into investigation events despite of varying contents.

Extracting structure and motion features

In order to build features capturing camera motion and the structure of the scene, we look at geometric models capturing this information. In particular, we consider which information is described well by geometric models, and which is not. Note that in our case (investigation) the scene can be assumed static. The most general model capturing structure and motion in this case is the fundamental matrix [61]. In practice, many applications, taking advantage of domain knowledge, use more specific models. Table 6.2 shows those models from the most general to the most specific. In studio shots where cameras are mounted on tripods, i.e. no translation present, the structure and motion are well captured by the homography model [61]. If the motion between two consecutive frame is small, it can be approximated by the affine model. This fact is well exploited in video analysis [127]. When the only information required is to know whether a shot is a pan, tilt, or zoom then the three-parameter model is enough [107, 85]. In that way, the structure of the scene, i.e. variety in 3D depth, is ignored.

As discussed, we need both structure and motion information to segment a log into the defined classes of events. While it is possible to estimate the structure and motion even from an uncalibrated sequence [121], that approach is not robust and not efficient enough for the freely captured investigation videos.

We base our method on the different models in table 6.2. We first find the correspondences between frames to derive the motion and structure information. How well those

(9)

correspon-Video log Extracting SM features Classifying frames Merging labeled frames Investigation events

Figure 6.3: Substeps to automatically sengment an investigation log into events

Model (d.o.f) Structure and motion assumption

Homography HP (8) Flat scene, or no translation in motion

Affine model HA(6) Far flat scene

Similarity model HS(4) Far flat scene, image plane parallel to scene plane

Three-parameter model HR(3) Same as HS, no rotation around the principal ray

Table 6.2: Motion models commonly used in video analysis, their degrees of freedom; names and structure and motion condition under which they hold.

dences fit into geometric models tells us something about the characteristics of the scene as well as the motion. Such a measurement is called an information criterion (IC) [156].

An IC measures the likelihood of a model being the correct one, taking into account both fitting errors and model complexity. The lower the IC value, the more likely the model is correct. Vice versa, the higher the value the more likely the structure and motion possesses the properties not captured by the model. A series of IC values computed on the four models represented in Table 6.2 characterize the scene and the camera motion within the scene. Based on those IC values we can build features capturing structure and motion information in video. Figure 6.4 summarizes the proposed features and their meaning, derived from Table 6.2.

The IC we use here is the Geometric Robust Information Criterion (GRIC) [156]. GRIC, as reflected in its name, is robust against outliers. It has been successfully used in 3D

(10)

recon-6.3. Analyzing Investigation Logs 89

HP- Flat scene, or no

translation

HA- Far flat scene

Hs- Far flat scene, image

plane parallel to scene plane

HR- Same as HS, no rotation

around the principal ray

cR– same as cS,, plus camera

rotation around the principal ray

cS– same as cA, plus

scene-camera alignment

cA– same as cP, plus

scene's distance

cP- Depth variety, translation

in camera motion

Structure and motion criteria Motion model assumptions

Figure 6.4:Proposed structure and motion criteria for video analysis.

struction from images, e.g. Chapter 3 of this thesis. The main purpose of GRIC is to find the least complex model capable of describing the data.

To introduce GRIC let us first define some parameters. Let d denote the dimension of the model ; r the input dimension; k the model’s degrees of freedom; and E = [e1, e2, ..., en] the

set of residuals resulting from fitting corresponding data points in the model and the input. The GRIC is now formulated as:

g(d, r, k, E) =

∑

ei∈E min(e 2 i σ2,λ3(r − d)) + (λ1nd +λ2k) (6.1)

whereσ is the standard deviation of the residuals.

The left term of (6.1), derived from fitting residuals, is the model fitting error. The min-imum function used in ρ is meant to threshold outliers. The right term, consisting of model parameters, is the model complexity. λ1,λ2, andλ3are parameters steering the influence of

the fitting error and the model complexity on the criterion. Their suggested values are log(r),

log(rn) and 2 respectively [156].

In our case, we consider a two dimensional problem, i.e. d = 2; and the dimension of the input data is r = 4 (two 2D points). The degrees of freedom k for the different models are defined in Table 6.2; and n is the number of correspondences. The GRIC equation, adding explicitly the dependence on the models in Table 6.2, is simplified to:

gH(k, E) =

∑

ei∈E min(e 2 i σ2, 4) + (2n log(4) + k log(4n)) (6.2)

In order to make the criteria comparable over frames, the number of correspondences n should be the same. We get correspondences from motion fields by a fixed sampling grid. As mentioned, GRIC is robust against outliers, thus outliers often existing in motion fields should not be a problem. For a pair of consecutive frames, we compute GRIC for each of the four models listed in Table 6.2 and Figure 6.4. For example, cp= gH(8, E), given E is the set

of residuals of fitting correspondences to the Hpmodel.

Our features include estimations of the three 2D frame motion parameters of the HR

model, and four GRIC values of the four motion models (Figure 6.4). The frame motion pa-rameters (namely the dilation factor o, the horizontal movement h, and the vertical movement

(11)

v) have been used in video analysis before [107, 85] e.g. to recognize detail segments [85].

We consider them as the baseline features. Our proposed measurements (cP, cA, cS, and cR)

add more 3D structure and motion information to those baseline features. To make it robust again noisy measurements and to capture the trend in the structure and the motion, we use the mean and variance of criteria/parameters over a window of frames. This yields a 14-element feature vector for each frame.

F = [ ¯o, ¯h, ¯v, ¯cP, ¯cA, ¯cS, ¯cR, ˜o, ˜h, ˜v, ˜cP, ˜cA, ˜cS, ˜cR] (6.3)

where ¯. is the mean, and ˜. is the variance of the value over the feature window wf.

Classifying frames

There are four classes corresponding to the four types of events listed in Table 6.1. The search acts as “the others” class, containing frames that do not have a clear intention.

Since logs are captured by handheld or head-mounted cameras, the motion in the logs is unstable. Consequently, the input features are noisy. It is even hard for humans to classify every class correctly. While the detail class is quite recognizable from its zooming motion, it is hard to distinguish the search and examination classes. Therefore, we expect that the boundary between classes is not well defined by traditional motion features. While proposed features are expected to distinguish classes, we do not know which features are best to recog-nize which classes. Thus, a random forest classifier is a good choice as it capable of selecting features. In fact, we also have carried the experiment with another popularly used classi-fier, the support vector machine. The result is indeed better with the random forest classifier. Hence, in Section 6.5, we only present result with the random forest classifier.

Merging labeled frames

As mentioned, the motion captured is unstable and the input data for classification is noisy. We thus expect many frames of other classes to be misclassified as search frames. To improve the result of the labeling step, we first apply a voting technique over a window of frames length wv, the voting window, to relabel all frames using majority voting. Finally, we group

consecutive frames having the same label into events.

6.4 Mapping Investigation Events to a 3D Model

In this section, we present the method to enhance the comprehensiveness of the investiga-tion in the space dimension by connecting events to a 3D model of the scene, thus enabling interaction with a log in 3D.

For each type of event, we take one or more representative frames to hint the part of the scene covered by an event and to sketch the camera motion. Overview and detail events are presented by the middle frames of the events. This is based on the assumption that the middle frame of an overview or detail event is close to the average pose of all the frames in the event. Searching, and examination events are presented by three frames, the first, the middle and the

(12)

6.4. Mapping Investigation Events to a 3D Model 91 last. To visualize the event in space, we have to match those representative frames to the 3D model.

Logs are captured at varying locations and poses in the scene and video frames are not as clear as high resolution images. Also the number of images calibrated to the 3D model are limited, thus we expect that some representative frames may be poorly matched or cannot be matched at all. This is indeed the case as we show in the evaluation section. To overcome those problems, we propose a semi-interactive solution containing two steps (Figure 6.5): (i) automatically map as many representative frames as possible to the 3D model, and then (ii) let users interactively adjust predicted camera poses of other representative frames.

6.4.1 Automatic mapping of events

Since our 3D model is built using an image-based method [32], the frame-to-model mapping is formulated as image-to-image matching. Note that color laser scanners also use images, calibrated to the scanning points, to capture color information. So our solution is applicable for laser scanning based systems.

Let I denote the set of images from which the model is built, or more generally a set which is calibrated to the 3D model. Matching a representative frame i to one of the images in I enables us to recover camera pose. To do that, we use the well-known SIFT detector and descriptor [93], of which we study the characteristics in Chapter 3. First SIFT keypoints and descriptors are computed for representative frame i and every image in I. Keypoints of frames i are initially matched to keypoints of every image in I, only based on comparing descriptors [93]. Correctly matched keypoints are found by robustly estimating the geometric constraints between the two images [61]. There might be more then one image in I matched to frame i. Since one matched image is enough to recover camera pose, we take the one with the most correctly matched keypoints, which potentially gives the most reliable camera pose estimation.

Once representative frames are matched to images from which the 3D model is built, we indeed have matched frames to 3D points in the 3D model. Thus, we can estimate the camera pose of each matched frame using the 5-point algorithm [141].

6.4.2 Interactive mapping of events

To overcome the missing matches between some events and the 3D model, we employ user interaction. The simplest way is to ask the user to navigate the 3D model, to viewpoints close to the representative frames of those events. However, this is ineffective as the starting viewpoints could be far from the appropriate viewpoints.

Since the log is continuously captured and events are usually short, camera poses of close events are close. We can exploit that to reduce the time users need to navigate to find those viewpoints (Figure 6.6). For each representative frame that is not mapped to the 3D model, we search backward and forward to find the closest mapped representative frame (both automatically mapped and previously interactively mapped, in terms of frames). We use the camera pose of these closest representative frames to initialize the camera pose of the unmapped representative frame. There are 6 parameters defining a camera pose: 3 defining

(13)

Automatic mapping Interactive mapping 3D event log Investigation events 3D reconstructed model

Figure 6.5: Mapping events to a 3D model includes an automatic mapping that paves the way for interactive mapping.

the coordinates in space, and 3 defining camera orientation/rotation. We interpolate each of them from the parameters of the two closest known camera poses:

pu= pid_di+ pjdj

i+ dj (6.4)

where pu is a parameter of the unknown camera pose; pi, pj are the same parameters of the

two closest known camera poses; and di, dj are the frame distances to frames of those known

camera poses.

Applying this initialization, as illustrated in Figure 6.6, we utilize automatically mapped and previously interactively mapped results to reduce the time of interaction to register an unmapped frames.

Having the camera pose of frames, we can visualize them as a frustum hinting the camera position and its field of view. Also we now can compute which representative frames (events) cover which points in the 3D model. This information is useful for 3D log navigation.

6.5 Evaluation

In this section, we detail the implementation and give an evaluation of the log analysis method (Section 6.3) and the method to connect these logs to a 3D model (Section 6.4).

(14)

6.5. Evaluation 93 TV table body sofa s o fa s h e lf heater shelf table TV TV Shelf (a) Shelf TV Shelf _TV Shelf (b) Shelf TV Shelf _TV Shelf (c)

Figure 6.6: Manually mapping an event to the 3D model could be hard (a). Fortunately, automatic mapping could provide an initial guess (b) that gives more visual similarity to the frames of the event. From there, users can quickly adjust the camera pose to a satisfactory position (c).

6.5.1 Dataset

In order to obtain clear ground truth for training, we capture a set of videos of separate types of events. The setup is a typical office scene, captured using a handheld camera. In total there are more than 15 thousand frames captured purely for training. For testing, we capture logs in the same office scene. Furthermore, we had crime investigators and others capture logs in fake crime scenes. Those logs in total are about one hour of video. The ground truth of those logs is obtained by manual segmentation.

(15)

6.5.2 Analyzing investigation logs

Criteria

We evaluate the log analysis for the two stages of the algorithm, namely classifying frames and segmenting logs. For the former, we look at the frame classification result. For the latter, which is more important as it is the purpose of analysis, we use three criteria to evaluate the quality of the resulting investigation story. They are the completeness, the purity, and the

continuity.

To define those criteria, we first define what we mean by a correct event. Let S =

{s1, s2, . . . , sk} denote a segmentation. Each event si has range ri (which is a tuple

com-posed of the start and end frames of the event) and class li. So let two segmentations ˆS and ¯S

be given. To check whether a segment ˆsiis correct with respect to the reference segmentation

¯S, the first condition is that there exists an event ¯sj in ¯S that sufficiently overlaps ˆsi, and has

the same class label. The condition is:

α( ˆsi, ¯S) =

(

1 if ∃ ¯sj, (_min(|ˆr|ˆri∩ˆr_i_|,|¯rj|_j_|) > k) ∧ (ˆli≡ ¯lj)

0 otherwise (6.5)

where |.| is the number of frames in a range, and k indicates how much the two events must overlap. Here we use k = 0.75.

Now suppose that ˆS is the result of automatic segmentation and ¯S is the reference seg-mentation. The completeness of the story C, showing whether all events are found, is defined as the ratio of segments of ¯S correctly identified in ˆS.

C = ∑

| ¯S|

i=1α( ¯si, ˆS)

| ¯S| (6.6)

The purity of the story P, reflecting whether identified events are correct, is defined as the ratio of segments of ˆS correctly identified in ¯S.

P = ∑

| ˆS|

i=1α( ˆsi, ¯S)

| ˆS| (6.7)

where |S| is the total number of segments in a segmentation S.

The last criterion is the continuity of the story U reflecting how well events are recovered without being broken into several events of wrongly merged. It is defined as the ratio of the number of events in the result and in the ground truth:

U = ˆS

¯S (6.8)

If U is greater than 1.0 then we are in the situation that the number of events in the result is greater than the real number of events, implying that there are false alarm events. When

U is less than 1.0 it means that the number of events found is less than the actual number of

events, implying that we miss some events. An important restriction on U is that we do not want to have a high value of U as the number of events should be manageable for reviewing logs. A perfect result has all criteria equal to 1.0.

(16)

6.5. Evaluation 95 Accuracy

Baseline 0.596

Proposed features 0.710

Proposed features & voting 0.755 Table 6.3: Accuracy of the frame classification

Implementation

Results presented here are produced with feature window wf = 8, and voting window wv= 24

(i.e. 1 second) (Section 6.3.2). The motion fields are estimated using OpenCV’s implemen-tation of the Lucas-Kanade method∗. Of course other implementations can also be used. We use the random forest classifier implemented in the Weka package†_.

Results

The accuracy of classifying frames using the baseline features and using the proposed features with and without voting are given in Table 6.3. When using only 2D frame motion features, the accuracy of the frame classification is 0.60. Our proposed structure and motion features improves the accuracy to 0.71. Looking into the confusion matrices (Table 6.4 a,b) we see that the recall of most classes is increased. The largest improvement is for the recall of the search class, increasing from 0.639 to 0.868. The recall of the examination class, is increased considerably from 0.074 to 0.195. Most of the incorrect results are frames misidentified as search frames. As mentioned, this is an expected problem as video logs captured from hand-held or head-mounted cameras are unstable. After we apply voting (Table 6.4 c), the overall accuracy is further improved to 0.755. The recall of the examination class is decreased. However, as shown later, overall with voting the final result is improved.

We evaluate the log segmentation with the overlap threshold k set to 0.75. The results are given in Table 6.5, including results with and without post processing of the voting window with w = 24. Without post processing the completeness is C = 1.0. As in reviewing an investigation, it is important to find all important events, this is a very good result. The purity of the story is reasonable, P = 0.65. However the number of events is extremely high compared to the ground truth, (U = 17.41). This is undesirable as it would take much time to review the investigation. Fortunately, applying the voting technique, the number of identified events is much less and acceptable (U = 2.16), while the completeness remains perfect. The purity P is decreased to 0.58 percent. This is practically acceptable as users can correct the false alarm events during reviewing. Table 6.6 gives a detailed evaluation for each class before and after voting. After applying voting, P is slightly decreased for all classes, while U is greatly reduced to 1.0, the perfect value.

Results presented above are the merged results of the data captured by ourselves in a lab room and the data captured by different others in the fake crime scene. This because we have found no significant difference between them (the average accuracy is only about one percent better for the data we captured ourselves). This evidently shows that the method is stable.

∗_{http://opencv.willowgarage.com} †_{http://www.cs.waikato.ac.nz/ml/weka}

(17)

Search Overview Detail Examination Search 0.639 0.136 0.174 0.051 Overview 0.476 0.325 0.131 0.068 Detail 0.229 0.053 0.660 0.058 Examination 0.535 0.077 0.314 0.074 (a)

Search Overview Detail Examination

Search 0.868 0.046 0.046 0.041

Overview 0.502 0.497 0.000 0.001

Detail 0.537 0.016 0.328 0.119

Examination 0.723 0.029 0.053 0.195

(b)

Search Overview Detail Examination

Search 0.931 0.028 0.026 0.015

Overview 0.515 0.485 0.000 0.000

Detail 0.649 0.012 0.280 0.059

Examination 0.834 0.000 0.022 0.144

(c)

Table 6.4: Classification results (confusion matrices): (a) using only 2D motion parameters as features (baseline), (b) using proposed structure and motion features, and (c) using pro-posed features and voting. Increased and decreased recall (comparing to the baseline) are in bold and italic respectively. The recall is improved in most of the classes, especially the hard examination class.

C P U

Before voting 1.00 0.65 17.41

After voting 1.00 0.58 2.16

Table 6.5: Log segmentation result with and without applying voting.

6.5.3 Mapping events to 3D model

In terms of representative frames, the percentage of frames mapped to the 3D model is 70.4 percent, of which 20.5 percent is mapped automatically. The percentage of unmatched frames due to lack of frame-to-frame overlap i.e. no visual clue to map at all, is 29.6 percent. This results in about 80 percent of events mapped to the 3D model (Table 6.7). Table 6.7 also provides more insight in the mapability of each type of event. All the overview events are matched, either automatically or interactively. The examination events are hardest. None of them is matched automatically. This due to the fact that those events are usually captured at close distance, while the panorama images are captured at a wide view. Thus, the scale difference is beyond the range of the SIFT descriptor. A solution could be to work with panoramas at higher resolution.

In conclusion, more than 80 percent of events can be mapped to the 3D model, of which about 25 is done automatically. This provides sufficient connection to represent a log in a 3D

(18)

6.6. Navigating Investigation Logs 97 C P U Search 1.00 0.68 16.57 Overview 1.00 0.66 139.50 Detail 1.00 0.71 7.41 Examination 1.00 0.54 54.33

(a) Before voting

C P U Search 1.00 0.58 2.13 Overview 1.00 0.61 19.00 Detail 1.00 0.68 1.15 Examination 1.00 0.33 4.00 (b) After voting

Table 6.6: Log segmentation evaluation per class.

Map Miss 81.9 18.1 Automatic Interactive 20.9 61.0 Search 9.6 29.4 6.8 Overview 1.7 1.1 0.0 Detail 9.6 28.2 10.2 Examination 0.0 2.3 1.1

Table 6.7:Percentage of events automatic/interactive mapped, and cannot be mapped.

model, giving us a spatial temporal representation of the investigation for review.

6.6 Navigating Investigation Logs

We describe here our navigation system for investigation logs. As discussed, the system aims to enable users to re-visit and re-investigate scenes. The user interface, shown in Figure 6.7, includes the main window showing a 3D model of the scene and a storyboard at the bottom showing events in chronological order. Those two components present the investigation in space and time. Users navigate an investigation via interaction with the two components. When the user selects one event, camera frustums are displayed in the model to hint the area in the scene covered by that segment. Vice versa, when the user clicks at a point in the model, log segments covering that point are highlighted and camera frustums of those events are displayed. Those interactions visualize the relation between the scene and the log, i.e. the spatial and the temporal elements of the investigation. To take a closer look at that segment, users click on the camera frustum to transform it into a camera viewpoint and watch the video segment in an attached window. Those interactions are demonstrated in the accompanying video.

(19)

(a)

(b)

(c)

Figure 6.7: The log navigation allows users to dive into the scene (a), check event location of events related to a location (b), watch a video segment of an event and compare it to the scene (c).

6.7 Conclusion

We propose to use a combination of video logs and 3D models, coined 3D event logs, to provide a new way to do scene investigation. The 3D events logs provide a comprehensive representation of an investigation process in time and space, helping users to easily get an

(20)

6.7. Conclusion 99 overview of the process and understand its details. To build such event logs we have to overcome two problems: (i) decomposing a log into investigation events, and (ii) mapping those events into a 3D model.

By using novel features capable of describing scene structure and camera motion and ma-chine learning techniques, we can classify frames into event classes at more than 70 percent accuracy. This helps recovering the investigation story completely, with a fairly good purity. To map events to a 3D model, we use a semi-interactive approach that combines automatic computer vision techniques with user interaction. More than 80 percent of the events in our experimental logs were mapped into a 3D model of the scene, providing a presentation that supports reviewing well.