DupLog: probalistic logical interpretation of Duplo assemblies from 3D vision

(1)

Bachelor thesis

Artificial Intelligence

v1.0

Radboud University

DupLog: probabilistic logical

interpretation of Duplo assemblies

from 3D vision.

Author:

Sil van de Leemput (s4085469)

SilvandeLeemput@gmail.com

Supervisor/assessor:

Dr. ir. Martijn van Otterlo

M.vanOtterlo@donders.ru.nl

(2)

Abstract

Composite real world objects can be logically explained by their components and their relations. Knowing and expressing the logical relations between the components can facilitate 3D object recognition employing semantic in-formation. DupLog is a novel method for 3D model acquisition of Duplo block assemblies from video. It acquires percepts trough a Kinect camera and translates these inputs into a structured representation through octrees and logical constraints. The high-level, declarative (logical) constraints are represented in a probabilistic logic and provide a new principled way for 3D object recognition. The system is efficient and effective in the interpretation of arbitrary Duplo buildings.

Keywords: 3D object recognition, model acquisition, probabilistic logic, knowledge base, octrees, Kinect, Lego Duplo, Problog

(3)

6 Implementation 28 6.1 Physical setup . . . 28 6.1.1 Hardware . . . 29 6.2 Software . . . 29 6.2.1 Project organization . . . 30 6.2.2 Data-flow diagram . . . 31 6.2.3 Visualizations . . . 32 7 Functionality 34 7.1 _{Using DupLog . . . .} 34 7.1.1 Main panel . . . 34 7.1.2 Fnect-settings panel . . . 39 7.2 Common interactions . . . 44

7.2.1 Calibration of the camera . . . 44

7.2.2 Calibrating the octree settings . . . 45

7.2.3 Calibrating the qualitative colors . . . 46

7.2.4 Scanning a Duplo assembly . . . 47

8 Results 48 8.1 Accuracy . . . 48 8.1.1 Incorrect alignment . . . 51 8.1.2 Noise . . . 52 8.1.3 Occlusion . . . 52 8.1.4 Assembly ambiguity . . . 53 8.1.5 Color mismatch . . . 54 8.2 Run-time complexity . . . 55 9 Discussion 56 9.1 Strengths . . . 56 9.2 Limitations . . . 57 9.3 Future research . . . 57 9.3.1 Vision . . . 57 9.3.2 Logic . . . 58 9.4 Conclusions . . . 58 10 Acknowledgment 59 Bibliography 60 A Appendix 62 A.1 Software table . . . 62

A.2 Kinect calibration theory . . . 63

A.3 Problog code listing . . . 65

A.3.1 Block grouping . . . 65

(5)

Chapter 1

Introduction

There is an increasing demand for real-world computer systems that are able to interpret visual imagery from the world around them. Exemplary applications are: autonomous cars that can navigate traffic (e.g. Google driverless cars, DARPA challenges), surveillance cameras that detect criminal activity, robots that navigate the streets in the same way humans do (e.g. OBELIX 1), medical diagnostic systems that can automatically detect symptoms and even interactive games like Life of George2. These systems often need not only interpret their surroundings, but also autonomously act on them (e.g. stopping for a red traffic signal, alarming the police, avoiding pedestrians, giving a score), this stresses the importance of their capability to correctly model and interpret relevant cues from their environment.

The field of computer vision deals with making a computer system that can interpret (acquire and analyze) visual scenery. This is a very challenging and difficult task, since it is basically an inverse problem, where observed measurements (raw image data) are turned into information about physical objects [17]. It is additionally challenging, since the information needed is often unknown and there is also insufficient information to fully specify the solution, introducing uncertainty and probabilities. What makes it even more challenging is that such systems are often required to have a high accuracy (recognition rate) with fast (often real-time) processing speed, while it is still flexible enough to deal well with variations and uncertainties [5]. Both the interpretation speed and the recognition accuracy must be well adjusted towards the application domain. For example a driverless car should recognize a stop signal quickly enough to be able to stop in time. The recognition rate must be good as well, since it is not feasible for a driverless car to miss an incoming vehicle. On the other hand with an automated medical diagnostic system the accuracy is more important than the speed. And for the driverless car and the diagnostic system both holds that they should be flexible enough to not fail when the object to be detected is differently illuminated or oriented.

On contrary humans are very good at interpreting what they see from the 3D world around

1

More on the OBELIX robot can be found on: http://www.sickinsight-online.com/

world-first-robot-obelix-moves-through-freiburg-autonomously/

2

Life of George is a free iOS/Android application released on October 2011, that employs the default camera present on the target platform. The program shows an image of a Lego brick assembly that the user should mimic using real Lego bricks, after which the program scans the assembly and presents a scoring based on the performance (i.e. accuracy and time) of the user. More can be found on: http://george.lego.com/

(6)

them. They do this typically by employing knowledge about the world. A car is more likely to be found on a road than in the water, houses have doors and windows, if there is fire then there is smoke, etcetera. Using the same principle of utilizing declarative, high-level knowledge is highly desirable in computer vision, although the inherently noisy and uncertain nature of vision data has been an obstacle to actually apply such knowledge in many vision contexts. Yet, a large set of vision problems is about structured objects where such high-level knowledge can be applied. An example of such an structured domain are houses. Knowing about spatial relations (i.e. exterior doors are mostly found on the first floor, a roof is on top of the walls of a house) can help to localize and explain low level features.

Applying probabilistic knowledge to facilitate computer vision is not a new idea, there are works that demonstrate how to use logical expressed background knowledge to interpret visual features from scenes (e.g. Bayesian networks). Instead one should look at what is a good ap-proach to do this. Advances in the logical problem solvers are able to efficiently solve high-level scene interpretation problems [16] by applying abductive reasoning with probabilities. Often they are designed for implementation with constraint solvers like Prolog, where probabilistic and first order logic facts are combined to tackle scene interpretation problems given as a set of facts, that return the set of all possible interpretations and their probabilities. Less work is focused on a principled method or system that combines computer vision methods and logical (probabilistic) constraint solvers in order to interpret visual scenery.

1.1 Thesis goals

There clearly is a demand for flexible computer systems that can interpret objects from visual scenery and there are expressive probabilistic first-order logic approaches available for scene interpretation. Therefore the first goal for this thesis is to combine those to provide a princi-pled method for creating computer vision systems for object recognition that utilize knowledge expressed as first-order logical constraints. The second goal of this thesis is to provide a working example implementation called DupLog of such a system that uses Duplo blocks as structured domain and a Kinect camera to acquire features from.

These goals give rise to the following research questions:

1. How to combine low-level computer vision with high-level probabilistic first order logic? 2. How to represent the output descriptions for Duplo assemblies (i.e. what is “correct”)? 3. How to model the domain knowledge for the Duplo blocks?

4. How to integrate all components into a working computer program? 5. How to acquire features from the Kinect?

(7)

1.2 _DupLog

The name DupLog 3is composed from the words Duplo and Logic and is an object recognition system for interpreting real Duplo assemblies through observing it with a Kinect sensor, using domain related background knowledge represented as logical probabilistic constraints. Duplo structures will be captured using a Microsoft XBox Kinect camera. This popular device provides a way to register not only color images, but also depth information for all objects within an area of 6 m2 _{from the view of the sensor. Through the Kinect DupLog can capture multiple} views of point cloud information from Duplo assemblies and combines those in an integrated model. That model is then segmented into bins using an octree data structure. Then the bins are grouped into blocks using a constraint solver, to finally be grouped into an assembly representations by applying background knowledge related to Duplo.

For the implementation the highly structured Duplo4 _{blocks domain is used. Duplo blocks}

are Lego blocks for small children and are twice the size of Lego. Both Duplo and Lego are distinctly colored building blocks that can be clamped on each other to create a large amount of different assemblies and are excellent objects for computer vision, since they are regularly shaped, have distinct colors and are easy to provide constraints for that facilitate recognition by computer.

1.3 Thesis outline

At this point the reader must have many questions. Therefore a small road-map of the thesis is given here. The main focus is on DupLog, explaining the underlying principles as the thesis progresses.

In chapter 2 this work is compared with other relevant scientific work in the fields of com-puter vision, object recognition and probabilistic logic. It partially answers the first research question.

Chapter 3 gives the problem definition and fundamental approach of how to combine com-puter vision with the first order logic, gives the chosen output representations and gives the assumptions where the domain knowledge is built upon. This effectively answers the first three research questions.

Detailed descriptions about how features are extracted from the Kinect and how the features are processed is found in chapter 4 and chapter 5, answering question five and also question one in more detail.

The implementation details about the technical framework, tools and programs used are mainly discussed in chapter 6, answering research question four. Chapter 7 explains how to utilize DupLog in a general sense and functions as a manual.

Chapter 8 presents results for the accuracy, flexibility and efficiency of DupLog. It gives an answer to the last research question.

3

A paper about DupLog has recently been submitted to the 25th Benelux Conference on Artificial Intelligence (BNAIC 2013) [18].

(8)

In chapter 9 a detailed discussion of DupLog is given, reflecting at the work from various angles.

Additional information can be found in appendix A. It contains information about calibra-tion of the Kinect lenses, a list of the software used for DupLog with related web-links and a listing of the ProbLog programs for DupLog.

(9)

Chapter 2

Related work

This chapter attempts to describe the relation of this thesis with other scientific work. It is situated at the crossing between computer vision, structured (logical) representations of objects, and the Duplo/Lego domain. The system is built upon the idea that many visual scenes are best described using high-level representational devices such as graphs, and even more generally using logical languages. Each of these topics are explored and related to the work.

2.1 Computer Vision

One of the most challenging goals for computer vision is generic object recognition, where the task is to recognize and classify instances or classes for any set of data. Despite much research, intelligent, practical and real-time object detection and recognition methods are still not present for the general case (e.g. [14, 17]). The problem with creating computer vision systems that are accurate, flexible and fast, is that computer vision is an “inverse problem” with a lot of ambiguity and uncertain factors, since the richness of visual imagery is huge [17]. Therefore a lot of computer vision systems are designed only for very specific tasks, that usually have very controlled conditions to simplify the recognition task. For example such system demands that the lighting conditions are always the same or demand that the objects must always be oriented in the same way, etcetera.

Throughout the years general computer vision [17] has matured enormously and produced a fast array of object recognition techniques for recognizing objects from 2D and 3D image data. Also the introduction of devices like the Kinect camera that can provide RGB and depth information gave rise to many new techniques that learn feature descriptors from such data (e.g. [3]), such as: convolutional k-means. The issue with many of those techniques is that they are optimized for specific applications under specific circumstances and can be difficult to match to new applications.

2.2 Structured representations of objects

Structured (logical) representations (see [13] for an overview) are less represented in the vast literature. Older work in structured pattern recognition [4] already acknowledged the

(10)

useful-ness of structured representations (such as graphs) for visual data, but lacked the power and flexibility of modern probabilistic reasoning systems for such structured data (see [6] for good starting points to the literature).

Increasingly many examples of using high-level, structured representations in computer vi-sion appear in the literature in recent years: high-level scene perception using description log-ics [11] or using a hierarchical approach [1]. A recent trend in the neighboring field of robotlog-ics is to employ such representations for affordances [10], i.e. representing objects in terms of what one can do with them (for example, a chair affords one to sit on it). These systems are based on similar visual input, and utilize structured representations of recognized objects for subsequent tasks.

Three dimensional objects can be represented as: a wireframe, constructive solid geometric representations, a spatial-occupancy representations or a surface boundary representation [2]. A wireframe representation exists of a list of vertex points and a list of edges that indicate the connecting between vertex points. A constructive solid geometric representation, models objects out of volumetric primitives like cubes, spheres, cones, etcetera. Spatial-occupancy representations have non-overlapping units that indicate the occupancy in space, e.g. point clouds. A point cloud is a set of points in a 3D volume. Surface boundary representation define a 3D object using surfaces. DupLog uses a spatial-occupancy representation, since Duplo blocks fit well in a 3D cube-shaped grid.

2.3 Duplo/Lego domain

There are also numerous works on the Lego domain, but only few are relevant for this thesis. Some works aim to generate structures, for example using a (population-based) search method. In the work of Peysakhov et al. [12] a representation of Lego block assemblies is described which was adopted to represent Duplo assemblies. The representation is slightly altered to be better suited for DupLog.

For the interpretation of assemblies from visual input, two very recent approaches exist. Both KinectFusion by Izadi et al. [7] and the work of Miller et al. [9] target this problem. Their systems are able to capture percepts in real-time and to build a model of scene objects in real-time. The work of Miller et al. is especially relevant for this paper, since it shows an efficient way to acquire a model of Duplo block assemblies by introducing some constraints. They also describe improvements in the combination of multiple percepts in an integrated model in real-time. However, both methods do not utilize the unique combination of probabilistic logic and octrees to acquire the models.

(11)

Chapter 3

Problem specification

On a very high level the goal for this thesis is to create a system that acquires the correct representations of real world objects, given background knowledge represented as logical con-straints on how (sub)parts of objects may relate to each other. The relations might for example be spatial or color specific. This high level goal is good to keep in mind, but is far to broad for this work, therefore Duplo assemblies are used as real world objects and serve as input. A Kinect camera is used to acquire relevant RGB and depth data from such assemblies. A Duplo assembly is a construction that consists of one or more Duplo blocks that are assembled in an arbitrary way. To simplify the recognition task the number of “valid” assemblies is limited to a restricted set (i.e. all assemblies that can be created from a subset of 9 Duplo blocks) and should satisfy certain construction restrictions (e.g. Duplo blocks may only be positioned with their studs facing up). The problem statement can now be formulated as follows.

Given an input consisting of raw RGB and depth data of any Duplo block as-sembly A satisfying the set and construction restrictions, acquired from a Kinect camera, construct a logical representation R(A) in a representation like the one used by Peysakhov et al. [12], using a logical background theory about relations and constraints about Duplo blocks.

Further specification requires a definition for the logical constraints, making the set restrictions and the construction restrictions explicit and the definition of a method to translate A into R(A), which can be found in the following sections. First an overview of the entire interpreta-tion process is given, with a compact descripinterpreta-tion. Then the components of the interpretainterpreta-tion process are discussed in more detail, starting with the 3D bin grid G. Next, the constraints for constructing a valid assembly input A are given, after which the output representation R(A) is discussed. Finally the intermediate steps for translating a valid assembly A through a Kinect and subsequent processing into a “correct” representation R(A) using point cloud and octrees are described.

(12)

Figure 3.1: Global overview of the interpretation process in DupLog.

3.1 System overview

Figure 3.1 shows the complete interpretation process in DupLog. On the top right we see a possible Duplo assembly A and a Kinect camera that captures “real world” percepts (RGB and depth images) of the former. The blue blocks under the Kinect camera represent algorithms that together translate the raw data from the Kinect camera in a logical representation of a Duplo assembly (“assembly representation” R(A)). Several of these algorithms are grouped into one of four steps, that together form a logically separable subpart of the system.

Starting with the “raw data” from the Kinect camera, the data are translated in each step into another representation and/or type. The type and visualization associated with each representation can be seen on the right half of Figure 3.1. First raw “real world” percepts from a Duplo block assembly come from the Kinect camera (at the top) that gives both depth and color images, that will be mapped from 2D to 3D into point clouds, additionally applying some linear filters and corrections. A point cloud is a collection of points (called voxels) within a 3D volume. One point cloud percept only contains the surfaces of one side of a Duplo block assembly, since a Kinect camera can only face one side of an assembly at a time. Therefore, multiple of such point clouds are combined into a single point cloud (completing step I). After multiple percepts are combined, octree segmentation (step II) is applied to create a bin representation, by segmenting

(13)

a point cloud (using a datastructure called an octree) into equally distributed non-overlapping subvolumes and by representing these subvolumes (called bins) as logical facts. Steps I and II represent the “vision” part of the system and are discussed in chapter 4.

For the next step the bin facts are translated into 2 × 2 Duplo block predicates (step III). Finally the 2 × 2 Duplo block predicates are translated into the “correct” output assembly representation (step IV). The last two steps employ a knowledge base that contains Duplo blocks related facts as background knowledge to calculate the “correct” output assembly representation R(A). Both steps are discussed in chapter 5.

3.2 3D bin grid

Most of the representations used for DupLog relate in a way to a 3D occupancy grid G. This makes it important to define grid G. Figure 3.2 shows this 3D bin grid G in model coordinates. This grid G can be described as a 3D volume segmented into N1× N2× N3 subvolumes called

bins, with N1, N2, N3 ∈ N and where dx, dy, dz are the dimensions of one bin in the grid, shown

on the right in Figure 3.3. One bin represents one out of eight parts of a 2 × 2 Duplo block in world dimensions (excluding the studs). Since 2 × 2 Duplo blocks have dimensions 1 of 31.7 × 31.7 × 19.2 millimeters (mm) this gives dx = dz = 15.85 mm and dy = 9.6 mm. G can

be interpreted as an occupancy grid for all points of the directly visible surfaces. The system treats the center of bottom-surface of the grid (i.e. (0, 0, 0)) as the origin.

Figure 3.2: A visualization of the 3D bin grid G with a green (upper right corner) and a blue (lower left corner) 2 × 2 Duplo block. It also shows the capture orientations (front, left, back and right) and the model coordinates.

1

The dimensions for the Duplo blocks were adopted from the following website:

(14)

3.3 Assembly construction

Any input assembly A can be made from a fixed (and known) set of nine Duplo blocks. This restriction simplifies the tasks of recognizing the assembly by ruling out many hypotheses, while it still allows for at least a couple of hundreds different assemblies one can construct. Furthermore the construction of A assumes that assemblies are placed flat on the grid surface and that all studs of the Duplo blocks point up. All the flat surfaces (i.e. not those of the studs) of the Duplo blocks have to be perpendicular or parallel to each other and have to be correctly aligned to the 3D bin grid G. For a complete and explicit list of the constraints concerning the set look at the set restrictions. For a list about the construction and positioning constraints look at the construction restrictions. By satisfying the constraints from both lists one creates a “valid” assembly that is correctly positioned.

3.3.1 Set restrictions

Figure 3.3: On the left the set D of Duplo block used for making assemblies. On the right the dimensions of a bin from 3D bin grid G relative to a 2 × 2 Duplo block.

Each assembly should be made from blocks that are a subset of finite set D that consists of the following 9 blocks (shown on the left in Figure 3.3):

• 4 Duplo blocks of 2 × 2 of each color: red, green, blue and yellow. • 4 Duplo blocks of 2 × 4 of each color: red, green, blue and yellow. • 1 additional red 2 × 4 Duplo block.

The constraints that arise from picking these blocks from finite subset D gives rise to the set restrictions (SR):

SR1. An assembly A must have blocks so that A ⊆ D.

SR2. Each block has one qualitative color that can either be: red, green, blue or yellow. SR3. The smallest block is 2 × 2.

(15)

SR4. The largest block is exactly twice the size of the smallest block and can be oriented as 2 × 4 or as 4 × 2.

The restrictions can be seen as assumptions that relate to the steps briefly described in sec-tion 3.1. Assumpsec-tion SR2. is exploited by steps II, III and IV and assumpsec-tions SR3. and SR4. are used in steps III and IV to give better estimates. Assumption SR1. is only used in step IV to give better explanations, but this assumption could be easily removed from the knowledge base and allow for larger sets of Duplo blocks (while SR2., SR3. and SR4. still hold).

3.3.2 Construction restrictions

The construction and placement of Duplo assemblies requires a little elaboration, since each assembly should satisfy all of the following construction restrictions (CR):

CR1. Blocks must be positioned in bin grid G so that for each bin b of each block holds that: −n ≤ bx < n ∧ −n ≤ bz < n where n ∈ N, n = 5 and bx and bz are respectively the

x-and z-coordinate for b within G.

CR2. Small blocks occupy a 2 × 2 × 2 cluster of exactly 8 bins within G.

CR3. Large blocks occupy a 2 × 2 × 4 or a 4 × 2 × 2 cluster of exactly 16 bins within G. CR4. Blocks must always be positioned with studs facing up.

CR5. Blocks on the floor (within G with a y-coordinate of 0) are always supported.

CR6. Blocks not on the floor (with an y-coordinate greater than 0) must be supported by other blocks, that is there should either be a supported block directly under or above such a block.

CR7. Blocks cannot occupy the same bins within G.

CR8. The position (0, 0, 0) for G aligns with the origin (center) of the grid surface. CR9. The grid G only contains Duplo blocks within the defined boundary of CR1..

Note that since G has a determined position relative to the grid surface (i.e. because of CR8.), all the construction restrictions say something about the positioning and the alignment of assemblies. CR1. entails that each block should be placed within a square boundary around the origin depending on n. The value for n can however easily be changed to allow broader assemblies. CR2. and CR3. do not only express how much space blocks occupy within bin grid G, but also that blocks should be perpendicular or parallel to each other, since otherwise they would occupy more bins than these constraints enforce. These two constraints are also used to group bins into blocks. CR4. also demands correct alignment of the blocks. CR5. and CR6. make sure that there are no “floating” blocks. CR7. says in short that once a bin in grid G is occupied by a block it cannot also be occupied by another block. CR9. enforces that no other objects than Duplo blocks are placed within the bounds of bin grid G.

(16)

3.3.3 Restrictions as assumptions

Both types of restrictions serve as assumptions that can be actively used within the interpre-tation steps (especially the logical part). This is illustrated in Table 3.1.

Assumption Step 1 Step II Step III Step IV

SR1. × × SR2. × × × SR3. × × SR4. × × CR1. × CR2. × × CR3. × CR4. CR5. × CR6. × CR7. × × CR8. × × × × CR9. × × × ×

Table 3.1: Table indicating for which steps the assumptions are actively applied.

3.4 Output

The output is a structured (spatial-occupancy) representation which specifies where each block is positioned in space (closely related to bin grid G) and based on the representation used by Peysakhov et al. [12]. Using the notation from section 3.1, it should be possible for a human to correctly reconstruct any input Duplo assembly A from the corresponding output representation R(A). This is however not entirely realistic in a lot of cases, since the percepts can only capture the directly visible surfaces of an assembly. This can cause several blocks at the “inside” of an assembly to be covered leading to several hypotheses about what is hidden (e.g. no block or a colored block). Therefore it is only reasonable to expect that from the output representation R(A) it is possible to at least reconstruct the outside of A.

Even with that assumption there are still certain instances for assembly A that are uncer-tain or be ambiguous about the original configuration of the assembly. These ambiguities are currently resolved by just picking the first interpretation with the highest probability. This leaves room for undesirable ambiguities and therefore there is room for additional filtering to optimize for accuracy, but this is left for future research.

The final representation is called the “assembly representation”, where each fact resembles the position and color of one Duplo block with respect to the bin grid G. There are three

(17)

different predicate names that say something about the dimension and orientation of a block, that is d22 describes a 2 × 2 block, d42 describes a 4 × 2 block with the long side on the x axis and d24 describes a 2 × 4 block with the long side on the z axis. The four arguments for each of the predicates indicate respectively the x coordinate, the z coordinate, the y coordinate and the qualitative color for the block. The qualitative color can be red r, green g, blue b or yellow y. The following facts are sufficient to describe for example the blue and green 2 × 2 Duplo blocks within Figure 3.2.

d22(-5, -5, 0, b). d22(3, 3, 0, g).

If one wants to stack a yellow 2 × 4 Duplo block and then a red 4 × 2 Duplo block on top of the blue block so that the first sticks out two bins to the right and the second sticks out two bins to the back we can add the following facts.

d24(-5, -5, 1, y). d42(-5, -5, 2, r).

Notice that the coordinates for each block are not the coordinates of the center of the block within the grid G, instead they describe the coordinates of the bin within the block with the lowest values for each of their coordinates. So when referring to Figure 3.2, the position of a block relates to the bin within that block that is nearest to the bottom, the left and the front. This decision was made to simplify the creation of a logical background theory. Also notice that for the y-coordinate the values increment by 1 instead of 2 for each stacked block. This decision was made because it is not possible with current set and construction restrictions to position a block on any odd position within G for the y-coordinate. The assembly representation employs only all minimal possible configuration locations, to reduce human errors while working with the assembly coordinates.

There are several differences with the representation for Lego assemblies of Peysakhov et al. [12]. First of all the dimensions of the blocks are inherent to the predicates, whereas Peysakhov et al. describes the dimensions using arguments. Secondly the position of the blocks relative to each other is described using graphs and relational predicates, this is not the case for DupLog where all blocks are positioned relative to bin grid G. It still is relatively easy to rewrite the dimensions for each block to the notation of Peysakhov et al. which could be useful when extending the application with other sized blocks.

3.5 From vision data to logical representation

At this point there are definitions for the input assemblies A and the corresponding output assembly representation R(A), but there is a gap that needs to be bridged between low-level data acquisition from vision from such assemblies A and high-level assembly interpretation. For DupLog there is a clear boundary between those two parts. The first part gathers features from the low-level data that are summarized as a list of logical facts for the second part (step I and step II). The representation should be well chosen, since it needs to be both the output

(18)

of the low-level vision phase and the input to the logical interpretation part (step III and IV). The “bin representation” is a collection of these logical facts (i.e. vision features) called “bins” which relate one-on-one with the bins within the bin grid G and are directly computed from point clouds and octrees.

Figure 3.4: Example of using point clouds and octrees to get the bin representation.

3.5.1 Point clouds

Using point clouds is a detailed way to represent the occupancy of 3D points (called voxels) within a volume. A point cloud can be described as a set of points of the form (x, y, z), where the x, y, z values are the position of a point in Euclidean space. The image on the outer left in Figure 3.4 shows a point cloud structure for one side of a simple Duplo block assembly acquired through the Kinect (i.e. the “percept acquisition” part of step I). The point clouds of DupLog have besides the position also (r, g, b) values that represent the color intensities of respectively red, green and blue.

Since only capturing one side of a Duplo block assembly is often not enough to decide on its structure, it is necessary to capture multiple sides and integrate those in a single point cloud (i.e. the “model integration” part of step I). Such a combined point cloud can be seen at the center left in Figure 3.4.

3.5.2 Octrees

Since point clouds are often too detailed to directly transform into a neat representation, there-fore an octree [8] is used to reduce the dimensionality. Octrees segment point clouds into subvolumes until a certain resolution is reached. An octree is as the name suggests a tree-like structure with eight subnodes. An octree can be obtained by specifying a boundary cube and a resolution. Then the boundary cube is segmented into eight equally sized subcubes. This seg-mentation process continues recursively for each subcube until the resolution value is reached. Each leaf node in the octree represents a part of space occupied with voxels found in the point cloud. Thus from the root node of the tree it is possible to find all the points and as one traverses down the tree, for each subsequent step the volume searched is 1₈ of the previous step, finally ending at the location of the voxels.

By only partitioning nodes that actually contain voxels, a sparse octree can be obtained. Advantages of such a sparse representation is the facilitation of voxel localization, the provision of additional information about the raw outline of the points and fast operations on the tree.

(19)

In order to get the bin representation first a combined point cloud of a Duplo block assem-bly is segmented using an octree (Step II). Then each leaf node of the acquired sparse octree corresponds with a certain bin within the 3D bin grid G. Each valid leaf node is then repre-sented as a logical fact, along with information about its 3D location relative to its position within G as well as a qualitative label for the average color of the voxels within the leaf node. This transformation from point cloud to bin representation through an octree can be seen in Figure 3.4 on the two rightmost images. First the point cloud is segmented using an octree, then the bin representation is calculated from that (see section 4.2 for more detail). Informally we can say that each bin represents a small piece of physical matter detected by the camera. Using logical reasoning with constraints, we can then group several of these pieces (i.e. bins) to form Duplo blocks and finally the assembly representation, which will be discussed in chapter 5.

(20)

Chapter 4

Feature extraction from vision

This chapter explains the workings of the first two steps of the interpretation process, where raw depth and image data from the Kinect sensor are transformed into a set of facts (bin representation). With step I multiple point cloud percepts are acquired and combined from the sensor into an integrated model ready for octree segmentation. Step II then segments the integrated model using an octree and renders a set of facts from it to be used by the constraint solvers of step III and IV discussed in chapter 5. Both steps use the data structures described in section 3.5 and it is assumed that the reader is familiar with them.

4.1 Step I: percept acquisition and model integration

In step I four point cloud percepts captured from the Kinect sensor (in physical coordinates) are combined and integrated in a single combined point cloud (in model coordinates). Each of the point cloud percepts are captured from a different capture orientation from Figure 3.2, where each is associated with a different angle (see Table 4.1). An integrated point cloud model Pcmb containing all visible (outer) surfaces for one Duplo block assembly can be obtained by aligning four properly calibrated and preprocessed point clouds P_i, each captured with a different capture orientation per i by:

Pcmb = [

i=0..3

CiKRT Pi

where P_i _{is a preprocessed point cloud matrix of percept number i ∈ N, T is a transformation} matrix, R is a rotation matrix and K is a correction matrix that together align P_ito the 3D bin grid G. Since the percepts were taken from different viewpoints each aligned matrix is rotated as the prefinal part of step I using the discrete rotation matrices C_i. These matrices rotate each aligned matrix by a multiple i of 90◦ corresponding to the capture orientation (see Table 4.1) and the amount of times the model was physically rotated around the y-axis. Finally Pcmb is obtained by simply adding all the voxels after using Ci on the aligned point clouds. T, R, K

(21)

4.1.1 Calibration and preprocessing

The input from the sensor are two 640 × 480 images, one with color information and the other with depth information. This input is transformed to a Euclidean point cloud Pi so

that the perspective is orthogonally corrected and the unit size is expressed in millimeters. The calibration measurements to do this may be provided for the sensor or can be determined experimentally (for more detail see appendix section A.2). The calibration measurements can be used to map the two images correctly on each other so that for each voxel within the point cloud Pi there is an associated color value.

A lot of the voxels lie outside the area of interest (that is outside the area around the origin of 3D bin grid G related to constraint CR1. from section 3.3.2). But since later on each voxel is used in the octree segmentation calculation they will have a negative effect on performance. The performance is improved by ignoring all voxels that are past the area of interest (i.e. everything except the Duplo assembly). This idea is implemented by ignoring voxels that have a z-coordinate greater than threshold t. As long as the threshold t is chosen in a way that that it does not ignore voxels that are part of Duplo assemblies, it will only influence performance and not the recognition process.

From this point on it is assumed that the point clouds of the percepts are properly corrected and contain only the voxels of interest (i.e. after applying threshold t). The point clouds exist of n voxels, where each voxel pjexists of position data pjx, pjy, pzj and associated color data pjr, pjg, pj_b,

with j, n ∈ N and 0 < j ≤ n. Since the alignment of the point clouds is only concerned with the position information it can be rewritten for step I as a n × 4 matrix Pi, where n ∈ N is the

number of voxels for point cloud number i. The first three rows contain respectively the x, y and z coordinates for each point and the final row consists of only ones. Note that each column within the point cloud Pi represents a single voxel.

Pi =        p1_x p2_x · · · pn−1 x pnx p1_y p2_y · · · pn−1 y pny p1_z p2_z · · · pn−1 z pnz 1 1 · · · 1 1       

4.1.2 Alignment to bin grid

The point clouds Pi contains voxels indicating surfaces of Duplo blocks. In order to segment

the point clouds P_i, it is necessary to align the surfaces of the point cloud P_i to the 3D bin grid G described earlier in section 6.1. In other words: it is necessary to find a translation matrix T , a rotation matrix R and a correction matrix K such that when applied to point cloud Pi it

is correctly aligned with G.

The rotation matrix R has to make sure that all the surfaces for the blocks are either perpendicular or parallel to the bins, taking into account earlier described constraints CR2. and CR3. from section 3.3.2. The translation matrix T should ensure that the surfaces of the point cloud are relatively the same distance from the model origin as the surfaces of the Duplo blocks to the origin in the real world. The trick with this last step is not to exactly align the

(22)

surfaces on the outlines of the cubes of the octree, but to move all voxels a little further down the z-axis (no more than half a bin). In this way the occupied bins represent the found outer surfaces of Duplo blocks.

In order to facilitate Duplo model acquisition it is more convenient to have the distance of voxels relative to the origin (the center of the square grid) instead of in terms of the distance from the camera. The translation matrix T translates all the voxels of P_i to the origin of bin grid G by choosing values for ox, oy and oz:

T =     1 0 0 ox 0 1 0 oy 0 0 1 oz    

With the voxels relative to the origin, the model now needs to be rotated around the x-axis to correct for the camera tilt angle and a possible slope of the table surface. Since the camera is looking down with angle θ the front of the grid is seen as having a lower y-position than the rear of the grid. Because the octree works like a cube-like 3D grid it expects the Duplo blocks to be perpendicular to the camera (see Figure 4.1). This can be done with matrix R and tilt angle θ: R =     1 0 0 0 cos(θ) −sin(θ) 0 sin(θ) cos(θ)    

Figure 4.1: This image shows the camera tilt angle θ. The tilt correction rotates all voxels relative to the focal point. This entails that after the surface has been tilted by θ the surfaces of the Duplo blocks show perpendicular to the camera.

At this point an optimization step is to ignore all voxels with an y-position lower than the y-position of the origin. This reduces the number of voxels and thus the complexity of the calculation. The octree structure uses only cubes, i.e. d_x = dy = dz, but we already know that

(23)

2 × 2 Duplo blocks do not satisfy this condition (see section 3.2). This is resolved by simply multiply the y-coordinate for each voxel with a factordx

dy. Since this fraction is a constant it can

be precalculated. Another small correction is performed, because very tall Duplo assemblies are registered as coming a little more towards the camera. This is probably due to a small error in the calculation of the depth image from the Kinect through inaccurate lens properties (see also appendix section A.2). A small slope ω relative to the y-position of each voxel is used to push back this error. This gives the following correction matrix:

K =     1 0 0 0 dx dy 0 0 ω 1    

The final alignment step before the point clouds are added to Pcmb is to rotate each of the point clouds so that it corresponds to the amount of degrees for each capture orientation (see Table 4.1). This gives the following rotation matrices Ci with i ∈ N and 0 ≤ i < 4:

Ci =     cos(αi) 0 sin(αi) 0 1 0 −sin(αi) 0 cos(αi)    

Capture orientation Angle symbol Angle

front α0 0◦

left α1 90◦

back α2 180◦

right α3 270◦

Table 4.1: The capture orientations and their associated angles.

4.2 Step II: octree segmentation

Next the combined point cloud model Pcmb is segmented using an octree (see also section 3.5.2) and returns the relevant octree bins and their associated color. Essentially this step segments the combined model with a 3D bin grid that closely resembles G, but has as smallest unit a cube with height, width and depth equal to dxand dz. Since each y-dimension within the point

cloud Pcmb is scaled with dx

dy in step I, the grid can effectively be treated as if it is the 3D bin

grid G. Step II segments each 2 × 2 Duplo block into eight bins. The segmentation into eight instead of four bins is thought to result in a lower overall error rate. A single bin within the octree contains all voxels from Pcmb within its bounds. Many bins will contain some voxels but are not interesting for the Duplo assembly structure. Such irrelevant bins can be discarded using a threshold tminpoints = 60, i.e. the minimal amount of points a bin needs to contain

(24)

to be considered. The value for tminpoints was obtained by testing on various assemblies how

well the constraints CR2. and CR3. from section 3.3.2 were satisfied and then by picking the overall best value for t_minpoints from those tests. Step II effectively interprets a point cloud and rewrites it to a list of facts of the following form:

bin(x, z, y, color)

where bin(· · · ) is the fact indicating a bin with enough points, x, y, z donate the position of the bin within the octree and color is the qualitative color estimated from the median voxel for that bin.

4.2.1 Color categorization

There are four qualitative colors that could be attributed to a bin: red r , green g, blue b and yellow y. Each of those qualitative colors has a corresponding unique combination of RGB values that closely resembles the qualitative color that they represent. The qualitative color for a bin is determined by first taking the color values of the median voxel for that bin and by calculating the relative distance between the color values for that bin and the color values for each qualitative color. The qualitative color with the lowest distance to the bin is then finally attributed.

The median voxel is in this way a good indicator for the color of a bin, since it is very probable to have an RGB value close to the most present color within a bin. This in contrast with the average which is heavily affected by outliers. The median is determined by first sorting all the voxels within a bin using the following comparison function f : voxel ×voxel → −1 | 0 | 1 4.1 between voxels va and vb:

f (va, vb) =      −1 if g(va) < g(vb) 0 if g(va_{) = g(v}b₎ 1 if g(va) > g(vb) g(v) = vr << 16 | vg << 8 | vb (4.1)

where the value −1 means that va is lesser than vb, the value 0 means that va is equal to vb, the value 1 means that va is greater than vb and where the <<-operator is a typical binary left shift operator.

Next the distance dist_(i,v) between the RGB values of the median voxel v and the average RGB values of qualitative color qi from the set of qualitative colors q is estimated by the following equation:

dist_(i,v)= A_(i,v)· A_(i,v) A_(i,v)=     vr vg vb     −     q_ri qi g q_bi    

where the ·-operator is the inner product. When the distances between RGB values of the voxel and all of the qualitative colors are calculated the qualitative color with the smallest distance

(25)

will be the selected. If this distance exceeds the threshold tcolor = 50000 it is ignored. This

threshold is chosen in such a manner that it will only affect extremely deviating colors (e.g. black).

The lighting conditions may vary per environment. If the system is not calibrated correctly for the lightning conditions of the environment this can result in misattribution of the qualitative colors for the bins, resulting in wrong output representation. Usually this type of error is detected while calibrating the system and can be resolved by entering new RGB values for the qualitative colors. More on this type of error can be found in section 8.1.5.

There is a calibration utility for easily setting the RGB values for the qualitative colors. Calibration can be done trough this utility by placing one or more blocks of the same color in front of the camera and running it prior to capturing the percepts. This procedure has to be repeated for each different color. As long as the lighting conditions do not drastically change this procedure has to be done only once for each color. A detailed guide on this procedure can be found in section 7.2.3.

(26)

Chapter 5

High-level interpretation using

probabilistic logic

Steps I and II have transformed the raw image input into a set of bins, each representing a part of the point cloud of the underlying data. These bins are represented as logical facts (i.e. the bin representation): bin(3, 0, 0, b), bin(3, 1, 0, b), . . ., (see also Figure 3.1) where the first three arguments determine the position of the bin in 3D space, and the last contains one of the colors of the Duplo domain. The next two steps assemble groups of bins into Duplo blocks using probabilistic logic. The first step (III) will generate a Duplo assembly consisting of only 2 × 2 blocks, i.e. of the smallest granularity. The second step (IV) tries to assemble 2 × 2 into larger 4 × 2 blocks if available in the system and likely present.

5.1 Logic for image interpretation

The literature shows several ways for applying logic to interpret images. One of the most elegant approaches is proposed by Shanahan [16]. Instead of reasoning from Duplo blocks to its features (i.e. cause to effect) it is possible to use abduction and reason from effect to cause (e.g. from voxels to Duplo assemblies) within certain domains. Shanahan elaborates on this idea and provides examples on how probabilities can give an explanation of the most likely image interpretation.

This work initially used this approach through AILog2 and SWI-prolog (see appendix A.1). AILog2 is a Prolog program for SWI-prolog that introduces an environment for logical inference using probabilities and abductive reasoning. However, it appeared that both performance and maintainability suffered greatly through the choice for this software, since negation was not supported (resulting in defining a lot of additional facts) and because of the fact that it is an interpreted program within SWI-prolog, without optimizations and an exponential complexity in the number of facts, resulting in problematic run-times for solving queries.

The disappointing results within AILog2 resulted in a switch to the ProbLog [15] module within Yet Another Prolog (YAP). This is a different better performing environment. The ProbLog module does not really provide abduction, but provides probabilities, negation and a lot of optimized tools. For step III and step IV the main focus is on the problog_max query

(27)

that gives the interpretation with the highest probability of all possible interpretations. It does not give the same results as the abductive “explain” in AILog2, but tries to mimic its behavior by “explaining” a list of features to the best of its ability. Queries for a Problog program thus start with:

problog_max(pred(arg), P, F).

where predicate pred is for example block grouping or assembly grouping and the arguments arg are a list of bins or blocks. The query returns a list F of all probabilistic facts that result in the highest probability P while having satisfied all the constraints for the predicate pred with the arguments arg.

The Problog programs are built to exhaust the list of bins or blocks given for arg, restricted by the set- and construction constraints related to Duplo blocks mentioned at 3.3.1 and 3.3.2. By picking smart probabilistic facts (with corresponding probabilities) it is possible to have desirable interpretations close to the most probable outcome of the problog_max query.

5.2 Step III: block grouping

input : A list B of bins (x, y, z, color) output: A list R of 2 × 2 blocks begin

R ← ∅

while B 6= ∅ do

D ← take S ⊆ B so that S contains only the bins of the highest block layer for B for color ∈ r, g, b, y do

D0← take S ⊆ D so that S contains only bins with color color if D06= ∅ then

R ← R ∪ mostLikelyGrouping(D0_{, R)}

B ← B \ D return R

Algorithm 1: Block grouping.

First the bins are grouped into blocks: one block consists of eight bins within a 2 × 2 × 2 cube of bins. The block grouping is efficiently done in a sequential fashion from the two highest layers of bins to the two lowest layers of bins and per color (see Algorithm 1). The resulting list of blocks R0 for mostLikelyGrouping(D0, R) is determined by a Problog program (listed at A.3.1) which computes:

D0, R `problog_max R0

Each bin in D0 can be a part of a block with probability P_block = 0.95 and missing bins have a probability P_{f n} = 0.2 (false negative). Alternatively a bin can be noise with probability Pf p = 0.02 (false positive). The intuition behind the chosen probabilities is illustrated in

Table 5.1. The system will evaluate for each bin if it is possible to group it either as noise, with probability P_{f p}, or together with directly surrounding bins as a 2 × 2 × 2 cube of bins with probability Pblock. For such a grouping the number of bins are evaluated. Each missing

(28)

that it favors grouping over noise, only if a grouping has 2 or less bins the system considers it to be more probable that those bins are noise. Finally a list with the highest probable block groupings are returned as a block representation R0.

Bin grouping Probability it is a block Probability it is noise

Pblock = 0.95 Pf p8 = 0.028 Pblock· Pf n = 0.19 Pf p7 = 0.027 Pblock· Pf n2 = 0.038 Pf p6 = 0.026 Pblock· Pf n5 = 0.000304 Pf p3 = 0.000008 Pblock· Pf n7 = 0.95 · 0.27 Pf p= 0.02 Pblock· Pf n6 = 0.0000608 Pf p2 = 0.0004

Table 5.1: Examples of bin groupings and the likelihood of them being a 2 × 2 Duplo block or noise.

(29)

5.3 Step VI: Assembly grouping

The last step takes the results of step III and computes the most likely assembly interpretation S in the form of a list of Duplo blocks given Duplo related constraints C and experimentally determined probabilities P = {P_d22 = 0.60, Pd24 = 0.15, Pd42 = 0.15, Pnoise = 0.01} (program

listed at appendix A.3.2). Given the list of blocks R from step III, constraints C and the probabilities P determine an optimal solution S (having the highest probability) as:

R, C, P `problog_maxS

The probabilities P have been determined in a similar way as for step III. For each block in R there are four possibilities: it is a 2 × 2 block with probability Pd22, it is a 2 × 4 block with

probability P_d24, it is a 4 × 2 block with probability P_d42or it is noise with probability Pnoise. If

two blocks are found to be a part of a larger block they are both removed from P before other blocks are evaluated.

Table 5.2 lists the constraints C that were used for this step. The constraints are largely composed of the restrictions from section 3.3.

Constraint Description Restriction

i Blocks come from a fixed/known set. SR1.

ii Duplo blocks cannot overlap each other. CR7. iii All blocks with y = 0 are on the ground. CR5. iv Blocks need to be supported by other blocks if they are not

on the ground (meaning that it cannot float in the air). CR6. v

Two bordering blocks can be grouped as one 2 × 4 or one 4 × 2 Duplo block with respectively probabilities: Pd24 and

Pd42.

CR3.

vi A block can be noise or be a 2 × 2 Duplo block with

respec-tively probabilities P_noise and P_d22. CR2. Table 5.2: Constraints used for the assembly grouping.

(30)

Chapter 6

Implementation

This chapter provides a general implementation outline for DupLog. It contains the hardware and software decisions that gave the results that are presented in chapter 8. The first section 6.1 explains the physical setup and the hardware used. The next section 6.2 deals with the soft-ware and describes the languages, libraries, programs and underlying design patterns that were adopted in order to develop DupLog.

6.1 Physical setup

Figure 6.1: The physical setup.

Figure 6.1 shows the physical technical setup for DupLog. It is designed for acquiring percepts of Duplo block assemblies in a semi-automated fashion. A Kinect sensor is placed on top of a cardboard box on an uncluttered tabletop surface, tilted 22◦ downwards, focused on a grid

(31)

surface on the tabletop surface. The grid surface is closely related to the 3D bin grid G from section 3.2, i.e. the center of the 2D grid surface aligns with the origin of grid G (i.e. (0, 0, 0)). The Kinect sensor is connected to a computer with the programs, described in section 6.2.1, installed. The measures in the figure should be seen as a reference and are not that strict.

6.1.1 Hardware

The hardware for the system is a Kinect sensor and a computer to run the software on (i.e. step I-IV and the knowledge base). The computer should have at least a keyboard and a monitor and preferably a mouse (for interacting with visualizations and GUIs). The computer can be any modern desktop or laptop computer with at least one USB 2.0 (or higher) port, preferably with a graphics card installed (for the visualizations) and plenty of Random Access Memory (RAM) to store the intermediate data structures. The Kinect sensor should be connected to the computer via a Universal Serial Bus (USB) cable. For this thesis a MacBook Pro laptop with an Intel Core 2 Duo 2.16 GHz processor, an ATI RadeonX1600 GPU and 3 GB of 667 MHz DDR2 SDRAM was used.

6.2 Software

The major design ideas for developing the software are “efficiency”, “flexibility”, having meaning-ful “visualizations” of data structures and “portability”. Efficiency is important for real-world applications, that is the recognition of assemblies should be accomplished within reasonable time and scale well to larger applications. This can be achieved by optimization: reducing the complexity of algorithms where possible. Flexibility is desirable to be able to use and test different parts of the software at a time, not having to change (much) to other parts. This can be achieved by keeping all the components loosely coupled and applying object oriented programming techniques. In this project a lot of different 3D data structures (e.g. point clouds and bins) are employed that are hard to interpret by looking at their stored form. Therefore proper visualizations are very useful for quickly assessing how good a certain data structure is. The software was developed on a computer with Mac OS X 10.6.8 installed, but since other people working on other platforms might want to use the software in the future, it should be portable. To ensure portability, mainly libraries and programs are utilized that should compile and behave the same on other major platforms (Windows, Unix-based).

The major part of the project is written in C++, since it has the advantage that it can use both low-level and high-level operations. For this project it provides much control over memory (shared memory) and other low level features (e.g. pointers) that optimize performance (efficiency) and on the other hand allow to structure programs in an object oriented fashion (flexibility). A drawback of the choice for C++ over languages like Java or Python is that there is an additional development time. This is because of the need to maintain both body and header files per class, because of the management of memory and because it requires being mindful to write code that compiles on multiple platforms.

To manage all the files, dependencies and paths related to the tools a Cross Platform Make (CMake) project file was used. CMake is a cross-platform automated build-system,

(32)

using compiler and platform independent configuration files in order to control the software compilation process. A list of all the used external libraries and programs can be found in the appendix section A.1.

6.2.1 Project organization

Tool name Shared memory Description

datalab register Utility for creating and managing the register fnect register, point clouds,

capture settings

Utility for acquiring sensor data from a Kinect sensor and storing it as a point cloud. Used for the first part of step I (percept acquisition)

pc register, point clouds Utility for managing point clouds. Performs also the second part of step I (model integration)

oct _{register, octree settings} _{Utility for configuring octree settings} vis register, visualization

settings Utility for configuring visualization settings str register, strings Utility for manipulating string representations dvis

register, octree settings, point clouds, strings, vi-sualization settings

Utility for visualizing data structures

filter register, point clouds, octree settings

Program that applies step II (octree segmentation) and outputs the results as a string

probfilter

-Program that takes in Problog queries and a Problog program and returns the results as block representa-tion or assembly representarepresenta-tion strings. Used for step III (block grouping) and step IV (assembly grouping) form _- Utility that rewrites string representations to other

forms

colorpick register, point clouds Utility tool for getting the color values of Duplo blocks Table 6.1: Table of the major project tools.

The emphasis of the system is on data structures (e.g. point clouds, assembly representations and program settings). The software for this project therefore mainly exists of a set of tools (console programs, listed in Table 6.1) that store, retrieve and manipulate uniquely named data structures in shared memory. A user can define a unique name and a type for a data structure and add it to a register so it can be used by the tools. The data structures for this project are of the following types:

(33)

octree settings Settings for performing octree segmentation.

capture settings Settings for percept acquisition from a Kinect sensor. visualization settings Settings for visualizing other data structures. string An array of text characters.

The programs can operate in parallel on the data structures (e.g. a visualization and a capture tool operate simultaneously on the same point cloud) by using mutex synchronization primi-tives. The shared memory management and associated synchronization primitives were realized through the Boost.Interprocess library. The Boost.Program_options library is used to maintain a consistent interface for the tools.

For the octree segmentation (filter tool) and the point cloud data type the Point Cloud Library (PCL) is used. The PCL framework contains numerous state-of-the art algorithms and is maintained by researchers all over the world. The utility program dvis uses visualization settings to generate interactive 3D visualizations for the point clouds and string representations by using the Visualization Toolkit (VTK).

6.2.2 Data-flow diagram

Figure 6.2 is a data-flow diagram for the complete interpretation process in DupLog. It shows how the software tools from Table 6.1 and data structures are related to each of the steps of the interpretation process in DupLog (from section 3.1). The yellow boxes in the diagram represent programs that operate on data structures. The programs can manipulate data structures directly through shared memory or pass data to other programs using pipes. It is assumed that the program settings f1 and o1 already are properly configured for respectively the fnect and filter tools. Configuration of f1 can be done with the fnect tool and configuration of o1 can be done with the oct tool.

Starting at the top the first part of step I (percept acquisition) is done with the fnect tool. This tool uses the open source libfreenect (hence the name of the tool) library from the OpenKinect community to get raw data from the Kinect sensor. It is used to sequentially get four point cloud percepts for the front f, left side l, right side r and the back b of a Duplo assembly. Next the four different percepts are combined using the pc tool in one point cloud p1, concluding the second part of step I (model integration). Then the filter tool takes p1 and the settings from o1 and outputs a string containing the bin representation by applying octree segmentation (step II). This string is then rewritten by the form tool to a string of Problog queries. The Problog queries together with a file containing the knowledge base are passed on to the probfilter tool. This tool loads the Yet Another Prolog (YAP) environment, Problog module and the knowledge base from the file into memory. Then it processes the queries which results in a block representation string for the block grouping (step III). The string is then stored in shared memory as s1 through the str tool. To complete the process the block representation of s1 is changed into another set of Problog queries through form and passed to the probfilter tool together with a file containing another knowledge base. The probfilter

(34)

performs the assembly grouping (step IV), outputs an assembly representation as a string that is stored though the str tool to s2, finishing the interpretation process.

Figure 6.2: Data-flow diagram with the necessary software components and data structures for the acquisition process in DupLog.

6.2.3 Visualizations

A very important design decision was to create a visualization for each of the representations that could be viewed and manipulated in real-time. The visualizations are realized through the VTK library and built upon the shared memory register in order to run in parallel. The visualizations are not central to the system, as the system can run without, but they greatly

(35)

enhance analysis and interpretation of all the different data structures.

Before running a visualization, a visualization settings data structure should be created in shared memory through the register. The settings determine what data structure should be visualized, the resolution of the visualization window, the background color and settings for helper objects (plane and cube grid).

While a visualization runs it can be manipulated through mouse interactions (mostly click and drag). This allows users to rotate, move and zoom the 3D data representations. It is also possible to manipulate the data structures and the visualization settings while the visualization runs.

The default setup creates eight visualization settings v1, v2, v3, v4, vf, vl, vr, vb in shared memory, which respectively are visualizations for p1, o1, s1, s2, f, l, r, b.

(36)

Chapter 7

Functionality

This chapter describes the funcionality of DupLog. It starts with a detailed description of how to interact with the system through both the console and the GUI. Then the general interactions with DupLog are described as a set of use-cases to give the reader a global explanation of the common interactions with DupLog.

7.1 _{Using DupLog}

The console programs from section 6.2.1 allow much control over the different data structures and visualizations, but executing common tasks often entails repeating a lot of commands (i.e. calibration of the sensor and octree settings, capturing a Duplo assembly). To reduce the time needed for these tasks it is often more convenient to serialize those commands through batch files or a Graphical User Interface (GUI). Both approaches are implemented for DupLog: the batch files are Unix shell scripts and the GUI is a Max1 file.

The GUI basically calls the console tools from Table 6.1 in the background, while presenting the user with three graphical panels, containing controls and feedback visualizations. This section shows the three different GUI panels and explains what they do and which underlying console tools are called. The GUI greatly simplifies common tasks of DupLog, but it is still possible to accomplish the same through the command-line.

7.1.1 Main panel

The main panel can be seen in Figure 7.1 and is the panel that opens when starting the GUI. The main panel contains control elements that are grouped into boxes, each box is related to a different control group. Through the main panel it is possible to initialize the system, start the sensor, start visualizations, scan a Duplo assembly, storing and retrieving point clouds and opening the other panels.

1

(37)

Figure 7.1: The main panel of the DupLog GUI. Initialization controls

At the top left the “Initialize” button can be seen with a colored indicator next to it. When the indicator is bright the system is properly initialized and when it is dark the system is not. By pressing the button the user can initialize or re-initialize the system, this calls the following console commands:

./datalab -c ./setup.sh

The initialization process creates a register and all data structures used for DupLog in shared memory. The register maintains the references to the data structures. The data structures created are: p1, f, l, r, b being the point clouds, f1 being the Kinect sensor settings, o1 being

(38)

the octree settings, s1, s2 being string representations and v1, v2, v3, v4, vf, vl, vr, vb being visualization settings for all the other data structures.

It is important to wait for the program to finish loading before doing other operations with DupLog. If the system breaks down, through system faults or illegal commands, rerunning the initialization process often fixes those problems, but remember to store your settings because it resets them to default values.

Camera controls

The box below the initialization box contains camera controls. It contains the button “Fnect Settings” that opens the fnect-settings panel in a new window. It also contains a button “Start Camera” that when pressed starts the Kinect sensor. The indicator next to the “Start Camera” button indicates whether the sensor is running or not. The “FPS” counter below that button gives the capture rate of the sensor in frames per second. When the sensor is started the “Start Camera” button changes in a “Stop Camera” button and can be clicked in order to stop the camera once it is running. To start the camera through the console type (preferably in a new console window):

./fnect -r f1

And to stop the sensor type: ./fnect -s f1

Octree controls

The box below the camera controls contain the octree controls. It provides a button “Octree Settings” to open the octree-settings panel and another button “Update Octree” to signal the visualization of o1 to redraw the octree by calling the following console command:

./oct --change o1 Visualization controls

The box below the octree controls has the visualization controls. Here the user can select the data structure to visualize through enabling and disabling check-boxes labeled “p1”, “o1”, “s1” and “s2” corresponding respectively to visualization settings v1, v2 ,v3 and v4. Then after pressing the “Change” button, first all visualizations are closed and then the checked visualizations start. The indicator is bright when there are visualizations running and dark otherwise. Through the “Close” button a user can close all visualizations with one click. The “Display Cubes” check-box enables or disables a helper 3D cube grid for visualization v1. This is especially useful during calibration of the sensor.

To start a visualization (e.g. v1) from the console type: ./vis -r v1

DupLog: probalistic logical interpretation of Duplo assemblies from 3D vision

Bachelor thesis

Artificial Intelligence

Radboud University

DupLog: probabilistic logical

interpretation of Duplo assemblies

from 3D vision.

Author:

Sil van de Leemput (s4085469)

SilvandeLeemput@gmail.com

Supervisor/assessor:

Dr. ir. Martijn van Otterlo

M.vanOtterlo@donders.ru.nl

Contents

Chapter 1

Introduction

1.1

Thesis goals

1.2

DupLog

1.3

Thesis outline

Chapter 2

Related work

2.1

Computer Vision

2.2

Structured representations of objects

2.3

Duplo/Lego domain

Chapter 3

Problem specification

3.1

System overview

3.2

3D bin grid

3.3

Assembly construction

3.4

Output

3.5

From vision data to logical representation

Chapter 4

Feature extraction from vision

4.1

Step I: percept acquisition and model integration

4.2

Step II: octree segmentation

Chapter 5

High-level interpretation using

probabilistic logic

5.1

Logic for image interpretation

5.2

Step III: block grouping

5.3

Step VI: Assembly grouping

Chapter 6

Implementation

6.1

Physical setup

6.2

Software

Chapter 7

Functionality

7.1

Using DupLog

_DupLog

_{Using DupLog}