Eindhoven University of Technology MASTER 3D face reconstruction using structured light on a hand-held device Roa Villescas, M.

(1)

MASTER

3D face reconstruction using structured light on a hand-held device

Roa Villescas, M.

Award date:

2013

Link to publication

Disclaimer

This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

• You may not further distribute the material or use it for any profit-making activity or commercial gain

(2)

Master Graduation Project

3D Face Reconstruction using

Structured Light on a Hand-held Device

Author:

Martin Roa Villescas

Supervisors:

Dr. Ir. Frank van Heesch Prof. Dr. Ir. Gerard de Haan

A thesis submitted in fulfilment of the requirements for the degree of Master of Embedded Systems

in the

Smart Sensors & Analysis Research Group Philips Research

August 2013

(3)

Abstract

Department of Mathematics and Computer Science Master of Embedded Systems

3D Face Reconstruction using Structured Light on a Hand-held Device by Martin Roa Villescas

A 3D hand-held scanner using the structured lighting technique has been developed by the Smart Sensors & Analysis research group (SSA) in Philips Research Eindhoven. This thesis presents an embedded implementation of such scanner. A translation of the original MATLAB implementation into C language yielded in a speedup of approximately 15 times running on a desktop computer. However, running the new implementation on an embedded platform increased the time from 0.5 sec to more than 14 sec. A wide range of optimizations were proposed and applied to improve the performance of the application. A final execution time of 5.1 seconds was achieved. Moreover, a visualization module was developed to display the reconstructed 3D models by means of the projector contained in the embedded device.

(4)

I owe a debt of gratitude to the many people who helped me during my years at TU/e.

First, I would like to thank Frank van Heesch, my supervisor at Philips, an excellent professional and even better person, who showed me the way through this challenging project while encouraging me in every step of the way. He was always generous with his time and steered me in the right direction whenever I felt I needed help. He has deeply influenced every aspect of my work.

I would also like to express my sincerest gratitude to my professor Gerard de Haan, the person who was responsible for opening Philip’s doors to my life. His achievements are a constant source of motivation. Gerard is a clear demonstration of how the collaboration between industry and academy can produce unprecedented and magnificent results.

My special thanks to all my fellow students at Philips Research, who made these eight months a wonderful time of my life. Their input and advice contributed significantly to the final result of my work. In particular, I would like to thank Koen de Laat for helping me set up an automated database system to keep track of the profiling results.

Furthermore, I would like to thank Catalina Suarez, my girlfriend, for her support during this year. Your company has translated in the happiness I need to perform well in the many aspects of my life.

Finally, I would like to thank my family for their permanent love and support. It is hard to find the right words to express the immense gratitude that I feel for those persons who have given me everything so that I could be standing where I am now. Mom and dad, my achievements are the result of the infinite love that you have given me throughout my life and I will never stop feeling grateful for that.

iii

(5)

(6)

Abstract ii

Acknowledgements iii

List of Figures ix

1 Introduction 1

1.1 3D Mask Sizing project . . . 3

1.2 Objectives . . . 3

1.3 Report organization . . . 4

2 Literature study 5 2.1 Surface reconstruction . . . 5

2.1.1 Stereo analysis . . . 6

2.1.2 Structured lighting . . . 9

2.1.2.1 Triangulation technique . . . 10

2.1.2.2 Pattern coding strategies . . . 11

2.1.2.3 3D human face reconstruction . . . 12

2.2 Camera calibration . . . 13

2.2.1 Definition . . . 14

2.2.2 Popular techniques . . . 14

3 3D face scanner application 17 3.1 Read binary file . . . 18

3.2 Preprocessing . . . 18

3.2.1 Parse XML file . . . 18

3.2.2 Discard frames . . . 19

3.2.3 Crop frames . . . 19

3.2.4 Scale . . . 19

3.3 Normalization . . . 19

3.3.1 Normalization . . . 20

3.3.2 Texture 2 . . . 21

3.3.3 Modulation . . . 22

3.3.4 Texture 1 . . . 22

3.4 Global motion compensation . . . 23 v

(7)

3.5 Decoding . . . 24

3.6 Tessellation . . . 25

3.7 Calibration . . . 26

3.7.1 Offline process . . . 27

3.7.2 Online process . . . 27

3.8 Vertex filtering . . . 28

3.8.1 Filter vertices based on decoding constraints . . . 28

3.8.2 Filter vertices outside the measurement range . . . 29

3.8.3 Filter vertices based on a maximum edge length . . . 29

3.9 Hole filling . . . 29

3.10 Smoothing . . . 30

4 Embedded system development 31 4.1 Development tools . . . 31

4.1.1 Hardware . . . 32

4.1.1.1 Single-board computer survey . . . 32

4.1.1.2 BeagleBoard-xM features . . . 34

4.1.2 Software . . . 34

4.1.2.1 Software libraries . . . 35

4.1.2.2 Software development tools . . . 36

4.2 MATLAB to C code translation . . . 37

4.2.1 Motivation for developing in C language . . . 37

4.2.2 Translation approach . . . 38

4.3 Visualization . . . 39

5 Performance optimizations 43 5.1 Double to single-precision floating-point numbers . . . 44

5.2 Tuned compiler flags . . . 44

5.3 Modified memory layout . . . 45

5.4 Reimplementation of C’s standard power function . . . 45

5.5 Reduced memory accesses . . . 47

5.6 GMC in y dimension only . . . 49

5.7 Error in Delaunay triangulation . . . 50

5.8 Modified line shifting in GMC stage . . . 50

5.9 New tessellation algorithm . . . 51

5.10 Modified decoding stage . . . 52

5.11 Avoiding redundant calculations of column-sum vectors in the GMC stage 53 5.12 NEON assembly optimization 1 . . . 54

5.13 NEON assembly optimization 2 . . . 57

6 Results 61 6.1 MATLAB to C code translation . . . 61

6.2 Visualization . . . 62

6.3 Performance optimizations . . . 62

7 Conclusions 67 7.1 Future work . . . 68

(8)

Bibliography 71

(9)

(10)

1.1 A subset of the CPAP masks offered by Philips. . . 2

1.2 A 3D hand-held scanner developed in Philips Research. . . 4

2.1 Standard stereo geometry. . . 7

2.2 Assumed model for triangulation as proposed in [4]. . . 10

2.3 Examples of pattern coding strategies. . . 12

2.4 A reference framework assumed in [25]. . . 14

3.1 General flow diagram of the 3D face scanner application. . . 17

3.2 Example of the 16 frames that are captured by the hand-held scanner. . . 18

3.3 Flow diagram of the preprocessing stage. . . 18

3.4 Flow diagram of the normalization stage. . . 20

3.5 Example of the 18 frames produced in the normalization stage. . . 21

3.6 Camera frame sequence in a coordinate system. . . 22

3.7 Flow diagram for the calculation of the texture 1 image. . . 22

3.8 Flow diagram for the global motion compensation process. . . 23

3.9 Difference between pixel-based and edge-based decoding. . . 24

3.10 Vertices before and after the tessellation process. . . 25

3.11 The Delaunay tessellation with all the circumcircles and their centers [33]. 26 3.12 The calibration chart. . . 27

3.13 The 3D model before and after the calibration process. . . 28

3.14 3D resulting models after various filtering steps. . . 29

3.15 Forehead of the 3D model before and after applying the smoothing process. 30 4.1 The BeagleBoard-xM offered by Texas instruments. . . 35

4.2 Simplified diagram of the 3D face scanner application. . . 39

4.3 UV coordinate system. . . 40

4.4 Diagram of the visualization module. . . 41

5.1 Execution times of the MATLAB and C implementations after run on different platforms. . . 44

5.3 Execution time before and after tuning GCC’s compiler options. . . 45

5.4 Modification of the memory layout of the camera frames. . . 46

5.5 Execution time with a different memory layout. . . 46

5.6 Execution time before and after reimplementing C’s standard power function . . . 47

5.7 Order of execution before and after the optimization. . . 48

5.8 Difference in execution time before and after reordering the preprocessing stage. . . 48

ix

(11)

5.9 Flow diagram for the GMC process as implemented in the MATLAB code. 49

5.10 Difference in execution time before and after modifying the GMC stage. . 49

5.11 Execution time of the application after fixing an error in the tessellation stage. . . 50

5.12 Execution times of the application before and after optimizing the line shifting mechanism in the GMC stage. . . 51

5.13 The Delaunay triangulation was replaced with a different algorithm that takes advantage of the fact that vertices are sorted. . . 52

5.14 Execution times of the application before and after replacing the Delaunay triangulation with the new approach. . . 53

5.15 Execution time of the application before and after optimizing the decoding stage. . . 54

5.16 Flow diagram for the optimized GMC process that avoids the recalcula- tion of the image’s columns sum. . . 55

5.17 Execution times of the application before and after avoiding redundant calculations of column-sum vectors in the GMC stage. . . 55

5.18 NEON SIMD architecture extension featured by Cortex-A series proces- sors along with the related terminology. . . 56

5.19 Execution flow after first NEON assembly optimization. . . 58

5.20 Execution times of the application before and after applying the first NEON assembly optimization. . . 59

5.21 Example of how to construct a LUT to apply gamma correction to the average of two 2-bit pixels. . . 59

5.22 Execution times of the application before and after applying the second NEON assembly optimization. . . 59

5.23 Final execution flow after second NEON assembly optimization. . . 60

6.1 Execution times of the MATLAB and C implementations after run on different platforms. . . 62

6.2 Example of the visualization module developed. . . 63

6.3 Performance evolution of the 3D face scanner’s C implementation. . . 64

6.4 Execution times for each stage of the application. . . 65

(12)

xi

(13)

(14)

Introduction

The potential of science and technology to improve every aspect of life seems to be boundless, or at least this is what the innovations of the previous centuries suggest.

Among the many different interests that advocate the development of science and technology, human healthcare has always been an important stimulant. New technologies are constantly being developed by leading companies all around the world to improve the quality of people’s lives. A clear example is the case of the Dutch multinational Royal Philips Electronics which devotes special interest to the development and introduction of meaningful innovations that improve people’s lives.

Within the wide range of products offered by Philips, there is a specific group categorized under the name of sleep solutions that aims at improving the sleep quality of people. A well-known family of products contained within this category are the so called CPAP (Continuous Positive Airway Pressure) masks. Such masks are used primarily in the treatment of sleep apnea, a sleep disorder characterized by pauses in breathing or instances of very low breathing during sleep [1]. According to a recent study con- ducted by Philips in collaboration with the University of Twente, 6.4% of the surveyed population was found to suffer from this disorder [2]. A total number of 4,206 people, comprising women and men of different ages and levels of education, took part in the 2-year study. A similar survey was undertaken by the National Institutes of Health in the United States of America [3]. It reported that sleep apnea was prevalent in more than 18 million Americans, i.e. 6.62% of the country’s population.

While aiming to attend the large demand for CPAP masks, Philips has designed and introduced a wide variety of mask models that seek to fulfill the different needs and constraints that arise due to several factors, which include the large diversity of size and shape of human faces, inclination towards breathing through the mouth or nose, diagnosis of diseases such as sinusitis or dermatitis, or disorders such as claustrophobia,

1

(15)

(a) Amara (b) ComfortClassic (c) ComfortGel Blue

(d) ComfortLite 2 (e) FitLife (f) GoLife

(g) ProfileLite Gel (h) Simplicity (i) ComfortGel

Figure 1.1: A subset of the CPAP masks offered by Philips.

amongst others. A subset of these models is shown in Figure 1.1. It is important to mention that a poor selection of a CPAP mask might cause undesirable side effects to the patient, such as marks or even pressure ulcers. Consequently, the physical dimensions of each patient’s face play a crucial role in the selection of the most appropriate CPAP mask.

Unfortunately, the current practices used to assess the adequacy of CPAP masks based on facial dimensions are quite error prone. They rely on trial-and-error procedures in which the patient tries on different mask models and selects the one he thinks is the most comfortable. In order to alleviate this problem, Philips Research launched the 3D Mask Sizing project which aims to develop an automated embedded system capable

(16)

of assisting sleep technicians in prescribing the most appropriate CPAP mask for each patient.

1.1 3D Mask Sizing project

The 3D Mask Sizing project is based on the initiative of Philips to develop some techno- logical means that can assist sleep technicians in the selection of a proper CPAP mask model for each patient. A series of algorithms, methods and hardware prototypes are the result of several years of research carried out by the Smart Sensing & Analysis research group in Philips Research Eindhoven. The resulting automated mask advising system comprises four main parts:

1. An accurate 3D model reconstruction of the patent’s face dimensions and geometry.

2. The extraction of facial landmarks from the reconstructed model by means of computer vision algorithms.

3. The actual fit quality assessment by virtually fitting a series of 3D mask models to the reconstructed face.

4. The creation of a custom cushion that optimizes for uniform pressure along the cushion contour.

The focus of this thesis project is based on the first step.

As part of the progress made in the 3D Mask Sizing project at Philips Research Eind- hoven, a first prototype of a 3D hand-held scanner using the structured lighting technique was already developed and is the base for the present project. Figure 1.2a shows the hardware setup of such device. In short, this scanner is capable of capturing a picture sequence of a patient’s face while illuminating it with specific structured light patterns.

Such picture sequence is processed by means of a series of algorithms in order to reconstruct a 3D model of the face. An example of a resulting 3D model is presented in Figure 1.2b. The reconstruction process and all other calculations are currently being performed offline and are mostly implemented in MATLAB.

1.2 Objectives

The main objective of this thesis project is to extend the functionality of the mentioned scanner such that the 3D reconstruction is computed locally on the embedded platform.

This implies transforming the already developed methods and algorithms in such a

(17)

(a) Hardware (b) 3D model example Figure 1.2: A 3D hand-held scanner developed in Philips Research.

way that extra-functional requirements are taken into account. These extra-functional requirements involve an optimal use of the available computational resources. Highest priority should be given to the execution time of the application. Specifically, the 3D reconstruction should be running on the embedded device in less than 5 seconds on average. Because the embedded processor contained in the final product will be similar to an ARM’s Cortex-A8, the new implementation should be targeted to this processor in particular by making proper use of the specific features it provides. Moreover, the visualization of the reconstructed face model should be made possible by means of the embedded projector contained in the device.

1.3 Report organization

This report is organized as follows: Chapter 2 presents the basic principles that underlay different technologies for surface reconstruction, placing special emphasis on structured lighting techniques. In Chapter 3, an overview of the 3D face scanner application is provided, which functions as the starting point for the current project. Chapter 4 details the most relevant aspects that pertain to the implementation of the 3D face scanner application on an embedded device. In Chapter 5, a series of optimizations used to reduce the execution time of the application are described. Chapter 6 highlights the most important results of the development process, namely the MATLAB to C translation, the visualization module and the set of optimizations. Finally, Chapter 7 concludes the thesis while delineating paths for further improvements of the presented work.

(18)

Literature study

This chapter presents a selective analysis of the state-of-the-art in the field of surface reconstruction, placing special emphasis on structured lighting techniques. A brief overview of the three main underlying technologies used for depth estimation is presented first. This is followed by an example of stereo analysis, which serves as the basis for the more specific structured lighting techniques. Moreover, this example helps to illustrate why stereo analysis is considered less preferable for 3D face reconstruction applications when compared with the structured lighting techniques. Special emphasis is placed on the scientific principles underlying structured lighting techniques. Further- more, a classification of the different types of pattern coding strategies available in the literature is given, along with an analysis of their suitability for our application. Fi- nally, the chapter concludes with a brief discussion of camera calibration and its most representative techniques.

2.1 Surface reconstruction

Surface reconstruction has a wide range of practical applications such as computer mod- eling of 3D objects (such as those found in areas like architecture, mechanical engineering, or surgery), distance measurements for vehicle control, surface inspections for quality control, approximate or exact estimates of the location of 3D objects for automated assembly, and fast location of obstacles for efficient navigation [4].

Technologies for surface reconstruction include contact and non-contact techniques, the latter being our principal interest. Non-contact techniques may be further categorized as echo-metric, reflecto-metric and stereo-metric, as proposed in [5]. Echo-metric techniques use time-of-flight measurements to determine the distance to an object, i.e. they

5

(19)

are based on the time it takes for a wave (acoustic, micro, electromagnetic) to reflect from an object’s surface through a given medium. Reflecto-metric techniques process one or more images of the object to determine its surface orientation and consequently its shape. Finally, stereo-metric techniques determine the location of the object’s surface by triangulating each point with its corresponding projections in two or more images.

Echo-metric techniques suffer from a number of drawbacks. Systems employing such techniques are heavily affected by environmental parameters such as temperature and humidity [6]. These parameters affect the velocity at which waves travels through a given medium, thus introducing errors in depth measurement. On the other hand, both reflecto-metric and stereo-metric techniques are less affected by environmental parameters. However, reflecto-metric techniques entail a major difficulty, i.e. they require an estimation of the model of the environment. In the remaining of this section, we will limit the discussion to the stereo-metric category, and focus on the structured lighting techniques.

2.1.1 Stereo analysis

Considering that surface reconstruction by means of structured lighting can be regarded as an extension of the more general stereo-vision technique, an introductory example of stereo analysis is presented in this section. This example intends to show why the use of structured lighting becomes essential for our application. This example is presented in [4].

Surface reconstruction can be achieved by means of the visual disparity that results when an object is observed from different camera viewpoints. In its simplest form, two cameras can be used for this purpose. Triangulation between a point in the object and its respective projection in each of the camera projection planes can be used to calculate the depth at which this point lies from a certain reference. Note, however, that in order to calculate the triangulation, more parameters are required. These parameters refer, for example, to the distance at which the cameras are located from one another (extrinsic parameter), or to the focal length of each of the cameras (intrinsic parameter).

Figure 2.1 illustrates the so-called standard stereo geometry [4] of two cameras. In this model, the origin of the XYZ-coordinate system O = (0, 0, 0) is located at the focal point of the left camera. The focal point of the right camera lies at a distance b along the X -axis from the left camera, i.e. at the point (b, 0, 0). Both cameras are assumed to have the same focal length f . As a consequence, the images of both cameras are located in the same image plane. The Z-axis coincides with the optical axis of the left camera. Moreover, the optical axes of both cameras are parallel to each other and

(20)

oriented towards the scene objects. Also, note that because the x-axes of both images are identically oriented, rows with same row-number in the two different images lie on the same straight line.

optical axis of right camera

left image (X,Y,Z) right image

row y row y

base distance b optical axis of left camera

xleft xright

Figure 2.1: Standard stereo geometry.

In this model a scene point P = (X, Y, Z) is projected onto two corresponding image points

p_{lef t}= (x_{lef t}, y_{lef t}) and p_right= (x_right, y_right)

in the left and right images, respectively, assuming that the scene point is visible from both camera viewpoints. The disparity with respect to p_{lef t} is a vector given by

∆(x_{lef t}, y_{lef t}) = (x_{lef t}− x_right, y_{lef t}− y_right)^T (2.1)

between two corresponding image points.

In the standard stereo geometry, pinhole camera models are used to represent the considered cameras. The basic idea of a pinhole camera is that it projects scene points P onto image points p according to a central projection given by

p = (x, y) = f · X Z ,f · Y

Z

(2.2)

assuming that Z > f .

According to the ideal assumptions considered in the standard stereo geometry of the two cameras, it holds that y = y_{lef t} = y_right. Therefore, for the left camera the central projection equation is given directly by Equation 2.2 considering that the pinhole camera model assumes that the Z-axis is identified to be the optical axis of the camera.

Furthermore, given the displacement of the right camera by b along the X axis, the

(21)

central projection equation is given by

(xright, y) = f · (X − b) Z ,f · Y

Z

.

Rather than calculating a disparity vector given by Equation 2.1 for all corresponding pairs of points in the different images, the scalar disparity proves to be sufficient under the assumptions made in the standard stereo geometry. The scalar disparity of two corresponding points in each one of the images with respect to p_{lef t} is given by

∆_ssg(x_{lef t}, y_{lef t}) = q

(x_{lef t}− x_right)²+ (y_{lef t}− y_right)².

However, because rows with same row numbers in the two images have the same y value, the scalar disparity of a pair of corresponding points reduces to

∆_ssg(x_{lef t}, y_{lef t}) = |x_{lef t}− x_right| = x_{lef t}− x_right. (2.3) Note that it is valid to remove the absolute value operator because of the chosen arrange- ment of the cameras. A disparity map ∆(x, y) is defined by applying equation 2.3 to all corresponding points in the two images. For those points that could not be associated with a correspondent point in the other image (for example because of occlusion) the value “undefined” is recorded.

Finally, in order to come up with the equations that determine the 3D location of each point in the scene, note that from the two central projection equations of the two cameras it follows that

Z = f · X

x_{lef t} = f · (X − b) x_right and therefore

X = b · x_{lef t} x_{lef t}− x_right. Using the previous equation it follows that

Z = b · f xlef t− x_right.

By substituting this result into the projection equation for y it follows that Y = b · y

xlef t− x_right.

The last three equations allow the reconstruction of the coordinates of the projected points P within the three-dimensional XYZ-space assuming that the parameters f and

(22)

b are known and that the disparity map ∆(x, y) was measured for each pair of corresponding points in the two images. Note that a variety of methods exists to calibrate different types of camera configuration systems, i.e to determine their intrinsic and extrinsic parameters. More on these calibration procedures is further discussed in Section 2.2.

The process of determining corresponding point pairs is known as the correspondence problem. A wide variety of techniques are used to solve the correspondence problem in stereo image analysis. Such techniques generally involve the extraction and matching of features between two or more images. These features are typically corners or edges contained within the images. Although these techniques are found to be appropriate for a certain number of applications, it turns out that they present a number of drawbacks that make their applicability unfeasible for many others. The main drawbacks are: (i) feature extraction and matching is generally computationally expensive; (ii) features might not be available depending on the nature of the environment or the placement of the cameras and (iii) low lighting conditions generally increase the complexity of the matching procedure, thus making the system more error prone. Such problems in solving the correspondence problem can generally be overcome by resorting to a different but similar type of techniques known by the name of structured lighting techniques. While structured lighting techniques involve a complete different methodology on how to solve the correspondence problem, they share large part of the theory presented in this section regarding the depth reconstruction process.

2.1.2 Structured lighting

Structured lighting methods can be thought of as a modification of the previously described stereo analysis approach where one of the cameras is replaced by a light source which projects a light pattern actively into the scene. The location of an object in space can then be determined by analyzing the deformation of the projected light pattern.

The idea behind this modification is to simplify the complexity of the correspondence analysis by actively manipulating the scene.

It is important to note that stereoscopic based systems do not assume complex requirements for image acquisition since they mostly rely on theoretical, mathematical and algorithmic analyses to solve the reconstruction problem. On the other hand, the idea behind structured lighting methods is to shift this complexity to another level such as the engineering prerequisites of the overall system [4].

A wide variety of light patterns have been proposed by the research community [5], [7]–

[17]. Their aim is to reduce the large number of images that would have to be captured

(23)

when using the most basic of all approaches, i.e., a light spot. In Section 2.1.2.2 a classification of the encoded patterns available is presented. Nevertheless, the light spot projection technique serves as a solid starting point to introduce the main principle underlying the depth recovery of most other encoded light patterns: the triangulation technique.

2.1.2.1 Triangulation technique

Triangulation refers to the process of determining the location of a point by measuring angles formed from it to points at either end of a fixed baseline. Various approaches have been proposed for accomplishing this task. An early analysis was described by Hall et al. [18] in 1982. Klette also presented his own analysis in [4]. In the following, an overview of Klette’s triangulation approach is explained.

Figure 2.2 shows the simplified model that Klette assumes in his analysis. Note that the

object P

base distance b

camera light source

Z

L X β

γ

α h

O

d

Figure 2.2: Assumed model for triangulation as proposed in [4].

system can be thought of as a 2D object scene, i.e. it has no vertical dimension. As a consequence, the object, light source and camera all lie in the same plane. The angles α and β are given by the calibration. As in the previous example, the base distance b is assumed to be known and the origin of the coordinate system O coincides with the projection center of the camera.

(24)

The goal is to calculate the distance d between the origin O and the object point P = (X₀, Z₀). This can be done using the law of sines as follows:

d

sin(α) = b sin(γ).

From γ = π − (α + β) and sin(π − γ) = sin(γ) it holds that d

sin(α) = b

sin(π − γ) = b sin(α + β). Therefore, distance d is given by

d = b · sin(α) sin(α + β)

which holds for any point P lying on the surface of the object.

2.1.2.2 Pattern coding strategies

As stated earlier, there is a wide variety of pattern coding strategies available in the literature that aim to fulfill all requirements found in different scenarios and applications.

In coded structure light systems, every coded pixel in the pattern has its own codeword that allows direct mapping, i.e. every codeword is mapped to the corresponding coordinates of a given pixel or group of pixels in the pattern. A codeword can be represented using grey levels, colors or even geometrical characteristics. The following classification of pattern coding strategies was proposed by Salvi et al. in [19]:

• Time-multiplexing: This is one of the most commonly used strategies. The idea is to project a set of patterns onto the scene, one after the other. The sequence of illuminated values determines the codeword for each pixel. The main advantage of this kind of pattern is that it can achieve high spatial resolution in the measurements. However, its accuracy is highly sensible to movement of either the structured light system or objects in the scene during the time period when the acquisition process takes place. Previous research in this area includes the work of [5], [7], [8]. An example of this coding strategy is the binary coded pattern shown in Figure 2.3a.

• Spatial Neighborhood: In this strategy, the codeword that is assigned to a given pixel depends on its neighborhood. Codification is done on the basis of intensity [9]–[11], color [12], or a unique structure of the neighborhood [13]. In contrast with time-multiplexing strategies, spatial neighborhood strategies allow for all coding information to be condensed into a single projection pattern, making them highly

(25)

suitable for applications that involve timing constraints, such as autonomous navigation. The compromise, however, is deterioration in spatial resolution. Figure 2.3b is an example of this strategy proposed by Griffin et al. [14].

• Direct coding: In direct coding strategies every pixel in the pattern is labeled by the information it represents. In other words, the entire codeword for a given point is contained in a unique pixel, as explained in [19]. Basically, there are two ways to achieve this: either by using a large range of color values [15], [16], or by introducing periodicity [17]. Although in theory this group of strategies can be used to reconstruct objects with high resolution, a major problem occurs in practice: the colors imaged by camera(s) of the system do not only depend on the projected colors, but also on the intrinsic colors of the measuring surface and light source. The consequence is that reference images become necessary. Figure 2.3c shows an example of a direct coding strategy proposed in [16].

(a) Time-multiplexing In 1993, Hung(67) proposed a grey level sinusoidal pattern. The period of the captured pattern depends on the depth of the surface where it is reflected. How- ever, computation demands much time. Hung proposed, as Wust and Capson had also proposed, to triangulate from the column phase of the imaged point. For each pixel point, this phase can be approximately obtained from the light intensity. Obviously, this method suffers the same limitation as the method proposed by Wust et al.

6.9. Griffin—Narasimhan—½ee

Griffin et al.(68), in 1992, have carried out a mathematical study about which should be the largest size allowed for a coded matrix dot pattern. It is supposed that (1) A dot position is coded with information emitted by itself and the information of its four neighbours (North, South, East and West). (2) There cannot be two different dot positions with the same code. (3) The information is determined using a fixed basis, which determines the symbols used to code the matrix. (4) The biggest matrix is desired, i.e. the matrix which gives a better resolution.

If a basis equal to 3 is done, a possible dot codification is shown in Fig. 21.

Griffin et al. have proved that, given a basis b, the largest matrix (the biggest n]m matrix) can be ob- tained from its largest horizontal vector (Vhm), and its largest vertical vector (Vvm). Vhm is a vector made by the sequence of all the triplets of numbers that can be made without repetition using a b basis. Vvm is a vec- tor made by the sequence of all the pairs of numbers that can be made without repetition. Then, the first row of the matrix is given directly by Vhm,

f0i"Vhmi (50)

and, the other matrix elements can be determined applying equation (51) where ‘‘i’’ is the row index and varies from 0 to the Vhm length, and ‘‘j’’ the column index and varies from 0 to the Vvm length.

fij"1#(( fi~1j#Vvmj) modb) . (51) For example, if a basis equal to 3 is supposed, then its largest vectors are

Vhm"(33132131123122121113323222333), Vvm"(3121132233) .

Fig. 21. Dot codification example using its four neighbours and a basis equal to 3, i.e. only three different symbols can be

used.

Fig. 22. A possible coded dot matrix, obtained by the method proposed by Griffin et al.(68) A basis equal to 3 has been supposed and to each symbol a coloured dot has been

associated.

So, the obtained matrix is

33132131123122121113323222333 33132131123122121113323222333 11213212231233232221321333111 33132131123122121113213222333 11213212231233232221321333111 22321323312311313332132111222 22321323312311313332132111222 11213212231233232221321333111 33132131123122121113323222333 33132131123122121113323222333 33132131123122121113323222333

After the coded matrix is found out, a different projection can be associated for each value, i.e. for each number which belongs to the intervalM1, bN. For example, a coloured dot pattern can be obtained if the red, green and blue colours are associated to the numbers 1, 2 and 3, respectively, obtaining a pattern like the one shown in Fig. 22.

The resolution of the pattern can be increased by simply increasing the basis value. Depending on the colour discriminating capability of the system employed to image the scene, almost any degree of resolution can be obtained.

In many applications, the scene is not made by colour neutral surfaces. A simple reason could be that the imaging system used is only able to capture monochromatic images. Then, monochromatic light has to be projected on the scene. In this case, the coloured association projected of each number can be changed for a geometric association. An example is shown in Fig. 23.

The method proposed by Griffin et al. is the unique method studied from the decodification of the pattern captured by the camera. For each image point (xp1, yp1), the projector position point (xp2, yp2) from which it has been emitted can be known. As shown in the mathematical section dedicated to surface measuring, it is not necessary to know both projector coordinates. Then, the pattern can be obviously simplified to obtain a single row coded or column coded pattern,

Recent progress in coded structured light 977

(b) Spatial Neighbor-

hood (c) Direct coding

Figure 2.3: Examples of pattern coding strategies.

2.1.2.3 3D human face reconstruction

Given the importance of face reconstruction in a wide range of fields, such as security, forensics, or even entertainment, it is no surprise that special focus has been devoted to this area by the research community over the last decades. A comparative study of three different 3D face reconstruction approaches is presented in [20]. Here, the most representative techniques of three different domains are tested. These domains are binocular stereo, structured lighting, and photometric stereo. The experimental results show that active reconstruction techniques perform better than purely passive ones for this application.

The majority of analysis on vision based reconstruction has focused on general performance for arbitrary scenes rather than on specific objects, as reported in [20]. Neverthe- less, some effort has been made on evaluating structured lighting techniques with special focus on human face reconstruction. In [21], a comparison is presented between three

(26)

structured lighting techniques (Gray Code, Gray Code Shift, and Stripe Boundary) to assess 3D reconstruction for human faces by using mono and stereo systems. The results show that the Gray Code shift coding performs best given the high number of emitted patterns it uses. A further study on this topic was performed by the same author in [22]. Again, it was found that time-multiplexing techniques, such as binary encoding using Gray Code, provide the highest accuracy. With a rather different objective than that sought by Woodward et al. in [21] and [22], Fechteler et al. [23] also focus their effort on presenting a framework that captures 3D models of faces in high resolutions with low computational load. Here, the system uses a single colored stripe pattern for the reconstruction purpose plus a picture of the face illuminated with regular white light that is used as texture.

Particular aspects of 3D human face reconstruction such as proximity, size and texture involved, make structured lighting a suitable approach. On the contrary, other reconstruction techniques might be less suitable when dealing with these particular aspects.

For example, stereoscopic approaches fail to provide positive results when the textures involved do not contain features that can be easily extracted and matched by means of algorithms, as in the case of the human face. On the other hand, the concepts behind structured lighting make it very convenient to reconstruct these kind of surfaces given the proximity involved and the size limits of the object in question (appropriate for projecting encoded patterns).

With regard to the suitability of the different pattern coding strategies for our application (3D human face reconstruction by means of a hand-held scanner), there are several factors to consider. Spatial neighborhood strategies do not offer high spatial resolution, which is needed by the algorithms that assess the fit quality of the various mask models.

Direct coding strategies suffer from practical problems that affect their robustness to different scenarios. This centers the attention on the time-multiplexing techniques which are known to provide high spatial resolution. The problem with such techniques is that they are highly sensible to movement, which is likely to be present on a hand- held device. Fortunately, there are several approaches as to how such problem can be solved. Consequently, it is a time-multiplexing technique which is being employed in our application.

2.2 Camera calibration

Camera calibration is a crucial ingredient in the process of metric scene measurement.

This section presents a review of some of the most popular techniques with special focus on those that are regarded as adequate for our application.

(27)

2.2.1 Definition

Camera calibration is the process of determining a mathematical approximation of the physical and optical behavior of an imaging system by using a set of parameters. These parameters can be estimated by means of direct or iterative methods and they are divided in two groups. On the one hand, intrinsic parameters determine how light is projected through the lens onto the image plane of the sensor. The focal length, projection center and lens distortion are all examples of intrinsic parameters. On the other hand, extrinsic parameters measure the position and orientation of the camera with respect to a world coordinate system, as defined in [24]. To better illustrate these ideas, consider Figure 2.4 which corresponds to the optical system for the structured pattern projection and triangulation considered in [25]. The focal length f_c and the projection center O_c are examples of intrinsic parameters of the camera, while the distance D between the camera and the projector corresponds to an explicit parameter.

Object A h

B C

H D

ImagePlane Camera

Reference Plane Image Plane

Projector f_p Op

Oc c

o r

f χ χ

Figure 2.4: A reference framework assumed in [25].

2.2.2 Popular techniques

In 1982, Hall et al. [18] proposed a technique consisting of an implicit camera calibration that uses a 3 × 4 transformation matrix which maps 3D object points to their respective 2D image projections. Here, the model of the camera does not consider any lens distortion. For a detailed description of this method refer to [18]. Some years later, in 1986, Faugeras improved Hall’s work by proposing a technique that was based on extracting the physical parameters of the camera from the transformation technique proposed in [18]. The description of this technique is given in [26] and [27]. A non-linear explicit camera calibration that included radial lens distortion was proposed by Salvi in his Ph.D.

(28)

thesis [28] which, as he mentions, can be regarded as a simple adaption of Faugeras’ linear method. However, a method that would become much more popular and that is still widely used, was proposed by Tsai in 1987 [29]. Here, the author proposes a two-step technique that models only radial lens distortion. Also worth mentioning is the model proposed by Weng [30] in 1992, which includes three different types of lens distortion.

The calibration mechanism that is currently being used in our application is based on the work performed by Peter-Andre Redert as part of his PhD thesis [31]. Although this mechanism focuses on stereo camera calibration, it was generalized for a system with one camera and one projector. It involves imaging a controlled scene from different positions and orientations. The controlled scene consists of a rigid calibration chart with several markers. The geometric and photometric properties of such markers are known precisely so that they can be detected. After corresponding markers in the different images are found, an algorithm searches the optimal set of camera parameters for which triangulation of all corresponding marker-point pairs gives an accurate reconstruction of the calibration chart. This calibration mechanism is discussed further in Section 3.7.

(29)

(30)

3D face scanner application

This chapter provides a general overview of the 3D face scanner application developed by the Smart Sensing & Analysis research group and provided as a starting point for the current project. Figure 3.1 presents the main steps involved in the 3D reconstruction process.

Read binary file 3.1

Preprocessing 3.2

Normalization 3.3

Global motion compensation

3.6 Decoding

3.5 Tessellation

3.4

Calibration 3.7

Vertex filtering 3.8

Hole filling 3.9

•Binary

•XML Start

3D Model

End

Figure 3.1: General flow diagram of the 3D face scanner application.

The current scanner uses a total of 16 binary coded patterns that are sequentially projected onto the scene. For each projection, the scene is captured by means of the embedded camera, hence producing 16 different grayscale frames (Figure 3.2) that are fed to the application in the form of a binary file. This falls in line with the discussion presented in Section 2.1.2.3 of the literature study of why time-multiplexing strategies result more suitable than spatial neighborhood or direct coding strategies for face reconstruction applications. In Sections 3.1 to 3.9, each of the steps shown in Figure 3.1 is described.

17

(31)

Figure 3.2: Example of the 16 frames that are captured by the embedded camera while the scene is being illuminated with binary structured light patterns. This frame

sequence is the input for the 3D face scanner application.

3.1 Read binary file

The first step of the application is to read the binary file that contains the required information for the 3D reconstruction. The binary file is composed of two parts, the header and the actual data. The header contains metadata of the acquired frames, such as the number of frames and the resolution of each one. The second part contains the actual data of the captured frames. Figure 3.2 shows an example of such frame sequence, which from now on will be referred to as camera frames.

3.2 Preprocessing

The preprocessing stage comprises the four steps shown in figure 3.3. Each of these steps is described in the following subsections.

Preprocessing Parse XML

file

Discard

frames Crop frames

Scale

•Convert to float

•Range from 0-1

Figure 3.3: Flow diagram of the preprocessing stage.

3.2.1 Parse XML file

In this stage, the application first reads an XML file that is included for every scan.

This file contains relevant information for the structured light reconstruction. This

(32)

information includes: (i) the type of structured light patterns that were projected when acquiring the data; (ii) the number of frames captured while structured light patterns were being projected; (iii) the image resolution of each frame to be considered and (iv) the calibration data.

3.2.2 Discard frames

Based on the number of frames value read from the XML file, the application discards extra frames that do not contain relevant information for the structured light approach, but that are provided as part of the input.

3.2.3 Crop frames

The original resolution of each camera frame (480 × 768) is modified in order to obtain a new, more suitable resolution for the subsequent algorithms of the program (480 × 754). This is accomplished by cropping the pixels that are close to the top border of the images. Note that this operation does not imply a loss of information in this application in particular. This is because pixels near the frame borders do not contain facial information, and therefore, can be safely removed.

3.2.4 Scale

Each pixel of the camera frame sequence (as provided by the embedded camera) is represented by an 8-bit unsigned integer value that ranges from 0 to 255. In this stage, the data type is transformed from unsigned integer to floating point while dividing each pixel value by 255. The new set of values range between 0 and 1.

3.3 Normalization

Even though this section is entitled Normalization, a few more tasks are being performed in this stage of the application, as shown by the blue rectangles in Figure 3.4. Here, wide arrows represent flow of data, whereas dashed lines represent the order of execution. The numbers inside the small data arrows pointing towards the different tasks, represent the number of frames used as input by each task. The dashed line rectangle that encloses the normalization and texture 2 tasks, represents that there is not a clear sequential execution between these two, but rather that these are executed in an alternating fashion.

This type of diagram will result particularly useful in Chapter 5 in order to explain the

(33)

Normalization

Texture 2

Modulation 16

Camera Frames

In

8 frames

Out

Texture 1

8 frames

Out

1 frame

Out

1 frame

Out Execution flow

Figure 3.4: Flow diagram of the normalization stage.

modifications that were made to the application to improve its performance. An example of the different frames that are produced in this stage are visualized in Figure 3.5. A brief description of each of the tasks involved in this stage follows.

3.3.1 Normalization

The purpose of this stage is to extract the reflectivity component (texture information) from the camera frames, while aiming at enhancing the deformed illumination patterns in the resulting frame sequence. Figure 3.5a illustrates the result of this process. The deformed patterns are essential for the 3D reconstruction process.

In order to understand how this process takes place, we need to look back at Figure 3.2. Here, it is possible to observe that the projected patterns in the top row frames are equal to their corresponding frame in the bottom row, with the only difference being that the values of the projected pattern are inverted. For each corresponding pair, a new image frame is generated according to the following equation:

Fnorm(x, y) = Fcamera(x, y, a) − Fcamera(x, y, b) F_camera(x, y, a) + F_camera(x, y, b)

where a and b correspond to aligned top and bottom frames in Figure 3.2, respectively.

An example of the resulting frame sequence is shown in Figure 3.5a.

(34)

(a) Normalized frame sequence.

(b) Texture 2 frame sequence.

(c) Modulation frame. (d) Texture 1 frame.

Figure 3.5: Example of the 18 frames produced in the normalization stage.

3.3.2 Texture 2

The calculation of the texture 2 frame sequence follows the same procedure as the one used to calculate the normalized frame sequence. In fact, the output of this process is an intermediate step in the calculation of the normalized frames, being this the reason why the two processes are said to be performed in an alternating fashion. The mathematical equation that describes the calculation of the texture 2 frame sequence is:

Ftexture2(x, y) = Fcamera(x, y, a) + Fcamera(x, y, b)

The resulting frame sequence (Figure 3.5b) is used later in the global motion compensation stage.

(35)

3.3.3 Modulation

The purpose of this stage is to find the range of measured values for each (x, y) pixel of the camera frame sequence, along the time dimension. This is done in two steps. First, two frames are generated by finding the maximum and minimum values along the time (t) dimension (Figure 3.6) for every (x, y) value in a frame.

Camera Frame Sequence

x

y t

Figure 3.6: Camera frame sequence in a coordinate system.

Second, a modulation frame is produced by finding the difference between the previously generated frames, i.e.:

Fmod(x, y) = Fmax(x, y) − Fmin(x, y)

Such modulation frame (Figure 3.5c) is required later during the decoding stage.

3.3.4 Texture 1

Finally, the last task in the Normalization stage corresponds to the generation of the texture image that will be mapped onto the final 3D model. In contrast to the previous three tasks, this subprocess does not take the complete set of 16 camera frames as input, but only the 2 with finest projection patterns. Figure 3.7 shows the four processing steps that are applied to the input in order to generate a texture image such as the one presented in Figure 3.5d.

Texture 1 Average

frames

Gamma correction

5x5 mean filter

Histogram stretch

Figure 3.7: Flow diagram for the calculation of the texture 1 image.

(36)

3.4 Global motion compensation

The major drawback of time-multiplexing strategies is its high sensitivity to movement.

In fact, if no measures are taken to correct the slight amount of movement of the scanner or of the objects in the scene during the acquisition process, the complete reconstruction process fails. Although the global motion compensation stage is only a minor part of the mechanism that makes the entire application robust to motion, it is not negligible in the final result.

Global motion compensation is an extensive field of research for which many different approaches and methods have been contributed. The approach used in this application is amongst the simplest in level of complexity. Nevertheless, it suffices the needs of the current application.

Figure 3.8 presents an overview of the algorithm used to achieve the global motion compensation. This process takes as input the normalized frame sequence introduced in the previous section. As noted at the bottom of the figure, these steps are repeated for every pair of consecutive frames. As a first step, the pixels in each column are added for both frames. This results in two vectors that hold the cumulative sums of each frame.

The second step is to determine by how many pixels the second image is displaced with respect to the first one. In order to achieve this, the sum of absolute differences between elements of the two column-sum vectors is calculated while slowly displacing the two vectors with respect to each other. The result is a new vector containing the SAD value for each displacement. Subsequently, the index of the smallest element in the SAD values vector is searched in order to determine the number of pixels that the second image needs to be shifted. The process concludes by performing the actual shift of the second frame.

Global motion compensation

Normalized frame sequence

For every pair of consecutive frames

Frame A

Frame B

Sum columns

Minimize

SAD Shift

Frame B

Figure 3.8: Flow diagram for the global motion compensation process.

(37)

3.5 Decoding

In Section 2.1.1 of the literature study, the correspondence problem was defined as the process of determining corresponding point pairs between the captured images and the projected patterns. This is exactly what is being accomplished during the decoding stage.

A novel approach has been implemented in which the identification of the projector stripes is based not on the values of the pixels themselves (as it is typically done), but rather on the edges formed by the transitions of the projected patterns. Figure 3.9 illustrates the different sets of decoded values that result with each of these methods.

Here, it is possible to observe that the pixel-based method produces a stair-casing effect due to the decoding of neighboring pixels that lie on the same stripe of the projected pattern. On the other hand, the edge-based method removes this undesirable effect by decoding values for only parts of the image in which a transition occurs. Furthermore, this approach enables sub-pixel accuracy for the determination of the positions where the transitions occur, meaning that the overall resolution of the 3D reconstruction increases considerably.

350 352 354 356 358 360 362 364 366 368

200 201 202 203 204 205 206 207

Pixels along the y dimension of the image

Decoded values

Edge vs. pixel based decoding

Edge−based decoding Pixel−based decoding

Figure 3.9: The stair-casing effect caused by pixel-based decoding is not present when edge-based decoding is used.

The decoding process results in a set of vertices, each one associated with a depth code.

Note, however, that the unit of measurement used to describe the position and depth of each vertex is based on camera pixels and code values, respectively, meaning that these vertices still do not represent the actual geometry of the face. The calibration process, explained in a later section, is the part of the application that translates the pixel and

(38)

code values to standard units (such as millimeters), thus recreating the actual shape of the human face.

3.6 Tessellation

Tessellation refers to the process of covering a plane using different geometric shapes in a manner such that no overlaps occur. In computer graphics, these geometric shapes are generally chosen to be triangles, also called “faces”. The reason for using triangles is that they have, by definition, its vertices on a same plane. This, in turn, avoids the generation of non-simple convex polygons that are not guaranteed to be rendered correctly. A complete example illustrating this point can be found in [32].

A set of 3D vertices calculated in the decoding stage is the input to the tessellation process. Here, however, the third dimension does not play a role, and hence the z coordinate for each of the vertices can be thought of as being equal to 0. This implies that the new set of vertices consist only of (x, y) coordinates that lie on the same plane, as shown in Figure 3.10a. This graph corresponds to a very close view of the nose area in the reconstructed face example.

368 370 372 374 376

258 259 260 261 262

Zoomed−in model before tessellation

x

y

(a) Vertices before applying the Delaunay triangulation.

368 370 372 374 376

258 259 260 261 262

Zoomed−in model after tessellation

x

y

(b) Result after applying the Delaunay triangulation.

Figure 3.10: Close view of the vertices in the nose area before and after the tessellation process.

The question that arises here is how to connect the vertices in such a way that the complete surface is covered with triangles. The answer is to use the Delaunay triangulation, which is probably the most common triangulation used in computer vision. The main advantages that it has over other methods is that the Delaunay triangulation avoids

“skinny” triangles, reducing potential numerical precision problems [33]. Moreover, the Delaunay triangulation is independent of the order in which the vertices are processed.

(39)

Figure 3.10b shows the result of applying the Delaunay triangulation to the vertices shown in Figure 3.10a.

Although there exists a number of different algorithms used to achieve the Delaunay triangulation, the final outcome of each conforms to the following definition: a Delaunay triangulation for a set P of points in a plane is a triangulation DT(P) such that no point in P is inside the circumcircle of any triangle in DT(P) [33]. Such definition can be understood by examining Figure 3.11.

Page 1 of 1

09/07/2013 file:///D:/Desktop/Delaunay_circumcircles_centers.svg

Figure 3.11: The Delaunay tessellation with all the circumcircles and their centers [33].

3.7 Calibration

The set of (x, y) vertices with their corresponding depth code values that result from the decoding process do not represent standard units of measure, i.e., these still have to be translated into standard units such as millimeters. This is precisely the objective of the calibration process.

The calibration mechanism that is used in the application is based on the work of Peter- Andre Redert as part of his PhD thesis [31]. The entire process is divided into two parts:

an offline and an online process. Moreover, the offline process consists of two stages, the camera calibration and the system calibration. It is important to clarify that while the offline process is performed only once (camera properties and distances within the system do not change with every scan), the online process is carried out for every scan instance. The calibration stage referred to in Figure 3.1 is the latter.

(40)

3.7.1 Offline process

As already mentioned, the offline process comprises the two stages described below.

Camera calibration: This part of the process is concerned with the calculation of the intrinsic parameters of the camera, as explained in Section 2.2 of the literature study. In short, the objective is to precisely quantify the optical properties of the camera. The manner in which the current approach accomplishes this is by imaging the special calibration chart shown in Figure 3.12 from different orientations and distances. After corresponding markers in the different images are found, an algorithm searches the optimal set of camera parameters for which triangulation of all corresponding marker-point pairs gives an accurate reconstruction of the calibration chart.

Figure 3.12: The calibration chart used to determine the intrinsic parameters of a camera and the extrinsic parameters of a projector-scanner system. All absolute dimensions

and photometric properties of the round markers are known precisely.

System calibration: The second part of the calibration process refers to the camera- projector system calibration, i.e. the determination of the extrinsic parameters of the system. Again, this part of the process images the calibration chart from different distances. However, this time, structured light patterns are emitted by the projector while the acquisition process takes place. The result is that each projector code is associated with a known depth and camera position.

3.7.2 Online process

The result of the offline calibration is a set of parameters that model the optical properties of the scanner system. These are passed to the application inside the XML file for every scan. Such parameters represent the coefficients of a fifth-order polynomial used for translating the set of (x, y) vertices with their corresponding depth code values into

(41)

standard units of measure. In other words, the online process consists of evaluating a polynomial with all the x, y and depth code values calculated in the decoding stage in order to reconstruct the geometry of the face. Figure 3.13 shows the state of the 3D model before and after the reconstruction process.

(a) Before reconstruction. (b) After reconstruction.

Figure 3.13: The 3D model before and after the calibration process.

3.8 Vertex filtering

As it can be seen from Figure 3.13b, there are a number of extra vertices (and faces) that have not been correctly reconstructed and, therefore, should be removed from the model. Vertex filtering is applied to remove all these noisy vertices and faces based on different criteria. The process is divided in the following three steps.

3.8.1 Filter vertices based on decoding constraints

First, if the distance between consecutive decoded points is larger than a maximum threshold in the (x) or (z) dimensions, then these are removed. Second, in order to avoid false decoded vertices due to camera noise (specially in the parts of the images where light does not hit directly), a minimal modulation threshold needs to be exceeded or else the associated decoded point is discarded. Finally, if the decoded vertices lie outside a margin defined in accordance to the image dimensions, then these are removed as well.

(42)

3.8.2 Filter vertices outside the measurement range

The measurement range, defined during the offline calibration, refers to the minimum and maximum values that each decoded point can have in the z dimension. These values are read from the XML file. The long triangles shown in Figure 3.13b that either extend far into the picture or, on the other hand, come close to the camera, are all removed in this stage. The resulting 3D model after being filtered with the two previously described criteria is shown in Figure 3.14a.

3.8.3 Filter vertices based on a maximum edge length

Several steps are involved in the removal of vertices based on the maximum edge length criterion. Initially, the length of every edge contained in the model is calculated. This is followed by determining a new set of edges L that contains the longest edge in each face. After this operation, the mean length value for the longest edge set is calculated.

Finally, only faces that have its longest edge value less than seven times the mean value, i.e. L < 7 × mean(L) are kept. Figure 3.14b shows the result after this operation.

(a) The 3D model after the filtering steps described in Subsections 3.8.1 and 3.8.2.

(b) The 3D model after the filtering step described in

Subsection 3.8.3.

(c) The 3D model after the filtering step described in

Section 3.9.

Figure 3.14: 3D resulting models after various filtering steps.

3.9 Hole filling

In the last processing step of the 3D face scanner application, two actions are performed.

The first one is concerned with an algorithm that takes care of filling undesirable holes that appear due to the removal of vertices and faces that were part of face surface. This is accomplished by adding a vertex in the middle of the hole and then connecting every surrounding edge with this point. The second action refers to another filtering step of