3D estimation of a salient object

(1)

Control Engineering

Klaas Jan Russcher

Bsc opdracht

Supervisors:

prof.dr.ir. S. Stramigioli prof.dr.ir. A. de Boer dr.ir. F. van der Heijden September 2011 Report nr. 019CE2011 Control Engineering EE-Math-CS University of Twente P.O.Box 217 7500 AE Enschede The Netherlands

(2)

(3)

Introduction

Figure 1.1: Humanoid NAO.

Rob Reilink made in 2008 a saliency algorithm for a humanoid head with stereo vision [7]. In this report it is tried to im- plement that saliency algorithm on the NAO humanoid (see figure 1.1). Because NAO does not have stereo vision, it is tried to estimate the 3D position of a salient object with 3D reconstruction.

In chapter 2 the theory of saliency mapping is explained. Saliency mapping is tested on some images to show under what conditions it works best.

Chapter 3 threats the distance estimation. In this report the distancte estimation is called 3D reconstruction. Three methods of 3D reconstruction are explained (rectification, linMMSE estimation and optimal triangulation). A Monte Carlo analysis and a parameter sweep are done to determine the lesast error prone method.

The saliency mapping and 3D reconstruction are combined in chapter 4. It is explained how the two images that are needed for 3D reconstruction are obtained. The coordinate systems of the saliency mapping, 3D reconstruction and NAO are matched and the algorithm is tested. Chapter 5 is the chapter where the conclusions are drawn and the recommendations are made.

(6)

Chapter 2

Saliency Mapping

Saliency mapping is a bottom-up visual attention system. In this system primal instincts determine the interesting points. The neurobiological model of saliency mapping is introduced by Koch and Ullmand in 1985 [5]. In 1998 Koch, Itti and Niebur describe how saliency mapping can be implemented in computer vision [4]. Section 2.1 will explain how the saliency mapping in computer vision works. In section 2.2 saliency mapping will be tested on some images to show how well it works. The results of the test images will be discussed in the conclusion (section 2.3)

2.1 Saliency in computer vision

Koch, Itti and Niebur have derived a algorithm for saliency mapping in [4]. The general architecture of saliency mapping displayed in figure 2.1. The saliency mapping algorithm will be explained using this figure. The first step in figure 2.1 is the linear filtering of the input image. It will only be filtered for color and intensity. Filtering for orientation requires a filter that is not in the OpenCV vision library.

Creating this filter would cost to much time and is therefore beyond the scope of this project. Reilink proved in [7] that saliency mapping works well without the orientation filtering.

Figure 2.1: Architecture of saliency mapping [4].

The algorithm starts with a 640x480 RGB input image. This image is splitted into three new images: red (r) , green (g) and blue (b). In- tensity map I is created by element-wise adding these three images together and then divide each element by three. The images red, green and blue are normalized by the intensity to decouple hue from intensity. Four color maps are created:

red (R) = r − (g + b)/2, green (G) = g − (r + b)/2, blue (B) = b − (r + g)/2 and yellow (Y ) = (r + g)/2 − |r − g|/2 − b.

For each of the five maps a Gaussian pyramid of scale 8 is created (the source image is scale 0).

These pyramids are used for the calculation of the center-surround differences (denoted by the symbol ). The higher the scale of a Gaussian pyramid, the more smoothened an image is and that will filter the small differences away. If then an image at a low scale of the Gaussian pyramid (center)

and an image at a higher scale of the Gaussian pyramid (surround) are compared, the contrast of the center and its surround will become visible.

For the intensity I(c,s) there are six maps created, with c = {2, 3, 4} and s = c + {3, 4}. The center-

(7)

surround difference is the absolute difference between two scales of the Gaussian pyramid:

I(c, s) = |I(c) I(s)| (2.1)

For the color maps it is not just the center-surround difference of each color. Because the human eye is sensitive for color pairs, it is tried to simulate this. The color pairs that human eye is sensitive for are green/red, red/green, blue/yellow and yellow/blue. The center-surround differences that are created for the color maps are RG(c,s) and BY(c,s):

RG(c, s) = |(R(c) − G(c)) (G(s) − R(s))| (2.2a) BY (c, s) = |B(c) − Y (c)) (Y (s) − B(s))| (2.2b) All the center-surround differences are created now. But there is still a problem. When an image has a lot of high intensity spots and just one weak red spot, the intensity spots will suppress the red spot.

Therefore a map normalization operator (Λ) is created. This operator will promote maps with few peaks and suppress maps with a lot of peaks. The map normalization operator consists of 3 steps:

1. Scale the map to a maximum M. In case of an 8-bit image, M will likely be 255.

2. Find all the local maxima of the map and calculate the average ¯m over these maxima.

3. Scale the map with (M − ¯m)²/M².

As shown in figure 2.1 there are now 12 color maps (6 RG and 6 BY) and 6 intensity maps. These maps are combined into two ”conspicuity maps”, one for the intensity ¯I and one for the color ¯C. The maps are scaled to scale 4 of the Gaussian pyramid and then a per element addition results in a ”conspiciuty map”. The saliency map is constructed out of the two ”conspiciuty maps”:

S = 1

2(Λ( ¯I) + Λ( ¯C)) (2.3)

The final step in the saliency mapping algorithm that is used for this project is the selection of the most salient place. This is done by the principle of winner takes all. The point on the salience map with the highest value is the focus of attention.

The original saliency mapping algorithm has also an inhibition of return step (see figure 2.1). The inhibition of return suppresses the focus of attention for a couple hundred milliseconds and prevents the algorithm from staring at the same object. This step is not used in this project because it could affect the the goal of selecting the same object in two different images.

2.2 Saliency tests

Figure 2.2a is a test image of colored balloons in a bright blue sky. Figure 2.2b is the resulting saliency map. The two biggest balloons have approximately the same size, one is red and the other is blue. But it is clear that the big red balloon attracts the most attention due to the contrast with the bright blue sky.

(a) (b)

Figure 2.2: Balloon test image (a) and the resulting saliency map (b).

(8)

Figure 2.3a is an image of the Control Engineering lab. On the image the handle of a red screwdriver that is held by someone is visible. The red color of the screwdriver is in good contrast with its background.

This can be seen in the resulting saliency map (figure 2.3b).

(a) (b)

Figure 2.3: Test image with red screwdriver (a) and the resulting saliency map (b).

The last test image is shown in figure 2.4a. It is also an image of the Control Engineering lab but now without a salient object. The result of an image without an salient object is shown in figure 2.4b.

(a) (b)

Figure 2.4: Test image of the Control Engineering lab (a) and the resulting saliency map (b).

2.3 Conclusion

Figure 2.2b show that saliency mapping works for an image with big salient objects and a background that exist of a plain color. When the background gets more complicated and the salient object gets smaller the saliency mapping tends to keep working well (figure 2.3b). When there are no salient objects the saliency mapping tries to spot something, but the results are useless (figure 2.4b).

So the saliency mapping spots easily salient objects for scenes where the objects contrast their background well. But when there are no contrasting objects the outcome of the saliency mapping is unpredictable.

This is something to keep in mind because when when the localizing error is to large, it is also possible that the object does not contrast it background well enough to be detected.

(9)

Chapter 3

3D reconstruction

In this chapter the subject of 3D reconstruction is treated. The 3D reconstruction reconstructs the 3D coordinates of an object that is seen by two cameras at different positions. In a situation without noise it is not difficult to reconstruct the coordinates of an object. But most of the times the image lines will not cross each other in the 3D space due to noise. So a more advanced method is needed to reconstruct an objects 3D coordinates in this case. There are various methods to reconstruct a 3D point from two noisy image points. The methods that are covered here are the rectification method, the linMMSE estimation (both suggested by [9]) and the optimal triangulation method.

First the three methods of 3D reconstruction will be explained (section 3.1 to 3.3). Then a parameter error analysis will be done to estimate which of the three reconstruction methods is the least error prone to parameter errors (section 3.4). In the conclusion at the end of this chapter the results of the parameter error analysis will be discussed and the least error prone reconstruction method will be chosen for the use in NAO (section 3.5).

The most common symbols that are used in this chapter, with their meaning, are displayed in table 3.1.

Table 3.1: Definitions of parameters Symbol Definition

CCS Coordinate system attached to camera 1 CCS’ Coordinate system attached to camera 2 x = [x, y] Image coordinates of the point in image 1 x⁰= [x⁰, y⁰] Image coordinates of the point in image 2 X = [X, Y, Z] 3D point in CCS space

X⁰= [X⁰, Y⁰, Z⁰] 3D point in CCS’ space

R Rotation matrix to convert CCS to CCS’

t Translation vector to translate CCS to CCS’

K Calibration matrix of camera 1 K⁰ Calibration matrix of camera 2 P = K[I|0] Projection matrix of camera 1 P⁰= K⁰[R|t] Projection matrix of camera 2

D Focal distance

3.1 Rectification

The rectification method ( [9], [2]) is the simplest method of the three 3D reconstruction methods. It virtually manipulates the images in such a way that image 1 and image 2 get identical orientations. This means that a point of interest on an imaginary horizontal line in image 1 is also on that same imaginary horizontal line in image 2. Only in image 2 it is on another position of the imaginary horizontal line.

(10)

This makes the reconstruction of a 3D point much easier.

So the only difference between these two images is that image 2 is shifted a certain distance in the x–direction in comparison to image 1. When the images are rectified, triangulation [3] determines the 3D point X.

Set t⁰= −R^Tt. The rotation matrix used to rotate both images has the form of formula (3.1a) with elements as in formula (3.1b).

R_rect=



 r^T₁ r^T₂ r^T₃



 (3.1a)

r1= t⁰

|t⁰| r2= 1

pr1(1)²+ r1(2)²





−r1(2) r1(1)

0



 r3= r1× r2 (3.1b)

With Rrect calculated the images can be rectified (3.2).

xrect= KRrectK⁻¹x x⁰rect = K⁰RrectRK⁰⁻¹x⁰ (3.2) The disparity is the difference between xrectand x⁰rect. This is used to determine the coordinates of the 3D point with the formula’s in (3.3).

Zrect= kt⁰kD

disparity Xrect=x_rectZ_rect

D Yrect= y_rectZ_rect

D (3.3)

This gives a point Xrect. But that point lies in the rectified coordinate system and not in the CCS coordinate system. X = R⁻¹_rectXrect transfers Xrect back to CCS coordinates.

3.2 LinMMSE estimation

The reconstruction with the linear minimum mean squared error (linMMSE) estimation is described in [6]

and [9]. It is derived from the Kalman filtering which is used to track a moving point. The linMMSE estimation is the update step from the Kalman filtering. The advantage of this method is that it does not only reconstructs a point X, but also constructs a covariance matrix. With the covariance matrix one is able to say something about the precision of the reconstructed point X.

The linMMSE estimation starts with the reconstruction of a vector z and a matrix H (4.2). z1 and z2 are the positions of the point of attention in respectively image 1 and image 2. H1 and H2 are the camera calibration matrices of camera 1 and camera 2, only without the last row.

z =z1

z2

H =H1

H2

(3.4) The covariance matrices of point X and point X⁰ are also needed for the linMMSE estimation. These can be calculated with formula (3.5) (I₂ is a 2x2 identity matrix). In this formula there is a ˆX. This is a initial estimate of where the point of attention could be. For this application it is the middle of the working space. σ is the uncertainty of point ˆX and it is the distance of ˆX to the edge of the working space. σ is used to construct Cx = σ²I3. For the calculations the two covariance matrices are put in one covariance matrix (3.5).

C( ˆX) = Z²σ²I2 Cn( ˆX) =C₁( ˆX) 0 0 C2( ˆX⁰)

(3.5)

The first estimate has three steps. First it calculates the innovation matrix S (3.6a). Second step is to calculate the linMMSE gain K_a (3.6b). The third step to calculate the new X (3.6c), this is called the Kalman update.

S = HCxH^T + Cn( ˆX) (3.6a)

(11)

Ka= CxH^TS⁻¹ (3.6b)

Xf irst= X + Ka(z − HX) (3.6c)

The second estimate has four steps. The first three steps (3.7) are basically the same steps as first three from the first estimate. The last step calculates the new covariance matrix (3.8).

S = HCxH^T + Cn(Xf irst) (3.7a)

Ka= CxH^TS⁻¹ (3.7b)

X_second = X + K_a(z − HX) (3.7c)

Cx,new= Cx− KaSK^T_a (3.8)

The reconstructed X is Xsecondand the covariance matrix is Cx,new.

3.3 Optimal triangulation

The method of optimal triangulation described here is the one that is proposed in [3]. The method is based on the maximum-likelihood estimation (MLE). Therefore the assumption is made that the noise that causes the uncertainty of the 3D point has a Gaussian distribution.

This method uses the obtained corresponding points of the two images (x and x’ in homogeneous coordinates) and the fundamental matrix F . The fundamental matrix is a matrix that relates points in image 1 to corresponding points in image 2 so that x^0T· F · x = 0. The fundamental matrix is calculated with the use of the camera matrices of both cameras (K and K’), the rotational matrix R and the translational vector t, this is displayed in formula (3.9).

F = K^0−T · [t]x· R · K⁻¹ (3.9)

[t]xis a skew-symmetric matrix that is constructed with the translational vector t.

The dot product of F and x gives an epipolar line l’ in image 2 and the dot product of the transposed F and x’ gives an epipolar line l in image 1. When there are no errors, point x is positioned on epipolar line l and point x’ is positioned on line l’. In practice this is never the case.

The optimal triangulation method tries to minimize the cost function. The cost function is the sum of the squared perpendicular distance between the image point x and the epipolar line l in image 1 and the squared perpendicular distance between the image point x’ and the epipolar line l’ in image 2.

In [3] the optimal triangulation method is presented as an 11-step algorithm. Here the same structure will be used to explain the algorithm. The first 5 steps are used to put x, x’ and F in such a form that the cost function can be easily calculated.

1. The first step is to define two matrices V and V ’ (3.10). These matrices take the points x and x’

to the origin of their coordinate systems. x, y, x’ and y’ are the coordinates of the points x and x’. After these transformations are x and x’ both at [0, 0, 1]^T of the new coordinate system.

V =





1 0 −x 0 1 −y

0 0 1



 V⁰=





1 0 −x⁰ 0 1 −y⁰

0 0 1



 (3.10)

2. Replace the fundamental matrix to make it compatible with the new coordinate system (3.11).

F = V^0−T· F · V⁻¹ (3.11)

3. The left and right epipoles are calculated with F . The right epipole e is the nullspace of F and the left epipole e’ is the nullspace of the transposed F . The epipoles have to be normalized so that e²₁+ e²₂= 1 and e⁰²₁+ e⁰²₂= 1, and multiply both epipoles the sign of their last element.

(12)

4. Form now the rotation matrices W and W ’.

W =





e1 e2 0

−e2 e1 0

0 0 1



 W⁰=





e⁰1 e⁰2 0

−e⁰2 e⁰1 0

0 0 1



 (3.12)

5. Replace the fundamental matrix to adjust it to the new coordinate system.

F = W^0−T· F · W⁻¹ (3.13)

The cost function is given by formula (3.14). Steps 6 to 9 evaluate the cost function and so determine the image points that give the smallest error.

S(t) = 2t

(1 + f²t²)² + 2(ad − bc)(at + b)(ct + d)

((at + b)²+ f⁰²(ct + d)²)² (3.14) 6. Step 5 results in a fundamental matrix that looks like formula (3.15). a, b, c and d are determined

by this matrix. f and f’ are respectively e3and e⁰3.

W =





f · f⁰· d −f⁰· c −f⁰· d

−f · b a b

−f · d c d



 (3.15)

7. g(t) (3.16) is the numerator of the derivative of S(t) (3.14). Fill in the variables a, b, c, d, f and f in g(t) as a polynomial in t. Solve g(t) to get the roots.

g(t) = t((at + b)²+ f⁰²(ct + d)²)²− (ad − bc)(1 + f²t²)²(at + b)(ct + d) (3.16)

8. Evaluate the cost function S(t) at each of the roots of g(t). If there are complex roots, only evaluate the real part of those roots. Also evaluate S(∞) (3.17). Select t, for which the cost function has the smallest value, as tmin.

S(∞) = 1

f² + c²

a²+ f⁰²c² (3.17)

9. Construct the two epipolar lines l = [tminf 1 − tmin]^T and l⁰= [(−f⁰(ctmin+ d) amint + b ctmin+ d)]^T.

The point x on an epipolar line l = [λ µ ν]^T that is the closest the origin is point x =

−λν − µν λ²+ µ²^T

. Construct with this formula ˆx and ˆx⁰.

The last two steps transform the image points back to the original coordinate system and point X is determined.

10. Replace ˆx and ˆx⁰ with respectively ˆx = T⁻¹R^Tx and ˆˆ x⁰= T⁰⁻¹R⁰^Txˆ⁰.

11. To complete the optimal triangulation method, point ˆX has to be calculated. This is done with the use of matrix A (3.18). Take the singular value decomposition of A = U DV^T. ˆX is the last column of matrix V^T. X is a homogeneous vector and therefore his last element has to be one.ˆ Divide ˆX by his last element to get a homogeneous vector.

A =





 ˆ

x · P_3,T− P1,T

ˆ

y · P_3,T− P2,T

ˆ

x⁰· P⁰_3,T − P⁰_1,T ˆ

y⁰· P⁰_3,T− P⁰_2,T







(3.18)

(13)

3.4 Parameter error analysis

As said in this chapters introduction the image points will be affected by noise. The rotation matrix R and the translation vector t are the rotation matrix and translational vector that the describe the shift from the camera position in the initial robot configuration to the camera position after the robot moved its head to another configuration. These rotation matrix and translational vector have uncertainties due to the errors in NAO’s sensors. The image points x and x⁰ have uncertainties because the images are build up out of pixels. Therefore it is not possible to get the exact image point. And the calibration matrices K and K⁰ have uncertainties caused by the calibration process.

For the 3D reconstruction the most robust method is wanted. The most robust method is the method that is the least sensitive to the parameter errors. Two analysis methods are used to determine the sensitivity of the 3D reconstruction methods to errors. The analysis methods analyze the 3D reconstruction methods for each parameter separately. So they analyze the error in rotation matrix R first, then the error in the translation vector t and so on. In this way it can be made clear for which errors the 3D reconstruction methods are the least sensitive.

The influence of five parameter errors have been analyzed. These five parameters are the focal distance of the camera (D), the camera center (cc) (these two parameters are in the camera calibration matrix K), the translational vector (t), the rotational vector (R) and the image points. The uncertainties of the focal distance, camera center and image points are determined by the Camera Calibration Toolbox for Matlab. The uncertainties of the translational vector and the rotational matrix are estimated by test performed on the NAO robot. For these test the command is given to NAO to change the yaw and pitch angles of the head by 0.5 rad. Then the angle changes of the yaw and pitch are measured with NAO’s internal sensors. The error is the measured angle change minus 0.5 rad. This test is done several times and resulted in an uncertainty for the rotational and translational angle (see table 3.2).

The uncertainties that are used in the simulations are two times the measured standard deviations, that ensures a 95% certainty that the parameter value will be within the uncertainty region (the uncertainties of the parameters have a Gaussian distribution). All the parameter values with their uncertainties are displayed in table 3.2. The uncertainties for the camera center and the translational vector are equal for each elements. The value for the image point is not given, because it has different values for image 1 and image 2 and it depends also on which of the five 3D points is used.

Table 3.2: NAO’s parameters and uncertainties

Parameter Value uncertainty (σ) 2 × σ

Focal distance (mm) 752 2 4

Camera center (mm) [331 253]^T 1.6 3.2

Image point (pixels) 0.12 0.24

Rotation angle (radian) 0.5 0.01 0.02

Translational vector (mm) [28 29 − 24]^T 1.3 2.6

For the simulation where 5 3D points used (3.19), all distances are in millimeters. These points where transformed to image coordinates with the projection matrices without errors. The parameters errors where only used in the 3D reconstruction methods.

X₁₋₅=



 0 0 100







 0 0 300







 0

−25 300







 40 40 750







 250

0 1500



 (3.19)

3.4.1 Monte Carlo analysis

The Monte Carlo method is suggested by [9] and described in [1]. The Monte Carlo method simulates the 3D reconstruction methods N times and each time it takes a random value within the uncertainty of the parameter. Each of the N simulations gives a 3D point Xest. This Xest is subtracted from the real 3D point X. The result is a error–vector with errors in the x–, y– and z–direction. Then the systematic error is calculated by taking the mean of each element of the error–vector over N simulations. The random error is calculated by taking the standard deviation of each element of the error–vector over N

(14)

simulations. The RMS absolute error is the root mean square of these two errors.

The Monte Carlo analysis was done with N = 10, 000. Results are displayed in table 3.3, the values in the table are worst case scenarios. The uncertainties of the camera center and image point are given in pixel distance, the rotation angle in degrees and the rest of the parameter and the errors are given in millimeters.

Table 3.3: Results Monte Carlo analysis (absolute error)

RMS linMMSE RMS RMS optimal

estimation rectification triangulation uncertainties absolute error absolute error absolute error

focal distance σD= 4mm 80.35 231.22 0.26

camera center σ_cc= 3.2 472.15 7,194.32 2.15

rotation angle σ_α= 1.15^◦ 780.91 38,044.77 15.89

translation vector σ_t= 2.6mm 131.75 145.51 19.01

image point σ_x= 0.24 55.38 72.08 29.61

3.4.2 Parameter sweep

The parameter sweep simulates the 3D reconstruction method also N times. But now it sweeps trough the uncertainty of the parameter. So it starts at −∆max and stops at +∆max. The error–vector is calculated and the norm of the error–vector is taken. If there are more 3D simulation points, the average of the error–vector norms is taken. The final step is to plot the result of the N simulations against the parameter error.

The results of this analysis is plotted in the figure 3.1. The errors on the x–axis are the deviations from the real parameter value.

3.5 Conclusion

The three 3D reconstruction methods are in the previous section (3.4) analyzed for their sensitivity to parameter errors. The Monte Carlo method and the parameter sweep give a clear picture of which reconstruction method is the least error prone. Both figure 3.1 and table 3.3 show that the optimal triangulation method is by far the least error prone 3D reconstruction method. Only for the image location error, the rectification method error and the linMMSE estimation error values come near the values of the optimal triangulation method.

A reason for the large errors in the rectification method and the linMMSE estimation is the small translational vector t of NAO, its in the order of a few centimeters. This makes these two methods sensitive to parameter errors. Testing shows that making the translational vector for instance 10 times larger reduces the errors in both methods significantly, whereas the errors in the optimal triangulation method stays the same.

Another reason for the large errors in the linMMSE estimation is the initial estimate of the point of attention and the uncertainty (σ) of that point. Because there is no idea where the point of attention is, the middle of the working space is chosen with a uncertainty that covers approximately the whole working space. This gives an initial guess that is most likely not close to the point of attention. This causes a large error in the linMMSE estimation.

With the results in figure 3.1 and table 3.3, the optimal triangulation method is chosen as method for the 3D reconstruction because it gives the best results.

(15)

−40 −3 −2 −1 0 1 2 3 4 1

2 3 4 5 6

Focal distance error (mm)

|error| (mm)

Rectification LinMMSE estimation Optimal Triangulation

(a)

−3 −2 −1 0 1 2 3

0 1 2 3 4 5 6 7 8

Camera center error (mm)

|error| (mm)

(b)

−0.020 −0.015 −0.01 −0.005 0 0.005 0.01 0.015 0.02 10

20 30 40 50 60 70 80

Rotation angle error (rad)

|error| (mm)

(c)

−2 −1 0 1 2

0 5 10 15 20 25 30 35 40

Translational vector error (mm)

|error| (mm)

(d)

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0

0.1 0.2 0.3 0.4 0.5

Image location error (mm)

|error| (mm)

(e)

Figure 3.1: results of the parameter sweep

(16)

Chapter 4

3D estimation of a salient object

This chapter will present the whole algorithm of the 3D estimation of a salient object (section 4.4). The main parts are already threated in chapter 2 (saliency mapping) and in chapter 3.3 (optimal triangulation method). In section 4.1 is the used robot presented. Section 4.2 shows the conditions under which the second image is captured. The transformation of the coordinate system is explained in section 4.3.

4.1 NAO

The humanoid used for this project is NAO (figure 1.1). NAO is manufactured by the French company Aldebaran Robotics. It has a heigth of 58 cm and weights 4.3 kg [8]. The humanoids equipment includes two CMOS digital cameras. The field of view of these two cameras do not overlap. Therefore taking an image with both cameras at the same time can not be used for 3D reconstruction. Nao has also 32 Hall effect sensors, a two axis gyrometer and a three axis accelerometer. These sensors are used to determine the position of NAO.

NAO is supported by the NaoQi API framework. This framework enables the user to let NAO walk and move easily. It is also used for getting the position of the cameras and the pitch and yaw angles of NAO’s head.

4.2 Obtain second image

When image 1 is obtained it does not matter in which position NAO is or when it is obtained. That is different for image 2. Image 2 has to be obtained as fast as possible after image 1 because as time elapses, NAO’s surroundings will most likely change. The field of view of camera position 2 has to overlap the field of view of camera position 1, therefore the position of NAO does matter when image 2 is obtained.

The fields of view have to overlap otherwise a 3D reconstruction can not be made.

When there is a salient object spotted in image 1, the yaw- and pitch angles of NAO’s will be changed so that the salient object transfers to the image center. The maximum angle changes will be the angle changes that will transfer an object from the border of the image to the center.

The salient object in image 1 is represented by image coordinates. The coordinates of the image center will be subtracted from the salient object coordinates and then divided by the image coordinates. This gives the relative image coordinates. These relative image coordinates will be multiplied by the maximum angle changes. The result of these angle changes will be that the object will appear in the center of the image.

There are two conditions which the angle changes have to meet. The first condition is that the angle changes have to be at least 0.1 radian. This condition is needed because the angles must also change when the salient object is close to the center of the image. The second condition is that angle changes have to result in an angle that is in the range of the yaw and pitch of NAO’s head. When this condition is not met, the angle change will be reduced till it is in the yaw or pitch range. When the angle change gets smaller then 0.1 radian, the angle change will be 0.1 radian in the opposite direction.

(17)

4.3 Coordinate system transformation

The position of the camera that is acquired from NAO is transformed twice. First it is rotated to match the orientation used in the saliency mapping and in the camera calibration. Then it is transformed so that the camera position of image 1 is at the center of the coordinate system. This is necessary for the calculation of the rotation matrix and translational vector.

4.3.1 Matching the coordinate systems

For the saliency mapping and the calibration of the camera a 2D coordinate system is used. The origin of those coordinate systems is at the left upper corner of the images. x is the width and y is the height of the image. For a Cartesian coordinate system z would then be depth of the image.

The orientation of NAO’s coordinate system is as shown in figure 4.1.

Figure 4.1: NAO’s coordinate system [8].

To rotate NAO’s coordinate system to the desired orientation it is first rotated by −90^◦ around the x-axis and then −90^◦ around the z-axis. The resulting rotation matrix is shown in formula (4.1).

Rxz =





1 0 0

0 cos(−90) sin(−90) 0 −sin(−90) cos(−90)



•





cos(−90) sin(−90) 0

−sin(−90) cos(−90) 0

0 0 1



=





0 −1 0

0 0 −1

1 0 0



 (4.1)

4.3.2 Orientate coordinate system to camera 1

Figure 4.2: Transform NAO.

The rotation matrix and translational vector that are needed for the optimal triangulation method are based on a coordinate system with at the origin image 1. The coordinate system is orientated is such a way that the x-axis is along the width of the image and the y-axis along the hight of the image. In figure 4.2 is a schematic drawing of NAO’s head and torso. Originaly the center of the coordinate system is somewhere at the center of the torso. It has to be transposed to the camera in the head of NAO.

(18)

To transform the coordinate system to the camera there are four transformation matrices needed. First the coordinate system has to be transposed to the center of the head joint. Then the coordinate system has to be rotated twice, once for the yaw and once for the pitch. The last transformation is the transformation from the head joint to the camera. Equation 4.2 shows these four transformation matrices. The yaw angle is φ and the pitch angle is ϕ.







1 0 0 0

0 1 0 −dstJ

0 0 1 0

0 0 0 1







•







1 0 0 0

0 cos(φ) −sin(φ) 0 0 sin(φ) cos(φ) 0

0 0 0 1







•







cos(ϕ) 0 sin(ϕ) 0

0 1 0 0

−sin(ϕ) 0 cos(ϕ) 0

0 0 0 1







•







1 0 0 −dstX 0 1 0 −dstY 0 0 1 −dstZ

0 0 0 1





 (4.2)

4.4 3D estimation of a salient object

All the individual parts of the 3D estimation of a salient object are explained in the previous chapters and sections. The complete algorithm is constructed from these parts. It is represented as a six step algorithm:

1. Take an image with NAO’s camera. This image will be called image 1. Take also the angles of NAO’s head joint.

2. Create the saliency map for image 1 (see chapter 2). It returns the most salient position (x₁) in the image.

3. Turn NAO’s head towards the salient object and take a second image (see section 4.2). This image will be called image 2. Take here also the angles of NAO’s head joint.

4. Create the saliency map for image 2. It returns the most salient position (x₂) in the image.

5. Perform the coordinate transformation (see section 4.3). Calculate the rotation matrix and the translational vector.

6. The two image coordinates of the salient object are now known (x1 and x2). Also the rotation matrix and the translational vector are known. These are the four parameters for the optimal triangulation method. Reconstruct now the 3D point with the optimal triangulation method (see section 3.3).

4.4.1 Test setup

For the test setup NAO will be put on the floor, facing a white wall. A bright red object will be taped to the white wall. The distance between NAO and the wall will be 30 cm and the red object will be at the camera center when image 1 is taken. The reason for this is that the optimal triangulation method puts the origin of coordinate system at the center of image 1. When the red object is then at the center of the image 1, the x- and y-values of the reconstructed 3D point are approximately zero and z-value will be the distance from the camera to the red object. These conditions ensure that it can be easily determined if the 3D estimation of a salient object works.

4.4.2 Results

Figure 4.3 shows image 1 and image 2 that are token for the first test.

(19)

(a) (b)

Figure 4.3: Test 1: image 1 (a) and image 2 (b).

The saliency maps that belong to the images of the first test are shown in figure 4.4.

(a) (b)

Figure 4.4: Test 1: saliency maps of image 1 (a) and image 2 (b).

The reconstructed 3D point for the first test is [0.12, −5.4, −229]^T.

For the second test the red object is placed in a corner of the camera image. This ensures that the angle changes are relatively large. Figure 4.5 shows the captured images 1 and 2.

(a) (b)

Figure 4.5: Test 2: image 1 (a) and image 2 (b).

The resulting saliency maps for the second test are shown in figure 4.6.

(20)

(a) (b)

Figure 4.6: Test 2: saliency maps of image 1 (a) and image 2 (b).

The reconstructed 3D point for the second test is [43, −34, 1271]^T.

(21)

Chapter 5

Conclusions and recommendations

The reconstructed 3D point of test 1 is a point that is approximately 23 cm behind NAO. The second reconstructed 3D point is 1.3 m in front of NAO. This is in no relation to the 30 cm in front of NAO that is should be. In fact, there are numerous tests done and each reconstructed 3D point was completely different and only in a few cases it was near the 30 cm in front of NAO. The 3D estimation of a salient object is analyzed and there are two possible error source why the 3D estimation of a salient object does not work.

As can be seen in figure 4.4 and figure 4.6 there can not be a misunderstanding about at which image point the red object is. The salient places have the size of a few pixels, but the pixel error should not be the problem (see section 3.5). So the problem is not in the saliency mapping part.

For the optimal triangulation method the translational vector from camera position 1 to camera position 2 is needed. Camera position 1 is taken here as the center of the coordinate system and should therefore have the coordinates [0, 0, 0]^T. However for the first test these coordinates are [3.7, −1.1, −4.6]^T and for the second test [−21.4, 19.1, 33.2]^T. This indicates that there is something wrong with the coordinate system transformations (section 4.3).

The first error source could be the distance from the original coordinate system to the head joint and the distance from the head joint to the camera (see figure 4.2). If these distances have a significant error, the transformation will not result in a coordinate system with the origin at NAO’s camera.

The second error source could be the error in the angles of the yaw and pitch of NAO’s head. If the command is given to NAO to set these angle to 0, the angles are not set completely to 0. The error in the angle is approximately 0.02 radian. This is 20% of the minimum angle change and that could cause a significant error. So the conclusion is that the 3D estimation of a salient object does not work due to coordinate transformation errors.

To make the 3D estimation of a salient object work is it recommended to let NAO be calibrated by Aldebaran Robotics. The calibration ensures that if the pitch and yaw angles are set to a certain value, NAO will move its head to that exact position.

When the NAO is calibrated the positions of NAO’s camera and NAO’s head joint can be calculated. If the pitch and yaw angles are 0 the distance of NAO’s camera in the z-direction is known. Then the pitch of NAO’s should be changed until the value of the camera’s z-direction gets 0. This forms a triangle with two known angles and one known side. This can used to calculate the distance in the y-direction from the head joint to the camera. When this distance is subtracted from the distance in the y-direction in the case that the yaw and pitch angles are 0, the result will be the distance form the original coordinate system to the head joint.

Another possible explanation that the algorithm does not works came to me just before printing this report. To determine the position of an object it is needed that the epipoles of the two images (O and O’, see figure 5.1) do not coincide. For the NAO the assumption was made that the epipoles do not coincide. But it is possible that for the NAO the epipoles coincide

21

(22)

and that could explain the results. So the recommendation is to first check if the epipoles of the images token by NAO coincide.

If they do, then the optimal triangulation could still be the best 3D reconstruction method if the movements of NAO would be adapted with the new knowledge.

(23)

Bibliography

[1] Costas Anyfantakis, Guglielmo Maria Caporale, and Nikitas Pittis. Parameter instability and fore- casting performance : a Monte Carlo study. London Metropolitan University, London, 2004. Costas Anyfantakis, Guglielmo Maria Caporale, Nikitas Pittis. ill. ; 21 cm. Discussion paper series ; DEDP 04/04 Includes bibliographical references. London Metropolitan University. Dept. of Economics.

[2] Andrea Fusiello, Emanuele Trucco, and Alessandro Verri. A compact algorithm for rectification of stereo pairs. Machine Vision and Applications, 12:16–22, 2000.

[3] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge University Press, Cambridge, 2000. by Richard Hartley, Andrew Zisserman. ill. (some col.) ; 26 cm.

Includes bibliographical references and index.

[4] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis.

Ieee Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, 1998. 138TX Times Cited:1230 Cited References Count:18.

[5] C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural circuitry.

Human Neurobiology, 4:219–227, 1985.

[6] K. K. Lee, K. H. Wong, M. M. Y. Chang, Y. K. Yu, and M. K. Leung. Extended kalman filtering approach to stereo video stabilization. 19th International Conference on Pattern Recognition, 1- 6:3470–3473 4005, 2008.

[7] R. Reilink, S. Stramigioli, F. Van der Heijden, and G. Van Oort. Saliency-based humanoid gaze emulation using a moving camera setup. University of Twente, Dept. of Electrical Engineering, Control Engineering, MSc thesis 019CE2008.

[8] Aldebaran Robotics. Documentation nao. Online, July 2011.

[9] Ferdi Van der Heijden. 3d position estimation from 2 cameras. Available on request at the Department of Electrical Engineering Mathematics and Computer Science(EEMCS) of the University of Twente, April 2011.

3D estimation of a salient object

Control Engineering

Klaas Jan Russcher

Bsc opdracht

Contents

Chapter 1

Introduction

Chapter 2

Saliency Mapping

2.1 Saliency in computer vision

2.2 Saliency tests

2.3 Conclusion

Chapter 3

3D reconstruction

3.1 Rectification

3.2 LinMMSE estimation

3.3 Optimal triangulation

3.4 Parameter error analysis

3.4.1 Monte Carlo analysis

3.4.2 Parameter sweep

3.5 Conclusion

Chapter 4

3D estimation of a salient object

4.1 NAO

4.2 Obtain second image

4.3 Coordinate system transformation

4.3.1 Matching the coordinate systems

4.3.2 Orientate coordinate system to camera 1

4.4 3D estimation of a salient object

4.4.1 Test setup

4.4.2 Results

Chapter 5

Conclusions and recommendations

Bibliography