Gaze Estimation Based on head location and eye-gaze

(1)

Gaze Estimation

Based on head location and eye-gaze

Steve Nowee 10183914

Honours Extension Bachelor thesis Credits: 6 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors Prof. dr. T. Gevers Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Dr. R. Valenti Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam July 30, 2014

(2)

Acknowledgements

I would like to thank Dr. Roberto Valenti for his continuous support and Prof. Dr. Theo Gevers for agreeing to evaluate this project, even after the regular bachelor thesis has ended. Also, I would like to thank Dr. Raquel Fern´andez for her efforts

in coordinating the Honours Programme.

Abstract

In this report an application of gaze estimation is discussed. This appli-cation performs head loappli-cation estimation with respect to a webcam and the screen, and it performs eye gaze estimation with a static head location. For both of these components, the underlying principles and approaches are ex-plained. The accuracy of the resulting head location estimation is acceptable. The accuracy of the eye gaze estimation, however, is less accurate. Possible causes for this inaccuracy are discussed.

(3)

1 Introduction

Gaze estimation is a field of study that can be widely implemented. An exam-ple of how gaze estimation can be used is as a means of controlling technology, e.g. a computer. For someone who is paralyzed, gaze tracking can be a good so-lution to regain a sense of independence, by being able to do something without needing help. Another example where gaze tracking can be helpful, is in the field of marketing. For businesses and shops, it can be interesting to know what the con-sumer actually looks at, for example, in advertisements. If it is known what draws an average consumer’s attention in an advertisement, this principle can be used in other advertisements as well.

In this report, an implemented application that estimates the gaze of a person is discussed. This application is consists of the location of the person’s head and eye gaze estimation. In Section 2, related work is discussed. This is followed by the used methods, which are explained in Section 3. The accuracy of the application is discussed in Section 4 and lastly, in Section 5, the conclusion is given.

2 Related Work

To know where a person looks, the pose of the face (e.g. the rotation) is impor-tant, as well as the eye gaze. Tan et al. [2002] have proposed a method of estimat-ing eye gaze based on the appearance of the eye. Dependestimat-ing on where someone is looking, the appearance of the eye region will differ and based on these differences the gaze direction can be estimated.

Another approach to estimating eye gaze, is by using the corneal reflection of the pupil. This has been proposed by Yoo and Chung [2003]. Using infrared LED lights, one on each corner of a screen, the reflections of these lights can be detected in the cornea of a user. These reflected points can be mapped from the eye to the screen corners. Consequently, the center of the pupil can be mapped onto a point on the screen. This method of gaze estimation is relatively simple and fast.

Head pose estimation can be treated as a regression problem, as was done by Fanelli et al. [2011]. They proposed the use of random regression forest to deter-mine head poses from training data. The approach is simple and achieves accuracy equal to state-of-the-art methods.

3 Methodology

A system to estimate the direction of a person’s gaze is based on several com-ponents. First of these components is the location and pose of a person’s head. The second component is the position of the eyes. For the system proposed in this re-port, the head pose does not vary. In the following subsections, these components will be discussed in more detail.

(5)

3.1 Head Pose

A person’s head pose can be subdivided into three degrees of freedom (DOF), since there are three orthogonal dimensions of rotation. These three DOF are called roll, pitch and yaw. Roll is tilting the head sideways, pitch is moving the head up and down and yaw is moving the head left and right. In figure 1 these three are visualized.

Figure 1: The three degrees of freedom in the head.

For the system proposed in this report, it is assumed that the subject’s head pose remains constant and the subject is looking straight ahead.

3.2 Head Location

In a system of gaze estimation, the location of a person’s head is very important. For example, based on a person’s eyes, it can be estimated he is looking at the upper left corner of his screen. However, when that person’s head shifts far enough to the right while keeping the eyes in the same position, the gaze is on the upper right corner of the screen. This example can be seen in figure 2.

(6)

Figure 2: An example of the effect of head location on the gaze estimation. The estimation of the actual location of the head takes place in several steps. First the distance between the head and the camera should be estimated, the depth. This is achieved by having a known length, for example the distance between the pupils (interpupillary distance), at a known distance from the camera. For this re-port an interpupillary distance of 6.3 centimeters was used. This is the average distance between the pupils for male and female subjects [Dodgson, 2004]. The distance away from the camera was set at 50 centimeters. This results in a distance between the pupils in the recorded image, in pixels. Using the real distance be-tween the pupils, the distance bebe-tween the pupils in pixels and the distance to the camera, equation 1 calculates the focal length of the camera.

f = d ∗ Z/D. (1)

In this equation, f denotes the focal length, d denotes the distance between the pupils in pixels, Z denotes the distance away from the camera in centimeters and D denotes the actual distance between the pupils in centimeters. Then, when the focal length is known, equation 1 can be rewritten to calculate the distance from the camera in centimeters, which can be seen in equation 2.

Z = f ∗ D/d. (2)

The next step in estimating the location of the head, is by finding the translation of the center of the head with respect to the origin of the camera. Let the center of each frame be the origin of the camera. The center of the head is computed using CrowdSight, in x and y coordinates from the frame origin in the top left corner of the frame. To get the horizontal and vertical distance between the camera origin and the center of the head, the x and y coordinates of the head should respectively be subtracted from the x and y coordinates of the camera origin. This results in the translation of the head with respect to the camera, in pixels. Using the previously calculated depth value, the distance from the camera, and the focal length, the

(7)

translations in pixel values can be converted into the translations in centimeters to get the location of the head in real world coordinates. For these coordinates in real world units, the intersection of a straight gaze from the face to the screen can be found, converting the real world units back to pixel units to project this gaze on the screen.

However, it is not enough to know where a person’s head is to know where that person is looking. It is also important to know where a person’s eye gaze is directed. In the following paragraph, the eye position feature is discussed.

3.3 Eye Gaze

A person can see in a wide range by moving only his eyes, making it important to know the direction of a person’s eye gaze to estimate his overall gaze. There are several methods to achieve a reliable estimation of eye gaze. This can be done through the use of head mounted devices [Noris et al., 2008] or by using the corneal reflection of infrared lighting to determine the gaze direction of the pupils [Yoo and Chung, 2003].

The method used for this report is by using the position of the pupil relative to the inner eye corner. This relative position can be represented by the vector between the two positions. An example of such vectors can be seen in figure 3.

Figure 3: Examples of vectors between the inner eye corner and the center of the pupil. With a neutral gaze (left) and an upward gaze (right).

In the following paragraphs, operations on the eye region of a recorded frame are discussed. In order to acquire these eye regions, first the detected face is nor-malized on scale and rotation. This is a necessary step to make the extraction of the eye regions stable and constant. Based on the distance between the eyes, and thus the original size of the face, the face region is either enlarged or downscaled. A rotation matrix is then applied to the face region to negate the actual head pose the head is in, which results in a frontal view of the face. In this image of the face, with a constant size, it is easy to extract the eye regions, because they are constantly at the same position. These eye regions are then used in any further operations.

Another normalization that has been applied and makes the application more robust, is illumination normalization. In order to make the application invariant to changes in the lighting environment a preprocessing and normalization procedure,

(8)

proposed by Tan and Triggs [2010], is applied to each scale and rotation normal-ized facial image.

The center of the pupil is computed using the CrowdSight software. In order to create a stable estimation of a person’s gaze, the extracted vectors should be stable as well. Subsequently, the eye corner point should be detected in a stable fashion. This detection is performed using the matching of templates, with the use of the OpenCV library. When matching an image with a template, the origin of the tem-plate (0,0) (upper left corner) will move over each pixel of the image, so that the whole template still fits in the image. Each of these pixels will result in a similarity measure between the template and that specific region of the image. Depending on the method of calculating the similarity, either the maximal or minimal similarity value corresponds to the best match of the template.

A template for each eye corner was extracted manually from an arbitrary facial images data set. For each eye, ten eye corners were extracted in images of 23x27 pixels. These eye corner images where converted to grayscale images and then averaged. The resulting templates can be seen in figure 4.

Figure 4: The left and right eye corner templates, respectively left and right. After finding the best matching regions in the eye images, these regions are extracted and combined with the generic and manually extracted eye corner tem-plates. These combined templates are the templates that will be used in the next iteration’s matching, which will then be combined with newly extracted eye corner regions. This should make the templates more specific for the particular user and thus should make the process of detecting the eye corners more stable, since the eye templates get more similar to the actual eye corner of the person. However, the detection of the eye corners has to be accurate, because otherwise the templates will be adapted with regions that do not show eye corners, resulting in more inac-curate detection in following frames.

Having found both the pupil center and the eye corner for both eyes, the rep-resenting vector can be computed. This is achieved by finding the horizontal and vertical location differences between the two points. In this vector, the eye corners are the anchor points, because these do not move relative to the face. Thus, the vectors can be represented as follows:

~g = (X, Y ),

(9)

In order to know where a certain gaze vector is directed, calibration is neces-sary. A set of known points on the screen are shown to a user. For each of these points, the user has to move their eyes towards them. This results in a set of points on the screen and their corresponding eye vectors. Having collected the eye vectors for these calibration points, a set of coefficients can be found that best translate the eye vectors to the screen points. This translation is represented by the following equations:

Xscreen= α0+ α1X + α2Y + α3XY,

Yscreen= β0+ β1X + β2Y + β3XY.

(3) In these equations, Xscreen and Yscreen denote the x and y coordinates on the

screen, αiand βidenote the coefficients and X and Y denote the eye vector values.

In the application created for this report a set of twenty-five calibration points was used. The locations of the calibration points can be seen in figure 5.

Figure 5: The locations of the twenty five calibration points.

The formulas in equation 3 are both first order polynomials and thus linear. A method to fit a first order polynomial to a set of data is the least squares method. For the application discussed in this report a linear least squares method has been applied to acquire the best fitting parameters. For this parameter fitting, the formu-las in equation 3 have to be converted to matrix notation. Following is this notation of the equations:      xscreen1 yscreen1 xscreen2 yscreen2 .. . ... xscreen25 yscreen25      =      1 x1 y1 x1y1 1 x2 y2 x2y2 .. . ... ... ... 1 x25 y25 x25y25          α0 β0 α1 β1 α2 β2 α3 β3     . (4)

In this matrix equation, the leftmost matrix consists of the calibration points and shall be called xy in further equations. The middle matrix consists of the partial derivatives of the formulas in equation 3 over the parameters αiand βi. These

par-tial derivatives are the same between α and β. This middle matrix shall be denoted as XY in further equations. Lastly, the rightmost matrix consists of the twelve parameter values, six for the x coordinate calculation and six for the y coordinate calculation. This final matrix shall be denoted by ab in further equations.

(10)

becomes:

xy = XY ∗ ab. (5)

In this equation, the values for matrices xy and XY are known and can be used to estimate the best fits for the parameters in matrix ab. In order to make such an estimation, the equation has to be rewritten. First a left-multiplication with transpose XYT on both sides of the equation is applied, resulting in:

XYT ∗ xy = XYT _{∗ XY ∗ ab.} ₍₆₎

Lastly, to isolate the matrix with parameter values, both sides of equation 6 are left-multiplied by the inverse of XTT ∗ XT , which results in:

ab = (XYT ∗ XY )−1∗ XYT _{∗ xy.} ₍₇₎

Having found the best fitting parameters for calculating the screen coordinates given eye gaze vectors, these parameters can be applied to all following computed eye gaze vectors. This way, an estimation is made as to where a user is looking based on the eyes.

3.4 Combining Eye Gaze and Head Location

In the previous section the eye gaze estimation has been explained, using a set of fitted parameters to transform eye vector data into screen coordinates. However, these fitted parameters only work for one position of the head. Consider the fol-lowing: a user looks at the right top corner of the screen from a certain distance and the parameters are fitted in this situation. Now, when the user moves backward, but still looks at the top right corner of the screen the eye vector will not be the same as in the previous position.

To overcome this issue, there are several possibilities. The first approach to overcome this problem would be to create several models, instead of one single model. Each model only consists of the parameters for one location and distance of the head. So by fitting several sets of parameters, for several locations and distances of the head, the application could estimate the gaze of users that are in varying locations. Since the actual location of a user is computed for each frame that is caught by the camera, the closest model can be found resulting in the best fitting set of parameters to convert the eye vectors to screen coordinates. However, the accuracy of this approach depends on the number of models that will be used and the more models are used the more laborious the task. Even so, the models have to be fitted only once.

The other approach to estimate a user’s gaze at arbitrary locations is by us-ing angles. Instead of mappus-ing the eye vector data to screen coordinates, it can be mapped to the angles under which the eye must be, looking at specified points on the screen. For this calibration, the subject must face the camera at a known distance with his eyes being aligned at the origin of the camera. The horizontal

(11)

and vertical distances of each point on the screen, with respect to the camera can be calculated. Subsequently, using these distances and the known distance from the camera, the horizontal and vertical angle of the eyes can be calculated with the following formula:

γ = tan−1 point distance screen distance

. (8)

In this equation, γ is the angle of the eye, point distance is the relative distance between a point and the camera and screen distance is the distance between the subject and the screen. The calibration process for these angles would be the same as the calibration discussed in section 3.3. Having found a set of parameters that accurately converts eye vector data to horizontal and vertical angles of the eyes, these parameters can be used to compute angles for newly extracted eye vectors. Using these angles and the estimated distance of a subjects face, the horizontal and vertical distances an be calculated using:

point distance = tan(γ)screen distance. (9)

Combining these estimated horizontal and vertical distances with the displacement of the face with respect to the camera origin, an estimation of the combined gaze can be made. Since only one model has to be calibrated when using this approach, it seems like a more attractive method than the one discussed in the previous para-graph.

Regretfully, neither of the two discussed methods have been implemented. The estimation of the location of a subjects head and the eye gaze estimation have, how-ever, been implemented separately. In the following section, both of these separate parts will be evaluated.

4 Results

For both of the parts of the application, the one estimating head location and the one estimating eye gaze, the same method of evaluating has been used. This method is the mean squared error.

To test the head location estimation a set of eight points were sequentially shown on the screen. For each of the shown points, the user had to position his head so that the point on the screen aligned with the user’s eyes. Note again, that it is assumed that the user does not rotate his head in any way. The deviations between the actual points on the screen and the estimated points where squared and summed and then averaged by dividing it by eight. This resulted in a mean squared error of 7.43 centimeters.

The evaluation of the eye gaze estimation is performed similarly to that of the head location estimation. Again, just as with the head location estimation, eight points were shown on the screen. Now, instead of moving the head to align with the points on the screen, the user had to move his eyes to look at the points while

(12)

keeping his head still in one position. The found deviations between the real point and the estimated point were squared, summed and averaged. This resulted in a mean squared error of 27.13 centimeters. It should be noted, that due to the unstable detection of the right pupil by CrowdSight only the left eye vector has been used in the evaluation.

4.1 Discussion

The achieved accuracy for the head location estimation is not optimal, but it could have been worse. Because only one camera was used, the face detection adopts a slight deviation the closer a face is towards the side edges of the screen. This slight deviation in the detection phase might account for the mean squared error of 7.43 centimeters in the estimation.

The mean squared error of the eye gaze estimation was higher than that of the head location estimation. There are several possible causes for it to be as high as it is. Firstly, it is possible that the first order polynomial is not a good enough fit for this specific task. Also, the fact that the coefficients have been fitted for one specific head location means that if a user’s location slightly varies from the calibration location, the accuracy will be considerably lower. Lastly, the way the eye corner detection is implemented makes it vulnerable for inaccurate detections resulting in continuous faulty detections because of the template adaptation. A more continuously stable approach could be used instead of the current method.

5 Conclusion

An application of gaze detection has been implemented and discussed in this report. As mentioned before, sadly this application has not reached the originally hoped form. However, it does consist of major components that, combined, would have achieved the original goal. These components are a head location estimation component and an eye gaze estimation component.

5.1 Future Work

The head location estimation component works with an acceptable accuracy, whereas the eye gaze estimation is less accurate. To improve the eye gaze esti-mation there are several options. Firstly, a more complex fitting polynomial could be applied in the calibration procedure, such as a second order polynomial. Also, to acquire a stable eye corner detection the level of curvature can be used. Natu-rally, the eyelid edges and eye corners will have a higher level of curvature than the surroundings. By finding the highest curvature intersecting with a corner interest point, the eye corners could be detected more reliably. Lastly, the two methods described in section 3.4 can be implemented to achieve a more robust and more functional gaze estimation application.

(13)

References

Neil Anthony Dodgson. Variation and extrema of human interpupillary distance. In Proceedings of SPIE, volume 5291, pages 36–46, 2004.

Gabriele Fanelli, Juergen Gall, and Luc Van Gool. Real time head pose estimation with random regression forests. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 617–624. IEEE, 2011.

Basilio Noris, Karim Benmachiche, and Aude Billard. Calibration-free eye gaze direction detection with gaussian processes. In VISAPP (2), pages 611–616, 2008.

Kar-Han Tan, David Kriegman, and Narendra Ahuja. Appearance-based eye gaze estimation. In Applications of Computer Vision, 2002.(WACV 2002). Proceed-ings. Sixth IEEE Workshop on, pages 191–195. IEEE, 2002.

Xiaoyang Tan and Bill Triggs. Enhanced local texture feature sets for face recog-nition under difficult lighting conditions. Image Processing, IEEE Transactions on, 19(6):1635–1650, 2010.

Dong Hyun Yoo and Myung Jin Chung. Non-contact eye gaze estimation sys-tem using robust feature extraction and mapping of corneal reflections. In Pro-ceedings of International Conference of Advanced Robotics, Coimbra, Portugal, volume 6, 2003.

Gaze Estimation Based on head location and eye-gaze

Gaze Estimation

Based on head location and eye-gaze

Contents

1

Introduction

2

Related Work

3

Methodology

4

Results

5

Conclusion

References