Real-time self-generated motion parallax on a personal mobile display

(1)

Real-time self-generated motion

parallax on a personal mobile

display

Remi van Veen

June 8, 2016

Supervisor(s): dr. R.G. Belleman

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

to generate a view that, for best results, approximates the location of the observer. For a more realistic projection, the projection method would need to take into account the position of the observer so that the view changes with the observer’s position. This thesis describes the design and implementation of a projection method that does this. In addition to this, the 2D display surface used is mobile, and its position and orientation with respect to a real-world 3D object is also taken into account. This way, a novel illusion appears to the observer of having a window into a virtual 3D world. An Android tablet is used as the mobile display, of which the front-facing and rear camera are used to track the position of the observer and the 3D object. The observer is tracked using eye tracking, while the 3D object is tracked using augmented reality markers. Based on the position of the observer, the virtual camera projection is changed, while based on the position and orientation of the tablet to the 3D object, the virtual 3D scene is transformed. The realism of the illusion is measured by comparing the observer’s viewing angle to the display to the virtual camera’s viewing angle into the 3D scene. The results show accuracy to a few degrees, which could in the future be improved by using a better camera and more powerful hardware.

(4)

(5)

1 Introduction 7

1.1 Outline . . . 8

2 Related work 9 2.1 The ‘Dead Cat Demo’ . . . 9

2.2 Wii remote head tracking . . . 10

2.3 Other work . . . 10 3 Design 11 4 Implementation 13 4.1 Eye localization . . . 13 4.2 Eye tracking . . . 14 4.2.1 Validation . . . 14

4.3 Real-world eye position estimation . . . 15

4.4 Virtual camera projection . . . 16

4.4.1 Virtual camera planes . . . 17

4.4.2 Virtual camera position . . . 17

4.4.3 Off-center projection matrix . . . 18

4.5 Marker tracking . . . 19

4.6 Tablet camera position correction . . . 20

4.6.1 Front-facing camera . . . 20

4.6.2 Rear camera . . . 20

5 Experiments 21 5.1 Virtual camera projection correctness . . . 21

5.2 Performance . . . 22 6 Results 25 7 Discussion 27 7.1 Possible applications . . . 27 7.2 Limitations . . . 28 7.2.1 Freedom of movement . . . 28 7.2.2 Device compatibility . . . 28 7.3 Future work . . . 28

7.3.1 Virtual camera projection correctness . . . 28

7.3.2 See-through display illusion . . . 29

7.3.3 Markerless object detection . . . 29

8 Conclusions 31

(6)

(7)

Introduction

Using computer graphics it is possible to render virtual 3D scenes to a 2D display surface by applying geometric projections. This method uses a model that includes a virtual camera to generate a view that, for best results, approximates the location of the observer. Because the exact location of the observer is unknown, this projection method assumes it to be static. For a more realistic projection, the projection method would need to take into account the position of the observer so that the view changes with the observer’s position. This would create the effect that is commonly referred to as “motion parallax”.

If in addition to this the 2D display surface would be mobile, and its position and orientation with respect to a real-world 3D object would also be taken into account, a novel illusion would be generated in which, in theory, a realistic illusion would appear to the observer of having a window into a virtual 3D world. Figure 1.1 illustrates a possible setup of an observer holding a mobile 2D display surface in front of a 3D object.

Figure 1.1: Illustration of an observer holding a mobile 2D display surface in front of a 3D object. Images derived from an original image by Michael Scarpa [22].

This thesis describes the design and implementation of this display method. An Android tablet is used as the mobile 2D display surface on which a virtual 3D scene will be rendered. The tablet is required to have a front-facing and rear camera, which can be used to track observer’s position and the position and orientation of the tablet with respect to the 3D object. Suitable tracking methods are needed, as well as a projection method that takes these factors into account.

(8)

1.1 Outline

In Chapter 2 and 3, related work and the intended design of the implementation are discussed. In Chapter 4, the way the observer and the 3D object are tracked is described, as well as a projection method that takes these factors into account. In Chapter 5, the correctness of the projection method and the overall performance of the implementation are measured. In Chapter 6, the display method is demonstrated. In Chapter 7, possible applications and future work are discussed. Finally, conclusions are drawn in Chapter 8.

(9)

Related work

2.1 The ‘Dead Cat Demo’

In 2003, researchers of the Informatics Institute of the University of Amsterdam developed a demonstration that is very close to the goal of this thesis [22]. Trackers were placed on a mobile display, a physical object and the observer’s head, which were then tracked by a tracking subsystem. Based on the position of these three factors, a visualization was rendered by another subsystem and streamed to the display. This resulted in a convincing illusion of having a window into a virtual 3D world, but was mostly attracting attention because it demonstrated that it is possible to distribute interactive graphics applications over a high-speed/low-latency wide-area network. For the demonstration a preserved panther cub was used as the physical object onto which a CT scan was rendered: hence the name ‘Dead Cat Demo’. Figure 2.1 shows the setup of the demonstration. While an article about the performance of the networking aspect of the project has been published, a paper on the project itself was never finished.

(10)

2.2 Wii remote head tracking

A well-known project that implemented a display method that rendered a 3D scene on a 2D display with respect to the position of the observer, is Johnny Lee’s Wii remote head tracking published in 2008 [15]. By tracking two infrared LEDs mounted to a pair of glasses using the Wii remote infrared camera, an illusion of looking through a virtual window instead of looking at a flat screen appeared to the observer. The Wii remote was attached to a common personal computer, on which the 3D scene was rendered. Besides the source code, Lee did not publish a comprehensive description of the techniques used to achieve these results. However, in 2008 Kevin Hejn and Jens Peter Rosenkvist analyzed and described Lee’s work in their paper about Wii remote head tracking [11].

Lee’s Wii remote head tracking provided a source of inspiration for projects that aimed to achieve similar results. For example, Jens Garstka and Gabriele Peters published an article that described a project that used the same virtual camera transformations as the Wii remote head tracking method, but used a Microsoft Kinect motion controller to track the observer’s position [8].

2.3 Other work

In 1979 already, Brian Rogers and Maureen Graham published a paper about displaying 3D motion parallax on a 2D display surface [21]. Patterns were displayed on an oscilloscope, while the observer’s head rested on a chinrest. If the observer moved around the display surface, the chinrest would move as well, allowing the observer’s position to be tracked. This way, movement of the chinrest produced relative movement between the dots on the oscilloscope face, simulating the parallax produced by a 3D surface [21].

Jun Rekimoto’s head tracker for virtual reality from 1995 already used the same projection method as the method used by Johnny Lee’s Wii remote head tracking [20]. For this project, the observer was tracked using background subtraction, applied to common webcam images. The goal of this project was to improve the human’s ability in understanding complex 3D structures presented on a 2D computer screen [20].

A paper by Mark Hancock et al. investigated the ability of participants to identify projections of virtual scenes that most closely match an equivalent real scene [9]. A similar paper by Frank Steinicke et al. studied people’s ability to judge virtual object orientations under different pro-jection conditions on a tabletop display [24]. These papers provide a psychophysical background on the subject of displaying 3D objects on 2D display surfaces.

(11)

Design

As described in the introduction, an Android tablet is used as the 2D display surface on which a virtual 3D scene will be rendered. The Android tablet is required to have a front-facing and a rear camera, so the observer and a real-world object can be tracked. The ‘Dead Cat Demo’ discussed in Section 2.1 is very close to the goal of this project, but used multiple subsystems which caused the setup to be very expensive and not very portable. For this project, the tracking, as well as the rendering of the 3D scene, will be done by the tablet itself.

To track the observer’s position, the eyes will be tracked. This is similar to Johnny Lee’s tracking of two infrared LEDs mounted to a pair of glasses [15], but does not require this extra hardware. Also, the regular front-facing camera of the Android tablet can be used, while Lee used the infrared camera of a Wii remote. The real-world object will be tracked using an augmented reality marker, as augmented reality markers are flat images with well trackable features. Figure 3.1 shows the setup of the demonstration.

The idea is that first the real-world position and orientation of the augmented reality marker with respect to the tablet are estimated and the virtual objects in the 3D scene are transformed accordingly. Next, the position of the observer with respect to the tablet is estimated and this position is taken into account by the virtual camera projection. This way, the projection based on the observer’s position and the 3D scene transformation based on the position and orientation of the tablet to the augmented reality marker are separated, so they can be implemented and used independently of each other.

Figure 3.1: Setup of the mobile window into a virtual 3D world demonstration. Image derived from an original image by Michael Scarpa [22].

(12)

(13)

Implementation

This project is developed using Unity3D [25]. Unity3D is a multi-platform game engine, in which a 3D scene can be easily build up using a drag-and-drop interface. A virtual camera is used to render the 3D scene, to which scripts can be attached that implement the virtual camera’s projection method.

For the implementation of the eye tracking aspect of the project, OpenCV is used [19]. OpenCV is an open source computer vision library, that contains a lot of well-known computer vision algorithms. Using OpenCV, image pre-processing and computer vision algorithms needed to implement eye tracking do not have to be implemented from scratch. To use OpenCV with Unity3D, the OpenCV for Unity plugin is used [18].

The augmented reality marker tracking aspect of the project is implemented using Vuforia, a widely used augmented reality framework that can be integrated into Unity3D [28].

4.1 Eye localization

To track the observer’s position, the observer’s eyes are tracked using the tablet’s front-facing camera. To be able to do this, first the initial positions of the eyes have to be localized. Because the eyes are only relatively small points in the analyzed camera frame, the camera frame is reduced to specific regions of interest [29] [32]. Figure 4.1 illustrates this. First, the image is reduced to the observer’s face using face detection. To do this, the Viola-Jones Haar-like feature-based cascade classifier is used [27]. This classifier uses Haar-like feature models that are trained using positive and negative example images, from which features are extracted that are used by the classifier to detect similar objects. The classifier indicates a rectangular region in the analyzed image in which a face is located. If multiple faces are detected, the face that is the closest to the camera, and thus occupies the largest area in the image, is analyzed.

After obtaining a rectangular face region, the bottom half and the upper quarter of this region are removed and the region is slightly narrowed, as based on the human face proportions, the eyes should be located in this region [13]. The remaining region is split vertically, as there is one eye at each half of the region. Finally, eye detection is performed on the two remaining regions of interest. Again, a cascade classifier is used, this time using a Haar-like feature model trained for detecting eyes. As the algorithm indicates rectangular regions in the image in which the eyes are located, the exact eye positions are deduced by taking the center points of these rectangles. By performing the eye detection on two very specific regions of interest, the chance to obtain false positives is reduced to nearly zero.

Pre-trained Haar-like feature models, including models for face and eye detection, are widely available online [17]. As the used model for detecting faces is trained for detecting frontal faces, the observer should look straight to the camera.

(14)

(a) Analyzed image. (b) Detected face rectangle. (c) Estimated eye region.

(d) Split eye region. (e) Detected eye rectangles. (f) Deduced eye positions.

Figure 4.1: Step-by-step illustration of the eye detection process. Analyzed image taken from the BioID Face Database [30].

4.2 Eye tracking

Using the eye localization algorithm every camera image to track the observer’s eyes is not an option, as this would require the observer to look straight to the camera all the time. This would mean that if the observer’s face is slightly titled or rotated, the eyes would not be detected anymore. Also, eye localization is computationally expensive, as first a face is detected using a classifier, and then two eyes are detected using another classifier.

Instead, the pyramidal Lucas-Kanade tracking algorithm is used [3] [33]. This algorithm does not search for a straight face but simply tries to track the positions of two points, the eyes, in the stream of camera images. Given a camera frame and a set of points in this image that should be tracked, the pyramidal Lucas-Kanade tracking algorithm algorithm returns the new positions of these points in the next camera image by tracking a window of pixels around the points. As described in the previous section, the initial position of the points that should be tracked are determined by eye localization. The output of the tracking algorithm is then used for tracking the eyes in the next image, and so on. To improve the results, the camera frames are preprocessed by applying histogram equalization, which reduces the effect of fluctuating lighting conditions [33] [12].

4.2.1 Validation

The pyramidal Lucas-Kanade tracking algorithm always returns a new position for a tracked point, even if the point is not visible to the camera anymore. Also, very fast movement of the observer can cause the algorithm to return wrong results. To detect wrong results, cross-checking is applied by tracking the points forwards and backwards in time [1]. After obtaining the new position of a tracked point in a new camera image, the tracking process is applied backwards. The new camera image with the new point that is determined by the algorithm is given as input, and the position of this point in the previous camera image is determined by the algorithm. The correct position of the tracked point in the previous camera image is already known and can thus be compared to the resulting point. By calculating and thresholding the Euclidean distance between the resulting point and the already known point, it is possible to determine if the result is correct. If the result is incorrect, the tracking is stopped and the process starts over by performing eye localization again. This validation step reduces the performance of the

(15)

4.3 Real-world eye position estimation

When the observer’s eye position in a camera image is known, it is used to change the projection of the virtual camera. To do this, the real-world eye position is estimated based on the eye coordinates in the camera image. This is done using three phases, in a very similar way to Johnny Lee’s estimation of the real-world position of the tracked infrared LEDs [14].

First, the camera image coordinate system is changed. As the origin of a 3D scene is at the center of the screen, the origin of the camera image should also be at the center. The origin of a camera image, however, is at the top left. Transforming a point to the new coordinate system is a simple but important step and is done as follows:

x0= x −width 2 y0 = y −height

2

Here, width and height represent the camera image resolution in pixels.

Next, the observer’s distance to the camera is estimated. This distance is used to move the virtual camera along the z-axis, and is also needed to estimate the observer’s horizontal and vertical position to the camera. To determine the observer’s distance to the camera, Lee uses the known real-world distance between the two infrared LEDs that are being tracked. The same method can be used with eye tracking, using the average human interpupillary distance of 63 millimetres [5]. Besides the interpupillary distance, also the angle between the eyes and the camera is needed. Using the horizontal field of view of the camera, the angle per horizontal pixel from the center of the camera frame can be calculated. This is done as follows:

radians per pixelhorizontal=

f ovhorizontal∗

π 180 width

Here, f ovhorizontal is the horizontal field of view of the camera in degrees and width is the

width of the camera image in pixels. By multiplying this value with the number of pixels between the center of the camera image and a certain point, the angle between the camera and that point is estimated. Similarly, by multiplying this value with the Eucledian distance in pixels between the two tracked eye points in the camera image, the angle between these points and the camera is estimated [11]. Taking half of this angle and the center point between the eyes, so a right-angled triangle is obtained, the distance in millimetres from the camera to the center point can be estimated using the tangent:

eye distancepixels =

q

(lef t eyex− right eyex)2+ (lef t eyey− right eyey)2

angle = eye distancepixels∗ radians per pixelhorizontal

distancemm=

(interpupillary distance

2 )

tan(angle 2 )

Here, lef t eye and right eye are the tracked eye points in the camera image.

Figure 4.2 illustrates this principle. As the figure shows, this method assumes that the observer is right in front of the camera. In practice, this is not always the case, which would mean that using the center point between the eyes would not yield a rectangular triangle. This assumption thus simplifies the calculations significantly, while the results are still accurate.

(16)

Figure 4.2: Illustration of the calculation of the observer’s distance to the camera (top-down view). Image derived from an original image by Kevin Hejn and Jens Peter Rosenkvist [11].

Finally, using the estimated distance to the camera, the observer’s horizontal and vertical position to the camera is estimated. This position is used to move the virtual camera along the x-axis and y-axis. To estimate this position, the center point between the two tracked eye points is used again. To calculate the horizontal position to the camera, the angle between this point and the camera is again calculated by multiplying the angle per horizontal pixel by the horizontal distance in pixels between the point and the center of the camera frame. Next, using the calculated distance of the observer and a sine calculation, the observer’s horizontal position to the camera is estimated:

eye centerx=

lef t eyex+ right eyex

2

anglehorizontal= eye centerx∗ radians per pixelhorizontal

observerx= sin(anglehorizontal) ∗ distancemm

To calculate the vertical angle between the camera and the observer, so the observer’s vertical position to the camera can be estimated, Lee again uses the horizontal field of view of the camera [14]. However, to calculate this angle, the vertical distance in pixels between the camera frame center and the center point between the eyes is used. Therefore, using the vertical field of view of the camera leads to more accurate results:

radians per pixelvertical=

f ovvertical∗

π 180 height eye centery=

lef t eyey+ right eyey

2

anglevertical= eye centery∗ radians per pixelvertical

observery= sin(anglevertical) ∗ distancemm

Here, f ovverticalis the vertical field of view of the camera in degrees and height is the height

of the camera image in pixels.

4.4 Virtual camera projection

After estimating the real-world position of the observer to the camera, the virtual camera pro-jection is changed accordingly.

(17)

a 3D scene. The near plane is the closest plane visible to the camera, so objects that are in front of the near plane are not rendered. Similarly, the far plane is a plane that defines at what distance objects are no longer rendered. Figure 4.3 illustrates the near and far planes of a virtual camera. The position of the near plane is an important aspect when changing the virtual camera projection based on the observer’s position in a realistic way.

Figure 4.3: Near and far plane of a virtual camera [16].

In the coordinate system of a virtual 3D scene, the near plane is a unit square that is scaled to the aspect ratio of the screen. As the origin of the coordinate system is at the center of the screen, the top left of the near plane is located at (−1₂· ratio,1

2) and the bottom right of the near

plane is located at (1₂· ratio, −1

2). Figure 4.4 illustrates this by surrounding a 3D scene with a

unit cube that is scaled to the screen aspect ratio, showing that at the near plane, the boundaries of the cube are exactly at the edges of the screen. This property is used when transforming the virtual camera projection based on the observer’s position.

Figure 4.4: Unit cube scaled to the screen aspect ratio surrounding a 3D scene, showing that at the near plane, the boundaries of the cube are exactly at the edges of the screen.

4.4.2 Virtual camera position

The first step of changing the virtual camera projection based on the observer’s position, is changing the position of the virtual camera in the 3D scene. First, the estimated real-world position of the observer is scaled to the height of the screen that is used:

camerax= observerx screen heightmm cameray = observery screen heightmm cameraz= distancemm screen heightmm

(18)

As the near plane of the virtual camera is a unit square that is scaled to the screen aspect ratio, this way the observer’s estimated real-world coordinates are translated to the coordinate system of the 3D scene [14]. Therefore, they can directly be used to change the position of the virtual camera.

4.4.3 Off-center projection matrix

Figure 4.5a shows the effect of changing the position of the virtual camera when the observer moves to the bottom left of the screen. It is clear to see that the bottom left of the scaled cube is now very prominently visible. Also, the edges of the cube are not at the edges of the screen anymore. This effect is not very realistic, as moving to the bottom left of a real window would mean that the bottom left of the cube should barely be visible. Also, the cube’s edges would still be at the edges of the real window.

To keep the cube’s edges fixed at the edges of the screen, the near plane of the camera should stay fixed, while the way the virtual camera looks through the near plane, and thus the visible field of the 3D scene, should change. Figure 4.6 illustrates this. To achieve this effect, an off-center projection matrix is used. Using an off-center projection matrix, the perspective’s vanishing point is not necessarily in the center of the screen, which results in a realistic perspective based on the observer’s point of view [16]. Based on the position of the near plane and far plane, the off-center projection matrix is calculated as follows:

P =                 2 · nearz

nearright− nearlef t

0 nearright+ nearlef t

nearright− nearlef t

0

0 2 · nearz

neartop− nearbottom

neartop+ nearbottom

neartop− nearbottom

0 0 0 −f arz+ nearz f arz− nearz −2 · nearz· f arz f arz− nearz 0 0 −1 0                

Here, nearzis the virtual camera’s distance to the near plane and f arz is the distance to the

far plane. nearlef t, nearright, neartop and nearbottom define the boundaries of the near plane.

These boundaries are relative to the virtual camera position and are calculated as follows [14]:

nearlef t= nearz· (− 1 2 · ratio + camerax) z nearright= nearz· ( 1 2· ratio + camerax) z neartop= nearz· ( 1 2 − cameray) z nearbottom= nearz· (− 1 2 − cameray) z

As the near plane is a unit cube that is scaled to the screen aspect ratio, −1₂ · ratio and

1

2· ratio are used when calculating the left and right boundaries. Figure 4.5b shows the effect

of the camera movement and the off-center perspective matrix when the observer moves to the bottom left of the screen. It is clearly visible that the position of the near plane now stayed fixed, as the cube’s boundaries are still exactly at the screen’s edges. Also, the bottom left of the cube is barely visible anymore, which is the correct projection when the observer moves to the bottom left of the screen.

(19)

tom left of a real window would mean that the bottom left of the cube should barely be visible. Also, the boundaries of the cube would still be at the edges of the window.

exactly at the screen’s edges. Also, the bottom left of the cube is barely visible anymore, which is the correct projection when the observer moves to the bottom left of the screen.

Figure 4.5: Unit cube scaled to the screen aspect ratio surrounding a 3D scene, to illustrate the effect of the different camera transformations based on the observer’s position.

Figure 4.6: Change in visible field when the virtual camera moves, while the near plane stays fixed [11].

4.5 Marker tracking

Finally, the marker tracking aspect of the project is implemented, so the virtual objects in the 3D scene are transformed based on the real-world position and orientation of an augmented reality marker. By loading a marker image into Vuforia, the platform tries to detect and track the marker in the rear camera stream and while tracking it, its position and orientation are estimated.

By default, Vuforia changes the virtual camera projection based on the position and orienta-tion of the detected marker. As discussed in Chapter 3 however, the 3D scene itself should be transformed, so the marker tracking aspect is separated from the virtual camera projection based on the observer’s position. This is achieved by setting Vuforia’s world center mode parameter to camera. To take the real-world size of the marker into account, so the virtual objects are scaled correctly to the marker size, the Vuforia marker width parameter is set to ₁₀1 of the marker width in centimeters.

(20)

4.6 Tablet camera position correction

4.6.1 Front-facing camera

When determining the observer’s horizontal and vertical position to the camera, the calculations assume that the camera is centered. In practice this is not the case, as this would mean that the camera is positioned in the center of the screen. To account for this, the near plane dimensions are used again.

If the camera is not horizontally centered when holding the tablet in landscape mode, its horizontal distance to the center of the screen is determined. Next, using the width of the screen and the fact that the near plane is a unit square that is scaled to the aspect ratio of the display, this distance is translated to the coordinate system of the 3D scene:

correctionx=

horizontal camera distance to display centermm

screen widthmm

· ratio

This way, the distance of the camera to the center of the near plane in 3D scene coordinates is calculated. If the camera is closest to the left edge of the screen, and thus closest to the left edge of the near plane, this value is added to the estimated horizontal position of the observer. If the camera is closest to the right edge of the screen, this value is subtracted from the estimated horizontal position of the observer.

Similarly, if the camera is not vertically centered when holding the tablet in landscape mode, this is corrected using its vertical distance to the center of the screen:

correctiony =

vertical camera distance to display centermm

screen heightmm

Again, if the camera is closest to the bottom edge of the screen, and thus closest to the bottom edge of the near plane, this value is added to the estimated vertical position of the observer. If the camera is closest to the top edge of the screen, this value is subtracted from the estimated vertical position of the observer.

When the front-facing camera of a tablet is not horizontally centered, the lens is sometimes tilted so the camera’s field of view is closer to the field of view of a centered camera. Because of this, the orientation of the camera is not the same as the orientation of the tablet. On the Nvidia Shield Tablet used for this project, the front-facing camera is tilted to the top-right. When the observer moves closer to the screen, this tilt causes it to look like the observer moves towards the camera. This is corrected by subtracting the camera tilt in degrees from the estimated angles between the observer and the camera, as discussed in Section 4.3. As the exact camera tilt is not specified by tablet manufacturers, it has to be estimated. For the Nvidia Shield Tablet used for this project, the horizontal camera tilt has been estimated at 8.9 degrees, while the vertical camera tilt has been estimated at only 0.75 degrees.

4.6.2 Rear camera

The rear camera of Android tablets is often not centered either. As the virtual objects in the 3D scene are transformed according to the position and orientation of the tablet to the augmented reality marker, this should also be taken into account. If the marker is centered relative to the tablet’s screen, the 3D scene should be centered at the display as well. If the camera was centered, Vuforia would detect the marker in the center of the camera frame and thus put the 3D model at the center of the scene. With a camera that is positioned on the left, however, the marker would be detected more to the right of the camera frame. Again, this is solved by adding an offset to the horizontal translation of the 3D scene, depending on the rear camera placement.

(21)

Experiments

5.1 Virtual camera projection correctness

The virtual camera projection based on the observer’s position aims to create continuity between the real world and the virtual 3D scene. To determine the extent to which this continuity is achieved, the difference between the observer’s viewing angle to the display and the corresponding viewing angle into the 3D scene is calculated. This is done by moving a picture of a face around the display at a distance of 30 centimeters, which is a likely viewing distance for someone using a tablet. The real-world viewing angle is determined by calculating the angle between the observer and the center of the display of the tablet. The viewing angle into the 3D scene is determined by calculating the angle between the virtual camera and the center of the near plane.

Figure 5.1 shows the average error and the standard deviation for horizontal viewing angles over 100 samples. As the results show, movement to the right (negative angles) causes the error values to increase steeply. Movement to the left, however, results in smaller error values. This is caused by the camera tilt discussed in Section 4.6.1. Because of this tilt, the orientation of the camera is not equal to the orientation of the tablet, adding an error to the estimation of the real-world position of the observer. While the camera tilt correction described in Section 4.6.1 improves the results, the way the camera looks into the real world is still different than when the orientation of the camera would have been equal to the orientation of the tablet. It should be noted that as the exact camera tilt is not specified by tablet manufacturers, the camera tilt is estimated and more accurate estimations might improve the results.

−20 −15 −10 −5 0 5 10 15 20

0 2 4

Real-world horizontal viewing angle (degrees)

Absolute

error

(degrees)

Figure 5.1: Error between the observer’s real-world horizontal viewing angle and the virtual camera’s horizontal viewing angle into the 3D scene.

(22)

the observer’s distance to the camera assumes that the observer is right in front of the camera. As discussed in Section 4.3, this simplifies the calculations significantly, but the results could improve by using a more accurate estimation.

Figure 5.2 shows the average error and the standard deviation for vertical viewing angles over 100 samples. As the results show, the vertical viewing angle of the virtual camera projection is accurate to a fraction of a degree. Similar to the horizontal viewing angle measurements, downward movement causes the error values to increase faster than upward movement, but the error values are a lot smaller than the horizontal viewing angle error values. This is caused by the fact that the vertical camera tilt is estimated at only 0.75 degrees, while the horizontal camera tilt is estimated at 8.9 degrees. This shows that the degree of camera tilt is an important factor for the viewing angle accuracy.

The standard deviations of the vertical viewing angle errors are a lot smaller than the standard deviations of the horizontal viewing angle errors. This is caused by the fact that the the Viola-Jones cascade classifier used for eye localization, as discussed in Section 4.1, sometimes localizes the eyes left or right from the center of pupil, causing the estimated horizontal position of the observer to shift right or left. The localized vertical position of the eyes is more stable, causing the estimated vertical position of the observer to be more stable as well.

−15 −10 −5 0 5 10 15

0 0.2 0.4

Real-world vertical viewing angle (degrees)

Absolute

error

(degrees)

Figure 5.2: Error between the observer’s real-world vertical viewing angle and the virtual cam-era’s vertical viewing angle into the 3D scene.

5.2 Performance

When running the application, the Android tablet should process two camera sources, track objects in both camera sources and render a 3D scene all at the same time. These are all computationally intensive tasks, which means that processing power of the tablet is an important factor. In this section, the performance of the different aspects of the demonstration, separately and combined, is measured on high-end Android devices of four different generations. First, the 3D scene is rendered based on the observer’s position estimated using eye tracking. Next, the 3D scene is rendered based on the augmented reality marker position and orientation estimated using marker tracking. Finally, the 3D scene is rendered based on both aspects. However, as the Nvidia Shield Tablet is the only device used that supports the front-facing and rear camera to be active simultaneously, only the separate aspects are tested on the other devices. More on device compatibility is discussed in Section 7.2.2.

Both the eye tracking and marker tracking are performed on camera frames with a resolution of 640 by 360 pixels. As the results in Figure 5.3 show, marker tracking and rendering the 3D scene accordingly runs at 30 frames per second on all devices. The reason for this is that Vuforia has capped its frame rate at 30 frames per second as this is the maximum frame rate of the camera of most Android devices. Therefore, all devices perform marker tracking and render the 3D scene accordingly at the maximum possible frame rate.

(23)

and marker tracking simultaneously on the Nvidia Shield Tablet. However, the eye tracking performance increases quickly on newer generation devices. On the most recent generation device, the eye tracking frame rate also reaches 30 frames per second, which is the maximum frame rate of the device’s front-facing camera. This means that on new generation devices that support the front-facing and rear camera to be active simultaneously, the application could run at frame rates close to the maximum possible frame rate of 30 frames per second. As the eye tracking algorithm is executed on the CPU, the increased frame rates of the application on newer generation devices depend on the increased processing power of these devices.

Eye tracking and 3D rendering Marker tracking and 3D rendering Combined 10 15 20 25 30 11 30 17 30 13 25 30 30 30 F rames p er second HTC One M7 (2013) Nvidia Shield Tablet (2014) Samsung Galaxy S6 (2015) HTC 10 (2016)

Figure 5.3: Performance of the different aspects, separately and combined, of the application on high-end Android devices of four different generations. As the Nvidia Shield Tablet is the only device used that supports the front-facing and rear camera to be active simultaneously, only the separate aspects are tested on the other devices.

(24)

(25)

Results

While the display method described in this thesis is best demonstrated in real life, pictures taken from different points of view illustrate what it is capable of. This is done by positioning the tablet on a grid, and adding a virtual grid to the 3D scene displayed on the screen. The position and orientation of the tablet to the real-world grid is tracked using an augmented reality marker. Figure 6.1 illustrates how the virtual grid is transformed when the tablet moves around the real-world grid. Figure 6.1a shows that when the tablet is positioned parallel to the real-world grid, the lines of the virtual grid are parallel to the lines of the real-world grid as well. Figure 6.1b shows that when the tablet is moved, the virtual grid is transformed in such a way, that its lines are still parallel to the lines of the real-world grid.

(a) Tablet positioned parallel to the real-world grid. The lines of the virtual grid are parallel to the lines of the real-world grid as well.

(b) Tablet positioned oblique to the real-world grid. The virtual grid is transformed in such a way, that its lines are still parallel to the lines of the real-world grid.

Figure 6.1: Illustration of how the virtual grid is transformed according to the position and orientation of the tablet to the real-world grid, that is tracked using an augmented reality marker. Figure 6.2 illustrates the virtual camera projection when the position and orientation of the tablet to the real-world grid is kept the same, but the observer’s position to the display changes. It is clear to see that when the observer moves around the display, the virtual camera projection is changed in such a way, that the lines of the virtual grid are still parallel to the lines of the real-world grid. The virtual camera projection also changes if the observer moves upwards or downwards or moves closer to the screen, but that is hard to illustrate using pictures.

(26)

(a) Observer right in front of the display. The lines of the virtual grid are parallel to the lines of the real-world grid.

(b) Observer moved to the right side of the display. The virtual camera projection is changed in such a way, that the lines of the virtual grid are still parallel to the lines of the real-world grid.

Figure 6.2: Illustration of the virtual camera projection based on the observer’s position to the display. The tablet is positioned parallel to the real-world grid, while the observer moves around the display.

(27)

Discussion

7.1 Possible applications

This thesis describes a novel display method but has not yet touched upon the applications in which it would be useful. Possible applications are in the field of non-destructive examination. Non-destructive examination is a wide group of analysis techniques to evaluate the properties of a material, component or system without causing damage to it [31]. For example, virtual autopsy is a virtual alternative to a traditional autopsy. Using CT or MRI scans, a 3D model of the examined body is generated, which can be examined digitally by forensic pathologists. Currently, these 3D models are mostly examined on a computer that is controlled in a conventional way, using mouse and keyboard. This way, there is no direct link between what is shown on the screen and the real body. To improve this, Lars Christian Ebert et al. developed a system that by hovering a handheld tracker over the body, displays the corresponding part of the 3D model on a screen on the wall [6]. This way, interaction between the real world and the 3D model is possible, but the user of this system still has to make a link between the position of the tracker and what is displayed on the screen. If this system is extended to use the display method described in this thesis, by hovering a mobile display over the examined body, the 3D model can be displayed corresponding to the position and orientation of the display to the body. By also changing the virtual camera projection corresponding to the position of the observer, a direct link between the body and the displayed 3D model is created.

Another possible application similar to virtual autopsy is examination of coral. This way, using CT scans again, coral can be investigated without damaging it. By hovering the mobile display over parts that look interesting from the outside, the corresponding insides are projected on the mobile display.

Another possible application could be using the mobile display during surgery. When a surgeon for example has to cut in a very specific direction, so no vital parts of the body are damaged, the mobile display can help with this. As the displayed graphics are based on the surgeon’s point of view, the cutting direction would be displayed correctly from every point of view. However, this would require both the tracking of the observer and the marker tracking to be completely accurate, as during surgery there is no room for any error.

In contrast to the possible scientific applications discussed so far, another possible application could be using the mobile display for quest games. This way, for example, a certain 3D scene is only displayed when the mobile display is aimed at a certain real-world scene or object, creating a location-based window into a virtual 3D world. This would result in a new and very interactive quest experience.

(28)

7.2 Limitations

7.2.1 Freedom of movement

As the observer is tracked using the front-facing camera of the Android tablet, the observer must stay within the field of view of this camera. The same goes for the tracked real-world object, that must stay within the field of view of the rear camera of the tablet. Compared to this, the ‘Dead Cat Demo’ discussed in Section 2.1 used a tracking subsystem, which was thus independent of the position of the mobile display. This way, the observer, as well as the real-world object, could move 180 degrees around the display. However, as Android devices evolve every year, it might be possible that in the future Android devices with fisheye cameras are developed, which would also allow for a 180 degrees freedom of movement.

7.2.2 Device compatibility

The eye-tracking augmented reality demonstrations uses the front-facing camera of the Android tablet to track the observer’s position. At the same time, the rear camera of the tablet is used to track an augmented reality marker. This thus requires both cameras to be active simultaneously. In practice, on most Android devices only one camera can be active at a time. This is probably caused by manufacturers using the same Universal Serial Bus for both cameras, to save space in the housing of the device. Rapidly switching between both cameras is not possible, as enabling a camera on the Android platform takes about a second. Therefore, the eye tracking augmented reality demonstration is currently only supported on a small range of devices. As support for two simultaneously active cameras is not listed on the specification sheets of Android devices, the only way to determine if a device supports this is by testing. Android devices that have been tested and have been confirmed supporting two simultaneously active cameras are listed below:

• HTC One M8 (2014) • Nvidia Shield Tablet (2014) • Nvidia Shield Tablet K1 (2015) • LG Nexus 5X (2015)

• LG G5 (2016)

7.3 Future work

7.3.1 Virtual camera projection correctness

To determine the correctness of the virtual camera projection based on the observer’s position, the difference between the real-world viewing angle to the display and the virtual camera’s viewing angle into the 3D scene has been measured in Section 5.1. As the results showed, the viewing angle is accurate to a few degrees, but the camera tilt causes the results to be less accurate. To create a convincing illusion, the viewing angle should be as accurate as possible. Using a front-facing camera that is not tilted would improve the results, as then the orientation of the camera would be equal to the orientation of the tablet. Also, as discussed in Section 5.1, the fact that the eyes are sometimes localized a bit to the left or right from the center of pupil causes the standard deviations of the horizontal viewing angle errors to be relatively large. The eye localization accuracy, as well as the eye tracking accuracy, could be improved by using camera images of higher resolution, as this would result in more detailed images. Currently, this would lead to a significant decrease in performance, but the performance measurements in Section 5.2 show that the performance increases rapidly on newer hardware. This means that in the future it would be possible to perform eye tracking on higher resolution camera images, while application would still run at the 30 frames per second limit of most current cameras.

(29)

known. Instead, the average human interpupillary distance is used, as discussed in Section 4.3. Using this average value instead of the exact interpupillary distance of the observer causes the viewing angle error to become larger. Especially for scientific applications, it is recommended to specify the exact interpupillary distance of the observer.

7.3.2 See-through display illusion

Currently, both the position of a real-world 3D object and the observer are taken into account when rendering a completely virtual 3D scene. This could be extended by also showing the real world behind the 3D scene using the rear camera images. This way, a realistic illusion of having a see-through display could be created. This would mean that when the observer moves around the display, the camera images should be transformed accordingly, just like virtual 3D scene is transformed.

In 2008, Chris Harrison and Scott E. Hudson published a paper that allowed for pseudo-3D video chatting using a single webcam [10]. Based on the observer’s point of view, the video is skewed to create the pseudo-3D effect. Figure 7.1 illustrates this. This method might be adapted to create the see-through display illusion. However, as the camera is at a fixed position, this causes the field of view into the real world to be limited, leading to the unused screen space visible in Figure 7.1. As stated before, it might be possible that in the future Android devices with fisheye cameras are developed, which would take away this limitation.

(a) Observer leaning left. (b) Observer in front of screen. (c) Observer leaning right.

Figure 7.1: Pseudo-3D video chatting using a single webcam. Images created by Chris Harrison and Scott E. Hudson [10].

7.3.3 Markerless object detection

Currently, the real-world object is tracked using an augmented reality marker. However, mark-erless object detection would lead to a better experience, as then the real-world object does not have to be altered. To achieve this by still using the rear camera of the tablet, a 3D model of the real-world object could be used to first detect and estimate the pose of an object and then track the object. Several papers discuss methods to implement this [4] [2]. To track simpler planar objects, other methods are available, such as tracking planar structures [23] [7] or combining the information provided by edges and feature points [26].

(30)

(31)

Conclusions

In this thesis, the design and implementation of a display method has been discussed that renders 3D scenes on a 2D mobile display surface and takes the position of the observer, as well as the position and orientation of the display with respect to a real-world 3D object into account. The observer is tracked using eye tracking, while the real-world 3D object is tracked using augmented reality markers.

To change the projection of the virtual camera based on the observer’s real-world position, the virtual camera is translated and an off-center projection matrix is used. As the results in Chapter 6 showed, a realistic projection of the 3D scene is displayed according to the observer’s point of view. As stated in the introduction, this display method should, in theory, create a novel illusion to an observer of having a window into a virtual 3D world. To create an illusion that is as realistic as possible, the viewing angle of the virtual camera into the 3D scene should be as close as possible to the observer’s viewing angle to the display. Measurements showed that at a viewing distance of 30 centimeters, the horizontal viewing angle is accurate to a few degrees, while the vertical viewing angle is accurate to a fraction of a degree. These results, and thus the realism of the illusion, could be improved by using a front-facing camera that is not tilted, so its orientation is equal to the orientation of the display, and by applying eye tracking to camera images of higher resolution, so its accuracy is improved. Currently, this would lead to a significant decrease in performance, but the performance measurements in Section 5.2 show that the performance of the application increases rapidly on newer hardware.

Based on the real-world position and orientation of the augmented reality marker, not the virtual camera, but the virtual objects in the 3D scene itself are transformed. This way, this aspect of the display method is separated from the virtual camera projection based on the observer’s position, so both aspects can also be used individually. The results in Chapter 6 showed that a 3D scene is translated, rotated and scaled according to the real-world position and size of the augmented reality marker.

Besides improvement of the virtual camera projection based on the observer’s real-world position, several aspects of the display method could be improved in the future. Because of the limited field of view of current cameras on Android devices, freedom of movement of the mobile display with respect to the observer and the real-world object is limited. As Android devices evolve every year, it might be possible that in the future Android devices with fisheye cameras are developed, which would allow for a 180 degrees freedom of movement. Also, the real-world object tracking could be improved by using markerless object detection, so objects can be tracked without attaching augmented reality markers to it. Finally, the illusion of having a window into a virtual 3D world can be extended to the illusion of having a see-through display, by displaying and realistically transforming the rear camera images of the real world behind the 3D scene.

(32)

(33)

Bibliography

[1] Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M. J., and Szeliski, R. A database and evaluation methodology for optical flow. International Journal of Computer Vision 92, 1 (2011), 1–31.

[2] Beier, D., Billert, R., Br¨uderlin, B., Stichling, D., and Kleinjohann, B. Marker-less vision based tracking for mobile augmented reality. In Proceedings of the 2nd IEEE/ACM International Symposium on Mixed and Augmented Reality (2003), IEEE Com-puter Society, p. 258.

[3] Bouguet, J.-Y. Pyramidal Implementation of the Lucas Kanade Feature Tracker. Intel Corporation 5, 1-10 (2001), 4.

[4] Comport, A. I., Marchand, ´E., and Chaumette, F. A real-time tracker for markerless augmented reality. In Proceedings of the 2nd IEEE and ACM International Symposium on Mixed and Augmented Reality (2003), IEEE Computer Society, p. 36.

[5] Dodgson, N. A. Variation and extrema of human interpupillary distance. In Electronic imaging 2004 (2004), International Society for Optics and Photonics, pp. 36–46.

[6] Ebert, L. C., Ruder, T. D., Martinez, R. M., Flach, P. M., Schweitzer, W., Thali, M. J., and Ampanozi, G. Computer-Assisted Virtual Autopsy Using Surgical Navigation Techniques. American Journal of Roentgenology 204, 1 (2015), W58–W62. [7] Ferrari, V., Tuytelaars, T., and Van Gool, L. Markerless augmented reality with

a real-time affine region tracker. In Proceedings of the IEEE and ACM International Sym-posium on Augmented Reality (2001), IEEE Computer Society, pp. 87–96.

[8] Garstka, J., and Peters, G. View-dependent 3D projection using depth-image-based head tracking. In 8th IEEE International Workshop on Projector Camera Systems PRO-CAMS (2011), pp. 52–57.

[9] Hancock, M., Nacenta, M., Gutwin, C., and Carpendale, S. The effects of changing projection geometry on the interpretation of 3D orientation on tabletops. In Proceedings of the ACM International Conference on Interactive Tabletops and Surfaces (2009), ACM, pp. 157–164.

[10] Harrison, C., and Hudson, S. E. Pseudo-3D video conferencing with a generic webcam. In Multimedia, 2008. ISM 2008. Tenth IEEE International Symposium on (2008), IEEE, pp. 236–241.

[11] Hejn, K., and Rosenkvist, J. P. Headtracking using a Wiimote. Graduate Project. Department of Computers Science University of Copenhagen (2008).

(34)

[12] Hidai, K.-i., Mizoguchi, H., Hiraoka, K., Tanaka, M., Shigehara, T., and Mishima, T. Robust face detection against brightness fluctuation and size variation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (2000), vol. 2, IEEE Computer Society, pp. 1379–1384.

[13] Kr´olak, A., and Strumi l lo, P. Eye-blink detection system for human-computer inter-action. Universal Access in the Information Society 11, 4 (2012), 409–419.

[14] Lee, J. Head tracking for desktop VR displays using the Wii remote. http://www.cs. cmu.edu/johnny/projects/wii, 2007. [Online].

[15] Lee, J. C. Hacking the Nintendo Wii remote. Pervasive Computing, IEEE 7, 3 (2008), 39–45.

[16] Near and Far Planes. https://www.agi.com/resources/help/online/AGIComponents/ Programmer’s%20Guide/Overview/Graphics/Camera/ViewFrustum.html. [Online]. [17] OpenCV pre-trained Haar-like feature models. https://github.com/Itseez/opencv/

tree/master/data/haarcascades. [Online].

[18] OpenCV for Unity Homepage. http://enoxsoftware.com/opencvforunity/. [Online]. [19] OpenCV Homepage. http://opencv.org/. [Online].

[20] Rekimoto, J. A Vision-Based Head Tracker for Fish Tank Virtual Reality. In Virtual Re-ality Annual International Symposium, 1995. Proceedings. (1995), IEEE Computer Society, pp. 94–100.

[21] Rogers, B., and Graham, M. Motion parallax as an independent cue for depth percep-tion. Perception 8, 2 (1979), 125–134.

[22] Scarpa, M., Belleman, R. G., Sloot, P. M., and de Laat, C. T. Highly interactive distributed visualization. Future Generation Computer Systems 22, 8 (2006), 896–900. [23] Simon, G., Fitzgibbon, A. W., and Zisserman, A. Markerless tracking using planar

structures in the scene. In Proceedings of the IEEE and ACM International Symposium on Augmented Reality (2000), IEEE Computer Society, pp. 120–128.

[24] Steinicke, F., Bruder, G., and Kuhl, S. Realistic perspective projections for virtual objects and environments. ACM Transactions on Graphics (TOG) 30, 5 (2011), 112. [25] Unity3D Homepage. http://unity3d.com/. [Online].

[26] Vacchetti, L., Lepetit, V., and Fua, P. Combining edge and texture information for real-time accurate 3D camera tracking. In Proceedings of the 3rd IEEE and ACM In-ternational Symposium on Mixed and Augmented Reality (2004), IEEE Computer Society, pp. 48–56.

[27] Viola, P., and Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2001), vol. 1, IEEE Computer Society, pp. I–511.

[28] Vuforia Homepage. http://vuforia.com/. [Online].

[29] Wang, P., Green, M. B., Ji, Q., and Wayman, J. Automatic eye detection and its validation. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. (2005), IEEE Computer Society, pp. 164–164.

[30] Distance Measurements with the WiiMote. https://www.bioid.com/About/ BioID-Face-Database. [Online].

[31] Wikipedia. Nondestructive testing. https://en.wikipedia.org/wiki/Nondestructive_ testing. [Online].

(35)

[33] Zhao, Z., Fu, S., and Wang, Y. Eye Tracking Based on the Template Matching and the Pyramidal Lucas-Kanade Algorithm. In International Conference on Computer Science & Service System (CSSS) (2012), IEEE Computer Society, pp. 2277–2280.