Touchless interaction with 3D images in the operating room

(1)

the operating room

Exploring usability in augmented reality

Sarah Elisabeth R¨ uegger

Master’s Thesis

Human-Machine Communication University of Groningen, The Netherlands

July 2018

Internal supervisor:

Dr. F. Cnossen (Department of Artificial Intelligence, University of Groningen) External supervisors:

Dr. ir. P.M.A. van Ooijen (Department of Radiology, University Medical Center Groningen)

J. Kraeima, MSc. (Department of Oral and Maxillofacial Surgery, University Medical Center Groningen)

(2)

Douglas Adams, The Salmon of Doubt

(3)

Surgeons often consult 3D medical images during procedures, but using a mouse and keyboard to interact with visual displays can lead to problems with operating room sterility and interrupted workflow. Touchless interaction techniques, such as gestures and speech, are promising alternatives, especially in augmented reality systems, but only if they are efficient and intuitive for surgeons to use. This project investigates the usability of touchless interaction for use during surgery.

A usability study was conducted with medical experts to test “3jector”, an application featuring gesture-based interaction with projections of 3D image. The goal was to evaluate the usability of the gestures and assess whether gestural interaction is acceptable in the operating room.

Insights from the study led to the design of a second application that runs on a Microsoft HoloLens, an augmented reality headset. This application allows interaction with holograms using gestures, gaze and speech. A further usability study evaluated two different kinds of 3D rotation: constrained to one axis at a time, or controlling all axes at once.

The efficiency and satisfaction of the two rotation methods were compared. The study also examined when users choose to use gestures or voice commands.

The studies’ results show that surgeons believe touchless interaction will be beneficial in the operating room. Main insights provided by this research include the necessity of simple gestures, the benefits of using both speech and gestures together, and users’

preference for 3D rotation constrained to one axis. These insights are an important step towards giving surgeons greater control over medical images.

(4)

Abstract ii

List of Figures v

List of Tables vi

Abbreviations vii

1 Introduction 1

1.1 Research question . . . 3

1.2 Thesis structure . . . 3

2 Theoretical Background 4 2.1 Touchless interaction . . . 4

2.1.1 Gestures . . . 5

2.1.2 Speech . . . 6

2.1.3 Eye movements and gaze . . . 7

2.1.4 Multimodality . . . 8

2.2 3D visualization of medical images . . . 9

2.3 Augmented reality . . . 10

2.3.1 Types of viewing devices . . . 11

2.3.2 Medical augmented reality . . . 12

2.3.3 Interaction in augmented reality . . . 13

2.4 Usability testing . . . 14

2.4.1 Think-aloud method . . . 16

2.4.2 System Usability Scale . . . 16

2.5 Devices . . . 17

2.5.1 Leap Motion Controller . . . 17

2.5.2 HoloLens . . . 18

2.6 Related work . . . 19

2.6.1 Touchless interaction with medical images . . . 19

2.6.2 Touchless interaction in 3D . . . 23

2.6.3 Touchless interaction in augmented reality . . . 25

2.6.4 Conclusion . . . 27

3 Part 1 - 3jector 29 3.1 Introduction . . . 29

iii

(5)

3.2 System . . . 30

3.2.1 Architecture . . . 30

3.2.2 GUI . . . 30

3.2.3 Gestures . . . 31

3.3 Usability study . . . 35

3.3.1 Participants . . . 35

3.3.2 Study protocol . . . 36

3.3.3 Data analysis . . . 40

3.4 Results . . . 42

3.4.1 SUS . . . 42

3.4.2 Errors . . . 43

3.5 Discussion . . . 46

3.5.1 Usability . . . 46

3.5.2 Findings . . . 48

3.6 Conclusion . . . 53

4 Part 2 - HoloLens 55 4.1 Introduction . . . 55

4.2 System . . . 56

4.2.1 Architecture . . . 56

4.2.2 GUI . . . 56

4.2.3 Touchless user interface . . . 58

4.3 Usability Study . . . 61

4.3.1 Participants . . . 61

4.3.2 Study protocol . . . 62

4.3.3 Data analysis . . . 65

4.4 Results . . . 66

4.4.1 Rotation tasks . . . 66

4.4.2 SUS . . . 66

4.4.3 Multimodality . . . 67

4.5 Discussion . . . 70

4.5.1 Rotation modes . . . 70

4.5.2 Touchless interaction . . . 74

5 Discussion 79 5.1 Comparison of 3jector & HoloLens . . . 79

5.1.1 Comparing the usability studies . . . 80

5.1.2 Comparing the applications and interaction designs . . . 82

5.2 Impact of research . . . 84

5.3 Limitations and future work . . . 86

References 89

(6)

2.1 A virtuality continuum . . . 10

2.2 Interpreting SUS scores . . . 17

2.3 Leap Motion Controller . . . 18

2.4 Microsoft HoloLens . . . 19

3.1 3jector’s graphical user interface . . . 31

3.2 Menu . . . 31

3.3 DICOM menu . . . 32

3.4 Setup of 3jector for the usability study . . . 36

3.5 SUS scores for all participants, for the surgeons and for the radiologists . 43 3.6 SUS ratings for each question . . . 44

4.1 GUI for free rotation . . . 57

4.2 GUI for constrained rotation . . . 57

4.3 Visual clues of interaction . . . 58

4.4 HoloLens air tap gesture . . . 59

4.5 Procedure for the HoloLens usability study . . . 63

4.6 Physical object used for the rotation tasks besides the virtual version . . . 63

4.7 Rotations used for the usability study . . . 64

4.8 Rotation task times for the two rotation modes . . . 67

4.9 SUS rating by question . . . 68

4.10 Number of “Rotate” commands per minute . . . 70

v

(7)

3.1 Actions for the menu . . . 33

3.2 Actions for Navigation mode . . . 33

3.3 Actions for Cutting mode . . . 34

3.4 Actions for Brightness mode . . . 34

3.5 Actions for the DICOM menu . . . 35

3.6 How many participants experienced each error for a given task . . . 46

3.7 Comparing SUS for Leap Motion Controller systems . . . 47

4.1 Actions for gestures and gaze . . . 59

4.2 Voice commands and their effect . . . 60

4.3 Number of participants that failed to complete the rotation task . . . 66

4.4 SUS scores for the rotation modes . . . 68

4.5 Success rate of keywords by number of attempts . . . 69

vi

(8)

AR Augmented Reality

CT Computed Tomography

DICOM Digital Imaging and Communications in Medicine GUI Graphical User Interface

HCI Human-Computer Interaction LMC Leap Motion Controller MRI Magnetic Resonance Imaging NUI Natural User Interface

OR Operating Room

PET Positron-Emission Tomography SUS System Usability Scale

UMCG University Medical Center Groningen VR Virtual Reality

2D Two-dimensional

3D Three-dimensional

X axis Left-right axis Y axis Up-down axis

Z axis Forward-backward axis

vii

(9)

Introduction

In any given operating room, surgeons typically have access to multiple visual displays showing medical image data during their procedures. The images help experts make diagnoses, plan procedures and serve as intra-operative references or guides. Advances in imaging techniques and other technology have given surgeons better understanding of their cases. One important development is that images are increasingly available in three dimensions, allowing medical experts to form more detailed diagnoses and intervention plans. While these 3D images are usually viewed on 2D screens, there is growing interest in the medical field to view them in three dimensions. One promising area is augmented reality (AR), where the physical world is enhanced with computer-generated virtual objects. This technology could be especially useful during surgery, letting doctors view medical images directly upon patients in real-time.

Whether viewing traditional 2D images on a screen or interacting with 3D models in AR, it is crucial to give surgeons full control over these images, such as the ability to scale, rotate and browse through them. However, this is not a simple task in the operating room. Several obstacles hinder surgeons from directly using a mouse and keyboard (O’Hara et al., 2014). For one, they can endanger sterility, and using them may require a surgeon to move away from the patient, causing delays. Surgeons often try to overcome these problems by instructing assistants to interact with the images, but this communication is often frustrating. Finally, traditional 2D input methods like a mouse and keyboard are often simply unsuitable for performing actions in 3D space.

1

(10)

Touchless interaction methods, including using gestures and voice commands, offer a way of overcoming these issues. They are often viewed as a natural approach to interacting with 3D medical images on screens and in AR. However, they are not yet very well studied in terms of usability, or user-friendliness (Mewes, Hensen, Wacker, & Hansen, 2017). Usability is crucial for new technologies and systems, and especially so in the medical field, where usability errors could be costly in terms of time, money and patient outcome.

This thesis aims to assess the usability of systems that feature touchless interaction, focusing on systems that facilitate viewing and manipulating 3D medical images during surgery. Methods from human-computer interaction let us gain valuable insights from tests with users about user satisfaction, system performance, usability issues and more.

The usability of two different touchless interactive applications was investigated for this project. The first application, known as 3jector, is designed to allow surgeons to view 3D images projected onto a wall and interact with them by using gestures with a Leap Motion Controller, a small device that recognizes hand movements. This application was tested with medical experts to evaluate how comfortable they felt using gestures to control images, whether they thought gestures could be used during surgery, and to find any usability problems that needed to be accounted for.

The second application runs on the Microsoft HoloLens, a pair of AR smartglasses.

Insights from the 3jector user study were taken into account to develop an application for manipulating 3D objects by using gestures, voice commands and gaze as input. The performance of the application was tested in a second usability study, with the goal of discovering what kinds of manipulation are best sorted for AR, and how users interact when they have different input modalities available to them.

The results from both user studies presented in this thesis contribute to the understanding of the usability of touchless interaction for use in surgery.

(11)

1.1 Research question

The main research question is as follows:

How can the usability of interacting with 3D medical images throughout the surgical patient care process be improved with gestures and voice commands?

The following objectives were used to answer the research question:

• Evaluate the usability of gestures with the Leap Motion Controller to manipulate 3D medical images

• Explore the possibilities of combining gestures and voice commands for a Microsoft Hololens-based augmented reality system to view and manipulate 3D medical images

• Evaluate the usability of gesture and speech interactions within the augmented reality system

1.2 Thesis structure

Chapter 2, the Theoretical Background, introduces important concepts for this thesis, including touchless interaction, augmented reality and usability. It also includes a review of relevant research conducted and described in the scientific literature.

The two parts of this project are each discussed in their own chapter. Chapter 3 concerns Part 1, the 3jector study, and Chapter 4 covers Part 2, the HoloLens study. In both chapters, the respective applications are described, followed by an elaborate look at the usability study conducted and a discussion of the results.

In the concluding Chapter 5, the results from both usability studies are compared and the most important recommendations are summarized. The thesis is also discussed in the larger research context, including limitations of the project and suggestions for future research.

(12)

Theoretical Background

2.1 Touchless interaction

Interaction is called touchless when it does not require mechanical contact between the user and any part of the system (De La Barr´e, Chojecki, Leiner, M¨uhlbach, & Ruschin, 2009). A variety of input modalities can be used to facilitate this kind of interaction.

The three most common are gestures, speech, and eye movements, but the definition also encompasses lesser-used potential inputs such as facial expressions or electrical brain activity.

One of the main application areas for touchless interactive systems is for use during surgery. Increasingly, surgeons refer to medical images (scans, models etc.) during procedures. Making the interaction with these images touchless has two main benefits:

sterility and improved workflow.

The need for sterility in the operating room means that surgeons usually cannot simply use a mouse and keyboard to interact with image viewing systems. Instead, surgeons often delegate this task to another member of the surgical team. The surgeons instruct the assistant on which images to show. This approach is flawed for two reasons. Firstly, surgeons require direct control of medical images to fully understand the data (Johnson, O’Hara, Sellen, Cousins, & Criminisi, 2011). Secondly, the communication between the assistant using the computer and the surgeon can be imprecise or complicated, leading to misunderstandings, time delays and interrupted workflow (O’Hara et al., 2014). Some surgeons bypass these issues by interacting with the mouse or keyboard themselves,

4

(13)

using a barrier between the surgeon’s glove and the image viewing device to preserve sterility. However, this also hinders workflow, as surgeons must often move away from the patient to access the systems showing the medical data.

For these reasons, it is advantageous to allow surgeons to interact with computer systems with touchless methods. This allows them to have direct control over what they see and do, without having to instruct an assistant. If the touchless system is appropriately placed, surgeons can stay close to the patient while accessing data.

A fully functional touchless image viewing system must enable surgeons to browse through the available image data set and manipulate whole images or specific parts.

This manipulation includes panning, scaling, rotation and cutting, and depending on the system and the type of medical images, the adjustment of various parameters like contrast, density functions and opacity (O’Hara et al., 2014).

Even if a touchless image viewer offers all these functions, it will only be beneficial if it is usable for the surgeon. Touchless interaction is often associated with natural user interfaces (NUIs), though they are not synonymous. According to Wigdor and Wixon (2011), a user interface is natural if it allows a user experience that feels like an extension of their body. In addition, using NUIs should be intuitive for both novices and expert users. While any type of interaction can be designed with NUI principles in mind, the term is most often applied to touch-based gestures and touchless interaction, which are said to afford implicit naturalness. However, some have questioned the usability of NUIs, pointing to issues such as a lack of visible clues, the difficulty for the user to discover all possible commands, and lack of consistency (Norman & Nielsen, 2010). Therefore, usability testing is crucial for touchless interaction.

The following sections introduce some of the common modalities of touchless interaction and discuss their benefits as well as some of the usability issues that must be accounted for.

2.1.1 Gestures

The term “gesture” is used to refer to various types of body movements. With the current abundance of touch screens, gesture is often used as shorthand for surface-based gesture.

But the gestures can also refer to the concept of touchless, in-air gestures. Within this

(14)

category, some restrict the term only to manual gestures, while others include other body parts, such as feet or even full-body movements. Since this research is focused on touchless interaction, the term gesture will be used exclusively for in-air gestures, and unless otherwise noted, will only include manual movements.

There are two kinds of technologies that are used for detecting and interpreting gestures in order to use them as system input. The first are inertial sensors, which can detect motion and are worn on the hand, head or body. The second, camera-based group makes use of color and/or depth information to segment images (Mewes et al., 2017).

Gestures are perhaps the most common touchless input modality and popular for NUIs.

Humans naturally make frequent use of gestures when communicating and interacting with the world. However, there are some usability challenges that are inherent to gestural interaction. Learnability is certainly an issue. If a large set of gestures must be remembered to interact with a system, this can place a cognitive strain on the user. A possible solution is to use a limited number of gestures that have different effects in various system modes. Another issue is the idea of discoverability, or whether all possible gestural commands can be discovered spontaneously by a user. Using simple gestures or providing visual guides might improve this. Finally, the lack of haptic feedback means that a system must find other ways of providing gestural feedback, whether this be visual, auditory or other (LaViola Jr., Kruijff, McMahan, Bowman, & Poupyrev, 2017).

LaViola Jr. et al. (2017) stress the importance of having so-called delimiters for gestures, which are ways of ensuring that normal movements do not unintentionally activate system commands. This can be achieved by providing a mechanism to turn gestural control on and off, or by limiting the interaction area, so that only gestures within this area are interpreted as system input.

2.1.2 Speech

There are several benefits to using speech as system input. Besides being touchless, it is also “eyes-free”. This is useful in an environment like an operating room where surgeons may not want to shift their view from the patient. Additionally, if microphones cover a large enough range, speech interaction is ubiquitous, that is, available throughout an environment or room. Finally, even more so than gestures, speech is considered

(15)

to be an intuitive and natural interaction mode because it is so important to human communication (Cohen, Giangola, & Balogh, 2004).

Voice interfaces exist on a range of complexity, from simple speech recognition, which typically involves short voice commands, to fully natural spoken dialogue, allowing a discourse between the user and the system. In all systems, a speech recognition engine uses natural language processing techniques to understand the acoustic signals. The accuracy of speech recognition can be affected by things like background noise and accents (LaViola Jr. et al., 2017).

A major issue for using speech as input is that users are likely to be talking throughout a task without meaning to interact with the system (LaViola Jr. et al., 2017). In a crowded operating room, multiple people may be conversing at once, and it would be confusing and cumbersome if someone inadvertently triggers a command. One way to get around this is the “push-to-talk” technique, where a button is pushed whenever the user wishes to make a voice command. Of course, this runs counter to true touchless interaction.

However, there may be multimodal approaches that emulate this functionality, perhaps by requiring a specific gesture to make a voice command. An alternative solution to accidental command triggering is by requiring a “wake word”. For example, the Amazon Echo, a computer assistant, only responds to commands that begin with “Alexa”.

As for gestures, discoverability can also be difficult for speech. If a system supports relatively natural dialog, this is not a big problem. However, many systems use short voice commands, and it may be hard for users to discover all the correct keywords if they were not appropriately chosen.

Despite its naturalness, speech control can be tiring if used continuously, and is not suitable for all tasks and environments (LaViola Jr. et al., 2017). For example, speech is cumbersome for navigating through a menu, and obviously unsuited for noisy settings.

For these reasons, most interfaces that make use of speech use it in connection with other modalities.

2.1.3 Eye movements and gaze

Eye tracking allows a user’s eye movements to be measured to determine what a user is looking at. Eye trackers typically direct infrared light into the eye, which makes the

(16)

pupil bright and easy to track and creates a corneal reflection. The vector between the pupil center and the corneal reflection is used to calculate the point-of-regard (Poole &

Ball, 2006).

Eye movements and gaze have most frequently been used as input techniques in devices designed for disabled users who struggle to use traditional input methods. Recently, mainstream applications are also beginning to use gaze as input, frequently in virtual and mixed reality applications (Poole & Ball, 2006). One popular solution is to use gaze position as an alternative for a mouse cursor.

A significant benefit of using gaze as input is that it is very fast. Long distances can be spanned quickly, which is especially useful for far-away objects in virtual environments.

Furthermore, it requires no training, as users simply look at the object they want to manipulate. However, there is no way to turn off eye movements. Also, since many eye movements are subconscious, users may need to get used to using gaze as intentional input, so the technique may not be as natural as first assumed (Jacob & Keith, 2003).

In order to avoid having subconscious eye movements trigger functions, some systems combine gaze with other forms of input, an example of multimodality (Poole & Ball, 2006).

2.1.4 Multimodality

The goal of using multiple forms of input is to allow for a richer set of interactions compared with unimodal interfaces (LaViola Jr., Buchanan, & Pittman, 2014). There are several advantages associated with multimodal systems. They are more flexible, often allowing users to choose inputs they prefer, which is beneficial to people with disabilities who may not be able to use all modalities equally well. Multimodality can mean more efficient systems and may reduce the number of errors, especially if noise is present in one of the input modalities. There is also some evidence that multimodal interaction can reduce cognitive load, and can help prevent fatigue from relying too long on one input modality (e.g., extended screen usage) (LaViola Jr. et al., 2017; Turk, 2014). Additionally, Oviatt, Coulston, and Lunsford (2004) found that as cognitive load and task complexity increases, users tend to start using more multimodal interaction.

This may be particularly relevant for high-stress situations during surgeries.

(17)

There a six different ways to combine input modalities, as defined by LaViola Jr. et al.

(2014):

• Complementarity: Multiple inputs are needed to issue a single command, each providing different information

• Redundancy: Multiple inputs with the same information are needed to issue a single command

• Equivalence: A user can choose which input to use to issue a command

• Specialization: Only one input is available for a command

• Concurrency: Multiple inputs are used for separate commands that are issued at the same time

• Transfer: One input transfers information to a second input, which then uses this information to complete a command

While many different combinations of inputs have been used to create multimodal interfaces, the pair of gestures and speech is particularly popular. They occur together in natural human communication. In systems, they complement each other, according to Billinghurst (1998): gestures are intuitive for direct manipulation of objects, while speech is very descriptive.

2.2 3D visualization of medical images

Medical images are primarily used for four purposes: education, diagnosis, treatment planning and intraoperative support (Preim & Bartz, 2007). A variety of imaging technologies are available, with CT, MRI, ultrasound and PET being some of the most common. Both two-dimensional and three-dimensional images are widely used. 2D viewing is often done slice-by-slice through the medical data and allows for a detailed and precise analysis. For this reason, many radiologists rely primarily on 2D images when diagnosing. On the other hand, 3D images offer an overview of medical data.

Surgeons, who must have a good understanding of the spatial relations of the data, are

(18)

more likely to use 3D visualizations for treatment planning and as aids during surgery (Preim & Bartz, 2007).

3D visualization can be achieved in multiple ways. The simplest includes shaded images on 2D screens, while more sophisticated 3D projections can be viewed with special glasses, and the most immersive 3D images can be found as virtual displays in augmented reality environments (Robb, 2000). It is important to note that 3D visualization refers not only to the display of 3D objects, but also explicitly requires the ability to manipulate, analyze and interpret the data. 3D images must be closely associated with interactive visualization techniques in order to reach their full potential (Robb, 2000).

2.3 Augmented reality

Augmented reality (AR) is a subset of mixed reality, which refers to the range between real environments and virtual reality (Azuma, 1997). Figure 2.1 shows this continuum.

In virtual reality, users find themselves in completely simulated environments, with no remnants of the physical world present. In augmented virtuality, the environment is still virtual, but real objects are incorporated. By contrast, augmented reality technologies display the real environment and incorporate virtual objects.

The combination of real and virtual aspects is the first of three criteria for AR as described by Azuma (1997). The second states that the system must be interactive in real time. This means that films using computer-based imagery effects are excluded from the definition. Furthermore, the virtual objects must be three-dimensional, so simple 2D wall projections are also not considered AR.

Figure 2.1: A virtuality continuum (Adapted from Milgram and Kishino (1994))

(19)

This section first introduces the different types of devices that are used to create AR environments, then the developing field of medical AR, and finally addresses issues of HCI in AR.

2.3.1 Types of viewing devices

There are different ways of adding virtual objects to reality, and many different kinds of AR devices have been developed in the last few years. They can be classified along two different dimensions - the technique used to incorporate virtual objects into reality, and the positioning of the display (van Krevelen & Poelman, 2010). There are benefits and drawbacks to each of these techniques and display types, and the choice of device depends on the application.

The three techniques for adding augmented objects to an environment are video see- through, optical see-through and projective. Using video see-through, users do not view the real world directly, but instead see a video of reality that incorporates augmented virtual objects. This is comparatively easy to implement, as the real environment is already viewed in a digitized form. However, there are problems with inherent timing delays, and the resolution of “reality” depends on the quality of the camera and display.

Optical see-through devices avoid these problems by letting the user directly view the real environment through transparent displays. Semi-transparent mirrors are then used to overlay the virtual objects. But it is more difficult to align the real and virtual objects in this technique, and the field of view is limited. A larger field of view may be achieved with projective methods, which project the virtual objects onto real-world surfaces. These systems must be calibrated whenever the setup changes and are unsuitable for outdoor applications (van Krevelen & Poelman, 2010).

Each of these techniques can be used with one of the three display types, head-mounted, hand-held or spatial. Head-mounted (sometimes called head-worn) displays are incorporated into helmets or take the form of eyeglasses (sometimes known as smartglasses), and the world is displayed in front of the eyes. Interest in AR for hand-held devices has soared with the wide spread availability of smartphones, although other gadgets such as tablets and flashlights are also used (van Krevelen & Poelman, 2010). Finally, spatial displays are static within the environment and user independent. They may use screens, projectors or holograms to create the AR environment.

(20)

2.3.2 Medical augmented reality

AR has multiple applications in the medical field, including medical training and patient education. However, one of the biggest research areas concerns AR technologies for use in surgery. The main goal is to visualize the medical data in the same space as the patient, so both can directly be viewed together (Sielhorst, Feuerstein, & Navab, 2008).

Several benefits are expected from AR in surgery as compared to other visualizations.

For instance, while much of the patient data is already available to surgeons, there may be multiple displays placed in the operating room, each showing different types of medical images or other information. Locating and consulting these images can interrupt surgeons in their workflow, and the different sets of hardware can clutter an already crowded operating room. If a surgeon needs to interact with images, traditional devices can be unsuitable because of sterility issues in addition to the known limitations of using 2D interaction techniques on 3D data. The implicit three-dimensional nature of AR presupposes the use of 3D interaction techniques, giving surgeons better control of their data. Finally, the overlay of medical images directly on the patient offers several advantages. Multiple types of data can be fused and superimposed on the real anatomy, providing more information to the surgeon at the same instant. Also, because the virtual object matches the real-world orientation and position, surgeons do not have to mentally rotate images, which is necessary when viewing images on a screen. This is presumed to improve hand-eye coordination (Sielhorst et al., 2008).

AR is also an innovation for image-guided surgery. In these surgical procedures, the positions of surgical instruments are tracked and displayed within patient-specific images. This allows surgeons to see what they are doing, especially in minimally invasive procedures where they would otherwise be operating blindly (Grimson, Lorigo, Kapur,

& Kikinis, 1999). AR allows these images to be transferred from a screen directly onto a patient. This will let surgeons precisely trace their planned incisions and compare the current situation of the patient to the planned procedure.

A further opportunity for AR in surgery is the possibility of virtual interactive presence, or telementoring. This will allow surgeons to receive assistance and guidance from one or more remotely located surgeons who are observing the surgery. AR technologies are able to show the combined augmented environment to both the operating and the mentoring surgeon. This lets both of them view and point at the patient during the surgery to

(21)

interact and communicate (Shenai et al., 2011). This would be especially useful for procedures familiar to only a few experts, so they could assist in surgeries worldwide, and for operations taking place in remote locations where additional expertise is needed.

2.3.3 Interaction in augmented reality

AR applications are examples of 3D user interfaces, which can be problematic for user interaction. Users can find it difficult to interact in 3D space (LaViola Jr. et al., 2017).

Common 2D interaction techniques may not be appropriate for 3D systems. Consider a computer mouse, which operates in two dimensions. This limitation makes it difficult to use it to position an object in 3D environments.

Different interaction techniques are employed to overcome these challenges. The choice of input method depends on the type of AR device used, the amount of interaction required as well as on the type of application. The techniques include the following:

• 2D input devices: Although AR is by nature three-dimensional, this does not mean that 2D interaction techniques cannot be used. For example, screen-based AR systems can function similarly to traditional 2D GUIs, so a mouse and keyboard can provide input. Hand-held AR, especially on smartphones, usually uses touchscreens for interaction. However, in head-mounted or projected AR, these types of devices are not typically available.

• Controllers: Rather than using 2D input, systems may make use of 3D pointing devices, which control more degrees of freedom and allow better 3D manipulation.

These controllers include 3D mice, joysticks and wands, sometimes with haptic feedback. They can be used with different types of AR systems, but interacting with controllers is not as intuitive as directly interacting with physical objects (Billinghurst, Clark, & Lee, 2015).

• Tangible objects: Here, physical objects are mapped to virtual ones, and users can interact with virtual objects by manipulating their physical counterparts (Billinghurst et al., 2015). This can be very intuitive because users interact directly with the physical world, but requiring physical objects may not be suitable for all types of AR systems, especially mobile ones.

(22)

• Touchless: The touchless interaction techniques introduced above, including gestures and speech, are also frequently used. No controllers or other physical objects are required, which makes it particularly suitable for wearable and static AR. Ide- ally, touchless techniques would provide a natural user experience, but as section 2.1 discussed, there is still much work to be done in terms of usability.

2.4 Usability testing

The International Organization for Standardization defines usability as the “extent to which a product or system can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use”(ISO, 1998). Usabil- ity is generally considered to be multi-dimensional, with multiple properties influencing overall usability. The following five attributes are commonly differentiated (Nielsen, 1993):

• Learnability: The effort and time necessary for novice users to learn to interact with the system. This is considered one of the most crucial factors of usability, since poor learnability may discourage many potential users. Easy-to-learn systems commonly display a steep initial learning curve.

• Efficiency: The ability of experienced users to perform tasks with the system at a high, consistent level of performance. If a system is inefficient, even expert users will be unable to carry out tasks quickly and accurately.

• Memorability: The extent to which casual, infrequent users remember how to interact with the system. Of course, better learnability usually makes a system more memorable, but generally, good memorability ensures that casual users do not need to re-learn the entire system.

• Errors: The error rate of the system, as well as the ability of users to recover from errors. It is probably impossible to eliminate all minor user errors, but it should be easy to discover and reverse them. Catastrophic errors that remain undiscovered or destroy previous work should not occur.

• Satisfaction: The pleasantness of interacting with the system. This is the most subjective attribute and is influenced by the other four properties.

(23)

Usability testing refers to evaluating a system on usability with potential users. Tests may evaluate a system in terms of one of the usability attributes mentioned above, or evaluate general usability. There are two main concepts of usability testing, summative and formative. They each have a different focus and goals, but they ultimately complement each other to ensure more usable systems.

Summative evaluations revolve around measurements (Lewis, 2006). Summative user tests resemble traditional experiments in that they generally involve participants completing tasks in a formal setting, after which task performance is assessed. Metrics such as task completion, accuracy, completion time and error rate may be considered. But summative testing is not limited to objective measurements. Researchers are also interested in evaluating subjective factors, including satisfaction, which is usually assessed by a questionnaire. The aim of summative usability testing is to ensure that a system meets certain usability goals or to compare different systems in terms of usability, based on the elicited metrics.

On the other hand, the goal of formative usability testing is diagnosing usability issues in order to eliminate them from a system (Lewis, 2006). In formative user studies, evaluators observe users performing tasks and talk to them about their experience with the system in order to find usability problems. These can include common errors, difficulties and confusing functionalities. Formative studies are often less formal than summative ones.

The two concepts of usability testing are not mutually exclusive. The development of a system may call for one or the other type to be used at different stages, or even within the same round of user testing. A usability study may have both formative and summative goals. In this way, tests can both quantify usability and detect areas that require usability improvement.

The following sections discuss two techniques used in this thesis to assess usability:

the think-aloud method for eliciting users’ thought processes while interacting with a system, and the System Usability Scale, a popular usability questionnaire.

(24)

2.4.1 Think-aloud method

The think-aloud method is an observational technique frequently used in usability evaluations. The method asks users to think aloud while performing tasks, including verbalizing which actions they are taking, what they think is happening, why they are taking specific actions and what they are attempting to do (Dix, Finlay, Abowd, & Beale, 2004). This technique offers evaluators information on users’ underlying cognitive processes (Yen & Bakken, 2009). The think-aloud protocol is part of formative usability testing, in that it helps to identify problems of the interface. Indeed, it should not be combined with performance tests, because verbalizing may pose a burden on users and slow them down (Nielsen, 1993).

There are multiple benefits to this protocol. It is simple to set up and observe. It also requires a fairly small amount of users. Nielsen (1994) has shown that a user study with about 5 users can be expected to find 75% of usability problems, and 10-15 allow almost all to be discovered. However, as noted above, thinking aloud can be difficult for some people, and some users may need to be prompted to keep talking with questions.

Additionally, thinking aloud may influence how users perform tasks, meaning that the observations are not wholly natural (Nielsen, 1993).

2.4.2 System Usability Scale

The System Usability Scale (SUS) is a summative method for assessing user satisfaction (Brooke, 1996). It has been widely used in user studies and shown to be valid and reliable from about 8-12 users (Brooke, 2013). The questionnaire was designed to be quick and simple to administer. It consists of ten statements which users score on a five- point Likert scale, from “strongly disagree” to “strongly agree”. Half of the statements are positively-phrased and half are negatively-phrased. These alternate to ensure users must contemplate their answers:

1. I think that I would like to use this system frequently.

2. I found the system unnecessarily complex.

3. I thought the system was easy to use.

4. I think that I would need the support of a technical person to be able to use this system.

(25)

Figure 2.2: Interpreting SUS scores (Bangor et al., 2009)

5. I found the various functions in this system were well integrated.

6. I thought there was too much inconsistency in this system.

7. I would imagine that most people would learn to use this system very quickly.

8. I found the system very cumbersome to use.

9. I felt very confident using the system.

10. I needed to learn a lot of things before I could get going with this system.

Users’ answers are used to compute a score from 0-100. The SUS score allows com- parisons to be made between different systems or different versions of systems. 68 is considered the average SUS score. In addition, Bangor, Kortum, and Miller (2009) have proposed various ways that SUS scores can be interpreted (see figure 2.2). Scores can be interpreted as acceptable, marginally acceptable or unacceptable; given a letter grade;

and described with an adjective.

2.5 Devices

This section introduces the devices that were used in the usability studies featured in this thesis.

2.5.1 Leap Motion Controller

The Leap Motion Controller (LMC) (figure 2.3) is a small device (79mm x 13mm x 30mm) developed by Leap Motion, Inc.¹ to detect and recognize hand gestures. It contains three infrared LED lights and two cameras that capture an area of roughly 80cm above the device (Colgan, 2014). Data from the Controller is sent via USB to a

1https://www.leapmotion.com/

(26)

Figure 2.3: Leap Motion Controller (from https://www.leapmotion.com/press#

117)

host computer. This computer then uses software from Leap Motion to construct a 3D representation of a user’s hands. The software uses algorithms to form the 3D model and deduce the positions of obscured parts (such as fingers), though the exact techniques are not publicly available.

2.5.2 HoloLens

The Microsoft HoloLens (figure 2.4) is the first commercially available AR headset (Evans, Miller, Pena, MacAllister, & Winer, 2017). The see-through glasses allow the wearers to see holograms in the real environment. It is a battery-powered, self-contained system running Windows 10². It uses cameras and sensors to map environments and track the user’s position.

The HoloLens is designed for touchless interaction. It supports three input types: gaze, gestures and voice (Microsoft HoloLens Team, 2016). Gaze is used in a similar way as a cursor, to target objects. Actions can be performed on these targets by performing the

“air tap”, a gesture where a tapping motion is performed with the raised index finger.

The air-tap is similar to a mouse click. If the downward position of the finger is held, a click-and-drag functionality is activated. The only other recognizable gesture is the

“bloom”, where a user first holds all finger tips closely together pointing upwards and then spreads all fingers out. This gesture is reserved for accessing the Windows 10 menu.

Finally, the speech engine allows applications to recognize voice commands.

2For full technical specifications, see https://www.gamespot.com/articles/microsoft-hololens -specs-and-features-detailed/1100-6435187/

(27)

Figure 2.4: Microsoft HoloLens (from https://upload.wikimedia.org/wikipedia/

commons/0/02/Ramahololens.jpg)

2.6 Related work

In recent years, usability studies have been carried out on several touchless interactive applications. This section discusses some of the studies that are most relevant for this thesis, in three broad categories. The first includes research on touchless interaction with medical image viewers. These studies specifically address usability in the OR, and are often conducted with surgeons or other medical experts. Because of this project’s focus on 3D images, a second section examines research on 3D touchless interaction.

These studies typically do not have a medical background in mind, but they offer important insights into how to interact touchlessly with 3D volumes. Finally, the last section discusses the few studies that have been conducted about touchless interaction in augmented reality.

2.6.1 Touchless interaction with medical images

Many research groups are interested in the potential of using touchless interaction to control image viewers during surgery. A recent review revealed more than 30 such systems described in the scientific literature (Mewes et al., 2017). Several systems only feature browsing through sets of images, without supporting further essential image manipulation techniques, and other systems did not undergo any usability testing; these are excluded from this discussion. A selection of the remaining studies are discussed

(28)

here to identify the most important insights about the usability of touchless medical image viewing systems.

The Kinect, a motion sensing device developed by Microsoft, has been a popular device for gesture recognition research. For example, Tan, Chao, Zawaideh, Roberts, and Kinney (2013) used it to recognize gestures that controlled a cursor for a 2D image viewer. Various tools (such as zoom, measure or scroll) could be selected by moving the cursor to the tool icon with the right hand. The left hand could then be raised and lowered to imitate “mouse up” and “mouse down” commands and perform the desired image manipulations. A usability study with 29 radiologists revealed that they found most tasks easy or moderately difficult. Despite some recognition problems, 69% of users thought that the system would be useful in interventional radiology procedures, which the authors saw as a promising result for touchless interaction.

The Kinect is suited for recognizing fairly large gestures involving hands or arms. By contrast, the Leap Motion Controller (LMC), another popular gesture recognition device, is able to detect smaller and more detailed hand and finger positions. Mewes, Saalfeld, Riabikin, Skalej, and Hansen (2016) developed a LMC-controlled 2D and 3D image viewer. Five fine hand gestures were used for functions like 3D rotation, zooming, and pointing and clicking. A user study with 9 participants showed that users preferred easy and robust gestures, including moving a fist to pan and zoom. Some gestures were harder for the LMC to recognize consistently, which reduced their usability considerably.

This included the clicking gesture: tapping the thumb. As a whole, the participants said the gestures were well-suited to the task, but were prone to errors. Mewes et al.’s user study is noteworthy because it evaluated the usability of the individual gestures, not just the application as a whole, allowing a more detailed understanding of their touchless interaction paradigm.

A study by Nestorov, Hughes, Healy, Sheehy, and O’Hare (2016) is interesting because it compared the LMC, the Kinect and a traditional computer mouse for interacting with a 2D image viewer in a formal usability study. 10 radiology residents completed tasks that involved image selection, zooming and measuring. Completing the tasks took about the same amount of time in both touchless conditions, but considerably less time with the computer mouse. In terms of accuracy, the LMC and the computer mouse performed similarly, while the Kinect did comparatively poorly. Additionally, 40

(29)

surgeons and radiologists tested the system and rated its usability with the SUS. The LMC and Kinect rated similarly, with marginally acceptable scores of 63.4 and 66.1 respectively. These results indicate that the LMC and Kinect are comparable for use in gestural interaction, but the LMC might offer better accuracy.

The fact that both touchless conditions were slower than using a mouse may be an obstacle to adopting touchless interaction in surgery, as the ultimate goal is to improve the image viewing process. However, in real procedures, surgeons are less likely to use a mouse because of sterility issues, and more likely to give assistants instructions to use the image viewer. Therefore, if touchless interaction outperforms delegation to an assistant, it would still provide major benefits, even if it is slower than using a mouse.

Wipfli, Dubois-Ferri`ere, Budry, Hoffmeyer, and Lovis (2016) tested this assumption.

They performed a formal usability study with 30 participants with a medical background.

Three interaction modes with a DICOM viewer were tested: with a mouse, with gestures from a Kinect, and by giving oral instructions to a third person. Tasks included scaling, rotating and changing the contrast of images. Usability was evaluated on efficiency, effectiveness and satisfaction.

Tasks completed with a mouse were more than twice as fast as the other conditions, but using gestures was also significantly faster than directing a third person. Satisfaction was best for using a mouse, but still good for the gesture condition. No significant difference in error rates were found.

This study is important because it offers support for using touchless interaction, as it seems to be more efficient and easier to use than relaying instructions to an assistant.

However, gestures perform significantly worse than mouse-based interaction in Wipfli et al.’s study. The paper does not elaborate on the kinds of gestures they used, so it is plausible that “better” gestures (and the possible addition of voice control) could reduce the gap between touchless interaction and traditional input methods.

All of the previously mentioned image viewing systems were operated with gestures.

While gestures are the most common touchless input modality, they are not the only one. A few studies have looked at systems with multimodal touchless interaction. One example is by Ebert, Hatch, Ampanozi, Thali, and Ross (2012). Their system used gestures identified by the Microsoft Kinect and voice recognition software from Apple Voice for control of 2D medical images. They used simple gestures, moving the flat palm

(30)

of one or both hands vertically or horizontally. These gestures could be used in three different modes: repositioning the image, scrolling through the data set, and adjusting the window. Voice commands were used to switch between the operating modes, to apply predefined settings, to select different data sets, and to toggle gesture control on or off. Gestures and speech thus control mostly separate functionalities, except when using speech to select a predefined setting.

10 medical professionals took part in a usability study that compared touchless interaction with mouse control. The participants were asked to recreate screenshots with the system. On average, it took about 1.4 times longer to complete the task with gestures and voice than with a mouse. This difference is less than in the Wipfli study discussed above. Participants also got over 20% faster on average throughout the study using the touchless gestures. Participants had spent about 10 minutes on average on training with the system. This is a promisingly short time, but more practice would probably have led to better results for the touchless system. This study indicates the importance of users receiving enough time to learn the gestures and voice commands.

While the Ebert et al. prototype used both gestures and voice, the use of voice went mostly unanalyzed by the authors, with no discussion of how having a multimodal interface affects users. The benefits of multimodal interaction are often extolled, but there is a lack of research examining how surgeons use systems when multiple input are available. The exception is a study by Mentis et al. (2015). They implemented a system where all functionalities could be achieved either with gestures (recognized by a Kinect) or with voice commands. To use voice commands for continuous actions, the system used a start-stop model; for example, a user could say “Clip In” to begin clipping, and then “Stop” at the desired clipping plane. This action could also be realized by a hand gesture.

This redundancy of inputs allowed Mentis et al. to compare when surgeons choose to use gestures or voice. Unfortunately, they only performed a single use case with one surgeon, though it did occur during a real surgery. The surgeon used both input modalities during the use case. From this, the authors concluded that one cannot say that one input is always better suited for a certain functionality, but that the benefits of each input are circumstantial. They say that surgeons may be more likely to use gestures away from

(31)

the patient table, as their hands are free. Though this insight is interesting, the lack of formal analysis and the single participant considered does limit its impact.

Still, the fact the system was observed during a real surgery is an asset. Most touchless interaction systems for surgery are tested in usability laboratory settings, allowing concrete measurements of usability. Trials conducted during real procedures are often less formal, like the Mentis et al. study, but they provide validation for the real-world potential of touchless interaction.

Rosa and Elizondo (2014) also made use of real surgeries for usability testing. They developed a touchless interaction system for intraoperative use in dental surgery. A LMC was used to recognize gestures to manipulate images on a computer screen. The images included both preoperative 2D images and 3D simulation models used to plan the procedure, as well as intraoperative images. Interaction was primarily based on pointing a finger at the computer, which made a cursor appear onscreen. To select an object or option, the finger was moved closer towards the screen. Some functions required two fingers. These gestures allowed navigating through images, zooming, rotating, measuring, and adjusting contrast and brightness.

The system was tested during 11 dental surgery procedures. Though the usability of the individual gestures was not specifically addressed, the authors concluded that, overall, the surgeons had easy access to and direct control over the necessary images. A “per- ceived increase” in the number of times the images were consulted was also reported, but this did not have an effect on the durations of the surgeries. It is not mentioned how much training the surgeons received with the system, but the authors recommend that users complete four to six 30-minute training sessions in order to fully make use of the touchless interaction.

2.6.2 Touchless interaction in 3D

As discussed above, computer interaction in 3D is often not as straight-forward as 2D interaction. Therefore, it is important to consider some of the usability issues presented by touchless interaction with 3D objects. The studies examined in this section are not about medical systems, but they offer insight into how to design touchless interaction with 3D medical images.

(32)

A study published by ˇSkrlj, Bohak, Guna, and Marolt (2015) performed a formal usability evaluation that compared a regular mouse and keyboard, a 3D mouse with six degrees of freedom, and gestures with the LMC for manipulating 3D images. 29 Partic- ipants were required to match the orientation and zoom level of a (non-medical) image.

In the touchless condition, scale and rotation were mapped to the movement of the user’s hand. Task completion times were recorded, and participants filled out the SUS.

Results showed that solving tasks with the normal mouse and the 3D mouse took about the same amount of time, while tasks completed with the LMC took 1.8 times longer.

The 3D mouse scored best on the SUS, while the Leap scored only “ok”. Two reasons were presented for this. Firstly, some of the participants had trouble remembering the correct gestures. More importantly, the lack of haptic feedback inherent to touchless interaction caused confusion. Nevertheless, there were some participants for whom the touchless interaction felt natural. The authors concluded that the LMC was not yet mature enough for widespread use, but that it did offer promising new kinds of interfaces.

Kirmizibayrak et al. (2011) also compared touchless and traditional inputs for 3D rotation. Two-handed gestures recognized by a Kinect were used as a metaphor of holding the object from its sides and turning it. In a usability study, 15 participants matched the rotation of target images using these gestures or a normal mouse. The gestures were significantly faster and more accurate than the mouse. This stands in contrast to the ˇSkrlj et al. study, and shows that the choice of gestures matters. The two-handed rotation apparently allowed participants to outperform the mouse condition, which was not possible with ˇSkrlj et al.’s one-handed approach. This study indicates that with the appropriate gestures, touchless interaction can serve as a substitute for traditional input methods.

Caggianese, Gallo, and Neroni (2016) investigated different 3D manipulation techniques.

Their system involved a VR headset and a LMC to recognize gestures. They were interested in how users can position and rotate 3D objects. To study this, they compared two techniques. In the direct version, a user could move or rotate an object in all directions at once. The position or rotation was mapped to the user’s hand movement, so for example, moving the hand forward diagonally would also move the object forward diagonally. By contrast, the constrained technique meant that a user had to first select the axis in which to move or rotate the object, and then the hand movements only

(33)

affected the object along that axis. For example, selecting the X axis meant that the gestures would be used to move the object to the left and right.

In a user study, 10 participants were asked to move objects to specified locations and rotate objects to match certain orientations. The direct and constrained techniques were compared, and the SUS was completed for both. For positioning objects, users preferred direct control of all dimensions, and considered the constrained version too slow. However, the rotation task was more complex, and here, users preferred only rotating along one axis at a time. The constrained rotation achieved a SUS score of 61.34, compared to only 49.86 for the direct technique.

This is an important finding for 3D touchless manipulation. It may also help explain why the touchless condition in the ˇSkrlj et al. study above performed badly. Their rotation technique was direct; perhaps users would have rated the LMC higher if they could rotate in a constrained manner.

Coelho and Verbeek (2014) also looked at 3D object positioning. They compared gestures with the LMC (controlling all dimensions at once) to a mouse for moving an object to a specified target in 3D space. A usability study with 35 participants revealed that completing the task took 40% less time with gestures than with the mouse, verifying that touchless interaction can perform well for 3D manipulation.

2.6.3 Touchless interaction in augmented reality

One of the very few usability studies on touchless interaction in medical AR was conducted by Frikha, Ejbali, Zaied, and Ben Amar (2015). Gestures were used to interact with a 3D virtual heart in their screen-based AR system. There were four functions:

attenuation, scaling, rotation and displaying anatomical text labels. To attenuate, a user held up one finger on one hand; to scale, two fingers; and so on.

An informal user study with 15 participants led the researchers to conclude that the users had no problems interacting with the heart and could manipulate a virtual object

“in the same way as real objects”. This claim seems far-fetched, especially considering that the number of fingers that corresponds to a command is arbitrary. Participants could choose which gesture to perform, so there was no test on whether the users knew

(34)

which command was associated with which gesture. This is a major limitation to the study.

For more detailed evaluations of usability of touchless interaction in AR, non-medical systems must be considered. For example, while Radkowski and Stritzke (2012) were primarily interested in assembly tasks in AR, their findings can be applied to touchless interaction in AR in general. The study used a video see-through spatial screen to display the augmented environment and a Microsoft Kinect to recognize hand gestures.

The interface had two modes of operation. The direct mode allowed fast translation and rotation by mapping hand movements to object movements. The precise mode let users perform more fine-tuned tasks by selecting axes on which to rotate, translate or scale. In both modes, a 3D virtual cursor followed the position of the user’s hand, and closing the hand to a fist allowed objects to be selected. The use of the two modes is very similar to that in the Caggianese et al. (2016) study mentioned above.

In a usability study, 15 participants were asked to carry out a six-step assembly task after a few minutes of practice. The users could choose when to use the direct or the precise mode. All were able to complete the task. A questionnaire showed that the users did not experience much difficulty in interacting with the system. However, not all of the participants used the two modes as intended, often using the precise mode for all interactions, despite results showing that this mode took more time than the direct mode. This could indicate that users found it difficult to grasp the difference between the modes and perhaps a need to rethink this interaction paradigm. Nevertheless, this study demonstrated that hand gestures could be used to perform precise tasks in AR.

Another non-medical study by Lee, Billinghurst, Baek, Green, and Woo (2013) provides relevant information on combining gestures and speech in AR. They hypothesized that a multimodal interface would be more efficient, effective and satisfactory than using speech or gestures alone. They tested this with a task requiring users to change the shape, color and direction of virtual objects. Gestures consisted of pointing, open hand and closed hand, while speech commands consisted of commands like “cylinder”, “green” and “up”.

The AR environment was shown on a screen.

A usability study with 20 users found that while the multimodal interface outperformed the gesture interface in terms of task completion time, it was not significantly faster than using speech only, nor did it lead to fewer errors. However, subjectively, the users felt

(35)

the multimodal mode was easier, more natural and more effective than either the gesture and the speech mode. This points to the benefit of using both interaction modes in AR.

The gestures used in this study were fairly rudimentary, which might have counteracted some of the advantages of multimodal interaction; it is plausible that a more developed gestural interaction system could lead to an even better performance for multimodal interfaces.

Another multimodal interface was tested by Manuri, Piumatti, and Torino (2015). Their system featured a LMC mounted on top of a pair of optical see-through smartglasses.

Both gestures and speech recognition were implemented. The system was evaluated in a very simple use case which involved playing and pausing an augmented video and a 3D animated model as well as leafing through the pages of a book. 7 users participated.

The user study was performed in both a neutral setting and in a noisy environment.

In the neutral setting, participants preferred voice controls because they were simple, whereas the gestures needed to be learned and were harder to reproduce. In the noisy setting, speech recognition was impaired, so users switched to using gestures. This implies that multimodal interfaces offer robust input methods. While important to the understanding of multimodal touchless interaction, Manuri et al.’s research should not be taken to mean that speech will always be preferable to gestures; for more complex 3D tasks, such as rotation or slicing, speech may be limited.

2.6.4 Conclusion

Various studies indicate that surgeons and other medical experts believe touchless interaction to be an asset for the operating room, allowing them direct control over images.

It has been shown to be more efficient and usable than instructing an assistant to oper- ate the image viewer. These are promising results, but it must also be mentioned that many applications have only been tested with small, informal usability studies. More extensive user testing, especially during real surgeries, may reveal more insights for the usability of touchless interaction.

The two most popular gesture recognition devices, the Kinect and the Leap Motion Controller, have both been shown to support reliable touchless interaction systems.

Results show that they are largely comparable in terms of usability, but the LMC may

(36)

be more accurate. However, some gestures are less robust because the LMC cannot recognize them reliably, which should be taken into account when developing the gestural design.

In general, it seems clear that some gestures are better than others and so should be carefully chosen. With a suitable set of gestures and enough training time to learn, the difference between traditional input methods and touchless interaction becomes smaller.

For certain 3D tasks, touchless interaction may even be superior.

Most touchless medical image viewers feature gestures as their main interaction paradigm.

A few also make use of voice, but for the most part, they do not evaluate how gestures and voice are used together multimodally. The little research that is available indicates that users appreciate having multiple inputs available and that they perceive multimodal interaction has more usable. More extensive research in medical settings could help substantiate this.

Some new challenges are expected for touchless interaction in 3D and augmented reality, but first results suggest that it is suitable for these tasks. As more 3D and AR systems are developed, care must be taken to make 3D interaction usable.

(37)

Part 1 - 3jector

3.1 Introduction

This chapter describes the 3jector system and discusses the usability study that was performed to evaluate it with users. The system was developed as a collaboration between the 3D lab of the University Medical Center Groningen (UMCG) and COSMONiO¹, a company interested in artificial intelligence and computer vision. Their project focused on gestural interaction with 3D medical images in the operating room. The result was the “3jector” prototype, a portmanteau of the terms 3D and projector. 3jector allows 3D medical images to be projected onto a wall in the OR and manipulated with gestures recognized by the Leap Motion Controller. Its goal is to allow surgeons easy, touchless access to medical images during surgery, ultimately leading to more successful procedures.

A usability study was conducted with surgeons and radiologists to assess how well the prototype performs and how users experience gestural interaction, and to find usability problems that would hinder its acceptance in real-life situations.

1http://www.cosmonio.com/

29

(38)

3.2 System

3.2.1 Architecture

The 3jector prototype runs on a Windows 10 computer and relies on an Optoma ultra short throw projector to project a 3D image onto a wall, which can be viewed with 3D glasses. A Leap Motion Controller (Orion Software version 3.2.1) is connected via USB.

It detects gestures in an approximately 80 cm range (the interaction area). The software was developed using the game engine Unity (Version 5.4.1). The gestures and software were largely designed by Lingyun Yu, then a post-doc at the UMCG. Some additional features were implemented for this thesis.

3.2.2 GUI

Figure 3.1 shows the main screen of the graphical user interface (GUI) of the 3jector system. The image is at the center of the screen (A). In 3D mode, this image is three- dimensional and can be viewed with 3D glasses. Three axes in red, blue and green (B) are incorporated into the image to show the dimensionality. A cube (C) in the lower right corner has the same colors on its sides, which helps the user understand the orientation of the image. The upper left corner displays information on the file name and current mode (D). The upper right corner features an icon (E) corresponding to the action being performed. In Figure 3.1, the action icon represents free rotation.

When a user’s hands are within range of the LMC, the screen includes hand models (F).

In this way, the user sees a system representation of their hand(s) in real time. This is especially useful to check that the system is accurately recognizing the performed gestures.

In the lower middle of the screen, a semicircle represents the menu (G). When the menu is opened (Figure 3.2), the user may choose between four options: three modes (Navigation, Brightness or Cutting) and the DICOM menu. The DICOM menu (Figure 3.3) shows thumbnails of all available image files. Selecting a file will load the image on the main screen in Navigation mode.

(39)

Figure 3.1: 3jector’s graphical user interface

Figure 3.2: Menu

3.2.3 Gestures

This section discusses the gestural interface. The possible actions and their effects are listed below, categorized by mode. The correct gesture for each action is given in the following format: how many hands are involved, hand pose and movement. If an action only requires only one hand, the other hand should be outside of the LMC’s interaction area. Some actions are always available, such as selecting from the menu, while others can only be performed if the correct mode is active, such as adjusting contrast. Furthermore, some gestures may execute different actions in different modes.

(40)

Figure 3.3: DICOM menu

The gestures rely on four basic hand poses:

• Palm: All fingers and the thumb are extended and the hand is held palm down

• Pointing: The index finger is extended, the other fingers and thumb are closed

• Pinched: The thumb touches the tip of the index finger, the other fingers are held loosely

• Fist: All fingers and thumb are closed

(41)

Menu

These actions are available in all modes.

Table 3.1: Actions for the menu

Action Effect Gesture

Pointing Menu option that is pointed at is highlighted in yellow

One hand, pointing at object, static

Selecting Selected menu option opens (DI- COM menu opens, or mode is activated)

One hand, pointing with thumb extended, static

Navigation

Except for constrained rotation, these actions are also available in the Cutting and Brightness modes.

Table 3.2: Actions for Navigation mode

Action Effect Gesture

Free rotation

Image rotates; user controls three dimensions simultaneously

One hand, palm, panning

Scaling Image gets smaller or larger Two hands, pinched, moving to- ward or away from each other Resetting Image returns to default orienta-

tion and size

Two hands, fists, static

Constrained

rotation User controls one dimension at a time:

In and out: image rotates over x- axis

Horizontally: image rotates over y-axis

Vertically: image rotates over z- axis

One hand, pinched, static

Second hand, palm, moving forward and backward, horizontally or vertically