Watch your tongue and read my lips: a real-time, multi-modal visualisation of articulatory data

(1)

Erasmus Mundus European Masters Program in

Language and Communication Technologies

Master’s Thesis

Watch your tongue and read my

lips: a real-time, multi-modal

visualisation of articulatory data

Kristy James

s2615754

supervised by

Dr. Ingmar Steiner and Dr. Martijn Wieling

(2)

(3)

I hereby confirm that the thesis presented here is my own work, with all assistance acknowledged.

Amsterdam, 6 May 2016

(4)

(5)

This thesis presents a new software tool for the visualisation of EMA data, using 3D animation in a game engine. This tool displays the movement of articulators in real-time, extrapolating from point-tracking data to a basic representation of tongue and lip movement, as well as being able to induce a palate trace from streamed data, with plans to include more accurate tongue models in the future. The tool is writ-ten in Python and reads data into Blender, an animation and game engine, in real-time. In addition, Blender game-like resources have been developed, so that a face ‘scene’ is provided, of which the user can fully customise the appearance and behaviour to their own needs. It is both compatible with displaying pre-recorded data from var-ious data formats, which may be of use in demonstrating recorded data from different speakers, as well as streaming live data from an NDI WAVE machine, which could be adapted to provide online feed-back for pronunciation training. In both modes, game controls allow the user to choose their preferred viewpoint and set game parame-ters, whilst the researcher can set other parameters before the stream-ing commences.

Furthermore, effort has been taken to incorporate several modali-ties: in static data mode, simultaneously recorded ultrasound videos can be overlaid on the image, and synchronised sound recording and playback is supported from live data.

The accuracy of this software’s visualisation was tested in an on-line experiment that involved more than 110 participants: the sub-jects were challenged to ‘speech read’ three types of vocal tract vi-sualisations to identify the prompt and whether they were display-ing matched or unmatched stimuli. In this experiment a competdisplay-ing two-dimensional visualisation (VisArtico) was found more effective, though this 3D system performed comparably to a third system that showed the data as dots, and well as gathering valuable feedback about aspects of the software.

The open-source nature of both this package and Blender as well as the ease of scripting with Python mean that this software would be ideally adapted for experimenting with real-time feedback for pro-nunciation training or speech therapy, both by applying changes to its manipulation of the raw data and experimenting with visual adapta-tions and feedback in the Blender GUI.

(6)

(7)

This work could not have been completed without the help and assis-tance of a great many people to whom I wish to express my gratitude. Firstly to my supervisors – to Ingmar Steiner for challenging me to use new technologies, providing steady advice and constant feed-back and a willing ear whenever there were problems, and to Martijn Wieling for all the help with the articulograph, scientific and statisti-cal advice, and for being ever-ready to share your resources with me and promote my work to others.

Thanks must also go to Mark Tiede, whose MVIEW scripts were used liberally for visualisations in this work, and to Mark Liber-man who promoted the experiment on LanguageLog. To Brigitte Carstens for allowing me access to the Carstens Wiki; to Angelika Braun for use of the Trier dataset; to Fabrizio Nunnari for helping conquer Blender and for your debugging help; and to Simon de Vries for pre-processing the Matlab files.

To the Deutsches Forschungszentrum f ür K ünstliche Intelligenz (DFKI) for employing me as a student assistant during part of the thesis period, and for funding a trip to the workshop on ‘Feedback in Pronunciation Training’ run by Saarbr ücken’s IFCASL project.

To the members of the Multimodal Speech Processing group for be-ing excellent office-mates and for provbe-ing excellent feedback, partic-ularly S´ebastien Le Maguer for the advice on network programming, and Alexander Hewer for your patient explanation of head-correction algorithms. To the other office HiWis for forgiving being inadver-tently audio-recorded during software development of that feature, and for your camaraderie and humor; to the CoLi Systems Group for your technical assistance and for the inspiration for the title of this document; and to Tuur, Anastasiia, Slavomir, Kata and Samantha for pilot testing the experiment, and Tim and Rob for your feedback.

To Bobbye Pernice, Ivana Kruiff-Korbayova and Gosse Bouma for administrative support within the LCT Program.

To everyone in CoLi at Saarbr ¨ucken, whose friendship has made this year a fantastic one.

And to my family and Vincent, without whom none of this would have been possible.

(8)

(9)

1 introduction 19

1.1 Software development motivation . . . 19

1.2 Intended audience and usage . . . 20

1.3 Speech-reading experiment . . . 21

1.4 Structure of this thesis . . . 22

2 background 25 2.1 Vocal tract imaging techniques . . . 25

2.1.1 Image-based techniques . . . 26

2.1.2 Point-tracking techniques . . . 29

2.2 Electromagnetic Articulography . . . 30

2.2.1 How EMA works . . . 30

2.2.2 Development of EMA systems . . . 31

2.2.3 Combining EMA with other modalities . . . 33

2.2.4 Drawbacks of EMA . . . 35

2.3 Software for visualising the vocal tract . . . 36

2.3.1 VisArtico . . . 36

2.3.2 MVIEW . . . 37

2.3.3 Opti-Speech . . . 38

2.3.4 Other visualisations . . . 38

2.3.5 Visualisations in other modes . . . 39

2.4 Applications of EMA . . . 41

2.4.1 Speech therapy . . . 42

2.4.2 Pronunciation training for language learning . 42 2.4.3 General use . . . 43

2.5 Intelligibility of visual speech . . . 44

2.5.1 Studying laypersons’ interpretation of vocal tract visualisations . . . 45

2.5.2 Studies on talking heads . . . 46

2.5.3 Studies on visualisation intelligibility benefits . 47 3 data manipulation 51 3.1 Data sources . . . 52

3.1.1 EMA corpora . . . 52

3.1.2 Common EMA data formats . . . 53

3.1.3 Describing machine accuracy . . . 56

3.2 Real-time vs. pre-recorded data processing . . . 56

3.2.1 Scaling the visualisation . . . 56

3.2.2 Palate and other traces . . . 57

3.2.3 Identification of coil roles . . . 58

3.2.4 Storing synchronised data . . . 59

3.3 Smoothing the data . . . 60

3.4 Head movement correction and change of base . . . . 60

3.4.1 Transformation to local coordinate system . . . 61

(10)

3.4.2 Head-correction within global coordinate system 62

3.4.3 Recovering head movement . . . 63

3.5 Rotation in the context of EMA . . . 64

3.5.1 Describing rotation . . . 64

3.5.2 Rotation information from EMA . . . 65

3.5.3 Processing rotation within the application . . . 66

4 software 69 4.1 Basic requirements . . . 69

4.1.1 Networking requirements . . . 69

4.1.2 Visualisation requirements . . . 70

4.1.3 A note on open-source and the code . . . 71

4.2 System architecture . . . 71

4.2.1 Emulating a real-time server from static data . . 73

4.2.2 ‘Gameserver’ client application . . . 77

4.2.3 Game loop in Blender Game Engine (BGE) . . . 78

4.3 Setting the Blender scene . . . 81

4.3.1 Rigging and meshes . . . 81

4.4 Communication between sub-modules . . . 84

5 experimental evaluation 85 5.1 Introduction . . . 85

5.2 Research questions and hypotheses . . . 86

5.2.1 Abstractness and Learning . . . 86

5.2.2 Difference Detection . . . 86 5.3 Materials . . . 87 5.3.1 Stimuli . . . 87 5.3.2 Software . . . 88 5.3.3 Video processing . . . 92 5.4 Method . . . 92 5.4.1 Question generation . . . 93 5.4.2 Question presentation . . . 94 5.4.3 Answer revelation . . . 95 5.4.4 Qualitative judgements . . . 95 5.4.5 Promotion . . . 95

5.5 Results and analysis . . . 96

5.5.1 Participant backgrounds . . . 96

5.5.2 Subject variability . . . 99

5.5.3 Prompt variability . . . 102

5.6 Statistical analysis . . . 104

5.6.1 Investigating the identification score . . . 104

5.6.2 Investigating the difference detection . . . 106

(11)

6.4 Pitfalls of the experiment . . . 114

6.5 Returning to a broader context . . . 116

7 conclusion 119 7.1 Software package . . . 119

7.2 Speech-reading experiment . . . 121

a survey prompt extraction scripts 123 a.1 Extraction of the prompts’ timestamps and file num-bers in the log file . . . 123

a.2 Cutting of the WAV and TSV files based on the times-tamps . . . 124

a.3 Constructing the video in MATLAB . . . 124

a.4 Converting AVI or MP4 files to web-compatible video formats . . . 128

a.5 Prefixing filenames . . . 128

b survey materials 129 b.1 Question random generation . . . 129

b.2 Setting answer options and video sources . . . 130

b.3 Answer display behaviour . . . 132

b.4 Survey text . . . 134

c survey data analysis 151 c.1 Pre-processing and importing results into R . . . 151

c.2 Qualitative analysis . . . 156

c.3 Descriptive statistics and plots . . . 157

(12)

(13)

Table 1 Comparison of state-of-the-art EMA machines 33

Table 2 Development of EMA systems . . . 34

Table 3 Screenshots of point-tracking tools . . . 39

Table 4 VENI and Trier datasets . . . 53

Table 5 List of speech corpora . . . 54

Table 6 EMA file formats . . . 55

Table 7 Comparison of MATLAB and Blender . . . 71

Table 8 Word pair stimuli displayed in the experiment 87 Table 9 Fixed vs. random effects for analysis . . . 105

Table 10 Mixed-model coefficients for the identification dataset . . . 105

Table 11 Difference detection model coefficients . . . . 106

Table 12 Mean difference detection by system and match 108 Table 13 Selected survey comments . . . 109

(14)

(15)

Figure 1 Transmitter and receiver coil locations . . . 31

Figure 2 AG501 and NDI WAVE . . . 32

Figure 3 Screenshot of the VisArtico GUI . . . 36

Figure 4 Screenshot of the MVIEW GUI . . . 37

Figure 5 Screenshot of the Opti-Speech system . . . 38

Figure 6 Additional visualisation tools . . . 40

Figure 7 Two visualisation systems. . . 41

Figure 8 Two talking heads . . . 47

Figure 9 Palate traces from VENI dataset . . . 59

Figure 10 VENI biteplate . . . 61

Figure 11 Coil tilting reducing effective surface area . . . 66

Figure 12 Ematoblender’s basic system architecture . . . 71

Figure 13 GUI for the ‘real time’ server . . . 76

Figure 14 Female model created using MakeHuman . . . 81

Figure 15 Bones controlling face and lips . . . 82

Figure 16 Weight painting for various bones . . . 83

Figure 17 Screenshot of a video for MATLAB . . . 90

Figure 18 Screenshot of a video for VisArtico . . . 91

Figure 19 Screenshot of a video for Ematoblender . . . . 92

Figure 20 Example question page in the survey platform 94 Figure 21 Participant age . . . 96

Figure 22 Language backgrounds of participants . . . 97

Figure 23 Non-native years learning English by age . . . 98

Figure 24 Participant gender . . . 99

Figure 26 Scores by exposure and education level . . . . 99

Figure 25 Subject education and previous exposure levels 100 Figure 27 Time on question . . . 101

Figure 28 Scores by question number . . . 101

Figure 29 Mean page score per subject . . . 102

Figure 30 Identification by prompt and system . . . 103

Figure 31 Heat map of strict scores . . . 104

Figure 32 Mean difference detection by stimulus pair . . 107

Figure 33 Qualitative feedback on Ematoblender . . . 110

(16)

(17)

2D two-dimensional

3D three-dimensional

AIC Akaike information criterion

BGE Blender Game Engine

CSV comma-separated values

CT Computed Tomography

EMA Electromagnetic Articulography

EPG Electropalatography

fps frames-per-second

IK Inverse Kinematics

L2 second-language

MEG magnetoencephalography

MRI Magnetic Resonance Imaging

RMS Root Mean Square

RMSE Root Mean Square Error

SL side-lip SLL side-lip-left SLR side-lip-right TB tongue back TM tongue middle TSV tab-separated values TT tongue tip

UDP User Datagram Protocol

US ultrasound

XRMB X-ray microbeam

(18)

(19)

1

INTRODUCTION

This thesis was inspired by work being undertaken at the Univer-sity of Groningen by my co-supervisor, Martijn Wieling, which used Electromagnetic Articulography (EMA) to quantify the differences in articulation that are responsible for the production of a foreign ac-cent in English by Dutch speakers (Wieling et al., 2015a). Not only were these insights great conversation starters for a foreigner study-ing lstudy-inguistics in the Netherlands, but I also became fascinated at the possibility of being able to show speakers this difference in real-time, particularly to aid second-language (L2) learners in their pronunci-ation, or for use during speech therapy. Happily there was interest both at the University of Groningen and the Saarland University for a real-time visualisation system for use by researchers, and I hope that this thesis’ final product caters both to their aims of faithfully reproducing data collected for scientific experiments, as well as being intuitive and interesting to any layperson (albeit one with access to an articulograph!) who may wish use it with pedagogical aims.

In addition to the production of this software, this work aims to investigate the effects that different levels of abstractness in the rep-resentation of EMA have upon subjects’ ability to interpret the data.

The recent developments in three-dimensional (3D) vocal tract vi-sualisations (eg. by Katz et al. (2014), this work) also raise questions relating to the efficacy of presenting 3D visualisations as compared to two-dimensional (2D) visualisations for the various purposes for which EMA is used, which motivated the work in chapter 5 (intro-duced in more detail in section 1.3). This comprises an online exper-iment in which laypersons ‘speech-read’ from silent vocal-tract visu-alisations with differing levels of abstractness, allowing a comparison of their identification of mono-syllabic English prompts as shown by different systems. This experiment is also used to verify the accuracy (relative to established technologies) with which the aforementioned software displays EMA data.

1.1 software development motivation

Many different methods of visualising vocal tract movements exist, and even for just one point-tracking technology, EMA, many different systems for visualising its output are already available as commercial

(20)

or free applications (see section 2.3). This begs the question as to how a new system can contribute to the field – indeed, this thesis was undertaken with the goals of making a novel contribution by:

• utilising open source software where possible (namely anima-tion in Blender and collecting and manipulating the data using Python) so that the product can be shared and accessed freely within the research community

• using a simplistic tongue model as a basis but allowing for re-searchers to replace this with their own (more sophisticated) tongue models

• exploring multi-modal visualisation options such as integrating visual data (for example a video of the user’s face), overlaying ultrasound data onto the animation, or playing simultaneously recorded audio recordings

• exploring different methods of indicating the goal pronuncia-tion and whether this is achieved.

The catalyst for a new visualisation was that increased processing speeds in modern computers (even in more modest systems) allow for a visualisation using a game-engine (such as Blender) that can render more lifelike graphics than previous systems. This has seen an explosion in the number of recent developments in this area of 3D renderings of point-tracking data (see section 2.3). Other work at the University of Saarland focuses on constructing realistic tongue and palate models (Hewer et al., 2014, 2015), and the addition of these models to the software is an exciting future development, as well as the possibility of using the game engine with the tongue as a controller to construct ‘serious games’, with real-time feedback for pronunciation training already being explored using other systems (Katz et al., 2014).

1.2 intended audience and usage

The design of this software is very clearly shaped by its intended users: principally researchers in phonetics, but also with the potential use in the domains of foreign language learning or speech rehabilita-tion.

(21)

mere points in a fixed space as the experiment is performed. The real-time nature of the response means they see these articulations directly rather than recording the data and analysing it at a later date. The researcher is therefore better placed to detect irregularities, such as coils detaching, or identify when the coils have been misla-belled (eg the coil intended for the tongue tip (TT) is placed at tongue back (TB)) as such problems are immediately apparent.

This thesis was devised with the input primarily from the researcher user-base, though it did receive some feedback from pronunciation teachers later in its construction. Previous research has indicated that visual feedback from EMA can be a useful tool for teaching L2 pro-nunciation (Levitt and Katz, 2010; Suemitsu et al., 2013). It has also been shown that talking heads have been useful for both speech train-ing for hard-of-heartrain-ing persons and L2-teachtrain-ing for adults (Granstr ¨om, 2004). From these studies we can learn much about the elements that need to be displayed onscreen and incorporated into the training.

The intended audience has affected this project at the design stage in many ways: Firstly, the project utilises open-source software (Python and Blender), ensuring that the technologies involved remain freely available for use into the future. Secondly, the code will be released under an open-source licence, so that it can be modified by its users. It is envisaged that researchers may want to alter the appearance of the application, make additions in order to parse an unseen EMA data format, or make changes to the data manipulation processes for their own needs. Language instructors fortunate enough to have ac-cess to an EMA machine may want to add game-like functionality or include their own target pronunciations. Thirdly, the objects in the .blend file that make up the scene can be replaced with user-defined meshes, such as custom tongue or lip meshes, and additional items related to sensor location (for example, ear meshes, if this addition took one’s fancy) could be incorporated, so that the appearance of the software can be modified and more accurate models can be included. In addition, efforts have been taken to make the design intuitive for the user, and the code well-documented to support these aims. Hope-fully these efforts will ensure that the project is a useful tool for both of these user groups.

1.3 speech-reading experiment

(22)

by rendering the articulators as surfaces, and greater video-realism than 2D visualisations, but whether these details improve the sub-ject’s ability to distinguish the differences between trajectories which are important for their training is an open question. This is an espe-cially relevant question for intra-oral articulators such as the tongue, which subjects cannot be accustomed to seeing; thus it is not obvi-ous in which way subjects prefer to visualise these articulators. In-deed, earlier studies that used live EMA data to provide real-time visual feedback have used a variety of different visualisations (eg. Katz et al., 2010; Levitt and Katz, 2010; Suemitsu et al., 2013; Katz and Mehta, 2015), and differing evaluation metrics, which make the effect of the visualisation formats difficult to compare.

This kind of evaluation is distinct (though not inseparable) from the evaluation of talking-heads, though as seen in section 2.3 EMA visu-alisation software and talking heads have historically worked with different input data and were not used in real-time. Thus this the-sis asks the central research question: ‘Which level of abstractness (dots, 2D or 3D) is optimal for presenting EMA-data in a speech-reading context?’, and also aims to investigate (through qualitative evaluation questions) how the presentation of a 3D EMA-driven talk-ing head can be optimised for intelligibility. These research questions are outlined further in section 5.2.

1.4 structure of this thesis

Section 2 describes the context of electromagnetic articulography and ultrasound, the technologies visualised with this system, as well as other recent visualisations that form a backdrop to this new approach. It also describes studies evaluating talking heads and concludes with a short summary of some applications of these visualisations and data collection in this modality. Section 3 details data sources and formats used in the project, as well as the mathematical manipula-tions that occur (for instance, during smoothing and head-correction) before the points are projected into the game-engine.

The structure of the software package produced is introduced in section 4, justifying its architecture based on the choice of (and limi-tations of) the game engine and potential applications of the system. It details the functionality of each major component of the software package, and describes some of the problems (and their solutions) encountered during development.

(23)

the factors that influence how well subjects identify the prompt, such as their exposure to EMA visualisations, and the abstractness of the visualisation.

(24)

(25)

2

BACKGROUND

This section has two aims: firstly it is intended to familiarise the reader with the basic workings of the articulographic techniques used for data collection in both the software application and the experi-mental component that make up this work – namely Electromagnetic Articulography (EMA) for the collection of the point-tracking data that drives the animation. Competing and informing technologies, ultrasound (US) for the collection of imaging data of the tongue that can be shown as a video overlay and as the main competing technol-ogy with EMA for real-time visualisations, and Magnetic Resonance Imaging (MRI), which was used by a colleague to create an accurate tongue mesh (Hewer et al., 2014), are also described.

Secondly, the research questions that are investigated in chapter 5 and chapter 6 are placed within their theoretical context: laypersons’ interpretation of vocal tract visualisations, and the evaluation of talk-ing heads.

The chapter is structured as follows: Firstly a short summary of competing fleshpoint-tracking technologies is included to justify the choice of EMA as the leading method for this visualisation. As the main technology used here its history and technological basis are then summarised. Secondly, existing visualisation software packages are briefly outlined, and their applications are described. These rep-resent a state-of-the-art in this field, though few have yet been used with real-time data.

Thirdly, some common applications of EMA data and online EMA feedback are outlined, namely speech therapy, L2 pronunciation train-ing, and general research into speech and other movements of the vocal tract. Finally the problem of measuring the intelligibility of vi-sual speech is described, incorporating descriptions of existing talk-ing heads and methodology for evaluattalk-ing their quality.

2.1 vocal tract imaging techniques

Techniques used to gain information about speech production can be broadly divided into two categories:1

1. imaging techniques where a 2D photograph (or tomogram, in the case of a slice of tissue) of the vocal tract is obtained, by cap-turing either light reflected from or projected through an object,

(26)

or a signal induced in an object by a magnetic field. Many 2D representations may be returned in the case of several ‘slices’ being performed.

2. point-tracking techniques, which record the location information of markers attached to various anatomical points of interest. Though many imaging techniques result in higher-resolution data, most are unsuitable for investigating articulation due to the long time needed to acquire the image, or due to their harmful side-effects of doing so, such as X-rays. Point-tracking techniques similarly have their pitfalls: line-of-sight measures fail when articulators are ob-scured (when the mouth is closed, for example), and the effect of the attachments made to the speech production organs upon a sub-jects’ articulation remains an active discussion. These are discussed further with relation to EMA in subsection 2.2.4.

2.1.1 Image-based techniques

Popular image-based techniques for investigating speech production are cineradiography, Magnetic Resonance Imaging (MRI) and ultra-sound (US).2

Computed Tomography (CT) has also been occasionally used for this purposes.

Cineradiography

Cineradiography (for the purposes of vocal tract imaging generally consisting of producing lateral X-ray videos of speech production) is a technique with a long history, though it is now rarely performed on humans for research purposes due to concerns about X-ray ex-posure. For many years it was considered an excellent reference technique due to the high temporal resolution of the videos (Jallon and Berthommier, 2009), which displayed all (not just the suprala-ryngeal) articulators, before MRI was used for these purposes (Badin et al., 1998). Despite this, the superimposition of articulators over each other on the image has been problematic, meaning that the use of these historic corpora often require a great deal of manual annota-tion, even when the corpora themselves have been digitised such as by Munhall et al. (1995). Recent work focusses on using automatic or semi-automatic methods to extract tongue and lip contours from these corpora (Jallon and Berthommier, 2009), though the technique is still used in animal studies to investigate vocal tract movements (Ghazanfar et al., 2012; Riede et al., 2006; Tecumseh Fitch, 2000).

1

Though as rightly noted by Steiner (2010, p.27) point-tracking merely represents a reduced-dimensionality of the same data which is gained from imaging processes.

2

(27)

Computed Tomography

CT is a technique which uses X-rays conducted from multiple angles to reconstruct a tomogram of a slice (or slices) of tissue. It allows articulations to be captured with reasonable temporal resolution at a sampling rate of 15 Hz or faster and allows for the collection of multiple planes at once. However, this is not fast enough to sample speech, and so it is not suited for studies of articulation. Also the technique’s use of X-rays represents is a weakness as with CT, as it limits the amount of recordings that can be made from the one subject. Nonetheless, this technique has a spatial resolution of less than 1 mm, and it can be used to construct planar images in any direction (Stone, 2006, p. 527).

Magnetic Resonance Imaging

MRI, on the other hand, is generally considered harmless as it ex-ploits the behaviour of hydrogen atoms when exposed to a magnetic field (they emit a weak radio signal when they realign to this field) to indicate the hydrogen density of each area in the slice. This pro-vides a tomogram of the vocal tract (though without teeth) in high spatial resolution. Though previously MRI could only be used to capture images of speech production where the vocal tract position can be sustained for some time, or repeated over several utterances of the same stimuli, the technology now has impressive spatial and temporal resolution that can track vocal tract movements dynami-cally (Lingala et al., 2016). These videos are very useful to the speech research community, and inform aspects of other vocal tract visualisa-tions (such as modelling the shape of the tongue or palate). However, the complexity and cost of accessing such a machine may be pro-hibitive for some purposes. Additionally, the supine position of the subject and the noise of the machine prompt compensation by the subjects (Steiner et al., 2014), making it a less-than-ideal environment for language-teaching or speech-rehabilitation, which form a primary goal of real-time visualisations.

Ultrasound

(28)

which, generated by an electric current passing through a piezoelec-tric crystal, pass through soft tissue and echo back to the crystal once they encounter a difference in the density of the tissue they are trav-elling through.

The strength of the echo influences the colour that the signal is dis-played in, and the delay indicates the distance from the crystal that the difference was found, which allow images to be reconstructed in which white lines represent flesh boundaries and the XY coordi-nates on a plane represent location relative to the probe in 2D which, where several crystals in an array are used at the same time, are im-ages rather than a single measurement.3

Spatial resolution for this technique is good at less than 0.7 mm, though a major shortcoming is that the signal cannot traverse any ‘air-gap’. That is, if a straight line between a part of the tongue and the probe is not fully com-posed of flesh, but is interrupted by air forming a ‘sub-lingual cavity’, the signal is reflected at this boundary. This is a particular problem for retroflex sounds, where the tongue adapts a particularly vertical position, so the crystals must be placed in a position such that the echoes pass through flesh as much as possible. Therefore the posi-tioning under the tongue is crucial in order to make maximal use of the technique, though some data-loss can still occur. Nonetheless, ul-trasound’s temporal resolution is good, and a series of images can be displayed in real-time to show movement.

Another shortcoming is that the images can be difficult to interpret by the layperson, and may appear grainy. This could mean it is more difficult to use for pedagogical purposes, though studies have over-come these problems, many of which are summarised in Wilson et al. (2006). Edge-detection software also exists that can automatically ex-tract the tongue shapes and display them (Li et al., 2005), though this extraction and display in real-time is not trivial.

Imaging summary

In summary, whilst CT and MRI result in higher-quality images than ultrasound, the setup of the machines and the X-ray exposure or cost and noise of using the machine make them inconvenient for real-time speech feedback. Though ultrasound images are grainy, their use in conjunction with an expert who can interpret and explain the important features is promising for pronunciation training, which has already been shown in several studies (Wilson et al., 2006; Tsui, 2012; Bernhardt et al., 2008).

3

(29)

2.1.2 Point-tracking techniques

Point-tracking measurement techniques aim to gain insights into speech production by affixing small tracking devices (receiver coils in the case of EMA, metal pellets in the case of X-ray microbeam and mark-ers in the case of optical tracking) to the articulators. These are then tracked at a high sampling rate, giving insights into the movement of the selected fleshpoints.

X-ray microbeam

X-ray microbeam (XRMB) (Kiritani et al., 1975) locates small gold trackers using a lower-radiation X-ray, which is passed as an ex-tremely thin beam through the subject, producing a 2D image. The image is most often of the sagittal plane. It provides excellent point-tracking data with high spatial and temporal accuracy, using the pre-dicted movement of the pellets to target the beam. XRMB also repre-sents an improvement over cineradiography techniques (which reveal moving X-rays of areas of interest in real-time but expose the subject to unacceptable levels of radiation for research purposes). However, XRMB also exposes the subject to low levels of ionising radiation.4

Many datasets are publicly available (Westbury et al., 1990); the ma-chine itself was located at the University of Wisconsin.

Optical tracking

Optical tracking, such as the proprietary system Optotrak (NDI Digi-tal), uses a camera to track special markers affixed to the body in 3D space. Such a system can support a large number of markers at once with high fidelity and does not present any danger to the subject, though it is less suited for the study of articulation because it requires a line-of-sight between the camera and the markers. Nonetheless, it has proven useful for analysing jaw and head movements.

Other techniques

Palatography is a technique whereby an artificial palate containing an array of electrodes or infrared diodes is fitted to and placed over the subject’s hard palate. Electropalatography (EPG) produces an electric current when the tongue is in contact with the palate, and optopalatog-raphy measures the distance between the tongue and the palate at all times.5

4

Despite this, the dose from X-ray microbeam is much lower, in part due to adaptations that only expose the subject as often as necessary for pellet tracking, and only in the areas where the pellets are expected (Westbury et al., 1990).

5

(30)

Two techniques that can be used to complement EMA are elec-troglottography and plethysmography. Elecelec-troglottography measures vocal fold vibration during phonation using an analog sensor. It has high temporal resolution but is sensitive to subject movement (Stone, 2006). For the study of breathing, a plethysmograph can be used, which measures the changes in lung volume. A respiratory induc-tance plethysmograph calculates this using wires around the chest and abdomen, upon which the stretch is measured.

2.2 electromagnetic articulography

EMA overcomes the difficulties of the previously mentioned point-tracking methods by providing 3D trajectory information both inside and outside the mouth with a high sampling rate (even up to 1250 Hz in the AG501 (Stella et al., 2013b)) that does not expose its subjects to radiation, and is therefore suited to repeated use for training or research.6

2.2.1 How EMA works

The following is based on the description by Hoole and Nguyen (1999). EMA provides information about the coordinates of receiver coils (generally 2 mm to 3 mm in size) glued to articulators. These may be in the line of sight or outside it (here within the mouth); their location can be measured so long as it is within the measurement volume of the machine.7

Coordinates are obtained for the receiver coils (those glued to the articulators) by inferring their distance from transmitter coils (placed around the head at strategic positions or in a box lateral to the head), making use of the properties of the signal induced in the receiver coil by an alternating magnetic field. This sig-nal is inversely proportiosig-nal to the cube of the distance between the transmitter and receiver:

rsig =

1 dist(t−r)3

where t is the location of the transmitter coil, and r the location of the receiver coil, and where the receiver coils have a parallel align-ment to the transmitter coils. Rotation of the receiver coil away from the transmitter (referred to as tilt or twist) decreases the surface area relative to the transmitter, and thus the signal, making it appear fur-ther away than the actual distance. The receiver signal is transmitted

6

Though due to exposure of the strong magnetic field it is recommended to not be used by subjects with a pacemaker (Nor, 2011), despite some investigations showing it does not cause adverse effects in several devices (Joglar et al., 2009).

7

(31)

to the console via cables from the sensor that lead out of the mouth to the machine. The calculation of the sensor position in 2D can be achieved by two or three transmitter coils, though 3D EMA requires more.8

(a) Placement of three transmitter coils, from Sch ¨onle et al. (1987).

(b) Illustration of common coil posi-tions along the mid-sagittal plane, as described by (Perkell et al., 1992, fig. 1).

Figure 1.: Mid-sagittal views showing early transmitter coil locations and common receiver coil positions

2.2.2 Development of EMA systems

The development of EMA systems began with several prototypes using transducers to measure distance in the 1970’s (Hixon, 1971; van der Giet, 1977).9

Though they differed in their number and configuration (between one and three coils were tracked), the effect of rotation of the sensors on the signal meant that only sensors on the mid-sagittal plane could be measured, (such that the technique was often called EMMA for Electromagnetic Mid-sagittal Articulog-raphy) and head-movement had to be restricted to ensure this con-figuration was maintained. Later systems (Perkell et al., 1992; Bran-derud, 1985) resolved this problem somewhat by using two transmit-ter coils, though rotational-alignment remained a source of unrelia-bility. Three-transmitter systems using a solution by Sch ¨onle et al. (1983, 1987), the AG100/200, resolved this problem and became pop-ular systems installed worldwide.

The addition of extra transmitters to better detect sensor alignment solved both the rotational problem, as well as the need that sensors remain in the mid-sagittal plane or for the head to be still relative to the transmitters and thus restricted in a helmet, resulting in the

8

For instance, the AG500 uses six transmitter coils, the AG501, nine.

9

(32)

AG500 and AG501 machines (in which the subject sits within a cube-shaped plastic structure, or under a hanging three-armed structure, respectively) produced by Carstens Medizinelektronik.

New systems developed by Northern Digital also remove the need for a helmet, as the generator coils are housed within a box placed at the side of the subject’s head.10

Where reference coils are used to normalise for head movement, the subject can move freely within the measurement volume and the data can be recovered. One ini-tial system, AURORA had a relatively low (for speech) sampling rate of 40 Hz, and was evaluated with good accuracy (Frantz et al., 2003; Kroger et al., 2008), though later releases suggest it is now less in-tended for speech kinematics than medical applications (aur), that role passing to the younger system, the WAVE. The WAVE was eval-uated by Berry (2011), and it was found that its automatic head-correction does not impact negatively upon its accuracy.

Another recent system combines EMA with magnetoencephalography (MEG), namely the Magneto-articulography for the Assessment of Speech Kinematics (MASK) system, developed by Sick Kids and the Univer-sity of Toronto and as evaluated by Lau (2013). Though general EMA techniques cannot be combined with MEG for reasons of magnetic interference, in this technique the magnometers from the MEG read the positions of the (in this case, transmitter) coils on the articula-tors, which operate at a frequency below that of the brain, yielding synchronised brain activity and articulographic data. However, this is at the expense of the accuracy of the location readings, and the technology requires further development to reach the required level of accuracy (Lau, 2013).

10

As this system is proprietary, the internal workings remain unknown.

(a) Carstens AG501 (b) NDI WAVE

(33)

Name AG500 AG501 WAVE

Sensors 12 24 8,16

Transmitter Coils

6 9 unknown

Volume 300 mm sphere >300 mm sphere 300 mm3 or

500 mm3 Frames Per Second 200 fps 250 fps 100 fps, ∼400 fps with hard-ware up-grade

Accuracy Deviations above

1.8 mm (due to numerical soft-ware) (Kroos, 2012; Yunusova et al., 2009) RMS of 0.3 mm within 110 mm measurement volume Stella et al. (2013b) 85% of errors <0.5 mm, 98% <1.0 mm Berry (2011) Table 1.: Comparison of state-of-the-art EMA machines

2.2.3 Combining EMA with other modalities

EMA fleshpoint-tracking can be combined with other vocal-tract imag-ing modalities, which not only help researchers identify which utter-ances the EMA data pertain to, but can also provide more detailed imaging information (that may have a slower frame rate than EMA), show movements of articulators or muscles to which EMA coils can-not be affixed, record visible stimuli with line-of-sight measures, or even investigate brain activity during speech. These combinations are made especially feasible by newer EMA systems where the artic-ulograph is portable and allows greater access to the subject. Audio recordings, often with a directional microphone, have also been con-ducted with the technique from the earliest experiments in order to identify which segments were said.

(34)

(35)

allows both magnetoencephalographic brain readings and EMA po-sitional readings to occur simultaneously.

2.2.4 Drawbacks of EMA

Despite the aforementioned benefits of using EMA for speech re-search and teaching such as its inexpensive nature (compared to other technologies), safety, and ability to be combined with other technolo-gies, it does have some drawbacks, and these must be evaluated when choosing the best modality in which to investigate a research ques-tion. The most prominent problem is that EMA data only reports on the areas to which the receiver coils can be attached; glottal or nasal places or manners of articulation cannot be captured within the data, nor can voicing distinctions.

The need to glue the sensors securely to articulators within the mouth is time-consuming, and is certainly more invasive than imag-ing techniques or line-of-sight point-trackimag-ing measures. In addition, attaching the coils to the very tip of the tongue or too far back on the tongue blade can be uncomfortable for the subject, so the tongue sen-sors only represent a limited area on the tongue surface. For a more accurate picture of how this low-dimensional data (as compared to an image or video) corresponds to real articulator positions, techniques such as MRI, cineradiography or ultrasound are necessary to learn this correspondence – three (or five) points on the tongue surface are difficult to interpret meaningfully (whether in a visualisation or oth-erwise) without knowing how they relate to the tongue position as a whole.

Furthermore, the degree to which the presence of the receiver coils and wires in the subject’s mouth affects their speech is debatable, though no vocal tract measurement technique is perfect in this regard (for instance MRI produces a lot of sound, and requires the subject to be in a supine position, ultrasound requires holding a probe carefully or wearing a heavy helmet). Hoole and Nguyen (1999) suggest that the wires do not interfere with articulation so long as the front tongue sensor is not placed too close to the tongue tip, but while Katz (2006) found that while there were no consistent changes across speakers, some individuals’ articulations were affected, reinforcing the need for screening of subjects before experiments. They also found that individuals with aphasia and apraxia may produce less intelligible speech in this condition.

(36)

Figure 3.: Screenshot of the VisArtico GUI playing a POS file, from Ouni et al. (2012).

2.3 software for visualising the vocal tract

Software packages for working with fleshpoint-tracking data can be broadly divided into two groups – analysis software that visualises EMA data from a corpus and performs corrections and visualisations for research purposes, and visualisation software with ‘talking-head’-like functionality that is intended for use in speech-therapy settings. Some talking-heads use acoustic-to-articulatory inversion, and infer the articulations based on a number of parameters (such as ARTUR). Others use live point-tracking data (such as Opti-Speech) to drive these visualisations.11

One of the oldest software packages for the analysis and visual-isation of EMA data was developed at Haskins laboratories, named HADES (L ¨ofqvist et al., 1993). Later, Nguyen’s EMATOOLS (Nguyen, 2000) was popular, as well as software for calibration, and data dis-play/analysis packaged with the Carstens articulographs (Car) where these machines were used. More recently, VisArtico and MVIEW have emerged as leading tools; their functionality is summarised be-low. Additionally Opti-Speech, a recent state-of-the-art tool which performs largely the same function as this work, is described. Table 3 gives an overview of these and other tools that can be used for EMA data analysis and visualisation.

2.3.1 VisArtico

Ouni et al.’s VisArtico tool aims to provide an intuitive interface to EMA data, particularly to benefit users who are not able to access or use MATLAB. It runs on any machine that supports JAVA, and has a GUI that is used when setting up the coil configuration and visu-alising the movements in the data (Ouni et al., 2012). It was initially conceived as a tool for the Carstens AG500, though functionality has

11

(37)

Figure 4.: Screenshot of the MVIEW GUI playing a .mat file of the utterance “The girl was thirsty” with the cursor at the po-sition of /s/.

been extended to include other data formats, such as the AG501, NDI WAVE, several corpus-specific formats as well as general motion cap-ture data (Ouni, 2015). It presents several views that are animated simultaneously (3D view, mid-sagittal view, waveform, trajectory vi-sualisation etc.) and supports image overlay and display of segmen-tation files (such as Praat’s TextGrid), as well as export of these views as images or video. The articulators displayed are the lips, tongue contour, and jaw angle, (as seen in Figure 3), and a palate trace can also be shown or calculated.

2.3.2 MVIEW

(38)

Figure 5.: Screenshot of the Opti-Speech system, from a promotional video (utd).

2.3.3 Opti-Speech

Opti-Speech is an initiative of UT Dallas by Katz et al., initially us-ing the animation engine Maya.12

Shown in Figure 5, it represents the most-advanced current system, interfacing in real-time with the NDI WAVE and displaying this information as movements on a stan-dard human facial avatar. While the speech organs are not accurate anatomical models, they resemble the articulators so as to provide an intuitive display. Additionally, the interactive use of articulatory tar-gets, such as figures that light up when a tongue reaches that position, are already in use. This has prompted initial research investigating whether native English speakers reached a certain palatal target, and whether this changed in a condition with visual feedback.

2.3.4 Other visualisations

Articulographs generally ship with pre-packaged software for data analysis too – these complement the internal software that deter-mines the coil coordinates from the strength of the received signals. The NDI WAVE comes with the WaveFront software package, which calculates the sensor positions and gives feedback about their acti-vation. Additionally there is an Application Programming Interface (API) to allow users to access the data in real-time. Carstens machines come with a control server laptop with pre-loaded software (such as cs5view, cs5recorder, cs5cal etc.) for the calculation of sensor positions and their real-time display (ag5, b).

EGUANA and Articulate Assistant Advances are other actively-developed and popular tools for data analysis, the latter being multi-modal and supporting EPG and US data too. In addition, other recent tools include Kolb (2015)’s visualisation tool for the EMA-MAE

cor-12

(39)

Name Author Year Platform Availability Articulate Assistant Advanced Wrench 2003; 2007

Standalone Registration key

purchase

EGUANA Pascal van

Lieshout

2013 Standalone Free with registration

EMATOOLS Nguyen 2000 MATLAB Superseded

MVIEW Tiede 2010 MATLAB From author

Artimate Steiner and

Ouni

2012 JAVA,

Ardor 3d

Public, open-source license

VisArtico Ouni et al. 2012 JAVA Public as binary

SMASH Green et al. 2013 MATLAB,

GUI driven

Unknown

Opti-Speech Katz et al. 2014 Maya/

Unity

Not yet publicly available Title

unknown

Kolb 2015 PyQT Not yet available13

Table 3.: Various point-tracking visualisation systems

pus in Python, and Speech Movement Analysis for Speech and Hear-ing Research (SMASH) (Green et al., 2013). Artimate is a lightweight tool not for analysis of large datasets, but rather for visualising ar-ticulators during speech synthesis (Steiner and Ouni, 2012). Some of these tools are shown in Figure 6.

2.3.5 Visualisations in other modes

Aside from visualisations of the tongue/articulators built especially for ultrasound (for instance (Xu et al., 2015; Wrench, 2003, 2007)) or the plethora of medical tools that can visualise CT/MRI, many orofa-cial clones or talking heads aim to visualise the motions of the vocal tract using data from several modalities. Often, the input to these models is text or (more ambitiously) an acoustic signal, where the mapping between these inputs and the desired movements has been trained using a dataset (which is often EMA data due to its low di-mensionality).

Some popular systems are:

(40)

(a) Software by Kolb (2015, p. 61) (b) Carstens AG501 real-time view_{with facial trace}

(c) EGUANA software

(d) WaveFront (Nor, 2011, p. 20)

(e) Articulate Assistant Advanced in

EPG mode (f) JustView trace viewer for AG500_{(ag5, a)}

(41)

(a) ARTUR, from B¨alter et al. (2005, fig. 1)

(b) Artisynth, from Lloyd et al. (2012, fig. 1)

Figure 7.: Two visualisation systems.

• the oro-facial clone by Badin et al. (2008, 2010b), a more realistic talking head that is built based on MRI, CT and video data. See Figure 8b.

• ARTUR (B¨alter et al., 2005; Engwall and B¨alter, 2007), a speech training aid14

with a mid-sagittal viewpoint that uses speech recognition to detect mispronunciations. See Figure 7a.

• Kr ¨oger’s ‘Visual Model of Articulation’ (Kr ¨oger, 2003), which uses grapheme-allophone conversion to provide mid-sagittal views of articulatory organs controlled by 9 parameters for a speech therapy context.

• Artisynth (Fels et al., 2006; Lloyd et al., 2012), a biomechanical simulation environment for modelling anatomical structures, developed to model speech production. See Figure 7b.

• SYNFACE (Beskow et al., 2004), a multilingual system that uses acoustic-to-articulatory inversion to display visual speech us-ing a synthetic talkus-ing head for headus-ing-impaired people when using the telephone.

2.4 applications of ema

Fleshpoint tracking data from EMA has found many applications, for instance in the study of speech (particularly investigating features such as coarticulation) and swallowing, in the mechanics of speech disorders and foreign-language speech, and as a modality informing other tasks in language technology such as speech recognition and speech synthesis.

14

(42)

2.4.1 Speech therapy

Visual augmented feedback is one tool often used in a speech therapy setting, and a range of pronunciation training applications (such as BALDI (Massaro et al., 2003) or (Kr ¨oger, 2009)) have already made use of vocal tract visualisations, though these were constructed from text input or pre-recorded. For real-time personalised feedback, ul-trasound has previously proved much more popular that EMA due to its accessibility and non-invasiveness. Despite this, some studies have also been conducted with EMA to help treat apraxia of speech (Katz et al., 2010, 2007) by providing augmented visual feedback of articulatory movements.

EMA also potentially has the ability to overcome the limitations that ultrasound presents when used for speech therapy feedback, as described by Bernhardt et al. (2005). Unlike US, EMA can show both sagittal and coronal views simultaneously, can provide palate-relative information after performing a palate trace, though the cost and portability of the machine (relevant for some models more than others) still present a practical limiting factor.

In another application for persons with speech problems, EMA data has also been investigated for use in silent speech interfaces, where the movements of a user who is unable to produce normal speech following a surgical procedure such as a laryngectomy are resynthesised into vocal sounds (Fagan et al., 2008; Denby et al., 2010). 2.4.2 Pronunciation training for language learning

In terms of investigating how speakers can improve their pronunci-ation of a second language (L2), EMA data has been used for both online and offline analyses of the differences between L2 and L1 goal trajectories. Wieling et al. (2015b) conducted a study in which phone-mically similar Dutch and English words were pronounced by a num-ber of native speakers (21 and 22, respectively) of those languages, in order to later identify which landmarks differ in the pronunciation of certain selected phonemes in this language pair. Such insights can inform pedagogical practice, as teachers can explicitly make learners aware of the difference in articulation between their native and target language, rather than relying on an acoustic contrast.

(43)

motivating the learners to move the tongue trace’s tip to the correct position.

Inspired by this work, Suemitsu et al. (2013) taught Japanese learn-ers the English /æ/ with a Carstens AG500 articulograph, attempting to use a data-driven target position as derived from the X-ray mi-crobeam corpus (Westbury et al., 1990) and their own EMA corpus. In this study participants underwent a training phase, where they accli-matised themselves to the format of visual feedback, then attempted to match the on-screen targets. This also represented an extension by attempting to match the entire tongue contour, rather than just the tongue tip.15

Another study (Veenstra, 2014) developed training materials for Dutch learners of English using audio-visual materials based on EMA data, and found a trend (though non-significant) that the learners with the audio-visual condition performed better than those with only auditory materials.

2.4.3 General use

EMA data is often used to train models of articulatory inversion, in which an acoustic signal forms the input, and fleshpoint-coordinates over time form the output, though as a one-to-many problem (a certain acoustic signal can be the result of many articulations) this is not trivial. Badin et al. (2010a) and Youssef et al. (2011) use an orofacial clone constructed from MRI, CT, and video data, and these detailed models are driven by a small number of parameters obtained from EMA data. The aim of the research is to use acoustic-to-articulatory inversion to infer the EMA coordinates that drive the clone sparing the subject from being hooked up to the EMA machine.16

Speech, swallowing and other research

EMA has also been used to investigate other questions relating to tongue movements. Serrurier et al. (2012) investigated the relation-ship between tongue movements for speech and for feeding, and concluded that speech movements can be reconstructed from those of feeding. Trumpeters have also been investigated using EMA, as the more traditional methods of exploring musical instrument

play-15

Japanese learners of English were also the population of a 2D ultrasound study that aimed to teach the English /l/ and /ô/ (Tsui, 2012) gains in pronunciation ac-curacy reported, and the use of an ultrasound modality for L2 teaching and speech therapy is also an active research area. Indeed, Gick et al. (2008) showed in a pilot test that pronunciation gains could be made in this contrast after only one 30 minute lesson. This contrast was also trained in (Massaro and Light, 2003) using a talking head without real-time feedback, though seeing the indicators in this case did not significantly improve learners’ results.

16

(44)

ing techniques with MRI fall below the sampling rate required for some fast movements, which were captured at 250 fps (Bertscha and Hooleb, 2014).

Speech recognition

The use of speech production knowledge for the development of speech recognition systems has a significant history (Frankel and King, 2001; Wrench and Richmond, 2000), for example with Wrench and Richmond (2000) showing that such information can help disam-biguate the acoustic signal, though King et al. (2007) suggest that this information is underutilised. This area of study is too large to sum-marise here, though recent developments explore the use of acoustic-to-articulatory inversion and focus on generalising these methods to be speaker-independent (Ghosh and Narayanan, 2011).

2.5 intelligibility of visual speech

The following subsections outline two major areas in the field of vi-sual speech: how laypersons interpret (and learn to interpret) vocal tract visualisations, and how previous talking heads have been eval-uated. The description of these two areas is motivated not only by the need to evaluate the software package constructed in the scope of this thesis, but also because this area touches on a greater question in the field of augmented visual feedback for speech training: how can we best present the subject’s articulation (or a visualisation of the correct trajectories) so that they can detect the difference that must be adjusted for? Alternatively, in a situation where there is no real-time feedback, how can we best present stimuli so that they can be easily interpreted by the users?

As a proxy measure for the ease of interpretation of a visual stim-ulus by its users, we can consider intelligibility – indeed, this is a common thread throughout many of the methodologies described in the following subsections. Earliest canonical studies in these areas (such as Sumby and Pollack, 1954) indicate the use of visualisation of speech to improve intelligibility.

Of course, any evaluation of a talking head must also rest on its accuracy to the stimuli (eg. timing, fidelity of movement etc.). This can be measured explicitly by taking the coordinates of parts of the visualisation, or alternatively by comparing its appearance with stan-dard and established tools. Additionally, laypersons should be able to learn to interpret this visualisation quickly and easily, and its over-all attractiveness (which may or may not influence its intelligibility) deserves mention.

(45)

and for comparison purposes, in a simple dots-and-tails represen-tation of the coil locations. The specific research questions and hy-potheses are outlined in detail in section 5.2, though it was expected that the more-detailed visualisations would perform better than dots-and-tails, and (as they show basically the same shapes) that 3D and 2D visualisations would perform comparably.17

A secondary goal is to collect information about users’ subjective experiences interpreting the different types of visualisations. This data can be used to optimise the appearance of the software tool, which should ideally appeal to (ie. be attractive and interpretable) to both experts and laypersons.

2.5.1 Studying laypersons’ interpretation of vocal tract visualisations

Many studies have investigated the effect that the inclusion of the vi-sual mode18

has on speech intelligibility. Sumby and Pollack (1954) conducted an early study that showed how speech intelligibility im-proved when the speaker’s facial and lip movements were visible, as well as with a decrease in the vocabulary size, by manipulating the signal-to-noise ratios of the acoustic signal. Furthermore, the McGurk effect notably demonstrates that visual information affects humans’ perception of the auditory signal itself. For instance, when a visible speech /ba/ is played under the auditory speech /ga/ the subject perceives /da/), even when have knowledge about the workings of the effect itself (McGurk and MacDonald, 1976).

A common manipulation to test the intelligibility of visible speech is to present it alongside the auditory speech, where the auditory speech is presented under different noise levels. This can be consid-ered an intrinsic evaluation of the quality of the visual speech. Other studies in these areas perform extrinsic evaluations and have focused on the effects of using such models for speech therapy or language learning (Hazan et al., 2005). These may focus on how users inter-act with the visualisations, for instance, how they are interpreted by naive users (eg. Kr ¨oger, 2003).

Whilst orofacial clones and mid-sagittal visualisations of the speech production organs have long been used in a speech therapy context, the manner in which patients learn to interpret these visualisations

17

As a 3D system is not expected to significantly improve intelligibility over 2D, the reader at this point may question the justification of developing the system. Indeed, it is hoped that the overall attractiveness/realism of a 3D animation provides a pleasing user-experience; that in a real-time situation game-like controls allow better use of the extra dimension to see the scene from task-appropriate angles; and that the freedom from using sensors on (or splines representing) a mid-sagittal line along the tongue can better visualise lateral consonants.

18

(46)

remains under-investigated. The use of US as an effective tool for visual feedback during pronunciation training makes it clear that pa-tients/students can learn to interpret grainy images with an instruc-tor’s guidance, however the degree to which users can respond to point-tracking measures in raw data form (or when driving only basic representations of main articulators) remains unresolved. Nonethe-less, it is clear that where normally invisible articulators are shown or where an abstract representation is used subjects must first develop the skills to interpret the visualisation before they can utilise it for learning purposes.

The area of ‘speech reading’ aims to investigate these questions, and investigates phenomena such as lip-reading (and how speech intelligibility interacts with the audio/visual modalities), whether intra-oral articulators can also be ‘read’ in the same manner, and whether certain configurations of talking heads augment speech in-telligibility. Additionally, the ability to speech-read from animations is a holistic measure of the quality of an animation, and can be used to evaluate the appearance of the talking heads in question. The special case of ‘real-time’ feedback when using EMA or ultrasound also provides the subject with extra information to learn the mapping between the visualisation and gestures, as they both see the visualisa-tion and can make use of the auditory and somato-sensory feedback they receive from their own speech production. Many studies that utilise real-time visual feedback like this allow a period of time in which the subject acclimatises to the display in which they do not produce speech but practise fitting their tongue to targets (for exam-ple 5 minutes in Suemitsu et al. (2013)).

The study described in this thesis draws inspiration from this pre-vious work by providing a training period in which subjects are in-vited to watch vocal-tract animations in a zero-noise condition, and by showing each prompt with auditory speech after the question has been answered (these are described in section 5.4).

2.5.2 Previous studies in interpreting talking heads

The animations presented in this experiment are essentially talking heads, though in two cases with much lower dimensionality. The evaluation of the software tool by judging the intelligibility is in-formed by the following studies.

(47)

they could identify more words after using the tool (Massaro et al., 2003).

(a) Different presentation conditions of BALDI, as described in Massaro (2003, fig. 4).

(b) Badin et al.’s orofacial clone (Badin et al., 2010b, fig. 1).

Figure 8.: Two talking heads

To date, no external evaluation has been conducted on Badin et al.’s talking head, though Badin et al. (2010b)19

evaluated the ability for humans to identify the visemes under different presentation condi-tions (described in more detail in section 2.5.3).

The topic of how children interpret the images they see during speech therapy sessions has been explored before, whereby:

• Kr ¨oger (2003) asked children to mimic the sounds that they thought a mute talking head was making. Their success above chance level confirmed that even children can ‘read’ the move-ments of intra-oral articulators from a visualisation.

• B¨alter et al. (2005) confirmed that a talking head interface with visible articulators could be used by children with no prior training, even those with language disorders, though it was not suitable for illiterate children.

2.5.3 Previous studies in talking head presence and degrees of abstractness influencing intelligibility

Evaluations of talking heads, apart from considering the accuracy of their representations of facial and articulatory movements, have also formed the basis of studies into the contributions that the visibility of the face and intra-oral articulators have on speech intelligibility. The following studies present various methodologies that have been used to investigate this issue, as well as general talking-head evaluation:20

• Badin et al. (2010b) presented audiovisual stimuli in four pre-sentation conditions (audio alone, audio with mid-sagittal view 19

I refer the reader to this article for an excellent summary of previous studies about the use of talking heads for language learning and their evaluation.

20

(48)

but no tongue, audio with mid-sagittal view and tongue, audio with textured face model), and four signal-to-noise ratios (no audio, 9dB, +3dB and no noise).

Their subjects informally reported that they found the tongue video useful only in the condition of no audio, and group differ-ences suggested that one group may have benefitted from im-plicit learning after watching the stimuli with no added noise first. Indeed, the effect of the modes were stronger at low signal-to-noise ratios. They also found a significant interaction be-tween the consonants and the presentation condition – some consonants have very clear labial movements that subjects were easily able to distinguish. They also found that a group of ‘poor tongue readers’ could be distinguished: these subjects did not benefit from viewing the tongue in the high signal-to-noise con-dition and performed worse when it was shown compared to other visual presentation conditions.

• Ouni et al. (2007) investigated metrics of how well a talking head contributes to speech intelligibility, taking the presence of a high-quality video recording of the actual speaker as a bench-mark upper bound. The intention of their proposed metric is to allow the comparison of different talking heads (whether in-vestigating different presentation conditions, comparing com-peting tools or evaluating a tool’s improvement during its de-velopment).

To validate the metric they tested the identification of 27 CV combinations in five presentation conditions, which differed in whether they displayed audio, a natural video, or the BALDI talking head, under five different noise conditions. From the experiment they concluded that both the presence of BALDI and the natural speaker improved intelligibility, though BALDI less so than the natural talker. In subsequent tests where the presence of lips only was compared to the full face, the full face gave better intelligibility (though the difference was less pronounced for the natural video).

• Cosker et al. (2005) proposes evaluating the realism of near-video realistic talking heads21

by considering the degree to which they can induce the McGurk effect, with ten (McGurk inducing) 21

Though intra-oral animations are by no means near-video animations, the in-vestigation of whether they can incite the McGurk effect would be an interesting future development. The article also neatly summarises current subjective evalua-tion methods for talking heads (Cosker et al., 2005, pg. 272):

(49)

monosyllabic words used as prompts. They suggest that the in-dividual utterances that succeed or fail at this aim can then be analysed to improve the system. Their particular model was a 2D image-based head with no access to the intra-oral articu-lators (it was trained on video footage).22

For this evaluation, they produced ‘real’ McGurk tuples (an original video stimulus dubbed with a different word) and ‘synthetic’ tuples (the talk-ing head dubbed with audio from a different stimulus), after which the subject had to write down the word they perceived (an open response paradigm), with the hypothesis that where lip-synching or animation is bad, the audio cue will dominate. They also asked forced-choice questions, aiming to gather users’ opinions about the naturalness of the talking heads. From this study they were able to judge a list of possible McGurk tuples on how well they performed as indicators, and were able to suggest ways that their system (Cosker et al., 2003, 2004) could be improved, such as by increasing sampling rate.

• Some evaluations of BALDI measured its silent speech intelli-gibility relative to video footage. One study (Chaloupka, 2002) synthesised 110 different words (each subject saw 40) where the subject had to choose from two, three, or four words, the one which corresponded to the silent stimulus. In only one sub-section were the words similar (varying in one phoneme, the rest seemingly varied randomly from the items selected). Though many of this particular study’s conclusions may be dis-puted, the results that show similar levels of recognition for the BALDI and video stimuli in this similar condition are relevant for this work.

The experiment described in this document is particularly inspired by Badin et al. (2010b) as described above: their findings that the effect of the presentation mode is stronger in silent conditions moti-vated the use of silent stimuli in this experiment. Also their finding that certain consonants have different degrees of readability due to their lip-movements informed the selection of prompts with a vari-ety of consonants and the use of minimal pairs (aiming to capture details about which of the consonants in a word is more problem-atic, see subsection 5.3.1). Other studies above, due to comparisons with video-quality which is not available in this scenario, were not adapted for the task at hand. However, their insights helped to in-form the closed-choice paradigm and choice of stimuli.

ground-truth mouth animation parameters, typically obtained from a real speaker.

22

(50)

Thus, from the literature we can see that the visualisation of talk-ing heads and articulators can be beneficial in certain circumstances to speech intelligibility, though this depends on the quality of the visualisation, as well as the presentation condition (ie. which artic-ulators are shown). A silent stimulus can be considered as having a signal-to-noise ratio of −∞, and it is in this condition that the