Speech communication systems in realistic environments : strategies for improving system performance and user experience

(1)

Speech communication systems in realistic environments :

strategies for improving system performance and user

experience

Citation for published version (APA):

Cvijanovic, N. (2017). Speech communication systems in realistic environments : strategies for improving system performance and user experience. Technische Universiteit Eindhoven.

Document status and date: Published: 09/10/2017 Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

Speech communication systems in realistic

environments: Strategies for improving

system performance and user experience

Speech c

ommunic

ation syst

ems in r

ealistic envir

onments

Nemanja Cvijanović

Nemanja C vijano vić

Eindhoven University of Technology

(3)

environments: Strategies for improving system

performance and user experience

Nemanja Cvijanović

August 28, 2017

(4)

The work described in this thesis has been carried out at the Philips Research Laboratories in Eindhoven and the Eindhoven University of Technology.

This research received ﬁnancial support from the European Commission under Contract FP7-PEOPLE-2011-290000 and from Philips Research Laboratories Eindhoven.

Cover idea and design: Gosia Perz and Nemanja Cvijanović.

Cover extras: Bilge Çelik Aydin, Illapha Cuba Gyllensten, Salvador Peregrina, Sarah Wassum.

A catalogue record is available from the Eindhoven University of Technology Library. ISBN: 978-90-386-4336-6

Financial support by the Eindhoven University of Technology for the publication of this thesis is gratefully acknowledged.

(5)

realistic environments:

Strategies for improving system performance and user

experience

proefschrift

ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de rector magniﬁcus, prof.dr.ir. F.P.T. Baaijens,

voor een commissie aangewezen door het College voor Promoties, in het openbaar te verdedigen op mandaag 9 oktober 2017 om 16.00 uur

door

Nemanja Cvijanović

(6)

i Dit proefschrift is goedgekeurd door de promotoren en de samenstelling van de promotiecommissie is als volgt:

voorzitter: prof.dr. C. J. H. Midden

1e_{promotor: prof.dr. A. G. Kohlrausch}

2e_{promotor: prof.dr. V. Hazan (University College London)}

leden: prof.dr.ir. H. Van hamme (Katholieke Universiteit Leuven)

prof.dr. J. H. D. M. Westerink prof.dr.ir. J. H. Eggen

dr. E. Janse (Radboud Universiteit Nijmegen)

adviseur: ing. C. P. Janse (Philips Research Laboratories Eindhoven)

Het onderzoek dat in dit proefschrift wordt beschreven is uitgevoerd in overeenstemming met de TU/e Gedragscode Wetenschapsbeoefening.

(7)

(8)

iii

Chapter 1 General introduction

1.1 Speech communication in everyday

scenar-ios

Speech is used to convey information between speakers and listeners in a variety of face-to-face and mediated communication scenarios. Regardless of the scenario and the seemingly little eﬀort we exert during communica-tion in most instances, the speech communicacommunica-tion process is rather complex, and parts of it have yet to be completely understood. Starting from the intent of the speaker, the speech communication chain includes the formu-lation of the message, the activation of articulators and speech production, the acoustic transmission of speech to the listener who is then responsible for the perception and understanding of the message. For mediated commu-nication scenarios, the transmission process also includes the enhancement and encoding of the speech signal.

Given this complexity and its importance in everyday life, speech com-munication has been the research focus of numerous scientiﬁc disciplines. These range from auditory neuroscience, linguistics, speech perception, and cognitive science to more technical disciplines including acoustics, speech enhancement, speech recognition and speech synthesis. The popularity of speech communication research has also been aﬀected by recent trends and the rapid advancements in connectivity, mediated communication technology and human-technology interaction and their commercial potential. These led to numerous applications requiring reliable speech recognition, natural lan-guage processing or speech synthesis, e.g., Apple’s Siri, Amazon’s Alexa, Mi-crosoft’s Cortana, Google Now, and various speech-generating devices aiding

(13)

speech impaired users1_{. When researching speech communication, however,}

it is important to remember that speech communication usually happens in non-ideal environments in which the communicators must cope with and adapt to various interferences and distractions. The eﬀects of such every-day environments on the speech communication process and strategies to overcome them are the main focus of this dissertation.

During speech communication in the real world, the most common inter-ference is background noise, which is inevitable and ubiquitous. Originating from a variety of sources, it affects communication at work (e.g., computer and other electronic devices, nearby conversations from colleagues), during a commute/travel (e.g., car and train noise, wind noise, announcements in public transport) or leisure (e.g., music in the background, conversations in the background in restaurants and cafes, household appliances). During a conversation, noise can reduce speech intelligibility which may lead to an increase in listening effort [2, 3] and ultimately listener fatigue as well as annoyance and stress [4–6]. The severity of these adverse effects, however, varies across different listener populations. One important factor is age, as it has been shown that (normal hearing) elderly listeners exert more effort in recognizing and understanding speech in adverse conditions compared to young adults, e.g., [7–10]. Furthermore, the adverse effects of non-ideal lis-tening conditions are more pronounced in non-native listeners (see [11] for a detailed review) and listeners with hearing impairment [12–14].

The adverse eﬀects of background noise during speech communication naturally depend on its level. Several studies have been conducted in the past with the aim to quantify average levels of background noise in every-day scenarios. One example is the study presented in [15] which investigated the environmental noise levels in classrooms, private houses, hospitals, de-partment stores, trains, and airplanes. The results of this technical report were summarized in [16] and revealed average background noise levels mea-sured in classrooms, houses, hospitals, and department stores ranging from 45 dBA − 55 dBA SPL, while increasing to approximately 70 dBA − 75 dBA SPL in trains and airplanes. In a more recent study, the noise levels in subway and bus stops in New York were investigated and average noise levels of ap-proximately 86 dBA measured, with maximum levels exceeding 110 dBA [17]. In addition to the background noise level, the characteristics of the in-terfering sounds are relevant when examining the corresponding eﬀects on speech intelligibility and communication. Studies investigating the process-ing and recognition of speech in adverse conditions usually classify the

mask-1

A recent market analysis estimated the global voice recognition market value to in-crease from $104.4 billion in 2016 to $184.8 billion in 2021 [1].

(14)

Chapter 1. General introduction 3

ing effects of the employed interferers into two classes based on their charac-teristics – energetic and informational masking [18–21]. Energetic masking occurs when the speech signal of interest and the interfering signal(s) contain energy in the same frequency regions at the same time, while informational masking is assumed to occur when the target signal (in our case speech) and the interfering signal share similar informational content, e.g., both target and interfering signals are speech [18, 19]. A detailed review and discussion on the characteristics of and separation between energetic and informational masking is found in [18] and [21]. Again, the effects of different types of masking may vary across different listener populations. For example, the effects of informational masking have been shown to depend on language proficiency of the listeners (native vs. non-native listeners) in [22].

1.2 Use of speech communication systems in

noise

The detrimental effects of noise on speech communication are particularly prominent during the use of speech communication systems, as the micro-phones they employ capture all ambient sounds from the environment in which they are used. As such, their performance, and thus the user’s experi-ence, is highly dependent on the acoustic characteristics of the environment. Noise can not only lead to the capture and subsequent transmission of noisy and unintelligible speech but also limit the functionality of the underlying system, e.g., the functionality of applications relying on speech recognition. For this reason, speech communication systems nowadays employ numerous methods for enhancing the captured noisy signal [23–25]. These methods, however, are usually designed for specific applications and perform best for a set of predefined conditions, e.g., hand-held vs. hands-free scenarios. A general solution, applicable to all noise types, use cases and SNR conditions, does not exist so far.

Compared to speech communication systems, the effects of challenging acoustical conditions are less pronounced during face-to-face communication. This is mostly due to the numerous benefits of binaural hearing as well as the vast capabilities of the human auditory system. One example is stream seg-regation [26], shown using the well-known cocktail party effect first described in [27]. In scenarios with multiple active speakers, stream segregation allows listeners to focus on the voice of interest while ignoring other speakers which benefits intelligibility [26]. Similar to the adverse effects of noise, these ben-efits may differ depending on the listener characteristics. For example, the

(15)

age of the listeners has been shown to affect stream segregation in [28]. Another major difference between face-to-face communication and the use of speech communication systems is the multimodal nature of speech perception, wherein listeners utilize visual and even haptic information in addition to auditory cues to aid their perception [29]. The benefits of this phenomenon have been studied extensively in audio-visual speech perception research [30, 31]. Hereby, studies with normal hearing listeners have shown the benefits of visual cues during speech perception in shadowing experi-ments (participants repeat speech directly after hearing it) in silence [32], their intelligibility benefits for speech perception in noise [33, 34] or during the perception of foreign language, heavily accented speech or speech con-taining complex sentences [35]. Furthermore, similar benefits have also been found in studies with hearing-impaired listeners (see [36] as an example). Additionally, the McGurk effect, described in [37], demonstrates the power of visual cues during speech perception. Again, the benefits of visual cues depend on listener and speaker characteristics – studies in [38] and [39] show that elderly participants benefit less from visual cues compared to young adults, while the study presented in [40] showed a reduction in visual cue benefit for non-native speech perception.

Motivated by the multimodal nature of speech perception, researchers have recently started incorporating additional modalities into speech com-munication systems. The main ideas were to utilize auxiliary sensors robust to ambient noise to circumvent the limitations posed by traditional micro-phones, and use additional information about the produced speech signal, absent from the traditional microphone signal. Such systems can be based on video [41, 42], electromagnetic articulagraphy [43], various ultrasound se-tups [44–46] and bone conduction [47, 48], among others. Aiming to benefit from the phenomenon of audio-visual speech perception, the captured aux-iliary information often describes facial or articulatory movements during speech production including the velocities, movement trajectories or coordi-nates of predefined articulators. This type of information can be captured using various methods, and their choice, as well as the choice of the cor-responding auxiliary modality, are paramount in the design phase of mul-timodal speech communication systems. While systems using electromyog-raphy, for example, may provide accurate and robust coordinates and thus precise kinematic information about the articulators, they also require the placement of electrodes on the speaker’s face. Consequently, as a trade-off, their use can be overly intricate, uncomfortable, and impractical for every-day scenarios. The first part of this dissertation focuses on this question and examines the feasibility, potential and robustness of an ultrasound sensor in a multimodal speech communication system as a cheap and comfortable

(16)

alternative to other modalities.

1.3 Evaluation of the effects of noise and noise

suppression algorithms

While the use of speech enhancement algorithms in speech communication systems is needed to ensure high performance, especially in realistic scenarios, their evaluation and the development of reliable evaluation methods is equally important, as shown by the numerous studies and standards devoted to them [49–55]. The utilized evaluation methods focus either on the quality of the speech signal, e.g., how natural the speech signal sounds after processing, or on the intelligibility of the processed speech and can be classiﬁed as subjective or objective [24, Chapters 10 and 11].

Intelligibility evaluation involves numerous listening tests including syl-lable, word or sentence recall (see [24, Chapter 10] for a detailed discussion), and the subsequent determination of the so-called speech reception threshold (SRT) [56, 57]. Subjective speech quality is usually evaluated using listening tests during which listeners evaluate the presented material using a pro-vided rating scale, e.g., the Mean Opinion Score (MOS) [58]. However, since such listening tests are expensive and time consuming, objective methods for quality evaluation are usually preferred. These include comparisons of the original and processed signals resulting in measures such as the segmen-tal signal-to-noise-ratio (segSNR) [59] and perceptual evaluation of speech quality (PESQ) [54] among others. While these measures can be obtained without expensive and strenuous listening tests, they do not always correlate well with the subjective methods [52].

Outside of the field of speech signal processing and telecommunications, different cognitive measures are utilized to assess the effects of noise on speech perception. The most prominent example is listening effort which refers to the mental effort required for listening, where mental effort can be defined as the “deliberate allocation of resources to overcome obstacles in goal pursuit when carrying out a task“ [60, pg. 11S]. The effects of different noise types and background noise levels on listening effort have been studied using a variety of different experimental designs and methods. Dual task paradigms have been utilized in [2] and [61], while subjective rating scales have been ap-plied in [62]. Recently, researchers have also started employing physiological measures to assess the effects of challenging acoustical conditions on listen-ing effort with the argument that, since noise affects listenlisten-ing and mental effort, noise-induced physiological changes during speech perception should

(17)

also be present [63–65]. Given the recent advances in wearable technologies and developments in the so-called internet of things, these psychophysiolog-ical approaches are especially interesting as numerous wearable sensors, able to provide physiological data from users in real-time, already exist, e.g., [66] and [67]. These sensors could be used to evaluate the eﬀects of environmen-tal noise on user experience and well-being in real-time allowing for a more personal evaluation of speech enhancement algorithms or even changes and adaptations of relevant algorithm parameters.

The examination of the eﬀects of noise through measures of listening eﬀort and physiological changes is the focus of the second part of this dissertation.

1.4 Objectives and thesis structure motivation

This dissertation does not follow the more common mono-thematic approach but investigates speech communication from diﬀerent perspectives and across multiple disciplines. This approach was partially motivated by the multi-disciplinary nature of the European research network to which the author’s project belonged – INSPIRE (Investigating Speech Processing In Realistic Environments). The main aim of INSPIRE was to combine the knowledge and research eﬀort of the numerous disciplines involved in speech commu-nication research which granted the author access to experts from a variety of disciplines ranging from psychology through signal processing to physi-ology. Additionally, the research comprised in this dissertation was carried out in a multi-disciplinary work environment at Philips Research, guided by a supervisory team with a mixed technical and behavioral background and with close collaboration with the Speech, Hearing and Phonetic Sciences de-partment at the University College London (UCL). These factors as well as the author’s motivation to perform multi-disciplinary research, shaped the organization and nature of this dissertation.

The fragmentation of traditional speech communication research into nu-merous sub-disciplines and the lack of interaction between them has lead to significant differences in research methodologies and difficulties in cross-disciplinary knowledge transfer. Algorithm development and research in technical fields (e.g., speech enhancement) mostly focuses on the improve-ment of speech quality based on predefined objective metrics as discussed in the previous section. While practicality and real-world scenarios are given a high priority in such disciplines, they rarely investigate cognitive, physio-logical and other user-centric aspects of speech communication. Part of the reason for such an approach are also the high cost of subjective tests and behavioral studies with human participants in general or even the lack of

(18)

infrastructure required to organize and perform them. In comparison, the more behavioral disciplines are mainly interested in the speakers or listeners in a communication scenario and in cognitive, perceptual, and physiological effects of different manipulations in the speech communication chain. Such studies may also investigate changes in the effects of certain manipulations between different groups of participants. Examples include the examination of differences between normal hearing adults and children, elderly, non-native speakers, hearing impaired listeners or cochlear implant users, which may not be a high priority from a commercial point of view. These studies, however, often focus on laboratory settings instead of realistic, real-world scenarios. This may involve listening experiments presenting speech in complete si-lence or under unrealistic noise levels (extremely low SNRs). While such approaches are necessary to ensure that the effect of interest is present and observed in isolation, it is also important to evaluate whether the effects also occur in everyday communication scenarios. One of the aims of this disser-tation was to combine technical and behavioral approaches, and investigate the effects of realistic environments on speech communication. Furthermore, the author aimed to develop algorithms for improving the performance of speech communication systems in such environments and potential methods for their evaluation.

The first part of this dissertation focuses on the development and evalu-ation of a multimodal approach to aid speech communicevalu-ation in noise. The main research questions were related to the suitability of ultrasound for ar-ticulatory movement capture, the type, utility, and potential benefit of the information captured via our ultrasound setup and finally the stability and robustness of the setup towards artifacts and challenges that arise in a realis-tic use case. To address these, this part starts with a detailed analysis of the articulatory information captured via our ultrasonic Doppler sensing setup followed by an examination of the potential benefit of the auxiliary ultra-sound sensor. For further evaluation, the captured articulatory information is then integrated in a state-of-the-art noise suppression system. Finally, the developed multimodal approach is extended to a realistic communication scenario through a robustness analysis of the auxiliary ultrasound modality. The second part of the dissertation then focuses on the cognitive and physiological effects of noise on speakers and listeners during speech per-ception through two experiments. The aim of these was to examine the effects of noise and speech enhancement on the users of a speech commu-nication system and determine whether behavioral approaches can be used for a more user-centric evaluation of speech enhancement algorithms. The first experiment was carried out at UCL and designed to examine the ef-fects of ambient noise and a commercial state-of-the-art noise suppression

(19)

algorithm on recognition memory. As the used algorithm was designed for a hands-free teleconferencing scenario, a relatively high SNR of 10 dB was used for the noisy condition – teleconferences are usually not held in worse environmental conditions. Listening experiments were used to determine the recognition memory scores for different conditions, two different age groups (young adults and elderly group) and two different speaking styles. The motivation for the comparison between different age groups was to deter-mine the noise suppression algorithm benefit for populations known to be affected more by challenging environments during speech communication (el-derly participants) and the availability of an el(el-derly participant pool at the UCL. While this study combined behavioral with technical approaches and utilized realistic stimuli, the second study aimed to take this a step further and better emulate an everyday speech communication scenario. For this purpose, the design of the second study revolved around the communication between two participants using a speech communication system under en-vironmental (noise) conditions. It examined the effects of noise on arousal of the participants while they solved collaborative tasks together. Arousal was assessed using popular physiological measures whose sensitivity and suit-ability for arousal detection in realistic speech communication scenarios was examined. With these insights, we aimed to determine whether such physio-logical measures could potentially be applied to evaluate technical methods for speech quality improvement or to monitor user experience during the use of speech communication systems.

1.5 Outline

The first part of the dissertation focuses on the use of an ultrasound sensor to enhance the robustness of a speech communication system in challenging en-vironments. The multimodal system employs an ultrasound sensor consisting of an ultrasonic receiver and an ultrasonic transmitter as well as a standard air-conduction (AC) microphone, enabling synchronized ultrasound and AC signal capture. With the sensor positioned in front of the user and aimed at the user’s face, an ultrasound signal is emitted and reflected off of the user’s face and recorded again. This captured reflection then contains information about articulatory movements during speech production which is encoded in shifts of the emitted carrier frequency occurring due to the Doppler effect.

Chapter 2of the dissertation presents the system setup and hardware

used throughout the ﬁrst part of this thesis and provides an analysis of the captured ultrasound reﬂection including a detailed discussion about the var-ious sources contributing to it. These sources range from articulatory

(20)

move-Chapter 1. General introduction 9

ments, which are the original motivation for the employed ultrasound setup, to several non-articulatory contributions often overlooked in similar systems in the literature. Finally, the chapter is concluded with a model of the cap-tured ultrasound reﬂection.

Chapter 3focuses on the use of speech-production-related information,

captured with the ultrasound setup introduced in Chapter 2, in a multi-modal speech communication system. This system utilizes auxiliary speech production features extracted from the ultrasound reflection for speech en-hancement. Firstly, a set of ultrasound features is developed and their link to acoustic speech features discussed with emphasis on their discriminabil-ity, robustness to inter- and intra-speaker differences as well as changes in the capture setup. Subsequently, a speech enhancement system relying on a mapping between ultrasound features, extracted from the ultrasound reflec-tion, and clean acoustic speech features extracted from the AC microphone signal is presented and the potential of this mapping explored. Finally, the ultrasound-to-AC feature mapping is employed and evaluated in a state-of-the-art speech enhancement system.

While Chapters 2 and 3 assumed that only speech-production-related movements and events are encoded in the ultrasound reflection, Chapter 4 investigates the robustness of the utilized ultrasound setup by including ar-tifacts caused by non-articulatory movements in the captured ultrasound re-flection. Essentially, since the ultrasound setup cannot differentiate between useful articulatory movements and non-articulatory movements originating from the user or objects in the background of the user, methods for the de-tection and, in some cases, compensation of such non-articulatory artifacts are required. Such methods are developed and presented in this chapter fol-lowed by their implementation and evaluation on an ultrasound-based voice activity detector. Finally, a setup utilizing an ultrasound receiver array is ex-amined and the potential benefit of corresponding beamforming approaches for the robustness of our setup discussed.

The second part of the dissertation investigates the eﬀects of noise and noise suppression methods, designed to aid speech communication in chal-lenging acoustic environments, on the experience and well-being of the users of speech communication systems.

In the first study described in Chapter 5, the effects of noise and noise suppression on the memory encoding of speech are examined in a recognition memory framework, where participants were asked to recognize previously heard sentences in a set of sentences containing previously heard but also unheard (new) sentences. The recognition memory was evaluated for differ-ent background noise levels (silence, noisy and noisy with noise reduction), speaking styles (conversational vs. clear speaking style) and for two different

(21)

age-groups (young adult and elderly participant groups).

The second study presented in Chapter 6 focused on the effects of noise on the arousal of users of speech communication systems in a speech com-munication setting – participant pairs solved collaborative tasks together through communication under different background noise conditions. The participants’ arousal states in each noise condition were measured using phys-iological features extracted from captured skin conductance and heart rate variability data. Additionally, the perceived mental effort required for com-munication was assessed through subjective rating scales. Furthermore, this chapter examines the sensitivity of physiological measures regularly used for arousal state detection in the literature with the aim to determine whether the most frequently used physiological features are in fact suited for arousal state detection in speech communication experiments.

Finally, Chapter 7 summarizes the contributions of the dissertation and provides an outlook and ideas for future research.

(22)

11

(23)

(24)

13

Chapter 2 Articulatory information capture

using ultrasonic Doppler sensing

2.1 Introduction

Information about articulatory movements during speech production has been utilized in a variety of research disciplines. It has been used to inves-tigate the kinematic characteristics of speech production [68–75], examine the effects, progression and onset of speech disorders or changes in articu-lation induced by different medical conditions [76–80], assist in the treat-ment of speech disorders [81] or as auxiliary information to improve the performance of speech communication systems [42, 45, 82–84]. Correspond-ing to these numerous application domains, a variety of articulatory data acquisition methods has been developed, which depend mostly on the ar-ticulatory information required by the application, e.g., exact coordinates of certain articulators, muscle activation patterns, velocity profiles or sim-ple speech presence decisions. Some of the developed methods include the use of pulsed echo ultrasound [68–70, 85], X-ray microbeam systems [73, 76], video [42, 86], electromagnetic articulographs [72, 74, 78–81, 87], electromyo-graphs [43, 83, 84, 88–90], magnetic resonance imaging (MRI) [75] and ultra-sonic Doppler sensing [45,82]. In this chapter, the ultraultra-sonic Doppler sensing approach is investigated further.

One aim of this part of the thesis is to investigate the use of articula-tory information to aid speech communication in challenging environments. For this purpose, we employ an auxiliary sensor to capture such information

(25)

in a multimodal speech communication system. The choice of an ultrasonic Doppler sensing approach was motivated by our application, as the use of sev-eral aforementioned articulatory information capture methods is not feasible or impractical in daily speech communication scenarios (MRI for example or any intrusive method requiring the placement of sensors on the user’s face). Ultrasonic Doppler sensing employs sensors consisting of an ultrasonic transmitter, emitting an ultrasound carrier signal at a pre-defined frequency, and an ultrasonic receiver. For articulatory information capture, these sen-sors are usually aimed at the speaker’s face so that the emitted signal is reflected off of it and the corresponding reflection recorded by the receiver. Facial movement during speech production leads to shifts in the original ul-trasound signal frequency due to the Doppler effect which can be analyzed to extract articulatory movement information. One of the main benefits of such a sensor for a speech communication system is its robustness to ambient noise and reverberation. While other aforementioned sensors may also share this property, the Doppler sensing approach is also non-intrusive, i.e., no skin contact is required, as opposed to methods like electromagnetic articulog-raphy, electromyography or x-ray microbeam systems. Additionally, unlike video [91–93], ultrasound sensing does not raise privacy concerns since no sensitive user information is recorded. Finally, the sensors employed in this work consist of low cost off-the-shelf hardware components and can easily be integrated into existing speech communication systems and hardware.

While the augmentation of a speech communication system with an aux-iliary ultrasound sensor and its application for enhancement of the captured noisy speech signal in challenging environments is discussed in the follow-ing chapters, the present chapter focuses on the information about speech production captured via ultrasonic Doppler-sensing. As the employed setup is essentially an ultrasonic Doppler sonar and not designed specifically for articulatory information capture, it is important to analyze the captured re-flections and differentiate between the different contributions that affect it. This knowledge is essential for the extraction of meaningful information and speech production features from the captured reflection which are then used for further processing as discussed in Chapters 3 and 4.

2.2 Ultrasound Doppler sensing setup and

background

Throughout this work, we focus on a hands-free speech communication sce-nario, in which all sensors included in the employed communication system

(26)

Chapter 2. Articulatory information capture 15

are positioned at a 50 cm distance from the speaker. The ultrasonic sensor used here consists of an ultrasound transmitter with a carrier frequency of 40 kHz (Prowave 400ST100) and an ultrasound receiver with a bandwidth of 3 kHz (approximately −6 dB at 40 kHz ± 1.5 kHz) around the carrier frequency (Prowave 400SR100). Additionally, the employed sensor contains a standard air-conduction (AC) microphone, allowing for time-aligned artic-ulatory information and speech capture. The choice of sensor components and overall design are based on the systems in [94] and [95]. In comparison to those systems, the distance between the transmitter and the receiver of the sensor was increased to 10 cm as shown in Fig. 2.1 to reduce carrier signal leakage. In our recording setup, the sensor was positioned at the height of

Figure 2.1: Hardware used during the recordings. The sensor consists of one AC microphone, an ultrasound transmitter (Prowave 400ST100) and an ultrasound receiver (Prowave 400SR100). The ultrasound transmitter operates with a carrier frequency of 40 kHz and the receiver has a bandwidth of 3 kHz around the carrier.

the speaker’s mouth and aimed at the speaker’s face. The sound pressure level of the transmitted ultrasound signal at a 50 cm distance was

approxi-mately 80 dB SPL1_{. The emitted signal is reﬂected oﬀ of the speaker’s face}

and captured by the ultrasound receiver. This reflection can contain fre-quency components different from the carrier frefre-quency due to the Doppler effect, which describes the change in frequency of a sound wave after being reflected off of a moving surface – here, the surfaces are the articulators in-volved in speech production. Assuming one moving surface, the frequency of

the ultrasound reﬂection after the Doppler shift, fr, is given by

fr= c + v c − vfc≈ 1 +2v c fc= fc+ ∆f, (2.1) 1

This was a signiﬁcantly lower exposure level than the levels considered safe by various standards from around the world [96, 97]. A detailed discussion on the use of ultrasound, its characteristics and eﬀects on people as well as additional safety aspects is given in [98].

(27)

where fc, v, c and ∆f are the frequency of the emitted carrier signal, the

velocity of the moving articulators or other parts of the speaker’s face2_{, e.g.,}

cheeks or chin, the speed of sound and the movement-induced Doppler shift, respectively [99]. The factor 2 is related to the fact that the beam has to travel the distance between sensor and speaker twice similar to applications using Doppler radar. A schematic of the recording setup is shown in Fig. 2.2. Throughout all recordings, the sensor shown in Fig. 2.1 is used in horizontal orientation – the representation in Fig. 2.2 is used solely for visualization purposes.

Figure 2.2: Diagram of the recording setup used throughout all recordings and experiments in this thesis, unless stated otherwise. The sensor containing the ultrasonic receiver and transmitter pair as well as the AC microphone is positioned in front of the speaker at the height of the speaker’s mouth and

at a distance of dss = 50 cm. The distance between the transmitter and

receiver on the sensor is dtr = 10 cm. Throughout the experiments in this

dissertation, the sensor shown in Fig. 2.1 is used in horizontal orientation. Articulatory-movement-induced Doppler shifts in the ultrasound reﬂec-tion can be directly associated with the velocities of visible surfaces and ar-ticulators moving during speech production. In equation (2.1), the movement velocity v assumed to be positive (negative) for movements towards (away from) the sensor while only movement components parallel to the ultrasound beam result in Doppler shifts. Although articulatory movements are mostly perpendicular to the ultrasound beam, their parallel components are strong

2

This includes only velocity components collinear with the direction of the ultrasound signal.

(28)

enough to cause Doppler shifts, as seen in the example in Fig. 2.3 which shows a spectrogram of the captured ultrasound reflection for a recording of five different uttered sentences. Furthermore, Fig. 2.4, shows the average

(dB)

Figure 2.3: Time-domain waveform of the captured AC signal (top panel) and the corresponding spectrogram of the ultrasound reflection (bottom panel) for five different sentences. Articulatory information, encoded in Doppler shifts, can be observed around the 40 kHz carrier.

power spectral density (PSD) across the first utterance of the example com-pared to a silent segment of the same duration (≈ 3 s). Welch’s periodogram method with a Hann window length of ≈ 140 ms and 50 % overlap was used to compute the average PSDs for all the ultrasound reflection signal analysis examples in this Chapter [100]. These examples show that slow articula-tory movements, resulting in Doppler shifts below 500 Hz, are 30 − 60 dB below the carrier energy, while contributions resulting in Doppler shifts of 500 − 1000 Hz are 60 − 75 dB weaker than the contributions at the carrier while still being above the noise floor of about 75 dB below the carrier. Ad-ditionally, a similar setup has also been successfully employed in the past for articulatory information capture, e.g., [101], [102], [103] and [104]. While the observed sideband contributions have mostly been assumed to originate from articulatory movement in the past, in the following we will show that other

(29)

37 38 39 40 41 42 43 Frequency (kHz) -80 -60 -40 -20 0 Magnitude (dB) Speech Silence

Figure 2.4: Power spectral density averaged across the ﬁrst utterance in Fig. 2.3 compared to a segment of silence of the same duration.

sources contribute to the ultrasound reﬂection as well.

At this point it is important to note that while we use ultrasound Doppler processing for articulatory information capture, this is not the only applicable or feasible methodology in ultrasound-based speech communication systems. In [44], a standing wave between the speaker and the sensor was established using a 40 kHz excitation signal, and information about the movement of the lips was extracted based on changes in the envelope of the ultrasound reflection. Similarly, lip state detection has been used to determine the open-ing state of the speakers’ mouth (open, closed, or half-open) in [46,105,106], where the detection methods relied on tracking resonance patterns in the reflected ultrasound signal – these were affected by resonating of the exci-tation signal inside the user’s mouth. In comparison to the approach used here, the latter method relied on low-frequency ultrasound, using frequen-cies of 20 − 24 kHz, and on a short, 2 − 8 cm, distance between sensor and speaker. Finally, a different methodology (ultrasound imaging) utilizing an ultrasound transducer fixed under the speaker’s chin, has been employed in silent speech interfaces in the past to get information about the vocal tract changes during speech production [43].

2.3 Modelling the ultrasound reflection

The original purpose of an additional ultrasound modality in our speech communication system was the capture and use of articulatory information related to speech production to aid communication. This use case is espe-cially interesting in noisy environments, as the ultrasound sensor employed

(30)

here is robust to environmental noise outside the 40 kHz ± 1.5 kHz frequency band, unlike a conventional AC microphone. Therefore, a major component of an ultrasound-based system is the extraction of meaningful information from the captured ultrasound reflection. Different contributions to the re-flection captured using our setup are discussed in the following.

2.3.1 Contributions from articulatory movements

The captured ultrasound reﬂection contains contributions from a number of moving surfaces and articulators on the speaker’s face involved in speech

production, e.g., tongue, jaws, lips3_{, chin, etc. Hereby, each contribution}

is dependent on the distance between the corresponding moving surface or articulator and sensor, the area of the moving surface and finally its velocity. When modelling the ultrasound reflection, all of these movement components and their dependencies need to be taken into account. Here, we use parts of our model presented in [104] as a starting point, and express the captured reflection as um r(t) = Na X i=1 gi a(t) cos2π fc+ ∆fai(t) t + φ i a(t) + Ns X j=1 gj s(t) cos 2πfct + φjs(t) , (2.2)

where Naand ∆fai(t) are the number of moving articulators and the Doppler

shift resulting from the i-th moving articulator, respectively. The eﬀect of the distance between the i-th articulator and the sensor on its contribution

is represented by the corresponding phase term φi

a(t) while the dependency

on the articulator surface area is incorporated through the gain factor gi

a(t).

In addition to reflections from the speaker’s face, the emitted signal is also reflected off of stationary objects and surfaces within the ultrasound beam. These reflections are represented by the second summation term and their

contributions agree in frequency with the carrier, where Ns, gsj(t) and φjs(t)

are the number of stationary objects that reﬂect the emitted signal and the corresponding gain and phase terms depending on the reﬂecting surface area and the proximity of it to the sensor, respectively.

3

Articulators involved in speech production can be divided into two classes depending on whether they move during speech production. Moving articulators are classiﬁed as activewhile stationary articulators are passive. The lips in particular are passive and active articulators as the lower lip usually moves while the upper lip remains stationary [107, Chapter 2].

(31)

The example in Fig. 2.3, shows sideband contributions of up to ±3 kHz in the ultrasound reflection. A Doppler shift of 1 kHz would correspond to a movement velocity above 4 m/s given a carrier frequency of 40 kHz. Such velocities, however, have not been reported in existing research on the kinematics of articulatory movements. Studies in this field mostly investi-gated changes in articulator velocity and displacement profiles throughout various speech characteristic manipulations (e.g., changes in speaking rate or stress) – kinematics of tongue movements were investigated in [68–70,72,73], lip movement profiles in [71–73, 108] and jaw kinematics in [71–73]. In the aforementioned studies, the measured peak articulator velocities were mostly lower than 1 m/s which would lead to a Doppler shift of approximately 230 Hz according to equation (2.1). However, the ultrasound reflection in Fig. 2.3 clearly shows Doppler shifts that exceed this mark. One possible explanation for this occurrence may be that studies investigating articulatory movement kinematics usually focus on the main articulators – lips, jaws and tongue. However, these are not the only surfaces moving during speech production. As the ultrasound sensor can detect movement from any surface illuminated by its beam, it is possible that certain surfaces on the speakers face move fast enough to result in higher Doppler shifts. However, we argue that the observed contributions originate from different sources.

2.3.2 Acoustic leakage

The first source of potential additional contributions to the ultrasound re-flection is acoustic leakage – components with frequencies around the used carrier frequency originating from the uttered speech. Acoustic leakage ef-fects were investigated using a speech recording utilizing a Bruel&Kjaer 4939 1/4 inch free-field microphone with a frequency range of 4 Hz−100 kHz. The recording was conducted in an anechoic chamber with the aim to determine whether frequency components present in an acoustic speech signal reach 40 kHz. The acoustic waveform and spectrogram of the recorded speech for the first 10 sentences from the list of Harvard sentences presented in [109] are shown in Fig. 2.5. It can be seen that all uttered sentences contain seg-ments with acoustic components around the 40 kHz frequency band which are mostly related to the fricatives /s/ (e.g., smooth, makes, served ), /z/ (e.g., rise, size), /f/ (e.g., fine, fed, faced ) and /T/ (e.g., smooth, depth) in this example. If we look at one of the stronger contributions seen in the last sentence (Large size in stockings is hard to sell) originating from the /s/ in the word size (approximately at the 26 s mark), it can be seen that the acoustic contributions picked up around 40 kHz seem to be similarly strong as the contributions between 2 and 5 kHz. For further analysis, Fig. 2.6

(32)

(dB)

Figure 2.5: AC microphone signal waveform and corresponding spectrogram of the ﬁrst 10 sentences from the Harvard sentence list recorded using a free-ﬁeld microphone with a frequency range of 4 Hz − 100 kHz. It can be seen that the uttered sentences contain acoustic components in the 40−kHz range – the spectral range of our ultrasonic receiver is highlighted by the dashed lines.

shows a comparison between the PSDs of 100 ms segments taken from the vowel /a/ in large, the fricative /s/ in size and a silent segment. Indeed, the contributions for the fricative /s/ around 40 kHz are approximately 10 dB above the noise ﬂoor and thus leak into the ultrasound receiver.

2.3.3 Air flow-ultrasound interaction

Apart from acoustic leakage, another source of higher Doppler contribution may be related to the interactions between the emitted ultrasound signal and the air ﬂow generated by inhaling and exhaling during speech production. In a ﬁrst attempt to further investigate this interaction, a recording was carried out in which a speaker, at the standard 50 cm distance from the sensor,

(33)

ar-0 5 10 15 20 25 30 35 40 45 Frequency (kHz) -30 -20 -10 0 Magnitude (dB) Silence Vowel a Fricative s

Figure 2.6: Comparison between the PSDs of a silent segment, the vowel /a/ in large and the fricative /s/ in size from the last utterance in the example in Fig. 2.5.

ticulated two diﬀerent sentences with and without vocalization – in the latter case, the speaker moved his lips without producing any sound and without breathing. The reasoning for this experiment was to compare pure articula-tory contributions to the contributions seen in the ultrasound reﬂection for normal speech. The acoustic waveform and corresponding ultrasound spec-trogram are shown in Fig. 2.7, while a comparison between PSDs from the vocalized and mimed speech, again averaged over one utterance, is shown in Fig. 2.8.

Differences between vocalized and silent speech are visible in both ex-amples. In Fig. 2.7, the contributions of silent speech seem to be contained within a 1 kHz band around the carrier (40 kHz ± 500 Hz). In compar-ison, vocalized speech results in contributions exceeding 1 kHz. This can also be observed in the average PSDs presented in Fig. 2.8, where contribu-tions above the noise floor can be seen for higher distances to the carrier for the vocalized examples. As the differences between the silent and vocalized speech are apparent throughout each utterance and not only for sounds lead-ing to acoustic leakage (e.g., fricatives discussed in the example in Fig. 2.5), it is reasonable to conclude that, apart from acoustic leakage, the air flow generated during speech production interacts with the emitted ultrasound signal. These interactions then lead to additional components in the ultra-sound reflection with frequencies outside the carrier unrelated to the Doppler effect. In the example in Fig. 2.7 and Fig. 2.8, the sideband contributions with higher distances from the carrier (higher than 500 Hz) generated during vocalized speech are 10 − 15 dB above the noise floor but significantly weaker (up to 40 dB in the example) compared to contributions at frequencies closer

(34)

(dB)

Figure 2.7: AC microphone signal waveform and ultrasound reflection spec-trogram for two uttered sentences (each repeated twice) from the Harvard sentence list [109]. The sentences were uttered twice with and twice with-out vocalization and significantly wider ultrasound contributions, in terms of sideband width, can be observed in the vocalized cases. The used sentences were ’The birch canoe slid on the smooth planks.’ (first two utterances) and ’Large size in stockings is hard to sell.’ (last two utterances).

to the carrier corresponding to large articulator movements present in both vocalized and silent speech (see also the discussion for Fig. 2.3).

When analyzing these recordings, it is important to note that while ar-ticulatory movements can be emulated during mimed speech, they are not entirely equal to the standard movements occurring during speech. The rea-son for this is that vibrations of articulators and parts of the speaker’s face that occur during the production of voiced sounds or stop consonants, which are generated by the sudden release of air pressure, are diﬃcult to recreate without the actual air ﬂow.

To further investigate the eﬀect of breathing, a speaker, seated in front of the sensor at the standard position deﬁned in Fig. 2.2, was recorded while breathing normally. This recording is shown in Fig. 2.9, where the AC mi-crophone signal is presented as a reference – the subject knocked on his chair

(35)

37 38 39 40 41 42 43 Frequency (kHz) -100 -50 0 Magnitude (dB) Vocalized speech Mimed speech Silence

Figure 2.8: Average PSD comparison between utterances produced with and without vocalization. The PSDs were computed for the utterance ’The birch canoe slid on the smooth planks.’.

prior to each exhalation throughout the recording. Again, for analysis pur-poses, an average PSD, this time averaged over one breathing instance (the third breath starting around 20 s in this example) is presented in Fig. 2.10. Hereby, the left panel of Fig. 2.10 shows the average contributions of breath-ing and compares them to contributions from speech created without vo-calization from the example in Fig. 2.7, while the right panel compares the breathing contributions in the upper and lower sidebands.

It can be observed that breathing causes contributions around the carrier that resemble white noise, due to the interaction between the air flow and the emitted ultrasound signal. These contributions are up to 20 dB above the noise floor, depending on the distance to the carrier and the exhalation strength, but they are still weaker compared to the contributions from the articulatory movements in the left panel of Fig. 2.10. It is important to once again note that the presented PSDs are averaged over the whole utterance or breathing sequence, which implies that the contribution difference will vary depending on the position within the utterance and the breathing state. Furthermore, the exhalation strength is expected to decrease during speech compared to a full exhalation. The right panel of Fig. 2.10 shows that a majority of the contributions is in the upper frequency sideband (above the carrier frequency).

The observed contributions can be explained as a modulation of the emit-ted ultrasound signal frequency by the air ﬂow occurring during speech pro-duction. Part of this modulation originates from the moving air particles present during an exhalation – these modulate the medium in which the ultrasound wave propagates. Due to the direction of the air ﬂow during

(36)

ex-Chapter 2. Articulatory information capture 25

(dB)

Figure 2.9: Ultrasound reflection spectrogram for an inhaling and exhaling subject for a 50 cm sensor-subject distance and approximately 80 dB SPL of the emitted ultrasound signal at the speaker. Prior to exhalation, the subject knocked on his chair to create a reference point in the AC signal. The air flow during exhalations causes a white noise-like contribution in the ultrasound reflection.

halation (towards the sensor), most of the contributions from this interaction should lie in the upper frequency sideband. However, since the ultrasound wave has to travel the speaker-to-sensor distance twice, contributions in the lower frequency sideband are also present – the air particles associated with the propagation of the ultrasound signal are affected by the air flow twice in opposing directions. The difference in the strength and bandwidth of these contributions is given by the fact that only a fraction of the emitted ultrasound signal is reflected off of the speaker.

An additional source of modulation may be the changing temperature of the air in front of the speaker (due to breathing) during ultrasound propa-gation. The speed of sound depends on the temperature of the medium in which the sound is propagated such that

c = 343.2r 273.15 + T

(37)

39 39.5 40 40.5 41 Frequency (kHz) -80 -60 -40 -20 0 Magnitude (dB) Breathing No breathing Mimed speech 0 250 500 750 1000

Distance from carrier (Hz)

-80 -60 -40 -20 0 Magnitude (dB) f >40 Khz f < 40 kHz

Figure 2.10: PSDs for contributions of breathing and articulatory movements from speech created without vocalization (from the example in Fig. 2.8) are shown in the left panel. The right panel compares upper and lower sidebands of the breathing contributions in the ultrasound reﬂection. The PSDs were computed over a segment of ≈ 3 s.

where T is the temperature in degrees Celsius [110, pp.13]. Therefore, fluc-tuations in the medium temperature correspond to flucfluc-tuations in the speed of sound and result in additional modulations of the emitted ultrasound

car-rier4_.

Even though the effects observed in the experiment presented in Fig. 2.9 may be exaggerated, as it is likely that the airflow generated during speech is weaker than the airflow generated during normal breathing, it is still rea-sonable to assume that similar effects also occur during speech production. Hereby, these effects will depend on the produced speech sound, e.g., the air pressure is stronger for consonants than for vowels for example [111], on the speaking style [112] and on the personal characteristics of the speaker [113]. Finally, while the contributions of acoustic leakage and air flow-ultrasound interactions may seem similar, the discussed air flow effects can only be observed in a narrow, asymmetric frequency band around the carrier, which allows us to distinguish them from acoustic leakage discussed in the previous section. The same argument can also be used to conclude that the effects are not induced by acoustic leakage during breathing (sounds generated during breathing with components in the ultrasound frequency band).

4

As an example, a temperature ﬂuctuation of ∆T = 1◦_{C results in a change of ∆c =}

(38)

(dB) (dB)

Figure 2.11: AC signal spectrogram and ultrasound reflection spectrogram for the sustained phonation of the vowel /a/ for different fundamental fre-quencies. During the last utterance, the pitch frequency was varied twice throughout the sustained phonation. Contributions symmetric around the carrier, at frequencies corresponding to multiples of the fundamental fre-quency of the speaker, can be seen in both the upper and lower sideband of the ultrasound reflection.

2.3.4 Effects of vibrations from the speaker’s face

In addition to the effects discussed so far, additional modulations of the emit-ted ultrasound carrier may be induced by vibrations of parts of the speaker’s face during speech production. These interactions were first observed in a recording of a sustained phonation of the vowel /a/ for different pitch fre-quencies in the standard scenario. Spectrograms of the signals recorded by the AC microphone (top) and the corresponding ultrasound reflection (bot-tom) are shown in Fig. 2.11, where the vowel /a/ was sustained during four utterances for different and, in the case of the last utterance, varying fun-damental frequencies. The spectrogram of the ultrasound reflection shows multiple parallel sideband contributions around the carrier frequency

(39)

occur-ring at distances from the carrier which are a multiple of the fundamental frequency for the corresponding utterance (40 kHz ± k · F0 Hz), where k is a positive integer – they mimic the harmonics of the corresponding acoustic signal. This connection can also be observed through the matching changes in the speaker’s F0 and the contribution frequency – an increase (decrease) in pitch results in a larger (shorter) frequency distances between contribution and carrier. The average power spectrum of a 1 s segment of the ﬁrst utter-ance of the sustained vowel /a/ from the example in Fig. 2.11 together with the average spectra of sustained vowels /e/ and /o/ is shown in Fig. 2.12. Peaks at locations corresponding to harmonics of the acoustic signal can be

39.5 40 40.5 Frequency (kHz) -80 -70 -60 -50 -40 -30 -20 -10 0 Magnitude (dB) /a/ 39.5 40 40.5 Frequency (kHz) /e/ 39.5 40 40.5 Frequency (kHz) /o/

Figure 2.12: Average PSDs of 1 s segments of an ultrasound reﬂection result-ing from the sustained phonation of diﬀerent vowels. Symmetric contribu-tions (peaks) around the carrier at k · F0 Hz distances can be seen.

observed with magnitudes up to 20 dB above the noise ﬂoor. While these peaks correspond to harmonics in the acoustic signal as already mentioned they are not entirely symmetric in terms of magnitude. One reason behind this are the non-ideal characteristics of the ultrasound receiver as shown in Fig. 2.16 in the Appendix. In can be seen that receiver sensitivity diﬀerences for frequencies in the upper and lower sidebands at the same distance from the carrier reach up to 3 dB .

The aforementioned effects observed for sustained vowels cannot be ex-plained by the standard Doppler effect, as they would require a surface mov-ing towards (away from) the sensor with an approximately constant veloc-ity for a prolonged period of time. We argue that the observed effects are caused by frequency modulation of the ultrasound carrier by vibrating

(40)

sur-Chapter 2. Articulatory information capture 29

faces, which would explain the similarity between the spectra of the observed ultrasound reﬂection for sustained vowels in Fig. 2.11 and the theoretically expected spectra of a frequency modulated signal when the modulating signal is a sinusoid [114, Chapter 8]. In this special case, the resulting modulated signal can be represented by a sum of Bessel functions of the ﬁrst kind and n-th order. The spectrum of such a signal then has a carrier component

with an inﬁnite number of sidebands at frequencies at fc± k · fmod, where

k = 1, 2, 3... and fmodis the frequency of the modulating signal (for a more

detailed discussion and formula derivation see [114, Chapter 8] or [115, Chap-ter 5]).

To examine these modulations in more detail additional measurements were carried out with a new setup. Instead of a human speaker, a loudspeaker (Genelec 1031A) was used. The reasoning behind this was that a loudspeaker provides a larger vibrating surface with the loudspeaker membrane compared to a human speaker where vibration effects may be caused by vibrations of the face of the speaker or the inside of the speaker’s mouth – the emitted ultrasound signal may enter the speaker’s mouth and a reflection may exit it during speech production. Here, it is important to note that while vibration contributions seem to be absent in the running speech examples discussed in the previous sections, this may be due to the overlap with other speech production contributions, a relatively weak magnitude of such contributions for normal speaking rates or the characteristics of the employed spectral analysis. Assuming that reflections from the inside of the speakers mouth contribute to the effect, one example in which such reflections may be strong is the low distance use case in which the sensor is mounted on a hand-held device, e.g., a mobile phone. An example of such a close recording

(dss = 5 cm) is shown in Fig. 2.13. Indeed, the sidebands resulting from

frequency modulation are clearly visible around the 14 s, 19 s, 24 s and 35 s marks. The eﬀects of sensor positioning are discussed in more detail in the next chapter.

To simulate a seated speaker, according to the standard setup in Fig. 2.2, a 35 x 33 x 55 cm cardboard box was used and placed on a chair in front of the sensor. This box was also used to ensure a suﬃciently strong reﬂection (to replace the human speaker’s torso). The used loudspeaker was then placed on top of the box on the height of the ultrasound sensor with the standard 50 cm distance to the sensor.

In a ﬁrst experiment, a 300 Hz sinusoid was emitted from the loudspeaker. The loudspeaker’s emitted sound pressure level at the ultrasound sensor was set to be approximately 80 dB SPL at the start of the experiment and was decreased in 6 dB steps every 5 s. The resulting spectrogram of the ultra-sound reﬂection (top) as well as a comparison between PSDs averaged over

(41)

(dB)

Figure 2.13: Spectrograms of the ultrasound reﬂection for nine sentences with a 5 cm distance between speaker and sensor.

2 s segments for three different sound pressure levels (bottom) can be seen in Fig. 2.14. Note that the use of a large and flat reflective surface resulted in a stronger reflection at the carrier compared to the (human) speaker.

Contributions in the upper and lower sidebands at fc±300 Hz are present

throughout the recording and their magnitude decreases with the sound pres-sure level as expected. For the loudest setting, second order Bessel compo-nents are still strong enough to result in visible and measurable sidebands at 40 kHz ± 600 Hz.

To examine a more complex signal, a speech signal was played back via the loudspeaker, again with a level of 80 dB SPL at the ultrasound sensor in a second experiment. The spectrograms of the speech signal and the corresponding ultrasound reflection are shown in Fig. 2.15. Comparing the top and middle panel in Fig. 2.15, we can recognize harmonic components of the speech signal in the ultrasound reflection caused by the vibration-induced frequency modulation. While the high sound pressure levels used in this experiment do not occur during human communication in realistic environments, the presence of the same effects was validated using lower sound pressure levels. 80 dB SPL was chosen in the presented examples for visual purposes.

While a more detailed analysis of the underlying acoustical processes and their exact modelling was beyond the scope of this dissertation, the presented experiments led to the conclusion that vibrating surfaces cause a frequency modulation of the emitted ultrasound carrier and result in the phenomena observed in Figs. 2.11, 2.14 and 2.15.

(42)

Figure 2.14: Ultrasound reﬂection spectrogram for a loudspeaker placed in front of the ultrasound sensor at a 50 cm distance and the emission of 300 Hz tone at a decreasing sound pressure level (marked by the vertical dotted lines). Due to the interaction between the emitted ultrasound and the vi-brating loudspeaker membrane, a frequency modulation of the ultrasound carrier can be observed.

2.3.5 Interactions between ultrasound and the

pro-duced speech

While analyzing the ultrasound reflection, we came across the experiment in [116], where the authors investigated the modulation of an ultrasonic wave by the wave field of an audio source. In their setup, the ultrasound receiver and transmitter were in separate locations at a 4 m distance from each other, and the emitted ultrasound signal crossed the wave field of an audio source before being captured by the employed receiver. The results of this experiment led the authors to the conclusion that the audio wave can affect the speed of sound along the propagation path of the ultrasonic wave which results in its modulation. Two possible sources responsible for this change in the speed of sound were defined: air particle motion induced by the audio waves, and the change of medium density caused by the audio wave, which is associated

(43)

(dB)

Figure 2.15: Ultrasound reflection spectrogram for a loudspeaker placed in front of the ultrasound sensor at a 50 cm distance, emitting a speech signal. The top panel shows the spectrogram of the emitted speech signal, the bot-tom panel the spectrogram of the corresponding ultrasound reflection and the middle panel the spectrogram of the upper sideband of the ultrasonic reflection for easier comparison. Due to the interaction between the emitted ultrasound signal and loudspeaker membrane vibrations, the components of the emitted speech signal (mainly lower harmonics) modulate the emitted ultrasound carrier.

with a periodic change of the pressure and density in the medium.

While the results from the experiments with sinusoidal sounds in [116] show similarities with the frequency modulation effects discussed in the previ-ous section, we argue that such non-linear effects in air due to the interaction of the two sound waves are negligible for the applications examined in this thesis. The main reason for this conclusion are the employed sound pressure levels, which lead to physical differences between our setup and the setup in [116]. While a dynamic range of 120 dB within the used frequency spec-trum was required in [116], resulting in extremely high sound pressure levels,

Speech communication systems in realistic environments : strategies for improving system performance and user experience

Speech communication systems in realistic environments :

strategies for improving system performance and user

experience

Speech communication systems in realistic

environments: Strategies for improving

system performance and user experience

Speech c

ommunic

ation syst

ems in r

ealistic envir

onments

Nemanja Cvijanović

environments: Strategies for improving system

performance and user experience

Nemanja Cvijanović

August 28, 2017

realistic environments:

Strategies for improving system performance and user

experience

proefschrift

Contents

Part I

13

Part II

89

Chapter 1

General introduction

1.1

Speech communication in everyday

scenar-ios

1.2

Use of speech communication systems in

noise

1.3

Evaluation of the effects of noise and noise

suppression algorithms

1.4

Objectives and thesis structure motivation

1.5

Outline

Chapter 2

Articulatory information capture

using ultrasonic Doppler sensing

2.1

Introduction

2.2

Ultrasound Doppler sensing setup and

background

2.3

Modelling the ultrasound reflection

2.3.1

Contributions from articulatory movements

2.3.2

Acoustic leakage

2.3.3

Air flow-ultrasound interaction

2.3.4

Effects of vibrations from the speaker’s face

2.3.5

Interactions between ultrasound and the

pro-duced speech