ImprovingtheIntelligibilityofEsophagealSpeechusingNon-ParallelVoiceConversion U B C

(1)

U

NIVERSITY OF THE

B

ASQUE

C

OUNTRY

M

ASTER

T

HESIS

Improving the Intelligibility

of Esophageal Speech using

Non-Parallel Voice Conversion

Author:

Inge S

ALOMONS

First Supervisor:

Prof. Eva N

AVAS

University of the Basque Country

Second Supervisor:

Prof. Dr. Martijn W

IELING

University of Groningen

European Masters Program

Language and Communication Technologies

(2)

i

“For most people, technology makes things easier. For people with disabilities, technology makes things possible.”

(3)

ii

Abstract

Impaired speech can result in reduced intelligibility, causing communication prob-lems and other (social) difficulties for patients. To overcome this, Voice Conversion (VC) methods have been proposed to improve the intelligibility of impaired speech. VC aims to modify a speech signal uttered by a source speaker to sound as if it was uttered by a target speaker, while maintaining the linguistic content of the utterance. This study aims to determine to what extent the intelligibility of esophageal speech (speech produced by means of vibrations in the esophagus after the larynx has been removed) in Spanish can be improved when using a state-of-the-art non-parallel VC model. In a non-parallel approach, the linguistic information of the source and tar-get data is unrelated. One of the advantages of this approach compared to a parallel approach, is that the source data set is relatively small, meaning that the speaker does not have to provide as much data to train the VC model on. Since producing speech while suffering from a speech impairment can require a lot more effort com-pared to the natural process of speech production, the community of esophageal speakers could benefit greatly from a model that improves the intelligibility of their speech by only providing a few minutes of training data. We adopt a model that improves the intelligibility of impaired speech resulting from an articulation disor-der in Mandarin and apply it to esophageal speech and dysarthric speech in Span-ish. To study the model performance regarding the acoustic features of the speech signals, effects on the spectral envelope and fundamental frequency (F0) are also considered. The model is a Generative Adversarial Network (GAN) that is trained on non-parallel data and learns feature relations between impaired and unimpaired speech from mel-spectrograms. The results demonstrate that, although improve-ments in the speech quality were identified, the model did not succeed in improving the speech intelligibility. This indicates that further research is necessary in order to find an alternative approach that is able to improve the intelligibility of esophageal speech without the need of a parallel data set.

(4)

iii

Preface

When people asked me what I wanted to be when I grew up, I responded with "nurse", then "teacher", then "doctor" and finally "midwife". Looking back, I think I just wanted to pursue a career in which I could actively help people. So, after I graduated high school, I started studying midwifery. However, soon thereafter I found out that there was not enough room to feed my curiosity; most of my ques-tions were considered “not relevant” and “unnecessary”. That is when I decided to switch plans. I chose to study linguistics, which was in line with my love for lan-guages, but also had a practical side: my new plan was to become a clinical linguist. However, again I found myself being mostly intrigued by the theoretical side of the topic, which eventually led me to the field of computational linguistics. Little did I know back in the days that now, six year later, the circle is round. I proudly present this thesis, of which the topic reflects my interests in a research area where ques-tions are welcome and new ones are created every day, and at the same time aims at improving the life of others.

That being said, I want to strongly emphasize that my motivation and the knowl-edge I gained over the years are merely a fraction of how this thesis has come about. I did not do this alone.

First of all, I would like to say that not enough words in the world could express how grateful I am for the patient, supporting, and enthusiastic guidance from my supervisor Eva Navas. She was there on every step of the way, ready to answer questions, and no question seemed to be too much. Her enthusiasm and optimism kept me motivated and intrigued until the very last moment. I hope to meet her in person one day, so I can say "Thank you, gracias, eskerrik asko!"

I would also like to express my gratitude towards my second supervisor, Martijn Wieling, who, with his keen eye and critical questions, made me realize that some-times it is necessary to determine whether you are still on the path you wanted to be on, and then find your way back to the path towards your goal.

Furthermore, I acknowledge the important part my fellow students, the teachers, program coordinators, and all others related to the master, played in my academic journey.

Last, but not least, I would like to thank, from the bottom of my heart, those who have always had faith in me, especially in those moments when I stopped believing in myself: my dearest family members and friends. To all the people close to me I say: without your love and support I would not have succeeded, and most importantly, without you I would not be the person I am today.

I dedicate this work to my parents, because they taught me how to be the best version of myself. Even though that is a task without a finish line, I hope they see this work as a result of their efforts.

(5)

iv

2.2 Impaired speech . . . 5 2.2.1 Esophageal speech . . . 5 2.2.2 Dysarthric speech . . . 7 2.3 Summary . . . 7 3 Voice Conversion 9 3.1 Summary . . . 10 4 Related Work 11 4.1 Therapy techniques . . . 11 4.2 Acoustic methods . . . 11 4.3 ASR and TTS . . . 12 4.4 VC methods . . . 12 4.5 Summary . . . 14 5 The Model 15 5.1 Summary . . . 18

6 Data and Processing 19 6.1 Target speech . . . 19 6.2 Source speech . . . 20 6.2.1 Esophageal speech . . . 20 6.2.2 Dysarthric speech . . . 20 6.3 Processing procedure . . . 20 6.4 Summary . . . 21

(6)

v

7 Experimental Design 23

7.1 Work environment . . . 23

7.2 Model configuration . . . 23

7.3 Procedure . . . 23

7.3.1 Phase 1: Understanding the model . . . 24

7.3.2 Phase 2: Model development . . . 24

7.3.3 Phase 3: Training . . . 24

7.3.4 Phase 4: Testing and evaluation . . . 25

7.4 Summary . . . 27

8 Results 28 8.1 Speech intelligibility . . . 28

8.2 Speech quality . . . 29

8.3 Summary . . . 31

9 Discussion and Conclusion 32

A Data information per speaker 35

(7)

vi

List of Figures

2.1 The human speech production system (from Adami, 2010) . . . 4 2.2 The human speech system and air and food flow before and after a

laryngectomy (from About breathing stomas 2018) . . . 6 2.3 Waveform (top left), spectrogram (bottom left), relative intensity (top

right) and auto pitch (bottom right) for unimpaired, esophageal and dysarthric speech . . . 8 3.1 Diagram of a voice conversion system . . . 10 5.1 Schematic overview of the model (Chen, Lee, and Tsao, 2018) . . . . 15 5.2 Global variances of data from original experiment by Chen, Lee, and

Tsao (2018) . . . 17 5.3 Mean Opinion Scores (MOS) scores comparison between proposed

approach by Chen, Lee, and Tsao (2018) and two baselines, one par-allel (cGAN) and another non-speaker characteristics preserving (Cy-cleGAN) . . . 17 6.1 Waveforms of dysarthric and esophageal speech signals before

pro-cessing and after original and adapted propro-cessing . . . 22 7.1 Perceptual loss over 100 epochs for models trained on different types

of training data: dysarthric speech, and esophageal speech from three different speakers . . . 25 8.1 Mean and individual STOI index values of the original and converted

speech signals per speech type . . . 28 8.2 Mean and individual WER values of the original and converted speech

signals per speech type . . . 29 8.3 Mean and individual MCD values of the original and converted speech

signals per speech type . . . 29 8.4 Mean and individual RMSE values of the original and converted speech

signals per speech type . . . 30 8.5 Global variances per speech type . . . 30 8.6 Spectrograms of one of the original and converted speech signals per

speech type . . . 31 8.7 Waveforms and pitch curves of one of the original (black) and

(8)

vii

List of Tables

6.1 Target speech data set information per speaker sex and total . . . 19 6.2 Source speech data set information per speech type and speaker . . . 20 A.1 Information per speaker of the unimpaired speech data . . . 35

(9)

viii

List of Abbreviations

AI Artificial Intelligence

ALS Amyotrophic Lateral Sclerosis

ASR Automatic Speech Recognition

CD Cepstrum Distance

dBFS deciBel relative to Full Scale

DNN Deep Neural Network

DTW Dynamic Time Warping

ELU Exponential Linear Unit

F0 Fundamental frequency

FFT Fast Fourier Transform

GAN General Adversarial Network

GMM Gaussian Mixture Model

MCD Mel-Cepstral Distortion

MFC(C) Mel-Frequency Cepstrum (Coefficient)

ML Machine Learning

MOS Mean Opinion Score

NNMF Non-Negative Matrix Factorization

LP Linear-Predictive

LSF Line Spectral Frequencies

(B)LSTM (Bidirectional) Long Short-Term Memory

PPG Phonetic Posteriorgram

ReLU Rectified Linear Units

RMSE Root Mean Squared Error

RNN Recurrent Neural Network

STOI Short-Time Objective Intelligibility

STFT Short-Time Fourier Transform

TEP Tracheo Esophageal Puncture

TTS Text To Speech

VC Voice Conversion

VT Voice Transformation

(10)

ix

(11)

1

Chapter 1

Introduction

In The age of A.I., a documentary covering ’the ways Artificial Intelligence, Machine Learning and Neural Networks will change the world’, former football player Tim Shaw heard his own voice for the first time again after many years. Tim suffers from amyotrophic lateral sclerosis (ALS), which is a progressive neurodegenerative disease that affects nerve cells in the brain and the spinal cord. The voluntary mus-cle action is progressively affected, which eventually results in loosing the ability to speak, eat, move and breathe. He has difficulty speaking and articulating well, which affects the intelligibility of his speech.

A team of researchers and software engineers created an Automatic Speech Recog-nition (ASR) system that is able to transform impaired speech to text. Tim recorded over several hours of audio to train this model. In addition, the team created a Text-to-Speech (TTS) system that transformed written text to speech in Tim’s former voice, the one he had before his illness. This model was trained on speech that was available from his time as a football player, like interviews or other video or audio fragments. When the researchers used this system to read out loud a letter that he wrote to his 22-year old self, his parents got emotional. Tim had his voice back. This is a beautiful example of how Artificial Intelligence can improve the quality of life of people and in this particular case people who’s speech is impaired.

However, even though this combination of systems works well, there is still room for improvement. In an ideal situation, the impaired speech would directly be translated into more natural and intelligible speech without losing the speaker identity and speaker characteristics, using one system and not requiring a lot of in-put data to train on. Recent research in the field of language technology has shown that this situation has the potential to become reality. The methods achieving this are based on Voice Conversion (VC), which is a procedure in which a speech sig-nal uttered by a source speaker is modified to sound as if it was uttered by a target speaker, while maintaining the linguistic content of the utterance. Current VC meth-ods consist of Machine Learning (ML) approaches in which a model is trained on a data set of impaired and unimpaired speech. This data set can be either parallel or non-parallel. If the data set is parallel, this means that the linguistic information of the source and target data is related, in other words, the linguistic content is the same.

The main goal of this study is to determine to what extent the intelligibility of esophageal speech can be improved when using a non-parallel VC model.

Esophageal speech is different from natural, unimpaired speech and most other types of impaired speech, because the source of the speech production process is un-natural: esophageal speech is a form of alaryngeal speech, meaning that the speaker

(12)

Chapter 1. Introduction 2 has no larynx. This results in a range of differences in the acoustic domain. For ex-ample, since there is no vibration from the vocal folds, there is less and more unclear periodicity in the speech signals, resulting in a distorted fundamental frequency (F0). Dysarthric speech on the other hand, is also less intelligible than unimpaired speech, but since the abnormality is in the filtering of the sounds, it is more similar to natural speech in terms of acoustic features.

To achieve our research goal, we adopted a state-of-the-art model that improves the intelligibility of impaired speech resulting from an articulation disorder in Man-darin, and applied it to esophageal speech and dysarthric speech in Spanish. The model is a Generative Adversarial Network (GAN) that is trained on a non-parallel data set, and extracts the learning parameters from the mel-spectrograms of the speech signals.

Based on the theoretical framework and related work involved in this topic, the following hypotheses were formed:

1. (a) The intelligibility of dysarthric speech will improve, because the character-istics of this of speech are similar to those of the source speech in the original study.

(b) Due to the abnormality in the sound generation of esophageal speech, the intelligibility might not be improved, or compared to dysarthric speech, to a smaller extent.

2. The spectral envelope, which is a representation of the filter response, will improve for dysarthric speech due to the abnormality in the sound filtering of dysarthric speech, but will be less or not affected in esophageal speech.

3. The F0 curve will improve for esophageal speech to compensate for the abnor-mality in the sound generation. Since dysarthric speech has no problem with the source, the model will not have an effect in the restoration of F0, or to a smaller extent when compared to dysarthric speech.

The results of the experiment showed that the intelligibility of both speech types did not improve after conversion, with respect to their intelligibility before conver-sion. However, improvements in the quality of speech were identified. As expected, the spectral envelope of dysarthric speech and the F0 curve of esophageal speech were more similar to unimpaired speech than before conversion.

By performing this study we contributed to the research field that focuses on the improvement of impaired speech, by providing new insights about re-using an existing non-parallel VC model for a different type of impaired speech in general, and more specific knowledge regarding the effect of this type of voice conversion on dysarthric and esophageal speech.

In the next chapter, the process of natural speech production is explained shortly, after which the two types of impaired speech, esophageal and dysarthric speech, are discussed in more detail. Then, in chapter 3, a general overview of the VC method is given. In chapter 4 other approaches that are related to this topic are discussed. It allows the reader to understand the key concepts and state-of-the-art of the current topic, and provides a critical view on the previous approaches that has led to the current approach. The chapter that follows describes the state-of-the-art model that is adopted and the results obtained in the original study (chapter 5). The data set

(13)

Chapter 1. Introduction 3 and the procedure of processing the data are discussed in chapter 6. Then the exper-imental design of the current study is introduced in chapter 7, describing the work environment, model configuration and experimental procedure. After this chapter follow the results in chapter 8, showing the outcomes of the evaluation procedure. Finally these findings are discussed in chapter 9, followed by a general conclusion.

(14)

4

Chapter 2

Speech Production

2.1 The human speech system

In natural circumstances, speech is produced by pushing air from the lungs through the vocal folds to the mouth and nasal cavity. By moving the articulators, such as the lips, tongue and jaw, different sounds are produced. When done correctly, this results in a language that can then be perceived and understood by other speakers of that language. Which language is spoken depends on the combination and types of sounds that are produced. The human speech system consists of the sub-glottal system (the lungs and trachea), the vocal tract (the vocal cords or glottis, pharynx, larynx and oral cavity) and the nasal tract (soft palate and nose cavity). Figure 2.1 shows a diagram of a human speech system and where the different parts are lo-cated. Vibration of the vocal cords results in a periodic interruption of the airflow, creating voiced sounds such as vowels. The frequency of this periodicity is referred to as the fundamental frequency (F0) and is perceived as the pitch. In unvoiced sounds there is no vibration of the vocal cords, meaning that the airflow is free, which is characterized as the absence of F0.

(15)

Chapter 2. Speech Production 5 In the source-filter theory (Fant, 1971), speech production is described as a two-stage process, namely the sound generation (source), and the sound filtering by the resonant properties of the vocal tract (filter).

An abnormality in one of these two stages, may cause difficulties in the produc-tion of speech. These difficulties vary from stuttering, to aphasia, to articulatory problems, but all result in some level of impaired speech. Two types of impaired speech that are topic in this research will be explained in further detail, namely ala-ryngeal speech and dysarthric speech. The most important difference between these two types of speech is that in case of alaryngeal speech there is an abnormality in the generation of the sound, whereas dysarthric speech is caused by brain damage, resulting in an abnormality in the sound filtering.

2.2 Impaired speech

Speech is an important part of human communication. An impairment in the speech production can directly cause communication problems, which indirectly can result in social difficulty or depression. This often affects relationships with friends and family or can result in social isolation (Dysarthria: Symptoms and Causes 2020). The severity of the impairment and resulting decrease in intelligibility varies per im-paired speech type and person. Two types of imim-paired speech that are related to this study, esophageal and dysarthric speech, are described in more detail below.

Figure 2.3 shows the waveforms, spectrograms, relative intensity and pitch val-ues for fragments of unimpaired, esophageal and dysarthric speech, with the same linguistic content (‘de atonía’, extracted from the sentence ‘Unos días de euforia y meses de atonía.’). The recordings of the two impaired speech signals were taken from the data set used for this study (see section 6), and the recording of the unimpaired reference sentence was provided by a native Spanish female speaker.

2.2.1 Esophageal speech

Natural speech as described in the previous section is also referred to as laryngeal speech, since the larynx and especially the vocal folds, play an important role. If a person has undergone a larynx amputation surgery, usually because of laryngeal cancer, the production of laryngeal speech is no longer possible. This is due to the absence of the vocal folds and the separation of the airway from the nasal cavity and the mouth. In order to breath, laryngectomees receive a stoma in the throat, directly attached to the trachea. Figure 2.2 shows the difference in speech systems before and after a laryngectomy.

There are three alternative ways to speak without a larynx. The first is using a voice prosthesis or tracheo esophageal puncture (TEP). An opening between the tra-chea and oesophagus allows for the speaker to push air from the lungs through this opening up into the mouth, when covering the stoma with a finger. Although this is the most common way, it occasionally results in speaking difficulty because the pharynx goes into spasm or there is swelling of the opening area. The second way is using an electrolarynx, which is a battery operated machine that creates vibrations that normally the vocal cords do. Because it makes some noise, it is particularly

(16)

Chapter 2. Speech Production 6

Before laryngectomy After laryngectomy FIGURE2.2: The human speech system and air and food flow before

and after a laryngectomy (from About breathing stomas 2018)

used if a voice prosthesis is not (yet) an option. The last type of alaryngeal speech is esophageal speech. A speaker who produces speech in this way pumps air from the mouth into the esophagus and the stomach and when releasing this air, a vibrating tissue around the entrance of the esophagus generates excitation signals. These sig-nals are articulated the same way as when natural speech is produced. This method used to be the most common one and is still preferred by people who would like to communicate without the need of any equipment or surgical procedure (Learning to speak again 2018).

Apart from the way the speech is generated, esophageal speech differs from nat-ural speech in a number of acoustic features (Othmane, Martino, and Ouni, 2019):

• The waveform of the speech signal is more unstable;

• There are specific noisy sounds, present in all frequency bands; • The fundamental frequency (F0) contour is more chaotic; • The spectral power, representing the intensity, fluctuates more.

These differences in acoustic features between esophageal and unimpaired speech are clearly visible in figure 2.3, namely there is no periodicity in the waveform, the auto pitch curve is less stable, the relative intensity fluctuates more, and there is more noise in the spectrogram.

The unnatural characteristics of esophageal speech result in a more hoarse, creaky and unnatural voice, which is less intelligible to listeners (Lachhab et al., 2015). In general, the intelligibility and quality of esophageal speech are lower than those of laryngeal speech, but to what extent they are lower depends on the experience and abilities of the speaker. The amount of listening effort it takes the listener to under-stand the esophageal speaker decreases when they become more familiar with the speaker (Raman et al., 2019).

(17)

2.2.2 Dysarthric speech

Dysarthria is a mild to severe motor speech disorder that is a result of weakness in the muscles that are used for speech, due to brain damage. These are muscles in the face, lips, tongue, throat and the muscles used for breathing. The brain damage can happen at birth or after an illness or injury, such as a stroke or tumor, or diseases like amyotrophic lateral sclerosis (ALS), Parkinson’s or Lyme disease (among oth-ers). Some medications can also cause dysarthria. Dysarthric speech can be charac-terized as slurred, slow or rapid, with an uneven rhythm or volume, and monotone. The exact symptoms depend on the underlying cause and type of dysarthria, but in general dysarthric speech is more difficult to understand than unimpaired speech (Dysarthria: Symptoms and Causes 2020).

In figure 2.3, dysarthric speech can be compared with unimpaired speech in terms of acoustic features. The utterance of dysarthric speech in this example is spoken by a female Spanish speaker with ALS. The relative intensity shows that the dysarthric speech is more monotone, and the decrease in articulation is visible in the waveform of the signal, where there is a less clear distinction between phonemes than in the waveform of unimpaired speech. Also the difference in rhythm is visible; the fragment is one second longer due to longer phonemes and internal pauses.

2.3 Summary

Natural healthy speech is produced by pushing air from the lungs through the vocal cords to the mouth and nose and creating sounds by moving the articulators (lips, tongue, jaw). An abnormality in this process can result in less intelligible, impaired speech. In esophageal speech, the sounds are generated differently because of the absence of the vocal folds. In dysarthric speech, the sounds are filtered differently because of a weakness in the articulatory muscles.

(18)

Unimpaired speech (0.7 sec.)

Esophageal speech (0.7 sec.)

Dysarthric speech (1.6 sec.)

FIGURE 2.3: Waveform (top left), spectrogram (bottom left), relative intensity (top right) and auto pitch (bottom right) for unimpaired,

(19)

9

Chapter 3

Voice Conversion

Recent approaches to improve the intelligibility of alaryngeal speech implement the Voice Conversion (VC) method, which is a Voice Transformation (VT) technique. The process of VT in general refers to the modification of one or more aspects of a human-produced speech signal, while preserving the linguistic information of the signal. In VC, the goal is to modify a speech signal uttered by a source speaker to sound as if it was uttered by a target speaker, while maintaining the linguistic con-tent of the utterance. More specifically, in VC the speaker-dependent characteristics of the speech signal are modified, such as spectral and prosodic aspects, but the speaker-independent information, referring to the linguistic content, are remained unchanged.

The VC process consists of two phases. In the training phase the model is trained to learn the feature relations between the source and target speech. In the conver-sion phase new input utterances from the source speaker are converted using these learned relations. In a typical VC system, in both phases first the mapping features of the input waveform signals are computed based on a speech analysis, so that the speech properties can be modified. Examples of local features are spectral envelope, cepstrum, line spectral frequencies (LSF) and formants. Figure 3.1, adapted from Mohammadi and Kain (2017), shows a diagram of the two phases and their pro-cesses of a VC system. This diagram shows that the time alignment step is optional. This is related to whether the data set is parallel or not. If a data set is parallel, it means that the linguistic content of the source speaker is the same as that of the target speaker. This approach of using parallel data requires the use of a time align-ment process, manually or automatically. An example of an often used automatic process is a dynamic time warping (DTW) algorithm, which computes the best time alignment between the source and target utterances, resulting in a pair of source and target features of equal length. This is necessary because two recordings of the same utterance, even when spoken by the same speaker, are very likely to differ in phoneme duration and therefore the total length of the recording. So, in the training phase, the source and target speaker’s speech segments are time-aligned in order to associate segments that are related in terms of linguistic content with each other (Mohammadi and Kain, 2017). However, this time alignment step is not necessary if the data set is non-parallel, meaning that the linguistic content of the source speaker is not related to that of the target speaker.

The results of a VC system can be evaluated with respect to three different as-pects: the similarity between the converted speech and the target speech (speaker similarity), the quality of the speech, and the speech intelligibility. There exists sub-jective evaluation measures such as the mean opinion score (MOS) and obsub-jective

(20)

Chapter 3. Voice Conversion 10 evaluation measures such as the mel-cepstrum distance (mel-CD). Some objective evaluation metrics can only be applied when the data set is parallel.

Apart from converting impaired speech to more natural speech, the VC method has other applications. Mohammadi and Kain (2017) name the following: person-alizing TTS systems, speech-to-speech translation, biometric voice authentication systems, speaking- and hearing-aid devices, and telecommunications.

FIGURE3.1: Diagram of a voice conversion system

3.1 Summary

The method of Voice Conversion aims to convert source speech to target speech while maintaining the linguistic content. In a non-parallel approach, the linguistic content of the source and target speech are unrelated, meaning that the sizes of the two data sets can differ. A common application of this method is to improve the intelligibility of impaired speech.

(21)

11

Chapter 4

Related Work

As long as esophageal speech has been offered as an alternative speaking method by speech therapists, efforts have been made to make this kind of speech more in-telligible. Over time the focus of the approach shifted from changing the process of speech production to converting the output of the speech production, the speech itself. But also speech that is a result of brain damage, such as dysarthric speech, has been the topic of many research focusing on intelligibility improvement. This chapter provides an overview of previous approaches in general and different kinds of statistical VC methods.

4.1 Therapy techniques

Early approaches to improve the intelligibility of esophageal speech consisted of introducing new speech therapy techniques. An example is the ‘push-harder’ tech-nique, proposed by Christensen and Dwyer (1990), which involves the emphasis on the production of unvoiced consonants, by making the speaker ‘push harder’ on these sounds. This would increase the intra-oral air pressure and sharpen the voic-ing contrast between voiced and unvoiced sounds. Analysis of listenvoic-ing test scores showed that the average intelligibility of unvoiced consonants increased from 55% pre-therapy to 66% post-therapy. Even though speech therapy is a very important part of producing alaryngeal speech and the quality of it differs per person, the ab-sence of the vocal folds is inherent to a decrease in naturalness and intelligibility and cannot be overcome by speech therapy alone.

The importance of speech rehabilitation in the first year after total laryngectomy is demonstrated by Singer et al. (2013). The results of their longitudinal study showed that laryngectomees who attended any form of rehabilitation were more intelligible after one year than those who did not.

4.2 Acoustic methods

The next step in this research area was the approach of creating a new voice signal by adapting the acoustic features of the original voice signal. One of these synthesis methods is a linear-predictive (LP) analysis-synthesis method, for example the one proposed by Qi, Weinberg, and Bi (1995). After performing an LP analysis, the com-puted LP coefficients were combined with a synthetic source voice signal. To test five different conditions, they then performed some speech processing over the syn-thesized signal. In the overall best evaluated condition, the fundamental frequency

(22)

Chapter 4. Related Work 12 (F0) was raised to 160 Hz and smoothed and the formant frequencies were raised by 5%. Listeners preferred in more than 80% of the cases the synthesized word over the original word for three out of the four alaryngeal speakers. However, a more ideal method would enhance the speech of all speakers and should therefore focus more on individual speech rather than general characteristics.

To enhance the intelligibility of alaryngeal speech produced by means of an elec-trolarynx, noise reduction techniques have been proposed, that led to speech that listeners evaluated as more pleasant and improved the naturalness (Liu et al., 2006; Tanaka et al., 2013). This technique focuses on reducing noise in speech signals by applying spectral subtraction and auditory masking.

Other approaches consisted of filtering, which resulted in more clear voices as evaluated by listeners (Hisada and Sawada, 2002) or using a synthetic glottal wave-form, shown to reduce the perceived breathiness and harshness of the speech signals (Pozo and Young, 2006).

4.3 ASR and TTS

Whereas Voice Conversion converts the speech using one system, there exist two applications in the field of speech technology that can also be very useful when try-ing to make oneself more understandable. One application is Automatic Speech Recognition (ASR). Instead of writing down what a persons wants to say, they can save time by simple saying it, after which speech is transformed to text automat-ically. The second application is Text-to-Speech (TTS). One enters a text and this is transformed to speech. In the most simple way, a web application tool can be used to transform the text to synthetic speech. However, even though the newest synthetic voices can sound very natural, in this process the speaker identity is com-pletely lost. So a more complicated way is to use a TTS system that is trained on the voice of the specific person. As introduced in the first chapter, the combination of an ASR and a TTS system yields to very impressive results, to the point where the synthetic voice is almost identical to the original voice (Shor et al., 2019; Chorowski et al., 2019). Although hours of training data was needed to perform this well, they showed that 71% of the improvement in relative Word Error Rate (WER) scores re-sulted from only five minutes of training data. However, this approach still requires the complexity of training two different models, and ideally the same results would be achieved by using only one model. This is where statistical Voice Conversion (VC) models become interesting.

4.4 VC methods

In recent years many kinds of statistical VC methods have been proposed. These methods differ from each other according to various categories, but one of the cat-egories these methods can be divided in is whether the data set is parallel or non-parallel. Especially in the beginning of the VC era, methods to improve the intelli-gibility of impaired speech were trained on data from a parallel data set.

(23)

Chapter 4. Related Work 13 A model that has been one of the most popular VC methods over the years, is the Gaussian Mixture Model (GMM). A GMM is a probabilistic model that as-sumes that all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. A Gaussian distribution, also known as the normal distribution, has a bell-shaped curve and assumes that there is an equal number of measurements above and below the mean value. Among others, Doi et al. (2010), Nakamura et al. (2012) and Lachhab et al. (2015) used this method to improve the intelligibility of alaryngeal speech. Regardless of the pop-ularity of this method, Othmane, Martino, and Ouni (2019) show that deep neural networks (DNN) outperform GMM’s and even more in combination with a time dilated Fourier cepstra algorithm.

To improve esophageal speech in Spanish, Serrano et al. (2018) proposed a paral-lel VC method using a Long Short-Term Memory (LSTM) network. They trained the model on 100 sentences produced by one esophageal speaker paired with 100 sen-tences produced by a non-impaired speaker. This method reduced the Word Error Rate (WER) from 56 percent for the original esophageal speech to 41 percent for the converted speech. However, a preference test showed that listeners still preferred to listen to the original speech rather than the generated speech.

Other parallel VC approaches are methods such as Non-Negative Matrix Factor-ization (Aihara et al., 2013), One-to-Many Eigenvoice Conversion (Doi et al., 2014), a restricted Boltzmann machine-based probabilistic model (Nakamura et al., 2012) and different kinds of neural networks (Othmane, Martino, and Ouni, 2019; Doi et al., 2014).

More recently, approaches were introduced that are able to convert impaired speech to more intelligible speech using non-parallel data. In this approach the lin-guistic content of the source speaker is unrelated to that of the target speaker, which makes it possible to use a larger data set of target speech and a smaller set of source speech. Not only does a model usually train better on a larger data set, it is also easier to obtain data from healthy speakers than from impaired speakers, since it takes impaired speakers more effort to speak and the healthy speech can be taken from an already existing data set. When using non-parallel data, any combination of impaired and healthy speech can be used. This also solves the issue of the risk of making errors in the time alignment between the source and target speech signals. Since a GMM performing VC is a supervised learning model and therefore requires a parallel data set, this method cannot be used in this approach. However, some of the used VC methods in parallel approaches were also used in a non-parallel ap-proach, such as Non-Negative Matrix Factorization (Aihara et al., 2013) and neural networks.

In a later study by Serrano et al. (2019), the performance of the parallel VC model described above (Serrano et al., 2018) was compared to a non-parallel VC method. The non-parallel approach is a Phonetic Posteriorgram (PPG) based model consist-ing of two components: an ASR system, used to obtain the PPGs of the input source signals, and a Bidirectional LSTM (BLSTM) network, trained to predict the corre-sponding acoustic parameters of the target signals from the obtained PPG vectors. It appeared that for this non-parallel approach the WER was not reduced. However, both methods had the same reduction in spectral distance.

(24)

Chapter 4. Related Work 14 Chen, Lee, and Tsao (2018) proposed the use of a Generative Adversarial Net-work (GAN) as a VC method, which is a complicated method of neural netNet-works interacting with each other. More specifically, two models, a generator G and a dis-criminator D, are trained simultaneously and function as adversaries. G tries to gen-erate new data that is similar to the target data. D estimates the possibility that an input sample is part of the source data or is generated by G. The model improves if G is able to maximize the probability of D making an error (introduced by Goodfel-low et al. (2014)). See the next chapter for more details about this work. According to the authors, this approach is unique because it is both non-parallel and able to preserve the speaker’s characteristics. The greatest advantage of a non-parallel ap-proach compared to a parallel apap-proach when aiming to improve the intelligibility of impaired speech, is that it requires a much smaller set of impaired speech. Since speakers with impaired speech usually have more difficulty talking, they would benefit from a model for which they have to provide much less speech data than for a parallel model. Coming back to the second point of why this approach is unique, the authors mention that other non-parallel VC methods were mostly focused on speaker identity, meaning that the speech is directly transformed from one speaker to that of another, resulting in the loss of the characteristics of the speaker. Since it is crucial in the task of improving the intelligibility of speech to preserve these char-acteristics, so only the quality of the speech is transformed, the authors created this model. As can be read in the next chapter, the model parameter values are learned from the mel spectrograms of the speech signals and not linguistic features of the signal itself. Because of this approach, the model can be applied to other languages than Mandarin, which in the current study is Spanish.

4.5 Summary

Approaches to improve the intelligibility of impaired speech changed from speech therapy techniques, to acoustic methods, to machine learning methods. The current state-of-the-art model is a non-parallel VC method, improving the intelligibility of impaired speech while keeping the speaker’s characteristics.

(25)

15

Chapter 5

The Model

We adopted the voice conversion method proposed by Chen, Lee, and Tsao (2018). Their approach consists of a Generative Adversarial Network (GAN) consisting of three models: a generator G, a discriminator D, and a controller C.

The process during the training phase is as following: with impaired speech as input, the controller outputs a code which is taken as the input of the generator. The generator then generates normal speech based on this input code and tries to fool the discriminator. While the discriminator learns to judge whether the output of the generator is real speech or generated speech, the controller learns to gener-ate code that makes the generator output unimpaired speech which has the same linguistic content and the same speaker characteristics as the impaired speech, thus minimizing their high-level differences. Then finally, during the conversion phase, the controller encodes impaired utterances for the generator, which in turn converts this utterance to normal speech. For a schematic overview of the model see figure 5.1. The training data consists of a large data set of unimpaired utterancesT, which is used to train D and G, and a much smaller data set of impaired utterances S, which is used to train C. The utterances of both data sets are not related.

FIGURE 5.1: Schematic overview of the model (Chen, Lee, and Tsao, 2018)

In mathematical terms, the speech from unimpaired subjects T is denoted as

{xt_i}N

i=1, where xti is an acoustic feature from the unimpaired speech and N the

num-ber of audio segments. The impaired speechS is denoted as{xs_i}N0

i=1, where N’ is the

number of audio segments. As mentioned before, the data set forS is a lot smaller than the data set for T: N0 << N. The mathematical functions 5.1, 5.2, 5.3 and 5.4 represent the statistical interaction between the models, where_{∑ is the sum and}

(26)

Chapter 5. The Model 16

E the expected value (weighted average). The generator G (5.2) learns to generate

audio segments ( ˜x) that yield a score of 1 from D, given a vector c, ˜x =G(c), which is the output of C and has a distribution Pc(c). The discriminator D (5.1) learns to

assign unimpaired speech (xt) a score of 1 and generated audio segments ( ˜x) a score of 0; it tries to distinguish xt ∼ T from ˜x ∼ G(c) while G is attempting to fool it. The controller C (5.3) learns to make the impaired speech xs and the corresponding output of G (G(C(xs))) as close as possible, by choosing (and outputting) the condi-tion c that causes G to generate speech xt similar to xs. The objective is to generate speech that sounds as unimpaired as xt, but contains the speaker characteristics of xs. Since the only objective of the controller is to find the optimal value of c, a smaller set of impaired training data can be used. The distance measure L(x, x0)(5.4) evalu-ates the difference between two audio segments and calculevalu-ates the perceptual loss, where Dl(x) denotes the output of the lth layer of D given input x. While G and

D occur simultaneously and are updated continuously, C is only updated when the perceptual loss (5.4) is minimized.

L_D =E_xt_∼T[(D(xt) −1)2] +E_c∼Pc(c), ˜x∼G(c)[(D(˜x)) 2_] (5.1) L_G =E_c_∼_P_c₍_c₎_{, ˜x}_∼_G₍_c₎[(D(˜x) −1)2] (5.2) L_C =Exs_∼S[(L(G(C(xs)), xs)] (5.3) L(x, x0) =

∑

l 2−2l|Dl(x) −Dl(x0)|1 (5.4)

During their experiment, Chen, Lee, and Tsao (2018) trained the model with 132 utterances of impaired speech spoken by one male Mandarin speaker and 20.000 unimpaired utterances in Mandarin spoken by 100 males, 200 different sentences each. The impaired speaker suffers from a serious articulatory injury as a result of the surgical removal of parts of the articulators.

In the conversion phase, with impaired speech as input, the model outputs the converted corresponding speech signal in raw waveform. The model also performs Automatic Speech Recognition (ASR); this ASR part of the model tries to transform the speech to text for as well the impaired input as the by G generated output speech signals. Apart from these texts and the generated audio waveforms, the model out-puts the spectrograms of both signals as well. For the entire test set, the model also outputs a graph with the global variances (GV) of the unimpaired speech, the im-paired speech and the generated speech. Figure 5.2 shows these values for their test results. As they point out in the paper, the GV of the impaired speech is quite different from that of unimpaired speech, whereas the GV of the generated speech is similar. Mean Opinion Score (MOS) analysis of the results show that compared to unimpaired speech, in terms of how clear the speech is in articulation, this score increased from 52.6% for the impaired speech to 59.4% for the generated speech.

Chen, Lee, and Tsao (2018) compared their approach with two baselines: a Con-ditional GAN (cGAN) method and a CycleGAN method. They used the cGAN method proposed by Isola et al. (2016), which requires a parallel training data set.

(27)

FIGURE 5.2: Global variances of data from original experiment by Chen, Lee, and Tsao (2018)

The network consists of a discriminator and a conditional generator, which takes im-paired speech as input and outputs unimim-paired speech. Chen, Lee, and Tsao (2018) created parallel data to their unimpaired data set, by recording the same utterances from a male unimpaired speaker. For the CycleGAN method they used the one pro-posed by Zhu et al. (2017). This method does not require a parallel data set and learns to transform an object from one domain to another domain, namely that of speaker identity. Chen, Lee, and Tsao (2018) used their speech data as training data, as they did in their own approach. Since this method directly transforms the speech of one speaker to that of another speaker, the speaker characteristics are not pre-served in the generated speech. The results obtained by Chen, Lee, and Tsao (2018) show that their non-parallel, speaker characteristics preserving approach outper-forms both the cGAN and the CycleGAN method, when evaluating the Mean Opin-ion Scores (MOS) of the observed similarity of both the content and the speaker, and the observed articulation in the generated speech (see figure 5.3) .

FIGURE5.3: Mean Opinion Scores (MOS) scores comparison between proposed approach by Chen, Lee, and Tsao (2018) and two baselines, one parallel (cGAN) and another non-speaker characteristics

(28)

5.1 Summary

The adopted model is a General Adversarial Network that improves the intelligi-bility of impaired speech. Compared to other VC methods with the same goal, this approach is unique in that it is non-parallel, meaning that it is trained on a small data set of impaired speech and a large data set of unimpaired speech, unrelated in linguistic content, while preserving the speaker characteristics.

(29)

19

Chapter 6

Data and Processing

To train the Discriminator and the Generator, a large set of unimpaired speech was used as target speech. To train the Controller, smaller sets of impaired speech were used as source speech. The linguistic content of the source data is unrelated to the linguistic content of the target data, which means that the data set of source and target speech is non-parallel. This chapter describes the characteristics of the data set and the data processing procedure.

6.1 Target speech

The target data consists of a data set of unimpaired speech containing 11.168 ut-terances spoken by different Spanish male and female speakers, extracted from a publicly available speech corpus called M-AILABS1. The length of the audio files of the speech signals range in length from 1 to 15 seconds. Each of the speech signals contains a segment of speech spoken when reading a book (El Quijote) out loud. The entire data set was used as training data. Figure 6.1 lists the characteristics of the data set regarding the duration and number of recordings and per speaker sex and in total. See appendix A for this information per speaker.

TABLE6.1: Target speech data set information per speaker sex and total

Female Male Total

Number of speakers 16 9 25

Number of recordings 7126 4042 11.168

Total duration 13:39:10 07:39:40 21:18:50 Average speaker duration 00:51:12 00:51:04 00:51:09 Average recording duration 00:00:07 00:00:07 00:00:07

(30)

Chapter 6. Data and Processing 20

6.2 Source speech

6.2.1 Esophageal speech

The data set of esophageal speech consists of three times 100 sentences spoken by three different Spanish laryngectomees. Speaker one (female) and three (male) are considered proficient speakers of esophageal speech and speaker two (female) is a less proficient speaker. The data set was made available by the AhoLab Signal Processing Laboratory of the University of the Basque Country, where the speech was recorded in a semi-professional recording room and originally used for another research project carried out by Erro et al., 2015. The length of the audio files range in length from 3 to 12 seconds. 85 randomly chosen sentences were used as training data and the other 15 sentences were used as test data.

6.2.2 Dysarthric speech

The data set of dysarthric speech consists of 100 sentences uttered by a Spanish female suffering from amyotrophic lateral sclerosis (ALS). The origin of the data is the same as for the esophageal speech, with the only difference that the recording was done by the speaker themself in their home. The linguistic content of these sentences is the same as the sentences in the data set of esophageal speech. The length of the audio files range in length from 3 to 25 seconds. The same 85 sentences that formed the training set for the esophageal speech were used as training data and the other 15 sentences as test data.

TABLE 6.2: Source speech data set information per speech type and speaker

Dysarthric Esophageal

Speaker 1 Speaker 2 Speaker 3

Speaker sex Female Female Female Male

Proficiency level Proficient Less Proficient Proficient

Number of recordings 100 100 100 100

Total duration 00:21:42 00:13:05 00:21:18 00:12:29

Average duration 00:00:13 00:00:08 00:00:13 00:00:07

6.3 Processing procedure

A separate audio processing file created by Chen, Lee, and Tsao (2018) processes and transforms the raw speech signals so it can be used as input data for the model. This processing procedure consists of the following steps:

1. Normalization: the relative intensity is increased or decreased to match a ref-erence level. In more technical terms, this normalization process consists of applying the difference between the target and the actual dBFS of the sound

(31)

Chapter 6. Data and Processing 21 to the speech signal, where dBFS is a measurement of amplitude levels and represents the decibels relative to full scale.

2. Silence trimming: fragments of the signal that have a lower intensity than the threshold value are trimmed.

3. Conversion to 16kHz sample rate.

4. Short-Time Fourier Transform (STFT): a spectrogram, representing the signal in the time-frequency domain, is created, using a Hanning window length of 50 milliseconds, a hop length of 12.5 milliseconds and a Fast Fourier Transform (FFT) size of 1024.

5. Transformation to mel-spectrogram: a mel-spectrogram is constructed using 128 mel frequency bands with a frequency range of 55Hz to 7600Hz.

6. Conversion to decibel scale, by performing logarithmic transformation.

After conversion, the mel-spectrograms are converted back to raw speech signals by first re-scaling the output of the model and multiplying the pseudo inverse of the mel filter bank, in order to recover the linear spectrogram. Then, the waveform is reconstructed by estimating the phase, using Griffin and Lim’s algorithm Griffin and Lim (1984).

Two adaptations to this procedure were essential. It appeared that when analyz-ing the input data after it has been processed, in the original signals the amplitudes were cut off on both sides and some non-silent fragments were deleted (see figure 6.1), resulting in audio signals that were not recognizable as speech. Therefore the following adjustments were made to the processing procedure:

• In the normalization process, the target dBFS value was changed from -20 to -30. It appeared that the average dBFS was -18 for unimpaired speech, -21 for dysarthric speech and -33 for esophageal speech2. However, -30 was the optimal target value for all speech types: a higher value resulted in clipping and a lower value did not have an effect.

• In the trimming process, the silence threshold (10 dB below the signal peak power) was lowered to 15 dB for unimpaired speech, and 25 dB for esophageal and dysarthric speech. This resulted in trimmed signals, as intended by the authors, but without losing non-silent fragments.

Figure 6.1 shows the waveforms of an original fragment of dysarthric and esopha-geal speech, and waveforms after using the original and adapted audio processing procedure.

6.4 Summary

The data set used as target speech consists of more than 11.000 fragments of unpaired speech and the data used as source speech consists of 100 fragments of im-paired speech (dysarthric or esophageal). Adaptations to the original audio process-ing procedure were made to ensure no information in the input data was lost.

2_{This is the average of the three speakers of esophageal speech: speaker 1 (-34), speaker 2 (-33)} and speaker 3 (-32).

(32)

Chapter 6. Data and Processing 22

Dysarthric speech - no processing Esophageal speech - no processing

Dysarthric speech - original processing Esophageal speech - original processing

Dysarthric speech - adapted processing Esophageal speech - adapted processing FIGURE 6.1: Waveforms of dysarthric and esophageal speech signals

(33)

23

Chapter 7

Experimental Design

7.1 Work environment

The experiments were executed on a TITAN GPU, which was made available by the university of the Basque Country (UPV/EHU), in a Python 31 environment using Anaconda2. TensorFlow 2.03 was used to process the deep neural network architecture. The original script of the model was written to be executed using an older version of TensorFlow. However, in order to keep the model as up-to-date as possible, we transformed the script to be executed using TensorFlow 2.0 by means of the tf_upgrade_v2 function. Additionally used modules were librosa4, matplotlib5, Speech Recognition6, tqdm7and PocketSphinx8. A Spanish language model was installed to perform the build-in PocketSphinx Automatic Speech Recog-nition (ASR) part of the experiments.

7.2 Model configuration

To maintain as much of the original architecture of the model as possible, we used the same model configuration as Chen, Lee, and Tsao (2018). The GAN is trained using an Adam optimizer, with a learning rate of 0.0002 for C and G and 0.0001 for D and a β1value of 0.5 and β2value of 0.9 for all three models. The activation function

that was used for G and C was the Exponential Linear Unit (ELU) function, and for D the Rectified Linear Units (ReLU) function. The batch size was 64 and the dropout rate was 0.9 for C and 0.8 for D.

7.3 Procedure

The experimental procedure of the current study consisted of four consecutive phases, which will be described below.

1_{https://www.python.org/download/releases/3.0/} 2_{https://www.anaconda.com/} 3_{https://www.tensorflow.org} 4_{https://github.com/librosa/librosa/tree/main/librosa} 5_{https://matplotlib.org/} 6_{https://pypi.org/project/SpeechRecognition/} 7_{https://pypi.org/project/tqdm/} 8_{https://pypi.org/project/pocketsphinx/}

(34)

Chapter 7. Experimental Design 24

7.3.1 Phase 1: Understanding the model

When adopting a state-of-the-art model it is first necessary to understand the model and know how to run it correctly. The authors of the model (Chen, Lee, and Tsao, 2018) gave access to all the files that were needed to run the model by means of a GitHubrepository9. Since the README.md file only contains a link to the research paper, most of the knowledge was gained by studying the code closely and reading the information provided with the help option of the parameters. It then became a method of trial-and-error by reading and solving the errors that appeared when running the model.

7.3.2 Phase 2: Model development

Since the model outputs per epoch five validation samples, from a pre-defined val-idation set extracted from the input training data, the effect of changing hyper-parameter values to the model performance can be checked. Since the data set of source speech in this study consists of 85 speech fragments as opposed to 132 in the original study, the model performance was analyzed when changing the batch size from 64 to 32, however, this change did not have an effect. Also, since clip-ping appeared to be an issue in the processing procedure, the clipclip-ping value was enlarged from 3.0 to higher values, namely 3.5, 4.0 and 5.0, but changing this hyper-parameter value did not have an effect. More than 50 different combinations of hyper-parameter values and processing parameter values were tried to understand the model response to these parameters.

In the end, the architecture of the original model was maintained, and the only adaptations were made to the processing procedure (see section 6.3) and the training time (see next section).

7.3.3 Phase 3: Training

Four experiments were performed, differing in the type of source data used to train the Controller. In each experiment, the Controller was trained on a different sub set of source training data (see table 6.2 in chapter 6), namely speech from one speaker of dysarthric speech and speech from three speakers of esophageal speech, differing in sex and proficiency level. Figure 7.1 shows the perceptual losses for the valida-tion data in each experiment. The perceptual loss between the source and the target sentence is represented by the output of the distance measure, which is updated after every batch. The average perceptual loss of every epoch is taken to be eval-uated. The lower the perceptual loss, the better. The epochs corresponding to the minimum values of the losses are marked in the graph, which are 5, 9, and 37 for the esophageal speakers and 66 for the dysarthric speaker. The loss progressions indi-cate the best model performance for the dysarthric speaker, followed by esophageal speaker 1 (the female proficient speaker), followed by the other two esophageal speakers. This is confirmed by informal subjective evaluation by the author and resulted in the decision to only consider the dysarthric speaker and speaker 1 of the esophageal speakers to be used for evaluation in the next phase.

(35)

Chapter 7. Experimental Design 25

FIGURE7.1: Perceptual loss over 100 epochs for models trained on dif-ferent types of training data: dysarthric speech, and esophageal speech

from three different speakers

Regarding the number of epochs that the model was trained on in the exper-iments, Chen, Lee, and Tsao (2018) did not specify in their paper the number of epochs they used for their experiment. The default value of the hyper-parameter representing the number of epochs is set to 800, however, the models in the current experiment were trained on 100 epochs for several reasons.

First of all, training the model on 800 epochs was equal to a training time of more than four days. Since more than 50 training rounds were needed in order to understand how the model could best be used for the training data, the time span of this study did not allow for training the model each time on 800 epochs.

Secondly, as after a certain point (before the number of 100 epochs was reached) the perceptual loss increased as the number of epochs increased, there was no mo-tive to train the model for a longer time.

7.3.4 Phase 4: Testing and evaluation

During training, the model parameter values were saved after every fifth epoch, so for the conversion phase the models were used that corresponded to the closest number of epochs with respect to the optimal number, related to the lowest percep-tual loss. This means that the model trained on 35 epochs was used for esophageal speech and 65 epochs for dysarthric speech to convert the fifteen unseen sentences of the test set. Then the results of each set were compared and evaluated by means of several evaluation measures.

Since the linguistic content of the source and target speech is unrelated, there is no direct objective evaluation metric available to measure the accuracy of the con-verted speech compared to the original speech. To obtain an intelligibility measure that is as objective as possible, we evaluated the results by means of the STOI index and Word Error Rates. To objectively evaluate the speech quality, the mel-frequency cepstra and fundamental frequency (F0) curves were compared. Also the global variances were analyzed. A detailed explanation of each measure is given below.

(36)

Chapter 7. Experimental Design 26 To test whether values were significantly different or not, we performed statis-tical analyses in R. To compare the values between original and converted speech signals, we used the Wilcoxon signed rank test, which is a non-parametric alterna-tive of the paired t-test. To compare the values between the two types of speech, we used the non-parametric alternative of the Welch Two-Sample t-test, namely the Wilcoxon rank sum test. Non-parametric tests were chosen due to the small sam-ple size. The test set consisted of 15 speech fragments, however in the evaluation process one fragment was left out because this fragment appeared to be wrongly recorded in the set of dysarthric speech.

STOI index The STOI (Short-Time Objective Intelligibility) index is an objective intelligibility measure and has a monotonic relation with subjective speech intel-ligibility, where a higher value denotes better intelligible speech (Taal et al., 2010; Taal et al., 2011). For this measure the intelligibility of a speech signal is based on differences with a reference signal in the time-frequency domain. If there is no par-allel reference signal of unimpaired speech available, as is the case in this study, a computer generated signal based on healthy speech (a synthesic signal) can be used as the reference signal (Janbakhshi, Kodrasi, and Bourlard, 2019). To compare the intelligibility of the original impaired speech with the converted speech, both are compared to synthetic speech, in order to determine which kind of speech has a higher STOI value. The synthetic speech was time aligned with the original test sentences of each set of the source data, and trimmed and normalized as in the pro-cessing procedure described in section 6.3.

Word Error Rate We used the original impaired test utterances on the one hand and the corresponding generated output of the model on the other hand as input for an Automatic Speech Recognition (ASR) evaluation system to compare the Word Er-ror Rates (WER) of the outputs (hypotheses) with the annotation of the original test sentences (reference). The WER value is derived from the Levenshtein’s distance, and is computed by dividing the sum of the number of substitutions (S), deletions (D) and insertions (I) by the number of words in the reference sentence (N) and multiplying the result by 100 to represent the value as a percentage:

WER(in%) = S+D+I

N ∗100 (7.1)

Global Variance The global variance represents the standard deviation of the pa-rameter values, which can be compared for the impaired speech, the generated speech and the unimpaired speech. The graph representing these values for the entire test set is provided with the test samples.

Mel-Cepstral Distortion The mel-cepstral distortion (MCD) is an objective mea-sure of the difference between two sequences of mel-frequency cepstra (MFC). An MFC is a representation of the short-term power spectrum of a speech signal based on a linear cosine transform of a log power spectrum on a non-linear mel scale of frequency, and is made up by mel-frequency cepstrum coefficients (MFCCs). The advantage of representing a signal in this way, is that the filter response can be

(37)

Chapter 7. Experimental Design 27 separated from the excitation. The lower coefficients in the quefrency domain corre-spond to the filter response, whereas the higher coefficients correcorre-spond to the excita-tion. To calculate the MCD, only the lower coefficients are considered, meaning that the filter responses, referring to the spectral envelopes, of two signals are compared. The calculation of the MCD measure is represented by the equation in 7.2, where Csyn refers to the MFCCs of the synthesized signal and Chyp to the MFCCs of the

original or converted signal. The MFCCs of the signals are extracted by means of the Ahocoder vocoder (Erro et al., 2014). The MCD between the synthesized and the original signals is compared with the MCD between the synthesized and converted signals.

MCD =mean( 10

log₁₀ ∗ q

2∗

∑

(Csyn−Chyp)2) (7.2)

F0 distance The distance between the F0 curves of two signals is measured by calculating the Root Mean Squared Error (RMSE) between the F0 values of these signals. As with the MFCCs, the F0 values are extracted by means of the Ahocoder vocoder. The RMSE between the synthesized and the original signals is compared with the RMSE between the synthesized and converted signals.

7.4 Summary

After understanding, developing and training the model, a small set of impaired speech fragments were converted. These converted signals were then analyzed and evaluated regarding speech intelligibility and quality. The original model architec-ture was maintained and the model performance during development was based on the perceptual loss of the validation samples. It turned out that the optimal training size differed per source speech type.

(38)

28

Chapter 8

Results

This chapter provides an overview of the values of the objective measures described in section 7.3.4 of the previous chapter, presented per evaluation category: speech intelligibility and speech quality.

8.1 Speech intelligibility

As described in the previous chapter, the STOI index is an objective evaluation mea-sure that represents the intelligibility of a speech signal and is denoted by a value up until one, where a higher value means higher intelligibility. The mean STOI index of the converted speech signals is statistically the same as the original signals, for both types of speech (see figure 8.1). For both the original and converted signals, it is the case that esophageal speech has a significant higher mean STOI index ( ¯xor = 0.226,

¯xcon = 0.208) than dysarthric speech ( ¯xor= 0.139, ¯xcon= 0.121), por= 0.0454 and pcon

= 0.002997.

FIGURE 8.1: Mean and individual STOI index values of the original and converted speech signals per speech type

The WER represents the percentage of errors that were made in Automatic Speech Recognition compared to the total number of words in a text. The lower the WER value, the higher the similarity between the reference and hypothesis recognized text. The figures in 8.2 show the individual and mean WER values of the original and converted test recognized texts per speech type. The mean WER values of the recognized texts of the converted signals ( ¯xDS = 163.3%, ¯xES = 112.2%) are

signifi-cantly higher than of the original signals ( ¯xDS = 117.0%, ¯xES = 98.1%), pDS < 0.001

(39)

Chapter 8. Results 29 statistical analysis shows that for both types of speech signals, original and con-verted, the WER values of the recognized texts of the esophageal signals are lower, por= 0.04079 and pcon< 0.001.

FIGURE8.2: Mean and individual WER values of the original and con-verted speech signals per speech type

8.2 Speech quality

The difference between the mel-frequency cepstra of a reference and a hypothe-sis signal is represented by means of the mel-cepstral distortion (MCD). Figure 7.2 shows the MCD values for both types of speech, where each of the speech signal type is compared with a synthetic reference. There is a statistical difference between the mean MCD values of the original and converted speech signals for both speech types. However, for dysarthric speech the MCD of the converted signals is lower ( ¯xor = 12.8, ¯xcon = 11.8, p < 0.001), whereas for esophageal speech it is higher ( ¯xor =

12.3, ¯xcon= 12.7, p = 0.03528).

FIGURE8.3: Mean and individual MCD values of the original and con-verted speech signals per speech type

The differences between the fundamental frequency (F0) curves between two signals can be represented by means of the RMSE. The figures in 8.4 show these values for the different speech signal types per speech type, compared with a syn-thetic reference. A statistical analysis of these values shows that the RMSE between

(40)

Chapter 8. Results 30 the synthetic and converted esophageal speech signals is lower than between the synthetic and the original esophageal speech signals ( ¯xor = 146.17, ¯xcon= 98.06, p <

0.001). For dysarthric speech there is no difference.

FIGURE 8.4: Mean and individual RMSE values of the original and converted speech signals per speech type

The global variances for both of the speech types are shown in figure 8.5. These graphs show that dysarthric speech has a higher variance as opposed to esophageal speech, but the difference between variance of the original and converted speech is smaller.

Dysarthric speech Esophageal speech FIGURE8.5: Global variances per speech type

Spectrograms of the original and converted speech signals per speech type are shown in 8.6 and the waveforms and F0 curves in figure 8.7. They refer to one instance of the test set and therefore are not representative of the entire test set, but provide a useful visualization of the differences between the original and converted speech signals.

ImprovingtheIntelligibilityofEsophagealSpeechusingNon-ParallelVoiceConversion U B C

U

NIVERSITY OF THE

B

ASQUE

C

OUNTRY

M

ASTER

T

HESIS

Improving the Intelligibility

of Esophageal Speech using

Non-Parallel Voice Conversion

Author:

Inge S

First Supervisor:

Prof. Eva N

Second Supervisor:

Prof. Dr. Martijn W

European Masters Program

Language and Communication Technologies

Abstract

Preface

Contents

List of Figures

List of Tables

List of Abbreviations

Chapter 1

Introduction

Chapter 2

Speech Production

2.1

The human speech system

2.2

Impaired speech

2.2.1

Esophageal speech

2.2.2

Dysarthric speech

2.3

Summary

Chapter 3

Voice Conversion

3.1

Summary

Chapter 4

Related Work

4.1

Therapy techniques

4.2

Acoustic methods

4.3

ASR and TTS

4.4

VC methods

4.5

Summary

Chapter 5

The Model

∑

5.1

Summary

Chapter 6

Data and Processing

6.1

Target speech

6.2

Source speech

6.2.1

Esophageal speech

6.2.2

Dysarthric speech

6.3

Processing procedure

6.4

Summary

Chapter 7

Experimental Design

7.1