Deep neural networks for speech enhancement for communications

(1)

communications

Deep neural networks for speech enhancement for

Academic year 2019-2020 Technology

Master of Science in Electrical Engineering - main subject Communication and Information Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Alexander Bohlender, Yanjue Song Supervisor: Prof. dr. ir. Nilesh Madhu

Student number: 01503169

(2)

(3)

communications

Deep neural networks for speech enhancement for

Academic year 2019-2020 Technology

Master of Science in Electrical Engineering - main subject Communication and Information Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Alexander Bohlender, Yanjue Song Supervisor: Prof. dr. ir. Nilesh Madhu

Student number: 01503169

(4)

Preface and acknowledgement

After playing around with machine learning by developing an AI for a board game and for natural language processing, it was interesting to work on the combination of machine learning and speech enhancement, indulging in my curiosity about machine learning and signal processing. Also, it was enjoyable to work on something with a clear, practical application: everyone has experience with being the recipient of a call where you can barely understand the person speaking due to background noise. Removing this background noise would make for a much better experience. This became abundantly clear during the current COVID-19 pandemic, where a lot of people had to start working from home and meet by videoconferencing, accompanied by frustrations about speech quality due to background noise or low-quality microphones.

The work done in this thesis would not have been possible without the advice and feedback from my thesis promotor prof. Nilesh Madhu and my counsellors Alexander Bohlender and Yanjue Song. I am very grateful for their input and guidance. The support and encouragement of my family and friends have also been essential, for which I am very thankful. My mother, my father, and my brother Olivier have made home a great environment to work in throughout my studies. I also would especially like to thank my fellow electrical engineering students who made the breaks way more interesting, and my roommate, Jarne Van den Herrewegen, for the pleasant company and putting up with my rambling about neural network architectures.

Guillaume Van Wassenhove June 2020

(5)

Permission for usage

The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In all cases of other use, the copyright terms have to be respected, in particular with regard to the obligation to state explicitly the source when quoting results from this master dissertation.

Guillaume Van Wassenhove June 2020

(6)

Deep neural networks for speech enhancement for

communications

Guillaume Van Wassenhove

Master’s dissertation submitted in order to obtain the academic degree of Master of Science in Electrical Engineering

main subject Communication and Information Technology Academic year 2019-2020

Supervisor: Prof. dr. ir. Nilesh Madhu Counsellors: Alexander Bohlender, Yanjue Song Department of Electronics and Information Systems,

Faculty of Engineering and Architecture, Ghent University

Abstract

Speech is one of the most used forms of communication. When having a conversation using, e.g. a mobile phone, noise is present due to background noise and limited quality recording equipment. This can significantly impair speech quality. To cope with this, speech enhancement aims to suppress the noise.

The goal of this thesis is to develop and evaluate deep neural network-based speech en-hancement systems that can suppress noise of various types and strengths. Deep neural networks have already been successfully applied for this purpose. In this thesis, one CNN-based and two RNN-based state-of-the-art network architectures are implemented and evaluated. Deep neural networks for speech enhancement are often trained using the mean square error loss function, which does not take the human auditory perception into account. It is shown that by using a perceptually motivated loss function based on PESQ, subjective speech enhancement performance can be improved. In related literature, often only the magnitude of the noisy speech signal is enhanced, while the noisy phase is left untouched. As enhancing the noisy phase can increase speech enhancement performance, different phase enhancement techniques are investigated in this thesis.

(7)

Deep neural networks for speech enhancement for

communications

Guillaume Van Wassenhove

Supervisor(s): prof. dr. ir. Nilesh Madhu, Alexander Bohlender, Yanjue Song

Abstract—Speech is one of the most used forms of communication. When having a conversation using, e.g. a mobile phone, noise is present due to background noise and limited quality recording equipment. This can sig-nificantly impair speech quality. To cope with this, speech enhancement aims to suppress the noise.

The goal of this thesis is to develop and evaluate deep neural network-based speech enhancement systems that can suppress noise of various types and strengths. Deep neural networks have already been successfully applied for this purpose. In this thesis, one CNN-based and two RNN-based state-of-the-art network architectures are implemented and evaluated. Deep neu-ral networks for speech enhancement are often trained using the mean square error loss function, which does not take the human auditory per-ception into account. It is shown that by using a perceptually motivated loss function based on PESQ, subjective speech enhancement performance can be improved. In related literature, often only the magnitude of the noisy speech signal is enhanced, while the noisy phase is left untouched. As enhancing the noisy phase can increase speech enhancement performance, different phase enhancement techniques are investigated in this thesis.

Keywords—Speech enhancement, deep neural networks, phase enhance-ment

I. INTRODUCTION

In speech communication, background noise such as traf-fic noise or other people talking can signitraf-ficantly increase lis-tener fatigue and make the speaker harder to understand. More and more people are teleconferencing, where low noise levels and excellent intelligibility are required. Besides, hearing aids and automatic speech recognition systems benefit also from less noise.

Speech enhancement aims to suppress background noise and enhance the speech component, so speech quality is improved, thus reducing listener fatigue and increasing intelligibility.

Traditional speech enhancement methods are based on statis-tical models, which assume the noise has certain statistics. This causes problems when noise is non-stationary and difficulties in ‘tracking’ the noise. Recently deep neural networks (DNNs) have also been applied for speech enhancement, with excellent results. However, DNNs are trained without taking human au-ditory perception into account. Also, often only the magnitude of the noisy signal is enhanced while keeping the noisy phase untouched.

In this thesis, a perceptually motivated loss function is used to improve perceptual speech enhancement performance, and sev-eral DNNs incorporating noisy phase enhancement are proposed and evaluated.

The speech enhancement scheme will be introduced in Sec-tion II. SecSec-tion III will give an overview of the state-of-the-art speech enhancement methods. In Section IV, the speech en-hancement techniques proposed in this thesis will be described. The experimental setup and the results will be discussed in Sec-tion V. SecSec-tion VI concludes the work, including some notes on future work.

II. PROBLEM FORMULATION

As can be seen in Figure 1 the goal of speech enhancement is to obtain an estimate of the clean speech ˆs(k), based on the assumption that there is an additive noise:

y(k) = s(k) + v(k) (1) with y(k) the noisy speech signal, v(k) the noise signal and k the time index.

Fig. 1. Speech enhancement scheme.

Most speech enhancement techniques apply the enhancement in the short-time Fourier transform domain, i.e. the signal is transformed by the short-time Fourier transform (STFT), en-hanced, and transformed back to a waveform by the inverse STFT. Equation 1 then results in:

Y (µ, λ) = S(µ, λ) + V (µ, λ) (2) In this equation, µ is the frequency bin index and λ is the time-frame index.

III. STATE-OF-THE-ART SINGLE-CHANNEL SPEECH ENHANCEMENT

This section concerns speech enhancement using the signal from a single microphone.

A. Statistical models

Traditional methods are based on statistical models, which assumes the noise being stationary, the Fourier coefficients of the signal being Gaussian random variables, and/or the noise and the speech being independent. Some of these assumptions are made for the spectral subtraction technique [1], Wiener filtering [2] and the MMSE-LSA estimator [3]. These assumptions do not always hold, which leads to these models having difficulty in ‘tracking’ the noise if it is non-stationary.

The minimum mean square error log-spectral amplitude (MMSE-LSA) estimator uses a gain function G(µ, λ) to esti-mate the clean speech:

(8)

The gain function is a function of the a priori SNR ξ(µ, λ) and the a posteriori SNR γ(µ, λ): ξ(µ, λ) = σ 2 S(µ, λ) σ2 V(µ, λ) (4) γ(µ, λ) = |Y (µ, λ)| 2 σ2 V(µ, λ) (5) with σ2

S(µ, λ)the variance of the speech and σV2(µ, λ)the

vari-ance of the noise. The varivari-ance of the noise is measured during non-speech parts [4]. The a priori SNR is estimated using a decision-directed approach based on the clean speech estimate

ˆ

S(µ, λ− 1) of the previous timeframe [5]. B. Deep neural networks

Deep neural networks have shown exceptional results in fields such as computer vision [6], and they have also been applied succesfully to speech enhancement [7]. Like for statistical mod-els, most deep neural networks for speech enhancement predict an STFT mask [8] or STFT coefficients [7], but waveform en-hancement is also possible [9]. Deep neural networks are trained on a selection of noise types, causing them to generalise less well to different noise types than statistical models.

In STFT domain there are multiple possibilities for the net-work output. Direct estimation of the denoised STFT or de-noised spectrogram is possible:

ˆ

S(µ, λ) = f (Y (µ, λ)) (6) but masks are commonly used [10]:

ˆ

S(µ, λ) = Y (µ, λ)f (Y (µ, λ)) (7) with f(Y (µ, λ)) the output prediction for an input Y (µ, λ).

Three state-of-the-art neural networks are implemented as baselines.

B.1 CNN

This neural network is a timeframe-by-timeframe convolu-tional neural network (CNN) taking in the a priori and a posteri-ori SNR (defined in Equations 4 and 5) from a single timeframe. A 2D convolution is applied across the frequency dimension and the a priori/posteriori SNR dimension. The output of the convo-lution is run through three fully-connected feedforward layers, predicting an ideal ratio mask (IRM) [10] as the output:

IRM(µ, λ) = |S(µ, λ)|2

|S(µ, λ)|2₊_{|V (µ, λ)|}2 (8)

This model comprises about 10 million parameters. B.2 EHNet

EHNet [11] consists of a 2D convolution layer followed by two bidirectional long short-term memory (LSTM) layers [12] and a fully-connected feedforward layer, directly estimating the clean speech spectrogram. Due to the bidirectional nature of the LSTM layers, this network takes in the whole spectrogram, i.e. all previous and future timeframes are required to predict the current timeframe. The 2D convolution is applied across the frequency and time dimensions.

This model comprises about 60 million parameters.

B.3 NSNet

NSNet [13] is made up from 3 gated recurrent unit (GRU) layers [14] followed by a fully-connected feedforward layer. A gain mask is predicted in STFT domain.

While the CNN and EHNet are trained using the mean square error (MSE) loss function, the NSNet loss function weighs the MSE loss during active and non-active speech parts differently: lossspeech =MSE(Sactive, (G S)active) (9)

lossnoise=MSE(0, G Y) (10)

loss = αlossspeech+ (1− α)lossnoise (11)

where Sactive denotes a matrix containing |S(µ, λ)| with the

subscript ‘active’ denoting the timeframes with speech activity, and denotes the element-wise multiplication. By tuning the weighting α, a trade-off can be made between increased noise reduction with stronger target speech distortion and vice versa.

This model comprises about 1.2 million parameters. IV. PROPOSED METHODS

A. Loss function for improved subjective performance

Most neural networks for speech enhancement are trained us-ing the mean square error (MSE) loss function. However, the MSE does not take the human auditory perception into account. PESQ is a standardised and widely-recognised metric for speech quality evaluation. However, using PESQ as a loss function is not possible since it is non-differentiable.

A differentiable approximation of PESQ, called the percep-tual metric for speech quality evaluation (PMSQE) was pro-posed as a loss function in [15]. In this thesis, this PMSQE loss was combined with the unchanged NSNet architecture [13] in order to improve the perceptual speech enhancement perfor-mance.

A problem that remains is that the PMSQE loss shows strong non-linearity with singular points, meaning it is not differen-tiable everywhere. This could cause problems in gradient de-scending (e.g., exploding gradients), causing the model not to learn. To alleviate this, the weighted loss function from NSNet is also added to the loss function:

loss = βlossPMSQE+ (1− β)lossNSNet (12)

with β the weighting factor between the PMSQE loss and the NSNet loss. Grid searching suggests that the optimal β is 0.4. B. Phase enhancement

Based on the assumption that enhancing the noisy phase does not increase speech quality [16], speech enhancement is ap-plied to the STFT magnitude spectrum, and the estimated clean speech waveform is reconstructed using the noisy phase. How-ever, it has been shown that for most speech enhancement mod-els enhancing the phase does have a positive effect on speech quality [17]. Thus, it is expected that phase enhancement using deep neural networks can increase speech quality.

B.1 Joint mask estimation network (JMENet)

Contrary to the magnitude spectrogram, the phase spectro-gram does not contain any structure. The real and imaginary

(9)

Fig. 2. Phase enhancement, joint mask estimation network architecture (JMENet). M denotes the number of frequency bins, and L denotes the number of timeframes.

part of the complex spectrogram do contain structure, making them candidates for enhancement.

Figure 2 shows a network with an encoder/decoder structure consisting of a 2D convolution and a 2D deconvolution across the frequency and time dimensions of the spectrogram, and three GRU layers in the middle. The network predicts a joint mask for the real and imaginary part of the spectrogram.

The loss function for training this network is the summation of the MSE loss of the real part and of the imaginary part. B.2 Seperate mask estimation network (SMENet)

Figure 3 shows the same encoder architecture as JMENet, but after the GRU layers there are two different ‘heads’: one pre-dicting a mask for the real part and the other a mask for the imaginary part of the spectrogram. Each ‘head’ consists of an independent fully-connected layer. These fully-connected lay-ers each have more parametlay-ers than the 2D deconvolution layer in JMENet, and can thus better capture the mapping from the dense GRU output to a spectrogram mask. The drawback is that the fully-connected layers are independent, such that corre-lations between the real and imaginary part of the spectrogram will not be learned.

During training, the same loss function as for JMENet is used. B.3 Cosine transform

Using the cosine transform instead of the Fourier transform brings the benefit that the cosine transform coefficients are real. This means a single representation of the cosine transform is possible, in contrary to the two representations necessary for the Fourier transform: either real and imaginary part or

mag-Fig. 3. Phase enhancement, seperate mask estimation network architecture (SMENet).

nitude and phase. With a cosine transform equivalent of the STFT, the short-time cosine transform (STCT), a speech en-hancement model can be applied on the real cosine transform coefficients, thus implicitly enhancing both the magnitude and the phase w.r.t. the STFT representation.

A network using the STCT is shown in Figure 4.

The loss is calculated between the output waveform and the target waveform, and backpropagated through the inverse STCT and STCT layers. This requires the implementation of differen-tiable STCT and inverse STCT layers. The STCT operation can be seen as a 1D convolution operation over the input waveform, with the convolutional kernels being the rows of the discrete co-sine transform (DCT) matrix H. This matrix is obtained sim-ilarly to how a discrete Fourier transform (DFT) matrix is ob-tained, by taking the DCT of the identity matrix of dimensions M_{× M, with M the DCT length:}

H =    1 1 ... 1 cos₍π M12) cos(Mπ 32) ... cos(Mπ((M−1)+12)) ... ... ... ... cos₍π M12(M−1))cos(Mπ 32(M−1))... cos(Mπ((M−1)+12)(M−1))    (13) V. EXPERIMENTAL RESULTS A. Dataset

The Microsoft Scalable Noisy Speech Dataset (MS-SNSD) [18] is used for noise and speech recordings, combined with the Aachen Impulse Response Database [19] to add reverberation. Noisy speech mixtures are generated by mixing a noise dataset and a dry clean speech dataset at SNRs of −10 dB, −5 dB, 0 dB, 5 dB, 10 dB, 20 dB, 30 dB and 40 dB. Reverberation is added

(10)

Fig. 4. Phase enhancement, network using the STCT (STCTNet). M denotes the number of discrete cosine transform coefficients, and L denotes the num-ber of timeframes.

to the dry clean speech to make the speech signal more realistic. The target signal during training is the dry clean speech, i.e. the clean speech without reverberation.

84 hours of noisy speech are generated from the MS-SNSD dataset, of which 10% are used for validation and 90% for train-ing. Two test sets are generated as well, and both use speech samples that is outside the training set. The seen noise test set is generated with noise that has been used in the training set, and the unseen noise test set is generated with noise that is not in the training set. Each test set is 2 hours and 20 minutes long, and contains noisy speech mixtures at −10 dB, −5 dB, 0 dB, 5 dB, 10 dB, 15 dB and 25 dB.

B. Metrics

Evaluation is done on the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) [20], which give an indication of the subjective speech quality. It has been shown that PESQ has good correlation with subjective listening tests [21].

C. Results and discussion

The results can be seen in Table I. C.1 Baselines

NSNet performs the best for PESQ and STOI, both for seen and unseen noise and all input SNRs. The CNN beats NSNet in PESQ on seen noise for an input SNR of 10 dB, but the dif-ference is small (PESQ of 3.19 versus 3.12). For an input SNR of 25 dB the MMSE-LSA estimator performs better in terms of PESQ, but only slightly (PESQ of 3.73 versus 3.70).

SNR (dB) Model _{PESQ (MOS) STOI PESQ (MOS) STOI}Seen noise Unseen noise

-10 Noisy mixture 1.52 0.67 1.27 0.52 MMSE-LSA 1.69 0.66 1.11 0.50 CNN 1.78 0.69 1.03 0.51 EHNet 1.26 0.64 0.70 0.41 NSNet 1.90 0.73 1.33 0.55 NSNet-PMSQE 1.74 0.69 1.16 0.49 JMENet 1.78 0.68 1.16 0.49 SMENet 1.50 0.70 1.02 0.52 STCTNet 1.34 0.58 0.87 0.42 0 Noisy mixture 2.03 0.83 1.61 0.74 MMSE-LSA 2.32 0.83 1.73 0.72 CNN 2.56 0.85 1.80 0.76 EHNet 2.13 0.86 1.40 0.72 NSNet 2.57 0.88 2.04 0.80 NSNet-PMSQE 2.54 0.86 1.90 0.76 JMENet 2.34 0.83 1.85 0.73 SMENet 2.20 0.87 1.69 0.74 STCTNet 1.83 0.72 1.35 0.60 10 Noisy mixture 2.69 0.94 2.30 0.91 MMSE-LSA 2.97 0.93 2.50 0.89 CNN 3.19 0.94 2.61 0.91 EHNet 2.91 0.95 2.44 0.93 NSNet 3.12 0.95 2.72 0.93 NSNet-PMSQE 3.19 0.95 2.81 0.92 JMENet 2.84 0.92 2.49 0.88 SMENet 2.80 0.95 2.48 0.91 STCTNet 2.04 0.75 1.85 0.73 25 Noisy mixture 3.60 1.00 3.29 0.99 MMSE-LSA 3.73 0.99 3.44 0.99 CNN 3.70 0.96 3.44 0.96 EHNet 3.50 0.99 3.30 0.98 NSNet 3.70 0.99 3.46 0.99 NSNet-PMSQE 3.78 0.98 3.59 0.98 JMENet 3.52 0.97 3.26 0.96 SMENet 3.53 0.99 3.27 0.99 STCTNet 2.53 0.81 2.44 0.81 TABLE I

PESQANDSTOI,EVALUATED ON THE SEEN AND UNSEEN NOISE TEST SET.

The excellent performance, relatively small network size and real-time nature of NSNet is the reason this network architecture is used for further experiments.

C.2 PMSQE loss

Using the PMSQE loss function, performance is degraded in comparison with the NSNet baseline for low input SNRs, but it still beats the MMSE-LSA baseline. The higher the input SNR, the better NSNet-PMSQE performs, overtaking the NSNet base-line when the input SNR is 10 dB or higher. The lower perfor-mance at low input SNRs may be due to too much speech dis-tortion compared to NSNet. With further tuning, it is expected that NSNet-PMSQE can perform better than NSNet for all input SNRs.

C.3 Phase enhancement

For both seen and unseen noise the performance of JMENet and SMENet on PESQ and STOI is mediocre: NSNet always performs better, and JMENet and SMENet perform similar to or worse than the MMSE-LSA baseline.

The performance on PESQ and STOI for STCTNet is poor: the model consistently has lower PESQ and STOI than the MMSE-LSA baseline. This is the result of the strong speech distortion that can be noticed when listening to the output of

(11)

STCTNet.

VI. CONCLUSIONS AND FUTURE WORK

In this thesis, it was shown that taking human auditory per-ception into account when training deep neural networks for speech enhancement is important. By using a perceptually moti-vated loss function, subjective speech quality is increased, while retaining the network architecture and the real-time nature of a model.

Further research is required into phase enhancement: the net-works that enhance the phase also introduce too much speech distortion. This indicates that larger neural networks may be required to do phase enhancement.

The combination of phase enhancement and perceptually mo-tivated loss functions could lead to even better performance. Also, the speech enhancement could be improved by introduc-ing local contextual information from future timeframes, while keeping the resulting delay acceptable. Finally, multi-channel speech enhancement could be applied, leveraging data from multiple microphones.

REFERENCES

[1] S. Boll, “Suppression of acoustic noise in speech using spectral subtrac-tion,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113–120, Apr. 1979.

[2] Jae S. Lim and Alan V. Oppenheim, “Enhancement and Bandwidth Com-pression of Noisy Speech,” Proceedings of the IEEE, vol. 67, no. 12, Dec. 1979.

[3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, Apr. 1985.

[4] Timo Gerkmann and Richard C Hendriks, “Noise power estimation based on the probability of speech presence,” in 2011 IEEE Workshop on Ap-plications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2011, pp. 145–148.

[5] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109– 1121, Dec. 1984.

[6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “ImageNet clas-sification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., pp. 1097–1105. Curran Associates, Inc., 2012. [7] Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori, “Speech En-hancement Based on Deep Denoising Autoencoder,” in INTERSPEECH 2013. 2013, pp. 436–440, ISCA.

[8] Robert Rehr and Timo Gerkmann, “Robust DNN-Based Speech Enhance-ment with Limited Training Data,” in Speech Communication; 13th ITG-Symposium, 2018, pp. 1–5.

[9] Dario Rethage, Jordi Pons, and Xavier Serra, “A Wavenet for Speech Denoising,” arXiv:1706.07162 [cs], Jan. 2018.

[10] Yuxuan Wang, Arun Narayanan, and DeLiang Wang, “On Training Targets for Supervised Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1849–1858, Dec. 2014.

[11] Han Zhao, Shuayb Zarar, Ivan Tashev, and Chin-Hui Lee, “Convolutional-Recurrent Neural Networks for Speech Enhancement,” arXiv:1805.00579 [cs, eess], May 2018.

[12] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[13] Yangyang Xia, Sebastian Braun, Chandan K. A. Reddy, Harishchan-dra Dubey, Ross Cutler, and Ivan Tashev, “Weighted Speech Distor-tion Losses for Neural-network-based Real-time Speech Enhancement,” arXiv:2001.10601 [cs, eess], Feb. 2020.

[14] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bah-danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Ma-chine Translation,” arXiv:1406.1078 [cs, stat], Sept. 2014.

[15] Juan Manuel Martin-Donas, Angel Manuel Gomez, Jose A. Gonzalez, and Antonio M. Peinado, “A Deep Learning Loss Function Based on the Per-ceptual Evaluation of the Speech Quality,” IEEE Signal Processing Let-ters, vol. 25, no. 11, pp. 1680–1684, Nov. 2018.

[16] D. Wang and Jae Lim, “The unimportance of phase in speech enhance-ment,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 30, no. 4, pp. 679–681, 1982.

[17] Kuldip Paliwal, Kamil W´ojcicki, and Benjamin Shannon, “The impor-tance of phase in speech enhancement,” Speech Communication, vol. 53, no. 4, pp. 465–494, Apr. 2011.

[18] Chandan K.A. Reddy, Ebrahim Beyrami, Jamie Pool, Ross Cutler, Sriram Srinivasan, and Johannes Gehrke, “A Scalable Noisy Speech Dataset and Online Subjective Test Framework,” in Interspeech 2019. Sept. 2019, pp. 1816–1820, ISCA.

[19] Marco Jeub, Magnus Schafer, and Peter Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in 2009 16th International Conference on Digital Signal Processing, San-torini, Greece, July 2009, pp. 1–5, IEEE.

[20] Cees H. Taal, Richard C. Hendriks, Richard Heusdens, and Jesper Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE International Conference on Acous-tics, Speech and Signal Processing, Dallas, TX, USA, 2010, pp. 4214– 4217, IEEE.

[21] Yi Hu and Philipos C. Loizou, “Evaluation of Objective Quality Measures for Speech Enhancement,” IEEE Transactions on Audio, Speech, and Lan-guage Processing, vol. 16, no. 1, pp. 229–238, Jan. 2008.

(12)

List of Figures

2.1 Speech enhancement scheme. . . 3

2.2 Speech enhancement scheme using STFT. . . 4

2.3 Wiener filter block diagram. . . 5

2.4 Dilated convolutions with exponentially increasing dilation factors in WaveNet, taken from [1]. . . 11

2.5 CNN based on a priori and a posteriori SNR. . . 12

2.6 EHNet architecture, taken from [2]. . . 13

2.7 NSNet architecture, taken from [3]. . . 14

3.1 Noisy speech and denoised spectrogram, model trained using the multi-scale-spectral loss. . . 24

4.1 Different complex spectrogram representations. . . 30

4.2 STFT phase enhancement network architectures. . . 31 4.3 Network in the waveform domain, computing the STCT of a waveform,

denoising in the STCT domain, and transforming back to a waveform with the inverse STCT. Loss is calculated between noisy and denoised waveforms. 37

(15)

List of Tables

2.3.1 Hyperparameters for the CNN based on a a priori and a posteriori SNR. . 16 2.3.2 Hyperparameters for EHNet. . . 16 2.3.3 Hyperparameters for NSNet. . . 17 2.3.4 Non-perceptual metrics for the 4 baseline models, evaluated on the seen

noise test set. . . 21

2.3.5 Non-perceptual metrics for the 4 baseline models, evaluated on the unseen

noise test set. . . 21

2.3.6 Perceptual metrics for the 4 baseline models, evaluated on the seen and unseen noise test set. . . 22 3.2.1 Non-perceptual metrics for the NSNet trained with PMSQE loss function,

compared with NSNet and MMSE-LSA baselines, evaluated on the seen noise test set. . . 27 3.2.2 Non-perceptual metrics for the NSNet trained with PMSQE loss function,

compared with NSNet and MMSE-LSA baselines, evaluated on the unseen noise test set. . . 27 3.2.3 Perceptual metrics for the NSNet trained with PMSQE loss function,

com-pared with NSNet and MMSE-LSA baselines, evaluated on the seen and unseen noise test set. . . 28 4.1.1 Hyperparameters for the joint mask estimation network. . . 32 4.1.2 Non-perceptual metrics for JMENet and SMENet, compared with the NSNet

and MMSE-LSA baselines, evaluated on the seen noise test set. . . 34 4.1.3 Non-perceptual metrics for JMENet and SMENet, compared with the NSNet

and MMSE-LSA baselines, evaluated on the unseen noise test set. . . 35 4.1.4 Perceptual metrics for JMENet and SMENet, compared with the NSNet

and MMSE-LSA baselines, evaluated on the seen and unseen noise test set. 35 4.2.1 Hyperparameters for waveform network. . . 39

(16)

LIST OF TABLES xiii 4.2.2 Non-perceptual metrics for the network using the STCT, compared with

the NSNet and MMSE-LSA baseline, evaluated on the seen noise test set. . 40

4.2.3 Non-perceptual metrics for the network using the STCT, compared with the NSNet and MMSE-LSA baseline, evaluated on the unseen noise test set. 41 4.2.4 Perceptual metrics for the network using the STCT, compared with the NSNet and MMSE-LSA baseline, evaluated on the seen and unseen noise test set. . . 42

A.1 Non-perceptual metrics, evaluated on the seen noise test set. . . 51

A.2 Non-perceptual metrics, evaluated on the unseen noise test set. . . 52

(17)

Chapter 1 Introduction

Speech is one of the most commonly used forms of communication between humans, especially when conveying ideas or explaining a concept, as there are lots of subtle cues in speech that cannot be captured by written text. Nowadays, speech communication is often over mobile phone or laptop, while being in a noisy environment due to traffic noise or other people talking. Background noise plays a significant role in the perceived speech quality, which makes the suppression of noise a requirement for efficient speech communication.

Mobile phones or laptops are devices with some compute power available. By using some of this computing power to run an algorithm that suppresses background noise and enhances the speech component, speech quality can be improved. Speech enhancement encompasses the research, development, and testing of these algorithms. Besides, speech enhancement has lots of other use cases, e.g. in hearing aids [4], or as a preprocessing step for automatic speech recognition [5].

First, an overview of single-channel speech enhancement is given, which is speech en-hancement using the signal from a single microphone. Traditional statistical models are discussed in Section 2.1 and DNN-based models in Section 2.2.

Deep neural networks (DNNs) are a class of artificial neural networks, which are an artificial approximation of the biological neural networks that are present in, e.g. a human brain. DNNs consist of multiple connected layers by which the input is transformed. By training a DNN, the optimal parameters for these transformations are obtained. Training is often done by giving examples of what the output of the DNN should look like This is called supervised learning.

Deep neural networks have given a massive boost of performance in fields such as computer 1

(18)

CHAPTER 1. INTRODUCTION 2 vision [6] and natural language processing [7], and they have been successfully applied in speech processing for automatic speech recognition as well [8]. For speech enhancement, deep neural networks can also lead to an enormous increase in performance compared to traditional statistical models.

Some state-of-the-art models are implemented and evaluated in Section 2.3. These models will be used as a baseline to compare with further research.

An important element of training a deep neural network is the loss function, which calculates the error between the network’s predictions and the correct examples. Since most loss functions do not take human auditory perception into account, in Chapter 3 two perceptually motivated loss functions ([9], [10]) are combined with a state-of-the-art model from Section 2.3 and evaluated.

Lastly, as enhancement of the phase of a noisy signal has been largely ignored up to this point in this thesis, research is done on noisy phase enhancement and its effect on speech quality: different new network architectures and features are tested and compared in Chapter 4.

(19)

Chapter 2 Single-channel speech

enhancement

In speech enhancement, the goal is to improve the speech quality of a noisy speech signal. This enhancement is based on the assumption that the noisy speech signal y(k) consists of the clean speech signal s(k) with additive noise v(k), with k the discrete time index, as can also be seen in Figure 2.1.

y(k) = s(k) + v(k) (2.1)

Using several methods, an estimate of the clean speech ˆs(k) can be obtained.

Figure 2.1: Speech enhancement scheme.

2.1 Traditional methods

This section offers an overview of traditional single-channel speech enhancement techniques, meaning techniques not involving machine learning.

These techniques are all based around the estimation of the noise present in the input signal. This noise estimate is then used to produce an estimate of the clean speech signal. As is shown in Figure 2.2, traditional speech enhancement is often done in the short-time Fourier transform (STFT) domain, where the Fourier spectrum is calculated for

(20)

CHAPTER 2. SINGLE-CHANNEL SPEECH ENHANCEMENT 4

Figure 2.2: Speech enhancement scheme using STFT.

each timeframe using overlapping windows, resulting in a spectrogram. Applying this to Equation 2.1 results in

Y(µ, λ) = S(µ, λ) + V (µ, λ) (2.2)

In this equation, µ is the frequency bin index and λ is the timeframe index.

After enhancement, the estimated clean speech spectrogram ˆS(µ, λ) is transformed back

into the time domain using the inverse short-time Fourier transform (ISTFT). The phase of the clean speech signal is set equal to the noisy speech signal’s phase, as noise in the phase is assumed to be unimportant for human hearing [11].

2.1.1 Spectral subtraction

Spectral subtraction was one of the first techniques to be proposed for speech enhancement [12].

It works in the STFT domain, following Equation 2.2. An estimate of the noise magnitude

spectrum _{| ˆ}V(µ, λ)_{| is obtained by averaging the non-speech segments of the noisy speech}

magnitude spectra _{|Y (µ, λ)|. The noise is subtracted from the noisy speech, resulting in}

the estimated clean speech.

| ˆS(µ, λ)_{|= |Y (µ, λ)|−| ˆ}V(µ, λ)_|

Using the ISTFT with the noisy phase 6 _{S(µ, λ) =}ˆ 6 _Y_{(µ, λ), the estimated clean speech}

signal ˆs(k) is obtained.

A disadvantage of the spectral subtraction method is that it causes ‘musical noise’. This musical noise is caused by isolated peaks remaining in the spectrum after spectral subtraction, which the human ear perceives as musical tones.

2.1.2 Wiener filter

Much of the theory regarding the application of the Wiener filter in speech enhancement was laid out in [13]. Other filtering techniques have also been explored for speech enhancement,

(21)

Figure 2.3: Wiener filter block diagram. e.g., Kalman filtering [14].

Defining the impulse response of the Wiener filter as h(n), as in Figure 2.3, then the estimation error e(k) is

e(k) = s(k)− ˆs(k) = s(k) − h(n) ∗ y(k) (2.3)

The Wiener filter will minimize e2_{(k), i.e. the mean square error. This minimization is}

done optimally when the spectral coefficients of both the noise and the clean speech signals are Gaussian independent variables. In the STFT domain, the Wiener filter is defined as:

H(µ, λ) = |S(µ, λ)|

2

|S(µ, λ)|2₊_{|V (µ, λ)|}2 (2.4)

As the Wiener filter is a linear time-invariant filter, this assumes a linear relationship between the noisy speech signal and the clean speech signal, i.e. when the noisy speech signal doubles in amplitude the clean speech signal does as well.

However, the clean speech S(µ, λ) and the noise V (µ, λ) are necessary to obtain the Wiener filter, meaning those will have to be estimated. As in the previous section, the noise can be estimated during the non-speech parts of the noisy speech signal. In [13] it was proposed to model the clean speech: voiced speech is generated by filtering a pulse train, and unvoiced speech is generated by filtering white noise. Thus, the problem of estimating the clean speech signal was reduced to the problem of estimating the speech model filter coefficients.

Note that, in contrast to spectral subtraction, the Wiener filter method estimates the spectral magnitude and phase of the clean speech signal and can do this optimally in terms of mean square error.

(22)

2.1.3 Non-linear minimum mean square error estimation (MMSE)

One of the most important assumptions when using the Wiener filter is that the noisy speech and the clean speech spectral coefficients have a linear relationship. Also, the

Wiener filter yields an optimal clean speech spectrum ˆS(µ, λ) in terms of MMSE, but it

does not yield an optimal clean speech spectral magnitude _{| ˆ}S(µ, λ)_|.

In this section, the non-linear minimum mean-square error estimation (MMSE) is discussed, but other non-linear estimators also exist. These ‘estimators’ are called as such because they give an estimate for the clean speech spectral coefficients, based on the statistics of the noisy speech spectral coefficients. A maximum-likelihood estimator for speech enhancement was first applied in [15], after which in [16] a minimum mean-square error estimator was used to estimate the clean speech spectral magnitude.

In [16] it was assumed the spectral coefficients of the clean speech and the noise are Gaussian independent variables. The authors derived an MMSE estimator for both the spectral magnitude and the spectral phase. The problem with both estimators is that they

require an estimate of the variance of the noise σ2

V(µ, λ), and the a priori SNR ξ(µ, λ), i.e.,

the SNR before noise is added.

ξ(µ, λ) = σ 2 S(µ, λ) σ2 V(µ, λ) (2.5) Obtaining the noise variance is relatively straightforward: assuming the noise is stationary, the noise variance can easily be obtained during the non-speech segments of the noisy speech signal [17]. Two different estimators for the a priori SNR are proposed: a maximum-likelihood approach and a so-called “decision-directed” estimation approach.

The maximum-likelihood estimator offers an estimate of the a priori SNR based on consecutive observations of the noisy speech spectral magnitudes. The “decision-directed” estimator leverages the relationship between the a priori SNR and the a posteriori SNR γ(µ, λ), which enables them to derive the a priori SNR as a recursive function of the previous spectral magnitude estimate. Defining the a posteriori SNR as:

γ(µ, λ) = |Y (µ, λ)|

2

σ2 V(µ, λ)

(2.6) the recursive a priori SNR estimation is done as such:

ˆ ξ(µ, λ) = α| ˆS(µ, λ− 1)| 2 σ2 V(µ, λ− 1) + (1_{− α)max(γ(µ, λ) − 1, 0)} (2.7)

where α is a parameter weighing the past a priori SNR (first term) versus the current a posteriori SNR (second term).

(23)

CHAPTER 2. SINGLE-CHANNEL SPEECH ENHANCEMENT 7 This MMSE approach is improved by minimizing the mean square error of the logarithm of the spectral magnitude [18]. The log-spectral magnitude is a better metric for speech since auditory perception is logarithmic. This minimum mean square error log-spectral amplitude (MMSE-LSA) estimator significantly improved the noise reduction performance compared to the plain spectral magnitude estimator [18], [19].

2.1.4 Subspace algorithms

Subspace speech enhancement algorithms assume a clean signal is confined to a subspace of the noisy Euclidian space. The noisy speech signal vector can then be decomposed into a clean speech signal vector and a noisy signal vector. Thus, by removing the noisy signal vector, the clean speech signal is obtained. These algorithms are based on linear algebra decomposition techniques such as singular vector decomposition (SVD) and eigenvalue decomposition. The applications of SVD for speech enhancement were explored in [20], while in [21] eigenvalue decomposition was applied for speech enhancement. The main focus in [21] was increasing the speech quality and reducing listener fatigue (i.e. removing high amplitude noise from e.g. a factory environment) while minimizing any loss of intelligibility (i.e. making sure the speech can still be well understood).

2.2 Machine learning-based methods

The methods discussed in the section above only work reliably when the noise is stationary, which often is not the case: e.g. traffic, typing on a keyboard and shutting doors are non-stationary noise types. Traditional techniques have difficulty in ‘tracking’ the noise when the noise is non-stationary. Also, assumptions are often made regarding the independence of noise and speech, the spectral distribution of the noise and the speech, and so forth. These assumptions lead to a reduction in performance, as they are not always accurate.

Machine learning-based methods offer a data-driven approach instead of these assumptions, and the rise of machine learning and deep learning has not gone unnoticed in the speech enhancement domain: both ‘classical’ machine learning (e.g. support vector machines (SVMs) [22]) and deep learning (e.g., deep denoising autoencoders [23]) have been used to do speech enhancement. In this thesis, the focus is on deep learning-based speech enhancement.

(24)

2.2.1 Spectral enhancement

Most machine learning-based speech enhancement techniques work in the spectral domain, similar to the spectral subtraction discussed above. Speech enhancement in the spectral domain is a regression problem: using the STFT of the noisy speech signal as the input, a prediction is made to denoise the input. Often not the whole STFT is used as the input, but the magnitude, thus disregarding phase and enhancing only the magnitude spectrum. There are multiple possibilities for the type of output prediction, direct estimation of the denoised STFT or denoised spectrogram is possible:

ˆ

S(µ, λ) = f (Y (µ, λ)) (2.8)

but masks are commonly used [24]: ˆ

S(µ, λ) = Y (µ, λ)f (Y (µ, λ)) (2.9)

with f (Y (µ, λ)) the output prediction for an input Y (µ, λ).

Both supervised learning and unsupervised learning have been applied for spectral en-hancement.

Supervised learning

In supervised learning, a large dataset of both the noisy speech mixture and clean speech is necessary. Such a dataset is generated by combining a clean speech dataset with a noise dataset, thus creating noisy speech mixtures. Examples of such datasets are the TIMIT speech dataset [25], the ETSI background noise dataset [26] and the Microsoft Scalable Noisy Speech Dataset (containing both speech and noise) [27]. The noisy speech mixtures, generated by mixing speech with noise, are the input to the neural network, and a loss function (e.g. mean squared error) is defined between the neural network output and the clean speech. The neural network is then trained by optimizing for this loss.

Lots of neural network architectures have been used to apply spectral enhancement: fully-connected feed-forward networks [23], [28], convolutional neural networks (CNNs) [29], recurrent neural networks (RNNs) [5], transformers [30], . . . As well as neural networks combining the aforementioned architectures [2]. Recent research has been on increasing performance by using various loss functions [31], incorporating phase enhancement [32] and real-time enhancement [3].

(25)

CHAPTER 2. SINGLE-CHANNEL SPEECH ENHANCEMENT 9 Unsupervised and semi-supervised learning

One of the problems with the approach described above is that the neural network only learns the noise types which were present in the dataset. To alleviate this problem, unsupervised and semi-supervised learning techniques have been used. In fact, all the speech enhancement techniques described in Section 2.1 are unsupervised. However, those techniques are mostly based on statistical models with certain assumptions, which can cause degradation of the enhancement performance when the noise is non-stationary. A widely used technique in unsupervised speech enhancement is nonnegative matrix factorization (NMF). Using NMF, a spectrogram is factorized in two nonnegative matrices: the dictionary or basis matrix, and the NMF coefficient or activation matrix. The basis matrix will contain the most important features ‘distilled’ from the spectrogram, while the activation matrix will combine these features to reconstruct the spectrogram. Using semi-supervised learning, an NMF model for clean speech can be trained, i.e. clean speech spectrograms are factorized into a clean speech basis matrix and an activation matrix. Clean speech is thus presented as a linear combination of basis vectors from the basis matrix. When the speech is corrupted by noise, clean speech can be recovered by using the correct linear combination [33], [34].

A problem with the NMF approach is the assumption of linearity, more recent studies have used variational autoencoders (VAEs) based on deep neural networks to generate the clean speech model [35].

2.2.2 Waveform enhancement

While spectral enhancement using machine learning can deliver better performance than traditional spectral enhancement techniques, both reconstruct the enhanced/denoised signal using the noisy speech signal phase. However, it has been shown that the phase does have an impact on speech quality [36], in contrast to an earlier paper that claimed the opposite [11].

In [11] it is shown that enhancing the phase only improves speech quality noticeably when using long timeframes (M = 4096). Also, the authors claim that enhancing the phase is difficult, and errors in the estimated clean phase lead to worse speech quality than just using the noisy phase. Thus, it is recommended to use the noisy phase to estimate the clean speech signal.

(26)

CHAPTER 2. SINGLE-CHANNEL SPEECH ENHANCEMENT 10 in Section 2.1.1. In [36], experiments using different enhancement methods are tested, showing that enhancing the phase does increase speech quality for the MMSE-LSA estimator described in Section 2.1.3, especially when the clean phase can be predicted very accurately. This suggests that phase enhancement may also increase the speech quality when using other enhancement methods.

Waveform enhancement poses a solution to the phase enhancement problem: by directly enhancing the waveform, both magnitude and phase spectrum are enhanced. Also, this enables end-to-end training of the deep neural network, as no features need to be manually extracted from the waveforms by way of STFT.

In [29], it is shown that a fully convolutional network (FCN), a network consisting only of convolutional layers, trained end-to-end on raw waveforms, can perform better than neural networks using fully connected layers. An FCN is used because it can capture temporal patterns of the waveform, and it has much fewer parameters than a fully connected (dense) network. However, the authors note that the network has difficulties in generating

high-frequency components.

In [37] a generative adversarial network (GAN) [38] is proposed for speech enhancement, which was called speech enhancement GAN (SEGAN). A GAN consists of two neural networks: a generator and a discriminator. The generator generates an output (from some input), while the discriminator learns if the output of the generator is correct or not. Both networks are adversaries: the generator learns to ‘fool’ the discriminator, while the discriminator gets better at distinguishing the generator’s output. In SEGAN, the generator learns a mapping from noisy waveforms to clean waveforms, and the discriminator learns to distinguish between clean waveforms and noisy waveforms.

WaveNet [1], a generative speech model, was also used for speech enhancement in [39]. Dilated convolutions are used, which are convolutions that have spaces in the convolutional kernel, e.g. a 1D kernel [w0, w1, w2] becomes [w0,0, w1,0, w2] with a dilation of 2. This

increases the so-called receptive field of the convolution, i.e. how much the convolution ‘sees’: the original kernel has a receptive field of 3, while the dilated kernel has a receptive field of 5. In WaveNet, a denoised sample is predicted by applying dilated convolutions to a noisy waveform sample preceded by a number of previous samples and succeeded by a number of future samples, as can be seen in Figure 2.4. Doing this for each noisy waveform sample, the denoised waveform is predicted. This network is non-causal, as future samples are necessary to be able to make a prediction. Thus, real-time enhancement is not possible: some delay must be added, depending on how many future samples are necessary.

(27)

Figure 2.4: Dilated convolutions with exponentially increasing dilation factors in WaveNet, taken from [1].

2.3 Baselines

Baselines are necessary to be able to compare different speech enhancement techniques. Two types of baselines are used: statistical models and machine learning models. In the following section, a brief overview of the baseline models is given. The hyperparameters for each baseline are given in Section 2.3.5. As this thesis mainly focusses on speech enhancement in the spectral domain, all the baseline models do the enhancement in the spectral domain.

2.3.1 MMSE-LSA

As a statistical model baseline, the minimum mean-square error log-spectral amplitude estimator from [18] is used. This model was already discussed in section 2.1.

The MMSE-LSA model was chosen because it is often used as a baseline in literature (e.g. in [23]), and it offers reasonable performance while having limited complexity. The

following gain function was used, as proposed in [18]:

G(µ, λ) = ξ(µ, λ) 1 + ξ(µ, λ)exp 0.5 Z ∞ ν(µ,λ) e−t t dt (2.10) with ν(µ, λ) = _1+ξ(µ,λ)ξ(µ,λ) γ(µ, λ).

2.3.2 CNN based on a priori and a posteriori SNR

This model is a frame-by-frame model, meaning the model takes in a noisy STFT magnitude timeframe and outputs a denoised STFT magnitude timeframe, not taking into account

(28)

CHAPTER 2. SINGLE-CHANNEL SPEECH ENHANCEMENT 12 previous or future timeframes. For each STFT magnitude timeframe, the a priori and a posteriori SNR (as defined in Equations 2.5 and 2.6) are calculated for each frequency bin and normalized to the range between 0 and 1. The a priori and a posteriori SNR are calculated based on an estimate of the timeframe’s noise power, so the performance of this model depends on the accuracy of this estimation.

A visualisation of the CNN input can be seen in Figure 2.5a, with ξµ the a priori SNR

and γµ the a posteriori SNR for frequency bin µ, and fi the result of the convolution. The

normalized a priori and a posteriori SNRs are stacked, and a 2D convolution is applied along the frequency bin dimension and the a priori/posteriori SNR dimension.

(a) Network input.

(b) Network architecture.

Figure 2.5: CNN based on a priori and a posteriori SNR.

The convolutional layer is followed by max pooling (to reduce the dimensionality) and dropout. The features extracted by the convolution are flattened, so the timeframe has a single feature vector. This is followed by three fully-connected feedforward layers. An overview of the network architecture can be seen in Figure 2.5b.

The output of the model is an ideal ratio mask (IRM) [24]:

IRM(µ, λ) = |S(µ, λ)|

2

(29)

CHAPTER 2. SINGLE-CHANNEL SPEECH ENHANCEMENT 13 This model is trained using the mean square error (MSE) loss between the predicted output IRM and the target IRM, and the parameters are optimized using the Adam optimizer [40]. MSE( ˆS, S) = PM −1 µ=0 PL−1 λ=0( ˆS(µ, λ)− S(µ, λ))2 M _{× L} (2.11)

2.3.3 EHNet

EHNet [2] is a deep neural network speech enhancement model, based on a combination of the CNN and RNN architecture. The network architecture can be seen in Figure 2.6. With a 2D convolution layer, features are extracted for each timestep of the input magnitude spectrogram. This convolution is done using a kernel size of 32 in the frequency domain, and 11 in the time domain. The resulting convolution has a window size of 11: data is needed from the previous 5, the future 5, and the current timeframe.

The extracted features are flattened, so each timeframe consists of a single feature vector. These flattened features are run through a two-layer bidirectional long short-term memory (LSTM) [41], consisting of 1024 units each, followed by a fully connected feedforward layer

to recover the output magnitude spectrogram.

As a bidirectional LSTM is used, i.e. two parallel LSTM layers, one going forwards through the timeframes and one going backwards, all future timeframes need to be available as well, meaning online real-time speech enhancement is not possible.

Figure 2.6: EHNet architecture, taken from [2].

The mean-square error loss function is used to calculate the loss between the predicted output spectrogram and the clean speech spectrogram. The model parameters are optimized with the AdaDelta optimizer [42].

(30)

Figure 2.7: NSNet architecture, taken from [3].

2.3.4 NSNet

NSNet [3] is a deep neural network speech enhancement model, based on a relatively simple RNN architecture. The complete network and training system can be seen in Figure 2.7.

Contrary to EHNet, the output spectrogram is not directly predicted, but a gain mask is predicted

| ˆS(µ, λ)_{|= G(µ, λ)|Y (µ, λ)|} (2.12)

with G the predicted gain mask.

Predicting a mask results in better performance than directly predicting a spectrogram, as they act as a filter for the noisy speech spectrogram, whereas direct estimation ignores the noisy speech spectrogram. Furthermore, “masks are likely easier to learn than spectral envelopes, as their spectrotemporal patterns are more stable with respect to speaker variations” [24].

Each timeframe of the noisy log-power spectrogram is normalized and applied to the input of a three-layer gated recurrent unit (GRU) [43], followed by a fully connected feedforward layer to recover a gain mask. This way, NSNet works in real-time: denoising the current frame only requires the current frame and previous frames, but no future frames.

The novel idea in NSNet is the utilization of a weighted loss function, which weighs the loss differently when speech is active in the clean speech signal. Using a simple energy-based

(31)

CHAPTER 2. SINGLE-CHANNEL SPEECH ENHANCEMENT 15 voice activity detector (VAD), the timeframes with speech activity are obtained. By tuning the weighting, a trade-off can be made between increased noise reduction with stronger target speech distortion and vice versa.

The loss function is defined as follows:

lossspeech = MSE(Sactive,(G S)active) (2.13)

lossnoise = MSE(0, G Y) (2.14)

loss = αlossspeech+ (1− α)lossnoise (2.15)

where Sactive denotes a matrix containing |S(µ, λ)| with the subscript ‘active’ denoting the

timeframes with speech activity, and _{denotes the element-wise multiplication. It can be}

seen that minimizing lossspeech minimizes the speech distortion, and minimizing lossnoise

minimizes the noise.

The model parameters are optimized using the Adam optimizer.

2.3.5 Hyperparameters

CNN based on a priori and a posteriori SNR The hyperparameters can be seen in Table 2.3.1.

In the 2D convolutional layer the first dimension is the frequency dimension and the second dimension is the a priori and a posteriori SNR dimension. The max pooling layer divides the frequency dimension by two. Note that the last fully-connected layer uses the sigmoid activation layer, as the IRM is bounded between 0 and 1. The result is a network of around 10 million trainable parameters.

(32)

Hyperparameters

STFT 512-point FFT, 50% overlap, Hann window

2D convolution 32 kernels, size 5_{× 2, ReLU activation}

Max pooling kernel size 2× 1, stride 2 × 1

Dropout probability 0.5

Fully-connected 1 2056 units, ReLU activation

Fully-connected 2 515 units, ReLU activation

Fully-connected 3 257 units, sigmoid activation

Batch size 35

Optimizer Adam, learning rate 0.001

Table 2.3.1: Hyperparameters for the CNN based on a a priori and a posteriori SNR.

EHNet

The same hyperparameters were used as in the EHNet paper [2], as can be seen in Table 2.3.2.

The 2D convolutional layer’s first dimension is the frequency dimension, the second dimension is the time dimension. Dropout in the bidirectional LSTM is between LSTM layers. The result is a network of around 66 million trainable parameters.

Hyperparameters

STFT 510-point FFT, 50% overlap, Hann window

2D convolution 256 kernels, size 32× 11, stride 16 × 1, ReLU activation

Bidirectional LSTM 2 layers, 1024 hidden units each, 0.1 dropout probability

Fully-connected 256 units, ReLU activation

Batch size 16

Optimizer AdaDelta, learning rate 1

Table 2.3.2: Hyperparameters for EHNet.

NSNet

The same hyperparameters were used as in the NSNet paper [3], as can be seen in Table 2.3.3.

(33)

CHAPTER 2. SINGLE-CHANNEL SPEECH ENHANCEMENT 17 Dropout in GRU is between GRU layers. The fully-connected layer uses the sigmoid activation function, as the network predicts a gain mask, and amplification is not desired.

α is the loss function weighting factor, as defined in Equation 2.15.

The result is a network of around 1.2 million trainable parameters. Hyperparameters

STFT 512-point FFT, 75% overlap, Hamming window

GRU 3 layers, 257 hidden units each, 0.2 dropout probability

Fully-connected 257 units, sigmoid activation

α 0.35

Batch size 32

Optimizer Adam, learning rate 0.001

Table 2.3.3: Hyperparameters for NSNet.

2.3.6 Dataset

The Microsoft Scalable Noisy Speech Dataset [27] (MS-SNSD) is used as a source for dry clean speech and noise. Data is generated as follows: a random clean speech and a random noise fragment are chosen. Reverberation is added to the clean speech fragment by convolving the speech fragment with a randomly chosen room impulse response from the Aachen Impulse Response Database [44]. The clean speech fragment with reverberation and the noise fragment are mixed so that the mixture has a signal-to-noise ratio (SNR) of −10 dB, −5 dB, 0 dB, 5 dB, 10 dB, 20 dB, 30 dB and 40 dB.

The noisy mixtures all have a length of 8 seconds. If the clean speech signal is shorter than 8 seconds, additional speech fragments are randomly picked until a length of at least 8s is obtained.

From 3 hours of noise and 16 hours of speech, 84 hours of noisy mixtures are generated, of which 90% is used for training and 10% for validation. During training, the target signal is the dry clean speech signal, i.e. the clean speech fragment without reverberation. Testing is done using two test sets, which are generated in the same way as the training data. In this case, speech segments that are not in the training set are combined with noise. For the seen noise test set, this is the exact same noise that is also present in the training set. For the unseen noise test set, this is noise that is not present in the training

(34)

CHAPTER 2. SINGLE-CHANNEL SPEECH ENHANCEMENT 18 set, which is done to test the model’s generalization performance to unseen noise. The seen and unseen noise test set each contain 2 hours and 20 minutes of data.

2.3.7 Metrics

Signal to noise ratio

One of the most straightforward metrics is the increase in the signal-to-noise ratio of the input of the model versus the model’s output. This is defined as follows in the time domain: ∆SNR = 10 log₁₀ P ks2(k) P kvˆ2(k) − 10 log10 P ks2(k) P kv2(k) (2.16)

with ˆv(k) the residual noise after denoising:

ˆ

v(k) = ˆs(k)_{− s(k)} (2.17)

SNR can also be calculated by splitting the signal into timeframes, similar to the way the STFT works: E_s(λ) = PM −1 µ=0 s2(λM + µ) M (2.18) ∆SNRsegmental= 10 log10 1 L PL−1 λ=0Es(λ) PL−1 λ=0Eˆv(λ) ! − 10 log10 1 L PL−1 λ=0Es(λ) PL−1 λ=0Ev(λ) ! (2.19) With λ the timeframe index, M the timeframe length in number of samples, and L the number of timeframes.

An energy threshold voice activity detector (VAD) is defined with threshold hs

2_(k)_i

10 ,hs 2_(k)_i

being the average of s2_{(k) over k. Using this VAD, the difference in SNR during the active}

speech parts of the signal is obtained. The sum in the equation above is then only over the time frames containing speech, instead of over all timeframes:

∆SNRactive = 10 log10 1 Lactive P λactiveEs(λ) P λactiveEvˆ(λ) ! −10 log10 1 Lactive P λactiveEs(λ) P λactiveEv(λ) ! (2.20)

with Lactive the number of speech active timeframes.

Noise attenuation

The noise attenuation (NA) measures the ratio between the noise power at the input of the

(35)

CHAPTER 2. SINGLE-CHANNEL SPEECH ENHANCEMENT 19 the denoised signal, obtained by applying the gain mask for the noisy speech signal to the noise signal: N A= 10 log10 1 L P kv 2_(k) P kv˜2(k) (2.21) Similar to the speech active SNR, the speech active noise attenuation can be obtained:

NAactive = 10 log10 1 Lactive P λactiveEv(λ) P λactiveE˜v(λ) ! (2.22)

This metric was taken from [45].

Segmental speech to distortion ratio

The segmental speech to distortion ratio (SSDR) measures the amount of speech distortion

introduced by the model. Defining ˜s(k) as the speech component in the denoised signal,

obtained by applying the gain mask for the noisy speech signal to the clean speech signal, the SSDR (in dB) is defined as follows:

SSDR = 1 Lactive X λactive 10 log10 Es(λ) Es−˜s(λ) (2.23)

This metric was taken from [45].

Perceptual evaluation of speech quality

While the previous metrics are undoubtedly useful for gauging the performance of a model, they do not take the human auditory perception and preferences into account. That is why the perceptual evaluation of speech quality (PESQ) was developed in ITU-T recommendation P.862 (02/2001). The goal was to obtain an objective metric for predicting the subjective speech quality of a signal. PESQ is measured in Mean Opinion Score (MOS) and ranges from 0 to 4.5.

PESQ is a good metric for measuring speech quality, as there is a substantial correlation (ρ = 0.65) between PESQ and subjective scores [46].

Short-time objective intelligibility measure

Another metric taking into account the human perception of speech is the short-time objective intelligibility measure (STOI), which was proposed in [47]. STOI is a metric

(36)

CHAPTER 2. SINGLE-CHANNEL SPEECH ENHANCEMENT 20 for the intelligibility of a degraded (noisy) speech signal versus a reference (clean) speech signal. It ranges from 0 to 1, and it indicates how hard it is to understand the degraded speech signal. This is different from PESQ, as PESQ is an objective measure for the speech quality, i.e., how fatiguing it is to listen to a speech signal.

2.3.8 Results

Non-perceptual metrics

The results for the non-perceptual metrics for the 4 baseline models, evaluated on the seen and unseen noise test sets, can be seen in Tables 2.3.4 and 2.3.5. Only the results for an

input SNR of _{−10 dB, 0 dB, 10 dB and 25 dB are given, an overview of the results for all}

input SNRs can be found in Tables A.1 and A.2 in the Appendix.

NSNet is (on average) the best performing model when looking at non-perceptual metrics, especially in low SNR conditions. When the input SNR is higher, the CNN based on a priori/posteriori SNR tends to perform better in terms of SNR increase and noise attenuation, but it also introduces more speech distortion than NSNet. On unseen noise the performance of NSNet does drop, but it remains the best performing model on average.

Perceptual metrics

The results for the perceptual metrics for the 4 baseline models, evaluated on the seen and unseen noise test sets, can be seen in Table 2.3.6. Again only the results for an input

SNR of _{−10 dB, 0 dB, 10 dB and 25 dB are given, an overview of the results for all input}

SNRs can be found in Table A.3 in the Appendix.

NSNet also performs the best for PESQ and STOI, both for seen and unseen noise and all input SNRs. The CNN outperforms NSNet in PESQ on seen noise for an input SNR of 10 dB, but the differency is small (PESQ of 3.19 versus 3.12). For an input SNR of 25 dB the MMSE-LSA estimator performs better in terms of PESQ, but only slightly (PESQ of 3.73 versus 3.70).

(37)

SNR (dB) Model ∆SNR (dB) ∆SNRactive (dB) NA (dB) NAactive (dB) SSDR (dB)

-10 MMSE-LSA 4.31 4.21 10.67 10.51 11.07 CNN 10.80 9.97 18.58 17.14 4.55 EHNet 11.99 10.74 18.29 15.97 6.22 NSNet 18.91 16.82 42.43 32.13 10.68 0 MMSE-LSA 5.66 5.37 10.15 9.70 15.13 CNN 9.32 8.68 18.18 15.08 10.22 EHNet 12.65 10.90 27.35 19.81 12.98 NSNet 14.68 12.84 38.24 22.28 17.32 10 MMSE-LSA 5.46 5.01 9.19 8.24 20.09 CNN 7.57 6.79 16.58 11.52 15.71 EHNet 7.37 5.70 21.15 9.49 17.71 NSNet 8.71 7.04 32.58 11.55 23.83 25 MMSE-LSA 4.34 3.80 7.60 5.96 26.26 CNN 5.60 4.76 14.59 8.57 19.93 EHNet 0.58 -0.15 10.15 1.13 22.13 NSNet 2.75 1.53 21.65 3.10 28.93

Table 2.3.4: Non-perceptual metrics for the 4 baseline models, evaluated on the seen noise test set.

SNR (dB) Model ∆SNR (dB) ∆SNRactive (dB) NA (dB) NAactive (dB) SSDR (dB)

-10 MMSE-LSA 1.07 0.92 9.11 8.79 8.82 CNN 8.45 7.50 14.47 14.19 2.68 EHNet 1.62 -0.20 21.88 18.79 3.09 NSNet 10.74 9.06 36.54 31.81 7.11 0 MMSE-LSA 3.23 2.82 8.50 7.78 12.93 CNN 7.13 6.57 13.53 12.26 7.90 EHNet 6.54 4.61 28.68 15.99 8.32 NSNet 10.15 8.53 33.39 21.38 13.75 10 MMSE-LSA 3.40 2.81 7.54 6.14 18.53 CNN 6.83 6.07 12.32 9.39 13.75 EHNet 3.95 2.55 18.20 5.27 16.80 NSNet 6.92 5.40 29.33 10.51 22.08 25 MMSE-LSA 2.42 1.73 6.10 3.85 25.94 CNN 5.88 5.19 11.40 7.69 19.47 EHNet -0.30 -1.39 21.32 0.89 21.76 NSNet 2.91 1.69 20.62 2.78 28.83

Table 2.3.5: Non-perceptual metrics for the 4 baseline models, evaluated on the unseen noise test set.

(38)

SNR (dB) Model Seen noise Unseen noise

PESQ (MOS) STOI PESQ (MOS) STOI

-10 Noisy mixture 1.52 0.67 1.27 0.52 MMSE-LSA 1.69 0.66 1.11 0.50 CNN 1.78 0.69 1.03 0.51 EHNet 1.26 0.64 0.70 0.41 NSNet 1.90 0.73 1.33 0.55 0 Noisy mixture 2.03 0.83 1.61 0.74 MMSE-LSA 2.32 0.83 1.73 0.72 CNN 2.56 0.85 1.80 0.76 EHNet 2.13 0.86 1.40 0.72 NSNet 2.57 0.88 2.04 0.80 10 Noisy mixture 2.69 0.94 2.30 0.91 MMSE-LSA 2.97 0.93 2.50 0.89 CNN 3.19 0.94 2.61 0.91 EHNet 2.91 0.95 2.44 0.93 NSNet 3.12 0.95 2.72 0.93 25 Noisy mixture 3.60 1.00 3.29 0.99 MMSE-LSA 3.73 0.99 3.44 0.99 CNN 3.70 0.96 3.44 0.96 EHNet 3.50 0.99 3.30 0.98 NSNet 3.70 0.99 3.46 0.99

Table 2.3.6: Perceptual metrics for the 4 baseline models, evaluated on the seen and unseen noise test set.

(39)

Chapter 3 Loss functions for improved

subjective performance

The loss function is one of the few variables of a deep neural network that can be modified to ‘steer’ the network’s output in a particular direction, as the loss function works on spectrograms, which contain a clear structure that can be reasoned on. This is in contrary to the network architecture, where the hidden states do not necessarily contain an intuitive structure.

In the following, it is examined how the unchanged NSNet architecture performs when various loss functions are used for training. NSNet is selected because it is the best-performing baseline model, it works in realtime, and it comprises a relatively low number of parameters.

3.1 Multi-scale spectral loss

The multi-scale spectral loss was proposed in [9], where it was applied in the context of music synthesis. It is based on the mean absolute error (MAE):

MAE( ˆS, S) = PM −1 µ=0 PL−1 λ=0| ˆS(µ, λ)− S(µ, λ)| M _{× L} (3.1)

The loss function is defined as follows:

lossi = MAE( ˆSi, Si) + αMAE(log ˆSi,log Si) (3.2)

loss =X

i

lossi (3.3)

(40)

CHAPTER 3. LOSS FUNCTIONS FOR IMPROVED SUBJECTIVE PERFORMANCE24

Where Si is a magnitude spectrogram calculated using an STFT with an FFT length of i,

and α is a weighting factor which is set to 1. The idea is to use multiple FFT lengths to have different spatial-temporal resolutions incorporated in the loss function.

However, since the output of the network is a gain mask applied to the noisy STFT with a single, fixed FFT length (512-point FFT in the case of NSNet), experiments were run using only a single FFT length.

The result can be seen on Figure 3.1: the denoised output sounds like a poorly working VAD, during some speech fragments the model acts as an all-pass filter, letting through both speech and noise, while other speech fragments are blocked entirely. As the output sounds terrible, it was decided not to evaluate the output of this model further.

0 1000 2000 3000 4000 5000 6000 7000 8000 Hz -80 dB -70 dB -60 dB -50 dB -40 dB -30 dB -20 dB -10 dB +0 dB

(a) Noisy speech spectrogram.

0 1000 2000 3000 4000 5000 6000 7000 8000 Hz -80 dB -70 dB -60 dB -50 dB -40 dB -30 dB -20 dB -10 dB (b) Denoised spectrogram.

Figure 3.1: Noisy speech and denoised spectrogram, model trained using the multi-scale-spectral loss.

Deep neural networks for speech enhancement for communications

communications

Deep neural networks for speech enhancement for

communications

Deep neural networks for speech enhancement for

Preface and acknowledgement

Permission for usage

Deep neural networks for speech enhancement for

communications

Guillaume Van Wassenhove

Abstract

Deep neural networks for speech enhancement for

communications

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Single-channel speech

enhancement

2.1

Traditional methods

2.1.1

Spectral subtraction

2.1.2

Wiener filter

2.1.3

Non-linear minimum mean square error estimation (MMSE)

2.1.4

Subspace algorithms

2.2

Machine learning-based methods

2.2.1

Spectral enhancement

2.2.2

Waveform enhancement

2.3

Baselines

2.3.1

MMSE-LSA

2.3.2

CNN based on a priori and a posteriori SNR

2.3.3

EHNet

2.3.4

NSNet

2.3.5

Hyperparameters

2.3.6

Dataset

2.3.7

Metrics

2.3.8

Results

Chapter 3

Loss functions for improved

subjective performance

3.1

Multi-scale spectral loss