Modeling Partial Loudness of Audio Waves for Psychoacoustically Improved Machine Hearing

(1)

MSc Artificial Intelligence

Master Thesis

Modeling Partial Loudness of Audio Waves for

Psychoacoustically Improved Machine Hearing

by

Laura Ruis

10158006

September 23, 2020

36 EC February - August, 2020

Supervisors:

M. Bruse

J. Alakuijala

Assessors:

Prof Dr. W. Ferreira Aziz

Prof Dr. P. S. M. Mettes

Google Research

(2)

(3)

Abstract

In this work we model the partial loudness of the spectral components that make up complex sounds based on a dataset that we gather from four listeners. Understanding human perception of loudness is a long-standing goal in hearing science. Decades of psychophysical experiments provide insights in the acoustical phenomena that affect the loudness of sound. They show that spectral interactions in complex sounds ensure that the overall loudness is not a simple summation over tones, and can cause suppression, masking, intensification, and dissonance. In this thesis we look at partial loudness of complex sounds and add to the decades of psychophysical experiments by gathering a dataset with sine-on-sine and noise-on-noise partial masking patterns, where each listening test gets labeled by four independent listeners. We learn several models on the dataset of sinusoid sounds. Previous work on partial loudness assumes knowledge of the spectral contents and levels of complex sounds. The first model we learn also uses this assumption and is able to significantly improve over a naive baseline and predict partial loudness for both within-distribution and out-of-distribution data. We use this model to sample extra weakly-supervised training data for an end-to-end neural model. This model gets rid of the simplifying assumption of known spectral contents and levels so far used in literature and predicts partial loudness directly from the raw waveform, conditioned only on the probe frequency. We pre-process the waveforms with discrete Fourier-transforms and a bio-mimicking model of the human cochlea. This model is able to accurately predict partial masking on a within-distribution test set, but fails completely when tested outside of the training distribution. We analyze and discuss the patterns in the data and the predictions of the models, and all data and implementations are made publicly available.

(4)

Acknowledgements

I am very grateful to Martin Bruse for supervising this thesis. His enthusiasm has been infectious, he guided me from zero knowledge of audio and human hearing to where I am now, and he developed many of the tools used for data collection. I have rarely met someone who grasps complex subjects so quickly and easily. The collaboration has been very valuable to me. I am also grateful to Jyrki Alakuijala for enabling this collaboration in the first place and for gently keeping me on the right track throughout the research effort. I am thankful for valuable feedback from Thomas Fischbacher, both on writing and on modeling. Additionally, this work would not have been possible without the labeling effort by Martin Bruse, Moritz Firsching, Lode Vandevenne, and Jyrki Alakuijala. For this work I have been very lucky to receive feedback from an expert in human hearing research, for which I am grateful. Feedback from someone with as much as experience and a bird’s-eye view of the field as Dick Lyon is invaluable. Finally, I’d like to thank both Wilker Aziz and Pascal Mettes for feedback and examining the work for my graduation.

(6)

1 Introduction

1.1 Motivation

In this work we propose a novel model of the perceived loudness of the components that make up complex sounds. The human auditory system has evolved to efficiently process the changing air pressure that we call sound. How this happens is relatively well understood from a physical and mechanical perspective, yet under-standing the factors that determine how we perceive sound is an open area of research. What we perceive is not only determined by the physical stimuli of sound, but involves many top-down processes like attention (Moore et al., 1997) and imagined speech (Tian et al., 2018). The field of psychophysics aims to quantify the complex relation between the physical stimuli we receive from the external world and the way we perceive them. Its subfield psychoacoustics experimentally identifies multiple separable ways in which sound waves can be analyzed that play a role in perception, of which perceived loudness is an important part, dealing with how we perceive sound on a scale from quiet to loud.

Sound is a vibration of particles in a medium like air. This vibration manifests as a deviation in the pressure of that medium, which can be quantified logarithmically in decibels of sound pressure level. For one particular form of a sound, a sine wave, the sound pressure level and frequency largely determine how loud we perceive it (Fletcher and Munson, 1933). For the same sound pressure level, a pure tone at the low end of the spectrum sounds more quiet than a tone playing at 1000 Hz. In reality, we hardly ever hear pure sine waves, and for any sound that is not a pure sinusoid how loud we perceive it is significantly more complex. When multiple pure tones play simultaneously, it sounds louder than either alone. However, interactions between the sounds ensure that the overall loudness is not a simple summation over tones (Stevens, 1956; Zwicker and Scharf, 1965). Decades of psychophysical experiments provide insights in these effects and relate them to physical stimuli (for a detailed review see Fastl and Zwicker, 2006). They found that frequencies in complex sounds interact, causing suppression, masking, intensification, and dissonance.

How these phenomena influence perception is a complex problem. Much research is done towards understanding perceived loudness (Fastl and Zwicker, 2006), but the term “loudness” is only defined for full signals. In this work we investigate the perceived loudness of the components that make up a full signal, hereafter called partial loudness1 _{of spectral components of sound. A small part of psychoacoustic loudness research has looked at}

partial loudness. Zwicker and Jaroszewski (1982) for example derive tone-on-tone suppression patterns for a set of sounds with two frequencies playing simultaneously, but extrapolating such results to more complex sounds over the full frequency spectrum seems impossible. Furthermore, the work looks at full masking, meaning they investigate how a sound can be rendered fully inaudible. We are also interested in partial masking, where only part of a sound is masked. Moore et al. (1997) propose a model that predicts loudness, hearing thresholds, and partial loudness. The model is a complex sequence of filters modeling the path of sound from free field through the auditory system, but the data used to derive it are not available. Additionally, the partial loudness predictions it makes are only accurate under specific conditions on the masking sound and require full knowledge of the spectral contents and levels of the sounds. In this thesis we aim to move closer to understanding how humans perceive spectral components of sound by gathering data to cover as much of the acoustic phenomena caused by interacting frequencies as possible, while keeping it small enough to be labeled by human listeners. Labeling acoustic data is a difficult and time-consuming process, and the data we gather does not cover the full spectrum or pressure range. In our model designs we therefore rely on inductive biases informed by hearing science. A first model aims to solve the simplified problem of predicting partial loudness conditioned on the spectral components and sound pressure levels. We call this model the “spectral partial loudness model” and use it to sample extra data for a second model that predicts partial loudness from the raw waveform. To the best of our knowledge, predicting partial loudness without knowledge of the spectral contents and levels of a complex sound has not been attempted before. For this latter model, which we call the “waveform partial loudness model”, we extract rich features from a bio-mimicking model of the cochlea that represents auditory masking (Lyon, 2011c) and has been shown to improve many machine hearing tasks like music retrieval (Lyon et al., 2010) and music melody matching (Lyon, 2017). We fit the models to the data with gradient-free and gradient-based methods, and all data and models are made publicly available.

Perception of loudness is an important part of machine hearing with applications in hearing aids, mastering of videos and music, and music recommendation. A pleasant hearing aid mimics how various sounds are perceived by humans with normal hearing, and music- or video streaming services need to ensure each signal is on average equally loud. Additionally, quantifying the partial loudness of the spectral components of sounds can be useful for compression. Compression is a reduction in the number of bits to represent data. It can be done in a lossless or lossy fashion. For the former, the data need to be decompressed to their exact original values. For

1_{A term first encountered in Moore et al. (1997).}

(7)

the latter, information may be lost after decompressing. In many cases we can discard the information that is not perceived by humans. To this end we need to know what is perceived by humans. Specifically, quantization artifacts can occur when allocating bits to spectral components of digital signals. A perceptually lossless com-pressed stream needs to make sure these artifacts are not noticeable (i.e, fully masked). For a perceptually lossy compression scheme the artifacts need to be equally noticeable across the entire stream, motivating the need for quantifying the partial loudness of its spectral components. Further, state-of-the-art perceptual similarity met-rics depend on comparing the similarity of spectral components (Hines et al., 2015; Beerends et al., 2013), and providing a way to quantify their partial loudness will likely improve the psychophysical accuracy of the metrics.

1.2 Research Questions & Contributions

We propose a novel model of the perceived loudness of the components that make up complex sounds, i.e., partial loudness of spectral components of sound. There is no dataset publicly available to learn such a model from, motivating us to gather a new dataset. We aim to answer the following research questions.

How do we gather data on the partial loudness of spectral components of signals from human listeners such that as much as possible of the patterns caused by the acoustic phenomenon masking is covered?

As of currently (to the best of our knowledge) there is no dataset available that quantifies the partial loudness of the spectral components of sound. In this thesis we carefully gather a dataset of complex sounds, the distri-bution of which is informed by hearing science, and label it with the partial loudness of its spectral components. We analyze the data and make it publicly available. The data gathered clearly shows that more partial masking happens when tones are closer together on the frequency spectrum and that frequencies can mask more up the spectrum than down. However, the data is also noisy and we reflect on several ways to improve methodology of collecting them in the future. After going over the necessary background materials and related work in Section 2, the data generation, labeling, and analysis are discussed in Section 3.

Can we learn models from this data and use it to predict the partial loudness of the spectral components of unseen data of the same distribution?

The complexity of acoustical data makes it costly to label a dataset that covers all frequency combinations. We need to interpolate a model between the data points. Furthermore, we make use of prior knowledge to introduce helpful inductive biases into the models. In the first model, the “spectral partial loudness model”, we fit curves conditioned on the spectrum and levels. The parameters of the distributions are learned with Nelder-Mead (Nelder and Mead, 1965). The second model is a neural model that predicts partial masking from the raw waveform, called the “waveform partial loudness model”. The waveform is preprocessed with a model that mimicks the excitation pattern of the cochlea, called cascade of asymmetric resonators with fast-acting compression (CAR-FAC; Lyon, 2011a). Both approaches are able to significantly improve over a naive baseline in terms of mean-squared error. We additionally judge the models empirically based on plots of their predic-tions. This research question is addressed in Section 4.

Is it possible to extrapolate outside of the data distribution?

The gathered data consists of sounds with two frequencies. In reality the sounds we listen to will almost always consist of more than that. Furthermore, literature teaches us that bands of noise masks sounds differently than sinusoids do (Allen and Neely, 1997). In addition to the training data and within-distribution test set, we gather three out-of-distribution (OOD) test sets and test the models’ predictive capabilities on these. The first is a two-tone set with sinusoids, but with a different spectral contents and levels than seen in the training set. The second contains band-limited noise as maskers instead of sinusoids. The final set is a three-tone set with sinu-soids. These datasets are specified and analysed in Section 3. We test our models’ predictive power on the OOD data in Section 4. The spectral partial loudness model (i.e, the simplified model conditioned on knowledge of the spectral contents and levels of the sound) succeeds on parts of the OOD data, but the waveform partial loudness model (i.e., the end-to-end model learned from the waveform) completely fails. We discuss several ways to over-come this issue. In Section 5 we conclude and discuss the encountered problems and propose potential solutions.

(8)

2 The Psychoacoustics of Loudness

How the physical representation of sound relates to the psychophysical experience is a long-standing area of research and far from straightforward. From a physical perspective, sounds are variations in the pressure of the medium its waves vibrate in, and quantifying the magnitude of the sensation of sound pressure level is an important part of psychoacoustic analysis of sound. In the following we discuss the psychoacoustics of perceived loudness and related psychoacoustical phenomena. We additionally introduces the loudness experiments that we partly reproduce as a comparison of experimental conditions. We discuss auditory masking and position this research among two important related works. In this thesis we attempt to model auditory masking from little and noisy data. To this end we introduce inductive biases into the model designs, one of which is a model of the human cochlea that provides rich features for machine hearing applications, the workings of which are summarized in the following. In Appendix A a brief overview of sound from a physical perspective and the human cochlea is given.

2.1 Sound Pressure Level, Loudness, and Critical Bands

The sound pressure level (SPL) measures the deviation from ambient air pressure (pa) relative to some reference

scale for pressure-deviations (in air often p0= 20µPa) in decibel (dB) and is defined by the following relation.

Lp= 20 log10

p − pa

p0

dB (1)

Loudness is the perception of sound pressure and is studied as part of the field of psychoacoustics. Experiments with human listeners have shown that loudness of pure tones (i.e., sine waves) depend on frequency and level (Fletcher and Munson, 1933). Fletcher and Munson first derived equal-loudness contours (see Figure 1). These contours show which frequencies and levels are perceived to be equally loud to human listeners. The loudness is quantified on a phons-scale (Robinson and Dadson, 1956), which is defined to be equal to the level in dBSPL of a 1 kHz pure tone of the same loudness. Perception of sound level relates differently to physical stimulus parameters for sine waves (i.e., pure tones) than for noise (Allen and Neely, 1997). Many more psychoacous-tic experiments show how loudness of more complex sound depends on other physical stimuli parameters in addition to frequency and level, like bandwidth, duration, and temporal structure (Seeber, 2008). To make matters more complex, Steinberg (1937) showed that loudness depends heavily on properties of the individual listener. For comprehensive reviews of loudness experiments refer to Moore (2003) and Fastl and Zwicker (2006).

Figure 1: The Fletcher-Munson equal-loudness contours (image from (Fastl and Zwicker, 2006)), which first showed that different frequencies at the same level are perceived at a different loudness. Each contour shows at what level a particular frequency must play to be perceived equally loud as a 1 kHz tone and the level at which the 1 kHz tone is playing quantifies a particular phon- and sone-level. E.g., a 1 kHz tone at 40 dB is perceived equally loud as a 0.05 kHz tone at ± 70 dB, meaning a 0.05 kHz tone at 70 dB has 40 phons or 1 sone. The psychoacoustic experiments mapping physical stimuli to perception reveal the complex nature of the prob-lem at hand and motivate the development of many analytical expressions to model this mapping. Early laws that attempt to describe how perception varies as a function of physical parameters in analytical form are accurate at some parts of the audible spectrum of frequency and level and less so at other parts (Stevens, 1961).

(9)

The method that dominates current approaches to quantifying loudness first calculates the estimated excitation pattern from a spectral analysis based on properties of the human auditory system and from this calculates the overall loudness (Zwicker, 1958, 1960). These models were originally developed for steady-state sounds for normal listeners and later extended for time-varying sounds (Zwicker, 1977) and revised for hearing-impaired listeners (Chalupper and Fastl, 2002). The method became a standard in 1991 (Zwicker et al., 1991). An important notion used in this model is that of critical bands (Fletcher and Munson, 1933; Fletcher, 1940). The concept of critical bands postulates that there are ‘bandwidths’ of frequencies that interact, and frequencies outside of these bandwidths act relatively independently. How wide these bandwidths are and where the cut-off frequencies lie on the spectrum depends on the problem being considered (Lyon, 2017). When applied to loud-ness, the notion of critical bands means that for signals with similar frequencies their intensities add, having a smaller effect on the resulting loudness than for signals with frequency content that is separated by more than a critical bandwidth. In reality these effects are more complex, and different experiments discover different critical bands. The derivation of critical bands we use in this thesis is by Zwicker (1961). Zwicker derived the critical bands shown in Table 1 from listening tests in which several acoustic phenomena occur, like masking, perception of phase, and loudness of complex sounds. Zwicker assumed in his measurements that the exact location of the bands on the spectrum is not fixed, but the width of the bands is.

Table 1: Critical Bands as measured by Zwicker (1961). The leftmost column shows the Bark scale. This scale transforms the frequency scale in Hz to Bark, postulating that within each critical band frequencies interact, and frequencies act relatively independently when from different critical bands. In reality the bands are not a step-wise function with fixed cut-off points but a continuous function of frequency.

Critical Band (Bark) Center Frequency (Hz) Cut-off Frequency (Hz) Bandwidth (Hz)

1 50 100 100 2 150 200 100 3 250 300 100 4 350 400 100 5 450 505 110 6 570 630 120 7 700 770 140 8 840 915 145 9 1000 1080 160 10 1170 1265 190 11 1370 1475 210 12 1600 1720 240 13 1850 1990 280 14 2150 2310 320 15 2500 2690 380 16 2900 3125 450 17 3400 3675 550 18 4000 4350 700 19 4800 5250 900 20 5800 6350 1100 21 7000 7650 1300 22 8500 9400 1800 23 10500 11750 2500 24 13500 15250 3500

2.2 The Perception of Frequencies in Complex Sounds

The term “loudness” does not apply to the perceived loudness of the spectral components of sounds and most research towards perception of sound pressure level concerns overall loudness of full signals. When quantify-ing the perceived level of the spectral components of sounds instead, an important phenomenon is maskquantify-ing. Auditory masking is the effect of one signal on the perception of another signal. Masking occurs both in the spectral domain and the time domain. The former refers to components of signals that are suppressed or less audible when played simultaneously and the latter refers to signals that are suppressed or less audible when played non-simultaneously. Here, we focus on simultaneous masking of steady-state sounds.

The notion of critical bands describes how frequencies interact, and in the context of masking it postulates that signals that are spectrally similar tend to mask each other more strongly (Zwicker and Jaroszewski, 1982).

(10)

This can be seen in Figure 2, where the full masking patterns of a test tone under a 1 kHz masker at different levels is shown. When the test tone is close to the masker in frequency almost 50 dB of sound can be rendered inaudible. When complex sounds interact the overall loudness does not simply sum due to masking. A model used to predict the overall loudness of sounds under masking conditions is proposed by Zwicker and Scharf (1965). It calculates excitation patterns from the physical representation of sound which are then used in an analytical expression to calculate the overall loudness. The model is derived from experimental data consisting of single tones and four-tone complexes masked by bands of noise. They show that their model yields predictions of the same pattern as the experimental data, but differs when compared in an absolute way. How to derive the partial loudness from this model however, is unclear. Moore et al. (1997) revise this model to better fit experimental data collected by Zwicker (1963) and to be usable for predicting partial loudness of sounds without the need for correction factors. The model is a sequence of filters representing the transfer of sounds from free field to the excitation patterns in the brain. The authors add that the partial loudness calculations come with a caveat – they are only accurate when the masker is a relatively broadband sound without strong temporal fluctuations.

Figure 2: Image taken from Zwicker and Jaroszewski (1982): “Tone-on-tone masking patterns for a simultaneous masker at fm= 1, 000 Hz with a level Lm= 20−60 dB SPL in 10 dB steps. The excitation patterns for maskers

at 20, 40, and 60 dB SPL are also plotted mirrored in frequency to emphasize the non-linearity of the pattern (dashed dotted line).” Each curve shows how many dB SPL are fully masked (and thus rendered inaudible) under masking by a 1 kHz tone at different levels. Note that in this work we look at partial masking instead of full masking, and how these patterns translate to partial masking is unclear.

2.3 Modeling the Cochlea: CAR-FAC

With the invention of the Helmholtz resonator to identify frequencies in complex sound (Von Helmholtz, 1863) the default way to model the cochlea2 _{and do spectral analysis became the filter bank (Seeber, 2008). A filter}

is anything that takes a signal, performs computations on it, and produces a filtered output signal. A filter bank is an array of band-pass filters that separates part of the input signal. A more efficient way of modeling how sound travels in the cochlea is filter cascades and a recent approach that uses this to realistically model the cochlea is the bio-mimetic model called “cascade of asymmetric resonators with fast-acting compression” (CAR-FAC; Lyon, 2011a). CAR-FAC mimics auditory physiology like masking and is based on a pole-zero filter cascade (PZFC) model of auditory filtering in combination with a multi-time-scale coupled automatic-gain-control (AGC) network (Lyon, 2011c), both briefly discussed below.

The job of the cochlea is translating the physical vibrations of sound into electrical information the brain can recognize. This is accomplished on the basilar membrane, a rigid element that extends across the length of the cochlea. The membrane consists of fibers with different resonant frequencies. The brain can differentiate frequencies in sound because higher-frequency waves vibrate fibers close to the outer ear, and lower frequency waves fibers at the other end of the membrane. The pattern is passed from the cochlear nerve to the cerebral cortex, where the brain can interpret it. Louder sounds for example release more energy and move a greater number of hair cells near the fibers with matching resonant frequency. The brain maps the electrical impulses sent by the cochlear nerve to the sound we perceive. The pole-zero filter cascade (PZFC) part of CAR-FAC

2_{For a brief introduction to the human cochlea, the reader is referred to Appendix A.}

(11)

Filter Stage Filter Stage Half-Wave Rectifier AGC Filter AGC Filter Filter Stage Filter Stage Half-Wave Rectifier AGC Filter AGC Filter Half-Wave Rectifier Input Half-Wave Rectifier Outputs Signal Control

Figure 3: Image taken with permission from Lyon (2011c), a schematic depiction of the CAR-FAC model of peripheral auditory filtering. The cascaded filter stage is shown at the top and the feedback automatic gain control (AGC) at the bottom. The former models the wave propagation in the cochlea with filters and the latter reduces the dynamic range of the output signal levels. CAR-FAC can be used to extract features from raw waveforms of sounds that can represent acoustic phenomena like auditory masking.

models the wave propagation in the cochlea with an approximate method for solving nonuniform distributed wave systems (Lyon, 1998). A filter can be, together with the gain constant, completely characterized by its poles and zeros, describing the roots (i.e., zeros) of respectively the denominator and numerator polynomial of the transfer function. The transfer function is simply the algebraic representation of the filter in the frequency domain. The PZFC is schematically depicted in Figure 3 (top). The parameters of the PZFC are learned with a nonlinear optimization procedure and the resulting filter cascade is shown to fit human psychophysical data better than any previously considered models (Lyon, 2011b). The bottom of Figure 3 shows the AGC filters. Automatic gain control is a circuit designed to reduce the dynamic range of the output signal level of a system by amplifying weak signals more than strong ones. The AGC filter of CAR-FAC compresses a wide input dynamic range into a narrower output dynamic range (Lyon, 2017). This stage models the adaptation of the cochlear gain. The resulting full CAR-FAC model, among other things, can handle auditory masking; it represents audible differences and suppresses inaudible differences (Lyon, 2011c). The features that can be obtained from CAR-FAC are shown to be effective in downstream tasks such as music retrieval (Lyon et al., 2010), audio fingerprinting, cover-song detection, and speech recognition (Lyon, 2011c, 2017).

(12)

3 Generating, Labeling, and Analyzing Partial Loudness Data

3.1 The Problem Definition

The goal of this thesis is to quantify the partial loudness of the spectral components of a sampled audio signal x. The spectral components of an audio signal are all the frequencies that make up the signal. Each frequency is perceived by humans at a particular level. This perceived partial loudness of a frequency depends on the other spectral components of the sound and their levels, among other things3. We denote the sampling rate (i.e., number of samples taken of the continuous signal per second) of the discretized audio signal by fs, the

total number of samples under consideration N , and the audio signal s a function that maps time to amplitude: s: N0→ R≥0.

We denote the frequency by f and distinguish probe frequencies and masker frequencies, where the former is the frequency of which we currently want to find the partial loudness and the latter are the frequencies that are assumed to mask a part of the probe frequency. We want to learn a mapping g from a sampled signal x = (s(t1), s(t2), . . . , s(tN)) to a partial loudness for a probe frequency on a Bark-scale. Denote the probe

frequency on a Bark scale by cbfp and the partial loudness of that probe lp,c(where c denotes the critical band

(i.e., Bark number) fp falls in). Recall that the Bark-scale transforms the frequency scale in hertz to another

scale where each critical band contains all frequencies that interact. The mapping g depends on the sampled signal x and the current probe frequency under consideration cbfp and maps any number of samples N to a

single partial loudness:

g: [

N ≥1

RN × R → R≥0.

3.2 Listening Tests

In order to learn a mapping g, we collect data from human listeners. This section relates to the first research question;

How do we gather data on the partial loudness of spectral components of signals from human listeners such that as much as possible of the patterns caused by the acoustic phenomenon masking is covered?

Spectral interactions between components of complex sounds make it impossible to use the standards that map single frequencies to loudness to reason about signals with more than one frequency. We need a dataset consisting of signals with more than one frequency playing simultaneously. The continuous nature of the problem make it infeasible to ‘label’ all combinations of frequencies and sound pressure levels with their partial loudness directly. Additionally, the partial loudness under masking by sinusoids is different than under masking by for example bands of noise. Collecting and labeling a dataset that covers all phenomena is infeasible, and we make use of findings from hearing science to select parts of the problem domain such that the data will reflect as much of the phenomena as possible with as little redundancy as possible. In the following we discuss the data generation and labeling procedure for the partial loudness experiment and an additional experiment done to compare experimental conditions with those in literature. We specify three extra data sets gathered to allow testing for out-of-distribution generalization. All code used to generate the audio as described in the following sections is written in Golang and open-sourced4. Refer to Appendix D for links to all implementations.

3.2.1 Partial Loudness Experiment

The Data Distribution: We construct a dataset consisting of two-tone sounds (i.e., two superpositioned sinusoids); where one tone is the masker frequency (fm) and one the probe frequency (fp). The simplifying

assumption here is that the probe does not mask the masker, although in reality this might happen. Let Ap be

the amplitude of the probe and Am the amplitude of the masker:

sp(t) = Apsin (fpt)

sm(t) = Amsin (fmt)

sc(t) = sp(t) + sm(t)

From hearing science we know that the closer two tones are together on the frequency spectrum, the more they mask each other (Zwicker and Jaroszewski, 1982). However, distinguishing a particular tone in a combination

3_{The partial loudness may depend on many more factors, like attention, which we ignore in this work.} 4_{https://github.com/google-research/korvapuusti}

(13)

Figure 4: A heatmap of masker-probe frequency combinations present in the gathered dataset on a Bark scale, where the horizontal axis shows the probe frequency and the vertical axis the masker frequency. There are 6 examples of each masker-probe combination, which cover 3 masker levels combined with 2 probe levels.

Figure 5: The ISO reproduction examples plotted on the equal-loudness contours. Each curve repre-sents a particular phons level that sounds equally loud. The red crosses show the probe tones present in the dataset, and the blue crosses the reference tones at decibel level 40, 50, and 60. Note that the location of the data points on the Y-axis reflect the level we expect according to the ISO standard and not the actual data. The listeners are asked to listen to a sound at a red cross and compare its loudness to the sound at the blue cross on the same contour.

of two tones from the same critical band played simultaneously is impossible due to a beating sensation that will be perceived by the listener. This motivates us to gather a dataset in the following way. We select a set of five critical bands spread over the audible spectrum and its middle frequency to be the masker frequencies. The audible spectrum for humans is between 20 and 20 000 Hz. For each masker frequency, we select a specific set of probes that depend on the location of the masker frequency on the spectrum. Denote the center frequency at critical band of the masker cbm, then the probe frequencies fp for each masker are chosen to be the center

frequencies of all critical bands cbifor all i ∈ {m−4, . . . , m+4}\{m}. Outside this range we take every other

crit-ical band to be a probe frequency. We use the critcrit-ical bands defined by Zwicker (1961) (see Table 1 in Section 2). To generate the data we need to select the amplitudes Apand Am. As mentioned in Section 2.1, the SPL can be

calculated from the power with Equation 1. We can use its inverse to calculate the amplitude from the SPL. To make sure that a full-scale sine at A = 1 is calibrated to be playing at 90 dB, we calculate the amplitude with A = 10(Lp−90)/20_{. We generate samples with levels L}

p,p∈ {30, 60} for the probe tones, and Lp,m ∈ {40, 60, 80}

for the masker tones. This gives a total of 6 listening tests for each probe-masker combination. In Figure 4 this is visualized in terms of critical band combinations. On the vertical axis the masker frequency on a Bark scale is shown, and on the horizontal axis the probe frequency. Note that some rows in Figure 4 suggest that five critical bands of probes above or below a masker frequency are combined (i.e., for masker frequency at cbm= 16). This is due to the probe frequencies that are selected at every other critical band coinciding with

the probes around the masker frequency.

The Labeling Procedure: To mitigate the fact that perceived level depends on the individual listener (Stein-berg, 1937), all data points get labeled by four independent listeners, each of which is supplied with a decibel meter and a pair of headphones5_{. In Figure 6a the front-end of the listening test is shown. To quantify how}

one spectral component can mask another, human listeners are asked to adjust the level of the probe sinusoid 5_{See Appendix C for specifications of the equipment used.}

(14)

in quiet until it matches the perceived partial loudness of that same tone under masking conditions. Since we set a full-scale sine (A = 1) to be Lp= 90, listeners are asked to calibrate the volume of a full-scale sinusoid on

the machine to be playing at 90 dB with the provided decibel meter (step 2 in Figure 6a)6_{. The listeners are}

then asked to adjust the volume of the single tone (which plays sp(t)) until it is perceived to be equally loud as

that same frequency in the combined tones (sc(t)) (step 4-7 in Figure 6a). If the listeners are not able to hear

the probe tone at all, they can choose this option on the screen (step 8 in Figure 6a). This might happen for example if the frequency is too high to be perceived at all. The listening tests label the data with the partial loudness of the probe frequency under masking by the masker frequency.

Example 3.1. Let the masker tone be at critical band cm= 1 with center frequency fm= 50 and Lp,m= 80, the

probe tone at cbp= 2 with fp= 150 and Lp,p= 30. This gives rise to the signal sc(t) = 80 sin (50t)+30 sin (150t),

from which we sample N points to get x. The human listener might adjust the volume of the probe tone to lp,2= 20, effectively giving Lp,p− lp,2= 30 − 20 = 10 dB of masking. This gives a single labeled example with

partial loudness lp,2= 20.

3.2.2 ISO Equal-Loudness Reproduction Experiment

Recall from Section 2.1 that the equal-loudness contours quantify the loudness of single frequency tones relative to a 1 kHz tone. The Fletcher-Munson equal-loudness contours were redefined by ISO to a standard in a 2003 and are now called ISO equal-loudness contours (Mellert et al., 2003). In order to compare the conditions of our experiment to existing standards, we add a small set of ISO equal-loudness examples to the data. We use the most recent ISO equal-loudness standard: ISO 226:2003. A tone playing at X decibel at 1 kHz is said to have X phons. For this experiment we select reference tones playing at 3 levels, namely 40, 50, and 60, and 8 probe tones for each reference tone spread out over the audible spectrum. These data points are visualized in Figure 5. From this visualization one can see that, according to the ISO standard, a tone playing at the same decibel level with a lower frequency is perceived to be less loud. The reference tone is depicted with blue crosses in Figure 1, and the accompanying probes are the red crosses on the same curve, of which there are 8 per reference tone. Note that the location of the probes on the Y-axis is where the ISO experiments would place them, in reality they are playing at the same level as the reference tone.

We ask the listeners to adjust the level of a probe tone until it sounds equally loud as the 1 kHz reference tone playing at a particular level (see Figure 6b). We can compare the data gathered in this experiment to the equal-loudness contours. It should be noted that the equal-loudness contours were obtained from listening tests under different conditions. We ask listeners to use headphones, whereas the ISO standard performs listening tests in a free progressive plane wave. Furthermore, the listeners are in the age range on 18 through 25, whereas in our case the listeners are in an older age range, with the youngest being 27. The ISO-reproduction listening tests provide us with a decibel level that is perceived to be equally loud as a 1 kHz reference tone playing at a particular level.

Example 3.2. Let the reference 1 kHz tone be playing at 40 dB and the probe a 100 Hz tone. The listener might adjust the volume of the probe to 60 dB, effectively saying that a 100 Hz tone at 60 dB is perceived at 40 phons.

3.2.3 Additional Partial Loudness Tests

The sine-on-sine masking discussed so far is limited in scope and we want to compare the partial masking patterns derived from this data with patterns under different conditions. We ask the listeners to label three extra, out-of-distribution (OOD) test sets that can be used to test models outside of the training distribution. OOD set 1: sine. The first set we label contains sine-on-sine tones like the training set, but with an unseen masker frequency, unseen probe frequencies, and unseen probe and masker levels. The training set contains only five different frequencies as maskers, whereas the audible spectrum contains any frequency between 20 and 20 000 Hz. We want to be able to test whether the training set is representative of other parts of the audible spectrum. We additionally choose different masker and probe levels, to test whether the patterns in the training set translate to differences in sound pressure level. In this test set, we take the masker frequency to be 843 Hz (8 Bark), the masker levels 50 and 70 dB, and the probe levels 20 and 40 dB. The probe frequencies are the range of all frequencies from 100 Hz to 15 000 Hz with steps of one ERB. The ERB (equivalent rectangular bandwidth; Moore and Glasberg, 1983) scale is another scale that approximates the bandwidths of filters in human hearing, similar to the Bark-scale. By using the ERB scale we ensure the probe frequencies will be 6_{To be sure that the machine the listener is using doesn’t use some nonlinear volume control the listeners also need to check} that another sine plays at 75 dB and one at 60 dB.

(15)

different from those seen in the training set.

OOD set 2: noise. The second test set contains bands of noise instead of sinusoids, for both the probe and the masker frequency. The generated noise for the masker has a center frequency that also occurs in the training set, namely 1370 Hz (11 Bark) playing at 80 dB, and the noise is half a critical band wide. The probe frequencies are again the range of all frequencies from 100 Hz to 15 000 Hz with steps of one ERB, playing at 30 dB. This test set can be used to determine whether the patterns in the training set translate to different types of sound than superpositioned sinusoids.

OOD set 3: threetone. The third additional test set contains sinusoids, but each example contains two masker frequencies instead of one. We take masker frequencies that both appear in the training set: 1370 Hz (11 Bark) and 2908 Hz (16 Bark), each playing at 80 dB. The probe frequencies are again the range of all frequencies from 100 Hz to 15 000 Hz with steps of one ERB, playing at 60 dB. This test set can be used to determine whether the patterns in the training set can translate to sounds with more than two frequencies.

(16)

(a) The screen presented to the listeners for the listening test of type A, the probe-masker-combinations, as described in Section 3.2.1.

(b) The screen presented to the listeners for the listening test of type B, the ISO Equal-Loudness reproduction examples, as described in Section 3.2.2.

Figure 6: These screens are the front-end of the listening tests presented to four individual listeners, asking them to label a particular probe frequency with its partial loudness (Figure a) or label a particular probe frequency

(17)

3.3 Data Analysis

The total number of labeled examples obtained is 516 for the partial loudness experiment, 24 for the ISO equal-loudness experiment, 144 for OOD set 1, 36 for OOD set 2, and 36 for OOD set 3. All four listeners failed to hear probe tones at 17 625 Hz and one listener failed to hear any tone above 10 000 Hz. This leaves us with 488 labeled examples for the masking experiment, and 24 examples for the ISO reproduction experiment (where 3 of the 24 ISO examples are labeled only by the three listeners that could hear the probe). To measure the reliability of our data we calculate Krippendorff’s alpha coefficient as a measure of agreement between the four listeners (Krippendorff, 2011). Krippendorff’s alpha gives insight into the disagreement observed between listeners and the expected chance disagreement. Perfect reliability of the data gives α = 1, and no reliability at all gives α = 0. The coefficient can be negative when disagreement exceeds the expected chance disagreement. For a detailed example on how to compute Krippendorff’s alpha refer to Krippendorff (2011). The author suggests that it should be customary to require α ≥ 0.8 for reliable data, and α ≥ 0.667 is the lowest possible coefficient to accept any conclusions from the data. All labeled data is made publicly available and links to the locations they are deposited can be found in Appendix D.

3.3.1 Partial Loudness Experiment

Krippendorff’s alpha coefficient for the partial loudness data is α = 0.703 before any preprocessing, meaning we are allowed to draw conclusions, but there isn’t ideal agreement. In order to obtain a dataset with sufficient reliability we drop examples for which the empirical variance between the answers is too large. To this end we calculate the empirical variance for all examples and combine them into a histogram. All examples that lie above the 85-th quantile are dropped, causing us to drop 66 examples. We then split the remaining examples in a training set and a test set, where the test set constitutes a random subset of 15% of the total data. This leaves us with |D| = 318 training examples and 56 test examples. The training set obtains a Krippendorff’s alpha coefficient of α = 0.785.

We can visualize the data in the style of Zwicker and Jaroszewski (1982), by calculating the partial masking as the actual sounds pressure level minus the perceived loudness. We calculate the mean partial loudness from the data and plot the difference with the actual sound pressure level for each example, where we set any example below zero to zero, assuming that negative masking (i.e., intensification) means no masking occurs. Figure 8 shows the partial masking patterns for all masker frequencies and probe level combinations. The Y-axis of each subplot shows how much of the signal is masked in decibel, and the X-axis the probe frequency on a log scale. The vertical lines on the data points show the empirical variance between listeners. The green dotted vertical line across the plot provide a reference for the masker frequency, and the horizontal line for the actual probe level. These plots show that maskers can mask more up the frequency spectrum than down. This is consistent with findings from literature on partial masking (Zwicker and Jaroszewski, 1982; Moore, 2003). Furthermore, the higher the masker level the more masking occurs, and a masker at level 40 seems unable to mask probes at levels 30 and 60. The data seems noisy however. Although Krippendorff’s alpha suggests there is sufficient agreement between listeners, the variance between the answers is often large, and the pattern not always consistent with the expectation that being spectrally closer to the masker means more masking (e.g., for masker frequency 568 Hz, probe level 30 dB, and masker level 80 dB). Additionally, for the low and high frequencies (60 Hz and 7000 Hz), no clear patterns are observed.

(18)

3.3.2 ISO Equal-Loudness Reproduction Experiment

Figure 7: The ISO equal-loudness contours plotted with the labeled data. The blue crosses depict the reference 1 kHz tones at 40, 50, and 60 dB, and the colored dots are the examples that are expected to be close to the curve of the same color. The vertical bars show the variance between the answers of the listeners. For this experiment, listeners are asked to listen to the sounds at each of the red dots and adjust its loudness until its equally loud as the tone at the blue cross on the curve with the same color as the dot. The SPL in dB they answer determines the location of the dots on the Y-axis.

In Figure 7 the labeled data is plotted on the equal-loudness contours. This experiment contains 24 labeled examples by four independent listeners. One of the listeners was not able to hear the probe tone at 10 313 Hz. When dropping this particular label from the data, the Krippendorff alpha is α = 0.447, meaning the disagree-ment between the listeners is too high to draw reasonable conclusions from the data. All listeners reported that it was difficult to compare the loudness of tones that differ in frequency. From Figure 7 it can be seen that the highest variance occurs at the highest frequency example. If we drop this example, the Krippendorff alpha becomes α = 0.932, meaning we can draw conclusions from the other probe tones.

The observed pattern does not seem to match the ISO equal-loudness contours as inferred from listeners by Mellert et al. (2003). We can conclude from this experiment that our experimental setup differs too much from that used in the ISO standard to give comparable results. Additionally, the variance between the answers makes it impossible to derive a pattern.

3.3.3 Additional Partial Loudness Tests

OOD set 1: sine. In Figure 9 and 10 the labeled first OOD test set is shown. Again a clear pattern of tones masking more up the frequency spectrum than down is visible, and higher level maskers mask more. The masker at 50 dB is not able to mask higher frequency probes at 40 dB. The Krippendorf’s alpha coefficient for this data is α = 0.807, meaning we can draw conclusions without any preprocessing. Nevertheless, the empirical variance between labels for the probe tones with a higher frequency than the masker frequency seems high. OOD set 2: noise. In Figure 11 the results of noise-on-noise partial masking tests are shown. Interestingly, the labels of different listeners have less variance between them than for experiments with sinusoids. Other than this the pattern seems similar. The example around 10 000 Hz is likely an error, and when we remove it the set gets a Krippendorf’s alpha of α = 0.857.

(19)

Figure 8: All partial masking patterns derived from the listening tests. The overall vertical axes shows increasing masker frequencies and the overall horizontal axis shows increasing probe level. Each data point in a plot shows the amount of masked decibels (shown on the sub Y-axes) for a particular probe frequency (shown on the sub X-axes) for a particular masker frequency and probe level. These patterns show that for masker levels 80 and 60, the closer the probe frequencies are to the masker frequency, the more they are masked, and that masker frequencies can mask more up the spectrum than down. For some masker-level and probe-level combinations no patterns are seen and the data are noisy.

(20)

Figure 9: Masking patterns for a masker at 843 Hz at 50 and 70 dB, with probes at 20 dB. For both masker levels these data points show that the closer the probe frequency is to the masker frequency, the more masking happens, and the masker frequency can mask more up the spectrum than down.

Figure 10: Masking patterns for a masker at 843 Hz at 50 and 70 dB, with probes at 40 dB. For masker level 70 again the pattern that a masker can mask more up the spectrum than down is ob-served, but for masker level 50 the masker cannot mask higher frequencies at all, likely because the probe level is too close to the masker level.

Figure 11: Masking patterns for a white noise masker at 1370 Hz at 80 dB, with white noise probes at 30 dB. The pattern is the same as for sinusoids, but shows much less empirical variance, signalling a higher inter-annotator agreement for this dataset.

Figure 12: Masking patterns for a two maskers at 1370 and 2908 Hz both at 80 dB, with probes at 60 dB. The amount of masked decibels increase until the highest masker frequency. For probe frequen-cies higher than the masker frequency the variance is too large to draw conclusions.

(21)

OOD set 3: threetone. In Figure 12 the listening tests with two maskers is shown. Again the variance seems lower for the answers up to the highest masker frequency (2908 Hz), after which the variance between labels increases. The error at 10 000 Hz is again present. The amount of masking is larger with two maskers playing at 80 dB than a single one, with at the peak as much as 40 dB rendered inaudible. When removing the erroneous example at 10 000 Hz the Krippendorf’s alpha of this data is α = 0.549, which means there is too little agreement to draw conclusions from the data. From Figure 12 we see that this is due to the answers for frequencies higher than the highest masker 2908 Hz. If we remove those, the Krippendorf’s alpha becomes α = 0.977, but since this data is just used for testing and not for learning models from, we will keep all data points.

3.4 Discussion on Labeling

The Partial Loudness Experiment. The labeled examples we obtained with the partial loudness experi-ments are noisy and few remain after dropping the ones with too high variance. To draw any reasonable and robust conclusions, more listeners are most likely needed. Additionally, in many listening tests no masking seems to happen because the probe and masker are too far apart on the spectrum. Given that each example is expensive to label, for future work these examples could be dropped. From Figure 8 one can see that hardly any masking happens when the masker level plays at 40 dB, both for a probe at 30 dB and 60 dB, and this shows that examples where the masker has a similar or lower level than the probe in retrospect do not show interesting patterns either. Perhaps an iterative way to gather partial loudness data could be more suited for psychoacoustic tests. E.g,. for a masker frequency and probes from about 6 critical bands below to 8 above the masker (since frequencies mask more up the spectrum), have listeners label the data and repeat the process at parts of the spectrum where the empirical variance is high. Furthermore, we have not tested the supplied headphones for what range of frequencies they can produce. Perhaps the probe tones at 17 625 Hz that no listener could hear was simply not produced by the headphones at all.

The method of volume calibration we used for the listening tests was in retrospect flawed. The decibel meter was held to the headphones like depicted in the third animation of Figure 6a and 6b, leaving part of the headphones uncovered. However, when listening to the tests the headphones are on the subjects’ heads and closed. This means that the output will be louder when over the listener’s head. We have tested this difference by calibration a tone at 70 dB7 _{to the volume of a machine, putting the headphones on the listener’s head and measuring}

the level with the decibel meter. The decibel meter then shows about 73 dB, meaning the tone plays about 3 dB louder than expected. This happens for both the probe tone and combined tones in the experiments, and therefore should simply result in a shift in the resulting partial loudness. In our experiments in the next section, we only look at the partial masking (i.e., the difference between the actual level and the partial loudness), meaning the fact that the absolute value of the partial loudness might differ by 3 dB should not matter. In the future however, this should be taken into account.

Finally, the headphones we have used were selected for their flat frequency response, meaning the response of the headphones for different frequencies should be roughly the same. However, in an experiment to test the response of one of the pairs of headphones of the listeners, we found that a tone generated at the 70 dB at 100 Hz has a level that is roughly 6 dB higher than a tone at 1000 Hz, meaning even though a flat frequency response is advertised, the low frequencies get boosted. With the right equipment, the frequency responses of all the used headphones could be measured and used to post-process the data to control for differences in frequency responses. ISO Equal-Loudness Reproduction Experiment. For this experiment we have made a mistake in the de-sign of the tests. We did not provide listeners with the choice to flag probe tones when they are inaudible to them even at the highest level (see Figure 6b). We solved this by post-processing the data as follows. We matched the listener ID’s to the ones in the partial loudness experiment, where the listeners did flag the probe tones that were inaudible to them even at the highest level, and removed these probe tones from the ISO equal-loudness dataset. This experiment might also have been affected by boosted low frequencies. When we assume the level of the probe at around 100 Hz in the equal-loudness reproduction experiment is boosted by about 6 dB, this would mean the labels for that probe tone should have ended up 6 dB higher on the Y-axis than they did in Figure 7. Again, this could be controlled for when the frequency response of the headphones gets taken into account for post-processing the data.

(22)

The Additional Test Sets. Test set 2 (noise-on-noise partial masking) shows much less variance between answers and indicates that partial masking patterns are more easily derived from these type of sounds.

(23)

4 Predicting Partial Loudness of the Spectral Components of Sound

We want to learn the mapping g from the input signal x to the partial loudness of the probe frequency lp,c.

There are a few important considerations that we need to take into account during modeling. First, we have a small, noisy training set of |D| = 318 examples, which are likely not enough to learn an end-to-end model from. Second, the examples we have do not cover all frequencies and levels. These considerations ask for the introduction of reasonable inductive biases.

4.1 The Spectral Partial Loudness Model - Modeling in the Spectral Domain

Before we attempt to solve the problem as stated in Section 3.1 we simplify the problem by assuming the spectral components and pressure levels per tone are given. In practice this should rather be inferred from the input, but doing this allows us to construct models that can be used to sample extra weakly-supervised data from for an end-to-end model. We choose to model a set of simple curves conditioned on the masker frequency, probe level, and masker level. As can be seen in Figure 8, the further away the probe is from the masker in terms of frequency, the less it is masked. Additionally, the masker masks more up the spectrum then down. In order to have as little degrees of freedom as possible we model the pattern with a function inspired by a skewed Gaussian distribution, a generalization of the normal distribution that allows for skewness. For the data to fit this shape, we transform the partial loudness to masked decibels: Lm,c= Lp− lp,c. The resulting spectral partial loudness

model hθn maps a spectrum and level specification to a single nonnegative number quantifying the masked

decibels, where θn specifies the set of parameters used. Since the pattern we are modeling is not a probability

distribution, we can drop the normalization constant. The resulting model depends on four parameters; ξ ∈ R the location, ω > 0 the scale, α ∈ R the shape, and we add an amplitude a ≥ 0 to allow the output to be scaled to the data. hθ1(cbfp, cbfm, Lp,p, Lp,m) := a · exp − cbfp− cbfm 2 2ω2 Z α cbfp −cbfm ω −∞ 1 √ 2πexp −t 2 2 dt

We get rid of one degree of freedom by setting the mode m of the distribution to the masker frequency in Bark cbfm since we know from literature that the function must peak here. This allows us to define the location

parameter in terms of the masker frequency and the other parameters. m = ξ + ω ∗ m0(α) := cbfm

ξ = cbfm− ω ∗ m0(α)

The mode m is unique, but for arbitrary α there is no analytical expression for m0(Azzalini, 2013). We therefore

use the following numerical approximation, taken from Azzalini (2013). m0(α) ≈ µz− γ1σz 2 − sgn α 2 exp −2π |α| Where µz= r 2 πδ, σz= p 1 − µ2 z, δ = α √ 1 + α2 γ1= 4 − π 2 δq_π2 3 (1 −2δ_π2)32

For each (cbfm, Lp,p, Lp,m) combination we fit the parameters (θ1= (ω, α, a)) to the data.

In theory, any masker frequency can partially mask a probe, so we would like our model to allow heavy tails. The Cauchy distribution is a distribution similar to the Normal distribution, but with heavier tails and a taller peak. In addition to the skewed Gaussian model, we fit Cauchy distributions to the data. The Cauchy is a symmetrical distribution and to allow for skewness we fit different distributions to the right side of the masker frequency than the left side. Besides the learned amplitudes al and ar(for the amplitudes of the curves to the

left and right side of the masker respectively), we learn one scale parameter each (γl > 0 and γr > 0). This

gives the following model.

hθ2(cbfp, cbfm, Lp,p, Lp,m) :=        ar· 1 πγr 1+ cbfp −ξ γr 2, if cbf_p> cbf_m al· 1 πγl 1+ cbfp −ξ γl 2, if cbfp< cbfm

(24)

When the probe frequency equals the masker frequency, hθ2 is undefined. In any case, when the masker and

probe frequency coincide there will be no masking, but intensification. For each (cbfm, Lp,p, Lp,m) combination

we fit the parameters (θ2= (γl, γr, al, ar)) to the data.

4.1.1 Gradient-Free Optimization - Nelder-Mead

We learn the parameters with a heuristic-based optimization method called Nelder-Mead. The objective function we optimize is the mean-squared error over the data points with additional penalties on a part of the amplitude-domain. The amplitudes below zero and above the probe level are nonsensical, so we add a penalty that is proportional to the difference between the the boundaries and the learned parameter. Let there be N data points for each (cbfm, Lp,p, Lp,m) combination.

Lθ1 = 1 N N X n=1 L(n)m,c− hθ1(cbfp) 2

+ Ia>Lp,p(a − Lp,p) + Ia<0|a|

Lθ2 = 1 N N X n=1 L(n)_m,c− hθ2(cbfp) 2 + X ai∈{al,ar} Iai>Lp,p(ai− Lp,p) + Iai<0|ai|

In order to stabilize parameter search we take several measures. Our input data is on a scale of 1 to 25 Bark and the output on a scale of 0 to approximately 60. To make optimization more stable, we preprocess the data as follows. Instead of masking in decibel, we use masking in bel (1 B is 10 dB). This transforms the data to an input ranging from 1 to 25 and output ranging from 0 to 6. Furthermore, reasonable initialization of the parameters is important. Nelder-Mead searches parameter space efficiently by searching each parameter domain subsequently, but when it searches for example the scale parameter while the amplitude is still zero, the learned functions will always be zero and the search will amount to nothing. It is therefore important to initialize both functions such that search starts with a reasonable distribution. We initialize the amplitudes at the probe bel level, and the scales at 1. For the Skewed Gaussian we initialize the skewness parameter α at 3, which gives a right-skewed function. We run Nelder-Mead for 300 iterations with a stepsize ∆ = 0.01, which determines the change in parameters in subsequent iterations. All code that implements Nelder-Mead in Python is made publically available (for locations of the code refer to Appendix D).

4.1.2 Out-of-distribution Prediction

The training set is gathered such that it covers a large part of the audible spectrum as well as level range. However, we’ve only learned models for particular masker-probe frequency and level combinations. In order to use these models for prediction of unseen masker-probe combinations we use the following procedure. Given a masker at frequency fmand level Lp,m, a probe at frequency fp and level Lp,p, we select from our models the

one with the closest masker frequency in Bark. From the resulting six models we select the one with the closest masker level. Finally, from the resulting two models we select the one with the closes probe level for prediction. We then transform this model to center around the masker frequency we are trying to do prediction for (this is a simple shifting of the curve across the X-axis). This procedure can also be used for noise-on-noise masking, by simply selecting the center frequencies of the noise bands.

For masking patterns of sounds with more than two frequencies we need a different procedure. We choose the following: select the model with the procedure for two-tone masking described in the previous paragraph for both masker frequencies and sum the predicted masking of each of them. We clip the predicted value again at the probe level, since masking more than the actual level is nonsensical.

4.1.3 Results

In Table 2 the mean-squared error over the predictions on the training, within-distribution test, and out-of-distribution test sets are shown. Additionally, since we are reporting results on a dataset that has never been used before, we show results of a baseline that always predicts zero masking ( ˆLm,c = 0). The MSE of such

baseline provides an upper-bound which when crossed shows that a model performs worse than not using the model at all. The model based on the skewed Gaussian always performs significantly better than this baseline, and also outperforms the Cauchy model on every dataset. The Cauchy model outperforms the baseline for the training and test set, and one OOD test set (noise-on-noise masking), but performs worse than on both the OOD test set with sine-on-sine masking and the set with an extra masker tone. For this latter test set it

(25)

Table 2: MSE for the models learned with Nelder-Mead averaged over all (cbfm, Lp,p, Lp,m) combinations and

averaged over three runs (standard deviation between brackets). The top row shows the MSE obtained when always predicting a partial masking of zero, which is used as a baseline. The skewed Gaussian spectral partial loudness model always does significantly better than this baseline. The Cauchy variant fails to do better than the baseline on two of the three OOD test sets.

Bold numbers are significantly different from the baseline with a p-value of 0.01 (one-sided t-test). Underlined numbers are significantly different from the baseline with a p-value of 0.05 (one-sided t-test).

Train Test Sine White Threetone

ˆ

Lm,c= 0 0.1651 (± 0.0000) 0.1927 (± 0.0000) 0.3974 (± 0.0000) 1.2284 (± 0.0000) 3.7430 (± 0.0000)

S-Gaussian 0.0304 (± 0.0023) 0.0320 (± 0.0018) 0.1978 (± 0.0128) 0.4126 (± 0.0689) 2.1967 (± 0.5172) Cauchy 0.0538 (± 0.0003) 0.0340 (± 0.0000) 1.1303 (± 0.0000) 8.8650 (± 4.3617) 3.1986 (± 0.1918)

significantly overestimates the amount of masking. In Figure 16 and 17 the learned models’ predictions on the training set are plotted for each (cbfm, Lp,p, Lp,m) combination. First, we look at the models for the skewed

Gaussian model (Figure 16). Although they always outperforms the baseline, there seem to be considerable issues with the models. For lower masker levels, the models are often simply flat curves that predict a similar masking level for the whole spectrum. For example, for the masker at 60 Hz playing at 60 dB, with a probe of 30 dB, the model predicts a constant amount of masking of around 1 dB, which means it predicts a partial loudness of ˆlp,c = 30 − 1 = 29 dB for any probe frequency c. Additionally, for maskers 568 Hz (6 Bark) and

1370 Hz (11 Bark) the models predict zero masking for probes with a lower frequency than the masker. This is probably an artifact of the fact that we use a restricted model with only a few degrees of freedom, and optimization found it was better in terms of MSE to use a very large skewness to fit the data points at higher frequencies than to also fit the points at lower frequencies. Further, at masker 568 Hz the function collapses at higher frequencies. This is due to instabilities of numerical integration. The model requires integrating a function from −∞ to αcbfp−cbfm

ω

, and this does not work when the right limit is a small number. In practice we choose to integrate from −5 to αcbfp−cbfm

ω

, which gets rid of almost all numerical instabilities, but not all. Some curves do represent the data rather well, like the one at masker frequency 2908 Hz and probe level 30 dB. For the probe frequency at 17 Bark the model predicts 21 dB of masking, which means the probe tone is perceived by the listeners at a partial loudness of ˆlp,17= 9 dB, even though it is playing at 30 dB.

In Figure 17 the patterns learned with the Cauchy-model are shown. The patterns again show more masking up the spectrum than down, and a higher amplitude for the right-Cauchy then the left. However, it becomes clear why the model fails for the OOD Test sets 1 and 3; the model overestimates the amount of masking by a large margin since the predicted amplitude at 1370 Hz is very large.

(26)

Figure 13: Out-of-distribution results with the Skewed Gaussian model on OOD set 1 (sine-on-sine). For both probe levels and probe frequencies that are higher than the masker frequency, the spectral partial loudness model overestimates the amount of masking by a masker at level 70 and underestimates this for a masker at level 50. For probe frequencies lower than the masker frequency the pattern for both masker levels is inaccurate.

Figure 14: Out-of-distribution results with the Skewed Gaussian model on OOD set 2 (noise-on-noise). The spectral partial loudness model pre-dicts the pattern reasonably well, showing that partial masking patterns for noise are similar to those of sinusoids.

Figure 15: Out-of-distribution results with the Skewed Gaussian model on OOD set 3 (two si-nusoid maskers). For probe frequencies lower than the masker frequencies the pattern gets predicted reasonably well, but for probe frequencies higher than the lowest masker frequency the spectral par-tial loudness model completely fails to extrapolate. OOD set 1: sine. For the OOD test sets we visualize the skewed Gaussian model only, since the Cauchy model is not significantly better than the baseline. See in Figure 13 the results for the first OOD test set: sine-on-sine partial masking with masker frequency 843 Hz, at 50 and 70 dB, probes at 20 and 40 dB. The model overestimates all masking for the masker at 70 dB, which is to be expected since we simply selected the learned model at 80 dB here (see Section 4.1.2). For example, the probe tone at 10 Bark and 20 dB SPL is playing at a partial loudness of lp,10 = 20 − 13 = 7 dB, but the model predicts 25 dB of masking, which

translates into a prediction of a partial loudness of ˆlp,10= 0 dB. For the probe tone at 10 Bark and 40 dB SPL

(27)

Figure 16: All learned partial masking patterns with the Skewed Gaussian spectral partial loudness model (hθ1). The overall vertical axes shows increasing masker frequencies and the overall horizontal axis shows

increasing probe level. The patterns for masker frequencies 60 and 7000 Hz are mostly constant amounts of partial masking across the probe frequency spectrum. For the other masker frequencies the learned patterns are reasonably accurate for masker level 80, but again often constant for masker levels 60 and 40.

(28)

Figure 17: All learned partial masking patterns with the Cauchy spectral partial loudness model (hθ2). The

overall vertical axes shows increasing masker frequencies and the overall horizontal axis shows increasing probe level. This model seems to be better at predicting partial masking for the lower masker levels than the Skewed Normal model is, but the learned amplitudes often seem too high.

Modeling Partial Loudness of Audio Waves for Psychoacoustically Improved Machine Hearing

MSc Artificial Intelligence

Master Thesis