Sound Recognition: A Cognitive Way Alle Veenstra

(1)

Sound Recognition: A Cognitive Way

Alle Veenstra

August 13, 2010

Master Thesis Artificial Intelligence Dept. of Artificial Intelligence

University of Groningen, The Netherlands

Primary supervisor:

Dr. T. C. Andringa (Artificial Intelligence, University of Groningen) Secondary supervisor:

Ir. J. D. Krijnders (INCAS

³

)

(2)

(3)

Abstract

Sound recognition systems aim to determine what source produced a sound event. Until now, such systems lack explicit knowledge about sound sources; they rely on crude signal descriptions and large annotated training databases. These databases, hopefully for such systems, allow a reliable correlation between signal descriptions and annotations. Some modern perception approaches are based on the concept of gist, which suggest that a signal is first analyzed crudely, not unlike conventional sound recognition systems, but that the crude analysis is followed by a more detailed knowledge driven approach. In this work I explore a way of involving explicit knowledge about sound sources and audition, resulting in a more cognitive approach to sound recognition. I find that this approach is complementary to conventional sound recognition systems, as it leads to higher classification performance on a single sound event recognition task and it gives more insight into recognitions. For example, in addition to answering what source produced a sound event, my approach can also precisely tell where (in time and frequency the evidence stems from) and why the sound event was recognized as being such. Also, this approach is extensible, as knowledge can be added and specialized, possibly increasing its performance beyond that of conventional sound recognition systems.

v

(6)

(7)

Acknowledgments

First and foremost I would like to thank my parents Wolter and Aukje Veenstra. I am grateful for their support during my studies and in the choices I made.

Next I would like to thank my classmates Maurice Mulder and Marijn Stollgenga.

Thanks to their help and friendship my years of studying were a delight. Together we finished most projects with great results and had a lot of fun.

Finally my thanks go to my supervisors Tjeerd Andringa and Dirkjan Krijnders. They provided me with a great subject for my masters thesis. Furthermore, I would like to thank Tjeerd for the opportunity he gave me for a sneak peak into the tasks of a PhD student.

vii

(8)

(9)

Chapter 1 Introduction

This thesis presents a more cognitive approach to sound recognition. Chapter 1 gives an introduction into the subject, reviews related work and states the research question. In chapter 2 the fundamental ideas driving the system design are presented. The implementation is described in detail in chapter 3. Chapter 4 shows an experiment that compares my approach to conventional and human sound recognition. In chapter 5 the results of the experiment are discussed, along with what could be done to improve them are described.

Finally, chapter 6 summarizes and concludes the thesis.

This chapter explains why this research is performed. Section 1.1 explains sound recognition and its terminology. Related research is reviewed in section 1.2. Section 4.3 addresses what is missing in conventional sound recognition approached: knowledge about sound sources and audition. In section 1.4 the research question and objectives are stated.

Finally, section 1.5 summarizes this introduction.

1.1 Problem definition

Imagine you are walking on a street in your favorite city. During your walk you hear an abundance of sound. You hear the wind blowing and howling as it flows around buildings.

You listen to the guitars and violins of street-musicians playing on the sidewalk. Often you hear people chatting and laughing. You also hear alarms from shops and birds fighting for food. The combination of all these events form the auditory landscape, or soundscape.

In this soundscape, you can make subtle distinctions. For example, you can distinguish cars from buses driving next to you on the sidewalk. And you can recognize a friend by her voice greeting from a single “Hello”.

These examples illustrate how easily and precise we perform sound recognition. That is, trying to find out which sources produced what sound events. Sound recognition also can be automated; it can be performed by robots or computer programs. My thesis shows

1

(10)

how this can be achieved in a way that is more similar to how humans do this. It presents a more cognitive approach to sound recognition by introducing knowledge about sound sources and audition.

1.2 Related research

How can computers and robots perform sound recognition? What is being done on this topic in the research community? This section addresses these questions, it presents research related to my thesis. First two popular auditory feature extraction and classification methods are addressed. Secondly conventional sound recognition approaches in different domains are discussed. Furthermore we see research about creating real world sound recognition systems. Finally, research describing the need for interaction between top-down and bottom-up processing is addressed.

1.2.1 Popular audio feature-extractors

To a computer, an audio recording is just a collection of bytes. These bytes represent the sonic pressure development (shape) of the recorded signal. For a computer to make sense of this, it needs to extract information from the signal. This process is called feature extraction. In this subsection we see two often used methods for this.

Mel-frequency cepstral coefficients (MFCCs) can be used as features for classification.

It uses the Mel-scale, which is a scale in which the difference between frequencies is judged equal by human listeners. Apart from that, it uses the cepstral, which is the cosine transform of the logarithmic frequency spectrum. MFCCs capture information about the shape of the spectrum. It provides a workable representation for clean speech recognition, because it is an efficient representation of the envelopes of harmonic complexes.

Matching pursuit (MP) is a signal reconstruction approach, which can also be used for feature extraction. It searches for the best approximation of a signal by combining and weighing dictionary elements. Such dictionary elements are for example sinusoid, sawtooth or square periodic functions. The Fourier transform is a subset of MP with a sinusoidal dictionary. MP provides a efficient representation for most time-frequency information.

These MFCCs and MP features can be used for classification. The Waikato Envi- ronment for Knowledge Analysis (WEKA) (Hall et al., 2009) provides many common classifiers. My thesis uses their k-nearest-neighbor, multi-layered-perceptron and support vector machine classifiers. The support vector machine is trained with sequential min- imal optimization (SMO) (Platt, 1998). SMO is a fast way of training support vector

(11)

1.2. RELATED RESEARCH 3 machine’s, by feeding training examples sequentially.

1.2.2 Sound recognition research

What tasks are performed with sound recognition in the research community? How did other research use sound recognition? This subsection gives an overview of sound recognition research.

A soundscape is the auditory equivalent of a landscape. In soundscape recognition the task is to figure out what kind of environment a sound comes from. For example, the sounds of a busy street are distinguishable from the sounds of a forest (Chu et al., 2008, 2009). Chu et al. (2008, 2009) uses both MP and MFCCs features. These features describe both the frequency (in the case of MFCCs) and the time-frequency (in the case of MP) domain. They use Gaussian mixture models for classification. For this task, performance of this method is comparable to that of human listeners. But it is unclear if this task is representative since no sound events are recognized and the test set is small.

The method of Chu et al. (2008, 2009) is called a bag-of-frames approach. This comes from the fact that summaries of frames cut from the signal are used as features. Although it may work for soundscape recognition, it fails at more specific tasks like music recognition (Aucouturier et al., 2007).

The freesound.org website hosts a continuously growing collection of audio recordings.

The samples on this website are uploaded by users from all over the world. Any copyright- free audio files can be uploaded. At the time of writing it contains more than 80.000 sounds. The uploaded sounds can be tagged by the community of users. Martinez et al.

(2009) suggested a method for automatically tagging unseen or untagged samples in this database. Their method uses MFCCs features and k-nearest-neighbor classification. They analyze in-depth how sounds are tagged on the website.

Andringa and Niessen (2006) propose a method for creating a real world sound recognition system. They reason that an interaction between bottom-up and top-down is required for this. An up-to-date model of the environment is maintained to drive top-down reason- ing. Besides that, physical realizability is used to constrain recognition possibilities. This constraint ensures that the environment model and evidence are consistent with physical laws. This research gives a requirements overview for real-world sound recognition.

Recent research of Andringa (2008) discusses the concept of sound textures. Three basic sound textures are defined: noise, pulse and tone. They conducted an experiment to show that humans might use similar low-level perceptual distinctions. In this experiment the dissimilarity between sound events are compared to the differences that are reported by humans (Gygi et al., 2007). The results indicate that humans use sound texture to

(12)

perform first stage sound event categorization. This makes sound textures an interesting tool for sound recognition, as it is biological plausible.

Imagine you hear a ticking sound. This could make you wonder whether there is a clock ticking or that somebody walking on heels is approaching you. If you are then greeted by a friend wearing heeled shoes, you know that the sound was probably from this person and not from a clock. This example illustrates dependencies between events and how they can be used to solve ambiguities.

The previous example first shows bottom-up processing; the hearing of the ticking sound. Secondly, it shows top-down processing; inferring that this sound came from a heeled shoe pair. The bottom-up process generates hypothesis, which are validated using a top-down process. When a sound recognition system needs to deal with ambiguities, it could use this bottom-up and top-down interaction (Andringa and Niessen, 2006).

1.3 Conventional sound recognition

Today’s sound recognition systems are focused on using machine learning classification techniques. A classifier is trained on signal features that describe the main distribution of energy without any attempt to separate sound events. No explicit knowledge about audition and sound sources is used. Only a training database is used. This section shows how these conventional sound recognition systems operate.

Signal Summary

Classification

“Features”

Raw Data

1 2

3

4

Figure 1.1: The architecture of a conventional sound recognition system. (1) The blue line represents the input signal. Time and amplitude are respectively on the x- and y- axis. (2) A window with an arbitrary length slides over the signal. This way overlapping segments are produced. (3) Features are calculated for each segment. The features are summarized. (4) Classification is made using feature summaries. The obtained “features”

are of the signal, not of the sound sources and how they produced sound events.

Figure 1.1 shows a typical sound recognition system. The first step presents signal in a raw form to the system (see figure 1.1 part 1). The second step is segmentation of

(13)

1.4. RESEARCH QUESTION 5 the audio signal (see figure 1.1 part 2). A so called sliding window cuts the signal into overlapping and equally sized intervals. The windows are arbitrarily placed on the signal.

The window length is determined by trail and error. This results in a collection of signal intervals. The third step is to extract features of the signal intervals (see figure 1.1 part 3).

Popular features are obtained from the Fourier coefficients and MFCCs. The features are calculated from each signal interval. These features are summarized using statistics such as the mean, variance, skewness, etc. The fourth step is to classify using the extracted features (see figure 1.1 part 4).

What does the mean of MFCCs and Fourier coefficients tell about sound sources? Not much. These features tell little about the sound source properties and how they produce sound events. They are about signal properties, while the goal is sound event recognition.

They are thus not that suitable for the task. There is only knowledge about the signal produced by sound events. Explicit knowledge about sound events is nowhere to be found in such systems. This makes it near impossible to recognize sound events in unconstrained environments, as the environment has a great impact on the signal. So, features about the signal itself only allow usage in constrained environments.

We have seen that conventional sound recognition systems lack knowledge about sound sources. How can we add this kind of knowledge? Well, how do humans perform sound event recognition and how can we model it? These are the central questions of my thesis.

1.4 Research question

Conventional sound recognition systems lack explicit knowledge about sound events. Mod- eling such knowledge in a sound recognition might improve performance and give insight into human audition. The aim of this thesis can be summarized with the following research questions:

• How can we design a sound recognition system that uses knowledge about sound events and human audition?

• Does this approach improve recognition performance over conventional approaches?

This question leads to the following research objectives:

1. Model gist-like crude signal analysis, to generate interpretation hypothesis that can be verified.

2. Model knowledge consistent with the physics of sound sources.

(14)

3. Model contextual knowledge, how sound sources and events relate to each other in the world.

4. Model signal-driven hypothesis generation and knowledge-driven hypothesis checking.

5. Compare this approach to a conventional sound recognition approach.

1.5 Summary of the introduction

In sum, sound recognition is about finding out what events produced a sound. Conven- tional sound recognition systems lack explicit knowledge about sound sources, they only use signal properties. This limits the performance of such systems by relying on classifiers and requires large annotated training databases. This leads to the following research question: how can we design a system that uses knowledge about sound events and human audition? And, does this approach improve recognition performance over conventional approaches?

(15)

Chapter 2 Theory

This chapter presents knowledge about sound sources and human audition which could be used for automatic sound event recognition. Most discussed topics are from work of Andringa (2010), which provides an overview on the various assets of auditory perception.

Audition is discussed in a chronological order: from physical objects producing sonic events, through low-level cognition and ending on high-level cognition. Section 2.1 shows that a sound signal contains patterns resulting from sound event physics and transmission effects. Section 2.2 describes how sonic pressure waves (sound) can be transduced into a time-frequency representation using a model of the basilar membrane. The distribution of energy in the time-frequency represents sound texture, which can be used for initial low-level classification, is described in section 2.3. Section 2.4 addresses how low-level auditory perception is driven by crude low-level signal statistics. Addressed in section 2.5 is how hearing and listening is a continuous interaction between the generation and checking of hypothesis. Section 2.6 argues that we become aware of hypothesis and not the signal. Furthermore, section 2.7 addresses how context facilitates the interpretation of sound. Finally, section 2.8 summarizes this chapter.

2.1 Physics

We start at the sound event, by explaining how physical objects produce sound. Sound events emit their energy partially as sonic energy, which are pressure waves traveling through air. Physical interactions of objects causes sonic energy to distribute in a source- specific way. For example, an object can resonate, emitting energy on frequency ranges.

Or, an object can release energy quickly, emitting pressure waves in short intervals. This shows that the object physics determines the kind of sound that is produced.

The signal that reaches our ears is not what is produced by a sound event. Many transmission effects like noise, masking and reverberation take place. Figure 2.1 shows an

7

(16)

+ =

Frequency (Hz)

time [s]

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4600

2830 1740 1070 660 400 240 150 87 50

Frequency (Hz)

time [s]

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4600

2830 1740 1070 660 400 240 150 87 50

Frequency (Hz)

time [s

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4600

2830 1740 1070 660 400 240 150 87 50

Figure 2.1: An example of how a signal is a combination of a pattern and transmission effects. (1) shows the pure pattern generated by the sound event. (2) shows the transmission effects, which in this example are simulated background noise and reverberation.

(3) shows the actual perceived signal.

example of how what we perceive is build-up. The signal is a combination of sound event physics and transmission effects. Sound events create patterns and transmission effects change them in a lawful way.

Figure 2.2: The signal we perceive are sound event physics plus transmission effects.

According to Gaver (1993) sound event physics plus transmission effects results in patterns. Figure 2.2 shows an illustration of this concept. This means that physics of an object causes the patterns we perceive. Such patterns are similar when the physics are similar. A large bell (low frequency) will produce a pattern similar to a small bell (high frequency). In this case, only the size and the position in the time-frequency plane will change.

2.2 The ear

We have seen how objects produce sound and what happens to it before it reaches our ears. Now we look one step further into what goes on inside our ears and how sound is made available for cognitive processing. Sonic vibrations are in the time domain, but what we hear is in the time-frequency domain. This operation is performed by the cochlea,

(17)

2.2. THE EAR 9 which transduces audio from the time into the time-frequency domain.

1

2 3

Figure 2.3: A dissection of the human ear. (1) The eardrum, malleus and incus, which transfer sound in the form of vibrations to the cochlea. (2) The cochlea transduces sound from the time domain to the time-frequency domain. (3) The auditory nerve connects the cochlea to the brain.

The human inner-ear turns sonic waves into neural firing patterns. Figure 2.3 shows a dissection of the human ear. Sonic waves cause the eardrum to vibrate. The malleus and incus transmit these vibrations to the cochlea. Inside the cochlea these vibrations stimulate the basilar membrane. The basilar membrane connects to many hair-cells, which produce a grade potential. This potential is picked-up by spiking neurons. This way, neural firing patterns are the consequence of sonic waves.

Each part of the basilar membrane responds to different frequencies. High frequencies stimulate the start of the basilar membrane. Low frequencies stimulate the end of the basilar membrane. The frequency response is logarithmic, similar to the mell-scale. Thus, the basilar membrane transduces audio from the time into the time-frequency domain.

Frequency (Hz)

time [s]

0.25 0.5 0.75 1 1.25 1.5

4600 2830 1740 1070 660 400 240 150 87

50 −40

−30

−20

−10 0 10 dB

Figure 2.4: An unfolded cochlea (left) containing the basilar membrane. And, a cochleogram (right) of the sound of a babbling baby followed by a ball hitting a wall.

Time and frequency are respectively on the x- and y-axis. Like the basilar membrane, the frequency axis is logarithmic. A cochleogram shows local energy in decibel (dB) over frequency and time.

The gammachirp filterbank (Irino, 1997) models the basilar membrane. It transduces recorded audio into the time-frequency domain. It does this by segmenting the basilar

(18)

membrane. The frequency response of each segment is represented as a low-pass filtered log of the squared excitation. The gammachirp filterbank output is the frequency response of basilar membrane segments.

A cochleogram shows the basilar membrane response over time. Figure 2.4 shows an example cochleogram. It shows the local energy, in decibel, of the frequencies over time.

2.3 Textures

Next, we discuss the lowest level auditory perception. How do we perceive sound in the earliest perception stages?

Sound events emit their energy as sonic waves. Such emissions take place in different ways. In a pulsal way, where it emits in discrete points in time. In a tonal way, where it emits at discrete frequencies and longer intervals. In a noisy way, where it emits in both time and frequency.

Sound events lay their energy in a specific way. This fact can be used for early categorization.

Frequency (Hz)

time [s]

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4600

2830 1740 1070 660 400 240 150 87 50

Frequency (Hz)

time [s]

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4600

2830 1740 1070 660 400 240 150 87 50

Frequency (Hz)

time [s]

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4600

2830 1740 1070 660 400 240 150 87 50

1 2 3

Figure 2.5: Three cochleograms showing extremes in energy distribution. (1) Pure noise lays energy in both time and frequency. (2) Pure pulse lays energy in frequency in a short time period. (3) Pure tone lays energy in time on small frequency-range.

Textures describe sound in terms of pulsal, tonal and noisy contributions. Figure 2.5 shows three cochleograms. They show a pulse, a tone and noise.

Figure 2.6 shows how textures are applied. For each section of the cochleogram the texture is estimated. The texture tells in what way energy is distributed. This is in frequency, in time or in a combination; it is mapped to a point on the triangle of figure 2.6. This way, the whole cochleogram is represented in terms of tonal, pulsal and noisy contributions.

The result of the texture application is shown in figure 2.7. A distinction is made between pulsal, tonal and noisy parts. Each similar region is colored the same way, resulting in a patchwork of textures.

Textures are a way to represent sound. But is there any evidence that humans use

(19)

2.3. TEXTURES 11

Tonal Noisy Pulsal

over time

over frequency

0%

100%

frequenc y

time

Distribution of energy

1 2

3 4 5

Figure 2.6: The application of textures on a cochleogram. (1) is an illustration of a cochleogram. (2) is the distribution of energy, which can be over frequency, over time or a combination. This is the actual texture space. The texture of each local time-frequency section (3, 4 and 5) is estimated. (3) is a section with a tonal texture. (4) is a section with a noisy texture. (5) is a section with a pulsal texture. This picture is based on Andringa (2010).

similar measures? Our goal is to model the human auditory system as well as possible.

Can textures be linked to the human auditory perception?

How do humans perceptually categorize environmental sounds? Gygi et al. (2007) conducted a rather strange experiment to find out how humans do this. They conducted an experiment in which they gave their test subjects the assignment of estimating the similarity of random sound pairs. Imagine you have to tell the similarity between a purring cat and a tea kettle. This is quite an awkward thing to do. But as awkward as it might be, after a while, the test subjects learned to perform this task and evaluated 10.000 sound pairs. After analysis of this data, it show three strong perceptual groups. The first group contains harmonic sounds with predominantly periodic contributions (discrete in frequency). The second group contains impact sounds with predominantly pulse-like contributions (discrete in time). The third group contains continuous sounds. This indicates that humans have low-level categories similar to sound texture, which make a similar distinction.

(20)

Frequency (Hz)

time [s]

0.25 0.5 0.75 1 1.25 1.5

4600 2830 1740 1070 660 400 240 150 877

50 1 − No energy

2 − Noises 3 − Sinusoids 4 − Sinusoids flat 5 − Noisy tones 6 − Tone>pulse 7 − Pulses 8 − Noisy pulses 9 − Pulse>tones 10− Pulse flat

Figure 2.7: The textures of the sound in figure 2.4. The reddish colors (10 - 7) show pulsal regions. The greenish colors (6 - 3) show tonal regions. The blue color (2) shows noisy regions. The dark blue color (1) shows regions lacking energy.

2.4 The GIST of the scene

We hear many things, but do not listen to everything we hear. What happens to the information that is not attended? Research shows that this information in both audition Harding et al. (2008a) and vision (Oliva, 2005) becomes available in an ensemble representation, in the form of statistical characteristics. But what is the form of this representation and what does this tell us?

Figure 2.8: Illustration of the information gathered in the first 100ms of exposure to a beach scene. This is called the gist of a scene. A crude first analysis of the scene is being made, guiding further in-depth analysis. This image is from research of Oliva (2005).

Please take a look at figure 2.8, which seems like a vague artistic painting. It actually is a representation of the information we use within the first 100ms after observing a beach scene. Although there is little detail in the image, one can guess the scene location.

(21)

2.5. HEARING AND LISTENING 13 Such information can steer the interpretation of the scene from early on.

This early interpretation is called the gist of a scene (Oliva, 2005). A meaningful interpretation of a scene can be given within 100ms (Potter, 1976). The visual gist contains statistical information about groups of objects. The conceptual gist is enriched and modified as the perceptual information bubbles up from early stages of visual processing (Oliva, 2005).

Is there also an auditory gist? Do we get fast and crude initial overview of the auditory surroundings? Harding et al. (2008a) shows that there is considerable evidence supporting auditory gist existence. This tells us that the human auditory system also uses crude signal statistics for early categorization.

2.5 Hearing and listening

Now that early stages of perception are discussed, we look into how attention is used. Not all sounds are attended, audition is selective. Here we see that audition is a continuous interaction between hearing and listening. This interaction is first illustrated with an example.

Attention Signal capturing

Hypothesis generation Hypothesis checking

Attention Hypothesis gene

Flow of Consciousness

Is there

someone? Being greeted? Look for traffic

Motor Pulse pattern Formants

Wheels

Footsteps? “Hello”? Car?

Repeated

pulses Low freq tone

noise Harmonic

complex

frequenc y

time

Figure 2.9: An example of the interaction between hearing and listening. Hypothesis generation and hypothesis checking play a dominant role in this. The time-frequency plane is an illustration of the cochleogram. The strings represent observations and hypothesis.

The clouds represent the concepts becoming conscious. This image is from Andringa (2010).

Suppose that, on a relaxing Sunday morning a fictional person called Charlie decided to take a walk through the park. Most people find this quite an ordinary activity, but for

(22)

Charlie it is different. After being struck with an eye infection, she lost her vision. Now she has to rely on audition and tactile feedback from her cane.

During her walk Charlie hears something – a vague repetition of crispy pulses. As she starts to listen to the sound, she figures it might be footsteps. Is there someone approaching her? The pulses stop and make place for harmonic tones. It is Jack, the man from next door, greeting her “Morning neighbor!”. Obviously Charlie is not the only one enjoying the park this time of day.

This example shows the workings of audition. Charlie hears something and focuses her attention on it by listening. This illustrates the difference between hearing and listening. Hearing is driven by the signal and attracts attention to salient events. Listening selectively focuses attention depending on goals and signal-driven hypothesis.

frequenc y

time

Hypothesis generation Task/goal relevance

Hearing

Knowledge-driven hypothesis checking

Listening

Signal-driven salience attention

1

2 3

4

Figure 2.10: The interaction between hearing and listening. This is a generalization of the example seen in figure 2.9. (1) is an illustration of a cochleogram. (2) shows how salience guides attention in the hearing process. (3) shows how hypothesis are verified in the listening process. (4) shows how expectation management is process where task/goal relevant hypothesis are maintained. This image is from Andringa (2010).

Hearing and listening play a complementary role in our auditory perception. Audition is considered a combination of hearing and listening (Harding et al., 2008b). Figure 2.10 shows the interaction between hearing and listening.

Remember the last time you spoke with someone in a crowded room? Even when there are many people speaking at the same time, we can hear each other. This is called

(23)

2.6. AWARENESS 15 the cocktail party effect (Cherry, 1953; Bronkhorst, 2000). The secret of why we still can hear each other in such settings lies in our ability to focus our attention and listen carefully.

What distinction can be made between kinds of attention? First there is signal- driven attention, focusing on the properties of a signal. This attention form is mostly salience based. Information that pops out of the background receives focus. Second there is knowledge-driven attention, focusing on objects and performing tasks. Not all information that we perceive is useful at all times. Information relevant to the task we perform receives more attention.

An effect of top-down attention is that we become unaware of task irrelevant information. This causes blindness (Mack, 2003; Cherry, 1953) to stimuli not receiving focus.

We do not become aware of the signal properties that are not important to our current task.

2.6 Awareness

Eventually, some sounds will reach awareness in the form of sound events. Low-level statistical signal descriptions end up facilitating high-level symbolic interpretations.

When you read this sentence the visual stimuli of the letters are translated into words.

Words are meaningful symbols and reach awareness. Whatever the font, the size or the color of the paper is, the meaning is perceived. The shape of the letters, the actual stimuli, does not reach awareness unless we specifically attend to it.

Think about what happens when you are talking to somebody in a crowded place.

Transmission effects, like reverberation and masking, are now influencing the speech signal. Similar to font and paper properties, these effects change the speech signal in a major way. But what we hear is the “clean” voice of speaker and not the changed signal properties.

These examples indicate that we become conscious of the symbols and objects and not the signal or its properties. How do these signals become symbols and objects in our mind? There are often multiple ways to interpret a signal. These interpretations are hypothesis about the meaning of the signal. Our mind tries to find the best hypotheses that most accurately matching the signal.

2.7 Context

Another important high-level cognitive trait is our use of context. How concepts relate to each other in the real world. Context gives much information about how observations

(24)

can be interpreted.

Imagine yourself gently petting a cat, laying on your lap and purring. What sound do you picture at this scene? Now imagine an older car running stationary, a low and fast pulsating sound. Is there a big difference between those two sounds? Not really. They are both low and rhythmic, and are probably difficult to discriminate from each other.

Often, sound has more than one possible interpretation. In such cases the context can facilitate the interpretation. As audition is performed in the real world, it is always in a context.

Niessen et al. (2008) proposes a method for using context in sound recognition. They model contextual knowledge in a dynamic network model. Their approach reduces the used search space for low-level features. This way, time spent by the algorithm is greatly reduced.

2.8 Summary of the theory

In sum, the human auditory system uses knowledge about sound sources. The patterns in a cochleogram are the result of sound event physics and transmission effects. The cochlea transduces sound from the time into the time-frequency domain. The gammachirp filterbank is a model of the cochlea and results in a cochleogram. The energy distribution on the cochleogram defines the texture of sound, being tonal, pulsal or noisy. An interaction between hearing and listening takes place in which hypothesis are generated and checked conform the task. Early stages of perception result in crude statistical features about a visual or auditory scene. We gain consciousness of hypothesis and not the signal. The previously summed knowledge about sound sources and the human auditory system can be used to design a sound recognition system.

(25)

Chapter 3 Implementation

This chapter presents the implementation details. Section 3.1 gives an overview of the implementation architecture. Section 3.2 addresses segmentation of the sound signal.

In section 3.3 the extraction of signal properties is discussed. Section 3.4 shows how hypothesis about the sound event physics are generated. How knowledge about sound events is represented is shown in section 3.5. Section 3.6 addresses modeling of knowledge about the world (context). In section 3.7 an example of a system output is presented and discussed. Finally, section 3.8 summarizes this chapter.

3.1 Architecture

In the implementation I try to use all knowledge addressed in the theory (chapter 2). The system is designed in multiple layers, see figure 3.1. Each layer has its own responsibility with respect to the kind of knowledge it models. This ensures transparency and makes relation between knowledge types explicit. In this section we discuss the individual layers and their responsibilities.

The first layer is responsible for transducing the signal from the time into the time- frequency domain and then selecting interesting parts for further analysis. It is based on the theory from sections 2.2 and 2.3. This layer performs segmentation of the cochleogram;

segmenting the signal into areas that contain energy. Segmentation of the time-frequency plane results in rectangular areas, called patches. Similar textured areas are grouped and a patch is created around them. Similarity is based on the texture type, which can be pulsal, tonal and noisy.

The second layer is responsible for extracting properties from the patches. It is based on the theory from section 2.4. This layer contains a number of signal property estimators.

A few examples properties are: repetitions, tone complex and noise distribution. These properties are extracted from individual patches. What properties are extracted depends

17

(26)

Texture/

salience 1. Signal segmentation

5. Contextual activation network

2. Signal-driven properties

car

tones &

noises harmonic

complex social

speech

tone complex

street

noise blob

water

pulses

n

repetitions footsteps 4. Rule-based knowledge system

3. Pattern-based expectations

frequen cy

time

Knowledge about the world

Knowledge about sound sources

Knowledge about physics

Knowledge about signals

Knowledge about attention

Signal capturing

Hypothesis generation

Hypothesis checking

Level of cognition LowerHigher

Figure 3.1: This figure shows an overview of the layered implementation. It shows the kind of interaction between layers, being: salience, signal capturing, hypothesis generation or hypothesis checking. Each layer has its own responsibility at modeling a specific knowledge kind. (1) The signal segmentation layer is responsible for signal-driven attention. (2) The signal-driven properties layer is responsible for modeling knowledge about signal. (3) The pattern-based expectation layer is responsible for modeling knowledge about physics.

(4) The rule-based knowledge system is responsible for modeling knowledge about sound sources. (5) The context activation network is responsible for modeling knowledge about the world (context).

upon the texture underlying the patch, which can be pulsal, tonal or noisy. Each of these texture types has their own set of applicable signal property estimators.

The third layer is responsible for generating hypothesis that link signal properties to source physics. It is based on the theory from section 2.1. This layer searches for patterns across extracted signal properties. Patterns can both be found in single or multiple patches. This kind of knowledge about physics and patterns describes how objects interact. An example of a pattern is the driven resonator. This is a object class that produces a tone when driven by a force. Thus, the patterns are abstractions of physical interactions.

The fourth layer is responsible for modeling knowledge about sound sources. This layer verifies the pattern hypothesis form the previous layer and generates hypothesis about the source that a sound stems from. It is based on the theory from sections 2.5, 2.1 and 2.6.

This layer is a rule based knowledge system, which stores explicit knowledge in rules. Such

(27)

3.1. ARCHITECTURE 19 systems are also called expert systems. They can be used to model the knowledge that experts use to make decisions. Wikipedia is a possible source of such expert knowledge.

For example, the Wikipedia page about speech describes some sonic properties, which in principle can be translated into rules. The rules describe what patterns and signal properties are caused by what sound sources.

The fifth layer is responsible for making contextual knowledge available. This layer verifies whether sound source hypothesis from the previous layer are supported by their context. It is based on the theory from sections 2.5, 2.1 and 2.6. This layer contains contextual relations between sound sources. It has more than one use. The first usage is to resolve recognition ambiguities. For example, when there is evidence supporting both a purring cat or a car engine, there is an ambiguity. This conflict might be resolved using contextual knowledge. Another usage is to recognize more complex sound sources, like a running coffee machine, which has many sound events spread over time in a predictable way. A coffee machine is difficult to recognize as a whole, but its sound event parts (steam, ticking and water) are not.

Frequency (Hz)

time [s]

1 2 3 4 5 6 7

4600 2830 1740 1070 660 400 240 150 87

50 −40

−30

−20

−10 0 10 propel

ler pla ne

pink noise

jet aircraft axe o

n w ood

baby b abbli

ng

ball hitting wal l

bell

bird s ingin

g

bowlin g kegs

water bubbling

dB

Figure 3.2: The example recording followed in explaining the implementation. It is a concatenation of multiple single sound events. It contains pink noise (0.0s - 0.5s), propeller aircraft (0.5s - 1.0s), jet aircraft (1.0s - 2.1s), axe hit (2.1s - 2.5s), baby (2.5s - 3.5s), ball (3.5s - 4.2s), bell (4.2s - 5.1s), bird (5.1s - 6.0s), bowling kegs (6.0s - 5.9s) and water bubbling (5.9s - 7.8s).

To explain the system an example sound recording is followed throughout the expla- nation of the individual layers. Figure 3.2 shows the cochleogram of this recording. It contains a concatenation of single sound events.

(28)

3.2 Signal segmentation

Frequency (Hz)

time [s]

1 2 3 4 5 6 7

4600 2830 1740 1070 660 400 240 150 87

50 1 − No energy

2 − Noises 3 − Sinusoid flat 4 − Sinusoids 5 − Noisy tones 6 − Tone>pulse 7 − Pulses 8 − Noisy pulses 9 − Pulse>tones 10− Pulse flat

Figure 3.3: The texture of the recording in figure 3.2.

Take a look at figure 3.3. This figure shows the texture of the time-frequency plane.

There are ten textural categories; no energy and all combinations of pulse, tone and noise.

Such a combination is for example: pulsal and tonal, but no noise.

Interesting signal parts, those that contain sufficient energy (upper 60dB), are seg- mented into patches. These patches are rectangular areas in the time-frequency plane.

The texture is transformed into three binary masks, one mask for each texture. These masks combine time-frequency areas with similar texture.

Frequency (Hz)

time [s]

4

1 2 3 4 5 6 7

4600 2830 1740 1070 660 400 240 150 87

50 1 − No energy

2 − Noises 3 − Sinusoid flat 4 − Sinusoids 5 − Noisy tones 6 − Tone>pulse 7 − Pulses 8 − Noisy pulses 9 − Pulse>tones 10− Pulse flat

Figure 3.4: Patches applied on the texture of recording in figure 3.2. Patches are rectangular sections in the time-frequency plane.

(29)

3.3. SIGNAL PROPERTIES 21 Sound event physics Signal structure Property name

Highly damped impact Structured pulse Individual pulse Periodic damped impacts Periodic pulses Repetition frequency

Resonance A single tone Isolated tone

Resonance with beats A single tone with AM Modulated tone Changing resonance fre-

quency

A tone changing in frequency

Sheared tone Multiple resonations Multiple tones Tone complex Many uncorrelated pro-

cesses

Noise Noise spectrum

Broad resonance Noise with a band Noise band

Many uncorrelated damped processes

Decaying noise Energy slope

Table 3.1: This table shows how signal properties relate to sound event physics.

A connected components algorithm is used to identify separate regions on the binary masks. A rectangular patch is drawn around each region, see figure 3.4. Each patch is marked with its source mask, being pulsal, tonal or noisy. This information is used by other layers.

Are patches the best way to segment a signal? No. In fact, it’s a crude way to segment.

For example, a tonal sweep is a diagonal line in the time-frequency plane. Such diagonal patterns cover only a small portion of a rectangular patch.

Although using rectangular patches is probably not the best way, they provide an easy initial segmentation approach. Rectangular boxes are easy to store and calculate with.

They only require two xy-points to store. Calculating overlap between two boxes is simple and fast. This makes their usage sufficient, as it does not complicate the implementation much.

3.3 Signal properties

The signal property estimators estimate properties of the signal (using the cochleogram).

These signal properties relate to the sound source physics, as seen in table 3.1. They are applied on the patches extracted in the first layer. The textural source of the patch determines what properties are calculated; the signal determines what analysis (property extraction) is applied on itself.

Certainty estimates of the applied properties are calculated. These are maintained and used for later decisions and calculations. Furthermore, properties with low certainty can easily be discarded. Currently, the implementation discards properties with a certainty lower than 20%. It is important to make an accurate certainty estimation, a process which

(30)

requires clear criteria or costs a lot of trail-and-error work on a training set.

The description of the implementation uses many different mathematical symbols.

Table 3.2 shows their unit and what they stand for. Variables encapsulated in inequality signs h...i are property estimators.

3.3.1 Smoothing

Suppose a sound event is producing a pulse every half second and is located in a small room. The walls will add reflections and noise from the pulses. These reflections will likely are also present in the signal. Estimating the repetition frequency of the sound source in the signal now becomes difficult, because the reflections are also pulses and there is noise added. These transmission effects needs to be reduced to be able to estimate the repetition frequency of the sound event.

Smoothing the signal can help reducing these kinds of noise. In this implementation the signal is smoothed using a two-dimensional moving average filter. This filter replaces the value of a point with the average value of area surround that point. But then the question rises: what is the optimal size of the area?

Suppose a signal property is calculated in the time domain. This implicates that information in the time domain is important. Now suppose there is much noise in the frequency domain, which is also used (summarized for example) for the same calculation.

The noise needs to be eliminated, which can safely be done using smoothing filters, as no information from this domain is vital for the calculation. Thus, the domain in which no information is extracted can be smoothed without introducing potential problems.

For each point in the input signal the local area is averaged. The number of points, or scope, to include in the average is arbitrary. This scope determines the strength of the smoothing. A small scope results in little smoothing and a large scope results in strong smoothing. Each individual situation has its own optimal scope, depending on how much noise is present and the calculation that needs to be performed.

Noise has the highest possible entropy; it requires the most bits to encode. When noise is reduced, the entropy of a signal is also reduced. This property can be used to find a suiting smoothing scope. Figure 3.5 illustrates how this concept is applied in a real world example.

Let A be the moving average filter, applied on a two-dimensional cochleogram patch ξ_t,f with scope size φ. Let function g be a function that used for estimating a signal property. This function is acquiring some property from the signal. For example, g could be measuring the distances between peaks over time. When noise decreases, by changing

(31)

3.3. SIGNAL PROPERTIES 23

Symbol Unit Meaning

ξ_t,f dB a patch in the cochleogram

t seconds time interval

~t all patch time intervals

f Hertz frequency

f~ all patch frequencies

φ smoothing scope size

H(~x) entropy of ~x

A(ξ_t,f, φ) smoothing average filter on

input ξ_t,f with scope size φ

S(x, µ, σ) value of sigmoid with mid-

point µ and breath σ at x

G(x, µ, σ) value of Gaussian distribu-

tion with mean µ and standard deviation σ at x

G(µ, σ) Gaussian distribution with

mean µ and standard deviation σ

~

pi dB individual peak shape

ρ correlation

g arbitrary function

µ center point

σ breadth

β₀ linear regression base

β₁ linear regression slope

~ error between regression

and data

~γ dB energy in all time intervals

ψ~ dB energy in all frequencies

hxi property x estimation

Table 3.2: The used mathematical symbols and their meaning.

(32)

Frequency (Hz)

time [s]

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 5860

4600

3610

2830

2220

1740

1370

Frequency (Hz)

time [s]

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 5860

4600

3610

2830

2220

1740

1370

Figure 3.5: Example of the successful application of entropy minimization smoothing.

This figure shows a patch of a cochleogram containing a propeller plane. The goal is to estimate the repetition frequency of the pulses generated by the propeller (which is approximately 28 pulses per second). (1) shows an unprocessed version of the patch.

Under this cochleogram are the peaks found by the peak-detection algorithm, which are too many. The extra peaks are probably caused by reflections or interference from other sound events. The estimated repetition frequency of this unprocessed version is 52.7 pulses per second. (2) shows a smoothed version of the patch, using entropy minimization smoothing on the peak distance. Using this, a more reliable estimation of the repetition frequency can be made. The estimated repetition frequency of this smooth version is 23.6 pulses per second.

the scope size φ, entropy H also decreases.

minφ H(g(A(ξ_t,f, φ))) (3.1)

Formula 3.1 shows how entropy minimization is applied. It is important to choose a function g that fits the task. Also, it is important to smooth the domain of which no information is extracted.

3.3.2 Certainty estimation

All property estimators give a certainty estimation about their analysis. This estimation is always the product of one or more confidence estimations of metrics extracted from the signal. There are four different ways of converting a metric x to a certainty estimation:

a sigmoid (formula 3.2), an inverse sigmoid (formula 3.3), a unnormalized Gaussian distribution (formula 3.4) and an inverse and unnormalized Gaussian distribution (formula 3.5). One advantage of using such function is that only the mid-point µ and breath σ

(33)

3.3. SIGNAL PROPERTIES 25 have to be determined by trail-and-error work. Another advantage is that these functions are continuous, which eases the trail-and-error process and allows for noise and unreliable data.

S(hpropertyi, µ, σ) = 1

1 + e^−(hproperty^i−µ)

σ

(3.2)

S⁻¹(hpropertyi, µ, σ) = 1 − 1

1 + e^−(hproperty^i−µ)

σ

(3.3)

G(hpropertyi, µ, σ) = e⁻

(hproperty^i−µ)2

2σ2 (3.4)

G⁻¹(hpropertyi, µ, σ) = 1 − e⁻

(hproperty^i−µ)2

2σ2 (3.5)

3.3.3 Pulsal properties

What signal properties can be calculated from a pulsal texture?

Frequency (Hz)

time [s]

0.1 0.2 0.30.4 0.5 0.6 0.7 0.8 0.9 1 2480

1640 1080 710 460 300 200 130 80 50

Frequency (Hz)

time [s]

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4600

2830 1740 1070 660 400 240 150 87 50

1 2

Figure 3.6: Examples of pulsal textured cochleograms. (1) is suitable for the individual pulse property. (2) is suitable for the repetition frequency property.

Individual pulse

The first pulsal property is the individual pulse. It determines whether there is a (strong) pulse in the patch ξ_t,f with time/frame t and frequency/segment f . The strength of the pulse is estimated using the median energy difference over time, see formula 3.6.

hpulse strengthi = median( ~max

t ξ_t,f − ~min

t ξ_t,f) (3.6)

Certainty of this property is estimated by taking the product of confidence estimators on the duration in seconds, frequency range and estimated strength in dB, see table 3.3.

As an example, formula 3.7 shows how the certainty of this property is calculated using

(34)

table 3.3.

certainty = S⁻¹(duration(ξ_t,f), 0.2, 0.05)∗S(max ~f −min ~f , 500, 10)∗S(hpulse strengthi, 20, 3) (3.7) Single pulse

property parameters unit

duration inverse sigmoid with µ = 0.2, σ = 0.05 seconds frequency range sigmoid with µ = 500, σ = 10 Hertz pulse strength sigmoid with µ = 20, σ = 3 dB

Table 3.3: The confidence estimation parameters of the single pulse statistic.

Repetition frequency

The second pulsal property is the repetition frequency. This property estimates the pulse frequency in a patch ξ_t,f with time/frame t and frequency/segment f . First the signal inside the patch is smoothed using the entropy minimization using the time between peaks (g = peakdistances), see formula 3.8. The repetition frequency (in peaks per second) is estimated as the median of the number of peaks per segment ~pps(ξ_t,f) over time, see formula 3.9.

ξ_t,f⁰ = A(ξ_t,f, min

φ H(peakdistances(A(ξ_t,f, φ)))) (3.8)

hrepetition frequencyi = median( pps(ξ~ _t,f⁰ )

duration(ξ_t,f)) (3.9) Certainty is estimated by taking the product of confidence estimators on the number of pulses n_p and the pulse-shape similarity. The more pulses, the more evidence supporting a repetition pattern. The pulse shape is calculated by summarizing the energy in the frequency domain. Let ~p_i be the shape of an individual peak, or the frames of the cochleogram surround a peak. The size of ~p_i is given by the mean distance between between peaks. The shape similarity is calculated by comparing the individual pulse shapes

~

p_i with the mean pulse shape ~p_µ, see formula 3.10. Shape comparison is based on the positive correlation ρ.

shape similarity = max({1 − ρ(~p_i, ~p_µ), 0}) (3.10)

(35)

3.3. SIGNAL PROPERTIES 27 Repetitions

shape similarity sigmoid with µ = 0.5, σ = 0.2 number of peaks sigmoid with µ = 8, σ = 2

duration sigmoid with µ = 0.2, σ = 0.01 seconds peaks per second sigmoid with µ = 0.5, σ = 0.05 seconds⁻¹

Table 3.4: The confidence estimation parameters of the repetitions property.

3.3.4 Tonal properties

What signal properties can be calculated from a tonal texture?

Frequency (Hz)

time [s]

0.10.2 0.3 0.40.5 0.6 0.70.8 0.9 1 2480

1640 1080 710 460 300 200 130 80 50

Frequency (Hz)

time [s]

0.10.2 0.30.4 0.5 0.6 0.70.8 0.9 1 4600

2830 1740 1070 660 400 240 150 87 50

Frequency (Hz)

time [s]

0.1 0.2 0.30.4 0.50.6 0.7 0.80.9 1 2480

1640 1080 710 460 300 200 130 80 50

Frequency (Hz)

time [s]

0.1 0.2 0.30.4 0.50.6 0.7 0.80.9 1 4600

2830 1740 1070 660 400 240 150 87 50

1 2 3 4

Figure 3.7: Examples of tonal textured cochleograms. (1) is suitable for the isolated tone property. (2) is suitable for the modulated tone property. (3) is suitable for the sheared tone property. (4) is suitable for the tone complex property.

Isolated tone

The first tonal property is the isolated tone. It estimates properties of a single tone in a patch. The tone is estimated as being the mean patch frequency, see formula 3.11.

htone frequencyi = mean( ~f ) (3.11) Certainty is estimated by taking the product of confidence estimators on the similarity with a Gaussian distribution, the frequency range and the duration of the patch. The shape of the tone ~γ, see formula 3.12, is compared to the shape of a Gaussian distribution G with the same mean µ_~_γ and standard deviation σ_~_γ. Comparison is performed using the root mean squared error RM SE between the difference of the two shapes, see formula 3.13. This gives an indication about how clean the tone is, even though perfect tones have no Gaussian shape.

~γ =

t=nt

X

t=1

ξ_t,f (3.12)

RMSE = q

mean((~γ − G(µ_~_γ, σ_~_γ))²) (3.13)

(36)

Isolated tone

RMSE inverse sigmoid with µ = 0.4, σ = 0.08

duration sigmoid with µ = 0.2, σ = 0.08 seconds segment count inverse sigmoid with µ = 15, σ = 0.5

Table 3.5: The confidence estimation parameters of the isolated tone property.

Modulated tone

The second tonal property is the modulated tone. It analyzes amplitude modulation in a patch ξ_t,f with time/frame t and frequency/segment f . First the signal inside the patch is smoothed using the entropy minimization using the time between peaks (f = peakdistances), see formula 3.14. Second the frequency with the highest cumulative energy (fmax) is calculated, see formula 3.15. The depth of the modulation is estimated by comparing the median energy at the peaks peaks(ξ_t,f) and the median energy at the valleys valleys(ξ_t,f) on segment fmax, see formula 3.16.

ξ_t,f⁰ = A(ξt,f, min

φ H(peakdistances(A(ξt,f, φ)))) (3.14)

fmax = max

f (

nt

X

t=1

ξ⁰_t,f) (3.15)

hmodulation depthi = median(peaks(ξ_t,f⁰ max)) − median(valleys(ξ_t,f⁰ max)) (3.16) Certainty is estimated by taking the product of confidence estimators on the positive peak shape correlation and the number of peaks. Let ~p_i be the shape of an individual modulation peak; the frames of the cochleogram surround a peak. The size of ~p_i is given by the mean distance between between peaks. The shape similarity is calculated by comparing the individual pulse shapes ~p_i with the mean pulse shape ~p_µ, see formula 3.17.

Shape comparison is based on the positive correlation ρ.

shape similarity = max({1 − ρ(~pi, ~pµ), 0}) (3.17)

Sheared tone

The third tonal property is the sheared tone. It analyzes the frequency increase or decrease of a tone. It estimates the angle of the shear of the tonal-ridges in the cochleogram.

Sound Recognition: A Cognitive Way Alle Veenstra