• No results found

Comparing the Accuracy and Workings of Automatic Speech Recognition and Human Speech Recognition in Noisy Environments

N/A
N/A
Protected

Academic year: 2021

Share "Comparing the Accuracy and Workings of Automatic Speech Recognition and Human Speech Recognition in Noisy Environments"

Copied!
17
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Comparing the Accuracy and Workings of Automatic Speech

Recognition and Human Speech Recognition in Noisy Environments

Tineke Slotegraaf

Student number: 10458581 Supervisor: Prof. Dr. Bart de Boer Number of words: 9566

(2)

Abstract:

This literature review will report findings of the past decade on human speech recognition (HSR) and automatic speech recognition (ASR). The aim of this review is to find if the fields of HSR and ASR started to work more closely together, learned from each other, and incorporated the

recommendations for improvement by Moore and Cutler (2001).

Improvements in the field of HSR are mainly based on differences in human speech recognition and the notion that while hearing in noise humans don’t perform 100% correct. Unfortunately for this review, most ASR improvements came from large companies. These systems are not reported in papers, but hidden in patents. Because of the difficulty of finding, and

understanding, these system improvements, they are not reported in this review.

The fields of HSR and ASR can still learn more from each other by tighter cooperation and communication. They can report their findings similarly, and discuss how the other field can help them with findings new insight.

Introduction

The technologies we can use to communicate with each other through written words have

developed fast in the past decades. First, the texts were written by hand, then by a typewriter, and eventually by a computer. The best improvements of the latter are the ability to change the text without having to start all over again and the spellcheckers used to help you correct your mistakes. The latest improvement in this is that you don’t even need to type in the text; you can dictate it to the computer, just like the bosses of old did to their secretaries.

Just a couple of years ago, software that converted speech into written words needed to be trained to your voice and could only be used in a room with as little background noise as possible. The main reason to use the software was to prevent Repetitive Strain Injury, from typing too much. The newer systems are using less working memory from a computer and work on your smartphone as well as on a large desktop computer. But how good are the speech recognition systems today? This literature review will give an overview of the newer systems to try to answer this question.

To get an idea of how well systems perform, they are compared to the best example of speech recognition systems: humans. We listen to speech all the time and most often this goes well. The comparison of human speech recognition (HSR) and automatic speech recognition (ASR) has been done before (Moore & Cutler, 2001). This review gives some new insights in both fields, and finds if the advice of improvements in these fields was used, or not.

Moore and Cutler (2001) described ASR as a small system that can work with only a few voices commands or as a large system that only works on a computer. These systems were based on Hidden Markov Models (HMMs) from the 1980s with some additional algorithms to improve

performance. I will describe these models in more detail in the chapter about ASR systems. HSR is based on research with task analysis. It is harder to find what different parts of the brain do, since we cannot switch a part of the system off to see how the rest will function without it (Moore & Cutler, 2001). This results in models of speech recognition which are incomplete. The tasks performed to measure speech recognition is mostly to repeat words or sounds heard, someone other than the researcher testing the participants will score the performance as correct or incorrect. The

(3)

chapter on HSR focuses on the differences between humans, which shows that humans will not always perform at 100%, which is assumed by the literature (Moore & Cutler, 2001).

Moore and Cutler (2001) concluded that the fields of ASR and HSR can help and learn from each other. In ASR research people also rely on models, but they do include what is not known. They use statistics to adapt to new speakers and/or environments, which can be helpful in cochlear implants as well. On the other hand, these parameters ‘forget’ old settings when adapting. Maybe HSR can give insight to help ASR systems to keep functioning in previous learned settings after having adapted to a new setting.

The biggest problem in comparing the two systems is that both report different

measurements. ASR research reports the accuracy of performance, while HSR research, on healthy individuals, assume humans have an accuracy of 100%. To try to get ASR systems, to work at the same accuracy as humans, researchers tried to use more training data. When you give a computer the same hours of speech a 10-year old child has received - 10.000 hours - the error rate would be reduced from 20% - at 140 hours of training - to 12% (Moore & Cutler, 2001; Moore, 2003). This shows that ‘just using more training’ is not enough to make ASR systems perform at HSR level. There is a need for development of more structured models which better exploit the information available in the data.

To get better models of how humans do what they do (from seeing an image to thinking deductively, but also understanding speech) new technologies like brain imaging are used. It is still not rally possible to shut one part of the brain off and see the effect, but it does give an insight in which parts work together, and what the different parts do. Since research towards hearing and understanding speech with neuroimaging is very new, there is not a separate chapter on this. The next section will briefly show some findings and explain how this can help our understanding of HSR. Speech recognition in Neuroscience

The first technique is looking at animal models that can explain a phenomenon found in humans. For example, seeing edges of surfaces works the same in a cat as in a human. By using animals we can do more than in humans. It is easier to get ethical approval to put electrodes inside the brain - which gives more accurate data - of an animal than in that of a human when you want to see which neurons are active during a given task. These findings are then used for further research in humans to see if they find the same brain areas and show the animal models are accurate.

Frisina and Frisina (1997) used animal models of presbycusis - age related hearing loss - to form his hypothesis that the same brain areas will be involved by presbycusis in humans. This was indeed what they found as people get older, the sensitivity for speech-frequency pure tones is reduced. This is what causes hearing loss. This finding would help in a model of speech recognition in aging humans.

When we want to look at the activity from the human brain we cannot put in electrodes to get the electrical signals from the neurons directly. Instead, we can measure the signals that reach the scalp with electroencephalography (EEG), or we can use a bigger machine which measures the magnetic signals from the brain with magneto encephalography (MEG). Using the latter technique means you need a chamber which all other magnetic signals can’t reach, because the signals from the brain are very weak compared with things like an elevator.Both these techniques can measure changes in brain activation in milliseconds and are therefore very quick. But, they can’t give a very accurate picture of where in the brain exactly the signal is coming from..

An example of MEG research in the HSR research field is to look at degraded speech - spoken numbers - and short term memory. (Obleser, Wöstmann, Hellbernd, Wilsch & Maess, 2012) They looked at the speed of oscillation of the signals which implies that brain areas are working together

(4)

in that task. The only brain area which was already associated with speech recognition they found, is important in degraded hearing and aging. This means it can help improve the model of aging and hearing loss, as well as extend the model with other brain area’s that work together to solve the problem of degraded speech.

An even bigger machine to take images of the brain uses large magnets to see which brain areas get more blood at a given moment; which means that area was active. This functional Magnetic

Resonance Imaging (fMRI) is very precise in the area that was active, but every scan takes seconds. An example of the use of fMRI in HSR research is listening to speech and non-speech in increasingly difficult listening conditions. (Erb, Henry, Eisner & Obleser, 2013) They found an

executive network that was active for both speech and non-speech with rapid adaptations, which can be used to extend a model of speech recognition in noise.

So far, these techniques only show what brain areas are involved in different tasks involving speech recognition. This will start to be useful for ASR research when we understand better what these areas do precisely and how they work together, which can take another decade of research in this field.

Research Questions

Using the papers discussed in the first part of this introduction as a view of how ASR worked a decade ago and what they thought was the most important information from HSR (Moore & Cutler, 2001; Moore, 2003), this paper will discuss what other important things were found in both fields since then. The questions this review will answer are: (1) How much better are humans in

understanding speech in noise compared to computers? (2) Do humans perform at 100% accuracy in noisy conditions, and why (not)? (3) Did the field of ASR took the advice of Moore and Cutler (2001) and developed new/different algorithms to increase the accuracy of speech recognition (in noise)?

To answer these questions it is important to first discuss what type of noise and speech this review will focus on, as well as what normal hearing conditions are. Next new findings in HSR and new findings in ASR are discussed. When both fields are defined a comparison is made to see where we stand and how well both fields perform. The review ends with a personal opinion about

improvements for both fields and a brief conclusion.

What are normal hearing conditions?

In our everyday lives, we encounter different hearing conditions. We can listen to a presentation, someone talking through the phone or trying to have a conversation in a club. When we see the person we are listening to we use body language and lip-reading to help us understand what is said. On the phone we can understand speech even though we can’t rely on body language and the sound is distorted. This distortion of the voice as well as background noises from cars, other people talking, or music are examples of adverse listening conditions (ACs), but speaking with an accent or speech disorders are counted as ACs as well (Mattys, Davis, Bradlow & Scott, 2012). Many ACs can cause failure of recognition of phonetic features and lexical representations. These are the conditions this paper will focus on.

In research people listen to speech in noise, to find which noises affect our apprehension of speech most, or which sounds most likely intermingle with speech. Most researchers record speech and noise separately, so they can adjust the signal to noise ratio (SNR) (Cutler, Weber, Smits & Cooper, 2004; Rogers, Lister, Febo, Besing & Abrams; 2006; Hygge, Rönnberg, Larsby & Arlinger, 1992; Strait, Parbery-Clark, Hittner & Kraus, 2012; Ruggles, Freyman & Oxenham, 2014). Other research found that people adjust their voices when speaking in noisy environments, called Lombard speech (Hansen, 1996; Valentini-Botinhao, Maia, Yamagishi, King & Zen, 2012). This means that just

(5)

changing the SNR might not replicate normal hearing conditions when SNR is low; when the noise is very loud compared to the speech signal. The hearing accuracy in real life conditions, with loud noise present, might be higher than research found, because we adjust our voice to make it easier for other people to understand what we say.

Since there are different kinds of noises and different kinds of speech, the next sections will give an overview of these different kinds of noise and speech. The section will also define the kind of noise and speech the rest of this review will focus on.

What is noise?

In the broadest sense, noise is everything apart from the signal you try to pick up. Whether it is that one person you are looking for in a crowd, and spot suddenly, or the voice of your friend you are listening to in an environment filled with traffic noises. In this review, the focus is on sound noises. These noises can be divided into several categories: white noise, pink noise, Gaussian noise, and multi babble noise. Each will be explained shortly and some examples of the usage of these noises in speech recognition research will be given after the explanations.

Sound is made up of waves. The frequency of a wave gives the pitch of the sound, or how high the sound is. All frequencies together create a spectrum. If all frequencies between 20Hz and 20.000 Hz (the hearing spectrum of humans), are used to create a sound, and the sound is randomly divided over all these frequencies, we call it white noise. This noise can be created on a computer and sound like a very long /sh/ sound. It does not intervene with speech, since speech uses a variety of frequencies that change a lot, but if loud enough it will make it hard to hear the source one is listening to.

Pink noise is more frequently present in everyday life than white noise. It still uses the entire hearing spectrum, but since higher sounds need less energy to be louder the energy is lower for each octave in the spectrum of frequencies. It is present in almost all electric devices and does not

intervene with speech if not too loud for these same reasons white noise does not intervene with speech.

Gaussian noise can be visualized as a normal curve – it has lots of energy at the middle of the spectrum and low energy at the ends of the spectrum. When created on a computer this will

intervene with speech more than white or pink noise since we use the middle of our hearing

spectrum for speech. When occurring natural it is in spectra we cannot hear, like waves from the sun, but that can intervene with computer networks or communication channels.

Multiple speaker babble noise comes from other people that are speaking at the same time as the one you are listening to. This noise we all know from being in a pub, on the street, or in public transport. This noise is thought to be the most difficult one since the spectrum is the same as the source one is listening to, and has similar sounds. If the language is the same as the one you are listening to it is also possible to get lexical interference; you hear a word in the noise that interferes with the conversation you try to follow.

For steady noises – white, pink, and Gaussian – Mishra and Lutman (2014) found that we can block them out after a while so that we can focus on the sound that (might) need our attention. There is a brain area that helps in unmasking the speech signal in noise (medial oliovocochlear), but they could not tell if this area is also at work without artificial activation.

With multiple speaker babble listening to a speech signal becomes easier when more people are talking, the worst performance was when 2 talkers where in the noise signal (Rosen, Souza, Ekelund & Majeed, 2013). When more people speak it is more likely that the whole spectrum is filled with noise, which causes less lexical interference with the speech signal. In that case it starts to be more and more like Gaussian noise.

(6)

In research to speech recognition in noise, a noise signal and speech signal are mixed in the lab with different signal to noise ratios (SNRs). To restrict this review, the research discussed focussed on multi speaker babble noise, since it is the hardest noise to handle for both humans and automatic systems.

What is speech?

Speech is made out of sound waves and we use speech to communicate. In the literature speech is divided into several areas of interest: sound of letters, words, and sentences. The loudness of speech is between 55 and 80 dB (Dirks, Morgan & Dubno, 1982), the noise signal should be in that same range to intervene with the signal. Too low and there is no problem in understanding the speech signal; too high and the speech signal will not be heard.

In research with sound, they often use two letters and change the order of these to see if we can distinguish between similar sounds (i.e., /f/ and /theta/ in English). This has been used in

research with humans; for example to hear two phonemes and see if both syllable initial and syllable final will be recognized (Weber & Smiths, 2003). They found that (in multiple speaker babble noise) stops and fricatives were the hardest to distinguish and glides and liquids were the easiest. None of this research was performed on automatic speech recognizers.

Sounds are rather difficult to distinguish in noise, also because in normal conditions we do not need to do that. The sound is embedded in other sounds to form a word. When looking at words from a lexical point of view, Becker, Nevins, and Levine (2012) found that monosyllables are more protected in a language, but this protection could not be learned in an artificial language. This means that the sound did not change much when the word goes from singular to plural. This should make it easier to understand the right word.

Even more than loose words, our speech perception relies on sentences. There is a context, so that when we didn’t hear a word correctly or missed part of it because of intervening noise, we can still fill in the blanks. In research sentences are rarely used. In automatic systems it is very difficult to build a complete, semantic model of a language to make it fill in the blanks. In humans, it is more interesting to look at what goes wrong when listening to sounds or words.

Although looking at the comprehension of spoken sentences in noise seems like the best way to answer the research questions of this review, more research (in both fields) has been done on understanding words. The review will thus focus on recognizing spoken words in multiple speaker babble noise.

Human Speech Recognition

Age

Children need years to completely understand a language – to gather a vocabulary of their own as well as master the rules of syntax. It starts with producing sounds to try to imitate what they hear, which become words and eventually sentences with syntax. Alter about 10 years (or hearing about 10,000 hours of speech) children master their mother tongue (Moore and Cutler, 2001; Moore, 2003), but even then they still learn about irregularities in the language as well as new words. The learning of new words from ones mother tongue happens throughout life, since the language itself is evolving and sometimes new jargon needs to be learned. Most ASR systems want to work at a level humans have around the age of ten.

It does not only takes us long to learn a language. When we get older and still have our vocabulary and rules of syntax, we lose some of our hearing abilities which make it harder to understand speech. This hearing loss is often caused by presbycusis and often starts around the age

(7)

of 75 - which more and more people reach in our mature society (Ciorba, Bianchini, Pelucchi & Pastore, 2012). Prevalence of deafness is present in about 55% of the elderly population, but only 14% uses hearing aids (Ramdoo, Bowen, Dale, Corbridge, Charrerjee & Gosney, 2014). Because of this hearing loss, the ability for speech recognition also deteriorates.

When performance of speech recognition is tested, or when computer performance is compared to human performance this differences by age are not taken into account.

Non-native listeners

Non-native listeners are people that learned a new language later in life and are listening to that language. An example is the use of English at Universities, even in countries where English is not the official language, or immigrants (i.e. from Latin America to the U.S.A.). These listeners can be fluent in their new language, but still have a larger effect from disadvantageous listening conditions (e.g. listening in a noisy environment) than native listeners do (Cutler, Weber, Smits & Cooper, 2004; Rogers, Lister, Febo, Besing & Abrahms, 2006). Despite these findings little research has been done on the subject of bilinguals and speech perception in noise.

Cutler et al (2004) looked at consonant-vowel and vowel-consonant syllable understanding in American-English with both native English-American speakers and Dutch students who were fluent in English. The syllables were masked by multiple speaker babble (six speakers) at three different SNRs (0, 8, and 16 dB) where the signal is louder than the noise if the number is higher than zero and both are equally loud at an SNR of 0 dB. They found the non-native speakers always performed worse than the native speakers, no matter the SNR. Between 0 and 8 dB SNR there was a significant difference in performance for both consonants and vowels, while between 8 and 16 dB SNR the difference was only significant for consonants. The place of the sound in the syllable didn’t have any effect on this outcome.

Rogers et al (2006) tested people who originally spoke Spanish, but learned English before the age of six and speak English without a Spanish accent. Early bilinguals seem to have many advantages from being bilingual, like better problem solving skills and better performance in tasks with memory or inhibition of attention, and they are perceived as being native speakers in their second language. In their paper, Rogers et al (2006) found that performance (of repeating back monosyllable words) was equal in quiet, but bilinguals performed less in every SNR tested (-6, -2, 0, 2, and 4 dB).

Cutler et al (2004) concluded: “non-native listening is at all processing levels slower and less accurate than native listening.” By adding the multiple speaker babble noise all process is slowed down from the beginning and it is possible that this leads to exceeding of thresholds of auditory memory storage. Rogers et al (2006) showed that this is a problem for all bilingual speakers, even those that seem dominant in their second language. These people can be helped by counselling and acknowledging the problem that they will get more difficulty in hearing in different environment when they get older, compared to native speakers.

Hearing aids

Another difference in humans is the ability to hear properly. Elderly people get hearing problems, but it can also come from a birth deficit, or damage to the ears because of listening to loud sounds/music too much. For these people there are hearing aids or hearing implants that help amplify the sounds and hear better. This is probably the field that can learn most from automatic speech recognition software. The remainder of this section will be about the differences in normal hearing people (NH) and hearing impaired people (HI) as well as how the HI can be helped with their speech recognition.

For NH it is quite easy to block out noises as long as they do not resemble speech (either played backward or forward) and aren’t too loud. Even when the noise is babble, most of the conversation will be understood. For HI the kind of noise is not important; it will intervene with the

(8)

speech anyway. HI do not benefit from the amplitude variations that occur in natural speech. (Hygge, Rönnberg, Larsby & Arlinger, 1992).

Normal speech is around 60 dB and often in a room with noise. Dirks et al (1982) found that to perform at a 50% accuracy NH would require a SNR of -3 dB (the noise is 3 dB louder than the signal), while HI would require a SNR of +12 dB (the signal is 12 dB louder than the noise). They used only one setting, but this huge difference does show that HI perform much worse in an environment with noise then NH.

One way to improve speech recognition in HI is “clear speech” which includes slower talking. This is mostly used when one knows the other has hearing problems, or when we are in a very noisy situation, i.e. at a concert. (Grynpas, Baker & Hazan, 2011).

Another way to improve speech recognition, in both HI and NH when listening in a noisy environment, is iconic gestures. We automatically make these gestures with our hands or posture when talking, to help the listener understand us. If you show video images without sound, however, it is hard to figure out what the gestures mean. Since the increase in speech recognition is greater than the sum of sound and gestures it seems that the two sources have been fused to form a new integrated perception. (Holle, Obleser, Rueschemeyer & Gunter, 2009)

The best solution to help HI with increasing their speech recognition is a good working hearing aid, but these amplify both speech and background sounds. (Bronkhorst, 2000) Using head shadow – the sound has to travel around the head to get to the second ear and there will be a tiny delay – NH gain 7 dB in SNR, but HI only gain 0-2 dB. This means that in HI there is almost no difference, but NH will feel like the signal is 7dB louder than it actually is, which increases

recognition. Hearing aids are now developed to work better on where the signal comes from to block noise from other directions. The findings of this might also help automatic speech recognition to filter speech from noise.

Musical Training

When people (mostly children) learn to play a musical instrument they first learn to recognize pitch. After a while they know their instrument well enough to play with other musicians – where everyone is playing a different part – and they learn to listen to who is playing the melody or bass line (because these parts should be played a little louder than the rest). Finding a particular line in the music can be viewed as listening in a noisy condition. Researchers do not agree upon the benefit of musical

training to speech recognition in noise.

In research with children there was a significant difference between those who had musical training and those who hadn’t while listening in a multiple speaker babble of six people (Strait et al, 2012). There was no difference when listening in quiet. The musical training might be modulating the auditory brainstem which is also involved in language experience and strengthen the neural and cognitive underpinning of speech perception. This research involves children around the age of ten, whom chose (or had to by their parents) to play music, so there is always the chance that the natural difference in brainstem caused these children to start playing an instrument and keep playing it.

Another thing can be that it depends on the kind of instrument one plays – violin and woodwinds are really different. This was a conclusion by Ruggles et al. (2014) who investigated voiced sentences and whispered sentences while not expecting to find a significant on the latter, the literature suggested they would find it in the first (since voiced and pitch are similar). This finding can also be explained by the fact that all people are trained to listen to speech in noisy environments; musical training might not be as beneficial as first expected.

(9)

Conclusion

This section showed that people do have differences in listening to speech. These are small to non-existing in silence, but can be quite large in noise – especially in multiple speaker babble.

We might want to think of automatic speech recognition systems in the same way. Maybe they act more like non-native listeners that need to do more computations to understand the speech. Or the systems might act like the hearing impaired who cannot distinguish the signal from the background as well as normal hearing people.

To get a better view on this I will explain some automatic speech recognition systems in the next section.

Automatic Speech Recognition

Difficulties

The introduction of this paper suggests that automatic systems perform worse than humans on speech recognition. More training seems to have little effect on performance (Moore and Cutler, 2001; Moore, 2003). A first consideration is to find why it is hard for computers to recognize speech – especially in noise. There are several reasons: humans also use their eyes to look at posture and other body language, a human voice is not static but changes when emotions are present, there are huge differences in voices (child vs. adult, men vs. women), and to recognize a word it has to be in the lexicon.

All of these reasons make it harder for an automatic speech recognition system (ASR) to understand speech. In such a system we would like to have as few parameters as possible and these will be fixed after learning the settings. It is relatively easy to do this for one voice, but then someone else can’t use the same system. It is also relatively easy to make it recognize speech in silence, but a lot harder to filter the signal from the background noise.

When humans listen to speech, they use their model of the language to understand it. This model is not only filled with rules on how to make a sentence and a lexicon of words and their meaning, but also on what we know of the subject of the conversation and the person talking. We use all this information to fill in blanks when listening. It is very hard to build such a model in a computer. Most difficult is dealing with homophones – words with a different meaning that sound the same – an example of this is tail vs. tale (Lakra, Prasad, Sharma, Atrey & Sharma, 2012).

In this chapter older and newer ASR systems are reviewed to show how they work and what has changed in the past decade. Unfortunately, many of the new technologies were not reported by research groups in scientific papers, but by large companies who have hidden it in difficult patents. Those patents will not be discussed in this review.

Old systems

Old systems often use a Hidden Markov Model (HMM) for speech recognitions. These models are also often used in other pattern recognition system such as written numbers/letters, because they work best on those tasks. In a HMM only the outcome will be known to the users, not how that outcome was derived. It is like drawing balls of a couple of bowls. You know how many balls of each colour are in each bowl and how many bowls there are, but when someone is randomly picking balls from the bowls and only shows you the outcome, you will not know which ball comes from which bowl. The random variables will be set in a training phase with feedback and then tested on a different set to see how well the model performs.

(10)

Hansen(1996) recognized that most ASR systems don’t perform well in noise and that this might be due to the Lombard effect – we speak differently when noise is present, so that other people still understand what we say. This could decrease recognition with 58%, so that only 30.3% is recognized where first 88.3% was recognized by the ASR. The algorithms to enhance speech are actually cancelling out the noise and therefore also the Lombard speech signals in these noisy situations.

To solve this problem Hansen (1996) made additional algorithms to work with HMMs to enhance speech with front-end processing approaches. He made a feature enhancement with formant based stress equalization for keyword recognition and a feature enhancement with

compensation for noise and stress in the signal. He showed that these new features added to a HMM contributed significantly in better recognition performance in adverse listening conditions.

When a transcription of a spoken text is saved, to make it easier to search the document in a query another problem arises: the terms in the query and the document might not be exactly the same. Crestani (2002) suggests that systems need a thesaurus to overcome this problem. The

differences between the original text and the query might be either semantic or phonetic. Because of this, phonetic similarity estimation via a phone confusion matrix – a matrix with all phones and how often two phones are mistaken for each other – might be a very useful tool to find the word that was spoken, given it was different from the word recognized. This was only a computational approach, but if it can be implemented it will most likely make a great improvement to existing ASR systems. Since it will most likely improve recognition.

New systems

In the new systems that are developed today the HMMs still play their role, mostly with additional software, but there have also been some new – more flexible – networks developed. This section discusses some recent papers on the topic of ASR. Unfortunately these are not the top new developments, since those are found in patents from large companies like Apple and Microsoft.

One new way to set the parameters of a HMM is by using fuzzy logic, which has not only the numbers 0 and 1 (or FALSE and TRUE), but also everything in between (i.e. 0.15, or 0.80). These numbers are found by using a lot of practice data and they make the parameters less fixed. Lakra et al (2012) used fuzzy logic to tackle accents, speed of pronunciation, and emphasis in speech. It still needs to be implemented to find how well this works. When it works the system will be more robust and it will be able to listen to different voices both in quiet and in stress situations (like in background noises).

There has been lots of research with speech recognition systems who listen to English sentences. Building a system for a “new” language can be really expensive, so Schultz, Thang, and Schlippe (2013) build a database with 20 languages to find similar phonemes. These phonemes are the most common phonemes in the world and can be used to implement a new language or dialect. Schultz et al (2013) build a corpus of audio-speech data (100 sentences from 100 adults per

language), corresponding transcripts, pronunciation dictionaries of the transcripts, and baseline n-gram language models using a HMM. Users can build an application using online speech recognition and synthesis in a talk-back function; this is used as a database of recorded speech of which the users do need to add a transcript. Another necessity is a vocabulary list, so the system knows if an order of phonemes is a word or not. The output for a new language will be calculated by a multi-layer HMM which has been trained on 12 languages. Making a system for your own language this way will help get better systems and make them available for more people.

A deep neural network (DNN) is a neural network with hidden layers. The nodes from these layers communicate with each other through connections multiplied by a value between 0 and 1 (which give value to the importance of the connection with 0 being not important and 1 being highly

(11)

important) and are based on how the neurons in a brain communicate with one another. These hidden layers enable a DNN to make a composition of features to form the lower layers, which give it the potential to model complex data with fewer nodes in each layer and being more flexible. Wang and Wang (2014) used a DNN to separate monaural speech from noise in low SNRs with non-stationary noises. As inputs for the DNN they made time-frequency plots for every frequency in the sound and extracted acoustic features. These plots are just a visual representation of the sounds. By using the plots the system is not trained on a single time slice, but on a window of frames. This gives the system more flexibility.

DNNs work well as long as the mismatch between the training and testing phase is not significant. These can be improved by a supervised system which is used for speech separation. It incorporates context at the feature level to decrease word error rate. Narayanan and Wang (2014) tested this system on SNRs between -6 and +9 dB, with noise that simulated a family living room. They found a word error rate between 31.4% and 14.1%, which means that even with the signal being 9 dB louder than the noise 14.1% of the words was not correctly recognized. To put this into perspective: in a study on life log recording data, an error rate of 32% was considered to be okay because people search through it and use keywords they use often to search it (Seo, Kim, Song & Hong, 2014). Using the 32% error rate as a cut-off point, the system of Narayanan and Wang (2014) performs well in a noisy environment.

To conclude, an example of how ASR can help in everyday life. This is a preliminary study for a new system in a hospital, to help nurses with the clinical handover at the end of a shift. (Dawson, Johnson, Suaminen, Basilakis, Sanchez, Estival, Kelly & Hanlen, 2014). The handover is a conversation between nurses who take over the shift from one another. The nurse who is leaving tells about the patients; what has happened and who needs more attention. The nurse who is starting her shift will make notes, but those often cannot include all the necessary information. A transcript of the conversation will make it easier to look at the details again and it can help in claims of not having received information on a patient. The nurses are involved in the implementation of this system to make sure it is helpful and will be used. Practice runs of the software in a noisy hospital show an accuracy range of 77%, which can be improved to 99% in just one hour of training on a specific voice. It will take about a year to fully incorporate the system in the hospital.

Is every system an ASR

The systems reviewed so far in this chapter are working on a computer and need lots of training data to perform well in a test. Some companies offer a transcript service – where speech is converted to text- but they don’t seem to work with software. These companies jump into the large market for transcripts (for court orders, meetings, etc.). Liem, Zhang and Chen (2011) show that there are companies that get and average accuracy of 99%, but these are expensive because humans do all the work. It is much cheaper to use an ASR, but Google Voice’s voicemail-to-text has an accuracy of 78-86% which is considered too low. One example of a good program is Dragon, which can get an accuracy of 99% if the system is trained for that one voice. The problem with Dragon is that it won’t work that well in a meeting, where more than one person will speak.

Liem et al (2011) went for a sort of in between of ASR and HSR. They created new software by using a game-environment to transcribe speech. The game was tested on students whom had to transcribe short sound clips of 10 seconds; their score is based on how much their transcript resembles that of another student (who is transcribing at the same moment). When these students agree it is most likely that the transcript is correct. The accuracy was 96.6% and can probably be improved if given more time or add a feature to the game to improve other peoples transcripts. The latter would probably help to remove the slang and finished sentences that were not in the sound clip. This game is getting close to the accuracy of people transcribing a whole conversation, but is much cheaper.

(12)

It seems that even though ASR systems improve, they still cannot compete against human transcripts. The next chapter will make a comparison between HSR and ASR performance and elaborate on what both fields can still learn from each other.

Comparing human speech recognition and automatic speech recognition

The previous chapters discussed some literature on both human speech recognition (HSR) and automatic speech recognition (ASR). The basis for this review is found in two articles that compare the field and make suggestions on how it can be improved (Moore & Cutler, 2001; Moore, 2003). There has been only one other more recent comparison of the two fields – a literature review (Scharenborg, 2007). The field of HSR works with tests based on reaction times, correctness, and brain scans which are performed in different hearing conditions and report what is going wrong, but not how well people perform. The field of ASR works with large practice sets and a test, mostly in similar hearing conditions and focussing on a particular speaker, and only report percentages correct, but not what is going wrong. In other words, the improvements mentioned by Moore and Cutler (2001) for better cooperation between the fields of HSR and ASR by presenting the same kinds of findings do not seem to be implemented.

Scharenborg (2007) did another attempt to show the field of HSR and ASR that they can learn from each other. In HSR there has been an attempt to use ASR techniques to improve their models of speech recognition, but performance of these models was still lower than human performance. Using HSR findings in ASR research is even harder, because HSR has little known details in their general models. What could help is a better dialogue between the fields where they tell the other what sort of findings may help them. It is possible that ASR needs to build an entire new architecture when HSR finds more on how children acquire language. The review ends with a very positive remark about the future of this research: “We have just started to get to know one another, now it is time to make things work.”

An example of working together is that HSR found the Lombard effect – we change the way we speak in the presence of noise. This effect has not yet been implemented in an ASR system, but can be found in a system that works the other way around: converting text into spoken words in a car navigation system (Valentini-Botinhao et al, 2012). The voices used in car navigation systems are usually flat and were recorded or created as speaking in silence. However, in a car there are usually a lot of noises. By implementing the Lombard effect in this speech system, understanding the speech – by humans – was improved in all noise types. It was most effective for noise which resembles speech. This shows the power of the Lombard effect and the importance to implement it in an ASR system.

Another example of HSR findings used in ASR research is that humans use more than their ears when listening. The linguistic content of speech is what ASR research has been focussing on – semantics and lexicon. Humans, however, also use the paralinguistic content of emphasis, accent, and gestures. Lakra et al (2012) used these findings to create additional fuzzy logic to a hidden Markov model (HMM) to help distinguish accent, speed of pronunciation, and emphasis to improve speech recognition. This system works more like humans because it can listen to different voices and emotions or the Lombard effect. A next step might be to add a memory layer so that blanks can be filled in by knowledge of the speaker and/or subject, or a camera to use the gestures for better understanding.

Even though there are some examples of the fields of HSR and ASR working together, they can both still make large improvements on this.

(13)

Personal opinion

This review discussed the difficulty in specifying speech and noise, findings in human speech recognition (HSR), findings in automatic speech recognition (ASR), and if the fields learn from each other and/or work together. The questions this review tried to answer were: (1) How much better are humans in understanding speech in noise compared to automatic systems? (2) Do humans really perform at 100% accuracy in noisy conditions, and why (not)? (3) Did the field of ASR took the advice of Moore and Cutler (2001) and developed new/different algorithms to increase accuracy of speech recognition (in noise)?

The next sections will discuss these questions and answer them as best as possible based on the previous chapters of this review. After that, some shortcomings and improvements based on the answers of these research questions will be mentioned and further questions to improve both fields will be formulated.

How much better are humans in understanding speech in noise compared to automatic systems?

This was the main question in literature review. Unfortunately, it cannot really be answered. We do know that ASR isn’t as good as HSR, because if it is we don’t need companies with human

transcribers anymore and there would be no need for a game to make the best transcription of a 10 second speech signal (Liem et al, 2011). Instead, we would have a system present which would give the transcription right away.

There have been improvements in ASR, but these are hidden in patents of large companies. There are no publications on how well these new systems work or what they can’t do. The latter is a problem in all ASR research. Papers focus on percentages correct, while other information might be more fruitful

To improve future comparison between HSR and ASR it will be helpful if both fields report their findings the way they are used to as well as the way the other field report its findings. This means that HSR can report their % correct in different conditions and ASR can focus more on the types of mistakes their system is making. By shifting their focus in publications the fields can be compared and learn more from each other. If computers make the same mistakes as humans do they may improve if they learn how humans correct these mistakes.

Another thing that might improve both fields is the shift in HSR from only looking at reaction times to using brain imaging techniques. At this point we can only say which region is involved in performing a specific task, but knowing the region that is involved does not say anything about how we do the task. When this field grows and we do understand more, and more about how humans do what they do, we can make more complete models and use the findings to improve automatic systems.

Do humans really perform at 100% accuracy in noisy conditions, and why (not)?

This question needs some sub questions to answer it correctly. However, if the noise is not too loud, the human is a normal hearing adult who is not suffering form hearing loss and is listening to her mother tongue she would perform at about 100% accuracy.

Most research does not report the percentage correct if it is performed on people with normal hearing. Therefore, the chapter on HSR focussed on differences between people. Elder people can get presbycusis and, as a consequence, don’t understand speech in noise as well as healthy people (Ciorba et al, 2012). But, we also saw that people who are listening to a language other than their mother tongue – even if they learned the language very early – have trouble with speech recognition in noise (Cutler et al, 2004; Rogers, Lister et al, 2006).

(14)

It is easy to believe that we understand speech in noise for a complete 100%, since most situation where we have a conversation is in the presence of noise – whether it is cars passing on a nearby street, the static noise of a working computer, or another conversation at a nearby table in a restaurant. Part of speech understanding in noise, however, is that we also look at gestures and use Lombard speech to make it easier for the listener to understand what we say (Hansen, 1996;

Valentini-Botinhao et al, 2012). In a lab setting with head phones that present the speech signal, this extra information is (often) not present.

A major shortcoming of much research in both HSR and ASR is that the Lombard effect is not recognized as being important. The researches just records speech in a quiet space and adds noise afterwards, this is not a situation of speech recognition that we would encounter in a normal situation. Lombard speech is even not the same in different noise situations. We change the

characteristics of our voice depending on the background noise (Hansen, 1996; Valentini-Botinhao et al, 2012). Using more realistic settings will help improve our understanding in both fields.

Did the field of ASR took the advice of Moore and Cutler (2001) and developed new/different algorithms to increase accuracy of speech recognition (in noise)?

This question is easier to answer: Yes, researchers did develop new algorithms to increase accuracy of speech recognition. Most improvements were made on being able to listen to more than one voice and recognize the words correctly. The improvements were not based on entirely new systems, but mainly on additional algorithms and hidden Markov models (HMMs) or neural networks which work with hidden layers.

Future questions

If we want the fields to truly learn from each other, a lot of research still needs to be done. One of the most important things to investigate is whether just adding noise or using Lombard speech in noise is significantly different. If there is a difference a lot of research findings in the HSR field (with noise present) will need to be checked.

Another improvement will be for ASR to focus on the types of errors their systems make. Are these similar to the mistakes of humans in phoneme mishearing? If this is the case, new scripts can be built to try to overcome this problem and get better accuracies. Necessary for this will be the use of semantics and better language models, since the system will need to “see” if a word is strange in a certain sentence, or not.

In the studies of Artificial Intelligence, students learn the vocabulary and slang of different research field so that they can bridge between them. Something like that will also be necessary for the fields of HSR and ASR, in order for them to learn more from each other. Computations from ASR might also be helpful in creating better cochlear implants and/or the other way around. But, this will only be possible if the researchers in both fields understand what the others did and found.

Conclusion

This paper focussed on the findings of Moore and Cutler (2001) on the fields of human speech recognition (HSR) and automatic speech recognition (ASR). Since their article has been written over a decade ago, and the field of computer science is rapidly growing, recent literature was used to investigate the new finding in both HSR and ASR and if the fields improved their cooperation.

Unfortunately, it seems that the recommendations by Moore and Cutler (2001) have not been implemented in the fields. ASR researchers still only report how well their system performed, without notes of what exactly is going wrong. HSR researchers still do this the other way around and focus on the mistakes made, and not how much mistakes are made.

(15)

Most improvements made on ASR systems are hidden in technical patents of large companies, which make it even harder to compare the two fields. Fact is that automatic systems perform better in speech recognition than a decade ago, but can improve even more when the fields of ASR and HSR start working together more closely.

(16)

References

Becker, M., Nevins, A., & Levine, J. (2012). Asymmetries in generalizing alternations to and from initial syllables. Language, 88(2), 231-268.

Bronkhorst, A. W. (2000). The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acustica united with Acustica, 86(1), 117-128. Ciorba, Andrea, et al. "The impact of hearing loss on the quality of life of elderly adults."Clinical interventions in aging7 (2012): 159.

Crestani, F. (2003). Combination of similarity measures for effective spoken document retrieval. Journal of Information Science, 29(2), 87-96.

Cutler, A., Weber, A., Smits, R., & Cooper, N. (2004). Patterns of English phoneme confusions by native and non-native listeners. The Journal of the Acoustical Society of America, 116(6), 3668-3678. Dawson, L., Johnson, M., Suominen, H., Basilakis, J., Sanchez, P., Estival, D., ... & Hanlen, L. (2014). A usability framework for speech recognition technologies in clinical handover: A pre-implementation study. Journal of medical systems, 38(6), 1-9.

Dirks, D. D., Morgan, D. E., & Dubno, J. R. (1982). A procedure for quantifying the effects of noise on speech recognition.Journal of Speech and Hearing Disorders,47(2), 114-123.

Erb, J., Henry, M. J., Eisner, F., & Obleser, J. (2013). The brain dynamics of rapid perceptual adaptation to adverse listening conditions.The Journal of Neuroscience,33(26), 10688-10697. Frisina, D. R., & Frisina, R. D. (1997). Speech recognition in noise and presbycusis: relations to possible neural mechanisms. Hearing research, 106(1-2), 95-104.

Grynpas, J., Baker, R., & Hazan, V. (2011, August). Clear speech strategies and speech perception in adverse listening conditions. In International Congress of Phonetic Science.

Hansen, J. H. (1996). Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition.Speech communication,20(1), 151-173.

Holle, H., Obleser, J., Rueschemeyer, S. A., & Gunter, T. C. (2010). Integration of iconic gestures and speech in left superior temporal areas boosts speech comprehension under adverse listening conditions. Neuroimage, 49(1), 875-884.

Hygge, S., Ronnberg, J., Larsby, B., & Arlinger, S. (1992). Normal-hearing and hearing-impaired subjects' ability to just follow conversation in competing speech, reversed speech, and noise backgrounds. Journal of Speech, Language, and Hearing Research, 35(1), 208-215.

Lakra, S., Prasad, T. V., Sharma, D. K., Atrey, S. H., & Sharma, A. K. (2012). Application of fuzzy mathematics to speech-to-text conversion by elimination of paralinguistic content. arXiv preprint arXiv:1209.4535.

Liem, B., Zhang, H., & Chen, Y. (2011, August). An Iterative Dual Pathway Structure for Speech-to-Text Transcription. In Human Computation.

Mattys, S. L., Davis, M. H., Bradlow, A. R., & Scott, S. K. (2012). Speech recognition in adverse conditions: A review.Language and Cognitive Processes,27(7-8), 953-978.

Mishra, S. K., & Lutman, M. E. (2014). Top-Down Influences of the Medial Olivocochlear Efferent System in Speech Perception in Noise.PloS one,9(1), e85756.

(17)

Moore, R. K., & Cutler, A. (2001, July). Constraints on theories of human vs. machine recognition of speech. In Proceedings of the workshop on speech recognition as pattern classification (Vol. 3). Moore, R. K. (2003, September). A comparison of the data requirements of automatic speech recognition systems and human listeners. In INTERSPEECH.

Narayanan, A., & Wang, D. (2014). Joint noise adaptive training for robust automatic speech recognition. Proc. ICASSP, to appear.

Obleser, J., Wöstmann, M., Hellbernd, N., Wilsch, A., & Maess, B. (2012). Adverse listening conditions and memory load drive a common alpha oscillatory network. The Journal of Neuroscience, 32(36), 12376-12383.

Ramdoo, K., Bowen, J., Dale, O. T., Corbridge, R., Chatterjee, A., & Gosney, M. A. (2014).

Opportunistic hearing screening in elderly inpatients. SAGE Open Medicine, 2, 2050312114528171. Rogers, C. L., Lister, J. J., Febo, D. M., Besing, J. M., & Abrams, H. B. (2006). Effects of bilingualism, noise, and reverberation on speech perception by listeners with normal hearing. Applied

Psycholinguistics, 27(03), 465-485.

Rosen, S., Souza, P., Ekelund, C., & Majeed, A. A. (2013). Listening to speech in a background of other talkers: Effects of talker number and noise vocoding.The Journal of the Acoustical Society of

America,133(4), 2431-2443.

Ruggles, D. R., Freyman, R. L., & Oxenham, A. J. (2014). Influence of Musical Training on Understanding Voiced and Whispered Speech in Noise. PloS one,9(1), e86980.

Scharenborg, O. (2007). Reaching over the gap: A review of efforts to link human and automatic speech recognition research. Speech Communication,49(5), 336-347.

Schultz, T., Vu, N. T., & Schlippe, T. (2013, May). GlobalPhone: A multilingual text & speech database in 20 languages. InAcoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International

Conference on(pp. 8126-8130). IEEE.

Seo, D., Kim, S., Song, G., & Hong, S. G. (2014, January). Speech-to-text-based life log system for smartphones. InConsumer Electronics (ICCE), 2014 IEEE International Conference on(pp. 343-344). IEEE.

Strait, D. L., Parbery-Clark, A., Hittner, E., & Kraus, N. (2012). Musical training during early childhood enhances the neural encoding of speech in noise. Brain and language, 123(3), 191-201.

Valentini-Botinhao, C., Maia, R., Yamagishi, J., King, S., & Zen, H. (2012, March). Cepstral analysis based on the Glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise. InAcoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on(pp. 3997-4000). IEEE.

Wang, Y., & Wang, D. (2014). A structure-preserving training target for supervised speech separation,”. Proc. ICASSP, to appear.

Weber, A., & Smits, R. (2003). Consonant and vowel confusion patterns by American English listeners. InProceedings of the 15th International Congress of Phonetic Sciences(pp. 1437-1440).

Referenties

GERELATEERDE DOCUMENTEN

This study is based on both quantitative and qualitative content analysis and examination of media reports in The New York Times and the Guardian regarding South Africa’s

The main finding of this study is that an increased arterial bicarbonate level causes a decrease in the mean EAdi and minute ventilation of all subjects during a hypercapnic

Nadee van deze werkwijze is dat natuurllk geen enkele aanbieder agerzal bieden dan de vaste prijs, ook a1 zou hry voor een lagere prijs.. kunnen

relaxatieoefeningen en psycho-educatie, en is het niet duidelijk in hoeverre men de zelfhulp (correct).. Uit een meta-analyse van Hansen et al. blijkt dat IET zorgt voor een

After gaining insight into the SOC scores of the 15 leaders, the following subsections will present the qualitative research findings regarding how the female leaders

Functionalist mass production has flooded the world with meaningless products, while mass production is in fact technically outdated. 18

The calibration can be achieved by adding extra information about the shape or inter- nal parameters of the cameras, but since the intrinsics of the camera and the 3D shape of the

7 Factors are classified in child and caregiver related factors associated with placement instability: (1) Caregiver-related factors are quality of foster parenting, child’s