• No results found

UNIVERSITY OF GRONINGEN August 2020

N/A
N/A
Protected

Academic year: 2021

Share "UNIVERSITY OF GRONINGEN August 2020"

Copied!
42
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

PERCEPTION OF VOICE CUES AND LINGUISTIC FEATURES WITH COCHLEAR IMPLANT SIMULATION

by Thawab Shehab

A Master’s thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science

(Clinical Linguistics)

at the Joint European Erasmus Mundus Master’s Programme in Clinical Linguistics

(EMCL+)

Student Number

S3859843

UNIVERSITY OF GRONINGEN August 2020

(2)

PERCEPTION OF VOICE CUES AND LINGUISTIC FEATURES WITH COCHLEAR IMPLANT SIMULATION

Thawab Shehab

Under the supervision of Dr. Thomas Koelewijn at the University of Groningen and Professor Stefan Werner at the University of Eastern Finland.

(3)

ABSTRACT

Present work poses a question of whether linguistic processing affects speech perception when perceiving a degraded speech signal. It investigated how lexical and phonological features are involved in word recognition and how they interact with voice perception, especially with changes in one important voice cue: The Vocal tract length (VTL). Based on previous research and related experimental studies, phonological features seem to influence the perception of a degraded speech signal (Koelewijn et al., 2020). The current study used three types of vocoding (No vocoding, Low Spread vocoding, and High Spread vocoding) to simulate CI transmission, and three different word type conditions (Real words, Non-words, and Reversed words). Seventy-five real Dutch words and seventy-five Non-words were taken from the VariaNTS Corpus, while reversed words were created by time reversing the audio files of real words. Six Dutch native speakers with normal hearing completed the online experiment in a Three Intervals-Three Alternative Forced Choice Task (3I-3AFC) where the changes in VTL were manipulated using a staircase procedure. The data were collected by measuring the smallest Just Noticeable Difference (JND) in VTL that the participant can detect, from which a threshold value in semitones was derived. I predicted that JNDs would be larger with reversed words compared to real words and non-words, because reversed words do not contain lexical or semantic information and they do not follow the natural phonology that occurs in the Dutch language. Also, stimuli in the High Spread vocoder (4th order filter) were predicted to yield large JND thresholds, as found in the previous studies. The results were as expected, but there was no interaction between the vocoding effect and the word status effect.

(4)

DEDICATION

With writing, we survive.

I dedicate my thesis work to Aziz Almosawi,

Thank you for everything.

.وجنن ،ةباتكلاب

،يوسوملا زيزع ىلإ لمعلا اذه يدهأ

(5)

ACKNOWLEDGMENTS

I am grateful to the supervisors of my master's thesis. It was an honor to work with Dr. Thomas Koelewijn, my direct supervisor at UMCG, without whom I cannot imagine going through the stages of this project. Dr. Koelewijn has given constructive criticism with encouragement during the process of writing my thesis. He was incredibly supportive when the distancing measures were taken, and the research lab was closed due to the COVID-19 situation. I am grateful to Professor Stefan Werner from the University of Eastern Finland, who was my first teacher in the EMCL+ program. Dr. Werner has encouraged me and provided valuable advice throughout the first and last semesters of EMCL+. He supervised my master's thesis and supported my work.

I am grateful to Professor Deniz Başkent for her encouragement to students in the dB SPL lab at UMCG. Special thanks go to Dr. Etienne Gaudrain, who created an online platform to perform several lab experiments, including the current research. I am enormously grateful to the Dutch native speakers who took part in the present study.

I want to thank my family for their love and support throughout the whole master's program.

(6)

TABLE OF CONTENTS

ABSTRACT ... III DEDICATION ... IV ACKNOWLEDGMENTS ... V TABLE OF CONTENTS ... VI LIST OF TABLES ... VIII LIST OF FIGURES ... IX

CHAPTER1 INTRODUCTION ... 1

1.1 What is Cochlear Implant? ... 1

1.2 CI transmission of voice cues ... 4

1.3 Linguistic interference ... 7

1.4 Semantics and phonology of words and non-words ... 11

1.5 Current Study: Research questions and Predictions... 13

CHAPTER 2 PSYCHOPHYSICS ... 14

2.1 Vocoding ... 14

2.2 Just Noticeable Difference (JND) ... 15

2.3 Three Intervals-Three Alternative Forced Choice Task (3I-3AFC) ... 16

CHAPTER 3 METHOD: THE EXPERIMENT (LISTEN TO VOICES) ... 17

3.1 Participants ... 17

3.2 Stimuli ... 18

3.3 Procedure... 19

3.4 Results ... 20

(7)

3.4.2 Vocoder Type ... 22

3.5 Discussion ... 23

3.6 Valorization ... 26

REFERENCES ... 27

(8)

LIST OF TABLES

3.1 Examples of real words and non-words that were used in the experiment from the

VariaNTS Corpus. ... 18

3.2 The nine blocks of the experiment ... 20

3.3 Discriptive Measures for the nine conditions in the experiment. ... 20

3.4 Group mean and SD per vocoding condition ... 21

(9)

LIST OF FIGURES

1.1 Schematic representation of the waveform and coding of F0 and VTL ... 4 3.1 Boxplots of JND thresholds in the nine conditions ... 21

(10)

Chapter1

Introduction

The ability to discriminate between voices is crucial for speech perception and overall communication. Unfortunately, people with a Cochlear Implant (CI) have trouble when distinguishing sounds and voices, unlike people with normal hearing (NH). The limited delivery of voice cues through CI devices is related to restricted signal transmission between CI-electrodes and neurons in the cochlea (Başkent, Gaudrain, N. Tamati, & Wagner, 2016). Even in multichannel CIs, the coding of Fundamental Frequency (F0) and Vocal Tract Length (VTL) is limited. Researchers have been developing strategies to improve the coding of F0 so that CI users have better pitch perception. However, the perception of VTL that is highly impaired needs

more investigation (Fuller, Gaudrain, Clarke, Galvin, Fu, Free, & Başkent, 2014). This paper

contains a literature review on CIs, perception of voice cues, and linguistic processing in speech recognition. The second chapter is about psychophysical aspects related to the experimental design of the current study. The Psychophysics chapter includes information on the vocoding procedure that changes voice cues perception (i.e., VTL) to simulate the CI transmission of degraded speech signals. It also includes an explanation of the Just Noticeable Difference (JND)

score for the VTL voice cue used as a threshold, and the Three Intervals-Three Alternative

Forced Choice Task (3I-3AFC). The process of data collection is mentioned in the Method section with data analysis, results, and discussion.

1.1 What is Cochlear Implant?

CI is an auditory prosthesis for people with severe to profound sensorineural hearing loss, as they do not benefit from acoustic hearing aids (O'Donoghue, Nikolopoulos, & Archbold, 2000). In the sensorineural hearing loss, there is a problem in the natural function of the inner ear. Thus, the ability to transmit acoustic signals from the cochlea to the auditory nerve is diminished or even lost. CI devices mimic the role of the inner ear by stimulating the auditory

(11)

nerve with an electrode array inserted surgically inside the cochlea (Başkent, Gaudrain, N. Tamati, & Wagner, 2016). There are external and internal components of a cochlear implant system. The outer parts are a microphone to pick up sound waves, a processor that converts acoustic signals into electric signals, and a transmitting coil. These parts connect via a magnet with internal components under the scalp, a receiver/stimulator, and the intracochlear electrode array (Gopalakrishna, Kehtarnavaz, & Loizou, 2010).

The history of cochlear implant surgeries goes back to the 1970s and 1980s when it was

first approved by the Food and Drug Administration (FDA) as a treatment method with adults, then with children in the 1990s (Spencer, & Marschark, 2003). In the past, children born with congenital deafness may have had the option to learn Sign language for communication. However, with the CI technology, these children had the opportunity to hear speech, which allows them to develop language skills and be part of larger communities. In the past decades, research in the CI field has investigated how CI users derived the most from this technology over the years. It is essential to know what CI users can do with the sensory information they receive or rely on pre-existing mechanisms when processing speech (Zatorre, 2001). It is even more interesting to see how their speech and language outcomes are affected (Geers, 2002).

Literature gave attention to the performance variation between people with Normal Hearing (NH) and CI users (e.g., Peng, Tomblin, & Turner,2008; Plyler, Bahng,& Von Hapsburg, 2008; Most, & Aviner,2009; Jin, Liu, & Sladen,2014; Ji, Galvin, Chang, Xu, & Fu, 2014). These studies have shown that the way CI users perceive sounds like speech and music, differs from how normal-hearing people perceive sound. The electrical signal might be distorted or carrying unclear speech elements due to the existence of background noise or competing speakers. In addition, there is a high variance among CI users themselves when they perceive speech. One of the factors that causes variance speech degradation between CI users is the damage to the cochlea, which can happen during implant surgery. Also, there is an effect of drilling and insertion that can lead to high impedance around the electrode array and might change the pathway that currents must take for auditory nerve activation (Arenberg Bierer, 2010). Other factors include the potential mismatch in the frequency-place mapping ( Faulkner, 2006), the limited dynamic range of electric hearing (Santos, Cosentino, Hazrati, Loizou, & Falk, 2013), and the presence of acoustic low-frequency hearing in the implanted or contralateral ear ( Gantz, Turner, Gfeller, & Lowder, 2005).

(12)

Ideally, a speech signal can be perceived when all segmental and suprasegmental cues are present. For example, the voice pitch can provide information on voicing and manner of articulation. At the same time, suprasegmental cues (utterance prosody) like rhythm, intonation, and stress placement are carried by both F0 and amplitude envelopes. Pitch fluctuations help in the perception of sentence prosody and overall sentence comprehension (Carlson, 2009; Veenendaal, Groen, & Verhoeven, 2016). Modern multichannel CIs take into consideration the cochlear tonotopic organization, unlike the old single-channel devices. Therefore, high frequencies are delivered to electrodes near the round window and elicit high pitch perception. Apical electrodes receive lower frequencies to obtain low pitch perception (Baumann, & Nobbe, 2006).

The organized transmission of voice cues means that multichannel CI users might show better performance in speech perception. Nevertheless, the electrical stimulation does not act the same as the acoustic hearing. The CI processor bandpass filters the acoustic input of broadband speech signals into a few frequency bands. Therefore, the distinct electrodes designed for tonotopic stimulation of the auditory nerve will always deliver fixed-rate current pulses modulated by the slowly varying amplitude envelopes of the corresponding spectral band. As a result, Spectro-temporal fine structure will be lost, which will affect speech perception in challenging situations like background noise or in multiple talker conversations (Friesen et al., 2001; Fu & Nogaki, 2005; Stickney et al., 2004).

It is crucial to understand the challenging circumstances that CI users go through in real-life listening situations. Since clinical experiments are designed in ideal conditions, they may not resemble everyday life situations. The main goal of a CI is to help people regain their ability to hear the language and communicate with people around them, and consequently, to have a better quality of life. Perception of speech signals in natural conditions includes classrooms, parties, meetings, or conversations with more than one person, and even restaurants. Many studies have been investigating how CI users perceive speech in these different situations (Beer, Kronenberger, & Pisoni, 2011; Amann, & Anderson, 2014; Dunn, Noble, Tyler, Kordus, Gantz, & Ji, 2010; Hast, Schlücker, Digeser, Liebscher, & Hoppe, 2015). This increased interest in the perception of voice cues, with a comparison of performance between CI users and people with normal hearing (Fu, Chinchilla, & Galvin, 2004; Pyschny, Weber, Walger, von Wedel, & Meister, 2007; Zaltz, Goldsworthy, Kishon-Rabin, & Eisenberg, 2018; Başkent, Luckmann, Ceha, Gaudrain, & Tamati, 2018).

(13)

1.2 CI transmission of voice cues

Hearing is a natural function that people do without normally thinking about all the underlying steps involved in this process. In real-life situations, we are surrounded by a variety of sounds that may include environmental noises, signals from other conversations that are not meaningful or related to ours. These sounds need to be separated to focus on the target signal

and neglect the interfering noise (Bronkhorst 2000; Cherry 1953). Also, we need to gather speech

parts to form a stream of meaningful speech (Başkent et al., 2016).

Figure 1.1. (Figure 12-4 from Başkent et al., 2016, copied with first author’s permission) Left column—schematic representations of the waveform (A–D) and spectrum (E–H) of the vowel /a/ for different combinations of F0 and VTL. Right column— Schematic representation of the coding of F0 (I– L) and VTL (M–N) in the implant.

There are multiple voice cues that listeners rely on for the segregation of speech signals, but the two essential cues are the Fundamental Frequency (F0) and the Vocal Tract Length (VTL). The first cue (F0) is related to the glottal pulse rate of a speaker, while the second (VTL) is an acoustic cue that gives information about the size or height of the speaker (Roers et al., 2009; Gaudrain & Başkent, 2015). F0 varies between speakers of different ages and gender: adults have lower formant frequencies than children, while females have higher formant frequency values than males (Whiteside, & Hodgson, 2000). Perception of these two cues plays

(14)

a vital role in speech perception of different speakers and everyday life situations.

It is astonishing how the auditory system can segregate information about the size of a sound source (i.e., speaker's vocal tract length) from information about its shape and structure

(Irino, Aoki, Kawahara & Patterson, 2010). In speech signals, the perception of glottal pulse rate

of a talker (F0) depends on both the temporal and place coding of speech. On the other hand, the perception of VTL that is related to the speaker's size depends on perceiving spectral characteristics like the formant structure. (Başkent et al., 2016). An increase in F0 means more frequent glottal pulses (higher frequency). A decrease in VTL leads to expansion of the spectral envelope as shown in figure (1.1). According to previous studies, increasing the difference in F0 and VTL between competing signals leads to better segregation of speech segments in a cocktail party situation where multiple competing speakers are present (Darwin, Brungart, & Simpson, 2003; Vestergaard, Fyson, & Patterson, 2009). The cocktail party problem refers to the challenge of segregating a target voice from a background of interfering speech or noise that may come from different directions or may contain different accents (Cherry, 1953; Loizou et al., 2009). NH listeners develop coping mechanisms for these situations, but it is challenging for CI users to do the same.

Unfortunately, the delivery of voice cues through CI devices is sometimes weak and limited because of the restricted signal transmission. Even in multichannel CIs, the coding of F0 is limited by the fixed-rate pulses and the low spectral resolution. Developing coding strategies that enhance temporal fine structure in CI do not provide a benefit for VTL perception (Gaudrain, & Başkent, 2015). More studies are needed to investigate the perception of VTL that is profoundly impaired in CI users.

Previous studies examined how modifying F0 and VTL may influence the ability to recognize speaker gender (Fuller et al., 2014). Meister, Fürsen, Streicher, Lang-Roth, & Walger (2016) tested CI users and NH listeners for speaker gender recognition in single words and sentence stimuli. Findings from this study revealed that changing the two cues combined had a more substantial influence on the speaker's gender perception for NH listeners. NH listeners relied on both F0 and VTL because these cues conflict when manipulated singularly. On the other hand, CI users showed ambiguous responses for gender recognition, which can be explained by the limited spectral resolution of CI devices. Besides, they found that performance in sentences was better than single words in both groups, which might be explained by having more information in sentences that allows for more detailed analysis. However, CI listeners did

(15)

not make use of VTL cues regardless of stimulus type.

Gaudrain and Başkent (2015) aimed to investigate possible causes for the limited sensitivity to VTL differences in CI users. Their first hypothesis was that CI users could not perceive VTL due to reduced spectral resolution. The second hypothesis was that VTL differences are detected but not used by CI users for gender categorization. The results indicated that in 12 channels devices, the current spread causes channel interaction, which might be the main reason preventing CI users from perceiving VTL differences between male and female speakers. Current spread refers to electrical spread of excitation in the cochlea when current is injected through an intra-cochlear electrode of the CI and returned through an extra-cochlear electrode. The amount of spread determines how much interaction appears between individual stimulation channels (Gaudrain and Başkent, 2015). Later studies investigated how sensitive CI users to detect changes in voice cues (F0 and VTL).

Gaudrain & Başkent (2018) measured the raw sensitivity to hearing any perceptual difference that results from F0 and VTL change. The goal was to explore to what degree CI users can listen to F0 and VTL cues based on the hypothesis that CI users would show larger (worse) thresholds of sensitivity to differences in F0 and VTL compared to NH individuals (Gaudrain, & Başkent, 2018). They measured the Just Noticeable Difference (JND), which refers to the smallest threshold at which listeners detect the change or difference in voice cues (discrimination threshold), which is measured in semitones (st). In a three interval, three alternative forced-choice (3I-3AFC) procedure, CI users had an average F0 JND of 9.19 and VTL JND of 7.19 semitones (st), while NH individuals had only 1.95 and 1.73 (st) respectively. The findings suggest that the deficit of voice perception in CI listeners is more than just poor perception of F0 as they do not have adequate access to VTL cues due to limited spectral resolution (Gaudrain, & Başkent, 2018).

Another study by El Boghdady, Gaudrain, & Başkent (2019), aimed to explore the relationship between the difficulty in understanding speech in multi-talker settings and the perception of voice cues in CI users in three experiments. They hypothesized that higher sensitivity to F0 and VTL differences would correlate with better Speech on Speech (SoS) overall performance. Therefore, CI users' deficits in SoS intelligibility and comprehension could relate to their reduced sensitivity in voice cue perception. In general, CI listeners had significantly larger (worse) JNDs for F0 compared to NH listeners. The findings do not support the hypothesis that increasing F0 and VTL differences between two concurrent speakers (target

(16)

and masker voices) should lead to an improvement in SoS intelligibility for CI users. Unlike the NH group, CI users showed a slight decrease in SoS intelligibility due to increasing VTL. The results also showed that the lack of benefits from VTL differences in SoS intelligibility and comprehension performance in CI users is correlated with their reduced sensitivity to VTL. However, the CI users who were more sensitive to variations in F0 had more benefit in SoS intelligibility (El Boghdady, Gaudrain, & Başkent, 2019).

1.3 Linguistic interference

Neurocognitive processes such as working memory (capacity), inhibition, and sequencing skills are strongly correlated with language outcome measures (Pisoni, Conway, Kronenberger, Henning, &Anaya, 2010). With a better understanding of these processes, we can explain and predict individual differences in speech and language outcomes following cochlear implant. Neurolinguistic and psycholinguistic studies demonstrated how the language is represented in the brain (Yelland, 1994; Morton, 1969). For instance, the memory system that stores all our knowledge and concepts about words is called the mental lexicon. Lexical access is related to the processes that support the mapping between a stimulus and its corresponding representation in the memory (Yelland, 1994). Some approaches and models described lexical access for spoken and written stimuli (e.g., Activation models, connectionist models, computational models). In the case of spoken stimuli (auditory input), the phonemes are the stimuli's critical features. The phonemes are not available directly in the speech waveform but through an abstract representation of the acoustic properties.

Speech stimuli contain diversity that is not available in written stimuli, like intonations and different voice onset times, which might be challenging for CI users. One of the activation models that described lexical access is Morton's original Logogen Model (1969). "Logogens" are the units (feature counters) that contain and receive perceptual information from the input. Each logogen corresponds to a single entry in the lexical memory, and they are activated by speech sounds, images, objects, or written information. Morton has revised the model to include two sets of inputs: Visual input logogens and Auditory input logogens. The visual input logogens count only orthographic and semantic features, while auditory input logogens count

phonological and semantic features (Morton, 1969, 1979). In (1986), Harris & Coltheart made a

distinction between input logogens and output logogens. They suggested that recovered information from the cognitive system (e.g., semantic properties of the input stimuli) are then

(17)

linked to separate sets of output logogens: auditory output (for speaking) and visual output (for writing) (Yelland, 1994).

People with hearing impairment often have some form of language and communication difficulties compared to NH listeners. For instance, prelingually deaf children score significantly below their peers on many tasks that involve verbal working memory, sentence repetition, and sequential learning. Although using training and therapy programs might result in improvements, more intervention and evaluation are recommended with these children (Kronenberger, & Pisoni,2016). In post-lingual deafness, after long periods of sensory deprivation, there is a decline in the ability to evoke phonological representations that are important for oral and written communication (Lazard, Lee, Truy, & Giraud, 2013). Although the ability to evoke non-speech sounds (environmental sounds representations) is not difficult for post‐lingual deaf subjects, it might be altered. This long‐term cognitive alteration of auditory processing can lead to functional cerebral re-organization (Lazard et al., 2013).

On the other hand, people with normal hearing understand their native speech without effort, and they attend quickly to acoustic information that is distinctive and reliable within their native language. Adults and children CI users have individual differences in their spoken language outcomes. Language difficulties are related to neurocognitive processing skills, including the operations used in encoding, storage, and retrieval of phonological and lexical representations of spoken words (Başkent et al., 2016).

When perceiving a degraded speech signal, the brain can use context and linguistic knowledge to complete or restore the masked parts of the speech. This function is called phonemic restoration or top-down restoration, and it is affected by hearing impairment (Warren, & Sherman, 1974; Başkent, 2010). Some cochlear implant studies examined actual CI users to compare their performance with NH listeners (Bhargava, Gaudrain, & Başkent, 2014; Bhargava, Gaudrain, & Başkent, 2016). Others used a simulation of CI processing with a noise band vocoder to mimic the electric hearing by creating degraded speech signals (Gaudrain, & Carlyon, 2013; Gaudrain, Grimault, Healy, & Béra, 2008). When a speech signal is interrupted with noise, some speech elements might be removed or inaccessible to the listener. However, when silences are inserted in speech, some of the speech elements may be replaced with misleading cues. In both scenarios, a reduction in the spectral resolution of the speech signal will result in reduced

speech intelligibility, increasing the listening effort and the cognitive processing load (

(18)

and may lead to mental fatigue (Bess, & Hornsby, 2014).

In lexical processing, NH listeners can use sentential information preceding the target word to anticipate upcoming expressions and understand the meaning. However, CI users have a reduced ability to benefit from previously heard information, making the lexical access more effortful. In degraded speech signals by electrical hearing or artificial noise band vocoder,

subjects might take longer to identify the stimuli. Also, with a degraded speech signal, the context is used more slowly than with a clear signal. This can put the hearing-impaired listener in a dilemma of continuing to process the last sentence when the next sentence has already begun (Winn and Moore 2018). In real-life situations, listeners need an adequate level of integration of semantic and grammatical sources for a successful perception of spoken language. Because of periods of auditory deprivation, CI users rely on disrupted perceptual and linguistic systems with atypical neurocognitive skills (Castellanos, Pisoni, Kronenberger, & Beer, 2016). Despite the limitations in CI devices, CI users seem to develop adaptation after a long period of exposure to degraded speech signals, which may result in poor spoken language skills. In a study by Tamati, & Pisoni, (2015), CI users were less sensitive to differences in intelligibility between foreign-accented and native speech than NH listeners. The study indicated that CI users who were more susceptible to foreign accent differences also had better speech perception abilities. Therefore, they concluded that CI sensitivity to variability in the speech signal might be correlated with the development and use of speech and language processing skills. The previously mentioned study by Meister et al. (2016) suggested that having more information in sentences allows for more detailed analysis, explaining better performance in sentences than single words in NH and CI groups. They argued that there are more suprasegmental cues in sentences than in words, and that phonetic information is sparser in word stimuli. In the same study, they concluded that regardless of stimulus type, CI users did not use VTL cues (Meister et al., 2016).

The effect of linguistic information and CI processing was examined recently by

Koelewijn, Arts, Gaudrain, N. Tamati, & Başkent (2020). The purpose was to understand the relationship between voice perception and linguistic processes that may contribute to improvements for CI-users in speech understanding in challenging situations (e.g., multiple talkers). The study included two experiments with Dutch native speakers who have normal hearing, and both experiments used (3I-3AFC) discrimination task. Manipulations of voice cues (F0 and VTL) were made to measure JNDs on a continuum between the target male voice (artificial) and the female reference voice. The first experiment measured (JNDs) in easy and

(19)

hard Dutch words and easy and hard Dutch non-words from the VariaNTS corpus. The selection of words was based on two linguistic features: lexical frequency and neighborhood density. These two features affect spoken word recognition (Howes, 1957; Savin, 1963; Luce & Pisoni, 1998), as that words with higher lexical frequencies are more accessible than words with low lexical frequency. Also, words that have fewer neighbors are phonetically more unique and easier to be recognized. The VariaNTS corpus suggests that pseudowords have no clear representation in the mental lexicon, and they do not look or sound like existing Dutch words or names. However, all these Dutch non-words can be appropriately pronounced by native Dutch speakers and can be recognized depending on their sounds, phonotactic probability, and the density. (Arts, Başkent, & Tamati, in prep). The term phonotactic probability refers to the likelihood of a non-word to occur in a natural language. Neighborhood density refers to the relation of non-words to typical phonological sequences of real words (Janse & Newman, 2013). Therefore, Easy words in the first experiment in Koelewijn, et al., (2020) were the words with high lexical frequency and low neighborhood density, while easy non-words were those with high phonotactic probability and low neighborhood density. These words and non-words were presented either in a fixed condition (three similar utterances in one trial, e.g., Kamp, Kamp, Kamp) or a variable condition (three different utterances in one trial, e.g., Kamp, Zwak, Plek). The voice cues were manipulated using a referent female voice either in F0 or VTL or both combined. In the second experiment, Dutch words from the NVA corpus were compared with reversed words, which are time-reversed sound files. Besides, the second experiment used stimuli in three types of vocoding: a low spread vocoder (12th order filter band), a high spread vocoder (4th order filter band that is wide and causes spread of frequency information), and the no vocoder condition. The first experiment showed no significant difference in JNDs between words and non-words, but lower (better) JNDs were observed with the fixed stimuli than with the variable stimuli.

Moreover, the characteristics of real words (easy versus hard) did not affect JNDs. However, lower JNDs were observed with easy non-words than with hard non-words. In the second experiment, JNDs were affected by word status (words versus reversed words) in the VTL condition, especially with variable stimuli. The study indicated that linguistic variation interferes with the perception of voice cues with and without CI simulation. In the second experiment, there was an effect of word status and phonological properties. The authors concluded that individual perception of voice cues is related to phonological processing.

(20)

1.4 Semantics and phonology of words and non-words

Investigating the linguistic effect on voice cues perception involves creating a testing paradigm to compare stimuli with rich linguistic content with others that do not include the same linguistic information. An example of this could be comparing sentences with single words or words with non-words (pseudowords). The next section contains further explanation of word recognition and the difference between real words and non-words.

Word recognition is a set of required processes to link sensory input to a corresponding

mental representation. It involves checking the accuracy of a match between memory and stimulus representation. It provides contents of the accessed memory representation, such as the phonological form of a word and syntactic properties. It also clarifies any inconsistency between the input and the accessed representation (e.g., speech or spelling mistakes) so that the product of recognition is the intended form of the speaker or writer (Yelland, 1994).

A real word is a small utterance of language that carries a meaning. Words are also called the blocks that build languages and learners use them to make connections between heard or written forms and their referents. Words are the linguistic units of interest in the logogen model; in which phonological and orthographic codes are linked with shared units of meanings

between people (Morton, 1969, 1979; Nation, 2014). Content words refer to an action (verb) or

an entity (noun), while function words serve a linking purpose in sentences (e.g., prepositions). There is no clear relationship between the phonology of a word (sounds and phonemes) and its meaning in the lexicon (semantic information). This relationship was described as arbitrary for a long time, but some studies showed that the sound-meaning mapping is more systematic than was previously expected (Monaghan, Shillcock, Christiansen, & Kirby, 2014). Still, each language has its phonological system, which makes the phonological forms of words different between languages even though they refer to the same meaning. For example, the word (vogel) in Dutch means (bird) in English, and it refers to an animal that flies.

On the other hand, pseudowords (non-words), are made up of combinations of sounds following the phonological constraints that occur in a natural language. (Keuleers & Brysbaert, 2010). Non-words may sometimes sound like words, but they are not real, and they do not carry meanings, nor do they exist in the lexicon. Researchers can create non-words using pseudowords generators (e.g., Keuleers, & Brysbaert, 2010) for lexical decision tasks in psycholinguistic experiments and to measure the semantic priming effect. In a lexical decision task, subjects are

(21)

presented with prime-target pairs on each trial (e.g., nurse, hospital), and the instruction is to decide whether the target is a word or a non-word as quickly as possible. The semantic priming effect appears when the response to a target word is influenced by the prime word presented earlier on trial. Because recognizing one word can activate a network of related words in the lexicon, faster (shorter) reaction times are observed with target words that are more semantically related to their primes. Therefore, in the previous example, faster responses can be observed if the target word (Hospital) was preceded by (Nurse) than by another prime that is not related like (e.g., panda). Recognition is facilitated by the activation of semantically related words and concepts (Heyman, Van Rensbergen, Storms, Hutchison, & De Deyne, 2015). Another lexical decision study used two probabilities to explain semantic priming: relatedness proportion and Non-word ratio (Neely, Keefe, & Ross, 1989). Relatedness proportion refers to the likelihood of a target word to be semantically related to its prime. While non-word ratio is the probability that a target word is a non-word given that it is unrelated to its prime. Semantic priming for targets with higher lexical frequencies (high dominance) was modulated by the relatedness proportion more than the non-word ratio. On the other hand, the semantic priming for target words with lower lexical frequencies was strongly influenced by the non-word ratio (Neely, Keefe, & Ross, 1989).

A study by Cassani, Chuang, & Baayen (2019) aimed to examine a systematic relation between words' sound and meanings by testing the bootstrapping theory. They hypothesized that children can immediately exploit semantic relations of words evoked from their sounds and that even isolated non-words elicit an informative semantic impression. Their hypothesis was contradictory with the "phonological bootstrapping" theory, which suggested that the mapping between form and meaning is indirect and mediated by grammar. The phonological bootstrapping assumes that children map the phonological structure to the likely lexical category (also called syntactic category, e.g., noun, verb, adjective). Then by determining this, they form predictions about the meaning of the new word. The hypotheses of Cassani et al., (2019) was based on the idea that the sound of a word can provide useful and reliable cues to infer the implicit meaning of a word directly from its sound without the intermediate step of category identification (Sharpe & Marantz, 2017). The paradigm included English non-words in two categories, verb-like non-words, and noun-like non-words. Mappings between form and meaning were estimated using standard linear transformations from linear algebra, and Linear Discriminant Analysis (LDA) was performed. Findings from this study supported the hypothesis that children can infer the implicit meaning of a word directly from its sound. They have also

(22)

suggested that non-words do land in semantic space. Such that children can capitalize on their semantic relations with other elements in the lexicon to decide whether a non-word is more likely to denote an entity (noun-like) or an action (verb-like). Therefore, the relation between form and meaning is not mediated by grammar, and abstract lexical categories are not necessarily involved. On the contrary, an informative relationship may exist between phonology and meaning, so the mapping that causally links form and meaning is non-arbitrary (Cassani et al., 2019).

There is a difference between real words, non-words, and reversed words based on phonological and lexical components. Real words have both phonological and semantic information that facilitate how NH listeners recognize them. On the other hand, non-words obey the phonological constraints of a natural language. Still, there is a debate on their mental and semantic representation that cannot be explained without their relationship to real words. Finally, the process of time reversing auditory stimuli will result in hearing the end of a word first, then the beginning will be heard in the end. This procedure makes the reversed words sound unnatural to native speakers. On that account, the reversed words used in Koelewijn, et al. (2020) do not follow the Dutch language's phonological constraints, and they certainly do not carry semantic information to be represented in the mental lexicon.

1.5 Current Study: Research questions and Predictions

The current study investigates the influence of linguistic processes on the perception of voice cues in simulated CI processing. There are two research questions: (1) are VTL JNDs affected by the type of word stimuli (real words, non-words, and reversed words), and (2) does the word status effect interact with vocoding?

We hypothesized that; first, spoken word recognition involves phonological and semantic components that facilitate lexical retrieval. Therefore, larger JNDs are predicted with reversed words than real words or non-words. We also predict smaller JNDs with words than words and reversed words. Any significant difference in JNDs between real words and non-words will signal the distinct representation of real non-words in the mental lexicon. Second, simulating the CI transmission of signal by manipulating the sharpness of the filter would affect the VTL JNDs. Larger (worse) JNDs are expected with the high spread vocoder (4th order filter) because of the high spread of information compared to the other vocoding conditions.

(23)

Chapter 2

Psychophysics

2.1 Vocoding

In 1939, Homer Dudley introduced the channel vocoder (voice coder) as a machine that spoke with the help of a human operator seating in front of a console. The machine contained an artificial speech synthesis system implemented in hardware. Inside the machine, there were analogue circuits of bandpass filters, switches, and amplifiers connected to a loudspeaker. This idea impacted telephony and speech transmission applications, and most importantly, on the development of Cochlear implant devices. CI devices nowadays are programmed digitally with a modified version of the vocoder analysis algorithm (Loizou, 2006). Researchers use noise-vocoded speech to simulate the speech transduced by a cochlear implant, a procedure by which normal hearing people have a better understanding of degraded speech signals to improve the CI devices. When speech is transmitted through a cochlear implant, the spectral details might be reduced, and the temporal fine structures might be altered. The intelligibility of the noise-vocoded speech depends on some properties like the number of frequency bands in a vocoder. Speech vocoded with ten bands or more is more intelligible than speech vocoded with just four bands ((Davis, Johnsrude, Hervais-Adelman, Taylor, & McGettigan, 2005).

(24)

In Gaudrain, & Başkent, (2015), the authors examined the effect of different vocoding conditions on discrimination thresholds of Normal hearing subjects. The first two experiments aimed to investigate whether reducing the spectral resolution causes an increase in the just Noticeable difference for VTL only and not F0. They manipulated the number of frequency bands (No vocoder,12 bands, four bands) in the first experiment, and the type of carrier in vocoders in the second experiment. The result of the first experiment showed an effect of the number of frequency bands in the sinewave vocoder on VTL JNDs, while F0 was less affected. The second experiment revealed that the carrier type (i.e., sinewave, noise, low noise) had a significant effect on F0 but not on VTL. In the third experiment, they measured JNDs for VTL only. The aim was to simulate two aspects of electrical stimulation that were known to limit spectral resolution: Number of electrodes, and current spread in the cochlea. For that reason, they manipulated the number of channels and the sharpness of the bandpass filters. The number of channels affects how spectral information is quantized along the frequency axis. The sharpness of the filter determines how much overlap and blending occurs between the channels. The authors found that these two manipulations had a similar effect on VTL difference detection, fewer bands yielded large JNDs, and shallower filters were also associated with larger JNDs. However, an interaction was found between the two factors, when there were enough bands (number of channels) in a vocoder, using sharper filters did not improve the JNDs (Gaudrain, & Başkent, 2015). In the current study we manipulated the sharpness of the filters to simulate low and high spread in signal transmission, suggested to cause variance in the speech processing abilities in CI users.

2.2 Just Noticeable Difference (JND)

The just Noticeable difference (JND) refers to the minimum amount of stimulus change required to elicit a sensation difference on the psychological continuum. It can be explained by Weber's law, which states that physical stimulus intensity must be increased by a constant fraction of its starting value to be just noticeably different. Weber's law is useful for sensory discrimination that can be compared across different conditions and modalities (Das, & Alagirusamy, 2010). The JND is also called a difference limen. It is used in psychoacoustic experiments where subjects are asked to compare two sounds and to indicate which one is higher in level, or in frequency (Long, 2014). The just noticeable difference is a method of limits used to avoid how different observers make judgments at different rates.

(25)

When measuring the JND, the experimenter controls the starting point and the size of increment or decrement for a given variable. In each step, either in an ascending or a descending series, the subject must report if the stimuli appear "less than," "equal to," or "greater than" the standard (e.g., original stimuli). Within the method of adjustments, there will be a range of values of the variable close to that of the standard in which a subject might find it challenging to reach a judgment of "lesser", or "greater," and they might likely say it is "equal". Because of this difficulty and indecision, the mean of ascending trials will differ from that for descending ones. Still, even if subjects were not allowed to select "equal", there might be an error of habituation or anticipation that result in different means for ascending and descending trials. (Vickers, 2014). In cochlear implant studies, researchers measure the JNDs to examine how CI users and NH listeners can discriminate voices with different characteristics (e.g., voices different speakers). In the current study, we are measuring JNDs in 3AFC test for different conditions.

2.3 Three Intervals-Three Alternative Forced Choice Task (3I-3AFC)

In the AFC test, multiple stimuli are presented to the subjects, and they are given a criterion by which they are asked to select one stimulus. For example, selecting the one stimulus that differs in a defined attribute among three or four stimuli (Astm, 2009). Adaptive staircase procedures are widely used for measuring auditory thresholds to avoid the effects of variation in response criterion (Johnson, Watson, & Kelly, 1984; Schlauch, & Rose, 1990).

These tests are different from the Method of Constant Stimuli. MCS involves separate estimates of individual points along with the psychometric function. However, during adaptive staircase procedures, the subject's responses from one or a series of trials will indicate the stimulus level. A decision rule governing the changes in the stimulus level is selected to target a fixed level of performance (Schlauch, & Rose, 1990). The stimulus level corresponding to that performance criterion is taken to be the threshold. Subjects do not need prior knowledge of the threshold region for the adaptive staircase procedures. If the stimulus level is started far from the threshold, the level tends to move to the region of interest. Schlauch, & Rose (1990) found that 3- and 4-interval forced-choice (IFC) procedures are more efficient than a 2IFC procedure with a decision rule that targets 70.7% correct performance.

(26)

Chapter 3

Method: The Experiment (Listen to Voices)

3.1 Participants

Six participants were recruited by word of mouth to perform the online experiment. At the beginning, twelve participants started the experiment, but only six of them had complete data from all the blocks, and therefore were included in the analysis. In ideal situations, audiometric thresholds should have been measured for each subject in the lab before experimenting to ensure that they have normal hearing. However, due to the lockdown situation in Groningen, research activities were conducted online, and we relied on self-reported normal hearing from our participants. All subjects were Dutch native speakers with normal hearing and normal to corrected vision. The participants' age range was (19- 50), and they had no previous history of neurocognitive disorders. All subjects joined the online experiment voluntarily after completing a questionnaire and signing informed consents. The experiment "Listen to Voices" is part of the PICKA-XL project approved by the University Medical Center Groningen's ethics committee and was available on ( https://dbsplab.fun/). This website contained other experiments from the same research lab.

(27)

3.2 Stimuli

Three lists of stimulus items were included in the study: seventy-five real Dutch words, seventy-five Dutch non-words, and seventy-five reversed words. The words and non-words lists were taken from the VariaNTS corpus (Arts et al., in prep). To create the reversed words, we used Adobe Audition for time reversing the audio files of the real words. The original corpus contained 11 linguistic categories based on lexical frequency, phonotactic probability, and neighborhood density. We selected the list of words with high frequency and low density (easy words) and the list of words with low phonotactic probability and high density (hard non-words). The terms' easy' and 'hard' in the VariaNTS corpus refers to the processing demands of the linguistic information. However, in the study by (Koelewijn et al., 2020), they compared easy and hard words and non-words. The results showed no significant difference between easy and hard real words, besides there was no significant difference between words and non-words. They indicated lower JNDs for the easy non-words in the variable condition. Therefore, in this experiment, we focused on the variable condition (presenting three different stimuli at one trial). Because no difference was observed between easy and hard words, we compared the easy words with the hard-non-words. Table (1) shows some examples of words and non-words that were included in the study, while the full lists of stimuli are presented in the Appendices section.

The VariaNTS corpus contained five different speakers. We selected the audio files recorded from a 20-yr-old native Dutch female speaker with an average F0 of 214.36 Hz. All lists of words were analyzed with WORLD (Morise, Yokomori, & Ozawa, 2016), to obtain the F0 contour and the spectral envelope. In the adaptive procedure described below, the VTL of the words was varied in each trial. Three randomly selected words were resynthesized with WORLD using the new VTL parameters and normalized to an average duration of 600 ms. The words were resynthesized even when the VTL was unchanged compared to the original voice.

Table 3.1 Examples of real words and non-words that were used in the experiment from

the VariaNTS Corpus.

Real word Non words

kamp greep dorp dap kein doer

zwak steen vorm raf peis roen

(28)

3.3 Procedure

In a design like Gaudrain and Başkent (2015), discrimination thresholds (JNDs) were obtained using a 3I-3AFC adaptive procedure. The VTL of the stimuli was manipulated using WORLD (Morise, Yokomori, & Ozawa, 2016), while F0 was held constant in all conditions. During each trial, the subjects heard three stimuli and saw three buttons on the screen lighting up as each corresponding stimulus was played. The instruction was to choose the one stimulus that differed in the voice from the other two by clicking on the corresponding button on the computer screen. In each trial, the subjects were presented with three items (from the three categories). Two items were produced with the original (recorded) voice parameters. In contrast, the odd stimulus (randomly assigned to one of the three presentation intervals) was produced with VTL that differed from the original voice. All VTL differences were expressed relative to the actual VTL of the original speaker. The initial voice difference in each condition started with 12 st, calculated as the Euclidian distance in the F0-VTL plane represented in semitones relative to the reference voice. A given step size then modified the voice difference according to the subject's response.

The two-down, one-up adaptive procedure means that when a subject had two consecutive correct answers, the voice difference was reduced by a specific step size (more challenging to discriminate). Whenever they had one incorrect answer, the voice difference was increased by that same step size (more comfortable to discriminate). Therefore, this yielded an estimate of the voice difference corresponding to 70.7% correct discrimination on the psychometric function (Levitt 1971).

The study aimed to compare three different linguistic items (word status) in three vocoding conditions. Therefore, the experiment included nine blocks divided over three sessions. The duration of the experiment was around 30-40 minutes. The vocoder condition was randomized over the nine conditions. Each subject had three training rounds before starting the nine blocks in the experiment. Each run started with a difference of 12 st calculated along the spoke. A given step size then modified the voice difference according to the subject's response. Table (2) shows the nine conditions in the experiment. The results section contains data analysis of 6 participants who completed the whole experiment.

(29)

Table 3.2 (The nine blocks of the experiment).

Condition (vocoder/ word

status) Words Non-words Reversed Words

No Vocoder (No) No Words No Non-words No Rev-words

Low-spread vocoder (LS) LS Words LS Non-words LS Rev-words

High-spread vocoder (HS) HS Words HS Non-words HS Rev-words

3.4 Results

The median JND in the (No words) condition (0.67 st, SE= 0.3st) was the smallest compared to the other nine conditions in the experiment. The largest median JND was in the (HS Rev-words) condition (10.82 st, SE= 2.6st). The mean JND for No vocoder condition averaged over all word status was the smallest (1.8 st) compared to Low Spread (6.2 st) and High Spread (8.3 st) vocoders. Table 3.3 shows descriptive measures of the nine conditions. For word status, the mean JND averaged over the three vocoders was the smallest for words (3.5 st) compared to non-words (5.5 st) and reversed words (7.3 st). Figure(3.1) shows the boxplots of JNDs in the different conditions based on vocoder type and word status on a logarithmic scale. The boxplots were created in R studio using the ggplot function from the ggplot2 package.

Table 3.3 Discriptive Measures for the nine conditions in the experiment.

Vocoder Word status Mean Median SD Range SE

No words 1 0.67 0.72 1.86 0.29 Non-words 1.67 1.07 1.62 4.38 0.66 Rev-words 2.8 1.67 3.23 8.36 1.32 LS words 3.37 3.59 1.03 2.85 0.42 Non-words 7.25 5.36 6.05 16.18 2.47 Rev-words 8.24 6.63 5.66 15.16 2.31 HS words 6.22 6.08 3.23 7.02 1.32 Non-words 7.69 5.64 5.62 15.05 2.29 Rev-words 10.99 10.82 6.4 15.65 2.61

(30)

Figure 3.1 The Just Noticeable Differences in nine conditions based on vocoder type and word status. Average VTL JNDs, in semitones, shown on a logarithmic scale (base10) for each condition. The boxplots extend from the lower to the upper quartile (the interquartile range, IQ), and the midline indicates the median. The whiskers indicate the highest and lowest values, and the dots indicate the outliers, i.e., data points larger than 1.5 times the IQ.

Table 3.4 Group mean and SD per vocoding condition

Vocoder Mean Median SD Range SE

No 1.82 1.14 1.84 4.85 0.75

LS 6.29 6.09 3.1 8.92 1.26

HS 8.3 6.6 3.27 7.95 1.34

Table 3.5 Group mean and SD per word status condition

Word Status Mean Median SD Range SE

Words 3.53 3.7 1.29 2.76 0.53

Non-words 5.54 4.36 2.93 7.28 1.19

(31)

The JND thresholds were log transformed to have a normal distribution and to reduce difference in variability. Data was analyzed using a repeated measures analysis of variance ANOVA type 3, with word status and vocoder type as repeated factors. The car package was installed, and a linear model was specified with the lm function in R version 4.0.1. The log transformed JND thresholds were used as the dependent variable. The Vocoder and wordS (word status) were the independent variables, while (Vocoder*wordS) in the formula adds the interaction term for Vocoder and wordS to the model as shown in the lm syntax:

lm (log_threshold~vocoder*wordS)

The outcome showed a significant main effect of vocoder [F (2,10) = 31.41, p < 0.05], and a

significant effect of word status [F (2,10) = 4.69, p < 0.05]. However, there was no significant

interaction [F < 1, p = 0.947]. For planned comparisons between word status conditions, the thresholds were averaged over all vocoder conditions. Additionally, for planned comparisons between vocoder conditions, the thresholds were averaged over all word status conditions.

3.4.1 Word Status

For the planned comparison, three paired samples t-tests were used for comparing JNDs between the word status conditions. A significant difference was found between words and reversed words [t (5) = -2.94, p < 0.05]. Also, non-words and reversed words were significantly different [ t (5) = -2.76, p < 0.05]. However, the words and non-words did not show a significant difference [t (5) = -2.41, p = 0.060]. This result showed that more phonological information was associated with smaller (better) JNDs.

3.4.2 Vocoder Type

Three paired samples t-tests were used for comparing JNDs between the vocoder conditions. The results were significantly different between the No vocoder and LS Vocoder [t (5) = -8.71, p < .001) and between the No vocoder and the HS Vocoder [t (5) = -6.37, p < 0.05]. No significant difference was found between the Low and High spread Vocoders [t (5) = -2.03,

p = 0.098]. The JNDs were mostly affected when using High Spread vocoder (12-band and 4th

(32)

3.5 Discussion

The current experiment shows the differences in VTL JNDs depending on word status and vocoding. The stimuli included words, Non-words, and Reversed words in three vocoder conditions: High Spread vocoder (HS: (12 bands, 4th order filter)), Low Spread vocoder (LS: (12 bands, 12th order filter)), and No vocoder condition. The stimuli in No vocoder condition represent clear speech as perceived by NH listeners. In contrast, the LS and HS vocoder simulate the low or high spread signal transmission, respectively, that can occur in different CI-users. Smallest JNDs were associated with words and with No vocoder condition while using vocoders yielded larger JNDs in general. Statistical analysis showed the main effect of the vocoder and the main effect of word status on the JNDs.

With this study, we aimed to answer the following research questions: First, are VTL JNDs affected by word stimuli (real words, non-words, and reversed words). Second, does this

word status effect interact with vocoding? The results showed that perceiving VTL JNDs for

words is significantly smaller than for reversed words. Also, VTL JNDs for non-words are significantly smaller relative to reversed words. The second research question is related to the effect of vocoders (CI simulation), which showed a significant result, as mentioned above. Stimuli that represent how NH listeners perceive speech signals show lower (better) JNDs than

LS and HS vocoded stimuli, which were used to resemble some parts of CI sound transmission

(degraded speech signals).

The first hypothesis about spoken word recognition suggested that phonological and semantic components facilitate lexical retrieval. The current results support the predictions about word status: larger JNDs were associated with reversed words compared to real words and non-words. The significant difference between words and reversed words might, at first, signal the benefit of both semantic and phonological components on speech perception. However, there

was no significant difference in JNDs between real words and non-words, and it seems that

listeners do not depend on semantic information as much as phonology. A significant difference was found between reversed words and non-words, which supports the idea that hearing speech signals with natural Dutch phonology (even if it has no meaning) influences speech perception.

In highly degraded speech signals, listeners cannot distinguish words (with meanings) and

non-words (with no meaning). This result also suggests that phonological properties are more critical when perceiving degraded speech signals than the distinct representation of real words in the

(33)

mental lexicon. In line with the findings from Koelewijn et al. (2020), our results show that the perception of VTL changes is not affected by real words properties but closely related to the phonological properties of words and non-words. This result is also supported by the hypotheses of Cassani et al. (2019), which suggested that the sound of a word or non-word can provide useful and reliable cues for recognition.

In the second hypothesis, we predicted higher thresholds for the perception of VTL in vocoded speech, which was also shown in the results. A significant difference was found

between JNDs in No vocoder and LS vocoder (12th order filter), besides the significant difference

between No vocoder and HS vocoder (4th order filter). Larger VTL JNDs were associated with

the HS vocoder compared to the No vocoder condition, indicating that VTL changes are more complicated when information is highly spread. However, there was no significant difference

between LS vocoder and HS vocoder, which is in line with the findings of Gaudrain & Başkent,

(2015).VTL perception depends on spectral resolution that can be manipulated by changing the number of channels (bands) in a vocoder, and changing the sharpness of the filters (i.e., the 12th order filter is sharper than 4th order filter). Gaudrain & Başkent (2015) showed that both manipulations had perceptually similar effects on VTL difference detection. However, the authors suggested that the two manipulations are not independent; because with a high number of channels (i.e.,12 bands), increasing the order of the filter more than eight did not further improve the JND. Our results in the current study showed the same, where the mean JND in LS vocoder and HS vocoder did not show a significant difference.

The perception of voice cues, especially VTL, needs further investigation to enhance the perception of different voices and talkers. In the current study, we focused on VTL perception in different word stimuli because in Koelewijn et al. (2020) the perception of VTL was related to word status. Our study can be developed into extensive experiments using more vocoding conditions and more participants. In experiments that include more participants, we might see a difference between the 4th and the 12th order filter conditions. However, we must keep in mind the conclusions of Gaudrain and Başkent (2015) that sharpening the filters does not always improve VTL JNDs, as it happened in their experiment using 12 bands between 8th and 12th order filters.

Speech depends on cognitive skills that allow top-down processing to enhance the acoustic bottom-up information, which means that the brain can apply what it knows to fill in the blanks. Degraded speech input of a CI might affect the top-down mechanism and cognitive

(34)

abilities (Başkent et al., 2016). In future related studies that include sentences, it might be predicted that context can make speech perception of degraded signals easier. However, in the study by Winn and Moore (2018), the results showed that the benefit of context might not survive outside the idealized laboratory or clinical environment when the listener uses context to repair part of a sentence, later-occurring auditory stimuli interfere with that repair process. The increased listening effort in deaf and hard of hearing listeners might result not just from poor auditory encoding but also an inefficient use of context and prolonged processing of misperceived utterances that compete with the perception of incoming speech.

The findings of the current study lead to a better understanding of the interaction between linguistic processing and voice cues perception. It is part of a larger project that includes other experiments. The study also gives insight into the involvement of lexical and phonological properties in speech perception; this is especially important for clinical applications that help actual CI users, whether they are pre-lingually or post-lingually deaf.

Based on the data we have; we cannot entirely ignore the difference between words and non-words. When we look at the results, we see some differences (i.e., in LS condition) that solicit for further investigation. However, the data was underpowered, so for future research, it is better to include more participants to rule out whether there is a difference between words and non-words. The semantic effect might then be more evident in comparison with the phonological effect.

For future experimental studies, I recommend including all the lists of words and

non-words from the VariaNTS Corpus (Arts et al., in prep), and designing experiments that contain sentence stimuli. Future studies can include the list of hard words compared with the list of hard non-words because both lists are linguistically more complex according to the corpus. The real words were matched on psycholinguistic variables like lexical frequency and neighborhood density. These two features affect spoken word recognition (Howes, 1957; Savin, 1963; Luce & Pisoni, 1998) because words with higher lexical frequencies are more accessible than words with low lexical frequency. Also, words that have fewer neighbors are phonetically more unique and

easier to be recognized. These matched variables of complexity can be in a better comparison

with the list of hard non-words. The non-words in VariaNTS corpus were matched on phonotactic probability and density, and the hard non-words are expected to cause more demands when processing linguistic information (Arts et al., in prep). Also, the comparison between easy words and easy non-words is recommended as both are predicted to have less processing

(35)

demands. On the other hand, non-words do not have a clear representation in the mental lexicon,

and they are measured by their relation to typical phonological sequences of real words (Janse

& Newman, 2013). This means that the phonological effect is still very strongly expected with non-words compared to lexical involvement.

3.6 Valorization

To conclude, perceiving different talkers and voices for CI users is an important aspect

that further develops their language and communication skills in real life. Developing the technology behind CI devices is a topic of interest for researchers in the field where different companies may compete to produce the best technology for better stimulation and speech transmission. However, the involvement of language abilities is also fundamental for predicting communication outcomes. In clinical training for CI users, the therapy programs target the auditory-verbal skills to improv overall language skills. These two main aspects of language skills and CI transmission were addressed in the current study by performing a relatively small experiment that opens the gate to ask more questions and design larger experiments, including more participants in the future.

The current study is beneficial not only to researchers and companies that manufacture CI devices, but also to families and clinicians who have direct contact with CI users. The knowledge in this project can be developed into strategies and tools that adapt to daily life difficulties of CI users. Increasing social awareness of these difficulties has a positive impact to enhance the quality of communication between CI users and people around them.

(36)

REFERENCES

Amann, E., & Anderson, I. (2014). Development and validation of a questionnaire for hearing implant users to self-assess their auditory abilities in everyday communication situations: The Hearing Implant Sound Quality Index (HISQUI19).

Acta oto-laryngologica, 134(9), 915-923.

Arenberg Bierer, J. (2010). Probing the electrode-neuron interface with focused cochlear implant stimulation. Trends in amplification, 14(2), 84-95.

Arts, F., Başkent, D., and N. Tamati, T., (in prep) Development and structure of the VariaNTS corpus: A spoken Dutch multi-talker and linguistic variability corpus for speech perception assessment.

Astm, I. (2009). Standard terminology relating to sensory evaluations of materials and products. West Conshohocken, PA: ASTM International. p. E253-209a.

Başkent, D. (2010). Phonemic restoration in sensorineural hearing loss does not depend on baseline speech perception scores. The Journal of the Acoustical Society of

America, 128(4), EL169-EL174. Doi: https://doi.org/10.1121/1.3475794

Başkent, D., Gaudrain, E., Tamati, T. N., and Wagner, A. (2016). "Perception and psychoacoustics of speech in cochlear implant users," In A. T. Cacace, E. de Kleine, A. G. Holt, and P. van Dijk (Eds.), Scientific foundations of audiology: perspectives

from physics, biology, modeling, and medicine, Plural Publishing, Inc, San Diego,

CA, pp. 285–319. ISBN13:978-1-59756-652-0.

Başkent, D., Luckmann, A., Ceha, J., Gaudrain, E., & Tamati, T. N. (2018). The discrimination of voice cues in simulations of bimodal electro-acoustic cochlear-implant hearing. The Journal of the Acoustical Society of America, 143(4), EL292-EL297.

Baumann, U., & Nobbe, A. (2006). The cochlear implant electrode–pitch function. Hearing

Research, 213(1-2), 34-42. Doi: https://doi.org/10.1016/j.heares.2005.12.010

Beer, J., Kronenberger, W. G., & Pisoni, D. B. (2011). Executive function in everyday life: Implications for young cochlear implant users. Cochlear implants international,

12(sup1), S89-S91.

Bess, F. H., & Hornsby, B. W. (2014). Commentary: Listening can be exhausting Fatigue in children and adults with hearing loss. Ear and hearing, 35(6), 592.

Bhargava, P., Gaudrain, E., & Başkent, D. (2014). Top– down restoration of speech in cochlear-implant users. Hearing Research, 309, 113–123.

Bhargava, P., Gaudrain, E., & Başkent, D. (2016). The intelligibility of interrupted speech: Cochlear implant users and normal-hearing listeners. Journal of the Association for

Research in Otolaryngology, under revision.

Bronkhorst, A. W. (2000). The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions. Acta Acustica united with Acustica, 86(1), 117-128.

Carlson, K. (2009). How prosody influences sentence comprehension. Language and Linguistics Compass, 3(5), 1188-1200.

Referenties

GERELATEERDE DOCUMENTEN

perceive the collective dimension of religion, including in liquid moderni- ty, such as becomes apparent in religious events, small communities, global religious networks and

The eight cults originated from the worship of people who died without known descendants (who might have started ancestor- worship), the only exception being that of Ch'en

In the present study, the sensitivity of two temperate Cladoceran species, Daphnia magna and Daphnia pulex, and a smaller tropical species Ceriodaphnia dubia, to primary

Two sub-components of the study aimed at finding out whether there is a difference in the number of misspellings of target words between the various subtitling

If the entire population of the world were transported to the USA and spread evenly, population density there would be not much greater than it is now in the

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication:.. • A submitted manuscript is

Patients and methods: We used previously published data for 76 South African children with tuberculosis to describe the population pharmacokinetics of rifampicin, pyrazinamide

The excessive and flagrant spending of the Obiang regime, in particular his playboy son (and widely-touted successor), Teodoro serves as a prime example (coupled