• No results found

Variation in intensity dynamics: A cross-linguistic study of between-speaker variability among L1 and L2 speakers

N/A
N/A
Protected

Academic year: 2021

Share "Variation in intensity dynamics: A cross-linguistic study of between-speaker variability among L1 and L2 speakers"

Copied!
75
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Variation in intensity dynamics: A cross-linguistic study of between-speaker

variability among L1 and L2 speakers

by

Carolina Lins Costa Almeida Castro Machado

Thesis submitted to the Department of Humanities in Partial Fulfillment of the Requirements for the Degree of

Master of Arts in Theoretical and Experimental Linguistics at

Leiden University July 2020

Supervisor: Dr. W. F. L. Heeren Second reader: Dr. N. H. de Jong

(2)

Table of Contents

Abstract ... 4

1. INTRODUCTION ... 5

1.1. Problem Statement ... 6

1.2. Motivation and Objectives ... 8

1.3. Thesis Outline ... 8

2. THEORETICAL BACKGROUND ... 10

2.1. Speech Production in L1 and L2 ... 10

2.1.1. Differences Between L1 and L2 Speech Production Models ... 11

2.1.2. Factors Affecting L1 and L2 Speech... 12

2.1.3. The Mechanism of Speech Production ... 15

2.2. Sound Intensity ... 17

2.3. Speech Dynamics ... 19

2.4. Intensity Dynamics ... 23

2.5. Speaker-specificity ... 26

2.6. Forensic Speaker Comparison ... 28

2.7. Research Questions and Hypotheses ... 31

3. METHODOLOGY ... 33

3.1. Corpus ... 33

3.2. Data Preparation... 34

3.3. Data Extraction ... 37

(3)

4. RESULTS... 42

4.2. Factor Analysis... 43

4.3. Multinomial Logistic Regression ... 45

4.1. Linear Discriminant Analysis ... 46

4.4. Linear Mixed-Effects Model ... 49

4.5. Intraclass Correlation ... 50

5. DISCUSSION ... 52

5.1 Between-speaker Variation ... 52

5.2. Language Effects... 54

5.3. Limitations and Future Research ... 57

6. CONCLUSION ... 60

7. BIBLIOGRAPHY ... 61

APPENDICES ... 69

Appendix I. Group statistics: Samples per speaker per language ... 69

Appendix II. Interval plots of measures of negative intensity dynamics ... 70

Appendix III. Interval plots of measures of positive intensity dynamics ... 71

Appendix IV. Confusion matrix (in percentage) of analyses with positive (A+) and negative measures (A–) for the subset L1 Dutch. ... 72

Appendix V. Confusion matrix (in percentage) of analyses with positive (A+) and negative measures (A–) for the subset L2 English. ... 74

(4)

Abstract

Do measures of intensity dynamics change when an individual speaks in a nonnative language? This study investigates positive and negative intensity dynamics (related to the articulatory mouth opening and closing gestures) in spontaneous speech produced by Dutch speakers in their native language (L1) and in English, a second language (L2). Statistical analyses showed that between-speaker variability was explained by the measures of mean, standard deviation and sequential variability of positive and negative intensity dynamics. Negative dynamics explained a larger quantity of inter-speaker variability, suggesting a lesser prosodic control over the mouth closing movement. When assessing the influence of language on intensity dynamics, there was a significant effect on positive and negative dynamics. These findings suggest that intensity dynamics differ between the L1 and the L2. However, speaker-specific information may be embedded in these time-bound measures despite the language (L1 or L2) in use, suggesting the discriminative power of intensity dynamics across languages. Finally, the results add support to findings positing that that speaker-specificity may not be restricted to the native language.

(5)

1. INTRODUCTION

What makes your voice different than your neighbor’s voice? This question has motivated linguists into a quest in search of ways to find measurable differences between individuals’ voices. A multitude of factors influence the way we speak, such as the audible acoustic characteristics of pitch and speech rhythm, the idiosyncratic way in which we pronounce a certain word or phoneme, or an accent we might have indicating the region we came from (Coulthard et al., 2016, p. 136). Particularly involved with the study of how and why voices differ between speakers is the field of forensic phonetics, where experienced phoneticians carry out, among other methods, speaker comparisons. This entails comparing voice samples of an unknown speaker to samples of a known speaker to establish whether the undetermined samples may belong to the known speaker or not. This process can be described as an auditory-acoustic analysis where the experienced analyst performs an aural-perceptual investigation and examines the physical features of the speech signal (Rose, 2002; Coulthard et al., 2016). Until now, acoustic features such as measures of fundamental frequency (Rose, 2002; Gold & French 2011) and vowel formants (Goldstein, 1976; McDougall, 2007; He et al., 2019) have been extensively studied and employed in speaker comparisons. However, there are still features capable to support forensic voice analysis that we know little about, for example, intensity dynamics (He & Dellwo, 2017).

In simple terms, intensity is what we auditorily perceive as loudness; a loud sound has a higher intensity and a quieter sound a lower intensity. For example, by pronouncing a word louder than the rest of an utterance to make it prominent, you are radiating much more energy from your mouth than by speaking the same word without emphasis. This amount of radiated energy is intensity, which in a word in focus is relatively higher compared to the rest of the utterance. The intensity carried by the speech wave will depend on the amplitude of the

(6)

frequency of vibration of the vocal folds; for example, if the amplitude is low then the amount of energy dispersed will also be low. Moreover, inherent intensity is fundamentally different across phonemes, the smallest units of speech sounds. Just as we perceive differences in global speech loudness, we can also identify the inherent “loudness” of different phones even when the overall voice intensity is kept constant. This is because speech sounds, such as vowels and consonants, have intrinsically different intensities as a result of the different configurations of the vocal tract.

The differences between phones’ intensities in a speech chain can be visualized by a curve with high and low points of intensity as the utterance progresses. To put it more simply, think of the intensity curve as the silhouette of a mountain chain with peaks and valleys. Places where intensity values are at its highest are peaks and places where the values are at its lowest are valleys. The gradual transitions between the peaks and valleys of a speech chain can be understood as intensity dynamics. Besides providing acoustic information related to the message of an utterance, intensity dynamics may also reflect information about a speaker’s behavioral and biological characteristics (He & Dellwo, 2017; He et al., 2019).

1.1. Problem Statement

Thus far, differences between speakers in intensity dynamics have only been demonstrated in studies analysing read speech samples from native speakers of Zürich German (He & Dellwo, 2017). Therefore, it is necessary to consider whether other languages would replicate the results discovered, to generalize the claim that intensity dynamics may contain enough speaker-specific information to differentiate between speakers. Hence, this thesis seeks to address this gap by investigating whether intensity dynamics of native Dutch speakers provide enough speaker-specific information to distinguish, to a degree, different speakers. Furthermore, given the fact that almost all Dutch nationals (94%) report being able to speak at least one more

(7)

language other than their native tongue (European Commission, 2012, p. 16), it is of great interest to grasp if patterns found in measures of intensity dynamics in their mother language (L1) would emerge in their second language (L2) as well. For that matter, English was chosen as the L2 in this study due to its status of being the non-native language most spoken in the Netherlands, where 90% of the population reported being able to use it (ibid., p. 21).

Regarding speech style, it has been proposed that some phonetic patterns are exclusive to spontaneous speech setting it apart from other styles such as read speech (Simpson, 2013). These patterns are believed to result from the communicative situation the speaker is in. That is, different communicative situations will involve different factors influencing speech gestures and the phonetic signal (ibid., p. 163). For example, a speaker reading a passage aloud might tend to focus more on the vocalization of the message than on the message itself, since the information being delivered is already given in the text. Contrariwise, when two speakers are conversing information is in the foreground and vocalization tends to receive less attention, also due to the fact that in case of misunderstandings, clarification and repetition of the message are possible.

Given what was mentioned above, it seems important to investigate idiosyncratic information (i) in intensity dynamics in spontaneous speech samples, and (ii) when a speaker uses an L2, since the likelihood is rather high that certain groups of native Dutch speakers – such as university students and lecturers, or workers of multinational companies in the Netherlands for instance – might use a non-native language in everyday exchanges. By assessing whether speaker-specific information is still present when a speaker uses different languages in informal communicative situations, one could establish the robustness of intensity dynamics as a feature to be used in forensic voice analysis.

(8)

1.2. Motivation and Objectives

First and foremost, in addition to He & Dellwo (2017) more studies about intensity dynamics with native and non-native speakers of different languages are necessary to understand which speaker-specific factors influence this acoustic variable and why, filling a gap in our current knowledge. Secondly, research like this could also be beneficial to forensic phoneticians, because, in order to express their findings more reliably, population statistics about the acoustic parameters being analyzed are necessary (Foulkes & French, 2012; Coulthard et al., 2016). This study addresses this need, by providing initial background statistics on the intensity dynamics of native Dutch speakers. Finally, using spontaneous speech samples rather than prepared read speech provides an approach closer to reality for the analysis of intensity dynamics in the field of forensic phonetics.

Thus, the main purpose of this study is to investigate to what extent measures of intensity dynamics reveal between-speaker variability in spontaneous speech samples of native Dutch speakers when they use their L1 and when they use English, a second language (L2) in which they are proficient.

1.3. Thesis Outline

This thesis begins by providing a theoretical background (Chapter 2) on the main aspects of this research, highlighting key theoretical concepts related to first and second language speech production, contextualizing the research in terms of the acoustic aspects involved and the speaker- and language-specific factors affecting them, as well as providing an overview of where the findings of this thesis may be practically applied. This chapter ends in a concise overview of the previous sections culminanting in the research questions and hypotheses investigated in this study.

(9)

Chapter 3 considers the sources and methods of study and will include a description of the corpus used, the methods for the preparation and extraction of the acoustic data, and the statistical methods employed. In Chapter 4 the results are presented. Following, in Chapter 5, these results are discussed and related to previous findings. This chapter ends addressing the limitations of this thesis and providing possible directions for future research. Finally, the conclusion highlights the key findings of this thesis and practical implications to the field of forensic phonetics (Chapter 6).

(10)

2. THEORETICAL BACKGROUND 2.1. Speech Production in L1 and L2

In the seminal work “Speaking: From Intention to Articulation” Levelt (1989) states that the generation of uninterrupted speech involves several underlying components functioning in a highly automatic manner. In developing “a blueprint for the speaker” he considers the individual an information processor, that is, someone who can transform feelings, thoughts or intentions into “fluent articulated speech” (ibid., p. 1). The components of this archetypal speech production system are a conceptualizer, a formulator, an articulator, and the speech-comprehension system. The conceptualizer consists of the conceptual module used to generate preverbal messages. The formulator entails two sub-components, the first is a grammatical encoder, which creates a “surface structure” embedding the preverbal message, and the second is a phonological encoder, which outputs a phonetic plan of the surface structure generated by the grammatical encoder. Evidence for a planning phase in speech production has been found, for instance in examples of sound reversals called spoonerisms, such as the slip of the tongue “peach seduction” when the speaker meant “speech production” (Fromkin, 1973, p. 185). Following this planning phase, the next component is the articulator, which unfolds over time and executes the phonetic plan resulting in overt speech. The two last outputs, the internal and external speech, are monitored by the speaker in the speech-comprehension system.

Although Levelt (1989) claimed that the speech production system is highly automatic, he further clarifies that only the formulator and the articulator components can be considered somewhat reflex-like, while the generator and monitoring components require the speaker’s attention through the entire speech process (p. 28). Provided that the articulator component can be considered automatic when an individual speaks in her/his L1, we could understand why speakers using their L2 tend to produce sounds closer to the ones of their L1. Kormos’ (2006) integrated model of L2 speech production provided a theoretical approach for the influence of

(11)

L1 in L2 speech production. Grounding her model of speech production on Levelt (1989, 1999)’s architecture of the speaker, she exhibits the distinctions between the speech process of L1 and L2. Following (Table 2.1) is an overview of the main differences between the components of the monolingual and bilingual1 speech processes according to Kormos (2006).

While the scope of this thesis falls essentially in the articulatory component of the speech production process, it is important to comprehend the process as a whole since the interaction of components in the speech production system will affect the articulatory output.

2.1.1. Differences Between L1 and L2 Speech Production Models

The differences between L1 and L2 start in the conceptualization of the message in the L2. Items from both languages compete for selection, hence before the activation of the relevant concepts are encoded the speaker has to select the language s/he will use. It is assumed that during the formulation of an utterance the linguistic knowledge of a native speaker is complete, making lexico-grammatical, morpho-phonological, and phonetic encoding quasi-automatic; i.e. during the conception of the message a speaker should possess all the knowledge needed to choose the correct lexeme and syntax for the intended message and “translate it” without much effort into speech with the correct phonological and phonetic forms. Conversely, since it is theorized that, unlike that of her/his native language, a speaker’s L2 language knowledge will never be complete, the same encoding processes may require much more attention from the speaker, leading to a hindered communication in the L2 and demanding from her/him a greater effort to overcome problems in communication. Furthermore, it has been proposed that words in both languages are stored in the same lexicon, that syntactic rules shared by both languages

1 In this thesis bilingual refers to late bilingualism, defined as the acquisition of a second language after

(12)

are stored together, and that identical phonemes in the L1 and L2 have a joint memory representation (Kormos, 2006, cited Roelofs, 2003 and Meijer & Fox Tree, 2003).

Regarding articulatory gestures, the monolingual model by Levelt (1989, 1999) proposes that speakers pronounce phonemes in their native language virtually instinctively because the necessary articulatory gestures for the production of these sounds are already automatized and stored in the syllabary, a repository of articulatory syllable programs (Levelt, 1999, p. 88). Conversely, beginning and advanced L2 language learners have difficulty in acquiring new articulatory gestures for L2 phonemes, implying that the production of these sounds would not be automatic (Kormos, 2006, p. 167). Consequently, these difference in the automatization of L1 and L2 speech might affect acoustic measures that depend on the temporal organization of speech, such as intensity dynamics.

The last component of both speech production models is the long-term memory. This component contains the syllabary, the knowledge of facts and figures (episodic memory), and the mental lexicon. An extra subcomponent for bilingual speakers is the declarative memory containing phonological and syntactic rules of the L2 necessary for speech production in the non-native language. This subcomponent is essential to L2 speakers, since they may not have the production rules of the second language automatized, nor as part of their encoding systems like L1 speakers do.

2.1.2. Factors Affecting L1 and L2 Speech

Because a large amount of knowledge needed for the processing and production of language is automated for L1 speakers, most (if not all) attentional resources can be employed in the self-perception component for self-monitoring; i.e., “the checking of the correctness and appropriateness of the produced verbal output” (Kormos, 2006, p. 185). Conversely, because of the reduced automatization in the processing of L2 speech, nonnative speakers may need to

(13)

prioritize language processing for the sake of intelligibility, leaving speech production with very limited attentional resources. Kormos (2006) argues that the lack of attentional resources and specially the lack of automaticity during speech production may be two central causes “why L2 speech is slower than L1 speech” (p. 154). Although this might generally be the case for beginning L2 learners, proficiency or length of experience could alter this tendency. Kormos (ibid.) suggested that advanced L2 speakers could generally create syllable programs in the L2 and store them in their syllabary; consequently, experienced speakers would produce L1 and L2 speech at a similar pace, relying on L1 and L2 articulatory gestures stored in their long-term memory. The idea of reliance on the L1 has also been shared by theories of L2 phonology acquisition, which assume that speech production in the L2 is affected by the phonologically similar category in the speakers’ L1 (Flege, 1995; Best et al., 2001; Best & Tyler, 2006; Escudero, 2009).

Among other possible reasons for the effect of language on speech production is the semantic diversity between languages. In a recent paper, Kemmerer (2019) provides an extensive discussion, from a neurolinguistic point of view, positing that semantic diversity between languages may significantly affect speech production. By doing so, he strengthened the idea that an utterance should “be tuned to the target language” (Levelt, 1989, p.71). In fact, by comparing the categorization of static spatial relationships in English and in Dutch cross-linguistic differences become evident. As reported by Bowerman and Choi (2001), the spatial semantic categories between both languages demonstrate that while English uses the on preposition for the notions of contiguity and support, Dutch uses the preposition op for support and aan for contiguity; e.g., kop op de tafel “cup on the table” and foto aan de muur “picture on the wall” (ibid., p. 485). The different semantic categories in these languages may affect speech production prior to the articulatory act, during the formulation of the message. After conceptualizing the message, several closely related word meanings may be coactivated, from

(14)

which the speaker has to select the most appropriate one given the message s/he wishes to convey (Kemmerer, 2019: 5).

Table 2.1 Distinctions Between the Components in Monolingual and Bilingual Speech Production. The components in this table are based on Levelt’s (L) and Kormos’ (K) models.

Components Monolingual Bilingual

Conceptualization of the message (K)

Activation of relevant concepts to be encoded.

Language selection.

Activation of relevant concepts to be encoded.

Formulation (K) Complete language knowledge to carry out lexico-grammatical, morpho-phonological, and phonetic encoding.

Assumed never completed language knowledge.

Lexical encoding: L1 and L2 words are stored together.

Syntactic encoding: L1 and L2 syntactic rules shared are stored together.

Phonological encoding: identical L1 and L2 phonemes are represented together.

Articulation (K) Syllabary: articulatory gestures are stored as syllables and are fully automatized.

Beginner L2 speakers: rely on L1 syllable programs.

Advanced L2 speakers: can generally create separate units for syllables in the L2.

Long-Term Memory (K) Mental Lexicon: stores conceptual and semantic information

Knowledge of external and internal world

Syllabary

Mental Lexicon2

L1 & L2 episodic memory L2 declarative rules Syllabary

Self-Perception (L) More attentional resources for self-monitoring

Less attentional resources for monitoring, depends on the speaker’s proficiency level.

2Kormos (2006) points out that there is still disagreement whether semantic and conceptual

levels of representation are independently stored in L2. In her model, she posits that the mental lexicon stores conceptual information of both languages.

(15)

In the above-mentioned example, the speaker might not have much trouble selecting the best word. Now imagine if this semantic subdivision was larger in Dutch, or even better, imagine the case of false cognates, words that sound phonetically similar but have different meanings. These instances might cost more attentional resources, since the selection of the correct word prior to articulation might take more time. This selection time may affect speakers’ production either by slowing it down or by the manifestation of pauses and dysfluencies (Felker et al. 2019). Eventually though, regardless of which language speakers are using, they will employ the same anatomical components of the articulatory mechanism to produce speech.

2.1.3. The Mechanism of Speech Production

Every utterance we produce is a joint effort of our articulators and a sophisticated underlying system responsible for sending neural impulses that activate the muscles involved in a complex and rapid change of articulatory settings, altering the shape of our vocal cavity and generating the air pressure necessary for the production of sound (Hixon et al., 2020). Although speech production might be at times perceived as an isolated phenomenon, the act of speaking encompasses a variety of intricate features, such as the individual’s neurophysiology, the physics involved in articulation and respiration for speech, and the cognitive mechanism used to monitor speech sounds (Marchal, 2009).

Speech production starts from a power source provided by the lungs through respiration, where respiratory muscles generate a current of air. The airstream is then modified during phonation by the vocal folds at the larynx and led to the vocal tract. Throughout articulation the phonatory air-current is affected by the different configurations of the tongue, velum and lips, from where it surfaces as speech sounds. These sounds can be analyzed in physical terms, i.e. acoustically, in the resulting speech signal. In general, among the characteristics analyzed

(16)

are the fundamental frequency (F0) – auditorily correlated to pitch – formant structure – best associated with vowels – and amplitude, which we perceive as loudness. These acoustic characteristics are some of the features related to speech that help us understand it. They can form patterns that may be linked to a specific meaning we might have previously stored in our phonological system. Moreover, speech production is a dynamic process taking place in the temporal domain. This enables analyses that provide information about the gradual transitions between the building blocks of speech; such as the natural transition from a vowel to a consonant, or from one syllable to the next (Pols, 2011).

Measuring dynamics in conversational speech is not a trivial task, since the speech signal is populated by variation stemming from, for instance, (i) the lack of clear boundaries between phonemes, syllables, and words, (ii) coarticulation, and/or (iii) phonological processes, such as reduction, insertion and deletion of phonemes. Similarly, measuring intensity also presents its challenges due to the number of variables simultaneously influencing the amplitude of the speech signal. These are variables such as the inherent intensity of different phones, the lack of control when gathering data (e.g. the distance between the speaker and the microphone has an effect on the intensity value), and the speaker’s emotional state, just to name a few (Fry, 1979). However, the study of speech comprising amplitude in time, i.e. intensity dynamics, might provide better answers to what could be associated with linguistic and extra-linguistic information that measures of time displacement or intensity alone would fail to provide.

Additionally, the effect of language over speech dynamics has been most recently demonstrated by Schwartz and Kaźmierski (2019). In a study of the acquisition of vowel dynamics of L1 Polish and L2 English speakers, they found evidence that there may be a temporal reorganization of vocalic targets in English for Polish speakers learning this L2. This investigation revealed that vowel dynamics differed systematically between these languages; i.e., for English more movement is observed in the beginning of the vowel, whereas for Polish

(17)

movement occurs at the end of the vowel. They also found an effect of proficiency over these measures. More proficient L2 English speakers showed a shift in formant movement, from back to front. A possible explanation of this effect was rooted in the idea that dynamic patterns of vowels must somehow be represented in the phonological system.

The important lesson to be inferred from these studies is that differences at a higher level of the language production system – e.g. the formulation component of Kormos’ (2016) model – and influence of the L1 may affect L2 speech production. Next, before diving into the main theme of this thesis, intensity dynamics, it is necessary to fully understand the two concepts involved with it; namely, sound intensity and speech dynamics.

2.2. Sound Intensity

Simply put, intensity is “the amount of energy present in a sound” (Cruttenden, 1997, p. 3). This energy is powered by a supply of air gathered in the lungs during inhalation and released during exhalation; more specifically, inhaled air expands the lungs and chest reducing the pressure inside the lungs and increasing its volume. Then, the thorax and lungs are contracted, changing the lungs’ volume and expelling the inhaled air, which in turn affects the air pressure inside the lungs (Fry, 1979). According to Boyle’s law (Marchal, 2009, p. 4) this is possible due to the inverse relationship between volume and pressure; for instance, a greater lung dimension reduces the pressure inside it. Thus, the amount of air inhaled influences overall intensity (Fry, 1979, p. 66). For example, the more air we inhale, the higher the subglottal pressure in our lungs will be, which, upon uncontrolled release, will result in a higher amplitude of the speech wave. In physical terms, voice intensity is chiefly controlled by subglottal pressure, which has an exponential relationship with intensity (Isshiki, 1964); meaning that a large change in intensity can be achieved by a relatively small change in subglottal air pressure.

(18)

Although intensity seems to be mainly a result of subglottal pressure variation, glottal and supra-glottal regions also affect intensity during the modulation of speech. Intensity can be increased or decreased by elevating and lowering the airflow resistance at the vocal folds. This resistance is determined by the velocity of the airflow coming from the lungs and the dimension of the pathway it goes through; therefore, intensity varies depending on the configuration of the laryngeal structures above the glottis (Stevens, 2000, p. 27). The same can be said about the vocal tract, where its form, material properties and adjustment of its active articulators (i.e. tongue and lips) contribute to the variation of intensity. For example, the effect of tongue position on the intensity of phonemic segments can be seen by comparing the alveolar approximant / ɹ / and the alveolar lateral approximant / l / in English; where / ɹ / has been shown to have a higher intrinsic intensity than / l / (Parker, 2008). Although this is the case in English, the way that active articulators influence intensity seems to be language specific (Gordon et al., 2012). In a cross-linguistic study, Kaland and Postma-Nilsenová (2014) reported differences in the intrinsic intensity of the same vowels produced by Dutch and English native speakers. Overall, the mean intrinsic intensity for all vowels produced by male speakers was around 30 dB lower in Dutch than in English.

Although the intensity of a segment produced in the same articulatory manner varies between languages, it is important to note that the vocal folds and the vocal tract cannot amplify the intensity of a sound; what they can do is provide either an open or obstructed path for the airstream from the lungs to be radiated. That is, the differences in degree of obstruction in the vocal tract will affect the radiation efficiency of the sound coming from the mouth (Zhang, 2016, p. 2628). Titze and Palaparthi (2018) have theorized and demonstrated with computational models that radiation efficiency may also depend on a speaker’s mouth and head sizes; they concluded that variation in intensity also seems to depend on the opening diameter of the mouth and anatomical shape of an individual’s head.

(19)

Titze ad Palaparthi’s (2018) finding added further evidence to the following claim made over twenty years ago. Summerfield (1992) stated that the size of mouth opening is one of the factors that determines the overall intensity of the speech stream (p.76). That is, not just how open or obstructed by its articulators the vocal tract is, but also the degree of mouth opening affecting the rate of airflow through the vocal tract was believed to determine the overall amplitude of an utterance. To put it more simply, Summerfield (ibid.) theorized that a fully open mouth allows a greater airflow to pass through the vocal tract than a mid-open mouth. Consequently, the intensity of a sound, which is dependent on the amount of airflow passing through the oral tract, would be higher in a larger mouth opening area than in a smaller mouth opening area, that hinders the airflow going through. Later, Summerfield’s assumption received strong empirical evidence from research conducted by Chandrasekaran and colleagues (2009), which will be discussed in the following section.

Overall, the information presented above indicates that understanding what aspects of speech production affect voice intensity is not a simple task. More than just the broad statement that intensity is directly correlated to subglottal pressure, it has become clear that the speech apparatus, along with its active articulators and different configurations, seem to also affect sound intensity.

2.3. Speech Dynamics

As previously described, the speech sounds we hear are produced by the rapid movements of our articulators overlapping in time and forming a continuum of actions. It is because of the dynamic nature of speech that we can produce sounds encoding messages (Greenberg, 2006). Besides carrying linguistic information, speech dynamics also includes paralinguistic and extralinguistic information, such as the affective state of the speaker, and her/his physiological, anatomical, and behavioral characteristics (Dellwo et al., 2007).

(20)

In phonetics, dynamic features of speech are those related to the temporal dimension, providing a view of how articulatory movements change over time. A dynamic analysis of these motions involves their transitional parts between and within, for instance, words and syllables. These transitions have been studied as early as 1927 in Raymond H. Stetson’s monograph Motor Phonetics. Stetson’s view of speech was already one focused on articulation. That is, he proposed that we hear movements made audible by voice instead of the acoustic sounds resulting from articulation (Stetson, 1927, p. 203). He reasoned that movements were the most important aspect of speech because even deaf people are able to articulate words without sound and understand words through “lip-reading”, thus rendering sound a non-essential part of speech since the communicative process may still happen without it (ibid., p.192). However interesting his theory might be, Stetson neglected the fact that deaf people also use other cues to understand speech, and lip-reading is just one of them; body movements and context also aid in “soundless communication”. Still, his theory and experimental investigation on the respiratory movement of syllables did add some insight on how motor activity might be organized during speech. Ultimately, Stetson’s investigation gave way to the modern motor control models, which try to explain articulatory movements as a function of neuromuscular coordination, sensory-motor information, and cognitive factors (Marchal, 2009, p. 15). Since this thesis focuses on speaker-specific characteristics of speech dynamics, I will refrain from explaining purely kinematic motor control models. Instead, I will focus on models that primarily account for biomechanical factors influencing speech motor control, since these factors seem to be directly linked to speaker-specificity in speech production (for an extensive review of the current speech motor control models I recommend Parrell et al., 2019). Nevertheless, it is worth pointing out that the main lesson we can take from all these models is the existence of a complex interlinkage between nerves and muscles which are engaged during the speaking process.

(21)

Biomechanical models of speech production are able to imitate and simulate speech production based on speaker-specific characteristics derived from medical imaging technologies (Tang et al., 2017). Such models generally include the components of the vocal tract along with its muscle properties and shapes, the articulator’s kinematics (i.e., their position, velocity and acceleration), and the neural inputs and reflexes involved during speech production (Sanguinetti et al., 1998). These components and their characteristics, based on data from real speakers, are set into a simulation framework that will synthesize speech and attempt to offer an understanding about the mechanisms engaged during the production process. The main idea linked to this modeling approach is that physical patterns carrying linguistic information (phonetic patterns) are directly impacted by the physical phenomena underlying their production (Perrier et al., 2011). This notion implies that acoustic outputs resulting from the mechanisms involved during speech production may reflect anatomical and neuro-physiological features related to these mechanisms. With that in mind, it seems plausible to infer that transitional acoustic aspects of speech may also be related to the physical transition of the physiological mechanisms responsible for the creation of sound.

In fact, the relationship between intensity dynamics and articulatory movements responsible for mouth opening and closing has been demonstrated. As previously stated, Titze and Palaparthi (2018) and Chandrasekaran and colleagues (2009) revealed that the size of mouth opening is related to changes in intensity. While Titze and Palaparthi’s (2018) analysis of intensity was static in nature, i.e., it did not take movement in speech into consideration, Chandrasekaran and colleagues (2009) offered a dynamic observation, demonstrating that the amplitude envelope is closely related to the time course of opening and closing mouth gestures (ibid., p. 13).

They observed “robust correlations and close temporal correspondence between the area of the mouth opening (or inter-lip distance) and the auditory envelope” (Chandrasakeran

(22)

et al., 2009, p. 2), through their analysis of audiovisual speech3 across different languages and

different speech contexts. That is, for the examined languages, English and French, the audiovisual speech samples of read and spontaneous utterances exhibited a strong spatial and temporal correlation between mouth opening area and the acoustic envelope. To put it another way, at the same time that the mouth reached its largest opening area, the greatest amount of energy was displayed in the amplitude envelope. Particularly, their results suggested that relevant information related to this correlation is prominent in slower movements in time, indicating syllable structure (ibid., p.7). Moreover, they found that the timing of lip movements varies depending on the articulatory context. For instance, their results showed that in vowel-consonant-vowel (VCV) combinations the velocity of the lips from open-to-close-to-open states is faster when the consonant is at the beginning of a word (ibid., p. 13). Finally, their data showed that greater mouth areas correspond to higher intensity values in the acoustic signal. Overall, they observed that the correlation between mouth area and amplitude envelope were strong in both languages.

Of particular interest to this thesis, their analyses suggested that even though the relationship between the temporal patterns of the mouth opening gesture and the amplitude envelope was strong, there was also a significant amount of intra- and inter-speaker variability present (Chandrasekaran et al., 2009, p. 5). To put it more simply, although the overall correlation between the mouth opening gesture and intensity is very strong, the size of this correlation varies between speakers. Although not hypothesized by these authors, this effect could be due to speaker-specific characteristics and the fact that no person is able to produce the exact same acoustic sound more than once. In fact, He and Dellwo’s (2017) study of inter-speaker variability in the temporal organization of intensity contours suggested that differences in the dynamics between speakers of Zürich German may indicate the influence of

(23)

specific neurophysiological characteristics over articulatory movements. This means that the individual biological characteristics of a speaker might affect the opening and closing mouth gestures, which in turn would affect the intensity dynamics of the same utterance produced by different speakers.

2.4. Intensity Dynamics

Variations in intensity in the course of an utterance reflect increases and decreases in the amount of energy over time. He and Dellwo (2017) described variations in intensity over time, or intensity dynamics, as the velocity of increase and decrease in intensity between peaks and valleys. Following the authors’ manner of computation, Figure 1 provides a visualization of these concepts; in the two-syllable word “student” we can see two vocalic peaks, that is, places with a large amount of energy in a syllable. The point where intensity reaches its maximum value is the peak of the syllable (IP). Between two peaks is a place where the amount of energy

is at its minimum relative to them. The point where intensity is at its lowest is called a through or valley (IV). The speed of intensity decrease, between a peak and its following valley, and of

intensity increase, between a valley and its successive peak, is defined respectively as negative and positive intensity dynamics.

(24)

Figure 2.1. The upper plot contains an oscillogram of the word “student” in teal and its amplitude envelope (superimposed in red). The lower plot illustrates the intensity curve, its peak (IP) and valley

(IV) values, and time points associated with them (tP and tV). These plots are based on the description of

positive and negative dynamics by He and Dellwo (2017).

Positive dynamics is understood as the temporal displacement of acoustic energy that starts at the lowest intensity point (valley) in the acoustic signal and increases to the next highest intensity point (peak). These dynamics are calculated as how fast intensity linearly increases from a valley to its right-adjacent peak. Contrariwise, negative dynamics is understood as a displacement that starts at a peak point and decreases to its following valley point. Negative dynamics are thus calculated as how fast intensity linearly decreases from a peak to its right-adjacent valley. In Figure 2.1, negative dynamics is demonstrated in the intensity curve (lower plot) by a dark red secant line 𝐼⃗⃗⃗⃗⃗⃗⃗ and positive dynamics by the secant line 𝐼𝑃𝐼𝑉 ⃗⃗⃗⃗⃗⃗⃗ . 𝑉𝐼𝑃

IP IP Peak Peak Valley tP tV tP Syllable 1 Syllable 2 IV

(25)

Regarding changes in intensity dynamics, He and Dellwo (2017) associated positive dynamics with opening mouth gestures and negative dynamics with closing mouth gestures based on previous evidence suggesting that changes in intensity vary as a function of mouth radius (Chadrasekaran et al., 2009; Titze & Palaparthi, 2018). Interestingly, their analyses of both types of dynamics (positive and negative) reflected a similar trend proposed over ninety years ago by Stetson (1928): that negative movements show greater variation between speakers than positive movements. Their combined reasoning for why these opposing movements differ is somewhat enriching. While Stetson (ibid.) hypothesized that positive muscle groups were responsible for purely starting a movement, He and Dellwo (2017) postulated that positive dynamics contain linguistic information essential to communication. They hypothesized that positive dynamics ask for a more controlled mouth opening gesture to reach the presumed articulatory state of a phonetic segment (i.e. phonetic target). Hence, smaller variation of positive movements seems to be due to the necessity of being understood while speaking. On the other hand, the causes for greater variation in negative movements seem to be more nuanced. Essentially, He and Dellwo (ibid.) attributed increased variation in negative dynamics as a result of reduced articulatory control. That is, once the phonetic target has been reached, a speaker may reduce control over the articulatory gesture. Stetson’s (1928) hypothesis for possible causes of variation in negative movements are more intricate and involve not just a movement necessary to stop the ongoing movement, but also a phase where the preparation for a new movement will take place. Without going deeper into the processes happening after the desired target has been reached, it becomes clear that both studies agree that negative movements seem to vary much more than their positive counterpart, and that such variation could indicate information related to the biological characteristics of a speaker. In the following section I will present evidence from empirical investigations of this theory demonstrated by biomechanical models of speech production.

(26)

2.5. Speaker-specificity

When we hear two people talking, we might notice that one seems to have a higher pitch than another, or that one sounds like s/he has a constantly stuffed nose. This is one of many examples showing us that individuals differ in speech production, and that such differences are perceptibly exposed in speech. In the acoustics of an utterance these apparent variations may be a reflection of intrinsic factors, such as the individual anatomical and neurocognitive characteristics of a speaker, and of extrinsic factors like a speaker’s social context (Weirich, 2015: 416-417). In the scope of this thesis are the intrinsic characteristics of a speaker, which lay in the premise that no human being is exactly like another, not even twins (e.g. San Segundo & Yang, 2018 and Zuo & Mok, 2015).

It has been proposed that in the speech waveform two major physical variations are considered when it comes to speaker characteristics. The first, static variability, reflects the physical variations of the speech organs. The second, dynamic variability, is based on behavioral variations (Kitamura & Akagi, 2007). The latter form of variability is most commonly illustrated by coarticulation, a process where articulatory movements of adjacent phonetic segments influence each other during their transitions. Speaker-specificity in coarticulation are calculated as the degree of gestural overlap between segments. For instance, a speaker who speaks slower may have a lower degree of gestural overlap, thus showing less coarticulatory behavior than another (Dellwo et al., 2007).

Vowels offer another type of dynamic variation conveying speaker-specific information. Zuo and Mok (2015) investigated speaker similarities in formant dynamics of Mandarin-Shanghainese bilingual twins and found that, although these individuals have the same anatomical structure allowing them to produce the same canonical phonetic segment, individual choices drove their speech production resulting in differences in the formant

(27)

dynamics of each twin. Moreover, the researchers found that differences between twins are also affected by language dominance; i.e., there were more differences between twin pairs when they spoke in their non-dominant language (Zuo & Mok, 2015).

Other than behavioral differences, anatomical differences between speakers, such as the structure of the vocal tract (including muscles, bones, and nerves), may influence speech dynamics. The effect of vocal tract structure on the motor control strategies of speech has been modelled using biomechanical models, in 2D by Perrier and Winkler (2015) and in 3D by Stavness and colleagues (2013). The 2D models generated included the neck and head structures of different speakers. Both body parts were shown to affect the length and direction of the tongue muscle fibers. Consequently, different patterns of articulatory trajectories were observed between different models. This variability was attributed to anatomical differences in tongue and palatal shapes of different speakers and reflected in the acoustic domain (Perrier & Winkler, 2015). Similarly, Stavness and colleagues’ 3D models demonstrated that inter-speaker differences in speech production were owed to variation in muscle anatomy around the mouth. Because of these anatomical differences, distinctive motor control strategies would potentially be adopted by different speakers. That is, the same motor goal would be achieved with different configurations of the speech apparatus (Stavness et al., 2013, p. 887).

Evidence for different motor control strategies was demonstrated by previous research, which, analogous to the results of the 2D and 3D models revealed that different speakers employed different motor control strategies in the production of the same utterance. De Nil and Abbs (1991) found a wide variety of mouth closing sequences involving the lower lip (LL), upper lip (UL) and jaw (J). Although mainly interested in the influence of speaking rate on the peak velocity of articulatory movements during bilabial closure, they found that at different speaking rates there is a large inter-speaker variation involving the patterns of mouth closing gesture and the frequency in which they were used. Contrary to previous assumptions, their

(28)

results showed that the pattern UL–LL–J was not the most common closing pattern employed. In fact, a variety of sequencing patterns was observed across speakers and speaking rates, such as where LL or J leading the sequence or the peak velocity of two or three articulators happening simultaneously (ibid., p. 847). Furthermore, the frequency with which these patterns were employed also differed between speakers; for instance, the pattern UL–J–LL was used predominantly by one speaker while LL–J–UL was the preference of another.

Ultimately, the studies mentioned in this section illustrate the influence of biomechanical characteristics on the acoustic patterns of speech, and how these characteristics, along with articulatory decisions, could explain speaker-specific information found in the speech signal. Next, I will provide a brief overview of a process, employed in the forensic field, in which anatomical differences between speakers are fundamental.

2.6. Forensic Speaker Comparison

Speaker recognition is rooted in our ability to recognize the voices of those familiar to us just by listening to them (Hollien, 1990, p.189). Based on this perceptual mechanism, phoneticians thought that conducting voice comparisons in the forensic field could be an asset to the justice system. As a result, two forms of recognition processes are now carried out for forensic purposes; namely, speaker comparison (formerly referred to as speaker identification) and speaker verification. The former is a comparative analysis where speech from an unknown source is compared against samples of a known speaker (Foulkes & French, 2012, p. 2). The latter is a form of recognition of known samples, to verify if the known speaker “matches” her/his own samples (Hollien, 1990, p.190).

In the field of forensic phonetics, speaker comparison is conducted by experienced phoneticians who are tasked to establish whether a speech sample of unknown origin matches a reference speech sample, where the speaker’s identity is known (Rose, 2002, p. 24).

(29)

Sometimes an important piece of evidence in a case is a telephone interception, and a way to assess the strength of this evidence is by conducting a speaker comparison. All parties involved in the legal process, the prosecution, the defense, and the judge, benefit from the analysis conducted by a specialized forensic phonetician, since they are interested in the origin of the disputed sample; i.e., whether the suspect produced the speech sample or not. Other than performed through an acoustic-phonetic analysis by an experienced forensic phonetician, speaker comparison can be done with the help of automatic speaker recognition (ASR) systems. Since ASR falls out of the scope of this thesis, I will refrain from explaining the intricate methodology involved. However, it’s worth pointing out that this technology is mostly used as a support tool for the analyses conducted by the forensic phonetician (French & Stevens, 2013). Moreover, unlike speech recognition, which focuses on the content of an utterance, and speaker verification, which is provided with samples of a talker who wants to be recognized, forensic speaker comparison deals with sometimes uncooperative speakers, non-contemporary speech samples, and distorted signals, all factors that make the acoustic analysis of disputed samples challenging.

Research in the field of forensic speech science has demonstrated that acoustic features may reflect speech characteristics shared by a specific group of speakers. This is demonstrated when performing speaker comparisons, where commonalities and differences between the speaker in question and other speakers sharing the same socio-linguistic background are exposed (French & Stevens, 2013). To assess the significance of similarities between speech samples, forensic phoneticians make use of a background database that represents the relevant population to which the suspect belongs (Morrison et al., 2018). Features that are shared by a large number of speakers in this database are not as significant as those which are uncommon (French & Stevens, 2013). Furthermore, to contribute speaker-specific information, features need to vary little within a speaker but largely between speakers (Foulkes & French, 2012, p.

(30)

3). Hollien (1990) describes several features used in forensic phonetic analysis, such as fundamental frequency (F0), vowel formants, vocal intensity, prosody, general voice quality, and other speech characteristics (p. 197). Of these features, long-term average F0 and vowel formants are considered routine, since it has been established that they can provide information about a speaker’s articulatory and phonological patterns (Foulkes & French, 2012, p. 11). However, regarding the usefulness of a feature, voice quality has been considered the most useful parameter for speaker discrimination, and vowel formants, although widely deemed a strong parameter, has also been considered “rarely insightful” for discriminating speakers (Gold & French, 2011, p. 302).

Intensity per se is not considered a useful discriminative feature (other than in the context of formants). A reason for that might be related to significant changes owed to, for instance, a possible Lombard effect4 during the elicitation of intensity (Hollien, 1990, p.198).

However, considering the temporal organization of intensity provides a different approach to use this acoustic feature. Intensity dynamics is an aspect of speech rhythm, which has been considered a useful parameter for speaker discrimination (Gold & French, 2011, p. 302). Therefore, investigating between-speaker variation in intensity dynamics may provide a significant contribution to understanding this rhythmic aspect of speech and insight into whether it could be useful in forensic applications. Especially, as demonstrated previously, given the notion that this feature may be highly dependent on individual anatomical characteristics.

4 A compensatory effect stemming from speakers’ tendency to increase vocal intensity when they cannot

(31)

2.7. Research Questions and Hypotheses

As explained earlier, the underlying mechanisms of speech production in the L1 and in the L2 are both similar and diverging (Levelt, 1989; 1999; Kormos, 2006;). While the similarities lie on the mechanical apparatus used during the production of speech (Hixon et al., 2020; Marchal, 2009), the differences are related to the complex mechanisms taking place before articulation, which are believed to be affected by the different languages (Flege, 1995; Best et al., 2001; Best & Tyler, 2006; Escudero, 2009).

Moreover, the way speakers use their speech apparatus is believed to be highly influenced by their anatomical characteristics (San Segundo & Yang, 2018; Titze & Palaparthi, 2018; Zuo & Mok, 2015; Fry, 1979; Chandrasekaran et al., 2009; Dellwo, 2007; Kitamura & Akagi, 2007). Together with language-specific constraints (Kemmerer, 2019; Schwartz & Kaźmierski, 2019; Bowerman & Choi, 2001), these differences are directly reflected in the acoustic signal, since acoustic features are differently affected by a combination of both characteristics and constraints. For example, the way our tongue, mouth and lips may influence the intensity of the vowel /i/, has been demonstrated to differ between different languages (Gordon et al., 2012) and between different speakers of the same language (Perrier & Winkler, 2015). Likewise, the formant dynamics of the same vowel have exhibited differences in movement between languages (Schwartz & Kaźmierski, 2019) and between speakers with the same linguistic background (He et al., 2019). Similarly, intensity dynamics, the focus of this thesis, have been demonstrated to vary between speakers (He & Dellwo, 2018). However, this investigation was only conducted in one language. Therefore, this thesis sets out to determine whether this dynamic feature also varies between native speakers of Dutch and across languages, namely L1 Dutch and L2 English. More specifically, this study seeks to clarify:

RQ1: Whether measures of intensity dynamics vary between native Dutch speakers using their native language.

(32)

RQ2: Whether this between-speaker variability is also present when these speakers also use English, a second language they are proficient in.

RQ3: Whether language has an effect on intensity dynamics; and which measure(s) of intensity dynamics show the most between-speaker variation across languages.

Regarding the first two research question, I hypothesize that between-speaker variation in intensity dynamics will exist regardless of the language used by the speaker since they stem from speaker-specific characteristics (He & Dellwo, 2018). This means that productions of native Dutch speakers will vary between individuals in this language and in the L2.

Regarding the third research question, I hypothesize that, analogous to previous studies of speech dynamics (Schwartz & Kaźmierski, 2019), language will have an influence on intensity dynamics. Finally, also based on previous results (He & Dellwo, 2018; He et al., 2019), I hypothesize that negative measures of intensity dynamics will display the most amount of variation across speakers in both languages.

(33)

3. METHODOLOGY 3.1. Corpus

The present study uses the Database LUCEA corpus (see Orr & Quené, 2017 for further details), from which 59 female native Dutch speakers (age range of approx. 17–26 years with no reported speech and hearing disorders) were selected. Participants were recruited at University College Utrecht (UCU) and recorded in their first semester in 2012 and 2013, and in the end of the academic year in 2013, 2014 and 2015. Furthermore, the English level of all participants was collected with the aid of a language background questionnaire, where speakers had to inform their age of acquisition and the degree of exposure to the language (Orr et al., 2011). The Dutch participants reported to be proficient in English by providing the results of a formal proficiency exam, since the entry requirement for this language at UCU is at the level of a proficient user of the language (according to the Common European Framework of Reference for Languages the minimal proficiency level similar to B1).

The speakers were asked to perform six speaking tasks, of which two were selected for this study, specifically two two-minute-long prepared informal monologues on a free topic in English (L2) and in Dutch (L1). Although prepared, these monologues can be considered spontaneous samples of speech, since they are not elicited from reading material and were not directed in any way. However, it differs from completely spontaneous speech (i.e. unprepared) because the speech topic was prepared in advance. In both languages most speakers talked about the same topic, however, some of them simply continued their monologue in the other language. While the former speakers’ samples resulted in more homogeneous data, the same cannot be said about the latter.

Each participant was simultaneously recorded via eight microphones in a quiet furnished office with at least one facilitator seated at the opposite side from the speaker (Orr & Quené, 2017). For this study the selected recordings were the ones captured by the microphone

(34)

closest to the speaker (Sennheiser Headset HSP 2ew; 44.1 kHz; 16bit), since this microphone gathered the most optimal speech sample with little variation in the distance between the microphone and the speaker’s mouth, which is of extreme importance when analyzing envelope-based measures. In this study, the monologues were manually annotated by two annotators and checked by a third annotator at four levels: Language spoken, speech type, speech and silence intervals, and an orthographic transcription of the utterances.

To accurately measure interactions of amplitude and duration, essential to characterize intensity dynamics, a subset of 51 speakers was created, because the quality of eight of the 59 recordings was too poor for this research. In this subset two more levels were annotated by the author of this thesis; namely, stretches of continuous speech without any interruptions, which were selected manually to ensure precision, followed by an automatic segmentation of these stretches into smaller chunks. The nature of these chunks will be described in the following section.

3.2. Data Preparation

Prior to speech chunking, the waveforms related to the two prepared monologues in the L1 and the L2 were automatically extracted and saved as separate audio files. This step was done because the complete files were too large resulting in excessive lag during data processing, and because normalization of the data needed to be done by language. Normalizing the L1 and L2 data separately ensures that, for each language sample, no cross-linguistic influence would affect the analyses of the extracted measures of intensity dynamics.

Following Tilsen and Arvaniti’s (2013) method for analyzing speech rhythm using envelope-based measures, the chunking of the spontaneous speech data is achieved by obtaining uninterrupted speech segments between 1.4 s and 1.6 s. The authors stated that too short or too long stretches of speech might be inefficient to provide accurate information about

(35)

rhythmic characteristics of speech. For instance, in durations shorter than 1 s there are not enough syllables to provide rhythmic information; likewise, in durations longer than 3 s there may be a lot of rhythmic variation due to a mixture of different speech tempos (ibid., p. 629). Adopting durations between 1,400 ms and 1,600 ms could, therefore, reduce unwanted variation in the speech data. Furthermore, adopting this strategy resolves the issue of time normalization of the different sentences, since chunk durations are nearly uniformly distributed around 1,500 ms with a variation of ±100 ms from this value (ibid., p. 629). Moreover, adopting this chunking method also reduces any variation that may be caused by different speech tempos.

The resulting speech signals were then prepared in Praat (Boersma & Weenink, 2014), following the initial stages of He and Dellwo’s (2017) methodology. First, the DC bias was removed by subtracting the mean amplitude from the signal; then a higher-sampled amplitude envelope was created by low-pass filtering the full-wave rectified speech signal at 10 Hz [Hann filter, roll-off = 6 dB/octave]. Next, an intensity object was created in Praat (using command To Intensity…, with Minimum pitch = 100 Hz; Time step = 0.0 s; Subtract mean = True). This command squares and windows the signal before creating the intensity object (Kaiser-Bessel window: 𝜷 = 20; side lobe attenuation ≅ –190 dB; length: 32 ms). Resulting from this series of signal manipulations is the amplitude envelope of each signal and its intensity object containing intensity point values in time.

It is important to mention that the normalization of the intensity object in this study differs from that of the He and Dellwo’s (2017) due to the nature of our data. While their study elicited carefully controlled read speech, this thesis uses spontaneous speech samples, which requires a different method of normalization. For this reason the values in the intensity curve were linearly normalized within the range [0.01, 1] using the formula 𝐼′(𝑓) =

(1 − 0.01) (𝑚𝑎𝑥 − 𝑚𝑖𝑛⁡)⁄ × [𝐼(𝑓) − 𝑚𝑖𝑛] + 0.01 , where 𝐼′(𝑓) and 𝐼(𝑓) refer to the normalized and original intensity value at frame index f; max and min refer to the maximum

(36)

and minimum values of the original intensity curve, and 1 and 0.01 are the new maximum and minimum values of 𝐼′(𝑓) . This procedure is analogous to the one employed by He and

colleagues (2019) for the normalization of the first formant (F1). According to the researchers, the normalized curve maintains only information related to the trajectory of the (F1) curve that can be associated to speaker-specific articulatory gestures (ibid., p. 210).

Because of the large amount of (spontaneous) data used in this study, the detection of peaks and valleys, points where the envelope reaches maximum and minimum intensity values, was done semi-automatically. Instead of placing these points between pre-established syllable boundaries, an algorithm was crafted to automatically detect potential peak and valley points by surveying the amplitude envelope. The algorithm iterates through all points in the envelope gathering their intensity values in time. To determine if a point is a syllabic peak or valley, the algorithm stores each point value and checks if the following point value is larger or smaller. If the next value is larger, the previous point is stored as a valley. Similarly, if the next value is smaller than the previous point is stored as a peak. After these prospective peaks and valleys have been stored, pairs of peak and valley points are checked against each other. If their difference is larger than a predetermined threshold, they are considered as valid syllables peaks or valleys. Adopting a minimum intensity threshold of 5 dB between peak and valley points filters out potential outliers; i.e., erroneous point placement in case a syllable contained a “shallow” valley5 between the true maxima and a smaller peak (Figure 3.1). The output of this

process is a table containing the time in seconds and intensity in decibels of the peak and valley points of each syllable in a continuous stretch of speech. The time values in this table were then used to automatically place peak and valley demarcation points in the intensity contour.

5 The idea of a “shallow” valley can be interpreted as the result of articulatory motion between the

components within a syllable, as opposed to a valley which has a significant lower value compared to its adjacent peaks.

(37)

Together with the speech signal these points were manually checked to ensure that their placement was correct. In case of incorrect placement, points were readjusted manually.

Figure 3.1. Oscillogram (in black) and amplitude envelope (superimposed in blue) of the utterance “this week”. A shallow valley (v) and smaller peak (p) are represented in red. These could be falsely selected by the algorithm as true values, if they were not checked against the predetermined threshold. True peak (p) and valley (v) points are represented in bold and are teal colored.

3.3. Data Extraction

The intensity values of peaks and valleys were obtained at each of the demarcation points from the intensity curve using cubic interpolation. Adopting this method is better than non-interpolation because it offers true continuity between the motion trajectories that pass through each peak and valley point. Next, positive dynamics (𝜈𝐼[+]) were computed by calculating how fast the intensity level increased from a valley to its succeeding peak:⁡𝜈𝐼[+] ≝ (𝐼𝑃 − 𝐼𝑉)/(𝑡𝑃− 𝑡𝑉), where 𝐼𝑃⁡and 𝐼𝑇 refer to the intensity values of the peak 𝑡𝑃 and valley 𝑡𝑉 points. Similarly, negative dynamics (𝜈𝐼[−]) were measured by calculating the rate of intensity decrease from a peak to its right-adjacent valley: 𝜈𝐼[−] ≝ |𝐼𝑉 − 𝐼𝑃|/(𝑡𝑉− 𝑡𝑃), here the intensity values taken

are absolute, since only the magnitude of the signal is of interest for the analyses (He & Dellwo, 2017: 490).

Additionally, the distributions of positive and negative dynamics in a chunk of spontaneous speech were obtained by calculating the mean, standard deviation and Pairwise

p p p v v v v

(38)

Variability Index, or PVI (Grabe & Low, 2002), of both types of dynamics. The mean and standard deviation of each dynamic type demonstrate the central tendency and the overall dispersion of a speaker’s intensity dynamics, respectively. The PVI conveys the amount of variability between successive syllables by computing and averaging the difference in duration between sequential intervals in an utterance (Grabe & Low, 2002). Following He and Dellwo (2017)’s notation of these measures, MEAN_𝑣I[–], STDEV_𝑣I[–],and PVI_𝑣I[–]refer to negative

dynamics and MEAN_𝑣I[+], STDEV_𝑣I[+],and PVI_𝑣I[+] to positive dynamics. These measures

were stored in separate data subsets corresponding to language spoken; namely, English (EN) and Dutch (NL). Each subset contained all positive and negative dynamics measures per chunk per speaker.

3.4. Data Analyses

The extracted measures of positive and negative intensity dynamics in each language subset were analyzed to answer the three research questions of this thesis. RQs 1 and 2 can be answered by the same methods since they embody the same sub-questions: (i) Do measures of intensity dynamics encode different types of information? (ii) How much of the between-speaker variability is explained by each measure of intensity dynamics? (iii) Is there a significant between-speaker effect on each measure of intensity dynamics? The last research question embodies two sub-questions: (iv) Does language influence speakers’ measures of intensity dynamics? and (v) How much of the variability in measures of intensity dynamics is owed to within-speaker versus between-speaker variation?

According to He and Dellwo (2017) positive and negative measures may encode different types of information, which could be established if they are separated into two independent factors (p. 491). Therefore, factor analysis (FA) was employed on all measures of intensity dynamics in both language subsets to test whether positive and negative dynamics

Referenties

GERELATEERDE DOCUMENTEN

Moreover, the results from the robustness test show that the relationship between stock index return and changes in implied volatility is more negative under the negative return

Similarly in Greece, according to the IMF report on Greece (2015b), the commitment to reforms appears to be weak. After the heavy decrease in the house prices in the aftermath of

European Union, ‘The Investment Plan for Europe and Energy: making the Energy Union a reality’, http://europa.eu/rapid/press-release_MEMO-16-2195_en.htm , accessed 16 June

As discussed earlier, emergentist accounts of second language acquisition allow to hypothesize that, overall, the more exposure and more use learners make of the target language,

The outcome of the spectral analysis reported above showed performance on both languages to contain pink noise; interestingly, there was little difference between the languages in

Concerning between-speaker variation, compared to their own produced control vowels, some speakers tend to use sounds similar to [a] for the target words, another

persoonlijke verhalen (x1), het maken van keuzes in het aanvullen van de schoolboeken (x2) en het creeren van een monument/ standbeeld (x3) zullen de lessen betekenisvoller worden

In het IJsselmeergebied wordt deze groep dan ook nauwelijks gevangen, maar in de drie riviergebieden Benedenrivieren, Gelderse Poort en Maas neemt het aandeel van deze soorten