• No results found

Chinese Tones: Can You Listen With Your Eyes?: The Influence of Visual Information on Auditory Perception of Chinese Tones

N/A
N/A
Protected

Academic year: 2021

Share "Chinese Tones: Can You Listen With Your Eyes?: The Influence of Visual Information on Auditory Perception of Chinese Tones"

Copied!
188
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Tilburg University

Chinese Tones: Can You Listen With Your Eyes?

Han, Yueqiao

Publication date:

2021

Document Version

Publisher's PDF, also known as Version of record Link to publication in Tilburg University Research Portal

Citation for published version (APA):

Han, Y. (2021). Chinese Tones: Can You Listen With Your Eyes? The Influence of Visual Information on Auditory Perception of Chinese Tones. [s.n.].

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal

Take down policy

(2)

C

H

IN

ES

E T

O

N

ES

: C

A

N

Y

OU

L

IST

EN

WITH

Y

O

UR

EYES?

Yueq

ia

o H

an

It is my great pleasure to invite you to attend the public defence of

my PhD thesis entitled

Invitation

Chinese Tones:

Can You Listen

with Your Eyes?

Th e Infl uence of Visual Information

on Auditory Perception of Chinese Tones

Which will be held on Friday, June 18, 2021 at 10:00 in the

Aula of Tilburg University. Warandelaan 2, Tilburg.

(3)
(4)

Chinese Tones:

Can You Listen with Your Eyes?

The Influence of Visual Information on

Auditory Perception of Chinese Tones

(5)

Financial support was received from China Scholarship Council (CSC). ISBN: 978-94-6423-261-5

Printed by: ProefschriftMaken | www.proefschriftmaken.nl Cover design, and layout: Bregje Jaspers | ProefschriftOntwerp.nl Layout inspired by: http://www.martijnwieling.nl/

(6)

Chinese Tones:

Can You Listen with Your Eyes?

The Influence of Visual Information on

Auditory Perception of Chinese Tones

Proefschrift

ter verkrijging van de graad van doctor aan Tilburg University op gezag van de rector magnificus, prof. dr. W.B.H.J. van de Donk, in het openbaar te verdedigen ten overstaan van een door het college voor

(7)

Promotor:

prof. dr. M.G.J. Swerts (Tilburg University)

Copromotores:

dr. M.B.J. Mos (Tilburg University) dr. M.B. Goudbeek (Tilburg University)

Leden promotiecommissie:

prof. dr. A. Chen (Utrecht University) prof. dr. Y. Chen (Leiden University)

(8)
(9)
(10)

Acknowledgments

T

here is never any way to thank all the people whose physical or spiritual presence made this dissertation possible, but I’m going to give it a try anyway.

There are two parties that I would like to acknowledge first: one is my sponsor: the China Scholarship Council (CSC), without whose financial support my journey would be crippled at the beginning. The other one is my supervisors: Marc, Maria, and Martijn. Although it’s been said by so many Ph.D. students, and it’s kind of a cliché, I still have to repeat after them: I wouldn’t have made it to the end without the help and support from them. I don’t believe they have co-supervised many students before me, and I’m the lucky one who had to have all three of them, but I would advise them to do this more in the future, because I have experienced it as such a fruitful and powerful combination. Marc, you always know how to help me with figuring out an interesting introduction and composing an insightful discussion. You are so good at telling a story in a paper by threading through all the scattered information that it’s not surprising that you have already published two books. Researcher, writer, and musician (you are the best saxophone player I ever know), how cool even to own one of these titles, let alone all of these!

Maria, your comments are always the first ones to enter my mailbox. You are the one who is never shy to put your fingers on the fatal flaw in my arguments and speak your mind. You are the one who made me sweat when I could not give a satisfactory answer, and you are also the one who motivates me to get better and stronger. You are a master of managing time and work. I learned a lot from you.

And then there is Martijn, the savor. You are the one who joined the team when I was desperate for a hands-on data analysis experience. You are joyful and kind: you always cheer me up when I am worried and you definitely saved my ass from the data-analyzing odyssey. I honestly don’t think I would’ve finished this without your wonderful help!

I would like to extend my thanks to the members of my committee: prof. dr. Aoju Chen; prof. dr. Yiya Chen; prof. dr. Denis Burnham; prof. dr. ing. Hansjörg Mixdorff; and prof. dr. Jean Vroomen. They took their time reading my papers, gave elaborate comments and delivered an assurance for my work. I am very grateful for your work.

(11)

Th ere are many people who have helped me along this journey, and I want to single out some of them. Ad Backus, thank you for your comments on my earliest paper and my completed dissertation, and your moral support while I was writing my dissertation. Th iago, thank you for working together with me on my last paper, your expertise made that one happen. Yan Gu, Veronique, Giovana, Mariana, Loes, and Emmelyn, thank you for reading my dissertation and preparing me for the defense; it was very useful and much appreciated. Nadine, Alexandra, and Mariana, thank you for being wonderful friends and helping me with organizing my graduation session.

Another thank you goes out to all my kind colleagues in DCC for treating me well and creating such a wonderful working place: Lauraine, Jacqueline, Alex, Rein, Naomi, David, Tess, Ruben, Debby … My apologies for not naming out everyone. I am grateful that I have made close friends with many of you. In addition, I would like to thank the colleagues I have worked together with for the PhD council and TiPP (Tilburg PhD Platform). Such a pleasant and fruitful experience to have alongside my PhD track!

I am greatly indebted to the participants who took part in my studies and the people who helped me during the data collection. My special thanks goes out to Marlon Titre and his colleagues in Fontys for providing me all the help I needed to collect data for my study.

Even with the professional support of all these people, I would not deliver my dissertation as well as it turned out without the support and encouragement from my families: I would like to take this opportunity to share my deepest gratitude to my husband, Jaap. He has always helped me in all the ways he can. Without his constant proofreading and extraordinary graphing skills, my dissertation would not have been as good as this fi nal version. “And yes, he even proofread these acknowledgments”. As well, I am grateful for the presence of my angel, Marin. “Th ank you for being such an amazing daughter. You are sweet, smart and growing up so fast. You give me extra strength and determination to fi nish this dissertation, because being your mom is the most powerful thing I can receive”. Th ere is no way that this journey could be completed without the support from my dearest sisters and brother. Th ey have been always supporting me with everything they could. For them, I am always their youngest sister and their love never stopped, even though we are at the diff erent side of the globe. Most importantly, I have to thank my parents, who silently accepted my decision of leaving my homeland to pursue a study that they never really understand. From the beginning they gave me every opportunity to follow my own path, and I will always be grateful to them. I know my father is proud of me. I also know my mother couldn’t be happier with everything I accomplished. Th ank you, thank you so much.

(12)

Contents

1 General Introduction

1.1 Tone in Chinese

1.2 Visual information in tone

1.3 Elements and variables in current studies 1.3.1. Contextual factors

1.3.2 Individual differences between perceivers 1.4 Research question

1.5 Methodology 1.6 Overview

2 Effects of Modality and Speaking Style on Mandarin Tone Identification by Tone-naïve Listeners

2.1 Introduction

2.1.1 The effect of modality on tone perception 2.1.2 The effect of speaking style on tone perception 2.1.3 Variation between speakers and between tones 2.1.4 The current study

2.2 Methodology 2.2.1 Participants 2.2.2 Stimuli 2.2.3 Procedure 2.3 Results

2.3.1 The effects of modality and speaking style 2.3.2 The effects of speaker and tone

2.4 Discussion and conclusion

3 Mandarin Tone Identification by Tone-naïve Musicians and Non-musicians in Auditory-visual and Auditory-only Conditions

3.1 Introduction

3.1.1 Tone perception and musical ability 3.1.2 Tone perception and visual information 3.2 Materials and methods

3.2.1 Participants

3.2.2 Materials and stimuli

(13)

3.2.3 Procedure 3.3 Results

3.3.1 Overall tone perception 3.3.2 Individual tone perception

3.3.3 A more fine-grained look at musicality 3.3.4 Musicality and tone perception 3.4 Discussion

3.5 Conclusion

4 Relative Contribution of Auditory and Visual Information to

Mandarin Chinese Tone Identification by Native and Tone-naïve Listeners 4.1. Introduction 4.2. Methodology 4.2.1. Participants 4.2.2. Stimuli 4.2.3. Procedure 4.3. Results

4.3.1. How would a McGurk effect work at the tone level for native speakers of Chinese?

4.3.2. How much do visual cues affect tone-naïve listeners in identifying Mandarin Chinese tones?

4.3.3. What are the roles of congruent and incongruent visual information in tone perception?

4.4. Discussion and conclusion

5 Automatic Classification of Produced and Perceived Mandarin

Tones on the Basis of Acoustic and Visual Properties

5.1. Introduction 5.2. Corpus construction 5.3. Perception study

5.4. Machine Learning methods 5.4.1. Data

5.4.2. Features

(14)

6 General Discussion and Conclusion

6.1. Main findings

6.2. Theoretical implications

6.2.1 Audio-visual tone perception

6.2.2. Individual differences between perceivers 6.2.3. A theory of tone perception

(15)
(16)
(17)
(18)

CH APTER 1 GENER AL INTR O D UCTI O N Chapter 1

General Introduction

T

his dissertation is a study on the linguistic use of tone. More than half of the languages spoken in the world (60%-70%) are so-called tone languages. Unlike most European languages, which rely primarily on phonological distinctions between consonants and vowels to distinguish word meanings, tone languages, such as Mandarin Chinese, additionally use changes in tone for marking lexical distinctions. Because of its unfamiliarity, tone is known to be difficult to learn for western speakers. This dissertation investigates possible ways to ameliorate the perception of tone for tone-naïve speakers. It sets out to examine the factors that potentially promote efficient perception of Mandarin Chinese tone. To be more specific, this dissertation looks into the contribution of visual information (in particular, potential cues displayed by a speaker’s face) to Mandarin Chinese tone perception for tone-naïve perceivers, as well as that of other factors, such as differences in speaking styles of the speaker (natural vs. teaching speaking style), and musicality of the perceivers (musicians vs. non-musicians). These different variables are investigated in a task of Mandarin Chinese tone identification. Moreover, this dissertation also contains a computational study that compares the relative contribution of acoustic information and visual information to tone perception and tone classification. In this chapter, I sketch some background information about tone, especially tone in Mandarin Chinese, embark on the research questions addressed in the dissertation, and give an overview of the studies reported in this thesis, including some considerations of the relevant methodological aspects.

1.1 Tone in Chinese

(19)

CH APTER 1 GENER AL INTR O D UCTI O N

question or a confirmation. However, such usage is different from the way tone is exploited in a language like Mandarin Chinese, since the core meaning in American English of the word is not changed.

Tone languages consist of nearly 70% of the world’s languages, and they are extremely common in Africa (e.g., Yoruba), East and South-East Asia (e.g., Thai), and Central America (e.g., Mixtec) (Yip, 2002). Most of the European languages are not tonal, but there are exceptions like Swedish, Norwegian, Serbo-Croatian, and a few Dutch Limburgian dialects in which tone can also be used to mark lexical contrasts.

Of all those tone languages, Chinese is spoken by the largest population by far (total users in all countries in 2015: 1,107,162,2301). Under the general banner

of Chinese, eight major language/dialect groups are subsumed: Mandarin, Wu, Yue (Cantonese), Xiang (Hunan), Gan (Jiangxi), Kejia (Hakka), Southern and Northern Min. Although they do share a great deal in common, such as syntax, not one pair of languages is mutually intelligible. The mutual unintelligibility is mostly due to differences in phonology (Bao, 1990, 1999). Mandarin originated in North China and is spoken across most of northern and southwestern China. The Mandarin dialect group is spoken by more people and over a larger geographical area than any other major dialect group (65% of the Chinese population in 2017, estimated by Ethnologue).

In contemporary linguistics, Mandarin tones are often described in terms of pitch height and pitch shape. Accordingly, there are the four main distinctive Mandarin tones2, conventionally numbered 1 to 4: tone 1: high-level (5-5); tone

2: mid-rising (or mid-high-rising; 3-5); tone 3: low-dipping (also low-falling-rising or mid-falling-low-falling-rising; 2-1-4); and tone 4: high-falling (5-1) (Chao, 1930)3.

1.2 Visual information in tone

Tone is an acoustic phenomenon: listeners do not need to see speakers to be able to understand them (e.g., a conversation can take place via the phone). Actually, and interestingly, the more tonal the language, the greater the reliance on auditory information by listeners. Sekiyama and Burnham

(2008) explained that as tonal languages (and semi-tonal languages, such as Japanese) having fewer phonemes (consonants, vowels and syllables) and a simpler syllabic and phonological structure compared to English. Because of this, the lip-reading information may be used less in speech/tone processing.

1https://www.ethnologue.com/language/cmn.

2 There is a fifth tone, a neutral tone, which functions on grammatical level and cannot appear on

single syllable words.

3 The numerical substitute has been commonly used for tone contours, with a numerical value

(20)

CH APTER 1 GENER AL INTR O D UCTI O N

Mandarin Chinese, which has an elaborate tonal system, has been shown to have clear acoustic correlates, notably in the form of pitch and pitch contour. In particular, fundamental frequency (F0) patterns (both height and contour) and the direction of pitch, can distinguish the four main distinctive Mandarin tones. Other acoustic variables, such as duration and amplitude, can also be perceptually informative (Chen & Massaro, 2008; Ryant, Yuan, & Liberman, 2014), but to a lesser extent than fundamental frequency. Therefore, acoustic information is essential in (Mandarin) tone perception and accordingly listeners (at least native listeners) rely greatly on it when it is available.

At the same time, the way we perceive speech can be influenced by visual factors: it is a multisensory/multimodal process. What we hear can be affected by what we see (Campbell, Dodd, & Burnham, 1998; Han, Goudbeek, Mos, & Swerts, 2018, 2019; Rosenblum, 2008). For instance, seeing the face of the speaker normally helps the listener perceive speech better (Bailly, Perrier, & Vatikiotis-Bateson, 2012; Hirata & Kelly, 2010; Sumby & Pollack, 1954), especially in noisy environments (e.g., Burnham, Lau, Tam, & Schoknecht, 2001; Mixdorff, Hu, & Burnham, 2005b). Similarly, seeing the face of a speaker also aids hearing impaired listeners decoding the auditory speech signal (Desai, Stickney, & Zeng, 2008; Smith & Burnham, 2012).

(21)

CH APTER 1 GENER AL INTR O D UCTI O N

consciously or unconsciously, to produce the different melodic configurations (Rosenblum, 2008; Swerts & Krahmer, 2008; Zheng, Hirata, & Kelly, 2018).

A number of studies have explored the nature and locus of the visual cues in tone production and perception, and have revealed fairly reliable configurations of visual cues related to tone acquisition, even though exact visual cues have not been clearly defined yet. For instance, strong correlations between head movements and F0 were observed by Yehia, Kuratate, and Vatikiotis-Bateson (2002). Similar visual cues that relate to more general movements of the head and/or eyebrows have previously been reported to function as correlates of larger-scale prosodic structures in other languages, for example, quick movements of the head that co-occur with pitch accents (Burnham et al., 2006; Krahmer & Swerts, 2007; Vatikiotis-Bateson & Yehia., 1996).

Although there is visual information when speakers produce tones, whether and how this information is picked up by perceivers has been attracting scholars’ attention for the past two decades. Burnham et al., (2000) tested native identification of (six) Cantonese tones with auditory- only, visual-only, and auditory-visual modes, this being the first empirical study on the cue value of visual information for lexical tone. Their participants’ performance of tone perception in the visual only condition is significantly above chance level, which provides evidence that there is indeed visual speech information for lexical tone perception. Since then, more studies on visual and audio-visual tone perception have been conducted. In 2001, Burnham et al., conducted a same-different discrimination study on Cantonese tones, in which native Thai and Australian English speakers also performed significantly better than chance under visual-only conditions. Also, Chen and Massaro (2008) found that the performance of native Mandarin speakers in visual lexical-tone identification was statistically significant. Visual facilitation for tone identification has been especially found for speech in noise for both Mandarin (Mixdorff et al., 2005b) and Thai (Mixdorff et al., 2005a; Burnham et al., 2015).

(22)

CH APTER 1 GENER AL INTR O D UCTI O N

they found positive evidence that this type of motion is important for Thai tone production. Further research will be vital to describe visual tone cues more precisely, in both perception and production (Reid et al., 2015).

In general, visual speech information is known to benefit speech perception. For instance, an early study conducted by Sumby and Pollack in 1954 showed that seeing the speakers’ face helps the listeners’ intelligibility. More specifically, for the perception of tone, the visual facilitation mainly appears under difficult listening conditions (e.g., impaired listeners or noise-masked auditory signal) (see Campbell, et al., 1998, and Bailly et al., 2012, for a comprehensive collection of studies). As for the extent to which auditory-visual information facilitates or improves tone identification compared to auditory-only information (i.e., the superiority of bimodal performance compared to unimodal performance), it differs widely across individuals’ experience (Burnham et al., 2015; Grant & Seitz, 1998). Furthermore, the benefits of visual/facial information for tone perception depend strongly on context, and in particular on the availability of a clear and reliable acoustic signal. In situations where such a signal is available, extra visual information may actually distract the perceivers instead of facilitating their tone perception, since they are reluctant to use the visual information when acoustic sources are available and reliable. For example, Burnham et al. (2001) have found that in an experiment using clean speech, Australian English speakers performed better in a task of identifying Cantonese words that differed only in tone in the auditory-only (AO) condition than in the auditory-visual (AV) condition (where they also had access to lip and face movements).

Similar results also appeared in another study concerning visual cues in tone perception conducted by Mixdorff, Hu, and Burnham (2005b). In their study, native Mandarin speakers identified Mandarin tones in various auditory and/or visual conditions (clean, reduced, and masked audio-only/audio-visual). They found that adding visual information in the clear and devoiced auditory conditions was not particularly helpful. However, tone perception improved significantly in the babble-noise masked condition. The authors speculated that the absence of a facilitating effect for visual information on tone identification may be due to a ceiling effect for native speakers in clear audio conditions: auditory information suffices for quick and correct identification of tones, unless this information is compromised, that is, under low speech-to-noise ratios, in which case visual information is beneficial. Smith and Burnham (2012) found that tone-naïve listeners outperformed native listeners in the visual-only condition in a task of Mandarin tone discrimination, additionally suggesting that visual information for tone may be underused by normal-hearing tone language perceivers.

(23)

CH APTER 1 GENER AL INTR O D UCTI O N

suprasegmental level (i.e., tone), and more obvious on the non-native perceivers than on the native perceivers.

1.3 Elements and variables in current studies

While the way tones are acquired by listeners has attracted some scholarly attention (e.g., Burnham et al., 2000; 2001; Francis, Ciocca, Ma, & Fenn, 2008; Hao, 2012; So & Best, 2010), detailed knowledge of the factors that promote efficient acquisition is lacking. The current studies investigate several factors that are potentially important for the acquisition of tones, but have not yet been studied in a systematic way, or have not been combined in an integrated approach. These factors can be categorized into two groups: (1) contextual factors, such as the auditory, visual or audio-visual modality in which speech is presented, and speaking style of a speaker who is producing speech in a natural or teaching manner; and (2) individual characteristics, related to differences between tone-native and tone-naïve perceivers, and to perceivers with and without musical backgrounds.

1.3.1 Contextual factors

(24)

CH APTER 1 GENER AL INTR O D UCTI O N

relative contribution of auditory and visual information was compared during Mandarin Chinese tone perception with congruent and incongruent auditory and visual materials for speakers of Mandarin Chinese and speakers of non-tonal languages. We further explore the contribution of visual cues by adding them to a computational model for tone classification that has so far been based on conventional acoustic features only (Chapter 5). By comparing automatic and human classification of Mandarin Chinese tones, the representativeness of our models as models of tone learning is assessed.

The second contextual factor concerns possible adjustments speakers of the native language make when they talk to learners of their language. There is evidence that speakers adapt their speaking style to their audience and to the communicative context. A well-known example of this is infant-directed speech (IDS), where adults adapt their speaking style in the presence of the children (Burnham, Kitamura, & Vollmer-Conna, 2002; Fernald & Kuhl, 1987; Kuhl et al., 1997). IDS has been hypothesized to aid the learning process (Kuhl et al., 1997; Thiessen, Hill, & Saffran, 2005). Similarly, a native speaker who is addressing a non-native listener may adapt their speech to improve learning and understanding. In a teaching setting, they may, for example, be more inclined to speak slowly and in a more hyperarticulated manner (Bradlow & Bent, 2002; Smiljanić & Bradlow, 2007, 2009). Assuming that a teaching style that attends to the needs of learners may also make tonal contrasts more salient, my dissertation also aims to study whether a hyperarticulated speaking style helps learners to perceive tonal information (Chapter 2, 3 and 5).

1.3.2 Individual differences between perceivers

Visual information is mainly relevant for native speakers, and, in general, speakers of tone languages focus more on auditory information than speakers of non-tone languages. This then raises questions about its use by and usefulness for tone-naïve people, who (have to) acquire the tones. Therefore, I chose to mainly focus on tone-naïve participants in our tone identification experiment in order to establish the tone acquisition process (Chapter 2, 3, 4 and 5), and to avoid the emergence of ceiling effects on tone identification that would typically appear from native speakers. Most importantly, my studies on tone-naïve speakers can contribute to the field of tone learning for second/foreign language speakers.

(25)

CH APTER 1 GENER AL INTR O D UCTI O N

Milovanov, Huotilainen, Välimäki, Esquef, & Tervaniemi, 2008; Milovanov, Pietilä, Tervaniemi, & Esquef, 2010). This dissertation aims to explore whether or not musical expertise also helps tone-naïve listeners to correctly identify Mandarin Chinese tones (Chapter 3). Because of extensive musical training, musicians are particularly sensitive to the acoustic structure of sounds (i.e., frequency, duration, intensity, and timbre parameters). This sensitivity has been shown to influence their perception of pitch contours in spoken language (Schön et al., 2004), but the extent to which musicians are affected by the presence of (exaggerated) visual information during speech perception has remained largely unexplored. Given their extensive training to analyze the acoustic signal, they might not be as inclined to use visual cues as non-musicians and therefore might benefit less from the added visual information. We hypothesize that musicians may still benefit from the added visual information for the Mandarin tone identification, but that this contribution is likely smaller than that for non-musicians.

1.4 Research questions

The aim of this dissertation is to study the value of visual information (over and above acoustic information) in Mandarin tone perception for tone-naïve perceivers, in combination with other contextual and individual factors. Moreover, this dissertation exploits the relative strength of acoustic and visual information in tone perception and tone classification. The next four chapters present four studies aiming to answer these research questions. Generally, Chapter 2 and 3 report on empirical studies setting out to investigate to what extent tone-naïve perceivers are able to identify tones in isolated words, and whether or not they can benefit from (seeing) the speakers’ face and hyperarticulating speaking style, and their own musical experience. Chapter 4 deals with whether or not there is an audio-visual integration at the tone level in native speakers of Mandarin Chinese and tone-naïve perceivers (i.e., we explored perceptual fusion between auditory and visual information). Chapter 5 studies the acoustic and visual features of the tones produced by native speakers of Mandarin Chinese. Computational models based on acoustic features, visual features and acoustic-visual features are constructed to automatically classify Mandarin tones. More detailed research questions are presented in each empirical chapter.

1.5 Methodology

(26)

CH APTER 1 GENER AL INTR O D UCTI O N

studies were included in each empirical chapter of this dissertation. Although the produced tones by native Mandarin Chinese speakers was mainly used to gather the experimental stimuli for the perception test, and perception is the focus of this thesis, the assumption here is that by studying both production and perception we can look into what the perceivers pick up from what the speaker produces acoustically and visually. Native Mandarin Chinese speakers were instructed to produce individual words, while the participants are asked to identify which tone they think they have heard and/or seen (Chapter 2, 3, and 4). The produced stimuli can give us information about what a speaker does (the acoustic information they convey, the visual cues they employ), and the perception results tell us to what extent this information is relevant for the perceivers (Chapter 4 and Chapter 5). An additional reason to look into both production and perception in our study is that it is known from previous work that there is not necessarily a direct relation between speech production and perception (e.g., Bradlow, Pisoni, Akahane-Yamada, & Tohkura, 1997; Casserly & Pisoni, 2010; Baese-Berk & Samuel, 2016). For instance, some acoustic variation, while a systematic and potentially good classifier, may not work well in perception, because it is below a perceptual threshold. Similarly, there is also no direct relationship between tone production and perception (Wang, Spence, Jongman, & Sereno, 1999; Wang, Jongman, & Sereno, 2003).

Second, we recruited (Mandarin Chinese) tone-native and tone-naïve participants with various backgrounds (mainly Dutch). Data from the former group were needed to set a baseline for the perception experiment, and to compare performance in employing visual cues (Chapter 2 and Chapter 4). Questionnaires were used to select and group the participants according to their language background (tone-native and tone-naïve) and musical behavior (musician and non-musician). These questionnaires are commonly used in research on language learning (Chapter 2 and Chapter 3) and the reliability and validity of them have been established. Third, to assess the importance of acoustic and visual features in the process of tone classification and tone perception, Chapter 5 made use of Machine Learning (ML), a subarea of Artificial Intelligence which aims to learn how to categorize data by using patterns and inference instead of explicit instructions. By comparing automatic and human classification for Mandarin Chinese tones, the representativeness of our models as models of tone learning can be established.

(27)

CH APTER 1 GENER AL INTR O D UCTI O N

the validity of the experimental stimuli on the one hand, and are the basis of classifying tones on the other hand.

A recurring methodology used in analysing the data obtained by the perception experiments is repeated measurements ANOVAs (in Chapter 2 and Chapter 3), which is widely used in psycholinguistics and serves the goals of the experiments sufficiently (Chapter 2 and Chapter 3). The results were also analysed with mixed-effect models in the R software program (Baayen, Davidson, & Bates, 2008) when more variables needed to be included in the analysis (Chapter 4). Since the last study (presented in Chapter 5) was concerned with a classification problem, Logistic Regression was used to classify the produced and perceived tones. In addition, we also heavily relied on analyses of confusion matrices to gain insight into patterns in how perceivers had categorized the various tones.

1.6 Overview

Having introduced the research questions and some of the methodological aspects of this dissertation, I can now present an overview of the remaining chapters. Chapter 2 to 5 are self-contained texts (i.e., they all have their own abstract, introduction and discussion section), and are based on articles either published in (chapter 2, 3, 4) or submitted (chapter 5) to peer-reviewed journals. Therefore, some overlapping texts between individual chapters and between those chapters and this introduction are unavoidable. The author of this thesis was the main researcher in all studies presented here.

The study presented in Chapter 2 investigates the effect of visual cues (comparing audio-only with audio-visual presentations) and speaking style (comparing a natural speaking style with a teaching speaking style) on the perception of Mandarin tones by non-native listeners, looking both at the relative strength of these two factors and their possible interactions. Native speakers of a non-tonal language were asked to distinguish Mandarin Chinese tones on the basis of audio (-only) or video (audio-visual) materials. In order to include variations, the experimental stimuli were recorded using four different speakers. Participants’ responses and reaction times were recorded. The proportion of correct responses and average reaction times were reported. Continuing the exploration of the potential factors in tone perception,

Chapter 3 is concerned with the effects of musicianship of the participants

(28)

CH APTER 1 GENER AL INTR O D UCTI O N

to participants with or without musical experience. The Goldsmiths Musical Sophistication Index (Müllensiefen, Gingras, Musil, & Stewart, 2014) was used to measure the musical sophistication of each participant. A linear regression analysis was conducted to find out whether a specific musical ability/skill as measured by the subscales of the Gold-MSI is related to successful tone identification. Since the effects of the two independent variables might vary among tones, the effects for each tone were subsequently assessed individually in the study.

Chapter 4 focuses on comparing the relative contribution of auditory and

visual information during Mandarin Chinese tone perception. Two questions were investigated in this chapter: the first question is whether a McGurk effect can also be discerned at the tone level in native speakers of Mandarin Chinese. Secondly, how visual information affects tone perception for native speakers and non-native (tone-naïve) speakers. To answer these questions, various tone combinations of congruent (AxVx) and incongruent (AxVy) auditory-visual materials (10 syllables with 16 tone combinations each) were constructed and presented to native speakers of Mandarin Chinese and speakers of non-tonal languages. Accuracy, defined as the percentage correct identification of a tone based on its auditory realization, was used as the dependent variable. In general, there are two assumptions: one is that (native and tone-naïve) participants mainly depend on auditory information when they have to identify Mandarin Chinese tones. Both groups of participants therefore are expected to identify the congruent stimuli more accurately than the incongruent ones. The other one is that (congruent) visual information would facilitate speech perception, especially for perceivers who lack comprehensive knowledge of the language (tone-naïve participants), while this additional value of visual cues would be less important for native participants. Furthermore, when participants are presented with the incongruent experimental materials, there are three types of possible outcomes of how the cues from different modalities are combined: non-integration, integration, and attenuation.

Chapter 5 is another chapter to zoom in on the relative importance of

(29)

CH APTER 1 GENER AL INTR O D UCTI O N

these questions, four Mandarin speakers were videotaped while they produced ten syllables with four Mandarin tones (i.e. 40 words in two styles - natural and teaching), totaling 160 stimuli (the same stimuli in Chapter 2). These audio-visual stimuli were subsequently presented to 43 tone-naïve participants in a tone identification task (the same data from non-musicians in Chapter 3). Basic acoustic and visual features were extracted. We used various machine learning techniques to identify the most important acoustic and visual features for classifying the tones. The classifiers were trained on produced tone classification (given a set of auditory and visual features, predict the produced tone) and on perceived/responded tone classification (given a set of features, predict the corresponding tone as identified by the participant).

Finally, Chapter 6 provides a general discussion of the main findings, and

(30)
(31)
(32)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N Chapter 2

Effects of Modality

and Speaking Style

on Mandarin Tone

Identification by

Tone-naïve Listeners

Abstract. Although the way tones are acquired by second or

foreign language learners has attracted some scholarly attention, detailed knowledge of the factors that promote efficient learning is lacking. In this article, we look at the effect of visual cues (comparing audio-only with audio-visual presentations) and speaking style (comparing a natural speaking style with a teaching speaking style) on the perception of Mandarin tones by non-native listeners, looking both at the relative strength of these two factors and their possible interactions. Both the accuracy and reaction time of the listeners were measured in a task of tone identification. Results showed that participants in the audio-visual condition distinguished tones more accurately than participants in the audio-only condition. Interestingly, this varied as a function of speaking style, but only for stimuli from specific speakers. Additionally, some tones (notably tone 3) were recognized more quickly and accurately than others.*

2.1 Introduction

F

or many second language learners, the ultimate goal of learning a language is to be able to communicate like a native speaker. Acquiring a new language entails that a whole gamut of linguistic structures needs to be learned, including grammatical, lexical and phonological characteristics, as well as pragmatic aspects of language use. This paper focuses on acquiring

* This chapter is based on: Han, Y., Goudbeek, M., Mos, M., & Swerts, M. (2018). Effects of

(33)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

specific phonological properties of a language, namely tones in Mandarin Chinese. Chinese tones serve to distinguish word meanings. For instance, if the Mandarin Chinese syllable /ma/ is produced with a rising tone, it means “hemp”, whereas it means “scold” when produced with a falling tone. Obviously, tonal information represents a crucial aspect of this language’s sound structure, and needs to be learned by a second language learner.

While the way tones are acquired by listeners has attracted some scholarly attention (Burnham et al., 2000, 2001; Francis et al., 2008; Hao, 2012; So & Best, 2010), detailed knowledge of the factors that promote efficient learning is lacking. In the current study, we investigate two factors that are potentially important for the acquisition of tones, but have not been studied yet in a systematic way, and have not been combined in an integrated approach. First, we explore the effect of visual cues in a speaker’s face on tone identification by tone-naïve listeners. In most of our daily interactions, we both hear and see our conversation partners: whenever visual information is available, observers use it to decode what they hear (Davis & Kim, 2004; Hazan et al., 2006; Navarra & Soto-Faraco, 2007). However, while it has been shown that speech perception in general is affected by such visual information, the added value of facial expressions for tone perception is context-dependent. It has even been argued that the extra visual information from the face may actually distract the listeners from accurate tone perception, since listeners are reluctant to use the visual information when acoustic sources are available and reliable (Burnham et al., 2001). To gain an insight into the possible added value of visual information and identify under what circumstances listeners use visual information, we test whether learners who can see the speakers outperform those who only have access to auditory information (see section 2.1.1).

The second factor concerns the effect of speaking style. Listeners usually encounter tones in two different speaking styles: a natural style, representing the way native speakers speak in most of their daily interactions; and a teaching style, that is, the hyperarticulated manner in which teachers/native speakers address non-native speakers in a teaching context. Assuming that a teaching style that attends to the needs of learners may also make tonal contrasts more salient, the second goal of our study is to study whether a teaching style helps learners to perceive tonal information (see section 2.1.2).

(34)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

2.1.1 The effect of modality on tone perception

Generally speaking, speech perception is multimodal, which means that it involves information from more than just the auditory modality. Whenever visual information is available, observers use it to decode what they hear (Bailly et al., 2012; Burnham et al., 2001; Calvert et al., 2004; Campbell et al., 1998; Massaro, 1998). Visual information is provided by movements of the lips, face, head and neck. The impact of such cues has been demonstrated with the classical McGurk effect (McGurk & MacDonald, 1976): observers perceived an auditory [ba] paired with a visual [ga] as “da” or “tha”. This shows that auditory speech perception changes with simultaneously presented incongruent visual information of the speaker’s face. In other words, access to visual information about the source of speech can have clear effects on speech perception, as it alters the perception of speech.

Various studies have already investigated the impact of visual information on speech perception by linking facial cues and gestures (head and/or hand) to speech comprehension. The results have demonstrated the supportive role of visual information for speech perception in face-to-face interaction (Hirata & Kelly, 2010; Sueyoshi & Hardison, 2005; Yehia, Kuratate, & Vatikiotis-Bateson, 2002). One early study, conducted by Sumby and Pollack in 1954, examined the contribution of visual factors to oral intelligibility by manipulating the presence or absence of a supplementary visual display of a speaker’s facial and labial movements. Subjects were instructed to select the words they heard (or they thought they had heard) from a furnished list. When the speakers could both be seen and heard, the speech was considered to be more intelligible, in particular when the speech-to-noise ratio was low (i.e., in noisy contexts) or the number of alternatives listeners had to choose from was limited. These results suggest that supplementary visual observation of the speaker improves the intelligibility of oral speech in specific situations. In general, congruent visual information during articulation facilitates speech perception (Cutler & Chen, 1997; Hallé, Chang, & Best, 2004; Ye & Connine, 1999).

(35)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

benefit achieved (i.e., the superiority of bimodal performance in relation to unimodal performance) differs widely across individuals (Grant & Seitz, 1998).

Burnham et al. (2000), for example, investigated the perception of Cantonese tones with Cantonese native speakers, who were either phonetically-trained or phonetically-naïve. They found no difference in performance between audio- only and audio-visual conditions and while listeners performed worse in the visual-only condition, they still performed above chance. Interestingly, phonetically-naïve listeners outperformed phonetically-trained listeners, which Burnham et al. attribute to attentional and learning processes. Another study concerning visual cues in tone perception was conducted by Mixdorff et al. (2005b). In their study, native Mandarin speakers identified Mandarin tones in various auditory and/or visual conditions (clear, reduced, and masked audio-only/audio-visual). They found that, in the clear and devoiced auditory conditions, adding visual information was not particularly helpful (similar to the findings of Burnham et al., 2000). However, tone perception was significantly improved in the babble-noise masked condition. The absence of a facilitating effect for visual information on tone identification may be due to a ceiling effect for native speakers in clear audio conditions: auditory information suffices for quick and correct identification of tones, unless this information is compromised, that is, under low speech-to-noise ratios, in which visual information is beneficial.

While all of the studies above have used native listeners as participants, some studies have included non-native participants. For example, Burnham et al. (2001) compared tonal and non-tonal native speakers (Thai and Australian English) in their ability to discriminate Cantonese tones. They found that both groups performed significantly above chance, even in visual-only conditions, confirming that there is visual information in the face for tone discrimination. However, they did not find an advantage among Australian English speakers for audio-visual stimuli in neither clear nor noisy auditory conditions. Thai native speakers, however, did benefit from audio-visual presentation in noisy conditions. In this study, the researchers manipulated the speech-to-noise ratio by adding a certain level of background noise. They concluded that visual cues were more salient for the Thai listeners in degraded auditory settings. These findings suggest that perceivers that have difficulty accessing the auditory material because of noise, hearing impairment or because it is not in their native language, might benefit the most from the supplementary visual information when listening to tones.

(36)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

implant-simulated audio-only, cochlear implant-simulated audio-visual and visual-only. They found that, in the visual-only condition, both Mandarin and Australian English speakers discriminated tones above chance levels. As in Burnham et al. (2000), tone-naïve listeners (Australian English speakers) outperformed native speakers of Mandarin Chinese. Their explanation was that the visual information may in fact be underused by native speakers that have come to rely on their auditory abilities for their native language.

Given the mixed effects with respect to the contribution of visual cues to tonal perception and the possible role of ceiling effects, the current study investigates participants who were naïve with respect to tone identification: native speakers of a non-tonal language. This strongly reduces the possibility of ceiling effects when comparing audio-visual and audio-only conditions. We focused primarily on the added value of visual information for tone-naïve listeners in two clear, yet distinct auditory conditions: when speakers employ a “teaching style” specifically geared to non-native listeners or a more natural speaking style, geared towards fellow native speakers. This distinction is discussed in the following section.

2.1.2 The effect of speaking style on tone perception

Adult speakers possess the ability to intuitively and automatically adjust their speaking style to meet the demands of the target audience or the communicative situation (Junqua, 1993; Kuhl et al., 1997; Skowronski & Harris, 2006). They show sensitivity to characteristics of the audience they are addressing (Burnham et al., 2002). To make themselves more intelligible to the listeners, speakers usually articulate in a more “exaggerated” manner: they maximize phonetic contrast, attempt to speak more slowly, more loudly, and more clearly (Smiljanić & Bradlow, 2009). These modifications in speaking style have been discussed extensively as “clear speech” (Ferguson & Kewley-Port, 2007; Smiljanić & Bradlow, 2007, 2009; Uchanski, 2005).

(37)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

concluded that tone fidelity is not affected by the exaggerated intonation of IDS. In contrast to the claims that IDS helps in highlighting important aspects of speech (Thiessen et al., 2005), Benders (2013) argues that a hyperarticulated speaking style in IDS might not be helpful to facilitate language learning, but is primarily meant to promote affection between mothers and infants. Similarly, there is substantial literature criticising the didactic notion of IDS by Alex Cristia and her colleagues (e.g., Cristia, 2013; Cristia & Seidl, 2014; Martin et al., 2015).

In the field of second language learning, a hyperarticulated speaking style is commonly associated with “teacher talk” (or “foreigner talk”) that teachers use when addressing second language learners in the classroom, anticipating learners’ needs for assistance in their attempts at comprehension (Ferguson, 1975, 1981). For instance, the paper of Uther, Knoll and Burnham (2007) concerned a comparison of “foreigner-directed-speech” (FDS), IDS and regular adult-directed speech. The results suggest that linguistic modifications found in both infant- and foreigner-directed speech are didactically oriented, and that linguistic modifications are independent of vocal pitch and affective valence. With respect to vocal aspects of different speaking styles, more attention has been paid to segmental correlates like vowels and transitions (Llisterri, 1992) than to lexical tone information (Chen & Massaro, 2008). To the best of our knowledge, there are no previous studies that have explored to what extent the acquisition of tones by non-native listeners is affected by the speaking style to which they are exposed, which is all the more surprising given the ubiquitous use of clear speech by foreign language teachers. So, the second goal of our study is to investigate whether exposure to a hyperarticulated speaking style (teaching style) leads to better tone recognition than exposure to normal speaking style.

2.1.3 Variation between speakers and between tones

(38)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

of the visual cues that they provide (Grant & Braida, 1991; Lesner & Kricos, 1981). For example, in a study presenting visual-only stimuli from six female speakers to a group of normal-hearing subjects, three speakers were judged to be difficult to speechread and three were judged to be fairly easy to speechread (Kricos & Lesner, 1982).

With respect to tone perception, there has been relatively little research on speaker variation (but see Creel, Aslin, & Tanenhaus, 2008; Gagné et al., 1994; Nygaard, Sommers, & Pisoni, 1995). In most cases, only one speaker was employed to produce all experimental stimuli (such as in Burnham et al., 2001, 2006; Mixdorff & Charnvivit, 2004; Mixdorff et al., 2005a, 2005b; Reid et al., 2015). The study of Smith and Burnham (2012), mentioned previously, recruited two adult native speakers of Mandarin (one male and one female), but their study, given the low number of speakers, does not address a possible speaker effect. Chen and Massaro (2008) studied the role of visual information and tone perception and showed that female speakers were easier to understand. In their study, four Chinese native speakers, two male and two female, produced the experimental materials. Mandarin participants identified the tones before and after a learning phase. The results revealed that performance was generally better if the speaker was female. Their explanation for this finding was that female speakers tended to have more salient head/chin movement than male speakers. Based on the above, individual speaker differences should be examined, independently of modality and speaking style.

(39)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

It has been suggested that some lexical tones are easier to distinguish than others during audio-visual speech perception. For example, Mixdorff et al. (2005b) observed that Mandarin Chinese native speakers highly confused Mandarin tone 1 with tone 2 when they were asked to identify the tones in the devoiced-audio-visual condition. Additionally, they also found that tone 1 yielded the least correct responses, whereas tone 3 yielded the highest scores in the devoiced-audio-visual condition. Chen and Massaro (2008) also mentioned that the visual cues for tone 3 (neck movements) tended to be more pronounced than those for tone 2 and tone 4, and tone 4 tended to have the shortest duration of visual cues. Speakers presumably provided extra visual information when they used tone 3 through dipping their head or chin, which made tone 3 the easiest one to distinguish, while tone 2 and tone 4 were relatively hard to discriminate on the basis of visual information. Since facial motion might provide better clues for the identification of some tones than others, the effects of the visual modality may differ between tones.

Even though the main focus in this study is on the role of modality and speaking style in tone identification, the role of speaker and tone variation will be investigated as well, based on the considerations outlined above. However, these analyses should be considered as exploratory post hoc rather than as tests of predefined hypotheses.

2.1.4 The current study

(40)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

tones from (1) visual cues (i.e., whether or not a learner can see the speaker) and (2) the speaking style (i.e., whether or not the language input is transmitted in teaching style).

Given the beneficial effect visual cues are supposed to have for tone perception, the hypothesis we explore is: participants in the bimodal (audio-visual) condition will outperform participants in the unimodal (audio-only) condition, i.e. they will give more correct responses and have shorter reaction times. Similarly, we predict that participants exposed to the teaching style perform better than their counterparts who are exposed to the stimuli in natural style. Speaking style and modality most likely have independent effects, but there might also be interactions between the two. The combination of clear speech and visual cues might, for example, facilitate tonal learning more than each factor independently. In contrast (and in line with the absence of audiovisual superiority in clear speech conditions), visual information might be of little added value in a teaching style because of the clear audio signal. Based on the finding that audiovisual information is mostly beneficial in situations where the auditory signal is degraded, we conjecture that the difference between the audiovisual and audio-only condition is more pronounced in the natural speaking style condition, while in the teaching style condition, this difference is attenuated or perhaps even absent. The reasoning behind this, is that for tone-naïve listeners, normal speech presents less clear (e.g., “degraded”) information about tone than teaching style.

We conducted a tone perception experiment, in which native speakers of Dutch were asked to distinguish Mandarin Chinese tones on the basis of audio or video materials. To account for variation between speakers and between tones, the experimental stimuli were recorded using four different speakers and four different tones. We report the proportion of correct responses and average reaction times. We use reaction times in addition to accuracy, because they have proven useful for indicating the degree of helpfulness of visual cues and teaching style (Chen, 2003; Schneider, Dogil, & Möbius, 2011) and to extend previous research that only reports the proportion of correct responses (Burnham et al., 2001; Chen & Massaro, 2008; Mixdorff et al., 2005a, 2005b).

2.2 Methodology

(41)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

As dependent variables we recorded both the accuracy (whether a response was correct or not), and the reaction time (how long a participant took to respond) for each stimulus.

2.2.1 Participants

Eighty-six participants were recruited from the Tilburg University participant pool. The age of the participants ranged from 18 to 35 (M = 23, SD = 2.9). None of them had been previously exposed to tone languages. 72% of the participants were native speakers of Dutch, the remaining subjects were German, Italian, British, Spanish, Austrian, Indonesian, Bulgarian, and Turkish. They either received 0.5 study credits for their participation or a small token of appreciation. Participants were randomly assigned to one of the four conditions (video + teaching; video + natural; audio + teaching; audio + natural), while maintaining a balanced gender distribution in each group.

2.2.2 Stimuli

Stimulus construction

We constructed a word list with 10 Mandarin monosyllables (e.g., ma, ying …; selection based on Chen & Massaro, 2008 and Francis et al., 2008, see Appendix 1 for the complete list). Each of these syllables was chosen such that the four tones would generate four different meanings resulting in 40 (10 syllables × 4 tones) different existing words in Mandarin Chinese. Four adult native Mandarin-Chinese speakers (two male and two female) were asked to read out these words. All speakers were born and raised in China and had come to the Netherlands for their graduate studies. They have been in the Netherlands for less than three years. Speakers were instructed to produce the 40 words in two different scenarios in sequence: a natural mode (“pronounce these words as if you were talking to a Chinese speaker”) and a teaching mode (“as if you were talking to someone who is not a Chinese speaker”). In both conditions, there were no other instructions or constraints imposed on the way they should produce the stimuli. There was a 20-minute break between the two recordings to avoid fatigue, with the recording of the natural stimuli preceding the recording of the teaching style stimuli.

(42)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

In total, 320 stimuli were produced; two sets of 160 video stimuli (10 syllables × 4 tones × 4 speakers) were generated, in teaching and in natural modes. These video clips were segmented into individual tokens, with each token containing one stimulus. In the final analysis, two problematic stimuli were discarded because one was edited too short, and another one was produced incorrectly by the speaker. We converted the video format from mp4 to avi using Freemake Video Converter (version 4.1.6) to ensure compatibility with E-Prime. Format Factory (version 3.9.5) was used to extract the sound from each video to generate the material for the audio-only conditions. This resulted in four types of experimental stimuli: video + teaching (VT); video + natural (VN); audio + teaching (AT); audio + natural (AN).

Pilot study

To ascertain the feasibility of the experimental task and the validity of the stimuli, 24 native Mandarin Chinese speakers were asked to identify the tones which were presented in the audio + natural condition (the supposedly most challenging condition). These speakers were born and raised in China. They were postgraduate students (aged 21-40) and had been staying in the Netherlands between four months and five years. They were asked to identify the tones, and their accuracy was 99.5%, indicating the validity of the stimuli and the feasibility of the task.

Stimulus characteristics

Acoustic and visual analyses were conducted to assess whether the differences between the two speaking styles (teaching and natural) were present in the experimental stimuli. We measured the mean duration and the average pitch of the two sets of experimental stimuli. In general, we expected that the duration of the teaching stimuli would be longer than the duration of the natural style stimuli. Similarly, we expected that tone patterns would be exaggerated in the hyperarticulated style. The tone fidelity, however, should not be impacted by the speaking style, since pitch is closely related to the lexical meaning of the word (Xu & Burnham, 2010).

We used Praat 6.0.33 (Boersma & Weenink, 2017) to measure the duration and average pitch of the experimental stimuli. A repeated-measures ANOVA, with tone and speaking style as the within-subject factors and speaker as between-subject factor, revealed that speaking style had a significant effect on the duration of the stimuli, F (1, 9) = 63.3, p < .001, ŋp2 = .876. In line with our

(43)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

p = .183, ŋp2 = .188. Figure 2.1 provides an illustrative example of the difference

between the two speaking styles for the four tones. The figure clearly shows the expected rising and falling patterns, which are more pronounced (especially in their duration) in the teaching style. Table 2.1 and Table 2.2 present the means and standard deviations (as well as confidence intervals) for the average pitch and duration. Speaker differences accounted for a decent amount of variation in the average duration of the experimental stimuli, F (3, 27) = 8.01, p = .001, ŋp2 = .471, and even more variation in average pitch, F (3, 27) = 28.7, p < .001, ŋ

p2

= .761 (Table 2.1). Tones accounted for a large amount of variation between the two speaking styles in duration: F (3, 27) = 399.8, p < .001, ŋp2 = .978 and in pitch:

F (3, 27) = 66.4, p < .001, ŋp2 = .881(Table 2.2). There was a significant interaction

between tone and speaking style in terms of duration: F (3, 27) = 9.01, p < .001, ŋp2 = .50. However, no significant interactions were found between tone and

speaking style in terms of average pitch: F (3, 27) = 0.59, p > .05, ŋp2 = .63. Thus,

our global acoustic analysis reveals strong effects of individual speakers and tones, while small differences between the two speaking styles also emerge.

Figure 2.1. Plots of tone contours for natural (a) and teaching style (b). Figure based

(44)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

Speaker Tone Mean SE 95% CI

lower bound upper bound

N T N T N T N T 1 1 .434 .564 .036 .032 .354 .493 .514 .635 2 .509 .633 .039 .029 .421 .568 .597 .698 3 .663 .815 .032 .033 .590 .739 .736 .891 4 .373 .375 .028 .021 .310 .328 .436 .422 2 1 .454 .567 .029 .036 .389 .485 .519 .649 2 .458 .510 .026 .035 .399 .430 .517 .590 3 .701 .848 .036 .037 .621 .765 .781 .931 4 .395 .419 .026 .024 .337 .364 .453 .474 3 1 .429 .436 .026 .031 .370 .366 .488 .506 2 .507 .519 .024 .027 .452 .459 .562 .579 3 .588 .640 .033 .029 .514 .575 .662 .705 4 .349 .360 .027 .028 .288 .298 .410 .422 4 1 .413 .431 .025 .027 .357 .370 .469 .492 2 .490 .536 .031 .027 .419 .474 .561 .598 3 .590 .687 .019 .029 .546 .622 .634 .752 4 .336 .343 .024 .031 .282 .273 .390 .413

Table 2.1. Descriptive statistics for duration (seprately by speaker, tone and style) (s).

(45)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

Speaker Tone Mean SE 95% CI

lower bound upper bound

N T N T N T N T 1 1 232.662 252.691 5.253 4.534 220.778 242.434 244.546 262.948 2 199.245 208.177 5.039 7.435 187.847 191.357 210.643 224.997 3 168.108 187.620 9.136 12.746 147.441 158.787 188.775 216.453 4 295.655 322.456 12.627 10.317 267.091 299.118 324.219 345.794 2 1 172.967 195.988 5.805 5.436 159.834 183.690 186.100 208.286 2 149.407 160.717 6.898 3.911 133.802 151.869 165.012 169.565 3 133.542 138.875 5.683 4.836 120.686 127.935 146.398 149.815 4 232.851 221.915 24.496 13.603 177.436 191.142 288.266 252.688 3 1 156.752 138.275 18.896 3.102 114.006 131.258 199.498 145.292 2 114.581 135.800 8.987 11.564 94.252 109.641 134.910 161.959 3 170.019 168.692 36.792 26.611 86.791 108.495 253.247 228.889 4 253.851 222.173 36.442 30.937 171.413 152.188 336.289 292.158 4 1 305.976 323.645 3.548 5.989 297.950 310.098 314.002 337.192 2 229.345 241.122 12.771 9.022 200.455 220.712 258.235 261.532 3 200.173 194.949 10.446 5.479 176.543 182.554 223.803 207.344 4 346.520 348.839 20.047 13.421 301.171 318.480 391.869 379.198

Table 2.2. Descriptive statistics for average pitch (separated by speaker, tone, and

style) (Hz). Note that SE represents standard error; CI represents confidence interval; N represents natural style; T represents teaching style.

For the visual analyses, we expected that the hyperarticulated action (teaching style) results in more facial movements as compared to natural style. We used Flow Analyzer4 to track the amount of movement present in the video

as an estimate of the magnitude of movements. In this case, the total amount of motion was measured for each speaker, in both speaking styles, for each of the syllable/tone combinations. A repeated-measures ANOVA showed that speaking style had a main effect on the total amount of motion, F (1, 9) = 115,

p < .001, ŋp2 = .928. In teaching style (M = 0.25, SE = 0.01), speakers tended to

use more visual cues than in the natural style (M = 0.15, SE = 0.003), which

4 Flow Analyzer is a piece of software, based on Optical Flow Analysis, for extracting motion from

(46)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

is in line with the idea of hyperarticulation. Individual speakers also differed significantly in their amount of movement, F (3, 27) = 19.56, p < .001, ŋp2 =

.685. Pairwise comparison (using Bonferroni adjustment) showed that Speaker 1 and speaker 3 provided the most visual movement information. There is no difference between speaker 1 (M = 0.27, SE = 0.023) and speaker 3 (M = 0.21, SE = 0.008). Speaker 4 (M = 0.18, SE = 0.003) moved significantly less than speaker 1 and speaker 3, p< .02. Speaker 2 (M = 0.13, SE = 0.009) signaled the least visual information. The different tones did not affect the amount of movement, F (3, 27) = 2.27, p = .103, ŋp2 = .202. On average, there was no significant difference in

motion between natural and teaching speaking style for each separate tone: F (3, 27) = 2.79, p > .05, ŋp2 = .237.

Flow Analyzer can also measure the motions displayed in the horizontal direction (x) and the vertical direction (y), which can give a clearer picture of the directionality or type of motions among different tones. Table 2.3 provides the amount of movement in the x and y direction and shows for example, that tone 1 has the least amount of vertical movement and the most horizontal movement, which is in line with a level tone. Similarly, for tone 2, tone 3 and tone 4, there is more vertical than horizontal motion.

Tone Mean SE 95% CI

lower

bound upper bound lower bound upper bound

x y x y x x y y

1 .037 .029 .005 .001 .026 .047 .026 .032 2 .028 .039 .003 .002 .023 .034 .035 .043 3 .025 .047 .001 .003 .022 .029 .041 .053 4 .033 .047 .001 .003 .030 .036 .040 .055

Table 2.3. Descriptive statistics for the amount of facial movements on x and y axes for

the different tones (pixels per frame). Note that SE represents standard error and CI represents confidence interval.

(47)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N 2.2.3 Procedure

All sessions were conducted in a sound-attenuated room. E-prime (version 2.0; Zuccolotto, Roush, Eschman, & Schneider, 2012) was used to set up and run the experiment. The full procedure consisted of three blocks: instruction, practice trials, and test trials. Before the experiment started, participants were asked to fill out a questionnaire that assessed their language background. After that, a brief instruction about Mandarin Chinese tones was first displayed on the screen (see Figure 2.2 for a screenshot): “There are four tones in Mandarin Chinese: the first tone is a High-Level tone, symbolized as “ ¯ ”, the second tone is a Mid-Rising

tone, symbolized as “ ̷ ”, the third tone is a Low-Dipping tone, symbolized as “ ˅ ”,

and the fourth tone is a High-Falling tone, symbolized as “ \ ”.

The task of the participants was to identify the tones they perceived from the speakers. Three practice trials were included to allow participants to get familiar with the testing procedure and the stimuli. After the practice trials, the experiment leader checked with the participants to make sure they fully understood the concept of tones (in particular the symbols) and the task. Finally, 160 testing stimuli (video/audio) were presented in randomized order for each participant (operated by E-Prime). The time for participants to give responses was 10 seconds. Participants received feedback: “good job” or “incorrect” depending on the correctness of their response, or, “no response”, if the participants had not reacted within the given 10 seconds. In order to motivate the participants to do their best, a special programming code was implemented in the experimental procedure: if the participants gave ten correct responses consecutively, the experiment would stop5.

Participants wore headsets, and were seated directly in front of the PC running the experiment. All stimuli were presented at a comfortable hearing level. The participants were instructed to press the designated keys with the corresponding tone symbols (“¯”, “ ̷ ”, “˅ ”, “\”, see Figure 2.3) on them

as accurately and as quickly as possible after they made their decisions. Their responses and reaction times were recorded automatically by E-prime.

5 In total, 22 participants finished their experiments in advance (they gave ten correct responses

(48)

CH APTER 2 M O D ALIT Y AND S PEAKIN G S TYLE O N MAND ARIN T O NE ID ENTIFI CA TI O N

Figure 2.2. Screenshot of a brief introduction of Mandarin Chinese tones (in video

conditions)

Referenties

GERELATEERDE DOCUMENTEN

The purpose of this study is to analyze and evaluate illicit file sharing habits of media content of internet users, the alternative use and availability of

judgements of visual and auditory stimuli, under certain conditions, to be dependent on the phase of posterior alpha and theta oscillations, as these oscillations are.. thought

Similar to synchrony boundaries, boundaries of the temporal window for integration are also distinguished by referring to them as ‘audio first’ or ‘video first.’ Although the

Exposure to the ventriloquism situation also leads to compensatory aJtemffed.r, consisting in post exposure shifts in auditory localization (Canon 1970; Radeau 1973; Radeau

Cast the audio-visual localization problem in the framework of maximum likelihood with missing data:.. Maximize the expected

Here, we examined the influence of visual articulatory information (lip- read speech) at various levels of background noise on auditory word recognition in children and adults

Although image analysis research for art investigation has focused on a wide variety of tasks, four main tasks are identified and discussed based on their popularity and relevance

Given that utterance planning is influenced by conceptual factors and that ani- macy has a privileged role in language production, we could expect animate entities to be mentioned