• No results found

Audio-visual synchrony perception

N/A
N/A
Protected

Academic year: 2021

Share "Audio-visual synchrony perception"

Copied!
151
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Audio-visual synchrony perception

Citation for published version (APA):

Eijk, van, R. L. J. (2008). Audio-visual synchrony perception. Technische Universiteit Eindhoven. https://doi.org/10.6100/IR634898

DOI:

10.6100/IR634898

Document status and date: Published: 01/01/2008

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

providing details and we will investigate your claim.

(2)

Audio-visual synchrony perception

(3)

J.F. Schouten School for User-System Interaction Research.

An electronic copy of this thesis in PDF format is available from the website of the library of the Technische Universiteit Eindhoven (http://www.tue.nl/bib).

c

2008, Rob L.J. van Eijk, The Netherlands

CIP-DATA LIBRARY TECHNISCHE UNIVERSITEIT EINDHOVEN

Eijk, Rob L.J. van

Audiovisual synchrony perception / by Rob Lambertus Jacobus van Eijk. Eindhoven: Technische Universiteit Eindhoven, 2008. Proefschrift.

-ISBN 978-90-386-1264-5 NUR 778

Keywords: Synchrony perception / Temporal interval discrimination / Psychophysics

Cover design: Paul Verspaget

(4)

Audio-visual synchrony perception

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de Rector Magnificus, prof.dr.ir. C.J. van Duijn, voor een

commissie aangewezen door het College voor Promoties in het openbaar te verdedigen op donderdag 29 mei 2008 om 16.00 uur

door

Rob Lambertus Jacobus van Eijk

(5)

prof.Dr. A.G. Kohlrausch en

prof.dr. J.F. Juola

Copromotor:

(6)

Contents

1 Introduction . . . 1

1.1 Intersensory timing. . . 1

1.2 Subjective simultaneity . . . 2

1.3 Sensitivity to temporal interval differences . . . 11

1.4 Perceptual and cognitive aspects of intersensory timing . . . 15

1.5 Stimulus complexity . . . 23

1.6 Overview of this thesis . . . 25

2 Audio-visual synchrony and temporal order judgments: Effects of experimental method and stimulus type . . . 27

2.1 Introduction . . . 28

2.2 Method . . . 28

2.3 Results . . . 32

2.4 General discussion . . . 42

3 Temporal interval discrimination thresholds depend on perceived synchrony for audio-visual stimulus pairs . . . 49

3.1 Introduction . . . 50

3.2 Experiment 1: Thresholds at 0 ms, PSS, and 2×PSS . . . 52

3.3 Experiment 2: Additional thresholds within and beyond the synchrony range . . . 60

3.4 General discussion . . . 63

4 Effects of visual predictive information on audio-visual temporal order and simultaneity judgments . . . 69

4.1 Introduction . . . 70

4.2 Method . . . 71

4.3 Results . . . 74

(7)

detection thresholds . . . 93 5.1 Introduction . . . 94 5.2 Method . . . 96 5.3 Results . . . 99 5.4 General discussion . . . 110 6 General discussion . . . 115 6.1 Main findings . . . 115 6.2 Future research . . . 117

A Notes on the literature overview of PSS estimates . . . 121

Bibliography . . . 123

Summary . . . 133

Samenvatting . . . 137

Acknowledgments . . . 141

(8)

1 Introduction

1.1

Intersensory timing

Most events in our natural environment, such as when we listen to someone speaking in front of us or observe a book falling to the floor, provide us with information via the different sensory modalities that is generally integrated into a single multisensory representation (King, 2005; Spence, 2007; Stein and Meredith, 1994). These examples illustrate the importance of auditory and visual modalities when passively perceiving events. When engaging in interaction with the environment, also the importance of the tactile modality becomes apparent. When writing a text using a personal computer, for example, one can feel the keyboard beneath one’s fingers, hear the sound that is produced by pressing a key, and see the corresponding character appear on the screen. In this example both visual and auditory information are provided by transmission through the external world, whereas the tactile stimulation is co-located with the physical object (the keyboard). As such, tactile stimulation is generally limited to nearby events, whereas auditory and visual stimulation can be provided by events occurring within a relatively large range of distances (Gepshtein et al., 2005; Hillis et al., 2002; Miyazaki et al., 2006). Given the relatively low speed of sound, the auditory component of a perceived event will always reach an observer later than the visual component, and this difference increases with physical distance. These examples from daily life reveal that audio-visual integration does not require a physically synchronous presentation of auditory and visual components of a multisensory event, but must—to some extent—be tolerant of temporal disparities.

In the physical world, the timing relationship between auditory and visual signals arising from our natural environment is determined by the properties of some physical process, e.g., the physical moment of impact of the objects involved and the properties of the media carrying the signals to the respective sensory receptors. Auditory and visual information are not always directly stemming from our natural environment, however, but may also be provided through the reproduction of prerecorded audio-visual material on television or in the cinema, or may be generated in real time in a virtual environment. In artificial environments such as television, teleconferencing systems, and games, the reproduction of timing relations between sensory modalities is completely dependent on technology. Temporal disparities between auditory and visual components may have a

(9)

detrimental effect on perceived quality of audio-visual presentations (Rihs, 1995), and may, for example, also hamper the feeling of presence in a virtual environment (see Kohlrausch and van de Par, 2005, for an overview of audio-visual research in the context of multimedia applications). Therefore, it is important to control the temporal relationship between signals of different modalities in a way that is perceptually optimal. In realistic computer games and interactive virtual reality environments the auditory signal is not readily available, but has to be rendered in real time from a physical model of the environment (Moeck et al., 2007; Murphy and Rumsey, 2001). As such, the complexity of the sound rendering system may, for example, be scaled (Murphy and Rumsey, 2001) such that the sound quality is optimal, given the available processing resources, but that the delay with which the auditory component is presented is still well within perceptually acceptable limits. The aforementioned examples illustrate the importance of psychophysical studies that explore the limits of audio-visual synchrony perception.

The remainder of this introductory chapter is organized as follows: The concept of subjective simultaneity is treated in detail in section 1.2, along with the most common experimental methods used to obtain estimates of the point of subjective simultaneity. Furthermore, an overview is provided of the estimates of subjective simultaneity that have been reported in the literature. In section 1.3 sensitivity to audio-visual temporal intervals is discussed, along with different experimental methods used to estimate sensitivity. Section 1.4 provides an overview of the perceptual and cognitive aspects that may influence subjective simultaneity and sensitivity to audio-visual temporal intervals. Section 1.5 deals with the influence of stimulus complexity on measures of audio-visual synchrony perception. This introduction is concluded by section 1.6, which gives an overview of this thesis and derives the central research topics to be addressed.

1.2

Subjective simultaneity

1.2.1 Definition and methodology

Subjective simultaneity in the context of audio-visual perception is generally expressed by the point of subjective simultaneity (henceforth PSS), that indicates the relative auditory delay (ms) between components of a bimodal stimulus for which the perception of synchrony occurs. By convention (Arrighi et al., 2006; Aschersleben and M¨usseler, 1999; Enoki et al., 2006; Vatakis and Spence, 2006a,b; Zampini et al., 2003a,b, but see

(10)

1.2 Subjective simultaneity 3

Lewald and Guski, 2004; Spence et al., 2003), positive auditory delays indicate that the auditory component trails the visual component (with 0 ms being physical synchrony of the auditory and visual stimulus components). Negative values are used for the far less-frequent occurrence of an auditory component of some event occurring before its visual counterpart (see also section 1.2.2). In the literature, the PSS is defined in different ways, depending on the experimental method that is used. The two methods that are most commonly used are the synchrony judgment (SJ) task, and the temporal order judgment (TOJ) task.

In the SJ task, observers are asked to judge whether the auditory and visual components of a stimulus are synchronous or not. Stimuli are presented with variable onset asynchronies and the SJ task thus yields a relatively direct measure of perceived synchrony (e.g., Fujisaki et al., 2004; Stone et al., 2001; Zampini et al., 2005b). Figure 1.1 shows a schematic representation of the response pattern in such a task. The black curve indicates the observed proportions of ‘synchronous’ responses as a function of the relative audio delay between the auditory and visual components in the stimulus. The grey curve indicates the ‘non-synchronous’ response proportions. In an SJ task the PSS is defined as the midpoint of the range of delays that are predominantly judged to be synchronous (termed “synchrony range” in this thesis). Two variations can be discerned in the synchrony judgment task: (1) the SJ2 task, in which there are only two judgment categories: simply ‘synchronous,’ and ‘non-synchronous,’ and (2) the SJ3 task, in which there are three response categories: ‘audio first,’ ‘synchronous,’ and ‘video first.’

In the TOJ task (e.g. Aschersleben and M¨usseler, 1999; Spence et al., 2003; Sternberg and Knoll, 1973; Zampini et al., 2003a,b) observers are asked to indicate which of two modalities was stimulated first by responding with either ‘audio first,’ or ‘video first.’ The response pattern in such a task is shown schematically in Figure 1.2. The proportion of ‘video first’ responses (black curve) increases monotonically with increasing audio delay, while the proportion of ‘audio first’ responses (grey curve) decreases correspondingly. The PSS is estimated as the point at which the proportion of ‘audio first’ judgments equals the proportion of ‘video first’ judgments, the TOJ 50% point.

(11)

-2000 -150 -100 -50 0 50 100 150 200 250 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Audio delay (ms) R e sp o n se p ro p o rt io n Synchronous Asynchronous

Figure 1.1: Schematic ‘synchronous’ (black) and ‘asynchronous’ (grey) response curves as a function of the audio delay (ms). The intersection points between synchronous and asynchronous response curves, termed synchrony boundaries, are indicated using vertical dashed lines. The synchrony range corresponds to the range of delays between the synchrony boundaries. The PSS is defined as the midpoint of the synchrony range.

-2000 -150 -100 -50 0 50 100 150 200 250 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Audio delay (ms) R e sp o n se p ro p o rt io n Video first Audio first

Figure 1.2: Schematic ‘video first’ (black) and ‘audio first’ (grey) response curves as a function of the audio delay (ms). The intersection point between ‘audio first’ and ‘video first’ response curves, the TOJ 50% point, is used as PSS estimate. The temporal window of integration is defined as the range of 25% to 75% response proportions, indicated using vertical dashed lines. The just-noticeable difference (JND) is defined by subtracting the delay at the 50% point from the audio-visual delay at the 75% point. As such, the width of the temporal window of integration equals 2×JND.

(12)

1.2 Subjective simultaneity 5

1.2.2 Physical and physiological aspects

In our natural environment the relative timing relation between an auditory and a visual signal is influenced by their propagation speeds, and inside our body by sensory transduction and neural conduction times. As was mentioned in section 1.1, auditory and visual signals are subject to different conduction times of their respective media that cause the auditory component of an audio-visual event to always reach the sensory receptors of an observer later than the visual component (Spence and Squire, 2003). That is, whereas the arrival of a visual signal is almost instantaneous due to a propagation speed of approximately 300 × 106 m/s, the arrival of an auditory signal

may be substantially delayed due to a propagation speed of approximately 340 m/s (i.e., in air it takes approximately 3 ms for an auditory signal to travel a distance of 1 m).

When auditory and visual signals reach the human ears and eyes, the associated physical (light and sound) energy has to be converted into neural activity by the process of transduction occurring in the relevant sensory receptors. That is, light stimulation of the retina is transmitted to the optic nerve through a phototransduction process taking place in the rods and cones, followed by a chain of neurochemical stages, lasting around 50 ms (Arrighi et al., 2006; Schiffman, 2001). Sound energy is transduced by transforming sound waves into mechanical motion from the ear drum, via the middle ear to the cochlea where it is transformed into nerve impulses (Schiffman, 2001). As the acoustic transduction process only takes about 1 ms and travel times in the inner ear are below 10 ms, sound transduction in the human ear is about 40–50 ms faster than light transduction in the human eye (Arrighi et al., 2006; King, 2005). This temporal disparity is even further increased due to longer neural transmission times in the visual system (King, 2005). As a result, only when an audio-visual event occurs at a distance of approximately 15 m, do the corresponding auditory and visual signals arrive simultaneously at the sensory cortices. For audio-visual events that occur within this so-called ‘horizon of simultaneity’ (P¨oppel, 1988), the auditory component will arrive in our brain first, whereas for events that occur beyond the horizon of simultaneity the visual component will arrive first. That is, an auditory stimulus presented at a relative delay of less than 45 ms will arrive in the brain before the corresponding visual component, whereas an auditory stimulus presented at a relative delay larger than 45 ms will arrive later.

(13)

1.2.3 Literature overview of PSS estimates

The physical and physiological aspects treated in section 1.2.2 lead to the expectation that the PSS should occur near the point of physical synchrony or at some positive audio delay, i.e., when the auditory component lags behind the visual component. More specifically, due to the faster transduction of sound, an auditory lag is required at the level of the peripheral sensory receptors for auditory and visual signals to arrive simultaneously in the brain. Furthermore, adaptation to positive audio delays perceived in every-day life may very well have shifted the PSS in the direction of larger, more positive delays (due to temporal recalibration; see section 1.4.2). Due to temporal ventriloquism (see section 1.4.3) the moment of occurrence of a visual stimulus may be biased in the direction of a trailing auditory stimulus. As temporal ventriloquism critically depends on the auditory signal that follows the visual signal (although see Aschersleben and Bertelson, 2003), it may result in a larger tolerance for positive, audio trailing delays, but not for negative, audio-leading delays. As such, temporal ventriloquism may result in a shift of the PSS in the direction of larger, more positive delays with the auditory component trailing the visual component.

Research in the area of perceived audio-visual synchrony has made use of a wide range of stimulus types (see, e.g., Arrighi et al., 2006; Enoki et al., 2006; Keetels and Vroomen, 2005; Vatakis and Spence, 2006a) and experimental methods (see, e.g., Dixon and Spitz, 1980; Exner, 1875; Vatakis et al., 2008; Vroomen et al., 2004). Stimuli varied from simple (e.g., a flash of light accompanied by an audible click; see, e.g., Aschersleben and M¨usseler, 1999; Hamlin, 1895; Ja´skowski et al., 1990) to complex (e.g., a video of a person speaking, or playing a musical instrument; see, e.g., Dixon and Spitz, 1980; Hollier and Rimell, 1998; Vatakis and Spence, 2006a). Comparing PSS values derived from different experimental methods shows that TOJ and SJ tasks often yield different results. A non-exhaustive overview of PSS values reported by or estimated from various studies is shown in Table 1.1. Given the context of this thesis the overview in Table 1.1 is restricted to publications about perceived temporal relations between auditory and visual modalities (i.e., other modalities and unimodal studies are excluded) that aim to uncover the perception of audio-visual synchrony under “normal conditions” (i.e., studies that attempt to manipulate synchrony perception by, for example, manipulating attention, or by exposing participants to asynchronous audio-visual adaptation stimuli are excluded).

(14)

1.2 Subjective simultaneity 7 T able 1.1: PSS v alues (ms) for audio-visual stim uli rep orted b y or estimated from studi e s u sin g differen t metho ds and stim ulu s typ es. Negativ e v alues indicate that the au ditory comp onen t of the stim ulus led the vi sual comp onen t, whereas p ositiv e v alues indicate that the visual comp onen t led, at the p oin t at whi ch b oth judgmen ts w ere at the 50% p oin t (TOJ task), or at the midp oin t of the ‘s y nc hronous’ judgmen t range (SJ2 and SJ3 tasks). The range of rep orted PSS v alues is indicated b y the minim um and maxim um (separated b y ‘. .. ’), o r b y the standard deviation if in dividual PSS v alues w ere not rep orted. ‘N’ is the n um b er of participan ts in the cited study , ‘n/a’ stands for ‘not applicable,’ and ‘n/r’ for ‘not rep orted’ (or imp ossible to deriv e). S e e App endix A for additional information . Study Note on stim ulus TOJ task SJ task 1 PSS (range; ms) N PSS (range; ms) N Flash-clic k (stationary) stim ulus Asc hersleb en and M ¨usseler (1999) -13 (n/r) 16 Bald et al. (1942) Hidden sound source -9 (n/r) 32 Visible sound source -1 (n/r) 30 Blo ch (1887; in Hamlin, 1895) -4 (n/r) n/r Dinnerstein and Zlotogura (1968) +71 (± 61) 23 Enoki et al. (2006) Sudd e n app earance +36 (+24. .. +50) 11 Exner (1875) +50 (n/a) 1 F ujisaki et al. (2004) No adaptation +4 (± 31) 7 Adaptation to 0 ms -10 (± 35) 7 Hamlin (1895) -19 (-34. .. -2) 2 Hirsh and F raisse (1964) +29 (n/r) 8 +22 (n/r) 8 Hirsh and S herric k (1961) +5 (n/r) 5 Ja ´sk o ws k i et al. (1990) +48 (+36. .. +69) 3 1All SJ studies used an SJ2 task, with the exception of Exner (1875), who used an SJ3 task.

(15)

T able 1.1: (con tin ued) Study Note on stim ulus TOJ task SJ task PSS (range; ms) N PSS (range; ms) N Keetels and V ro omen (2005) +8 (n/r) 15 Rutsc hmann and Link (1964) -43 (-46. . . -40) 2 Smith (1933) -8 (n/r) 40 -2 (n/r) 40 Sp ence et al. (2003) +20 (± 24) 8 Stone et al. (2001) +51 (-21. .. +150) 17 T eatini et al. (1976) Visual stim ulus left -7 (-30. .. +23) 5 Visual stim ulu s righ t +5 (-27. . .+27) 5 T racy (in Hamlin, 1895) -10 (-20. . . +1) 6 V atakis et al. (2008) Adap tation to 0 ms +1 (n/r) 13 +1 (-50. .. +65) 13 V ro omen et al. (2004) Adaptation to 0 ms -6 (± 24) 10 -11 (± 12) 10 Whipple (1899) Single presen tation -13 (-73. . .+6) 5 Rep eated presen tation -4 (-16. . .+6) 6 Zampini et al. (2005b) Same stim ulus lo cation +22 (-30. . .+69) 40 Differen t stim ulus lo cation +33 (-17. .. +72) 40 Zampini et al. (2003a) Same stim ulus lo cation +60 (± 17) 9 Differen t stim ulus lo cation +75 (± 19) 9 Simple (motion) stim ulus Arrighi et al. (2006) Biological motion +60 (n/r) 3 Non-biological motion +35 (n/r) 3 Random motion +20 (n/r) 3 Asc hersleb en and M ¨usseler (1999) -17 (n/r) 16

(16)

1.2 Subjective simultaneity 9 T able 1.1: (con tin ued) Study Note on stim ulus TOJ task SJ task PSS (range; ms) N PSS (range; ms) N Dixon and Spitz (1980) +56 (n/r) 18 Enoki et al. (2006) F ree fall + 74 (+52. . .+115) 11 Hollier and Rimell (1998) Short visual cue (p en) +28 (n/r) 12 Long visual cue (axe) +40 (n/r) 12 Lewk o wic z (1996) +24 (-5. .. +65) 10 V atakis and Sp ence (2006a) Ob ject action +63 (± 90) 28 Complex stim ulus Dixon and Spitz (1980) +64 (n/r) 18 Hollier and Rimell (1998) +38 (n/r) 12 McGrath and S umm erfi e ld (1985) +30 (-1. . .+131) 12 Rihs (1995) +40 (n/r) 18 Smeele (1994) -105 (± 30) 12 -12 (± 41) 6 v an W assenho v e et al. (2007) +26 (n/r) 20 V atakis and Sp ence (2006a) Sp eec h -36 (± 143) 28 Guitar m u sic +65 (± 169) 28 Piano m us ic -84 (± 259) 28

(17)

The most striking results shown in Table 1.1 are the negative PSS values, which represent the situation in which the auditory stimulus had to lead the visual stimulus for the pair to be interpreted as synchronous. Since the PSS values reported in Table 1.1 are generally measured at the level of the peripheral sensory receptors of the observers (with the exception of the study by Bald et al., 1942, that reports delays measured at the stimulus source), negative “external” delays indeed are highly unnatural.

It can be seen from Table 1.1 that negative overall PSS values are reported mainly for the TOJ task. In their review of the audio-visual TOJ literature, Neumann and Niepel (2004) found that the majority of studies yielded a negative PSS and conclude that “[o]n the whole, these studies clearly suggest a negative PSS as the rule [p. 254].” Later on they qualify this conclusion by referring to TOJ studies that yielded a positive PSS and they state that this “sheds some doubt on the generality of the finding that the PSS is situated at a negative SOA [p. 254].” Thus, from the review of the literature presented in Table 1.1 it may be concluded that the two tasks (TOJ and SJ) might be measuring different things. That is, the SJ task emphasizes the judgment of “synchrony” vs. “successiveness,” whereas the TOJ task emphasizes the judgment of “order,” which requires the perception of successiveness for correct perception (Allan, 1975; Hirsh and Sherrick, 1961). Indeed, Shore et al. (2005, p. 1260) report that their “. . . present findings corroborate the claim (Allan, 1975; Hirsh and Sherrick, 1961) that judgments of temporal order and judgments of simultaneity (versus successiveness) are fundamentally different.” In the context of their unimodal experiments on tactile temporal processing, Shore et al. (2005, p. 1252) state that “. . . it has been argued that TOJs require more information about the stimuli before a correct response can be made” and that “. . . this increased processing requirement might reveal more subtle effects than the simpler simultaneity judgments used in previous studies.” Furthermore, Zampini et al. (2003a, p. 208) note that TOJ and SJ tasks “. . . may reflect very different processes/mechanisms (i.e. one related to multisensory binding, and the other related to temporal discrimination instead. . . ).” Such differences could call into question whether estimates of parameters, such as the PSS, are independent of the experimental method. Although explanations for the differences in PSS values shown in Table 1.1 can be based on differences in experimental methods, this hypothesis has not been fully addressed before within a single study. It has been suggested that differences between PSS estimates derived from SJ and TOJ tasks should be experimentally investigated (Shore et al., 2002; Zampini et al., 2005b). Indeed, Fujisaki et al. (2004, see

(18)

1.3 Sensitivity to temporal interval differences 11

section 1.4.2 for details) measured recalibration of audiovisual simultaneity using both synchrony judgment and temporal order judgment tasks in a within-subject design and “obtained similar adaptation effects” for both tasks, although the effect was less stable for the TOJ task. Smeele (1994) found a significant difference between PSS values obtained from SJ and TOJ data for 9 out of the 10 speech stimuli she used when comparing results for the six participants common to both experiments. Interestingly, she found a very high correlation between TOJ and SJ PSS values, and a constant shift between TOJ and SJ PSS values of 94 ms (with TOJ PSS values being more negative). A between-subjects design comparing SJ2 to TOJ was employed by Smith (1933) who reported consistent results between the two tasks in that both produced negative overall PSS values, although he did report that individual differences were somewhat greater in the SJ2 task. Vatakis et al. (2008) found a significant PSS shift when exposing participants to an audiovisual speech video with the auditory speech lagging behind the visual stream. The PSS shift, however, was only observed in the SJ task, but not in the TOJ task. Vroomen et al. (2004) also report a between-subjects study in which TOJ and SJ2 are compared using the same stimuli. They found similar shifts in the PSS using the two methods after adaptation to a series of stimulus pairs with specific offsets of their audio-visual components (although the absence of a significant difference between the two judgment tasks may have been due to a lack of statistical power; see also section 1.4.2). In summary, the literature comparing the effect of experimental method on audio-visual synchrony perception is not only limited, it has also produced different results for TOJ and SJ procedures.

1.3

Sensitivity to temporal interval differences

A concept that is related to the point of subjective simultaneity, is the sensitivity with which people can discriminate between different audio-visual temporal intervals. A clear example is a thunderstorm that is approaching a group of hikers—or receding from them. By estimating the temporal interval between the (visual) lightning flash and the (auditory) thunder, one can determine that a thunderstorm is approaching if following temporal intervals are progressively shorter in duration. Determining whether the temporal interval is shorter, however, does require that the difference in the length of the temporal interval is large enough to be detected.

(19)

At least three different methods for measuring sensitivity to audio-visual asynchrony have been reported in the literature. In one method, various delays between an auditory and a visual stimulus are introduced in a test sequence using the method of constant stimuli. The just-noticeable difference (JND) can then be determined from the slope of the response curves in a temporal order judgment (TOJ) paradigm (e.g., Hirsh and Sherrick, 1961; Vatakis and Spence, 2006a). Related are the sensitivity measures derived from a synchrony judgment (SJ) task, which can be derived from the slopes at the synchrony boundaries, or from the width of the range of synchronous responses (e.g., Arrighi et al., 2006; Vatakis et al., 2008; Zampini et al., 2005b,c). In a third method, discrimination thresholds can be determined directly by using an adaptive procedure with more than one observation interval (e.g., McGrath and Summerfield, 1985).

1.3.1 Just-noticeable difference (JND)

In TOJ experiments, sensitivity is generally characterized by the JND, which can be determined by fitting the ‘video first’ data to a cumulative Gaussian distribution (e.g., by means of a Probit Analysis (Finney, 1952) in which the proportions are converted to standard z -scores and fitted with a straight line across audio-visual onset asynchrony; see, e.g., Hirsh and Sherrick, 1961; Rutschmann and Link, 1964). The JND is defined by subtracting the audio-visual delay at the 50% point from the audio-visual delay at the 75% point, and thus is inversely related to the slope (see also Figure 1.2).2 The JND then defines an interval around the PSS (TOJ 50% point), called the temporal window of integration (Navarra et al., 2005; Spence and Squire, 2003), within which participants are unable to accurately determine the temporal order of an auditory and a visual stimulus. The temporal window of integration thus is defined as the range of 25% to 75% response proportions in a TOJ task, and its width equals 2×JND.

1.3.2 Sensitivity derived from a synchrony judgment task

In a synchrony judgment (SJ) experiment, several methods for estimating sensitivity can be used. The synchrony boundaries define the range of delays that are predominantly judged to be synchronous, and thus can be seen as asynchrony detection

2A different way to determine the JND is by dividing the distance of the 75% point to the 25%

point by two, which results in the average distance of the 75% point and the 25% point to the 50% point. Due to the symmetrical shape of the (fitted) cumulative Gaussian distribution, both approaches yield identical JNDs.

(20)

1.3 Sensitivity to temporal interval differences 13

thresholds (see also Figure 1.1). The ‘audio first’ synchrony boundary, which is always located at a negative, audio-leading delay, is generally closer to physical synchrony than the ‘video first’ synchrony boundary (Arrighi et al., 2006; Enoki et al., 2006; Lewkowicz, 1996). This suggests that observers are more sensitive to negative, audio-leading delays than to positive, video-leading delays (i.e., sensitivity to audio-visual asynchrony is asymmetrical). When synchronous responses are fitted using a Gaussian distribution, the standard deviation is commonly used as a measure of sensitivity (Arrighi et al., 2006; Vatakis et al., 2008; Zampini et al., 2005b,c). A similar measure of sensitivity is provided by the width of the synchrony range. Finally, the slopes at the synchrony boundaries of the synchronous response curve indicate sensitivity at the transition from perceived synchrony to perceived asynchrony.

1.3.3 Discrimination threshold

In contrast to the TOJ and SJ methodology, which use judgments of individual audio-visual pairs presented by using the method of constant stimuli, audio-visual discrimination thresholds are determined by using two or three successive audio-visual pairs that are to be discriminated in tasks using the method of limits. In such experiments subjects have to discriminate between a standard reference stimulus with a given audio-visual delay and a stimulus with a smaller or larger audio-visual delay.

One of the few studies of audio-visual discrimination thresholds was performed by McGrath and Summerfield (1985), who used a three-interval, two-alternative forced-choice procedure with synthetic audio-visual approximations to bilabial consonant-vowel syllables. In their procedure, the first interval always contained the physically synchronous reference (or standard) stimulus with which stimuli in the following two intervals had to be compared. Subjects were to indicate which of the two latter intervals contained the target stimulus, with an audio-visual delay different from the reference (the other stimulus always matched the standard). Depending on the subject’s response, the amount of audio-visual asynchrony in the target stimulus was changed adaptively. McGrath and Summerfield (1985) reported an average negative threshold of 79 ms and an average positive threshold of 138 ms.

Grant et al. (2004) used a two-interval, two-alternative forced-choice procedure with correct-answer feedback. Participants had to judge which of two films of a female talker appeared to be “out of sync.” One of the films always was presented

(21)

in (physical) synchrony, whereas the other film contained an adaptively controlled amount of audio-visual asynchrony. Grant et al. (2004) found a negative threshold of approximately 50 ms, and a positive threshold of approximately 200 ms. The same procedure was used by van de Par and Kohlrausch (2000), who used an animation of a white disc that accelerated downward until it hit a bar after which it returned. This visual animation was accompanied by a short acoustic impact sound. Van de Par and Kohlrausch (2000) reported a negative threshold of 29 ms, and a positive threshold of 85 ms. In all the adaptive procedures, thresholds were defined as the audio-visual delay that led to a specific percentage of correct responses, e.g., 70.7% when using the 1-up, 2-down procedure (Levitt, 1971).

Although the reported thresholds vary, all studies reported that positive thresholds are larger than negative thresholds. This suggests greater sensitivity to an auditory advance than to a visual advance relative to the point of objective simultaneity (POS; although see Sinex, 1978, for some exceptions).

1.3.4 Comparison of sensitivity measures

Whereas TOJ data only yield a single measure of sensitivity, both synchrony judgment and temporal interval discrimination tasks yield two measures of sensitivity. Similar to the asymmetric sensitivity to audio-leading and video-leading delays in the synchrony judgment literature (section 1.3.2), differences in sensitivity to negative and positive audio-visual asynchronies were also demonstrated using threshold measurements (section 1.3.3). Explanations for this asymmetry in thresholds (Dixon and Spitz, 1980; Grant et al., 2004; McGrath and Summerfield, 1985; van de Par and Kohlrausch, 2000; see Alais and Carlile, 2005, for a recent review) refer to the natural temporal relations between an auditory and visual event in the real world, due to the relatively low speed of sound (see section 1.2.2). Grant et al. (2004) and van de Par and Kohlrausch (2000) suggested that, as a result of these natural temporal relations, the human perceptual system might have adapted its processing to tolerate and even bind sensory events over a range of relative delays into a “common event” interpretation that extends more liberally into the visual leading range than it does for stimulus pairs in which the auditory component leads.

It is not known if the larger positive thresholds observed at a 0-ms reference delay persist over a large range of reference delays, or if they are localized mainly

(22)

1.4 Perceptual and cognitive aspects of intersensory timing 15

near the point of objective simultaneity (POS). It could also be expected that delay discrimination thresholds should increase proportionally in both directions from the POS, in accordance with Weber’s Law (as demonstrated in the slope estimates from a TOJ study by Alais and Carlile (2005), in which they manipulated the apparent distance of a sound source to vary the PSS). Another possible prediction for the relative size of discrimination thresholds can be derived from the perceived synchrony of the reference delay. That is, if the perception of synchrony is categorical in nature, then discrimination thresholds should be large within the range of perceived synchrony, but decrease as test delays approach either side of the synchrony category boundary, where perceived synchrony yields to the clear perception of ‘audio first’ or ‘video first.’ The two aforementioned possibilities are discussed in more detail in Chapter 3.

1.4

Perceptual and cognitive aspects of intersensory timing

It has been suggested that the tolerance of small asynchronies between auditory and visual components of a single event may be explained by (1) the human ability to adapt to every-day life exposure to auditory delays (Fujisaki et al., 2004; Navarra et al., 2005; Vroomen et al., 2004; Vatakis et al., 2008), (2) the “auditory capture of vision,” (Aschersleben and Bertelson, 2003; Bertelson and Aschersleben, 2003; Morein-Zamir et al., 2003; Vroomen and Keetels, 2006), or (3) the compensation for the distance of an audio-visual event (see, e.g., Alais and Carlile, 2005; Arnold et al., 2005; Lewald and Guski, 2004; Sugita and Suzuki, 2003). The aforementioned explanations are discussed below in more detail together with the associated perceptual phenomena. Furthermore, the effect of spatial disparity on PSS and JND values is also treated here.

1.4.1 Spatial disparity

In nature, auditory and visual components of a common event originate from the same location. In laboratory settings and when using technical systems, however, visual stimuli are often produced using computer or television screens. Auditory stimuli are produced by headphones or speakers. As a consequence, auditory and visual stimuli need no longer originate from a common spatial position.

Spence et al. (2003) presented auditory and visual stimuli from two possible positions. Two target LEDs were positioned 26 cm on either side of a fixation LED (62 cm in front of the observer). The two loudspeakers were placed directly behind

(23)

each target LED. Flash-click stimuli were presented from either the same or a different spatial position with relative delays of ±10, ±30, ±55, ±90, and ±200 ms. Participants judged the temporal order of auditory and visual stimuli. No significant effect of spatial disparity on the PSS was found (+20 ms, on average). The JND values for the same position (53 ms), however, were significantly larger than JND values for the different position (42 ms). From these results Spence et al. concluded that spatial disparity improves the accuracy with which people can perform a temporal order judgment task. Zampini et al. (2003a) used an experimental set-up that was almost identical to the one used by Spence et al. (2003). The relative delays, however, were ±20, ±30, ±55, ±90, and ±200 ms. In Experiment 1, participants performed a ‘modality’ temporal order judgment (i.e., they indicated which modality came first). The JND values for stimuli presented from the same position (32 ms) were significantly larger than those for stimuli presented from different positions (22 ms), essentially confirming the results of Spence et al. (2003). Furthermore, Zampini et al. also found a significant shift in the PSS, which was 60 ms for the same condition, and 75 ms for the different condition.

In yet another very similar experimental set-up (Zampini et al., 2005b) participants performed an SJ2 task, but now for relative delays of 0, ±20, ±30, ±70, ±200 ms. Standard deviations of the synchrony curve (a fitted Gaussian) were significantly larger for the same condition (114 ms) than for the different condition (91 ms). Furthermore, PSS values were smaller for the same condition (19 ms) than for the different condition (32 ms). In line with results from TOJ tasks, participants were more likely to respond with ‘simultaneous’ when stimuli were presented from the same position, than when they were presented from different positions.

Zampini et al. (2003b) manipulated the spatial location of the auditory stimulus by presenting it either by headphones or from a centrally positioned loudspeaker. Fixation light and target lights were located directly in front of the loudspeaker. Participants judged the temporal order of auditory and visual stimuli presented at relative delays of ±20, ±30, ±55, ±90, and ±200 ms. No effect of the spatial position of the auditory stimulus was found on JND values (86 ms), nor on PSS values (29 ms) in Experiment 1. In subsequent experiments, Zampini et al. stated that lower JNDs previously reported for stimuli originating from different spatial locations, may critically depend on presentation of the stimuli to different sides of the body midline; i.e., lower JNDs are found when stimuli are presented to different cerebral hemispheres, but not when stimuli are presented from different spatial locations within the same hemifield.

(24)

1.4 Perceptual and cognitive aspects of intersensory timing 17

The influence of hemispheric redundancy was also studied by Keetels and Vroomen (2005). Flash-click stimuli were presented from loudspeakers placed at 10◦, 30◦, and 50◦ to the left and right of fixation, and LEDs placed directly in front of the two loudspeakers at 10◦. Participants had to judge the temporal order of audio-visual stimuli that were presented at various relative delays (0, ±30, ±60, ±90, ±120, ±240 ms) and various spatial disparities (same location, or at 20◦ or 40◦ spatial separation in same or different hemifields). No significant effect of spatial disparity on the PSS (+8 ms) was found. The JNDs, however, were smaller (1) when stimuli were presented from different spatial positions, (2) when stimuli were presented from different hemifields, and (3) for stimuli presented at 40◦ than for stimuli presented at 20◦ spatial disparity. Keetels and Vroomen (2005) concluded that audio-visual JNDs depend on the relative spatial and hemispheric disparity at which stimuli are presented.

In summary, spatial disparity may produce a shift in the point of subjective simultaneity, and improve accuracy in temporal order judgments. Such a performance difference was not demonstrated, however, when the auditory stimulus was presented over headphones instead of over a loudspeaker that was effectively in the same position as the visual stimulus.

1.4.2 Temporal recalibration

As a result of the different propagation speeds of light and sound, humans are accustomed to perceiving every-day life events with auditory signals arriving later at the sensory receptors than the corresponding visual signals. As such it is expected that observers are more tolerant of lagging audio than of lagging video in integrating the sensed components of an event. If our perceptual system has indeed adapted to the auditory delays present in our natural environment, it is possible that temporal recalibration occurs when participants are exposed for some time to delays that are different from those experienced in every-day life.

Fujisaki et al. (2004) tested the expectation of temporal recalibration by exposing participants to audio-visual tone-flash stimuli with different, constant delays for several minutes. Participants performed an SJ2 task after exposure to relative delays of -235 ms, 0 ms, and +235 ms (in separate sessions). Exposure to a delay of 0 ms resulted in a PSS value of -10 ms. After exposure to a delay of -235 ms an average PSS value of -32 ms was obtained, which constitutes a shift in the direction of more negative, audio leading delays compared to the 0-ms adaptation condition. A PSS value of +27 ms was

(25)

found after exposure to a +235 ms delay. Interestingly, there was a small but significant difference in the PSS between the no-adaptation condition (+4 ms) and the condition in which participants were exposed to physically synchronous stimuli (-10 ms). Fujisaki et al. (2004, p. 774) offered a possible explanation by stating that “the no-adaptation condition might have been more affected by pre-adaptation to the natural environment, in which audio signals tend to be delayed relative to visual signals.” Besides a PSS shift, Fujisaki et al. also found a widening of the synchrony range in the direction of the adapted lag.

Recalibration of subjective simultaneity was also studied by Vroomen et al. (2004). Similar to Fujisaki et al. (2004), they exposed participants for 3 min to flash-click stimuli with a constant relative delay of 0, ±100, or ±200 ms (in separate sessions). After exposure, half of the participants performed a temporal order judgment (TOJ) task, and the other half performed a simultaneity judgment (SJ2) task using the same stimuli as in the exposure phase. Results were not significantly different for the TOJ and the SJ2 task, although this may have been due to a lack of power in the between-subjects analysis. The PSS values were approximately -10 ms following exposure to negative, audio leading delays, and approximately +10 ms after exposure to positive, video leading delays. Vroomen et al. found that the effect of exposure leveled off for adaptation delays around ±100 ms. Furthermore, no PSS shifts were found in a control experiment using exposure lags of ±350 ms. Vroomen et al. suggest that temporal recalibration is limited for audio-visual exposure lags in the range where a sound can capture the perceived onset of a light (see the section on temporal ventriloquism further ahead).

Somewhat different results were reported by Navarra et al. (2005). They exposed participants to videotapes of audio-visual speech, or of a hand playing the piano that was either presented in physical synchrony, or with an auditory delay of 300 ms. Participants performed a temporal order judgment (TOJ) task using a (flash-click) stimulus that was different from the exposure stimulus. Furthermore, the TOJ task was performed while participants monitored the exposure stimulus (although some practice trials preceded the actual measurement phase). Different from the two studies above, no PSS shifts were reported. Similar to Fujisaki et al. (2004), however, Navarra et al. did find a widening of the temporal window of integration following exposure to asynchronous speech or music videos. That is, an average JND of 118 ms was reported after exposure to a synchronous stimulus, but an average JND of 135 ms was found after exposure to an asynchronous stimulus. No temporal recalibration effects were observed

(26)

1.4 Perceptual and cognitive aspects of intersensory timing 19

after exposing participants to a delay of 1000 ms. Similar to Vroomen et al. (2004), Navarra et al. suggested that the audio-visual lag used in the asynchronous exposure stimulus must remain within the temporal window of integration.

Vatakis et al. (2008) used a procedure similar to that used by Navarra et al. (2005). They exposed participants to videotapes of audio-visual speech, presented in physical synchrony, or with an auditory delay of 300 ms. While participants monitored the exposure stimulus, they performed a TOJ or an SJ2 task (in separate sessions), using a flash-click stimulus. Similar to Navarra et al. (2005) they reported larger JNDs after exposure to asynchronous stimuli for TOJ data (139 ms vs. 117 ms), but no effect on PSS estimates. The SJ task, however, resulted in a significant PSS shift in the direction of the video leading exposure lag (+16 ms vs. +1 ms). Similar to Navarra et al. (2005) and Fujisaki et al. (2004), a wider synchrony range was found after asynchronous exposure for the SJ task (i.e., a larger standard deviation of the fitted Gaussian distribution; 157 ms vs. 139 ms).

The effect of spatial disparity on temporal recalibration was investigated by Keetels and Vroomen (2007). Participants were exposed for 3 min to flash-click stimuli with a delay of ±100 ms (in different sessions). The location of the auditory stimulus during exposure was manipulated as a within-subjects variable (but between blocks), such that it either (1) corresponded to the location of the light and the fixation point, or (2) was laterally displaced by a distance of 70 cm. After exposure, participants performed a TOJ task. The location of the auditory stimulus during the TOJ task was manipulated as a between-subjects variable, such that it was always co-located with the light source (for half the subjects), or that it was always laterally displaced (for the other half of the subjects). In line with studies discussed above (Fujisaki et al., 2004; Vatakis et al., 2008; Vroomen et al., 2004), a PSS shift was reported when comparing exposure to -100 ms (-8 ms) with exposure to +100 ms (+5 ms). The location of the sound during exposure and measurement did not produce any main or interaction effects. Keetels and Vroomen (2007) thus concluded that spatial disparity does not affect temporal adaption in the audio-visual domain. Since spatial disparity was shown to affect the accuracy with which participants can make temporal order judgments, this may be a surprising result. Keetels and Vroomen offered an explanation for this finding by pointing out that the role of space in hearing is to steer vision, and that spatial co-localization thus need not be a requirement for intersensory pairing to occur.

(27)

task with relative delays from a Gaussian distribution either centered on +80 ms, or on -80 ms (i.e., no exposure phase prior to the TOJ task). They reported PSS shifts in the direction of the mean relative delay. Relative delays centered on +80 ms resulted in an average PSS of +86 ms, whereas relative delays centered on -80 ms yielded an average PSS of -49 ms. Their results are, however, open to an alternative explanation (different from temporal recalibration). The PSS shifts might also be explained by an observer bias, such as response frequency equalization (see also section 2.4).

In summary, depending on the delay in the adaptation stimulus, temporal recalibration may produce a shift in the point of subjective simultaneity, and may widen the synchrony range and the temporal window of integration. For temporal recalibration to occur, however, spatial co-localization is not a necessary condition.

1.4.3 Temporal ventriloquism

Temporal ventriloquism refers to a temporal bias in the perception of a visual stimulus in the direction of the occurrence of an auditory stimulus. Morein-Zamir et al. (2003) examined temporal ventriloquism by presenting irrelevant sounds in different temporal configurations while participants performed a visual temporal order judgment task. When sounds preceded the first visual stimulus, and followed the second visual stimulus a performance improvement was reported (i.e., lower JND values compared to a baseline condition in which sounds and lights were presented simultaneously). A performance deterioration was found, however, when the two sounds were presented between the two visual stimuli. Morein-Zamir et al. (2003) showed that the performance improvement is critically dependent on the second sound trailing the second light for delays up to 225 ms. In terms of audio-visual synchrony perception, temporal ventriloquism may thus play a role by temporally aligning a visual stimulus and a subsequently (i.e., asynchronously) presented auditory stimulus (Spence and Squire, 2003).

Morein-Zamir et al. (2003) produced an effect on JND values, and this provided an indirect measure of the temporal attraction of audition on vision. Aschersleben and Bertelson (2003), however, tried to assess the contributions of auditory and visual modalities on temporal ventriloquism in a more direct way. In one experiment they asked participants to produce tapping movements that were simultaneous with a visual pacing stimulus, while ignoring an auditory distracter stimulus. In a second experiment they asked participants to tap along with an auditory pacing stimulus while ignoring a visual distracter stimulus. Distracter stimuli were presented with relative delays of

(28)

1.4 Perceptual and cognitive aspects of intersensory timing 21

0 ms, ±15 ms, ±30 ms, and ±45 ms. Results showed that the timing of the taps was biased very strongly towards the auditory distracter. A significant bias was also found towards the visual distracter, but the effect was a lot smaller. Aschersleben and Bertelson (2003) conclude that auditory dominance for temporal processing is strong, but not total. Note that Aschersleben and Bertelson found an effect of both trailing and leading auditory (distracter) stimuli, whereas Morein-Zamir et al. (2003) did not find an effect of the first auditory stimulus preceding the first visual stimulus.

Vroomen and Keetels (2006) studied the effect of spatial disparity on temporal ventriloquism. In the first experiment the two visual stimuli were presented from locations 5◦ above or below central fixation. Sounds were presented from a speaker in between the two lights, or from a position at 90◦ to the left or right of the participant. Sounds preceded the first visual stimulus and followed the last visual stimulus by either 100 ms, or 0 ms. Visual stimuli were presented with relative delays ranging from -75 ms (lower first) to +75 ms (upper first). Participants had to judge whether the lower or the upper light was presented first. No effects on PSS were found. The JNDs, however, were smaller for the ±100 ms auditory delay (22 ms vs. 28 ms when auditory and visual stimuli were presented simultaneously). More importantly, no interaction was found between auditory delay and spatial disparity, indicating that temporal ventriloquism was unaffected by the spatial separation or correspondence between auditory and visual stimuli. Subsequent experiments confirmed the findings of the first experiment. Vroomen and Keetels offered the same explanation as for the absence of an influence of spatial disparity on temporal recalibration. That is, if the function of space in hearing is to steer vision, then there is no reason for spatial correspondence to be a necessary condition for multisensory interaction to occur.

In summary, it has been shown in the literature that both auditory and visual modalities may play a role in temporal ventriloquism, but that the auditory modality clearly dominates. For temporal ventriloquism to occur, spatial co-localization is not a necessary condition. Results regarding a leading auditory stimulus are contradictory. Furthermore, the potential effect on perceived simultaneity has not been established directly.

1.4.4 Distance compensation for temporal disparities

Somewhat related to the topic of temporal recalibration is the notion that people may compensate for the longer travel time of sound when perceiving an audio-visual event

(29)

that takes place at some distance. Sugita and Suzuki (2003) asked participants to judge the temporal order of white noise bursts (presented through headphones) and light flashes (produced by LEDs at different distances, 1–50 m, from the observer). They reported that PSS values increased with viewing distance in a way that was roughly consistent with the travel time of the sound signals for distances at least up to 10 m. Alais and Carlile (2005) used a constant visual stimulus (Gaussian-profile luminance blob) without depth information and white noise that sounded as if it was produced at different distances (5–40 m). They asked participants to judge the temporal order and, similar to Sugita and Suzuki (2003), found that “observers were taking account of the distance of the sound source” and “were attempting to compensate for the travel time from that source with a subjective estimate of the speed of sound [p. 2245].”

Other studies (Arnold et al., 2005; Lewald and Guski, 2004; Stone et al., 2001; Washikita et al., 2007), however, have failed to demonstrate compensation for stimulus distance. Lewald and Guski (2004), for example, used loudspeakers and LEDs placed on a lawn at distances of 1, 5, 10, 20 and 50 m from the observer. Participants had to indicate the temporal order of the auditory and visual signals. Results showed that subjective simultaneity occurred at a +17 ms (auditory) delay, regardless of the distance of the stimulus. That is, compensation for the travel time of sound did not occur.

Arnold et al. (2005) used three different audio-visual tasks with sound presented both from loudspeakers and headphones (in separate sessions), and visual stimuli presented from four different distances in different sessions (approximately 1, 5, 10, and 15 m). In the stream-bounce task, participants were shown two dots that moved towards each other, became superimposed, and then moved away from one another. Sound was presented at relative delays in the range of -300 ms to +300 ms. Participants had to indicate whether two dots were passing through, or bouncing off, one another. In the causal attribution task participants were shown two visual collisions, but heard only one sound (also presented at delays in the range of ±300 ms). Participants had to indicate if the tone sounded like is was being produced by the collision of the upper, or lower pair of dots. In the TOJ task, participants were shown a moving dot that approached, became superimposed on, and then moved away from a stationary dot. Sound was presented with relative delays in the range of -200 ms to +200 ms. Participants had to indicate if they felt that the timing of the tone was too early or too late to be consistent with the collision of the dots. No perceptual compensation for the difference between the speeds of light and sound was observed in any of the tasks. As such, Arnold et al. suggest that

(30)

1.5 Stimulus complexity 23

visual and auditory events that reach an observer at the same time become perceptually bound, even when they could not have originated from the same (distant) event. Only when one of the authors imagined that the visual stimuli and the auditory stimuli had originated from a common source, was compensation observed for headphone-presented sounds. From this they conclude that the origin of the compensation is not perceptual, but rather cognitive.

Given the contradictory results in the literature, it is unclear to what extent distance compensation plays a role in audio-visual synchrony perception.

1.5

Stimulus complexity

Although it is not immediately clear from the literature overview presented in Table 1.1 in section 1.2 that stimulus complexity should have an effect on perceived simultaneity (but see Arrighi et al., 2006; Enoki et al., 2006; Vatakis and Spence, 2006a), it seems likely that the amount of information available in a stimulus should be capable of influencing an observer’s ability to compare the moments of occurrence of the auditory and visual components of an event (e.g., Spence, 2007). Indeed, Aschersleben (1999) has suggested that differences in the ecological validity of the stimuli could explain some of the conflicting results reported in TOJ and SJ literature. She considered the bouncing disk stimulus used by Lewkowicz (1996) to be more ecologically valid than the flash-click stimuli used in other studies. In a flash-click stimulus, for example, only single and contextually-isolated auditory and visual events are present. Besides the lack of ecological validity, or every-day life experience with such stimuli, they also do not enter into an obvious causal relationship. Therefore, it is likely that a clear expectation of the temporal relationship will be lacking as well, and thus different PSS values and sensitivities could be expected to result for these stimuli than from stimulus pairs that support such a temporal expectation. A simple impact event, such as a ball bouncing off a surface, for example, promotes a clear impression of a moving stimulus leading to an event with a causal interpretation (the bouncing sound is caused by the ball hitting the surface) and thus a more predictable temporal relation: the auditory component is not expected to precede the visual impact of the ball. Furthermore, the ecologically invalid occurrence of sound preceding video should be more easily detected because, unlike a flash-click stimulus, the occurrence of the visual event (the moment of impact) can be predicted from the context of the ball’s trajectory, which is continuously available over

(31)

some observation period. Enoki et al. (2006) indeed found that asynchrony detection thresholds for leading auditory components are smaller when observers can predict the moment of impact of a moving ball, compared to the situation in which the ball suddenly appears, as in a flash-click stimulus.

In section 1.2 the expectation of positive PSS values was expressed as a possible consequence of every-day life exposure to auditory delays, and the faster neural conduction time of sound. A third contributory factor might be the apparent causal relationship present in the audio-visual stimulus. That is, most real-world events occur in a continuous visual context in which sporadic auditory spikes occur as the result of predictable collisions and other contacts between objects. Related is the effect of the possibility to predict the visual component, which was shown by Enoki et al. (2006) to result in an auditory delay that is larger than when no prediction is possible (70 ms vs. 35 ms) for auditory and visual components to be perceived as synchronized.

Visual predictive information is not strictly necessary for an obvious causal relationship to be promoted by an audio-visual stimulus. For example, Stekelenburg and Vroomen (2007) used a video clip of a person tearing apart a paper sheet, which has no visual predictive information, but does support a clear causal interpretation. In simple audio-visual events, however, it is often the case that the visual modality conveys information that allows the observer to predict the moment of occurrence of the auditory component (e.g., watching a hammer moving towards and striking a peg that produces a sound at impact). Although our perception of physical objects is mediated by our senses, and physical objects thus never are perceived directly, it is customary to refer to visual percepts as objects (Julesz and Hirsh, 1972). Furthermore, both the occurrence and the perceived properties of sound appear “to be completely explained by reference to the objects that gave them birth” (Walter Murch, in Chion, 1994, p. xvi). As a result, the auditory component of an audio-visual event resulting from the physical impact between two objects is perceived to result from the visual impact. In line with such a causal relationship is the expectation that, for an impression of simultaneity to occur, the visual component (i.e., “the cause”) should occur in close temporal proximity to the auditory component (i.e., “the effect”), but the cause should precede the effect.

(32)

1.6 Overview of this thesis 25

1.6

Overview of this thesis

The main motivation for the work described in this thesis was to find an explanation for the unnatural, negative PSS values reported in the literature. Such negative PSS values are almost exclusively reported for temporal order judgment (TOJ) studies, and only rarely so for synchrony judgment (SJ) studies. Therefore, the focus of the first study, described in Chapter 2, was to compare the effects of TOJ and SJ tasks on estimates of subjective simultaneity. Since it has been suggested that stimulus complexity has an influence on the resulting PSS value (Arrighi et al., 2006; Enoki et al., 2006; Vatakis and Spence, 2006a), two stimuli of varying complexity were used.

The next chapter focuses on sensitivity to audio-visual temporal intervals. It is known that observers are more sensitive to audio-leading asynchronies than to video-leading asynchronies. That is, negative thresholds are smaller than positive thresholds. It is not known, however, whether this threshold asymmetry persists over a larger range of standard delays. In Chapter 3 two hypotheses were tested regarding thresholds at asynchronous reference delays: (1) discrimination thresholds follow Weber’s Law, and (2) discrimination thresholds are related to perceived synchrony of the reference delay. Discrimination thresholds were measured for negative and positive standard delays for the same participants and stimulus types as in Chapter 2.

Besides the complexity of the audio-visual stimulus, it has been argued that the apparent causal relationship in a stimulus also influences the perception of audio-visual synchrony. That is, apparent causality should result in a clear expectation regarding the temporal order in which different signals are generated (i.e., the effect cannot precede the cause), and thus perceived. In order to study the effect of apparent causality, one needs a stimulus in which the apparent causal relationship can be systematically manipulated. Such a stimulus is presented in Chapter 4, which explores the effects of visual predictive information and apparent causality on perceived synchrony. As in Chapter 2 the effect of experimental method was also included, but now limited to the use of TOJ and SJ3 tasks.

In the final experimental chapter (Chapter 5) the influence of visual predictive information and apparent causality on asynchrony detection thresholds was explored. Discrimination thresholds were measured using the same stimuli and participants as in Chapter 4. Also, the relationship between asynchrony detection thresholds and synchrony judgment data was further investigated.

(33)
(34)

2 Audio-visual synchrony and temporal

order judgments: Effects of experimental

method and stimulus type

When we perceive an audio-visual event in our natural environment, there will always be a physical delay between the arrival of the leading visual component and the trailing auditory component. Assuming that our perceptual system has adapted to these naturally occurring delays, this natural temporal relationship suggests that the point of subjective simultaneity (PSS) should occur at an auditory delay larger than or equal to 0 ms. A review of the literature suggests that PSS estimates derived from a temporal order judgment (TOJ) task differ from those derived from a synchrony judgment (SJ) task, with (unnatural) auditory-leading PSS values reported mainly for the TOJ task. Data is reported from two stimulus types that differed in terms of complexity; namely (1) a flash and click, and (2) a bouncing ball and impact sound. The same subjects judged the temporal order and synchrony of both stimulus types, using three experimental methods: (1) a TOJ task with two response categories (‘audio first,’ and ‘video first’), (2) an SJ task with two response categories (SJ2; ‘synchronous,’ and ‘asynchronous’), and (3) an SJ task with three response categories (SJ3; ‘audio first,’ ‘synchronous,’ and ‘video first’). Both stimulus types produced correlated PSS estimates using the SJ tasks, but PSS values estimated using the TOJ procedure were uncorrelated with those obtained from the SJ tasks. Results suggest that the SJ task should be preferred over the TOJ task when one is primarily interested in perceived audio-visual synchrony.

This chapter is based on van Eijk, Kohlrausch, Juola and van de Par, “Audio-visual synchrony

and temporal order judgments: Effects of experimental method and stimulus type,” accepted for publication in Perception & Psychophysics.

(35)

2.1

Introduction

In this chapter a study is presented of the effects of experimental method on PSS and sensitivity estimates using all three judgment methods reported in the literature, as well as two different stimuli of varying complexity. The study reported here is the first that compares the effects of both stimulus complexity and three temporal judgment methods (SJ2, SJ3, and TOJ) in a completely within-subjects design. A simple stimulus was used without motion or visual predictive information (i.e., a light flash and a click sound), and a more complex stimulus with natural motion and visual predictive information (i.e., a bouncing ball and a subjectively appropriate impact sound). Measurements for both stimulus types were made using TOJ, SJ2, and SJ3 tasks in separate, counterbalanced sessions with the same set of participants. The aim was to collect evidence within a single study to reveal whether the unnatural, negative PSS values reported in the literature may be explained by differences in experimental methods (see section 1.2.3).

2.2

Method

Two stimulus types were used in the current experiment: a flash-click stimulus, and a bouncing ball stimulus. The measurements for the two stimulus types were initially carried out in two separate parts, but are described here as a single experiment.

The flash-click stimulus has a long history in SJ and TOJ research (e.g., Exner, 1875; Hirsh and Sherrick, 1961; Ja´skowski et al., 1990). Given the absence of motion cues, predictive information, and apparent causality (although see Whipple, 1899) the flash-click stimulus may very well be the most simple stimulus in the context of auditory-visual temporal perception research. As such it is an excellent reference stimulus for the comparison of stimulus types of different complexity.

To investigate the effect of stimulus type on the PSS and judgment sensitivity, a second stimulus was used: a simulation of a bouncing ball. This stimulus differs from the flash-click stimulus in two important, and somewhat related, aspects: (1) motion, and (2) predictive information. Whereas the flash-click stimulus contained no motion, but a suddenly appearing and disappearing circle, the bouncing ball stimulus contains a circle that is continuously present in the visual scene and that produces a visual event

(36)

2.2 Method 29

by moving toward, and apparently contacting a horizontal bar (and then bouncing back up again).

Several years passed between measurements with the bouncing ball stimulus of subjects 1–3 and 12 on the one hand (see also van de Par et al., 2002), and subjects 4–11 on the other. As a result, some equipment changes resulted in minor differences in the experimental method. Described here is the method used for the most recent measurements (participants 4–11). Given that the changes in method are relatively minor (see van de Par et al., 2002), and the fact that a repetition of these measurements by participant 2 yielded almost identical results, it is expected that these changes had little or no influence on the results and conclusions presented here.

2.2.1 Participants Flash-click stimulus

Three female and eight male subjects participated voluntarily. Four participants were familiar with the details of the experiment (i.e., they were the four authors of the corresponding paper). All participants had previously completed measurements with the bouncing ball stimulus. Participants varied in age over the range 26–58 years with a mean of 35. All participants reported normal or corrected-to-normal vision and normal hearing.

Bouncing ball stimulus

Four female and eight male subjects, including the four informed ones, participated voluntarily. Participants varied in age from 23–53 years with a mean of 32. All participants reported normal or corrected-to-normal vision and normal hearing.

2.2.2 Apparatus

The visual stimulus was shown on a Dell D1025HE CRT computer monitor at a resolution of 1024 × 768 pixels and at a 85-Hz refresh rate. The auditory stimulus was played through a Creative SB Live! sound card, a Fostex PH-50 headphone amplifier and Sennheiser HD 265 linear headphones. Participants were seated in front of the monitor at an approximate distance of 60 cm and responded using a keyboard. The setting was a dimly-lit, sound-attenuated room.

(37)

In a calibration procedure the timing control of the auditory and visual signals was determined by using a photo cell that measured the light emitted by a flashing circle on the screen. The position of the flashing circle corresponded to the region where the visual event was presented during the experiments. The electrical output of the photo cell was shown as a trace on an oscilloscope together with a pulsed tone that was played in synchrony with the flashing circle, using the same equipment that presented stimuli in the present study. In these calibration conditions, the synchrony between the flash and the tone was shown to be accurate to within ±2 ms.

In section 1.4 the influence of spatial disparity on sensitivity to audio-visual temporal intervals was discussed. As no influence of headphone presentation was found (Zampini et al., 2003b), it can be expected that the presentation of the auditory stimulus over headphones in the present study (as opposed to using speakers) did not affect the resulting judgments.

2.2.3 Stimulus Flash-click

The visual part of the flash-click stimulus consisted of a white disc (97 cd/m2 as

measured using an LMT L1003 luminance meter) shown during 1 frame (12 ms) at a central position on the screen. The disc had a diameter of 49 pixels and subtended an area of about 1.4◦ at an unconstrained viewing distance of about 60 cm. The total duration of the visual stimulus was 2 s, during which four corners of a surrounding square were visible to indicate the central location of the flash. The square was presented to give equivalent spatial and temporal information about the upcoming flash and click that participants had in viewing the bouncing ball stimuli. The occurrence of the flash was randomized with the timing restriction that it occurred within the time window of 500–1500 ms after the onset of the surrounding square. The acoustic part of the stimulus consisted of a 12-ms white noise burst with a sound pressure level of 67 dB.

Bouncing ball

The visual part of the bouncing ball stimulus consisted of a white disc (identical to the disc in the flash-click stimulus) against a black background. An animation showed the disc apparently moving down over a distance of 460 pixels towards a horizontal bar, impacting it, and then bouncing back up again with the same (constant) acceleration.

Referenties

GERELATEERDE DOCUMENTEN

Regardless of these small di fferences in peak frequency, the dif- ferent results from the present study clearly converge onto an instru- mental role for alpha oscillations

judgements of visual and auditory stimuli, under certain conditions, to be dependent on the phase of posterior alpha and theta oscillations, as these oscillations are.. thought

We tested patients affected by Multiple Sclerosis and Parkinson’s Disease and showed that: (1) the parameters of the saccadic main- sequence alone are insufficient to separate

Once all these aspects are determined, an automatic design procedure requi- res the definition of an optimization criterion (or cost function), typically in terms of a distance

To obtain an independent measure of perceived spatial disparity, participants judged, in a different part of the experiment, in a location discrimination task whether the sound and

Here, we examined the influence of visual articulatory information (lip- read speech) at various levels of background noise on auditory word recognition in children and adults

During an exposure phase, participants tapped with their index Wnger while see- ing their own hand in real time (»0 ms delay) or delayed at 40, 80, or 120 msM. Following

This led the authors to conclude that lags between auditory and visual inputs are perceived as synchronous, not because the brain has a wide tempo- ral window for