• No results found

3 | Literature Review

3.1 Direct Scaling

3.1.1 Double-Blind Triple-Stimulus

The double-blind triple-stimulus test is described in ITU-R Recommendation BS.1116-1 [29]. It is used to assess systems that introduce sounds with very small impairments.

It is recommended to have participants who have a certain level of expertise. The participants should be selected by their experience in taking listening tests. A train-ing is also recommended to let the test subjects familiarize themselves with the tests.

This training procedure can also be used to select the most qualied test subjects.

The double-blind triple-stimulus test presents the test subject with three stimuli "A",

"B" and "C". The test subject is asked to rate the impairment of B compared to A, and C compared to A on a scale of 1 to 5, as seen in Table 3.1).

The double-blind triple-stimulus was used by Lokki and Savioja [15] in a study where auralisations were validated with both objective and subjective means. Multiple listening test methods were used in this research. However, they do not mention the reason for choosing the double-blind triple-stimulus test or any of the pros and cons of the methods. The test subjects of this research were asked to compare the spatial and timbral dierences between recorded and auralised sound fragments. Based on the results of this listening test, Lokki and Savioja [15] concluded that the auralisations of this study were almost imperceptible from the actual recordings. However, transient signals such as a snare drum did provide auralisations with audible dierences.

Table 3.1: Answering scale of the double-blind triple-stimulus test [29].

Impairment Grade

The same listening test is used by Lindau et al. [30] with their research on Head-phone Transfer Functions (HpTF). A HpTF described the response and coupling of a headphone with the ears [31]. They are generated by playing known signals through the headphone and recording the signal with in-ear microphones. Non-individual, individual and generic HpTFs were compared within this research. Due to the fact that one of the two samples ("B" or "C") is identical to the reference ("A"), one slider will always be set to 5 (imperceptible). Test subjects could only pick one slider to rate the dierence. The other slider will then have to be set to 5. This is an

advantage of the double-blind triple-stimulus test: it forces the test subject to grade the hidden reference as imperceptible. It makes it possible for the experimenter to test the ability of the test subject to detect the artifacts of the test stimuli when compared to the reference [29].

Hiekkanen et al. [32] also used an adaptation of ITU-R Recommendation BS.1116-1 [29] for their research on the dierence between real and virtualized loudspeakers.

Auralisations make it possible to present test subjects with stimuli from dierent speakers originating from the same position. Dierences between the real and virtual loudspeaker are evaluated using ve attributes. The rst three are related to the source localization: Apparent sound width, direction of events and distance to events.

The rst describes how the width of the source is perceived by the subjects. The other two describe the perceived location and distance of the source compared to the (virtual) position of the subject. The fourth attribute is spaciousness, which is described as the amount of space present in the stimuli. The spectral content is described by the attribute Tone Colour. The same answer scale as Table 3.1 is used.

A similar listening test is used by Paul [33], who used three instead of two dierent stimuli to compare with the reference. Test subjects had to mark their rating on a linear scale printed on paper, which ranges from very similar to very dierent. The answering scale has no reference points for the subjects to base their scaling of the perceived dierence on. This can potentially lead to signicant dierences in ratings between dierent subjects. Although the used scale is continuous from 0 to 10, the marks were transformed into numerical values with a precision of 0.5.

3.1.2 MUltiple Stimulus with Hidden Reference and Anchors Another test often used is the Recommendation ITU-R BS.1534-1 [34], also known as MUltiple Stimulus with Hidden Reference and Anchors (MUSHRA). This codec-listening test is used for subjective assessment of audio coding formats such as MP3, AAC and FLAC. The dierence between the double-blind triple-stimulus test and MUSHRA is that MUSHRA is more suitable in evaluating audio of lower quality. It is more adequate in discriminating dierences in quality that with the use of other test methods might agglomerate at the lower half of the scale [28]. With MUSHRA, the test subject is presented with multiple audio fragments including the reference signal, one or multiple impaired audio fragments, a hidden reference and a hidden anchor. This anchor is a low-pass ltered version of the unprocessed signal (the reference). When test subjects are presented with the dierent stimuli and asked to rate the dierences, the low-pass signal should be given the lowest score. The hidden reference respectively should be given the highest. These two 'extreme' stimuli are used to encourage the test subjects to use a broad rating range. This range makes MUSHRA suitable in evaluating audio of lower quality [28]. Dierence ratings for multiple stimuli of lower quality could otherwise accumulate at the bottom end of the earlier shown answering scale (see Table 3.1) [28]. An example of the interface used for MUSHRA-tests can be seen in Fig. 3.2.

Watanabe et al. [35] used the MUSHRA-method in their study on the acoustics of a virtual renaissance theatre. Actual recordings in the theatre were compared with convolved stimuli using the measured transfer functions for the dierent loudspeakers used in the theatre. They used the MUSHRA-method to present the test subject

with ve dierent signals. Three perceptual attributes (Apparent Source Width (ASW)[36], Listener Envelopment (LEV)[36] and Clarity) were chosen to be assessed by the test subjects.

Figure 3.2: An example of the user interface used in the blind grading phase [34]

3.1.3 AB Comparison

Malecki et al. [37] used the AB-method for their research on the assessment of multi-channel auralisations. The RIRs were recorded in a test room with both the use of a multi-microphone technique and a SoundField type microphone. These RIRs were used to convolve the anechoic signals. The anechoic signals were also used to record reference samples in the same test room. Recommendations of ITU-R BS.1284-1 [38]

were followed. The test was performed in three series, with all three using a dierent sound system: a 5.0 surround system, a stereophonic system and a headphone. The test subject group consisted of eight people, who all had at least a few years of experience with listening tests. During the test, the subjects were presented with pairs of samples, which could randomly consist of dierent or identical samples. Test subjects rst had to answer if the samples were dierent or equal. When found to be dierent, the subjects were asked to express the dierence on a scale ranging from 1 to 5 (see Table 3.2), although it was not mandatory to answer this question.

However, this scale is not based on any ITU recommendations. The detection rate of the auralisations was found to be signicant for all three systems. The stereophonic has a notably lower detection rate of 63% compared to the headphone system (77%) and the 5.0 system (83%).

An AB-comparison method was also used by Lokki and Järveläinen [39] in a study concerning the subjective evaluation of auralisations where real-head recordings were compared with auralisations. 24 pairs of auralisations and recordings were used for the comparison. To measure the reliability and bias of the test subjects, eight pairs of identical samples were also mixed in with the other pairs. They presented the test subject with the auralisation and the accompanying recording and asked the subject to rate the dierence between the two samples.

Table 3.2: Indirect scaling of dierences used by Malecki et al. [37]

Impairment Grade

The dierences are hard to notice 1.0

Dierences based on the noise and crackles 2.0 Small dierences based on the quality and sound 3.0

Big dierences in sound 4.0

Very big dierences 5.0