• No results found

6 | Listening Tests

6.4 Listening Test Procedure

The listening tests were performed in meeting room 2.03 located in the Zwarte Doos at the TU/e Campus in Eindhoven. The room was chosen based on its low noise level and it's relatively remote location, which minimized disturbance of potential passers-bys. It is a meeting room without windows. Subjects were asked to take their place in front of the test laptop, after which they received their handout (see Appendix H). For both tests, the test subjects were asked to stop if they had not

nished their tests. Generally, the given time window was sucient for the subjects to nish the test. The subjects were given a short break between both tests. The procedure of the second test was explained during these breaks. Again, the subjects had 20 minutes to complete the tests.

6.4.1 Double-Blind Triple-Stimulus Test

The rst test performed by the subjects was the double-blind triple-stimulus test.

The test subject was presented two sliders and three play buttons on the GUI (Fig.

6.3). Button A always played the reference/measurement. However, buttons B and C randomly played the simulation or the measurement. The test subject was asked to rate the dierence between samples B & A and C & A on a scale of 1 to 5. The rating scale for this test is based on that used by Lokki and Järveläinen [39]. Since one of the two samples is identical to the reference, one of the samples has to be rated with 5. This is done by automatically setting the rating of one slider to 5 when the other slider is moved. It is possible for the test subject to rate both samples with 5. Every sample was assessed 5 times by the subjects. The samples were assessed in a randomized order.

6.4.2 Signal Detection Theory Test

The second test is the SDT-test. The subject was presented with one play button and two check-boxes (see Fig. 6.3). The question the subjects had to answer is if the sample is real or simulated. They were only allowed to listen to the sample once.

Next, the subject had to answer the question by ticking one of the two boxes. and proceeding to the next question. The samples were repeated 10 times. Again, the order of the samples was randomized.

6.5 Results

The tests were performed by 21 subjects (average age 25.8). For both tests, the results of one subject were unavailable due to a technical error. Four subjects mentioned that they suered from a kind of hearing disability. Although widely varying in signicance, it was decided to remove the results of all four subjects from the analysis.

Their results, however, can be found in Appendix E and F. An analysis of the test subject reliability can be found in Appendix C.

Figure 6.3: GUIs: a. SDT b. double-blind triple-stimulus

Test power calculations per tested sample were performed, using the method of Bruun Brockho and Schlich [86] to calculate the nal sample size. As mentioned in sec-tion 6.3.3, assumpsec-tions on the homogeneity of the samples were made beforehand.

Afterwards, the answers turned out more homogeneous than assumed for most test subjects, which meant that a higher sample size could be used for the analysis [86].

The test answers for the double-blind triple-stimulus test were highly homogeneous and led to a high sample size ranging from 77 to the maximum value of 80 for all samples. Using this sample size, an eect size of 0.5 and an α = 0.05 led to a test power of 0.99. Lower eect size values of 0.3 and 0.1 led to lower test powers of 0.87 and 0.22, which is below the commonly used minimum test power of 0.95.

The results for the SDT-test were found to be less homogeneous. The calculated sample size ranged from 77 to 89 with an average of 81 per tested sample. This is signicantly lower than the maximum sample size of 200. Looking at Table 6.10, it is possible to use a α-error of 0.10 and a pcvalue of 0.75. This pcvalue corresponds to a lower d0minvalue of 0.9539.

Test subject reliability for the double-blind triple-stimulus test was checked according to Itu-R [29]. No subjects showed unreliable results for the rst check. The second check consisting of repeated T-tests was not performed. Since the score of the test was based on integers instead of a decimal rating, this data-set could not provide usable results. Scores are calculated following the recommendations of Itu-R [29].

The ratings given to the measurements were subtracted from the ratings given to the auralisations. This means that negative scores indicate that the auralisation was found to sound dierent from the measurement. A positive score means that a test subject found the auralisation to sound more like the reference (which was the measurement) than the measurement. This indicates that the dierence between the auralisation and measurements was found to be small, according to the subject. The average scores per sample can be seen in Table 6.11.

Fig. 6.4 shows the boxplots of the scores of the double-blind triple-stimulus test.

One subject rated all but one sample with 1. Although visible in Fig. 6.4 as the low outliers at every sample, the ratings of this subject did not signicantly alter the results of the test. The scores of the dierent samples were compared as well. Since

the results of all samples were normally distributed (see Appendix D), the Paired samples T-test could be used to compare the means of the dierent samples [87]. No signicant dierences were found for dierent distances, radiation angles, stimuli and reverberation times (see Appendix E for the results of the comparisons).

For the SDT-test, the d0 values were calculated per subject with Formula 6.1. Ta-ble 6.12 provides the average d0 values per sample. Boxplots for scores per sample can be seen in Fig. 6.5. The average d0-scores, H-ratio's and F-ratio's per sub-ject can be found in Appendix F. None of the samples exceeds the minimum eect threshold of d0min value of 0.9539. This means no relevant dierence between an auralisation/measurement-pair was found. The results for samples 3 and 9 were found to be not normally distributed. The recommended Wilcoxon signed rank test [87] was used for the comparisons with these samples instead of the Paired samples T-test. Again, no signicant dierences were found for dierent distances, source radiation angles, stimuli and reverberation times. The comparisons can be seen in Appendix D.

For both tests, Independent T-test and Paired samples T-test were also performed with groups of combined data from dierent samples, e.g. the comparison between the combined results of all three calibrated samples and that of their non-calibrated counterparts. The dierent groups, the comparisons and the results can be found in Appendix G. Despite the bigger sample sizes, no signicant dierences were found.

Table 6.11: Average DB-score per Sample

Sample SA1 SA2 SA3 SA4 SA5 SA6 SA7 SA8 SA9 SA10

Mean -1.75 -1.98 -1.67 -1.70 -1.82 -1.42 -1.8 -1.61 -1.50 -1.94

Table 6.12: Average d0 scores

Sample SA1 SA2 SA3 SA4 SA5 SA6 SA7 SA8 SA9 SA10

Mean 0.05 -0.04 -0.10 0 -0.07 0.36 0.13 -0.01 0 0.19

L2S1L2S10L2S16L3S1L1S8TrumpetDrumsL2S1_cL3S1_cL2S10_c-5

Very different -4

Rather different -3

Slightly different -2

Rather similar -1

Very similar 0

1 Figure6.4:Averagescoresforthedouble-blindtriple-stimulustestforallsamples. k

L2S1L2S10L2S16L3S1L1S8TrumpetDrumsL2S1_cL3S1_cL2S10_c-2.5

-2-1.5

-1-0.5

00.5

11.5

22.5

d'-score

Figure6.5:AveragescoresfortheSDT-testforallsamples.

7 | Discussion

The research goal was to investigate the dierent ways the authenticity of auralisa-tions can be analysed. Out of the multiple available methods that were discussed in the literature review, the SDT-test and double-blind triple-stimulus test were selected and carried out. In this section, the results of these listening tests will be discussed. The test procedures, measurements and simulations will also be discussed.

The d0 values of the SDT-test are relatively low or even negative. The average scores ranged from −0.10 to 0.36. None of the scores exceeded the relevant dierence threshold d0min of 0.9539. This d0min-value corresponds to an Eect Size pc of 0.75, which with this sample size was the highest possible pc-value with signicant results (see Table 6.8). Remarkable are the negative scores for four of the ten samples.

A negative score indicates that the F-ratio of the samples is higher than the H-ratio (see Formula 6.1). This means that for a sample-pair with a negative score, the auralisation is more often marked as real than the measurement. According to Stanislaw and Todorov [88], negative d0-scores can arise due to sampling errors or response confusion (when for instance a subject answers no while intending to answer yes). Macmillan and Creelman [82] state negative values can also arise by chance if the number of trials is small and should not be a cause for concern. This could be the case for this research, since the samples were only repeated ten times. All values are relatively close to zero. This could indicate that the subjects have been performing by chance. The low values might also be caused by a lack of training of the subjects. A training would mean that the subjects already get exposed to the samples before the test and can practice how to dierentiate between measurements and auralisations.

This could lead to higher H-ratios and lower F-ratios [82]. Lindau and Weinzierl [16]

had a similar inexperienced and untrained test panel. For their study, BRIRs were recorded using a setup of 5 dierent speakers using a dummy head. The test subject was placed in the same positions wearing transparent headphones. The study used a similar SDT-test setup where the subject had to decide if the stimuli was played through the loudspeakers (real) or through the headphones (simulated). 11 subjects were presented with 100 stimuli. The average D-Prime value for their 2010 stage test (0.051) is close to the average D-Prime value over all samples of this SDT-test (0.042).

All samples got a negative average score with the double-blind triple-stimulus test.

The negative values indicate that the test subjects were able to distinguish the aural-isations from the measurements. For both tests, the dierence in results for samples with various S/R-distances, source-radiation angle, used anechoic stimuli and re-verberation time was analysed. No signicant dierence was found for any of the

comparisons.

Due to the low sample size, signicant results could only be achieved for relatively high pc detection rate values. As mentioned, the lowest possible pc-value was 0.75.

This value is commonly used for these types of tests [16]. Still, the study of Lindau and Weinzierl [16] used a stricter value of 0.55. A large Eect size of 0.5 was required to support the results of the double-blind triple-stimulus test as well.

When looking at the dierence in average rating of the double-blind triple-stimulus test, all three calibrated samples had a higher average rating than their non-calibrated counterparts. However, the found dierences were too small (0.043, 0.14 and 0.19) to be signicant. The calibration was done by adjusting the absorption of one surface in the model. It is possible that a dierent calibration method, where all the surfaces were adjusted, could have resulted in dierent results. One clear rst order reection was already removed from the model by adjusting one surface, which improved the quality of the auralisations. Perhaps there were more unwanted reections present in the auralisations that were not removed by calibrating the model. The eect of the improved reverberation time could then be too small to get signicant results.

The eect of the reverberation was also expected to be larger for the samples with a longer S/R-distance, since the eect of the direct sound on the BRIRs is smaller compared to the BRIRs recorded at a short S/R-distance. However, the eect of calibrating the model seems similar when comparing the results of the shorter and longer distances.

The calibration was performed to create a model with the same reverberation time as the measurement. Next to the T30, the adjustments of the model could also aect other acoustical parameters as well. These changes are not investigated further, but could possible have inuenced the results. It could be questioned if the reverberation time is the optimal parameter to use for the calibration process. When the model was calibrated to match another acoustical parameter of the measurement then the T30, the calibrated samples might have produced dierent results.

In previous research, dierences in the anechoic signal is reported to create dierences in the rating of or the detection of auralisations. Lindau et al. [43] found a dierence in detection scores for the drum and trumpet sample when compared to the male speech sample. The research of Malecki et al. [37] also found a lower detection score for the trumpet sample than the speech sample. Similar dierences were not found when comparing the scores of the drum and trumpet sample to the male speech sample for both tests in this study.

Tests were also performed with which a group of data collected from dierent samples were compared to other groups. The samples gathered in a group all had at least one parameter (s/r-distance, source radiation angle, calibrated or non-calibrated) in common. Although no signicant results were found in the end, these comparisons already had a drawback from the start. Although having one parameter in common, the samples diered from each other for other parameters. If signicant results were found, it still would have been dicult to say with certainty which parameter was the cause of the dierence.

The set test duration of 20 minutes, as recommended by Bech and Zacharov [28], proved to be sucient for most users to answer all of the questions in both tests. In only a few instances, the 20 minute limit was found to be too short and the subject

was asked to stop the test. As stated in the literature review, little is found on the set test duration of other studies. Literature indicates that subjects were allowed to

nish all of the questions without a time limit, although it is never explicitly stated.

Similar results produced by a larger sample size could provide more usable and signicant results. The recommended maximum length of 20 minutes did reduce the number of times the samples could be repeated. The total number of test subjects is also relatively low. The sample size could be increased by getting more test subjects or by using more repetitions. However, most comments made by the subjects after the tests were about the number of repetitions of the speech sample. The repetitive nature of the sample was found to be bothersome to some subjects. It was commented that it could have aected their results. Simply increasing the amount of this sample could thus lead to more complaints.

More repetitions of the samples may be needed. This will lead to a lower detectable Eect size/detection rate and possibly to signicant results. With the 5 repetitions used for the double-blind triple-stimulus test and 10 used for the SDT-test, it is possible that smaller dierences between the samples have been missed. The SDT-test results produced some negative d0 values, which possibly also could be avoided when samples were repeated are often [82].

Dierences can be found in the spectra of the measured and simulated BRIRs. The largest deviations seems to occur at the lower frequencies. The same trend can be seen when comparing the spectrum of the BRIR made with the ECHO source in a anechoic ODEON model with the spectrum of the Echo rotation measurement. The largest dierences are also found at the lower octave bands. The rotation measure-ment has signicantly less energy here when compared to the simulation. From the 250 octave band onwards, the dierence between the simulation and measurement becomes smaller. Next to this, the HRTF that was used in the model diered from the HATS B&K dummy head used for the measurements. This, by default, made dierences between the auralisations and measurements inevitable. This error was reduced as much as possible by analysing all the available HRTFs and picking the best tting le.

8 | Conclusions

The rst part of the research goal focused on investigating the dierent ways the authenticity of auralisations can be rated. The SDT-test and double-blind triple-stimulus test methods were selected after analysing the dierent listening test meth-ods used in studies concerning auralisations. The double-blind triple-stimulus test was found suitable because of it's ability to assess small dierences and for the con-venient way to assess the subject's expertise. The SDT-test was chosen because the test did not ask the subject to rate the samples in any way and thus limits the personal bias of the subjects. It also does not present the subject with a compari-son to a reference and is therefore very dierent in it's setup than the double-blind triple-stimulus test. Both methods were carried out in the second part of the study.

The measured BRIRs used for this study were recorded in a TU/e sports hall using the ECHO speech source and a B&K HATS dummy head. The ray-based software ODEON was used to create the simulated BRIRs. The parameters of the simu-lated model, such as the T30, diered signicantly from measurements. Therefore, a calibrated model with T30-values matching those of the measurements was created as well. These simulations were performed with an HRTF that diered from the dummy head used with the measurements. The spectra of the measured and simu-lated BRIRs dier from each other as well. The simusimu-lated BRIRs have more power at the lower frequencies. Since the HRTFs are dierent, some spectral dierences were inevitable. When comparing the spectra of the measured and modelled source, the biggest dierences can also be found for the lower frequencies and middle frequencies, up to approximately 1000Hz. However, here the measurements generally have less power for the lower frequencies.

10 dierent measurement/auralisation sample-pairs were assessed with both tests.

These samples diered in the following aspects: source/receiver-distance, source ra-diation angle, anechoic stimuli and reverberation time. The samples were repeated 5 times during the double-blind triple-stimulus test and 10 times for the SDT-test.

The tests were performed with 21 test subjects who were given a time limit of 20 minutes for each test. This limit proved to be a suitable limit for most subjects to answer all of the questions. Due to hearing disabilities and a technical error, the results of only 16 subjects were used for the analysis.

While using the same sample set, both tests provided dierent results. The low d0 values, with means ranging from −0.07 to 0.36, from the SDT-test indicate that the subjects had diculty dierentiating the measurements from the auralisations. No sample had a d0-score that exceeded the minimum eect threshold of 0.9539. The negative average double-blind triple-stimulus scores for every sample however, show