Measuring calibration of likelihood ratio systems

(1)

ratio systems

Yara van Schaik (10262458)

MSc in Forensic science, University of Amsterdam (UvA)

Research performed at the Dutch Forensic Institute (NFI)

Supervisor: Peter Vergeer

Examiner: Marjan Sjerps

Period of research project (36 EC) : 10/04/2017 - 16/10/2017

Suggested journal: Science & Justice

November 7, 2017

Abstract

When forensic evidence is examined, the conclusions are summarized by a weight of evidence, most commonly the likelihood ratio. There are several methods to calculate these ratios and good performance of such models is essential. Calibration is a performance characteristic and this property can be measured. Different methods to measure the calibration of an LR system are available, and four of them are explored in this work. One metric is based on moments and another is based on rates of misleading evidence. The other two measures are PAV and ECE. We propose a new numerical calibration measure corresponding to the PAV transformation algorithm. A simulation study is performed to examine the performance of the calibration metrics, with the goal to find pros and cons for each measure and to provide guidelines on which measure to use in what situation. The first two methods do not behave as desired and PAV and ECE perform better. On the basis of this work, PAV is more sensitive than ECE.

Keywords:forensic science, strength of evidence, likelihood ratio, calibration, calibration metric

1. Introduction

After the examination of evidence, forensic sci-entists conclude on the interpretations of their findings by a weight of evidence. A commonly used measure for the weight of evidence is the likelihood ratio (LR) [1, 2, 3]. It expresses the ratio of the probability of the evidence given the prosecution hypothesis (Hp) and the proba-bility of the evidence given the defence hypoth-esis (Hd). It takes values between 0 and∞. A value greater than 1 supports the prosecution hypothesis and a value less than 1 supports the defence hypothesis. This likelihood ratio is then handed over to the court, where it is the task of the judges or jury to combine the

weights of all evidence into their final judg-ment.

Over the last two decades, numerical meth-ods to calculate LRs have been introduced. These methods are used by forensic experts to support their conclusions. Below, these meth-ods are referred to as LR systems. Depending on the type of data that is examined, different methods for calculating likelihood ratios are available. Generally, one strives for large LR values when Hpis true and LRs close to zero when Hdis true. However, aspects like a bad quality or low quantity of the evidence mea-surements, wrong choice of statistical models or simply bad luck may result in likelihood ratios supporting the wrong hypothesis in a

(2)

crime case. All the above examples may lead to wrongful conviction or acquittal of a suspect, which of course is undesirable. Unfortunately, not every cause of bad performance is some-thing that one can get hold of. However, as-pects like a wrong choice of satistical models can be prevented and they are the basis of this research. The focus is on measuring perfor-mance of LR models, rather than improving it.

1.1. Calibration

Calibration is a performance characteristic of an LR system or a set of LRs and this property can be measured. To give an insight into the definition of calibration, the classical weather forecasting example [4, 5] is given. A proba-bilistic weather forecaster assigns a probability that it rains on, say, day i. He does that for a set of future days, when the ground truth is unknown. At the end of each day, it is known whether it rained or not on that day. A way of evaluating the performance of the weather forecaster is to look at its calibration. A fore-caster is said to be well-calibrated when for the days to which he assigned a probability of p, it rained p∗10 out of 10 days. So, for example, when he assigned a probability of 0.8 to a set of days, we want it to rain 8 out of 10 days.

Calibration is defined as the degree of ap-proximation of all predicted probabilities to the actual, or empirical, probabilities [6]. When X is a perfectly calibrated set of LRs, it has the following property:

x=LR(LR=x):= P(LR=x|Hp)

P(LR=x|Hd)

, (1.1) for all x∈X[7]. An information theoretical proof of the claim that this property of an LR set means that this LR system improves predic-tions can be found in Appendix A. Through-out this article, the term well-calibrated will be used. This does not refer to perfect calibra-tion, but calibration that is reasonably good. In practice, perfect calibration cannot be mea-sured since availabe amounts of LRs are always finite and they represent a random draw. Find-ing numerical values that better express what

’reasonably good’ calibration means is within the scope of this study.

An LR system is well-calibrated when the output of the LR system yields correct LR val-ues. Bad calibration indicates that a significant part of the LRs have a wrong value, i.e. too large or too small. It is therefore very desir-able for a likelihood ratio system to be well-calibrated.

An LR system can be miscalibrated in several ways. In this research, four general types of miscalibrated LR sets were investigated. The first set has too large LRs. In the second one, all LRs are too small. The LRs are too extreme (i.e. too large for LRs greater than 1, and too small for LRs less than 1) in the third and too weak in the last miscalibrated LR set.

1.2. Misleading evidence

Some of the examined calibration metrics in this study (see section 2.3) are defined in terms of misleading evidence. This refers to LRs pointing into the wrong direction, i.e. an LR greater than 1 when in fact the defence hypoth-esis is true, or an LR less than 1 when in fact the prosecution hypothesis is true. A relatively large fraction of misleading LRs in a set may be indicative of miscalibration, see section 2.2.2.

1.3. Measures of calibration

Measurement of calibration not only depends on the LR system, but also on the chosen cali-bration metric. A number of studies have ap-plied one or more calibration metrics to some LR data in order to get to know whether the corresponding LR system works good. For ex-amples, the reader is referred to Vergeer et al. [8], Van Es et al. [9] and Leegwater et al. [10].

No study has been performed on the com-parison of different calibration measures. Like-wise, no agreements or protocols on which methods to use in what situation have been made yet.

(3)

1.4. Scope of this study

Goal of this study is to compare several meth-ods that measure calibration of LR systems, and to investigate their pros and cons. Simu-lations were used in order to obtain controlled data. Furthermore, it was tried to define some guidelines on which measures to use in what situation. In section 2, the simulations that were used to create data together with the investigated calibration metrics are explained. Section 3 describes the results obtained from the comparison of outputs of the calibration measures. The discussion and conclusion cover sections 4 and 5 respectively.

2. Methods

In order to investigate the performance of dif-ferent calibration metrics, sets of LRs to apply them to are needed. It is required to know what characteristics the datasets have, so that a desired behavior of the calibration measures can be defined. Firstly, LR sets that are well-calibrated were simulated. After that, different types of ill-calibrated LR sets were simulated. To each obtained set of LRs, four calibration measures were applied. In the end, for each metric it was examined to what extent it distin-guishes between well- and ill-calibrated sets of LRs.

All simulations and calculations were per-formed with the use of the software R [11].

2.1. Real LR datasets

In this study, some information about distribu-tions of real LR sets was needed for the sim-ulations. For this reason, four datasets were obtained from Vergeer et al. [8] and Van Es et al. [9]. They contain three different LR datasets based on comparison of gasoline samples and one LR dataset based on the comparison of glass samples. For the first three datasets, the same-source (SS) and different-source (DS) LRs have sample size 60 and 15420 respectively. The glass dataset has sample size 320 for SS and 51040 for DS.

2.2. Simulation

2.2.1 Simulating well-calibrated data

In the first simulation, a well-calibrated LR set was created by randomly drawing LR values out of distributions that are perfectly calibrated. A model of Gaussian distributions that meets certain characteristics was chosen. These char-acteristics are explained below.

For a perfectly calibrated LR system, if one of the different-source and same-source natu-ral log likelihood ratio (LLR) distributions is Gaussian, then the other one is Gaussian as well (Van Leeuwen and Brümmer [7]). Fur-thermore, the distributions have equal variance and mirrored means around the y-axis (where log LR=0). So, for Gaussian distributions

SS∼ N (µs, σs) and DS∼ N (µd, σd),

(2.1)

σ:=σs=σdand µs = −µdapplies.

Also, σ2 = 2µs and µs is related to the discrimination performance measured by the equal error rate (EER), with the relation

µs =2(Φ−1(EER))2. (2.2) For proofs of all claims, the reader is referred to Van Leeuwen and Brümmer [7].

According to the above theory, in the first simulation, well-calibrated sets of LLRs were created by randomly drawing values from two normal distributions (representing the same-source and different-same-source LLR distributions) satisfying all above properties:

SS∼ N (µs, σ=p2µs) and DS∼ N (−µs, σ=p2µs).

(2.3) Throughout this article, whenever the terms LLR or log LR are used, they refer to the natu-ral logarithm (ln) of the LR.

As can be seen from equation (2.3), µs is a free variable, which could be varied over the entire range of positive real numbers. To restrict this interval, the EERs belonging to some real LR datasets obtained from Vergeer et al. [8] and Van Es et al. [9] (see section

(4)

2.1) were used. With those EERs, values for

µs were calculated, using equation (2.2). The values 6, 11 and 17 were obtained and used for

µs in the simulations.

Other variables in the simulations were the sample size of both SS and DS. For real LR data, when n samples are used to create the data, often all samples are compared to each other and to themselves. This way, a sample size of n is obtained for SS and(n₂)for DS. This proportion in sample size was used for the LLR distributions in the simulations. For n, the value 300 was chosen. This choice is based on the amount of LLRs calculated for same-source comparisons in the glass database (see section 2.1), which was the largest of all databases that were studied.

2.2.2 Simulating ill-calibrated data

In the second part of this study, ill-calibrated LR sets were created. This was done by ap-plying transformations to the distributions of simulation 1 (see equation 2.3). There are sev-eral ways to adapt those distributions in order to obtain an ill-calibrated LR system. Four gen-eral ways were investigated, which are stated below. In all transformations, the manipulation size c can be any positive constant (but greater than 1 for the last two transformations).

• Too large LRs (simulation R): the distri-butions are shifted to the right along the x-axis.

SS∼ N (µs+c, σ=p2µs) and DS∼ N (−µs+c, σ=p2µs).

(2.4) • Too small LRs (simulation L): the

distribu-tions are shifted to the left along the x-axis. SS∼ N (µs−c, σ=p2µs) and DS∼ N (−µs−c, σ=p2µs).

(2.5) • Too extreme LRs (simulation E): the

distri-butions remain the same as in simulation 1. After randomly drawing logLRs from both distributions, all obtained values are multiplied by c, resulting in more extreme LRs.

• Too weak LRs (simulation W): the distri-butions remain the same as in simulation 1. After randomly drawing logLRs from both distributions, all obtained values are divided by c, resulting in weaker LRs. For the first two transformations three val-ues for c were picked: c∈ {1, 3, 5}. The other two transformations are multiplications or di-visions, whereby multiplying or dividing all LLRs by c = 1 would not change the origi-nal LLRs. For those latter two transformations c∈ {2, 4, 6}were chosen.

2.3. Calibration metrics

In this section, the four calibration metrics that were compared in this study are depicted.

2.3.1 Moments

According to Good [12], if a set of LRs is well-calibrated then the nth _{moment of a likelihood} ratio about the origin (hence E[(X−c)n]with c = 0) given Hp is equal to the(n+1)th mo-ment of LR given Hd, i.e.

E(LRn|Hp) =E(LRn+1|Hd). (2.6) A proof of this claim is given in Appendix B. For a database of LRs, the difference between both sides of equation (2.6) can be calculated, which defines a measure of the calibration loss of the examined LR system.

Van Es et al. [9] used this same calibration metric in terms of moments, for the cases n=0 and n= −1, to measure calibration of an LR set. This research also restricted the use of the ’Moments’ calibration measure to those specific cases, by looking at equation (2.6) for n = 0 and n = −1, as a calibration measure for LR systems, i.e.: 1=E(LR|Hd) and (2.7) 1=E 1 LR|Hp . (2.8)

In theory, for a well-calibrated LR set, both (2.7) and (2.8) apply. In practice, one has a

(5)

finite LR set consisting of skewed distributions, most of the times resulting in expected values less than 1.

2.3.2 Probability of misleading evidence

Royall [13, 14] defined a calibration measure in terms of probability of misleading evidence. When a set of LRs is well-calibrated, the fol-lowing holds: If Hdis true and one observes E, then it is unlikely that one will obtain a strong value for the LR favoring the false hypothesis (Hp in this case). Specifically, for every con-stant k≥1,

P(LR≥k|Hd) ≤ 1

k. (2.9) Likewise, if Hpis true and one observes E, then for every constant k≥1 the following holds

P LR≤ 1 k Hp ≤ 1 k. (2.10) The equations 2.9 and 2.10 determine upper bounds for the rates of misleading evidence of perfectly calibrated LR sets. This research restricted the use of this calibration metric to the case k=2, i.e.

P(LR≥2|Hd) ≤ 1 2 and (2.11) P LR≤ 1 2 Hp ≤ 1 2. (2.12) For simplicity, from now on we call these cal-ibration measures ’Misleading Hd’ and ’Mis-leading Hp’ respectively.

As far as we know, no study has yet applied this calibration metric to an LR system.

2.3.3 PAV transformation

The third calibration measure is based on the Pool Adjacent Violators (PAV) transformation [15, 16]. A PAV-transform (see figure 1) is an algorithm that, applied to a set of LRs, trans-forms the set into a well-calibrated one. The transform works over a finite range, which is defined by the overlapping range of LRs under Hpand Hd. By plotting the PAV-transform of the LLRs against the original LLRs together

with the line y=x, it can be seen if and how the two lines deviate from each other. For a well-calibrated LR set, in theory, the lines will completely overlap. A large deviation between the two lines indicates bad calibration.

Figure 1: A PAV transformation of LRs calculated for the

comparison of gasoline samples, retrieved from Vergeer et al. (2014).

This method for measuring calibration is a visual one. One has to determine, by looking at two lines in one plot, whether an LR system is well- or ill-calibrated. In this study, a numerical measure for the calibration loss that is based on the PAV-transform was defined. We named this measure ’devPAV’, and defined it as the ratio of the total area between the PAV-transform line and the line y=x and the length of the interval of all LRs that underwent the PAV-transform and obtained a finite PAV value.

It is expected that large devPAV values indi-cate bad calibration, whereas good calibration leads to a small value of devPAV.

2.3.4 Empirical Cross-Entropy (ECE)

The last examined calibration measure is Em-pirical Cross-Entropy [17], which was also used in [8] and [9]. This metric measures perfor-mance based on strictly proper scoring rules. Better performance is indicated by a smaller ECE value, as it measures the cost. The ECE is calculated for the LR-method over a range

(6)

of prior probabilities after which the ECE is plotted as a function of the prior probability. An example of such a plot is shown in figure 2.

Figure 2: ECE plot for LRs calculated for the comparison

of gasoline samples, retrieved from Vergeer et al. (2014).

One of the ways to look at calibration error of an LR set is by calculating the difference between the cost log likelihood ratio (Cllr) and the minimal cost log likelihood ratio (Cllrmin). The Cllr is the ECE at prior odds of 1 and the Cllrmin is the ECE value of LRs obtained after PAV at prior odds of 1. This difference between the two values is called Cllrcal [18], and measures the calibration loss.

Furthermore, the ECE curve of the LR sys-tem can be compared to the curve where all the LR values are equal to 1. This latter curve is non-informative, since it assigns equal weights to the evidence under both hypotheses. If the ECE values calculated for an LR system are greater than those of the non-informative LR set, then this would suggest that it is better not to use the system (as an LR providing no information performs better). LR systems that are perfectly calibrated will always perform better. Therefore, crossing of the two curves is indicative of calibration problems. If a crossing occurs on the left side of the y-axis, it is indica-tive of too strong misleading evidence given Hd (i.e. LR>>1 given Hd). If a crossing on the right side of the y-axis occurs, it indicates too strong misleading evidence given Hp (i.e. LR<<1 given Hp).

Throughout the experiments, only the nu-merical output (Cllrcal) of the ECE calibration metric was investigated.

3. Results

For each calibration metric and simulation, the 5th and 95th percentiles were calculated by re-peating every simulation 2000 times. The num-ber of 2000 was chosen since it is recommended for accurately calculating the 5th or 95th per-centile (for bootstrap simulations) [19]. In or-der to verify this, each experiment (consisting of 2000 repetitions of the same simulation) was repeated three times. The obtained percentiles of calibration values of each experiment led to the same interpretation of the results. This supports the choice of 2000 repetitions.

3.1. Calibration of data for fixed µ

s

In the first part of the experiments, µs =6 was selected for the Gaussian LLR distributions. The four calibration measures were applied to simulated data of all the types that are dis-cussed in section 2.2.

3.1.1 Moments

In this section, to reveal potential overlapping parts of calibration values between simulation 1 and transformation R, L, E or W data, the 95th percentile for simulation 1 and the 5th percentile for different transformations and ma-nipulation sizes are plotted. In theory, the cal-ibration value of a perfectly calibrated set of LRs, when measured with the ’Moments’ cali-bration metric would equal 1. Strong positive calibration values indicate bad calibration. The dotted black line represents the 95th percentile of the first simulation. The 5th percentile of each of the four transformations (over varying manipulation size) is represented by its own color. If for a certain manipulation size c a colored line is plotted under the black line, the corresponding 5th percentile is smaller than the 95th percentile for the first simulation. This indicates that some calibration values obtained for simulation 1 LR data are larger than some

(7)

of the calibration values obtained with the transformation. Accordingly, the calibration metric cannot very well distinguish between an ill-calibrated set of such form and a well-calibrated set. To illustrate this better, boxplots of a calibration metric applied to simulation 1 and transformation L data (with manipulation size c = 1) are shown in figure 3. It can be seen that most calibration values obtained for simulation L data are larger than most values obtained for simulation 1 data. Though, for some LR datasets the opposite is true. There-fore, this is an example of an ill-calibrated type of LR dataset that, under this metric, cannot fully be distinguished from simulation 1 LR data.

Figure 3: Boxplots of the ’Moments (n= −1) calibra-tion metric applied to simulacalibra-tion 1 and L (with manipu-lation size c=1) LR data.

The results for the ’Moments’ calibration metric can be found in the percentile plots in figures 4 and 5. For the specific case n = 0, the red and blue lines are all above the black line, as desired. The green and yellow lines are all positioned under the black line. The first mentioned situation can be explained by the fact that the expected value of the LR decreases, since both distributions of LLRs are moved to the left along the x-axis. An explanation for the position of the yellow lines is that the LRs are weakened, so the LRs in the region(0, 1)

increase and the LRs in(1,∞)decrease. The decreasing LRs in the second region have a

big-ger influence on the expected value of the LR under Hdthan the increasing LRs in the first region do. This is because the manipulations are on a log scale and a change of an LR value of 100 to 10 is more influential than a change of 0.01 to 0.1. Both are the consequence of a transformation that weakens all LRs with ma-nipulation size 10 on a log scale. As a result, the expectated value E(LR|Hd)decreases.

Figure 4: Plots showing percentiles for the calibration

metric ’Moments (n=0)’ applied to simulations with

µs=6. The 95th percentile for simulation 1 and the 5th

percentiles for transformations R, L, E & W are plotted.

metric ’Moments (n= −1)’ applied to simulations with

µs=6. The 95th percentile for simulation 1 and the 5th

percentiles for transformations R, L, E & W are plotted.

For n = −1, the consequences of too large and too small LRs are interchanged compared to the calibration metric ’Moments’ for n=0. Under this measure, some simulated datasets of too large LRs have a smaller calibration value than well-calibrated sets. This is due

(8)

to the fact that the LR distributions are moved to the right along the x-axis and therefore the expectation of the inverse of the LR decreases. The 5th percentile of the simulation of too weak LRs (transformation W) is also smaller than the first simulation’s 95th percentile. This can be explained in the same way as was done for the case n=0. Only now, the expectation of 1/LR under Hpis influenced more by the change of LRs in the region(0, 1), instead of the (1,∞)

region. Simulation W therefore leads to a lower expectation E(_LR1 |Hp).

Noteworthy, for small manipulation sizes, the 5th percentile of simulation L is below the 95th percentile of the first simulation. This is not the case for the percentiles obtained by ap-plying calibration metric ’Moments (n=0)’ to LR data from simulation 1 and R. An explana-tion is that the distances between the 5th and 95th percentiles for n= −1 of both simulation 1 and L are larger than the distances between the 5th and 95th percentiles of simulation 1 and R for n=0 respectively. This is due to a small sample size of the same-source distribu-tion compared to that of the different-source LRs. Smaller sample size leads to more fluc-tuation in calibration values, and therefore to more distanced percentiles.

3.1.2 Misleading evidence

Calibration metrics ’Misleading Hp’ and ’Mis-leading Hd’ were restricted to the case k=2 and the results are shown in figures 6 and 7. Similar to the previously described calibra-tion measure, in the plots the 5th percentile (black line) for simulation 1 and the 95th per-centiles (colored lines) of the transformations are drawn.

When the ’Misleading Hp’ metric is applied, only simulation R behaves as desired. The colored lines corresponding to all other sim-ulations are under the black line, indicating that the 5th percentile for those simulations are all smaller than the 95th percentile for simula-tion 1. The posisimula-tion of the 5th percentile for simulation L can be explained by the decrease of the LRs. Therefore, the rate of misleading

metric ’Misleading Hp’ for k=2’ applied to simulations

with µs=6. The 95th percentile for simulation 1 and

the 5th percentiles for transformations R, L, E & W are plotted.

metric ’Misleading H_d’ for k=2’ applied to simulations with µs=6. The 95th percentile for simulation 1 and

the 5th percentiles for transformations R, L, E & W are plotted.

LRs under Hpincreases, resulting in a larger calibration value. Mostly, data obtained by sim-ulation W seem to perform better in terms of calibration than simulation 1 data. This is be-cause the LRs get weaker, resulting in a lower rate of LRs under Hpthat are less than or equal to 1₂. For simulation E, for each manipulation size the corresponding 5th percentile is smaller than the 95th percentile of the first simulation. This may give the impression that under this metric calibration values of LR sets consisting of too extreme LRs are smaller than calibration

(9)

values of well-calibrated LR sets. However, most calibration values calculated for simula-tion E data are larger than most of the first simulation’s calibration values. If, instead of comparing the 95th percentile of simulation 1 with the 5th percentile of simulation E, the 5th percentiles of both simulations were com-pared to each other, than one would see that the 5th percentile corresponding to simulation E is larger than that of the first simulation.

Calibration metric ’Misleading Hd’ behaves in the same way as its prosecutor’s equivalent, only here the results for too large and too small LRs are obviously interchanged. Too large LRs result in a lower rate of misleading LRs under Hpand the opposite is true for LRs under Hd. The other way around, decreasing LR values cause a higher rate of misleading LRs under Hp and a lower rate of misleading LRs under Hd. Simulations E and W behave approximately the same under this calibration metric as under the metric ’Misleading Hp’ and therefore their results are quite similar. One difference is the better separation between calibration values of simulation 1 and some of the transforma-tions (simulatransforma-tions R and E) under the measure ’Misleading Hd’ compared to some of the trans-formations (simulations L and E) under the ’Misleading Hp’ metric. This might be due to a larger distance between the 5th and 95th per-centile for a simulation under the latter metric, which is caused by a smaller sample size for the same-source LRs.

3.1.3 PAV transformation

The results obtained for the PAV calibration metric are shown in figure 8. Also here, the 95th percentile for the first simulation and the 5th percentiles for the transformations are plot-ted.

As can be seen from figure 8, none of the col-ored lines are positioned under the black line. So, for the tested manipulation sizes, the PAV metric correctly makes a distinction between well- and ill-calibrated LR sets.

Figure 8: Plots showing percentiles for the PAV

calibra-tion metric applied to simulacalibra-tions with µs=6.

The 95th percentile for simulation 1 and the 5th percentiles for transformations R, L, E & W are plotted.

3.1.4 Empirical Cross-Entropy

The percentiles corresponding to the ECE cali-bration metric are shown in figure 9. Again, the 95th and 5th percentiles for simulation 1 and the transformations respectively are plotted.

Figure 9: Plots showing percentiles for the ECE

cali-bration metric applied to simulations with µs=6. The

95th percentile for simulation 1 and the 5th percentiles for transformations R, L, E & W are plotted.

Also here, no detrimental position of per-centiles of any of the manipulations relative to those of the of the first simulation are observed. This means that for the tested manipulation sizes a proper distinction is made between well-and ill-calibrated LR sets.

(10)

3.2. Increasing mean value µ

s

In the second part of this research, all exper-iments from section 3.1 were repeated, after increasing µs to the values 11 and 17.

In this section, only the results for the cali-bration metrics PAV and ECE are discussed, as the other metrics do not perform as desired for any µs∈ {6, 11, 17}.

Figure 10: Percentile plots for the PAV and ECE

calibra-tion metrics applied to simulacalibra-tions with µs=11.

In figures 10 and 11, the 95th and 5th per-centiles of the PAV and ECE calibration metrics applied to simulation 1 and the transforma-tions respectively with µs = 11 and µs = 17 are shown. For both calibration metrics, the position of percentiles for µs =11 do not differ much with those for µs = 6. One difference is that for small manipulation size the 5th per-centiles of the transformations are smaller than the 95th percentile of the first simulation. This is because large µs results in better separation of the DS and SS distributions and therefore fewer values are transformed by the PAV

algo-Figure 11: Percentile plots for the PAV and ECE

calibra-tion metrics applied to simulacalibra-tions with µs=17.

rithm.

3.3. Differences between PAV and

ECE

In the last stage of this study, the PAV and ECE calibration metrics were compared to each other to see if they differ in sensitivity. For sim-ulations R and L, manipulation sizes smaller than 1 were chosen and sizes smaller than 2 were picked for simulations E and W. Per simulation, for every new manipulation size, percentiles were calculated. The selection of smaller manipulation sizes was done, until some size was found for which the 5th per-centile of the transformation is smaller than the 95th percentile of simulation 1 for only one of the metrics PAV or ECE. For each sim-ulation, the value µs = 6 was chosen for the same-source distributions.

(11)

figure 12, for all transformations manipulation sizes showing a higher sensitivity for one of the metrics were found. For all four types of data transformations, PAV is more sensitive than ECE.

Another way to measure sensitivity is by cal-culating the area under the curve (AUC) of a receiver operating characteristic (ROC) graph [20]. ROC is a depiction of classifier perfor-mance and the AUC value is a numerical rep-resentation taking values between 0.5 and 1. AUC values of 1 and 0.5 refer to a good and bad classifier respectively.

Per calibration metric and per transforma-tion, AUC values were calculated using the 2000 calibration values for simulations 1 and 2. They are shown in table 1. All AUC values for PAV are at least as large as the corresponding AUC values for ECE. This is in agreement with the percentile plots in figure 12 and therefore supports the claim that PAV is more sensitive than ECE. Simulation R Simulation L c=0.65 c=0.70 c=0.75 c=0.70 c=0.75 c=0.80 ECE 0.97 0.98 0.99 0.96 0.97 0.99 PAV 0.98 0.99 0.99 0.98 0.99 0.99 Simulation E Simulation W c=1.10 c=1.20 c=1.30 c=1.20 c=1.25 c=1.30 ECE 0.55 0.70 0.86 0.81 0.89 0.94 PAV 0.85 0.99 1 0.97 0.99 0.99

Table 1: AUC values for ECE and PAV calculated for

simulation 1 and transformations R, L, E and W. Per transformation, different manipulation sizes c are chosen.

4. Discussion

In this study, performance of several calibra-tion metrics were compared to each other with the use of simulated LR data. For some type of simulations, the calibration measures in terms of moments and rates of misleading evidence did not perform as desired. Applied to the simulations discussed in this study, PAV and ECE both turned out to be good calibration

Figure 12: Percentile plots for the PAV and ECE

calibra-tion metrics applied to simulacalibra-tions with small manipula-tion sizes (µs=6).

metrics with a higher sensitivity for PAV. How-ever, when one keeps increasing the value of µs in the same-source and different-source distri-butions, eventually all calibration metrics will fail to make a distinction between well- and ill-calibrated LR sets.

Below, some discussion points are high-lighted. Firstly, an explanation of the detrimen-tal results of the first two calibration metrics is given. Then, the pros and cons of PAV and ECE are discussed. After that it is explained why large values of µs result in bad performance of the calibration metrics. This is followed up by a discussion about the generalizability of the ob-tained results. Finally, some recommendations for future research are given.

The calibration performance of a set of too weak LRs is not correctly reflected by any of the ’Moments’ or ’Rate of misleading evidence’ metrics as, under these metrics, it seems to

(12)

perform as good as or even better than a well-calibrated set in terms of calibration. Also, de-pending on the choice of n and Hpor Hd, bad calibration of too large or too small LRs is not detected by these metrics. This undesired be-havior can be explained by the definition of the metrics. Both calibration metrics focus on LRs from only the different-source or same-source distribution. Simulation L, for example, moves both distributions to the left along the x-axis. This results in a lower expectation of LRs under Hdand a lower rate of misleading evidence un-der Hd, and therefore a better calibration out-put when measured with calibration metrics ’Moments (n=0)’ and ’Misleading Hd’.

Fur-thermore, both metrics apply calculations to the LRs on a normal scale, whereas all manipu-lations are carried out on a log scale. Values on the left side of the y-axis weigh little to nothing in the calculation of the expectation. Values on the right, on the other hand, are very in-fluential on the expected value. Therefore, the calibration metrics in terms of moments and rates of misleading evidence are not always suitable for their purpose.

Based on the current work, PAV is preferred over ECE. This is not only due to its higher sen-sitivity, but also on the basis of its visual repre-sentation. A PAV plot shows which LR values cause a good or bad calibration. When for ex-ample an outlier is added to a well-calibrated LR set, the calibration value obtained with PAV might be large, indicating bad calibration. If one in addition looks at the PAV transforma-tion plot, he will see that the LR set in fact is not that badly calibrated. An ECE plot, on the other hand, is a summary of the calibra-tion of the whole LR set plotted as a funccalibra-tion of the prior odds. In the example above, the corresponding ECE plot will probably reflect bad calibration, though it is not visible that the outlier is causing the large output value.

Another part of this study showed that in-creasing values of µs result in a worse distinc-tion between simuladistinc-tion 1 and R, L, E or W data for PAV and ECE. This is explained by the fact that the SS and DS distributions are more separated, so fewer LR values are

trans-formed by the PAV algorithm. Therefore, for large µs, the calibration values for PAV and ECE decrease for each of the simulations.

The last note that has to be made is that all metrics are tested on simulated LLR data ran-domly drawn from normal distributions. In reality, LLR datasets globally consist of com-binations of all simulations. It is therefore ex-pected that similar results are obtained when other types of LR data are studied. However, as mentioned in the previous example, we do not know what the impact of outliers will be on the performance of PAV and ECE.

Recommendations for future research are to add outliers to the SS and DS distributions of simulation 1 in order to study the output of the calibration metrics to complement the present research. Another research point would be to decrease the sample size for both the different-source and same-different-source LR set. For some of the simulations we have already seen that a smaller sample size leads to worse separation of their calibration values and those obtained for simulation 1 data. Furthermore, in this research the focus was on the numerical esti-mation of calibration, but ECE and PAV also include visual representations of the calibra-tion performance. Within the ECE calibracalibra-tion metric, one can look at (the amount of) cross-ings between the curve of the original LRs and the non-informative LRs curve in the ECE plot.

5. Conclusion

It can be concluded that the calibration mea-sures in terms of moments and rates of mis-leading evidence do not behave as desired for all types of ill-calibrated LR data. For Gaussian same-source and different-source LLR distribu-tions, the PAV and ECE metrics perform good, with PAV being more sensitive than ECE. How-ever, for large values of µs, none of the calibra-tion metrics can make a distinccalibra-tion between well- or ill-calibrated LR sets.

On the basis of the current work, the PAV metric is preferred over ECE. This is due to PAV’s higher sensitivity and its visual repre-sentation which is better interpretable than an

(13)

ECE plot.

The results of this research are based on as-sumptions about the distribution of LR sets. All forms of real LR data are combinations of the studied distributions, so we expect that sim-ilar conclusions can be drawn when the same study is performed on any type of dataset. An exception might be datasets containing outliers, which could result in a bad performance of the PAV and ECE calibration metric. However, the PAV transformation plot might prevent one from interpreting the results in a wrong way.

References

[1] C.G.G. Aitken, F. Taroni, Statistics and the evaluation of evidence for forensic scientists, second ed. John Wiley & Sons, Chichester, UK, 2004.

[2] I.W. Evett, Towards a uniform framework for report-ing opinions in forensic science casework, Sci. Justice 38 (1998) 198-202.

[3] B. Robertson, G.A. Vignaux, C.E.H. Berger, Inter-preting evidence: Evaluating forensic science in the courtroom, John Wiley & Sons, Chichester, UK, 1995. [4] A.P. Dawid, The well-calibrated Bayesian, J. Am. Stat.

Assoc. 77 (1982) 605-610.

[5] M.H. De Groot, S.E. Fienberg, The comparison and evaluation of forecasters, J.R. Stat. Soc. (Series D: The Statistician) 32 (1982) 12-22.

[6] A. Bella, C. Ferri, J. Hernández-Orallo, M.J. Ramírez-Quintana, On the effect of calibration in classifier combination, Appl. Intell. 38 (2013) 566-585. [7] D.A. van Leeuwen, N. Brümmer, The distribution of

calibrated likelihood-ratios in speaker recognition, Interspeech (2013) 1619-1623.

[8] P. Vergeer, A. Bolck, L.J.C. Peschier, C.E.H. Berger, J.N. Hendrikse, Likelihood ratio methods for foren-sic comparison of evaporated gasoline residues, Sci. Justice 54 (2014) 401-411.

[9] A. van Es, W. Wiarda, M. Hordijk, I. Alberink, P. Vergeer, Implementation and assessment of a likeli-hood ratio approach for the evaluation of LA-ICPMS evidence in forensic glass analysis, Sci. Justice (2017), in prep.

[10] A.J. Leegwater, D. Meuwly, M. Sjerps, P. Vergeer, I. Alberink, Performance study of a score-based likeli-hood ratio system for forensic fingermark compari-son, J. Forensic Sci. (2016), in prep.

[11] R Development Core Team, R: A language and envi-ronment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2008. URL http://www.R-project.org.

[12] I.J. Good, Weight of evidence: A brief survey, Bayesian Stat. 2 (1985) 249-270.

[13] R. Royall, Statistical evidence: A likelihood paradigm, first ed., Chapman & Hall, London, 1997. [14] R. Royall, On the probability of observing mislead-ing statistical evidence, J. Am. Stat. Assoc. 95 (2000) 760-768.

[15] R.E. Barlow, D.J. Bartholomew, J.M. Bremner, H.D. Brunk, Statistical inference under order restrictions; The theory and application of isotonic regression, Wiley, New York, 1972.

[16] G. Zadora, A. Martyna, D. Ramos, C. Aitken, Sta-tistical analysis in forensic science: Evidential value of multivariate physicochemical data, John Wiley & Sons, Hoboken, 2014.

[17] D. Ramos, J. Gonzalez-Rodriguez, Reliable support: Measuring calibration of likelihood ratios, Forensic Sci. Int. 230 (2013) 156-169.

[18] N. Brümmer, J. du Preez, Application-independent evaluation of speaker detection, Comput. Speech Lang. 20 (2006) 230-275.

[19] B. Efron, R.J. Tibshirani, An introduction to the boot-strap, Chapman & Hall, New York, 1993.

[20] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (2006) 861-874.

Appendix A. LR(LR)=LR implies lower en-tropy

Proof. Let E be the evidence space and H the hypothesis space. Let Q be the self-defined probability measure and P the empirical prob-ability measure. Suppose for a subset En⊂E, ∀Enj ∈En (j∈ {1, . . . , m}) we have

Q(Hi|Enj) =Cn. (A.1) Then we also have

LR(En1) = · · · =LR(Enm). (A.2) Define Ensum:= ∪mj=1Enj.

Claim: If LR(Ensum) = LR(Enj), then we have information gain over H.

(14)

Proof: The entropy of H conditioned over En is written as H(H|En) = m

∑

j=1 P(Enj)H(H|En =Enj) = − 2

∑

i=1 m

∑

j=1 P(Hi, Enj)log Q(Hi|Enj) = − 2

∑

i=1 P(Hi) m

∑

j=1 P(Enj|Hi)log Q(Hi|Enj) = − 2

∑

i=1 P(Hi)log Cn m

∑

j=1 P(Enj|Hi) = − 2

∑

i=1 P(Hi)log CnP(Esumn |Hi) = − 2

∑

i=1 P(Esum_n , Hi)log Cn = −P(Ensum) 2

∑

i=1 P(Hi|Ensum)log Cn. (A.3) The minimum of this conditional entropy is reached when Cn :=Q(Hi|En_j) =P(Hi|Esumn ). Assuming Q(H) is well-calibrated, this mini-mum is reached when LR(Enj) =LR(E

sum n ). Subsequently, we have H(H) −H(H|En) = 2

∑

i=1 Z EP(Hi, E)log P(Hi|E) P(Hi) dE = 2

∑

i=1 Z Esum n ∈E P(Hi, Ensum) logP(Hi|E sum n ) P(Hi) dE_nsum = Z Esum n ∈E P(Ensum) 2

∑

i=1 P(Hi|Esumn ) logP(Hi|E sum n ) P(Hi) dE sum n . (A.4)

Claim: LR(Ensum) =LR(Enj)is equivalent to LR(LR(Ensum)) =LR(Esumn ). (A.5)

Proof: All LRs in a partition En∈E are equal, so P(LR(Esum_n )|Hi) = m

∑

j=1 Q(Enj|Hi), (A.6) ∀i∈ {1, 2}. Then te following holds:

(

LRn

|

Hp

) =

E

(

LRn+1

|

Hd

)