Web app for the elicitation of a likelihood ratio in forensic science: Development and pilot study

(1)

\

Master’s Thesis Psychology,

Methodology and Statistics Unit, Institute of Psychology Faculty of Social and Behavioral Sciences, Leiden University Date: 28-10-2019

Student number: S2102226

Supervisor: Mark de Rooij (internal), Jan de Koeijer (external), and Peter Vergeer (external)

Web app for the elicitation of a likelihood

ratio in forensic science

Development and pilot study

(2)

Acknowledgements

I would like to thank my supervisors at the NFI, Jan de Koeijer and Peter Vergeer, for the opportunity to do my Master’s Thesis in such an interesting and innovative topic. They helped me greatly in understanding the basics of forensic science and how statistics is used in the forensic sciences. I would also like to thank my supervisor Mark de Rooij of Leiden University. He helped me giving structure to my Master’s thesis and his advice was greatly appreciated and incorporated in my research. I would also like to thank the experts at the NFI who wanted to participate in the development of the web app and the pilot study. The study could not have been successfully conducted without their expertise.

Finally, I must express my gratitude to my boyfriend Sander and parents for their support and encouragement throughout the process of researching and writing my Master’s Thesis. Their love and encouragement motivated me greatly and I could not have done this without you. Thank you.

(3)

Abstract

The strength of evidence in a criminal case is reported with a verbal expression of probability and a likelihood ratio (LR) interval at the Netherlands Forensic Institute (NFI). The aim of the study was to develop an instrument which can be used for elicitation of the probability

distribution of an LR in the forensic field and to perform a pilot study to evaluate the instrument. The study thus followed two phases, Phase I was the development of an instrument for the elicitation of a probability distribution and Phase II was the pilot study to evaluate the instrument. It was decided that a web app would be the most appropriate instrument for elicitation because a web app allows for: (a) a structured approach to elicitation, (b) a formal setting, (c) easy distribution within the NFI, (d) immediate (graphical) feedback, and (e) the results to be concluded in a standardized report. The web app was developed with periods of programming in R (R Core Team, 2019) with Shiny (Chang, Cheng, Allaire, Xie, & McPherson, 2019) and feedback sessions. Four steps of elicitation are followed in the web app: 1) the

elicitation of probability judgments with the trial roulette method (Gore, 1987), 2) the fitting of a parametric probability distribution to the probability judgments, 3) feedback to the expert based on the probability judgments with the P and V method (Hora, 2007), and 4) revision of the probability judgments based on the feedback (O’Hagan et al., 2006). A pilot study was

performed to evaluate the web app for (a) face validity, (b) convergent validity, (c) usability, and (d) reliability. The web app seems to have face validity, test-retest reliability, and convergent validity based on the pilot study. The usability of the web app is acceptable but failures of the web app should be resolved before the web app can be deemed to have good usability. The pilot study was performed with only four experts and the conclusion should thus be interpreted with caution. More development of the web app is needed before it can be used within the NFI.

(4)

Table of contents

1 Introduction ... 5

1.1 Verbal likelihood ratio interval scale ... 6

1.2 Aim of the study...11

2 Phase I: Development of the elicitation instrument ...11

2.1 Literature review ...11

2.2 Method ... 15

3 Elicitation using the web app ... 16

3.1 Preparation of elicitation session ... 16

3.2 Example case ... 17

3.3 Step 1: Elicitation of probabilities ... 18

3.4 Step 2: Fitting of parametric distribution ... 19

3.5 Step 3: Feedback on the parametric distribution... 25

3.6 Step 4: Adjusting the probability distribution ... 25

3.7 Combining the distributions ... 30

4 Phase II: Pilot study ... 30

4.1 Method ... 31

4.2 Results ... 34

5 Discussion ... 41

5.1 Limitations ... 44

5.2 Recommendations for future research and development ... 45

6 Conclusion ... 45

(5)

1 Introduction

When a criminal offence such as a burglary has taken place, the police will investigate the incident. When the investigation is successful, pieces of evidence are collected and a suspect is arrested. The pieces of evidence are analyzed in a forensic lab and this is typically done at a police forensic lab or at the Netherlands Forensic Institute (NFI). The forensic experts analyze the piece(s) of evidence and report their conclusions. The NFI and a few other forensic institutes around the world report the majority of their conclusions with a verbal likelihood ratio interval scale, see Table 1 for the NFI version and the English translation (Buckleton & Champod, 2006). Judges and other legal experts often rely on these conclusions in forensic reports as a basis for their legal decision(s) (Lund & Iyer, 2017).

Table 1.

Verbal Likelihood Ratio interval scale used at the NFI

Dutch verbal expression English translation Order of magnitude LR Ongeveer even waarschijnlijk Approximately equally probable 1-2

Iets waarschijnlijker Slightly more probable 2-10

Waarschijnlijker More probable 10-100

Veel waarschijnlijker Appreciably more probable 100-10,000 Zeer veel waarschijnlijker Far more probable 10,000-1,000,000 Extreem veel waarschijnlijker Extremely more probable > 1.000.000

This chapter will discuss the history of the likelihood ratio interval scale, the theory behind the scale and its potential drawbacks. The aim of the current study is discussed at the end of the chapter.

(6)

1.1 Verbal likelihood ratio interval scale

1.1.1 From verbal expressions to likelihood ratio

Traditionally the strength of evidence was reported as the probability for the hypothesis of the prosecution using a verbal scale. This verbal scale would range, for example, from ‘possible’ to ‘likelihood bordering certainty’, similar to the verbal expressions shown in Table 1 (de Keijser & Elffers, 2012). Verbal probability expressions are intuitive to most people. These are thus often preferred by people compared to other expressions of probability such as

frequencies or probabilities (Burgman, Fidler, Mcbride, Walshe, & Wintle, 2006; Garthwaite, Kadane, & O’Hagan, 2005). However, the interpretation of verbal probability expressions often differs between people, both in lay people and in experts (Willems, Albers, & Smeets, 2019). One person might have a different interpretation of the verbal expression ‘probable’ than another person might have (Burgman et al., 2006; Garthwaite et al., 2005; Hora, 2007). These different interpretations can lead to miscommunication between experts, both legal and forensic experts, and to misinterpretation of the probability statements (Hora, 2007).

The forensic scientific community came under pressure in recent years to incorporate quantitative statements (e.g. such as probabilities or statistics) in the verbal scale to increase the scientific foundation (Thompson & Newman, 2015). Probabilities and statistics are hard to come by in most disciplines of forensic science because every criminal case is unique and very few quantitative data are available to base probabilities and statistics on. The interpretation of probability and statistical procedures in the frequentist approach to statistics is based on hypothetical infinite repetitions of the event of interest (Bolstad, 2007). Frequentist statistics does not apply to forensic sciences because the event of interest (a crime) only occurs once and cannot be repeated. The Bayesian approach to statistics in contrast is highly applicable in the forensic sciences because the premise of Bayesian statistics is not based on repetition of the

(7)

event of interest. Likelihood ratios based on the theorem of Bayes were introduced in the forensic field more than a decade ago. The likelihood ratio (LR) is the ratio between two probabilities, that of the evidence given the hypothesis of the prosecution and that of the

evidence given the hypothesis of the defense. The LR can be used to determine the likelihood of both the hypotheses. Numeric intervals of the LR were incorporated in a verbal scale to quantify the verbal statements which were traditionally used. This resulted in the verbal LR interval scale as shown in Table 1.

1.1.2 Likelihood ratio approach

The LR approach is used internationally and in multiple disciplines of forensic science as a method to evaluate the evidential value of a piece of evidence (Buckleton & Champod, 2006; Lund & Iyer, 2017). The odds form of Bayes’ theorem is used as a framework in the LR

approach to interpret the evidence probabilistically (Buckleton & Champod, 2006; Lucy, 2005): Pr(𝐻_𝑝 | E) Pr(𝐻𝑑 | E) = Pr(E | 𝐻𝑝) Pr(E | 𝐻𝑑) × Pr (𝐻𝑝) Pr (𝐻𝑑) ,

where Pr(…) stands for probability, Hp for the hypothesis of the prosecution, Hd for the

hypothesis of the defense, E for evidence, and ‘|’ for ‘given’. This equation is verbally expressed as:

𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠 = 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑟𝑎𝑡𝑖𝑜 × 𝑝𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠.

In theory the LR approach can be used to conclude which hypothesis is more likely, that of the defense or that of the prosecution, based on the prior odds and the LR (Lund & Iyer, 2017). The posterior odds represent this conclusion. The prior odds are usually determined by the judge or other decision makers and indicate the likelihood of the suspect being guilty given case

circumstances before having considered the evidence at hand. The forensic scientist is typically not involved in determining the prior odds.

(8)

The forensic scientist determines the ‘strength’ of a piece of evidence and expresses this with the LR. The LR is the ratio between two probabilities; the probability of the evidence given Hp and the probability of the evidence given Hd. When the LR is greater than one, the piece of

evidence is more likely to be found given Hp than given Hd. For example, a red fiber was found

in a crime scene. A suspect (John) is arrested and John has a red coat. A fiber from John’s red coat will be compared to the red fiber found at the crime scene to investigate how similar they are. A simple example of hypotheses would be: Hp = The red fiber is from John’s red jacket, and

Hd = The fiber is from a random other red item. The LR would thus be the ratio between (a) the

probability of the similarity between the red fiber found at the crime scene and the fiber from John’s red jacket given that the red fiber found at the crime scene is from John’s red jacket and (b) the probability of the similarity between the red fiber found at the crime scene and the fiber from John’s red jacket given that the red fiber found at the crime scene is from a random other red item. The forensic expert assesses the evidence (the crime scene red fiber and the red coat fiber in the example) and the hypotheses of the prosecution and the defense. He or she then estimates the LR and reports the corresponding verbal expression as shown in Table 1.

Quantitative data to base probabilities on are not available in most forensic disciplines which makes the calculation of the LR impossible (Martire, Kemp, Sayle, & Newell, 2014). There have been initiatives to make quantitative data available in recent years (Gill, 2018). However, in the absence of quantitative data, the LR is estimated by the expert. This estimation is based on the experience of the expert and relevant literature.

1.1.3 Potential drawbacks and limitations

The use of verbal probability expressions has some drawbacks. The variation of

interpretations of verbal probability expressions between (and within to a lesser degree) experts is considerable (O’Hagan et al., 2006). Numeric expressions of probability have advantages over

(9)

verbal expressions: 1) numbers provide greater distinction between strength of evidence than verbal expressions do because there are only a limited number of verbal probability expressions, 2) experts who use the same numeric LR value want to convey the same message, there is thus less variation in interpretations, and 3) when an expert has to use a number rather than a verbal expression to express his or her probability judgment the expert is forced to elicit his or her thinking process (Marquis et al., 2016).

The NFI expert presently reports a verbal LR with the accompanying range of LR values as shown in Table 1. When the expert estimates the LR to be near a threshold value the expert has to make a choice which LR interval he or she will report. A brainstorm session at the start of the current study revealed that experts will often choose the lower interval because it is most favorable for the defendant and they are certain that the LR is at least of this magnitude. For example: an expert assesses the LR to be approximately 10,000. The expert now has the choice between the two intervals ‘appreciably more probable’ (LR is 100 to 10,000) and ‘far more probable’ (LR is 10,000 to 100,000,000). He or she will then usually choose the lower category ‘appreciably more probable’. This causes information loss because the conclusion now indicates that the evidence given the Hp is 100 times more likely than the evidence given Hd or it could

indicate it is 10,000 times more likely. A considerable large range, especially when the expert first determined the LR to be approximately 10,000.

The use of an LR interval scale implies that the expert deems all LR values in the interval he or she chooses to be equally probable. This implies that the interval has a uniform distribution because all values are equally probable (Benétreau-Dupin, 2015; Nease & Owens, 1990).

Experts could have more information on the probability of certain LR values than the uniform distribution and the broad LR intervals indicate. There could be information loss on which LR values are more probable than others when this method is used. A narrow range of values is more

(10)

informative than a broad range of values (O’Hagan et al., 2006). For example, when someone has to make a judgment about Barack Obama’s age a broad range of possible ages such as 10-100 would be accurate but not very informative, a narrower range such as 40-60 is far more informative. The NFI expert can indicate with the verbal LR interval scale if the LR is at the lower or upper end of the interval. However, the current method used at the NFI does not invite for such nuances and does not allow for more nuance to an LR interval.

Nuances to LR categories could be very useful. In some serious and complex criminal cases, it can be desirable to combine the individual LR intervals of multiple pieces of evidence into one LR interval. The threshold values of the LR interval can be multiplied to calculate a combined LR interval. The resulting combined LR interval can be very broad making

interpretation difficult. For example, the combination of evidence assessed as ‘more probable’ (LR is 10 to 100) and evidence assessed as ‘far more probable’ (LR is 10,000 to 1,000,000) results in an interval of possible LRs from 100,000 to 100,000,000. This shows how quickly the interval of combined LRs can become immense.

1.1.4 Points of improvement

A narrow LR interval which accurately reflects the expert’s knowledge could limit

information loss and is more informative than a broad LR interval. A narrower LR interval could be elicited but elicitation of the underlying probability distribution of the LR would provide even more information than only an interval. Information on which LR values are most and least probable becomes available when a probability distribution is specified. The elicitation of a probability distribution of the LR could have multiple advantages over the elicitation of a verbal probability expression and LR interval: (a) a verbal probability expression is no longer needed, (b) a narrower interval of possible LR values can be specified, (c) allows for nuances in

(11)

1.2 Aim of the study

To improve the current method used at the NFI for elicitation of the LR an instrument was developed. The aim of the study was to develop an instrument which could be used to elicit a probability distribution of the LR from experts in the forensic field. A pilot study was

conducted to evaluate the (a) face validity, (b) convergent validity, (c) usability, and (d) reliability of the instrument. Elicitation is a relatively small field of research, especially

elicitation in the forensic field. To the author’s knowledge, elicitation of a probability distribution of the LR in the forensic field has not yet been reported.

The results of the study are presented in two phases. Chapter two discusses the first phase of the study, the development of the instrument. The instrument is presented in chapter three. In chapter four the pilot study performed to evaluate the instrument is discussed. Chapter five contains the discussion of the instrument and results of the pilot study. Chapter six is the conclusion.

2 Phase I: Development of the elicitation instrument 2.1 Literature review

A literature review was performed at the beginning of the study to investigate elicitation of probability judgments. This review started with an important book in the field of probability elicitation “Uncertain judgements: Eliciting experts’ probabilities” by O’Hagan et al. (2006) and articles on expert probability elicitation provided by the supervisors. The reference lists of the book and articles were checked and searches were done on web databases such as Web of Science with terms as “elicitation” and “expert elicitation”. The literature review was used to define the construct of elicitation and make decisions on the sort of elicitation instrument and elicitation methods. This section discusses the results of this literature review, defines the

(12)

construct of expert probability elicitation, and discusses the choices for elicitation methods and elicitation instruments.

2.1.1 Expert probability elicitation

In expert probability elicitation the expert’s belief about the quantity of interest is translated to a probability statement such as a probability distribution (Colson & Cooke, 2018; Dias, Morton, & Quigley, 2018; Wolfson, 2015). The purpose of elicitation is to structure the thinking process of the expert in such a way that the expert comes to a rational and well thought through probability judgment such that the probability distribution accurately represents the expert’s knowledge and the uncertainty about the quantity of interest (O’Hagan et al., 2006). In elicitation three primary roles can be identified, that of the decision maker, the expert, and the facilitator (Garthwaite et al., 2005; O’Hagan et al., 2006; Oakley, 2010). The decision maker makes a decision based on the outcome of the elicitation process. In forensic science this is often a legal expert such as a judge. The facilitator is the one who performs the elicitation and guides the expert through the elicitation process (O’Hagan et al., 2006). A person whose knowledge the decision maker wishes to elicit is called the expert (Burgman et al., 2006; Garthwaite et al., 2005). In the forensic field the expert is the person who evaluates the results of the forensic analyses.

The quantity of interest differs per discipline in the forensic sciences. In some disciplines the experts directly estimate the LR while in other disciplines the experts estimate the two probabilities underlying the LR [Pr(E | Hp) and Pr(E | Hd,)]. In some disciplines the experts

estimate the probability that the evidence is found given the hypothesis of the defense [Pr(E | Hd)] because the probability that the evidence is found given the hypothesis of the prosecutor is

certain and thus 1. This is also called one over the rarity, the expert estimates how rare the piece of evidence is. It was decided that the elicitation instrument should be able to handle these three

(13)

different scenarios such that the elicitation instrument can be used in most of the disciplines within the NFI.

2.1.2 Elicitation method

Elicitation of probability judgments usually follows four steps. These four steps are: 1) elicitation of probabilities or summary statistics, 2) fitting of a parametric distribution based on the elicited probability judgments, 3) feedback to the expert based on the fitted parametric

distribution and, 4) assessment of the fitted distribution by the expert (O’Hagan et al., 2006). The four steps are a natural way of improving accuracy of the probability judgment (O’Hagan et al., 2006; Oakley, 2010). The elicitation method is the method which is used for elicitation of the probability judgment(s) from the expert, thus step one in the four steps. We chose to integrate these four steps in the web app to improve accuracy of the probability judgments. This also allows for a natural guide for the elicitation session which further improves the structural nature of the elicitation session.

Numerous methods of elicitation of probabilities or summary statistics exist, all with their own advantages and disadvantages. Little research has been done to compare and validate all these methods (Johnson, Tomlinson, Hawker, Granton, & Feldman, 2010). Therefore, there is no general best elicitation method. Three of the most popular elicitation methods are: 1) the P and V method (Hora, 2007), 2) the trial roulette method (Gore, 1987), and 3) the bisection method (Hora, 2007; Oakley, 2010). During a brainstorm session at the start of the current study some NFI experts indicated that they thought the bisection method and the P and V method were arduous. We therefore chose the trial roulette method as the elicitation method to be used in the instrument. The advantages of this method are 1) that it allows the expert to make an estimation of the probabilities without actually having to state the probabilities and 2) the direct visual feedback to the expert (Veen, Stoel, Zondervan-wijnenburg, & van de Schoot, 2017).

(14)

2.1.3 Elicitation instrument

Elicitation can be approached in many ways, for example elicitation can be done with an interview or a questionnaire. Questionnaires are low in cost and easy to execute but come at the cost of the quality of responses (O’Hagan et al., 2006). Interviews are thus preferred unless the questions in the questionnaire are simple and the experts have a high level of commitment. A formal and structured approach to elicitation will convince reviewers of the accuracy and meaningfulness of the expert judgment (Hora, 2007). A formal and structured approach is thus appropriate when the probability judgment will be reviewed by others and when it can be challenged in court. The elicitation instrument for the NFI should be formal and structured but should still have enough flexibility to adjust for the unique features of every criminal case. Furthermore, questionnaires are not an appropriate elicitation instrument for the NFI because the quality of responses could be low. It was decided that a web app would be an appropriate

elicitation instrument. A web app allows for a structured approach to elicitation, can be used in a formal setting, can be easily distributed within the NFI, provides immediate (graphical)

feedback, and can make standardized reports which are then comparable for each criminal case. Furthermore, a web app can be developed in such a way that it allows for the unique features needed for the different disciplines within the NFI.

A few web apps for the elicitation of probability distributions are available and have some good qualities (Morris, Oakley, & Crowe, 2014; Veen et al., 2017). However, these web apps 1) were not specifically made for the forensic field, 2) do not give the choice between the elicitation of the LR or the two underlying probabilities of the LR, and 3) do not allow for the combination of independent probability distributions.

(15)

2.2 Method 2.2.1 Participants

Feedback sessions were held separately with three supervisors and two experts of the NFI. The two NFI experts were recruited via the supervisors at the NFI.

2.2.2 Materials

The web app was developed in R (R Core Team, 2019) version 3.6.1 with the package Shiny (Chang et al., 2019). Shiny allows for the development of interactive documents such as web apps (a plethora of examples can be found on: https://shiny.rstudio.com/gallery/). The interactive plot for the trial roulette method was made using code from the R package SHELF (Oakley, 2019) and the R code from the web app for elicitation of single experts from SHELF. The R packages fitdistrplus (Delignette-Muller & Dutang, 2015) and rriskDistributions

(Belgorodski, Greiner, Tolksdorf, & Schueller, 2017) were used to fit probability distributions. 2.2.3 Procedure

Once the literature review was concluded and decisions were made on which elicitation method would be most appropriate and which kind of instrument to develop, the development of the web app began. The web app was developed in an iterative process of programming periods and feedback sessions. The feedback given during the feedback sessions were incorporated into the web app until the web app was deemed satisfactory by the supervisors and NFI experts. The feedback sessions with the supervisors and one of the NFI experts were informal and

unstructured. The web app and the features were demonstrated and points of improvement and ideas were discussed. Two structured feedback sessions were held with a second NFI expert. During these sessions the expert was asked to use the web app as they would for a criminal case. There was room for questions and feedback during these feedback sessions. At the end the expert was asked to fill in the System Usability Scale (further discussed in chapter four) and was asked

(16)

whether the probability distribution of the LR in the web app represented the distribution the expert had in mind. The answers to these questions were used to discuss points of improvement and new ideas which could be incorporated in the web app.

3 Elicitation using the web app 3.1 Preparation of elicitation session

Before an elicitation session is held some preparation is required. There is a broad scientific consensus on the most important stages of the elicitation procedure (Burgman et al., 2006). These stages are: 1) definition of the problem, 2) identification and recruitment of experts, and 3) elicitation session. The first and second stage can be considered preparation for the

elicitation session. The preparation of the elicitation session at the NFI will often only consist of the definition of the problem because experts are identified earlier in the forensic process. The definition of the problem involves the definition of the two hypotheses such that the expert can make a probability judgment of these two hypotheses.

Next to preparation of the elicitation session it is good practice to have some general guidelines during the elicitation session to avoid the most common biases in elicitation such as overconfidence and anchoring (Burgman et al., 2006; Johnson, Tomlinson, Hawker, Granton, & Feldman, 2010). The guidelines for elicitation with the web app, based on the guidelines by Johnson et al. (2010), are: (a) the facilitator should give clear instructions throughout the elicitation with the help of the written questions and guidelines in the web app, (b) the expert is provided with some practice elicitation tasks before the elicitation of the LR, and (c) the

facilitator asks the expert to make a summary of his or her evidence base. Practice elicitation tasks are useful to train the expert in the protocol of the elicitation task, in overcoming

incoherence in probability judgments, and working on overconfidence of the expert (Dias et al., 2018; Garthwaite et al., 2005). Practice tasks can, for example, consist of almanac questions or

(17)

the distance between cities (Hora, 2007). It is good practice to make a summary of the evidence and theory the expert bases his or her probability judgment on (Dias et al., 2018). This summary serves a dual purpose; a record is kept of the evidence base and the expert has an opportunity to recall all the available information (O’Hagan et al., 2006). The web app provides the opportunity to make notes which are reported in the final report of the elicitation.

3.2 Example case

In this manuscript the web app will be presented using an example criminal case from Nordgaard, Ansell, Drotz and Jaeger (2012): “A red Volvo ran into a blue Saab and the Volvo left the scene before the police arrived. The damaged front left wing area of the blue Saab was investigated and a red paint flake was recovered. Later the police found a red Volvo in a parking place with scratches on the right front door. Comparison paint from the damaged area of the door was collected.” (Nordgaard et al., 2012, p. 18). Both paint samples were analyzed and the paint sample found on the blue Saab was compared to a paint sample from the red Volvo. Two hypotheses were formulated in order to estimate the LR. Hp: “The red multilayered paint flake

recovered from the blue Saab originates from the red Volvo” and Hd: “The red multilayered paint

flake recovered from the blue Saab originates from another car” (Nordgaard et al., 2012, p. 18). The LR cannot be computed numerically and is thus estimated by the expert. The expert states the following: “The probability of obtaining matching results if the proposition Hp is true is

considered very large. The probability of obtaining matching results if the alternative proposition Hd is true is considered to be smaller than if the paint flake had consisted of only industrial

coating due to the repaint layer.” (Nordgaard et al., 2012, p. 19). The web app is used to

determine an LR. This was done by the author of the current research and not a forensic expert. The main use of the example case is to show the purpose of the web app and not to determine a real LR for this example case.

(18)

3.3 Step 1: Elicitation of probabilities

The first screen (after the welcome screen) in the web app is the elicitation of the

probability judgments, see Figure 1 for the screen before use. The probability elicitation is done using the trial roulette method. The expert is guided through the steps of this method by use of numbered statements and questions. The first question the expert should answer is which

quantity of interest will be elicited. There are three choices: (a) elicitation of the LR, (b) separate elicitation of the numerator Pr(E | Hp) and the denominator Pr(E | Hd) of the LR, and (c)

elicitation of the denominator Pr(E | Hd) when the numerator equals one. Next, the expert is

asked for the practical boundaries of the quantity of interest. The scale of the x-axis can be set to a logarithmic or linear scale and by adjusting the number of columns the values of the x-axis can be manipulated. When the expert is satisfied with his or her choices, he or she is asked to make a histogram of the quantity of interest. It is explained in the instruction that every ‘chip’ or block in the histogram represents the probability of the values of the column on the x-axis. For example, when only 1 chip is placed in the column 10-100 there is a probability of 100% that the quantity of interest lies within these boundaries. The probability of each chip is shown at the top of the plot. The expert is asked to place the chips such that the histogram represents his or her

probability judgment of the quantity of interest. See Figure 2 for the example elicitation session. The option to elicit the two probabilities separately was chosen which means the two

probabilities are elicited separately. The x-axis values were adjusted and the two probabilities were estimated. The probability of the numerator Pr(E | Hp) is estimated to be near to 1 and the

(19)

(20)

(21)

(22)

Step 2: Fitting of parametric distribution

The expert makes a histogram of the LR, probabilities, or rarity, using the trial roulette method in the first step. A parametric probability distribution is fitted to this histogram in step two. In order to do this a dataset of values based on the histogram is made. The dataset consists of means of the columns that contain chips. For every chip in a column the mean is added to the dataset. Depending on the scale of the x-axis this is either the arithmetic mean (linear x-axis) or the logarithmic mean (logarithmic x-axis). For example, in Figure 2 the probability of the numerator Pr(E | Hp) is deemed to be most likely between 0.9 and 1. Four chips were placed in

the column 0.95 to 1. The histogram was on a linear scale thus the arithmetic mean will be taken, which is approximately 0.975. The mean of the column is added to the dataset for each chip, which means 0.975 will be present four times in the dataset because there were four chips placed in this column.

Figure 3 shows the two distributions which were fit to the example elicitation session in Figure 2. The function ‘fitdist’ (Delignette-Muller & Dutang, 2015) is used to fit the normal, lognormal, gamma, Weibull, uniform, and in case of the elicitation of two probabilities or rarity the beta distribution to the dataset. The expert can choose which probability distribution they deem appropriate, or, they can choose the best fitting probability distribution. The best fitting probability distribution is determined with the Akaike Information Criterion (AIC); the probability distribution with the lowest AIC has the best fit to the dataset. The minimum and maximum value of the plot can be adjusted if necessary. In the example the best fitting probability distributions were chosen.

(23)

(24)

(25)

3.4 Step 3: Feedback on the parametric distribution

When a parametric probability distribution is fitted in step two it implicates certain probabilities which the expert did not specify. These probabilities are presented to the expert to improve the probability judgment (Burgman et al., 2006). The method of feedback on the probability distribution to the expert was based on the P and V method (Hora, 2007). The expert is first presented with the 1st_{, 50}th_{, and 99}th_{percentiles. These are pointed out because the 1}st_and

99th_{percentile are a representation of the boundaries of the quantity of interest. There is a low}

probability that the quantity of interest lies beyond these percentiles. The expert is reminded that the 50th_{percentile represents the point where the probability is similar to a coin flip. When a coin}

is flipped the quantity of interest has a 50% probability to be lower or higher than the 50th

percentile. The expert also has the option to see more percentiles (5th_{, 25}th_{, 33}rd_{, 66}th_{, 75}th_{, and}

95th_{). The feedback for the example elicitation session can be seen in Figure 4. The expert is}

presented with the percentiles on the right and figures on the left. The colors in the figures represent the 1st, ₅th_{, 25}th_{, 33}rd_{, 50}th_{, 66}th_{, 75}th, ₉₅th_{, and 99}th_percentiles.

3.5 Step 4: Adjusting the probability distribution

Based on the percentiles presented in step three the expert decides if the fitted probability distribution accurately describes her or his beliefs about the quantity of interest (the LR,

probabilities, or rarity). When this is not the case the expert should adjust the values of the percentiles. The functions ‘get.dist.par’ (where ‘dist’ is replaced with the appropriate probability distribution) from the package rriskDistributions (Belgorodski et al., 2017) are used to fit a new probability distribution to the percentiles. The same parametric probability distributions as in step two are fitted. The best fitting probability distribution is chosen as the probability distribution with the lowest convergence value. The expert is presented with the percentiles

(26)

based on the new probability distribution and can choose to adjust the percentiles once more. This process is repeated until the expert deems the probability distribution to be appropriate. For consistency the final probability distribution is always fitted to the percentiles, also when the expert chose not to adjust the percentiles.

When the expert chose to either elicit the probabilities or rarity, the probability

distribution of the LR is made based on the final probability distributions of the probabilities or rarity. A dataset of random LR values is made by dividing 10,000 randomly drawn values of the final probability of the numerator Pr(E | Hp) by 10,000 randomly drawn values of the final

probability distribution of the denominator Pr(E | Hd). When eliciting the rarity, the numerator

values are all 1. A probability distribution is then fitted to the dataset of LR values in the same manner as in step two, with the function ‘fitdist’ (Delignette-Muller & Dutang, 2015). The expert can choose the best fitting probability distribution or another parametric probability distribution if desired. Percentiles of the final probability distribution of the LR are shown but are not adjustable. If the expert is unhappy with the final probability distribution of the LR the expert should asses the probability distributions of the probabilities or rarity again. Figure 5 shows the probability distribution of the LR based on the probability distributions of the two separate probabilities without adjustment of the percentiles.

(27)

(28)

(29)

(30)

3.6 Combining the distributions

The combination of probability distributions of multiple (independent) LR’s is facilitated in the web app. This is not part of the elicitation session but can be done after multiple elicitation sessions are completed and multiple probability distributions are obtained. The multiple

probability distributions of LR’s can be used to calculate the posterior odds of multiple pieces of evidence in theory. The formula shown in chapter one for the posterior odds is extended to:

𝐶𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠 = 𝐿𝑅1 × 𝐿𝑅2× … × 𝐿𝑅𝑝× 𝑝𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠.

Combined posterior odds can thus be calculated in theory by multiplying LR’s and prior odds. A combined LR can thus be calculated by multiplying LR’s. It is then the task of the legal decision makers to use the combined LR to come to the combined posterior odds.

A dataset of the probability distributions can be downloaded after the elicitation session is completed. This dataset consists of 100,000 random LR values of the final probability

distribution of the LR. The dataset is used to combine the probability distributions of several independent LR’s in one probability distribution for the combined LR. The user uploads the datasets for each probability distribution and the web app multiplies the LR values to calculate combined LR values. A new probability distribution is fitted to the combined LR values in the same manner as described in step two. The user is presented with the combined probability distribution (e.g. normal distribution) and the mean and standard deviation of this probability distribution, a plot of the probability distribution is also shown.

4 Phase II: Pilot study

The web app was evaluated for face validity, convergent validity, usability, and reliability in a similar fashion to Johnson et al. (2010). Face validity is when the method appears to

(31)

Granton, Grosbein, et al., 2010). The web app has face validity when the NFI experts believe that it makes a good representation of their probability judgment of the LR. Convergent validity is when the method correlates with other methods which measures the same construct (Johnson, Tomlinson, Hawker, Granton, Grosbein, et al., 2010). The web app should measure the expert’s probability judgment of the LR. The method used at the NFI to measure the probability judgment of the LR is the verbal LR interval scale. Therefore, the final probability distribution of the LR obtained with the web app should fall within the verbal LR interval for the same criminal case. The ease of use of the web app is referred to as usability. Reliability refers to the consistency in measurement (Gregory, 2011). Reliability in the web app thus refers to the consistency of the final probability distribution of the LR. This chapter discusses the method and results of the pilot study.

4.1 Method 4.1.1 Participants

A total of eight NFI experts were approached to participate in the pilot study via email. Some experts were first approached by a supervisor to see if they would be interested in participating. The experts who were invited to participate in the pilot study were not randomly selected. They were selected for their expertise in their respective forensic disciplines and for their understanding and interest in statistics. They were either recommended by one of the supervisors or one of the participating experts. The following characteristics of the participants were collected: gender, field of expertise, and years of experience.

4.1.2 Materials

The expert should base his or her probability judgment of the LR on the report of an old criminal case. To minimize the anchoring bias the experts were asked beforehand to contact a colleague who could choose a report of an old criminal case. This report includes information on

(32)

the background of the case, forensic analyses of the evidence, and a verbal LR conclusion and its appropriate LR interval. The colleague should blind the verbal LR conclusion given in the report to prevent anchoring bias. This meant the expert could review the case and the results but not the verbal LR conclusion and the numeric LR interval.

The web app was evaluated with a questionnaire which measured face validity and usability. Face validity was measured with the question: “Do you feel the final probability distribution which you made with the web app was a good representation of the LR?” (Dutch: “Had je het gevoel dat de uiteindelijke kansverdeling van de LR die je hebt gemaakt met de web app een goede representatie was van de LR?”). The question could be answered with “Yes”, “No” or “Different because: …”. Usability of the web app was measured with the System Usability Scale (SUS) and the time it took to finish the elicitation task.

The SUS is a 10-item standardized questionnaire to measure the perceived usability which has shown to be valid and reliable (Lewis & Sauro, 2018). The items are scored on a 5-point Likert scale. The SUS provides a score from 0 to 100 with 0 representing very poor perceived usability and 100 representing excellent usability, how to score the SUS can be found in the paper by Brooke (1996). The adjusted version of the SUS suggested by Bangor, Kortum, and Miller (2008) was used in the current study.

4.1.3 Procedure

The elicitation session started with a practice exercise to minimize overconfidence and to familiarize the expert with the web app. The expert was asked to estimate the distance between two Dutch cities the Hague and Eindhoven (~144 km by car or ~100 km as the crow flies). The expert then used the web app with verbal guidance of the researcher to make a probability distribution of what they thought was the distance between The Hague and Eindhoven. The researcher thus acted as the facilitator. The verbal guidance always followed the written guidance

(33)

in the web app as outlined in chapter 3. During the practice task there was also room to ask questions. Next, the elicitation task started. The elicitation task was to make a probability distribution for the LR of the old criminal case prepared by a colleague. Before the elicitation task started the expert was asked to give a summary of the case and the results of the analysis. The expert used the web app with verbal guidance to make a probability distribution of the LR they deemed appropriate for the case. The verbal guidance followed the written guidance in the web app. After the elicitation task ended the expert was asked to fill in the questionnaire. To measure test-retest reliability the evaluation session was repeated one to two weeks later. The same procedure was followed for the second session with the exception of a practice exercise. 4.1.4 Statistical analysis

Face validity was measured by the percentage of experts who chose “Yes” to the question “Do you feel the final probability distribution which you made with the web app was a good representation of the LR?”. The SUS final score was calculated for each participant according to the rules of the SUS scale (Brooke, 1996). The mean and standard deviation of the final SUS scores were calculated in Microsoft Excel.

Statistical analysis for convergent validity and test-retest reliability was approached in the same manner. For convergent validity the probability distribution of the LR of the first session was compared to the verbal LR interval as shown in table 1. The LR interval of the criminal case was treated as a uniform distribution. For reliability the probability distribution of the LR of the first session was compared to that of the second session. The two sample Kolmogorov-Smirnov (KS) test was used to compare the probability distributions. This was done by drawing a sample of 100,000 random values of each probability distribution. The KS test tests the null hypothesis that the two samples have the same underlying probability distribution (Massey, 1951). The KS test was done with R (R Core Team, 2019) version 3.6.1.

(34)

4.2 Results

4.2.1 Descriptive statistics

Four NFI experts agreed to participate in the pilot study. The characteristics of the four experts are shown in Table 2. Two experts used old criminal cases which were blinded by colleagues, one expert used an old criminal case which was not blinded, and one expert used an Australian criminal case. The Australian case was unique, more information can be read about this case in Mitchell et al. (2019). The evidence given the Hd in this case was highly unlikely in such a magnitude that no LR had been given previously. It was formerly estimated that the probability of the evidence given the hypothesis of the defense would be near zero.

Table 2.

Characteristics of the experts who participated in the evaluation.

Characteristics Participants (N = 4)

Male, n (%) 3 (75%)

Field of expertise

DNA, n (%) 1 (25%)

Firearms and toolmarks, n (%) 1 (25%)

Micro traces, n (%) 2 (50%)

Years of experience, median (range) 9 (6-11)

Method of elicitation

LR, n (%) 2 (50%)

(35)

4.2.2 Face validity

Face validity was measured by asking the experts the question: “Do you feel the final probability distribution which you made with the web app was a good representation of the LR?”. Three out of the four experts (75%) answered the question with “yes”, the fourth expert chose different and gave a textual explanation. This expert stated that the LR interval of the final probability distribution in the web app was very broad, the expert questioned whether the web app would succeed in making a narrow LR interval and if the web app would be a good addition to the current method used at the NFI.

4.2.3 Convergent validity

Convergent validity was measured by comparing the probability distribution of the LR the expert made in the first elicitation session to the LR interval given in the report of the

criminal case. The LR interval was treated as a uniform distribution. The probability distributions obtained in session one and the LR interval conclusion are shown in Figure 7. The criminal case of expert B had no formal LR conclusion, thus, the convergent validity could not be measured for this expert. The results of the KS tests are shown in Table 3. The KS test was significant for all experts indicating that none of the two probability distributions of each expert are identical. The median and range of the probability distributions of session one falls approximately within the range of the LR conclusions for expert A and D. For expert C the range of the probability distribution of session one roughly overlaps with the LR conclusion, the right half falls out of the range of the LR conclusion with maximum possible LR values of 170. However, the median falls within the range of the LR conclusion.

(36)

(37)

Table 3.

Results of convergent validity tests.

Expert Distribution Median (Range)

Kolmogorov-Smirnov (p)

A 0.55 (p < 0.01)

Session 1 Normal (mean = 80, sd = 10) 80 (38, 125) LR conclusion More probable,

uniform (10-100)

55 (10, 100)

B -

Session 1 Lognormal (meanlog = 6.34, sdlog = 1.5) 1765 (0.8, 261725)

LR conclusion - -

C 0.64 (p < 0.01)

Session 1 Lognormal (meanlog = 4.53, sdlog = 0.151) 93 (49, 170) LR conclusion More probable, uniform (10-100) 55 (10, 100)

D 0.61 (p < 0.01)

Session 1 Weibull (shape = 6.28, scale = 3.95) 4 (1, 6) LR conclusion Slightly more probable, uniform (2 – 10) 6 (2, 10)

4.2.4 Usability

Usability of the web app was measured by use of the SUS questionnaire and measuring the time it took to complete the elicitation task. The first session was around 30 minutes for all four experts. The second session was around 15-20 minutes for all four experts. The SUS had a mean of 69.36 (SD = 6.57).

Failures

Usability is also determined by the number of successes and failures of the program (Bangor et al., 2008). The pilot study revealed some failures of the web app which were not revealed during the development phase of the web app. Expert B wanted to estimate a very small probability with a large range of possible values (0.000001 < Pr() < 0.001). This large range of possible values was too large for the web app and resulted in errors. The expert chose to settle for

(38)

a smaller range of values in order to continue the sessions, the failure thus influenced the probability judgment and decreased usability.

The most frequent failure was of a different nature and related to how the final

probability distribution is fitted. The final probability distribution is fitted on the percentiles of the probability distribution fitted to the probability judgments (step one and two). This was done such that the expert can adjust the percentiles if necessary. When the expert does not adjust the percentiles the final probability distribution should match the probability distribution fitted to the histogram. The pilot study revealed that these two probability distributions are often similar but not identical. This is most likely due to the use of two different functions and R packages to fit a probability distribution to: 1) the dataset of the histogram, and 2) to the percentiles. An example of this failure can be seen in Figure 8. Figure 8 is an example from expert C in the second

session. Expert C chose the best fitting probability distribution in step two which was the gamma distribution (shape = 27.1, rate = 8.07). It is expected that the final best fitting probability

distribution, fitted to the percentiles, would be the same gamma distribution. However, the final best fitting probability distribution fitted to the non-adjusted percentiles was a lognormal distribution (mean = 1.10, sd = 0.18). This failure can be inhibited by not choosing the best fitting distribution for the final distribution but the same parametric distribution which was fitted to the histogram. However, the parameters of the probability distributions can still slightly differ. Although none of the experts noticed this failure it causes inconsistency and, therefore, decreases the usability.

(39)

Figure 8. Probability distributions fitted in the feedback screen of the web app to the same elicited data.

4.2.5 Reliability

Reliability was measured by repeating the elicitation session after one to two weeks. The two probability distributions for each expert are shown in Table 4 and visualized in Figure 9. The two probability distributions were compared with the KS test, results are shown in Table 4. The KS test was significant for all the experts which indicates that none of the probability

distributions of the second session were identical to the probability distribution of the first session. The medians and ranges of the probability distributions of session one and session two were comparable for all experts except for the medians of the probability distributions of expert B. The probability distributions of expert A and C were both broader in the second session compared to the first session. The median and range of the probability distributions of expert D are very similar.

(40)

(41)

Table 4.

Results of test-retest reliability tests.

Expert Distribution Median (Range)

Kolmogorov-Smirnov (p)

A 0.26 (p < 0.01)

Session 1 Normal (mean = 80, sd = 10) 80 (38, 125) Session 2 Normal (mean = 73, sd = 14.2) 73 (8, 144)

B 0.85 (p < 0.01)

Session 1 Lognormal (meanlog = 6.34, sdlog = 1.5) 1765 (1, 261725) Session 2 Lognormal (meanlog = 9.58, sdlog = 0.8) 14555 (405, 565112)

C 0.36 (p < 0.01)

Session 1 Lognormal (meanlog = 4.53, sdlog = 0.151) 93 (49, 170) Session 2 Lognormal (meanlog = 4.69, sdlog = 0.243) 109 (37, 314)

D 0.24 (p < 0.01)

Session 1 Weibull (shape = 6.28, scale = 3.95) 4 (1, 6) Session 2 Gamma (shape = 27.1, scale = 8.07) 3 (1, 7)

5 Discussion

The aim of the current study was to develop an instrument which can be used for elicitation of the probability distribution of an LR in the forensic field and to perform a pilot study to evaluate the instrument. The web app was developed with increments of feedback sessions and programming periods in R with the package Shiny (Chang et al., 2019; R Core Team, 2019). The web app follows four steps of elicitation: 1) probability judgments are elicited with the trial roulette method (Gore, 1987), 2) a parametric distribution is fitted to the probability judgment, 3) percentiles of parametric probability distribution are presented to the expert based on the P and V method (Hora, 2007), and 4) the expert can adjust the percentiles until the expert deems the probability distribution to be an appropriate representation of his or her probability judgment.

(42)

A pilot study was conducted to evaluate the face validity, convergent validity, usability, and reliability of the web app. The majority of the experts (75%, n = 4) deem the web app to have face validity. The one expert (25%) who gave a textual explanation was expert B and related to the failure of the web app in the unique Australian case. From this result it can be concluded that the web app has face validity. According to Gregory et al. (2011) face validity is “a matter of social acceptability and not a technical form of validity”. If the web app lacked face validity the web app might not be taken seriously by the experts. However, face validity is not a technical form of validity thus it cannot be concluded that the web app is valid based on face validity alone.

Convergent validity, which is a more technical form of validity, was also assessed during the pilot study. To assess convergent validity, the probability distribution of the LR made with the web app was compared to the verbal LR interval conclusion of the same case. The verbal LR interval conclusion was treated as a uniform distribution. The KS test showed that none of the probability distributions were identical. However, the probability distributions of expert A and D fall within the range of the verbal LR interval. The probability distribution of expert C is a typical example where the LR is deemed to be around a threshold value. The probability

distribution allows for nuance where the verbal LR interval does not. The probability distribution of expert C thus does not fall within the verbal LR interval. There was no formal conclusion for the criminal case of expert B. No comparisons could thus be made for expert B. Based on the three remaining experts a cautious conclusion that the web app has convergent validity can be made.

Usability was measured by the time it took the expert to complete the elicitation session and with the SUS. The time it took to complete the elicitation session the first time was longer (30 minutes) than the second time (15-20 minutes). This difference in time is most likely due to

(43)

explanation and the elicitation practice exercise in the first elicitation session. The experts were not asked what they thought of the time it took to complete the elicitation task and no guidelines on elicitation session length are given in the literature to the authors knowledge. However, 15 to 20 minutes seems like an acceptable length. The SUS score (M = 69.36, SD = 6.57) was ‘OK’ (<71) and acceptable according to Bangor, Kortum, and Miller (2008). The SUS score is

acceptable when it is above the average SUS score of 68. A SUS score is deemed ‘Good’ when it is higher than 71. The pilot study revealed two failures of the web app: 1) inability to handle large ranges of possible values, and 2) discrepancy between the probability distribution fitted in the second step and the final probability distribution. These failures decrease the usability of the web app.

Test-retest reliability was measured by repeating the elicitation task one to two weeks after the first elicitation task. The unique criminal case of expert B again showed during the second session that the web app cannot handle large ranges of values and expert B had to compromise his probability judgment. The probability distributions of expert B are very

dissimilar. The ranges and medians of the probability distributions of the first and second session of the other three experts were comparable, although the probability distribution of experts A and C had a wider range in the second session than in the first session. This shows that the experts might be less certain of their probability judgment during the second elicitation session. It could also indicate that the experts adjusted their probability judgment such that they were more certain it would overlap with the first elicitation session. The KS test was significant for all four experts, meaning the probability distributions of the second session were not identical to the probability distributions of the first session. However, the KS test was done with 100,000 random values which could lead to significant differences even when the differences are small. When the distributions are visually compared and ranges and medians are compared it can be concluded

(44)

that the probability distributions might not be identical but are very similar. It is thus concluded that the pilot study showed test-retest reliability of the web app although two out of the four experts were less certain of their probability judgment in the second elicitation session.

The web app seems to have face validity, test-retest reliability, and convergent validity. Usability of the web app is acceptable according to the SUS and time of the elicitation session. However, the failures of the web app should be resolved before the web app can be deemed to have good usability. The results of the convergent validity show that experts may have more information on the probability distribution of the LR than the uniform probability distribution of the verbal LR interval scale indicates. The web app allows for nuances in the probability

judgment of the LR which the verbal LR interval scale cannot. This was one of the goals of the web app which thus has been achieved.

5.1 Limitations

The interpretation of the results of the pilot study should take into account that only four experts participated in the pilot study. One of these experts (expert B) worked with a criminal case that might not be representative of the work they do on a regular basis. This case also had no formal verbal LR interval conclusion which meant that convergent validity could not be measured. This unique case showed that the web app could not handle large intervals of values. The criminal cases should have been blinded by colleagues but this did not happen in one case. This could have influenced the experts’ probability judgments by the anchoring bias. This means that experts who knew the verbal LR interval conclusion could have biased probability

judgments based on the LR interval. Statistical knowledge of the experts was assumed to be adequate but was not tested. It could be that the concept of probability distributions was difficult for some of the experts which could in turn have influenced the results.

(45)

5.2 Recommendations for future research and development

The web app contains some failures which have to be resolved before the web app can be used for elicitation of the LR within the NFI. This can be done by adjusting the feedback screen in the web app. Probability distributions are fitted in two different manners in the web app. The use of the same method of fitting a probability distribution could improve consistency of the web app and resolve this failure. This means that either the trial roulette method or the P and V

method should be used in both step one and step four. The web app cannot handle large intervals at the moment; options should be explored to broaden the range of values which the web app can handle. Non-parametric distributions were not considered in the current study due to the

strengths of parametric distributions. A weakness of the web app is that the parametric

distributions do not allow for probability with left sided tails. Non-parametric distributions could solve this problem and could thus be explored. After further development of the web app it is recommended that the web app is thoroughly evaluated and validated before it can be put to use within the NFI.

6 Conclusion

The pilot study showed the potential and the limitations of the web app. It showed that: (a) the experts deem that the web app measures their probability judgment of the LR (face validity), (b) the web app accurately measures the LR (convergent validity), and (c) the web app allows the expert to produce somewhat consistent probability judgments (test-retest reliability). The web app is a good step towards a new and innovative method to quantify the conclusion of the LR into a probability distribution. However, more development and research is necessary before the web app can be used at the NFI for elicitation of the LR.

(46)

References

Bangor, A., Kortum, P. T., & Miller, J. T. (2008). An empirical evaluation of the system usability scale. International Journal of Human-Computer Interaction, 24(6), 574–594.

Belgorodski, N., Greiner, M., Tolksdorf, K., & Schueller, K. (2017). rriskDistributions: Fitting Distributions to Given Data or Known Quantiles.

Benétreau-Dupin, Y. (2015). The Bayesian who knew too much. Synthese, 192(5), 1527–1542. Bolstad, W. M. (2007). Introduction to Bayesian Statistics (2nd ed.). Hoboken, NJ: John Wiley

and Sons.

Brooke, J. (1996). SUS-A quick and dirty usability scale. Usability evaluation in industry, 189(194), 4–7.

Buckleton, J., & Champod, C. (2006). An extended likelihood ratio framework for interpreting evidence. Science & Justice, 46(2), 69–78.

Burgman, M., Fidler, F., Mcbride, M., Walshe, T., & Wintle, B. (2006). Eliciting expert judgments: Literature review. Risk Analysis, ACERA Project 0611, 1–71.

Chang, W., Cheng, J., Allaire, J., Xie, Y., & McPherson, J. (2019). Shiny: Web Application Framework for R.

Colson, A. R., & Cooke, R. M. (2018). Expert elicitation: Using the classical model to validate experts’ judgments. Review of Environmental Economics and Policy, 12(1), 113–132. Delignette-Muller, M. L., & Dutang, C. (2015). Fitdistrplus: An R package for fitting

distributions. Journal of Statistical Software, 64(4), 1–34.

Dias, L. C., Morton, A., & Quigley, J. (2018). Elicitation. The Science and Art of Structuring Judgement. Cham, Switzerland: Springer.

Garthwaite, P. H., Kadane, J. B., & O’Hagan, A. (2005). Statistical methods for eliciting probability distributions. Journal of the American Statistical Association, 100(470), 680–

(47)

700.

Gill, P. (2018). Interpretation continues to be the main weakness in criminal justice systems: Developing roles of the expert witness and court. WIREs Forensic Sci, 1:e1321. Wiley. Gore, S. (1987). Biostatistics and the Medical Research council. Medical Research Council

News, 35, 19–20.

Gregory, R. J. (2011). Psychological testing History, principles, and applications. Boston: Pearson Eduction.

Hora, S. C. (2007). Eliciting probabilities from experts. Advances in Decision Analysis: From Foundations to Applications (pp. 129–153). Cambridge: Cambridge University Press. Johnson, S. R., Tomlinson, G. A., Hawker, G. A., Granton, J. T., & Feldman, B. M. (2010). Methods to elicit beliefs for Bayesian priors: a systematic review. Journal of Clinical Epidemiology, 63(4), 355–369.

Johnson, S. R., Tomlinson, G. A., Hawker, G. A., Granton, J. T., Grosbein, H. A., & Feldman, B. M. (2010). A valid and reliable belief elicitation method for Bayesian priors. Journal of Clinical Epidemiology, 63(4), 370–383.

de Keijser, J., & Elffers, H. (2012). Understanding of forensic expert reports by judges, defense lawyers and forensic professionals. Psychology, Crime and Law, 18(2), 191–207.

Lewis, J. R., & Sauro, J. (2018). Item Benchmarks for the System Usability Scale. Journal of Usability Studies, 13(3), 158–167.

Lucy, D. (2005). Introduction to Statistics for Forensic Scientists. Chichester, West Sussex, England ; Hoboken, NJ: Wiley.

Lund, S. P., & Iyer, H. K. (2017). Likelihood ratio as weight of forensic evidence: A closer look. Journal of Research of National Institute of Standards and Technology, 122(27).

(48)

W. D., et al. (2016). Discussion on how to implement a verbal scale in a forensic laboratory: Benefits, pitfalls and suggestions to avoid misunderstandings. Science & Justice, 56(5), 364–370.

Martire, K. A., Kemp, R. I., Sayle, M., & Newell, B. R. (2014). On the interpretation of

likelihood ratios in forensic science evidence: Presentation formats and the weak evidence effect. Forensic Science International, 240, 61–68.

Massey, F. J. (1951). The Kolmogorov-Smirnov test for goodness of fit. Journal of the American Statistical Association, 46(253), 68–78.

Mitchell, N., Blankers, B., Kokshoorn, B., Van Der Stelt, A., & McDonald, S. (2019). A cold case turns hot after 30 years. Australian Journal of Forensic Sciences, 51(sup1), S60–S67. Taylor & Francis.

Morris, D. E., Oakley, J. E., & Crowe, J. A. (2014). A web-based tool for eliciting probability distributions from experts. Environmental Modelling and Software, 52, 1–4.

Nease, R. F., & Owens, D. K. (1990). Assessment and representation of prior beliefs:

Unexpected implications of the uniform distribution. Medical Decision Making, 10(2), 112– 114.

Nordgaard, A., Ansell, R., Drotz, W., & Jaeger, L. (2012). Scale of conclusions for the value of evidence. Law, Probability and Risk, 11(1), 1–24.

O’Hagan, A., Buck, C. E., Daneshkhah, A., Eiser, J. R., Garthwaite, P. H., Jenkinson, D. J., Oakley, J. E., et al. (2006). Uncertain judgements: Eliciting experts’ probabilities. Uncertain Judgements: Eliciting Experts’ Probabilities. wiley.

Oakley, J. E. (2010). Eliciting probability distributions. Retrieved from http://www.jeremy-oakley.staff.shef.ac.uk/Oakley_elicitation.pdf

(49)

R Core Team. (2019). R: A language and environment for statistical computing. R Foundation for statistical computingR Foundation for statistical computing, Vienna, Austria.

Thompson, W. C., & Newman, E. J. (2015). Lay understanding of forensic statistics: Evaluation of random match probabilities, likelihood ratios, and verbal equivalents. Law and Human Behavior, 39(4), 332–349.

Veen, D., Stoel, D., Zondervan-wijnenburg, M., & van de Schoot, R. (2017). Proposal for a five-step method to elicit expert judgment. Frontiers in Psychology, 8, 2110.

Willems, S. J. W., Albers, C. J., & Smeets, I. (2019). Variability in the interpretation of Dutch probability phrases - a risk for miscommunication.

Wolfson, L. J. (2015). Elicitation of Probabilities and Probability Distributions. International Encyclopedia of the Social & Behavioral Sciences (pp. 382–385). Amsterdam: Elsevier.