• No results found

Feedback as a tool to improve probability judgements in forensic feature-comparison experts

N/A
N/A
Protected

Academic year: 2021

Share "Feedback as a tool to improve probability judgements in forensic feature-comparison experts"

Copied!
29
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Feedback as a tool to improve probability

judgements in forensic feature-comparison

experts

Master: MSc Forensic Science Name: Judith van Diggelen Student number: 10573216 Word count: 8961

Supervisors: Erwin Mattijssen MSc and dr. Peter Vergeer Examiner: prof. dr. Marjan Sjerps

(2)

2 Index

Abstract 3

1. Introduction 4

1.1 Forensic feature-comparison methods 4

1.2 Critique on feature-comparison methods 4

1.3 Categorical source judgements and degree-of-support judgements 5

1.4 Validating likelihood-ratio opinions 6

1.5 Question addressed in this review 7

2. Probability judgements in forensic experts 9

2.1 Experts vs novices 9

2.2 Statistical learning in forensic experts 10

2.3 Cognitive processes of statistical learning 11

3. Feedback as a learning mechanism for cognitive bias 13

3.1 Cognitive bias 13

3.2 Improving performance and adjusting bias 13

3.3 Feedback as a learning mechanism 14

4. Improving calibration in a research setting 16

4.1 Previous research 16

4.2 Calibration 17

4.2.1 Scoring rules 18

4.2.2 Misleading evidence 18

4.2.3 The calibration graph 19

4.3 Feedback 20

5. Discussion 23

Search strategy 25

(3)

3 Abstract

This literature review examines the subjective probability judgements of forensic feature-comparison experts. Forensic feature-comparison experts, such as fingerprint or firearm examiners, compare samples to assess if these have originated from the same source, or from different sources, based on the features found on the samples. After examination, experts will provide a degree of support for either the same-source hypothesis, or different-source hypothesis, in a likelihood ratio approach. It is important that experts are well-calibrated, meaning that they are able to provide an degree of support that is in line with the true frequency of occurrence of the features. Unfortunately, some feature-comparison experts are not properly calibrated, which may be a disadvantage for a suspects’ defence in criminal cases. In this literature review, the aim is to provide insight into the cognitive processes of probability judgements, to assess how calibration can be improved and to provide recommendations for future research on this matter. Feature-comparison experts appear to have learned statistical information from their environment by statistical learning to a certain extent, but their knowledge of statistical information is not sufficient for well calibrated degree-of-support judgements. Providing trial-by-trial feedback and performance feedback and creating a deliberate practice environment seem to provide promising methods for the improvement of calibration in feature-comparison methods. For potential research on this matter, it is recommended that both trial-by-trial and performance feedback in combination with deliberate practice is examined to improve calibration. To measure calibration, strictly proper scoring rules can be used, however, further research on calibration measurements may provide alternate methods.

(4)

4 1. Introduction

1.1 Forensic feature-comparison methods

Feature-comparison methods in forensic science are procedures where examiners compare evidentiary samples and source samples by assessing the features found on these samples. An evidentiary sample is a sample that has been brought in as evidence, for example, a fingerprint, a DNA sample or a recovered bullet from a crime scene. These samples can be compared to a source sample, for example a fingerprint from a suspect, a DNA sample from a suspect or a reference bullet (shot from a firearm that is proposed to have shot the recovered bullet from a crime scene). Some feature-comparison methods are procedures performed by human examiners that involve little to no judgment or by automatic systems, these are known as the objective methods, such as fingerprint examination, firearm examination or facial recognition. On the other hand, there are also subjective feature-comparison methods, where procedures are performed by human examiners and rely on a substantial amount of human judgment (President’s Council of Advisors on Science and Technology, 2016). One of these subjective methods is forensic firearm examination, a discipline that focusses on the examination of recovered firearms, spent cartridge cases, bullets and/or other related artefacts from crime scenes (Jackson et al., 2016). The examination is based on a feature comparison of the contours made on the surface of ammunition (bullets or spent cartridge cases) in the process of firing a firearm. These contours (also known as features or tool marks) can consist of either impressions or striations, produced by imperfections of the firearm on the ammunition. Impressions are made by straight pressure between two surfaces, while striations are impressions in combination with lateral movement of the two surfaces. Other subjective feature-comparison methods include fingerprint examination, voice recognition, hair feature-comparison and facial recognition.

1.2 Critique on feature-comparison methods

In forensic science, these feature-comparison methods have been subject to critique concerning the validity of their scientific methods. In 2009, the National Research Council (NRC) reviewed the methods of the forensic science community in the United States of America and reported shortcomings in the forensic methods that were used to provide evidence in court. They reported that many of the feature-comparison methods were not validated, missed proficient determination of error rates and reliability testing (National Research Council, 2009). In 2016, another impactful report was made by an advisory council for President Barack Obama to assess the scientific validity of feature-comparison methods in forensic science. It was found that nearly all comparison methods needed (additional) validation for them to be used in court, except for single source DNA examination and fingerprint examinations (President’s Council of Advisors on Science and Technology, 2016). Though both reports were made in regards to the situation in the United states of America, the critical assessment of the methods influenced the international forensic community. Over the past years, some feature-comparison forensic

(5)

5 disciplines have focused their research on investigating the validation and reliability of their methods by empirically assessing the accuracy of the forensic experts’ conclusions (Growns & Martire, 2020). Meanwhile, it is up to the legal systems of different countries to decide what they accept as forensic evidence in court. However, at least one main distinction between the different legal systems is the manner in which the forensic experts report their results of the subjective feature-comparison to the court. In some countries, such as the United States of America, forensic experts provide the court with a categorical source judgement, while in multiple European countries and New Zealand, forensic experts report their findings using a degree-of-support judgment, also known as a likelihood ratio (Bolton-King, 2016).

1.3 Categorical source judgements and degree-of-support judgements

Categorical source judgments are used by most forensic feature-comparison experts and institutes over the world to report the conclusions of their investigation. With this method, experts report a “match”, “no-match” or “inconclusive” statement after a comparison of an evidentiary sample and a source sample. A “match” indicates that the expert beliefs that the evidentiary sample originates from the same source as the source sample. A “no-match” indicates that the expert beliefs the samples originate from a different source. This method simplifies the result of an investigation, which can aid in the understanding of the court whether or not the expert beliefs that the evidentiary sample originated from the same source as the source sample. However, when using a categorical source judgment, an expert is forced to either defend the “match” or “no match” conclusion or they will lose information on their comparison, by reporting an “inconclusive” result (Kerkhoff et al., 2013).

In contrast to the categorical source judgments, degree-of-support judgements give experts the opportunity to provide a degree of belief according to the probative value of the result of an examination, given two stated alternative propositions. In several countries, including the Netherlands, degree-of-support judgements are reported by making use of a likelihood ratio (LR) that represents the strength of the evidence (Bolton-King, 2016; Jackson et al., 2016). This approach is based on Bayes´ theorem:

𝑝𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠 × 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑟𝑎𝑡𝑖𝑜 = 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠

This approach can be used for a comparison of the evidence based on two different hypotheses (also known as the propositions). In criminal cases, these two hypotheses often represent the position of the prosecutor and the position of the defence. The prior odds represent a mathematical description of the prior knowledge about the hypotheses that are being tested, before any evidence is examined. The posterior odds represent a mathematical description of the knowledge of the hypotheses that are being tested after the examination of the evidence. The likelihood ratio is used by an expert to express the strength of the evidence, and it is defined as follows (Jackson et al., 2016):

(6)

6 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑟𝑎𝑡𝑖𝑜 = 𝑡ℎ𝑒 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝑜𝑐𝑐𝑢𝑟𝑖𝑛𝑔 𝑔𝑖𝑣𝑒𝑛 𝑡ℎ𝑎𝑡 𝑎 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑖𝑠 𝑡𝑟𝑢𝑒 𝑡ℎ𝑒 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝑜𝑐𝑐𝑢𝑟𝑖𝑛𝑔 𝑔𝑖𝑣𝑒𝑛 𝑡ℎ𝑎𝑡 𝑎𝑛 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑖𝑠 𝑡𝑟𝑢𝑒

A practical example of the use of an LR by forensic experts is given for feature-comparison in firearm examination. Visual examinations can be performed by firearm experts by comparing the features of a recovered bullet to reference bullets fired from a firearm that is hypothesised as having been used to shoot the recovered bullet. If a case does not have a firearm, bullets from a scene can be compared to each other to determine if the evidence provides support for the proposition that both bullets have been shot from the same firearm (Bolton-King, 2016; Jackson et al., 2016). The propositions used in these examinations are usually that the compared ammunition both originate from the same source (shot from the same firearm) or from a different source (shot from different firearms). In practice the considered hypotheses from a prosecutor and from the defence may be as follows: Hypothesis of the prosecutor: “The recovered bullet and the reference bullet were fired from the same firearm”. Hypothesis of the defence: “The recovered bullet and the reference bullet were fired from different firearms”. The firearm expert estimates the probability of the evidence given either of those hypotheses, based on the feature-comparison of the impressions and striations on the surfaces of the bullets by the expert. To conclude this example, the examination can result in the following conclusion by the expert: “It is 10 times more likely to find the evidence given that the hypothesis of the prosecutor is true, than if the hypothesis of the defence is true”.

Though both categorical source reporting and LR-based reporting methods are being used internationally, there is a growing support for the use of the LR-based approach, as the categorical source judgement has less valuable information. The categorical approach is simpler to explain to judges, which is very valuable, but while the LR-based approach is more difficult to explain, the LR is capable of showing the people involved in the criminal justice system, how complex these judgments actually are (Kerkhoff et al., 2017). Furthermore, for experts, scientists and within different legal disciplines, the use of the LR-based approach has been widely accepted as an admissible procedure to infer probability judgements (Biedermann et al., 2013; Bolton-King, 2016). For the rest of this review, there will be more focus on the use and calibration of the LR by forensic experts in the context of feature-comparison methods.

1.4 Validating likelihood-ratio opinions

The LR-based approach in subjective feature-comparison methods gives experts the opportunity to provide a degree-of-support judgement. However, as has been established, this judgement is based on subjective examination of certain samples. Most importantly, in forensic cases, evidentiary samples do not have a known source and there will always be a degree of uncertainty involved in these judgements. To validate the use of a LR-based approach in feature comparison methods, it must be assessed whether

(7)

7 or not the forensic experts are well-calibrated. As described by Vergeer et al. (2020) a well-calibrated LR-system (in this case the forensic expert) should produce LR-values that are in line with their expected frequencies under the two hypotheses that have been determined. In essence, these expected frequencies are the amount of times the evidence would present as it has, given that the hypotheses are true. An ill-calibrated forensic expert produces LR-values that are either too large or too small and therefore, misleading (Vergeer, Alberink, et al., 2020).

In firearm examination, experts are expected to provide these LR values (or degree-of-support) by estimating the probabilities based on their experience and internal databases (Growns & Martire, 2020). In a recent study, the calibration of this degree-of-support judgement was examined by Mattijssen et al. (2020). They found that the 10 firearm examiners that use an LR-based approach in their work, showed poor calibration of LRs. This means that the given degree of support for a certain same-source or different-source decision by the experts did not fall within the expected ranges of degree of support based on a proportion of misleading choices. This measurement of calibration, based on misleading choices, will be further explained in chapter 4.2.2..

Even though the experts performed well on a categorical source choice task, when estimating the LR for a same-source or different-source conclusion, the firearm examiners showed overconfidence in their degree-of-support judgements. Over- and under-confidence is a mismatch between the judged degree of support made by the expert concerning certain evidence and the relative frequency of the evidence as it has represented (O’Hagan et al., 2006a). In a study examining the expertise of handwriting experts from the US, it was found that these experts where better than novices in estimating the frequency of occurrence of specific handwriting features, however even the best performing expert showed an average deviation of the true feature occurrence value of 18.5% (Martire et al., 2018). Additionally, it was shown that fingerprint examiners have obtained knowledge concerning the rarity of fingerprint patterns, but were unable to express this with an accurate quantitative judgement (Mattijssen, Witteman, Berger, & Stoel, 2020).

Other overconfidence biases have been found among other experts such as lawyers, physicians and weather forecasters over the years (Ayton & Pascoe, 1995; González-Vallejo & Bonham, 2007; Mazzocco & Cherubini, 2010). Depending on which hypotheses a forensic expert finds support, either over- or under-confidence in their reported degree of support can be disadvantageous for suspects in court and must be taken as a serious issue in the criminal justice system.

1.5 Question addressed in this review

At this moment, opinion-based likelihood ratios by forensic firearm experts are not validated and there is scientific evidence that they are poorly calibrated (Mattijssen, Witteman, Berger, Brand, et al., 2020). Examining the (cognitive) processes may provide insight into expertise and the ability to create probability judgements in general. Specifically, the process of statistical learning is explored.

(8)

8 Additionally, potential ways improve calibration by experts are considered. One of the methods that has been suggested to reduce overconfidence in experts is a feedback method, where performance feedback can create a better calibration of probability judgements (Stone & Opel, 2000). Therefore, in this review, the aim is to first establish the cognitive and psychological processes involved in the calibration of likelihood ratios. This is followed by analysing whether or not the use of feedback may be used to resolve poor calibration in forensic feature–comparison experts. Finally, possible approaches are discussed for future research into the improvement of calibration of feature-comparison experts.

(9)

9 2. Probability judgements in forensic experts

In this chapter, the cognitive theories and studies relevant for the probability judgements performed by forensic experts and experts in general are examined. First, the difference in performance between experts and non-experts (novices) is reviewed. Then, the ability of making probability judgements is reviewed by focussing on the concept of statistical learning on a more cognitive level.

2.1 Experts vs novices

When hearing the term “expert”, people form an idea about certain special capabilities that enable them to be called an expert. One will instinctively picture a person who has a specific knowledge or skill set that can be distinguished from novices, and this expertise is simultaneously acknowledged by the public or by their colleagues (Ericsson, 2006). In the progression from a novice to an expert, it seems that novices, intermediates (and other sub forms) and experts approach their task differently. For example, a novice will only start to learn to recognize certain characteristics, while after more practice, they will start to distinguish important from non-important cues. In turn, an expert is described as someone who has an adequate amount of experience to proceed on intuitive judgment and knowledge (Edmond et al., 2017; O’Hagan et al., 2006b). Interestingly, it was found that experts are not always aware of how they come to their conclusions, confirming the intuitive aspect of their decision making. When asking experts for their reasoning, they may offer reasons that are retrospective rationalisations and having experts provide reasoning for their decisions can actually reduce their performance (Aitken et al., 2010; Edmond et al., 2017).

In forensic science, the feature-comparison experts face a cognitive challenging task, as they have to assess the evidential value for whether or not two elements, such as bullets or fingerprints, originate from the same source. These tasks are not performed solely on intuition, as it has been found that forensic feature-comparison experts use both non-analytical (intuitive) and analytical processing to perform this difficult task (Dror & Stoel, 2014; Growns & Martire, 2020). Additionally, Dror & Stoel (2014) describe the key to superior abilities and expertise as the development of top-down cognitive mechanisms, such as selective attention towards important cues and the chunking of information to better remember information.

Studies have been performed to assess the difference in performance between forensic experts and novices for feature-comparison methods. These include fingerprint examinations, where it was found that experts perform better in small time intervals in categorical source-choice tasks, as well as that the experts benefit more in terms of performance when receiving larger time intervals. Additionally, it has been suggested that the retention of memory is better in experts than in novices, in both chess experts and fingerprint examiners (Growns & Martire, 2020; Schneider et al., 1993; Thompson & Tangen, 2014).

(10)

10 As mentioned, the circumstances surrounding the comparison of two samples are often challenging, making the cognitive task difficult for the forensic experts. There can be both inter- and intra-variability in the samples used for comparison. For example, fingerprints can be deposited on different surfaces, or a firearm may show differences in deposited striations and/or impression because of usage and wear traces (Dror & Cole, 2010). Consequently, after the assessment of the features in the samples, the probability of the evidence, given the two hypotheses, still have to be established, considering an LR-based approach. The frequency, and so the rarity, of the features on the samples must be taken into account as this will influence the probability of the evidence (given the hypotheses). However, there are no databases available to calculate these frequencies and therefore the estimation of the LR is a subjective representation of the expert’s frequencies of the features. Therefore, in the next section, the ability to learn frequencies is assessed.

2.2 Statistical learning in forensic experts

Statistical learning is the process of learning statistical properties (such as patterns) from the sensory information that is received from the environment. Two types of statistical learning are distinguished, conditional statistical learning and distributional statistical learning. Conditional statistical learning is the process of learning joint relationships between stimuli, such as one element always occurring with another. In this review, there is more focus on the distributional statistical learning, which is the process of learning the frequencies and variability of certain stimuli in the environment (Growns & Mattijssen, 2020). However, one could argue that the conditional relationship between multiple features on one sample may be important for the estimation of the frequency of features in the environment. On the other hand, if an expert considers the full picture of different striations and impressions as the feature of the sample, this does not hold. Based on the definitions described Growns & Mattijssen (2020), in this review, the feature comparison is approached as if the experts consider the full picture with all marks as a feature to estimate the frequency of.

As mentioned in the previous section, the ability to correctly learn the frequencies of stimuli in the environment is key for the forensic feature-comparison expert when estimating their degree of support. Only a few studies have been performed where they examined statistical learning regarding forensic experts. In a study concerning fingerprint examiners, it was found that for a specific subset of fingerprint patterns used in the study, these fingerprint experts were able to better estimate the rarity of general patterns in fingerprints than novices. However, they were not able to give the correct frequency judgements for the patterns (Mattijssen, Witteman, Berger, & Stoel, 2020). The authors report that it is unclear if the examiners were unable to statistically learn the frequencies, or if their personal work experience and knowledge interfered with the statistical learning. Additionally, in a study regarding forensic document experts from the US, the ability to estimate the frequency of handwriting features was compared with novices and experts from outside the US. The experts from the US were better than

(11)

11 novices and better than experts outside the US, indicating that the US experts have statistically learned from their environments. However, even though the experts from the US performed better, they still showed a very high error rate, which in turn indicates that the statistical learning may have been better, but still not in line with the true frequency in the environment (Martire et al., 2018).

2.3 Cognitive processes of statistical learning

Although the cognitive sciences may not seem immediately relevant to forensic science research, it can bring forensic science essential insights into human cognition, which in turn may improve the processes involved in forensic feature comparison methods, but can also improve processes in crime scene investigation by assessing the cognitive pitfalls of context and emotion (Edmond et al., 2017).

To better understand the process of statistical learning, the cognitive processes associated with the process are assessed. In the last few decades, a growing amount of research has been done on statistical learning, and its involvement in different cognitive domains and areas of psychology. In light of this, the importance of considering statistical learning to be a fundamental construct of learning and development in general has been highlighted (Bogaerts et al., 2020; Sherman et al., 2020). An important aspect of statistical learning is that it consists and is affected by multiple processes, and in general is based on two main cognitive mechanisms. As reviewed by Conway (2020), these two main cognitive mechanisms are cognitive plasticity and top-down modulation. Cognitive plasticity regulates perception and associative learning based on a bottom-up principle. This means that the sensory information that enters the brain by means of perception, is an implicit and passive process. This sensory information can be visual stimuli, sounds or touch, and cognitive plasticity mediates the perception and associative learning for this input. In contrast, the other cognitive mechanism that Conway (2020) lists, is top-down modulation. The top-down modulation of learning involves active processes such as selective attention and working memory. Selective attention is the active cognitive mechanism of focussing on certain stimuli, while working memory is the active process of holding on to a piece of information for a certain time so that it can be used for other processes (Conway, 2020; Dror & Stoel, 2014). In research assessing the statistical learning of language, it was shown that disruption of a specific brain area (known as the dorsolateral prefrontal cortex - DLPFC), changed the balance between the cortical plasticity and top-down modulation, benefitting the cortical plasticity during a learning task. Additionally, it was found that weaker top-down modulation and reduced activity of the DLPFC was related to better statistical learning and suggests better consolidation of long-term learning (Ambrus et al., 2020; Bogaerts et al., 2020; Tóth et al., 2017). These findings provide an interesting outlook for the cognition in statistical learning, as it suggest that top-down modulation, with active, explicit attention, may actually inhibit the statistical learning process. In combination with the basis of expertise on the development of top-down cognitive mechanism, as described by Dror & Stoel (2014), this may suggest that with the development of expertise, the statistical learning may be inhibited. Although more research should be done on this

(12)

12 matter, the combination of this literature can provide a potential explanation for the ill-calibrated performance of the feature-comparison expert on the estimation of frequency of features.

Additionally, more research must still be done on the subject of statistical learning and memory consolidation, and the integration of new information and stored information (Bogaerts et al., 2020). It has also been suggested that there are distinct but related memory processes that mediate distributional and conditional statistical learning (Thiessen et al., 2013). However, it is still unclear how different memory processes may contribute to the ability to distinguish between different frequencies or to specifically estimate frequencies (Growns & Mattijssen, 2020).

In the following chapter, cognitive bias in feature-comparison experts is examined and feedback is considered as a learning mechanism to improve calibration.

(13)

13 3. Feedback as a learning mechanism for cognitive bias

The topic of this chapter is cognitive bias in relation to probability judgements in experts and the possibility to improve the calibration of experts by using feedback as a learning mechanism.

3.1 Cognitive bias

The collaboration between bottom-up processing and top-down modulation enables the brain to learn and process information. As bottom-up input becomes more difficult, for example because two samples like fingerprints have been deposited on different surfaces, there will be a higher sensitivity to cognitive bias (Dror & Cole, 2010; Dror & Stoel, 2014). It is important to realise that cognitive bias is unrelated to dishonesty or partiality, it is simply the imperfect processes of perception and cognition that are imbedded into human nature. The effect of cognitive bias on categorical source-choice tasks has been investigated, and it was found that contextual information, individual expectation and emotion can influence the decision making process (Found, 2015). For forensic feature-comparison examiners that use an LR-based approach to report their finding, it was found that there is another form of cognitive bias present, an ill-calibration of their LR judgments (Mattijssen, Witteman, Berger, Brand, et al., 2020). This indicates that their internal representation of the frequency of features in the environment is not in line with the true frequency of features in the environment. It is possible to have examiners that are either overconfident or under-confident. An examiner that is overconfident, will provide a probability judgment that is smaller than the relative frequency of the event in question. In analogy, an examiner that shows under-confidence, will have a larger probability judgment than the relative frequency of the event. The total degree of over- or under-confidence of an examiner on a task depends on the difference between the reported mean judgement and the relative frequency (O’Hagan et al., 2006a). These biases are relevant for the calibration of the experts and in the next section, methods to reduce these biases are reviewed. However, it is also important to assess why these biases may be present in these experts. Although there is not a definitive answer, three different mechanisms have been suggested (Ferretti et al., 2016). First, “anchoring” may cause examiners to unintentionally use the judgment of another examiner on the comparison, causing their judgement to be anchored towards the primary examiner. This may cause a secondary examiner to not adjust the estimations for extreme points in a probability distribution. Second, people would rather be precise than accurate and therefore provide smaller confidence intervals. Finally, the working memory has a limited capacity and could therefore constraint the information available to people, causing them to have a smaller set of relevant facts, this may result in a lower variance and an overconfidence in their estimates (Ferretti et al., 2016).

3.2 Improving performance and adjusting bias

Improving performance and adjusting bias are difficult tasks. Measures to reduce bias can either be performed beforehand, or after the participants have performed their task. However, a simple measure like warning people of their bias beforehand, or asking them to explain their reasoning afterwards is

(14)

14 unable to adjust bias or improve performance, the latter can even worsen performance (Edmond et al., 2017; O’Hagan et al., 2006a). In the previous century, there has been research concerning factors and strategies than may improve performance, here, a summary of the suggestions made by Murphy and Winkler (1984) and Fischhoff (1989) as cited in O’Hagan (2006b) is given. Murphy and Winkler were interested in the factors aiding in the excellent probability judgments of meteorologist forecasters and found that practice, prompt feedback about the ground truth, the evaluation of performance by quantitative scoring rules, the effort, technological resources and the objectivity of the forecasters were the most important factors. Fischhoff (1989) specified four conditions for improving judgments. First, sufficient practice with a task or set of tasks that is consistent in its design. Second, the criterion on reporting must be well-defined, because judgments may otherwise be open to contingencies. Third, the feedback must be specific for the task at hand. And finally, the need for learning must be explicit. The above-mentioned factors and strategy have a lot in common with the deliberate practice method as suggested by (Ericsson et al., 1993). It was shown that expert performance on a task is generally dependent on the amount of deliberate practice that has been performed by the individual. Deliberate practice is a method of performing specific practice tasks, to increase speed and accuracy in motor, perceptual and cognitive tasks. There are a number of conditions suggested to create an environment suitable for deliberate practice, such as motivation and willingness to put in effort from the participants. Furthermore, the practice tasks must be in line with the knowledge of the participant so that the task is well-understood, and in line with Fischhoff’s suggestion, the set of tasks must be similar and consistent. Finally, it is essential that the participants receive immediate feedback and information about their performance.

Based on this information, if a task is efficient and consistent, if the participant is internally motivated, practices deliberately and receives efficient feedback, it should be possible to improve performance. The following paragraph will focus on feedback as a learning mechanism.

3.3 Feedback as a learning mechanism

When considering feedback as a learning mechanism to improve performance in probability judgements, it is important to realise that there are different forms of feedback that one can provide. Outcome feedback simply informs a person whether their answer was correct or incorrect. Performance feedback provides someone with a detailed description of the quality of the judgements. And finally, environmental feedback is additional (environmental) information about the events that are to be judged (Harvey & Fischer, 2014; Stone & Opel, 2000).

Another distinction has been made by Stone & Opel (2000), who examined the difference between environmental feedback and performance feedback. The environmental feedback included an informative lecture on the cues that were relevant for the probability judgement task. The performance feedback included a personal calibration graph of the participants performance. It was found that

(15)

15 environmental feedback improved performance on a discrimination task (categorical source-choice), while the performance feedback improved the calibration of the participants (Stone & Opel, 2000). It has also been found that when feedback is given very frequently, such as weather forecaster’s receiving feedback practically daily, the calibration is improved (Ferretti et al., 2016). Additionally, the combination of outcome feedback with information on the task structure seems to improve performance (O’Hagan et al., 2006a). In line with these findings, it was shown that a trial-by-trial feedback via a scoring rule is able to provide positive changes in performance in limited time (González-Vallejo & Bonham, 2007). A scoring rule gives rewards or penalizes probability statements depending on the ‘closeness’ of these statements to the outcome. With the trial-by-trial feedback they were able to frequently give feedback and by using a scoring, which may have contributed to the positive effect they found in their study.

All in all, there seems to be sufficient evidence that feedback can be used as a method to improve performance of probability judgements. In combination with the deliberate practice model as suggested by Ericsson et al. (1993), motivation of the expert, appropriate tasks and immediate feedback after performance shows a promising method for the improvement of performance in probability judgements. For the following chapter the focus shifts towards more practical factors, to examine experimental set-ups that may contribute to research on the improvement of calibration in forensic feature-comparison experts.

(16)

16 4. Improving calibration in a research setting

It has become clear that there are certain issues regarding the calibration of feature-comparison experts in the forensic community. The previous chapters attempted to provide insight in the underlying mechanisms of probability judgment and the use of feedback as a learning mechanism. This literature review discussed the poor calibration of feature-comparison experts. Now, the focus shifts towards a discussion of potential experimental methods and considerations for future research. The overall aim is to be able to answer the following two questions: What is the best way to measure calibration? And: What is the best way to provide feedback to improve calibration? First, an overview of three studies that have been discussed in previous sections is given for insight into previous research methods. Then, multiple methods of measuring calibration are discussed, followed by considering potential methods of providing feedback in an experimental set-up.

4.1 Previous research

In this section, an overview of the studies of Stone & Opel (2000), González-Vallejo & Bonham (2007) and Mattijssen, Witteman, Berger, Brand, et al. (2020) are provided. This overview is necessary to be able to compare and consider their methods in regards to potential future research into the calibration of feature-comparison experts.

The study by Stone & Opel (2000) focusses on calibration and discrimination and examined if these judgments were affected by specific training procedures in the form of feedback. They performed this study with 84 (43 male) participants that were shown art slides provided with two time periods. The participants then had to give a probability estimation towards the art originating from the latter time period with probabilities from 0%-100%, with 10% increments. Half of the slides were easy and the other half were hard in terms of difficulty. The study was set up, so that all participants first performed the task, then received one of three possible feedback trainings and then the task was performed again (with different slides) to assess the effect of different trainings. One group received performance feedback during the training period, they were given a calibration graph and a personal feedback session of 2-5 minutes. It was found that this group improved on the calibration of their probability judgments. The second group received environmental feedback in this training period, they were given a lecture on art history of 30 minutes. It was found that this group improved most on discrimination in their judgments. Finally the last group was a control group that did not receive any training.

In a study by González-Vallejo & Bonham (2007) examined calibration and discrimination by measuring confidence and accuracy in an experiment with 129 participants who answered 150 general knowledge questions. The study consisted of two phases, in the first phase, all participants were shown 100 (of 150) questions with correct answers, 50 of the presented question-answer pairs were shown only once, the other 50 were presented 3 times. In the second phase, the participants had to answer all 150 questions in a two-option forced choice task and then rate their confidence. In this second phase, the

(17)

17 participants were divided into three groups with different methods of providing feedback based on a scoring system. The first group received feedback by an all reward system, where participants receive points for every correct answer. The second is all penalty, where the participants just lose points for wrong answers. And the final feedback method is reward and penalty, where they receive both. It was found that all systems improved accuracy and confidence, however, de combined system of rewards and punishments showed significant improvement in calibration, such that the confidence was best aligned with the accuracy. The main difference between this study and the one from Stone & Opel (2000) is that outcome feedback in combination with a scoring rule is given per answer (trial by trial), instead of performance feedback after performing the complete task. The authors cite Stone & Opel’s research as well, saying their findings are in line, but they show how it is also possible to tackle both discrimination and calibration by using their scoring system as feedback.

Mattijssen, Witteman, Berger, Brand, et al. (2020) performed a study to assess the validity and reliability of firearm examinations. They examined the validity of categorical source judgments from firearm examiners and compared this to the validity of a computer-based feature comparison method, but also studied the reliability of the judgments of the firearm examiners to determine their calibration. To stay on topic, in this review the focus remains their experimental set-up concerning examiner judgments. More information concerning the validity and comparison of categorical source-choice judgments of examiners and the computer-based methods can be found in their paper (see: (Mattijssen, Witteman, Berger, Brand, et al., 2020)). This study had 77 international firearm examiners who were faced with 60 comparison images with striation patterns of two spent cartridges that were presented in a side-by-side manner. The examiners were asked questions about these images, including a categorical source-choice (same-source or different-source) and a degree of support. The degree-of-support judgment could only be provided by 10 of the 77 firearm examiners, as the other firearm examiners did not use an LR-based method in casework. For these 10 firearm examiners, the calibration of their judged degrees of support was calculated by using a method based on a proportion of misleading choices. A misleading choice refers to a degree of support pointing towards one of the two hypotheses, while the other hypothesis in fact represents the ground truth. The proportion of misleading choices for certain judged degrees of support were compared to calculated expected proportion of misleading choices for those judged degrees of support, which in turn resulted in a calibration graph that showed that the examiners were ill-calibrated, showing overconfidence.

4.2 Calibration

The above-mentioned papers have attempted to measure calibration using different methods. Both Stone & Opel (2000) and González-Vallejo & Bonham (2007) use methods based on strictly proper scoring rules, while Mattijssen, Witteman, Berger, Brand, et al. (2020) measured calibration by assessing the

(18)

18 proportion of misleading choices and judged degree of support. The following section will provide further insight into these methods.

4.2.1 Scoring rules

Scoring rules are a measurement of the quality of probability judgments, but they can also be used in a task as an encouraging method for the assessor in the actual elicitation of a probability judgement (Gneiting & Raftery, 2007). One variety of scoring rules that are interesting, specifically for the calibration of feature-comparison experts, are strictly proper scoring rules. Strictly proper scoring rules command that the best scores are obtained if the probability judgement are in line with the target variable (Mojab, 2016). The target variable in this case could be the frequency of occurrence of a certain feature pattern. Stone & Opel (2000) used a strictly proper scoring rule, the Brier scoring rule, to measure the calibration of the probability judgements. The Brier scoring rule is a measurement of probability judgement performance and consists of measurement of calibration and discrimination (DeGroot & Fienberg, 1983). The Brier score is calculated by taking the mean of all the probability scores from a task. These probability scores are calculated by subtracting the actual outcome of an event (for example: either two samples are same-source (1) or not (0)) from the probability judgement of the event, and then squaring this number (DeGroot & Fienberg, 1983; Stone & Opel, 2000; van Gelder, 2015). The calibration within this score is the difference between the prediction by an assessor and the relative frequency of the event. For a well-calibrated assessor the difference will be (close to) zero. The discrimination in the score is represented by the closeness of the proportion of true predictions (same source or difference source) to either 0 or 1 (DeGroot & Fienberg, 1983). And so, because for both calibration and discrimination a smaller number indicates better probability judgements, a smaller Brier Score indicates a better quality of probability judgement.

González-Vallejo & Bonham (2007) also used a scoring rule, based on the Brier scoring rule, but they did not just use it to measure the quality of performance of the participants, they also provided the participants with information about their scores to manipulate their performance in a positive way. They manipulated the functions of the Brier scoring rule to create the different scoring feedback environments for the participants. For a detailed description of their calculations, please see González-Vallejo & Bonham (2007).

4.2.2 Misleading evidence

Mattijssen, Witteman, Berger, Brand, et al. (2020) uses the proportion of misleading choices and judged degree of support to measure calibration at one point in time. As mentioned before, a misleading choice is made by an examiner when they provide a degree of support for one of the hypotheses, while in fact, the other hypothesis is the ground truth. For the examiners performance, there is an expected proportion of misleading evidence given per range of judged degree-of-support, calculated by the authors. For example, the range can be the judged degree-of-support of 10-100 (or 100-1000, or 1000-10.000 etc.)

(19)

19 for the hypothesis that the compared samples have originated from the same source. For all the samples that the examiner has given this degree of support of 10-100, it is calculated what the proportion of misleading choices of the examiner is. In other words, how often this examiner has given this degree of support of 10-100 to a sample that was actually different source. This calculated proportion is then compared to a range of expected misleading choices for those different degrees of support. To elaborate, a high degree-of-support judgement is expected to give a low proportion of misleading choices, as it shows a high probability of the evidence given a certain hypothesis. In contrast, a low degree-of-support judgement has a higher proportion of misleading choices with a larger range. This method of measuring calibration can create a calibration graph that is visually helpful to represent the calibration. In a comparison of different calibration measurement techniques by Vergeer, van Schaijk, et al. (2020), it was found that measuring calibration based on misleading evidence was able to differentiate between well-calibrated LR distributions and ill-calibrated LR distributions, indicating that misleading evidence as a metric for calibration is sensitive. However, compared to other metrics, the misleading evidence measurement lacked in stability of its measurements when conditions varied, such as the sample size (Vergeer, van Schaijk, et al., 2020).

4.2.3 The calibration graph

A calibration graph can be useful in two manners. First, it can be used to visualize the result of the calibration for the researcher. Since calibration is a complex factor to measure, it can be valuable to plot the values reported by an assessor, in the same graph as the expected values to examine the calibration and any over- or under-confidence of the assessor. When examining the possibility of improving calibration, it can be interesting to examine the calibration in a graph, before and after any sort of training intervention. Second, as seen in Stone & Opel (2000), the calibration graph can provide a visual representation of performance, and can therefore be used to provide the participants with performance feedback. If this would be done for feature-comparison experts it may visualize over- or under-confidence, for example, as the overconfidence seen in firearm experts (Mattijssen, Witteman, Berger, Brand, et al., 2020). For a visual representation of what a calibration graph for feature-comparison experts may look like, an example is shown in figure 1. The example of this graph is based on a hypothetical task where feature-comparison experts perform a categorical source-choice (the choices being either same-source or different-source) in combination with a judged degree-of-support in the form of a LR-range (such as: 2-10 or 10-100). The calibration graph in figure 1 is based on the proportion of correct same-source choices per judged degree-of-support, and while the blue S-curve represents well-calibrated judgements, the orange and green dots represent over- and under-confident judgements, respectively. For visual aid, the orange arrows show how an overconfident judgment should be adjusted towards a well-calibrated judgement. For example, the judged degree-of-support of 10-100 for the overconfident judgement, has a proportion of approximately 0,75 correct same-source choices, which for a well-calibrated judgement is better suited for a judged degree-of-support of 2-10. Similarly, the

(20)

20 green arrows indicate how an under-confident judgement can be adjusted towards a well-calibrated judgment. Note that this example is based on a same-source hypothesis, but it can also be done for different-source hypothesis, examining the proportion of correct different-source choices.. Additionally, when considering misleading evidence (incorrect source-choices) a similar but inverted, calibration graph can be made.

Figure 1: Calibration graph showing the difference between well-calibrated judgements (blue line), overconfident judgements (orange dots) and under-confident judgements (green dots) in the proportion of correct same-source choices per judged degree-of-support. Note that all judged degrees-of-support are made based on a same-source hypothesis. The orange arrows indicate the adjustment that can be made for the overconfident judged degree-of-support towards a well-calibrated judgement, based on the proportion of correct same-source choices. The larger the judged degrees of support (right side of the graph), the higher the proportion of correct same-source choices. The left side of the graph, where the smaller the judges degrees of support indicate a very low proportion of correct same-source choices.

4.3 Feedback

With the information gathered in the previous chapter, the use of feedback may improve performance in calibration, which is essential in the current situation for feature-comparison experts. The feedback methods from Stone & Opel (2000) and González-Vallejo & Bonham (2007), both separately and combined are considered for the improvement of calibration. Stone & Opel (2000) showed that performance feedback with the use of a calibration graph and an individual feedback session with an examiner improved calibration in a task focussed on art, for participants that had no previous knowledge of art(-history). González-Vallejo & Bonham (2007) showed that calibration was improved by using a

(21)

21 trial-by-trial outcome feedback method that included a scoring rule with both rewards and punishments, on a general knowledge task. In both studies, the participants started with a learning phase because the participants did not have previous knowledge about the specific task, however, the feature-comparison experts will not need a learning phase, as they already have the knowledge of the task, as they regularly perform feature comparisons. Therefore, their first measurements on a feature-comparison task, will be a baseline measurement, which can be a useful manner to create a baseline-calibration graph. Letting the feature comparison-experts start with a discrimination task combined with a degree-of-support, will create this baseline calibration graph that can be used for performance feedback after the experts have completed their task. Instead of providing the experts with performance feedback, it is also possible to provide them with outcome feedback in combination with a scoring rule after the baseline measurement, in a second test round with trial-by-trial feedback, as suggested in González-Vallejo & Bonham (2007). Both forms of feedback can be considered separately, but with the urgency of improving the calibration of the feature comparison experts, it can also be interesting to examine a combination of both feedback techniques. It is not known how this combination will affect the calibration of experts, but since both techniques have had a positive effect, it can be valuable to see if a larger positive effect is found with the combination. In an experimental set-up, this can be examined by comparing three different groups of feature-comparison experts, all receiving one of three feedback forms. The first group will receive just performance feedback, with a calibration graph and an individual session, a second group would receive only trial-by-trial feedback using the rewards/punishment scoring rule as suggested by Gonzalez. And finally the third group would receive both.

Incorporating deliberate practice for this feedback is an important consideration. The main difference between the participants of Stone & Opel (2000) and González-Vallejo & Bonham (2007) is that their participant are psychology students who participated for class requirements. However, the deliberate practice method calls for a strong sense of internal motivation to perform at the highest performance (Ericsson et al., 1993). Considering the importance of their reports towards the courts, it is possible feature-comparison experts have a more in-depth motivation to perform well on the task. Additionally, the deliberate practice calls for immediate feedback, consistent with the set-up of Stone & Opel (2000), but not with the trial-by-trial method, where 48 hours passed between the start of first (learning phase) and the start of the second task. Therefore, if trial-by-trial feedback with a scoring rule is used, it is recommended to follow the basis of deliberate practice, and immediately perform the second task after a baseline measurement of performance. This element of time also influences the choice to use different samples in the baseline measurement compared to the second measurement. In the trial-by-trial feedback study by González-Vallejo & Bonham (2007), about two-thirds of the samples were re-used for the second task, which was 48 hours later. However, if the feature-comparison experts perform their second task in the same day as their baseline measurement, using the same samples may influence the results. Therefore it is recommended to follow the set-up of Stone & Opel (2000) and use different samples for

(22)

22 the baseline measurement and the second task, both for performance feedback and for trial-by-trial feedback.

In summary, we examined specifically three previous research setting concerning the best measurement for calibration and the best manner to provide feedback for feature-comparison experts. The use of scoring rules can be useful for the measurement of calibration as used by González-Vallejo & Bonham (2007). Misleading evidence is a sensitive method, but more research must be done into methods of measuring calibration using misleading evidence, as there is an indication that other methods may be better (Vergeer, van Schaijk, et al., 2020). The use of a calibration graph provides a lot of information, and can be used for a visual representation of results but can also provide an opportunity to provide feedback. Both performance feedback with a calibration graph in combination with an individual feedback session, and the trial-by-trial feedback appear as promising methods for the improvement of calibration in feature-comparison experts. Therefore, it is recommended to perform a study to compare the improvement in calibration for feature-comparison experts with (1) performance feedback, (2) trial-by-trial feedback with a scoring rule and (3) a combination of (1) and (2).

(23)

23 5. Discussion

It was found that certain forensic feature-comparison experts showed poor calibration for the opinion-based likelihood ratios (Mattijssen, Witteman, Berger, Zheng, et al., 2020). Therefore, the aim of this literature review focussed on three main aspects, the cognitive processes involved in the calibration of likelihood ratios, the use of feedback to improve calibration in feature comparison experts and potential approaches for future research. The cognitive and psychological processes of estimating a probability value, like a degree-of-support for a specific hypothesis in an LR-based approach, depends for a part on the ability to learn statistical information from the environment (Growns & Mattijssen, 2020). This process is also known as statistical learning, which based on a combination of bottom-up and top-down modulation of information processing in the brain (Conway, 2020). It was also found that the frequency, or rarity, of features for feature-comparison methods, were not accurately estimated by experts, although their performance was better than novices (Growns & Martire, 2020; Mattijssen, Witteman, Berger, & Stoel, 2020). It is an urgent matter that this statistically learned information is improved, and literature shows that providing feedback was found to be a promising method (Ericsson et al., 1993; González-Vallejo & Bonham, 2007; Stone & Opel, 2000). Additionally, deliberate practice, which includes motivation of the expert, appropriate task and providing immediate feedback is an important factor in the improvement of performance on a task (Ericsson et al., 1993). Finally, previous research on calibration and feedback methods was assessed, to consider potential methods for the improvement of calibration in feature-comparison experts (González-Vallejo & Bonham, 2007; Mattijssen, Witteman, Berger, Brand, et al., 2020; Stone & Opel, 2000). Trial-by trial feedback in combination with a rewards and punishment scoring and providing performance feedback immediately after a task of multiple feature comparisons, a calibration graph and an individual feedback session on their performance shows potential for improving experts’ performance. It is therefore recommended to perform a study comparing both feedback methods, and examining the effect of a combination of both methods. Additionally, the use of strictly proper scoring rules to measure calibration is recommended. However, it must be kept in mind that new, improved methods of calibration measurements are being developed, as by Vergeer, van Schaijk, et al. (2020), which can potentially provide more accurate measurements.

With these recommendations in mind, it is important to take into account a few factors when developing an experimental set-up to improve the calibration of feature-comparison experts. First, in regular casework of experts, there is a certain base rate of same-source and different-source samples, but also a base rate of samples that are more difficult or easy. It is unknown if the improvement of calibration of feature-comparison experts in their casework is dependent on an accurate base rate depiction of difficult and easy sample comparisons in the feedback task for improvement.

Second, in this review the degree-of-support for a hypothesis has been examined as a probability judgement. However, as explained in the introduction, a LR is based on two probability judgements,

(24)

24 one for each of the hypotheses. The study by Mattijssen, Witteman, Berger, Brand, et al. (2020) has based the degree-of-support for either same-source hypothesis or for the different source-hypothesis, given by the experts, based on the assumption that the numerator of the LR is 1, and the denominator consists of a probability judgement based on the subjective perception of the frequency of occurrence of features. This makes their LR resemble a random match probability. Their result of the poor calibration of the firearm examiners is based on this method, and it may be interesting to re-examine their calibration if the experts were to provide probability judgements for both hypotheses, creating a proper LR.

Third, as noted in the research by Mattijssen, Witteman, Berger, Brand, et al. (2020), only 10 of 77 international firearm examiners performed an LR-based approach in their casework. Since this was already an international study, it may prove difficult to gather more participants. However, this study was done for firearm examiners, and using other feature-comparison experts may create larger total participants. It may be interesting to consider using different feature-comparison experts, from different feature-comparison methods, to create a larger scale investigation. Although their sample comparison would have to be specified towards their expertise, it may provide more insight whether feedback can improve the calibration of feature-comparison experts in general if studies are set up so that the effect between the different experts can be compared.

The final consideration is based on the limits of human capabilities. If future research shows it is not possible for feature-comparison experts to improve their calibration to provide valid LRs for the courts, it must be considered to transfer feature-comparisons towards computer-based methods. At this point, an abundance of research is going into computer-based methods, in some areas they surpass the abilities of human experts, but in other areas they still lack the skill of a human expert (Dror & Stoel, 2014; Mattijssen, Witteman, Berger, Zheng, et al., 2020). In the future, they may aid in the investigation of feature-comparisons, but until then, it is essential to investigate the improvement of calibration in forensic feature-comparison experts.

(25)

25 Search strategy

For this review, literature has been obtained by using the following web-based platforms:

- www.sciencedirect.com

- www.pubmed.ncbi.nlm.nih.gov/

- www.ncbi.nlm.nih.gov/

- www.scholar.google.nl/

The literature search was initiated by the paper from (Mattijssen, Witteman, Berger, Brand, et al., 2020) which prompted the research question. After carefully exploring literature that was referenced in their paper, the search was continued by using the following search words and combinations:

“feature-comparison methods”, “likelihood ratio”, “feedback learning”, “learning probability judgement”, “probability estimation”, “feedback learning experts”, “subjective probability judgement”, “statistical learning”, “cognitive forensic science”, “expert perception”, “deliberate practice”, “decision making”, “improving calibration”, “experts”, “strictly proper scoring rules”, “brier score”.

Additionally, some papers were used for reversed searching, to see what researchers had later cited those, including:

Dror, I. E., & Stoel, R. D. (2014). Cognitive Forensics: Human Cognition, Contextual Information, and Bias. In G. Bruinsma & D. Weisburd (Eds.), Encyclopedia of Criminology and Criminal Justice (pp. 353–363). Springer New York. https://doi.org/10.1007/978-1-4614-5690-2_147 Mattijssen, E. J. A. T., Witteman, C. L. M., Berger, C. E. H., Brand, N. W., & Stoel, R. D. (2020).

Validity and reliability of forensic firearm examiners. Forensic Science International, 307, 110112. https://doi.org/https://doi.org/10.1016/j.forsciint.2019.110112

Stone, E. R., & Opel, R. B. (2000). Training to Improve Calibration and Discrimination: The Effects of Performance and Environmental Feedback. Organizational Behavior and Human Decision Processes, 83(2), 282–309. https://doi.org/10.1006/obhd.2000.2910

(26)

26 References

Aitken, C., Roberts, P., & Jackson, G. (2010). Fundamentals of probability and statistical evidence in criminal proceedings: guidance for judges, lawyers, forensic scientists and expert witnesses. http://eprints.nottingham.ac.uk/id/eprint/1859

Ambrus, G. G., Vékony, T., Janacsek, K., Trimborn, A. B. C., Kovács, G., & Nemeth, D. (2020). When less is more: Enhanced statistical learning of non-adjacent dependencies after disruption of bilateral DLPFC. Journal of Memory and Language, 114(June), 104144.

https://doi.org/10.1016/j.jml.2020.104144

Ayton, P., & Pascoe, E. (1995). Bias in human judgement under uncertainty? The Knowledge Engineering Review, 10(1), 21–41. https://doi.org/10.1017/S0269888900007244

Biedermann, A., Garbolino, P., & Taroni, F. (2013). The subjectivist interpretation of probability and the problem of individualisation in forensic science. Science and Justice, 53(2), 192–200. https://doi.org/10.1016/j.scijus.2013.01.003

Bogaerts, L., Frost, R., & Christiansen, M. H. (2020). Integrating statistical learning into cognitive science. Journal of Memory and Language, 115(August), 104167.

https://doi.org/10.1016/j.jml.2020.104167

Bolton-King, R. S. (2016). Preventing miscarriages of justice: A review of forensic firearm

identification. Science and Justice, 56(2), 129–142. https://doi.org/10.1016/j.scijus.2015.11.002 Committee on Identifying the Needs of the Forensic Sciences Community: National Research Council.

(2009). Strengthening forensic science in the United States: A path forward. In Strengthening Forensic Science in the United States: A Path Forward. The National Academies Press. https://doi.org/10.17226/12589

Conway, C. M. (2020). How does the brain learn environmental structure? Ten core principles for understanding the neurocognitive mechanisms of statistical learning. Neuroscience and

Biobehavioral Reviews, 112(January), 279–299. https://doi.org/10.1016/j.neubiorev.2020.01.032 DeGroot, M. H., & Fienberg, S. E. (1983). The Comparison and Evaluation of Forecasters. Journal of

the Royal Statistical Society. Series D (The Statistician), 32(1/2), 12–22. https://doi.org/10.2307/2987588

Dror, I. E., & Cole, S. A. (2010). The vision in “blind” justice: Expert perception, judgment, and visual cognition in forensic pattern recognition. Psychonomic Bulletin and Review, 17(2), 161– 167. https://doi.org/10.3758/PBR.17.2.161

(27)

27 and Bias. In G. Bruinsma & D. Weisburd (Eds.), Encyclopedia of Criminology and Criminal Justice (pp. 353–363). Springer New York. https://doi.org/10.1007/978-1-4614-5690-2_147 Edmond, G., Towler, A., Growns, B., Ribeiro, G., Found, B., White, D., Ballantyne, K., Searston, R.

A., Thompson, M. B., Tangen, J. M., Kemp, R. I., & Martire, K. (2017). Thinking forensics: Cognitive science for forensic practitioners. Science and Justice, 57(2), 144–154.

https://doi.org/10.1016/j.scijus.2016.11.005

Ericsson, K. A. (2006). An introduction to Cambridge handbook of expertise and expert performance: Its development, organization, and content. In The Cambridge Handbook of Expertise and Expert Performance.

Ericsson, K. A., Krampe, R. T., & Tesch-Römer, C. (1993). The Role of Deliberate Practice in the Acquisition of Expert Performance. Psychological Review, 100(3), 363–406.

https://doi.org/10.1037/0033-295x.100.3.363

Ferretti, V., Guney, S., Montibeller, G., & Winterfeldt, D. Von. (2016). Testing best practices to reduce the overconfidence bias in multi-criteria decision analysis. Proceedings of the Annual Hawaii International Conference on System Sciences, 2016-March, 1547–1555.

https://doi.org/10.1109/HICSS.2016.195

Found, B. (2015). Deciphering the human condition: The rise of cognitive forensics. Australian Journal of Forensic Sciences, 47(4), 386–401. https://doi.org/10.1080/00450618.2014.965204 Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal

of the American Statistical Association, 102(477), 359–378. https://doi.org/10.1198/016214506000001437

González-Vallejo, C., & Bonham, A. (2007). Aligning confidence with accuracy: Revisiting the role of feedback. Acta Psychologica, 125(2), 221–239. https://doi.org/10.1016/j.actpsy.2006.07.010 Growns, B., & Martire, K. A. (2020). Human factors in forensic science: The cognitive mechanisms

that underlie forensic feature-comparison expertise. Forensic Science International: Synergy, 2, 148–153. https://doi.org/10.1016/j.fsisyn.2020.05.001

Growns, B., & Mattijssen, E. J. A. T. (2020). Distributional Statistical Learning : How and How Well Can It Be Measured ? 2213–2219.

Harvey, N., & Fischer, I. (2014). Development of experience-based judgment and decision making: The role of outcome feedback. The Routines of Decision Making, 117–138.

https://doi.org/10.4324/9781410611826

(28)

28 Pearson Education Limited.

Kerkhoff, W., Stoel, R. D., Mattijssen, E. J. A. T., & Hermsen, R. (2013). The likelihood ratio approach in cartridge case and bullet comparison. AFTE Journal, 45(3), 284–289.

Kerkhoff, W., Stoel, R. D., Mattijssen, E. J. A. T., Hermsen, R., Hertzman, P., Hazard, Gallidabino, M., Hicks, T., & Champod, C. (2017). Cartridge case and bullet comparison: Examples of evaluative reporting. AFTE Journal, 49(2), 111–121.

Martire, K. A., Growns, B., & Navarro, D. J. (2018). What do the experts know? Calibration,

precision, and the wisdom of crowds among forensic handwriting experts. Psychonomic Bulletin and Review, 25(6), 2346–2355. https://doi.org/10.3758/s13423-018-1448-3

Mattijssen, E. J. A. T., Witteman, C. L. M., Berger, C. E. H., Brand, N. W., & Stoel, R. D. (2020). Validity and reliability of forensic firearm examiners. Forensic Science International, 307, 110112. https://doi.org/https://doi.org/10.1016/j.forsciint.2019.110112

Mattijssen, E. J. A. T., Witteman, C. L. M., Berger, C. E. H., & Stoel, R. D. (2020). Assessing the frequency of general fingerprint patterns by fingerprint examiners and novices. Forensic Science International, 313, 110347. https://doi.org/10.1016/j.forsciint.2020.110347

Mattijssen, E. J. A. T., Witteman, C. L. M., Berger, C. E. H., Zheng, X. A., Soons, J. A., & Stoel, R. D. (2020). Firearm examination: Examiner judgments and computer-based comparisons. Journal of Forensic Sciences. https://doi.org/10.1111/1556-4029.14557

Mazzocco, K., & Cherubini, P. (2010). The effect of outcome information on health professionals’ spontaneous learning. Medical Education, 44(10), 962–968. https://doi.org/10.1111/j.1365-2923.2010.03744.x

Mojab, R. (2016). Probabilistic Forecasting with Stationary VAR Models. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2818213

O’Hagan, A., Buck, C. E., Daneshkhah, A., Eiser, J. R., Garthwaite, P. H., Jenkinson, D. J., Oakley, J. E., & Rakow, T. (2006a). The Elicitation of Probabilities. In Uncertain Judgements: Eliciting Experts’ Probabilities (pp. 61–96). John Wiley & Sons, Ltd.

https://doi.org/10.1002/0470033312.ch4

O’Hagan, A., Buck, C. E., Daneshkhah, A., Eiser, J. R., Garthwaite, P. H., Jenkinson, D. J., Oakley, J. E., & Rakow, T. (2006b). The Psychology of Judgement. In Uncertain Judgements: Eliciting Experts’ Probabilities (pp. 33–59). John Wiley & Sons, Ltd.

President’s Council of Advisors on Science and Technology. (2016). Report to the President - Forensic Science in Criminal Courts: Ensuring Scientific Validity. September, 1–160.

Referenties

GERELATEERDE DOCUMENTEN

(Color online) Transmission probability through a 2D constrained L´evy glass as a function of the thickness of the slab, for different values of the step size exponent α.. The

The point of departure in determining an offence typology for establishing the costs of crime is that a category should be distinguished in a crime victim survey as well as in

A quantitative investigation of the relationship between the time series properties of bitcoin prices and the properties of litecoin prices (another electronic currency),

Therefore we can conclude that the table feature is not a good feature to work with for self localization in this re- search.An occurrence that can be seen in the results which we

On the other hand, SIFT is more robust for the detection of “good” matches, while SURF appears more balanced between efficiency

approaches identify highly cited publications based on a publication’s maximum recommendation score (shown in green), a publication’s unweighted number of

In this paper, it is shown that some widely used classifiers, such as k-nearest neighbor, adaptive boosting of linear classifier and intersection kernel support vector machine, can

In this paper, it is shown that some widely used classifiers, such as k-nearest neighbor, adaptive boosting of linear classifier and intersection kernel support vector machine, can