The use of complex mixtures for DNA database searches when relatives are present in the database

(1)

The use of complex mixtures for DNA database

searches when relatives are present in the database

A graduation research project (36 EC)

Written by

Anouk de Ronde, 10419292

MSc in Forensic Science

University of Amsterdam

January 2014 - June 2014

20 June 2014

Super visor K.J. Slooten

The Netherlands Forensic Institute Examiner

M. Sjerps

(2)

2

Preface

In order to complete the master Forensic Science at the University of Amsterdam, I did my internship at the Netherlands Forensic Institute. The subject of this thesis is the use of complex mixtures for a DNA database search when relatives are present in the database. The project started in January 2014 and was completed in June 2014.

I wish to thank Klaas Slooten for all the support during the project. I also want to thank Jerien Koopman for the assistance during the project. Thanks also to Marjan Sjerps, who guided me on behalf of the UvA.

(3)

3

Abstract

Forensic scientists increasingly recognize the DNA database as a crime-solving tool for single source profiles. Additionally, the analysis and interpretation of mixed DNA profiles form a key part in forensic DNA analysis. A combination of using complex mixtures with drop-out and drop-in for DNA database searches would be interesting in order to obtain investigative leads in a case. However, a relative of the donor may be present in the database and this can lead to problematic situations since relatives share alleles identical by descent. In this research, I study whether it is feasible to search complex mixtures against the DNA database where relatives are present in the database. The computer software Mathematica is used to simulate profiles, relatives, complex mixtures and a DNA database representing the Dutch offender database. In the database search and the familial search, for each individual in the database a likelihood ratio (LR) is calculated and the obtained LRs are studied. The research has shown that the quality of the results depend on different features of the mixture; allelic drop-out, allelic drop-in, number of donors, number of replicates and rare alleles do influence the outcome of a database search. However, it is shown that database searches and familial searches on complex two-person mixtures are very well feasible as an investigative technique, especially in case where a victim profile is known. For familial searches it is shown that there is a tradeoff between the true positive rate and the false positive rate of the search, and that these rates differ per mixture due to various variables. Therefore it is difficult to determine one fixed LR threshold which can be used for all mixtures and an LR threshold has to be decided per mixture, for familial searches as well as for database searches. Additionally, some three-person mixtures are analyzed but the results show that these mixtures are probably not very suitable for DNA database searches and familial searches. In order to be able to draw a conclusion about three-person mixtures, a full analysis would be necessary.

(4)

4

Chapter 1 -

Introduction

The analysis and interpretation of mixed DNA profiles form an important part in forensic DNA analysis. In some cases, a suspect may be available which can be the donor of mixed trace evidence found on the crime scene. In other cases, no suspect is available and a forensic DNA database may help to search for a donor profile. The fact that relatives share part of their DNA is used in investigations on family relationships and this feature can also be used in a database search. Forensic scientists increasingly recognize the DNA database as a crime-solving tool for single source profiles and it will be useful to study the possibility of using complex mixtures for database searches. Therefore, the complications concerning relatives present in the database have to be studied and this is the subject of my research. I performed this research at the Netherlands Forensic Institute (NFI). In this thesis, I will present my results.

1.1 Background

Complex mixtures

In criminal cases concerning a DNA mixture, several individuals contributed to a biological stain. Sexual assault cases are an example where the vaginal swab may indicate the presence of the victim and one or more other individuals. When these samples are analyzed in the lab, a mixed DNA profile is obtained and it can be difficult to determine who the contributors to the mixture are.

The determination of the contributors to a mixed biological stain can be even more difficult in cases when only a low amount of DNA is present, the so called low-template DNA (LTDNA) samples. In the profiling process of these samples, stochastic effects that may influence the typing of the alleles have a higher chance to occur resulting in artifacts like allele drop-out and allele drop-in (Buckleton, Triggs and Walsh (2004)). Allele drop-out occurs when an allele fails to amplify during the polymerase chain reaction (PCR) amplification. Allele drop-in may occur due to the high sensitivity of the STR typing when analyzing LTDNA samples. The sensitivity is very high in order to generate a reliable STR profile out of a low template sample. Because of this high sensitivity, additional alleles (drop-in alleles) can be observed due to sporadic contamination and these can be alleles that were originally not present in the analyzed STR profile. Profiles obtained from LTDNA samples are so called complex mixture profiles, and it has been shown that the interpretation of these complex mixtures can be very difficult.

Likelihood ratio calculation with mixtures

In order to determine the evidential value of a possible match between a mixture profile and a reference profile, the use of the likelihood ratio method is recommended by the DNA commission of the International Society of Forensic Genetics (Gill et al. (2006)). This is done by calculating the likelihood ratio (LR); dividing the probability of observing the mixture given the prosecutors hypothesis ( ) by the probability of observing the mixture given the defense hypothesis ( ).

The LR calculation for full profile mixtures with no artifacts is described by Weir et al (1997). Curran, Gill and Bill (2005) describe a method to calculate a LR that allows for drop-out and drop-in, the so called ‘semi continuous model’. In this model, information derived from the electropherogram is used to determine the drop-out probabilities. These drop-out probabilities together with the possibility of drop-in are used in the LR calculation. Haned, Slooten and Gill (2012) extend this model to the ‘SplitDrop’ model, where the probability of drop-out can be varied per contributor or per hypothesis.

The question that arose is how the correct drop-out and drop-in probabilities can be estimated. The estimation of the drop-in parameter is done by the examination of negative controls. The drop-out parameter is more difficult to assess and this is discussed by many scientist in the literature (Gill, Kirkham and Curran (2007), Balding and Buckleton (2009), Tvedebrink et al (2009) and Gill et al(2012)).

(6)

6

No consensus is reached yet about how to assess the exact drop-out probability and in most cases, a range of plausible values is analyzed (Balding and Buckleton (2009)).

Database search

Nowadays, a DNA database is often used as a powerful tool for identification of individuals in criminal cases. In 1990, the CODIS database was created as a pilot by the FBI to store and compare DNA profiles (Gershaw & Schweighardt (2011)). The first database of Europe was formed in the UK in 1995 (Martin, Schmitter and Schneider (2001)). This database showed how effective the use of a DNA database was for the identification of perpetrators. In 1998, the Netherlands introduced a national DNA database. Nowadays, the Dutch DNA database consists of more than 180.000 DNA profiles from individuals 1, which are typed with the SGM+ kit (10 STR loci + Amelogenin) or with the NGM kit (15 loci + Amelogenin).

Initially, complete profiles from single contributors or mixed profiles where the main contributor could be unambiguously identified were searched against the database. Nowadays, research is done into searching unresolved mixtures against the database, as described by Bleka et al (2013) and Bright et al (2014). In a database search with a mixture, for every person in the database a LR is determined for the hypotheses:

Hp: The individual in the database and n unknown individual(s) contributed to the mixture.

Hd: n+1 unknown individuals contributed to the mixture.

The LRs are sorted and the result is a ranking, which can be used as an investigative lead in a criminal case.

Kinship analysis

In kinship analysis, the fact that relatives share part of their profiles is used to analyze family relations. Children receive an allele from their father and an allele from their mother and therefore parents and children share half of their DNA profiles, mutations ignored. Also siblings, half siblings and otherwise related persons share alleles identical by descent because of shared ancestry. This feature was already used in mass disaster victim identifications and in the identification of missing persons.

Recently, Bieber, Brenner and Lazer (2006) described that these features could potentially be used to search in an offender DNA database to identify relatives of a potential suspect. This method is called familial searching. Familial searching is only performed on full single source profiles and can be done by searching in the DNA database for rare alleles present in the profile, a high number of matching alleles with the profile or by the calculation of the likelihood ratio for every profile in the database. Previous studies by Hicks et al (2010) and Curran and Buckleton (2008) have shown that the LR method outperforms the other methods. Familial search is nowadays only performed on the parent/child and sibling relationships, since the power of a familial search is readily reduced for relationships such as half-siblings (Curran and Buckleton (2008)). Nevertheless, Slooten and Meester (2014) show that it is sometimes also possible to find a half sibling with the use of a sibling search.

In a familial search, for every person in the database a LR is computed for the hypotheses:

Hp: A relative of individual is a donor of the crime stain.

Hd: An unknown is the donor of the crime stain.

High LRs are characteristic for relatives but a high LR can also be obtained by a coincidental match with an unrelated individual. Therefore, Bieber, Brenner and Lazer (2006) suggest that investigators can use the LR ranking of the database as an investigative lead in a criminal case.

1

www.dnadatabank.forensischinstituut.nl/resultaten/groei_dna_databank_strafzaken - Accessed 10 June 2014

(7)

7

1.2 Aim of this research

Although a lot of research is done in the separate fields of mixture analysis, database searches and kinship analysis, the combination of these fields is not much studied or some important elements are missing.

Nowadays, research is done into searching unresolved mixtures against the database in order to obtain an investigative lead in a case, as described by Bleka et al (2013) and Bright et al (2014). Bleka et al (2013) perform a database search on complex mixtures and try to compare different match statistics. But the individuals in the database are all unrelated and it is important to study what will happen if relatives are present in the database. Bright et al (2014) discuss the search of mixed DNA profiles directly against a profile database and test the effects of over- and underestimating the number of contributors to a mixture, taking a fixed drop-in and drop-out parameter into account. Also in this research, the possibility of relatives present in the database is not taken into account and the effect of different drop-out probabilities on the database search results is not studied.

Chung, Fung and Hu (2010) and Chung and Fung (2013) focus on familial searches on mixtures. They focus on cases where a victim and one perpetrator are the contributors to a mixture with no drop-out and drop-in. This study shows that familial search is feasible on full profile mixtures, assuming the victims profile is known. I want to study whether this is also possible for mixtures where none of the contributors are known and complex mixtures.

In my research, the combination of complex mixtures, database searches and kinship analysis is studied. It is known that in a database search, the probability of an adventitious match is low if the crime profile and the database profile are both full single source profiles. But if a mixture with artifacts is used, adventitious matches will occur with a higher probability. Additionally, the probability of a match between a relative in the database and the mixture will increase. Important in the calculation of the likelihood ratios are the drop-out probabilities, but since there is no consensus yet about how to estimate these probabilities, wrong estimations can occur. All these factors can possibly lead to dangerous situations where a relative can be misinterpreted as the donor of the mixture and these situations have to be studied before complex mixtures can be used for database searches.

The aim of this research is to investigate whether complex mixtures can be used for DNA database searches and what will happen if relatives are present in the database. Therefore, the research questions addressed in this research are:

1. What likelihood ratios in favor of being the donor do relatives of the donor have for different complex mixtures?

- What is the difference between the LRs for the different kind of relationships?

- What is the effect of the number of replicates (one or three) on the LRs for relatives and the donor? - What LRs do relatives have for being the donor of a mixture with drop-out and drop in, when (dmix=

dLR)?

- What is the effect of calculating the likelihood ratio with the wrong drop-out vector, (dmix≠ dLR)?

2. When a database is used to search for donors of a mixture, how high will the LR of relatives be?

-What is the effect of drop-out on the ranking in the database for relatives in case of dmix= dLR?

- What is the effect of calculating the likelihood ratio with the wrong drop-out vector (dmix≠ dLR) on

the database ranking?

(8)

8

3. Is familial search on complex mixtures feasible?

- What are the probabilities of observing a relative of a donor in the top-n of the ranking for different drop-out rates?

- What are the false positive and true positive rates for the different kind of mixtures? - Is it possible to determine a LR threshold for familial searches on complex mixtures?

Scope

These research questions will form the basis of this thesis. The mixtures I am going to analyze are two and three person mixtures with different drop-out rates. I will investigate what the influence of different relationships, number of replicates and the drop-out probability will be on the results. The relationships parent/child, full-siblings and half-siblings are examined, since these relationships share most alleles identical by descent. In the experiments, I studied the next two sets of hypotheses, which I will refer to as the victim case and the victimless case.

Victim case:

: Victim + suspect contributed to the mixture. : Victim + unknown contributed to the mixture.

Victimless case:

: Suspect +unknown contributed to the mixture. Two unknowns contributed to the mixture.

In the victim case, the profile of the victim is known and this is a full profile with no drop-out or drop-in. The profile of the unknown contributor possibly shows drop-out and drop-in. An example of the victim case is a sexual assault case, where the vaginal swab indicates the full profile of the victim and another unknown profile. In the victimless case, none of the contributors are known and both contributors possibly show drop-out and drop-in.

1.3 Structure of the thesis

In the next chapter, the technical background supporting the computer simulations is discussed. The research can be subdivided into two areas: the research into an undesired high likelihood ratio for relatives and the research into familial search on mixtures. In Chapter 3, the first topic is discussed. This study is divided in two experiments; one to observe the LRs for different relatives and a second experiment in which a DNA database is used. In Chapter 4, familial search on mixtures is described. In this section, a high LR for a relative is used as an advantage to find relatives in a database. In Chapter 5, some three person mixtures are tested and the results are analyzed. Chapter 6 contains the discussion of the results, followed by the references and Appendices A and B, containing respectively additional tables and figures.

(9)

9

Chapter 2 -

Theoretical framework

In order to answer the research questions, computer simulations are performed using the computer software Mathematica. A notebook is generated, which is used to simulate profiles, relatives and mixtures as well as a DNA database representing the Dutch offender DNA database. In this chapter, I will describe the theoretical background supporting this notebook.

2.1 Simulation of profiles and relatives

For the calculations, the Dutch allele frequencies are used which are published by Westen et al (2014). This list of allele frequencies does not take subpopulations into account. Throughout the research, it is assumed that the Hardy-Weinberg equilibrium is preserved. More specifically, it is assumed that alleles present on the same locus are combined into a genotype independently. In this equilibrium, the allele frequency of the pair of alleles can be described by Equation (1).

( ) {

( ) In the Mathematica notebook used, a random profile per locus is generated by a weighted choice of two alleles by their population frequency. This is done for all 23 loci mentioned by Westen et al (2014). Because the alleles present on different loci are assumed to be combined into a genotype independently of each other, the random match probability for a complete profile can be calculated with the use of Equation (2).

∏ ( )

( ) Where are the loci, the frequency of allele on locus and is the Kronecker delta function,

which is defined as described in Equation (3).

{ ( )

For the randomly generated profiles, the SGM+ loci and the NGM loci can be shown and the random match probabilities for these profiles can be calculated with Equation (2) for the corresponding loci.

For relatives, we know they share alleles identical by descent (IBD) with a certain probability. These probabilities are denoted by the relatedness coefficients ( ) described by Fung and Hu (2008). If person A and B are related according to ( ), denotes the probability that neither allele of A is IBD to an allele of B, that one of the alleles of A is IBD to one allele of B and denotes the probability of two alleles IBD between person A and person B.

The relatedness coefficients used in this thesis are shown in Table 1. With the use of these coefficients, a relative of a profile can be generated in the notebook. Since the relatedness coefficients for half siblings are identical to the coefficients for the relationships grandparent-child and uncle-nephew, the results obtained for the half-siblings also apply to these relationships. ` Relationship Parent-child 0 1 0 Full siblings 0.25 0.5 0.25 Half siblings 0.5 0.5 0 Unrelated 1 0 0

Table 1: Relatedness coefficients. Probability that relatives share alleles IBD.

(10)

10

2.2 Simulating mixtures and likelihood ratio calculation

In order to simulate mixtures and calculate the likelihood ratio, the semi-continuous model as described by Curran, Gill and Bill (2005) and Haned, Slooten and Gill (2012) is implemented. Curran, Gill and Bill (2005) denotes the event of an allele dropping out by a probability and the event of an allele not dropping out by a probability . Haned, Slooten and Gill (2012) modified this definition for homo- and heterozygosity. They assume there is a number , the heterozygous drop-out probability and a number , the homozygous drop-out probability for an individual . The homozygous drop-out probability is assumed to be at maximum since alleles amplify independently. In the Mathematica implementation used in this research, the maximum is assumed.

For drop-in, the variable is used. It denotes the expected number of drop-in alleles per locus. The event of an allele to drop in is rare and laboratory records have indicated that one drop-in allele per 20 loci is expected (Gill, Kirkham and Curran (2007)). Therefore is set to be 0.05.

In the notebook, mixtures have to be simulated with drop-out and drop-in. At first the contributors to the mixture are randomly generated. For each allele in their profiles, there are two options: either to drop-out or to not drop-out. The probabilities for alleles to drop-out are described with vector

( ), where element is the drop-out probability for a contributor . The probability for a certain allele to occur in the DNA mixture needs to be calculated.

Suppose we look at a locus with allelic ladder{ }. These are all the possible alleles that

can occur at this locus , and they each have a population frequency . We assume that an allele either will be present or won’t be present in the mixture profile. These probabilities depend on the known genotypes for the mixture, the probability for the allele to drop-out and the event of dropping in, . The probability for an allele not to be present in the mixture is described by Equation (4) and logically, the probability for an allele to be present in the mixture is 1 - the probability described by Equation (4).

( | ) ( ) ∏

( ) Where = 1 by definition In order to not observe an allele in the mixture, the allele must not be dropped in. This is described in Equation (4) by ( ). Furthermore, if allele was present in the genotype of the donor, it must be dropped out. This happens for person with probability , where is the drop-out probability and is the number of alleles donor has, {0,1,2}. In

Mathematica, the mixtures are created by the use of Equation (4).

In order to calculate a likelihood ratio, the probability to observe the alleles { } { } at a locus in the mixture has to be calculated. For this purpose, Equation (5) is used. The probability depends on the known genotypes, the probability for the allele to drop-out and the event of dropping in, . For the drop-out probability, is used, where denotes the drop-out

vector used for the calculation of the likelihood ratio. In the optimal situation, holds.

( { }| )

∏ ( | ) ∏ ( | )

{ }

( )

For a mixture, multiple replicates can be made. For each replicate, Equation (5) can be used to determine ( { } | ) where is the jth replicate. Because each replicate

(11)

11

is conditionally independent from all other replicates made, multiple replicates of a mixture can be analyzed with the use of Equation (6).

( | ) ∏ (

| ) ( )

Imagine a mixture with unknown contributors, . There is a database profile (profile ) of a person in the database for which we want to check whether he is a donor of the mixture. At first, we will focus on one locus. All observed alleles in the mixture at this locus are listed and an additional Q allele is added. The Q allele is introduced by Gill, Kirkham and Curran (2007) and it accounts for an allele of any type which is not already observed in the mixture profile. The allele frequency for the Q allele is described by ( ) ∑ , where are the frequencies of the alleles observed in the

mixture at the particular locus. All possible combinations for 2 alleles are generated from the list of observed alleles and the Q allele. This list of possible combinations ( ) represent the possible genotype elements that could be a contributor to this mixture for the particular locus. The probability for observing this mixture for this locus given the prosecutor’s hypothesis can be described by Equation (7).

( | ) ∑ ( | )

( | )

( )

replaces the unknown contributor . are variables taking values in the set Since are all assumed to be unrelated to , they are independent of the profile . They are

also independent of each other, and therefore ( | ) ( ) ( ) holds.

These probabilities can be calculated with the population frequencies. ( | ) is

calculated with the use of Equation (5). Since all loci are assumed to be independent, these probabilities can be determined for each locus and multiplied in order to find ( | ). By dividing this by the probability of observing the mixture given the defense hypothesis, we obtain the likelihood ratio.

2.3 The DNA database

In order to explore the limits of performing a database search on mixtures, a large DNA database had to be simulated and likelihood ratios for the individuals in the database had to be calculated. To simulate the Dutch DNA database, a database is created by the random generation of 200.000 profiles, from which 100.000 SGM+ profiles and 100.000 NGM profiles.

At first, ( | ) was calculated separately for each profile in de database. This approach is rather computationally intensive and it is therefore time consuming to apply to many random profiles. The likelihood ratio calculations for the relative profiles could be performed in this way because only a small number of relative profiles are added to the database, but to be able to perform calculations on a database of 200.000 random individuals in a reasonable time, another approach was necessary.

In this new approach, we used the probability distribution for the likelihood ratio under , described by Slooten and Egeland (2013) and Dorum et al (2014). For a mixture, all possible genotypes that could be the donor of this mixture are listed in the same way as described above. Per locus, the LR for every possible genotype element as well as the probability for the genotype element to occur in the population is calculated with the use of the frequencies. A LR for a full profile can now be calculated by multiplying the LRs for the genotype elements per locus. The probability for this LR to occur in the population can be described by the product of the probabilities for the genotype elements to occur in the population. With this information, ( | ) is known and we have a probability distribution for the LRs under which can be used to sample LRs for full random profiles. This method is a lot less time consuming and by implementing this method, the likelihood ratios for the 200.000 random individuals can be determined in a reasonable time.

(12)

12

2.4 Familial search

In familial searching, the following set of hypotheses is compared:

: A relative of individual in the database is a contributor to the mixture. : An unknown is a contributor to the mixture.

The calculation of the LR in the notebook had to be adjusted to be able to perform familial search. Imagine again a mixture with unknown contributors . We have a database profile ( ) of the person in the database for which we want to check whether he is a relative of contributor . For one locus, the probability of observing this mixture given the prosecutors hypothesis can be described by Equation (8).

( | ) ∑ ( | )

( | )

( ) This equation looks a lot like Equation (7), only in this case we assume a relationship between and

and therefore is not replaced by . ( | ) is calculated with the use of

Equation (5). For ( | ), we apply somewhat the same method as described with

Equation (7). Since are assumed to be unrelated to , they are independent of the profile and therefore the probabilities ( | ) ( ) ( ) can again be calculated with

the normal population frequencies. But in this case, for , a relationship with is assumed and therefore the probability ( | ) has to be calculated with a different formula.

Fung and Hu (2008) describe formulas to calculate the probability for observing a genotype from an unknown ( ), given a genotype of a known ( ) and the hypothesis that they are related according to a relatedness coefficient vector . This Equation is shown in (9).

( ( )| ( ) ) ( (

))

( ( ( ) ( ) ( ) ( ) ) ( ( ) ( ) ( ) ( ))) ( )

denotes the Kronecker delta function, as defined in (2). With the use of Equation (9), ( | ) can be calculated and by using Equation (8), ( | ) is found. By dividing the obtained probability by the probability of observing the mixture given the defense hypothesis, we obtain a likelihood ratio for a familial search.

2.5 Determination of the true positive rate and false positive rate

In my research, we determined that it is also interesting to look at the false positive and true positive rates of a familial search at certain thresholds for the LR. A Mathematica notebook is created which is used to calculate the probability that a relative has a LR above some threshold and to determine how many false positives can be expected for this threshold. In order to do this, we used the probability distributions for the likelihood ratio under and . In section 2.3 of this chapter, the probability distribution for the likelihood ratio under is used to select LRs for random profiles.

(13)

13

Slooten and Meester (2014) prove that with the use of the distribution for the LR under the defense hypothesis, the distribution for the LR under the prosecutor’s hypothesis can easily be obtained by using Equation (10).

( | ) ( | ) ( )

An estimate of the probability ( | ) for some threshold is obtained by taking a large sample of LRs from the distribution. From this sample, the number of LRs higher than is determined. Dividing this number by the sample size will result in the fraction of LRs higher than , the true positive rate for this threshold . Formally, this is described with Equation (11).

( | ) ∑ { }

( )

Where is the indicator function indicating membership in set { }:

{ } { ( )

Applying the same procedure to determine the false positive rate is more difficult. The event of a high LR for a random person is a rare event with a small probability. The sample taken from the distribution should be very large in order to determine this probability in the way described above.

To estimate the false positive rate, Kruijver (2014) suggests the use of importance sampling. In importance sampling, the distribution of is used to estimate the false positive rate, ( | ). Again, we take a large sample of LRs from the distribution. The importance sampling estimator is given by Equation 13.

( | ) ∑ { } ( ) ( ) With: ( ) ( | ) ( | ) ( )

In Equation (10), we have seen that ( | ) ( | ), and therefore we can state that ( ) , resulting in in Equation (15).

( | ) ∑ { }

( ) With the use of Equation (15), the false positive rate can be estimated without having to take a very large sample.

(14)

14

Chapter 3 -

Undesired high likelihood ratio for relatives

In some situations, a high likelihood ratio for a relative can result in problematic situations where the relative can be mistaken for being the donor of a mixture. Before mixtures can be used for a DNA database search, these problematic situations have to be studied. This chapter is divided in two sections: the study into the likelihood ratios for relatives, the donor and random persons for different complex mixtures (3.1) and the study into using a complex mixture for a database search in case relatives are present in the database (3.2).

3.1 Likelihood ratios for relatives

In this first part, I want to look what LRs for relatives can be expected for different kinds of complex mixtures in order to answer research question one: What likelihood ratios in favor of being the donor

do relatives of the donor have for different complex mixtures? This is tested with the use of three

experiments. In the first experiment, the LR distributions for the different relatives and for different drop-out vectors are studied. In the second experiment, the influence of the number of replicates is tested. The third experiment focusses on what happens when the drop-out used for the LR is not the same as the drop-out of the mixture. What happens if the drop-out estimation is done wrong? For each experiment, I will discuss the experimental design used and the results obtained.

Experiment 1.1 - Testing the effect of the drop-out vector on the likelihood ratio.

Experimental design

Three replicates of a two person DNA mixture with a certain drop-out vector ( ) and

drop-in are made. For the first donor, a relative is created and a LR for this relative for being the contributor of the mixture is calculated. This is repeated a 1000 times per drop-out vector per relative. The LR in this experiment is calculated with the same drop-out as the drop-out the mixture is made with, ( ). Both the hypotheses sets of the victimless case and the victim case are tested,

and the exact drop-out values used are shown in respectively table 1A and 2A, which can be found in Appendix A. By comparing the results, the effect of the drop-out probabilities on the likelihood ratios for different relatives is studied.

Results

For single source traces, it is known that the donor has on average the highest LR for being the donor of the mixture. Siblings and children have on average approximately the same LR which is much lower than the LR for the real donor and half-siblings have a LR lower than siblings and children. A randomly generated profile has on average the lowest likelihood ratio for being the donor of the mixture. This is logical, since siblings and children share more alleles IBD than half-siblings and half-siblings share more alleles IBD than unrelated persons.

Victimless case

For mixtures in case none of the contributors are known, the results show that the higher the drop-out probability, the higher the LR for the relative for being the donor of the mixture. This is clearly seen in Figure 1, which shows the paired histogram of the distributions for the donor, sibling and random person for the drop-out vector (0.1, 0) on the left hand side and the drop-out vector (0.8, 0) on the right hand side. In this figure can clearly be seen that when the drop-out increases, the LR for the sibling and the random person are increasing while the LR for the real donor decreases. This shift of the LR distributions results in more overlap between the distributions leading to more cases where the sibling can have a LR as high as the real donor. The same phenomenon is observed for the

(15)

15

relationships parent/child and half siblings. This is displayed in figures 1B and 2B, which can be found in Appendix B.

Figure 1: Victimless case. Paired histogram showing the distributions of the donor, sibling and random person.

By comparing the paired histogram for the sibling, parent/child and half sibling, it can be seen that the sibling and the parent/child relationships are the most dangerous ones, because the distributions are the closest to the distribution of the real donor. Some overlap with the distribution of the real donor is observed, which means in some cases they can have a LR as high as the real donor can have. If the drop-out increases, this overlap will increase and more dangerous situations can occur. The half sibling distribution and the distribution for the real donor do not show overlap when drop-out is (0.1, 0), but if the drop-out increases to (0.8, 0), this relationship shows overlap too.

We also recorded the percentage of LRs higher than 1000 and 10000 for the relatives. There can be observed that these percentages increase when the drop-out probability is increasing. This is shown in Figure 2 which shows the percentage of parent/child relatives that have a LR higher than 1000 and higher than 10000. The drop-out for the second contributor is fixed at 0, and the drop-out for the first contributor is varied according to the x-axis. The percentages are increasing untill a certain point, where they will drop again. This is because when the drop-out is too high, no information can be obtained from the profile, resulting in low LRs.

(16)

16

In case the drop-out probabilities are d1= 0.1 and d2= 0, only 0.3% of the sibs score higher than 1000. In

the case when the drop-out probabilities are d1= 0.8 and d2= 0, this percentage is increased to 9%. The

figures obtained for the parent/child relation and for the half sibling relation are shown in figure 2B in Appendix B. These figures show the same overall development, although the percentages are lower. This is in line with the likelihood ratio distributions obtained before.

Victim case

In case the second contributor is a victim and this profile is known, the same progress can be seen. The results show that the higher the drop-out probability, the higher the average LRs of the relatives are. Figure 3 shows the paired histogram of the distributions for the donor, sibling and random person for the drop-out vector (0.1, 0) on the left hand side and the drop-out vector (0.8, 0) on the right hand side in the case a victim profile without drop-out is known.

Figure 3: Victim case. Paired histogram showing the distributions of the donor, sibling and random person.

By comparing Figure 3 with Figure 1, we can clearly see that the LR for the sibling in the victim case is much lower than the LR in the victimless case. Even when the drop-out increases, the distribution of the sibling is almost similar to the distribution for the random person and almost no overlap is shown with the donor distribution. Figures 3B and 4B in Appendix B show the paired histograms for respectively the parent/child and the half sibling relationships for the victim case. For these data, the same observations can be made. When looking at the percentage of relatives halving a likelihood ratio higher than 1000 and 10000 in the case where one contributor is known, this percentage is always 0.

We can conclude that in case one of the contributors is known to be a victim, a lot less dangerous situations occur than in case none of the contributors are known. This was expected since when a victim’s profile is known, the profile of the other unknown contributor can be partly deduced.

Experiment 1.2 -Testing the effect of the number of replicates on the likelihood ratio. Experimental design

The experiments performed in Experiment 1.1 are now performed with the use of one replicate. This data is compared to the data obtained for three replicates in Experiment 1.1, in order to determine the influence of the number of replicates on the LRs for relatives.

(17)

17

Donor Sibling Results

By direct comparison of the results obtained when using one and three replicates, we have found that the use of three replicates results in less dangerous situations. An example can be seen in Figure 4, which shows the distributions of the donor and the sibling for a mixture with drop-out probabilities = 0.3 and = 0.5. When using three replicates, the LRs siblings have for being the donor of this mixture are considerably lower and the LRs donors have are considerably higher than when using one replicate. This results in less overlap between the two distributions, which means that there are fewer cases where the sibling can have a LR as high as a real donor. This trend is observed for all mixtures and for all relationships and therefore it is recommended to use three replicates when a mixture is used for a database search, especially in cases where the drop-out is quite high.

Figure 4: Distributions for the donor. One replicate (left) vs. three replicates (right) for a mixture with drop-out (0.3, 0.5).

Experiment 1.3 - Testing the effect of using a wrong drop-out parameter for the likelihood ratio.

In this experiment, the drop-out used for the calculation of the LR is not the same as the drop-out used for the creation of the mixture ( ). In real life, this is often the case because the true

drop-out probability of the mixture is not known and has to be assessed. In this experiment, the drop-drop-out vector used for the calculation of the LR is determined by means of the number of mismatches between the mixture profile and the profile of a suspect. It may in some cases be tempting to use this method in casework when a suspect profile is available and it is therefore interesting to study the consequences for the LR.

A mixture with drop-out, drop-in and three replicates is simulated. A relative of the first donor is created and this profile is used as a profile of a suspect. The number of mismatches between the profile of this relative and the separate replicates are counted and divided by the total number of alleles (90 alleles, NGM profile with 15 loci and 2 alleles per locus with 3 replicates). This probability is the drop-out probability used for the calculation of the LR. The LR is calculated for the hypotheses:

: Suspect + unknown contributed to the mixture. : Two unknowns contributed to the mixture.

(18)

18

The drop-out values used for the mixtures are shown in table 3A, which can be found in Appendix A. By comparing this data to the data obtained in experiment 1.1 where ( ) for these

hypotheses and drop-out rates, the effect of this drop-out estimation method can be studied. Results

In general, when assessing the drop-out probability as described above, the drop-out is

overestimated when is low and under-estimated when is high. This can be seen in Figure 5,

which shows the estimated on the y-axis and the real drop-out of the mixture on the x-axis.

The gray line represents how it should be: the estimated drop-out used for the LR calculation is the same as the drop-out of the mixture, . We see that for all the relatives, the donor and the

random persons, the results deviate from this line. For the donor, we can see the drop-out is always underestimated. For the relatives, we see that the estimation of the drop-out is quite constant,

while is varied from 0.1 to 0.8.

Figure 5: Victimless case. Drop-out estimation of the drop-out for the first contributor.

By comparing the results with the results in experiment 1.1, it is observed that in case the drop-out probability is overestimated, the LRs for the relatives are higher. In case the drop-out probability is under-estimated, the calculated LRs for the relatives are lower than they should be.

Figure 6 shows the cumulative distributions of the likelihood ratios of siblings for the mixture with drop-out (0.1, 0). The gray distribution represent the case where and the black

distribution is the distribution obtained when estimating as described above, so .

In case of a mixture with (0.1, 0), we have seen that the drop-out for the first contributor is over estimated. In Figure 6, we can see that this results in higher LRs than in the case where .

Figure 6: Drop-out estimation. Cumulative distributions for siblings for a mixture with drop-out (0.1, 0).

(19)

19 Figure 7 shows the cumulative distributions when a mixture of (0.8, 0) is used. In this figure we can clearly see that when using a drop-out of (0.8, 0), higher LRs are obtained than in case of a drop-out of (0.1, 0). In Figure 5 we have seen that in case of a drop-out of (0.8, 0), the drop-out is under-estimated when using the estimation method. In Figure 7, we can clearly see that the under-estimation of the drop-out results in lower likelihood ratios.

Only in the cases where the drop-out of the mixture is around 0.5, the mismatch-counting-method between the suspected profile and the mixture gives relatively good results. But in cases where the drop-out probability is expected to be lower or higher, this drop-out estimation method will not lead to reliable results.

3.2 Database search

In this section, I want to study whether complex mixtures can be used for a DNA database search, especially in cases where relatives are present in the database. For this purpose, it is interesting to look which position in the database ranking relatives obtain for different drop-out values and whether a relative in the database can possibly be mistaken for being the donor of a mixture. With this experiment, I try to answer research question two: When the DNA database is used to search for

donors of a mixture, how high will relatives score? This is tested with the use of two experiments. In

the first experiment, I tested different drop-out mixtures with, In the second experiment

I tested what happens if the drop-out of the likelihood ratio is wrongly estimated, For

each type of mixture, two probabilities are determined: what is the probability for a relative to have a LR higher than the LR of the donor and what is the probability for a relative to end up before all unrelated persons in the database. This latter probability is to investigate how dangerous it is if the database does not contain the real donor but only a relative of the donor.

Experiment 2.1- Testing the effect of the drop-out vector on the database search when

For each mixture made with three replicates, a database consisting of LRs for random persons is created by the use of the LR distribution as described in section 2.3. This database consists of 200.000 LRs, from which 100.000 profiles were typed with the NGM kit and 100.000 profiles were typed with the SGM+ kit. Additionally, 500 NGM siblings, 500 SGM+ siblings, 500 NGM children, 500 SGM+ children, 500 NGM half-siblings and 500 SGM+ half-siblings of the first donor are added to the database with their LR for this mixture. A database search is performed for different mixtures and the profiles with a LR higher than one are selected, since these profiles are considered to be interesting in a database search. The selected LRs are sorted and the ranking is recorded. For every drop-out vector, this is repeated 10 times. The exact drop-out vectors used are described in Table 4A, which can be found in Appendix A. Both hypotheses sets of the victim case and the victimless case are tested. By comparing the results, the effect of increased drop-out probabilities on the outcome of the database search is studied.

Figure 7: Drop-out estimation. Cumulative distributions for siblings for a mixture with drop-out (0.8, 0).

(20)

20 Results

Victimless case

In case all contributors of the mixture are unknown, a higher drop-out probability will result in higher LR for the relatives and the random persons. This fact is already observed in earlier experiments.

When the drop-out increases, there is observed that more relatives have a LR higher than the SGM+ donor. This is most common for NGM relatives, mainly siblings, followed by children. Half-siblings do very rarely have a LR higher than the SGM+ donor, approximately on the same scale as random persons. An example of a database search result can be seen in Figure 8.

Table 2: Database search. Results of a database search on mixture (0.1, 0.1) (left) and (0.5, 0.5) right. We can see that there are more relatives ending up higher than the real donor typed with SGM+ if the drop-out is (0.5,0.5) (right), than if the drop-out is lower (0.1,0.1) (left). For example, for a sibling typed with the NGM kit and drop-out vector of (0.1,0.1), 0.08 % of the siblings have a higher LR than the SGM+ donor, while with a drop-out vector of (0.5,0.5), this percentage is 1.02%. If the donor is typed with the NGM-kit, only 0.02% of the siblings have an LR higher than the LR of the real donor for both mixtures.

It is also observed that when the drop-out probability increases, the probability that a LR of the relative is higher than the LRs of all unrelated persons in the database is also increasing. This means that if the relative is present in the database and the donor is not present, the relative will end up at the first position in the database ranking. The percentages of relatives that have a LR higher than all random persons are shown in Table 2.

Drop-out (0.1,0.1) (0.5,0.5)

Sibling NGM 0.86% 4.2%

Child NGM 0.1% 2%

Half Sibling NGM 0% 0.04%

Table 3: Victimless case. Percentage of relatives with LR higher than the LR of the highest random person in database. In Table 2, we can see that for a mixture with drop-out (0.5, 0.5), 4.2% of the brothers will end up at the first position in the database search if the donor is not present. For children, this is 2% and for half siblings, this is only 0.04%. These percentages seem quite low, but these are the situations that can cause trouble in a database search. Therefore, if a database search is performed, these probabilities have to be kept in mind.

(21)

21

Victim case

In victim cases, the same phenomena are observed. If the drop-out increases, the percentage of relatives ending up before the donor typed with the SGM+ kit is increasing a little, but is still very low. For example, in the case with drop-out (0.1, 0), 0% of the NGM siblings is ranked before the real donor typed with the SGM+ kit. For a mixture with drop-out (0, 0.5), this is 0.2%. For siblings typed with the SGM+ kit, 0% of these siblings is ending before the real donor typed with SGM+ in a mixture of (0.1,0) and in case of the drop-out being (0.5, 0), this is 0.1%. Because of these results, it seems that mixtures where one contributor is known are very good suitable for a DNA database search when the real donor is assumed to be present in the database.

What happens when the donor is not present in the database? Again we observe that when the drop-out probability increases, the probability that a LR of the relative is higher than the likelihood ratios of all random persons in the database is also increasing. The percentages of relatives having a likelihood ratio higher than the highest unrelated person are shown in Table 3 Table 4. Again, these percentages seem low, but these can lead to dangerous situations where a relative can be mistaken for being the donor of a mixture.

Drop-out (0.1, 0) (0.5, 0)

Sibling NGM 0.22% 4.4%

Child NGM 0% 1.2%

Half Sibling NGM 0% 0.02%

Table 4: Victim case. Percentage of relatives with LR higher than LR of the highest random person in database.

Experiment 2.2-Testing the effect of the drop-out vector on the database search when

In this experiment, the situation is tested in which the likelihood ratio is not calculated with the same drop-out as the drop-out of the mixture, ( ). These are the cases where the drop-out is

over- and underestimated. Additionally, an interesting question may be what happens if the drop-out of the mixture is different for the two donors and the drop-out used for the calculation of the likelihood ratio is determined to be a uniform drop-out for both donors. In order to do this, the same experimental design as described in Experiment 2.1 is used, with drop-out vectors described in Table 5A, Appendix A. The database search is performed for the hypotheses of the victimless case. By comparing the results, we try to see what effect a wrong estimation has on the database search. Results

In Experiment 1.3 we have seen that overestimation of the drop-out results in higher LRs. This is again observed when performing a database search. If we look at the probability for relatives having a LR higher than the real donor, no clear effect is observed. This is shown in Table 4. We can see that the probability for a sibling typed with NGM is increasing when the drop-out is overestimated, from 4/5000 to 7/5000, but for a sibling typed with the SGM+ kit, this probability is decreasing to 0.

(22)

22

Table 5: Over-estimation of the drop-out. Probabilities relative higher ranked than donor typed with SGM+. If we look at the percentage of relatives having a LR higher than the LR of the highest random person, we see an increase in percentages. These values are shown in Table 6A in Appendix A.

If the drop-out is underestimated, the LRs are decreasing as observed in earlier experiments. The probability for a relative having a likelihood ratio higher than the real donor typed with SGM+ is decreasing, which can be seen in Table 5.

Table 6: Under-estimation of the drop-out. Probabilities relative higher than donor typed with SGM+.

If we look to the percentage of relatives having a LR higher than the LR of the highest random person, we see a decrease. These percentages are shown in Table 7A in Appendix A.

When the drop-out is estimated as a uniform drop-out while it is not (drop-out vectors 5,6,7 and 8 described in Table 5A in Appendix A), we see the same trend as described in the cases with overestimation or underestimation of the drop-out, depending whether the drop-out for the first contributor is overestimated or underestimated.

(23)

23

Chapter 4 -

Familial search on mixtures

In this Chapter, I will study the use of complex mixtures for familial searches in order to answer research question three: Is familial search on complex mixtures feasible? In familial search there is searched for a relative of the donor in a database. The hypotheses tested in familial searching are:

Hp: A relative of the database member is the donor of crime sample.

Hd: An unknown is the donor of crime sample.

In familial search, the high LR for relatives is used as an advantage. Nowadays, familial search is only performed on single source traces. In this chapter I will look into the possibilities of performing familial search on mixtures and complex mixtures. This chapter is divided in two parts: the study into performing familial search on mixtures without drop-out and drop-in (4.1) and the study into performing a familial search on complex mixtures (4.2).

4.1 Familial search on mixtures without drop-out and drop-in

Chung, Fung and Hu (2010) and Chung and Fung (2013) tried to investigate whether performing a familial search on mixtures makes sense. In their research, only mixtures which show full 2-person profiles from which one profile is a known victim profile are tested. Respectively Hong Kong Chinese allele frequencies and the Swedish allele frequencies are used. In my research, I conducted the same study with the use of Dutch allele frequencies published by Westen et al. (2014) in order to see if the results obtained are similar. The experiment is divided in two sub-experiments.

Experiment 3.1 - Research of Chung, Fung and Hu (2010) and Chung and Fung (2013)

An offender database of 50.000 unrelated individuals with an SGM+ profile is simulated. For one individual in the database, a relative is created and a mixture is made where this relative represents the first unknown contributor to the mixture. The other contributor is assumed to be a known profile of a victim. A familial search is performed on this mixture and the calculated LRs are ranked. The rank of the relative is recorded to see how many individuals should be investigated in the ranked list, in order to successfully identify the relative of the unknown contributor. This is done a 1000 times, the same number used by Chung and Fung.

Additionally, the same research as described by Chung, Fung and Hu (2010) is performed. In this research, a database of 30.000 individuals and the Identifiler kit is used. This experiment is also repeated a 1000 times.

Results

At first, I want to recall results for familial searches on single source profiles. If a familial search on a single source profile is performed in the same way as described above, the results shown in Figure 9 are obtained. In this figure, the top k is plotted against the probability to find the relative in the top k. In this figure it is clearly shown that the parent/child search is the most effective one. There can also be observed that searching for a half sibling is not feasible since the probability to find the half siblings are very low. Therefore, familial search is nowadays only performed on the relationships parent/child and siblings and in my study I will also only focus on these relationships.

(24)

24

Figure 8: Familial search on single source profiles. Results for the parent/child, sibling and half sibling familial search in a 50.000 SGM+ database.

Figure 10 shows the results for a familial search on a mixture. The black line denotes the results obtained by Chung and Fung (2013) with the use of the Swedish allele frequencies and the dashed line denotes the results obtained with the use of the Dutch allele frequencies in case a victim profile is known. By comparing the parent/child search on the left and the sibling search on the right in Figure 10, it is shown that the search for parent/child relationships is the most effective. This is also observed above and observed in the study of Chung and Fung (2013).

From Figure 10 (left) it can be observed that we have a probability of approximately 0.85 to find the parent/child in the top 100. For siblings (Figure 10 right) this probability is 0.65. By direct comparison of my results with the results obtained by Chung and Fung (2013), we see small differences which can be explained by the different allele frequencies used and sampling uncertainty.

I also performed the same experiment for the victimless case where the second contributor is not known (black dotted line) to see if it is also feasible to use mixtures where none of the contributors are known. It can be seen that the results in this case are a lot worse than the results obtained when a contributor is known. Nevertheless, in approximately 50% of the cases, the child or parent is found in the top 100 profiles and for siblings this is in approximately 45% of the cases. This is still a quite good hit rate, especially because this database only consists of SGM+ profiles and the results for NGM profiles are expected to be better.

Figure 9: Familial search. Probability of identifying the parent-child (left) and full sibling (right) in the top k profiles in a database of 50.000 SGM+ profiles.

(25)

25

The results obtained by performing the same research as described by Chung, Fung and Hu (2010) in a database with profiles typed with the Identifiler kit can be found in Figure 6B in the Appendix B. The same observations described above are applicable for this data. However, the overall probabilities are higher since the Identifiler kit contains additional autosomal loci.

Chung and Fung (2013) mention there is a linear relationship between the probabilities of identification in the top k profiles and the logarithm of k. This fact can also be observed in figure 7B in appendix B, which plots my results for the parent/child and the sibling familial search against the logarithm of k. A consequence of this fact is that when the scale of investigation is already large enough, enlarging the scale of investigation will only result in little improvement.

Experiment 3.2 - NGM profiles

The same experiment as described in 3.1 is performed in case a victim profile is known, but now for a database with NGM profiles. This is done, because this NGM kit is nowadays used in the Netherlands and I want to compare this results to the results obtained with the SGM+ kit, to see how much better the NGM kit performs.

Results

Table 6 shows the results obtained with the use of a NGM database versus a SGM+ database. By direct comparison, we can clearly see that the results obtained with the NGM database are much better. For example, if we investigate the top 10 profiles, in 52.8% of the cases, we will find the parent/child when using a SGM+ database, while this percentage is 95.4% for a NGM database. This clearly shows that the use of NGM typed profiles is preferable for better results.

(26)

26

4.2 Familial search on complex mixtures

An interesting further research is to study whether familial search on complex mixtures with drop-out and drop-in makes sense. A study is conducted in which I looked at the top-n ranking of the relative and the false positive and the true positive rate of the familial searches. Because a database similar to the Dutch database is desirable, a new database is created which is different from the database used by Chung, Fung and Hu (2010).

Experiment 4.1 - Testing the effect of the drop-out vector in familial searches.

An offender database consisting of 100.000 NGM profiles and 100.000 SGM+ profiles is simulated. For one individual in the database, a relative is created and a mixture is made where this relative represents the unknown contributor to the mixture. The mixtures used are made with three replicates with drop-out and drop-in, where the exact drop-out values are described in Table 3A in Appendix A. The hypotheses tested are both the victim and the victimless case. A familial search is performed and the likelihood ratios obtained are sorted. The rank of the relative typed with the SGM+ and the NGM kit is recorded. This is repeated a 100 times per drop-out vector.

Results

Victimless case

The first observation is that the relative typed with the NGM kit has a higher chance of identification in the top k profiles than in case he is typed with the SGM+ kit. When the relative is typed with the SGM+ kit, the probability of identifying this relative is very low. This is visible in Figure 11, which shows the results for searching for a NGM typed parent/child versus a SGM+ typed parent/child. The mixture used is a mixture with drop-out vector (0.3, 0.3). We can clearly see that the results obtained for the NGM profile are much better. This means that when, in terms of further investigation, the profiles ending up in the top k are upgraded to a NGM profile, better results are expected. The same results are obtained in a familial search for a sibling. The results for the sibling search for a drop-out of (0.3, 0.3) are shown in figure 8B, which can be found in Appendix B.

Figure 10: Results of NGM vs. SGM+. A parent/child familial search in the victimless case on mixture with drop-out (0,3, 0.3). In Figure 12, the results for different drop-out values are shown next to each other. This figure shows a parent/child familial search where the relative is typed with the NGM kit. When the drop-out increases, we can see that the percentage of identification in the top k profiles is decreasing a little bit. This is expected, since when the drop-out is higher, earlier experiments have shown that more people will have a high LR. For example, if we investigate the top 50 profiles, we have a 65% chance of identification when there is little drop-out (0.1, 0.1) and a 45% chance of identification when there is a

(27)

27

high drop-out (0.5, 0.5). Although this is percentage is decreasing, this difference is not very large. This can be due to the fact that we use three replicates. If we have a drop-out of (0.1, 0.1) and we use three replicates, the probability for an allele not to be seen in one of the replicates is 0.13=0.001. For a drop-out of (0.5, 0.5), this probability is 0.53= 1/8, so in a profile with 30 alleles (NGM), this means approximately 3 alleles are not seen. Therefore, with the use of three replicates, almost all alleles will be seen. Additionally, all donors have the same drop-out. In this case, it is more difficult to distinguish between the different donors and this leads to more uncertainty. In case the second donor would have a very different drop-out, the results may show a bigger difference between the different drop-out values for the first donor.

Figure 11: Results for different drop-out mixtures. A NGM parent/child familial search in the victimless case.

In Figure 13, the results for a NGM sibling familial search are shown. The difference between the different drop-out mixtures is even less outspoken than in case of a parent/child familial search. This means that the percentage of identification in the top k profile does not decrease when the drop-out of the mixture increases.

Figure 12: Results for different drop-out mixtures. A full sibling familial search in the victimless case. Victim case

We have already seen in Experiment 3.1 that in cases where a victim profile is known, the percentages of identifications in the top k profiles are higher than in the victimless case. So in this experiment, we also expect the results for the victim case to be better. The results for a parent/child familial search with the use of different drop-out vectors are shown in Figure 14. By direct comparison with Figure 12, we can clearly see that the results in case a victim profile is known are much better. For example, if we

(28)

28

have a mixture with only a little drop-out of (0.1, 0.1), in 95% of the cases we will find the parent/child in the top 20. For the victimless case, this is approximately 45%.

Figure 13: Results for different drop-out mixtures. A parent/child familial search in the victim case.

The results obtained for the sibling search in the victim case are shown in Figure 15. By direct comparison with Figure 13 can be observed that the results are much better in the case where a victim profile is known. Additionally, in the victim case we see that there is more difference observed between the different drop-out values.

Figure 14: Results for different drop-out mixtures. A sibling familial search in the victim case.

Experiment 4.2 - False positive and true positive rates.

In this experiment, the probability a relative has a likelihood ratio higher than certain threshold (true positive rate) and the probability for random persons to have a likelihood ratio exceeding some threshold (false positive rate) for certain mixtures are studied. The true positive rate and the false positive rate are determined for four situations: victim case when searching for a relative of the first donor typed with the NGM kit, victim case with SGM+ relative, victimless case with NGM relative and victimless case with SGM+ relative. This is done with the use of importance sampling and per mixture, a sample of 100.000 LRs is taken from the corresponding distribution. The false positive and true positive rate is determined as described in section 2.5. This is repeated a 100 times per mixture. The exact drop-out values used can be found in Table 4A, Appendix A.

The use of complex mixtures for DNA database searches when relatives are present in the database