The Extraction of Short Tandem Repeats from the Genomes Sequenced by the 1000 Genomes Project

(1)

The Extraction of

Short Tandem

Repeats from the

1000 Genomes

Project Sequencing

Data

Jonna Muller (10317929) Literature thesis

Master Forensic Science, University of Amsterdam Supervisor: dr. prof. A.D. Kloosterman (UvA, NFI) Co-assessor: dr. M. Jonker (UvA, SILS)

(2)

Abstract

The purpose of this thesis is to explore and discusses the bioinformatics software tools available for extracting short tandem repeats (STRs) from next-generation sequencing (NGS) data and their applicability on the 1000 Genomes Project sequencing data. STRs are repetitive and highly variable regions in the genome with multiple uses in forensic investigations. STRs on the Y chromosome (Y-STRs) form haplotypes which are used for the determination of bio-geographic ancestry of the individual who left a DNA containing trace. For this purpose, Y-STR databases are developed, but many haplotypes occur too rarely in the database to be used for reliable bio-geographic ancestry determination. This problem might be solved by extending such a database with more Y-STR haplotype data. The 1000 Genomes Project dataset is a good source for these Y-STRs, but the extraction of STRs from NGS data has shown to be difficult for mainstream bioinformatics software tools processing and analyzing NGS data. Short reads and the repetitive nature of STRs cause ambiguous alignment of multi-reads to the reference, STRs longer than the short reads are lost during analysis due to the need for reads to fully span a STR, STRs deviating too much from the reference due to their high mutation rates fail to align and DNA slippage during PCR hampers reliable STR calling. Six new bioinformatics software tools for STR analysis, their methods to tackle these problems and their applicability to the 1000 Genomes Project dataset are discussed. Finally, this thesis argues that the methods for extracting STRs from NGS data used by these software tools have shown to work well on various types of NGS data but do need some editing in their algorithms to be also applicable on the extensive and variable dataset of the 1000 Genomes Project. It is recommended to gather the aspects of these algorithms that work well on the sequencing data of the 1000 Genomes Project and to implement those in a new algorithm especially made for the extraction of STRs from the 1000 Genomes Project dataset.

(3)

1.| Introduction

Forensic science consists of a wide range of disciplines investigating many types of traces like fingerprints, toxicology, and ballistics (Jobling & Gill, 2004). However, in the past 30 years DNA analysis has received a lot of attention and has been studied a lot making it one of the most powerful evidence in forensic science. The aim of forensic DNA analysis is to identify the source of a trace. This is done via DNA profiling in which DNA markers are characterized to make a DNA profile. Since the 1990s short tandem repeats (STRs) are used as markers to make these DNA profiles (Jeffreys, Wilson, & Thein, 1985).

The most familiar use of DNA profiles is to compare the DNA profile of a found trace to a DNA profile of a known individual like a suspect or to many individuals in a database. However, STRs on the Y-chromosome (Y-STRs) in male individuals and the haplotypes they form carry information regarding the origin of an individual (Kayser et al., 2003). In forensics this information is used to determine where a male individual who left a DNA trace is from, his bio-geographic ancestry. By searching for a found STR haplotype in one of the existing Y-chromosomal databases the likely geographic region of ancestry and the frequency of this haplotype is given. There are multiple Y-STR databases, but many Y-STR haplotypes in these databases occur very rarely in the database. These haplotypes do not have a geographic signature and thus cannot be used for determining its likely bio-geographic ancestry (Kayser, 2016). In forensics, Y-STR haplotypes are very useful in many situations (Kayser, 2017). For example, in cases where no match is found with autosomal STR profiles, knowing the perpetrators bio-geographic ancestry can still narrow down the suspect pool. However, for Y-STR haplotypes to be useful, the databases need to be extended because as they are now, searching for the likely bio-geographic ancestry of a generated Y-STR haplotype often do not lead to a result.

A solution to this problem might be to identify more Y-STRs and haplotypes to insert into the databases. A good source for these Y-STR haplotypes is the dataset of the 1000 Genomes Project. This project is one of the first large-scale sequencing projects sequencing 2504 whole human genomes of individuals across the world using next-generation sequencing (NGS) (National Institutes of Health, 2008; The 1000 Genomes Project Consortium, 2010). The created dataset of this project is made publicly available on the Internet.

However, due to the repetitive nature of STRs, extracting them from data generated by NGS have shown to be very difficult for mainstream bioinformatics data processing and analysis software tools (Treangen & Salzberg, 2012). This brings me to the research question of this literature thesis: How can we extract STR genotype data from NGS data? With the advent of NGS many new bioinformatics software tools for finding and extracting STRs from NGS data got developed. With this thesis an overview of some of these software tools will be given and their advantages and disadvantages for applying them on the dataset of the 1000 Genomes Project are discussed.

The outline of this thesis is as follows. First STRs and using STRs as markers for DNA profiles are discussed. Next, the goals of the 1000 Genomes Project are explained, how these goals are reached and how the resulting dataset looks like. After this, the problems that come with analysing STRs from NGS data are described and the developed bioinformatic solutions for these problems and how these apply to the dataset of the 1000 Genomes Project are discussed. In the conclusion the final findings are summarised and future directions proposed.

(5)

2.| Forensic DNA Typing with Short Tandem Repeats

For almost three decades now STRs have been used as markers for DNA profiles (Jeffreys et al., 1985). A STR is a repetitive stretch of DNA in which a certain DNA motif of one to six nucleotides is repeated in tandem six to several dozens of times (Butler, 2005). About 3% of the human genome consists of these STRs with one STR occurring on average every 10.000 base pairs. They are mostly found in noncoding regions throughout the genome but are also present in the coding regions. STRs have a relatively high mutation rate, ranging from 10-2_to 10-4_{repeat per cellular generation (Weber & Wong, 1993), as compared the approximate} mutation rate of 10-8_{for single nucleotide polymorphisms (SNPs) (Fan & Chu, 2007). Because} of their high mutation rate, STRs are highly polymorphic and have a very high discrimination power when analyzed in multiplex. Other advantages of using STRs as DNA markers is the speed at which they can be analyzed and the reduced costs compared to previous used DNA techniques, the possibility to analyze low template or degraded DNA and their ability to resolve simple DNA mixtures (Jobling & Gill, 2004).

Because of the high mutation rate of STRs, the number of repeat units making up a particular STR locus is highly variable between individuals with the results that many alleles exist for one STR locus (Butler, 2005). This is what makes them suitable for discriminating between individuals. In forensic DNA profiling, STR multiplex systems are used in which multiple STR loci are typed in one experiment. Capillary electrophoresis (CE) has been the main method for DNA profiling since the 1990s, separating STR fragments on size using an electric field (Jeffreys et al., 1985).

These DNA profiles mainly consist of autosomal STRs. However, STRs are also present on both sex chromosomes. In chapter 6 these sex chromosomal STRs will be discussed more.

3.| The 1000 Genomes Project

The 1000 Genomes Project launched in 2008 with as goal to sequence at least a thousand human genomes, representing populations across the world, creating the largest public catalogue of the human genome which can be studied for genetic variations (National Institutes of Health, 2008). Eventually, 2504 human genomes got sequenced when the project was finalized in 2015.

Prior to the 1000 Genomes Project, two other projects were finalized creating a human genetic catalogue to study genetic variations. The first one was the Human Genome Project

which ran from 1990 to 2003, sequencing the first human genome (Collins, Morgan, & Patrinos, 2003). Another project that was set up to study genetic variation was the International HapMap project which ran from 2002 to 2009 making a public catalogue called the HapMap of common SNPs and patterns of sequence variation (The International HapMap 3 Consortium, 2010). Even though these big projects were very successful, giving a lot of new information regarding the human genetic structure and variants and their association with many genetic diseases, they also showed that human genetics is very complex and much more research is needed to get a better understanding of it (Collins, Lander, Rogers, & Waterson, 2004; The International

Figure 1 | The logo of the 1000 Genomes Project when it started in 2008 (Clarke, 2012)

(6)

HapMap 3 Consortium, 2010). So did the Human Genome Project reveal that the human genome only entails about 20.000-25.000 protein coding genes (Collins et al., 2004) instead of the previous estimated 30.000-40.000 (Lander et al., 2001), indicating a much more complicated human gene expression than was first thought. And the International HapMap Project only focused on finding the most common SNPs which are present in at least 5% of the population (The International Hapmap Consortium, 2003) while research on SNPs and their association with disease has shown that many common diseases are more complex than initially thought and cannot only be explained by these common variants (The 1000 Genomes Project Consortium, 2010). Also, in a press release from 2010 by the National Institutes of Health (NIH), Lisa Brooks, who was the Program Director for genetic variation at the National Human Genome Research Institute (NHGRI), explained that GWAS studies on data from the HapMap only gives a region in the genome where variation exists which is associated to a particular disease (National Institutes of Health, 2010). Additional expensive and time-consuming sequencing is needed to identify the actual causative variants that are linked to this disease.

To increase the quality of studies in genetic variation and their association in diseases the 1000 Genomes Project extended these big projects, building on their data and generated new methods to fill in the missing genetic information to get a more detailed view on the human genome (National Institutes of Health, 2008). In this way the variants associated with a disease are found more quickly and easily. The specific goal of the 1000 Genomes Project was to capture over 95% of the variants accessible for current sequencing technologies with allele frequencies of 1% in 26 populations originating from Europe, East-Asia, South-Asia, West-Africa and the Americas (The 1000 Genomes Project Consortium, 2010). Because of the abundance of low-frequency (0,5%-5% minor allele frequency (MAF)) and rare variants (<0,5% MAF) in the human genome relative to common variants and their significant contribution to the genetic architecture of disease, it is important to capture these as well. This is why the 1000 Genomes Project also aimed to discover variants with allele frequencies as low as 0,1%. Figure 2 shows the fraction of human genetic variation that is not represented in the dbSNP (database for a variety of genetic variation) before the 1000 Genomes Projected started (Clarke, 2012). After the Human Genome Project, only 20% of the human genetic variation had been discovered. From 2001 to 2008 this percentage increased to 90% due to big sequencing projects like the International HapMap Project. The 1000 Genomes Project Consortium aimed to find less than 1% of the variation left in the human genome. In a news release of the National Institutes of Health (2008), F.S. Collins, NHGRI director at the time, stated the following: "This new project will increase the sensitivity of disease discovery efforts across the genome 5-fold and within gene regions at least 10-fold" (National Institutes of Health, 2008).

Furthermore, the 1000 Genomes Project delivered a far more complete genetic variant catalogue than the HapMap is, not only including SNPs but also structural variants like deletions and insertions (National Institutes of Health, 2008; The 1000 Genomes Project Consortium, 2010). For the investigation of the human genome and the role of genetic variations in a particular phenotype or trait, knowledge about the entire spectrum of types of variations and allele frequencies was needed.

Thus by discovering and genotyping all forms of human DNA variation with allele frequencies as low as 0,1% in populations across the world and by providing accurate and more complete haplotype information for all these populations, the 1000 Genomes Project provided a detailed open source genetic catalogue that can serve as a foundation for the investigation into the association between genotype and phenotype (The 1000 Genomes Project Consortium, 2010).

(7)

4.| The Generation of the 1000 Genomes Project Dataset

4.1 Next-generation sequencing

Sanger sequencing was introduced in the late 1970s (Sanger, Nicklen, & Coulson, 1977) and has since then been the main DNA sequencing method for over 30 years (Schuster, 2008). Sanger sequencing’s ability to generate long reads and to sequence with a very high accuracy makes it perfect for sequencing single genes and STR analysis. However, because of its low throughput of 96 samples at a time, its limitation to sequence only 1000-1200 base pairs in one run (Zhang, Chiodini, Badr, & Zhang, 2011) and with a sensitivity too low to capture rare variants (Behjati & Tarpey, 2013) sequencing a thousand whole human genomes in such fine detail the 1000 Genomes Project Consortium had set for themselves to do would not be feasible with Sanger sequencing. So did it took the Human Genome Project 13 years to sequence only one human genome with Sanger sequencing (Collins et al., 2003).

Because of these limitations of the Sanger sequencing method and the eagerness of scientists to start big sequencing projects, new sequencing technologies got developed in the late 20th and early 21st century (Schuster, 2008). These new high-throughput next-generation sequencing (NGS) technologies can sequence millions of DNA fragments in parallel, increasing the throughput tremendously making it possible to sequence a complete human genome within a single day (Behjati & Tarpey, 2013). Also, the sequencing depth of NGS can be increased to the user’s liking. By sequencing the human genome multiple times, the NGS sensitivity increases making it possible to locate rare variants Sanger sequencing would overlook. Finally, the costs to sequence a whole genome has been brought down significantly with NGS (Wetterstrand, 2016). The National Human Genome Research Institute (NHGRI) follows the costs of DNA sequencing from 2001 until now and they estimated the costs to sequence one whole genome has decreased from $100 million to $10 million between 2001 and 2007 and to about $1000 in 2017 (Fig. 3).

The development of NGS technologies, sequencing DNA much quicker, much cheaper and at a much higher resolution, made it possible to carry out the tasks of the 1000 Genomes Project. Different NGS sequencing platforms have been developed in the past decade (Metzker, 2010; Pillai, Gopalan, & Lam, 2017). For the 1000 Genomes Project multiple NGS platforms from

Figure 2 | Fraction of human genetic variation not in the dbSNP. The aim of the 1000 Genomes Project was to bring this number down to less than 1% (Clarke, 2012)

(8)

three NGS providers were used (Clarke, 2012): the 454 Roche Genome Sequencer FLX System, the Illumina Genome Analyzer I and II and Illumina HiSeq and the ABI SOLiD system. These different platforms use different methods but follow the same steps (Metzker, 2010): library preparation, clonal amplification, sequencing, and data analysis. During library preparation the genomic DNA to be sequenced is split up in a randomly fashion generating different sized fragments. Short pieces of synthetic DNA called adaptors are added to the ends of these template DNA fragments. With these adaptors the DNA fragments are immobilized to a solid surface like a bead or a slide. When immobilized, the DNA fragments are clonally amplified by either emulsion PCR or bridge PCR. After amplification the DNA fragments are sequenced. All three NGS platforms used by the 1000 Genomes Project perform a different sequencing method but are all based on the sequencing-by-synthesis principle in which the incorporation of nucleotides during DNA synthesis is visualized from which the sequence of the DNA can be determined.

Figure 3 | The sequencing cost in US dollars per human genome from 2001 to 2017 estimated by the National Human Genome Research Institute, National Institutes of Health. From 2001 through 2007 first-generation sequencing was used of which Sanger sequencing formed the basis. From 2008 next-generation sequencing became the main sequencing method for sequencing whole genomes (from Wetterstrand, 2016).

4.2 The phases of the 1000 Genomes Project

The 1000 Genomes Project was set up in multiple stages: the pilot project consisting of three pilot studies running from 2008 to 2010 and the main project also consisting of three phases, running from 2009 to 2015 (The 1000 Genomes Project Consortium, 2010).

(9)

Table 1 and 2 show an overview of the pilot project and the main project respectively, including the four most important publications of the 1000 Genomes Project Consortium.

The pilot studies were conducted to get a better understanding of the new NGS techniques and platforms and to develop, compare and evaluate different methods for genome-wide sequencing and to test what strategy is best for detecting and genotyping variants occurring with low frequencies (The 1000 Genomes Project Consortium, 2010). Besides getting a better understanding of the NGS technologies, the pilot studies also provided a lot of data interesting for genetic variation studies as it was much deeper end more informative than data previously available.

It is important to note that in this thesis the terms ‘depth’ and ‘coverage’ mean the same thing, namely the average number of times a nucleotide in the genome is sequenced and represented in the collection of reads. Thus, a coverage or depth of 20x means that on average each nucleotide is sequenced 20 times. For this I would normally use the term depth, but because the 1000 Genomes Project Consortium used the word coverage as denotation for the phases of the project, I adopted this word to avoid confusion.

Table 1 | Overview of the pilot project of the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2010)

Pilot study Strategy Depth of coverage Purpose

Pilot study 1 Low-coverage project Whole-genome sequencing of 179 unrelated individuals Low-coverage, 2-6x

(average 3,6x) Assess sequencing efficiency and the combining of data from multiple samples

Pilot study 2

Trio project

Whole-genome

sequencing of two trios using different NGS platforms

High coverage, 20-60x

(average 42x) Assess the effect of high coverage and the different NGS platforms

Pilot study 3

Exon project 8140 exons in 697 individuals High coverage, 50x (average 56x) Assess methods for gene-region variants Data published in 2010: https://www.nature.com/articles/nature09534

In the first pilot study the genomes of 179 individuals got sequenced at low coverage (2-6x) after which the data of all these individuals got combined to see if shared genetic variants could be discovered in this way with low-coverage data (The 1000 Genomes Project Consortium, 2010). In pilot study 2 the genomes of two families, each consisting of a female adult and her parents, got sequenced at a high coverage of 20-60x, using the three different NGS platforms. The purpose of this pilot study was to assess the effect of coverage on the yield of genetic variants and to get a better understanding of the different NGS platforms. In the final pilot study, only the exons got sequenced at high coverage. In total 8140 exons from 906 genes in 697 individuals got sequenced at an average coverage of 56x. The goal of this study was to develop and evaluate strategies for targeted sequencing and finding the rare variants in the more conserved gene regions. Actual functional variants are more commonly found in these conserved coding regions and because these coding regions are fairly conserved, the allele frequencies of these variants are also very low. To discover these rare variants, deeper sequencing is necessary.

(10)

These three sequencing strategies all supplied different types of information (The 1000 Genomes Project Consortium, 2010). The low-coverage project identified shared common variants. However, due to the low-coverage, sequencing information will be missed which will result in some wrong genotype calling and missed rare variants. By sequencing at high coverage and using multiple sequencing platforms like in the trio project, most variant types throughout the genome was discovered without a lot of missing data and gaps in the sequence. The method of the exon project also resulted in accurate discovery of rare variants, but only in the targeted regions.

In 2009 all three pilot studies were completed and the Consortium published their results in Nature in 2010 (The 1000 Genomes Project Consortium, 2010). The results indicate that low-coverage sequencing is an effective method for finding variation throughout the whole genome. For detection and accurately genotyping of more rare variants in particular regions, deep targeted sequencing is more suitable. With these pilot studies, the Consortium successfully developed protocols for whole-genome and targeted sequencing and developed and validated algorithms for variant detection.

Table 2 | Overview of the main project of the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2012, 2015). Main project

phase Strategy # Samples sequenced

Phase 1 - Low-coverage whole-genome sequencing

- Targeted deep exome sequencing 1092 samples

Phase 2 Method improvement and development Expansion to 1722 samples

Phase 3 Methods from phase 2 applied to make a

variation catalogue All 2504 samples

Publications: Phase 1, published in 2012: https://www.nature.com/articles/nature11632

Phase 3, published in 2015: https://www.nature.com/articles/nature15393 and

https://www.nature.com/articles/nature15394

With the methods and algorithms developed during the pilot project the three phases of the main project got designed (table 2). Three sequencing strategies were used throughout the main project: whole-genome sequencing with low-coverage (4x per individual), high-coverage targeted exome sequencing (50-100x) and dense microarray genotyping (The 1000 Genomes Project Consortium, 2010, 2015).

The goal of the first phase of the main project was to get an integrated view on human variations (Clarke, 2012). Collecting several types of data by conducting low-coverage (2-6x, average 5x) whole-genome sequencing, deep targeted sequencing of exomes (50-100x, average 80x) and also dense SNP microarray genotyping of 1092 samples showed to be an effective way to find variants and reconstruct haplotypes (The 1000 Genomes Project Consortium, 2012).

In the second phase of the main project over 600 individuals got sequenced, bringing the total amount of sequenced individuals to 1722 from 19 populations (Clarke, 2012). The data received from these individuals were used for improving the methods used in phase 1 and for developing new methods for complex variant calling. No papers got published for this phase.

(11)

In the third and final phase of the main project individuals from all 26 populations got sequenced with the methods developed in phase 2, making it possible to also include more types of genetic variants. This final phase brought the total amount of sequenced individuals to 2504, reconstructing their genomes using whole-genome sequencing with an average depth of 7.4x, targeted exome sequencing with an average depth of 65.7x and high-density SNP microarrays were used to genotype the individuals and their first-degree relatives if available (The 1000 Genomes Project Consortium, 2012, 2015).

The main project ran from 2010 to 2015, reconstructing the genomes of 2504 individuals from 26 populations originating from Europe, East-Asia, South-Asia, West-Africa and the Americas. From every population approximately 100 samples got sequenced (The 1000 Genomes Project Consortium, 2010).

5.| The 1000 Genomes Project Dataset

5.1 How do the data look like?

The 1000 Genomes Project created a tremendous amount of data. In 2012 260 terabytes of data was collected and publicly available in over a 250.000 files (Clarke et al., 2012). When the project was finalized more than 100 trillion base pairs got sequenced (Zheng-Bradley et al., 2017) and more than 88 million variants discovered (The 1000 Genomes Project Consortium, 2015).

Table 3 | The types of file formats used for storing the data produced by the 1000 Genomes Project (Clarke et al., 2012)

File format Description

FASTQ Text format for sequences and a per base Phred quality score

SAM and BAM Alignment formats storing reads aligned against a reference genome

CRAM Compressed alignment format reducing disk footprint

VCF Text based format storing variant calls ad genotypes

These data are stored in different types of formats (table 3) (Clarke et al., 2012).

The raw sequencing data are stored as FASTQ files which is a widely used text based file format for sequence reads that also includes the Phred quality score of every single base called (Cock, Fields, Goto, Heuer, & Rice, 2009).

The alignments are stored as BAM or CRAM files (Clarke et al., 2012). The Sequence Alignment/Map (SAM) and the Binary Alignment/map (BAM) formats are created by members of the 1000 Genomes Project Consortium for storing aligned reads against the reference sequence and for further processing (H. Li et al., 2009). SAM is an alignment format that supports both long and short reads from all types sequencers and all types of alignment tools. Because multiple types of sequencers and alignments software are used during the 1000 Genomes Project, such a generic alignment format was needed.

(12)

The BAM and CRAM formats contain the exact same information as the SAM format but in a compressed fashion. Because NGS can generate billions of reads that are aligned to the reference (Behjati & Tarpey, 2013), the SAM files can become very large. To save memory space on the servers, such big files can be compressed. There exist many different compression methods. The SAM files are texts that we humans can read (H. Li et al., 2009). When these big SAM files are converted to BAM files, the readable texts are converted in a binary fashion. Humans cannot read these files, but the 1000 Genomes Project Consortium created the software tool SAMtools which can directly work with these compressed BAM files without having to uncompress them. CRAM files are also compressed versions of the alignment but uses a different and better compression method than BAM, storing the data in containers and blocks (EMBL-EBI, n.d.-b). The transition from CRAM files to BAM files goes very smoothly so that these types of data can be analyzed with SAMtools as well.

For a more elaborate explanation about the SAM format and its compressed version BAM, the development site of SAMtools can be visited: http://samtools.sourceforge.net/. More information about CRAM and its compression method can be found on its usage site:

https://www.ebi.ac.uk/ena/software/cram-usage#bam_to_cram.

Finally, the results of the analyses, the variant calls and individual genotypes are stored in the variant call format (VCF) (Clarke et al., 2012). This generic format was also designed by members of the 1000 Genomes Project Consortium to store annotations and all types of genetic variants, from large scale indels to SNPs (Danecek et al., 2011). Just like the BAM format, VCF files can also be indexed making quick retrieval of variants of interest possible.

5.2 Where to find the data?

During the project, the data was made available as soon as possible on mirrored download sites at the European Bioinformatics Institute (EBI) and the National Center for Biotechnology Information (NCBI) (Clarke et al., 2012). By creating mirrored websites, the network traffic on a website is reduced resulting in an improved availability of the website and download speed. In America, the website of the NCBI (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/) has quicker access while in Europe the EBI website (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) shows greater download speeds. These websites are still up and running.

The data was also shared on three browsers, one for the data of the pilot project (http://pilotbrowser.1000genomes.org/index.html), one for the data created during the first phase of the main project (http://phase1browser.1000genomes.org/index.html) and one for single-nucleotide polymorphisms (SNPs) and indels found in phase 3 of the main project (http://phase3browser.1000genomes.org/index.html) (Clarke et al., 2012). These websites are also still available.

However, the 1000 Genomes Project Consortium recommends viewing the data on the genome browser of Ensembl that also contains data from other projects. The 1000 Genomes Project data can be viewed as aligned to the reference genome GRCh37 (http://grch37.ensembl.org/Homo_sapiens/Info/Index), which was used as reference during the main phase of the project or as aligned to the current reference genome GRCh38 (http://www.ensembl.org/Homo_sapiens/Info/Index) to which the reads created during the 1000 Genomes Project got realigned after the project was finalized (Zheng-Bradley et al., 2017).

Furthermore, the SNPs and indels captured by the 1000 Genomes Project got submitted to the dbSNP which is a database for short genetic variations, and structural variants to the Database of Genomic Variants archive (DGVa) which is a database for large genetic variations (Clarke et al., 2012).

(13)

5.3 What is the quality of the data?

During the project, the accuracy and error rate of the data improved a lot (The 1000 Genomes Project Consortium, 2015). The increased read length and depth and the reduction in per-base errors of new and improved NGS techniques developed during the years together with the introduction of paired-end sequencing resulted in an improved production of primary sequencing data. Also, the analysis methods of the sequence data improved due to new identification and filtering methods of poor-quality data, increased accuracy of mapping reads to the reference genome, especially in repetitive regions, and the new possibilities to capture a wider range of types of variants. This together with the increasing sample size resulted in a threefold reduction in error rates between the first and the third phase of the main project (The 1000 Genomes Project Consortium, 2012, 2015).

So did the accessible genome, the remaining portion of the reference genome to which sufficiently amounts of reads could be mapped leading to reliable variant discovery after filtering out poor quality reads, increase between the phases of the project (The 1000 Genomes Project Consortium, 2010, 2012, 2015). In the pilot phase about 85% of the reference genome showed to be accessible. When using the same criteria as used in the pilot phase, the accessible genome increased to approximately 94% in the first phase of the main project and to 96% in the third phase. However, stricter criteria were created for the first and third phase of the main project leading to a decrease of the accessible genome to 72.2% and 76.9% respectively, but also to an increased reliability of variant discovery.

During the third phase of the main project the variant discovery power and the accuracy of the genotyping results were evaluated (The 1000 Genomes Project Consortium, 2015). The power to detect SNPs with sample frequencies of ≥0.5% and >1.0% was estimated to be >95% and >99% respectively. The power to detect indels with these same sample frequencies got estimated on >80% and >85% respectively. And the power to detect SNPs with a frequency of 0.1% was estimated to be 75%. Finally, the heterozygous genotype accuracy was estimated to be 99,4% for SNPs and 99.0% for indels.

6.| Using the 1000 Genomes Project Dataset for Forensic Science

6.1 The applications of different types of STRs in forensics

STRs are found on autosomal chromosomes and both sex chromosomes. However, for this thesis I will focus on the STRs found on the Y chromosome (Y-STRs) because I think Y-STR analysis can have a more short-term benefit from the extraction of STRs from the 1000 Genomes Project dataset. In forensics, autosomal STRs are being used as makers for DNA profiles for three decades now (Jeffreys et al., 1985) and their utilization is already well accepted by courts of law and the society. X chromosomal STRs (X-STRs) on the other hand are not that commonly used in forensics. Their possible applications do have been studied in the past decade (Diegoli, 2015) but much more research is needed before they are as generally accepted as evidence. Y-STRs have a multitude of forensic applications, one of them being the determination of the paternal bio-geographic ancestry of the source of a specific DNA containing trace (Kayser, 2017). This method is used in forensics to a certain extend but more Y-STR data is needed for better and more reliable results. Because of this I will focus on Y-STRs. However, autosomal STR and X-STR analyses can also benefit from receiving more STR data.

(14)

6.2 The determination of paternal bio-geographic ancestry with Y-STR haplotypes

For a long time now STRs found on the Y chromosome (Y-STRs) are used to study the different genetic structures between human populations and for determining paternal bio-geographic ancestry (Kayser et al., 2003; Mahal & Matsoukas, 2017). In forensics, determining someone’s bio-geographic ancestry can be useful in multiple situations (Kayser, 2017). In cases where the perpetrator is unknown for the investigators, no autosomal STR profile match will be found. In such cases, the bio-geographical background of an individual can guide the investigators towards finding the perpetrator anyway. Y-STRs are also useful when autosomal STR profiling fails. This can happen when trying to profile a male/female DNA admixture or when only the bones of victims are left over from which it is difficult to extract sufficient DNA from to make an autosomal STR profile to environmental insults. In such cases Y-STR lineage markers can complement these autosomal STR profiles to increase statistical confidence for identification. Furthermore, in disaster victim identifications, Y-STRs can be used to identify the victim when only distant relatives are available for supplying reference DNA with autosomal STR profiles deviating too much from the victim to say anything about their relatedness.

Y-STRs are good paternal lineage markers because of several reasons (Kayser, 2017; Kayser et al., 2003). First of all, Y-STRs used for determining bio-geographic ancestry are located on the non-recombinant part of the Y-chromosome due to which they show linked inheritance. These linked Y-STRs form a haplotype which are specific for certain regions and populations in the world. Secondly, STRs mutate rather quickly and because the used part of the Y chromosome does not undergo recombination, the mutation stays in the gene pool when male offspring exist. Cultural factors like patrilocal residence (living near or with the husband’s parents instead of the wives’ parents) and polygyny (marriage of one man to multiple women) contribute to genetic differences between geographic regions. Because of these reasons, a haplotype composed of a set of linked Y-STRs is specific for a certain region in the world.

Figure 4 | The relation between haplogroups and haplotypes. Haplogroups are separated by different single-nucleotide polymorphisms (SNPs) that happen over time. In the following generations STR polymorphisms happen, creating different haplotypes that make up the haplogroup. From haplotypes it is usually possible to predict the haplogroup. Altered from

(15)

In turn, a haplogroup consists of a set of similar haplotypes that come from a common ancestor with a particular SNP (Fig. 4) (Kayser, 2016). Thus, a haplogroup has a set of haplotypes with each a STR polymorphism making them slightly different. When the haplotype is known, usually its haplogroup can be predicted. Because SNPs are much less common than STRs, with haplogroups you can determine someone ancestry dating back thousands of years, whereas STR polymorphisms can be used for much more recent historical events (Kayser, 2017).

STRs can only be used for determining the paternal bio-geographic ancestry because Y-chromosomes are only inherited by male offspring from the father (Kayser, 2017; Kayser et al., 2003; Mahal & Matsoukas, 2017), meaning that Y-STR analysis cannot be used when looking for a female perpetrator. However, males are most commonly responsible for violent crimes or sex offenses. Among the Dutch population, males become four times more a suspect of crimes than females (Centraal Bureau voor Statistiek, 2016). Thus, for most cases, Y-STR profiling can be very useful.

Multiple Y-chromosome haplotype databases exist, including the U.S. Consolidated Y-STR Database (http://www.usystrdatabase.org/) consisting of 35.658 haplotypes, and the Y-Chromosome Haplotype Reference Database (YHRD) (www.yhrd.org) consisting of 197.102 haplotypes making it the largest and most widely used database. The found Y-STR haplotype can be inserted into these databases which in turn give the likely geographic region of ancestry and the frequency of this haplotype which can be used to calculate the match probability (Kayser, 2017).

Many Y-STR haplotypes are very rare and do not occur or occur only a small amount of times in the database. These haplotypes cannot be used for determining bio-geographic ancestry (Kayser, 2016). Maybe by finding more STRs, this problem can be resolved. The 1000 Genomes Project sequenced 2504 human genomes from individuals originating from 26 populations located at five big regions in the world (Europe, East-Asia, South-Asia, West-Africa and the Americas) (The 1000 Genomes Project Consortium, 2010) providing also a lot of STR data. These STRs can be studied and new haplotypes can be identified and added to the Y-STR haplotype databases. This might improve the accuracy of the estimated frequencies of the haplotypes and increase the precision of the geographic region of ancestry of specific haplotypes making the results of a haplotype search more reliable and informative.

7.| The Extraction of STRs from NGS Data

NGS is a high-throughput sequencing technology, sequencing millions of DNA fragments simultaneously and can produce billions of reads per run (Behjati & Tarpey, 2013). For the 1000 Genomes Project short reads were created, ranging between 25 bp to 160 bp during the pilot project and for the final phase a minimal length of 70 bp was used (EMBL-EBI, n.d.-a). Aligning short reads with the repetitive sequences of STRs to a reference genome to elucidate their location in the genome and the subsequent analysis of these STRs has shown to be the central computational problem for the mainstream software tools used during the advent of NGS in the beginning of the 21st century (Treangen & Salzberg, 2012). Because there was a lack of software tools for analyzing STRs, these DNA regions often got neglected in many studies (Kristmundsdóttir, Sigurpálsdóttir, Kehr, & Halldórsson, 2016). From around 2010 many new pipeline tools got created tackling the problems that come with aligning short reads containing STRs and for subsequent STR genotype calling.

Before discussing these new software tools, the problems encountered by the older mainstream alignment software tools when aligning short-read data containing repetitive sequences like STRs are explained and some examples of these software tools are given.

(16)

7.1 The problems of extracting STRs from NGS data

First of all, aligning billions of reads to a reference genome takes a lot of computer power and also a lot of time (Reinert, Langmead, Weese, & Evers, 2015; Treangen & Salzberg, 2012). The older aligners work with a trade-off between the time it takes to align the reads and the possibility to detect genetic variants (Langmead & Salzberg, 2012; H. Li & Homer, 2010). Many of the older aligners use a genome indexing system to make a smaller selection of possible mapping locations to avoid scanning through the whole reference genome for the search for possible fits (Langmead & Salzberg, 2012). This indexing system works best for ungapped alignment, an aligning method that avoids gaps within sequences caused by sequencing errors or true genetic variants like indels. Software working with this method usually fail to align reads containing indels and will miss out on this type of information (H. Li & Homer, 2010). And when reads containing indels are aligned using the ungapped alignment method, this can lead to false genetic variant calling including false SNPs and false structural variants. The gapped alignment method does take indels into consideration by generating a gap in the read or reference sequence where an insertion or deletion has occurred compared to each other. However, allowing for gaps increases the search space of the reference genome tremendously, making index-aided alignment very inefficient and computationally challenging resulting in a significant decrease of alignment speed. Thus, ungapped alignment is efficient and fast but misses out on important genetic variant information and gives wrong genetic variant calls while gapped alignment leads to accurate genetic variant calls but is inefficient and slow. For the detection of STR variation the gapped alignment method is needed, making the alignment phase take a long time (Gymrek, Golan, Rosset, & Erlich, 2012). Thus, when NGS technologies got developed there was a need for a more efficient way to work with the tremendous set of data created by NGS.

The second problem has to do with the read lengths and the repetitive characteristic of STRs. Many short reads do not fully encompass a STR and because a STR is so repetitive it does not contain a lot of information regarding its location in the genome. Due to similarities in sequences between repeat motifs of different STRs, a read containing only a part of a STR might be able to map to multiple locations (Reinert et al., 2015; Treangen & Salzberg, 2012). With increasing similarities between two different STRs, the confidence of read mapping within these repeats decreases (Fig. 5) (Treangen & Salzberg, 2012). Reads mapping to more than one location in the genome are called multi-reads and create ambiguity regarding their true position in the genome and can lead to detection of false genetic variations. The percentage of short-reads (≥25 bp) that can map to more than one location on the human genome is approximately 20-30%, depending on the alignment software used.

Another problem that comes with short reads containing a partial STR is that these reads are not informative on the number of repeats that make up the STR but only give the minimal number of repeats (Cao et al., 2014; Fungtammasan et al., 2015; Gymrek et al., 2012; Tang et al., 2017). This would mean that only reads fully encompassing a STR can be used to detect STR variation. Many older software tools cannot make a distinction between reads containing a complete STR and reads containing a partial STR and thus align also these partial STR reads. However, only using reads fully encompassing a STRs limits the analysis to STRs smaller than the reads. Longer STRs will be lost which is a waste of informative data (Cao et al., 2014; Tang et al., 2017).

(17)

Figure 5 | Ambiguous read mapping of multi-reads. The more similar two copies of one repeat are, the lower the confidence of correct read-mapping becomes. In this figure two copies of the STRs X and Y are shown with beneath them mapping reads. For both STRs, one of the mapping reads is highlighted and its sequence given. Starting with STR X, the copies X1 and X2 are exactly the same. Because the highlighted read aligns equally well to X1 and X2, its mapping confidence is low. The highlighted read aligning to STR Y has two mismatches with Y2 and none with Y1. Thus, the confidence of mapping this read to Y1 is higher. Altered from Treangen & Salzberg, 2012.

Furthermore, the high mutation rates of STRs cause problems when aligning reads containing STRs. STRs have a mutation rate of 10-2_{to 10}-4_{repeat per generation (Weber & Wong, 1993)} which can make them deviate quite a lot from the reference genome. When these reads are mapped in their entirety to the reference genome, matching every nucleotide of the read, some reads will deviate too much from its corresponding position in the reference for mapping to take place (Fungtammasan et al., 2015). This causes a bias towards the STR length as it is in the reference resulting in a wrong allele frequency estimation and an underestimation of the actual level of STR variation (Fungtammasan et al., 2015; Gymrek et al., 2012; Tae, McMahon, Settlage, Bavarva, & Garner, 2013).

These mutations in STRs are caused by strand slippage during DNA replication (Gemayel, Vinces, Legendre, & Verstrepen, 2010). DNA slippage is the event of mispairing of DNA strands containing a repetitive sequence forming a loop in one of the two DNA strands resulting in an altered repeat number (Fig. 6). A looped-out template strand results in a deletion and a looped-out nascent strand in an insertion of one or more repeat units. This happens during normal DNA replication in a cell but also during PCR amplification, creating DNA amplicons with different repeat lengths than the original DNA to be sequenced (Fungtammasan et al., 2015; Gymrek et al., 2012). Since PCR is a standard step in library preparation of NGS, this PCR stutter noise can impede correct allele calling of repetitive sequences like STRs. This is not a problem regarding the alignment of reads containing STRs but when genotyping STRs this PCR stutter noise should be taken into consideration.

7.2 Examples of mainstream aligners

Some of the first mainstream alignment software tools for aligning short reads generated by NGS are Maq (H. Li, Ruan, & Durbin, 2008), Mosaik (Smith et al., 2008), SOAPAligner (R. Li, Li, Kristiansen, & Wang, 2008), Novoalign (http://www.novocraft.com/products/novoalign/), BFAST (Homer, Merriman, & Nelson, 2009), BWA (H. Li & Durbin, 2009), Bowtie (Langmead, Trapnell, Pop, & Salzberg, 2009), SHRiMP (Rumble et al., 2009) and mrFAST (Alkan et al., 2009), developed in 2008 and in 2009.

To search through the data quickly and to find the corresponding location of a read in the reference, these aligners use hashing algorithms or the Burrows-Wheeler transform (BWT) (Lee et al., 2014). How these methods exactly work is out of the scope of this literature thesis, but they do make it possible to work with the tremendous number of reads generated by NGS.

(18)

However, the algorithms on which these alignment pipelines work are not specifically written to handle reads containing repetitive sequences like STRs. SHRiMP for example does not tolerate non-unique matches discarding reads that map to multiple locations with similar strength (Rumble et al., 2009). Maq does report multi-reads but when they map equally well they are not taken into account for variant calling (H. Li et al., 2008). SOAPAligner, Bowtie and BWA do have some parameters on how repeats need to be treated. So does SOAPAligner report the number of equally best hits and report all locations (H. Li & Durbin, 2009; Treangen & Salzberg, 2012), Bowtie distributes the reads randomly over the repeats in the reference (Treangen & Salzberg, 2012), and BWA reports just one alignment of a multi-read randomly and gives a mapping score (H. Li & Durbin, 2009; Treangen & Salzberg, 2012). However, Yu et al. (2012) tested the performance of Bowtie, BWA, SOAP and also Novoalign and showed that it is better to remove multi-reads because they tend to align incorrectly which worsens the alignment accuracy (Yu et al., 2012).

Only mrFAST and Mosaik show to have reasonable good parameters for aligning reads to repetitive regions. mrFAST does this by reporting all locations to which a multi-read can map and showed that it could reach the structurally most complex and dynamic regions of the human genome (Alkan et al., 2009). Mosaik handles multi-reads by reporting the best mapping and adds a tag containing info on mapping quality and the number of other mapping locations (Lee et al., 2014). However, reporting all possible locations still gives ambiguity on the reads’ true location and randomly picking one mapping location is also not always the true location. All in all, these mainstream alignment software tools do efficiently work with the large amounts of data created by NGS, but they do not necessarily focus on aligning reads to the repetitive sequences of STR. Thus, these software tools by themselves do not create suitable alignments of STRs for further variant analysis. However, other software tools got created using these alignments but processes them further so that only informative reads with a reliable alignment are used for variant analysis (Fungtammasan et al., 2015; Highnam et al., 2013). RepeatSeq (Highnam et al., 2013) is one of those STR detection and genotyping tools. The developers of RepeatSeq evaluated the ability of different mapping algorithms to map low-complexity sequences like repeats. Their results showed that Novoalign and the newest version of Bowtie mapped these sequences most accurate. BWA showed the least accurate alignment.

Figure 6 | DNA slippage during replication. DNA strands containing repeat units (orange blocks) can dissociate and mispair during replication forming a loop which can result in a deletion of insertion or repeat units. Altered from Gemayel et al., 2012)

(19)

7.3 The new bioinformatics software tools for extracting STRs from NGS data

Because of the problems older mainstream aligners encountered when analyzing NGS data, new bioinformatics data processing and analysis pipelines got created, able to handle the large amounts of data and the repetitive characteristic of STRs (Treangen & Salzberg, 2012). Table 4 shows some of the bioinformatics software tools developed for aligning and/or calling STRs in NGS data. In this paragraph the methods used by these software tools to overcome the problems of extracting STRs from NGS data are discussed and compared.

There are more bioinformatics software tools for processing and analyzing STRs in NGS data than these six, but discussing all these tools would become too much. These six software tools got selected to be discussed in this thesis because they seemed most popular with high recurrences in many literature, their capability to analyze whole genomes instead of only targeted analysis of STRs and because of their different strategies to extract STRs from NGS data.

Table 4 | New developed software tools for extracting STRs from NGS data.

Software tool Year Reference

lobSTR 2012 (Gymrek et al., 2012)

RepeatSeq 2012 (Highnam et al., 2013)

STRViper 2013 (Cao et al., 2014)

STR-FM 2015 (Fungtammasan et al., 2015)

popSTR 2016 (Kristmundsdóttir et al., 2016)

TREDPARSE 2017 (Tang et al., 2017)

7.3.1 Solutions for extracting STRs from NGS data Short read problems

A big problem for STR alignment is the presence of multi-reads causing ambiguous alignment. Different methods got created to solve this problem.

LobSTR, STR-FM and popSTR tried to solve this problem by flanking-based mapping (Fungtammasan et al., 2015; Gymrek et al., 2012; Kristmundsdóttir et al., 2016). These software tools do not align the complete repetitive sequence but takes the two regions that flank the STR and align those to the reference. The idea behind this method is that the flanking regions of STRs are non-repetitive and unique so that these reads can only be mapped to one location in the genome instead of multiple locations avoiding ambiguous alignment. After alignment, the STR repeat number is determined by directly counting the repeat units within the reads.

However, there are several limitations of this method. First of all, is has been found that more than two-third of the STRs present in the human overlap with or are close to transposon elements meaning that one or both flanking regions of these STRs are not unique at all but are repetitive as well (Tae et al., 2013). Another problem of these three software tools, which is also one of the problems mentioned before, is that they only can use reads that fully encompass a STR since they filter out reads that do not have flanking regions on both sides of the STR. Because of this STRs longer than the reads are removed from the data and not used in further analysis. STRViper (Cao et al., 2014) and TREDPARSE (Tang et al., 2017) acknowledged this problem and developed different methods to resolve it.

(20)

STRViper uses paired-end sequencing data to infer STR variation (Cao et al., 2014). Paired-end reads are better at resolving STRs than single reads, especially when the data consists of short reads (Reinert et al., 2015). When sequencing a fragment from both sides, paired-end reads are created. When one of these reads falls within a repeat, its true location can still be determined when the other read originates from a unique, non-repetitive region because the approximate distance between the paired-end reads is known. Thus, the read that lies outside the STR helps identify the true location of the read mapping within the STR. Each fragment that is paired-end sequenced suggests a difference in length between the STR in the DNA to be sequenced and the reference (Cao et al., 2014). When the insert size between the mapped paired-end sequences is longer than the actual insert size of the fragment, the STR of that fragment consists of less repeat units than the reference, and vice versa. STRViper estimates the repeat number of the STR by Bayesian inference using this difference in insert size. The great advantage of this method is that it is not necessary to have reads fully encompassing the STR so that also STRs beyond read lengths are included in analysis for more complete and informative results.

This sounds very promising, however, the criticism on flanking-based mapping also applies on the method of STRViper since the paired-end reads need to be aligned to the flanking regions of the STR. Software tools will find it difficult to reliable map reads from STR loci next or close to a transposon. This is an issue that for now can only be reduced by producing longer reads. However, the 1000 Genomes Project is finalized and produced short reads. Thus, this is an issue for which now no real solution is yet. Furthermore, the approach of STRViper to estimate repeat number using the difference in insert size is less accurate than the approach of the three earlier software tools directly counting repeat patterns (Kojima et al., 2016). This is especially true for low coverage NGS data.

Figure 7 | The four types of reads used by TREDPARSE in its probabilistic model to determine the allele of a particular STR. The orange bar indicates a STR. a. Spanning reads are reads containing the complete STR and both flanking sides. b. Partial reads contain one flanking side and a part of the STR. c. Repeat-only reads only contain the STR without flanking regions. c. Paired-end reads that together span the STR. Altered from Tang et al., 2017.

TREDPARSE (Tang et al., 2017) has developed another method for aligning and analyzing STRs longer than the NGS reads. It does so by using information from four different sources, namely from spanning reads that fully encompass a STR, partial reads consisting of a partial STR and one flank, repeat only reads consisting only of a STR, and paired-end reads that cover a STR region (Fig. 7). From each type of read information about the STR length can be extracted. This data is combined in a probabilistic model from which STR size is inferred. Mutation rate problems

Because STRs mutate so often, they can deviate so much from the reference genome that the alignment software tool will not map the read to the reference due to a too low similarity (Fungtammasan et al., 2015).

This problem is also solved by the flanking-based mapping method of lobSTR, STR-FM and popSTR. These regions, have a much lower mutation rate, thus will not deviate too much from the reference. Furthermore, the user can change the parameters for the alignment to its own

(21)

liking. For lobSTR the user can change the minimum number of bases needed at both flanking regions to align the reads and also the number of mismatches allowed in each flanking region (Gymrek et al., 2012). For STR-FM the user can also change the number of flanking bases needed for alignment, the number of mismatches allowed and also the minimum Phred quality score of these bases (Fungtammasan et al., 2015). And when using popSTR, additionally to choosing the minimum number of bases needed in each flanking region, the minimum alignment score for these regions can be adjusted (Kristmundsdóttir et al., 2016). In this way the user can choose for himself how similar two sequences need to be for alignment, minimizing the number of incorrect mapping but also avoiding reads to be discarded too quickly.

STRViper also solves this problem by mapping the paired-end reads outside the STR region. I could not find anything about alignment parameters that could be adjusted by the user. For TREDPARSE four different types of reads are aligned. Mismatches and gaps are allowed but how much and if these can be adjusted by the user is not mentioned. Also not mentioned is how the spanning reads, the partial reads and the repeat only reads are aligned, in their entirety or only their flanks. When mapped in their entirety these reads might fail to map due to a too high dissimilarity to the reference. The paired-end reads will be able to map when mapping to unique locations in the genome.

Another software tool for calling STR genotypes from whole genome data not mentioned yet is RepeatSeq (Highnam et al., 2013). RepeatSeq is one of the software tools that uses aligners like Novoalign or Bowtie to map the reads to a reference after which RepeatSeq filters out reads not fully encompassing a STR and infers the most probable genotype form the mapped reads. However, the reads are mapped in their entirety to the reference making RepeatSeq vulnerable for the problem of reads deviating too much from the reference to align.

The high mutation rate of STRs are caused by DNA slippage during replication (Gemayel et al., 2010). As these software pipelines all have a way to cope with reads deviating a lot from the reference, they also need to take variation caused by DNA slippage during PCR into consideration when calling STR alleles since those variations are not true variations as found in the sequenced DNA. Most software tools mentioned above do have an error model for this event but all slightly different taking different properties of STRs into account.

The mutation rate of STRs range from 10-2_{to 10}-4_{repeat per cellular generation (Weber &} Wong, 1993). This mutation rate varies per STR depending on four intrinsic properties: the length of the STR, the repeat unit length, the repeat unit motif and the purity (interruptions) of the STR (Fungtammasan et al., 2015; Gemayel et al., 2010; Legendre, Pochet, Pak, & Verstrepen, 2007). STR length has the biggest impact on the mutation rate showing increased slippage events for STRs with more repeat units, independent of the length of the repeat unit (Fungtammasan et al., 2015; Shinde, Lai, Sun, & Arnheim, 2003). The purity of a STR is also positively correlated with DNA slippage (Legendre et al., 2007). In turn, the repeat unit length is negatively correlated with slippage rate with tetranucleotide STRs showing the least stutter and mono- and dinucleotide STRs the most (Fungtammasan et al., 2015; Shinde et al., 2003). Finally, particular base compositions of the repeat unit also affect DNA slippage. So is the GC (guanine and cytosine) content of the STR positively correlated with slippage events (Walser & Furano, 2010).

Because these factors all play a role on the mutation rate of a STR, an error correction model should take them into account when determining the true STR length. In that way, false alleles can be better distinguished from true alleles and true genotypes. However, most correction models of the STR software tools take only one or two of these factors into consideration. lobSTR conducts a statistical learning approach that models stutter noise created by PCR slippage to estimate the probability a read is a product of DNA slippage solely based on the length of this read, thus STR length (Gymrek et al., 2012). TREDPARSE also estimates the probability a read is a product of DNA slippage by using a correction model trained by lobSTR but also considers GC content of the reads (Tang et al., 2017). When they find a read to be a

(22)

product of DNA slippage, the models estimate with how many bases the read deviates from the true STR length. In turn, in the article presenting STR-FM the four properties influencing slippage events are mentioned and studied as well, but the correction model incorporated in STR-FM uses an error profile only based on repeat number and repeat unit motif (Fungtammasan et al., 2015). This is also the case for the article of RepeatSeq, discussing the four factors influencing slippage events but only considering STR length and repeat unit length in its approach to find the most probable genotype (Highnam et al., 2013). Additionally, to these two properties of the STRs, RepeatSeq also considers the quality of the mapped reads to enhance genotyping accuracy. PopSTR also considers additional factors for its error model (Kristmundsdóttir et al., 2016). Besides the STR length and STR purity, it considers mapping quality of the entire read and of the flanking region separately and Phred quality scores of bases in the flanking regions. It seems that only STRViper can incorporate the four intrinsic properties of STRs influencing DNA slippage events. STRViper uses a Bayesian method to call STR variation (Cao et al., 2014). By using a Bayesian method different sources of information can be incorporated in the prior which is to the users liking.

7.3.2 Comparing the STR extraction tools

All in all, the STR analysis tools conduct multiple methods for aligning STRs and calling STR genotypes accurately. The developers of these tools also compared each other’s performance by running these tools on real or simulated NGS data.

RepeatSeq, released shortly after lobSTR, compared the alignment capabilities of lobSTR to Novoalign, an alignment tool RepeatSeq can use before genotyping the STRs (Highnam et al., 2013). When these two tools finished aligning a human genome from the 1000 Genomes Project to a reference the results showed that Novoalign mapped much more reads (93.9%) than lobSTR did (2.59%) which is expected since lobSTR filters out non-informative reads before alignment (Gymrek et al., 2012) and RepeatSeq after alignment (Highnam et al., 2013). However, for subsequent genotyping, RepeatSeq genotyped much more repeats (90%) than lobSTR (<3%). This big difference is likely caused by the big difference in reads mapped by Novoalign and lobSTR. This study shows that lobSTR somewhere loses a lot of reads. Maybe because its filtering for informative reads is much more stringent than RepeatSeq or its alignment method is not able to map many informative reads.

The developers of lobSTR also compared the alignment capabilities of lobSTR to Novoalign (Gymrek et al., 2012). These results show a much smaller difference in reads aligned, but the percentage of informative reads aligned is again very small, 0.10% for lobSTR and 0.09% for Novoalign. So somewhere in the algorithm and settings a lot of reads are lost which is a waste of the data.

The developers of RepeatSeq do mention that lobSTR will probably show better results for STRs with an extreme allelic variation compared to the reference (Highnam et al., 2013). This is probably because RepeatSeq let its reads to be aligned in their entirety.

STRViper was released one year after lobSTR and RepeatSeq and compared their performances on simulated paired-end sequencing data of three genomes (Cao et al., 2014). These results show that from a sequencing depth of 40x and upwards STRViper calls the most variants, followed by lobSTR and then RepeatSeq calling the least variants. However, STRViper’s confidence in variation calls depends on sequencing depth showing an increased accuracy with higher depth. When the sequencing depth is lower than 40x, STRViper’s performance deteriorates while lobSTR and RepeatSeq have shown they do can work well with lower sequencing data (Gymrek et al., 2012; Highnam et al., 2013). The running times of the three software pipelines for different sequencing depths are shown in table 5. RepeatSeq is very quick and I doubt if this the truth or if it should be in hours instead of minutes. STRViper and lobSTR do not deviate too much.

(23)

Table 5 | Running times of STRViper, lobSTR and RepeatSeq for different sequencing depths in minutes unless mentioned otherwise (Cao et al., 2014).

10x 40x 100x

STRViper _3.86 _15.50 _36.50

lobSTR _8.70 _34.52 _{1.55 hours}

RepeatSeq _0.50 _1.02 _1.45

PopSTR also compared its genotype results with lobSTR and determined their error rates by comparing their results to genotypes generated by capillary electrophoresis (Kristmundsdóttir et al., 2016). For this comparison whole-genome sequencing data of ten individuals was analyzed. The results show that for a sequencing depth lower than 25x popSTR has a higher accuracy, from 25x upwards their accuracy is the same (Fig. 8). And when comparing the called genotypes to the genotypes called by capillary electrophoresis, popSTR showed to have a 50% lower error rate than lobSTR. The running times of popSTR and lobSTR also was compared. popSTR took 9.9 hours to analyze the sequenced genome of one individual for which lobSTR needed 39.2 hours. This seems a big difference but popSTR does not perform its own alignment like lobSTR does. Without the alignment, lobSTR takes only 0.1 hour to analyze one human genome. It is not mentioned how long it took the alignment software used by popSTR to align the genome. The sequencing depth for this experiment is also not mentioned.

Figure 8 | The accuracy of lobSTR and popSTR as a function of number of reads overlapping the STR.

The Extraction of Short Tandem Repeats from the Genomes Sequenced by the 1000 Genomes Project