Improving somatic mutation detection in cancer

(1)

ESSAY

Improving somatic mutation detection in cancer

Wouter Huiting, MSc 1769863

13-8-2016

Supervisor: Dr. C. van Diemen

(2)

2

Preface

This essay was written in the spring of 2016 in the context of my Master education Behavioral and Cognitive Neuroscience at the University of Groningen. The execution of this essay was supervised by Dr. C. van Diemen, whom I would like to thank for the opportunity.

(4)

4

Abstract

Somatic variation analysis is important to help us understand the onset and progression of cancer. Unfortunately, although next-generation sequencing technology has advanced rapidly over the last decade, most NGS strategies still prove inadequate to accurately grasp the molecular complexity of cancer, a fact that is largely the result of intratumour heterogeneity. In addition, the high-throughput of modern NGS platforms has also left us with the difficult task of managing and analysing vast amounts of data, forcing researchers to rely heavily on the use of bioinformatics. Advances in experimental and computational techniques designed to cope with these challenges occur quickly. The result is a rapidly evolving workflow of somatic tumour variant analysis. It is important that the entire cancer research community is informed regularly on these advances. Here I try to provide an overview of the complete workflow of tumour variant analysis in a way that is relevant to all those involved in cancer research. In addition, I highlight several weak links in this workflow and provide recommendations on how to cope with them.

(5)

5

1. Introduction

Since the dawn of next-generation sequencing (NGS) more than a decade ago, the sequencing technology has evolved immensely, driving speed and throughput to an unprecedented level (Goodwin et al. 2016). The financial costs of sequencing an entire human genome have been brought down from roughly US$10 million to little over a thousand dollars in just a few years’

time (https://www.genome.gov/27541954/dna-sequencing-costs-data). NGS technologies are now widely used in laboratories around the world, and many acclaim their potential as a diagnostic tool (Katsanis et al. 2013; Vrijenhoek et al. 2015). The development and widespread adoption of NGS technology would not have been possible without the efforts of Frederic Sanger and his colleagues almost forty years ago. In 1977, it was Sanger who introduced the first automated sequencing method based on his ground breaking ‘chain-termination’ technique. Due to its high accuracy (e.g. through long read lengths) and ease of use, Sanger sequencing would become the dominant DNA sequencing technology for the next decades. Around the turn of the millennium fundamentally different sequencing methods started to spring up, collectively referred to as ‘next-generation sequencing’ (NGS) techniques. Aided by the development of new technologies such as high-resolution imaging, these exciting methods offered several advantages over Sanger sequencing. Most importantly, they all allowed mass parallelisation of sequencing, greatly improving throughput (Shendure, Ji. 2008). See Box 1 for an overview of the principles behind Sanger and modern NGS platforms; an in depth overview lies beyond the scope of this essay (for an excellent recent review see for example Goodwin et al. 2016).

One field that has particularly benefited from this development is cancer research. Cancer comprises a group of diseases that arise from ‘a clone that has accumulated the requisite somatically-acquired genetic aberrations, leading to malignant transformation’ (Stratton. 2011;

Watson et al. 2013). The development of effective therapies for cancer patients requires a comprehensive assessment of the role of these somatic variants in tumour formation (Wang et al.

2013). Before the advent of NGS, studies relied on a costly and low-throughput workflow of PCR amplification followed by Sanger sequencing to identify candidate cancer drivers. When NGS technologies became widely used, this meant that the cancer research field could initiate systematic sequencing ‘screens’ to identify such somatic variants much more effectively. NGS studies have since then contributed greatly to our understanding of the mutations that drive tumourigenesis in different tumour types (Watson et al. 2013). However, accurate somatic mutation ‘calling’ remains highly challenging still.

(6)

6

1.1 The challenges of somatic variant calling in cancer

True somatic variant are very difficult to distinguish. Besides the errors in the sequencing process itself, e.g. contaminations and sequencing errors, there are issues inherent to the short-read nature of NGS: amplification bias and ambiguities in short read mapping, to name a few (Wang et al. 2013). However, most of these problems arise with every NGS-project. What makes somatic cancer variants in particular difficult to identify is the intratumour heterogeneity (Fidler. 1978;

Marusyk et al. 2012). As tumours evolve from single cells, clonal lineages begin to diverse, giving rise to distinct subpopulations within the tumour (Gerlinger et al. 2012). It is this genomic diversity that drives tumour proliferation, enabling cancer cells to survive a range of selective pressures from the tumour’s microenvironment: pH, hypoxia, therapy and others. (Davis, Navin.

2016). The result is a highly resistant tumour in which variants are non-uniformly present (Landau et al. 2013), with variant allele frequencies (VAFs) as low as 5% having been reported (Carter et al. 2012). Several sequencing strategies have been employed to disentangle this heterogeneity, and they can be divided into two different types of strategies. Methods belonging to the first type aim to sample from different places in the tumour and sequence them simultaneously. This can be executed with different macroscopic regions of the tumour mass, a method referred to as multiregion sequencing (Gerlinger et al. 2012; Yates et al. 2015), but also on a microscopic scale using a range of single cells (Navin et al. 2011; Xu et al. 2012). A related technique is to fluorescently label and sort cells from the tumour to achieve homogenous subpopulations of cancer cells (Bolognesi et al. 2016). The latter method has the added benefit of removing healthy cells from further analysis. The second strategy is to sequence ultra-deep (Nik- Zainal et al. 2012). As I will discuss later, this strategy has shown great promise of substantially improving the identification of tumour variants, even with very low VAFs (Griffith et al. 2015).

Although choosing the appropriate sequencing method is key, other aspects have to be taken into account as well. Among them are practical considerations, for example the choice of a valid control input to be able to accurately distinguish germline variation from true somatic variants. Others are more strategic, like the choice to include or omit a PCR amplification step.

As we will see further down, all of these can influence the outcome of the tumour variant analysis. One element in particular in which there is still significant room for improvement is the use of bioinformatics tools in the various data analysis steps of somatic variant calling. Modern NGS platforms generate large amounts of heterogeneous data, with higher error rates and (generally) shorter read lengths compared to Sanger sequencing platforms (see Box 1c). Because of this, NGS poses considerable challenges for data management and computational analysis (Schadt et al. 2010). Consequently, somatic variant analysis is forced to rely heavily on

(7)

7

bioinformatics (Pabinger et al. 2013). A wide range of algorithms has been written to aid in specific parts of the data analysis (Li, Homer. 2010; Bao et al. 2014), but the fact that there are so many makes choosing the right toolset a strenuous task, especially for the less experienced user (Pabinger et al. 2013).

1.2 Goal of this essay

Further advances in somatic variation analysis are important to help us understand the onset and progression of cancer. As existing sequencing technologies are constantly improved, and new technologies emerge at a fast pace, it is key that the cancer research field is informed regularly on new methods, tools etc. Indeed, excellent reviews are periodically written on how to optimize variant calling, but these often highlight only one aspect of the entire workflow. Moreover, the majority of these reviews is directed at geneticists and bioinformaticians (see for instance Alioto et al. 2015 or Griffith et al. 2015). This makes it challenging for clinical researchers and cell biologists to keep track of the advances in somatic variant calling methods in cancer. As NGS techniques and the associated data analysis are becoming more and more complex, the danger of a loss of crosstalk between bioinformatics and (clinically driven) biological research arises. This can result in a knowledge gap that will directly affect the clinical impact of NGS. Such a gap could prevent clinicians and researchers from smaller labs to exploit the power of NGS technology in the future. To prevent such a schism from happening it is crucial that we bring these fields closer to each other. For these reasons, I here present in interdisciplinary review of the entire workflow of tumour variant calling, starting from the bench-work. For clarity, I focus on several elements and principles that have arguably the largest effect on the accuracy of somatic variant calls, including library preparation, sequencing depth and coverage, as well as the choice of mappers and callers; other factors will be covered more briefly.

(8)

8

(9)

9 Box 1 (continued)

d. Sanger and Illumina sequencing

(I) Sanger sequencing workflow. (II-IV) General workflow of Illumina sequencing, the current marketleader in NGS platforms. See also Box 1a. Images taken from Mardis, 2013. For additional information, the reader is referred to the excellent recent reviews by Goodwin et al. (2016)

Template-strand

5’ 3’

3’ 5’

5’

Direction of electrophoresis

3’

T C G A A T C 5’

Direction of sequence read

-

- - -

5’

3’

5’

(10)

10

2. Optimizing the workflow of tumour variant calling

2.1 DNA extraction and shearing

Tumour DNA is typically obtained from formaldehyde fixed-paraffin embedded (FFPE) tumour samples, the standard preservation format for diagnostic surgical pathology (Kokkat et al. 2013).

In parallel to sampling the tumour, DNA is extracted from peripheral blood to serve as a

‘normal’ genome. This allows true somatic variants in the tumour to be discriminated from the host’s germline variants later in the workflow (Shah et al. 2009; see also Figure 2). If no peripheral blood is available, DNA from normal tissue surrounding the tumour in the FFPE sample is sometimes taken as a control (Pleasance et al. 2010). Importantly, choosing the proper source for a healthy control is no trivial task. For instance, peripheral blood often contains circulating tumour cells (Pantel, Speicher. 2015), and tissue surrounding the tumour may appear to be healthy but can in fact harbour cells with significant genomic defects (Sadanandam et al. 2012;

Troester et al. 2016). Both situations are unwanted, as they prevent the accurate identification of tumour-related somatic alterations. Another potential problem stems from the notion that the fixation and embedding needed to make FFPE tumour samples can damage the DNA (Ben-Ezra et al. 1991; Williams et al. 1999). This not only reduces the amount of DNA available for sequencing, it can also lead to inaccurate variant calling (Akbari et al. 2005). Pre-analytical molecular sample characterization has been proposed to correct for these problems (Sah et al.

2013).

Next, the obtained DNA has to be sheared into the short oligonucleotides - typically tens or hundreds of nucleotides long - required for high-throughput NGS (Poptsova et al. 2014).

Although several DNA shearing methods exist, all of them rely on one of two principles:

shearing by mechanical force or shearing through enzymatic fragmentation (Knierim et al.

2011).Few studies have investigated the effects of the various DNA fragmentation protocols on variant calling accuracy, as it has long been assumed that shearing occurs in a random (i.e.

unbiased) manner. Indeed, there is data supporting the idea that DNA fragmentation is random, regardless of the method employed, leading some to suggest that ‘a fragmentation method can be chosen solely according to lab facilities, feasibility and experimental design’ (Knierim et al. 2011).

However, several studies suggest that DNA shearing is in fact subject to a ‘fragmentation bias’, as both enzymatic- and mechanical fragmentation were shown to be sequence specific (Hansen et al.

2010; Grokhovsky et al. 2011). In an interesting study from 2014, Poptsova and coworkers showed that ultrasound shearing of genomic DNA can cause an amplified (i.e. higher than chance) cleavage of GC-rich areas, likely as a result of local variations in DNA structural dynamics. Regardless of the exact sequence that is preferred, a fragmentation bias will result in a

(11)

11

non-random distribution of fragment lengths and sequence ends, which in turn leads to a non- uniform read coverage after the alignment step (see section 3.3; Finotello et al. 2014). It is important that sequencing studies are aware of this problem, and try to correct for it. The solution could be a two-step fragmentation protocol, although that has only been validated for ChIP-seq (Mokry et al. 2010). Alternatively, it might be possible to negate the fragmentation bias by adding specific chemical agents in the shearing step (Grokhovsky et al. 2011).

2.2 Library preparation

An important principle of library preparation is ‘library complexity’, referring to the number of unique fragments present in the library. The main goal when preparing any sequencing library is to make it as complex (i.e. diverse) as possible, so that it accurately reflects the complexity of the original genetic sequence (Head et al. 2014), and at the same time avoid bias (Van Dijk et al. 2014).

Once the dsDNA is properly fragmented, the sequencing library can be made. In this process oligonucleotide adapters (specific to the NGS platform used) are attached to the ends of the fragments, preparing them for sequencing. Fragment ends are processed before the adapters can be ligated to generate the blunt-ended fragments that are required for adapter ligation (Head et al. 2014). End-processing is a two-step process that starts with the enzymatic blunting and 5’

phosphorylation of both sides of the fragment, followed by the addition of an adenine nucleotide to the 3’ ends. This A-tail not only reduces the risk of fragment-chimeras, but also facilitates the ligation of the T-tailed adapter oligonucleotides (Quail et al. 2008). This is a crucial step in the variant calling workflow, as the use of different protocols can result in marked variation in variant calling effectiveness (see Rhodes et al. 2014; Alioto et al. 2015).

After end-repair and A-tailing, adapters are ligated to the fragments. These adapters are a crucial part of library preparation as they hybridize to sequencing primers during the sequencing reaction. Importantly, the choice of adapters not only depends on the NGS platform used, but also on whether single-end or paired-end reads are pursued (see further down). As a result of imperfect end-repair or A-tailing, artefacts can arise in the library, including adapter dimers. As these dimers form clusters very efficiently during sequencing, they take up valuable space and thus waste the capacity of the sequencing platform (Head et al. 2014). A size-selection step right after shearing (but before adapter ligation), has been proposed as a means of reducing the effect of adapter-chimeras (Quail et al. 2009). This also yields a tighter distribution of fragment sizes resulting in a more homogenous PCR amplification, which in turn will provide a more uniform read coverage after the alignment step (see section 3.3; Quail et al. 2009).

Standard NGS library preparation protocols then rely on a PCR step to amplify their library (i.e. enriching for properly ligated fragments) before sequencing (Kebschul, Zador. 2015).

(12)

12

The need for this stems largely from the notion that libraries should be carefully quantified before sequencing commences (see further down), and most quantification protocols require large amounts of DNA to ensure accurate titration (Meyer et al. 2008; Parkinson et al. 2012).

Unfortunately, PCR is an inherently sensitive process: it not only skews data (and thus reduce library complexity) but it can also introduce hybrid or erroneous sequences into the library (Aird et al. 2011). Different factors are thought to underlie these PCR-induced imperfections, among which fragment length, content dependent amplification of sequences, template switching and even the intrinsic stochasticity of PCR (Kebschul, Zador. 2015). Variation in the PCR parameters (temperature, polymerase, buffer) plays a substantial role in this, which can be exacerbated with every PCR cycle (Dabney, Meyer. 2012; Meyer, Liu. 2014). Because of these issues with PCR amplification, some studies opt to exclude this step from library preparation (Kozarewa et al.

2009; Quail et al. 2009). The idea behind this is that eliminating the PCR amplification leads to improved coverage of regions with a high GC content and reduces the amount of duplicate reads after sequencing (Kozarewa et al. 2009). Omitting PCR indeed results in a more homogenous distribution of reads and can thus improve variant detection in cancer (Alioto et al. 2015).

However, a PCR-free approach does require a larger amount of starting material, and as FFPE samples generally yield small quantities of DNA, a PCR-free approach to identify somatic tumour variants might not always be possible (Luthra et al. 2015).

The last step of library preparation is quantification, during which the precise amount of adapter ligated fragments present in the library is evaluated. This allows the researcher to load the correct amount of sample onto the sequencing station (Liu et al. 2012; Loman et al. 2012). This is important, as sequencing experiments performed with too many or too few correctly ligated library fragments can yield poor data quality (Laurie et al. 2013). Several methods for library quantification exist, but the most widespread method is real-time quantitative PCR (qPCR). This is mainly due to its ability to assess only the amount of adapter ligated fragments (Buehler et al.

2010). Unfortunately, qPCR has also considerable disadvantages that can compromise variant calling. Template size and sequence content can result in an amplification bias just like with a regular PCR (Valasek, Repa. 2005). In addition, qPCR demands that a standard curve is created for each sample, a laborious process that is prone to inaccuracies (Yun et al. 2006). To overcome these problems White and coworkers developed a digital emulsion PCR method to quantify libraries (White et al. 2009). This method is based on the (massively parallel) fluorescent detection of a probe oligonucleotide (e.g. TaqMan) added to the adapter ligated fragments. A droplet digital PCR (ddPCR) device (e.g. QX100^®, Bio-Rad) first generates thousands of droplets; most are empty, but some contain a single amplified fragment of DNA. Next, droplets are counted and

(13)

13

assessed for fluorescence (all or nothing, so binary read-out), after which a the total number of input molecules can be calculated (White et al. 2009). DdPCR offers superior sensitivity and stability over convential qPCR quantification (Robin et al. 2016).

2.3 High-throughput sequencing: base-calling, depth and coverage

While new NGS instruments are being developed at an astonishing pace (Goodwin et al. 2016), the accuracy and speed of the main NGS platforms currently in use is also constantly improved.

Several elements are of particular importance in this respect.

Base-calling algorithms turn the sequencer’s output (fluorescence in the case of Illumina, current changes in the case of Thermo Fisher) into a base-call. However, due to the inevitable imperfections in sequencing chemistry and signal detection, errors in base-calling can arise. For instance, Illumina technology suffers from a number of biases owing to its technology, including phasing (and prephasing), signal decay and cross-talk (see Figure 1; Cacho et al. 2015). The standard Illumina base-calling algorithm, ‘Bustard’, is able to reduce the effects of these uncertainties on base-call accuracy by explicitly modelling these biases. However, the error rates in Bustard’s calls can still be significantly improved (up to 30%; Nielsen et al. 2011) by more sophisticated algorithms like BlindCall and freelbis (Das, Vikalo. 2013; Renaud et al. 2013; Ye et al. 2014). An in depth description of the mathematical principles used by these base-callers lies far beyond the scope of this report, but an excellent review of recently developed base-callers was recently written by Cacho and coworkers (2015). Importantly, the use of these advanced statistical tools instead of Bustard was shown to significantly reduce false positive SNP calls in tumour variant analysis (Nielsen et al. 2011).

Two other important concepts that are crucial for accurate somatic mutation detection in cancer are sequencing depth and coverage. Depth and coverage are two highly related terms that are frequently used interchangeably. Some use coverage to describe the breadth of sequence coverage, i.e. the percentage of the target genome that is sequenced a given number of times.

Depth can then be thought of as the ‘redundancy of coverage’ (Sims et. 2014), often denoted as n x (e.g. 30x depth). The importance of sequencing depth becomes instantly clear when one considers the internal error rate of high-throughput, short read sequencing - without sufficient depth, it is impossible to distinguish sequencing mistakes from real sequence variants. In addition, a uniform coverage is needed to eliminate the underrepresentation of SNPs in specific regions (for instance regions with high GC content) of the genome. Accordingly, an increased depth and uniformity of coverage can rescue sequencing errors (Sims et. 2014).

(14)

14

Figure 1. Commonly modelled base-calling errors for the Illumina platform

(a) Scaled cytosine (C) intensity versus cycle of a single read. A spike indicates a potential C nucleotide at that position. Phasing is observed as an anticipation signal in the cycle before a C (left arrow) and after (right arrows). It occurs during the sequencing process when one or more strands within a cluster fail to incorporate the next base in the read. The reads start lagging behind, distorting the fluorescence emissions. Prephasing occurs when two bases are incorporated in a single cycle. (b) Maximum intensity (signal) and median intensity (noise) plotted against cycle.

During the sequencing of the complementary strand, some material may be lost , causing a decreased signal to noise ratio known as signal decay. (c) Intensity versus fluorophore emission spectrum. The spectrum of the guanine (G) fluorophore bleeds into the optimal spectrum of the thymine (T) filter. Thus, when a G fluorophore is excited, a T signal will also be detected. This causes a positive correlation between the intensities of these two channels, a phenomenon known as cross-talk. (d) Two-dimensional histogram of intensity data of the T channel versus G channel. The G fluorophores (right arrow) transmit to the T channel, hence the positive linearity. However, the T fluorophores do not transmit to the G channel. A similar situation occurs with A and C channels (not shown). Figure and text adapted from Cacho et al. 2015.

For the accurate detection of germline single nucleotide variants (SNPs) a 30x average depth in 95% of the genome was shown to be sufficient (Ajay et al. 2011). Most cancer genomes (and

‘normal’ control genomes) are sequenced to comparable depths (Mardis et al. 2012; Borad et al.

2014), as previous studies indicated that depths of 15x-50x are sufficient to detect all SNPs and

(15)

15

small indels (Bentley et al. 2008; Ajay et al. 2011). However, these estimates are largely based on high-purity tumours, while in fact most tumours exhibit severe heterogeneity (as discussed earlier). As a result, a somatic tumour variant can have a VAF of 5% (Biankin et al. 2012) or even lower. Such rare variant are unlikely to be picked up with a sequencing depth of 30x. A recent study by Griffith and coworkers (2015) showed that a 30x-50x depth for whole genome sequencing is indeed insufficient for adequate variant identification in the face of sample contamination, aneuploidy or even moderate intratumour heterogeneity. Instead, they recommend a depth of 500x-1000x for the discovery of novel variants, especially for those with VAFs < 10%. Performing such ultra-deep sequencing (Wagle et al. 2012) of both the tumour and normal genomes on current NGS platforms is extremely costly, and so a 100x-300x depth has been proposed as a compromise (Alioto et al. 2015). In addition, it is also important that the ratio of tumour : normal depth is kept as close to one as possible, and at least within a 10% range, as this appears to reduce the amount of false positives (Alioto et al. 2015).

2.4 Read quality assessment and pre-processing

Upon completion of the desired amount of sequencing runs, the output first needs to be evaluated for its quality (Pabinger et al. 2013). Modern high-throughput sequencing platforms spit out several millions of short DNA reads with every run (Goodwin et al. 2016). Importantly, not all of these reads meet the predefined standards as they generally contain several sequencing artefacts (Dai et al. 2010). These errors have to be removed by trimming and filtering the reads. A range of tools has been developed to execute the different steps of quality assessment and subsequent pre-processing (Pabinger et al. 2013). As this is a complex process to grasp fully without prior bioinformatics training, only the basic principles will be discussed here.

The first step entails the visualization of base quality scores included in the output of the NGS platform. The output (that is in the text-based FASTQ format) contains not only the predicted sequence, but also contains for every base the estimated probability of an erroneous call (as discussed earlier). These error-probability values (P), ranging from 10% to 0.0001%, are converted into standard quality scores or ‘Phred scores’ by calculating the -10 log (P). As a result, the Phred scores form an array that is equal in length to the array of base calls, with values typically ranging from 10 to 40 (in ASCII format, see http://blog.nextgenetics.net/?e=33). Note that a 0.1% error rate in base calling translates into a Phred score of 10; higher Phred scores mean higher estimated accuracy (Nielsen et al. 2011). Programs like FastQC (Andrews. 2010) process these output files and produce graphical summary reports allowing one to quickly assess the quality of the data. Next, the reads are trimmed and filtered based on both the quality scores and sequence properties (Pabinger et al. 2013). This is important as read quality is often not

(16)

16

consistent over the entire length of a read (Huse et al. 2007; Dohm et al. 2008), and ignoring low quality base calls hampers downstream variant calling accuracy (Olson et al. 2015). Trimmomatic, PRINSEQ and other tools can perform various trimming tasks, among which adapter and primer trimming and 3’ and 5’ low quality stretch trimming. It is important to also remove reads that do not meet a minimum average base quality, as well as reads that exceed the minimum or maximum read length threshold. ClinQC integrates multiple quality control tools, making it a highly useful tool for clinical research (Pandey et al. 2016). Importantly, performing these various trimming and filtering steps was shown to result in significant improvements in variant calling (Del Fabbro et al.

2013).

2.5 Read mapping

After preprocessing and quality assessment, the reads are ready for further downstream analysis.

The classic approach used by many variant analysis projects is to align (‘map’) the reads of both the tumour and the normal sample to a validated human reference genome like GRCh37 or the newer GRCh38 (http://www.ncbi.nlm.nih.gov/project/genome/assembly/grc/human; see also Figure 2). This process, commonly referred to as ‘resequencing’, entails multiple complex bioinformatical steps (an alternative is to computationally ‘stitch’ the reads together, a process referred to as de novo assembly; see Box 2). These steps have to overcome technical hurdles that are collectively dubbed the ‘read-mapping problem’ (Trapnell, Salzberg. 2009). The core of this read-mapping problem is two-fold. In a practical sense, aligning billions of short sequences to a large genome requires highly efficient algorithms in the absence of extreme computational power (a PC only offers so much bits of memory). A more strategic problem stems from the complexity and heterogeneity of the human genome. Chromosomes are not just simple arrays of nucleotides in which variation consists of the occasional single SNP. On the contrary, the genomic sequence carries insertions and deletions (indels), translocations, inversions, duplications and copy number variants (CNVs) (Feuk et al. 2006), and these structural variants likely account for ten times more variation among human genomes than SNPs do (Pang et al. 2010). As a consequence, chromosomes differ from person to person (Baker. 2012), and in somatic tissue even from cell to cell (Astolfi et al. 2010; O’Huallachain et al. 2012). This heterogeneity is exacerbated greatly in cancer genomes due to their so-called ‘mutator phenotype’ (Albertson et al. 2009; Loeb. 2016).

While the short reads generated by modern high-throughput sequencers are great for picking up point mutations, they are difficult to work with in the face of (large) structural rearrangements (Trapnell, Salzberg. 2009). Indeed, NGS technologies have long been ‘biased towards typing unique tags in the genome’ (Baker. 2012).

(17)

17

A range of alignment algorithms (‘mappers’) have been developed over the last years to tackle the read-mapping problem, including Bowtie (Langmead et al. 2009), BWA (Li, Durbin.

2009) and SOAP/SOAP2 (Li, Yu, Li et al. 2009), to name just a few. Instead of describing for each of these how they work and how they can be wielded to optimize tumour variant calling, I try here to provide some general focus points. The first thing to realise when executing a tumour resequencing project is that it is very important to use an appropriate reference sequence to align the tumour and normal reads to (Alioto et al. 2015). At this moment the main source of human reference genomes is the Genome Reference Consortium (GRC). The GRC keeps updating the human reference assembly, as even the most recent human genome build (GRCh38) contains gaps, particularly around the centromeres (Chaisson et al. 2015). As a result, rare reads that belong somewhere else are mapped to the wrong place in the genome (the true location is missing), leading to local read pile-ups i.e. false positives. This phenomenon can be mitigated by providing

Figure 2. Analysis of tumour and matched normal DNA

Distinguishing somatic tumour variants from germline variants requires the parallel sequencing of DNA from tumour tissue and DNA from normal tissue. Here, peripheral blood is used as a ‘normal’ (see also section 2.1). After sequencing, the reads are mapped to the human reference genome (in green). Discrepancies observed in both samples are germline variants (in this example heterozygous), whereas those observed only in the tumour sample are inferred to be somatic variants.

a ‘decoy’ sequence additional to the reference genome. A decoy is made up out of sequences known to be absent from the reference. By including it, rare reads are ‘scavenged’ from the reference (Li. 2014). An extra control step is to filter the alignment for so-called ‘blacklisted sites’

in the genome. These are sites known to suffer from extensive read pile-up, and they should be

(18)

18

excluded from further downstream analysis (Miga et al. 2015). Indeed, the use of decoy sequences and blacklists can reduce false positives in somatic mutation detection (Miga et al. 2015; Alioto et al. 2015). To reduce false positives it is also key that reads mapped with many errors are filtered out (Pabinger et al. 2013).

It is also recommended to use paired-end or mate-paired reads (Pabinger et al. 2013).

While in single-end sequencing short fragments are read only from one end, in paired-end sequencing both ends of longer fragments are read (Volik et al. 2003). The result is a collection of paired reads separated by a known distance, so sequence as well as relative positional data can be inferred. This technique has proven to be very useful in detecting so-called ‘copy-neutral rearrangements’ in cancer genomes (Bashir et al. 2008; Oesper et al. 2012). Paired-end sequencing also helps to map reads over repetitive regions more precisely* (Treangen et al. 2011).

Finally, it is important to choose the right alignment software. For this it is wise to consider the NGS platform that was used, as their output sometimes requires the use of specific algorithms (Luthra et al. 2015). However, it is far more important to choose an alignment tool based on the application at hand. Not only do the various algorithms often differ in their ability to pick up specific genetic alterations, but in addition, they tend to suffer from a significant trade- off between speed and accuracy (Ruffalo et al. 2011). Indeed, the use of different alignment tools can clearly impact variant calling (Griffith et al. 2015). For alignment in the context of a tumour variant workflow, the tools Bowtie2, Novoalign and GMAP appear to be valid choices, as they show a high accuracy in picking up a range of genomic variations. However, Novoalign is much slower than the two others, most likely because it is based on a different alignment algorithm (Bao et al. 2014). Indeed, this speed-accuracy trade-off is something that has to be considered, especially in a clinical setting. As choosing an appropriate alignment tool is clearly a strenuous task, it is recommended to use multiple alignment strategies in parallel (Griffith et al. 2015). By assuming that a consensus in alignment has a higher likelihood of being correct, one can increase the accuracy of the variant calling workflow (Goode et al. 2013).

One post-alignment processing procedure that should also be mentioned here is the removal of PCR duplicates. PCR duplicates are reads of the exact same length and sequence identity that arise during library amplification. These duplicates consequently align with the exact same mapping coordinates. As a result of PCR bias, some reads are amplified much more than others, resulting in heterogeneous coverage (Aird et al. 2011), as described earlier. To correct for this bias it is common practice to remove excess duplicate reads after the alignment step

*A thorough discussion on paired-end sequencing is not provided here; for an excellent review see for example Risca, Greenleaf (2015)

(19)

19

(DePristo et al. 2011). Importantly, duplicate removal should not be performed carelessly, as overcorrecting read counts can produce flawed variant calls (Zhou et al. 2014).

2.6 Variant calling, annotation and prioritization

The last step of the tumour variant analysis workflow is variant calling, followed by variant annotation and prioritization (Pabinger et al. 2013). By comparing the aligned reads of the tumour and normal samples with each other, and with a reference genome, a range of somatic tumour variants can be detected (see Figure 2). It is important that a distinction between germline and somatic variant is made as these variants frequently play different roles in tumour development and progression (Pujana. 2014). Like read alignment, variant calling and annotation relies heavily on the use of bioinformatics - GATK and SAMtools are well known ‘callers’ (Li, Handsaker et al.

2009; McKenna et al. 2010), and some tools like Strelka (Saunders et al. 2012) are specifically designed to pick up somatic variants. These programs employ different algorithms to identify candidate variants (Altmann et al. 2012). Basic tools identify variants when the number of high confidence base-calls that disagree with the reference base exceeds a certain threshold; more refined tools also take into account strand bias and the quality of neighbouring base-calls (Olson et al. 2015). Importantly, most commonly used variant callers appear ill-suited to handle ultra- deep sequencing data. This is likely the result of an ‘over-training’ of parameters and filtering procedures towards a 30x-40x tumour-normal pair (Griffith et al. 2015). As the performance of different variant callers can change in different settings (as a result of different algorithm parameters), ‘the selection of an appropriate algorithm should be driven by each experiment’s design’ (Griffith et al. 2015). Recently a new variant caller, VarDict was developed specifically for

(20)

20

ultra-deep sequencing (Lai et al. 2016). Choosing the right caller is a crucial element in any variant analysis workflow: callers have to be stringent enough to control false positive calls, but not too stringent as this will result in false negatives (Olson et al. 2015).

Another factor that has to be taken into account when choosing a caller is the mapper used in the alignment step. Recently, Alioto and co-workers showed that certain mapper-caller combinations show a much higher compatibility than others (Alioto et al. 2015). If possible, it is recommended to use multiple variant callers to correct for this cross-talk, as this was shown to improve performance of somatic variant calling (for the same reason as using multiple mappers;

Bao et al. 2014; Alioto et al. 2015).

After variants have been called, the data enters another bioinformatics pipeline in which variants are annotated for clinical relevance. The sheer amount of data generated during the tumour variant analysis means that manually performing this step would be a long and difficult process (Dienstmann et al. 2014). For this reason, again a range of software tools has been developed to streamline this process. Together, these tools help to stepwise filter out calls that are known to be irrelevant while prioritizing those with the largest clinical significance. First the less reliable and common variant calls are removed, including those with low coverage, low quality and those supported by a low-confidence read alignment (Patel et al. 2014). The remaining variants can then be prioritized relative to the disease and the genomic context (Bao et al. 2014).

In this step, the aim is to identify those somatic tumour variants that can ‘confer diagnostic, prognostic, or treatment-related information’ (Sukhai et al. 2016). For an overview of the tools developed to perform these steps see Pabinger et al. 2013 or Bao et al. 2014. A particularly powerful toolset in this respect is ANNOVAR. ANNOVAR integrates public databases that store detailed information on possible variants, including experimental and/or clinical evidence, as well as detailed genomics data from for instance the ENCODE project (Wang et al. 2010). By doing so, it offers a complete annotation and prioritization of tumour variants, helping clinical researchers to interpret the data and make an informed decision regarding treatment (Yang, Wang. 2015). In addition, it is important that findings are communicated with the clinical oncology practice in a clear and timely fashion. A recent review by Dienstmann and coworkers describes not only a systematic approach to variant annotation and prioritization, but also proposes to use a structured format of cancer pathology reports (Dienstmann et al. 2014). It is initiatives like this that will help us to exploit the enormous potential of big genomics data (Eisenstein. 2015) in the fight against cancer.

(21)

21

3. Discussion

In order to better understand cancer and tumourigenesis it is key that the somatic variants underlying the disease are characterized well. Tumour variant analysis is however no trivial task, in particular because of intratumour heterogeneity. At the same time, deciphering this heterogeneity is one of the primary goals of cancer research, as it is thought to be a driving force of tumour proliferation and resistance to therapy. Designing an efficient (and cost-effective) tumour variant analysis workflow is therefore challenging. Researchers do not only have to decide how to perform the bench-work, but they are also asked to find the appropriate tools to support their specific NGS data analysis. It is important that all those involved understand how the various components of the somatic variant analysis workflow are executed, and where there is room for improvement. Indeed, errors that are picked up only in the final stages of variant calling might arise at the bench.

The overview provided here points out several key points for improvement of the somatic tumour variant workflow (see also Table 1). Some of these recommendations are beginning to be followed up in cancer studies, for instance the use of combined analysis tools:

although researchers have long relied on the use of a single caller, in recent years more studies have instead combined the output of multiple tools (Field et al. 2015). As pipelines employing multiple callers significantly outperform those that don’t in terms of accuracy (Alioto et al. 2015), this trend will likely benefit cancer research greatly. Importantly, the same holds true for the use of multiple alignment tools (Griffith et al. 2015). Another positive development is the increased number of studies that use ultra-deep sequencing for tumour variant analysis. However, as the costs of whole-genome sequencing are still substantial, many researchers opt to get higher depths by sequencing only parts of the genome. One technique is whole exome sequencing (WES).

Although WES is now a routinely used instrument to detect genetic variation in humans (Koboldt et al. 2013), several recent studies show that sequencing the entire genome (WGS) is more accurate than WES when it comes to detecting somatic variants (Fang et al. 2014; Meynert et al. 2014). The reason for this is that WES produces a more heterogeneous read coverage than WGS, likely because it is subject to higher levels of sequencing bias (Veal et al. 2012; Belkadi et al.

2015). Nonetheless, WES is still a powerful, and not to mention cheaper method. A technique that is perhaps clinically more relevant than WES is targeted resequencing of mutational

‘hotspots’ of tumours (Mamanova et al. 2010; Gerstung et al. 2012). With this technique a restricted gene panel can be sequenced ultra-deep, allowing one to pick up low VAF somatic mutations in known causative genes (Agrawal et al. 2011). Importantly, both WES and targeted sequencing do not allow for a complete characterization of a tumour genome (Griffith et al.

(22)

22

2015): the first does not capture variation in non-coding DNA*, and the latter sequences only user-specified genomic stretches. Even with very high depth sequencing it can be extremely difficult to identify low VAF SNPs (Griffith et al. 2015). The required depth depends heavily on the complexity of the sequencing library. The question of how much additional information can be gained when a specific library is sequenced deeper is therefore highly relevant. Recently, Daley and Smith presented a computational method that can quantify library complexity. Such a method is likely to prove very useful in controlling the costs of routine tumour variant analysis in the clinic (Daley, Smith. 2013).

Table 1: the main recommendation to optimize tumour variant calling

Workflow component Recommendations

DNA extraction & shearing - Be aware of poor DNA yield and/or quality when extracting from FFPE samples. Employ pre-analytical sample characterization

Library preparation - Size-selection after shearing - PCR free preparation if possible - Use Kapa HiFI polymerase

- Optimize temperatures and/or duration of PCR steps - Size selection after PCR amplification

- Use ddPCR to quantify library - Use paired-end reads

Depth and coverage - Sequence deep (100-300x) , especially with variants of low expected (< 10%) VAF

- Keep tumour : normal depth ratio close to one

Quality assessment - Use multiple software tools, or a complete toolset like ClinQC

Read mapping - Pick most complete validated reference genome (GRCh37 or GRCh38) - Use decoy sequences and blacklisted sites

- Use multiple mapping tools

- Use PCR duplicate removal with caution

Variant calling - Optimize mapper-caller combination / use multiple callers

(23)

23

* Non-coding DNA is rapidly shedding its ‘junk-DNA’ nickname as a significant amount of it appears to have some functional role (Flintoft, 2005; but see also Palazzo, Gregory. 2014); Importantly, somatic variants in non-coding DNA have also been directly linked to cancer (Khurana et al. 2016).

Other recommendations made here are still far from becoming standard practice. For example, the majority of cancer studies still relies on a PCR step to amplify their library ahead of quantification, while this amplification has long been recognized as a source of artificial mutations as well as substantial bias (Aird et al. 2011). Recently, alternative approaches have been introduced, for instance the use of ddPCR for library quantification when input material is limiting (Robin et al. 2016). The arrival of Illumina’s TruSeq® PCR-free technology is another important step towards reducing PCR artefacts (Alioto et al. 2015). However, as TruSeq requires very high amounts of starting material compared to other NGS technologies, it is currently not always considered a viable option in tumour variant analysis (Huptas et al. 2016). Future studies should evaluate whether ddPCR quantification could enable a PCR-free approach in the face of FFPE tumour samples.

The ‘age of clinical sequencing’ (Daley, Smith. 2013) is clearly approaching fast. Before genetic screening of tumours, and with it personalized cancer medicine, can become a routine application however, several important issues will have to be dealt with. One aspect that is sometimes overlooked is the duration of the complete variant calling workflow, which means that in the case of some aggressive cancers NGS analysis might be simply too slow (Goodwin et al.

2016). Indeed, faster systems will need to be developed to allow a ubiquitous deployment of NGS in the cancer clinic. In addition, the accuracy and sensitivity of somatic variant workflows will need to be further optimized and standardized, especially in the face of somatic variants with a low VAF. It is key that new variant calling workflows, as well as their individual pipeline components, are thoroughly evaluated to identify and isolate potential sources of error (Davies et al. 2016). This will greatly facilitate the analysis of tumour NGS data, and, ultimately, clinical interpretation.

(24)

24

4. References

Agrawal, N., Frederick, M. J., Pickering, C. R., Bettegowda, C., Chang, K., Li, R. J., . . . Myers, J. N. (2011). Exome sequencing of head and neck squamous cell carcinoma reveals inactivating mutations in NOTCH1. Science (New York, N.Y.), 333(6046), 1154-1157.

Aird, D., Ross, M. G., Chen, W. S., Danielsson, M., Fennell, T., Russ, C., . . . Gnirke, A. (2011). Analyzing and minimizing PCR amplification bias in illumina sequencing libraries. Genome Biology, 12(2), R18-2011-12-2-r18.

Epub 2011 Feb 21.

Ajay, S. S., Parker, S. C., Abaan, H. O., Fajardo, K. V., & Margulies, E. H. (2011). Accurate and comprehensive sequencing of personal genomes. Genome Research, 21(9), 1498-1505.

Akbari, M., Hansen, M. D., Halgunset, J., Skorpen, F., & Krokan, H. E. (2005). Low copy number DNA template can render polymerase chain reaction error prone in a sequence-dependent manner. The Journal of Molecular Diagnostics : JMD, 7(1), 36-39.

Albertson, T. M., Ogawa, M., Bugni, J. M., Hays, L. E., Chen, Y., Wang, Y., . . . Preston, B. D. (2009). DNA

polymerase epsilon and delta proofreading suppress discrete mutator and cancer phenotypes in mice. Proceedings of the National Academy of Sciences of the United States of America, 106(40), 17101-17104.

Alioto, T. S., Buchhalter, I., Derdak, S., Hutter, B., Eldridge, M. D., Hovig, E., . . . Gut, I. G. (2015). A

comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nature Communications, 6, 10001.

Altmann, A., Weber, P., Bader, D., Preuss, M., Binder, E. B., & Muller-Myhsok, B. (2012). A beginners guide to SNP calling from high-throughput DNA-sequencing data. Human Genetics, 131(10), 1541-1554.

Andrews, S.FastQC: A quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc.

(25)

25

Baker, M. (2012). Structural variation: The genome's hidden architecture. Nature Methods, 9(2), 133-137.

Bao, R., Huang, L., Andrade, J., Tan, W., Kibbe, W. A., Jiang, H., & Feng, G. (2014). Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer Informatics, 13(Suppl 2), 67-82.

Bashir, A., Volik, S., Collins, C., Bafna, V., & Raphael, B. J. (2008). Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer. PLoS Computational Biology, 4(4), e1000051.

Belkadi, A., Bolze, A., Itan, Y., Cobat, A., Vincent, Q. B., Antipenko, A., . . . Abel, L. (2015). Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proceedings of the National Academy of Sciences of the United States of America, 112(17), 5473-5478.

Ben-Ezra, J., Johnson, D. A., Rossi, J., Cook, N., & Wu, A. (1991). Effect of fixation on the amplification of nucleic acids from paraffin-embedded material by the polymerase chain reaction. The Journal of Histochemistry and Cytochemistry : Official Journal of the Histochemistry Society, 39(3), 351-354.

Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J., Brown, C. G., . . . Smith, A. J. (2008).

Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218), 53-59.

Berlin, K., Koren, S., Chin, C. S., Drake, J. P., Landolin, J. M., & Phillippy, A. M. (2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature Biotechnology, 33(6), 623-630.

Biankin, A. V., Waddell, N., Kassahn, K. S., Gingras, M. C., Muthuswamy, L. B., Johns, A. L., . . . Grimmond, S. M.

(2012). Pancreatic cancer genomes reveal aberrations in axon guidance pathway genes. Nature, 491(7424), 399- 405.

Bolognesi, C., Forcato, C., Buson, G., Fontana, F., Mangano, C., Doffini, A., . . . Manaresi, N. (2016). Digital sorting of pure cell populations enables unambiguous genetic analysis of heterogeneous formalin-fixed paraffin- embedded tumors by next generation sequencing. Scientific Reports, 6, 20944.

Borad, M. J., Champion, M. D., Egan, J. B., Liang, W. S., Fonseca, R., Bryce, A. H., . . . Carpten, J. D. (2014).

Integrated genomic characterization reveals novel, therapeutically relevant drug targets in FGFR and EGFR pathways in sporadic intrahepatic cholangiocarcinoma. PLoS Genetics, 10(2), e1004135.

(26)

26

Buehler, B., Hogrefe, H. H., Scott, G., Ravi, H., Pabon-Pena, C., O'Brien, S., . . . Happe, S. (2010). Rapid quantification of DNA libraries for next-generation sequencing. Methods (San Diego, Calif.), 50(4), S15-8.

Cacho, A., Smirnova, E., Huzurbazar, S., & Cui, X. (2015). A comparison of base-calling algorithms for illumina sequencing technology. Briefings in Bioinformatics.

Carter, S. L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., . . . Getz, G. (2012). Absolute quantification of somatic DNA alterations in human cancer. Nature Biotechnology, 30(5), 413-421.

Chaisson, M. J., Huddleston, J., Dennis, M. Y., Sudmant, P. H., Malig, M., Hormozdiari, F., . . . Eichler, E. E. (2015).

Resolving the complexity of the human genome using single-molecule sequencing. Nature, 517(7536), 608-611.

Dabney, J., & Meyer, M. (2012). Length and GC-biases during sequencing library amplification: A comparison of various polymerase-buffer systems with ancient and modern DNA sequencing libraries. Biotechniques, 52(2), 87- 94.

Dai, M., Thompson, R. C., Maher, C., Contreras-Galindo, R., Kaplan, M. H., Markovitz, D. M., . . . Meng, F. (2010).

NGSQC: Cross-platform quality analysis pipeline for deep sequencing data. BMC Genomics, 11 Suppl 4, S7- 2164-11-S4-S7.

Daley, T., & Smith, A. D. (2013). Predicting the molecular complexity of sequencing libraries. Nature Methods, 10(4), 325-327.

Das, S., & Vikalo, H. (2013). Base calling for high-throughput short-read sequencing: Dynamic programming solutions. BMC Bioinformatics, 14, 129-2105-14-129.

Davies, K. D., Farooqi, M. S., Gruidl, M., Hill, C. E., Woolworth-Hirschhorn, J., Jones, H., . . . Aisner, D. L. (2016).

Multi-institutional FASTQ file exchange as a means of proficiency testing for next-generation sequencing bioinformatics and variant interpretation. The Journal of Molecular Diagnostics : JMD, 18(4), 572-579.

Davis, A., & Navin, N. E. (2016). Computing tumor trees from single cells. Genome Biology, 17(1), 113-016-0987-z.

Del Fabbro, C., Scalabrin, S., Morgante, M., & Giorgi, F. M. (2013). An extensive evaluation of read trimming effects on illumina NGS data analysis. PloS One, 8(12), e85024.

(27)

27

DePristo, M. A., Banks, E., Poplin, R., Garimella, K. V., Maguire, J. R., Hartl, C., . . . Daly, M. J. (2011). A

framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491-498.

Dienstmann, R., Dong, F., Borger, D., Dias-Santagata, D., Ellisen, L. W., Le, L. P., & Iafrate, A. J. (2014).

Standardized decision support in next generation sequencing reports of somatic cancer variants. Molecular Oncology, 8(5), 859-873.

Eisenstein, M. (2015). Big data: The power of petabytes. Nature, 527(7576), S2-4.

Fang, H., Wu, Y., Narzisi, G., O'Rawe, J. A., Barron, L. T., Rosenbaum, J., . . . Lyon, G. J. (2014). Reducing INDEL calling errors in whole genome and exome sequencing data. Genome Medicine, 6(10), 89-014-0089-z. eCollection 2014.

Ferrarini, M., Moretto, M., Ward, J. A., Surbanovski, N., Stevanovic, V., Giongo, L., . . . Sargent, D. J. (2013). An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC Genomics, 14, 670-2164-14-670.

Feuk, L., Marshall, C. R., Wintle, R. F., & Scherer, S. W. (2006). Structural variants: Changing the landscape of chromosomes and design of disease studies. Human Molecular Genetics, 15 Spec No 1, R57-66.

Fidler, I. J. (1978). Tumor heterogeneity and the biology of cancer invasion and metastasis. Cancer Research, 38(9), 2651-2660.

Field, M. A., Cho, V., Andrews, T. D., & Goodnow, C. C. (2015). Reliably detecting clinically important variants requires both combined variant calls and optimized filtering strategies. PloS One, 10(11), e0143199.

Finotello, F., Lavezzo, E., Bianco, L., Barzon, L., Mazzon, P., Fontana, P., . . . Di Camillo, B. (2014). Reducing bias in RNA sequencing data: A novel approach to compute counts. BMC Bioinformatics, 15 Suppl 1, S7-2105-15-S1- S7. Epub 2014 Jan 10.

Flintoft, L. (2005). Genome evolution: an adaptive view of non-coding DNA. Nat.Rev.Genet.

(28)

28

Gerlinger, M., Santos, C. R., Spencer-Dene, B., Martinez, P., Endesfelder, D., Burrell, R. A., . . . Swanton, C. (2012).

Genome-wide RNA interference analysis of renal carcinoma survival regulators identifies MCT4 as a warburg effect metabolic target. The Journal of Pathology, 227(2), 146-156.

Gerstung, M., Beisel, C., Rechsteiner, M., Wild, P., Schraml, P., Moch, H., & Beerenwinkel, N. (2012). Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nature Communications, 3, 811.

Goode, D. L., Hunter, S. M., Doyle, M. A., Ma, T., Rowley, S. M., Choong, D., . . . Campbell, I. G. (2013). A simple consensus approach improves somatic mutation prediction accuracy. Genome Medicine, 5(9), 90.

Goodwin, S., Gurtowski, J., Ethe-Sayers, S., Deshpande, P., Schatz, M. C., & McCombie, W. R. (2015). Oxford nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Research, 25(11), 1750-1756.

Goodwin, S., McPherson, J. D., & McCombie, W. R. (2016). Coming of age: Ten years of next-generation sequencing technologies. Nature Reviews.Genetics, 17(6), 333-351.

Griffith, M., Miller, C. A., Griffith, O. L., Krysiak, K., Skidmore, Z. L., Ramu, A., . . . Wilson, R. K. (2015).

Optimizing cancer genome sequencing and analysis. Cell Systems, 1(3), 210-223.

Grokhovsky, S. L., Il'icheva, I. A., Nechipurenko, D. Y., Golovkin, M. V., Panchenko, L. A., Polozov, R. V., &

Nechipurenko, Y. D. (2011). Sequence-specific ultrasonic cleavage of DNA. Biophysical Journal, 100(1), 117-125.

Hansen, K. D., Brenner, S. E., & Dudoit, S. (2010). Biases in illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research, 38(12), e131.

Head, S. R., Komori, H. K., LaMere, S. A., Whisenant, T., Van Nieuwerburgh, F., Salomon, D. R., & Ordoukhanian, P. (2014). Library construction for next-generation sequencing: Overviews and challenges. Biotechniques, 56(2), 61-4, 66, 68, passim.

Huptas, C., Scherer, S., & Wenning, M. (2016). Optimized illumina PCR-free library preparation for bacterial whole genome sequencing and analysis of factors influencing de novo assembly. BMC Research Notes, 9, 269-016-2072- 9.

(29)

29

Huse, S. M., Huber, J. A., Morrison, H. G., Sogin, M. L., & Welch, D. M. (2007). Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology, 8(7), R143.

Katsanis, S. H., & Katsanis, N. (2013). Molecular genetic testing and the future of clinical genomics. Nature Reviews.Genetics, 14(6), 415-426.

Kebschull, J. M., & Zador, A. M. (2015). Sources of PCR-induced distortions in high-throughput sequencing data sets. Nucleic Acids Research, 43(21), e143.

Khurana, E., Fu, Y., Chakravarty, D., Demichelis, F., Rubin, M. A., & Gerstein, M. (2016). Role of non-coding sequence variants in cancer. Nature Reviews.Genetics, 17(2), 93-108.

Knierim, E., Lucke, B., Schwarz, J. M., Schuelke, M., & Seelow, D. (2011). Systematic comparison of three methods for fragmentation of long-range PCR products for next generation sequencing. PloS One, 6(11), e28240.

Koboldt, D. C., Steinberg, K. M., Larson, D. E., Wilson, R. K., & Mardis, E. R. (2013). The next-generation sequencing revolution and its impact on genomics. Cell, 155(1), 27-38.

Kokkat, T. J., Patel, M. S., McGarvey, D., LiVolsi, V. A., & Baloch, Z. W. (2013). Archived formalin-fixed paraffin- embedded (FFPE) blocks: A valuable underexploited resource for extraction of DNA, RNA, and protein.

Biopreservation and Biobanking, 11(2), 101-106.

Kozarewa, I., Ning, Z., Quail, M. A., Sanders, M. J., Berriman, M., & Turner, D. J. (2009). Amplification-free illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes.

Nature Methods, 6(4), 291-295.

Lai, Z., Markovets, A., Ahdesmaki, M., Chapman, B., Hofmann, O., McEwen, R., . . . Dry, J. R. (2016). VarDict: A novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Research, 44(11), e108.

Landau, D. A., Carter, S. L., Stojanov, P., McKenna, A., Stevenson, K., Lawrence, M. S., . . . Wu, C. J. (2013).

Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell, 152(4), 714-726.

Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25-2009-10-3-r25. Epub 2009 Mar 4.

(30)

30

Laurie, M. T., Bertout, J. A., Taylor, S. D., Burton, J. N., Shendure, J. A., & Bielas, J. H. (2013). Simultaneous digital quantification and fluorescence-based size characterization of massively parallel sequencing libraries.

Biotechniques, 55(2), 61-67.

Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics (Oxford, England), 30(20), 2843-2851.

Li, H. (2016). Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics (Oxford, England), 32(14), 2103-2110.

Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics (Oxford, England), 25(14), 1754-1760.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., . . . 1000 Genome Project Data Processing Subgroup. (2009). The sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16), 2078-2079.

Li, H., & Homer, N. (2010). A survey of sequence alignment algorithms for next-generation sequencing. Briefings in Bioinformatics, 11(5), 473-483.

Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S. M., Kristiansen, K., & Wang, J. (2009). SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics (Oxford, England), 25(15), 1966-1967.

Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., . . . Law, M. (2012). Comparison of next-generation sequencing systems. Journal of Biomedicine & Biotechnology, 2012, 251364.

Loeb, L. A. (2016). Human cancers express a mutator phenotype: Hypothesis, origin, and consequences. Cancer Research, 76(8), 2057-2059.

Loman, N. J., Misra, R. V., Dallman, T. J., Constantinidou, C., Gharbia, S. E., Wain, J., & Pallen, M. J. (2012).

Performance comparison of benchtop high-throughput sequencing platforms. Nature Biotechnology, 30(5), 434- 439.

Luthra, R., Chen, H., Roy-Chowdhuri, S., & Singh, R. R. (2015). Next-generation sequencing in clinical molecular diagnostics of cancer: Advantages and challenges. Cancers, 7(4), 2023-2036.

Improving somatic mutation detection in cancer

ESSAY