How does sequence variability affect de novo assembly quality?

(1)

Citation for this paper:

Skern-Mauritzen, R., Malde, K., Besnier, F., Nilsen, F., Jonassen, I., Reinhardt, B. …

Glover, K.A. (2013). How does sequence variability affect de novo assembly

quality? Journal of Natural History, 47(5-12), 901-910.

UVicSPACE: Research & Learning Repository

_____________________________________________________________

Faculty of Science

Faculty Publications

_____________________________________________________________

How does sequence variability affect de novo assembly quality?

R. Skern-Mauritzen, K. Malde, F. Besnier, F. Nilsen, I. Jonassen, R. Reinhardt, B.

Koop, S. Dalvin, S. Mæhle, H. Kongshaug, and K.A. Glover

2013

This article was originally published at:

(2)

Journal of Natural History, 2013

Vol. 47, Nos. 5–12, 901–910, http://dx.doi.org/10.1080/00222933.2012.738833

How does sequence variability affect de novo assembly quality?

Rasmus Skern-Mauritzena,b_{*, Ketil Malde}a_{, Francois Besnier}a_,

Frank Nilsenb,c_{, Inge Jonassen}b,c_{, Richard Reinhardt}d_{, Ben Koop}e_{, Sussie Dalvin}a,b_,

Stig Mæhlea_{, Heidi Kongshaug}b,c_{and Kevin A. Glover}a

a_{Institute of Marine Research, Bergen, Norway;}b_{Salmon Louse Research Centre, Bergen,}

Norway;c_{University of Bergen, Bergen, Norway;}d_{Max Planck Genome Centre, Cologne,}

Germany;e_{University of Victoria, Victoria, BC, Canada}

(Received 10 October 2011; final version received 8 October 2012; first published online 3 January 2013)

Molecular genetic tools have become standard in biological studies of both model and non-model species. This has created a growing need for sequence information, a resource hitherto limited for many species. With new sequencing technologies this is rapidly changing, and whole genome shotgun sequencing has become a realistic goal for many species. However, present sequencing protocols require more DNA than can be extracted from single individuals of many small metazoans, potentially forcing sequencing projects to perform sequencing on samples derived from sev-eral individuals. A pertinent question thus arises: can wild samples be used or is inbreeding necessary? In the present study we compare assemblies generated using sequence data from inbred and wild Lepeophtheirus salmonis. The results indicate not only that measures to reduce the genetic variability may significantly improve the final assemblies but also that deeper coverage to some extent can compensate for the detrimental effects of natural sequence variability.

Keywords: sequencing; assembly; genetic variation, inbreeding

Introduction

In biological sciences molecular methods are being applied at an ever increasing rate and have become standard approaches also when working with non-model organisms. As a consequence, considerable time and resources are used to obtain sequence data for in silico analyses and as baseline information for downstream applications such as Northern blotting, quantitative real-time polymerase chain reaction and in situ hybridization. The advent of new sequencing technologies and the associated decrease in sequencing costs and increase in speed, has made whole genome shotgun sequencing (WGS) feasible for a large variety of projects (Ekblom and Galindo 2011).

The objective for de novo genome sequencing will often be to generate an assem-bly that can be annotated and analysed, and subsequently used to identify single nucleotide polymorphisms (SNPs), design primers and probes etc. The quality of the generated assemblies is of great importance for subsequent annotations (Florea et al. 2010) and consequently also for downstream analyses. The assembly quality in turn depends on the assembly algorithm used as well as the amount, type and quality of the data entered into the analyses (Dalloul et al. 2010; Florea et al. 2010; Lin et al.

(3)

2011). Consequently, a number of papers comparing assembly tools and sequencing platforms have been published (Harismendy et al. 2009; Bao et al. 2011; Glenn 2011; Lin et al. 2011; Suzuki et al. 2011).

Next-generation sequencing protocols generally require 1–20 µg of high-quality DNA for construction of sequencing libraries, and sequencing projects often require construction of more than one library (e.g. libraries for different sequencing platforms, paired end libraries etc.). Even for a large arthropod, such as the up to 11 mm long ectoparasitic marine copepod Lepeophtheirus salmonis (Johnson and Albright 1991), it may be challenging to obtain sufficient DNA for library construction from a single individual. It goes without saying that requirement of more libraries, isolation from smaller species, or the desire to use dissected tissues to reduce the risk of contamina-tion, increase the number of individuals necessary for sequencing library construction. As the need for extraction from several individuals arises it becomes relevant to ask whether natural sequence variation in a population could affect the quality of the final assembly. However, studies directly addressing the effect of sequence variability on assemblies are absent, despite the fact that such studies could be a valuable reference when selecting sources of DNA (e.g. inbred cultures versus wild specimens) for de novo sequencing of small organisms.

The salmon louse, L. salmonis, is an economically important copepod ectopara-site with a genome size between 550 and 600 Mega-base pairs (Mbp) according to the animal genome size database (www.genomesize.com). We are presently sequencing the genome of an inbred strain of L. salmonis and simultaneously generating a resource of SNPs by sequencing wild specimens of L. salmonis sampled across several regions in the North Atlantic. These two data sets containing sequences from the same species with different degrees of genetic variability are comparable because they have been generated using the same sequencing platform (Illumina HiSeq2000; Illumina Inc., San Diego, CA, USA). To address the effect of genetic variability on sequence data assembly, equally sized data sets representing different starting materials were con-structed from the available sequence data and assembled. Here we present statistics for the resulting assemblies that may serve as an information baseline when designing projects for de novo sequencing of small organisms.

Material and methods

Sequencing and tissue sampling

The raw sequence data used in the present study were obtained from two projects with different aims. Material from whole untreated wild L. salmonis was sampled in the field for an SNP detection project and material from inbred sterilized L. salmonis for a genome-sequencing project was sampled from experimental facilities as previously described (Hamre et al. 2009).

Material for SNP analysis and DNA isolation

Eight adult female L. salmonis were sampled from each of five localities. Four of these have been described previously (Glover et al. 2011): C858 (Canada), S856 (Shetland), I852 (Ireland), N849 (northern Norway). The fifth sample consisting of eight females was collected in September 2008 from an emamectin-benzoate-desensitized population

(4)

Journal of Natural History 903

in Austevoll, western Norway. For all samples, DNA was isolated in a 96-well format using the DNeasy kit according to the manufacturer’s instructions (Qiagen, Hilden, Germany). Equal amounts of DNA from each of the eight individuals from each sta-tion were pooled to meet concentrasta-tion demands and were sequenced by Fasteris SA using the Illumina HiSeq 2000 platform following their standard protocols.

Material for genome sequencing and DNA isolation

Inbred adult female L. salmonis were sampled for 27 generations of inbreeding of the Ls1a culture as previously described (Hamre et al. 2009). To reduce the amount of non-salmon louse contamination before sequencing, DNA was purified from starved (2 days) individuals treated with 3% Virkon® in sterilized seawater. The speci-mens were digested using ample lyophilized proteinase K in 400µl 100 mM NaCl, 10 mMTris–HCl pH 8, 25 mMEDTA and 5% sodium dodecyl sulphate at 37◦C for 2–4 hours. DNA was extracted by addition of 400µl phenol: chloroform: isoamylal-cohol (25: 24: 1) before gentle homogenization and phase separation by centrifugation at maximum r.p.m. (16100 g) for 5 min at room temperature. The aqueous supernatant was thereafter transferred to new tubes and 2.5 volumes of ice-cold 90% ethanol was added. The DNA was then precipitated by addition of 0.1 volumes 3Msodium acetate at pH 5.2. When visibly precipitating, the high molecular weight (HMW) DNA was spooled on shepherds’ crooks prepared from glass Pasteur pipettes. The HMW-DNA was then cleaned in 70% ethanol, dried at room temperature and resuspended in water. The HMW-DNA was sequenced by Fasteris SA using the Illumina HiSeq 2000 plat-form following the same protocols as for sequencing of wild specimens. Additional 454 Life Sciences sequencing (not described in detail) was performed on ovaries from the same inbred strain and used to generate a draft genome assembly.

Data set preparation and analyses

Generating sequence sets from inbred and wild L. salmonis

An inherent challenge in comparing the data from the two studies was the different sources of DNA. In addition to the reduction in variability caused by inbreeding, the material chosen for sequencing of the L. salmonis genome was expected to contain less contamination than the libraries prepared for SNP detection because they had been starved and treated with Virkon® to reduce contamination. To further elimi-nate sequencing reads from contaminants, both data sets were mapped against a draft genome assembly based on inbred 454 reads (data not published). The mapping was performed using Burrows–Wheeler Aligner (BWA; Li and Durbin 2009) with default parameters. We then used a custom program (available from the authors upon request) to extract only read pairs where at least one read mapped to the genome.

Data set generation

To construct comparable data sets, we extracted a fixed number of random read pairs from each of the sequence sets. The smallest data set size was chosen to correspond to the smallest of the wild-type data sets, containing 34,393,766 read pairs, or approx-imately 12× genome coverage. We therefore extracted an equal number of random

(5)

read pairs from each of the other wild-type data sets, and also extracted five sets of random read pairs from the inbred runs so that sequence reads for each of the five sets were extracted from single sequencing runs. Similarly, we extracted 73,571,936 read pairs (∼ 24 ×) from the three largest wild-type data sets, and the same amount from individual sequence runs of inbred data. As for the 12× data sets; all data in an inbred 24× data set came from the same sequencing run. Finally, we pooled all wild-type data and used bootstrapping to extract five sets of 108,000,000 read pairs (∼ 36 ×), and similarly generated five data sets from a pool of the inbred data. These 36× data sets contained reads from all sequencing runs.

Sequence variation within and between data sets

To compare the diversity in the data sets, a simplified variation calling procedure was used. We aligned the generated data sets against the reference using BWA. We then performed variant calling usingSAMTOOLS PILEUP–VCF(Li et al. 2009). The output was filtered to remove variants called only because of disagreement with the reference sequence, but where the read data showed a unanimous consensus, and the numbers of remaining variants were counted.

Sequence data assemblies

The 26 data sets generated as described above were imported intoCLC GENOMICS WORKBENCH® v. 4.6.1 and trimmed using default settings. Subsequent assembly was performed usingCLC GENOMICS WORKBENCH®v. 4.6.1 using standard setting except the maximum distance for paired reads was adjusted to 450 to accommodate the larger than default insert size. Contig N50values were calculated from an approximated

genome size of 600 Mbp.

Results

Read mapping

Mapping of the original sequence reads to the best available genome assembly, and discarding all reads that did not map, resulted in variable fractions of the sequencing reads being retained from the different sequencing runs (Table 1). Even when omitting the Austevoll 1st_{run, a significantly larger average fraction of the reads from the wild}

samples were discarded compared with the inbred samples (Table 1). Furthermore the results showed a higher variability in the fraction of reads classified as contamination (i.e. discarded reads) from the wild samples compared with the inbred samples.

Genetic variability of data sets generated for assembly

As a proxy for genetic variability we used a simple count of variable sites generated usingSAMTOOLS PILEUPfor all the generated data sets (Table 2). The results suggest that significant genetic variability remained in the inbred data sets after 27 generations of semi-intensive inbreeding (see Table 1 in Hamre et al. 2009 for a description of the inbreeding regimen). The number of identified variant sites in the inbred data sets increased with data set size from the 12× to 24 × (Table 2) and then appeared to

(6)

Journal of Natural History 905 Table 1. Overview of the fraction of reads retained after mapping to the best available salmon louse genome assembly.

Sequencing run Source Reads retained (%) Average SD

Austevoll 1st run Wild samples 10.40 10.40 N/A

Austevoll 2nd run Wild samples 94.23 84.08 10.66

Canada 1st run 94.16

Canada 2nd run 92.13

Ireland 74.20

Nordland 70.91

Shetland 78.86

Inbred 1st run Inbred samples 96.44 96.36 0.93

Inbred 2nd run 95.02

Inbred 3rd run 97.07

Inbred 4th run 96.91

remain stable when increasing the size of the data set to 36× coverage. In contrast, the variability of the wild data sets continued to increase with increasing size. It is noteworthy that the variation in variable site count in the bootstrapped 36× data sets was extremely low compared with the variation in the smaller 12× and 24 × data sets. Regardless of data set size, the polymorphism density derived for the data sets from sequencing of eight wild L. salmonis was significantly higher than the density found in equally sized data sets derived from inbred lice.

Assembly statistics

The results showed that increasing the data set size improved assembly statistics (Table 2). This improvement was clearly seen in higher N50values and reduced number

of contigs contributing to the N50. Notably, the results also showed that assemblies of

inbred data sets with lower variation generated better assemblies than assemblies of wild data sets. Although the size of the largest contigs increased with the amount of data this figure was sufficiently variable to significantly overlap between assemblies of equally sized inbred and wild data sets. The results furthermore indicate that the effect of adding additional data is not saturated at 36× coverage for Illumina sequencing, suggesting that assemblies will improve further at increased coverage.

Discussion

Molecular approaches based on sequence data have created an increasing need for acquiring sequence resources. For smaller organisms DNA may have to be isolated from several individuals to meet the concentration requirements of sequencing pro-tocols. However, little information is available on the effect of reducing sequence variation on assemblies. Here we present results from assemblies of sequence data from a WGS sequencing project on inbred L. salmonis and an SNP-detection project on wild

L. salmonis. The assemblies were generated from data sets of approximately 12, 24 and

(7)

T ab le 2 . O v ervie w o f g ener at ed da ta sets , their v a ria b ility (see te x t for details) a nd assemb ly sta tistics. Siz e Sour ce Mbp C o v. A v er a ge co v. V a riant sites A v er a g e v aria bility N50 Av er a g e N50 N50 #contigs A v er a ge #contigs 12X Canada 5884 9.8 10.0 4 ,818,422 3,815,269 749(29,063) 735 228,143(15,840) 238,423 Ir eland 6265 10.4 3 ,583427 894(34,642) 190,634(13,352) Shetland 6139 10.2 3 ,770,231 778(30,985) 218,402(14,801) Nor dland 5602 9.3 3 ,657,995 n.a. (21 ,609) n.a. A u ste v oll 6016 10.0 3 ,246,270 518(24,566) 316,511(20,925) Inbr ed 1 6535 10.9 10.6 1 ,302,184 1,279,559 1051(46,477) 1076 168,622(20,013) 163,646 Inbr ed 2 6189 10.3 1 ,045,830 1455(28,331) 99,776(19,549) Inbr ed 3 6454 10.8 1 ,296,946 1002(42,925) 175,978(18,087) Inbr ed 4 6345 10.6 1 ,377,016 937(39,431) 186,954(23,787) Inbr ed 5 6350 10.6 1 ,375,821 937(38,901) 186,902(19,446) 24X Canada 13,800 23.0 22.4 4 ,231,054 4,256,095 2876(67,552) 2771 58,257(39,569) 60,625 Ir eland 13,162 21.9 4 ,345,494 2638(61,438) 63,312(34,691) Shetland 13,402 22.3 4 ,191,736 2798(62,881) 60,306(35,710) Inbr ed 1 13,517 22.5 22.7 1 ,598,925 1,474,379 3450(72,672) 3517 51,793(28,883) 50,750 Inbr ed 2 13,806 23.0 1 ,382,381 3706(73,655) 48,162(42,800) Inbr ed 3 13,585 22.6 1 ,441,831 3396(71,123) 52,327(33,131) (Continued )

(8)

Journal of Natural History 907 T ab le 2 . (Contin ued). Siz e Sour ce Mbp C o v. A v er a ge co v. V a riant sites A v er a g e v aria bility N50 Av er a g e N50 N50 #contigs A v er a ge #contigs 36X W ild 1 19,636 32.7 32.7 4 ,543,974 4,574,832 5689(55,129) 5685 28,842(63,359) 28,845 W ild 2 19,636 32.7 4 ,594,473 5700(55,818) 28,838(103,567) W ild 3 19,636 32.7 4 ,594,408 5685(55,709) 28,818(92,841) W ild 4 19,636 32.7 4 ,545,958 5685(55,992) 28,845(80,226) W ild 5 19,636 32.7 4 ,595,345 5665(54,989) 28,881(81,733) Inbr ed 1 20,004 33.3 33.3 1 ,459,872 1,460,590 8768(55,192) 8810 20,089(113,056) 19,991 Inbr ed 2 20,026 33.4 1 ,459,775 8854(55,818) 19,921(86,515) Inbr ed 3 20,004 33.3 1 ,460,107 8858(55,709) 19,902(88,241) Inbr ed 4 20,004 33.3 1 ,460,671 8797(55,992) 19,994(88,243) Inbr ed 5 20,004 33.3 1 ,462,525 8775(54,989) 20,050(79,589) The a mount of sequencing d at a u sed in each a ssemb ly is gi v en a s m ega b ase p airs (Mbp) and the estima ted co v er a g e (Co v. ) is calcula ted b ased on a stipula ted genome siz e of 600 Mbp . The N50 v a lues ar e d efined a s longest length such tha t at least 50% of the g enome is contained in this siz e contig or lar g er . T he total n umber o f N s in each a ssemb ly is gi v en in p ar enthesis belo w the N50 v a lues . T he n umber o f contigs in the N50 is gi v en in the N50 #contigs column along with the lar gest contig siz e in par enthesis .

(9)

36 times coverage of the L. salmonis genome, which represents a reasonable sequencing range for a small-scale WGS project on a non-model species.

The data sets were assembled with CLC GENOMICS WORKBENCH. The assem-bly results showed that the average contigs in assemblies from data based on inbred

L. salmonis were larger than average contigs generated from data from samples of

eight L. salmonis individuals collected in the wild (Table 1). Hence, assembly statis-tics indicate that reduction of genetic variation is highly desirable. It should be noted that the population size of L. salmonis is very large and that other organisms with smaller populations are expected to exhibit lower sequence variation, which in turn could improve the results from wild samples. The results furthermore showed that adding sequence data improved common assembly quality parameters such as N50

and the number of contigs in the N50 (Table 2). It should be noted that the largest

contig size is considerably more variable and consequently a less reliable indicator of assembly quality. Although the effect of adding sequence reads saturates when cover-age is sufficiently high, the results indicate that expanding a data set may compensate for sequence variability in the sampled material.

The data sets were generated by mapping Illumina-paired end reads from wild and inbred sources (see Material and methods) on an L. salmonis genome assembly based on 454 Life Sciences sequencing reads of inbred ovaries and discarding all reads that did not map to ensure that the data sets were comparable. The assembly against which we mapped the reads was generated from sequences derived from dis-sected inbred ovaries only (454 WGS sequence reads not used for other purposes in the present study), so we are confident that the vast majority of the assembly represents genuine L. salmonis sequence. The SNP-detection project sequencing was performed on untreated whole lice, and we therefore expected that these data sets would contain more contaminants than the inbred data sets. This was supported in the mapping step where 16% of the reads derived from wild L. salmonis did not map to the best inbred genome assembly as opposed to 4% of the inbred reads. Although it cannot be ruled out that fractions of the wild population may contain genetic material not present in the inbred strain, the results strongly suggest that a significant proportion of the reads may be expected to be contamination if measures are not implemented to counter this. To this end, the results indicate that simple measures such as starvation and Virkon® treatment can significantly lower the risk for contamination.

Sequence variabilities in the constructed data sets were evaluated from simple het-erogenic site counts (Table 2) and showed that the natural sequence variation in pools of eight individuals sampled at the same site was higher than in the inbred data sets. This appears to confirm the reduced sequence variation in the inbred strain reported earlier (Hamre et al. 2009) and later supported by analyses using an expanded set of microsatellites (12 out of 13 microsatellites were fixed, the last had two alleles, data not shown) following methods previously reported (Glover et al. 2011). The apparent large remaining sequence variability in the inbred strain seems surprising if microsatellite variability is a reliable proxy of genetic variation. However, these estimates are based on a draft genome using a simplified variant calling procedure, and so are likely to be inflated compared with the true number of variants. For instance, sequencing errors are likely to contribute to false variant calls, and genomic repeats may be collapsed in the draft genome, causing any differences to be counted as variants. Nevertheless, 9 × and 20 × data sets originating from inbred ovaries from more than 30 adults contained 2.7 and 3.1 million variable sites, respectively, which is significantly higher

(10)

Journal of Natural History 909

than the numbers of variable sites in the generated inbred 24× data sets originating from only three inbred individuals (approximately 1.5 million variable sites, Table 1) indicating that considerable residual variation is still present in the inbred Ls1a strain after 27 generations of inbreeding. Hence, despite the uncertainties pertaining to the absolute numbers of variants in the data sets, these results suggest that the loss of vari-ation in microsatellite markers is disproportionately high in comparison to the loss of genetic variation in general, and that the inbreeding regimen described by Hamre et al. (2009) has not been optimal for reducing genetic variability.

The variability counts also revealed that increasing the sample size resulted in a larger number of variant sites. However, the increase in variability from the wild 24× data set to the wild 36× data set was surprisingly low considering that the wild 24 × data sets were generated from single localities whereas the wild 36× data sets con-tained reads from all localities. This suggests that most of the variable sites are found throughout the North Atlantic, supporting earlier studies indicating that L. salmonis displays a high degree of gene-flow, consistent with a species that can disperse at both planktonic and adult stages (Glover et al. 2011). The homogeneous level of sequence variability in the 36× data sets compared with the smaller data sets, that all stem from single sequencing runs, is probably the result of the bootstrapping procedure averaging variability among sequencing runs, which in turn indicates that the error rate variation between sequencing runs is noticeable.

Sequencing platforms exhibit different coverage biases, i.e. some sequence regions will be under-represented in reads from one platform but exhibit normal coverage when sequenced with another (Harismendy et al. 2009; Dalloul et al. 2010). Therefore a combination of sequencing platforms is recommended to improve results. The results presented here are based on paired end Illumina sequencing only and employ-ing several sequencemploy-ing platforms may influence the effect of sequence variability on assemblies.

The conclusion from the present study is that measures that can be taken to reduce sequence variability, e.g. inbreeding or reducing the number of individuals used, will result in better assemblies. Furthermore, the results indicate that, in small sequencing projects, the beneficial effect of adding data may compensate for the adverse effect of using samples with some genetic variation.

Acknowledgements

We acknowledge the financial contributions from Marine Harvest and The Fishery and Aquaculture Industry Research Fund for financial contributions. We also appreciate our discussions with Dr James Emmanuel Bron, which resulted in restructuring of the analyses.

References

Bao SY, Jiang R, Kwan WK, Wang BB, Ma X, Song YQ. 2011. Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet. 56:406–414.

Dalloul RA, Long JA, Zimin AV, Aslam L, Beal K, Blomberg L, Bouffard P, Burt DW, Crasta O, Crooijmans RPMA, et al. 2010. Multi-platform next generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis. Plos Biol. 8. Ekblom R, Galindo J. 2011. Applications of next generation sequencing in molecular ecology

of non-model organisms. Heredity 107:1–15.

(11)

Florea L, Souvorov A, Salzberg SL. 2010. Genes and genomes, an imperfect world: comparison of gene annotations of two Bos taurus draft assemblies. Genome Biol. 11 (Suppl.1):P13. Glenn TC. 2011. Field guide to next-generation DNA sequencers. Mol Ecol Resour. 11:759–769. Glover KA, Stolen AB, Messmer A, Koop BF, Torrissen O, Nilsen F. 2011. Population genetic structure of the parasitic copepod Lepeophtheirus salmonis throughout the Atlantic. Marine Ecol Prog Ser. 427:161–172.

Hamre LA, Glover KA, Nilsen F. 2009. Establishment and characterisation of salmon louse (Lepeophtheirus salmonis (Kroyer 1837)) laboratory strains. Parasitol Int. 58:451–460. Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ,

Murray SS, Topol EJ, Levy S, Frazer KA. 2009. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 10:R32.

Johnson SC, Albright LJ. 1991. The developmental stages of Lepeophtheirus salmonis (Kroyer, 1837) (Copepoda, Caligidae). Can J Zool-Rev Canad Zool. 69:929–950.

Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheeler trans-form. Bioinformatics 25:1754–1760.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R.

2009. The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079.

Lin Y, Li J, Shen H, Zhang L, Papasian CJ, Deng HW. 2011. Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics 27:2031–2037. Suzuki S, Ono N, Furusawa C, Ying BW, Yomo T. 2011. Comparison of sequence reads

obtained from three next-generation sequencing platforms. PLoS One. 6:e19534.