• No results found

FDSTools: A software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise

N/A
N/A
Protected

Academic year: 2021

Share "FDSTools: A software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise"

Copied!
14
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Research paper

FDSTools: A software package for analysis of massively parallel

sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise

Jerry Hoogenboom

a,b,

*

,1

, Kristiaan J. van der Gaag

a,b,1

, Rick H. de Leeuw

a

, Titia Sijen

b

, Peter de Knijff

a

, Jeroen F.J. Laros

a

aDepartmentofHumanGenetics,LeidenUniversityMedicalCenter,Leiden,2300RC,TheNetherlands

bDivisionBiologicalTraces,NetherlandsForensicInstitute,LaanvanYpenburg6,2497GB,TheHague,TheNetherlands

ARTICLE INFO

Articlehistory:

Received26September2016

Receivedinrevisedform31October2016 Accepted23November2016

Availableonline27November2016

Keywords:

Forensicscience

Nextgenerationsequencing Massivelyparallelsequencing Shorttandemrepeat Powerseq

FDSTools Software

ABSTRACT

Massivelyparallelsequencing(MPS)isontheadventofabroadscaleapplicationinforensicresearchand casework.Theimprovedcapabilitiestoanalyseevidentiarytracesrepresentingunbalancedmixturesis oftenmentionedasoneofthemajoradvantagesof thistechnique.However,mostoftheavailable softwarepackagesthatanalyseforensicshorttandemrepeat(STR)sequencingdataarenotwellsuited forhighthroughputanalysisofsuchmixedtraces. Thelargestchallengeisthepresenceofstutter artefactsinSTRamplifications,whicharenotreadilydiscernedfromminorcontributions.FDSToolsisan open-sourcesoftwaresolutiondevelopedforthispurpose.Thelevelofstutterformationisinfluencedby variousaspectsofthesequence,suchasthelengthofthelongestuninterruptedstretchoccurringinan STR.WhenMPSis used,STRsareevaluatedassequencevariantsthateachhave particularstutter characteristicswhichcanbepreciselydetermined.FDSToolsusesadatabaseofreferencesamplesto determinestutterandothersystemicPCRorsequencingartefactsforeachindividualallele.Inaddition, stuttermodelsarecreatedforeachrepeatingelementinordertopredictstutterartefactsforallelesthat arenotincludedinthereferenceset.Thisinformationissubsequentlyusedtorecogniseandcompensate forthenoiseinasequenceprofile.Theresultisabetterrepresentationofthetruecompositionofa sample.UsingPromegaPowerseqTMAutoSystemdatafrom450referencesamplesand31two-person mixtures,weshowthattheFDSToolscorrectionmoduledecreasesstutterratiosabove20%tobelow3%.

Consequently,muchlowerlevelsofcontributionsinthemixedtracesaredetected.FDSToolscontains modules tovisualisethedata inaninteractiveformat allowinguserstofilterdata withtheirown preferredthresholds.

©2016TheAuthors.PublishedbyElsevierIrelandLtd.ThisisanopenaccessarticleundertheCCBY-NC- NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).

1.Introduction

AnalysisofShortTandemRepeats(STRs)hasbeenasuccessful forensic tool in the past two decades. The comparison of STR profiles from forensic DNA evidentiary traces with reference samplesandDNAdatabaseshasprovidedessentialinformationin many forensic cases [1]. Standard practice is to use Capillary Electrophoresis (CE) to analyse STR length variation. In recent

years,Massively Parallel Sequencing(MPS) wasintroducedasa newmethodtoanalyseSTRsandotherforensicDNAmarkers[2,3].

MPS enables the simultaneous detection of both length and sequence variation of STRs, which increasesthe discriminatory value substantially [4–6]. The output of CE consists of peaks reflectingfluorescentsignalintensitieswiththeirownrespective shapesandpeakheights.TheoutputofMPSdataanalysisconsists simplyofreadcountsoftheobservedsequences.Bothmethodscan sufferfromtheoccurrenceofPCRartefactssuchasSTRstutters[7].

This especially complicatestheanalysis of STR profilescoming from multiple contributors, which is common in forensic evidentiarytraces[8].Thelevelofstutterformationdependson a numberofdistinctaspects ofthesequence,includingtheA/T contentoftherepeatunitandthenumberofconsecutiverepeat units occurring in an STR [9]. Since any specific STR length

* Correspondingauthor.

E-mailaddresses:j.hoogenboom@nfi.minvenj.nl(J.Hoogenboom), k.van.der.gaag@nfi.minvenj.nl(K.J. vanderGaag),r.h.de_leeuw@lumc.nl (R.H.deLeeuw),t.sijen@nfi.minvenj.nl(T.Sijen),p.de_knijff@lumc.nl(P.deKnijff), j.f.j.laros@lumc.nl(J.F.J. Laros).

1 Theseauthorscontributedequally.

http://dx.doi.org/10.1016/j.fsigen.2016.11.007

1872-4973/©2016TheAuthors.PublishedbyElsevierIrelandLtd.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc- nd/4.0/).

ContentslistsavailableatScienceDirect

Forensic Science International: Genetics

j o u r n al h o m e p a g e : w w w . el s e v i e r . c o m / l o c a t e / f s i g

(2)

identifiedbyCEcanconsistofmultipledifferentsequences,these CE-identifiedlengthvariantsshowalargervariationinmeasured stutterpercentage than individual sequences analysed through MPS.ThisdecreasedvariationinstutterpercentageforMPSSTR datamayaidintheinterpretationofmixtures[2],asitallowsfora betterpredictionofstutterbehaviour,whichcanbeusedtofilter thedataforstutterproducts.Existingsoftwarepackagesforthe analysisofSTRsequencingdata[10–12]donotsupportextensive filteringandcorrectionofsystemicPCRand/orsequencingerrors andthereforeseemlesssuitedforanalysisofmixedDNAsamples.

Thispromptedustodevelopasoftwarepackagethatharboursthe followingfeatures:1)characterisationandcorrectionofnoisein thesequencingdatacausedbyPCRstutterorothersystemicPCR and/orsequencingerrors;2)visualisationofsequencingdataas comprehensiveprofiles;3)filteringofdataingraphsandtables withuserdefinablethresholds and4)open-sourceaccessibility.

Forensic DNA Sequencing Tools (FDSTools) is available via the PythonPackageIndex(eitherbymanualinstallationorbyusing thecommand‘pipinstallfdstools’).Weassesstheperformanceof FDSToolson31two-personmixturesgenotypedviathePromega PowerseqTMAutoSystemforwhichwefirstgeneratedareference datasetof450samples.

2.Materialsandmethods 2.1.Samplepreparation

PCRproductsandsequencinglibrarieswere preparedasdescribed previously[2]usingaprototypePromegaPowerseqTMAutoSystem containing23STRsandamelogenin.Asetof450Dutchsamples[13]

and31two-personmixtureswereamplifiedandsequenced.The

mixturesconsistedofthreecombinationsoftwodonorsselected randomlyfromapoolofunrelatedindividuals,whichweremixedin differentratios.Theminorcomponentsinthemixturescontributed 0.5%(sixmixtures),1%(sixmixtures),5%(fourmixtures),10%(six mixtures),20%(sixmixtures)and50%(threemixtures).

Sincethemixtureswereusedtotesttheperformanceofthe softwareandalsotodetermineanalysisthresholdsthatarefitfor purpose,webalancedtheinfluenceofvaryingDNAinputsinthe PCR and increased drop-out due to low DNA input. This was achievedbytheuseofa minimumof theminorcomponentof 60pginthe0.5%,1%and5%mixtures,resultinginatotalDNAinput of12ng,6ngand1.2ng,respectively(60pgresultedinlessthan 20%drop-outinthevalidationofPowerplex6C[14]).Thesame totalDNAinputof1.2ngwasusedforthe5%,10%,20%and50%

mixtures(resultingin120pgand240pgoftheminorcomponents inthe10% and20% mixtures,respectively).TheDNAinputwas 0.5ngforsingledonorsamples.

Thegenotypesofthedonorsusedinthemixtureswereknown, which enablestheidentification ofdrop-in and drop-out allele calls.Paired-endsequencingdataofallampliconswasgenerated usingtheMiSeq1Sequencer(Illumina).

2.2.Initialdataprocessing

InFig.1,themaintoolsoftheFDSToolspackageandtheirrolein thedataanalysispipelinearedisplayed.Thetoolscanbesplitinto threefunctionalgroups:toolsforreferencedatabasecreation,tools forreferencedatabasecuration(dataqualityassessment)andtools forcasesamplefilteringanddatainterpretation.Inaddition,the packagecontainsinitialdataprocessingtoolssuchasTSSV[10]that arecommontoreferencedatabasesamplesandcasesamples.

Fig.1.Flowchartoftheanalysisprocess,showingthemaintoolsofFDSTools.Flowchartshowingthemaintools(bluerectangles)oftheFDSToolspackageandtheirrolesin thedataanalysispipeline.TheoutputofeachtoolcanbevisualisedusingtheVistool(notshown).(Forinterpretationofthereferencestocolourinthisfigurelegend,the readerisreferredtothewebversionofthisarticle.)

(3)

2.2.1.Paired-endreadmerging

Using paired-end sequencing, forward and reverse strand molecules of each amplicon were sequenced from both ends.

Thefirst300nucleotidesfromeitherendwereobtained.These readpairsweremergedintoaconsensusreadbyaligningtheread pair such that the largest possible overlap is obtained while allowingforupto33%mismatchesintheoverlappedregion.Most ampliconswereabout300basepairsinlengthandprovidedfully complementaryreadpairs.

WithSTRampliconsthatarelongerthan300bp,aproblemmay occurwhenbothreadsendinthemiddleoftheSTRstructureand thepairmaybemergedintoatruncatedSTRsequence.Amodified versionofFLASH1.2.11[15](availableviagithub.com/Jerrythafast/

FLASH-lowercase-overhang)wasusedtomarkthebasesthatwere notintheoverlappedregioninlowercaseintheconsensusread.

ThisenablesdetectionoftruncatedSTRsequencesindownstream analysis.

2.2.2.Linkingreadstolociandalleles

Themergedreadsarelinkedtospecificlociandallelesbythe TSSVtool,whichisawrapperaroundasimplifiedversionofthe TSSV[10] programcalled TSSV-Lite.TSSV linksreads tolociby scanningthereadsforthesequencesflankingtheSTRlociused.

Theflankingsequencesofeachlocus,thatusuallyrepresentthe most50nucleotidesoftheprimers,areprovidedtoFDSToolsina libraryfile,togetherwithvariousotherdetailsaboutthelociused.

SupplementaryFile1representsthelibraryfileusedinthisstudy.

Thefilecontainsadescriptionforthecontentsofeachsection.

Eachreadisscannedfortheseflankingsequencesbycomputing alignments. In this study, the flanking sequences were 18 nucleotidesinlengthand twosubstitutions(ortwoinsertedor deletedbases)perflankwereallowedinthealignment.Readsare categorised as ‘unrecognised’ if no flankingsequence is found.

Furthermore,bothflankingsequencesarerequiredtohaveatleast oneuppercaseletter,whichensuresthatoverlappedreadsthatare potentially truncatedare categorised as ‘unrecognised’ as well.

Readsinwhichonlyoneflankingsequenceisfoundwithatleast oneuppercaselettergetlinkedtoalocusbutflaggedas‘nostart’or

‘noend’dependingonwhethertheleftorrightflankismissing, respectively(optionally, thesereads canbewritten toseparate fastaorfastqfiles).

ThemainoutputofTSSVisatextfilewithtab-separatedvalues.

Thefilecontainsonelineforeveryuniquesequenceofeachlocus.

Thecolumnsincludethenameofthelocus,thesequence,andthe

numberofreadscarryingthisparticularsequence.Readcountsare givenseparatelyfortheforwardandreversestrand.

TSSVincludesadditionaloptionsforfilteringsequencesthatare seentoofewtimesandsequenceswithalengthoutsideagiven range(e.g.,primer-dimers).Thisrangecanbespecifiedseparately foreachlocus.Furthermore,filteredsequencescanbeaggregated into a single ‘othersequences’ category for each locus. In this project,onlysingletons(i.e.,sequenceswithonlyoneread)were aggregatedtothe‘othersequences’category.

2.3.Buildingareferencedatabase

OnefunctionofFDSToolsisthebuildingofareferencedatabase.

Suchadatabasecanbeusedtoobtainestimatesofrecurringallele- specific systemic noise. Here, ‘noise’ refers to the complete collectionofsequencesobservedinasample,exceptthesample’s trueallelicsequences.Noiseincludesanyartefactderivingfrom thePCRaswellasthesequencing(suchasPCRstutterorsingle- nucleotide errors). Additionally, based on the reference data a statisticalmodelcanbederivedthataimstopredictstutterratios forallelesnotpresentinthereferenceset.

The creation of a reference database involves various tools includedintheFDSToolspackage,whichwillbediscussedinthe nextsections.Inadditiontotheseseparatetools,FDSToolsoffers the Pipeline tool,which convenientlyintegrates theentire data analysispipeline.UsersareadvisedtousePipelineasitremovesthe complexityofhavingtorunseveralseparatetoolsandtocombine theiroutput.Pipelinetakesasimpleconfigurationfilecontaining theanalysis parametersand automaticallyrunstheappropriate tools.

Buildingareferencedatabaseisatwo-phaseprocess.Inthefirst phase,thereferencesamplesareanalysedinaglobalmannerto identifytheirallelesandrejectthosesamplesinwhichthealleles arenotreadilyidentified.Inthesecondphase,thesystemicnoise ofeachoftheseallelesisanalysedindetail.

2.3.1.Allelecallingforreferencesamples

Determiningtheallelesofsingledonorreferencesamplesisa fairlystraightforwardprocessbecausethesegenerallyrepresent theoneortwomostabundantsequencesforanylocus.FDSTools includes Allelefinder to call alleles this way. It is applied after Stuttermark,whichisdescribedbelow.Anumberofthresholdsare used to guard against including alleles of potential low-level contaminations,whichareoutlinedinFig.2.Forheterozygousloci,

Fig.2.ThresholdsusedbyAllelefindertocallallelesinreferencesamples.Sequencevariantswithareadcountabovetheallelethresholdarecalledasalleles.Thefourlighter- shadedbarsrepresentstuttervariants(asrecognisedbyStuttermark),whichareignoredbyAllelefinder.

(4)

asecondalleleisonlycalledifitpassestheallelethreshold,which isdefined inthis project as 30%of theread countofthemost frequentalleleatthesamelocus.Asweexpectnostutterabove 30%[2],thisthresholdseparatestheallelesfromnoise.Noalleles arecalledatalocuswhenadditionalsequencesoccurthathavea read count below the allele threshold but above the noise threshold(whichisdefinedas15%ofthemostfrequentallelein thisproject)orifathirdsequencepassestheallelethreshold.If morethantwolociinthesamesamplefailtogivearesultforthese reasons,theoverallqualityofthesampleisconsideredtoopoorto reportanyalleles.Additionally,Allelefindercanbeconfiguredtocall atmostonealleleathaploidloci.

The three potential pitfalls are 1) PCR stutter artefacts that exceedthenoise threshold;2)strongread countimbalancefor heterozygousalleles,whichmaybetheresultofe.g.,primer-site sequencevariantsand3)autosomaltrisomy,whichisrare.Todeal with the problem of stutter, each sample was analysed with Stuttermark[2]beforecallingalleles.WithStuttermark,sequences thatareinastutterpositionofanothersequencewhilehavinga readcountbelowauser-suppliedpercentagewithrespecttothe othersequencearemarkedas‘stutter’.Sequencesthathavearead countthatistoohightobeexplainedbystutteralonewillnotbe markedas‘stutter’,astheymaycoincidewithagenuineallele.The thresholdsusedherewere30%for1stutter(lossofarepeatunit) and 10% for +1 stutter (gain of a repeat unit). For 2 stutter products, a 30% threshold of the 1 stutter product is used.

Sequencesthataremarkedas‘stutter’arecompletelyignoredby Allelefinder.

Allelefinderproducesthelistofallelesandareportdetailingfor which samplesand loci allelecallingis rejected and for which reasons.

2.3.2.Estimatingaverageallele-specificsystemicnoise

Foreachallele,aprofileofrecurringsystemicnoise,including PCRstutterproductsaswellasanyother‘sideproducts’,canbe generatedbasedonthereferencedata.Noiseprofilesarealways computedseparatelyforforwardandreversereads,becausestrand biasmayexistinthesequencingtechnologyused.Profilesarealso computedseparatelyforeach locus,undertheassumptionthat noiseproductionisnotinfluencedbyallelesofotherloci.Thelevel ofnoiseisexpressedasthenumberofnoisereadsasapercentage ofthenumberofreadsoftheparentallele.InthecontextofPCR stutteranalysis, thisquantityis oftenreferredtoasthe‘stutter ratio’, despitetherepresentation as a percentageof the parent allele.Weusethegeneralisedterm‘noiseratio’(alsorepresented as a percentage of the parent allele) to account for all other systemicnoiseaswell.

Noiseratio¼Noisereads Allelereads100

In homozygoussamples,thenoiseratiocanbecalculatedby dividingthe number of reads of a non-allelic sequence bythe number of reads of the allele.Allele-specific noise profilesare readilycomputedfromhomozygoussamplescarryingthisalleleby scalingthereadcountsineachsamplesuchthattheparentalleleis 100andaveragingthenoiseratiosforeachnoisesequence.These per-allelenoisestatisticsandotherstatistics,suchasthestandard deviationsofthenoiseratioscanbeobtainedusingBGHomStats.In heterozygoussamplestheextractionofnoisesequencesismore complex,becauseithastobedeterminedwhichproportionseach allelecontributedtotheobservednoise sequences.Weassume thatnoiseinheterozygoussamplescorrespondstothesumofthe noiseprofilesofthetwoalleles,aftertheapplicationofascaling correctiontoaccountfordifferencesintheamountofeachallele amplified.Thisisneededasevenforheterozygousallelepairs,PCR efficiencymayvaryduetoprimerbindingsitesequencevariation

orSTRlength[16].Toextractnoisefromheterozygousreference samplesaniterativeapproachwastakenandimplementedinthe BGEstimatetoolinFDSTools.

Inessence,thealgorithm,whichisdiscussedinmoredetailin SupplementaryText1,seeksanon-negativeleastsquaressolution tothematrixequationAP=C.Inthisequation,CisanNMmatrix ofconstantsderivedfromthereadcountsinthereferencesamples, AisanNNmatrixsummarisingtheallelebalanceinthesamples, and P is an NM matrix containing the estimated profilesof systemicnoise.Nisthenumberofuniquegenuineallelesamong the reference samples and thus also the number of profiles producedandMisthetotalnumberofuniquesequencesobserved.

MatrixCiscomputedonceatthestartofthealgorithm.Each rowinCcorrespondstoonealleleandcontainsthesumoftheread countsofallsamplesthathavethatparticularallele,afterscaling thealleleto100readsforhomozygoussamplesand50readsfor heterozygotes. The noise profiles in P are initialised with the assumptionthatnosystemicnoiseispresent,i.e.,allelementsare setto0, except for theelementsthat correspond totheactual alleles,whicharesetto100.

Thealgorithmthenproceeds byrepeatedlyre-estimatingthe allelebalancematrixAwhilereadingcross-contributionsbetween the alleles from the current profiles P and subsequently re- estimating P by finding a non-negative least squares optimal solutiontoAP=C.ThevaluesthusobtainedinParetheaverage noiseratiosofallobservedsystemicnoiseforallalleles(i.e.,each rowinPcontainsthenoiseprofileofoneallele).

Toavoidnoisefromoneallelebeingincorporatedinthenoise profileofanotherallele,aminimumofthreedifferentheterozy- gousgenotypesperallelewasusedinthisstudy.Athresholdcanbe setfortheminimalreadcountofnoisetoconsiderandtheminimal percentage (weused 80%)of reference sampleswiththe same allelewhichshouldcontainthesamenoisebeforeitisincludedin thenoiseprofile.Eachoftheseparameterscanbesetusingvarious optionsofthe‘fdstoolsbgestimate’command.

2.3.3.Relatingtheamountofstuttertorepeatlength

Withthemethodsoutlined above,profilesofsystemicnoise wereobtainedforeachallelepresentinthereferenceset.However, one would also like to be able to filter and correct the noise originatingfromallelesthatarenot(yet)includedinthereference set,ascasesamplesmaybeencounteredthatcontainallelesfor which noreferencesample was available. For this purpose, we developedamethodtopredictthesequenceandcorresponding amountofPCRstutterartefactsthatwouldbeproducedforany alleleofagivenlocus.Notethatthismethoddoesnotpredictnoise otherthannoiseresultingfromSTRstutterorsinglenucleotide stretches.

Previous studies have shown that the amount of stutter is strongly correlated with the length of the repeated sequence [17]andeven more sowiththenumberof consecutive repeat units [2,18]. The FDSTools tool Stuttermodel seeks to fit polynomial functions to the repeat length and stutter ratio in homozygous referencesamples. Stuttermodel scans each of the alleles forall positionswhere a particularrepeatunit (e.g.,the sequence ‘AGAT’) is repeated and records the length of this repeat, as the number of nucleotides, including incomplete repeatsatthebeginningorendoftherepeatedstretch.Foreach sample with this allele, the number of noise reads that lack exactly one repeat is counted. Reads that combinethe loss of one repeat with one or more other differences (e.g., substitu- tions,orstutterinanotherstretchofrepeatsinthesameallele) areincludedinthiscount.Thecountsthusobtainedareusedto compute the noise ratios of individual stutter sites and a polynomial function is fitted to quantify the relationship betweenthelength of therepeatand thestutterratio.

(5)

Thisanalysisisrepeatedforeachuniquerepeatunitofalength betweenoneandaconfigurablemaximumnumberofnucleotides (inclusive), treating cyclicallyequivalent units (e.g., ‘ATAG’ and

‘AGAT’)andtheirrespectivereversecomplements(e.g.,‘CTAT’and

‘ATCT’)synonymously.Theamountof+1stutter,2stutteretc.is analysedthesameway.

Becausedifferentlocibehavedifferentinstutterformation,a separatefunctionisfittedforeachlocus.Additionally,apolynomial functionisfittedtoalldataatonce,whichisusedtopredictstutter inallelesoflociforwhichinsufficientreferencedatawasavailable tofitalocus-specificfunction.Separatefunctionsarefittedforthe forwardandreversestrands.

Foreachfittedfunction,Stuttermodelalsodeterminesthelower boundoftherepeatlengthforwhichthefunctiongivesmeaningful results.Thislowerboundisdefinedasthelowestrepeatlengthfor whichthefunctionproducesanonnegativeresultandthefunction isnon-decreasing.Belowthisthreshold,andinanyotherpoints wherethefunctionvaluewouldbenegative,thefunctionvalueis settozero.

Thequalityoffitis assessedbycomputing thecoefficientof determination,

R2¼1

S

iðyiy^iÞ2

S

iðyiyÞ2 where

y^i¼ fi fi0 0 fi<0



withyithenoiseratiosofthereferencesamples,ythemean,fithe polynomialfunction’sestimateofthenoiseratioofsamplei,and^y themodifiedfunctionvalue.TheR2scorewillbeclosetoonewhen thefunctionisagoodfitandlowerotherwise.

Stuttermodel supports fitting polynomial functions of any degree.To preventover-fittingwhile stillallowinga non-linear relationship, second-degree polynomials (with a minimum R2 score)wereused.IncaseswherethefitforonestrandhasanR2 scoreabovethethresholdwhilethefitfortheotherstrandscores belowthethreshold,bothfitsarerejectedtopreventunintended introductionofstrandbiasbyfilteringstutterononlyonestrand.

2.3.4.Curatingthereferencedatabase

Tomakesureallreferencesampleswereofgoodqualityandall alleleswere called correctly, they wereput through the same analysis pipeline as case samples, thereby performing noise filteringandcorrectiononthereferencesamples.Itisimportant tonotethatthesereferencesampleswerepreviouslygenotyped byus in greatdetailusing CE [13].The remaining amountsof noisein eachsamplewereassessedusingBGAnalyse(described below)toidentifypotentiallyunsuitable referencesamplesthat still passed the thresholds of Allelefinder. Any sample with a notablyhigher amountof remainingbackgroundwas manually removedfromthesetofreferencesamplestopreventpollutionof thenoiseprofiles.

BGAnalyse was developed and employed to analyse the remainingnoiseaftercorrection.Foreachlocusandeachsample, thistoolcalculatestheleastfrequent(thiscanbeanegativevalue becauseofover-correction),mostfrequent,and totalnoise asa percentageofthenumberof readsofthehighestalleleateach locus.Theseresultsaresubsequentlyvisualisedtoeasilyidentify potentiallyproblematicsamples.Inthevisualisation,samplescan besorted byany of thecalculated values orbycoverage (total numberofreads).Samplesweresubjectedtomanualinspection andanysamplethatexhibitednon-stutterproductswithcorrected readcountsabove4%ofthemostfrequentalleleorabove2%ofthe totalreadswasrejected.

2.4.Analysingcasesamples

Theanalysisofmockcasesampleswasperformedinathree- stepprocesswhichisdescribedinthefollowingsections.

1. A prediction was made for the amount of stutter for each sequenceinthesample,usingthefittedpolynomialfunctions obtainedfromrunningStuttermodelonthereferencesamples.

Thesepredictionsareusedtoextendtheallele-specificnoise profiles obtained from running BGEstimate on the reference samples.

2. Theextractednoiseprofilesareusedtofilterandcorrectthe noiseinthecasesample.

3. Alleles are called and the sample is subjected to manual interpretation.

Similartothecreationofareferencedatabase,analysingcase samples involves multiple tools discussed in the following sections.Pipelineoffersaconvenientwaytoautomaticallyanalyse acasesamplewithalltoolsdiscussed.

2.4.1.Predictingstutteramountsforunknownalleles

Becausecasesamplesmaycontainallelesthatarenotpresentin thereferencesamples,noiseprofilesfortheseallelesneedtobe predicted. FDSTools includes the BGPredict tool, which uses a previously created Stuttermodel file to predict the amounts of stutter artefacts for alleles not present in the reference data.

BGPredictfindsallsequencesintheanalysedcasesampleinwhich aparticularrepeatunitisrepeated.Theexpectedamountofstutter in this repeatis then computed usingthe correspondingfitted polynomial function from the Stuttermodel file. All possible combinations of stutter aretaken into consideration when the frequencies of each stutter artefact are computed. The noise profilescreatedinthiswayareusedtoextendthenoiseprofilesin thepreviouslycreated BGEstimate file(atool calledBGMerge is includedinFDSToolsforthispurpose).

2.4.2.Noisefilteringandcorrectionincasesamples

To beabletofiltersystemicnoise in casesamples,one first needstodeterminewhichallelesarelikelypresentinthesample.

Tothisend,thealgorithmofBGEstimateisessentiallyreversed,i.e., thegoalisnowtosolveforainaP=c,wherecisarowvectorwith thesample’sreadcountsfortheMsequencesinthenoiseprofiles andaisarowvectorwiththeestimatedamountofeachoftheN profilesin the sample. P is the NM matrix of noise profiles obtainedfromBGEstimate,extendedwiththepredictionsobtained fromBGPredict.Solvingforaisdoneinanon-negativeleastsquares senseasbefore,givingestimatedallelecontributionsthatbestfit thevarioussequences–allelesaswellasnoise–presentinthe sample.

Background-correctedread countscanthen becomputedby firstsubtractingthescaledprofilesfromthesample’sreadcounts d caP

and then adding the total size of each profile to the correspondingallele,i.e.,

dn dnþan

S

Mm¼1Pn;m; 8n2½1:::N

Notethatdmayhavenegativeelementsifthesamplecontainsa loweramountof acertain sequencethanwas predictedbythe profilesofitsdominantalleles.

FDSToolsoffersBGCorrecttofilterandcorrectbackgroundnoise followingtheprocedureoutlinedabove.Givenasampledatafile (obtained from TSSV for example) and a file containing noise profiles, BGCorrect produces a copy of the sample data with

(6)

additional columns giving the amounts of each sequence attributedtonoiseandtheamountsofeachsequencethatwould berecovered by noise correction (i.e., adding the noise tothe originating allele). These values are given separately for the forward and reverse strand. Although the method by which BGCorrectcomputes them resultsin non-integer values, it was decidednottoroundthesenumberstoavoidunnecessarylossof precision.Ifnecessary,thesenumberscanberoundedtointeger values,thereby easingtheinterpretation as‘read counts’ when presentedinagraphortableinareport.

2.4.3.Allelecallingforcasesamples

ThenaïvemethodofcallingallelesthatAllelefinderusesisnot appropriateforcasesamples,sincethesemaycontainallelesof multiple contributors in different quantities. Therefore, calling alleles in case samplesis done bycomputing various statistics based on the information of the detected sequences and subsequentlysettinginterpretationthresholdsonthesestatistics.

Forthis,Samplestatswasdeveloped,whichoperatesonandadds variouscolumnstotheoutputofBGCorrect.Samplestatsautomati- callymarkssequencesas‘allele’usingthethresholdsoutlinedin Table1.

Alleles canalso becalledwhile visualisingthe sampledata, hence,FDSToolsincludestheSamplevisvisualisation.Bymeansof theinteractivegraphicaluserinterfaceofSamplevis,thesamesetof thresholdsasdepictedinTable1areavailabletofilterthevisible sequences and to automatically call alleles. Thresholds can be specifiedseparatelyforthegraphsandforthetables.Whilethe tabledisplaysthecalledalleles,lessconservativesettingsmaybe usedforthefilteringofthecorrespondinggraphtoensurevisibility of allelesjust below theallele-callingthreshold. The results of changing the thresholds are immediately visible. Clicking a sequenceinanyofthegraphstogglesits‘allele’status.Thisallows theusertomanuallyaddallelestoandremoveallelesfromthe profile.Anoteisaddedtomanuallyaddedalleles,statingthatthe alleleis‘User-added’.Similarly,iftheuserremovesanyalleles,the alleleremainsvisiblebuta‘User-removed’noteisadded.Inthis

wayitremainseasytotracebackexactlywhichallelesmeetthe thresholdsandwhichonesweremanuallyaddedandremoved.

Samplestatscanalsobeusedtofiltersequencesusingthesame typesofthresholds(albeitwithmorestringentthresholdvalues than used for allele calling, as potential alleles should not be filteredout)and(optionally)aggregatethefilteredsequencesper locustoasinglelinecategorised‘othersequences’.

2.5.Visualisation

For visualisation of the data, FDSTools makes use of the JavaScript graphing library Vega [19]. Vega graphs can be embedded ona web page, exposing a JavaScript programming interfacethatallowsforupdatingthegraphsbasedontheuser’s interactionwiththewebpage.VegacanalsorunonNode.js,which allowsittobeincludedinautomatedanalysispipelinestogenerate (static)imagefiles.

FDSToolscomeswithVegagraphspecificationsandaccompa- nyinginteractivewebpages(HTMLfiles)tovisualisetheoutputof eachtool.TheVistoolcanbeusedtoobtainself-containedHTML files containing visualisations of various types of data files generatedbytheothertools.Forexample,Samplevisvisualisesa sample data file as a sequence profile and Profilevis visualises backgroundnoiseprofilesobtainedfromBGEstimateorBGPredict.

AdescriptionofeachvisualisationcanbefoundinSupplementary Table1.Whenviewedinawebbrowser,thewebpageprovides additionalcontrolsthatallowtheusertofilterthedata,switch betweenlinearandlogarithmicscales,orselectdifferentsubsetsof thedatatovisualise.Thedefaultvaluesforthesettingsontheweb pagecanbesetwhentheHTMLfileisgeneratedbytheVistool.

Thewebpagesalsooffertheoptiontosavethedisplayedgraphs asaScalableVectorGraphics(SVG)orrasterisedPortableNetwork Graphics (PNG) image, so that they can be imported into documents. Alternatively, the Vis tool can supply a raw Vega graphspecificationfile(eitherwithorwithoutembeddeddata), whichcanthenbeusedbyVegatogenerateSVGorPNGimages directlyonthecommandline.

Table1

InterpretationthresholdsforcasesamplesinSamplestatsandSamplevis.Sequencesthatmeeteitherthe‘Percentagecorrection’or‘Percentagerecovery’threshold(orboth)as wellasalltheotherthresholdswillbemarkedas‘allele’.Thesethresholdvaluesareevaluatedafternoisecorrection.The‘Allelecallingdefault’columnliststhedefault thresholdvaluesforcallingalleles.The‘Filteringdefault’columnliststhedefaultvaluesusedforfilteringdisplayedsequencesinSamplevisgraphs.

Threshold Description Allele

calling default

Filtering default

Totalreads Minimumnumberofreadsperallele.Non-systemic(andthusunfilterable)sequenceerrorsoccursporadically.Thisthreshold ensuresthataminimalamountofamplifiedproductispresenttosupporttheallelecall.

30 5

Readsper strand

Minimumnumberofreadsperalleleforbothstrands.Thisthresholdcanbeusedtoexcludelowtemplatesequenceswithstrong strandbias.

1 0

Percentage ofmost frequent

Thenumberofreadsasapercentageofthenumberofreadsofthemostfrequentalleleatthelocus.Thisthresholdsetsalimittothe mixtureproportionsthatcanbeanalysedinmixedsamplesortotheallelebalanceinsampleswithasinglecontributor.

2% 0.5%

Percentage oflocus

Eachallelecontributesatleastthispercentagetothetotalnumberofreadsofthelocus.Withthisthreshold,aminimumcontribution percentagecanbeenforced.

1.5% 0%

Percentage correction

Thispercentagederivesfromthenumberofreadsafternoisecorrectionminusthenumberofreadsbeforecorrection,whichis dividedbythenumberofreadsbeforecorrection.Consequently,thepercentagecorrectionisnegativeifnoisecorrectionresultedina reductionofthereadcountofasequence.Therefore,withthisthresholdsetto0%,anysequencerepresentingnoisewillnotbecalled asanallele.

Tobeabletodetectallelesofminorcontributorsthatcoincidewithnoiseproductsforthemajorcontributor’salleles,the‘percentage recovery’thresholddescribedbelowisallowedtooverrulethisthreshold.

0% 0%

Percentage recovery

Thenumberofreadsaddedbynoisecorrectionasapercentageofthetotalnumberofreadsafternoisecorrection.Afternoise correction,atleastthispercentageofreadsmusthaveoriginatedfromcorrectednoise.Therationalebehindthisthresholdisthat onlyallelicsequenceswillhavesubstantialamountsofrecoveredreads.Whenanalleleofaminorcontributioncoincideswiththe stutterofanalleleofthemajorcontributor,noisewillbeextractedandaddedtothemajorcontributor’sparentalleleresultingina negativepercentagecorrection.Yet,sincetheminorcontributor’scontributiontothereadsalsoresultsinnoiseproductsthatare corrected,theallelewillreceiverecoveredreadsandapercentagerecovery>0%.Toallowthecallingofallelesforwhichnonoise profileexists(ornonoisewasdetected)inthereferencedatabasethethresholdissetat0%bydefault.

0% 0%

(7)

3.Resultsanddiscussion

WedevelopedFDSTools,asoftwarepackagecontainingasuite oftoolsthatcanbeusedfortheanalysisofforensicMPSdata.With thesetools,FDSToolsprovidesdetailedinsightinthequalityofa sample and the noise profile of a certain allele (or sequence variant). In Supplementary Table 1, an overview of all tools currentlyavailableinthepackageisprovided,ofwhichaselection wasdescribedinmoredetailsinSection2.

Toenhancetheanalysisofmixedsamples,FDSToolsidentifies, extractsandcorrectsforPCRorsequencingnoisesuchasstutter fromareferencedatabasewiththeaimtodiscernlow mixture proportions. Different STR amplification assays and different amplification protocols could result in different noise. It is thereforeimportanttobasethedatabasefornoisecorrectionon referencesgenerated bya methodthat is representablefor the caseworksamplestobeanalysed.

Notethatitisnotpossibletocorrectallnoisecompletelyasthe levelofnoiseshowsvariationbetweensamples.

3.1.Referencedatabase

Our reference samples were sequenced with an average coverageof65,000readsandamodeofabout45,000reads.For thepresentstudy,aminimumcoverageof6000readspersample wasrequired,whichrelatestoanaverageof250readsperlocusas 24 loci wereco-amplified.For heterozygousloci, less than 250 readsperlocusisnotsufficienttoquantifylowamountsofnoise accurately.

3.1.1.Referencesamplecuration

Sincethereferencedatabaseisusedtofilterandcorrectnoisein casesamples,itisessentialthatthereferencesamplescontainno contaminantsandreferenceallelesarecalledcorrectly.Although all other stepscan be performed automatically by FDSTools, a manualcurationofsamplesinthereferencedatabaseisneeded.

BGAnalysewasdevelopedtofacilitatethisprocessbyvisualising potentialoutliers.

Allelefinder automatically rejected two out of the initial450 sampleswhichwereclearlycontaminatedandthreesamplesthat hadtoolowcoveragetodetectallelesreliably.Manualinspection ofsampleswithanotablyhigheramountofremainingnoiseafter correctioninBGAnalyseresultedintherejectionofanadditional16 samples.Reasonsforrejectionwerelow-levelcontamination,low coverageandlow sequencingquality.Theinteractive BGAnalyse visualisations displaying the remaining noise for the reference samplesareavailableinSupplementaryFile2a(beforedatabase curation)and2b(aftercuration).Forthemajorityofsamples,the highestremainingnoise variantin thecompleteprofile didnot exceed3%ofthenumberofreadsofthehighestalleleatthelocus whilewithoutcorrectionSTRstutterscanrepresentover20%.For theremaining429samples,nodrop-inordrop-outwasobserved whencallingallelesusingAllelefinderwiththesettingsdescribedin Section2.3.1.

3.1.2.Extendingnoiseprofilesfornoisecorrection

AsdescribedinSection2.4.1,casesamplesmaycontainalleles which arenotpresent inthereferencedatabase.In suchcases, FDSToolsresortstonoisepredictioninsteadofnoiseestimation.A columnintheoutputfileofBGCorrectmarksifcorrectionhasbeen performedusingdataobtainedfromBGEstimate(iftheallelewas availableinthereferencedatabase)orbyusingBGPredict(ifnot availableinthereferencedatabase).

FromtheresultsfromStuttermodelitbecomesevidentthatfor simpleSTRsconsistingofasinglerepeatingelementorforlong stretchesofaspecificrepeatingelementwithina complexSTR,

only few reference samplesare needed toreliably fit a stutter model. However, when complex repeats consist of several repeating elementsof which someshowlittle lengthvariation, correctionusingthestuttermodelissuboptimalasexemplifiedby thepredictionsforD12S391.ThisSTRlocusconsistsoftworepeat units; an AGAT repeatstretchof highly variablelength and an ACAGrepeatthatis repeated6 to8timesfor mostindividuals.

Since Stuttermodel predictsthe amountofstutterbased onthe repeat length,at leastfour differentrepeatlengths need tobe availableinhomozygousreferencesamplestoobtainareliablefit.

However, the setof reference samples used in this study only containedhomozygoteswith6to8repeatsofACAG,whichisnot sufficientlyvariabletoobtainareliablefit.Consequently,ACAGis omitted from thestutter modelfor D12S391, even thoughthis repeatstuttersupto9%forthelongerrepeats(8repeatunits,data notshown).WhenBGEstimatedoesnotobtainabackgroundnoise profile,BGPredictwillnotcorrectstutterinthisrepeatandthus stutterswillremainpresent.Asalastresort,BGPredictoffersthe possibilitytouseastuttermodelbasedondatafromalllocithat have the same repeat unit sequence if no locus-specific fit is available.SupplementaryFig. 1displaysthestuttermodelobtained fromthesetof429referencesamples,includingtheindividual observationsonwhichthemodelwasbased.

Combining BGEstimate and BGPredict (by using BGMerge) insteadofusingBGPredictaloneisexpectedtoreducethenoise remaining after correction, as the combined correction also corrects for noise other than stutters. This is confirmed when we determine the percentage of remaining noise (the reads representingremainingnoiseasapercentageofthereadsforthe mostfrequentalleleatthelocus)andplotthehighestpercentage and various percentiles (90th, 95th and 99th) (Supplementary Fig.2a,b).The percentilesillustrate howoften samplesexhibit outlyingnoisesequencevariantsandwhenthe99thpercentileis regarded,BGPredictalone retainsonaverage2.6%noise andthe combinedcorrection2.4%.Also,thecombinedcorrectionresultsin lessovercorrectedvariants.

Thus,BGPredictcanbeusedwithoutBGEstimatewithaslightly reducedaccuracyincorrection.NotethatBGEstimateshouldnotbe usedwithoutBGPredictsinceallelesnotincludedinthereference databasewillnotbecorrected,whichcanresultinacombination ofcorrectedanduncorrectedallelesandremainingnoiseforthe uncorrectedalleles.

3.1.3.Referencedatabasesizeandcoverage

Totesttheeffectofthesamplesizeandtypefromwhichthe reference database is built, we used the complete curated reference database of 429 samples and a random selection of 100 samples (both with combined BGEstimate and BGPredict correction,whichwasfoundtobeslightlybetterasdescribedin Section3.1.2).SupplementaryFig.2c,ddisplayanoverviewofthe mostfrequentandthetotalremainingnoiseateachlocusafter correction.Thedifferentpercentilesofthereferencesamplesare given to illustrate how often samples exhibit outlying noise sequencevariants.

Whencomparingtheresultsforthecompletedatabasewiththe resultsforthesubsetof100samples,thedifferenceinremaining noise seems surprisingly small (Supplementary Fig. 2c, d).

However,withasmallerdatabase,lessalleleswillfitthecriteria tocreateaBGEstimatenoiseprofileandmoreallelesrelyonnoise prediction by BGPredict. Indeed, for the reference set of 429 samples,only3.5%oftheallelesarecorrectedusingBGPredict.This percentageincreasesto10.2%whenthecorrectionisbasedonthe subsetof100samples.

In alargerreferencedatabasemorealleleswillbeobserved.

SupplementaryFig.3displaystheallelesobservedinthereference databasesof429and 100samples.Tofitthecriteriatocreatea

(8)

BGEstimate noise profile, alleles need to be present as a homozygousgenotypeorbeavailableaspartofsharedgenotypes withatleastthreeotherallelesthatmustalsofitthesecriteria.For thestuttermodel,onlythehomozygousgenotypesareused.Inthe larger429 database, more alleles fit these criteria than in the smaller100samplesetdatabase.

Toexaminetheeffectofreadcoverageofthereferencesamples onnoise profileanalysis, wegeneratedtwo subsets comprising sampleswithhighorlowcoverage,whichisspecifiedasatotal read count between 82,000 and 350,000 or 8000 and 44,000 respectively.Thehighcoveragesetcomprised71samples;thelow coverageset70.Wenoticedthatinthelow-coveragenoiseprofiles, strandbiascanoccurespeciallyforthelow-percentagenoisethat isduetosingle-stranddrop-outofthisnoise.Thisisillustratedby theBGEstimatenoiseprofilesfortheCE10_TCTA[10]_-20T>Aallele forlocusD7S820inSupplementaryFig.4,inwhichforwardand reversereadsareingoodorreasonablebalanceforallsevennoise sequencesinthehighcoveragesamplesetwhilegoodbalanceis onlyseenforthetwomainnoisesequencesinthelowcoverageset.

Sincethemostabundantnoiseaftercorrectioninasampleis usuallyintherangeof0.5–3%(forSTRanalysis),werecommenda coverageofatleast1000readsperlocus(whichrelatestoa24,000 totalreadcoverageforour24lociamplificationkit)forthesamples of the reference database to obtain the most accurate noise estimates.

3.1.4.Infrequentalleles

Depending on the composition of the reference database, occasionallyalleleswillbeencounteredthatarenotincludedinthe database.BGPredict canpredict thenoise from stutterorother repeatingelementsbutcorrectionofothertypesofnoise(likelow levelSNPs causedby sequenceerrors)is not possibleforthese infrequentalleles.

WethereforerecommendtoobtainBGEstimatenoise profiles for asmany allelesas possible,while retaininggood qualityof thesenoiseprofiles.Severalfilteringcriteriacanbeapplied,suchas the minimumnumber of different heterozygousgenotypes per allele, the minimum number of samples per allele and the minimumnumberofhomozygoussamplesperallele.Theeffectof increasingthestringencyonthefilteringcriteriaonthenumberof retrieved BGEstimatenoise profilesfor our429 referencesetis showninSupplementaryTable2.Thesettingsselectedforusein thisstudyare:atleasttwosamplesperallele(whichensuresnoise isnotbasedonasinglesampleasthatcouldbeanoutlier)that present at least three different heterozygous or at least one homozygote genotype (i.e., the samples can be three different heterozygotesortwohomozygotesor onehomozygoteandone heterozygote).

Whenanalleleat aheterozygouslocusfailsthecriteria,the completelocuscarrying thisallelecannotbeusedforestablish- mentofanoiseprofilesincethenoisecannotbeattributedtoany ofthetwoalleles.Thus,forbothallelesataheterozygouslocusno noiseprofileisextracted.

3.1.5.Accuracyofnoisereferencedatabaseandstuttermodel Toverifytheaccuracyofthenoiseprofilesobtainedthrough BGEstimateandBGPredict,itcanbeusefultocomparetheaverage noiseratioswiththenoiseratiosobservedinindividualhomozy- gous samples. The noise ratios of all noise in all homozygous referencesamplescan easilybecollectedusingtheBGHomRaw tool.Thesedatapointscanbeplottedontopofanoiseprofileto inspecttheconsistencyandvariationinthenoiseratiosofvarious typesofnoiseforeachallele.InFig.3,thenoiseprofileofthemost frequentalleleof D7S820(CE10_TCTA[10]_-20T>A)is displayed, which has foremost a 1 stutter (CE10_TCTA[9]_-20T>A) in addition to a 1 nt slippage product at the A-stretch

Fig.3.NoiseprofileofD7S820alleleCE10_TCTA[10]_-20T>A.Thenoiseratioisshownforeachsystemicnoisesequenceobservedwithanoiseratioof0.1%orhigher.

Individualobservationsinhomozygoussamples(above0.5%)aredisplayedascircles.Asexpected,themostfrequentlyobservednoisesequenceisthe1stutter,butsincethe allelecontainsasingle-nucleotidestretchof9Anucleotides,aconsiderableportionofthenoiseconsistsofsequenceswithslippageatthisA-stretch(oracombinationofthe two).

(9)

(CE9.3_TCTA[10]_-20T>-). The individual observations for the homozygoussamples coincide nicely withthe estimated noise profileratios.

Similarly, it is useful to compare the functions fitted by Stuttermodel to the data points to which they were fitted.

Stuttermodelincludesanoptiontowritetherawdatapointstoa separateoutputfile,whichcanbevisualisedtogetherwiththefitted modelasshowninFig.4forD7S820.Thisexampleshowsthatthe

homozygouscallsandtheStuttermodelestimationfollowthesame trendandthatthereisnodiscrepancybetweenforwardandreverse reads.ThesameholdsfortheA-stretch(datanotshown).

In thestuttermodel, fits withanR2 score below0.75 were rejected.AlthoughthismayseemaverylowR2score,weobtained betterresultsbyincludingmorefitsthanbyexcludingthem,which wouldresultintheinabilityofthestuttermodeltobeusedtofilter andcorrectstutterfortherespectiverepeatunitsatall.

Fig.4.Stuttermodelforthe1stutterofD7S820.Onthex-axisthelengthoftherepeatisdisplayed(innucleotides)andonthey-axisthe1stutternoiseratio(aspercentage ofreadsoftheparentallele)isdisplayed.Eachhomozygousreferencesampleisdisplayedasadotandthelinesdisplaythefittedfunctionsusedforcalculatingtheexpected stutterofeachallele.

Fig.5.Sequenceprofileofasingle-sourcesample.SequenceprofileoflociD18S51andD19S433ofasingle-sourcesample.Asequenceprofiledisplaysthereadcountbefore correction(inpurplebars)andshowstheeffectsofnoisefiltering(lightpurpleforthereadsthatareremoved)andnoisecorrection(withthenoisereadsaddedtotheparent allelesindarkorange).Whenperformingcorrection,itispossiblethatanallelegainsreadsbecausethenoisereadsoriginatingfromthisalleleareadded,butlosesreadsatthe sametimesincethenoiseofanotheralleleintheprofileincludesreadsofthisallele.Thisoverlappingpartofaddedandremovedreadsismarkedseparatelyinlightorange.

Thismeansthattheoriginalreadcountofanallelebeforecorrectionisthecombinationofthepurpleandthelightorangebar.Thelinesinthebarsindicatethestrandbalance;

thelineisdrawnnearthetopofthebarifthemajorityofreadsofasequenceisontheforwardstrand,nearthebottomofthebarifthemajorityofreadsisonthereverse strand,andinthemiddleofthebarintheabsenceofstrandbias.Sequencesdisplayedingreeninthegraphsaretheallelesthatthesoftwareinferstobegenuineallelesinthe sample.Thesearealsodisplayedinthetable.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

(10)

3.2.Sampleanalysis

3.2.1.Allelecalling,interpretationandvisualisation

Whenareferencedatabasehasbeencreated,onecanproceed withtheanalysisofsamples.FDSToolsanalysessequencingdata, calls alleles and interprets the data by correction for noise as inferredfromthereferencedatabase.Resultscanberepresentedas agraphicalsequenceprofileoutputandasaninteractiveprofile report.

InFig.5,anexampleofasequenceprofileoftwolociofasingle- sourcesampleisdisplayed(generatedbythecommand‘fdstools vissample’).Asequenceprofiledisplaysthereadcountsbeforeand aftercorrectionand visualisesthe effects of noise filteringand noisecorrection.Amoredetailedexplanationoftheinterpretation ofasequenceprofilecanbefoundinSupplementaryFig.5.

The interactive sequence profile reports provide separate filtering options for the graphs and tables displayed (see Section2.4.3).In thegraphs,all alleles that are hidden bythe filtering options are (optionally) aggregated as a separate bar (displayingthecumulativenumbersofreads)withthelabel‘other sequences’.Inaddition,weaggregateallsingletonreadsinto‘other sequences’alreadyinthefirststepoftheanalysis(using‘fdstools tssv–minimum2–aggregate-filtered’)whichhastheadditional benefitsofspeedingupsubsequentanalysisanddecreasingdata storagedemand.

3.2.2.Improvingheterozygotebalancethroughnoisecorrection TheamplificationoflongSTRallelesinthePCRisgenerallyless efficientthanshorterallelesand,inaddition,longSTRallelessuffer fromahigherdegreeofstutterresultinginreducedheterozygote balancebetweenthetwoalleles. [2]SinceFDSTools determines which‘noisereads’arederivedfromwhichparentalleles,these readscan(optionally)beaddedtothereadcountsoftheparent alleles,whichtheoreticallywillimprovetheheterozygotebalance.

When we examine the heterozygote allele balance in the 429 single-source reference samples, an improved heterozygote balanceisindeedobservedwhenthestutterreadsareaddedto the read counts of the parent alleles (Table 2). Heterozygote balancewasdeterminedperlocusbydividingthereadcountsfor thelessfrequentallelesbythoseforthemorefrequentalleles,and takingtheaverageofall429samples.

3.2.3.Mixtureanalysis

For the analysis mixtures, noise correction may assist in identifyingtheallelesofalowminorcontributor.Weused31two- personmixtureswithminorcontributionsof50%,20%,10%,5%,1%

and0.5%toassessthisexpectation.

We varied the ‘percentage of locus’ threshold (Table 1) for callingalleles,whichsetsalimittothemixtureproportion.When nonoisecorrectionwasappliedthethresholdwasvariedbetween 5.0% and 1.5%; when noise correction was applied, a lower Table2

Heterozygotebalancefororiginal,filteredandcorrecteddatasets.Thereadcountsforthelessfrequentallelesaredividedbythoseforthemorefrequentalleles,andthe averageforall429single-sourcereferencesamplesistaken.

Dataset\locus Amel CSF1P0 D10S1248 D12S391 D13S317 D16S539 D18S51 D19S433 D1S1656 D21S11 D22S1045 D2S1338

Uncorrecteddata 0.83 0.88 0.82 0.79 0.88 0.87 0.85 0.84 0.88 0.89 0.85 0.77

Filtereddatawithoutnoisereadsaddedto allelereadcount

0.83 0.90 0.87 0.80 0.89 0.89 0.87 0.88 0.89 0.90 0.88 0.78

Correcteddatawithnoisereadsaddedto allelereadcount

0.83 0.91 0.89 0.84 0.90 0.91 0.89 0.90 0.91 0.91 0.93 0.80

Dataset\locus D2S441 D3S1358 D5S818 D7S820 D8S1179 FGA PentaD PentaE TH01 TPOX vWA

Uncorrecteddata 0.90 0.88 0.90 0.88 0.89 0.85 0.87 0.78 0.88 0.88 0.85

Filtereddatawithoutnoisereadsaddedtoallelereadcount 0.90 0.90 0.91 0.89 0.90 0.87 0.87 0.78 0.88 0.89 0.88 Correcteddatawithnoisereadsaddedtoallelereadcount 0.91 0.91 0.91 0.91 0.91 0.89 0.88 0.81 0.89 0.90 0.90

Table3

Averagenumberofdrop-inallelesandaveragedrop-outpercentagepersamplefordifferent‘percentageoflocus’allele-callingthresholds.

a)Summaryofdrop-inanddrop-outratesforvariousallele-callingthresholds

Minorcontribution 50%:600pg 20%:240pg 10%:120pg 5%:60pg 1%:60pg 0.5%:60pg

Analysismethodand threshold

#drop-in/

%drop-out

#drop-in/

%drop-out

#drop-in/

%drop-out

#drop-in/

%drop-out

#drop-in/

%drop-out

#drop-in/

%drop-out

Withoutcorrection,5.0% 0.7/0.0% 0.8/2.1% 1.2/25.0% 1.0/33.1% 1.7/39.9% 0.8/40.8%

Withoutcorrection,2.5% 4.3/0.0% 3.8/0.0% 4.3/2.5% 4.2/21.6% 3.5/38.3% 2.3/40.4%

Withoutcorrection,2.0% 5.3/0.0% 5.3/0.0% 5.3/0.5% 4.8/15.3% 4.2/38.1% 2.8/40.1%

Withoutcorrection,1.5% 7.7/0.0% 7.7/0.0% 7.0/0.0% 6.0/10.8% 6.0/37.4% 4.7/39.7%

Withcorrection,3.0% 0.0/0.0% 0.3/0.0% 0.3/5.7% 0.0/30.3% 0.7/41.1% 0.3/41.1%

Withcorrection,2.5% 0.3/0.0% 0.3/0.0% 0.5/1.4% 0.0/25.1% 0.7/40.6% 0.3/41.1%

Withcorrection,2.0% 0.7/0.0% 1.2/0.0% 1.8/0.5% 0.8/17.4% 0.8/40.6% 1.2/40.8%

Withcorrection,1.5% 1.7/0.0% 2.3/0.0% 2.5/0.0% 2.0/10.8% 2.7/39.9% 2.3/40.8%

Withcorrection,1.0% 4.0/0.0% 5.2/0.0% 4.5/0.0% 4.5/3.8% 5.7/37.8% 3.2/40.6%

Withcorrection,0.5% 16.3/0.0% 17.3/0.0% 14.0/0.0% 15.5/2.4% 14.0/30.3% 11.0/37.4%

b)Categoriseddrop-outrateswhenusing1.5%allele-callingthreshold(withcorrection)

Minorcontribution 20%:240pg 10%:120pg 5%:60pg 1%:60pg 0.5%:60pg

Allelesuniquetotheminor(homozygous) 0.0% 0.0% 0.0% 91.3% 100.0%

Allelesuniquetotheminor(heterozygous) 0.0% 0.0% 31.6% 98.1% 99.4%

Allelesuniquetominor 0.0% 0.0% 27.0% 97.2% 99.4%

Allallelesoftheminor 0.0% 0.0% 18.5% 67.7% 69.3%

Allallelesofthemajor 0.0% 0.0% 0.0% 0.0% 0.0%

Referenties

GERELATEERDE DOCUMENTEN

• Warming up using everyday activities such as ‘cleaning my shoes’, ‘making my bed’ • Body awareness exploring space and direction such as large, small, high, low, far, near

Deze Californische trips werd steeds op de vangplaten bij de praktijkbedrijven aangetroffen en kan in andere gewassen vergelijkbare blad- symptomen veroorzaken.. In de behan-

Archive for Contemporary Affairs University of the Free State

I show how data activism is rooted in movements for technological and political openness (specifically the freedom of information and open data movements), and how the

De eerste hypothese (H1), ‘Er zal een significant verschil zijn in het aantal gestelde parlementaire vragen door partijen tussen de verschillende perioden, waarbij er

Uit het verhaal “De echtgenote van de hertog Zhuang van de staat Li (11 e eeuw v.C.)” blijkt overduidelijk dat de vrouw geen gelukkig huwelijk had: “… zij hadden niets

Sommige leerlingen die het vak moeilijker vinden geven aan dat zij er goed voor leren, mindmaps maken en het huiswerk altijd doen, maar dat hun kennis er op de toets niet uitkomt

I will test these hypotheses further in the empirical analysis based on two different cases, concerning the European Food Safety Authority agency (EFSA) and its