FDSTools: A software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise

(1)

Research paper

FDSTools: A software package for analysis of massively parallel

sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise

Jerry Hoogenboom

^a,b,

*

^,1

, Kristiaan J. van der Gaag

^a,b,1

, Rick H. de Leeuw

^a

, Titia Sijen

^b

, Peter de Knijff

^a

, Jeroen F.J. Laros

^a

aDepartmentofHumanGenetics,LeidenUniversityMedicalCenter,Leiden,2300RC,TheNetherlands

bDivisionBiologicalTraces,NetherlandsForensicInstitute,LaanvanYpenburg6,2497GB,TheHague,TheNetherlands

ARTICLE INFO

Articlehistory:

Received26September2016

Receivedinrevisedform31October2016 Accepted23November2016

Availableonline27November2016

Keywords:

Forensicscience

Nextgenerationsequencing Massivelyparallelsequencing Shorttandemrepeat Powerseq

FDSTools Software

ABSTRACT

Massivelyparallelsequencing(MPS)isontheadventofabroadscaleapplicationinforensicresearchand casework.Theimprovedcapabilitiestoanalyseevidentiarytracesrepresentingunbalancedmixturesis oftenmentionedasoneofthemajoradvantagesof thistechnique.However,mostoftheavailable softwarepackagesthatanalyseforensicshorttandemrepeat(STR)sequencingdataarenotwellsuited forhighthroughputanalysisofsuchmixedtraces. Thelargestchallengeisthepresenceofstutter artefactsinSTRamplifications,whicharenotreadilydiscernedfromminorcontributions.FDSToolsisan open-sourcesoftwaresolutiondevelopedforthispurpose.Thelevelofstutterformationisinfluencedby variousaspectsofthesequence,suchasthelengthofthelongestuninterruptedstretchoccurringinan STR.WhenMPSis used,STRsareevaluatedassequencevariantsthateachhave particularstutter characteristicswhichcanbepreciselydetermined.FDSToolsusesadatabaseofreferencesamplesto determinestutterandothersystemicPCRorsequencingartefactsforeachindividualallele.Inaddition, stuttermodelsarecreatedforeachrepeatingelementinordertopredictstutterartefactsforallelesthat arenotincludedinthereferenceset.Thisinformationissubsequentlyusedtorecogniseandcompensate forthenoiseinasequenceprofile.Theresultisabetterrepresentationofthetruecompositionofa sample.UsingPromegaPowerseq^TMAutoSystemdatafrom450referencesamplesand31two-person mixtures,weshowthattheFDSToolscorrectionmoduledecreasesstutterratiosabove20%tobelow3%.

Consequently,muchlowerlevelsofcontributionsinthemixedtracesaredetected.FDSToolscontains modules tovisualisethedata inaninteractiveformat allowinguserstoﬁlterdata withtheirown preferredthresholds.

1.Introduction

AnalysisofShortTandemRepeats(STRs)hasbeenasuccessful forensic tool in the past two decades. The comparison of STR proﬁles from forensic DNA evidentiary traces with reference samplesandDNAdatabaseshasprovidedessentialinformationin many forensic cases [1]. Standard practice is to use Capillary Electrophoresis (CE) to analyse STR length variation. In recent

years,Massively Parallel Sequencing(MPS) wasintroducedasa newmethodtoanalyseSTRsandotherforensicDNAmarkers[2,3].

MPS enables the simultaneous detection of both length and sequence variation of STRs, which increasesthe discriminatory value substantially [4–6]. The output of CE consists of peaks reﬂectingﬂuorescentsignalintensitieswiththeirownrespective shapesandpeakheights.TheoutputofMPSdataanalysisconsists simplyofreadcountsoftheobservedsequences.Bothmethodscan sufferfromtheoccurrenceofPCRartefactssuchasSTRstutters[7].

This especially complicatestheanalysis of STR proﬁlescoming from multiple contributors, which is common in forensic evidentiarytraces[8].Thelevelofstutterformationdependson a numberofdistinctaspects ofthesequence,includingtheA/T contentoftherepeatunitandthenumberofconsecutiverepeat units occurring in an STR [9]. Since any speciﬁc STR length

* Correspondingauthor.

E-mailaddresses:j.hoogenboom@nfi.minvenj.nl(J.Hoogenboom), k.van.der.gaag@nfi.minvenj.nl(K.J. vanderGaag),r.h.de_leeuw@lumc.nl (R.H.deLeeuw),t.sijen@nfi.minvenj.nl(T.Sijen),p.de_knijff@lumc.nl(P.deKnijff), j.f.j.laros@lumc.nl(J.F.J. Laros).

1 Theseauthorscontributedequally.

http://dx.doi.org/10.1016/j.fsigen.2016.11.007

ContentslistsavailableatScienceDirect

Forensic Science International: Genetics

j o u r n al h o m e p a g e : w w w . el s e v i e r . c o m / l o c a t e / f s i g

(2)

identifiedbyCEcanconsistofmultipledifferentsequences,these CE-identifiedlengthvariantsshowalargervariationinmeasured stutterpercentage than individual sequences analysed through MPS.ThisdecreasedvariationinstutterpercentageforMPSSTR datamayaidintheinterpretationofmixtures[2],asitallowsfora betterpredictionofstutterbehaviour,whichcanbeusedtofilter thedataforstutterproducts.Existingsoftwarepackagesforthe analysisofSTRsequencingdata[10–12]donotsupportextensive filteringandcorrectionofsystemicPCRand/orsequencingerrors andthereforeseemlesssuitedforanalysisofmixedDNAsamples.

Thispromptedustodevelopasoftwarepackagethatharboursthe followingfeatures:1)characterisationandcorrectionofnoisein thesequencingdatacausedbyPCRstutterorothersystemicPCR and/orsequencingerrors;2)visualisationofsequencingdataas comprehensiveprofiles;3)filteringofdataingraphsandtables withuserdefinablethresholds and4)open-sourceaccessibility.

Forensic DNA Sequencing Tools (FDSTools) is available via the PythonPackageIndex(eitherbymanualinstallationorbyusing thecommand‘pipinstallfdstools’).Weassesstheperformanceof FDSToolson31two-personmixturesgenotypedviathePromega Powerseq^TMAutoSystemforwhichweﬁrstgeneratedareference datasetof450samples.

2.Materialsandmethods 2.1.Samplepreparation

PCRproductsandsequencinglibrarieswere preparedasdescribed previously[2]usingaprototypePromegaPowerseq^TMAutoSystem containing23STRsandamelogenin.Asetof450Dutchsamples[13]

and31two-personmixtureswereampliﬁedandsequenced.The

mixturesconsistedofthreecombinationsoftwodonorsselected randomlyfromapoolofunrelatedindividuals,whichweremixedin differentratios.Theminorcomponentsinthemixturescontributed 0.5%(sixmixtures),1%(sixmixtures),5%(fourmixtures),10%(six mixtures),20%(sixmixtures)and50%(threemixtures).

Sincethemixtureswereusedtotesttheperformanceofthe softwareandalsotodetermineanalysisthresholdsthatareﬁtfor purpose,webalancedtheinﬂuenceofvaryingDNAinputsinthe PCR and increased drop-out due to low DNA input. This was achievedbytheuseofa minimumof theminorcomponentof 60pginthe0.5%,1%and5%mixtures,resultinginatotalDNAinput of12ng,6ngand1.2ng,respectively(60pgresultedinlessthan 20%drop-outinthevalidationofPowerplex6C[14]).Thesame totalDNAinputof1.2ngwasusedforthe5%,10%,20%and50%

mixtures(resultingin120pgand240pgoftheminorcomponents inthe10% and20% mixtures,respectively).TheDNAinputwas 0.5ngforsingledonorsamples.

Thegenotypesofthedonorsusedinthemixtureswereknown, which enablestheidentiﬁcation ofdrop-in and drop-out allele calls.Paired-endsequencingdataofallampliconswasgenerated usingtheMiSeq¹Sequencer(Illumina).

2.2.Initialdataprocessing

InFig.1,themaintoolsoftheFDSToolspackageandtheirrolein thedataanalysispipelinearedisplayed.Thetoolscanbesplitinto threefunctionalgroups:toolsforreferencedatabasecreation,tools forreferencedatabasecuration(dataqualityassessment)andtools forcasesampleﬁlteringanddatainterpretation.Inaddition,the packagecontainsinitialdataprocessingtoolssuchasTSSV[10]that arecommontoreferencedatabasesamplesandcasesamples.

Fig.1.Flowchartoftheanalysisprocess,showingthemaintoolsofFDSTools.Flowchartshowingthemaintools(bluerectangles)oftheFDSToolspackageandtheirrolesin thedataanalysispipeline.TheoutputofeachtoolcanbevisualisedusingtheVistool(notshown).(Forinterpretationofthereferencestocolourinthisﬁgurelegend,the readerisreferredtothewebversionofthisarticle.)

(3)

2.2.1.Paired-endreadmerging

Using paired-end sequencing, forward and reverse strand molecules of each amplicon were sequenced from both ends.

Theﬁrst300nucleotidesfromeitherendwereobtained.These readpairsweremergedintoaconsensusreadbyaligningtheread pair such that the largest possible overlap is obtained while allowingforupto33%mismatchesintheoverlappedregion.Most ampliconswereabout300basepairsinlengthandprovidedfully complementaryreadpairs.

WithSTRampliconsthatarelongerthan300bp,aproblemmay occurwhenbothreadsendinthemiddleoftheSTRstructureand thepairmaybemergedintoatruncatedSTRsequence.Amodiﬁed versionofFLASH1.2.11[15](availableviagithub.com/Jerrythafast/

FLASH-lowercase-overhang)wasusedtomarkthebasesthatwere notintheoverlappedregioninlowercaseintheconsensusread.

ThisenablesdetectionoftruncatedSTRsequencesindownstream analysis.

2.2.2.Linkingreadstolociandalleles

Themergedreadsarelinkedtospecificlociandallelesbythe TSSVtool,whichisawrapperaroundasimplifiedversionofthe TSSV[10] programcalled TSSV-Lite.TSSV linksreads tolociby scanningthereadsforthesequencesflankingtheSTRlociused.

Theﬂankingsequencesofeachlocus,thatusuallyrepresentthe most5⁰nucleotidesoftheprimers,areprovidedtoFDSToolsina libraryﬁle,togetherwithvariousotherdetailsaboutthelociused.

SupplementaryFile1representsthelibraryﬁleusedinthisstudy.

Theﬁlecontainsadescriptionforthecontentsofeachsection.

Eachreadisscannedfortheseflankingsequencesbycomputing alignments. In this study, the flanking sequences were 18 nucleotidesinlengthand twosubstitutions(ortwoinsertedor deletedbases)perflankwereallowedinthealignment.Readsare categorised as ‘unrecognised’ if no flankingsequence is found.

Furthermore,bothﬂankingsequencesarerequiredtohaveatleast oneuppercaseletter,whichensuresthatoverlappedreadsthatare potentially truncatedare categorised as ‘unrecognised’ as well.

Readsinwhichonlyoneﬂankingsequenceisfoundwithatleast oneuppercaselettergetlinkedtoalocusbutﬂaggedas‘nostart’or

‘noend’dependingonwhethertheleftorrightﬂankismissing, respectively(optionally, thesereads canbewritten toseparate fastaorfastqﬁles).

ThemainoutputofTSSVisatextﬁlewithtab-separatedvalues.

Theﬁlecontainsonelineforeveryuniquesequenceofeachlocus.

Thecolumnsincludethenameofthelocus,thesequence,andthe

numberofreadscarryingthisparticularsequence.Readcountsare givenseparatelyfortheforwardandreversestrand.

TSSVincludesadditionaloptionsforfilteringsequencesthatare seentoofewtimesandsequenceswithalengthoutsideagiven range(e.g.,primer-dimers).Thisrangecanbespecifiedseparately foreachlocus.Furthermore,filteredsequencescanbeaggregated into a single ‘othersequences’ category for each locus. In this project,onlysingletons(i.e.,sequenceswithonlyoneread)were aggregatedtothe‘othersequences’category.

2.3.Buildingareferencedatabase

OnefunctionofFDSToolsisthebuildingofareferencedatabase.

Suchadatabasecanbeusedtoobtainestimatesofrecurringallele- speciﬁc systemic noise. Here, ‘noise’ refers to the complete collectionofsequencesobservedinasample,exceptthesample’s trueallelicsequences.Noiseincludesanyartefactderivingfrom thePCRaswellasthesequencing(suchasPCRstutterorsingle- nucleotide errors). Additionally, based on the reference data a statisticalmodelcanbederivedthataimstopredictstutterratios forallelesnotpresentinthereferenceset.

The creation of a reference database involves various tools includedintheFDSToolspackage,whichwillbediscussedinthe nextsections.Inadditiontotheseseparatetools,FDSToolsoffers the Pipeline tool,which convenientlyintegrates theentire data analysispipeline.UsersareadvisedtousePipelineasitremovesthe complexityofhavingtorunseveralseparatetoolsandtocombine theiroutput.Pipelinetakesasimpleconﬁgurationﬁlecontaining theanalysis parametersand automaticallyrunstheappropriate tools.

Buildingareferencedatabaseisatwo-phaseprocess.Intheﬁrst phase,thereferencesamplesareanalysedinaglobalmannerto identifytheirallelesandrejectthosesamplesinwhichthealleles arenotreadilyidentiﬁed.Inthesecondphase,thesystemicnoise ofeachoftheseallelesisanalysedindetail.

2.3.1.Allelecallingforreferencesamples

Determiningtheallelesofsingledonorreferencesamplesisa fairlystraightforwardprocessbecausethesegenerallyrepresent theoneortwomostabundantsequencesforanylocus.FDSTools includes Alleleﬁnder to call alleles this way. It is applied after Stuttermark,whichisdescribedbelow.Anumberofthresholdsare used to guard against including alleles of potential low-level contaminations,whichareoutlinedinFig.2.Forheterozygousloci,

Fig.2.ThresholdsusedbyAlleleﬁndertocallallelesinreferencesamples.Sequencevariantswithareadcountabovetheallelethresholdarecalledasalleles.Thefourlighter- shadedbarsrepresentstuttervariants(asrecognisedbyStuttermark),whichareignoredbyAlleleﬁnder.

(4)

asecondalleleisonlycalledifitpassestheallelethreshold,which isdefined inthis project as 30%of theread countofthemost frequentalleleatthesamelocus.Asweexpectnostutterabove 30%[2],thisthresholdseparatestheallelesfromnoise.Noalleles arecalledatalocuswhenadditionalsequencesoccurthathavea read count below the allele threshold but above the noise threshold(whichisdefinedas15%ofthemostfrequentallelein thisproject)orifathirdsequencepassestheallelethreshold.If morethantwolociinthesamesamplefailtogivearesultforthese reasons,theoverallqualityofthesampleisconsideredtoopoorto reportanyalleles.Additionally,Allelefindercanbeconfiguredtocall atmostonealleleathaploidloci.

The three potential pitfalls are 1) PCR stutter artefacts that exceedthenoise threshold;2)strongread countimbalancefor heterozygousalleles,whichmaybetheresultofe.g.,primer-site sequencevariantsand3)autosomaltrisomy,whichisrare.Todeal with the problem of stutter, each sample was analysed with Stuttermark[2]beforecallingalleles.WithStuttermark,sequences thatareinastutterpositionofanothersequencewhilehavinga readcountbelowauser-suppliedpercentagewithrespecttothe othersequencearemarkedas‘stutter’.Sequencesthathavearead countthatistoohightobeexplainedbystutteralonewillnotbe markedas‘stutter’,astheymaycoincidewithagenuineallele.The thresholdsusedherewere30%for1stutter(lossofarepeatunit) and 10% for +1 stutter (gain of a repeat unit). For 2 stutter products, a 30% threshold of the 1 stutter product is used.

Sequencesthataremarkedas‘stutter’arecompletelyignoredby Alleleﬁnder.

Alleleﬁnderproducesthelistofallelesandareportdetailingfor which samplesand loci allelecallingis rejected and for which reasons.

2.3.2.Estimatingaverageallele-speciﬁcsystemicnoise

Foreachallele,aprofileofrecurringsystemicnoise,including PCRstutterproductsaswellasanyother‘sideproducts’,canbe generatedbasedonthereferencedata.Noiseprofilesarealways computedseparatelyforforwardandreversereads,becausestrand biasmayexistinthesequencingtechnologyused.Profilesarealso computedseparatelyforeach locus,undertheassumptionthat noiseproductionisnotinfluencedbyallelesofotherloci.Thelevel ofnoiseisexpressedasthenumberofnoisereadsasapercentage ofthenumberofreadsoftheparentallele.InthecontextofPCR stutteranalysis, thisquantityis oftenreferredtoasthe‘stutter ratio’, despitetherepresentation as a percentageof the parent allele.Weusethegeneralisedterm‘noiseratio’(alsorepresented as a percentage of the parent allele) to account for all other systemicnoiseaswell.

Noiseratio¼Noisereads Allelereads100

In homozygoussamples,thenoiseratiocanbecalculatedby dividingthe number of reads of a non-allelic sequence bythe number of reads of the allele.Allele-specific noise profilesare readilycomputedfromhomozygoussamplescarryingthisalleleby scalingthereadcountsineachsamplesuchthattheparentalleleis 100andaveragingthenoiseratiosforeachnoisesequence.These per-allelenoisestatisticsandotherstatistics,suchasthestandard deviationsofthenoiseratioscanbeobtainedusingBGHomStats.In heterozygoussamplestheextractionofnoisesequencesismore complex,becauseithastobedeterminedwhichproportionseach allelecontributedtotheobservednoise sequences.Weassume thatnoiseinheterozygoussamplescorrespondstothesumofthe noiseprofilesofthetwoalleles,aftertheapplicationofascaling correctiontoaccountfordifferencesintheamountofeachallele amplified.Thisisneededasevenforheterozygousallelepairs,PCR efficiencymayvaryduetoprimerbindingsitesequencevariation

orSTRlength[16].Toextractnoisefromheterozygousreference samplesaniterativeapproachwastakenandimplementedinthe BGEstimatetoolinFDSTools.

Inessence,thealgorithm,whichisdiscussedinmoredetailin SupplementaryText1,seeksanon-negativeleastsquaressolution tothematrixequationAP=C.Inthisequation,CisanNMmatrix ofconstantsderivedfromthereadcountsinthereferencesamples, AisanNNmatrixsummarisingtheallelebalanceinthesamples, and P is an NM matrix containing the estimated proﬁlesof systemicnoise.Nisthenumberofuniquegenuineallelesamong the reference samples and thus also the number of proﬁles producedandMisthetotalnumberofuniquesequencesobserved.

MatrixCiscomputedonceatthestartofthealgorithm.Each rowinCcorrespondstoonealleleandcontainsthesumoftheread countsofallsamplesthathavethatparticularallele,afterscaling thealleleto100readsforhomozygoussamplesand50readsfor heterozygotes. The noise proﬁles in P are initialised with the assumptionthatnosystemicnoiseispresent,i.e.,allelementsare setto0, except for theelementsthat correspond totheactual alleles,whicharesetto100.

Thealgorithmthenproceeds byrepeatedlyre-estimatingthe allelebalancematrixAwhilereadingcross-contributionsbetween the alleles from the current profiles P and subsequently re- estimating P by finding a non-negative least squares optimal solutiontoAP=C.ThevaluesthusobtainedinParetheaverage noiseratiosofallobservedsystemicnoiseforallalleles(i.e.,each rowinPcontainsthenoiseprofileofoneallele).

Toavoidnoisefromoneallelebeingincorporatedinthenoise proﬁleofanotherallele,aminimumofthreedifferentheterozy- gousgenotypesperallelewasusedinthisstudy.Athresholdcanbe setfortheminimalreadcountofnoisetoconsiderandtheminimal percentage (weused 80%)of reference sampleswiththe same allelewhichshouldcontainthesamenoisebeforeitisincludedin thenoiseproﬁle.Eachoftheseparameterscanbesetusingvarious optionsofthe‘fdstoolsbgestimate’command.

2.3.3.Relatingtheamountofstuttertorepeatlength

Withthemethodsoutlined above,proﬁlesofsystemicnoise wereobtainedforeachallelepresentinthereferenceset.However, one would also like to be able to ﬁlter and correct the noise originatingfromallelesthatarenot(yet)includedinthereference set,ascasesamplesmaybeencounteredthatcontainallelesfor which noreferencesample was available. For this purpose, we developedamethodtopredictthesequenceandcorresponding amountofPCRstutterartefactsthatwouldbeproducedforany alleleofagivenlocus.Notethatthismethoddoesnotpredictnoise otherthannoiseresultingfromSTRstutterorsinglenucleotide stretches.

Previous studies have shown that the amount of stutter is strongly correlated with the length of the repeated sequence [17]andeven more sowiththenumberof consecutive repeat units [2,18]. The FDSTools tool Stuttermodel seeks to ﬁt polynomial functions to the repeat length and stutter ratio in homozygous referencesamples. Stuttermodel scans each of the alleles forall positionswhere a particularrepeatunit (e.g.,the sequence ‘AGAT’) is repeated and records the length of this repeat, as the number of nucleotides, including incomplete repeatsatthebeginningorendoftherepeatedstretch.Foreach sample with this allele, the number of noise reads that lack exactly one repeat is counted. Reads that combinethe loss of one repeat with one or more other differences (e.g., substitu- tions,orstutterinanotherstretchofrepeatsinthesameallele) areincludedinthiscount.Thecountsthusobtainedareusedto compute the noise ratios of individual stutter sites and a polynomial function is ﬁtted to quantify the relationship betweenthelength of therepeatand thestutterratio.

(5)

Thisanalysisisrepeatedforeachuniquerepeatunitofalength betweenoneandaconﬁgurablemaximumnumberofnucleotides (inclusive), treating cyclicallyequivalent units (e.g., ‘ATAG’ and

‘AGAT’)andtheirrespectivereversecomplements(e.g.,‘CTAT’and

‘ATCT’)synonymously.Theamountof+1stutter,2stutteretc.is analysedthesameway.

Becausedifferentlocibehavedifferentinstutterformation,a separatefunctionisfittedforeachlocus.Additionally,apolynomial functionisfittedtoalldataatonce,whichisusedtopredictstutter inallelesoflociforwhichinsufficientreferencedatawasavailable tofitalocus-specificfunction.Separatefunctionsarefittedforthe forwardandreversestrands.

Foreachﬁttedfunction,Stuttermodelalsodeterminesthelower boundoftherepeatlengthforwhichthefunctiongivesmeaningful results.Thislowerboundisdeﬁnedasthelowestrepeatlengthfor whichthefunctionproducesanonnegativeresultandthefunction isnon-decreasing.Belowthisthreshold,andinanyotherpoints wherethefunctionvaluewouldbenegative,thefunctionvalueis settozero.

Thequalityofﬁtis assessedbycomputing thecoefﬁcientof determination,

R²¼1

S

ⁱðyiy^iÞ²

S

iðyiyÞ² where

y^i¼ fi fi0 0 f_i<0

withyithenoiseratiosofthereferencesamples,ythemean,fithe polynomialfunction’sestimateofthenoiseratioofsamplei,and^y themodiﬁedfunctionvalue.TheR²scorewillbeclosetoonewhen thefunctionisagoodﬁtandlowerotherwise.

Stuttermodel supports fitting polynomial functions of any degree.To preventover-fittingwhile stillallowinga non-linear relationship, second-degree polynomials (with a minimum R² score)wereused.IncaseswherethefitforonestrandhasanR² scoreabovethethresholdwhilethefitfortheotherstrandscores belowthethreshold,bothfitsarerejectedtopreventunintended introductionofstrandbiasbyfilteringstutterononlyonestrand.

2.3.4.Curatingthereferencedatabase

Tomakesureallreferencesampleswereofgoodqualityandall alleleswere called correctly, they wereput through the same analysis pipeline as case samples, thereby performing noise filteringandcorrectiononthereferencesamples.Itisimportant tonotethatthesereferencesampleswerepreviouslygenotyped byus in greatdetailusing CE [13].The remaining amountsof noisein eachsamplewereassessedusingBGAnalyse(described below)toidentifypotentiallyunsuitable referencesamplesthat still passed the thresholds of Allelefinder. Any sample with a notablyhigher amountof remainingbackgroundwas manually removedfromthesetofreferencesamplestopreventpollutionof thenoiseprofiles.

BGAnalyse was developed and employed to analyse the remainingnoiseaftercorrection.Foreachlocusandeachsample, thistoolcalculatestheleastfrequent(thiscanbeanegativevalue becauseofover-correction),mostfrequent,and totalnoise asa percentageofthenumberof readsofthehighestalleleateach locus.Theseresultsaresubsequentlyvisualisedtoeasilyidentify potentiallyproblematicsamples.Inthevisualisation,samplescan besorted byany of thecalculated values orbycoverage (total numberofreads).Samplesweresubjectedtomanualinspection andanysamplethatexhibitednon-stutterproductswithcorrected readcountsabove4%ofthemostfrequentalleleorabove2%ofthe totalreadswasrejected.

2.4.Analysingcasesamples

Theanalysisofmockcasesampleswasperformedinathree- stepprocesswhichisdescribedinthefollowingsections.

1. A prediction was made for the amount of stutter for each sequenceinthesample,usingtheﬁttedpolynomialfunctions obtainedfromrunningStuttermodelonthereferencesamples.

Thesepredictionsareusedtoextendtheallele-speciﬁcnoise proﬁles obtained from running BGEstimate on the reference samples.

2. Theextractednoiseproﬁlesareusedtoﬁlterandcorrectthe noiseinthecasesample.

3. Alleles are called and the sample is subjected to manual interpretation.

Similartothecreationofareferencedatabase,analysingcase samples involves multiple tools discussed in the following sections.Pipelineoffersaconvenientwaytoautomaticallyanalyse acasesamplewithalltoolsdiscussed.

2.4.1.Predictingstutteramountsforunknownalleles

Becausecasesamplesmaycontainallelesthatarenotpresentin thereferencesamples,noiseproﬁlesfortheseallelesneedtobe predicted. FDSTools includes the BGPredict tool, which uses a previously created Stuttermodel ﬁle to predict the amounts of stutter artefacts for alleles not present in the reference data.

BGPredictfindsallsequencesintheanalysedcasesampleinwhich aparticularrepeatunitisrepeated.Theexpectedamountofstutter in this repeatis then computed usingthe correspondingfitted polynomial function from the Stuttermodel file. All possible combinations of stutter aretaken into consideration when the frequencies of each stutter artefact are computed. The noise profilescreatedinthiswayareusedtoextendthenoiseprofilesin thepreviouslycreated BGEstimate file(atool calledBGMerge is includedinFDSToolsforthispurpose).

2.4.2.Noiseﬁlteringandcorrectionincasesamples

To beabletoﬁltersystemicnoise in casesamples,one ﬁrst needstodeterminewhichallelesarelikelypresentinthesample.

Tothisend,thealgorithmofBGEstimateisessentiallyreversed,i.e., thegoalisnowtosolveforainaP=c,wherecisarowvectorwith thesample’sreadcountsfortheMsequencesinthenoiseprofiles andaisarowvectorwiththeestimatedamountofeachoftheN profilesin the sample. P is the NM matrix of noise profiles obtainedfromBGEstimate,extendedwiththepredictionsobtained fromBGPredict.Solvingforaisdoneinanon-negativeleastsquares senseasbefore,givingestimatedallelecontributionsthatbestfit thevarioussequences–allelesaswellasnoise–presentinthe sample.

Background-correctedread countscanthen becomputedby ﬁrstsubtractingthescaledproﬁlesfromthesample’sreadcounts d caP

and then adding the total size of each proﬁle to the correspondingallele,i.e.,

dn dnþan

S

^Mm¼1P_n;m; 8n2½1:::N

Notethatdmayhavenegativeelementsifthesamplecontainsa loweramountof acertain sequencethanwas predictedbythe proﬁlesofitsdominantalleles.

FDSToolsoffersBGCorrecttofilterandcorrectbackgroundnoise followingtheprocedureoutlinedabove.Givenasampledatafile (obtained from TSSV for example) and a file containing noise profiles, BGCorrect produces a copy of the sample data with

(6)

additional columns giving the amounts of each sequence attributedtonoiseandtheamountsofeachsequencethatwould berecovered by noise correction (i.e., adding the noise tothe originating allele). These values are given separately for the forward and reverse strand. Although the method by which BGCorrectcomputes them resultsin non-integer values, it was decidednottoroundthesenumberstoavoidunnecessarylossof precision.Ifnecessary,thesenumberscanberoundedtointeger values,thereby easingtheinterpretation as‘read counts’ when presentedinagraphortableinareport.

2.4.3.Allelecallingforcasesamples

ThenaïvemethodofcallingallelesthatAlleleﬁnderusesisnot appropriateforcasesamples,sincethesemaycontainallelesof multiple contributors in different quantities. Therefore, calling alleles in case samplesis done bycomputing various statistics based on the information of the detected sequences and subsequentlysettinginterpretationthresholdsonthesestatistics.

Forthis,Samplestatswasdeveloped,whichoperatesonandadds variouscolumnstotheoutputofBGCorrect.Samplestatsautomati- callymarkssequencesas‘allele’usingthethresholdsoutlinedin Table1.

Alleles canalso becalledwhile visualisingthe sampledata, hence,FDSToolsincludestheSamplevisvisualisation.Bymeansof theinteractivegraphicaluserinterfaceofSamplevis,thesamesetof thresholdsasdepictedinTable1areavailabletofilterthevisible sequences and to automatically call alleles. Thresholds can be specifiedseparatelyforthegraphsandforthetables.Whilethe tabledisplaysthecalledalleles,lessconservativesettingsmaybe usedforthefilteringofthecorrespondinggraphtoensurevisibility of allelesjust below theallele-callingthreshold. The results of changing the thresholds are immediately visible. Clicking a sequenceinanyofthegraphstogglesits‘allele’status.Thisallows theusertomanuallyaddallelestoandremoveallelesfromthe profile.Anoteisaddedtomanuallyaddedalleles,statingthatthe alleleis‘User-added’.Similarly,iftheuserremovesanyalleles,the alleleremainsvisiblebuta‘User-removed’noteisadded.Inthis

wayitremainseasytotracebackexactlywhichallelesmeetthe thresholdsandwhichonesweremanuallyaddedandremoved.

Samplestatscanalsobeusedtofiltersequencesusingthesame typesofthresholds(albeitwithmorestringentthresholdvalues than used for allele calling, as potential alleles should not be filteredout)and(optionally)aggregatethefilteredsequencesper locustoasinglelinecategorised‘othersequences’.

2.5.Visualisation

For visualisation of the data, FDSTools makes use of the JavaScript graphing library Vega [19]. Vega graphs can be embedded ona web page, exposing a JavaScript programming interfacethatallowsforupdatingthegraphsbasedontheuser’s interactionwiththewebpage.VegacanalsorunonNode.js,which allowsittobeincludedinautomatedanalysispipelinestogenerate (static)imageﬁles.

FDSToolscomeswithVegagraphspecificationsandaccompa- nyinginteractivewebpages(HTMLfiles)tovisualisetheoutputof eachtool.TheVistoolcanbeusedtoobtainself-containedHTML files containing visualisations of various types of data files generatedbytheothertools.Forexample,Samplevisvisualisesa sample data file as a sequence profile and Profilevis visualises backgroundnoiseprofilesobtainedfromBGEstimateorBGPredict.

AdescriptionofeachvisualisationcanbefoundinSupplementary Table1.Whenviewedinawebbrowser,thewebpageprovides additionalcontrolsthatallowtheusertoﬁlterthedata,switch betweenlinearandlogarithmicscales,orselectdifferentsubsetsof thedatatovisualise.Thedefaultvaluesforthesettingsontheweb pagecanbesetwhentheHTMLﬁleisgeneratedbytheVistool.

Thewebpagesalsooffertheoptiontosavethedisplayedgraphs asaScalableVectorGraphics(SVG)orrasterisedPortableNetwork Graphics (PNG) image, so that they can be imported into documents. Alternatively, the Vis tool can supply a raw Vega graphspeciﬁcationﬁle(eitherwithorwithoutembeddeddata), whichcanthenbeusedbyVegatogenerateSVGorPNGimages directlyonthecommandline.

Table1

InterpretationthresholdsforcasesamplesinSamplestatsandSamplevis.Sequencesthatmeeteitherthe‘Percentagecorrection’or‘Percentagerecovery’threshold(orboth)as wellasalltheotherthresholdswillbemarkedas‘allele’.Thesethresholdvaluesareevaluatedafternoisecorrection.The‘Allelecallingdefault’columnliststhedefault thresholdvaluesforcallingalleles.The‘Filteringdefault’columnliststhedefaultvaluesusedforﬁlteringdisplayedsequencesinSamplevisgraphs.

Threshold Description Allele

calling default

Filtering default

Totalreads Minimumnumberofreadsperallele.Non-systemic(andthusunﬁlterable)sequenceerrorsoccursporadically.Thisthreshold ensuresthataminimalamountofampliﬁedproductispresenttosupporttheallelecall.

30 5

Readsper strand

Minimumnumberofreadsperalleleforbothstrands.Thisthresholdcanbeusedtoexcludelowtemplatesequenceswithstrong strandbias.

1 0

Percentage ofmost frequent

Thenumberofreadsasapercentageofthenumberofreadsofthemostfrequentalleleatthelocus.Thisthresholdsetsalimittothe mixtureproportionsthatcanbeanalysedinmixedsamplesortotheallelebalanceinsampleswithasinglecontributor.

2% 0.5%

Percentage oflocus

Eachallelecontributesatleastthispercentagetothetotalnumberofreadsofthelocus.Withthisthreshold,aminimumcontribution percentagecanbeenforced.

1.5% 0%

Percentage correction

Thispercentagederivesfromthenumberofreadsafternoisecorrectionminusthenumberofreadsbeforecorrection,whichis dividedbythenumberofreadsbeforecorrection.Consequently,thepercentagecorrectionisnegativeifnoisecorrectionresultedina reductionofthereadcountofasequence.Therefore,withthisthresholdsetto0%,anysequencerepresentingnoisewillnotbecalled asanallele.

Tobeabletodetectallelesofminorcontributorsthatcoincidewithnoiseproductsforthemajorcontributor’salleles,the‘percentage recovery’thresholddescribedbelowisallowedtooverrulethisthreshold.

0% 0%

Percentage recovery

Thenumberofreadsaddedbynoisecorrectionasapercentageofthetotalnumberofreadsafternoisecorrection.Afternoise correction,atleastthispercentageofreadsmusthaveoriginatedfromcorrectednoise.Therationalebehindthisthresholdisthat onlyallelicsequenceswillhavesubstantialamountsofrecoveredreads.Whenanalleleofaminorcontributioncoincideswiththe stutterofanalleleofthemajorcontributor,noisewillbeextractedandaddedtothemajorcontributor’sparentalleleresultingina negativepercentagecorrection.Yet,sincetheminorcontributor’scontributiontothereadsalsoresultsinnoiseproductsthatare corrected,theallelewillreceiverecoveredreadsandapercentagerecovery>0%.Toallowthecallingofallelesforwhichnonoise proﬁleexists(ornonoisewasdetected)inthereferencedatabasethethresholdissetat0%bydefault.

0% 0%

(7)

3.Resultsanddiscussion

WedevelopedFDSTools,asoftwarepackagecontainingasuite oftoolsthatcanbeusedfortheanalysisofforensicMPSdata.With thesetools,FDSToolsprovidesdetailedinsightinthequalityofa sample and the noise proﬁle of a certain allele (or sequence variant). In Supplementary Table 1, an overview of all tools currentlyavailableinthepackageisprovided,ofwhichaselection wasdescribedinmoredetailsinSection2.

Toenhancetheanalysisofmixedsamples,FDSToolsidentifies, extractsandcorrectsforPCRorsequencingnoisesuchasstutter fromareferencedatabasewiththeaimtodiscernlow mixture proportions. Different STR amplification assays and different amplification protocols could result in different noise. It is thereforeimportanttobasethedatabasefornoisecorrectionon referencesgenerated bya methodthat is representablefor the caseworksamplestobeanalysed.

Notethatitisnotpossibletocorrectallnoisecompletelyasthe levelofnoiseshowsvariationbetweensamples.

3.1.Referencedatabase

Our reference samples were sequenced with an average coverageof65,000readsandamodeofabout45,000reads.For thepresentstudy,aminimumcoverageof6000readspersample wasrequired,whichrelatestoanaverageof250readsperlocusas 24 loci wereco-ampliﬁed.For heterozygousloci, less than 250 readsperlocusisnotsufﬁcienttoquantifylowamountsofnoise accurately.

3.1.1.Referencesamplecuration

Sincethereferencedatabaseisusedtoﬁlterandcorrectnoisein casesamples,itisessentialthatthereferencesamplescontainno contaminantsandreferenceallelesarecalledcorrectly.Although all other stepscan be performed automatically by FDSTools, a manualcurationofsamplesinthereferencedatabaseisneeded.

BGAnalysewasdevelopedtofacilitatethisprocessbyvisualising potentialoutliers.

Allelefinder automatically rejected two out of the initial450 sampleswhichwereclearlycontaminatedandthreesamplesthat hadtoolowcoveragetodetectallelesreliably.Manualinspection ofsampleswithanotablyhigheramountofremainingnoiseafter correctioninBGAnalyseresultedintherejectionofanadditional16 samples.Reasonsforrejectionwerelow-levelcontamination,low coverageandlow sequencingquality.Theinteractive BGAnalyse visualisations displaying the remaining noise for the reference samplesareavailableinSupplementaryFile2a(beforedatabase curation)and2b(aftercuration).Forthemajorityofsamples,the highestremainingnoise variantin thecompleteprofile didnot exceed3%ofthenumberofreadsofthehighestalleleatthelocus whilewithoutcorrectionSTRstutterscanrepresentover20%.For theremaining429samples,nodrop-inordrop-outwasobserved whencallingallelesusingAllelefinderwiththesettingsdescribedin Section2.3.1.

3.1.2.Extendingnoiseproﬁlesfornoisecorrection

AsdescribedinSection2.4.1,casesamplesmaycontainalleles which arenotpresent inthereferencedatabase.In suchcases, FDSToolsresortstonoisepredictioninsteadofnoiseestimation.A columnintheoutputﬁleofBGCorrectmarksifcorrectionhasbeen performedusingdataobtainedfromBGEstimate(iftheallelewas availableinthereferencedatabase)orbyusingBGPredict(ifnot availableinthereferencedatabase).

FromtheresultsfromStuttermodelitbecomesevidentthatfor simpleSTRsconsistingofasinglerepeatingelementorforlong stretchesofaspeciﬁcrepeatingelementwithina complexSTR,

only few reference samplesare needed toreliably ﬁt a stutter model. However, when complex repeats consist of several repeating elementsof which someshowlittle lengthvariation, correctionusingthestuttermodelissuboptimalasexempliﬁedby thepredictionsforD12S391.ThisSTRlocusconsistsoftworepeat units; an AGAT repeatstretchof highly variablelength and an ACAGrepeatthatis repeated6 to8timesfor mostindividuals.

Since Stuttermodel predictsthe amountofstutterbased onthe repeat length,at leastfour differentrepeatlengths need tobe availableinhomozygousreferencesamplestoobtainareliableﬁt.

However, the setof reference samples used in this study only containedhomozygoteswith6to8repeatsofACAG,whichisnot sufficientlyvariabletoobtainareliablefit.Consequently,ACAGis omitted from thestutter modelfor D12S391, even thoughthis repeatstuttersupto9%forthelongerrepeats(8repeatunits,data notshown).WhenBGEstimatedoesnotobtainabackgroundnoise profile,BGPredictwillnotcorrectstutterinthisrepeatandthus stutterswillremainpresent.Asalastresort,BGPredictoffersthe possibilitytouseastuttermodelbasedondatafromalllocithat have the same repeat unit sequence if no locus-specific fit is available.SupplementaryFig. 1displaysthestuttermodelobtained fromthesetof429referencesamples,includingtheindividual observationsonwhichthemodelwasbased.

Combining BGEstimate and BGPredict (by using BGMerge) insteadofusingBGPredictaloneisexpectedtoreducethenoise remaining after correction, as the combined correction also corrects for noise other than stutters. This is conﬁrmed when we determine the percentage of remaining noise (the reads representingremainingnoiseasapercentageofthereadsforthe mostfrequentalleleatthelocus)andplotthehighestpercentage and various percentiles (90th, 95th and 99th) (Supplementary Fig.2a,b).The percentilesillustrate howoften samplesexhibit outlyingnoisesequencevariantsandwhenthe99thpercentileis regarded,BGPredictalone retainsonaverage2.6%noise andthe combinedcorrection2.4%.Also,thecombinedcorrectionresultsin lessovercorrectedvariants.

Thus,BGPredictcanbeusedwithoutBGEstimatewithaslightly reducedaccuracyincorrection.NotethatBGEstimateshouldnotbe usedwithoutBGPredictsinceallelesnotincludedinthereference databasewillnotbecorrected,whichcanresultinacombination ofcorrectedanduncorrectedallelesandremainingnoiseforthe uncorrectedalleles.

3.1.3.Referencedatabasesizeandcoverage

Totesttheeffectofthesamplesizeandtypefromwhichthe reference database is built, we used the complete curated reference database of 429 samples and a random selection of 100 samples (both with combined BGEstimate and BGPredict correction,whichwasfoundtobeslightlybetterasdescribedin Section3.1.2).SupplementaryFig.2c,ddisplayanoverviewofthe mostfrequentandthetotalremainingnoiseateachlocusafter correction.Thedifferentpercentilesofthereferencesamplesare given to illustrate how often samples exhibit outlying noise sequencevariants.

Whencomparingtheresultsforthecompletedatabasewiththe resultsforthesubsetof100samples,thedifferenceinremaining noise seems surprisingly small (Supplementary Fig. 2c, d).

However,withasmallerdatabase,lessalleleswillﬁtthecriteria tocreateaBGEstimatenoiseproﬁleandmoreallelesrelyonnoise prediction by BGPredict. Indeed, for the reference set of 429 samples,only3.5%oftheallelesarecorrectedusingBGPredict.This percentageincreasesto10.2%whenthecorrectionisbasedonthe subsetof100samples.

In alargerreferencedatabasemorealleleswillbeobserved.

SupplementaryFig.3displaystheallelesobservedinthereference databasesof429and 100samples.Toﬁtthecriteriatocreatea

(8)

BGEstimate noise profile, alleles need to be present as a homozygousgenotypeorbeavailableaspartofsharedgenotypes withatleastthreeotherallelesthatmustalsofitthesecriteria.For thestuttermodel,onlythehomozygousgenotypesareused.Inthe larger429 database, more alleles fit these criteria than in the smaller100samplesetdatabase.

Toexaminetheeffectofreadcoverageofthereferencesamples onnoise profileanalysis, wegeneratedtwo subsets comprising sampleswithhighorlowcoverage,whichisspecifiedasatotal read count between 82,000 and 350,000 or 8000 and 44,000 respectively.Thehighcoveragesetcomprised71samples;thelow coverageset70.Wenoticedthatinthelow-coveragenoiseprofiles, strandbiascanoccurespeciallyforthelow-percentagenoisethat isduetosingle-stranddrop-outofthisnoise.Thisisillustratedby theBGEstimatenoiseprofilesfortheCE10_TCTA[10]_-20T>Aallele forlocusD7S820inSupplementaryFig.4,inwhichforwardand reversereadsareingoodorreasonablebalanceforallsevennoise sequencesinthehighcoveragesamplesetwhilegoodbalanceis onlyseenforthetwomainnoisesequencesinthelowcoverageset.

Sincethemostabundantnoiseaftercorrectioninasampleis usuallyintherangeof0.5–3%(forSTRanalysis),werecommenda coverageofatleast1000readsperlocus(whichrelatestoa24,000 totalreadcoverageforour24lociampliﬁcationkit)forthesamples of the reference database to obtain the most accurate noise estimates.

3.1.4.Infrequentalleles

Depending on the composition of the reference database, occasionallyalleleswillbeencounteredthatarenotincludedinthe database.BGPredict canpredict thenoise from stutterorother repeatingelementsbutcorrectionofothertypesofnoise(likelow levelSNPs causedby sequenceerrors)is not possibleforthese infrequentalleles.

WethereforerecommendtoobtainBGEstimatenoise profiles for asmany allelesas possible,while retaininggood qualityof thesenoiseprofiles.Severalfilteringcriteriacanbeapplied,suchas the minimumnumber of different heterozygousgenotypes per allele, the minimum number of samples per allele and the minimumnumberofhomozygoussamplesperallele.Theeffectof increasingthestringencyonthefilteringcriteriaonthenumberof retrieved BGEstimatenoise profilesfor our429 referencesetis showninSupplementaryTable2.Thesettingsselectedforusein thisstudyare:atleasttwosamplesperallele(whichensuresnoise isnotbasedonasinglesampleasthatcouldbeanoutlier)that present at least three different heterozygous or at least one homozygote genotype (i.e., the samples can be three different heterozygotesortwohomozygotesor onehomozygoteandone heterozygote).

Whenanalleleat aheterozygouslocusfailsthecriteria,the completelocuscarrying thisallelecannotbeusedforestablish- mentofanoiseproﬁlesincethenoisecannotbeattributedtoany ofthetwoalleles.Thus,forbothallelesataheterozygouslocusno noiseproﬁleisextracted.

3.1.5.Accuracyofnoisereferencedatabaseandstuttermodel Toverifytheaccuracyofthenoiseprofilesobtainedthrough BGEstimateandBGPredict,itcanbeusefultocomparetheaverage noiseratioswiththenoiseratiosobservedinindividualhomozy- gous samples. The noise ratios of all noise in all homozygous referencesamplescan easilybecollectedusingtheBGHomRaw tool.Thesedatapointscanbeplottedontopofanoiseprofileto inspecttheconsistencyandvariationinthenoiseratiosofvarious typesofnoiseforeachallele.InFig.3,thenoiseprofileofthemost frequentalleleof D7S820(CE10_TCTA[10]_-20T>A)is displayed, which has foremost a 1 stutter (CE10_TCTA[9]_-20T>A) in addition to a 1 nt slippage product at the A-stretch

Fig.3.NoiseproﬁleofD7S820alleleCE10_TCTA[10]_-20T>A.Thenoiseratioisshownforeachsystemicnoisesequenceobservedwithanoiseratioof0.1%orhigher.

Individualobservationsinhomozygoussamples(above0.5%)aredisplayedascircles.Asexpected,themostfrequentlyobservednoisesequenceisthe1stutter,butsincethe allelecontainsasingle-nucleotidestretchof9Anucleotides,aconsiderableportionofthenoiseconsistsofsequenceswithslippageatthisA-stretch(oracombinationofthe two).

(9)

(CE9.3_TCTA[10]_-20T>-). The individual observations for the homozygoussamples coincide nicely withthe estimated noise proﬁleratios.

Similarly, it is useful to compare the functions ﬁtted by Stuttermodel to the data points to which they were ﬁtted.

Stuttermodelincludesanoptiontowritetherawdatapointstoa separateoutputﬁle,whichcanbevisualisedtogetherwiththeﬁtted modelasshowninFig.4forD7S820.Thisexampleshowsthatthe

homozygouscallsandtheStuttermodelestimationfollowthesame trendandthatthereisnodiscrepancybetweenforwardandreverse reads.ThesameholdsfortheA-stretch(datanotshown).

In thestuttermodel, fits withanR² score below0.75 were rejected.AlthoughthismayseemaverylowR²score,weobtained betterresultsbyincludingmorefitsthanbyexcludingthem,which wouldresultintheinabilityofthestuttermodeltobeusedtofilter andcorrectstutterfortherespectiverepeatunitsatall.

Fig.4.Stuttermodelforthe1stutterofD7S820.Onthex-axisthelengthoftherepeatisdisplayed(innucleotides)andonthey-axisthe1stutternoiseratio(aspercentage ofreadsoftheparentallele)isdisplayed.Eachhomozygousreferencesampleisdisplayedasadotandthelinesdisplaytheﬁttedfunctionsusedforcalculatingtheexpected stutterofeachallele.

Fig.5.Sequenceprofileofasingle-sourcesample.SequenceprofileoflociD18S51andD19S433ofasingle-sourcesample.Asequenceprofiledisplaysthereadcountbefore correction(inpurplebars)andshowstheeffectsofnoisefiltering(lightpurpleforthereadsthatareremoved)andnoisecorrection(withthenoisereadsaddedtotheparent allelesindarkorange).Whenperformingcorrection,itispossiblethatanallelegainsreadsbecausethenoisereadsoriginatingfromthisalleleareadded,butlosesreadsatthe sametimesincethenoiseofanotheralleleintheprofileincludesreadsofthisallele.Thisoverlappingpartofaddedandremovedreadsismarkedseparatelyinlightorange.

Thismeansthattheoriginalreadcountofanallelebeforecorrectionisthecombinationofthepurpleandthelightorangebar.Thelinesinthebarsindicatethestrandbalance;

thelineisdrawnnearthetopofthebarifthemajorityofreadsofasequenceisontheforwardstrand,nearthebottomofthebarifthemajorityofreadsisonthereverse strand,andinthemiddleofthebarintheabsenceofstrandbias.Sequencesdisplayedingreeninthegraphsaretheallelesthatthesoftwareinferstobegenuineallelesinthe sample.Thesearealsodisplayedinthetable.(Forinterpretationofthereferencestocolourinthisﬁgurelegend,thereaderisreferredtothewebversionofthisarticle.)

(10)

3.2.Sampleanalysis

3.2.1.Allelecalling,interpretationandvisualisation

Whenareferencedatabasehasbeencreated,onecanproceed withtheanalysisofsamples.FDSToolsanalysessequencingdata, calls alleles and interprets the data by correction for noise as inferredfromthereferencedatabase.Resultscanberepresentedas agraphicalsequenceproﬁleoutputandasaninteractiveproﬁle report.

InFig.5,anexampleofasequenceprofileoftwolociofasingle- sourcesampleisdisplayed(generatedbythecommand‘fdstools vissample’).Asequenceprofiledisplaysthereadcountsbeforeand aftercorrectionand visualisesthe effects of noise filteringand noisecorrection.Amoredetailedexplanationoftheinterpretation ofasequenceprofilecanbefoundinSupplementaryFig.5.

The interactive sequence profile reports provide separate filtering options for the graphs and tables displayed (see Section2.4.3).In thegraphs,all alleles that are hidden bythe filtering options are (optionally) aggregated as a separate bar (displayingthecumulativenumbersofreads)withthelabel‘other sequences’.Inaddition,weaggregateallsingletonreadsinto‘other sequences’alreadyinthefirststepoftheanalysis(using‘fdstools tssv–minimum2–aggregate-filtered’)whichhastheadditional benefitsofspeedingupsubsequentanalysisanddecreasingdata storagedemand.

3.2.2.Improvingheterozygotebalancethroughnoisecorrection TheampliﬁcationoflongSTRallelesinthePCRisgenerallyless efﬁcientthanshorterallelesand,inaddition,longSTRallelessuffer fromahigherdegreeofstutterresultinginreducedheterozygote balancebetweenthetwoalleles. [2]SinceFDSTools determines which‘noisereads’arederivedfromwhichparentalleles,these readscan(optionally)beaddedtothereadcountsoftheparent alleles,whichtheoreticallywillimprovetheheterozygotebalance.

When we examine the heterozygote allele balance in the 429 single-source reference samples, an improved heterozygote balanceisindeedobservedwhenthestutterreadsareaddedto the read counts of the parent alleles (Table 2). Heterozygote balancewasdeterminedperlocusbydividingthereadcountsfor thelessfrequentallelesbythoseforthemorefrequentalleles,and takingtheaverageofall429samples.

3.2.3.Mixtureanalysis

For the analysis mixtures, noise correction may assist in identifyingtheallelesofalowminorcontributor.Weused31two- personmixtureswithminorcontributionsof50%,20%,10%,5%,1%

and0.5%toassessthisexpectation.

We varied the ‘percentage of locus’ threshold (Table 1) for callingalleles,whichsetsalimittothemixtureproportion.When nonoisecorrectionwasappliedthethresholdwasvariedbetween 5.0% and 1.5%; when noise correction was applied, a lower Table2

Heterozygotebalancefororiginal,ﬁlteredandcorrecteddatasets.Thereadcountsforthelessfrequentallelesaredividedbythoseforthemorefrequentalleles,andthe averageforall429single-sourcereferencesamplesistaken.

Dataset\locus Amel CSF1P0 D10S1248 D12S391 D13S317 D16S539 D18S51 D19S433 D1S1656 D21S11 D22S1045 D2S1338

Uncorrecteddata 0.83 0.88 0.82 0.79 0.88 0.87 0.85 0.84 0.88 0.89 0.85 0.77

Filtereddatawithoutnoisereadsaddedto allelereadcount

0.83 0.90 0.87 0.80 0.89 0.89 0.87 0.88 0.89 0.90 0.88 0.78

Correcteddatawithnoisereadsaddedto allelereadcount

0.83 0.91 0.89 0.84 0.90 0.91 0.89 0.90 0.91 0.91 0.93 0.80

Dataset\locus D2S441 D3S1358 D5S818 D7S820 D8S1179 FGA PentaD PentaE TH01 TPOX vWA

Uncorrecteddata 0.90 0.88 0.90 0.88 0.89 0.85 0.87 0.78 0.88 0.88 0.85

Filtereddatawithoutnoisereadsaddedtoallelereadcount 0.90 0.90 0.91 0.89 0.90 0.87 0.87 0.78 0.88 0.89 0.88 Correcteddatawithnoisereadsaddedtoallelereadcount 0.91 0.91 0.91 0.91 0.91 0.89 0.88 0.81 0.89 0.90 0.90

Table3

Averagenumberofdrop-inallelesandaveragedrop-outpercentagepersamplefordifferent‘percentageoflocus’allele-callingthresholds.

a)Summaryofdrop-inanddrop-outratesforvariousallele-callingthresholds

Minorcontribution 50%:600pg 20%:240pg 10%:120pg 5%:60pg 1%:60pg 0.5%:60pg

Analysismethodand threshold

#drop-in/

%drop-out

#drop-in/

%drop-out

#drop-in/

%drop-out

#drop-in/

%drop-out

#drop-in/

%drop-out

#drop-in/

%drop-out

Withoutcorrection,5.0% 0.7/0.0% 0.8/2.1% 1.2/25.0% 1.0/33.1% 1.7/39.9% 0.8/40.8%

Withcorrection,3.0% 0.0/0.0% 0.3/0.0% 0.3/5.7% 0.0/30.3% 0.7/41.1% 0.3/41.1%

Withcorrection,2.5% 0.3/0.0% 0.3/0.0% 0.5/1.4% 0.0/25.1% 0.7/40.6% 0.3/41.1%

Withcorrection,2.0% 0.7/0.0% 1.2/0.0% 1.8/0.5% 0.8/17.4% 0.8/40.6% 1.2/40.8%

Withcorrection,1.5% 1.7/0.0% 2.3/0.0% 2.5/0.0% 2.0/10.8% 2.7/39.9% 2.3/40.8%

Withcorrection,1.0% 4.0/0.0% 5.2/0.0% 4.5/0.0% 4.5/3.8% 5.7/37.8% 3.2/40.6%

Withcorrection,0.5% 16.3/0.0% 17.3/0.0% 14.0/0.0% 15.5/2.4% 14.0/30.3% 11.0/37.4%

b)Categoriseddrop-outrateswhenusing1.5%allele-callingthreshold(withcorrection)

Minorcontribution 20%:240pg 10%:120pg 5%:60pg 1%:60pg 0.5%:60pg

Allelesuniquetotheminor(homozygous) 0.0% 0.0% 0.0% 91.3% 100.0%

Allelesuniquetotheminor(heterozygous) 0.0% 0.0% 31.6% 98.1% 99.4%

Allelesuniquetominor 0.0% 0.0% 27.0% 97.2% 99.4%

Allallelesoftheminor 0.0% 0.0% 18.5% 67.7% 69.3%

Allallelesofthemajor 0.0% 0.0% 0.0% 0.0% 0.0%