Bacterial strain classification: From complex sequence data to fast epidemiology

(1)

Bacterial strain classification

From complex sequence data to fast epidemiology

Maarten Boon 10764399

Afstudeerproject BSc KI Scriptie Course: Afstudeerproject BSc KI

Credits: 18 EC

Bachelor Artificial Intelligence College of Science University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. A. E. Budding

Department of Medical Microbiology and Infection Control VU University Medical Center

De Boelelaan 1117 1081 HV Amsterdam

(2)

Abstract

In the last years, significant improvements have been made with regards to identification of strain specific similarity. This has been done by utilising whole genome sequencing, which assembles annotations the entire genome of a bacteria. The ability to link this data to the quickly determined genetic fingerprint of a bacteria provides a way to rapidly classify bacterial strains. Using a virtual implementation of the amplified fragment length polymorphism technique to create a similar genetic fingerprint from the whole genome sequences, it is possible to compare the genome sequencing data with the genetic fingerprint from a bacterial strain. After creating a virtual implementation and clustering the results a system is created to quickly identify a sample from a hospital, in order to identify a possible epidemic.

(3)

Acknowledgement

I would like to express my gratitude to my supervisor Dr. A. E. Budding and Madelon van der Bijl for their patience, constructive criticism and fruitful insights. Besides them I would also like to thank the rest of the IS-Diagnostics team for giving me the possibility to keep working with them on this system even after my thesis is finished.

(4)

1 Introduction

In the current day and age people that are seriously ill will be treated in a hospital, which acts as a facility for both treatment and recovery. The combination of ill people and people of weak health makes it that epi-demics caused by bacteria are a dangerous occurrence in hospitals and should be treated as soon as possible. To battle an epidemic one must first find its point of origin to prevent any further spread of the bacteria and to do so one has to determine in what places a specific bacterial strain is present. Since once one knows exactly in what locations a specific strain is found, it can be determined how people are being infected. This in turn provides a way to set up measures to prevent spread, quarantine patients and beat the epidemic.

At this moment in time there are two primary ways to determine if one is dealing with the same strain (genetically similar. The first way is Whole Genome Sequencing (WGS) [3], which is a process where the entire DNA-sequence (genome) of a bacteria is cut into small portions using enzymes, these portions are than organised into sequences in a computer to recreate the entire genome of a bacteria. This process, though accurate once completed, currently takes two to four months to perform to standards high enough for application in epidemiology. The second method, Amplified Fragment Length Polymorphism (AFLP) [8], involves cutting the genome into portions of certain lengths and on the basis of these lengths a match can be made between two samples of the same strain. This technique takes anywhere between one to three hours to perform, however it does not provide a designation of the strain found.

With these two techniques it is possible to determine if a certain strain is the one causing the epidemic and what strain this is. However, it would be beneficial for hospitals if it were possible to do both at the same time. To do this one must match the data obtained by AFLP to the currently existing WGS data. In order to do this a linking algorithm must be created that can take as input AFLP data and reference this with the existing WGS database. This algorithm consists of two parts, the first part being a virtual AFLP conducted on the WGS data followed by a classification to determine which strain the AFLP strain is equal to.

1.1 Research question

How can a system be created that can be used to classify bacterial strains from different data types, in order to detect epidemics in hospitals?

(5)

2 Theoretical Foundation

2.1 Whole Genome Sequencing

Whole Genome Sequencing or WGS, also known as Next Generation Sequencing, is the encapsulating name for different variations of the method that is used to obtain the full sequence of a genome. This is a difficult procedure since current systems are not powerful enough to analyse the entire DNA sequence at once, to circumvent the problem shotgun sequencing has been created. The first step of this technique is called fragmentation, which is the cutting of the large DNA molecule into small fragments using cutting enzymes. These small fragments are then sequenced using a heuristic search algorithm that will assemble the sequences in accordance with the overlap they have with each other, as seen in Figure 11this technique provides a way to go from laboratory results (in vitro) to an annotated sequence (in silico).

Figure 1: Image summary of Whole Genome Sequencing

Whole Genome Sequencing encounters the problem that, because of hardware limitations,the DNA sequence has to but cut into many small fragments. For the sequences of 4 million base pairs (Mbp) this results in a large heuristic search problem that takes a considerable amount of computing power to resolve. Added to the problem with computing power is the problem that it is not always possible to reassemble the full sequence, which results in annotations that contain multiple incomplete parts of the assembly or shifted as-sembly patterns [5]. Previously mentioned problems make it that this technique, though useful through its ability to provide accurate full annotations of sequences, is not useful for rapid epidemiology due to the time and computing power required to create an accurate annotation.

2.2 Amplified Fragment Length Polymorphism

Amplified Fragment Length Polymorphism or AFLP is a high efficiency and sensitivity method of DNA fingerprinting [6], that is capable of individual (strain level) identification of organisms. This method consists of three steps that together result in the fingerprint like result, which can be used to compare two bacteria

(6)

on strain level. The four components needed to perform AFLP are restriction enzymes2, DNA ligase3, synthetically created adapters that fit on the cut ends of DNA sequence that the restriction enzymes left and an isolated DNA sequence.

2.2.1 Restriction and Ligation

In order to start the AFLP process the isolated DNA sequence has to be cut into smaller fragments that are suitable for multiplication, this is done by two restriction enzymes. One of these restriction enzymes is a frequent cutter and a rare cutter. These enzymes latch onto the DNA sequence at a specific sequence of nucleotides, then proceed to cut this sequence in a staggered pattern as in Figure 2b4. The frequent cutter latches onto a short sequence that because of its length occurs more often, while the rare cutter generally connects to a longer sequence of nucleotides that has a lower occurrence rate. The cutting pattern mentioned before has as result that it creates a number of unbound nucleotides or sticky ends lon both sides of the cut. The synthetic adapters, which have sticky ends with the complement of the ones created on the DNA fragments by the restriction enzymes, connect to the sticky ends preparing the fragment for the selective amplification step.

2.2.2 Polymerase Chain Reaction

During the next step Polymerase Chain Reaction or PCR [4] is used to create selective copies of the frag-ments. By heating up the fragment mixture the bonds holding the nucleotide pairs of the strands together will separate (denaturation), leading to every fragment separating into two lose strands that are complementary to each other. If this mixture is then cooled down the primers in the mixture will bind to the complement of their nucleotide sequence (primer annealing). Once the primer is attached DNA polymerase, which is an enzyme that binds lose nucleotides to a strand of DNA, will start extending the primers until it reaches the end of the strand (primer extension). These three steps together form one PCR cycle, which has doubled the amount of DNA to which a primer was attached. It is important to note however that the duplication of an enzyme only happens on the adapter that the primer attaches to, meaning that depending on the configuration of adapters attached to the fragment it either does not multiplies linearly (if only one primer attaches to a fragment) or exponentially (when both ends of the fragment receive a primer) as in Figure 2c or not at all if no primers attach to the fragment.

2.2.3 Analysis

The last step is the analysis of the fragment mixture in order to obtain the strain specific fingerprint, this is done using a method called capillary electrophoresis [2]. This method is based on the principle that if an electric current is put through a fragment mixture using a positive and negative electrode, all fragments will start moving towards the positive electrode. However the smaller the fragment in the mixture is the quicker it will move towards the positive electrode, this will result in a spread of fragments by length (Figure 35) where the height of a bar indicates the frequency that fragment.

2_{Enzymes that can cut a strand of DNA at a certain sequence of nucleotides, creating ’sticky’ ends on both sides of the cut.} 3_{Type of enzyme that helps bonding two ’sticky’ ends of DNA to each other.}

4_{http://ars.els-cdn.com/content/image/1-s2.0-S0169534799016596-gr1.jpg}

5_{https://www.researchgate.net/profile/Panagiotis_Kalaitzis/publication/51420580/figure/fig5/AS:}

305917926100992@1449947925013/Figure-5-Virtual-gel-of-PCR-amplicon-pairs-using-c-and-d-primers-from-leaf-or-seed. png

(7)

Figure 2: Graphic representation of AFLP

3 Method

To be able to compare different data types, it is best to convert both to either a new type which represents the important elements of either type or to convert one data type to the other. The data provided in this problem is suited for the second option, since WGS data can be converted to AFLP data using a Virtual AFLP or VAFLP. These programs given a file containing a DNA sequence and two restriction enzymes, perform AFLP on the given DNA sequence. However the current versions of VAFLP [7][1] are not suitable for this project though, since they are either not available for use anymore or are restricted to small input files. Therefor a new version of VAFLP will have to be created, that is both always accessible and able to take sequence files of any size.

3.1 Dataset

As mentioned in the previous paragraph the dataset available consist of WGS data files, which are files containing a string representing the annotated genome that was acquired using WGS. Furthermore the dataset

(8)

Figure 3: Example of capillary electrophoresis result for plant comparison, with fragments marked by length

has been compiled by scraping these files from the NCBI GenBank6database using a python script to locate and download all WGS data files give a specific species, additionally an overview file has to be downloaded from the GenBank that holds a matching between WGS file and strains in order to create a means to validate the strain against existing VAFLP results. The complete dataset exists of all assemblies in the GenBank of three species: Clostridium Difficile (around 900 assemblies), Pseudomonas Aeruginosa (around 1700 assemblies) and Staphylococcus Aureus (around 4700 assemblies).

3.2 Virtual Amplified Fragment Length Polymorphism

The first step is to create a new VAFLP, which given the same input sequence has to create the same peaks as the existing implementations. This is done by creating a virtual restriction enzyme, a virtual PCR and by picking a data type that can easily represent the fingerprint, these three things together can recreate the AFLP process.

3.2.1 Virtual Restriction Enzym

While creating a Virtual Restriction Enzym or VRE one has to keep in mind that DNA is a two-stranded sequence where the strands are complementary to each other, meaning that whatever happens on one strand can happen in complement and in the opposite direction on the other strand. With this in mind and acknowl-edging the fact that input files only represent a single strand of DNA, it is most obvious to construct the VRE as a function that takes in a DNA Sequence and two restriction sequences and then splits the string at the sequences and their reverse complements to account for actions on the other strand.

(9)

Because real restriction enzymes do not necessarily perform the restriction in the middle of their recog-nition sequence, it occurs that two fragments that hold the same number of nucleotides in the centre get assigned different lengths due to the cutting positions of the enzymes. In the real world this problem does not occur due to the fact that the length of a fragment is the distance between the furthest ends to be mea-sured on any strand as displayed in Figure 4, to negate this problem both sides of the single strand receive the complete restriction sequence of their restrictor.

Figure 4: Example double strand length versus single strand length

3.2.2 Virtual Polymerase Chain Reaction

After restricting the DNA sequence the fragments that grow exponentially have to be selected. To do this a virtual Polymerase Chain Reaction or VPCR has to be created. The VPCR however does not have to perform the three steps that PCR has to perform, because with the data obtained from the fragments there has to be a simple check to find which fragments have on their ends one restriction sequence any selective nucleotides its primer has and one reverse complement of the other restriction sequence with any selective nucleotides it’s primer has in order to discover if a sequence would be exponentially duplicated in the real PCR. These fragments have to be put into memory, any other fragments are of no use and can be removed.

3.2.3 Analysis

Now that the VRE and VPCR have provided the relevant fragments for the VAFLP, the last thing that has to be done is creating the genetic fingerprint of using these fragments. To do so every element is binned using the formula7:

bin number= f ragment length + a

where a is a correction for the ignoring of adapters (and their lengths) in VAFLP. After binning all fragments the result is a vector of Nx1 elements, where N is the maximum length threshold8) for the sequence. This vector contains the strain specific fingerprint of the DNA sequence given to the VAFLP, an example of such a virtual fingerprint can be seen in Figure 5.

3.3 Clustering

After having ran the VAFLP on a dataset of WGS data the result is a group of Nx1 or a NxM data matrix, where M is the number of elements in the dataset. In order to be able to perform analysis on this matrix clustering must be performed, to obtain a database that can be quickly searched by computers and validated by humans.

7_{default a = 19, however this number can differ depending on the length of the adapters use for a restrictor.}

(10)

Figure 5: Visualisation of virtual strain specific fingerprint as created by the VAFLP.

3.3.1 Unweighted Pair Group Method with Arithmetic Mean

In order to make the database and similarity easily verifiable with real life data, the same clustering algorithm and distance measures will be used that are easily comparable to real world techniques. The clustering algorithm chosen is the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [9], since its results can be quickly visualised since it constructs a dendrogram (rooted tree). This method uses a similarity matrix to cluster the two most similar elements, this process takes two steps which are continuously executed until one cluster remains. But first a distance matrix has to be created from the existing data, this is done by using the cosine similarity (in the normalised form known as Pearson correlation) for vectors A and B is equal to: similarity = ∑ n i=1AiBi q ∑ni=1A2i q ∑ni=1B2i (1)

The cosine similarity is a value between 0 (similar) and 1(no similarity) that describes the similarity between two vectors. This makes it that the distance between two vectors is measured as d(x, y) = 1 − similarity and the similarity between clusters can be determined as the mean of the distance between their elements d(x, y), for clusters U and V :

1

| U || V |_x∈U

∑

_y∈V

∑

d(x, y) (2) After calculating the MxM distance matrix, the clusters with the smallest distance get clustered into a new (parent) cluster W = U ∪ V . Following the creation of the parent cluster its distance to all other clusters is calculated using the following formula where W is the parent cluster and Z is another cluster in the dataset:

d(W, Z) = d(U ∪V, Z) (3)

d(U ∪V, Z) = | U | d(U, Z)+ | V | d(V, Z)

| U | + | V | (4)

The results of these calculations are used in order to create the new (M − 1)x(M − 1) distance matrix. The clustering steps are repeated until there is only one cluster left. Producing a dendrogram which holds in every node the nodes below it and the distance between these nodes, with the exception that the edge nodes of the dendrogram are only holding their similarity = 1 cluster-elements.

(11)

4 Results

The Virtual Amplified Fragment Length Polymorphism is executed on the dataset, which results in a large amount of VAFLP output files as represented in Figure 5. However in order to verify if the newly created VAFLP tool actually performs in accordance to the previously created systems it will have to be tested against their output. This is done by finding specific strains from the dataset that have previously been processed using the old tools and compare them to the newly created tool. In the case that these outputs are not the same the new VAFLP does not perform as it is supposed to, however if the outputs are exactly the same the new tool functions as it should. As shown in Figure 6 the VAFLP results are the same as those produced by InSilico9, thus the output provided by the VAFLP is validated and ready to be used for clustering.

Figure 6: InSilico and VAFLP result comparison for Clostridium Difficile 630 strain

After running the VAFLP and clustering the outcomes a heat map can be created per species, depicting both the dendrogram and the VAFLP results per strain. The result of the Clostridium Difficile species10 as in Figure 7 provides an overview of the dendrogram at the top with the clusters showing in the middle and groups of anomalies near the edges of the heat map. The large amount of strains in a select couple of clusters raises the question as to whether the VAFLP actually works, however upon consulting the WGS files it seems to have more to do with the similarity of the assemblies rather than with the effectiveness of the VAFLP. This

9_{http://insilico.ehu.eus/AFLP/}

(12)

because the large clusters turn out the be data dumps from institutes trying to investigate the ability to track small genetic changes in a strain using WGS data.

Figure 7: Heat map and dendrogram of Clostridium Difficile species

5 Conclusion

The system as proposed in this paper shows the potential to significantly speed up the detection of epidemics within hospitals. This is done by applying a virtual AFLP (VAFLP) on whole genome sequences in order to vectorize them. This makes it possible to compare the whole genome sequence to data provided by the application of real world AFLP on a biological sample, thus making it possible to compare that sample to the samples in the database.

It can be concluded from the test results that the VAFLP proposed in this paper is more flexible and scalable than the ones currently in existence, because it is not limited to a maximum size of input sequence or a maximum amount of bands that can be returned by it. Together with the ability of the new system to handle incomplete WGS data, it makes it more suitable for usage in hospitals than the current VAFLP tools. From the results produced by utilising the VAFLP as described in this article and clustering by UPGMA it can be concluded that it is possible to create a system, which can combine different data types in order to detect and cluster similar strains from a set of samples. This system, in combination to the standardisation effort of the real world AFLP by the VUmc, can provide the basis for a significant speed up in the epidemi-ology, by applying a rapid similarity check for samples to a large database of known strains.

(13)

6 Discussion

Although the algorithm has effectively been tested on the three species from the dataset, only one of these had a separate set from the inSilico VAFLP [1] tool to verify against. This is not enough to guarantee correct performance with regards to all restrictors possible. For example an instance which has not been investigated is the case where restrictor sequences are largely overlapping, thus possibly creating an incorrect restriction match in the copied ends of the fragments causing phantom occurrences to be registered by the VPCR. Because the VAFLP implementation runs on the local machine a certain amount of data gathering has to be performed before one can build a large initial database of VAFLP data. During this project this was done by the scraping of the NCBI GenBank11_{, this process in general makes it that the initial setup of a database}

takes a considerable amount of time. However once this initial setup has been created on one machine it can be copied over for use on other machines or even provided through an online server system.

Another point of note is that the current system uses a bottom-up cluster algorithm that has to re-cluster the entire data set every time a new element is added, this results in a computationally heavy clustering step within the algorithm.

7 Future Work

Although this research has lead to a fast algorithm for the generation of virtual AFLP data from WGS data, it is only the first step towards faster clinical epidemiology. The next step would be to create a mapping between the in silico VAFLP data and in vitro AFLP data, in order to provide a match between the two data types even if this data is noisy. Such a mapping would help automate analysis of bacterial samples, saving time and resources on analysis that could be put to used on processing more samples.

Another avenue of approach left open by this research is the look into the possibility to implement a contin-uously learning cluster algorithm, that can update the database whenever a new sample is tested in order to create a flexible and always up-to-date system suitable for use amongst multiple hospitals.

Even though this research was focused on creating a link between two data types, this was in essence done by vectorizing large text files. It would be interesting to see if a form of VAFLP would work for the vector-ization of large text corpora in order to check for similarities, an interesting point in this research would be to find if there exist any restrictors that would make this feasible and in how far this method would be able to express similarities.

References

[1] Joseba Bikandi et al. “In silico analysis of complete bacterial genomes: PCR, AFLP–PCR and endonu-clease restriction”. In: Bioinformatics 20.5 (2004), pp. 798–799.DOI: https://doi.org/10.1093/ bioinformatics/btg491.

[2] Norman J Dovichi. “DNA sequencing by capillary electrophoresis”. In: Electrophoresis 18.12-13 (1997), pp. 2393–2399.DOI: https://doi.org/10.1002/elps.1150181229.

(14)

[3] David J Edwards and Kathryn E Holt. “Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data”. In: Microbial informatics and experimentation 3.1 (2013), p. 2.

DOI: https://dx.doi.org/10.1186/2042-5783-3-2.

[4] Lilit Garibyan and Nidhi Avashia. “Polymerase chain reaction”. In: Journal of Investigative Dermatol-ogy133.3 (2013), pp. 1–4.DOI: https://doi.org/10.1038/jid.2013.1.

[5] Nicholas J Loman et al. “High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity”. In: Nature Reviews Microbiology 10.9 (2012), pp. 599–606. DOI: https : //dx.doi.org/10.1038/nrmicro2850.

[6] Ulrich G Mueller and L LaReesa Wolfenbarger. “AFLP genotyping and fingerprinting”. In: Trends in Ecology & Evolution 14.10 (1999), pp. 389–394. DOI: https : / / doi . org / 10 . 1016 / S0169 -5347(99)01659-6.

[7] Stephane Rombauts, Yves Van de Peer, and Pierre Rouz´e. “AFLPinSilico, simulating AFLP finger-prints”. In: Bioinformatics 19.6 (2003), pp. 776–777.DOI: https://doi.org/10.1093/bioinformatics/

btg090.

[8] PHM Savelkoul et al. “Amplified-fragment length polymorphism analysis: the state of an art”. In: Journal of clinical microbiology37.10 (1999), pp. 3083–3091.

[9] R. R. Sokal and C. D. Michener. “A statistical method for evaluating systematic relationships”. In: University of Kansas Scientific Bulletin28 (1958), pp. 1409–1438.

(15)

A

Appendix

A.1 Heat map species

Figure 8: Heat map and dendrogram of Pseudomonas Aeruginosa species

Figure 9: Heat map and dendrogram of Staphylococcus Aureus species

A.2 VAFLP Software

The VAFLP tool runs on java and requires JDK 1.8 or above to compile. All information on the tool and the software itself can be found in the following repository https://github.com/mlboon/genomeLink

Bacterial strain classification: From complex sequence data to fast epidemiology