An algorithm for matching AFLP datasets

(1)

Bachelor Informatica

An algorithm for matching AFLP

datasets

Cyriel de Ceuninck van Capelle

June 8, 2016

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

Generating unique fingerprints of bacteria at a strain-specific level can be done by DNA analysis using the Amplified fragment length polymorphism (AFLP) method. AFLP output data how-ever, contains a lot of measurement variations. These variations include nucleotide shifts, peak intensity variations and missing peaks. Due to these measurement variations, classification of bacteria proves to be difficult. An algorithm to deal with these variations is described in this paper. The algorithm first pre-processes the AFLP data followed by classification of the bacteria. This algorithm is capable of correctly classifying all samples in the AFLP dataset provided by Vrije Universiteit Medisch Centrum (VUmc) with enough margin on strain- and species-level. Results indicate that peak intensity variations makes classification of profiles worse and only shifts of 1 nucleotide most likely occur in the considered dataset.

(4)

(5)

Introduction

Fast and efficient travelling is common in our modern world. People can get from one place to another in a relative short space of time. The advantages of efficient and fast travelling are big, but there are also disadvantages. One such disadvantage is that infectious diseases can spread relatively easily as people can potentially carry these diseases all over the world. Dangerous bacteria that are resistant to most antibiotics can thus spread quickly. To stop this spread, bacteria have to be analysed quickly to establish their source. Once the source is established, preventive measures can be implemented to stop the spread. Therefore, it is essential that quick and accurate methods are available for the analysis of bacteria in the event that new, potentially dangerous bacteria turn up.

The microbiology department at VUmc has established a technique to quickly identify bac-teria using a standardized and optimized version of the AFLP method. This method is capable of generating unique strain-specific fingerprints of bacteria [1]. The source of dangerous bac-teria can thus be established by generating a unique fingerprint using the AFLP method and comparing it with a database containing fingerprints of all previously found bacteria. This way, infectious bacteria can be identified and preventive measures can be implemented.

However, unique fingerprint data of this AFLP method contain a lot of variations in measure-ments caused by instrumentation differences. Fingerprint data of the same bacteria which should thus be identical, can differ from each other, which makes it hard to accurately classify these bacteria. This is due to variations in measurements generated by the equipment in the laboratory machines during the AFLP process. To identify the source of dangerous, infectious bacteria, these variations should be filtered so classification can be performed correctly. Since the laboratory procedure of the AFLP method has already been optimized by VUmc, remaining measurement variation reduction and classification should be performed in software, on the data analysis level.

1.1 Research question

The goal of this thesis is to analyse the generated AFLP data produced by the optimized AFLP method of VUmc and implement a suitable method for correct bacterial classification. VUmc has made an AFLP dataset available for analysis on different kind of bacterial strains. This dataset will be analysed to find what kind of measurement variations are generated by the machines and to what extent this occurs. When the type of measurement variation or different types of variations are identified, suitable methods can be conducted to perform reduction in these variations. After this reduction, classification can be performed to correctly classify bacteria. When new infectious bacteria are analysed using the AFLP method and a unique fingerprint is created, the classification method has to compare this new fingerprint to the database with existing fingerprints to find the correct source of the analysed bacteria. Preferably, the accuracy of this classification process will be near 100% as human lives are potentially affected by the outcome of this method.

(7)

CHAPTER 2

AFLP

2.1 Method

Amplified fragment length polymorphism or AFLP is a method for generating fingerprints based on DNA fragments [1]. AFLP is a relatively easy and fast method to determine DNA fingerprints of any kind of organism [2]. It is therefore a suitable method for generating unique fingerprints of bacteria. AFLP consists of five steps before a fingerprint is established. First DNA has to be isolated from the organism after which smaller DNA fragments are generated with specific restriction enzymes. Ligation is then performed on these fragments to make the DNA fragments suitable for Polymerase Chain Reaction (PCR) which amplifies the DNA fragments. This is needed to determine the sequence length of these fragments performed in the final step, elec-trophoresis [3].

2.1.1 DNA restriction and ligation

The first step in the AFLP process after DNA isolation is to cut isolated DNA of the organism into smaller fragments with restriction enzymes. Restriction enzymes are enzymes that bind to a particular sequence of nucleotides in DNA. When this sequence is found, it cuts the DNA at this particular sequence position. AFLP uses two types of restriction enzymes: rare cutter enzymes and frequent cutter enzymes. First, the frequent cutter cuts the DNA in multiple smaller DNA fragments. Then, the rare cutter cuts these fragments on particular places so that fragments occur with a frequent cutter side and a rare cutter side. Only these fragments will be analysed using the remaining AFLP steps. Control of the amount of fragments is an advantage of using frequent and rare cutter enzymes [1].

(8)

Figure 2.1: Restriction and ligation phase of the AFLP method [3].

After DNA is cut by restriction enzymes, and a group of smaller fragments are thus generated, ligation is performed. The DNA fragments cut by the restriction enzymes contain ‘sticky ends’, which are single-stranded parts of DNA on both ends of these fragments. Ligation deals with these ‘sticky ends’ by ligating adapters to the single-stranded pieces of DNA. This step is needed to prepare the DNA fragments for further analysis using PCR [3]. Figure 2.1 shows an example of the restriction and ligation process in the AFLP method. First, double-stranded DNA is cut in smaller fragments using EcoRI and MseI restriction enzymes. This results in fragments with small pieces of single-stranded DNA on both ends, the previously discussed ‘sticky ends’. These ‘sticky ends’ are then ligated by EcoRI and MseI adapters which results in fragments of double-stranded DNA [1]. Fragments are now ready for the next step, Polymerase Chain Reaction.

2.1.2 PCR amplification

Polymerase Chain Reaction or PCR is a technique used to make copies of specific DNA sequences. PCR can potentially create millions of copies of the same sequence of DNA[4]. PCR works by adding small sequences of DNA called primers complementary to a part of the DNA fragment to be amplified. Primers are needed to determine the DNA fragment that is to be multiplied. When primers are added, PCR can be performed and the resulting fragment will be multiplied in a number of cycles. Each PCR cycle consists of three steps: denaturation, primer annealing and primer extension. First, the reaction mixture is heated (denaturation) which separates the DNA in two single single strands. The reaction mixture is then cooled which leads to hybridization of primers to their complementary DNA strand (primer annealing). The last step involves DNA polymerase, which is where new DNA strands are created by extending the primers (primer extension). One PCR cycle is now completed, the considered DNA fragment is doubled. PCR multiplies DNA fragments exponentially as every single DNA fragment will be doubled during each cycle [5]. In the optimized and standardized version of AFLP by VUmc, a fixed number of cycles has been selected to ensure homogeneity of the data. PCR is needed in the AFLP process because DNA fragment length analysis, performed in the electrophoresis step, cannot detect single DNA fragments by size. Therefore, each DNA fragment has to be multiplied so detection is possible by the electrophoresis machine.

2.1.3 Electrophoresis

The final step in AFLP is electrophoresis, which is where the strain-specific unique fingerprint of bacteria are generated. Electrophoresis sorts the restricted, ligated and multiplied DNA fragments by sequence length. This is done by applying an electric current through a medium containing the DNA fragments. DNA fragments are negatively charged so when an electric

(9)

field is created with a positive and negative electrode, fragments tend to move towards the positive electrode. Smaller DNA fragments will travel faster through the medium because they encounter less resistance. This way, fragments will spread according to their size which leads to a unique representation of DNA sequence lengths for each different organism. In the AFLP method optimized and standardized by VUmc, capillary electrophoresis is used as a technique to separate the DNA fragments by sequence length. The principle behind this technique is the same for original electrophoresis only different instrumentation is used. With capillary electrophoresis, a capillary is used as medium for DNA fragments to travel through. One end of the capillary is dipped into a sample reservoir containing the DNA fragments and the other end is dipped into a buffer-filled reservoir. Then, an electric field is applied to the capillary which makes the fragments move towards the positive side of the capillary medium. Afterwards, a laser-induced fluorescence detector is applied for detecting the DNA fragments ordered by size. The advantage of capillary electrophoresis is the flexibility of capillaries and the relative easy installment of capillaries into automated machines [6]. When detection of the DNA fragments is done, a unique fingerprint is detected and generated.

Figure 2.2: Electrophoresis result after analysis of two samples using AFLP. Each black line in both samples represents DNA fragments of that particular nucleotide length. Darker lines contain more fragments with that particular nucleotide length than lighter lines. These results are unique for each bateria at a strain-specific level, unique fingerprints are now created[7].

(10)

2.1.4 Summary

Figure 2.3: Schematic overview of the AFLP method [3].

Figure 2.3 shows an overview of all the steps performed in the AFLP process. First, DNA of the bacteria is isolated after which fragments are created by restriction and ligation. Primers are then added to these DNA fragments for multiplication by the PCR method. And in the final step, DNA fragments are sorted according to their length. In the AFLP method optimized by VUmc, capillary electrophoresis is used instead of gel analysis shown in Figure 2.3. Furthermore, parameters such as the type of restriction enzymes or amount of PCR cycles are standardized by VUmc by using same values for every AFLP bacteria analysis. This way, standardized fingerprint data is ready to be investigated for noise reduction and classification.

2.1.5 Data output

Figure 2.4: Output of the AFLP method, a unique fingerprint at a strain-specific level of a bacteria.

(11)

Figure 2.4 shows the output of the AFLP method after processing the DNA of one bacteria. This graph is the result of digitalizing the electrophoresis result as for example shown in Figure 2.2. The resulted data is two-dimensional and consists of the number of nucleotides of a fragment and the number of fragments or intensity with the related nucleotide length. So in the graph of Figure 2.4, the x-axis represents the number of nucleotides and the y-axis the fragment intensity. Each peak thus shows the number of DNA fragments with that particular length of nucleotides. Each AFLP data graph optimized and standardized by VUmc has a fixed nucleotide length range from 0 to 600. As discussed, this AFLP data graph is unique for the particular bacteria at a strain-specific level.

2.2 Encountered problems

Figure 2.5: AFLP data of two samples of the same bacteria at a strain-specific level. Figure 2.5 shows two AFLP data graphs of the same bacteria at a strain-specific level from two different samples. In theory, these graphs should be exactly the same since the bacteria do not differ from each other on a genetic level: DNA sequences are identical. As shown in Figure 2.5 however, this is not the case: the AFLP data graphs show differences in their intensities and nucleotide lengths. These differences are most likely generated due to small variations in used material quantities during the AFLP process. Due to these measurement variations, classification of bacteria is difficult. It proves difficult for instance to match sample 1 in Figure 2.5 to sample 2 in Figure 2.5. A possible mismatch is likely to occur in this case. To prevent this from happening, the measurement variations have to be filtered and a suitable classification method has to be found to correctly match bacteria. When investigating these measurement variations, we can distinguish different types of variations in the AFLP data. This includes shifts, fragment intensity variations and missing peaks. These variation occurrences will be discussed in the next few sections.

(12)

2.2.1 Nucleotide shifts

Figure 2.6: Example of two shift occurrences between two AFLP data graphs.

Two peak occurrences in the graph of Figure 2.5 are examined in Figure 2.6. As discussed, these peaks should not differ from each other in terms of intensity and nucleotide length. The two peaks obviously do not share the same intensity but they also differ from each other in nucleotide length. The number of nucleotides of both peaks vary in one nucleotide. These differences in nucleotide length between two peaks named shifts, occur quite frequently in the AFLP data. Also, shifts occur randomly in the AFLP data, they do not show a particular pattern. Shifts mostly differ from each other in one nucleotide, but shifts of two nucleotides do also occur. Shifts of three nucleotides or more do most likely not occur.

2.2.2 Peak intensity variations

The type of measurement variation most noticeable in the AFLP dataset, is the amount of variation in peak intensities. In Figure 2.5 it is clearly visible that sample 1 of the particular E. faecium bacterial strain contains far more peaks with low intensities in comparison to sample 2. These intensity variations do not always follow a particular pattern as for example, four big peaks in sample 1 show resemblance in intensity when compared to sample 2. The other, lower peaks however, show some sort of linear pattern as the relative intensity differences between these peaks stay the same to some extent.

(13)

Figure 2.7: AFLP data of two samples of the same bacteria at a strain-specific level. If we look at the two AFLP data graphs of the S. aureus bacteria at a strain-specific level in Figure 2.7, this linear pattern is visible as well for some peaks. This linear pattern is most noticeable at higher nucleotide lengths. Peak occurrences at low nucleotide lengths show more random variations. Overall, these peak intensity variations occur mostly in linear form in the AFLP data where relative peak intensity differences stay the same to some extent.

2.2.3 Missing peaks

Figure 2.8: Example of two peaks missing in either sample 1 or sample 2 of the E. faecium bacteria.

(14)

Another type of measurement variation, is the random disappearances of peaks. When comparing samples of the same bacteria at a strain-specific level, one would expect that peaks will occur on the same, or slightly different due to shifts, nucleotide length in all samples. This is however not the case, Figure 2.8 shows two peak locations where a peak occurs in one sample but not in the other sample. These missing peaks occur more frequent at lower nucleotide length values. Also, missing peak variations tend to occur more frequent at low intensities.

(15)

CHAPTER 3

Implementation

3.1 Pre-processing the AFLP data

3.1.1 Equalization and normalization

As discussed in the previous chapter, measurement variation occurs in several forms. We thus have to apply techniques to filter these variations so we can classify bacteria correctly at a strain-specific level. One of the applied techniques is equalization, which will solve the peak intensity problem to some extent. Equalization basically reduces the relative peak intensity distances between all peaks in an AFLP data graph. This results in an AFLP data graph where peaks are more clustered towards the mean. Equalization is done using the following formula:

x0= µ +x − µ

c with c > 0 and c ∈ N (3.1) In Equation 3.1, x denotes the intensity of the input peak, µ the mean intensity of all peak intensities of the related AFLP data graph and variable c denotes to what extent the peaks should be equalized. Thus, the formula in Equation 3.1 calculates the difference between the old peak intensity and mean intensity which is calculated and divided by variable c which is then added to the mean. This results in a peak intensity closer to the mean since the difference between the peak intensity and mean intensity is now smaller due to variable c. This formula is applied to every single peak in the AFLP data graph, which results in new peak intensities closer to the discussed mean. Figure 3.1 shows an example of equalization.

(16)

Figure 3.1: AFLP data graph before and after equalization with c = 3.

One problem arises when using the equalization formula of Equation 3.1. This can be seen in Figure 3.1. The standard AFLP data graph in Figure 3.1 has an intensity range between 0 and 9000 whereas the equalized AFLP data graph ranges in intensity between 0 and 4000 fragments. This range depends on the mean intensity of the AFLP peak data which can differ between samples. Therefore we have to normalize the AFLP data after equalization. This is done using min-max normalization [8]:

x0= x − min(x)

max(x) − min(x) (3.2) Here, x is again the input peak intensity, min(x) the minimum peak intensity found in the related AFLP data graph and max(x) the maximum peak intensity found in the related AFLP data graph. Our AFLP data will always have gaps between at least two peaks, so our minimum fragment value, or min(x), will always be 0. We can therefore simplify Equation 3.2 to the following:

x0= x

max(x) (3.3)

The formula in Equation 3.3 will also be applied to every peak in the related AFLP data graph. This will set the peak intensity range from 0 to 1, where 1 is assigned to the highest peak as showed in Figure 3.2.

(17)

Figure 3.2: Equalized AFLP data graph before and after normalization.

3.1.2 Normal distribution

Equalization and normalization deals with the random intensity variations. Another type of measurement variation which must be addressed for providing suitable, classifiable AFLP data, is the random shift occurrences. These shifts occurrences, as discussed in Section 2.2.1, do not follow a particular pattern. However, studies show that measurement errors are often normally distributed [9]. We will therefore deal with potential peak shifts by normally distributing each peak intensity in the related AFLP data graph. Shifted peaks will thus share the same peak intensity to some extent when compared to each other. The method for normally distributing each peak in an AFLP data graph involves numerous of steps. First, an array of integers is created of the following form:

[−s ∗ 3, ..., 0, ..., s ∗ 3] with s ≥ 0 and s ∈ N (3.4) Here, 0 represents the peak value and s sets the width of the normal distribution curve. This array will be used to compute the normally distributed values which then results in an array representing the normally distributed bell curve. The following Gaussian function is thus applied to convolve each value in the array to normal distributed values with its peak value set to 1.

f (x) = e−2s2x2 (3.5)

As an example, an array is created with s = 1. This results in the following array:

[−3, −2, −1, 0, 1, 2, 3] (3.6) Each value in the array will then be recalculated using Equation 3.5. The array will now contain the following values:

[0.011, 0.135, 0.606, 1.0, 0.606, 0.135, 0.011] (3.7) This array is normally distributed. Its peak value is 1 and the other values are normally dis-tributed using the Gaussian function in Equation 3.5. We can now apply this array to our peak

(18)

done by multiplying each normally distributed value in our array with the related peak intensity value. For example, a piece of an AFLP data output sample has the following form:

[0, 0, 0, 0.8, 0, 0, 0.6, 0, 0] (3.8) Each peak value, in this case 0.8 and 0.6, will be convolved with the normally distributed array of Equation 3.7. In this particular case however, which occurs quite frequently, normally distributed peak intensity values will overlap. A gap of only two indices exists between the two peak values, 0.8 and 0.6. Since our normally distributed array with s = 1 has a width of 3 indices, double values will occur at the indices in this gap. This problem is solved by prioritizing the highest value of these double values. We thus say that peaks with higher intensity values are more important than the related overlapping peak with a lower peak intensity value. Using this method, the AFLP array of Equation 3.8 convolved by the normally distributed array of Equation 3.7 results in the following:

[0.009, 0.108, 0.485, 0.8, 0.485, 0.364, 0.6, 0.364, 0.081] (3.9) This process of normally distributing peaks is done for each peak in the related AFLP data graph. This finalizes the pre-processing part of our method. AFLP data which is equalized, normalized and where peaks are normally distributed results in the following AFLP data graph:

Figure 3.3: Pre-processed AFLP data graph. Equalized and normalized AFLP data graph with peaks which are normally distributed.

3.2 Matching

Our AFLP output data has now been equalized, normalized and peak values are normally dis-tributed. All types of measurement variation are therefore been dealt with to some extent except for missing peaks. This occurrence will not be addressed as it proves to be very difficult to cor-rectly determine where peaks should occur due to shifts. We cannot replace empty ‘spots’ with peaks as we simply do not know for sure if these missing peak spots are really missing peaks, or just shifted peaks. A method to deal with missing peaks is therefore not implemented, data matching will be performed with the pre-processed data discussed in Section 3.1.

Bacterial classification will now be performed by comparing pre-processed AFLP data graphs of new bacteria with the database containing all previous examined AFLP fingerprints of bacteria. Comparing or data matching is done by calculating the average peak error. This error is deter-mined by comparing each peak intensity value of one graph with the related peak intensity value in the other graph. These errors are determined for each peak in both samples. The sum of this error is then divided by the total amount of peaks in both samples which results in the average peak error.

(19)

Figure 3.4: Determining the average peak error by comparing peak values between two samples. Figure 3.4 shows an example of determining the average peak error. For simplicity, the number of nucleotides is set from 200 to 280. The average peak error process starts by looping over the nucleotide range and determining whether a peak is found. The peak intensity value is determined at that particular nucleotide length in both AFLP data samples when a peak is found. The difference in peak intensity is then calculated and added to the total peak error. So in Figure 3.4, the first added error will be around 0.6 since the peak intensity value of the first peak in the top graph sits around 0.6 and the intensity value at that particular nucleotide length equals 0.0 in the bottom graph. Again, this is done for each single peak in both graphs. This results in a total error when all peak errors are added together. This sum is then divided by the total number of peaks in both samples which results in the average peak error. In algorithmic form:

Listing 3.1: Algorithm to determine the average peak error between two samples. # AFLP d a t a s a m p l e s

sample1 = [ AFLP data1 ] sample2 = [ AFLP data2 ] p e a k e r r o r = 0

# Loop o v e r e v e r y n u c l e o t i d e l e n g t h v a l u e f o r i in range ( 0 , 6 0 0 ) :

# Determine f r a g m e n t i n t e n s i t y d i f f e r e n c e i f p e a k i s f o u n d i f sample1 [ i ] = peak or sample2 [ i ] = peak :

p e a k e r r o r = p e a k e r r o r + abs ( sample1 [ i ] − sample2 [ i ] ) # D i v i d e t o t a l e r r o r by number o f p e a k s i n b o t h s a m p l e s

(20)

The input AFLP sample of a new bacteria is compared with every other sample in the AFLP dataset using the algorithm in Listing 3.1. The input sample will then be classified to the existing bacteria AFLP sample in the dataset with the lowest average peak error.

(21)

CHAPTER 4

Experiments

4.1 Dataset

The AFLP dataset provided by VUmc contained 46 samples of 8 different bacteria at a strain-specific level. These 8 different bacteria at a strain-strain-specific level consist of four different bacterial species namely, E. coli, E. faecium, P. aeruginosa and S. aureus. Bacterial species have different bacterial strains, which each have a unique fingerprint. So bacteria at a strain-specific level are unique but bacterial strains from the same species share common peaks. Therefore, bacteria differ the most on inter species level. For example, E. coli 279 and E. coli 4839 share common peaks so they should have unique fingerprints which are more identical when compared to the difference in unique fingeprints of E. coli and E. faecium bacteria. This distinction is important to determine the accuracy of our data matching method.

Strain-specific bacteria Number of samples E. coli 279 6 E. coli 4839 6 E. faecium 1144 5 E. faecium 3197 6 P. aeruginosa 2011 6 P. aeruginosa 5771 6 S. aureus 1045 6 S. aureus 1580 5 8 46

Table 4.1: Overview of the AFLP dataset provided by VUmc.

4.2 Result

Each AFLP sample in the considered dataset was pre-processed using the method in Section 3.1. Then, all samples were compared with each other and matched using the method in Section 3.2. There are 3 parameters which influence the result of the classification method. This includes variable c discussed in Section 3.1.1, variable s discussed in Section 3.1.2 and the range of nucleotides. This is normally set from 0 to 600 but the discussed types of measurement variations can likely occur in particular regions of this nucleotide range. Nucleotide range was therefore added to the set of parameters.

(22)

4.2.1 Average peak error matrix

Figure 4.1: Average peak error matrix. The average peak error of each sample compared to every other individual sample in the dataset is determined and displayed in the matrix. Thus, each data point represents the average peak error between those particular two samples. This results in a matrix of size 46x46. Parameters: nucleotide range: [0, 600], c = 1, s = 0. Figure 4.1 shows the average peak error between all samples with the related parameters. This average peak error matrix is based on the data matching method without equalization and peaks which are normally distributed. Variable c was therefore set to 1 and variable s was therefore set to 0. Also, the nucleotide range was set to its full range of [0, 600]. As seen in Figure 4.1, this results in 45 correctly classified bacteria samples. Each green or red point represents the sample combination where the average peak error is smallest. Each green or red point thus shows the classified sample combination. For simplicity, only for each row classified samples or points are shown, columns are excluded. So for example, row 1 of the average peak error matrix in Figure 4.1 shows the classification outcome of the first sample of the E. coli 279 bacteria. This sample is classified to sample 3 of the E. coli 279 bacteria which falls within the same strain-specific bacteria and is thus matched correctly. The red dot represents the only sample which isn’t correctly classified. The diagonal of the average peak error matrix in Figure 4.1 contains the data matching result when comparing two identical samples. This diagonal will therefore always contain sample combinations with an average peak error of 0.0.

(23)

4.2.2 Intra strain, intra species, inter species

We have now visualized the result of our method with the average peak error matrix shown in Figure 4.1. To show the impact of our pre-processing methods discussed in Section 3.1, equalization and peaks which are normally distributed are not yet included. Thus, our next step will be to tweak our parameters to find an optimal variable combination to further improve this average peak error matrix and show the impact of the discussed methods in Section 3.1. Therefore, determination of suitable accuracy visualization is needed to provide feedback of how well the data matching method performs on different parameter settings. This is done by determining the average peak error on different levels in the average error peak matrix. We distinguish three different forms: intra strain, intra species and inter species. Intra strain equals the average peak error of all samples within the same strain, intra species equals the average peak error of all samples within the same species and inter species compares the average peak error of the analysed bacterial species with all other bacterial species. In Figure 4.1 for example, the intra strain accuracy of the E. coli 279 bacteria is represented by the average peak error of all samples in the left most ‘block’. The intra species accuracy for the E. coli 279 bacteria is represented by the second block, which is the result of comparing the particular bacteria with its species-specific counterpart, the E. coli 4839 bacteria. Inter species accuracy is then represented by the other six blocks. Intra strain, intra species and inter species accuracy measurements can be used to determine the accuracy of the data matching algorithm. We will look for the biggest difference between these three measurements as we seek for the highest margin in peak error between strains and species. Intra strain, intra species and inter species results are determined for each average peak error matrix with different parameter values as input.

(24)

4.2.3 Parameter values analysis

Figure 4.2: Intra strain, intra species and inter species results of all different bacteria on species level with different values of s. Parameters: nucleotide range: [0, 600], c = 3.

Figure 4.2 shows the result of experimenting with variable s, which sets the width of the normal distribution curve as discussed in Section 3.1.2. The purple dots represents the average peak error values of all sample combinations within the same strain, all combinations between the samples of E. coli 279 for example. Yellow represents the intra species average peak error dots, all combination between the samples of E. coli 279 and E. coli 4839 for example and the blue dots represent the inter species average peak error results. We want to classify bacteria at a strain-specific level so margin between the yellow dots and purple dots is the most important to determine accuracy of the data matching method. Figure 4.2 shows that s = 1 gives the most desirable results. Therefore, s = 1 was used in the following analyses.

(25)

Figure 4.3: Intra strain, intra species and inter species results of all different bacteria on species level with different values of c. Parameters: nucleotide range: [0, 600], s = 1.

In Figure 4.3, the intra strain, intra species and inter species results of different values of variable c is shown. We see that margins between the three accuracy measurements increase when equalization is performed with higher values of variable c. Therefore, = 40 was selected as parameter value for determining the optimal nucleotide range in the following graph.

(26)

Figure 4.4: Intra strain, intra species and inter species results of all different bacteria on species level with different values of c. Parameters: c = 40, s = 1.

Figure 4.4 shows the result of experimenting with the nucleotide range, we thus experiment with limiting the size of input data. Here, the results differ quite a lot and it is difficult to visually see which nucleotide range is optimal. A nucleotide range of [100, 600] was chosen as optimal value for generating a more suitable average peak error as our result.

(27)

4.2.4 Result

Figure 4.5: Average peak error matrix. The average peak error of each sample compared to every other individual sample in the dataset is determined and displayed in the matrix. Thus, each data point represents the average peak error between those particular two samples. This results in a matrix of size 46x46. Parameters: nucleotide range: [100, 600], c = 40, s = 1.

(28)

Figure 4.6: Intra strain, intra species and inter species results with the parameter values of the two average peak error matrices shown in figure 4.1 and 4.5.

Figure 4.5 shows the average peak error matrix with optimal parameters, found visually in Figures 4.2, 4.3 and 4.4. Classification is now performed 100% correctly as all 46 samples are classified correctly. Also, as seen in Figure 4.6, the margin between the intra strain, intra species and inter species results is bigger compared to the average peak error matrix in Figure 4.1.

(29)

CHAPTER 5

Discussion

Green dots in the average peak error matrix of Figure 4.5 shows that each sample will be classified correctly in the dataset provided by VUmc. Margins between strains and species are sufficient enough to accurately establish the source at a strain-specific level of each sample in the dataset. This result is obtained by analysing the parameter graphs in Section 4.2.3. The results of these graphs are quite interesting. Figure 4.2 for example shows the intra strain, intra species and inter species average peak errors of each bacteria on species-specific level for different values of variable s. Margins between intra strain and intra species are the biggest for s = 1 so normally distributed peaks with a width of 3 are therefore optimal. This results in a fairly small bell shape of the normal distribution curve. Actually, only shifts which differ in 1 nucleotide are therefore taken into account to some extent. Thus, only shifts of 1 nucleotide most likely occur in the AFLP data. Shifts greater than 1 probably do not occur that much in the AFLP data since the results where variable s exceeds 1, are fairly poor.

Another interesting result is seen in the graph where different values of variable c are tested in Figure 4.3. These results are quite clear, the higher variable c, the bigger the margins between strains and species and the more accurate classification is performed. Higher values of c means that equalization is performed more rigorously, all peaks are more corrected towards the mean when c gets higher. Thus, differences between peak intensities are smaller with higher values of variable c. Equalization therefore eliminates peak intensity differences with high values of c. We can therefore conclude, after analysing Figure 4.3, that variations in peak intensities occur in such extent that it is better to discard peak intensities. Of course, results with a wider range of values of c are needed to ensure this conclusion but Figure 4.3 gives a fairly good impression. When examining the different nucleotide ranges for the four different bacterial species in the dataset, we do not see a clear pattern. This is obviously due to the different AFLP fingerprints of different species. Peaks are distributed along the nucleotide range differently for each indi-vidual bacteria on species- and strain-level. It is therefore difficult to find an optimal nucleotide range for the dataset. This becomes even harder when the dataset is expanded with more dif-ferent bacterial species, as more AFLP graphs with difdif-ferent peak distributions are added to the dataset. A general valid recommendation of the appropriate nucleotide range can therefore not be conducted after analysing the results of Figure 4.4. For this dataset however, an optimal nucleotide range of 100 to 600 is selected which is purely based on visual analysis of Figure 4.4.

5.1 Further improvements

5.1.1 Estimating optimal parameter values

(30)

4.3 for example, is based on a nucleotide range of 0 to 600 and s = 1. We do not know how variable c behaves when experimenting with other nucleotide ranges or variable s values. It is therefore needed to examine all different combinations of parameter values to find the exact optimal parameter values for classifying bacteria.

5.1.2 Shift analysis

We have seen that parameter value s = 1 performs best on our dataset provided by VUmc. This indicates that shifts only occur with a nucleotide difference of 1 since the normally distributed widths are fairly small. It would be interesting to see if we can find a method to validate this statement after shift analysis. This can be done by assigning the same peak value to nucleotide values of +1 and -1 instead of computing the normal distribution of each peak. This way, each peak has the same value with shifts of 1 nucleotide. This new variable can be set to any value to deal with shifts of different nucleotide lengths. If we do the same analysis as performed in Figure 4.2 with this new method for dealing with shifts, we can accurately tell which type of shifts occur the most in the considered dataset. This method may possibly perform better as well when compared to our method where peaks are normally distributed.

5.1.3 Simplify method

Improvements on shift analysis can thus be conducted to provide more information about the shift occurrences. Another type of measurement variation discussed after analysis of Figure 4.3 is the occurrence of random peak intensity variations. According to Figure 4.3, our algorithm performs better when increasing variable c. Minimizing relative peak intensity differences thus leads to better classification results. Therefore, equalization as discussed in Section 3.1.1 is probably irrelevant as results are better when setting all peak intensities to an equal value. Our algorithm can thus be simplified by discarding peak intensities. Only the nucleotide length of peaks should be taken into account using this approach. This results in a much simpler method than the discussed method in Chapter 3, but after analysis of the parameter graphs in Section 4.2, it will most likely perform better on the considered dataset.

5.2 Conclusion

Bacterial classification with AFLP datasets proves to be difficult due to various type of measure-ment variations which causes differences in AFLP data graphs between two genetically identical bacteria. We have applied various methods on the AFLP output graphs to reduce these varia-tions and make them suitable for bacterial classification. After tweaking the parameters of our algorithm and analysing the results of the parameter graphs discussed in Section 4.2, an optimal parameter value combination was identified which leads to a sufficient average peak error matrix where every single bacteria is matched correctly at a strain-specific level. Results have also shown that some measurement variation occurrences appear in particular form. Shifts most probably occur with a difference of 1 nucleotide and peak intensity variations occur in such extent that intensities should probably be discarded in comparisons. While our algorithm provides sufficient results, a simpler and more optimized algorithm can now be constructed with these findings, to provide a suitable bacterial classification method especially for AFLP datasets optimized and standardized by VUmc.

(31)

Bibliography

[1] P. Vos, R. Hogers, M. Bleeker, M. Reijans, T. Van de Lee, M. Hornes, A. Friters, J. Pot, J. Paleman, M. Kuiper, et al., “AFLP: a new technique for DNA fingerprinting,” Nucleic acids research, vol. 23, no. 21, pp. 4407–4414, 1995.

[2] U. G. Mueller and L. L. Wolfenbarger, “AFLP genotyping and fingerprinting,” Trends in Ecology & Evolution, vol. 14, no. 10, pp. 389–394, 1999.

[3] M. Blears, S. De Grandis, H. Lee, and J. Trevors, “Amplified fragment length polymorphism (AFLP): a review of the procedure and its applications,” Journal of Industrial Microbiology and Biotechnology, vol. 21, no. 3, pp. 99–114, 1998.

[4] H. A. Erlich, “Polymerase chain reaction,” Journal of clinical immunology, vol. 9, no. 6, pp. 437–447, 1989.

[5] A. Hadidi and T. Candresse, “Polymerase chain reaction,” Viroids, pp. 115–122, 2003. [6] N. J. Dovichi and J. Zhang, “How capillary electrophoresis sequenced the human genome,”

Angewandte Chemie International Edition, vol. 39, no. 24, pp. 4463–4468, 2000.

[7] C. Jones, K. Edwards, S. Castaglione, M. Winfield, F. Sala, C. Van de Wiel, G. Bredemeijer, B. Vosman, M. Matthes, A. Daly, et al., “Reproducibility testing of RAPD, AFLP and SSR markers in plants by a network of european laboratories,” Molecular breeding, vol. 3, no. 5, pp. 381–390, 1997.

[8] I. B. Mohamad and D. Usman, “Standardization and its effects on k-means clustering algo-rithm,” Res. J. Appl. Sci. Eng. Technol, vol. 6, no. 17, pp. 3299–3303, 2013.

[9] A. Lyon, “Why are normal distributions normal?,” The British Journal for the Philosophy of Science, vol. 65, no. 3, pp. 621–649, 2014.

(32)

(33)

APPENDIX A

Appendix

A.1 Scripts

According to our method described in chapter 3, scripts are divided into a pre-processing part and a data matching part. The pre-processing script will load the AFLP data and equalize, normalize and normally distribute the peaks. Afterwards, it is saved as a data tree. This tree structure contains bacterial species at the top level, different strains at the second level and samples at the bottom level of the data tree. The data matching script will load this data tree and compute the average peak errors between all sample combination in the considered dataset. The result is stored in a text file for further analysis.

1. pre-process AFLP data.py

(a) load text file with AFLP data (b) pre-process data:

i. equalize ii. normalize

iii. normally distribute peaks (c) save as data tree

2. match data.py (a) load data tree

(b) compute average peak errors (c) store result in text file

A data tree is constructed by using the python objects: Species.py, Strain.py and Sample.py stored in the objects folder. The data tree consists of an array of species objects. Each species object in this array thus has the following structure:

Species1

Strain1

Sample1 Sample2 Sample3 Sample4

(34)

A.1.1 pre-process AFLP data.py

This file pre-processes the AFLP data. This data is initially stored in a text file which will be read by the get data(f ilepath) functions. Scripts are provided for the following two text file structures:

E COLI 279 (1) 1 0 FALSE

601 1 0.00 1 FAM BINNED False VUMC MMS 10624 A01 0080.fsa Escherichia coli AT13093 50 0 Untagged

The get data(f ilepath) functions filters this data to obtain the species name, strain name, sample id, nucleotide length and peak intensity. These values are stored in the object files to construct a data tree. Each object will contain the following values:

1. Species.py

(a) Species name (b) List of strain objects 2. Strain.py

(a) Strain name

(b) List of sample objects 3. Sample.py

(a) Sample name

(b) Unique fingerprint: array of 600 peak intensity values The resulted data tree is stored in the data tree folder as a pickle file.

A.1.2 match data.py

This file loads the pickle file containing the data tree, after which average peak errors are com-puted for each sample combination. The result is stored in a text file which has the following structure:

E COLI279(1) E COLI279(2) 0.0695995782421 intra strain E COLI

E COLI279(1) and E COLI279(2) are the compared samples, 0.069 the average peak error, in-tra sin-train the type of measurement and E COLI the species type of the related sample. This is done for each sample combination. The resulted text file can be used for further analysis.

An algorithm for matching AFLP datasets

Bachelor Informatica