THE USE OF BAYESIAN STATISTICS IN MASS SPECTROMETRY DATA

(1)

THE USE OF BAYESIAN

STATISTICS IN MASS

SPECTROMETRY DATA

Literature research

Elli Karampini

(2)

1

PART I

CHAPTER 1: BAYESIAN STATISTICS

1.1. Bayes’ Theorem

1.2. Bayesian network

CHAPTER 2: MASS SPECTROMETRY

2.1. Introduction to mass spectrometry

2.2. Instrumentation of mass spectrometry

PART II

CHAPTER 3: BAYESIAN STATISTICS IN PROTEOMIC STUDIES

3.1. Data pretreatment

3.2. Data treatment

CHAPTER 4: BAYESIAN STATISTICS IN METABOLOMIC STUDIES

4.1. Applications of Bayesian approach in metabolomics

CHAPTER 5: BAYESIAN STATISTICS IN FORENSIC SCIENCES

5.1. Bayes’ Theorem and forensic evidence

5.2. Mass spectrometry and Bayesian statistics in forensic

sciences

CHAPTER 6: BAYESIAN STATISTICS IN VARIOUS APPLICATIONS

PART III

CHAPTER 7: A CRITICAL REVIEW

ACKNOWLEDGEMENTS

(3)

2

To my parents, Alexandros and Mary, and Achilles

(4)

3

(5)

4

CHAPTER 1: BAYESIAN STATISTICS

Bayesian statistics is used to differentiate the sub-category of statistics in which the evidence of the true state of a hypothesis is given in degrees of belief and more specifically in Bayesian probabilities. The term Bayesian has been adopted to honor Thomas Bayes, although Pierre-Simon Laplace had independently worked on the same subject under the name the probability cause.

There are two main schools in calculating probabilities. From a frequentist approach, the definition of probability value is the chance of observing this data or more extreme given the fact that the hypothesis is true. This type of probability is calculated through a frequentist test, such as a t-test of one or two means. However, it does not answer the question “is my hypothesis correct?” but rather takes it as a given. On the other hand, frequentist approach ensures a certain probability of failure of the whole procedure. In other words, although it cannot answer the question “is my hypothesis correct?”, it can assure that, following the procedures of the frequentist test, the frequency of wrong answers is under control. By contrast, the answer to “is my hypothesis correct?” can be given through a Bayesian approach. More specifically, Bayes’ theorem determines the probability that the hypothesis is correct, using the mathematical formula of Bayes’ theorem. This process combines the background information I with the outcome (data) D of an experiment and the probability of the hypothesis before considering the data in order to assess the probability value of the hypothesis itself P(𝜃|𝐷, 𝐼) [1].

The purpose of the thesis is to make a critical review concerning the recent developments (around the last 5 years) in applying Bayesian statistics with respect to analysis with mass spectrometry instruments. Firstly, a short introduction about Bayesian statistics and mass spectrometry will be given in part I. In part II different articles, relative to the topic, from a broad spectrum of analytical fields, such as proteomics, metabolomics and forensic sciences, will be discussed and finally, a critical review in part III will be presented.

(6)

5

1.1. Bayes’ theorem

Bayes’ theorem is a rule which indicates how to treat conditional probabilities. Conditional probability of an event is defined as the probability obtained with the additional information that another event has already occurred [1]. It was first introduced by Thomas Bayes (1701-1761) and published under the name “An Essay towards solving a Problem in the Doctrine of Chances” in 1763, two years after his death. It was his friend Richard Price (1723-1791) who communicated the paper through John Canton (1718-1772) to the Royal Society [2].

Bayes’ theorem is based on the two fundamental rules of probability theory, the product and the addition rule. The first one defines the joint probability of two or more propositions, by means of the following equation

P(𝑥, 𝑦|𝐼) = P(𝑥|𝐼) P(𝑦|𝑥, 𝐼) = P(𝑦|𝐼) P(𝑥|𝑦, 𝐼)

where x and y are the propositions, which are interchangeable , I the background information and Pr(𝑥, 𝑦|𝐼) is the probability of x and y conditional on I. The later one can be expressed as

P({𝑦𝑗}|𝐼) = ∑ P({𝑦𝑗}, {𝑥𝑖}|𝐼) = ∑ P({𝑥𝑖}|𝐼) P({𝑦𝑖}|{𝑥𝑖}, 𝐼) 𝑀

𝑖=1 𝑀

𝑖=1 and, due to symmetry

P({𝑥_𝑖}|𝐼) = ∑ P({𝑥_𝑖}, {𝑦_𝑖}|𝐼) = ∑ P({𝑦_𝑖}|𝐼) P({𝑥_𝑖}|{𝑦_𝑖}, 𝐼) 𝑁

𝑗=1 𝑁

𝑗=1

where {𝑥_𝑖: 𝑖 = 1,2,3, … , 𝑀} and {𝑦_𝑖: 𝑗 = 1,2,3, … , 𝑁} are sets of propositions with M≠N. In this case the marginalization of joint probabilities of a discrete set of variables is defined, also known as Total Probability Theorem. The same applies for continuous variables, such that 𝑥 ∈ 𝑋, 𝑦 ∈ 𝑌 and 𝑋, 𝑌 ⊆ ℝ,

P(𝑥|𝐼) = ∫P(𝑥, 𝑦|𝐼) 𝑑𝑦 𝑌

and

P(𝑦|𝐼) = ∫ P(𝑥, 𝑦|𝐼) 𝑑𝑥 𝑋

where, as explained in Armstrong et al., “marginalization can be considered as integrating out unnecessary variables” [1].

(7)

6 By putting the above mentioned probability rules together, we arrive at the Bayes theorem’s mathematical formula. If the purpose is to determine the probability of a continuous parameter θ, (for instance the mean μ and/or the standard deviation σ), given the data D and the background information I, then we can express the joint probability of the θ and D, given I as so:

P(𝜃, 𝐷|𝐼) = P(𝜃|𝐼) P(𝐷|𝜃, 𝐼) = P(𝐷|𝛪) P(𝜃|𝐷, 𝐼) Due to the equality of the right part of the equation, we can arrive at:

P(𝜃|𝐷, 𝐼) = P(𝜃|𝛪)P (𝐷|𝜃, 𝐼) P (𝐷|𝐼)

where P(𝐷|𝐼) = ∫ P(𝜃,_𝜃 𝐷|𝐼)𝑑𝜃 with 𝜃 ⊆ ℝ. The above formula is the Bayes’ theorem mathematical equation and each term has its own meaning. Firstly, P(𝜃|𝐷, 𝐼) is known as the posterior probability which asserts the plausibility of θ, given D and I. Secondly, P(𝜃|𝛪) is called prior probability, which is the initial probability value before any additional information, e.g. the data D, is taken into account. In other words, it is the plausibility of θ before conducting the experiment. The numerator of the fraction P(𝐷|𝜃, 𝐼) is the likelihood probability and quantifies the plausibility of D given the θ and I, while the denominator P(𝐷|𝐼) plays the role of normalization factor [1].

In the case of hypothesis testing, where there are two hypotheses, 𝐻₁ and 𝐻₂ under investigation, given a set of data D, the Bayesian formula for each one is as follows: 𝑃(𝐻₁|𝐷) = 𝑃(𝐻₁)𝑃(𝐷|𝐻1) 𝑃(𝐷) and 𝑃(𝐻2|𝐷) = 𝑃(𝐻2) 𝑃(𝐷|𝐻2) 𝑃(𝐷) By dividing these two equations, we get:

𝑃(𝐻1|𝐷) 𝑃(𝐻₂|𝐷)= 𝑃(𝐻1) 𝑃(𝐻₂)∗ 𝑃(𝐷|𝐻1) 𝑃(𝐷|𝐻₂) which can be summarized as:

𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠 = 𝑝𝑟𝑖𝑜𝑟 𝑜𝑑𝑑𝑠 ∗ 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑟𝑎𝑡𝑖𝑜 [2, 3] Although Bayes’ theorem was known from the 18th_{century, a remarkable} increase in the employment of this theory in different fields—such as Chemistry and Physics—has been recorded in the last two decades. To illustrate this, a

(8)

7 search for the terms “Bayesian” or “Bayes” was conducted in the scientific search engine Web of Science. It was found that 106,585 out of 110,663 relevant articles were dated after 1990 and from those 90,013 are after 2000. This sudden increase in interest might be due to the rise of computational power in the last decades.

1.2. Bayesian network

The probabilities play a crucial role in pattern recognition and it is highly beneficial to improve the analysis using diagrammatic representations of dependences between variables, known as probabilistic graphical models. The graphic models offer useful properties such as a simple way of a probabilistic model’s visualization and insights into the model’s properties, including conditional and independent properties. The Bayesian network, also called directed graphical model, belongs to this category. As far the graphic illustration is concerned, the diagram consists of nodes, which represent random variables or sets of variables, connected by arcs, which express probabilistic relationships between these variables (4).

Consider for example the joint probability distribution over three variables, a, b, c e.g. 𝑝(𝑎, 𝑏, 𝑐). By applying the product rule to the joint distribution the result is:

𝑝(𝑎, 𝑏, 𝑐) = 𝑝(𝑐|𝑎, 𝑏) ∗ 𝑝(𝑎, 𝑏)

And by applying the product rule for the second time, the right-part of the above equation becomes:

𝑝(𝑎, 𝑏, 𝑐) = 𝑝(𝑐|𝑎, 𝑏) ∗ 𝑝(𝑏|𝑎) ∗ 𝑝(𝑎)

The latter equation can be described as a graphical model by first introducing nodes for each variable and associating each one with the corresponding conditional distribution. For 𝑝(𝑐|𝑎, 𝑏) there will be arcs from a and b to node c, for 𝑝(𝑏|𝑎) there will be an arc from a to b, and finally for 𝑝(𝑎) there will be no incoming links. In the case that there is a link from node x to node y, Figure 1: Bayesian model for three

variables a, b, c with their conditional distribution, represented as arcs [4].

(9)

8 then the node x is called parent of node y, while the node y is called child of node x. The link indicates that the probability of y is dependent on x, in other words, that 𝑝(𝑦|𝑥) can adopt some value.

The example discussed above, can be extended for K variables. The joint probability over K variables will be given by:

𝑝(𝑥₁, 𝑥₂, … , 𝑥_𝑘) = 𝑝(𝑥𝑘|𝑥1, … , 𝑥𝑘−1) ∗ … ∗ 𝑝(𝑥2|𝑥1) ∗ 𝑝(𝑥1)

The equation can be presented as a direct graph with K nodes, one for each conditional distribution on the right-hand side. Each node will have incoming arcs from the lower numbered nodes, therefore, this graph is called fully connected [4].

There also cases where there is absence of arcs, as shown in Fig.2. 𝑥1, 𝑥2 and 𝑥3 are parent nodes and there is no direct linkage between 𝑥₆ or 𝑥₇ with the parent nodes. This absence provides interesting information about the properties of the class of distributions that the graph represents. The decomposition of joint distribution of these seven variables is given by:

𝑝(𝑥₁, 𝑥₂, … , 𝑥₇)

= 𝑝(𝑥1) ∗ 𝑝(𝑥2) ∗ 𝑝(𝑥3)

∗ 𝑝(𝑥₄|𝑥1, 𝑥2, 𝑥3) ∗ 𝑝(𝑥5|𝑥1, 𝑥3) ∗ 𝑝(𝑥6|𝑥4) ∗ 𝑝(𝑥7|𝑥4, 𝑥5) The rule can be generalized, thus, for a graph with K nodes, the joint distribution is given by:

𝑝(𝑥) = ∏ 𝑝(𝑥_𝑘 𝐾

𝑘=1

|𝑝𝑎_𝑘)

where 𝑝𝑎_𝑘 denotes the set of parents of 𝑥_𝑘 and 𝑥 = {𝑥₁, … , 𝑥_𝑘}. At this point an important restriction should be mentioned. There are no closed paths within the graph such that the movement from node to node following the direction of the arcs and ending back to the starting node is not possible. These graphs are commonly referred to as Directed Acyclic Graphs (DAGs) [4].

Figure 2: Directed acyclic graph of seven variables, with three parent nodes and no direct linkage between 𝑥6𝑜𝑟 𝑥7 to the

(10)

9

CHAPTER 2: MASS SPECTROMETRY

2.1. Introduction to mass spectrometry

Mass spectrometry (MS) is an analytical technique that enables the identification and quantification of compounds of interest by measuring the mass-over-charge ratio (m/z) and abundance of ions in gas phase.

Mass spectrometry started to be employed in experiments over a century ago. Joseph John Thomson was the first to discover in 1910 that each charged particle followed its own path to the detector, which was a photographic plate. The first experiments were done on hydrogen and later on other atoms and molecules of carbon, oxygen and nitrogen were used. He argued that none of the particles, unless they share the same velocity and charge-over-mass ratio (e/m), would strike the detector’s plate at the same time. By inspecting the plate and knowing one of the parabolic paths that a set of particles, which share the same velocity and e/m, had followed, the e/m of the other particles would be deducted. This is considered to be the birth of mass spectrometry, and important features discussed by Thomson still remain relevant [5].

Through mass spectrometry, scientists are able to measure atomic and molecular weights of species in complex samples, analyze compounds at low level samples, as well as analyze compounds without the initial steps of sample purification. These advantages are more significant than the few disadvantages, such as the loss of the sample once it is analyzed through MS [5].

Mass spectrometry (MS) can be coupled to several different techniques, such as liquid chromatography and gas chromatography, to gain more information about the sample. MS can be coupled also to itself with the result being tandem mass spectrometry (MS/MS).

Tandem mass spectrometry was developed in order to acquire structural information about the analyte by coupling mass spectrometers in series between the ion source and the detector [5, 6]. The principle Figure 3: Schematic representation of a tandem mass

(11)

10 behind this type of experiment is simple. The targeted compound is ionized and its characteristic ions are separated from the mixture in the first mass spectrometer. The selected primary ions, also known as parent or precursor ions, are then decomposed in the dissociation region, resulting in fragment (historically known as daughter) ions, which are analyzed by the last mass spectrometer reaching the detector. Tandem mass spectrometry can achieve high specificity and sensitivity retaining the advantage of a fast response time [6].

2.2. Instrumentation of mass spectrometry

A mass spectrometer has three main compartments: a) the ionization source, b) the mass analyzer and c) the ion detectors [5]. One of the key performance indicators of a mass spectrometer is the resolving power, which means the minimum mass difference that can be separated from a given mass. Other performance indicators are the mass accuracy, which indicates the accuracy of the determination of the real mass of a compound, the sensitivity, which is expressed as the signal-to-noise ratio and finally the linear dynamic range (fig. 4)[7].

The ionization source serves the role of converting the analytes (M) into ions. The ionization occurs either when an electron is removed or added yielding a radical (M·+_{or M·}-_{respectively). It can also take place when charged species (e.g.} H+_{) are added or subtracted resulting in [M+H]}+_{or [M-H]}-_{. The ionization source} is also responsible for transferring the ions into gas phase before they are introduced into the mass analyzer. There are different types of ionization sources: a) electron (impact) ionization (EI), b) chemical ionization (CI), c) electrospray ionization (ESI), d) atmospheric pressure chemical ionization (APCI), e) atmospheric pressure photoionization (APPI) and f) matrix-assisted laser desorption ionization (MALDI). EI is considered as “hard” ionization, resulting in

Figure4: Characteristics of the performance of a mass spectrometer [7].

(12)

11 considerable fragmentation of the molecular ion, while ESI and MALDI are mostly employed when the analytes are biological macromolecules, which are easily degraded under harsh conditions [5, 7].

The mass analyzer is the central part of the mass spectrometer, with its role being to separate the ions with respect to their mass-over-charge ratio, providing a defined mass accuracy and resolution. There are various types of mass analyzers with differences in concept and performance, such as the quadrupole mass filter, the time-of-flight analyzer (ToF), the ion trap analyzer and the Orbitrap system. Often, they work as autonomous mass analyzers, however, the current trend points to the direction of hyphenated systems in order to increase the strength of the mass spectrometer by combining their advantages. Some of these systems are the triple-quadrupole mass filter, the quadrupole- time-of-flight (Q-ToF) system and the time-of-flight-time-of-flight (ToF-ToF) analyzer, with the last one being strongly connected to the analysis of biomolecules [7].

Finally, the ion detector is a device that is able to generate electrical current, whose intensity is proportional to the abundance of the ions. Ions exiting the mass analyzer can either be directed to a single-channel detector or dispersed to a focal-plate (array) detector. Quadrupole or ion trap instruments are equipped with single-channel detectors, whereas time-of-flight systems use focal-plate detectors. The generated electrical current is subsequently digitized and the signal is transferred to a computer, where the data can be managed and stored [7].

(13)

12

(14)

13

CHAPTER 3: BAYESIAN STATISTICS IN PROTEOMIC

STUDIES

Proteomics is the large-scale study of the proteome. The term proteome refers to the entire set of proteins, including their modifications, produced by an organism or a cellular system. The main goal of proteomics is the comprehensive and quantitative description of protein expression levels under the influence of environmental changes, such as drug treatment or disease [8]. The main analytical approach involves Mass Spectrometry (MS) with mild ionization. For the purification or the separation of the sample, prior to the analysis, usually Liquid Chromatography (LC) is selected.

The data, derived from a MS system or LC-MS system, are very complex and their further analysis can be divided into two categories: data pretreatment and data treatment. Each one, with respect to Bayesian statistics, is discussed in the following sections.

3.1. Data pretreatment

The data pretreatment is an essential step before the data can be available for further analysis, which in case of proteomic studies is strongly related to the discovery of biomarkers, drug development and disease classification. The data pretreatment usually consists of several stages, including denoising, peptide detection and spectra alignment.

All mass spectra and tandem mass spectra contain, apart from the peaks of peptide fragments that are considered as useful signals, peaks of instrument noise and contaminants. This is especially the case when the data arrive from complex samples. So, the noise peaks should be removed for optimal matching to be successful [9].

Different publications can be found in which Bayesian statistics is used in the data pre-treatment step. For example, Shao et al. in 2013 proposed an approach based on Bayesian inference for denoising spectra, to build spectral libraries. They build a Bayesian classifier to make the distinction between signal (S) and noise (N) and train it so that no assumptions of peptide fragmentation behavior or instrumental settings are needed [9].

(15)

14 The authors selected four different features that are peak’s characteristics and serve as good indicators of whether the peak 𝑖 (𝑖 = 1, 2, … , 𝑁) is signal or noise. The first feature was the rank, 𝐹_𝑟(𝑖), which is simply the intensity of the peak. For the most intense 𝐹_𝑟 = 1; for the second most intense 𝐹_𝑟 = 2 and so on. Since the intensities vary significantly from spectrum to spectrum, the team used the intensity rank as surrogate in order to avoid large changes in scale. The second selected feature was the m/z, i.e. 𝐹_𝑚(𝑖), which measures the relevant position of a peak in the spectrum. It is obvious that the probability of finding signal is not constant throughout the m/z range, but the exact trend is unknown prior to the experiment and it is discovered from the data. Finally, the compliment features, 𝐹𝑐,𝑍 , records whether a complimentary fragment ion can be found in the same spectrum, where the Z parameter denotes the sum of the assumed charges of the complimentary pair, and the sister features, 𝐹𝑠,𝛥 , determine the existence of a sister fragment ion in the spectrum. The sister peak is located at a distance Δ away from the peak of interest. This captures information concerning common neutral loss or isotopic ions [9].

Each peak is categorized as signal (S) or noise (N) by a consensus algorithm, according to whether or not it is consistently found across replicates. Since the peaks are labeled as S or N, the conditional probabilities 𝑃(𝐹_𝑘|𝑆) and 𝑃(𝐹_𝑘|𝑁) can be calculated for each feature 𝐹𝑘 mentioned above. These conditional probabilities constitute the Bayesian classifier and they can immediately be used to denoise singleton spectra or be written in a parameter file for future use. Given the conditional probabilities, by means of Bayes’ theorem, the computation of posterior probability of unlabeled peak is possible:

𝑃(𝑆|{𝐹𝑘}) =

𝑃(𝑆)[∏ 𝑃(𝐹_𝑘 _𝑘|𝑆)]

𝑃(𝑆)[∏ 𝑃(𝐹_𝑘 _𝑘|𝑆)] + 𝑃(𝑁)[∏ 𝑃(𝐹_𝑘 _𝑘|𝑁)]

where 𝑃(𝑆) is the prior probability of peak to be a signal without any additional information and 𝑃(𝑁) = 1 − 𝑃(𝑆). The prior probability was predicted by the team, using a linear regression model with 16 features for any given spectrum. As in typical Bayesian classifier, the posterior probability can be subjected to an appropriate threshold to make a decision about the peak’s preservation in the denoised spectrum [9].

(16)

15 The researchers claimed that the computed probabilities were reasonably accurate and their “denoiser” showed that the filtered spectra retained signal peaks and exhibit high similarity to their replicates, which indicates that their method would be a useful tool in spectral libraries. Additionally, the classifier is very flexible and can be subjected to further improvement by adding or modifying the selected features [9].

Peptide detection, whose main goal is to convert the raw spectra into a list of peptides and find the existence probability of each peptide candidate, has a direct effect on the subsequent analysis such as protein identification and quantification, biomarker discovery and classification of different samples. The difficulty is that peptides usually give several peaks in the spectra due to different charge states during ionization and isotopic peaks. To address this issue, in 2010, Sun et al. proposed a Bayesian approach for peptide detection (BPDA), which can be applied to MS data that have been generated by instruments with high enough resolution [10].

The authors used one-dimension mass spectra (1D MS) and the proposed approach can be considered as a three-step approach. The first step is to obtain a list of peptide candidates from the observed peaks and the second is to model the observed spectra, taking into account the N peptide candidates’ signal. The final step is to apply the algorithm on the fitted MS model to infer the best fitting peptide signals [10].

In a more detailed way, the spectra were first baseline corrected, the noise was filtered, peaks were detected using “mspeaks” (Matlab function) and a list of peptide candidates was generated. The mass of each peptide was produced by means of the following equation:

𝑚𝑎𝑠𝑠 = 𝑖(𝑑 − 𝑚𝑝𝑐) − 𝑗𝑚𝑛𝑡, 𝑖 = 1, 2, … , 𝑐𝑠; 𝑗 = 1, 2, … , 𝑖𝑠𝑜

where mass is the mass of one peptide candidate, d denotes the value of m/z of a detected peak, 𝑚_𝑝𝑐 is the mass of one positive charge, 𝑚_𝑛𝑡 is the mass shift due to the addition of one neutron and the parameters 𝑖 and 𝑗 represent the charge state and isotopic positions respectively. Then, the researchers modeled the spectra, taking into account the different charge states and isotopic positions for each candidate and it also incorporates the probability of candidates’ existence and the

(17)

16 thermal noise. The signal of N peptide candidates was given by the following equation: 𝑦_𝑚 = ∑ 𝜆_𝑘𝑔_𝑘(𝑥𝑚) + 𝜀𝑚= ∑ 𝜆𝑘∑ ∑ 𝑐𝑘,𝑖𝑗𝑓(𝑥𝑚; 𝜌𝑘,𝑖𝑗; 𝑎𝑘,𝑖𝑗) 𝑖𝑠𝑜 𝑗=0 𝑐𝑠 𝑖=1 𝑁 𝑘=1 + 𝜀_𝑚 𝑁 𝑘=1 𝑚 = 1,2, … , 𝑀

where 𝑥_𝑚is the m-th m/z in the spectrum, 𝑦_𝑚 is the intensity at 𝑥_𝑚, M is the number of observations, 𝜀𝑚 is the noise (𝜀𝑚~𝑁(0, 𝜎2), 𝑓(𝑥𝑚; 𝜌𝑘,𝑖𝑗; 𝑎𝑘,𝑖𝑗) is peak shape function, taken as Gaussian-shaped with 𝑎𝑘,𝑖𝑗 being the theoretical m/z value of the peak for the k-th candidate and 𝜌_{𝑘,𝑖𝑗} the peak’s width, 𝑐_{𝑘,𝑖𝑗} is the height of the peak of peptide k at i charge state and j isotopic position and, finally, 𝜆𝑘 is an indicator random variable, which is 1 if the peptide truly exists and 0 otherwise. The goal is to determine all the unknown parameters in the model (𝜃 ≜ {𝜆_𝑘, 𝑐_{𝑘,𝑖𝑗}; 𝑘 = 1, … , 𝑁; 𝑖 = 1, … , 𝑐𝑠; 𝑗 = 1, … , 𝑖𝑠𝑜}), based on the observed spectrum 𝑦 = [𝑦₁, … , 𝑦_𝑀]𝑇_{and especially 𝜆}

𝑘. Therefore, the Bayesian approach was employed to obtain the posterior probabilities of θ, 𝑃(𝜃|𝑦). The posterior probability of 𝜆𝑘 can be obtained by integrating the joint posterior probability over all parameter except for 𝜆_𝑘:

𝑃(𝜆𝑘|𝑦, 𝜃−𝜆𝑘) ∝ 𝑝(𝜆𝑘)𝑝(𝑦|𝜃) , where 𝜃−𝜆𝑘 ≜ 𝜃\𝜆𝑘

and for the computation the team chose Gibbs sampling method, which is a variant of Monde Carlo Markov Chain (MCMC) [10].

According to the authors, BPDA, their proposed model, which considers the charge state and isotopic positions, was positively compared to commercial and open-source software in terms of peptide detection, but it lacked in terms of computational time, since it was found time-consuming, especially when running under raw data mode [10].

Two years later, in 2012, the same team published a new paper, which can be described as the continuation of their first one. This time, their proposal concerned a peptide detection approach, but LC-MS data were used as input. They presented BPDA2d, a two-dimension (2D) Bayesian peptide detection algorithm to process the data more efficiently. BPDA2d shared the same core as BPDA, which is to evaluate all possible combinations of peptide candidates in order to minimize the mean square error (MSE) between inferred and observed spectra. The

(18)

17 difference between these algorithms is that BPDA models spectra along m/z dimension, while BPDA2d models spectra along both m/z and retention time (RT) dimensions [11].

After baseline correction, noise filtration and peak detection along the m/z axis, the authors added one more step before obtaining the list of peptide candidates. In this case, the detected 1D peaks were connected along the RT dimension. The 1D peaks were sorted according to their RT positions and if there were multiple peaks connected to the same RT only the one with the larger intensity was retained. Then they proceeded to the generation of peptide candidates followed their previously mentioned method. The model used to formulate the spectra was substantially the same apart from the addition of time as a parameter: 𝑦(𝑥_𝑚, 𝑡) = ∑ 𝜆_𝑘𝑔_𝑘(𝑥_𝑚, 𝑡) 𝑁 𝑘=1 + 𝜀(𝑡) = ∑ 𝜆_𝑘 𝑁 𝑘=1 ∑ ∑ 𝑐_{𝑘,𝑖𝑗}𝑙_𝑘(𝑡)𝐼_𝑥_𝑚_=𝑎_{𝑘,𝑖𝑗} 𝑖𝑠𝑜 𝑗=0 𝑐𝑠 𝑖=1 + 𝜀(𝑡) 𝑚 = 1, 2, … , 𝑀 and 𝑡 = 1, 2, … , 𝑇

where I is and indicator function (𝐼_𝐴 = 1 if 𝐴 ≠ 0, 𝐼_𝐴 = 0 otherwise) and 𝑙_𝑘 is the normalized elution profile of the k-th peptide candidate. As in the previous article, the model takes into account charge state and isotopic position, but includes peptides’ elution peaks. It also incorporates existence’s probability of candidates and thermal noise. The authors calculated posterior probabilities of the unknown parameters (θ) of the model, using Bayes’ theorem, and focused on 𝜆𝑘 , the indicator random variable (𝜆_𝑘= 1 if the peptide truly exists, 𝜆_𝑘 = 0 if not) by integrating the joint posterior probability, 𝑃(𝜃|𝑦), over the other parameters except for 𝜆𝑘 [11].

The authors claimed that their model, BPDA2d surpassed advanced software, such as msInspect and their previous proposal BPDA, in terms of sensitivity and detection accuracy. They also mentioned that their proposal is better suited for time-of-flight (TOF) data [11].

The alignment of spectra, which is necessary for the correction of experimental variations, a common problem in mass spectrometry (MS), is the basic step in the comparison of spectra. The alignment approaches can be divided into two categories: 1) the feature-based and 2) the profile-based approach. The first one is based on the distinction between signals from analytes and irrelevant

(19)

18 noise, which is the key point for a successful approach, also known as peak detection, and then the direct alignment of the detected peaks. On the other hand, the later one uses the whole spectrum to evaluate the experimental variation and adjust each one accordingly. The attempt is to find an alignment that minimizes the difference between all spectra and the reference spectrum [12, 13, 14].

The alignment of MS spectra was the purpose of a paper, written by Kong et al. in 2009, with an overall common goal to compare mean spectra across different patient populations, helpful for the biomarker discovery. The authors, advancing a previously developed method (Reilly et al. 2004), proposed a profile-based approach, using a parametric model in conjunction with Bayesian inference [12]. Firstly, the team started with normalization. The alignment model depends on the spectra’s abundances and, therefore, their variations, due to differences in sample preparation or matrix crystallization, needed to be minimized. The chosen method was one scaling factor to each MS run, following the notation of Wu (2004). The second step was the alignment model, given by the following equation:

𝑥𝑖(𝑡) = 𝜃(𝜉𝑖(𝑡)) + 𝜀𝑖(𝑡)

where 𝑥_𝑖(𝑡) denotes the height/intensity (on the log-scale) of each sample i at time t (which corresponds to a certain m/z in a ToF instrument), 𝜃(𝑡) is the average spectrum for this patient population at t, 𝜉_𝑖(𝑡) is the deforming function for the sample i at time t and, finally, 𝜀_𝑖(𝑡) is the random error. The restriction is that 𝜉_𝑖(𝑡) is monotone increasing, otherwise it is able to erase observed peaks in the spectra. The 𝜉𝑖(𝑡) function is parameterized as a piecewise linear function with knots positioned at the locations (or subset of locations) where there are the observed data. The estimation of posterior modes of 𝜉₁(𝑡), 𝜉₂(𝑡), …, 𝜉_𝑛(𝑡) is done by minimizing ∑ ∑ 1 |𝐸_𝑗|{∫ [𝑥𝑖(𝑡) − 𝜃(𝜉𝑖(𝑡))] 2_𝑑𝑡 𝐸𝑗 +𝜎 2 𝜏2∫ [𝜉𝑖(𝑡) − 𝑡]2𝑑𝑡 𝛦𝑗 } 𝑇−1 𝑗=1 𝑛 𝑖=1

where the team assumed that the least squares of 𝑥𝑖(𝑡) − 𝜃(𝜉𝑖(𝑡)) over 𝐸𝑗 (∫ [𝑥_𝐸_𝑗 𝑖(𝑡) − 𝜃(𝜉𝑖(𝑡))]2𝑑𝑡) are independent distributed as 𝜎2|𝛦𝑗|𝜒𝑙2 and the least

squares of 𝜉𝑖(𝑡) − 𝑡 over 𝐸𝑗 are also independent distributed as 𝜏2|𝛦𝑗|𝜒𝑙2. The 𝜉𝑖(𝑡) function needs only to be defined for t∈T (𝑡₁, 𝑡₂, … , 𝑡_𝑇), 𝐸𝑗 is the partition of [𝑡1 = 𝑚𝑖𝑛𝑡∈𝑇, 𝑡𝑇 = 𝑚𝑎𝑥𝑡∈𝑇] defined by the location of knots of 𝜉𝑖(𝑡). The above

(20)

19 equation is subjected to two conditions: 1) 𝜉_𝑖(𝑡_𝑗) < 𝜉_𝑖(𝑡_𝑗+1) in order to guarantee that 𝜉𝑖(𝑡)is strictly monotone increasing and 2) if 𝑚𝑖𝑛𝑗(𝑥𝑖(𝑡𝑗+1), 𝑥𝑖(𝑡𝑗)) > 𝑟𝑖(𝑡𝑗) then|𝜉𝑖(𝑡𝑗+1) − 𝜉𝑖(𝑡𝑗)| = 𝑡𝑗+1− 𝑡𝑗, with 𝑖 = 1,2, … 𝑛 and , 𝑗 = 1, 2, … , 𝑇 − 1 so as to maintain the shape of the peak along the ToF axis during the alignment process. Lastly, for the computations, the researchers followed the approach of Reilly et al. (2004) and they applied the dynamic programming (DP) algorithm to relate the minimized value with the approximation to objective function [12].

The results of this method showed that the model is very efficient for low mass accuracy data. In contrast, for MS spectra with 5-50 ppm accuracy, the method shows an improvement respect other conventional methods, although is not that efficient as with low mass accuracy. An interesting idea, mentioned by the authors, is that this model can be used to align spectra from different laboratories, which are severely misaligned, since their method can handle these misalignments [12].

The alignment of LC-MS data was addressed in two different papers in 2013, published by the same team, proposing the same approach with slight differences. The approach was tested, in one case, on proteomic and metabolomics data [13] and in another case, on proteomic, metabolomic and glycomic data [14]. The team’s proposal is a Bayesian Alignment Model (BAM), which is a profile-based approach.

BAM is a model that performs RT alignment based on multiple chromatograms of LC-MS runs. The model is based on alignment of the total ion chromatogram (TIC) or the base-peak chromatogram, thus reducing the order of the instrument (from 2nd_{to 1}st_{). The model has two major components: the} prototype and the mapping function. The prototype function, m(t), characterizes the part of the spectro-chromatograms that is shared by the different samples, and for the ith chromatogram at RT t, the intensity is referred to as prototype function indexed by the mapping function, ui(t), m(ui(t)). Each chromatogram (y for

sample 𝑖) is modeled as:

𝑦_𝑖(𝑡) = 𝑐_𝑖 + 𝑎_𝑖 ∗ 𝑚(𝑢_𝑖(𝑡)) + 𝑒_𝑖(𝑡)

for i=1, 2, 3…,N observed chromatograms [𝑦_𝑖(𝑡)], where 𝑐_𝑖~𝑁(𝑐₀, 𝜎_𝑐2₎ _and 𝑎_𝑖~𝑁(𝑎₀, 𝜎_𝑎2) are parameters, ei(t) is the error, which is considered independent

(21)

20 and normally distributed 𝑒_𝑖(𝑡)~𝑁(0, 𝜎_𝑒2). The prototype function is modeled with a B-spline regression: 𝒎 = 𝑩_𝒎𝝍, where ψ is a vector of elements drawn from normal distribution with a specific mean (𝜓_𝑙~𝛮(𝜓_𝑙−1, 𝜎_𝜓2), where 𝜓₀ = 0 ) and the mapping function is a piecewise linear function, characterized by a set of knots τ,𝜏 = (𝜏₀, 𝜏₁, 𝜏₂, … , 𝜏_𝑘+1), and their corresponding indices φi, 𝜑_𝑖 =

(𝜑_𝑖,0, 𝜑_𝑖,1, 𝜑_𝑖,2, … , 𝜑_𝑖,𝑘+1). By following this process, the alignment problem is transformed into an inference task, where given the chromatograms,𝒚 = {𝑦1, 𝑦2, … , 𝑦𝑁},the model parameters 𝜃 = {𝑎, 𝑐, 𝜓, 𝑎0, 𝑐0, 𝜎𝑎2, 𝜎𝑐2𝜎𝜀2𝜎𝜓2} need to be estimated. The authors used Markov Chain Monde Carlo (MCMC) methods to draw inference for the parameters. Once the inference is complete, the alignment is carried out by applying the inverse mapping function to each chromatogram, i.e. 𝑦̂(𝑡) = 𝑦_𝑖 _𝑖(𝑢̂_𝑖−1(𝑡)) [13, 14].

Although both papers share the same overall approach, one can claim that the second paper is the advanced version of the first one. In the first article, the authors used a single ion chromatogram to estimate the prototype and mapping functions for RT alignment. As it is claimed, the model showed better performance than other profile-based methods, such as Dynamic Time Wrapping (DTW) model (J. Chemometrics (2004) 18: 231–241) [13]. However, as it is stated in the second article, their first model lacked in integration of prior knowledge, e.g. internal standards, and assumed the existence of a pattern only based on a single ion chromatogram. So, they introduced an advanced version, which incorporates multiple representative chromatograms and internal standards [14].

As far as the multiple representative chromatograms are concerned, the authors proposed a clustering approach for the identification of the chromatograms, where the chromatograms are simultaneously considered in the profile-based alignment to assist the progress of the estimation of the prototype and mapping functions. As for the internal standards, based on their peaks, the research team was able to evaluate the RT variations, by means of Gaussian Process (GP) regression, and this information is used as the prior of the mapping function [14]. Below, the figures illustrate graphically the models in the first second article.

(22)

21 Figure 5: Bayesian alignment model, profile-based approach [13].

Figure 6: Bayesian alignment model, profile-based approach with the incorporation of internal standards and chromatographic clustering [14].

3.2. Data treatment

Identifying proteins in a complex mixture is common task that is usually performed by tandem MS, since it is becoming a very useful tool. Typically, the protein identification process consists of two main stages. In the first stage, the observed spectra and precursors’ m/z values are matched to peptides by a database search tool and in the second stage proteins are scored and ranked using the scored peptides. The large amount of data produced by the MS, on the one hand can be considered beneficial, but, on the other leads to false matches between peptides and spectra, lowering the specificity. One of the most common problems is the degeneracy of peptides, i.e. the possibility of match a peptide to multiple proteins. These degenerate peptides are responsible for the difficulty in calculating the posterior probabilities of proteins, since the posterior probability of one protein depends on the presence of another, especially when the peptide is matched to both of them. With respect to this problem, Serang et al. proposed in 2010 a Bayesian approach for computing posterior probabilities [15].

(23)

22 The suggested model uses three parameters and allows a protein with strong independent evidence to minimize the significance of supporting evidence that is shared with other proteins. The model is based on seven assumptions: 1) the recovery of one peptide by precursor scan does not influence the retrieval of other peptides given the set of proteins in the sample, 2) the creation and observation of one spectrum does not influence the creation and observation of other spectra given the set of peptides selected by the precursor scan, 3) the emission of a peptide is associated with the present protein with probability α, 4) the wrong detection of peptides from noisy signals (the probability that a truly absent peptide, not created by an associated protein, is falsely observed) has a constant probability, β, 5) the prior belief that a protein is present in the sample has a probability γ, 6) the prior probabilities are independent and finally, 7) each spectrum depends only on the peptide that is most excellently matched. From the probability model, the team was able to compute the likelihood of a set of proteins, which is proportional to the probability that these proteins would create the observed spectra, as follows:

where R denotes the present proteins, E is the set of present peptides, ε is for the peptides’ index, D is the observed spectra. R and E are random variables that represent the truly presence of proteins and peptides and r and e are specific values for these variables. The 𝑃(𝐷_𝜀|𝛦𝜀 = 𝑒𝜀) was calculated by PeptideProphet and the 𝑃(𝛦_𝜀 = 𝑒_𝜀|𝑅 = 𝑟) by the proposed model. The team also proposed a process about making the computations of posterior probabilities of large data set

(24)

23 more feasible. The process involves three steps: 1) partitioning, 2) clustering, 3) pruning. The three steps are presented in fig. 7 [15].

Figure 7: (A) partitioning: a protein is dependent on other proteins within connected sub-graphs and not dependent on proteins that share no peptides with the proteins in the connected sub-graphs. (B) clustering: proteins with identical connectivity, e.g. protein 1 and 2, can be clustered together to compute their posterior probabilities more efficiently. (C) pruning: within the connected sub-graph, e.g. protein 4 and 5, proteins that are connected only by peptides with zero probability can be divided into two sub-graphs that do not connect [15].

As mentioned in the paper, the model assumptions are not ideal. However, it is possible to evaluate their accuracy and replace them by others. For instance, assumption 5 can be improved by introducing more complex priors. The model, depending on the data and comparing the results with other methods, showed a similar and sometimes better performance. Additionally, it can be positively compared to other method in cases of high-scoring degenerate peptides [15].

The validation of database search results is an aspect of interest in protein identification. As indicated above, the protein identification has usually two stages: 1) matching peptides with database for their identification and 2) scoring and ranking of proteins based on the identified peptides [15]. The peptide identification is achieved by the combination of tandem MS and database search. The validation of these results is still developing in aspects such as specificity and sensitivity. In 2009, Zhang et al. proposed a Bayesian non-parametric model (BNP) for the validation of database results that incorporates popular methods in statistical learning, as the Bayesian method for posterior probabilities calculations [16]. The model integrates an extended set of features, which were selected from

(25)

24 literature, including peptide fragmentation knowledge and chemical or physical properties of the peptide.

After the tandem MS spectra are searched against the database, the first stage is to construct two subsets, one includes the decoy matches and the other consists of the matches validated by the cutoff-based method. Based on these sets, the coefficients of LDF (linear discriminative function) score can be calculated by means of multivariate linear regression. In the second step the LDF score (x) distribution is fitted by a non-parametric PDF (probability density function) with the maximum likelihood parameter estimation. The formulation of the hypothesis mixture PDF is given by: 𝑝(𝑥) = 𝑃_𝑝𝑜𝑠𝑓(𝑥) + 𝑃_𝑛𝑒𝑔𝑔(𝑥), based on the theory that the random and correct matches can be grouped into subcategories and that the LDF score of each subcategory should have a simple distribution, e.g. normal distribution. The negative component 𝑃_𝑛𝑒𝑔𝑔(𝑥), which contributes to random matches, can be estimated by the fully non-parametric density function estimate procedure that is carried out by the maximum likelihood estimation with EM algorithm, as indicated by Duda et al. in 2001 and Archambeau et al. in 2003. The positive component 𝑃_𝑝𝑜𝑠𝑓(𝑥), which contributes to the correct matches, can be estimated by a restricted fully non-parametric density function estimate. After the estimation of the conditional PDF the correct probability of a match with a LDF score x can be calculated by the following formulation:

𝑝_𝑐𝑜𝑟= 𝑝𝑝𝑜𝑠𝑓(𝑥) 𝑝_𝑝𝑜𝑠𝑓(𝑥) + 𝑝_𝑛𝑒𝑔𝑔(𝑥) This formulation can be explained as follows:

𝑝(𝑝𝑜𝑠|𝐷) 𝑝(𝑛𝑒𝑔|𝐷)=

𝑝(𝑝𝑜𝑠) 𝑝(𝑛𝑒𝑔)

𝑝(𝐷|𝑝𝑜𝑠)

𝑝(𝐷|𝑛𝑒𝑔) (𝐼) , from Bayes’ theorem

𝑝(𝑝𝑜𝑠|𝐷) + 𝑝(𝑛𝑒𝑔|𝐷) = 1, ℎ𝑒𝑛𝑐𝑒 𝑝(𝑛𝑒𝑔|𝐷) = 1 − 𝑝(𝑝𝑜𝑠|𝐷) (𝐼𝐼), as normalization factor. By introducing (II) into (I) we get:

𝑝(𝑝𝑜𝑠|𝐷) 1 − 𝑝(𝑝𝑜𝑠|𝐷)=

𝑝(𝑝𝑜𝑠)𝑝(𝐷|𝑝𝑜𝑠) 𝑝(𝑛𝑒𝑔)𝑝(𝐷|𝑛𝑒𝑔) and then, after some rearrangements:

𝑝(𝑝𝑜𝑠|𝐷) = 𝑝(𝑝𝑜𝑠)𝑝(𝐷|𝑝𝑜𝑠)

(26)

25 In this equation, 𝑝(𝑝𝑜𝑠|𝐷) can be considered as 𝑝_𝑐𝑜𝑟 in the above formula, 𝑝(𝐷|𝑝𝑜𝑠) and 𝑝(𝐷|𝑛𝑒𝑔) are 𝑓(𝑥) and 𝑔(𝑥) respectively, ending up to the formulation the team proposed. Finally, the authors were able to make a decision in accordance with the cost function, presented as FDR (false discovery rate) [16]. As claimed by the research team, the model can provide a correct probability of each assignment that promotes the subsequent analysis. The model was able to identify more high confident proteins from a MS/MS data set compared to other methods, such as ProteinProphet. The stronger aspect, as it is indicated in the paper, is the confirmation of a larger number of confident peptides. Thus, it can give more information for later biological analysis [16].

The identification of proteins issue was also assessed by LeDuc and co-workers in 2014. In this case “top down” experiments were employed to identify and characterize whole proteins. The main characteristic of a “top down” experiment is that the precursor ion is an intact proteoform and not small peptides produced form enzymatic digestion prior to mass spectrometry (shotgun or bottom-up experiment). Thus, the mass of the precursor ion represents a native whole protein and its fragment ions support the characterization and verification of the primary structure. In order to capture better the information given from top down proteomics, the team suggested a scoring system for protein identification, based on Bayesian statistics, under the name C-score [17].

The authors started with the basic formulation of the Bayes’ theorem: 𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚_𝑖|𝑑𝑎𝑡𝑎) =𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖)𝑃(𝑑𝑎𝑡𝑎𝑀𝑆 𝑀𝑆⁄ |𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖)

𝑃(𝑑𝑎𝑡𝑎_{𝑀𝑆 𝑀𝑆}⁄ )

where 𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚_𝑖|𝑑𝑎𝑡𝑎) is the posterior probability (the probability of the 𝑖th proteoform given the MS/MS data), 𝑃(𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚_𝑖) is the prior probability of proteoform 𝑖, 𝑃(𝑑𝑎𝑡𝑎𝑀𝑆 𝑀𝑆⁄ |𝑃𝑟𝑜𝑡𝑒𝑜𝑓𝑜𝑟𝑚𝑖) is the likelihood, reading the probability of the data given the proteoform 𝑖 and 𝑃(𝑑𝑎𝑡𝑎_{𝑀𝑆 𝑀𝑆}⁄ ) is known as the probability of the data. The team restated the above equation by defining variables, arriving at:

𝑃(𝜑_𝑞|𝑀_𝑂, {𝑚_𝑖}) =𝑃(𝜑𝑞)𝑃(𝑀𝑂, {𝑚𝑖}|𝜑𝑞) 𝑃(𝑀_𝑂, {𝑚_𝑖})

(27)

26 where 𝑀_𝑂 is the observed mass of the precursor ion, 𝑚_𝑖 is the mass of the 𝑖th of the n fragment ions, so {𝑚𝑖}𝑖=1𝑛 is the set of all the observed ions and 𝜑𝑞 is the 𝑞th candidate (from k candidate proteoforms in the database). The prior probability, 𝑃(𝜑𝑞), can be taken as “all the hypotheses are equal” or one can assign higher/ lower prior probabilities to a candidate. The interesting of this scoring model is that, in contrast to other Bayesian methodologies (also presented in this literature thesis), the model has no unknown parameters and instead of inferring values from the data collected for the study in question, the values are taken either from the team’s knowledge of mass spectrometry or from prior studies that focused on determining the needed values [17].

In order to calculate the likelihood, 𝑃(𝑀𝑂, {𝑚𝑖}|𝜑𝑞), LeDuc et al. assumed independence of all 𝑚_𝑖, and thus:

𝑃(𝑀_𝑂, {𝑚_𝑖}|𝜑𝑞) = 𝑃(𝑀𝑂|𝜑𝑞) ∏ 𝑃(𝑚𝑖|𝜑𝑞) 𝑛

𝑖=1

And for avoiding issues, such as the larger the list of fragment ions the lower the calculated value, the research team developed the aforementioned equation into:

𝑃(𝑀𝑂, {𝑚𝑖}|𝜑𝑞) = 𝑓 (𝑃(𝑀𝑂|𝜑𝑞)) 𝑔((∏ 𝑃(𝑚𝑖|𝜑𝑞) 𝑛

𝑖=1

1_⁄_𝑛 )

where the f function is a simple identity function for the precursor ion and g is a linear function on the logarithm base 10 of the probability of the fragment ion [17]. As stated by the authors, the proposed model showed high specificity and sensitivity, compared to the other methods. It is also mentioned that when the data are sufficient enough, the C-score demonstrated high characterization power and the method is flexible for having scoring system appropriate to the experimental procedure [17].

Spectral counting is a free-labeled quantitative approach in shotgun proteomics, which is defined as measuring the abundance of a given protein based on the number of tandem mass spectral observations of its identified peptides. Spectral counts (SPCs) have shown a good correlation with the abundance of the corresponding protein. Since SPCs can be extracted from any database search engine, it makes spectral counting a flexible and straight- forward technique. In 2011, Booth et al. proposed a Bayesian model for comparing SPCs under two

(28)

27 treatments, which allows the simultaneous classification of thousands of proteins [18].

The classification is conducted by calculating the posterior odds: 𝑂_𝑖 =𝑃(𝐼𝑖 = 1|𝑑𝑎𝑡𝑎)

𝑃(𝐼_𝑖 = 0|𝑑𝑎𝑡𝑎) 𝑖 = 1, … , 𝑝

where Ii defines the indicator for non-null status of the i-th protein. A “non-null

status” indicates that the protein has been affected by the treatment. If 𝑂_𝑖 > 𝑐, a suitable large c, then the protein is classified as non-null. The choice of the threshold c is based on the control of false discovery rate (FDR), i.e. the protein is classified as non-null where in fact there is no treatment effect. In order to compute the posterior odds, the research team considered the following model:

log𝜇_𝑖𝑗 = 𝛽₀+ 𝛽₁𝑇_𝑗+ 𝑏_0𝑖+ 𝑏_1𝑖𝐼_𝑖𝑇_𝑗+ log𝐿_𝑖+ log𝑁_𝑗

where μij denotes the expected count for protein i in replicate j, β0 is the overall mean for the control replicates, β1 is an overall treatment effect, b0i and b1i are the corresponding protein specific effects, Li and Nj are the offset, account for the

protein length and the replicate effect respectively and Tj is a binary indicator of

the treatment. The model is completed by placing prior distributions for the model parameters. The authors considered three different prior distributions for the protein specific coefficients. One allows for potential correlation between the protein specific coefficients, while the other two assume they are independent and one allows the posterior mean of the protein specific treatment effects to be different in the null and non-null groups. The necessary computations were performed by means of Markov Chain Monde Carlo (MCMC) methods [18].

The results, announced by the team, showed that the proposed model is more statistically coherent and valid than other approaches they compared it to (Bayes factors), lead to a simple classification, however, it is significantly slower process than the one-at-a-time methods, such as the score test, were the results are instantaneous [18].

For the quantification of a peptide, apart from free-labeling techniques, there are also labeling techniques such as 18_{O-labeling approach. In the enzymatic}18 O-labeling, the two oxygen atoms in the carboxyl-terminus of a peptide are replaced with oxygen isotopes form heavy-oxygen-water. The result is a m/z shift by 4Da for the labeled peptide and, thus, the labeled and unlabeled peptides are separated

(29)

28 with respect to their m/z in a spectrum, which allows the comparison of two samples. However, in practice due to water impurities (presence of 16_{O and}17_O) and mislabeling, not all the labeled peptide receive two 18_{O, which results in a} complex mixture of shifted and overlapping isotopic peaks of the labeled and unlabeled samples. In order to estimate the relative abundance of the peptide in two samples with respect to the aforementioned problem, Zhu et al. proposed a model with Bayesian framework for MALDI-TOF data, in 2011 [19].

The suggested model is an extension of a previously modeling approach (Valkerborg 2008, Zhu et al. 2010), where random effects of technical/biological variability were included. The model is given by:

𝑦𝑖𝑗 = 𝜇𝑖𝑗 + 𝜀𝑖𝑗

where 𝑦_𝑖𝑗 is the experimental intensity obtained at the 𝑖𝑡ℎ_{spectrum (or sample)} and 𝑗𝑡ℎ monoisotopic peak, 𝜀𝑖𝑗~𝑁(0, 𝜎2𝜇𝑖𝑗2𝜃) are independent and the θ parameter is the power parameter for the variance function to account for the heteroscedastic nature of the MS data. The mean intensity is the quantity 𝜇_𝑖𝑗 of the j-th (j=1 denotes the monoisotopic of the unlabeled peptide) peak of the i-th spectrum. This can be expressed as:

𝜇𝑖𝑗 ≡ 𝐸(𝑦𝑖𝑗) = { 𝐻_𝑖𝑅_𝑗+ 𝑄_𝑖𝐻_𝑖+ ∑ 𝑃_𝑘 min (4𝑗−1) 𝑘=0 𝑅_𝑗−𝑘 𝑖𝑓 1 ≤ 𝑗 ≤ 𝑙 𝑄𝑖𝐻𝑖 ∑ 𝑃𝑘𝑅𝑗−𝑘 4 𝑘=𝑗−1 𝑖𝑓 𝑙 + 1 ≤ 𝑗 ≤ 𝑙 + 4 where a peptide has 𝑙 ≥ 5 isotopic variants and a 18_{O-labeled peptide has 𝑙 + 4.}

Hi(𝐻_𝑖~𝑁(𝐻, 𝜎_𝛨2)) is the unobserved abundance of the peptide in the unlabeled

sample (Sample I) in the 𝑖-th spectrum and 𝑄_𝑖(𝑄_𝑖~𝛮(𝑄, 𝜎_𝑄2)) is the relative abundance of the peptide from the labeled sample (Sample II) with respect to the Sample I. 𝑃𝑘 is the m/z shift probability , which is calculated via a MCMC model and 𝑅_𝑗 is the isotopic ratios, which is defined as 𝑅_𝑗 =ℎ𝑗 _ℎ

1

⁄ , 𝑗 = 1, … , 𝑙, of the 𝑗-th isotopic variant, where ℎ₁, ℎ₂, etc. denotes the probabilities of occurrence of the first, second, etc. isotopic variant. The terms 𝐻𝑄𝑃_𝑘𝑅_{𝑗−1−𝑘} show the contribution to the mean value of the observed peaks from the isotopic variants of the peptide from Sample II [19].

(30)

29 The team suggested that by using a Bayesian approach the incorporation of prior information could be advantageous for the analysis of the data. Although similar approaches have been published in previous years (Eckel-Passow, 2006) this method takes into account the presence of 17_{O atoms in heavy water and} allows the isotopic distribution to be determined by the data. As it is pointed out, there are topics concerning the extension of the approach that are under further research, such as the incorporation of informative prior for the Bayesian model, which would yield a gain in precision for the estimation of the parameters [19].

Current methods for data analysis with the purpose of biomarker discovery, e.g. cancer diagnosis in an early stage, can be divide into two categories: 1) profiling, where the input is a list of peaks and 2) whole-spectrum, in which the entire spectrum is the input. It is argued that the profiling method is of greater importance, since the detected peaks are more meaningful in the way that they represent species that can be identified and further studied. The profiling method mainly consists of eight steps, which are: i) resampling, ii) noise reduction, iii) baseline correction, iv) normalization, v) peak detection, vi) peak alignment, vii) feature (peak) selection and viii) classification. As part of a profiling study, a study driven by the purpose of the discovery of biomarker that can distinguish cancer and normal samples, He et al. proposed Bayesian additive decision trees (BART) to build a classification model [20].

As shown in the schematic presentation of the proposed profiling method, after the data were baseline corrected and normalized, the next step was the peak detection. The authors used a smooth non-linear energy operator (SNEO) for the first time, a method that has been successfully used in Figure 8: The proposed method in a schematic illustration

(31)

30 electroencephalography and electrocardiogram, which was modified to be suitable for the peak detection in MS data. After the peak detection, a correlation-based peak selection was applied and the selected small peak set was used as input for BART in order to build a prediction model. BART was applied to classify samples and identify biomarkers. It is akin to the method that constructs a set of classifiers, e.g. a decision tree, to classify new data points. BART is defined by a prior and a likelihood and reports inference as the summary of all relevant uncertainties. It can be considered as a Bayesian “sum-of-tree” model, where each tree is constrained by a prior to be a weak learner. For the biomarker’s identification, the first step is to rank the selected peak according to their contribution to the classification. In this case, the contribution is determined by the calculation of the times the peak is used in the BART model. At first, the model was built on a few top ranking peaks and then progressively the number of peaks was increased [20].

As it is asserted in the paper, the method showed an excellent classification performance and the obtained results could be subjected to further research and validation. It is also mentioned that BART was accurate and the results better interpretable. Finally, by using the built-in partial dependent plot function of the BART model, it was possible to examine the effect of each biomarker to the cancer’s identification, as well as the interaction between these biomarkers [20].

As mentioned in the description of the previous paper, feature selection is a pre-step in the data analysis, with the final goal of building a classifier for biomarkers discovery. It is a common phenomenon that the initial discovery results in a relative large collection of biomarkers, but only a few remain as relevant after subsequent testing with new data. The main problem is the over-fitting of classifiers that, due to small sample size and large amount of variables, result in high false positive rate of biomarker candidates. To overcome this problem, Kuschner et al. developed a method for feature selection, based on a Bayesian network (BN), in 2010 [21].

To build the BN, the authors used a model-free test for independence, which is called mutual information. Mutual information (MI) can be described as the measure of the information gained about a variable, knowing the value of another variable, and is calculated by the equation:

(32)

31 𝑀𝐼(𝑋; 𝑌) = ∑ 𝑃(𝑥, 𝑦)𝑙𝑜𝑔2

𝑃(𝑥, 𝑦) 𝑃(𝑥)𝑃(𝑦) 𝑥,𝑦

where X and Y are two variables, x and y represent all the possible values that X and Y can take, respectively, and 𝑃(𝑥, 𝑦) denotes the joint probability that X takes on the value x and Y takes the value y. A value of MI equal to 0 indicates that the variables are independent. So, the first step of the method is the estimation of the variables that show dependency with class by calculating the 𝑀𝐼(𝑐𝑙𝑎𝑠𝑠; 𝑓𝑒𝑎𝑡𝑢𝑟𝑒). All features with 𝑀𝐼(𝑐𝑙𝑎𝑠𝑠; 𝑓𝑒𝑎𝑡𝑢𝑟𝑒) higher than a threshold are considered to have a connection with the class variable. For the graphic illustration of the BN that means that a directed arc is created from the class node to the node that represents the selected feature. In this way, the set of first level features is established and once it is done they are tested against all the other features individually, to determine connections between features. This is done by calculating the mutual information between first-level features and all the features. If the 𝑀𝐼(𝑓𝑖𝑟𝑠𝑡 𝑙𝑒𝑣𝑒𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒; 𝑓𝑒𝑎𝑡𝑢𝑟𝑒) is above the threshold (equal to the threshold used in the previous step), an directed arc is created to represent this dependency. When the connection is between two first-level features, an additional test is required, to determine the direction of this arc. Such test is based on computing the remaining mutual information between the class (C) and one of the first-level variables (F1) when the other first-level variable (F2) is known:

𝑀𝐼(𝐶; 𝐹1|𝐹2) = ∑ 𝑃(𝐶, 𝐹1, 𝐹2)𝑙𝑜𝑔₂ 𝑃(𝐶, 𝐹1|𝐹2) 𝑃(𝐶|𝐹2)𝑃(𝐹1|𝐹2) 𝑥,𝑦,𝑧

when this mutual dependency is 0, the initial link between the class and the (first-level) feature 2 is removed [21]. Similar tests on mutual information are done to simplify the BN by the elimination of non-relevant connections. For instance, if the connection to the class C is of the form 𝐶 → 𝐹1 → 𝐹2 then the feature F2 will become independent of the class C when the data is partitioned on the values of F1 and the 𝑀𝐼(𝐶; 𝐹2|𝐹1) will drop to zero (0). This indicates that the F2 is independent of the class and the initial link 𝐶 → 𝐹2 will be eliminated, providing a means to organize the first-level features with their in-between dependencies.

(33)

32 Fig. 9 illustrates the first BN for leukemia data. The first level features are those showing dependency to class, when conditioned to other features, while the second level features had a low MI with the class, when they were conditioned to the parent feature. As stated in the article, the Bayesian network/mutual information approach was able to provide a more distinct division between stable and unstable features and give the opportunity to examine relationships between features that may be useful in identification. However, testing the model with the artificial data, the method was not able to recreate the intended network completely, due to limitations of the algorithm [21].

Proteomic data treatment also refers to the post-translational modifications (PTMs) prediction, for example, the identification of peptide sequences and PTMs associated with the peptide in a biological sample [22]. PTMs are chemical modifications, which involve an enzymatic addition of a small chemical group or a larger moiety on the side chain of one or more amino acids. They regulate the activity of a protein and may occur either after or during the translation [23]. The analytical method usually chosen for the PTMs prediction is tandem MS followed by an analysis by means of a “blind” (unrestrictive) PTM search engine. This type of search engines are used because they can predict known and novel PTMs, but they are usually subjected to the noise in mass spectrometric data and result in false predictions, due to their containing inaccurate modification masses and incorrect modification positions [22]. To avoid the above mentioned issue, Chung et al. proposed a machine learning algorithm, PTMclust, in 2011 [23]. However, in 2013, the team, addressing limitations of PTMclust, they proposed a new method, iPTMclust, which introduces prior probabilities on the model variables and parameters, in order to overcome these disadvantages [22].

Figure 9: Bayesian network for leukemia data. First level features have high mutual information with the class, when conditioned to other features. Second level features showed little mutual information with class, when conditioned to the parent features [21].

(34)

33 PTMclust is applied on the results given by “blind” search engines and improves the predictions by suppressing the noise in the data and clustering the peptides with the same modification to form PTM groups. Based on the group, the algorithm finds the most probable modification mass and position [23]. Nonetheless, PTMclust showed some limitations, such as the greedy method for selecting the number of PTM clusters, the need for manual parameter tuning and the lack of confidence score per modification position. In order to overcome these limitations, the team extended PTMclust by making use of an infinite non-parametric mixture model, and so the infinite-PTMclust (iPTMclust) was developed [22].

The core of the model in iPTMclust remained the same as in PTMclust and describes how a modification mass and a modification position are generated. In the extended version (iPTMclust), priors were introduced on model variables and parameters that control the choice of a PTM group from a very large number of PTM groups. The relationship between variables can be illustrated by a Bayesian network (fig. 10). First, the shade nodes correspond to the observed variables, the unshaded nodes in the bottom plate indicate the latent variables, while the unshaded nodes in the upper plates are the model’s parameters and hyper-parameters are shown outside the plates (hyper-parameter: a parameter of a prior distribution). The model’s parameters are mixing coefficient (𝑎), modification mass means (𝜇𝑘), modification mass variances (𝛴𝑘) and probability of modification occurring on an amino acid (𝛽_𝑘𝑗), and the observed variables are the observed modification mass (𝑚_𝑛) and position (𝑥_𝑛) and peptide sequence (𝑆_𝑛). By combining the structure of the Bayesian network and the conditional distributions the joint probability can be written as:

Figure 10: Bayesian network showing the relationship between variables of the model [22].

(35)

34 𝑃(𝑐, 𝑎, 𝑧, 𝑥, 𝑚, 𝜇, 𝛴, 𝛽, 𝜆, 𝜐, 𝜑, 𝜉, 𝛾, 𝜔|𝑆, 𝛹) =

𝑃(𝛾|𝛹)𝑃(𝜆|𝛹)𝑃(𝜐|𝛹)𝑃(𝜑|𝛹)𝑃(𝜉|𝛹)𝑃(𝜔|𝛹) ∗

where Ψ represents the model hyper-parameters for hyper-priors placed on 𝛾, 𝜆, 𝜐, 𝜑, 𝜉 and ω. The combination of latent variables and prior distributions leads to complex joint probability distribution over high-dimensional spaces. Therefore, a MCMC (Markov Chain Monte Carlo) method was employed for the necessary computations [22].

The authors claimed that iPTMclust outperformed their previous method (PTMclust) and other PTM algorithms. Since iPTMclust provides the user with modification position-level confidence scores, the result’s quality could be evaluated and further refinement of analysis could be performed [22].

THE USE OF BAYESIAN STATISTICS IN MASS SPECTROMETRY DATA