Statistical support of ageing research at the UMCG

(1)

faculty of mathematics and natural sciences

Statistical support of ageing research at the UMCG

Master Project Mathematics

February 2015

Student: M. Eisenmann

First supervisor: Dr. M.A. Grzegorczyk Second supervisor: Prof Dr. E.C. Wit

(2)

(3)

Acknowledgements

Foremost, I would like to thank my supervisor Dr. Marco Grzegorczyk for the support throughout the year I was working on this thesis. He helped me during the whole process, was always open for questions and available when needed. I also want to thank Prof. Dr. Ernst Wit for being the second supervisor of this research project.

I would also like to thank the people of the UMCG: Peter Horvatovich, Karin Wolters, Barbara Bakker, Jolita Ciapaite and Sarah Stolle, who allowed me to be part of this interesting study and for the nice discussions and meetings.

(4)

This thesis is about the statistical support of ageing research on mice at the UMCG hospital in Groningen. The mice were held under different conditions, to see the effect of those conditions. Multi-way ANOVA and linear regression are used to analyse the data.

The used data set consists of peptide concentrations in mice cells, that have been measured via mass spectrometry.

The way of prepossessing, analysing and interpreting the data, is supposed to serve as a kind of framework for similar analyses that will be done at the UMCG in the future.

(5)

Introduction

When I first went to my supervisor Dr. Marco Grzegorczyk, to look for a topic for my master thesis, he told me about his collaborating work with Peter Horvatovich, who does analytical biochemistry for the university of Groningen at the UMCG hospital in Groningen. We had a first meeting with Peter, where he introduced us to an experiment in age research. In this experiment peptide concentrations in cells of mice had been measured with mass spectrometry. Before collecting the data the mice had been living under different conditions over a timespan of two years.

The data set is completely new, meaning that no analysis has been performed on it before. It is a possibly potent set of data, containing a lot of information yet to be found. Since a lot of time and resources have been consumed to collect the data, it is important to get a state of the art analysis, and to draw the right conclusions.

The aim of this thesis is to provide a first general analysis of the data set, to see its potential, and also to draw some conclusions. To this end, analysis of variance (ANOVA) is used. ANOVA is a rather old method, it goes back to the work of Fisher in 1918, but it is still state of the art for the kind of problem we are faceing here. ANOVA is generally used in a lot of fields to analyse data, it is a good and very well understood method.

The data, that is analysed in this thesis consists of 78 different peptide concentrations measured in three tissues for 97 mice. The mice have been killed at 4 different time points: after 6, 12, 18 and 24 months. While alive, the mice were given different diets, and they had different possibilities of eating: either as much as they wanted, or rationed portions. Some had also been given a running wheel to allow them a kind of exercise. These different living conditions can be interpreted as factors with different levels - a classical setup for an ANOVA.

When its model assumptions are fulfilled, the ANOVA allows to check for significance of the factors and the effect of their levels. Since the data has been collected for age research, we are mostly looking for the effect sizes and significance of the different ages of the mice. This can allow a better understanding of the ageing

(7)

CHAPTER 1. INTRODUCTION

process.

This thesis aims to help with the writing of a paper about the data set. This paper will be published by Prof. dr. Barbara M. Bakker.

A big part of the thesis was the communication with the team at the UMCG.

They needed to explain to us the relevant biology, and we our mathematical point of view. Even though it is necessary to understand the origin and aim of the data set, the biology is kept to a minimum in this document, simply because it is a mathematical thesis.

Next to the peptide concentration, the pyruvate oxidation in mitochondria has been measured. This is a measurement for the activity of a reaction chain in mitochondria. The data set contains peptides coding for the proteins participating in the reaction chain. The thesis also tries to find the regulating proteins for the activity of this chain.

Another aim of the thesis is, to deliver a kind of tool to the UMCG, that allows simpler analyses of similar data sets in the future. To this end, the process of analysing this data set, is kept in a more general prospective, so it can be easily adapted and used in the future. Also the programming used for the analysis is written in such a way, that it can be used for other data as well.

(8)

ANOVA-theory

This chapter is based on the chapter ”The Analysis of Variance (ANOVA)” from the book ”Regression: Linear Models in Statistics”, by N.H. Bingham and J.M. Fry.

The student t-test is used, to compare two normal means. But to compare more than two, something different has to be done. A possible answer is to perform an ANOVA (Analysis of Variance). The theory of ANOVA goes back to Fisher in 1918. His motivation came from comparing yields of one crop, grown under several conditions. His approach was the following: given that the variability between groups of different treatment is bigger than within those groups, one would not believe the treatment to be the same. So ANOVA compares means of groups by analysing variances.

2.1 One-way ANOVA

For r treatments, let µ_i be the mean value of the i-th treatment, so i = 1, . . . , r.

For every i, there are n_i data points X_ij. The X_ij are assumed to be normal and independent. All X_ij are assumed to have the same variance σ². So

X_ij ∼ N (µ_i, σ²) with j = 1, . . . , n_i and i = 1, . . . , r.

Let

n =

r

X

1

n_i

be the total number of data points. Let a bullet note that a certain index has been averaged out, i.e. for the i-th group mean we write

X_i• = ¯X_i = 1 n_i

ni

X

j=1

X_ij,

(9)

CHAPTER 2. ANOVA-THEORY

the grand mean is denoted as

X•• = ¯X = 1 n

r

X

i=1

ni

X

j=1

Xij,

and the i-th sample variance as

S_i² := 1 n_i

ni

X

j=1

(Xij − Xi•).

The total sum of squares is defined as

SS =

r

X

i=1

ni

X

j=1

(X_ij − X_••)² =

r

X

i=1

ni

X

j=1

[(X_ij− X_i•) + (X_i•− X_••)]².

Note that

X

j

(X_ij − X_i•) = 0.

If we expand the square term in the total sum of squares it simplyfies to SS =X

i

X

j

(X_ij − X_i•)²+X

i

2(X_i•− X_••)X

j

(X_ij − X_i•) +X

i

X

j

(X_i•− X_••)²

=X

i

X

j

(X_ij − X_i•)²+X

i

X

j

(X_i•− X••)²

=X

i

n_iS_i² +X

i

n_i(X_i•− X••)².

Here the first term is a measure for the variability within groups, we call it SSE (sum of squares for error). The second term measures the varability between groups of different treatment, we call it SST (sum of squares for treatment). So

SS = SSE + SST.

Let the Null-hypothesis H0 be, that there is no effect of treatment, i.e. µi = i for all i and X_ij ∼ N (µ, σ²). So under H₀ our data stands from one big sample with sample size n. This means that

SS σ² = 1

σ² X

i

X

j

(X_ij − X••) ∼ χ²(n − 1)

under H₀. Any χ²-distribution with degrees of freedom n has mean n, so we can write that under H₀

Eh SS n − 1

i

= σ².

(10)

With or without H₀ it holds that n_iS_i²/σ² = 1

σ² X

j

(X_ij − X_i•)² ∼ χ²(n_i− 1).

Next we use the χ²-addition property:

SSE/σ² =X

i

n_iS_i²/σ² = 1 σ²

X

i

X

j

(X_ij − X_i•)² ∼ χ²(n − r).

By the same reasoning as before it follows that EhSSE

n − r i

= σ².

For the following H₀ is assumed to be true. From independence of the sample mean and the sample variance, we get that S_i² and X_i• are independent. S_i² is also independent of X_j• for j 6= i, since they stand from different independent samples.

Combined this means, that S_i² is independent of X_i•for all i and therefore also from SST. Hence SSE and SST are independent:

SS/σ² = SSE/σ²+_indSST /σ².

For any χ²-distribution with degrees of freedom (n₁ + n₂), that is split up in an independent sum of which one summand is χ²(n₁), then the second summand must be χ²(n₂). This is called the substaction property of the χ²-distribution. Here the left hand side is χ²(n − r) and the first part on the right hand side is χ²(n − r).

Therefore, the second part must be χ²(r − 1).

Next we define the mean sum of squares by dividing the χ²-distributed variables by their degrees of freedom. Analogue the mean sum of squares for error and treatment are defined:

M S := SS/df (SS) = SS/(n − 1) M SE := SSE/df (SSE) = SSE/(n − r) M ST := SST /df (SST ) = SST /(r − 1).

We know, again by property of the χ²-distribution, that E[M S] = E[M SE] = E[M ST ] = σ²,

we have to remember that this holds only if H₀ is true. Without H₀ one only gets E[M SE] = σ².

(11)

Next we look into the F-statistic

F := M ST /M SE.

This has under H₀by definition of M SE and M ST a Fisher distribution with (r −1) and (n − r) degrees of freedom respectively. So comparing the F-statistic with the table values of the Fisher-distribution gives a way of testing the truth or falsehood of H₀. If H₀ is not true we expect M ST to be bigger and so the F-statistic to be bigger as well. To show this we look at the following: With or without H0 we have

SST =X

i

n_i(X_i•− X••)² =X

i

n_iX_i•² − 2X••

X

i

n_iX_i•+ X_••² X

i

n_i

=X

i

n_iX_i•² − nX_••² ,

since P

in_iX_i• = nX•• and P

in_i = n. And E[SST ] =X

i

n_iE[X_i•²] + nE[X_••² ]

=X

i

n_i[var(X_i•) − (EX_i•)²] + n[var(X_••) − (EX_••)²].

We have that var(X_i•) = σ²/n_i, and asP

in_i = n

var(X••) = var 1 n

X

i

n_iX_i•

= 1 n²

Xn²_ivar(X_i•)

= 1 n²

Xn²_iσ²/n_i = σ² n . So with

¯ µ := 1

n X

i

n_iµ_i = EX•• = E1 n

X

i

n_iX_i•,

(12)

we get

E(SST ) =X

i

n_i σ² n_i − µ²_i

+ n σ² n − ¯µ²

= (r − 1)σ²+X

i

n_iµ²_i − n¯µ²

= (r − 1)σ²+X

i

n_iµ²_i − 2n¯µ²+ n¯µ²

= (r − 1)σ²+X

i

n_iµ²_i − 2n¯µ1 n

X

i

n_iµ_i+X

i

n_iµ¯²

= (r − 1)σ²+X

i

niµ²_i − 2X

i

niµiµ +¯ X

i

niµ¯²

= (r − 1)σ²+X

i

n_i(µ_i− ¯µ)².

So the inequality

E(SST ) ≥ (r − 1)σ²

holds true, with equality iff µi = ¯µ for all i, so under H0. So we want to reject the null-hypothesis H₀ if the F-statistic takes too high values. Therefore we use a one-tailed Fisher-test with a certain significance level α. So we reject H₀ for

F > Fα(r − 1, n − r).

Practically we will look at the p-value for a treatment to decide if it is significant or not.

2.2 Multi-way ANOVA

In the section above we looked into the effect of a certain treatment. How about if we want to apply two kinds of treatment at the same time. In the example with the crops one could think of different strategies of watering the plants and planting them in different surroundings, e.g. on a hill or not. One would call this an experiment with two factors (treatment and surrounding). To be able to separate these effects in an analysis the growing area of the crop needs to be separated into blocks with the same surroundings and then separate the blocks into plots of different treatment. The plot assigned to a treatment should be chosen randomly. This kind of experimental design will later also be found in the mice data.

For a one-way ANOVA the model equations looked like X_ij = µ_i+ _ij

(13)

with µ_i being the group mean for the i-th treatment and _ij ∼ N (0, σ²). For the two-way ANOVA we take the following model:

X_ij = µ + α_i+ β_j + _ij,

where µ is the overall mean, α_i is the i-th treatment effect, β_j the effect of the j-th surrounding and _ij ∼ N (0, σ²) as above. i runs as before from 1, . . . , r, but now j is said to be running from 1, . . . , n. Note that

X

i

α_i = 0, X

j

β_j = 0,

since there is now a kind of intercept in form of the overall mean µ. We take the algebraic identity

X_ij− X_••= (X_ij − X_i•− X_•j + X_••) + (X_i•− X_••) + (X_•j − X_••).

We square and add up. One can check that the cross terms on the right hand side vanish to leave only the squares:

X

i

X

j

(Xij − X••)² =X

i

X

j

(Xij − Xi•− X•j + X••)² + nX

i

(X_i•− X••)² + rX

j

(X•j − X••)².

This can be written as

SS = SSE + SST + SSB.

Here the new term SSB is introduced for the sum of squares for blocks. Analogue to the one-way case this can be shown to be an independent sum of χ²-distributed random variables. The total sample size is now nr so SS has (nr − 1) degrees of freedom. Same as before SST and SSB have (r − 1) and (n − 1) respectively and SSE has (n − 1)(r − 1) to complete the substraction property of the χ²-distribution.

We now have two F-statistics F T and F B:

F T = M ST M SE F B = M SB M SE.

(14)

So analogue as in the one-way case we can now get p-values for having a treatment effect and a surrounding effect from the Fisher-distribution.

The concept of adding another factor to the one-way model can be extended in exactely the same way to any number of factors one would like to include in a model. For our mice data we would like to have four different factors: The age of the mouse, its diet, its activity and the tissue where the sample stands from.

2.3 Interactions

If one has an experiment with several factors one could think of interactions between some factors. For example could a certain treatment only have an effect in a specific surrounding. To model this expansion of an ANOVA one adds an interaction term in the model equations:

Xij = µ + αi+ βj + γij + ij.

These parameters are the same as before, but now the interaction term gamma_ij is present. Its index runs from 1, . . . , nr, so for every combination of treatment and surrounding, one value. The interaction term is like a new factor added to the model. To watch out for over fitting the data, one should not include interaction terms without reason. But it can be useful, if the simple model cannot explain enough of the data, to check if a much better result is reached by adding one or more interactions between factors.

(15)

Chapter 3 Experiment setup

3.1 Experiment setup

The experiment was done by an age researching team at the UMCG hospital in Groningen, the Netherlands. During the time the data was collected, 97 mice were held under different conditions. The mice were killed at different ages, and samples were taken from three different places of the body. The goal was to gain information about how the different living conditions of the mice influence the ageing process.

The different living conditions of the the mice were:

• Food: the mice either got a high fat (HF) or a low fat (LF) diet.

• Activity: The mice could eat as much as they wanted (AL for ad libidum), or they could eat as much as they wanted and had a running wheel (RW), or they had a restricted amount of food (CR for calorie restricted).

The measurement were taken at the following three tissues:

• heart,

• liver,

• skeletal muscle.

The mice were randomly separated into two batches. From each tissue the samples of each batch were randomly separated into 4-5 gels for measurement. The table on the next page shows the experimental design for Batch 1 and the liver and skeletal muscle tissues. So the the data is in a perfectly randomized setup to perform a multi-way ANOVA for the factors age, diet, activity, batch and gel.

To take the measurements, mass spectroscopy has been used to get peptide concentrations. Only the peptides that were included in the data set in every gel,

(16)

have been used for the analysis. This results in 78 peptides per tissue and mouse.

So in total there are 78x3x97=22698 data points taken into account for the ANOVA.

(17)

CHAPTER 3. EXPERIMENT SETUP

Mouse ID Age Experimental group Gel liver Gel skeletal muscle

6.84 6 HF CR 1 3

6.83 6 HF CR 1 4

6.68 6 LF CR 4 4

6.66 6 LF CR 4 3

6.5 6 LF RW 2 3

6.49 6 LF RW 4 1

6.4 6 LF AL 1 4

6.34 6 HF AL 3 1

6.33 6 HF AL 3 1

6.3 6 LF AL 1 1

6.19 6 HF RW 1 3

6.17 6 HF RW 4 2

12.82 12 HF CR 3 4

12.81 12 HF CR 3 2

12.62 12 HF AL 4 1

12.61 12 HF AL 2 4

12.43 12 LF CR 1 2

12.41 12 LF CR 1 4

12.3 12 HF RW 3 3

12.22 12 LF AL 4 1

12.21 12 LF AL 2 1

12.2 12 HF RW 2 3

12.102 12 LF RW 2 3

12.101 12 LF RW 1 2

18.93 18 HF RW 2 3

18.92 18 HF RW 1 1

18.62 18 LF RW 3 1

18.61 18 LF RW 4 1

18.5 18 HF AL 2 4

18.33 18 HF CR 3 2

18.31 18 HF CR 3 3

18.3 18 HF AL 3 2

18.152 18 LF CR 4 3

18.151 18 LF CR 3 4

18.123 18 LF AL 1 2

18.121 18 LF AL 3 1

24.82 24 LF CR 3 2

24.81 24 LF CR 2 2

24.43 24 LF AL 1 1

24.41 24 LF AL 1 4

24.2 24 HF AL 2 3

24.1 24 HF AL 4 3

24.202 24 LF RW 3 2

24.201 24 LF RW 2 1

24.171 24 HF RW 1 4

24.161 24 HF RW 1 1

24.122 24 HF CR 4 3

24.121 24 HF CR 3 3

(18)

3.2 Data preparation and approach

The whole data analysis has been done in MATLAB, so as a first step all the data needed to be imported from Excel. Due to the randomized setup this had to be done very carefully, to label each measurement with the right mouse-ID and its experimental group.

For every sample two replicates have been taken due to big variance. In theory these measurements should deliver exactly the same result, namely the true peptide concentration. So for the ANOVA it does not make sense to consider these independent replicates, but the two replicates are being averaged to get closer to the true value.

In fact the amount of data was bigger than the 22698 points used for the analysis.

Every point stands from the average of the two replicates. Additionally there were also the ”incomplete” peptides mentioned above, that had not been measured in all the gels.

There are 78 ”complete” peptides coding for 51 proteins, hereby every single peptide codes for one protein only, so there are no peptides that count more than once.

The protein concentration, might allow to conclude how active certain metabolic mechanisms are working and therefore in what stage the ageing process is. So we would like to know if and how the different living conditions influence the protein concentrations.

The idea now is to perform one ANOVA for each protein, to get p-values for all the factors. This allows to conclude if the change of a specific living condition (including age, which is not really a condition) has an effect on the resulting protein concentration.

For 27 of the 51 proteins there are two measured peptides, coding for one protein.

The underlying biology tells, that the concentrations of two peptides like this are the same. So to visualize the data, it makes sense to take the average of the two measured peptide concentrations, to get closer to the true value.

The first question that might arise is, whether the concentration of a protein even changes with the aging of a mouse. If this is not the case, then the protein is not interesting for the question asked here. On the other hand it can be interesting for other questions, for example under which conditions can one increase the concentration of a certain protein, and therefore maybe the activity of a metabolic mechanism. So we would like to make some kind of ranking of proteins, sorted by the significance of the factor age.

In the end we want to get p-values for all the biological factors, i.e. the different living conditions of the mice. We are not interested in any effect of the non biological factors ”gel” and ”batch”. To omit overfitting due to these two factors their effects

(19)

CHAPTER 3. EXPERIMENT SETUP

are filtered out of the data before the analysis. In theory the effect of ”gel” and

”batch” should of course be zero, since by construction, the only difference in the life of the mice should be the biological factors.

To filter the effect of ”gel” and ”batch” out, every measurement gets the mean of its gel subtracted and its global tissue mean added:

X_new = X_old− gel(X_old) + tissue(X_old).

The adding of the overall mean of the tissue ensures, that the difference induced by the factor tissue is still in the data set. Without it, the data would be centered around zero, leading to the conclusion, that the concentration of proteins in different parts of the body of a mouse is the same, which is clearly wrong.

(20)

Results and discussion

4.1 Model assumptions for the ANOVA

As described in chapter 2, the ANOVA uses the following model:

Xijklm = µ + αi+ βj+ γk+ δl+ κm+ ijklm,

Where α_i stands for the i-th age group, β_j for the j-th diet, γ_k for the k-th activity, δl for the l-th tissue and κm for the m-th peptide. The residuals ijklm are assumed to be normally distributed around zero. So to see if this model makes sense one has to check for the normality of the residuals.

This check for normality can be done and illustrated by using so called QQ-plots (Quantile-Quantile-Plot). Here the residuals are sorted and then plotted against the theoretical quantiles of a normal distribution. If the model assumptions are true, i.e. if the residuals are normally distributed, then the points in the plot will form a straight line. The residuals can also be standardised: Let _ijklm ∼ N (µ, σ²) , then

Z := _ijklm− µ

σ ∼ N (0, 1)

is standard normally distributed. Here µ and σ² are the mean and the sample variance of _ijklm respectively. The QQ-plot for Z is called a standard QQ-plot for

_ijklm. For normal _ijklm the points in the standard QQ-plot will lay on the line

x = y.

So one can see in the QQ-plot, whether the residuals are normally distributed or not, to justify the results of the ANOVA. This is a general procedure applicable to all similar settings.

In the ageing data set of the mice there are two kinds of proteins: Those that are coded by only one peptide, and those that are coded by two. If there were two peptides, the two set of measurements have been treated separately, and therefore

(21)

CHAPTER 4. RESULTS AND DISCUSSION

Figure 4.1: Standard QQ-plot for Cpt2

they have been plotted as two distinct sets of points for the QQ-plots as well.

Figure 4.1 shows the standard QQ-plot for the protein Cpt2. The sorted residuals of Cpt2 lay nicely on the straight line x = y. This results in a high correlation, between the residuals and the normal quantiles. The three biggest residuals might not fit the model very nicely, but overall the ANOVA model is very capable of explaining the measured data for Cpt2. Since this is no artificial data set, one would also not expect a perfect straight line, but obtaining a result like this definitively justifies the used ANOVA model.

In Figure 4.2 the standard QQ-plot for Pdk1 is shown. This protein was coded by two different peptides, namely AVPLAGFGYGLPISR (blue) and EISLLPDNLLR (red). Clearly the red peptide shows very nice results, again justifying the ANOVA model. For the blue peptide the values do not lay on x = y, due to big residuals at the negative and positive end of the value spectrum. But also for the blue peptide most of the values lay on a nice straight line, which indicates again a normal distribution of nearly the whole measured set. In this case also the ANOVA model can be justified by means of the argumentation with a QQ-plot.

There are also some proteins, that show a weird picture in the corresponding QQ- plot. An example of this is shown in Figure 4.3. Here the two shapes of the peptides coding for the protein Cs have been drawn. The blue peptide (ALGVLAQLIWSR) shows a peak in the middle of the measured data, supported by a lot of points. This indicates, that the residuals are clearly not normally distributed and therefore the model assumption of the ANOVA are not satisfied, which means its results are not

(22)

Figure 4.2: Standard QQ-plot for Pdk1 really justified.

So before interpreting the results of any of the ANOVA one should check if its results can be justified checking the model assumptions. One way to do this is looking at the QQ-plot of the residuals.

For the mice data set 8 out of the 51 proteins show at least some weird form of QQ-plots, these are the following:

• Cs

• Mdh2

• Ndufs1

• Slc26a10

• Slc25a3

• Slc25a4

• Slc25a5

• Suclg1

If one wants to explain the measured data of one of these proteins in particular, the model should probably be manipulated to obtain better results. Here this has not been done, since the required changes can differ from protein to protein and here we only want to make statements about the whole data set.

(23)

Figure 4.3: Standard QQ-plot for Cc

4.2 Results of the ANOVA

In this section the results of the 51 independent multi-way ANOVAs will be presented and visualised. To do so the complete discussion is being done for one protein and comments on the overall results are given. For every protein the ANOVA gives four or five p-values stating whether the corresponding factor was significant for the measurements. The proteins coded by a single peptide get a p-value for the factors diet, age, activity and tissue. For the proteins with two peptides an additional p-value indicates the significance of the peptide.

4.2.1 Overall results

A list with the complete results of the 51 ANOVA analyses is given in the appendix.

In Figure 4.4 a heatmap of all the resulting p-values is given. In this heatmap a black box stands for a significant p-value (< 0.05) and a red box indicates non- significance. Only the full red boxes indicate a non-significant p-value, but the brighter the red, the closer the p-value to the threshold of 0.05. The green boxes in the row of the factor peptide correspond to the value −1, which has been assigned to all the proteins with only one peptide. So the green boxes just mean, that the peptide factor does not exist in the corresponding columns.

Most obvious is, that the factor tissue is significant for all the proteins. This makes immediate sense, since the cells that originate the samples where taken from different organs and therefore had different tasks. For different tasks a cell will need

(24)

Figure 4.4: Heatmap for the p-values form the ANOVA

(25)

different material e.g. proteins, therefore this significance was expected.

The proteins in the map are sorted by the significance of their p-value for the factor age from left to right. So about 70% of the protein concentrations are affected by the age of the mice. This allows to see, whether a certain protein concentration is rising or falling with in time. Knowing the roles of these proteins in the metabolism, the results can help to understand the process of ageing, or just what gettong older means for the resulting concentrations. On the other hand these proteins can be seen as an indicator of how far a mouse is in its ageing process, i.e. one could say that the age is partially characterised by the concentration of certain proteins. In this prospective the factors diet and activity get very interesting for the proteins significant to age. Namely then one might be able to influence the speed of ageing with external influences such as a certain diet or sport.

In the heatmap one can see that the factors diet and activity are significant for about half of the proteins. The significant values of both factors seem to follow a random pattern, not anyhow connected with the significance of the factor age.

Although it seems, that for the last couple of proteins on the heatmap neither of the three factors are significant, so for these proteins the ANOVA does not deliver any results of value.

4.2.2 Results and visualisation for Hadhb

In this section an interpretation of the analysis on the protein Hadhb is done. Hadhb is the third most significant protein to the factor age (see Figure 4.4). The factors diet, activity, tissue and peptide. The protein is coded by two peptides, namely DQLLLGPTYATPK and LAAAFAVSR.

In figure 4.5 the measurements for Hadhb are visualized. The samples are sorted by tissue. The three different parts of the data are obvious. The first part corresponds to the samples for the skeletal muscle, the second one to the heart and the last one to the liver. It was mentioned before, that in theory the two peptide concentrations measured for each mouse should be the same. In the graph one can see, that this is practically not the the case, even though they are definitively correlated (with a coefficient of 0.85, as is also given in the plot). Therefore the average of the two concentrations is taken, to be closer to the real value. The measurements in the liver seem to be of a very good quality, since here the measured concentrations for the two peptides are almost identical.

The variance in the heart measurements is higher, than in the other two tissues.

This is actually an overall observation. Since the ANOVA assumes the same variance for all the factors, this leads to a deviation in the normality of the residuals. But since the residuals show overall a good normality, this effect is not too big. Figure

(26)

Figure 4.5: Peptide concentrations for Hadhb

4.6 shows the standard QQ-plot for the protein. The residuals follow x = y quite reasonably, which justifies the ANOVA model for this protein. The big variance in the heart results in the deviating points at both ends of the sorted residuals.

One can also see, that there are some measurements that do not seem to follow the overall pattern. These are, from a modelling point of view, clearly outliers, and they exist for almost all the proteins. It turned out, that it is not easy to find a reason to remove those measurements from the data, because no clear biological reason for the extreme value could be found. During the analysis Karin Wolters tried to find the reasons for some extreme results. It turned out, that it is an immense amount of work to look into this, and that it is not really worth it as long as we are only interested in overall results. If the interest should focus on a specific protein, then it is necessary to find reasons for the extreme values to be able to exclude them for better results.

In figure 4.7 a box-plot for the residuals of the protein has been made. In a perfect model these boxes are centered at zero, they all have the same size and the lines outside the boxes are of equal length. The length of these lines indicates the variance outside the box. As mentioned above, the variance in the measurements from the heart is bigger than in the other two tissues. The box-plot allows to

(27)

Figure 4.6: Standard QQ-plot for Hadhb

visualize and quantify this difference. It also helps to see extreme outliers. For a data set of this size some marked outliers can occur and are not really worrying.

But also in this example at least two measurements clearly qualify as outliers that should be looked at closer and if possible removed from the data set (the highest value from liver and the lowest from heart).

Factor/Level Effect Intersect 95.27

LF -4.58

HF 4.58

6 months -10.07 12 months -7.93 18 months 0.96 24 months 17.04

RW 4.26

AL 2.16

CR -6.42

(28)

Figure 4.7: Residual box-plot of the factor Tissue for Hadhb

(29)

As mentioned above the protein concentration for Hadhb shows significance for the factors age, diet and activity. In the table above the coefficients of the ANOVA- model are listed. The first observation is, that the concentration of Hadhb is rising with the age of the mouse. If one turns the argumentation around, it could be said, that the higher the concentration of Hadhb, the further is the ageing process.

From this point of view it would be interesting to slow the ageing process down, by adjusting the factors diet and activity correspondingly. So for Hadhb one should give the mouse a low fat diet and neither give it a running wheel nor give it as much to eat as it wants. This follows from the fact that the effects of the levels LF and CR are negative.

So the results the ANOVA yields for Hadhb are just like one would like them to be:

• The measurements are significant to the factor age, and they show a clear trend in time.

• The residuals are sufficiently normal distributed to justify the results of the ANOVA.

• Also the factors diet and activity show significance, which seems to allow external influence on the protein concentration.

4.2.3 Results without outliers - an example

The analysis of the data set has shown, that there are some outliers for almost all the proteins, that do not seem to follow the pattern of the other data points. In this section an example of an extreme outlier is shown, and its negative influence on the outcome of the ANOVA is discussed.

Fist of all figure 4.8 shows, that the measurements for the protein Uqcrc2 have an extreme outlier in the measurements of the peptide LSVTATR. The anaylsis including the outlier gives only significant values for the factors tissue and peptide.

All the factors, that we would like to have significance get clearly too high p-values.

Therefore Uqcrc2 lays in the last part of the heatmap (figure 4.4), where the ANOVA doesn’t give any results of value.

The outlier corresponds to the heart measurement from mouse 6.36 and the corresponding peptide.

After first encountering such outliers, one would clearly like to remove them from the data set to obtain a more homogeneous set of measurements. But this can not be done without finding a biological reason, e.g. sickness of the mouse. The search for a reason to remove a data point can be time intense. Therefore it has not been done yet.

(30)

Figure 4.8: Measurements for Uqcrc2 including the outlier

Figure 4.9 shows the measurements excluding the outlier. The first observation is, that this plot contains more information due to the smaller scale. Also the results of the ANOVA have changed massively. Without the outlier the analysis gives significance for all the factors except diet. So excluding the extreme measurement allows to gain better results.

This shows, that removing an outlier can change the results of the ANOVA from nothing of value into a good, significant result.

It is important to find biological reasons to exclude measurements, that do not fit the pattern of the overall data. This can increase the value of the data, which is measured in results.

Mouse ID p-value with outlier p-value without outlier

Diet 0.24 0.13

Age 0.31 4.51E-06

Activity 0.53 2.62E-06

Tissue 9.41E-05 3.19E-183

(31)

Figure 4.9: Measurements for Uqcrc2 excluding the outlier

4.3 Pyruvate oxidation in mitochondria

Next to the influence of the living conditions of the mice on the protein concentrations, another question came up during the working process on the data. Jolita Ciapaite measured the pyruvate oxidation in mitochondria for the skeletal muscle samples in the data set. The process in which this oxidation takes place is schemat- ically shown in figure 4.10. This is a biological chain of reactions, that allow it to transform ADP into ATP. ATP is a kind of energy, that can be directly used by the body of the mouse. The activity of this reaction chain can be measured, by the amount of O2 used. So by the O2-flux measurements we know how active the pathway was for every mouse. So one can again try to find a pattern, that connects the measured O2-flux with the living conditions of the mice.

In the scheme the proteins are shown in orange color. The concentrations of all the proteins that participate in the reaction chains have been measured. The goal of this part of the analysis was to find the regulating proteins in the energy production chain. So we are trying to find proteins concentrations, that show the same pattern as the O2-flux for different conditions.

As before the different living conditions are assumed to have an effect on the protein concentrations, which has been shown by the ANOVA.

(32)

Figure 4.10: Metabolic pathway of pyruvate oxidation in mitochondria

In figure 4.11 the measured O2-flux is shown split up by the conditions. Here every combination of the factors diet and activity gets a separate plot, and the factor age is given on the x-axis to provide a kind of time evolution for the activity of the reaction chain. It is of course not a true time axis, since the measurements all come from different mice. The plotted points correspond to the mean value of all the mice held under the same conditions, that have been killed at the same age.

Around every point the standard deviation of the measurements is drawn vertically.

The general trend is, that the chain gets less active over time. The mice that had a running wheel seem to have a different pattern, than the ones that lacked one.

For these mice the activity of the reaction chain stays more or less constant until the age of 18 months. After that the measured values get smaller, same as for the other activities.

There are 96 skeletal muscle samples. For 2 diets, 3 activities and 4 ages this leads to 96/(2 ∗ 3 ∗ 4) = 4 mice with equal conditions. So to arrive at figure 4.11 only 4 mice per point have been measured. This is a very small amount, which leads to big standard deviations shown in the plots.

The same kind of plot has been made for the measurements of all the peptides.

An example of the resulting plots is given in figure 4.12. This is the plot for the peptide VAVLGASGGIGQPLSLLLK which is coding for the protein Mdh2. Already from the plots one can see, that the standard deviation bars are strongly overlapping,

(33)

Figure 4.11: Measured O2-flux split up by living conditions of the mice

which means, that the drawn pattern does not really say anything. But this fact can of course also be shown mathematically.

If we are looking to explain the measured O2-flux directly with the peptide concentrations, we use linear regression. This makes sense, because in every step of the reaction chain a protein is needed once. So for every group of 4 mice that have been held under the same conditions and have been killed at the same age, a simple linear regression has been performed between the O2-flux measurements and the corresponding peptide concentrations. This leads for every such group to a p-value p_dat standing for the significance of the linear model within each group. To give a p-value per diet and per activity p_da 4 p-values have to be ”averaged”. In order to average these p-values they have been normally inverted, then averaged and then transformed back, i.e.

p_da = Φ 1 4

X

t

Φ⁻¹(p_dat).

Here Φ is the normal cumulative distribution function and the sum over t stands for the sum of the p-values of the four different levels of the factor age. The same idea is used to get a final p-value for each peptide to check whether it shows the same pattern as the O2-flux, so

(34)

Figure 4.12: Peptide concentration for VAVLGASGGIGQPLSLLLK split up by living conditions of the mice

p_peptide = Φ 1

6 X

d,a

Φ⁻¹(p_da).

A lot of peptides show patterns in the produced plots, but due to the fact that the standard deviations are really high, these patterns might just occur randomly.

To get more results from this analysis, the number of data points should be raised.

To get consistent results from this little amount of data, the variance would have to be very small, which is not the case.

Unfortunately none of the resulting p-values for the peptides are significant. A complete list of the resulting p-values is given in the appendix.

(35)

Chapter 5 Conclusion

Since the ANOVA delivered good results, it proved to be the right choice for analysing the data set. For most of the proteins the residuals are sufficiently normal distributed to justify the results of the ANOVA. The ANOVA showed that most of the proteins are significant to the factor age, which was the main outcome we were looking for. It has also been shown that a lot of proteins are significant to either the factor diet or activity or both, which seems to allow an external influence on the concentrations of these proteins. These results allow conclusions about the ageing process of mice, and since a lot of proteins are not only to be found in mice but also in humans, it might also be meaningful for human ageing.

We have found the data set to be potent and we gave a overview, about what information it contains. This is a good basis for further work on the data set.

It became clear, that the data contains numerous outliers. Some of these outliers are extreme, as shown in section 4.2.3. We have seen, that the outliers can have a big negative influence on the results of the ANOVA. For future and maybe more specific work on single proteins, it is necessary to find biological reasons, to allow a removal of these outliers. We have shown, that the ANOVA can give much more satisfying results without the extreme values.

Even though the ANOVA has shown the measurements of the peptide concentrations to be very potent, this can unfortunately not be said about the O2-flux measurements. Probably due to the too small data size and therefore big variance, the linear regression did not show any significance. This means, that we were unable to find the regulating proteins for the reaction chain in the mitochondria.

There have been numerous productive meetings with the team at the UMCG, Marco and me. These meetings lead to a good collaboration, with satisfying results for all parties involved.

The complete MATLAB-codes and the resulting numbers and figures will be given to Peter to allow reproduction of the results. Then the team at the UMCG can profit from this work, when they want to continue with a more specific analysis

(36)

in the future.

The provided codes and this thesis can be used to analyse similar data sets in the future. The used methods can be applied to such data. Also the code is written rather general, to allow automatic data preparation, ANOVA analysis and all kinds of visualisations for the data and the residuals. The import of the data from excel is hard coded, so this part needs to be done again for other data sets, because they will not be in exactly the same form as the data set we imported to MATLAB here.

(37)

Appendix: Lists of results

On the four following pages a complete table containing the results for all 51 ANOVA models is given. This is a big table, which might be hard to read, but due to its size it was hard to illustrate it differently.

Next also the complete list of p-values for the analysis with the O2-flux is given.

(38)

ANO V A results

ProteinIntersectp-DietLFHFp-Age6121824p-ActivityRWALCR Acaa21.43E+021.93E-05-9.70E+009.70E+001.82E-09-1.54E+01-1.32E+016.60E+002.21E+011.73E-015.68E+00-4.21E+00-1.47E+00 Acadm7.39E+011.01E-01-1.92E+001.92E+009.40E-05-8.42E+00-1.21E+006.14E+003.49E+008.62E-017.86E-01-7.59E-01-2.77E-02 Acads3.57E+014.32E-08-1.71E+001.71E+008.12E-12-3.53E+00-6.77E-012.01E+002.20E+002.71E-112.58E+002.07E-01-2.79E+00 Acadvl5.08E+011.92E-11-3.98E+003.98E+006.94E-02-8.38E-02-2.08E+002.28E+00-1.09E-017.71E-021.23E+00-1.82E+005.92E-01 Aco23.03E+029.04E-011.51E+00-1.51E+009.18E-012.79E+009.36E+00-1.46E+012.42E+001.54E-011.21E+01-3.39E+012.18E+01 Atp5b3.31E+022.50E-02-7.10E+007.10E+006.12E-02-9.54E+00-6.06E+001.26E+013.01E+005.84E-062.20E+01-9.09E+00-1.29E+01 Cox5a1.15E+022.66E-01-2.55E+002.55E+002.17E-078.62E+007.68E+007.59E+00-2.39E+014.34E-027.34E+00-6.69E+00-6.52E-01 Cpt1b2.40E+011.69E-03-1.01E+001.01E+003.42E-02-9.09E-019.27E-019.59E-01-9.77E-016.79E-061.70E+004.17E-01-2.12E+00 Cpt22.54E+017.56E-07-1.62E+001.62E+002.29E-02-8.21E-01-9.33E-011.53E+002.24E-013.31E-031.36E+00-4.82E-02-1.31E+00 Cs2.14E+026.19E-02-5.08E+005.08E+002.43E-042.93E-011.12E+018.16E+00-1.97E+013.61E-015.27E+00-3.87E+00-1.40E+00 Cycs2.55E+029.30E-03-1.49E+011.49E+011.72E-062.78E+011.30E+011.30E+01-5.38E+016.26E-017.67E+00-5.01E+00-2.67E+00 Decr14.60E+015.38E-08-3.28E+003.28E+002.49E-08-5.52E+00-1.27E+002.58E+004.21E+009.62E-021.55E+001.44E-02-1.57E+00 Dlat4.98E+018.67E-011.29E-01-1.29E-011.10E-01-3.28E+008.92E-011.15E+001.23E+003.69E-064.37E+008.07E-01-5.17E+00 Dld5.43E+013.90E-01-5.63E-015.63E-011.38E-05-4.23E+00-2.08E+002.10E+004.21E+007.92E-083.90E+001.27E+00-5.17E+00 Dlst3.87E+015.10E-01-4.03E-014.03E-011.31E-02-3.30E+003.25E-021.97E+001.30E+001.11E-053.12E+008.93E-01-4.02E+00 Echs13.62E+011.94E-01-6.63E-016.63E-012.77E-09-4.00E+00-2.65E+002.10E+004.56E+001.45E-022.03E+00-6.00E-01-1.43E+00 Eci12.30E+013.78E-04-1.31E+001.31E+001.19E-06-1.92E+00-1.76E+005.20E-013.16E+008.75E-062.02E+003.52E-01-2.37E+00 Etfa9.43E+015.23E-04-3.39E+003.39E+002.30E-12-9.27E+00-4.94E+004.14E+001.01E+012.63E-012.07E+00-2.63E-01-1.80E+00 Etfb1.06E+023.04E-03-5.94E+005.94E+003.53E-02-6.04E+00-4.70E+008.65E+002.09E+006.09E-025.89E+00-5.66E+00-2.35E-01 Etfdh5.54E+011.79E-05-3.44E+003.44E+006.42E-05-4.25E+004.30E-015.90E+00-2.08E+002.04E-022.95E+00-5.78E-01-2.37E+00 Fh7.91E+011.43E-02-2.35E+002.35E+001.42E-01-3.49E+00-1.11E-012.60E+009.99E-014.91E-022.48E+006.92E-01-3.17E+00 Gpx46.62E+001.52E-019.34E-02-9.34E-021.55E-17-1.28E-014.78E-015.95E-01-9.46E-018.59E-032.75E-01-1.99E-01-7.67E-02 Gsr2.30E+006.33E-011.21E-02-1.21E-024.79E-04-7.06E-02-8.78E-02-2.67E-021.85E-016.27E-012.33E-02-3.38E-021.05E-02 Hadh7.60E+016.19E-06-3.58E+003.58E+003.60E-13-8.30E+00-3.31E+003.32E+008.30E+004.69E-054.54E+00-5.05E-01-4.03E+00 Hadha8.11E+011.02E-06-4.83E+004.83E+009.78E-09-5.59E+00-5.47E+001.16E+009.90E+001.31E-034.42E+00-6.52E-02-4.36E+00

(39)

CHAPTER 5. CONCLUSION

Proteinp-valueTissueskeletalmuscleheartliverPeptide1Peptide2p-valuePeptidepeptide1peptide2 Acaa28.56E-129-1.05E+02-3.41E+011.39E+02VGVPTETGALTLNR-1.00E+00 Acadm1.37E-73-2.99E+014.27E+01-1.28E+01ANWYFLLAR-1.00E+00 Acads1.06E-145-1.48E+013.14E+001.17E+01ITEIYEGTSEIQRLVIAGHLLR4.47E-02-6.19E-016.19E-01 Acadvl1.79E-115-4.83E+002.31E+01-1.83E+01FFEEVNDPAKIFEGANDILR6.33E-42-8.58E+008.58E+00 Aco29.56E-435.89E+012.00E+02-2.59E+02NAVTQEFGPVPDTARVAGILTVK8.74E-351.65E+02-1.65E+02 Atp5b7.29E-1012.35E+019.05E+01-1.14E+02IPVGPETLGRVVDLLAPYAK9.48E-24-3.32E+013.32E+01 Cox5a2.39E-877.28E+012.01E+01-9.29E+01IIDAALR-1.00E+00 Cpt1b1.04E-1444.00E+001.86E+01-2.26E+01ALLHGNCYNR-1.00E+00 Cpt23.25E-50-8.47E+004.31E+004.16E+00SEYNDQLTR-1.00E+00 Cs1.77E-1911.04E+027.37E+01-1.78E+02ALGVLAQLIWSRLVAQLYK4.52E-69-5.51E+015.51E+01 Cycs3.76E-1252.15E+021.46E+01-2.29E+02TGPNLHGLFGRTGQAAGFSYTDANK6.74E-33-7.30E+017.30E+01 Decr13.94E-73-1.57E+011.99E+01-4.22E+00VAFITGGGTGLGK-1.00E+00 Dlat3.57E-893.12E+01-5.76E+00-2.55E+01ILVPEGTR-1.00E+00 Dld1.85E-288.65E-019.20E+00-1.01E+01ALTGGIAHLFKVCHAHPTLSEAFR2.84E-02-1.44E+001.44E+00 Dlst6.42E-183.70E+004.62E+00-8.31E+00GLVVPVIR-1.00E+00 Echs14.09E-71-1.66E+012.70E+001.39E+01AQFGQPEILLGTIPGAGGTQR-1.00E+00 Eci11.67E-97-1.14E+01-5.77E+001.71E+01WLAIPDHSR-1.00E+00 Etfa1.55E-119-4.09E+012.77E+011.32E+01LLYDLADQLHAAVGASRTIYAGNALCTVK6.71E-92-2.41E+012.41E+01 Etfb5.24E-46-3.07E+014.35E+01-1.28E+01AGDLGVDLTSKVSVISVEEPPQR7.24E-07-1.00E+011.00E+01 Etfdh2.67E-93-2.28E+013.50E+01-1.22E+01NLSIYDGPEQR-1.00E+00 Fh2.64E-1002.62E+012.04E+01-4.66E+01IYELAAGGTAVGTGLNTR-1.00E+00 Gpx41.02E-1573.14E+00-1.29E-01-3.01E+00EFAAGYNVKYAECGLR4.14E-27-7.41E-017.41E-01 Gsr4.96E-874.73E-02-9.53E-019.06E-01LNTIYQNNLTK-1.00E+00 Hadh2.60E-121-3.31E+019.32E+002.38E+01LGAGYPMGPFELLDYVGLDTTKLVEVIK8.80E-011.18E-01-1.18E-01 Hadha9.34E-108-3.27E+013.43E+01-1.63E+00DGPGFYTTRTHINYGVK2.15E-321.24E+01-1.24E+01

Statistical support of ageing research at the UMCG