Bayesian mixed effects models for microbiome data

(1)

Bayesian mixed effects models for

microbiome data

Master of Medical Informatics

Master thesis by Alma Revers

Clinical Epidemology, Biostastics and bioinformatics (KEBB) Amsterdam University medical center (Amsterdam UMC) University of Amsterdam (UVA) August 2019

(2)

Master Thesis

Bayesian mixed effects models for microbiome data Student:

A Revers, BSc.

10338012, a.revers@amsterdam.umc.nl Mentors:

A.H. Zwinderman, dr. Prof. X. Zhang, Dr.

Tutor: M.H. Hof, Dr. SRP Duration:

December 2018 – August 2019

Location of Scientific Research Project:

Department Clinical Epidemology, Biostastics and bioinformatics (KEBB)

University of Amsterdam (UVA) - Amsterdam University medical center (Amsterdam UMC)

Meibergdreef 9 1105 AZ Amsterdam

(3)

Preface

This thesis ”Bayesian mixed effects models for microbiome data” is my final assignment for the Master Medical Informatics at the University of Amsterdam. I undertook my scientific research project (SRP), where this thesis is the result of, at the at the department of clin-ical epidemiology, bioinformatics and biostatistics (KEBB) at the Amsterdam University Medical Center location AMC. During my SRP I learned a lot, and I am excited about the next four years as PhD student.

I want to thank my supervisors Koos Zwinderman and Xiang Zhang for there discus-sions, feedback and encouragement to think critically. Furthermore, I would like to thank Michel Hof for being my tutor and making sure the number of pages of this thesis stayed within an acceptable range. Also, thanks to Marit van Barreveld, Attila Csala and the rest of the KEBB for making sure I had a good time during my SRP.

Of course, I would like to thank my family and friends, especially Kyra and Martin for all the cups of coffee/tea we drank during this project, and Vincent for his feedback. Lastly, I want to thank my boyfriend Ruud for his endless amount of encouragement and feedback.

(4)

Abstract

The microorganisms living in our intestines influence our health. Diet is one of the lifestyle factors that influence the composition of gut-microbiome. The gut-microbiome is sampled from faecal samples. With 16S RNA gene sequencing, the microorganisms are clustered into organisational taxonomic units (OTUs). There are about 5000 different OTUs in the gut, and for most samples the total number of OTU counts varies between 10,000 and 150,000, which leads to high dimensional data. Also, OTUs have a hierarchical ordering called their phylogenetic relationships. However, with existing methods it is not yet possible to test which OTU is influenced by which diet variable, including these phylogenetic relationships. Therefore, we propose in this thesis a new method to test the association between OTUs and a diet variable, using Bayesian mixed effects models.

To specify the model, we first assumed that the relationships between every OTU and the same diet variable were comparable for all OTUs. Second, we assumed that the relationships between every OTU and a diet variable were comparable of the OTUs of the same family. With the first assumption six models were developed; a Poisson,

(zero-inflated) negative binomial, (beta-)binomial and multinomial model. With the second

assumption three models were developed; (zero-inflated) negative binomial, beta-binomial model. The No U-turn sampler algorithm as implemented by STAN was used for estimating the parameters.

Results showed that the negative binomial model was the best model in terms of false positive rate and precision. Furthermore, we found there is no increase in precision when including the phylogenetic relationships. A downside of our current method is limited to a relatively small number of OTUs, as the fitting process of the current implementation is time-consuming. Further optimization is needed before our mixed effects model can be used with the full-size dataset.

We anticipate that with future optimization of this model, we can identify gut-microbiome - diet associations. Knowing these associations leads to a better understanding of how our diet influences our gut-microbiome, and therefore, our health. Based on this knowledge alterations in diet patrons can be made that could benefit our health.

(5)

Samenvatting

De darmflora bestaat uit micro-organismen die leven in onze darmen. Deze micro-organismen

zijn meestal bacteri¨en en schimmels. De darmflora be¨ınvloedt onze gezondheid. Onze

darmflora wordt onder andere be¨ınvloed door ons dieet. De darmflora kan worden on-derzocht door middel van poep analyses. Deze monsters worden geanalyseerd met behulp van ’16S RNA gene sequencing’. De ‘16S RNA gene sequencing’ clustert het darm-micro-organismen in ’organisational taxonomic units’ (OTUs). Er zijn ongeveer 5.000 verschil-lende soorten OTUs aanwezig in onze darmflora. Het totale aantal OTUs geteld in een monster verschilt per persoon, maar ligt meestal tussen de 10.000 en 150.000. Extra com-plex is dat de OTUs ook onderlinge relaties hebben, dit zijn hun fylogenetische relaties. Het is nu nog niet mogelijk, met de bestaande methodes, om te testen welke OTUs wor-den be¨ınvloed door welke dieet variabele, rekening houwor-dend met de OTUs fylogenetische relaties. In deze thesis is een nieuwe methode ontwikkeld om achter de relatie tussen de darm-OTUs en een dieet variabele te komen, met behulp van een Bayesiaans ’mixed effects model’.

Om het model te maken, namen we aan dat alle relaties tussen alle OTUs en dezelfde dieet variabelen vergelijkbaar zijn. De tweede aanname was dat alleen de OTUs binnen

dezelfde familie een vergelijkbare relatie hebben met dezelfde dieet variabele. Met de

eerste aanname maakten we zes modellen; een Poisson, (‘zero-inflated’) negatieve binomi-aal, (beta-)binomiaal of multinomiale model. Met de tweede aanname maakten we drie modellen; (‘zero-inflated’) negatieve binomiaal, beta-binomiaal model. Het ’No U-turn sampler’ algoritme ge¨ımplementeerd door STAN werd gebruikt voor het schatten van de parameters.

Onze resultaat is dat het negatieve binomiale model het beste model is, in termen

van het aantal valse positieve en nauwkeurigheid. Verder zorgt het betrekken van de

fylogenetische relaties niet voor een hogere nauwkeurigheid. Een nadeel van onze methode is dat het gelimiteerd is tot een relatief klein aantal OTUs, omdat het schattingsproces tamelijk tijdrovend is. Verdere optimalisaties zijn nodig voordat de methode kan worden gebruikt voor de volledige dataset.

We verwachten dat met verdere optimalisatie van deze methode het mogelijk gaat worden om darmflora - dieet relaties te identificeren. Met het weten van deze relaties zullen we beter begrijpen hoe ons dieet onze darmflora be¨ınvloedt. En hoe ons dieet aangepast moet worden voor een betere gezondheid.

(6)

Chapter 1 Introduction

1.1 Microbiome

We are not living alone. The human body contains many microbiomes. Microbiomes are collections of microorganisms living in and on our body — mostly bacteria and fungi, but also viruses and other microorganisms. The gut is a place in the human body with a

mi-crobiome. The gut-microbiome can be sampled1in faecal material [1]. Every person has a

different composition of the gut-microbiome. For example, one person has more of microor-ganism of type A, as another person has more microormicroor-ganism of type B. Recent studies have found that gut-microbiome influences health, such as the chance of having obesity or inflammatory bowel diseases [2–5]. Changing the composition of the gut-microbiome could, therefore, benefit the health of patients suffering from those conditions. Diet is one of the lifestyle factor that influences the composition of gut-microbiome [5].

In 2010 a new technique was introduced to measure and quantify the gut-microbiome, called the 16S ribosomal(r) RNA gene sequence technique. The 16S rRNA gene is present in most bacteria and can be efficiently isolated [6]. The technique consists of 3 steps: (1) All faecal DNA in a faecal sample is isolated, (2) broken into fragments and (3) fragments specific to the 16S rRNA gene are sequenced. The sequences are subsequently mapped against known bacterial genome sequences in the 16S rRNA databases. If a particular fragment fits with a particular sequence in the database the corresponding bacteria is identified. Before 2010 older techniques required multiplication of the microbiome before analysing, which is a time-consuming process. However, with the 16S rRNA gene sequence technique the sample of microbiomes, faecal material, can directly be analysed.

With techniques such as 16s rRNA gene sequencing becoming more readily avail-able, studies increasingly include microbiome data. However, the statistical methods for analysing microbiome data are mostly underdeveloped [7]. Therefore this thesis focus on statistical methods for analysing microbiome data.

1

(9)

Table 1.1: Example of OTU counts

OTU 1 OTU 2 OTU 3 OTU 4 OTU 5 OTU 6

Sample 1 749 0 2025 6031 0 5802

Sample 2 0 3021 102 349 0 0

Sample 3 366 127 0 312 1720 107

Sample 4 196 1417 25 13067 1463 0

Sample 5 0 330 185 23 34 980

In this introduction, a description is given about gut-microbiome data. Next an in-troduction about current methods that can be used for analysing the effect of person characteristics such as diet on the gut-microbiome. Afterwards, an explanation is given for why a new method is needed to analyze the effect of diet on the gut-microbiome.

1.2 Organisational taxonomic units (OTUs)

Microorganisms in the faecal sample with less than 3% difference in the DNA-sequence of the 16S RNA gene are clustered into organisational taxonomic units (OTUs) [7]. For simplicity sake, we interpreted this as; ”one OTU represents one microorganism”. There are approximately about 5000 different OTUs in gut-microbiome samples. The measurements of OTUs are counts: The number of times a specific part of the 16S RNA gene is identified. Empirical observed OTU counts in fecal samples can vary between ±10,000 and ±150,000 and is determined by the experiment. To test if the observed ±5000 different OTUs are related to patient characteristics such as age, sex, diet, and medication use, statistical models will contain a large number of parameters and require a very large number of subjects donating samples to estimate reliably.

To introduce a couple of terms an example of the OTU counts is given in table 1.1. The first term used is the sequencing depth or sometimes called the library size. Sequencing depth is the total sum of all OTU counts of a sample. The sequencing depth of sample 3 is for example 2632. The second term often used with OTU data is their abundance. The abundance of OTU 1 in sample 3 is for example 366. The relative abundance (RA) is the abundance divided by the sequencing depth. The sum of all RA of a subject is always 1. In this example, the RA of OTU 1 in sample 3 is 0.139.

• Sequencing depth sample 3: 366 + 127 + 0 + 312 + 1720 + 107 = 2632

• Relative abundance OTU 1 of sample 3: _{366+127+0+312+1720+107}366 = 0.139

• Sum of RA of sample 3: ₂₆₃₂366 +₂₆₃₂127 +₂₆₃₂312 +1720₂₆₃₂ +₂₆₃₂107 = 1

As well as the high quantity of OTUs, the analysis of OTU counts is not straightforward, because of some characteristics of the OTU data. Firstly, the counts of different OTUs vary

(10)

between subjects and between OTUs and is also dependent on the sequencing depth of the sample [7]. If a subject is measured to have only a few OTUs, this does not imply that this subject has a small number of OTUs; only the sequencing revealed a small number of OTUs in the sample. So, the number of different OTUs of a faecal sample is not the same as the number of different OTUs in the subject. Also, the observed abundance of the OTU of the sample of the subject is not the same as the real abundance of the subject. This difference between the OTU abundance in the sample taken from the subject and ’real’ abundance of the subject has implications for the assumptions made between the OTU. If the OTUs are summarised into relative abundances (RA), therefore summing up to 1, a small negative correlation is implied between the OTUs. If one OTU has a high RA, other OTUs must have a low RA, such that the sum of all RAs remains 1. However, this negative correlation is perhaps mainly a reflection of the data management and analysis and does not have to be accurate, because the observed counts of the OTUs represent a stochastic sample of the subjects whole gut-microbiome [8].

Secondly, OTU data is often overdispersed. Overdispersed data has a large variance, compared to the mean. Some subjects will only have a small number of an OTU where others may have 10000’s counts of the same OTU. OTU data is often sparse too, many OTUs have a count of zero [9]. Those zeros have two sources of origin. They can be a true zero, meaning the subject does not have this OTU within his gut. Alternatively, they can be a ”sampling zero”, meaning the subject may have this OTU, but not counted in the sample.

The last characteristic of the OTUs is that they are related to each other. They

have a hierarchical ordering; species, genus, family, order, class, phylum, kingdom and domain, together called phylogenetic relationships [7]. Phylogenetic relationships are a representation of the microbiome evolution and based on the difference in the DNA sequence. The phylogenetic tree is a representation of the hierarchical ordering (figure 1.1). The leaves of the phylogenetic tree are the OTUs, and the nodes of the phylogenetic tree are the common ancestors. Not every OTU does have a full ordering. Some OTUs have an family and no genus and species. The distance between OTUs expresses the relationship between one OTU and another OTU. OTUs who will be in the same family will have a shorter distance than OTUs who are only in the same class but in a different order and family. The distances between OTUs is expressed in a distance matrix [10].

In figure 1.1 the leaves are the OTUs. The nodes are the species, genus, family, order, class etc. Not every OTU has a full hierarchical ordering, OTU 1 has only family and class, phylum, kingdom and domain. However, no genus or species. Also the distance between OTU 2 and OTU 3 is small, as they both are in the same genus, etc. In this example, the distance between OTU 1 and OTU 4 is large. They are only in the same order.

This thesis focuses on the effect of diet on the gut-microbiome. Diet specific factors may influence the gut-microbiome. Two examples of diet specific factors are the amount of red meat in a subject’s diet or the amount of trans fat within the subject’s diet. It is safe to assume that the diet influences the gut-microbiome. The other way around would

(11)

Figure 1.1: An example of phylogenetic tree

mean that, for example, the amount of coffee a person drinks is influenced by their gut-microbiome. The causality is more likely to be the other way around; the cause is the the coffee intake, the consequence the microorganisms presents in the gut. The OTU is the dependent variable.

1.3 OTU as dependent variable

When analysing the effect of a continuous or discrete co-variable on an OTU using regres-sion analyses, research questions can be answered as; what is the effect of age on a specific OTU? How does diet influence the OTUs? Is there a difference in the abundance of an OTU between diabetic and non-diabetic patients?

For this type of regression analyses often assumptions have to be made about the distribution of OTU counts. When estimating the association between a predictor (diet-variable) and an outcome as OTU data, assuming the OTU counts follows a distribution may increase the reliability of estimated associations. The parameters of the distribution are mathematically modeled as a function of the predictor.

The OTU data is count data. A distribution which can model counts is the Poisson distribution. However, the fit of a Poisson distribution on OTU data is often not good, because OTU data is overdispersed and the Poisson distribution is not flexible enough. An example of an OTU count distribution is given in figure 1.2. The Poisson distribution curve (red curve), does not fit the data very well. The skewness is not modelled correctly; the red Poisson curve goes not beyond the 20, where the data has counts of 40. Also, the mean count (around 10) is over-estimated. Therefore, an alternative distribution for modelling OTUs is the negative binomial distribution (blue curve), or the zero-inflated negative binomial distribution [11–13]. The negative binomial distribution with an overdispersion

(12)

Figure 1.2: Example of OTU counts fitted with a Poisson and a negative binomial distri-bution

parameter fits better than the Poisson distribution.

Sequencing depth can vary between subjects, leading to bias if the sequencing depth is not taken into consideration. This varying sequencing depth between subjects can be solved using normalisation. After normalisation, the OTU data are no longer counts and will sometimes follow a normal distribution. Usually there are however still many zero’s, as normalisation does not fix the number of zero of the OTU counts. Therefore they can be analysed with, for example, a (zero-inflated) Gaussian [14].

Another solution to the varying sequencing depths is by using the RA of the OTUs, summing up to 1 for every subject. This type of OTU data also has to deal with many zeros and can be analysed with, e.g. a beta zero-inflated distribution [15] or other models approximating outcomes between 0 and 1.

The main drawback of single OTU analyses is that the correlation between the OTUs is not modelled. Multivariate analysis of multiple OTUs can be done by using multino-mial [16], Dirichlet-multinomultino-mial [17] or zero-inflated logistic-normal [18] models. Existing implementations for comparing groups of subjects are PERMANOVA [19], or a Dirichlet-multinomial model [20, 21].

When analysing multiple OTUs at once, their known phylogenetic relationships might be incorporated in the statistical analysis. Using those known relationships in an analysis can increase precision. Including the phylogenetic relationships is not done in existing implementations of methods mentions above [16–18].

(13)

The phylogenetic relationships between OTUs can be viewed as a tree. A straight-forward approach to incorporate the phylogenetic relationships in a model is to view an internal node in the tree as a combination of all descending nodes [22]. A combination of nodes can be calculated simply by summing all descending nodes or other combination approaches. Then an analysis is preformed on this ’new’ OTU with one of the methods mentions above, or using scoring methods as the Quasi-conditional association test [23]. The main limitation of these approaches is that it only works if the OTUs that are com-bined are influenced by diet in the same manner. However, this is not always the case, and we consider this approach not suitable for our case.

1.4 This thesis

Diet is one of the lifestyle factors that influence gut-microbiome [5]. However, with the current methods as described above, the influence of diet on the OTUs, including the phylogenetic relationships, cannot be appropriately tested.

The basis of new method is a mixed effects model. With OTU data, there are many different OTUs and not many samples. Furthermore, not every sample has all the OTUs present, so sometimes there is very little data about one OTU. Estimating the effect of diet for every OTU independently is therefore not the best option, as this will give very noisy results. A better option is to pool the OTUs, borrowing strength from one OTU to the other. A method for pooling is using a mixed effects model.

As quality checks of our model false positive rate (FPR) and precision are used. The credibility intervals determine the precision. Parameters with a smaller credibility interval are more certain than parameters with a wider credibility interval.

1.4.1 Research questions and objective

The main research question of this thesis is:

How can gut-microbiome data be analysed using a mixed effects model taking the phylogenetic relationships into account?

With the subquestions:

• How can microbiome data be analysed using a mixed effects model without taking the phylogenetic relationships into account?

– What is the effect of the mixed effect model on computation time, precision and false positive rate?

• How can the phylogenetic relationships be taken into account?

– What is the effect of including the phylogenetic relationships in the model on computation time, precision and false positive rate?

(14)

• What is the effect of diet on the gut-microbiomes?

A new statistical method is developed within a Bayesian framework and tested with data from the Healthy Life in an Urban Setting (HELIUS) study [24]. We expect that our new approach will alleviate the high dimensional estimation problem and increase precision to identify gut-microbiome - diet associations.

1.4.2 Chapter overview

In chapter 2, a small introduction to Bayesian statistics is given. In chapters 3 and 4, multiple models are developed and tested on simulated data and real data; the focus of those chapters is on computation time, precision and FPR. The developed methods do not yet include phylogenetic relationships. In chapter 5, the best model of chapters 3 and 4 is extended to include the phylogenetic relationships. Focusing again on computation time, precision and FPR. In chapter 6, the best method of chapter 5 is used for analysing our real data, focusing on the effects of diet on gut microbiomes. Chapter 7 is the overall discussion. Chapter 8 is the conclusion.

(15)

Chapter 2 Background on Bayesian statistics

Bayesian statistics is a field of statistics where the aim is to update parameters1 of

probability distributions based on newly observed data. In the previous sentence, two

main concepts of Bayesian statistics are mentioned: Parameters and updating. First,

every parameter has a probability distribution. A parameter is not observed, it needs to be estimated based on the observed data, for example, an effect of a diet variable on a single OTU. Every parameter has a ’degree of belief’, or uncertainty, which is represented in the probability distribution [25].

The other concept is ’updating based on new data’. With Bayesian statistics, there is a prior belief about the parameters. This prior belief is updated using the data. This updat-ing is also called Bayesian inference. Combinupdat-ing the data with the prior gives the posterior belief. This updating makes Bayesian statistics useful with high-dimensional problems, for example, the effect of diet on gut-microbiome. With high-dimensional problems, there often are more parameters to be estimated than there is observed data. Even with high-dimensional data, parameters can be updated. More observations are still preferable but Bayesian inference is not restricted by it.

A graphical representation of prior, data and posterior is shown in figure 2.1. The red curve is the likelihood of the data. The blue curve is the prior belief. Combining the red curve and the blue curve (i.e. combining the data and prior) gives the posterior (black curve). The posterior is slightly drawn to the data, as the data is more spiked compared to the prior, having, therefore, more influence on the posterior than the prior.

Another way of describing Bayesian statistics is:

P (θ|data) = P (data|θ)P (θ)

R P (data|θ)P (θ)dθ ∝ P (data|θ)P (θ) (2.1)

Whereby θ is the value(s) of the parameter(s). The prior probability distribution of the parameter is the P (θ). The prior is combined with the likelihood of the data given the parameter P (data|θ). This is divided by a normalization constant, the integral of the

1

(16)

Figure 2.1: Sketch of the probabilities of the likelihood of the data, prior and posterior distributions

likelihood of the data times the prior over the parameter space. The normalization constant is a constant (i.e. independent of θ), and therefore the posterior, P (θ|data), is proportional (∝) to only the likelihood of the data times the prior. This combining of the likelihood of the data with the prior is done using Bayes’ rule, see equation 2.1.

2.1 Markov chain Monte Carlo

Bayesian inference aims to calculate the posterior for a probability model. For few simple models we can derive the posterior analytically (i.e. solve the integral involved in the posterior), but for most models (as the ones in this thesis) another method for calculating the posterior needs to be used.

A family of methods for estimating the posterior are Markov chain Monte Carlo methods (MCMC). These methods estimate the posterior, instead of solving it analytically. With MCMC, almost every model can be estimated. We will not go into detail about how MCMC works, but a couple of concepts needs to be introduced for understanding the remainder of this thesis. An introduction about MCMC can be found in Ravenzwaaij et al. [26].

First, MCMC methods work with chains. A chain is a list of sets of estimated param-eters values, whereby the next set of estimated paramparam-eters values depends on the previous one. With MCMC, multiple chains are run for a model. These chains have different start-ing points, but after a large number of iterations, the chains should give the same sets of estimated values for the parameters.

The period before the chains are the same is called the warmup or burn-in. Those iterations are not included in the results.

(17)

Figure 2.2: An example of a trace plot with convergent chains

Figure 2.3: An example of a trace plot where the chains did not con-verge

The next concept is autocorrelation. Autocorrelation is the correlation between the sets of estimations of parameters in one chain. If this autocorrelation is high the results may not be optimal.

The last term is convergence. Convergence means that multiple chains of the MCMC have reached the same estimated values for the parameters. In this thesis, mainly three methods are used to check whether the MCMC chains reached convergences: Trace plots,

ˆ

R statistic and Effective Sample Size (ESS).

Inspecting trace plots is a visual method. All chains are plotted, with on the X-as the iteration numbers after warmup, and on the Y-as the estimated value of the plotted parameter. If the chains are converged the estimated values of this parameter will overlap, as shown in figure 2.2. If there is no convergence yet one or more chains will not overlap, as shown in figure 2.3.

The next method for checking convergence is the value of the ˆR statistic. ˆR is the ratio

of the average variance of estimated parameters within one chain versus the variance of

the pooled estimated parameters across all chains. ˆR is calculated per parameter. If the

model has reached convergence, ˆR would be one. The variance within a chain will be the

same as the pooled variances across chains. If this ratio is above one, convergence is not yet met.

The last method for checking convergence is the ESS of a parameter. Ideally, every iteration results in a estimated set of parameters are independent of another set of estimated parameters. However, within MCMC chains, this is often not the case; they often have autocorrelation. With a high autocorrelation between sets of estimated parameters, the ESS will be close to zero. With a low correlation between the set of estimated parameters,

(18)

Figure 2.4: Examples of Leapfrog integrator trajectories

the ESS would be close to the number of iterations after warmup.

2.1.1 Hamiltonian Monte Carlo

The variant of MCMC used in this thesis is Hamiltonian Monte Carlo (HMC). Again, this thesis will not go into detail about how HMC works and only introduces some concepts. An introduction about HMC can be found in Monnahan et al. [27].

The posterior is unknown, as it cannot be analytically solved. What the HMC algorithm does is analogous to walking in an unknown dark room trying to find a particular location. When walking around in an unknown dark room, most people take small steps, hoping not the bump into the furniture or walls. The HMC algorithm does the same, only taking small steps between a new set of estimated values for the parameters, trying to find the optimal set of estimations. However, when there are only small changes between the sets

of parameters, autocorrelation will be high. So, not every set of estimated values for

parameters needs to be considered. The HMC algorithm selects sets to consider using the leapfrog integrator.

The set of estimates where only the last one is saved is called a trajectory. An example of the leapfrog integrator is given in figure 2.4. There are two settings that influence the trajectory. First, the step-size; with a small step-size, there is a small change between sets of parameter values. This smaller step-size leads to more calculations, but an outline is more precisely followed (the blue arrows). With a larger step-size, there are fewer calculations per trajectory (the red arrows), but is also less precise. The other setting is the number of steps in one trajectory. The more steps, the more calculations and the further the distance between the end point and start point of the trajectory [27].

The step-size and the number of steps of the leapfrog integrator together make the

length of the trajectory. If this length is too long, the trajectory will make a circle,

(19)

algorithm will not make an informed decision on the next step, making it a so-called random walk. There is a trade-off between the step-size and the number of iterations for the leapfrog integrator versus the computation time in the HMC.

An implementation of HMC is the No U-turn sampler (NUTS). The main benefit of NUTS is that the algorithm estimates the best step-size and the number of iterations for the leapfrog integrator. The NUTS solves the issue of having too long trajectories, by estimating the ideal step-size in the warmup and changing the number of steps with every iteration. It does, however, needs a maximum number of steps, and a ”target accepts rate”. These settings are called the max tree depth and adapt delta. The max tree

depth defines 2M axtreedept as the maximum number of steps in the leapfrog integration.

The target accepts rate determines the ideal step-size in the warmup. After calculating a new trajectory, this new trajectory is accepted or not, meaning that the values of the parameters will be updated, or remain the same. In general, the better the step-size, the higher the accepts rate. Having a high adapt delta and therefore a high acceptance rate after warmup will give a better fit of the step-size after warmup. However, it will increase computation time too. Also, the trajectory needs to be of the right length. If the step-size is small, this automatically means that the number of steps needs to go up to have the same trajectory length. Therefore if the adapt delta goes up, the max tree depth needs to go up as well. Otherwise, the trajectory will be too short, creating a random walk [28, 29].

2.2 STAN

NUTS is implemented in the software STAN [29,30]. STAN is a probabilistic programming language and is mainly used for Bayesian statistics. With STAN, the user can specify their models and run the models with NUTS. STAN can be used through different interfaces. In this thesis, STAN is run through the interface of R studio.

(20)

Chapter 3 Model development (part 1): A mixed effects

model without phylogenetic relationships

3.1 Introduction

We use mixed effects models to estimate relationships between OTU count data, as outcome Y, and diet data, as predictor X. We assume that the relationships between every OTU, Y, and the same diet variable, X, are to some degree comparable, hence the mixed effects model. Also, we assume that the relationship between every OTU, Y, and the same diet variable, X, differs per subject. However, this difference in relationship across all OTUs of the same subject is identical, and these differences in relationships of every subject are comparable.

3.2 Methods

3.2.1 The data

In the study by Deschaxaux et al. diet diaries and feacal samples of HELIUS participants were considered [24]. This data can be captured in a dataset, consisting of two matrices.

The first matrix includes all OTU counts Y . This is a Nsubj× NOT U matrix, representing

all OTUs of all subjects, with 1 . . . i . . . Nsubj subjects, 1 . . . j . . . NOT U OTUs. Whereby Yij

is the count of OTU j of subject i. Zi is the sum of the counts of all OTUs for subject i.

An example of the OTU count matrix Y is shown in table 3.1. In this example, Nsubj=5,

i.e. there are five subjects. Moreover, NOT U=5, i.e. there are 5 OTUs. Y13 is 2025, being

the count of the first subject and the third OTU.

For this thesis two extra columns are added to the matrix Y. An example of Y as used

in this thesis is in table 3.2. Now Nsubj is still 5, but the sixth column of this matrix is

the total counts of all other OTUs. So NOT U=6, or 5+1. The last column is the sum of

(21)

Table 3.1: Example of matrix Y

OTU 1 OTU 2 OTU 3 OTU 4 OTU 5

Subject 1 749 0 2025 6031 0

Subject 2 0 3021 102 349 0

Subject 3 366 127 0 312 1720

Subject 4 196 1417 25 13067 1463

Subject 5 0 330 185 23 34

Table 3.2: Example of matrix Y as used in this thesis

OTU 1 OTU 2 OTU 3 OTU 4 OTU 5 Sum of all other OTUs Sum of all OTUs (Z)

Subject 1 749 0 2025 6031 0 104052 112857

Subject 2 0 3021 102 349 0 90182 93654

Subject 3 366 127 0 312 1720 116305 118830

Subject 4 196 1417 25 13067 1463 9849 26017

Subject 5 0 330 185 23 34 150737 151309

The second matrix consists of diet variables X. This second matrix is an Nsubj× Ndiet

matrix, with 1 . . . k . . . Ndiet be the diet variables of 1 . . . i . . . Nsubj subjects. Then Xik is

the value for diet variable k for subject i.

Again an example of diet variable matrix X is table 3.3. Here, the number of subjects

Nsubj is again 5. The number of diet variables Ndiet is five as well. For example, X24 is

50.3. In this chapter, there is only one diet variable included, therefore Ndiet is 1, and the

included diet variable is a vector (x) with the length Nsubj. The x-variable is scaled and

centred.

3.2.2 The mixed effect models

As explained in the introduction chapter, OTUs can be modelled using different distri-butions. We will consider the Poisson distribution, the (zero-inflated) negative binomial distribution, the (beta-)binomial distribution and the multinomial distribution. The mod-els are developed within the Bayesian framework. The probabilistic programming language

Table 3.3: Example of matrix X

Diet variable 1 Diet variable 2 Diet variable 3 Diet variable 4 Diet variable 5

Subject 1 24.3 0 0.8 7.14 35.71

Subject 2 7.14 10.6 21.4 50.3 100.5

Subject 3 0 2.8 21.9 22.8 20

Subject 4 40.17 0 73.1 12.3 0

(22)

STAN is used for specifying the models [28, 29].

The parameters of these distributions depend on the values of the x-variable, the diet variable. This is accomplished by specifying a function between x and the parameters of the distribution of Y. The function for linking the diet variable with some parameter consists of two parts. The first part is the regression function, and the second part is the link function. This link function depends on the distribution. The regression function that we use is always the same:

Aij+ Bij × xi (3.1)

Where Aij is an intercept for OT Uj and subjecti. Bij is a slope for OT Uj and subjecti.

More on this regression function is explained in section 3.2.2.2.

3.2.2.1 Distributions to model the OTUs

OTUs are count data. Count data are often modelled using a Poisson distribution. The Poisson distribution has only one parameter, λ. This λ is both the mean and the variance

of Yj. λ is linked with a log link function to x:.

yij ∼ P oisson(λij; Zi)

log(λij) = Aij+ Bij × xi+ log(Zi)

(3.2) As we will use a Bayesian framework, this Poisson model, as will all other models, requires

specification of distribution of Aij and Bij, this will be discussed in section 3.2.2.2. All

priors are discussed in section 3.2.3.

The Poisson distribution might not be the best distribution to model the OTUs, as the OTU counts are often overdispersed. A model with a different variance parameter can better fit overdispersed data, as OTU data. A distribution which models counts data and has a separate variance parameter is the negative binomial distribution. The negative binomial has a second parameter for the variance, φ. The diet variable is linked to the mean (µ) in the same way as with the Poisson model.

yij ∼ negative binomial(µij, φj; Zi)

log(µij) = Aij + Bij× xi+ log(Zi)

(3.3) Apart from overdispersion, OTUs counts are also often zero-inflated, meaning that the

value Yij = 0 is present often, compared to Yij > 0. This characteristic of OTU data can be

modeled with zero-inflated distributions and hence we consider the mixture of a Bernoulli and a negative binomial distribution. A benefit of using zero-inflated models is that the number of observed zeros does not affect the mean, µ, as significantly as without using a

(23)

zero-inflated model: yij ∼ ( θj+ (1 − θj) ∗ negative binomial(µij, φj; Zi) if yij = 0 (1 − θj) ∗ negative binomial(µij, φj; Zi) if yij > 0 log(µij) = Aij+ Bij× xi+ log(Zi) P (Yij = 0) ∼ Bernoulli(θj) (3.4)

Both the Poisson and negative binomial distributions are modelling the OTU variable as counts. An alternative is to model the OTU variables as probabilities, using binomial or beta-binomial distributions. The binomial distribution, has one parameter, the mean (µ). The logit of the mean (µ) is used as link function.

yij ∼ binomial(µij; Zi)

logit(µij) = Aij+ Bij × xi

(3.5) A more flexible approach is possible with the beta-binomial distribution. Similar to the negative binomial distribution, there is an second variance parameter φ, making it better fit for data with overdispersion.

yij ∼ beta-binomial(αij, βij; Zi)

αij = µij × φj

βij = (1 − µij) × φj

logit(µij) = Aij+ Bij× xi

(3.6)

Univariate models ignores relationships between OTUs. To include correlations between OTUs a multivariate distribution is needed. In this thesis, a multinomial distribution is considered. The multinomial model has softmax as link function. The softmax uses the last OTU (so the sum of all other OTUs) as a reference category.

yi1. . . yiNotu ∼ multinomial(θi1. . . θiNotu; Zi)

θij = e(Aij+Bij∗xi) 1 +PNOT U−1 j=1 e(Aij+Bij∗xi) θiNOT U = 1 − NOT U−1 X j=1 θij (3.7)

3.2.2.2 Mixed effects model

In all six models we specified the mean parameter to be a function of:

(24)

Here we consider further modeling of Aij and Bij.

The intercept (A) is split into a fixed effect (α0) and a random effect. This random

effect is further divided into an effect depending on the subject (αsubjecti) and an effect

depending on the OTU(αotuj). A similar decomposition is done for the slope (B).

Aij = α0+ αsubji+ αOT Uj

Bij = β0+ βsubji+ βOT Uj

(3.9)

All subject effects (αsubj, βsubj) are assumed to come from a multivariate normal

dis-tribution (MVN), with a mean of 0. All OTU effects are assumed to come from a separate MVN, again with a mean of 0. All these random effects can be seen as deviations of the

fixed effects α0 and β0.

αsubji βsubji ∼ MVN (0 0 , Σsubj) αOT Uj βOT Uj ∼ MVN (0 0 , ΣOT U) (3.10)

The covariance matrix (Σ) can be hard to estimate if it increases in size [29]. We

therefore split the covariance matrix into a vector of variances (σ2) and a correlation

matrix. The correlation matrix is further decomposed into a lower triangle matrix (L),

where LL0 is the correlation matrix. Those two parameters are easier to estimate. A smart

method for splitting the covariance matrix into a vector of variances and a correlation matrix is known as the Cholesky Factorization [29].

Using the Cholesky Factorization we can write the multivariate distributions MVN (0, Σ)

as (diag(σ2) × L) × w, where w is a standard normal distribution.

3.2.3 Prior distribution

With setting the prior for parameters there is always a balance between having a to uninfor-mative prior, were the algorithm for estimation the parameters can have trouble converging, and having a too informative prior, were the results will mainly reflect the prior believe, with too little influence by the data.

For all models priors for the fixed effects α0 and β0 are normal distributions. For the

fixed effect slope parameter (β0) the expected value is close to zero. This fixed effect is the

same for all OTUs. Meaning the fixed slope parameter β0 will not be zero only if all OTUs

are affected by the diet variable in the same way, which is unlikely. Therefore as prior for

the β0 a normal distribution with a mean of 0 and a variance of 1 is used.

As for the fixed intercept α0, this is bigger than zero if all OTUs have a high mean

count, and closer to zero if the mean count is close to zero for all OTUs. A mean count close to zero can be the case with OTUs with a low abundance. It is unlikely all OTUs

(25)

have a high mean count, as OTUs are often sparse. Again normal distribution with a mean of 0 and a variance of 1 is used.

Both the subject’s specific random effects and the OTUs specific random effect need priors for the variance and the decomposed correlation matrix. As prior distribution for lower triangle matrix L, we use the LKJ Correlation Density distribution. This distribution only has one parameter. If set to 1, the likelihood of every value between -1 and +1 is almost equal. With a higher setting for the parameter of the LKJ correlation density distribution, the likelihood of the correlation being 0 increases, and the probability of values near the -1 and +1 decrease [29]. We expect a correlation between the intercepts and slopes, but it is unlikely this will be a correlation of -1 or 1, therefore the prior of the L is a LKJ correlation density distribution of 2.

The prior for the variance, σ2 is an exponential with intensity 1. As the variance is

always positive. Also values closer to zero are preferred over higher values.

The negative binomial and the beta-binomial distributions both have a second param-eter φ. As OTUs are overdispersed it is expected that the variance of the counts will be large. Also the variance can never be below zero. The prior is, therefore, an uninformative gamma distribution. The gamma distribution has two parameters a and b. The mean is

the a_b and a variance of _ba2. Meaning a large a and a small b create a uninformative prior.

We use a a of 10 and a b of 0.1 as prior.

The zero-inflated negative binomial model we consider in this thesis is a mixture of a negative binomial distribution and a Bernoulli distribution. The Bernoulli distribution has one parameter θ, θ can have every value between 0 and 1. An uniform prior between 0 and 1 is used.

3.2.4 Experimental setup

3.2.4.1 Real data; testing computation times and power

All models are tested with five OTUs, and all other OTUs present in our subjects summed into the six reference OTU. A first test for model estimations is done with only ∼10% of the subjects and a second test with the full set of subjects. The computation time included only the sampling (both burn-in and posterior iterations) and did not include compiling of the STAN model. The time in seconds of the longest-running chain is reported. If convergence was not reached within 30.000 seconds (∼ 8 hours), the model estimation was terminated and was reported as >30.000 seconds.

To check convergence of the models trace plots, the ˆR and the ESS are checked. All

three convergence checks were done for every parameter.

3.2.4.2 Simulation study; testing false positive rate

Of the models that reached convergence within 30.000 seconds with the full set of subjects, the FPR was tested. To test the FPR of the models the HELIUS data was used as a basis,

(26)

whereby one diet variable was randomly shuffled. The model was run with 5(+1) OTUs, and one diet variable, for ten times, every time re-shuffling the diet variable randomly.

Again the convergence was checked using trace plots, ˆR and ESS.

3.2.4.3 Software, initial parameters, settings and hardware

All models were made using RSTAN version 2.18. The initial values were randomly drawn using the STAN default initial values (between -2 and 2). The max tree depth and adapt delta settings were by default 10 and 0.8. When convergence checks failed, these settings were increased to 12 and 0.9 and again increased to 15 and 0.99 if needed. A total of 2000 iterations were done per chain. Two chains were run in parallel on a Intel i5-6267U processor.

3.2.5 Graphical models

Combining the mixed effects model with the Poisson, negative binomial, zero-inflated neg-ative binomial, binomial, beta-binomial and multinomial distributions gave a total of 6 different mixed effects models, these graphical models are illustrated in figures 3.1, 3.2, 3.3, 3.4, 3.5, 3.6 respectively. An explanation about graphical models can be found in appendix B.

(27)

yij λij Zi xi Aij Bij αOT Uj α0 αsubji β0 βOT Uj βsubji Σotu Σsubj 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 3.1: Graphical model of the Poisson mixed effects model

yij ∼ P oisson(λij; Zi) log(λij) = Aij+ Bij× xi+ log(Zi) Aij = α0+ αOT Uj+ αsubjj Bij = β0+ βOT Uj+ βsubjj αOT Uj βOT Uj ∼ MVN (0 0

, (diag(σ_otu2 × Lotu) × wotu))

αsubji βsubji ∼ MVN (0 0

, (diag(σ_subj2 × L_subj) × wsubj)

α0 ∼ N (0, 1) β0 ∼ N (0, 1) σ2_otu∼ exponential(1) Lotu∼ LKJ CorrelationDensity(2) wotu∼ N (0, 1) σ_subj2 ∼ exponential(1) Lsubj ∼ LKJ CorrelationDensity(2) wsubj ∼ N (0, 1)

(28)

yij µij Zi φj xi Aij Bij αOT Uj α0 αsubji β0 βOT Uj βsubji Σotu Σsubj 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 3.2: Graphical model of the Negative Binomial mixed effects model

yij ∼ negative binomial(µij, φj; Zi) log(µij) = Aij+ Bij× xi+ log(Zi) Aij = α0+ αOT Uj+ αsubjj Bij = β0+ βOT Uj+ βsubjj αOT Uj βOT Uj ∼ MVN (0 0

α0 ∼ N (0, 1) β0 ∼ N (0, 1) σ2_otu∼ exponential(1) Lotu∼ LKJ CorrelationDensity(2) wotu∼ N (0, 1) σ_subj2 ∼ exponential(1) Lsubj ∼ LKJ CorrelationDensity(2) wsubj ∼ N (0, 1) φ ∼ gamma(10, 0.1)

(29)

yij µij Zi φj θj xi Aij Bij αOT Uj α0 αsubji β0 βOT Uj βsubji Σotu Σsubj 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 3.3: Graphical model of the zero-inflated negative binomial mixed effects model

yij ∼ ( θj+ (1 − θj) ∗ negative binomial(µij, φj; Zi) if yij = 0 (1 − θj) ∗ negative binomial(µij, φj; Zi) if yij > 0 µij = eAij+Bij×xi+ log(Zi) P (Yij = 0) ∼ Bernoulli(θj) log(µ) = Aij+ Bij× xi Aij = α0+ αOT Uj+ αsubjj Bij = β0+ βOT Uj+ βsubjj αOT Uj βOT Uj ∼ MVN (0 0

, (diag(σ_otu2 × L_otu) × wotu))

α0 ∼ N (0, 1) β0 ∼ N (0, 1) σ2_otu∼ exponential(1) Lotu∼ LKJ CorrelationDensity(2) wotu∼ N (0, 1) σ_subj2 ∼ exponential(1) Lsubj ∼ LKJ CorrelationDensity(2) wsubj ∼ N (0, 1) φ ∼ gamma(10, 0.1) θ ∼ U (0, 1)

(30)

yij µij Zi xi Aij Bij αOT Uj α0 αsubji β0 βOT Uj βsubji Σotu Σsubj 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 3.4: Graphical model of the Binomial mixed effects model

yij ∼ binomial(µij; Zi) logit(µij) = Aij+ Bij × xi Aij = α0+ αOT Uj + αsubjj Bij = β0+ βOT Uj+ βsubjj αOT Uj βOT Uj ∼ MVN (0 0

, (diag(σ2_otu× Lotu) × wotu))

, (diag(σ2_subj× L_subj) × wsubj)

α0∼ N (0, 1) β0∼ N (0, 1) σ_otu2 ∼ exponential(1) Lotu∼ LKJ CorrelationDensity(2) wotu∼ N (0, 1) σ2_subj∼ exponential(1) Lsubj∼ LKJ CorrelationDensity(2) wsubj∼ N (0, 1)

(31)

yij αij βij Zi µij φj xi Aij Bij αOT Uj α0 αsubji β0 βOT Uj βsubji Σotu Σsubj 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 3.5: Graphical model of the Beta-Binomial mixed effects model

yij ∼ beta-binomial(αij, βij; Zi) αij = µij × φj βij = (1 − µij) × φj logit(µij) = Aij+ Bij × xi Aij = α0+ αOT Uj + αsubjj Bij = β0+ βOT Uj+ βsubjj αOT Uj βOT Uj ∼ MVN (0 0

, (diag(σ2_otu× Lotu) × wotu))

, (diag(σ2_subj× Lsubj) × wsubj)

α0∼ N (0, 1) β0∼ N (0, 1) σ_otu2 ∼ exponential(1) Lotu∼ LKJ CorrelationDensity(2) wotu∼ N (0, 1) σ2_subj∼ exponential(1) Lsubj∼ LKJ CorrelationDensity(2) wsubj∼ N (0, 1)

(32)

yij θij Zi xi Aij Bij αOT Uj α0 αsubji β0 βOT Uj βsubji Σotu Σsubj 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 3.6: Graphical model of the multinomial mixed effects model

θij = e(Aij+Bij∗xi) 1 +PNOT U−1 j=1 e(Aij+Bij∗xi) (j < NOT U) θiNOT U = 1 − NOT U−1 X j=1 θij Aij = α0+ αOT Uj + αsubjj Bij = β0+ βOT Uj+ βsubjj ) (j < NOT U) αOT Uj βOT Uj ∼ MVN (0 0

α0 ∼ N (0, 1) β0 ∼ N (0, 1) σ2_otu∼ exponential(1) Lotu∼ LKJ CorrelationDensity(2) wotu∼ N (0, 1) σ_subj2 ∼ exponential(1) Lsubj ∼ LKJ CorrelationDensity(2) wsubj ∼ N (0, 1)

(33)

Figure 3.7: Scatter plots of the diet variable and the 5 OTUs

3.3 Results

The OTU matrix (Y) included 2170 subjects and 12376 OTUs. The diet matrix (X)

contained 5350 subjects and 74 diet variables. Only subjects in both matrices were included for future analyses. The final datasets, therefore, included 1090 subjects, 12376 OTUs, and 74 diet variables of those 1090 subjects.

In this chapter, only the first 5 OTUs were used in all analyses, and the other 12371 OTUs were summed together in the 6th OTU. The total of 74 diet variables were in the data set, but only the first diet variable was used in the analyses. This unstandardised diet variable had a median value of 29.02, minimal value of 0 and a maximum value of 501.43. In figure 3.7, the unstandardised diet variable is plotted versus the 5 included OTUs. The correlations between the diet variable and OTUs were 0.109, -0.002, -0.008, -0.123 and -0.048, respectively. The diet variable was standardised in our analysis.

3.3.1 100 subjects

3.3.1.1 Poisson model

With the default setting of the NUTS algorithm the Poisson model had trouble estimating the optimal values for the intensity parameters λ. The ESS values were not large. Com-bining this with the number of max tree depth warnings indicates the leapfrog integration trajectories were too short for this model.

By increasing the allowed max tree depth to 15 we found the best results with ESS-values of at least 310. The lower ESS-ESS-values are of subject specific parameters. The highest

ˆ

R is 1.01. Computation time was 2269 seconds.

In figure 3.8 we illustrate the effect of the max three depth setting. Both trace plots are of the variance of the subject specific intercept. In the left trace plot, with the default settings of the NUTS algorithm, the model is not convergence. In the right trace plot, the max tree depth in set to 15, the model is convergent.

(34)

Figure 3.8: Trace plots of the variance of the subject intercepts from the Poisson model (Left; default settings, right; max tree depth increased to 15

3.3.1.2 Negative binomial model

The negative binomial model with the default settings took 125 seconds. The variance of the subject intercepts had to lowest ESS of 332. There were no warnings about leapfrog integration, indicating the NUTS did not have trouble finding the global optimal values for the parameters.

3.3.1.3 Zero-inflated negative binomial model

The zero-inflated negative binomial model was slower compared to the negative binomial model, it took 158 seconds with the default settings. There was an additional parameter to estimate, so this is in line of expectation. The variance of the subject-specific intercepts had the lowest ESS of 348. There were no warnings about leapfrog integration.

3.3.1.4 Binomial model

The binomial model did not reach convergence with the default settings. We increased the max tree depth, but found that convergence remained problematic. With a max tree depth of 15 computation times was 2706 seconds and lowest ESS was only 38. Especially the variance of the subject specific slopes turned out to be difficult to estimated.

3.3.1.5 Beta-binomial model

The beta-binomial model took 97 seconds for 2000 iterations with the default settings. The ESS of the variance of the subject slope was 127. There were only three warnings about

(35)

the adapt delta.

3.3.1.6 Multinomial model

The multinomial model did not reach convergence with the default settings. We increased the max tree depth setting to 15. The computation time was about 871 seconds, lowest ESS was 106 and also with this model the variance of the subject specific slopes turned out to be difficult to estimated.

3.3.2 All subjects

3.3.2.1 Poisson model

With all 1090 subjects included it was expected the computational time was longer. The Poisson had again trouble to converge with the default settings. There were 1998 max tree depth warnings. The max tree depth was set to 15. The computation time was longer as 30.000 and therefore the NUTS was terminated.

3.3.2.2 Negative binomial model

The negative binomial model with the default settings took 4172 seconds. The lowest ESS

was 725, with a ˆR of 1.

3.3.2.3 Zero-inflated negative binomial model

The zero-inflated negative binomial was again a slower compared to the negative binomial with the default settings, 6806 seconds.

3.3.2.4 Binomial model

The binomial did not convergence after 2000 iterations with the default settings. All 2000 warnings were max tree depth warnings. The trace plot of the variance of the subject-specific intercepts and slope do both indicate there was no convergence. This is also shown in the scatter plot in figure 3.9, there were two separated regions of estimated values of the variance of the slope of the subjects. The max tree depth was increased, to 15. After more than 8 hours, 50% of the iterations were completed. Based on this, the model was terminated, as the total computational time would be too long for the model to have any use, as we were only testing a small data set.

3.3.2.5 Beta-binomial model

The beta-binomial model took 2522 seconds with the default settings. It did convergence with the default settings. The lowest ESS was for the fixed slope effect, of 344, with a

(36)

Figure 3.9: Scatter plot of the variance of the subject intercepts versus the variance of the subject slope from the binomial model

ˆ

R of 1. The trace plot confirmed convergence as well. The ESS of the variance of the

intercept and slope of the subjects 634 and 709, both with a ˆR of 1. Because this was a

different result compared to the estimations with 100 subjects of the beta-binomial model, the model was rerun, with the same default settings. This second run confirmed the results of the first run.

3.3.2.6 Multinomial model

With the default settings, the multinomial model did not converge. After altering the settings, the multinomial model took longer than 30.000 seconds to reach convergence. The model was terminated.

3.3.3 Comparing estimated values

The estimated values of the subject-specific intercepts (αsubj) of the final models tested

with 100 subjects are in figure 3.10. It becomes clear why the variance of the subjects intercepts was hard to estimate in the binomial and multinomial models. As they differ more per subject than the variances of in the Poisson model, negative binomial model, the zero-inflated negative binomial model and the beta-binomial model.

The estimated slopes per OTU with all subjects are almost the same for the three models. They also show the same effect as the correlations between the diet variable and the OTU in question. The first OTU correlated is 0.109; all three models show a positive correlation. Similar to the 4th OTU, this had a negative correlation of -0.123.

(37)

(38)

(39)

3.3.4 False positively rate

For the negative binomial model, zero-inflated negative binomial model and the beta-binomial model the FPR was tested by shuffling the diet variable. This was done 10 times. All three models had no false positives.

3.4 Discussion

3.4.1 Main findings

The main finding of this chapter is that the current parametrisation does not work well enough for the data at hand. All six models had problems with convergence. Some con-vergence problems could be solved by adjusting algorithmic settings (allowing longer tra-jectories). However, all six models had long computational time.

3.4.2 Interpretation of results

The main problem with all models is the subject specific random effects. Mainly the

variance, which is hard to estimate. With the negative binomial, zero-inflated negative binomial and the beta-binomial models there is a high autocorrelation between the chains, indicated by the low ESS. To increase the ESS more iterations are needed. With the Poisson, binomial and multinomial models the leapfrog integration settings needed to be altered.

The subject’s specific intercepts and slopes are difficult to estimated. This is not

unexpected, because there are no repeated measurements or other forms of clustering in the data. The overdispersion parameter in the (zero-inflated) negative binomial and beta-binomial models simplified the estimations of the subject-specific intercepts. The offset used in the Poisson and (zero-inflated) negative binomial models also simplified the estimations of the subject-specific intercepts.

(40)

Chapter 4 Model development (part 2): A mixed effects

model without phylogenetic relationships

4.1 Introduction

From the previous chapter, we concluded that the mixed effects models were difficult to estimate, especially the subject-specific parameters were almost unidentifiable. In this chapter we drop therefore these parts of the models.

4.2 Methods

The data used has the same structure as in section 3.2.1. In summary, the OTU counts are

in matrix Y . This is a Nsubj × NOT U matrix, representing all OTUs of all subjects, with

1 . . . i . . . Nsubj subjects, 1 . . . j . . . NOT U OTUs. Yij is the count of OTU j of subject i. Zi

is the sum of the counts of all OTUs for subject i. The diet variable x is a vector with the

length Nsubj, xi is the value of the diet variable of subject i. X is scaled and centred.

The OTUs are again modelled using six different distributions; we will consider the Poisson, (zero-inflated) negative binomial, (beta)-binomial and the multinomial distribu-tions. Again all models are developed within the Bayesian framework. The probabilistic programming language STAN is used for specifying and estimating the models [28,29]. The parameters of these distributions again depend on x, the diet variable.

Aj+ Bj × xi (4.1)

Where Aj is an intercept of OT Uj. Bj is a slope for OT Uj. Aj is assumed to be sampled

from a normal distribution, with the mean µα and the variance σα2. Bj is from a normal

distribution as well, with mean µβand variance σ_β2. Both normal distributions are supposed

(41)

is also a simplification compared to the previous chapter.

Aj ∼ N (µα, σα)

Bj ∼ N (µβ, σβ)

(4.2)

The mean of all Aj is µα, and this is considered to be a parameter. The prior for this

parameter is a standard normal. The mean of all Bj is µβ, the prior for µβ is also a standard

normal, because we do not expect the included OTUs to have a similar association with the diet variable. The variances of the intercept and slope have a exponential distribution with intensity 1. The exponential is above 0, as the variance will always be bigger than zero. The lower values are more likely than higher values. The priors of all other parameters, those who do not depend on x, are similar as in chapter 3 (section 3.2.3).

µα ∼ N (0, 1)

µβ ∼ N (0, 1)

σα ∼ exponential(1)

σβ ∼ exponential(1)

(4.3)

The experimental setup is similar as in chapter 3, and explained in section 3.2.4. With alteration that the models are only tested with all 1090 subjects.

4.2.1 Graphical models

Combining the mixed effects model with the Poisson, negative binomial, zero-inflated neg-ative binomial, binomial, beta-binomial and multinomial distributions gives a total of 6 different mixed effects models, these graphical models are illustrated in figures 4.1, 4.2, 4.3, 4.4, 4.5, 4.6 respectively. An explanation about graphical models can be found in appendix B.

(42)

yij λij xi Zi Aj Bj µα σα µβ σβ 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 4.1: Graphical model of the Poisson mixed effects model

yij ∼ P oisson(λij; Zi) log(λij) = Aj+ Bj× xi+ log(Zi) Aj ∼ N (µα, σα) Bj ∼ N (µβ, σβ) µα ∼ N (0, 1) µβ ∼ N (0, 1) σα ∼ exponential(1) σβ ∼ exponential(1)

(43)

yij µij φj xi Zi Aj Bj µα σα µβ σβ 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 4.2: Graphical model of the negative binomial mixed effects model

yij ∼ negative binomial(µij, φj; Zi) log(µij) = Aj+ Bj× xi+ log(Zi) Aj ∼ N (µα, σα) Bj ∼ N (µβ, σβ) µα ∼ N (0, 1) µβ ∼ N (0, 1) σα ∼ exponential(1) σβ ∼ exponential(1) φ ∼ gamma(10, 0.1)

(44)

yij µij φj θj xi Zi Aj Bj µα σα µβ σβ 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 4.3: Graphical model of the zero-inflated negative binomial mixed effects model

yij ∼ ( θj+ (1 − θj) ∗ negative binomial(µij, φj; Zi) if yij = 0 (1 − θj) ∗ negative binomial(µij, φj; Zi) if yij > 0 log(µij) = Aj+ Bj× xi+ log(Zi) P (Yij = 0) ∼ Bernoulli(θj) Aj ∼ N (µα, σα) Bj ∼ N (µβ, σβ) µα∼ N (0, 1) µβ ∼ N (0, 1) σα∼ exponential(1) σβ ∼ exponential(1) φ ∼ gamma(10, 0.1) θ ∼ U (0, 1)

(45)

yij µij Zi xi Aj Bj µα σα µβ σβ 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 4.4: Graphical model of the binomial mixed effects model

yij ∼ binomial(µij; Zi) logit(µij) = Aj+ Bj× xi Aj ∼ N (µα, σα) Bj ∼ N (µβ, σβ) µα∼ N (0, 1) µβ ∼ N (0, 1) σα∼ exponential(1) σβ ∼ exponential(1)

(46)

yij αij βij Zi µij φj xi Aj Bj µα σα µβ σβ 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 4.5: Graphical model of the beta-binomial mixed effects model

yij ∼ beta-binomial(αij, βij; Zi) αij = µij × φj βij = (1 − µij) × φj logit(µij) = Aj+ Bj× xi Aj ∼ N (µα, σα) Bj ∼ N (µβ, σβ) µα∼ N (0, 1) µβ ∼ N (0, 1) σα∼ exponential(1) σβ ∼ exponential(1) φ ∼ gamma(10, 0.1)

(47)

yij θij Zi xi Aj Bj µα σα µβ σβ 1 . . . i . . . Nsubj 1 . . . j . . . NOT U

Figure 4.6: Graphical model of the multinomial mixed effects model

θij = e(Aij+Bij∗xi 1 +PNOT U−1 j=1 e(Aij+Bij∗xi) (j < NOT U) θiNOT U = 1 − NOT U−1 X j=1 θij Aj ∼ N (µα, σα) Bj ∼ N (µβ, σβ) ) (j < NOT U) µα ∼ N (0, 1) µβ ∼ N (0, 1) σα ∼ exponential(1) σβ ∼ exponential(1)

(48)

Table 4.1: The computation times, and convergence measures per model

Computation time (seconds) Number of warnings Lowest ESS Highest ˆR Poisson model 118 0 289 σ2

β 1.01 σ2β

Negative binomial model 118 1 adapt delta 888 σ2

β 1 σβ2

Zero-inflated negative binomial model 319 4 adapt delta 936 σ2

β 1 µβ Binomial model 111 0 419 σ2 α 1 σβ2 Beta-binomial model 308 0 963 σ2 β 1 A4 Multinomial model 308 0 310 σ2 α 1 σβ2

4.3 Results

The 5 OTUs used in this chapter were the same 5 OTUs as used in the last chapter, idem for the diet variable. All other OTUs in our data were summed into the 6th OTU. The correlations between the diet variable and OTUs were 0.109, -0.002, -0.008, -0.123 and -0.048, respectively.

All six models did converge with the default settings of the NUTS algorithm. The computational times were 118 seconds, 118 seconds, 319 seconds, 133 seconds, 346 seconds and 427 seconds for the Poisson model, negative binomial model, zero-inflated negative binomial model, binomial model, beta-binomial model and multinomial model (table 4.1). None of the models had trouble reaching convergence. However, of all parameters the

variance of the slope coefficient had the lowest ESS of highest ˆR for all models. The trace

plots of the slope variance, σ_β2, of the models in figure 4.7 indicate why. The estimated

values of this parameter were close to zero, which was near the boundary of possible values of the variance, as the variance can never be zero.

4.3.1 Comparing estimated values

The estimated slopes are in figure 4.8. The (zero-inflated) negative binomial and the

beta-binomial models were less certain about there estimations compared to the Poisson, binomial and the multinomial models. OTU 1 and OTU 4 both had a higher correlation (0.109 and -0.123), positively and indeed all models indicated a large effect for these two OTUs.

Note that the credibility intervals of the slopes according to the Poisson, binomial and multinomial models are very small.

4.3.2 False positive rate

For all models, the FPR was tested. The Poisson model had the highest FPR of 48 out of 50. The multinomial and binomial models had high FPR as well, both 43 and 40 out of 50. This results suggests that the credibility intervals of these models are too small and indicate lack of fit of these models to the data. The beta-binomial had an FPR of only 1 out of 50, and the (zero-inflated) negative binomial models 0 out of 50.

(49)

(50)

Figure 4.8: The estimated slopes per OTU

(51)

4.4 Discussion

4.4.1 Main findings

Our simplified models were identifiable in the sense that computation times were accept-able and that few/no convergence issues were encountered. The Poisson, binomial and multinomial models had high FPR, associated credibility intervals are too small, therefore overly precise. The models do not fit the data well. The (zero-inflated) negative binomial and the beta-binomial models do seem to fit the data.

4.4.2 Interpretation of results

Having a second variance parameter, as the negative binomial and beta-binomial distribu-tions have, does reduce the FPR. OTU counts are known to be overdispersed, meaning the variance of the counts is larger than the mean count. With an additional variance param-eter a better fit of the data is obtained, compared to distributions without overdispersion parameters, as the Poisson, binomial and the multinomial distributions.

All models have a small variance of the slopes. As a variance can never be zero or less, the algorithm had trouble estimating values near the boundary of possible values. This is less of a problem for the (zero-inflated) negative binomial and beta-binomial models, as they estimated a larger variance of the slopes. This is also a indication that the Poisson, binomial and multinomial models do not fit the data well.

The zero-inflated negative binomial did not perform better in terms of the FPR as the not zero-inflated negative binomial model. The estimated slopes were also almost similar. The data used in this chapter (the first 5 OTUs) were OTUs that are mostly present in all subjects. When testing the models with other OTUs, with more zeros, the zero-inflated models might outperform the models without zero-inflation. However, this was not tested within this chapter.

In our implementation, the intercept and slope of the regression function were inde-pendent. This independency is perhaps not valid, as a small intercept often has a larger slope, and a large intercept a smaller slope. As in this chapter, the models were only tested with 5+1 OTUs; the estimates of this dependency would be based on six intercepts and slopes. When scaling the number of included OTUs up, it will be better to estimate the dependency between the intercepts and slopes.

4.4.3 Comparison with literature

There was a higher FPR when using a Poisson, binomial and multinomial model. Xiang et al also had a higher FPR with a Poisson model, indicating that the Poisson distribution do not fit these kind of data very well [31]. Xiang’s study did not include the binomial and multinomial distribution. However, a Poisson distribution is an approximation of a binomial distribution if the number of trials is high and the number of successes per trial is

(52)

small [32]. This is sort of the case for the OTU counts, as there is a large variety of possible counts. So if the Poisson distribution has a high FPR, the binomial distribution will show the same behaviour.The multinomial distribution can be rewritten as a conditioned Poisson distribution [33]. Therefore, the multinomial model will show probably the same lack of fit as the Poisson model.

Bayesian mixed effects models for microbiome data