Estimation of parameters of ordinary diﬀerential equations

(1)

Estimation of parameters of ordinary differential equations

Stijn de Vos 1541234

Master Thesis in Mathematics

December 19, 2012

(2)

(3)

Estimation of parameters of ordinary differential equations

Summary

The problem of inferring parameters of differential equations from noisy data is of great importance in the dissimination of gene-regulatory networks. In this thesis, we shall look at the problem of parameter estimation. First, three methods are reviewed, where each method is of a different discipline (frequentist, Bayesian, machine learning). Then we propose and research a Bayesian method, which we apply to various toy problems.

Master Thesis in Mathematics Author: Stijn de Vos

1541234

First supervisor(s): E.C. Wit Second supervisor: W.P. Krijnen External supervisor:

Date: December 19, 2012

Institute of Mathematics and Computing Science P.O. Box 407

9700 AK Groningen The Netherlands

(4)

(5)

Chapter 1 Introduction

From a scientific point of view, it is of great importance to understand how genes activate or repress each other. It could explain why humans and mice have about 85% of similar genes, and yet are completely different organisms. It could also help us understand how to treat genetic diseases more efficiently. The collection of all gene interactions is called the gene regulatory network. The mapping and understanding of this network is a challenge;

although an exact amount is not known, the human genome is estimated to carry between 20.000 and 30.000 genes. To make things worse; it could be that a number of genes affect a number of other genes. Also, determining which genes influence each other is one thing, but to determine in what way this is done is another matter. The problem of modelling gene regulatory networks is being investigated in a number of ways. Some common methodologies are ([21]) network topology models (e.g. [10]), network control logic models (e.g. [25]), and dynamic models (e.g. [3]).

In this thesis we will focus on a mathematical problem connected to the dynamic modelling approach. First, we shall review the difference between frequentist, Bayesian and machine learning methods. In chapter two we will review a number of articles that present a solution to the problem (see below). In chapter three we will consider technical and computational aspects of Bayesian methods. Finally, we will attempt to improve the Bayesian approach given by [2] in chapter four.

1.1 Differential Equations and Modelling in Biology

Since the rise of systems biology, mathematics has had an increasing influence on tackling biological problems. As can be seen in [19], research has been done on the modelling of gene- regulatory networks and intra-celullular biochemical reactions in general. A popular method of expressing biochemical reactions mathematically is through the Power Law Formalism, which is a system of nonlinear ordinary differential equations that are written in the form of a Law of Mass Action:

˙

x_i(t) = F_i(x₁, . . . , x_n) = α_i

N

Y

m=1

x^P_m^im− β_i

N

Y

m=1

x^Q_m^im

where i ∈ {1, . . . , N }, αi, βiare kinetic rates and Pim, Qimare matrices of stoichiometric coef- ficients. This formalism was applied to a variety of biological problems. Later this framework was generalized and dubbed S-Systems. Not only was this framework applicable to a variety

1

(8)

of biological and mathematical problems, it also provided biologists with a way to investigate a complex system in detail or as a whole. In other words, this framework could be applied to a single cell, a network of cells or a whole organism.

Unfortunately there are limitations to using ordinary differential equations. At the mi- croscopic level, reality is of a stochastic nature. This stochasticity has several origins:

• Randomness on the molecular level; during the formation of heavy molecules, each molecule can assume a different quantum state.

• Randomness on the chemical level; a chemical reaction may depend on any number of complex processes depending on internal variables. In large systems, this stochasticity is of no consequence, but in living the cells the number of biomolecules may be not as large and the resulting process may have large fluctuations.

• Randomness in gene expression; in an individual cell, the production of protein and enzymes was discovered to be a stochastic pattern with random distribution of inten- sities and durations. The transcription factor concentrations and the rate of protein production fluctuates between individial cells as well (see [1] and [20]).

So the use of ordinary differential equations is not always legitimate. In some scenarios one should resort to partial or stochastic differential equations.

1.2 Mathematical formulation

In a dynamic modelling setting, we use an ordinary, partial or stochastic differential equation (ODE, PDE, SDE, respectively) to model gene activity. Here, gene activity can be defined as the number of mRNA molecules present in a cell, for example. We let xi denote the activity of gene i, i ∈ {1, . . . , n}. In the ODE case, the differential equation governing the gene activities is denoted by

˙

x(t) = f (x(t), t|θ)

Here, x(t) = (x₁(t), x₂(t), . . . , x_n(t)) ∈ Rⁿ is the state variable and f : Rⁿ → Rⁿ. We let θ = (θ₁, θ₂, . . . , θ_p) ∈ R^p denote a vector of parameters, such as gene specific parameters, activity levels of other genes, etcetera. We now have to infer what these parameters are.

Since accurately measuring the number of mRNA molecules is hard to accomplish, the data we are given contains noise. This means that instead of observing the state vector x(t) from our data we have noisy measurements y(t) = x(t) + (t). The problem, then, is this:

Of course, reality could make our life even harder. It could be that our differential equation has no closed-form solution, for example. Or maybe we can only observe a subvector of the statevector x(t). Or maybe we are not dealing with an ordinary differential equation but with a stochastic differential equation. In chapter two we will encounter a Bayesian method which deals with the no-closed-form-solution-case. The other two questions remain largely untouched in this thesis. Suppose we have a noisy data set {yi}ⁿ_i=1, which are noisy observations of a state vector x(t) that satisfies some differential equation. How can we infer the parameters θ1, . . . , θp of this differential equation?

(9)

1.3. TYPES OF INFERENCE 3

1.3 Types of inference

We look at different types of inference. Namely, frequentist, Bayesian and machine learning methods.

1.3.1 Frequentist methods

Although the way frequentist statistics is taught today suggests that this discipline has always one unified view on inference, this is not the case. Most people will associate frequentist methods with concepts like hypothesis testing, confidence intervals and significance levels.

These used to be, however, ideas originating from different views on hypothesis testing [11].

In the twentieth century there was a lot of debate between two camps. On the one hand there were Neyman and Pearson, who are now well known for their null- and alternative-hypothesis setup. On the other hand there was R. A. Fisher, who introduced the concepts of significance levels and mathematical likelihood. The debate between these two camps focused, among other things, on the role of probability theory in statistics. Sometimes, Fisher’s likelihood approach is not considered as a part of the frequency tradition. Nowadays, statisticians use a mix of these concepts and the distinction between the different views on frequentist statistics is not always made. In general, frequentists view the concept of probability as a long-run frequency. That is, the probability of an event is the proportion of occurences of that event in an infinite number of trials. In statistics, to adhere to a frequentist tradition usually means that one considers the obtained data to be random and that to-be-estimated parameters are viewed as fixed. We then formulate a test statistic and a null-hypothesis. If, under the null- hypothesis, the test-statistic does or does not cross a pre-determined threshold, we either reject or fail to reject the null hypothesis. Consider the following example.

Example Suppose we obtain data {x_i}ⁿ_i=1, which we think is independent and identically distributed (iid) from a N (µ, σ²) distribution where σ² is known. We want to infer on the population mean µ and formulate the hypothesis that µ = µ0. Since we known the variance of the sample, we can either use a t-test or a z-test, depending on the sample size. A z-test uses some test statistic T , which is a function of our data {xi}ⁿ_i=1. This T is in a way a summary of our data and, if the population is normally distributed, also follows a normal distribution.

Given T and the sample error s of the data, we compute the z-score ^{T −µ}_s , from which we can compute a p-value P (|z| ≤ x). Given some significance level, we then (fail to) reject the hypothesis that µ = µ0.

1.3.2 The likelihood principle

Statistical inference makes use of a couple of elementary principles. One of these originated with Fisher but was championed by statisticians such as Edwards and Birnbaum. Edwards ([4]) stated the likelihood principle as follows:

Within the framework of a statistical model, all the information which the data provide concerning the relative merits of two hypotheses is contained in the likelihood ratio of those hypotheses on the data. (...) For a continuum of hypotheses, this principle asserts that the likelihood function contains all the necessary information.

(10)

Authors sometimes distinguish between a weak and strong version of this principle. The weak version would be formulated similar to Edwards formulation, while the strong version also refers to sequential experiments. According to the strong likelihood principle, if we repeat an experiment several times and we obtain two equivalent likelihood functions, then we should make the same inference. Here, a likelihood function is called equivalent to another likelihood function if they only differ by a constant multiple. It is the last version especially that is controversial. Consider the following problem. Suppose we want to test the fairness of a coin, ie. we want to test wether P (heads) = θ = ¹₂. In a frequentist hypothesis test, we could take H₀ : p = ¹₂ versus H₁ : θ > ¹₂. Suppose we tossed the coin 9 times and observed 3 head outcomes. We could perform this experiment in the following ways:

(1) The number of tosses is predetermined, ie. we decided before running the experiment that we were going to toss the coin 12 times. Then the number of heads X has a binomial distribution B(n, θ) with likelihood

l₁(k|θ) =12 k

θ^k(1 − θ)^12−k = 220θ⁹(1 − θ)³ The p-value is

P (X ≥ 9|H₀) =

12

X

k=9

12 k

1 2

12

= 0.073

If we take the very common significance level of α = 0.05 then the hypothesis that the coin is fair is not rejected.

(2) The number of heads is predetermined, ie. we decide to continue tossing the coin until we reach 3 heads. Now, X has a negative binomial distribution N B(3, 1 − θ). The likelihood function is

l₂(θ|k = 3) =11 2

θ⁹(1 − θ)³ = 55θ⁹(1 − θ)³

which is a constant multiple of l₁ and is therefor l₂ is equivalent to l₁. However, in this case the p-value is

P (X ≥ 9|H₀) =

∞

X

k=9

3 + x − 1 2

2^−x−3 = 0.0327

leading to the rejection of H0. So even though the likelihood functions are equivalent, they lead to different conclusions. Although this might one wonder whether to adopt the likelihood principle or not, Birnbaum presented a proof in 1962 that claimed to show that the likelihood principle can be derived from two other, less controversial, statistical principles. The first one, the sufficiency principle, states:

Let two observations x and y be such that T (x) = T (y) for some statistic T related to some family f (·|θ). Then the inferences on θ based on x and y should be the same.

The second one is the conditionality principle and says that if we pick an experiment concerning the inference on some parameter θ from a collection of possible experiments, then any experiment not chosen is irrelevant to the inference. These principles are arguably less

(11)

1.3. TYPES OF INFERENCE 5 controversial so it would seem that we are justified, by Birnbaum’s proof, in accepting the likelihood principle. However, the proof is not accepted by everyone. See, for instance, the critique of Birnbaum’s proof given by Deborah Mayo in [12].

1.3.3 Bayesian methods

Bayesian statistics, founded on a subjective view of probability theory, differs fundamentally from frequentist statistical methods. A key difference is the view that data is not random and that we can express our knowledge about underlying models and their parameters in terms of probabilities. These probabilities are taken to be conditional on the given data.

A probability is not a proportion or frequency, but a measure of knowledge. The famous Bayes’ Theorem has been known since 1763 but only got famous after Pierre-Simon Laplace generalized Bayes’ original findings in 1774 [5]. Bayesian statistics fell out of favor at the start of the twentieth century when Fisher presented his framework for mathematical statistics, both for philosophical and practical reasons. This changed as the century progressed, however, in part because of advances in computer technology and research in methods that could be used to overcome computational difficulties associated with Bayesian statistics, most notable the Markov Chain Monte-Carlo (MCMC) methods. In Bayesian statistics, one computes the posterior distribution of a to-be-inferred parameter θ, given some dataset {x_i}ⁿ_i=1and a prior distribution, which has to be supplied. Using Bayes’ theorem, these concepts are related via

p(θ|{xi}ⁿ_i=1) = p({x_i}ⁿ_i=1|θ)p(θ) p({xi}ⁿ_i=1)

where p(A|B) is the probability density function (pdf) of event A, conditional on event B.

The expression p({xi}ⁿ_i=1) can be expanded by

p({xi}ⁿ_i=1) = Z

p({xi}ⁿ_i=1|θ)p(θ)dθ

The pdf p({xi}ⁿ_i=1|θ) is also called a likelihood, so, heuristically, Bayes’ theorem can be be expressed as

Posterior = Likelihood × Prior Marginal

The expression (R p({x_i}ⁿ_i=1|θ)p(θ)dθ)⁻¹ serves as a normalizing constant and is often not really of interest. It is sometimes ignored and Bayes’ theorem is expressed as

Posterior ∝ Likelihood × Prior

This method of doing statistics brings its own difficulties. For one, the specification of a prior is often non-trivial and even the use a ‘non-informative’, ‘flat’ or ‘vague’ prior such as an uniform distribution is not obvious. Another issue that can arise is constituted by the integrals that have to be computed in Bayesian calculations. In the case of high-dimensional data, for example, the integrals involved can become intractable very quickly. But even in one- dimensional problems we might need computational methods to approximate integrals[17].

(12)

Example Suppose we have iid (meaning “indepentent and identically distributed”) data {x₁, . . . , x_n} drawn from a N (µ, σ²) distribution where µ is unknown. We want to infer µ from the data {x_i}ⁿ_i=1. This means that we want to derive the posterior distribution p(µ|{x1, . . . , xn}). To do this we need need a prior, which we will choose to be N (0, 1) distributed, and the likelihood:

p({x₁, . . . , x_n}|µ) =

n

Y

i=1

p(x_i|µ)

=

n

Y

i=1

√ 1

2πσ²exp

−(xi− µ) 2σ²

= (2πσ²)⁻ⁿ² exp − 1 2σ

n

X

i=1

(xi− µ)²

!

So we have

p(µ|x₁, . . . , x_n) = p(x₁, . . . , x_n)p(µ) R (p(x₁, . . . , xn|ξ)p(ξ)dξ

and after plugging in the prior, likelihood and marginal we find the posterior density of µ given {x₁, . . . , x_n} to be

p(µ|{x1, . . . , xn}) = exp −_2σ¹ Pn

i=1(xi− µ)² R exp{−^ξ₂² −_2σ¹ Pn

i=1(xi− ξ)²}dξ

which is also a normal distribution. This is not a coincidence. In general, if x|µ ∼ N (µ, σ²) and µ ∼ N (τ, γ²) then

µ|x ∼ N ( γ²

σ²+ γ²x + σ²

σ²+ γ²τ, 1 γ² + 1

σ²

−1

)

In this case, we call the prior on µ a conjugate prior. A conjugate prior is a distribution P (θ) which is in the same family of distributions as its posterior distribution P (θ|x). Conjugate priors make life easier from a computational point of view, as it gives a closed-form expression for the posterior distribution. We are not always this lucky, however. In higher dimensions it can be the case that the integral in the denumerator has no closed-form expression; in such a case we have to resort to numerical methods. Even in one-dimensional cases we may have to resort to numerical simulation, for example in the case of a poly-t distribution [16].

1.3.4 Markov Chain Monte Carlo Methods

Doing Bayesian statistics often entails approximating complicated, high-dimensional integrals, such as

Eg(X) = Z

g(x)p(x)dx

where X is a multivariate random variable. This is not always easy to do and numerically evaluating this integral can cost a lot of computer resources. To combat this problem, one can use methods that make use of the probabilistic nature of these integrals. These methods are

(13)

1.3. TYPES OF INFERENCE 7 called Markov Chain Monte Carlo (MCMC) methods. These methods are based on Monte Carlo integration. The idea behind Monte Carlo integration is that one uses the approximation

E_p(x)[f (x)] = Z

f (x)p(x)dx ≈ 1 N

N

X

i=1

f (x_i)

Since, by the Law of Large Numbers,

N →∞lim 1 N

N

X

i=1

f (x_i) −→ E[f (x)] almost surely

we know that if we draw enough iid samples, we can approximate the true expected value.

The Markov Chain variant differs in that the samples X₁, . . . , X_n are no longer independent and identically distributed but are dependent. Specifically, the fact that they are drawn from a Markov Chain means that

E(X_n|X_n−1, . . . , X₁) = E(X_n|X_n−1).

A well known implementation of a MCMC method is the Metropolis algorithm, which works as follows. Suppose we want to draw samples from a target distribution p(x), which we don’t know. Suppose that have a proposal density q(x|xⁱ). Then do:

1. Set i = 0 and draw the first sample x⁽⁰⁾;

2. generate a proposed new sample x^∗ from q(x^∗|xⁱ);

3. calculate a = ^p(x_p(x^∗i)⁾

4. if a ≥ 1 then set xⁱ⁺¹= x^∗, else accept x^∗ with probability a.

The idea of accepting x^∗ with probability a < 1 is that we might ‘jump’ out of a local maximum of p(x), so that we may find a better candidate.

1.3.5 Machine learning in statistics

The field of machine learning deals with the computational aspects of learning and could be considered as a subfield of artificial intelligence. One of the main goals of this field is to develop and research algorithms that enables a machine or artificial agent to learn from input it receives. Although machine learning does not always focus on probabilistic models, it does have a strong connection with statistics. The treatment of data is a bit different. A data set is divided in a training set, validation set and a test set. A training set is used to fit models, the validation set is used to estimate prediction error and the test set is used for assessment of the generalization error of the chosen model. A key concept in machine learning is that of a loss function. In a generic machine learning setup we have a target variable Y , a vector of inputs X and some prediction model ˆf (X). Then, the loss function L(Y, ˆf (X)) gives a measure of error between Y and ˆf (X). Common choices of loss functions are

−2 · log P (Y |X) (log-deviance)

(14)

n

X

i=1

(Yi− ˆf (Xi))² (quadratic loss and)

n

X

i=1

|Y_i− ˆf (X_i)| (absolute loss)

If one is interested in robust estimation, in the sense that the effect of outliers should be reduced, one can use loss functions such as the Huber loss function, given by

L(Y, ˆf (X)) =

(Y − ˆf (X))² for |Y − ˆf (X)| ≤ δ 2δ|Y − ˆf (X)| − δ² otherwise

After we’ve picked a model ˆf we can calculate quantities like the test error, defined as E[L(Y, ˆf (X))|T ], or the training error, defined as _N¹ P_N

i=1L(y_i, ˆf (x_i)).

Example A common problem in machine learning is the problem of classification. The goal is to learn the class to which some datapoint x belongs. In a generic classification problem we start with a training set D = {(x_i, y_i)}ⁿ_i=1. Here x_i ∈ X for some sample space X and yi ∈ Y is an element of label space Y. In this training set each input x_i is labelled by yi, making clear to what class an input belongs to. After a classifying algorithm is trained by this training set, its purpose is to classify a new datapoint.

A more concrete example is given by linear support vector machines. Here the label space is often binary, so that we might have Y = {−1, 1}. The goal of a linear SVM is to find hyperplanes that separate the datapoints in D in some optimal fashion. These hyperplanes can be described by {x|w · x + b = 0}, where w is a vector perpendicular to the hyperplane and _||w||^b is the perpendicular distance from the hyperplane to the origin. These w and b are learned from the training data. A new test point x⁰ is then classified as belonging to +1 or -1 by the decision rule sign(w · x⁰ + b). This is ofcourse a very simple example. It could be that the data is not perfectly seperable by a single hyperplane, in which case we could try to fit piecewise hyperplanes. In more general settings we can also use nonlinear decision boundaries. By using kernel functions, one can project data that is not linearly separable into a higher dimensional space to make it separable.

(15)

Chapter 2 Comparison of methods

In this chapter we review some proposed methods developed over the last decade. We begin with a frequentist method with a direct application in gene activity inference. We then move on to a Bayesian method and a Machine Learning method, focussing at the mathematical problem an sich.

2.1 A frequentist approach

We look at a frequentist (in Fisherian spirit) approach given in [8] and [9]. The statistical framework provided in this article is applied to a simple gene regulatory network, called a SIM (single input motif ), in a species of antibiotics-producing bacteria (Streptomyces coelicolor ).

The article proceeds as follows. First, a dynamical model of gene expressions is given which takes into account transcription, decay rate, network structure and stochastic effects. In this model, gene expression is defined as the number of transcribed mRNA molecules at time t, denoted by µ(t). The rate of change is then described by the ordinary differential equation

˙

µ(t) = p(t) − δµ(t) (2.1)

which has as a general solution

µ(t) = µ0e^−δt+ Z t

0

e^{−δ(t−τ )}p(τ )dτ (2.2)

Here, p(t) is a production term and δ is some constant, to be interpreted as a degradation rate. The production term p(t) is dependent on the type of regulation that occurs. In [8], Michaelis-Menten kinetics, having been used before in modeling enzyme reactions, model the production rate, taking into account saturation effects that can occur.

2.1.1 General framework

In a gene regulatory network, we consider k genes and M regulators in general. Each gene i has a gene expression described by µⁱ(t). Each µⁱ(t) in turn depends on gene-specific parameters θⁱ, the decay rate δⁱ and the production rate pⁱ(t). Because each gene is assumed to be regulated by M regulators, µⁱalso depends on the regulator activities η1, . . . , η_M. What we want to do now is infer the regulator activity profiles η_i from a set of observations at time points t1, . . . , tn. Since gene activity measurements are subject to noise, we have to specify

9

(16)

how we feel the noise is distributed. Denote by gⁱ_cr(t) the observed gene expression of gene i under condition c from replicate r. By condition c we want to express the fact that our gene is of wild type or of mutant type. Let us assume that these g_crⁱ (t) are normally distributed.

We denote the mean and standard deviation of gene i under condition c by mⁱ_c(t) and σ²_i. By specifying a probability distribution we’ve introduced new parameters. To distinguish between parameters that we’re originally interested in and parameters that arise from our statistical model, we sometimes call the former parameters structural parameters and the latter nuisance parameters. In this case, the kinetic parameters are the structural parameters.

Now that we’ve assumed a distribution from which our observations are drawn, we want to use likelihoods (in true Fisher-fashion) to find estimates of the kinetic parameters. Given an observation g_crⁱ (t) and the fact that it’s lognormally distributed we have the log-likelihood

l(mⁱ_c(t), σ²_i|gⁱ_cr(t)) = 1 q

2πσ_i² e⁻

(log(gicr (t))−mc(t))2 2σ2i

This isn’t quite what we wanted since we want to estimate the kinetic parameters θⁱ, not the nuisance parameters mⁱ_c(t) and σ_i². However, since E(gⁱ_cr(t)) = µⁱ_c(t) and µⁱ_c(t) is dependent on the θⁱ this is not a problem and we have a likelihood of the θⁱ, given g_crⁱ (t).

2.1.2 Application in a Single Input Motif

In [8] the statistical framework is applied in a Single Input Motif (SIM). This is a pattern in a gene regulatory network where there is one regulatory gene which regulates k genes. The Michaelis-Menten model gives an expression for the production rate in a cell:

p(t) = β η(t) γ + η(t)+ γ and the general expression for the gene activity of gene k is

µ_k(t) = (µ₀−α

δ)e^−δt+α δ + β

Z

e^{−δ(t−τ )} η(t) γ + η(t)dτ

Because we have no way of knowing what η(t) looks like, we assume it can be approximated by a piecewise constant function, ie. η = (η0, . . . , ηN −2) where ηj is a constant function on the time interval (tj, tj+1), j = 0, . . . , N − 1. After plugging this in µ_k(t) we get

µ(t) = (µ₀−α

δ)e^−δt+α

δ + βe^δt1 δ

N −2

X

j=0

(e^δt^j+1− e^δt^j) η¯_j γ + ¯η_j

This statistical model is applied to a cluster of genes which may or may not be regulated by a regulator, in this case cdaR. This gene and the gene cluster it belongs to is associated with the production of natually derived antibiotics. The goal of the experiment is to find out which genes in the cluster are regulated by cdaR, given some expression data. This is done by taking samples from two different replicates. One has cdaR in the cluster and one has cdaR removed. In the experiment the data is partitioned as g_ctr= µ + κ_c+ τ_t+ _ctr, where κ_c is a term accounting for the presence of the regulator and τ_t is a possible time effect. Using ANOVA the significance of the presence of cdaR is then investigated. With p-values < 0.01, 17 genes in the cdaR cluster are found to be differentially expressed. Using maximum likelihood estimation the activity profile ¯η_t for cdaR is then inferred and the confidence bounds for ¯η_t are obtained through a Wilks procedure.

(17)

2.2. A BAYESIAN APPROACH 11

2.2 A Bayesian approach

For a Bayesian approach to parameter inference we review an article by [2]. Let’s first review how one would cast our problem into a Bayesian framework. As before, we have our differential equation ˙x(t) = f (x, t|θ) and our noisy dataset y(t) = x(t) + (t), y(t) = (y1(t), . . . , yn(t)), where we assume (t) is N (0, σ²I) distributed for some unknown σ². Since we take a number of measurements at timepoints [t0, . . . , tm] we can summarize everything by Y = X + E, where Y, X, E are T × N matrices:

Y =







y1(t1) y1(t2) · · · y1(tm) y2(t1) · · · y2(tm)

... ... . .. ...







Remember that we wanted to infer the parameters θ given the data y. In Bayesian statistics, one wants to calculate the probability density of parameters θ given Y . We are therefore interested in finding p(θ|Y ). We could find this by marginalizing the distribution of all unknowns in this problem, namely θ, σ and x₀ (in some cases x₀ is given, but let us assume that we are not so lucky). Then by Bayes’ theorem we have

p(θ, σ, x₀|Y ) = p(θ, σ, x₀)p(Y |θ, σ, x₀) R p(Y |θ, σ, x₀)p(θ, σ, x0)dθdσdx0

The integral in the denumerator is of course essential for normalization of the posterior distribution but nevertheless cumbersome and not really of interest, since it only serves as a normalizing constant. We can summarize what we know by writing

p(θ, σ, x₀|Y ) ∝ p(θ, σ, x₀)p(Y |θ, σ, x₀)

and if we assume that θ, σ and x₀are independent (which certainly does not seem unreasonable since θ is obtained from nature and σ is imposed by the measuring system), then we obtain

p(θ, σ, x0|Y ) ∝ p(θ)p(σ)p(x₀)p(Y |θ, σ, x0)

Here, p(θ)p(σ)p(x0) are priors that we have to specify. Now we can obtain p(θ|Y ) by inte- grating over σ and x₀ to obtain

p(θ|Y ) ∝ Z Z

p(θ)p(σ)p(x0)p(Y |θ, σ, x0)dσdx0

where we again ommitted the normalizing constant R p(Y |θ, σ, x₀)p(θ, σ, x₀)dθdσdx₀.

So far so good, but there is a catch. Since p(Y |σ, θ, x0) ∼ N (X, σ²I) we need the solution to the differential equation ˙x(t) = f (x, t|θ) to obtain X. This is not a problem for simple differential equations, but in real life we are often faced with differential equations that have no closed-form solution. One would then have to numerically approximate this solution, which can be computationally very expensive. It is this problem that [2] tries to tackle. The main idea behind the method presented is as follows. Since the parameters θ and x₀ are considered random we can also consider the state variable x(t) to be random. Through interpreting x(t) as a random variable we could try to find its distribution and thereby circumvent the need for an explicit solution to the ODE. The distribution of x(t) depends on the function f (x, θ, t) and the priors p(θ), p(x0).

We cannot just pick any prior for θ and x0 since the differential equation much be satisfied and hence ˙x(t) must exist. So we have to pick a class of functions which is hopefully rich enough. It turns out that we can accomplish this by using Gaussian processes (see also [15]).

(18)

Definition A Gaussian process is a collection of random variables such that any finite number of these random variables have a joint Gaussian distribution.

A Gaussian process can be described by a function f (x) such that, if we take some values x₁, . . . , x_n from the domain of f , we have that (f (x₁), . . . , f (x_n)) is jointly Gaussian distributed. A Gaussian process is completely determined by its mean function and covariance function (sometimes called a covariance kernel), given by m(x) = E[f (x)] and k(x, x⁰) = E[(f (x) − m(x))(f (x⁰) − m(x⁰))] respectively. We can then write f (x) ∼ GP(m(x), k(x, x⁰)).

Note that if the domain of f is finite, we have a Gaussian distribution. So Gaussian processes are more general than Gaussian distributions. Working in the Gaussian setting has other mathematical advantages; the Gaussian distribution is closed under addition, multiplication, conditioning, and marginalization.

Why are we interested in Gaussian processes? Recall that we assumed θ, σ and x0 were random with priors p(θ), p(σ) and p(x0). Interpreting the state variable as a Gaussian process has the advantage that ˙x(t) exists and is a Gaussian process too:

Theorem Given a Gaussian process {Xt|t ∈ T } for some domain T . Pick any subset {t₁, . . . , t_n} of T . Then the vector ( ˙X(t₁), . . . , ˙X(t_n)) is normal.

For more details, see [23]. In other words, Gaussian processes have the nice property of ensuring differentiable sample paths.

We assume x(t) is a Gaussian process with xi(t) ∼ GP(0, ki(s, t)) where ki(s, t) is some covariance function. This gives a distribution for X, namely p(X|φ) = N (0, C_φ) where C_φ is a covariance matrix whose entries are k_i(t_i, t_j) for time points t_i, t_j and depends on some paramaters represented by a vector φ. Since we have a distribution for X we also have p(φ, σ|Y ), p(X|Y, σ, φ) and, since ˙X exists, the distribution p( ˙X|Y, φ, σ). Next, the authors of [2] define a new model on ˙X given X, θ, γ as p( ˙X|X, θ, γ) = N (f (x, t|θ), Iγ).

Since X is now a random variable, we can find a distribution p(θ, σ, φ, X, γ|Y ) and then marginalize over every random variable except θ to find p(θ|Y ). we can expand p(θ, σ, φ, X, γ|Y ) by writing

p(θ, σ, φ, X, γ|Y ) = p(θ, γ|X, Y, φ, σ)p(X|Y, φ, σ)P (φ, σ|Y )

and the distributions in the righthand-side can be determined to find the desired posterior.

The idea that one does not require the solution of the differential equation is an attractive one, but one can raise some questions about this proposed method. The authors use what is called a Product of Experts (PoE) to marry the models p( ˙X|Y, θ, φ, σ) and p( ˙X|X, θ, γ). A PoE is a model used in machine learning to combine several marginal distributions to come up with a joint distribution, via:

P (x|{θj}) = 1 Z

M

Y

j=1

fj(x|θj) where

Z = Z M

Y

j=1

fj(x|θj)dx

is a normalizing constant. This seems ad hoc. Using this model seems like a technical convenience rather than a model which is founded on empirical observations. In machine

(19)

2.3. MACHINE LEARNING METHODS 13 learning, where the emphasis is on prediction and the training of programs, this is not an issue. But in problems such as gene activitiy inference one should have a solid justification in invoking a new model.

2.3 Machine Learning methods

We review a method which could be categorized under Machine Learning.

2.3.1 Smoothing method

We review a method presented in [14]. The problem under consideration is nearly the same as with the previous two methods, with only a few differences. We consider the differential equation

˙

x(t) = f (x, u, t|θ)

where u = u(t) is a vector of input functions and θ is a vector of parameters we want to estimate. The goal is to build an approximation of x and ‘tweak’ this approximation according to some optimalization criterion. In [14] the approximation ˆx of x is given by a basis function expansion. This means that for every xi, i ∈ {1, . . . , n}, we have ˆxi = c^T_i φi(t), where c_i is a vector of scalars and φ_i(t) is a vector of basis functions. We hope that we can pick basis functions such that ˆx approximates x reasonably well. As we did before we have measurements yij at timepoints tij, where i ∈ {1, . . . , n} and j ∈ {1, . . . , m}. So we have y_ij = x_ij + _ij where _ij is the error associated with variable x_i at timepoint j. We could also assume we can only partially measure x but this only makes the notation more cumbersome and we shall ignore this possibility. These errors are assumed to have joint probability density p(_i|σ_i), where σ_i is a vector consisting of density-dependent parameters (in the case of normally distributed errors, the σi’s would denote variances, for example). In this setup, our structural parameters are θ and σ and the nuisance parameters represented by the vectors c_i. We now let c be the vector one obtains by contatenating all the c_i’s and denote by Φ_i the matrix consisting of φ_k(t_ij), the values of φ_k at timepoints t_ij. By writing

Φ =







Φ1 0 · · · 0 0 Φ₂ · · · 0 ... ... . .. ... 0 0 · · · Φ_d





 we have as our estimator, given our measurements,





 ˆ x₁ ˆ x₂

... ˆ x_d







=







Φ₁ 0 · · · 0 0 Φ₂ · · · 0 ... ... . .. ... 0 0 · · · Φ_d











 c₁ c₂ ... c_d







To re-iterate, we want to estimate c, σ and θ given our dataset and noise model. The estimator ˆ

c of c is defined to be a function of σ and θ, so we have ˆc = ˆc(θ, σ|λ. We optimize this ˆc through the use of an optimization criterion, denoted by J (ˆc|σ, θ, λ). This criterion must penalize ˆc to the extent that ˆx = ˆcφ fails to satisy the differential equation ˙x(t) = f (x, u, t|θ).

The vector λ consists of parameters λi which controls the amount of regularization, ie. it

(20)

expresses the extent a deviance from satisfying the differential equation is penalized. These λ_i’s can be set manually or according to another optimization criterion F (λ).

The structural parameters θ and σ are fitted too. This is done by a data fitting criterion H(θ, σ|λ), where λ again controls the amount of regularization. H can be any measure of fit of θ and σ to the data y. For example, one could use (negative) log-likelihood or error sum of squares. If we take the negative log-likelihood as a measure of fit to the data, we get

H(θ, σ|λ) = −X

i∈I

ln[g(ei|σ_i, θ, λ)]

where I ⊆ {1, . . . , d} and the errors are eij = yij − ˆci(σi, θ; λ)^Tφ(tij). We can see that H depends on σ and θ directly and indirectly, namely, through the error density g(e_i|σ_i, θ, λ) and the error yij− ˆci(σi, θ|λ)^Tφ(tij) respectively.

So what does J look like? As mentioned before, we want J to have some penalty term which penalizes ˆc to the amount ˆx fails to satisfy the differential equation. We can express such a penalty term by introducing

L_i,θ(x_i) = ˙x_i− f_i(x, u, t|θ)

which ofcourse is zero for when L_i,θis evaluated at x_ibut is in general non-zero at an estimator ˆ

x_i. A penalty could then be defined by PENi(ˆx) =

Z

{L_i,θ(ˆxi(t))}²dt

which sums all the deviances from the differential equation over a time interval wherein we take measurements. Other penalties can be defined if desired, such as PENi(ˆx) =R |L_i,θ(ˆxi(t))|dt.

These index-dependent penalties can be combined to obtain an overall penalty term via

PEN(ˆx|λ) =

d

X

i=1

λ_iPEN_i(ˆx)

where the λi can be adjusted to give additional weight to some penalty terms PENi. Finally, the authors combine this penalty term and the data-fitting criterion to obtain

J (ˆc|σ, θ, λ) = −X

i∈I

ln[g(ei|σ_i, θ, λ)] + PEN(ˆx|λ)

We can see that there is a kind of chain reaction in this method. Through H(θ, σ|λ) we obtain a fit for σ, θ which in turn gives an estimate ˆc for c.

Given this setup, we can now obtain an estimate for θ. Noting the dependence of H on the data y, we write H(θ, σ|y, λ). The authors now derive an estimator ˆθ(y) of θ as the solution to the equation

∂H(σ, θ|y, λ)

∂θ = 0

where ∂H(σ,θ|y,λ)

∂θ can be expressed by calculating its total derivative and applying the implicit function theorem.

(21)

Chapter 3 A different Bayesian method

In the previous chapter we’ve seen a Bayesian method that uses Gaussian processes on the state variable x(t). One downside of that method was the invocation of a new model, the Product of Experts. In this chapter we look at a method which is also uses the Bayesian framework and places GP’s on the state variable, but which has no need for a Product of Experts or any other extraneous model.

3.1 Reproducing Kernel Hilbert Space

We first review some theoretical concepts that we shall need. The first is the notion of a Reproducing Kernel Hilbert Space (RKHS in short).

Definition Let H be a Hilbert space, ie. an inner product space which is complete, of real functions f defined on a set X . Then H is called a reproducing kernel Hilbert space with inner product < ·, · >H if there exists a function

k : X × X → R for which the following properties hold:

• for every x ∈ X the function k_x(y) = k(x, y) belongs to H;

• k is positive definite, ie. Pn i=1

Pn

j=1a_ia_jk(x_i, x_j) > 0 for all a_i, a_j ∈ R;

• < f (·), k(·, x) >_H= f (x) for f ∈ H.

The latter property is sometimes known as the reproducing property. The function k is called a kernel function or sometimes just kernel. This kernel is characteristic for a RKHS. This is explicated by the following theorem (see also [15]).

Theorem 3.1.1. (Moore-Aronszajn theorem) Let X be a set. Then for every positive definite function k(·, ·) on X × X there exists a unique RKHS, and for every RKHS there is a unique kernel function k.

There is an close connection between the kernel k of a RKHS and a corresponding covariance operator/covariance matrix, which is an n × n matrix C such that for all nonnegative functions f ∈ H such that f^|Cf > 0, where f = (f (x1), . . . , f (xn)).

15

(22)

The kernel function is a covariance operator by definition since f^|Cf = X

i,j=1

f (xi)f (xj)k(xi, xj) > 0

The covariance operator induces a kernel function through C_ij = k(x_i, x_j), which is positive definite. We now review lemma 5 from [22], which gives us some important facts about kernels and covariance operators.

Lemma 3.1.2. The following hold for any RKHS (H, < ·, · >S>).

1. The function s : X × X → R defined by s(xi, x_j) =< S_x_i, S_x_j >_S is a kernel function;

2. Any inner product < f, g >_S, with f, g ∈ H can be uniquely expressed in the form f^|T f where T is a positive definite operator;

3. Let S be an operator S : H → H defined by S_ij = s(x_i, x_j). Then s(x_i, x_j) = T⁻¹, ie.

S = T⁻¹.

The reader is referred to [?] for another overview of the relation between RKHS’s and the estimation of parameters of differential equations. What we can take away from this lemma is that for any covariance operator C there is an RKHS with inner product < ·, · >C= f^|Cg for f, g ∈ H. Conversely, given an inner product from a given RKHS we can construct a unique covariance operator. We now use the connection between the inner product of a RKHS and a covariance operator in our original problem.

3.2 Application to parameter estimation

We look at a simple toy problem. Consider, for t ∈ T ⊆ R and θ ∈ R, θ 6= 0, the ordinary differential equation

˙

x(t) = θx(t)

Suppose we have n samples at timepoints t₁, . . . , t_n. We can rewrite this equation as

d dt − θ

x(t) = 0 ∀t ∈ T This equation must hold at all timepoints ti, ie.

d dt t=ti

− θ

!

x(t_i) = 0

where _dt^d t=ti

means the derivative with respect to t, evaluated at the point t = t_i. We can rewrite this as a matrix equation. Denote D as the matrix

D =







d dt

t=t1

0 0 · · · 0

0 _dt^d t=t2

0 · · · 0

... . .. ...

0 · · · 0 0 _dt^d

t=tn







(23)

3.2. APPLICATION TO PARAMETER ESTIMATION 17 Then we have (D−θI)x = 0, where x = (x1, . . . , xn) = (x(t1), . . . , x(tn)). Define Lθ:= D−θI.

This gives L_θx = 0. This is equivalent with saying that

< L_θx, L_θx >= 0 ⇐⇒ (L_θx)^∗L_θx = 0

where A^∗ denotes the Hermitian of functional operator A. Suppose we denote K_θ⁻¹= L^∗_θL_θ.

We don’t know if K_θ⁻¹ exists in this case. However, we will look at a discretized version of the ODE. In general, discretizing an ODE is not a problem if the solution is sufficiently continuous, since the ODE can be approximated by piecewise-linear approximations (see [?], section 1.1). In the case of our toy problem, we transform the ODE into a difference equation by considering a discretized version of the differential operator D, namely the matrix

D_discr = 1

∆t







−1 1 0 · · · 0 1 0 −1 · · · 0 ... . .. ... 0 · · · 0 −1 1







In this case, (D_discr − θI) is just a matrix and instead of (L_θx)^∗L_θx = 0 we have (Lθx)^TLθx = x^TL^T_θLθx = 0.

Notice that L_θ is invertible for all timepoints t_i= 0, since _dt^d(e^θt) t=ti

− θ = 0 if and only if t_i = 0. This implies that L^T_θL_θ is positive definite. Thus, by the previous lemma, K_θ⁻¹ induces an inner product, ie.

x^TL^T_θLθx = x^TK_θ⁻¹x =< x, x >_K⁻¹

θ = kxk²_K−1 θ

.

So this K_θ⁻¹ is also a kernel function and hence a covariance matrix. Note that x^TK_θ⁻¹x = 0 ⇐⇒ x^T(λKθ)⁻¹x = 0

for λ ∈ R, λ > 0. We view x as a zero-mean Gaussian Process. We try to approximate this process by a multivariate normal distribution of sufficient dimension. We view our n samples as coming from this multivariate normal distribution and thus are multivariate normal themself, since the normal distribution is closed under marginalization. Since we are looking at a discretized version of the ODE, we let

(x₁, . . . , x_n)^T ∼ N (0, (λK_θ)⁻¹) where

K_θ⁻¹= L^T_θL_θ

is the discretized version of L^∗_θ. Placing a prior on x has been done before, but in this case we also allow λ to control extent to which x can deviate from satisfying the differential equation.

Note that if λ → ∞ then (λK_θ)⁻¹→ 0 and x will center around the ‘true’ solution. The idea behind using (λK_θ)⁻¹ is that we want to find a good choice of λ to find estimations for θ.

We shall try to find a relationship between λ and an estimation for θ.

(24)

Now, suppose that we have some noisy dataset y, consisting of measurements at timepoints t_i, i ∈ {1, . . . , n}. Suppose these measurements contain additive, normal, iid noise:

∼ N (0, σ²) We then have

y_i = x_i+

for all i. Since x ∼ N (0, (λKθ)⁻¹), we have that y|x ∼ N (x, σ²I) and y ∼ N (0, σ²I +(λKθ)⁻¹) where I is the identity matrix of appropriate dimensions. We now go back to our original problem, ie. the estimation of θ. We again do this in a Bayesian setting. We would like to obtain, for some fixed λ > 0, the distribution p(θ, σ|y). By Bayes’ theorem, this is

p(θ, σ|y) = p(y|θ, σ)p(σ, θ) p(y)

Suppose we place a noninformative prior on (σ, θ), ie. p(σ, θ) ∝ 1. Then we have p(θ, σ|y) ∝ p(y|θ, σ) = N (0, σ²I + (λKθ)⁻¹)

The join distribution of x and y is given by

x y

∼ N0 0

,(λK_θ)⁻¹ (λK_θ)⁻¹ (λK_θ)⁻¹ σ²I + (λK_θ)⁻¹

We will now investigate the effectiveness of this framework in tackling the aforementioned estimation problem using R.

3.3 Experimentation in R

Since we know that p(θ, σ|y) ∝ p(y|θ, σ) = N (0, σ²I + (λK_θ)⁻¹), we might ask how the likelihood p(y|θ, σ) varies for θ and σ, for varying values of λ. We can do this is in R as follows.

First we generate samplepoints from the true solution of ˙x = θx, giving x = (e^θt¹, . . . , e^θtⁿ). A noisy dataset is then created via y <- rnorm(n,x,sd=.5), where sd is the standard deviation, which we set to 0.5. Setting the true value for θ as θ = −3, and for some fixed values of λ, the likelihood p(y|θ, σ) can be plotted:

We can see that, as λ gets bigger and for any fixed σ, the likelihood surface does not change for θ. Thus in this situation any θ gives the about the same likelihood p(y|θ, σ). Since we want to find an estimate for θ this is not desirable. To circumvent this problem we can estimate σ before we consider the likelihood. We do this by finding a smoother for y and then using this smoother to estimate σ. Given a smoother S such that ˆy = Sy, an unbiased estimator for σ² is

ˆ

σ² = 1 n − df(S)

n

X

i=1

(y_i− ˆy_i)²

where df denotes the degrees of freedom. Now that we have predetermined estimate of σ, we can investigate the likelihood for varying θ and λ. Given intervals θintv = (θ1, . . . , θk) and λintv= (λ1, . . . , λ_l) we evaluate the likelihood for all θi ∈ θ_intv and λj ∈ λ_intv This gives us a matrix Z with elements

Zij = p(y|θi, λj)

(25)

3.3. EXPERIMENTATION IN R 19

Given this matrix, we might ask how an estimate for θ might vary with λ. One possible way of obtaining such an estimate is through the posterior mean of θ, given y. This is given by

E_λ(θ|y) = Z

θp(θ|y)dθ

where p(θ|y) is the posterior density of θ, given y. The subscript λ is there to emphasize the fact that this estimate depends on our choice of λ. We can approximate this posterior mean by changing the integral to a sum. However, since the posterior is determined only up to a

(26)

constant, we have

E_λ(θ|y) ≈

k

X

i=1

θ_iq(y|θ_i)

where q(y|θ_i) = ^p(y|θ_Kⁱ⁾ for some constant K > 0. To make sure the q(y|θ_i) factors are properly scaled, we let K =Pk

i=1q(y|θ_i). If we compute this posterior mean for every row of Z we get a vector of posterior means:

(E_λ₁(θ|y), E_λ₂(θ|y), . . . , E_λ_k(θ|y)) Now, consider the loss function

L(θ_true, ˆθ) = |θ_true− ˆθ|

where θtrue is the value for θ we used to generate x and y. We might ask for what λ we have minimal loss, ie. for what λ we have

λ = argmin L(θ_true, E_λ_i(θ|y))

and how this changes for different samplesize and σ. To get a sense of what’s happening we can implement this setup in R as follows:

Set n and sigma and generate y = x + epsilon, where epsilon has variance sigma^2 * I Define intervals for theta and lambda

Calculate the matrix Z For all rows of Z

Compute the posterior mean of theta

Return that lambda which minimizes L(theta,E(theta|y))

(27)

3.3. EXPERIMENTATION IN R 21

If we repeat this a number of times and save the results in an array we get a vector of lambdas for which the loss function was minimized. This script can be run for different values of σ and n. The results can then be visualized by a histogram.

We can see that in this case more often than not the optimal value for λ is found between zero and fifty. We find a similar pattern if one repeats the script for different n and different σ (the R code can be found in the appendix).

Now, suppose that we have a set of optimal lambdas λôpt = {λôpt₁ , . . . , λôpt_m }. What λôpt_i do we pick? We could use that element of λôpt that occurs the most. We thus take the mode of the set {λôpt₁ , . . . , λôptm }, yielding λôpt_mode. Thus, an estimate for θ is then E_λôpt

mode(θ|y). If the mode is unique we pick one at random. We can see how this setup performs in R. We choose θ_true= −3 and σ = 0.5. Then we repeat the following 100 times. We generate 10 datapoints y = (y1, . . . , y10). We calculate the matrix Z by setting θintv= seq(-10,-0.1,.5) and λintv= seq(1,200,.5) (setting θintv is rather arbitrary, but since y shows decreasing samplepoints it is reasonable to assume θ will be negative). Each time an optimal lambda is found and stored in an array. Afterwards, the mode of this array is computed to obtain λ^opt_mode. In this simulation the mode was found to be λ^opt_mode = 54. In this case the (approximate) posterior

(28)

mean of θ is equal to

E_λ=54(θ|y) = −2.443943

If we repeat this experiment for different n we obtain the following estimates for θ:

n λ_mode E_λ(θ|y) 10 54 -2.443943 30 31 -1.96768 50 47 -2.394646 100 41 -3.023044 which are arguably reasonable estimates.

3.4 Application to a nonhomogenous ODE

We now try to apply the previous framework to a nonhomogenous ordinary differential equation. Specifically, we consider a differential equation which was also discussed in [8], which is

˙

µ(t) = p(t) − δµ(t)

where δ is a constant. As we’ve seen before, this ODE models protein activity in a cell. The function p(t) is given by Michaelis-Menten kinetics:

p(t) = β ν(t) γ + ν(t)+ α and this gives as general solution

µ(t) = µ₀−α

δ

e^−δt+α δ + β

Z t 0

e^{−δ(t−τ )} ν(t) γ + ν(t)dτ

In [8] the function ν(t) is taken to be piecewise constant. In our case we shall keep things simpler for easy of exposition. We shall take ν(t) = 1 ∀t ∈ Dom(ν(t)) and µ₀ = 0. In this case the parameters we have to estimate are now θ = (α, β, γ, δ) (if one wishes, one could append µ0 and the piecewise constant ν(ti) to θ). The explicit solution can be approximated by (see [8], section 4.3)

µ(t) = −α

δe^−δt+α δ + β

δe^−δt

n−1

X

j=1

e^δt^j+1− e^δt^j 1 + γ With these modifications our ODE is

˙

µ(t) = β 1 + γ − α which implies

[D + δI] µ =

β

1 + γ + α

1_n

Estimation of parameters of ordinary diﬀerential equations

Estimation of parameters of ordinary differential equations

Stijn de Vos 1541234

Master Thesis in Mathematics

December 19, 2012

Estimation of parameters of ordinary differential equations

Contents

Chapter 1

Introduction

1.1 Differential Equations and Modelling in Biology

1.2 Mathematical formulation

1.3 Types of inference

Chapter 2

Comparison of methods

2.1 A frequentist approach

2.2 A Bayesian approach

2.3 Machine Learning methods

Chapter 3

A different Bayesian method

3.1 Reproducing Kernel Hilbert Space

3.2 Application to parameter estimation

3.3 Experimentation in R

3.4 Application to a nonhomogenous ODE