• No results found

Estimation of parameters of ordinary differential equations

N/A
N/A
Protected

Academic year: 2021

Share "Estimation of parameters of ordinary differential equations"

Copied!
42
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Estimation of parameters of ordinary differential equations

Stijn de Vos 1541234

Master Thesis in Mathematics

December 19, 2012

(2)
(3)

Estimation of parameters of ordinary differential equations

Summary

The problem of inferring parameters of differential equations from noisy data is of great importance in the dissimination of gene-regulatory networks. In this thesis, we shall look at the problem of parameter estimation. First, three methods are reviewed, where each method is of a different discipline (frequentist, Bayesian, machine learning). Then we propose and research a Bayesian method, which we apply to various toy problems.

Master Thesis in Mathematics Author: Stijn de Vos

1541234

First supervisor(s): E.C. Wit Second supervisor: W.P. Krijnen External supervisor:

Date: December 19, 2012

Institute of Mathematics and Computing Science P.O. Box 407

9700 AK Groningen The Netherlands

(4)
(5)

Contents

1 Introduction 1

1.1 Differential Equations and Modelling in Biology . . . 1

1.2 Mathematical formulation . . . 2

1.3 Types of inference . . . 3

1.3.1 Frequentist methods . . . 3

1.3.2 The likelihood principle . . . 3

1.3.3 Bayesian methods . . . 5

1.3.4 Markov Chain Monte Carlo Methods . . . 6

1.3.5 Machine learning in statistics . . . 7

2 Comparison of methods 9 2.1 A frequentist approach . . . 9

2.1.1 General framework . . . 9

2.1.2 Application in a Single Input Motif . . . 10

2.2 A Bayesian approach . . . 11

2.3 Machine Learning methods . . . 13

2.3.1 Smoothing method . . . 13

3 A different Bayesian method 15 3.1 Reproducing Kernel Hilbert Space . . . 15

3.2 Application to parameter estimation . . . 16

3.3 Experimentation in R . . . 18

3.4 Application to a nonhomogenous ODE . . . 22

4 Multidimensional systems and model selection 25 4.1 Systems of ODEs . . . 25

4.2 Bayesian information criterion and Bayes factors . . . 26

4.3 Conclusion . . . 29

A R source code 31

Bibliography 34

iii

(6)
(7)

Chapter 1

Introduction

From a scientific point of view, it is of great importance to understand how genes activate or repress each other. It could explain why humans and mice have about 85% of similar genes, and yet are completely different organisms. It could also help us understand how to treat genetic diseases more efficiently. The collection of all gene interactions is called the gene regulatory network. The mapping and understanding of this network is a challenge;

although an exact amount is not known, the human genome is estimated to carry between 20.000 and 30.000 genes. To make things worse; it could be that a number of genes affect a number of other genes. Also, determining which genes influence each other is one thing, but to determine in what way this is done is another matter. The problem of modelling gene regulatory networks is being investigated in a number of ways. Some common methodologies are ([21]) network topology models (e.g. [10]), network control logic models (e.g. [25]), and dynamic models (e.g. [3]).

In this thesis we will focus on a mathematical problem connected to the dynamic modelling approach. First, we shall review the difference between frequentist, Bayesian and machine learning methods. In chapter two we will review a number of articles that present a solution to the problem (see below). In chapter three we will consider technical and computational aspects of Bayesian methods. Finally, we will attempt to improve the Bayesian approach given by [2] in chapter four.

1.1 Differential Equations and Modelling in Biology

Since the rise of systems biology, mathematics has had an increasing influence on tackling biological problems. As can be seen in [19], research has been done on the modelling of gene- regulatory networks and intra-celullular biochemical reactions in general. A popular method of expressing biochemical reactions mathematically is through the Power Law Formalism, which is a system of nonlinear ordinary differential equations that are written in the form of a Law of Mass Action:

˙

xi(t) = Fi(x1, . . . , xn) = αi

N

Y

m=1

xPmim− βi

N

Y

m=1

xQmim

where i ∈ {1, . . . , N }, αi, βiare kinetic rates and Pim, Qimare matrices of stoichiometric coef- ficients. This formalism was applied to a variety of biological problems. Later this framework was generalized and dubbed S-Systems. Not only was this framework applicable to a variety

1

(8)

of biological and mathematical problems, it also provided biologists with a way to investigate a complex system in detail or as a whole. In other words, this framework could be applied to a single cell, a network of cells or a whole organism.

Unfortunately there are limitations to using ordinary differential equations. At the mi- croscopic level, reality is of a stochastic nature. This stochasticity has several origins:

• Randomness on the molecular level; during the formation of heavy molecules, each molecule can assume a different quantum state.

• Randomness on the chemical level; a chemical reaction may depend on any number of complex processes depending on internal variables. In large systems, this stochasticity is of no consequence, but in living the cells the number of biomolecules may be not as large and the resulting process may have large fluctuations.

• Randomness in gene expression; in an individual cell, the production of protein and enzymes was discovered to be a stochastic pattern with random distribution of inten- sities and durations. The transcription factor concentrations and the rate of protein production fluctuates between individial cells as well (see [1] and [20]).

So the use of ordinary differential equations is not always legitimate. In some scenarios one should resort to partial or stochastic differential equations.

1.2 Mathematical formulation

In a dynamic modelling setting, we use an ordinary, partial or stochastic differential equation (ODE, PDE, SDE, respectively) to model gene activity. Here, gene activity can be defined as the number of mRNA molecules present in a cell, for example. We let xi denote the activity of gene i, i ∈ {1, . . . , n}. In the ODE case, the differential equation governing the gene activities is denoted by

˙

x(t) = f (x(t), t|θ)

Here, x(t) = (x1(t), x2(t), . . . , xn(t)) ∈ Rn is the state variable and f : Rn → Rn. We let θ = (θ1, θ2, . . . , θp) ∈ Rp denote a vector of parameters, such as gene specific parameters, activity levels of other genes, etcetera. We now have to infer what these parameters are.

Since accurately measuring the number of mRNA molecules is hard to accomplish, the data we are given contains noise. This means that instead of observing the state vector x(t) from our data we have noisy measurements y(t) = x(t) + (t). The problem, then, is this:

Of course, reality could make our life even harder. It could be that our differential equation has no closed-form solution, for example. Or maybe we can only observe a subvector of the statevector x(t). Or maybe we are not dealing with an ordinary differential equation but with a stochastic differential equation. In chapter two we will encounter a Bayesian method which deals with the no-closed-form-solution-case. The other two questions remain largely untouched in this thesis. Suppose we have a noisy data set {yi}ni=1, which are noisy observations of a state vector x(t) that satisfies some differential equation. How can we infer the parameters θ1, . . . , θp of this differential equation?

(9)

1.3. TYPES OF INFERENCE 3

1.3 Types of inference

We look at different types of inference. Namely, frequentist, Bayesian and machine learning methods.

1.3.1 Frequentist methods

Although the way frequentist statistics is taught today suggests that this discipline has always one unified view on inference, this is not the case. Most people will associate frequentist methods with concepts like hypothesis testing, confidence intervals and significance levels.

These used to be, however, ideas originating from different views on hypothesis testing [11].

In the twentieth century there was a lot of debate between two camps. On the one hand there were Neyman and Pearson, who are now well known for their null- and alternative-hypothesis setup. On the other hand there was R. A. Fisher, who introduced the concepts of significance levels and mathematical likelihood. The debate between these two camps focused, among other things, on the role of probability theory in statistics. Sometimes, Fisher’s likelihood approach is not considered as a part of the frequency tradition. Nowadays, statisticians use a mix of these concepts and the distinction between the different views on frequentist statistics is not always made. In general, frequentists view the concept of probability as a long-run frequency. That is, the probability of an event is the proportion of occurences of that event in an infinite number of trials. In statistics, to adhere to a frequentist tradition usually means that one considers the obtained data to be random and that to-be-estimated parameters are viewed as fixed. We then formulate a test statistic and a null-hypothesis. If, under the null- hypothesis, the test-statistic does or does not cross a pre-determined threshold, we either reject or fail to reject the null hypothesis. Consider the following example.

Example Suppose we obtain data {xi}ni=1, which we think is independent and identically distributed (iid) from a N (µ, σ2) distribution where σ2 is known. We want to infer on the population mean µ and formulate the hypothesis that µ = µ0. Since we known the variance of the sample, we can either use a t-test or a z-test, depending on the sample size. A z-test uses some test statistic T , which is a function of our data {xi}ni=1. This T is in a way a summary of our data and, if the population is normally distributed, also follows a normal distribution.

Given T and the sample error s of the data, we compute the z-score T −µs , from which we can compute a p-value P (|z| ≤ x). Given some significance level, we then (fail to) reject the hypothesis that µ = µ0.

1.3.2 The likelihood principle

Statistical inference makes use of a couple of elementary principles. One of these originated with Fisher but was championed by statisticians such as Edwards and Birnbaum. Edwards ([4]) stated the likelihood principle as follows:

Within the framework of a statistical model, all the information which the data provide concerning the relative merits of two hypotheses is contained in the likeli- hood ratio of those hypotheses on the data. (...) For a continuum of hypotheses, this principle asserts that the likelihood function contains all the necessary infor- mation.

(10)

Authors sometimes distinguish between a weak and strong version of this principle. The weak version would be formulated similar to Edwards formulation, while the strong version also refers to sequential experiments. According to the strong likelihood principle, if we repeat an experiment several times and we obtain two equivalent likelihood functions, then we should make the same inference. Here, a likelihood function is called equivalent to another likelihood function if they only differ by a constant multiple. It is the last version especially that is controversial. Consider the following problem. Suppose we want to test the fairness of a coin, ie. we want to test wether P (heads) = θ = 12. In a frequentist hypothesis test, we could take H0 : p = 12 versus H1 : θ > 12. Suppose we tossed the coin 9 times and observed 3 head outcomes. We could perform this experiment in the following ways:

(1) The number of tosses is predetermined, ie. we decided before running the experiment that we were going to toss the coin 12 times. Then the number of heads X has a binomial distribution B(n, θ) with likelihood

l1(k|θ) =12 k



θk(1 − θ)12−k = 220θ9(1 − θ)3 The p-value is

P (X ≥ 9|H0) =

12

X

k=9

12 k

  1 2

12

= 0.073

If we take the very common significance level of α = 0.05 then the hypothesis that the coin is fair is not rejected.

(2) The number of heads is predetermined, ie. we decide to continue tossing the coin until we reach 3 heads. Now, X has a negative binomial distribution N B(3, 1 − θ). The likelihood function is

l2(θ|k = 3) =11 2



θ9(1 − θ)3 = 55θ9(1 − θ)3

which is a constant multiple of l1 and is therefor l2 is equivalent to l1. However, in this case the p-value is

P (X ≥ 9|H0) =

X

k=9

3 + x − 1 2



2−x−3 = 0.0327

leading to the rejection of H0. So even though the likelihood functions are equivalent, they lead to different conclusions. Although this might one wonder whether to adopt the likelihood principle or not, Birnbaum presented a proof in 1962 that claimed to show that the likelihood principle can be derived from two other, less controversial, statistical principles. The first one, the sufficiency principle, states:

Let two observations x and y be such that T (x) = T (y) for some statistic T related to some family f (·|θ). Then the inferences on θ based on x and y should be the same.

The second one is the conditionality principle and says that if we pick an experiment con- cerning the inference on some parameter θ from a collection of possible experiments, then any experiment not chosen is irrelevant to the inference. These principles are arguably less

(11)

1.3. TYPES OF INFERENCE 5 controversial so it would seem that we are justified, by Birnbaum’s proof, in accepting the likelihood principle. However, the proof is not accepted by everyone. See, for instance, the critique of Birnbaum’s proof given by Deborah Mayo in [12].

1.3.3 Bayesian methods

Bayesian statistics, founded on a subjective view of probability theory, differs fundamentally from frequentist statistical methods. A key difference is the view that data is not random and that we can express our knowledge about underlying models and their parameters in terms of probabilities. These probabilities are taken to be conditional on the given data.

A probability is not a proportion or frequency, but a measure of knowledge. The famous Bayes’ Theorem has been known since 1763 but only got famous after Pierre-Simon Laplace generalized Bayes’ original findings in 1774 [5]. Bayesian statistics fell out of favor at the start of the twentieth century when Fisher presented his framework for mathematical statistics, both for philosophical and practical reasons. This changed as the century progressed, however, in part because of advances in computer technology and research in methods that could be used to overcome computational difficulties associated with Bayesian statistics, most notable the Markov Chain Monte-Carlo (MCMC) methods. In Bayesian statistics, one computes the posterior distribution of a to-be-inferred parameter θ, given some dataset {xi}ni=1and a prior distribution, which has to be supplied. Using Bayes’ theorem, these concepts are related via

p(θ|{xi}ni=1) = p({xi}ni=1|θ)p(θ) p({xi}ni=1)

where p(A|B) is the probability density function (pdf) of event A, conditional on event B.

The expression p({xi}ni=1) can be expanded by

p({xi}ni=1) = Z

p({xi}ni=1|θ)p(θ)dθ

The pdf p({xi}ni=1|θ) is also called a likelihood, so, heuristically, Bayes’ theorem can be be expressed as

Posterior = Likelihood × Prior Marginal

The expression (R p({xi}ni=1|θ)p(θ)dθ)−1 serves as a normalizing constant and is often not really of interest. It is sometimes ignored and Bayes’ theorem is expressed as

Posterior ∝ Likelihood × Prior

This method of doing statistics brings its own difficulties. For one, the specification of a prior is often non-trivial and even the use a ‘non-informative’, ‘flat’ or ‘vague’ prior such as an uniform distribution is not obvious. Another issue that can arise is constituted by the integrals that have to be computed in Bayesian calculations. In the case of high-dimensional data, for example, the integrals involved can become intractable very quickly. But even in one- dimensional problems we might need computational methods to approximate integrals[17].

(12)

Example Suppose we have iid (meaning “indepentent and identically distributed”) data {x1, . . . , xn} drawn from a N (µ, σ2) distribution where µ is unknown. We want to infer µ from the data {xi}ni=1. This means that we want to derive the posterior distribution p(µ|{x1, . . . , xn}). To do this we need need a prior, which we will choose to be N (0, 1) distributed, and the likelihood:

p({x1, . . . , xn}|µ) =

n

Y

i=1

p(xi|µ)

=

n

Y

i=1

√ 1

2πσ2exp



−(xi− µ) 2σ2



= (2πσ2)n2 exp − 1 2σ

n

X

i=1

(xi− µ)2

!

So we have

p(µ|x1, . . . , xn) = p(x1, . . . , xn)p(µ) R (p(x1, . . . , xn|ξ)p(ξ)dξ

and after plugging in the prior, likelihood and marginal we find the posterior density of µ given {x1, . . . , xn} to be

p(µ|{x1, . . . , xn}) = exp −1 Pn

i=1(xi− µ)2 R exp{−ξ221 Pn

i=1(xi− ξ)2}dξ

which is also a normal distribution. This is not a coincidence. In general, if x|µ ∼ N (µ, σ2) and µ ∼ N (τ, γ2) then

µ|x ∼ N ( γ2

σ2+ γ2x + σ2

σ2+ γ2τ, 1 γ2 + 1

σ2

−1

)

In this case, we call the prior on µ a conjugate prior. A conjugate prior is a distribution P (θ) which is in the same family of distributions as its posterior distribution P (θ|x). Conjugate priors make life easier from a computational point of view, as it gives a closed-form expression for the posterior distribution. We are not always this lucky, however. In higher dimensions it can be the case that the integral in the denumerator has no closed-form expression; in such a case we have to resort to numerical methods. Even in one-dimensional cases we may have to resort to numerical simulation, for example in the case of a poly-t distribution [16].

1.3.4 Markov Chain Monte Carlo Methods

Doing Bayesian statistics often entails approximating complicated, high-dimensional integrals, such as

Eg(X) = Z

g(x)p(x)dx

where X is a multivariate random variable. This is not always easy to do and numerically evaluating this integral can cost a lot of computer resources. To combat this problem, one can use methods that make use of the probabilistic nature of these integrals. These methods are

(13)

1.3. TYPES OF INFERENCE 7 called Markov Chain Monte Carlo (MCMC) methods. These methods are based on Monte Carlo integration. The idea behind Monte Carlo integration is that one uses the approximation

Ep(x)[f (x)] = Z

f (x)p(x)dx ≈ 1 N

N

X

i=1

f (xi)

Since, by the Law of Large Numbers,

N →∞lim 1 N

N

X

i=1

f (xi) −→ E[f (x)] almost surely

we know that if we draw enough iid samples, we can approximate the true expected value.

The Markov Chain variant differs in that the samples X1, . . . , Xn are no longer independent and identically distributed but are dependent. Specifically, the fact that they are drawn from a Markov Chain means that

E(Xn|Xn−1, . . . , X1) = E(Xn|Xn−1).

A well known implementation of a MCMC method is the Metropolis algorithm, which works as follows. Suppose we want to draw samples from a target distribution p(x), which we don’t know. Suppose that have a proposal density q(x|xi). Then do:

1. Set i = 0 and draw the first sample x(0);

2. generate a proposed new sample x from q(x|xi);

3. calculate a = p(xp(xi))

4. if a ≥ 1 then set xi+1= x, else accept x with probability a.

The idea of accepting x with probability a < 1 is that we might ‘jump’ out of a local maximum of p(x), so that we may find a better candidate.

1.3.5 Machine learning in statistics

The field of machine learning deals with the computational aspects of learning and could be considered as a subfield of artificial intelligence. One of the main goals of this field is to develop and research algorithms that enables a machine or artificial agent to learn from input it receives. Although machine learning does not always focus on probabilistic models, it does have a strong connection with statistics. The treatment of data is a bit different. A data set is divided in a training set, validation set and a test set. A training set is used to fit models, the validation set is used to estimate prediction error and the test set is used for assessment of the generalization error of the chosen model. A key concept in machine learning is that of a loss function. In a generic machine learning setup we have a target variable Y , a vector of inputs X and some prediction model ˆf (X). Then, the loss function L(Y, ˆf (X)) gives a measure of error between Y and ˆf (X). Common choices of loss functions are

−2 · log P (Y |X) (log-deviance)

(14)

n

X

i=1

(Yi− ˆf (Xi))2 (quadratic loss and)

n

X

i=1

|Yi− ˆf (Xi)| (absolute loss)

If one is interested in robust estimation, in the sense that the effect of outliers should be reduced, one can use loss functions such as the Huber loss function, given by

L(Y, ˆf (X)) =

 (Y − ˆf (X))2 for |Y − ˆf (X)| ≤ δ 2δ|Y − ˆf (X)| − δ2 otherwise

After we’ve picked a model ˆf we can calculate quantities like the test error, defined as E[L(Y, ˆf (X))|T ], or the training error, defined as N1 PN

i=1L(yi, ˆf (xi)).

Example A common problem in machine learning is the problem of classification. The goal is to learn the class to which some datapoint x belongs. In a generic classification problem we start with a training set D = {(xi, yi)}ni=1. Here xi ∈ X for some sample space X and yi ∈ Y is an element of label space Y. In this training set each input xi is labelled by yi, making clear to what class an input belongs to. After a classifying algorithm is trained by this training set, its purpose is to classify a new datapoint.

A more concrete example is given by linear support vector machines. Here the label space is often binary, so that we might have Y = {−1, 1}. The goal of a linear SVM is to find hyperplanes that separate the datapoints in D in some optimal fashion. These hyperplanes can be described by {x|w · x + b = 0}, where w is a vector perpendicular to the hyperplane and ||w||b is the perpendicular distance from the hyperplane to the origin. These w and b are learned from the training data. A new test point x0 is then classified as belonging to +1 or -1 by the decision rule sign(w · x0 + b). This is ofcourse a very simple example. It could be that the data is not perfectly seperable by a single hyperplane, in which case we could try to fit piecewise hyperplanes. In more general settings we can also use nonlinear decision boundaries. By using kernel functions, one can project data that is not linearly separable into a higher dimensional space to make it separable.

(15)

Chapter 2

Comparison of methods

In this chapter we review some proposed methods developed over the last decade. We begin with a frequentist method with a direct application in gene activity inference. We then move on to a Bayesian method and a Machine Learning method, focussing at the mathematical problem an sich.

2.1 A frequentist approach

We look at a frequentist (in Fisherian spirit) approach given in [8] and [9]. The statistical framework provided in this article is applied to a simple gene regulatory network, called a SIM (single input motif ), in a species of antibiotics-producing bacteria (Streptomyces coelicolor ).

The article proceeds as follows. First, a dynamical model of gene expressions is given which takes into account transcription, decay rate, network structure and stochastic effects. In this model, gene expression is defined as the number of transcribed mRNA molecules at time t, denoted by µ(t). The rate of change is then described by the ordinary differential equation

˙

µ(t) = p(t) − δµ(t) (2.1)

which has as a general solution

µ(t) = µ0e−δt+ Z t

0

e−δ(t−τ )p(τ )dτ (2.2)

Here, p(t) is a production term and δ is some constant, to be interpreted as a degradation rate. The production term p(t) is dependent on the type of regulation that occurs. In [8], Michaelis-Menten kinetics, having been used before in modeling enzyme reactions, model the production rate, taking into account saturation effects that can occur.

2.1.1 General framework

In a gene regulatory network, we consider k genes and M regulators in general. Each gene i has a gene expression described by µi(t). Each µi(t) in turn depends on gene-specific parameters θi, the decay rate δi and the production rate pi(t). Because each gene is assumed to be regulated by M regulators, µialso depends on the regulator activities η1, . . . , ηM. What we want to do now is infer the regulator activity profiles ηi from a set of observations at time points t1, . . . , tn. Since gene activity measurements are subject to noise, we have to specify

9

(16)

how we feel the noise is distributed. Denote by gicr(t) the observed gene expression of gene i under condition c from replicate r. By condition c we want to express the fact that our gene is of wild type or of mutant type. Let us assume that these gcri (t) are normally distributed.

We denote the mean and standard deviation of gene i under condition c by mic(t) and σ2i. By specifying a probability distribution we’ve introduced new parameters. To distinguish between parameters that we’re originally interested in and parameters that arise from our statistical model, we sometimes call the former parameters structural parameters and the latter nuisance parameters. In this case, the kinetic parameters are the structural parameters.

Now that we’ve assumed a distribution from which our observations are drawn, we want to use likelihoods (in true Fisher-fashion) to find estimates of the kinetic parameters. Given an observation gcri (t) and the fact that it’s lognormally distributed we have the log-likelihood

l(mic(t), σ2i|gicr(t)) = 1 q

2πσi2 e

(log(gicr (t))−mc(t))2 2σ2i

This isn’t quite what we wanted since we want to estimate the kinetic parameters θi, not the nuisance parameters mic(t) and σi2. However, since E(gicr(t)) = µic(t) and µic(t) is dependent on the θi this is not a problem and we have a likelihood of the θi, given gcri (t).

2.1.2 Application in a Single Input Motif

In [8] the statistical framework is applied in a Single Input Motif (SIM). This is a pattern in a gene regulatory network where there is one regulatory gene which regulates k genes. The Michaelis-Menten model gives an expression for the production rate in a cell:

p(t) = β η(t) γ + η(t)+ γ and the general expression for the gene activity of gene k is

µk(t) = (µ0−α

δ)e−δt+α δ + β

Z

e−δ(t−τ ) η(t) γ + η(t)dτ

Because we have no way of knowing what η(t) looks like, we assume it can be approximated by a piecewise constant function, ie. η = (η0, . . . , ηN −2) where ηj is a constant function on the time interval (tj, tj+1), j = 0, . . . , N − 1. After plugging this in µk(t) we get

µ(t) = (µ0−α

δ)e−δt

δ + βeδt1 δ

N −2

X

j=0

(eδtj+1− eδtj) η¯j γ + ¯ηj

This statistical model is applied to a cluster of genes which may or may not be regulated by a regulator, in this case cdaR. This gene and the gene cluster it belongs to is associated with the production of natually derived antibiotics. The goal of the experiment is to find out which genes in the cluster are regulated by cdaR, given some expression data. This is done by taking samples from two different replicates. One has cdaR in the cluster and one has cdaR removed. In the experiment the data is partitioned as gctr= µ + κc+ τt+ ctr, where κc is a term accounting for the presence of the regulator and τt is a possible time effect. Using ANOVA the significance of the presence of cdaR is then investigated. With p-values < 0.01, 17 genes in the cdaR cluster are found to be differentially expressed. Using maximum likelihood estimation the activity profile ¯ηt for cdaR is then inferred and the confidence bounds for ¯ηt are obtained through a Wilks procedure.

(17)

2.2. A BAYESIAN APPROACH 11

2.2 A Bayesian approach

For a Bayesian approach to parameter inference we review an article by [2]. Let’s first review how one would cast our problem into a Bayesian framework. As before, we have our differential equation ˙x(t) = f (x, t|θ) and our noisy dataset y(t) = x(t) + (t), y(t) = (y1(t), . . . , yn(t)), where we assume (t) is N (0, σ2I) distributed for some unknown σ2. Since we take a number of measurements at timepoints [t0, . . . , tm] we can summarize everything by Y = X + E, where Y, X, E are T × N matrices:

Y =

y1(t1) y1(t2) · · · y1(tm) y2(t1) · · · y2(tm)

... ... . .. ...

Remember that we wanted to infer the parameters θ given the data y. In Bayesian statistics, one wants to calculate the probability density of parameters θ given Y . We are therefore interested in finding p(θ|Y ). We could find this by marginalizing the distribution of all unknowns in this problem, namely θ, σ and x0 (in some cases x0 is given, but let us assume that we are not so lucky). Then by Bayes’ theorem we have

p(θ, σ, x0|Y ) = p(θ, σ, x0)p(Y |θ, σ, x0) R p(Y |θ, σ, x0)p(θ, σ, x0)dθdσdx0

The integral in the denumerator is of course essential for normalization of the posterior dis- tribution but nevertheless cumbersome and not really of interest, since it only serves as a normalizing constant. We can summarize what we know by writing

p(θ, σ, x0|Y ) ∝ p(θ, σ, x0)p(Y |θ, σ, x0)

and if we assume that θ, σ and x0are independent (which certainly does not seem unreasonable since θ is obtained from nature and σ is imposed by the measuring system), then we obtain

p(θ, σ, x0|Y ) ∝ p(θ)p(σ)p(x0)p(Y |θ, σ, x0)

Here, p(θ)p(σ)p(x0) are priors that we have to specify. Now we can obtain p(θ|Y ) by inte- grating over σ and x0 to obtain

p(θ|Y ) ∝ Z Z

p(θ)p(σ)p(x0)p(Y |θ, σ, x0)dσdx0

where we again ommitted the normalizing constant R p(Y |θ, σ, x0)p(θ, σ, x0)dθdσdx0.

So far so good, but there is a catch. Since p(Y |σ, θ, x0) ∼ N (X, σ2I) we need the solution to the differential equation ˙x(t) = f (x, t|θ) to obtain X. This is not a problem for simple differential equations, but in real life we are often faced with differential equations that have no closed-form solution. One would then have to numerically approximate this solution, which can be computationally very expensive. It is this problem that [2] tries to tackle. The main idea behind the method presented is as follows. Since the parameters θ and x0 are considered random we can also consider the state variable x(t) to be random. Through interpreting x(t) as a random variable we could try to find its distribution and thereby circumvent the need for an explicit solution to the ODE. The distribution of x(t) depends on the function f (x, θ, t) and the priors p(θ), p(x0).

We cannot just pick any prior for θ and x0 since the differential equation much be satisfied and hence ˙x(t) must exist. So we have to pick a class of functions which is hopefully rich enough. It turns out that we can accomplish this by using Gaussian processes (see also [15]).

(18)

Definition A Gaussian process is a collection of random variables such that any finite num- ber of these random variables have a joint Gaussian distribution.

A Gaussian process can be described by a function f (x) such that, if we take some val- ues x1, . . . , xn from the domain of f , we have that (f (x1), . . . , f (xn)) is jointly Gaussian distributed. A Gaussian process is completely determined by its mean function and covari- ance function (sometimes called a covariance kernel), given by m(x) = E[f (x)] and k(x, x0) = E[(f (x) − m(x))(f (x0) − m(x0))] respectively. We can then write f (x) ∼ GP(m(x), k(x, x0)).

Note that if the domain of f is finite, we have a Gaussian distribution. So Gaussian processes are more general than Gaussian distributions. Working in the Gaussian setting has other mathematical advantages; the Gaussian distribution is closed under addition, multiplication, conditioning, and marginalization.

Why are we interested in Gaussian processes? Recall that we assumed θ, σ and x0 were random with priors p(θ), p(σ) and p(x0). Interpreting the state variable as a Gaussian process has the advantage that ˙x(t) exists and is a Gaussian process too:

Theorem Given a Gaussian process {Xt|t ∈ T } for some domain T . Pick any subset {t1, . . . , tn} of T . Then the vector ( ˙X(t1), . . . , ˙X(tn)) is normal.

For more details, see [23]. In other words, Gaussian processes have the nice property of ensuring differentiable sample paths.

We assume x(t) is a Gaussian process with xi(t) ∼ GP(0, ki(s, t)) where ki(s, t) is some covariance function. This gives a distribution for X, namely p(X|φ) = N (0, Cφ) where Cφ is a covariance matrix whose entries are ki(ti, tj) for time points ti, tj and depends on some paramaters represented by a vector φ. Since we have a distribution for X we also have p(φ, σ|Y ), p(X|Y, σ, φ) and, since ˙X exists, the distribution p( ˙X|Y, φ, σ). Next, the authors of [2] define a new model on ˙X given X, θ, γ as p( ˙X|X, θ, γ) = N (f (x, t|θ), Iγ).

Since X is now a random variable, we can find a distribution p(θ, σ, φ, X, γ|Y ) and then marginalize over every random variable except θ to find p(θ|Y ). we can expand p(θ, σ, φ, X, γ|Y ) by writing

p(θ, σ, φ, X, γ|Y ) = p(θ, γ|X, Y, φ, σ)p(X|Y, φ, σ)P (φ, σ|Y )

and the distributions in the righthand-side can be determined to find the desired posterior.

The idea that one does not require the solution of the differential equation is an attractive one, but one can raise some questions about this proposed method. The authors use what is called a Product of Experts (PoE) to marry the models p( ˙X|Y, θ, φ, σ) and p( ˙X|X, θ, γ). A PoE is a model used in machine learning to combine several marginal distributions to come up with a joint distribution, via:

P (x|{θj}) = 1 Z

M

Y

j=1

fj(x|θj) where

Z = Z M

Y

j=1

fj(x|θj)dx

is a normalizing constant. This seems ad hoc. Using this model seems like a technical convenience rather than a model which is founded on empirical observations. In machine

(19)

2.3. MACHINE LEARNING METHODS 13 learning, where the emphasis is on prediction and the training of programs, this is not an issue. But in problems such as gene activitiy inference one should have a solid justification in invoking a new model.

2.3 Machine Learning methods

We review a method which could be categorized under Machine Learning.

2.3.1 Smoothing method

We review a method presented in [14]. The problem under consideration is nearly the same as with the previous two methods, with only a few differences. We consider the differential equation

˙

x(t) = f (x, u, t|θ)

where u = u(t) is a vector of input functions and θ is a vector of parameters we want to estimate. The goal is to build an approximation of x and ‘tweak’ this approximation according to some optimalization criterion. In [14] the approximation ˆx of x is given by a basis function expansion. This means that for every xi, i ∈ {1, . . . , n}, we have ˆxi = cTi φi(t), where ci is a vector of scalars and φi(t) is a vector of basis functions. We hope that we can pick basis functions such that ˆx approximates x reasonably well. As we did before we have measurements yij at timepoints tij, where i ∈ {1, . . . , n} and j ∈ {1, . . . , m}. So we have yij = xij + ij where ij is the error associated with variable xi at timepoint j. We could also assume we can only partially measure x but this only makes the notation more cumbersome and we shall ignore this possibility. These errors are assumed to have joint probability density p(ii), where σi is a vector consisting of density-dependent parameters (in the case of normally distributed errors, the σi’s would denote variances, for example). In this setup, our structural parameters are θ and σ and the nuisance parameters represented by the vectors ci. We now let c be the vector one obtains by contatenating all the ci’s and denote by Φi the matrix consisting of φk(tij), the values of φk at timepoints tij. By writing

Φ =

Φ1 0 · · · 0 0 Φ2 · · · 0 ... ... . .. ... 0 0 · · · Φd

 we have as our estimator, given our measurements,

 ˆ x1 ˆ x2

... ˆ xd

=

Φ1 0 · · · 0 0 Φ2 · · · 0 ... ... . .. ... 0 0 · · · Φd

 c1 c2 ... cd

To re-iterate, we want to estimate c, σ and θ given our dataset and noise model. The estimator ˆ

c of c is defined to be a function of σ and θ, so we have ˆc = ˆc(θ, σ|λ. We optimize this ˆc through the use of an optimization criterion, denoted by J (ˆc|σ, θ, λ). This criterion must penalize ˆc to the extent that ˆx = ˆcφ fails to satisy the differential equation ˙x(t) = f (x, u, t|θ).

The vector λ consists of parameters λi which controls the amount of regularization, ie. it

(20)

expresses the extent a deviance from satisfying the differential equation is penalized. These λi’s can be set manually or according to another optimization criterion F (λ).

The structural parameters θ and σ are fitted too. This is done by a data fitting criterion H(θ, σ|λ), where λ again controls the amount of regularization. H can be any measure of fit of θ and σ to the data y. For example, one could use (negative) log-likelihood or error sum of squares. If we take the negative log-likelihood as a measure of fit to the data, we get

H(θ, σ|λ) = −X

i∈I

ln[g(eii, θ, λ)]

where I ⊆ {1, . . . , d} and the errors are eij = yij − ˆcii, θ; λ)Tφ(tij). We can see that H depends on σ and θ directly and indirectly, namely, through the error density g(eii, θ, λ) and the error yij− ˆcii, θ|λ)Tφ(tij) respectively.

So what does J look like? As mentioned before, we want J to have some penalty term which penalizes ˆc to the amount ˆx fails to satisfy the differential equation. We can express such a penalty term by introducing

Li,θ(xi) = ˙xi− fi(x, u, t|θ)

which ofcourse is zero for when Li,θis evaluated at xibut is in general non-zero at an estimator ˆ

xi. A penalty could then be defined by PENi(ˆx) =

Z

{Li,θ(ˆxi(t))}2dt

which sums all the deviances from the differential equation over a time interval wherein we take measurements. Other penalties can be defined if desired, such as PENi(ˆx) =R |Li,θ(ˆxi(t))|dt.

These index-dependent penalties can be combined to obtain an overall penalty term via

PEN(ˆx|λ) =

d

X

i=1

λiPENi(ˆx)

where the λi can be adjusted to give additional weight to some penalty terms PENi. Finally, the authors combine this penalty term and the data-fitting criterion to obtain

J (ˆc|σ, θ, λ) = −X

i∈I

ln[g(eii, θ, λ)] + PEN(ˆx|λ)

We can see that there is a kind of chain reaction in this method. Through H(θ, σ|λ) we obtain a fit for σ, θ which in turn gives an estimate ˆc for c.

Given this setup, we can now obtain an estimate for θ. Noting the dependence of H on the data y, we write H(θ, σ|y, λ). The authors now derive an estimator ˆθ(y) of θ as the solution to the equation

∂H(σ, θ|y, λ)

∂θ = 0

where ∂H(σ,θ|y,λ)

∂θ can be expressed by calculating its total derivative and applying the implicit function theorem.

(21)

Chapter 3

A different Bayesian method

In the previous chapter we’ve seen a Bayesian method that uses Gaussian processes on the state variable x(t). One downside of that method was the invocation of a new model, the Product of Experts. In this chapter we look at a method which is also uses the Bayesian framework and places GP’s on the state variable, but which has no need for a Product of Experts or any other extraneous model.

3.1 Reproducing Kernel Hilbert Space

We first review some theoretical concepts that we shall need. The first is the notion of a Reproducing Kernel Hilbert Space (RKHS in short).

Definition Let H be a Hilbert space, ie. an inner product space which is complete, of real functions f defined on a set X . Then H is called a reproducing kernel Hilbert space with inner product < ·, · >H if there exists a function

k : X × X → R for which the following properties hold:

• for every x ∈ X the function kx(y) = k(x, y) belongs to H;

• k is positive definite, ie. Pn i=1

Pn

j=1aiajk(xi, xj) > 0 for all ai, aj ∈ R;

• < f (·), k(·, x) >H= f (x) for f ∈ H.

The latter property is sometimes known as the reproducing property. The function k is called a kernel function or sometimes just kernel. This kernel is characteristic for a RKHS. This is explicated by the following theorem (see also [15]).

Theorem 3.1.1. (Moore-Aronszajn theorem) Let X be a set. Then for every positive definite function k(·, ·) on X × X there exists a unique RKHS, and for every RKHS there is a unique kernel function k.

There is an close connection between the kernel k of a RKHS and a corresponding covari- ance operator/covariance matrix, which is an n × n matrix C such that for all nonnegative functions f ∈ H such that f|Cf > 0, where f = (f (x1), . . . , f (xn)).

15

(22)

The kernel function is a covariance operator by definition since f|Cf = X

i,j=1

f (xi)f (xj)k(xi, xj) > 0

The covariance operator induces a kernel function through Cij = k(xi, xj), which is positive definite. We now review lemma 5 from [22], which gives us some important facts about kernels and covariance operators.

Lemma 3.1.2. The following hold for any RKHS (H, < ·, · >S>).

1. The function s : X × X → R defined by s(xi, xj) =< Sxi, Sxj >S is a kernel function;

2. Any inner product < f, g >S, with f, g ∈ H can be uniquely expressed in the form f|T f where T is a positive definite operator;

3. Let S be an operator S : H → H defined by Sij = s(xi, xj). Then s(xi, xj) = T−1, ie.

S = T−1.

The reader is referred to [?] for another overview of the relation between RKHS’s and the estimation of parameters of differential equations. What we can take away from this lemma is that for any covariance operator C there is an RKHS with inner product < ·, · >C= f|Cg for f, g ∈ H. Conversely, given an inner product from a given RKHS we can construct a unique covariance operator. We now use the connection between the inner product of a RKHS and a covariance operator in our original problem.

3.2 Application to parameter estimation

We look at a simple toy problem. Consider, for t ∈ T ⊆ R and θ ∈ R, θ 6= 0, the ordinary differential equation

˙

x(t) = θx(t)

Suppose we have n samples at timepoints t1, . . . , tn. We can rewrite this equation as

 d dt − θ



x(t) = 0 ∀t ∈ T This equation must hold at all timepoints ti, ie.

d dt t=ti

− θ

!

x(ti) = 0

where dtd t=ti

means the derivative with respect to t, evaluated at the point t = ti. We can rewrite this as a matrix equation. Denote D as the matrix

D =

d dt

t=t1

0 0 · · · 0

0 dtd t=t2

0 · · · 0

... . .. ...

0 · · · 0 0 dtd

t=tn

(23)

3.2. APPLICATION TO PARAMETER ESTIMATION 17 Then we have (D−θI)x = 0, where x = (x1, . . . , xn) = (x(t1), . . . , x(tn)). Define Lθ:= D−θI.

This gives Lθx = 0. This is equivalent with saying that

< Lθx, Lθx >= 0 ⇐⇒ (Lθx)Lθx = 0

where A denotes the Hermitian of functional operator A. Suppose we denote Kθ−1= LθLθ.

We don’t know if Kθ−1 exists in this case. However, we will look at a discretized version of the ODE. In general, discretizing an ODE is not a problem if the solution is sufficiently continuous, since the ODE can be approximated by piecewise-linear approximations (see [?], section 1.1). In the case of our toy problem, we transform the ODE into a difference equation by considering a discretized version of the differential operator D, namely the matrix

Ddiscr = 1

∆t

−1 1 0 · · · 0 1 0 −1 · · · 0 ... . .. ... 0 · · · 0 −1 1

In this case, (Ddiscr − θI) is just a matrix and instead of (Lθx)Lθx = 0 we have (Lθx)TLθx = xTLTθLθx = 0.

Notice that Lθ is invertible for all timepoints ti= 0, since dtd(eθt) t=ti

− θ = 0 if and only if ti = 0. This implies that LTθLθ is positive definite. Thus, by the previous lemma, Kθ−1 induces an inner product, ie.

xTLTθLθx = xTKθ−1x =< x, x >K−1

θ = kxk2K−1 θ

.

So this Kθ−1 is also a kernel function and hence a covariance matrix. Note that xTKθ−1x = 0 ⇐⇒ xT(λKθ)−1x = 0

for λ ∈ R, λ > 0. We view x as a zero-mean Gaussian Process. We try to approximate this process by a multivariate normal distribution of sufficient dimension. We view our n samples as coming from this multivariate normal distribution and thus are multivariate normal themself, since the normal distribution is closed under marginalization. Since we are looking at a discretized version of the ODE, we let

(x1, . . . , xn)T ∼ N (0, (λKθ)−1) where

Kθ−1= LTθLθ

is the discretized version of Lθ. Placing a prior on x has been done before, but in this case we also allow λ to control extent to which x can deviate from satisfying the differential equation.

Note that if λ → ∞ then (λKθ)−1→ 0 and x will center around the ‘true’ solution. The idea behind using (λKθ)−1 is that we want to find a good choice of λ to find estimations for θ.

We shall try to find a relationship between λ and an estimation for θ.

(24)

Now, suppose that we have some noisy dataset y, consisting of measurements at timepoints ti, i ∈ {1, . . . , n}. Suppose these measurements contain additive, normal, iid noise:

 ∼ N (0, σ2) We then have

yi = xi+ 

for all i. Since x ∼ N (0, (λKθ)−1), we have that y|x ∼ N (x, σ2I) and y ∼ N (0, σ2I +(λKθ)−1) where I is the identity matrix of appropriate dimensions. We now go back to our original problem, ie. the estimation of θ. We again do this in a Bayesian setting. We would like to obtain, for some fixed λ > 0, the distribution p(θ, σ|y). By Bayes’ theorem, this is

p(θ, σ|y) = p(y|θ, σ)p(σ, θ) p(y)

Suppose we place a noninformative prior on (σ, θ), ie. p(σ, θ) ∝ 1. Then we have p(θ, σ|y) ∝ p(y|θ, σ) = N (0, σ2I + (λKθ)−1)

The join distribution of x and y is given by

x y



∼ N0 0



,(λKθ)−1 (λKθ)−1 (λKθ)−1 σ2I + (λKθ)−1



We will now investigate the effectiveness of this framework in tackling the aforementioned estimation problem using R.

3.3 Experimentation in R

Since we know that p(θ, σ|y) ∝ p(y|θ, σ) = N (0, σ2I + (λKθ)−1), we might ask how the likelihood p(y|θ, σ) varies for θ and σ, for varying values of λ. We can do this is in R as follows.

First we generate samplepoints from the true solution of ˙x = θx, giving x = (eθt1, . . . , eθtn). A noisy dataset is then created via y <- rnorm(n,x,sd=.5), where sd is the standard deviation, which we set to 0.5. Setting the true value for θ as θ = −3, and for some fixed values of λ, the likelihood p(y|θ, σ) can be plotted:

We can see that, as λ gets bigger and for any fixed σ, the likelihood surface does not change for θ. Thus in this situation any θ gives the about the same likelihood p(y|θ, σ). Since we want to find an estimate for θ this is not desirable. To circumvent this problem we can estimate σ before we consider the likelihood. We do this by finding a smoother for y and then using this smoother to estimate σ. Given a smoother S such that ˆy = Sy, an unbiased estimator for σ2 is

ˆ

σ2 = 1 n − df(S)

n

X

i=1

(yi− ˆyi)2

where df denotes the degrees of freedom. Now that we have predetermined estimate of σ, we can investigate the likelihood for varying θ and λ. Given intervals θintv = (θ1, . . . , θk) and λintv= (λ1, . . . , λl) we evaluate the likelihood for all θi ∈ θintv and λj ∈ λintv This gives us a matrix Z with elements

Zij = p(y|θi, λj)

(25)

3.3. EXPERIMENTATION IN R 19

Given this matrix, we might ask how an estimate for θ might vary with λ. One possible way of obtaining such an estimate is through the posterior mean of θ, given y. This is given by

Eλ(θ|y) = Z

θp(θ|y)dθ

where p(θ|y) is the posterior density of θ, given y. The subscript λ is there to emphasize the fact that this estimate depends on our choice of λ. We can approximate this posterior mean by changing the integral to a sum. However, since the posterior is determined only up to a

(26)

constant, we have

Eλ(θ|y) ≈

k

X

i=1

θiq(y|θi)

where q(y|θi) = p(y|θKi) for some constant K > 0. To make sure the q(y|θi) factors are properly scaled, we let K =Pk

i=1q(y|θi). If we compute this posterior mean for every row of Z we get a vector of posterior means:

(Eλ1(θ|y), Eλ2(θ|y), . . . , Eλk(θ|y)) Now, consider the loss function

L(θtrue, ˆθ) = |θtrue− ˆθ|

where θtrue is the value for θ we used to generate x and y. We might ask for what λ we have minimal loss, ie. for what λ we have

λ = argmin L(θtrue, Eλi(θ|y))

and how this changes for different samplesize and σ. To get a sense of what’s happening we can implement this setup in R as follows:

Set n and sigma and generate y = x + epsilon, where epsilon has variance sigma^2 * I Define intervals for theta and lambda

Calculate the matrix Z For all rows of Z

Compute the posterior mean of theta

Return that lambda which minimizes L(theta,E(theta|y))

(27)

3.3. EXPERIMENTATION IN R 21

If we repeat this a number of times and save the results in an array we get a vector of lambdas for which the loss function was minimized. This script can be run for different values of σ and n. The results can then be visualized by a histogram.

We can see that in this case more often than not the optimal value for λ is found between zero and fifty. We find a similar pattern if one repeats the script for different n and different σ (the R code can be found in the appendix).

Now, suppose that we have a set of optimal lambdas λopt = {λopt1 , . . . , λoptm }. What λopti do we pick? We could use that element of λopt that occurs the most. We thus take the mode of the set {λopt1 , . . . , λoptm }, yielding λoptmode. Thus, an estimate for θ is then Eλopt

mode(θ|y). If the mode is unique we pick one at random. We can see how this setup performs in R. We choose θtrue= −3 and σ = 0.5. Then we repeat the following 100 times. We generate 10 datapoints y = (y1, . . . , y10). We calculate the matrix Z by setting θintv= seq(-10,-0.1,.5) and λintv= seq(1,200,.5) (setting θintv is rather arbitrary, but since y shows decreasing samplepoints it is reasonable to assume θ will be negative). Each time an optimal lambda is found and stored in an array. Afterwards, the mode of this array is computed to obtain λoptmode. In this simulation the mode was found to be λoptmode = 54. In this case the (approximate) posterior

(28)

mean of θ is equal to

Eλ=54(θ|y) = −2.443943

If we repeat this experiment for different n we obtain the following estimates for θ:

n λmode Eλ(θ|y) 10 54 -2.443943 30 31 -1.96768 50 47 -2.394646 100 41 -3.023044 which are arguably reasonable estimates.

3.4 Application to a nonhomogenous ODE

We now try to apply the previous framework to a nonhomogenous ordinary differential equa- tion. Specifically, we consider a differential equation which was also discussed in [8], which is

˙

µ(t) = p(t) − δµ(t)

where δ is a constant. As we’ve seen before, this ODE models protein activity in a cell. The function p(t) is given by Michaelis-Menten kinetics:

p(t) = β ν(t) γ + ν(t)+ α and this gives as general solution

µ(t) = µ0−α

δ



e−δt+α δ + β

Z t 0

e−δ(t−τ ) ν(t) γ + ν(t)dτ

In [8] the function ν(t) is taken to be piecewise constant. In our case we shall keep things simpler for easy of exposition. We shall take ν(t) = 1 ∀t ∈ Dom(ν(t)) and µ0 = 0. In this case the parameters we have to estimate are now θ = (α, β, γ, δ) (if one wishes, one could append µ0 and the piecewise constant ν(ti) to θ). The explicit solution can be approximated by (see [8], section 4.3)

µ(t) = −α

δe−δt+α δ + β

δe−δt

n−1

X

j=1

eδtj+1− eδtj 1 + γ With these modifications our ODE is

˙

µ(t) = β 1 + γ − α which implies

[D + δI] µ =

 β

1 + γ + α

 1n

Referenties

GERELATEERDE DOCUMENTEN

10 WO I materiaal.      5 Besluit   Sommige duidelijke kuilen met een gedifferentieerde vulling wijzen op bepaalde ‐evenwel moeilijk 

(Zdh). Een deel van het in totaal 2,3 ha gebied was niet toegankelijk voor onderzoek, door de aanwezigheid van bestaande bebouwing en een weg. De noordoostelijke hoek

Als u verschijnselen van een urineweginfectie heeft (koorts, troebele urine, pijn bij het plassen) moet u dit zo mogelijk een paar dagen voor het onderzoek tijdig doorgeven aan

In this paper, a new method based on least squares support vector machines is developed for solving second order linear and nonlinear partial differential equations (PDEs) in one

In this chapter, a brief introduction to stochastic differential equations (SDEs) will be given, after which the newly developed SDE based CR modulation model, used extensively in

Before starting the parameter estimation for all activities, we first estimated the parameters for each of six activity groups (Table 2), namely: daily shopping,

√n-consistent parameter estimation for systems of ordinary differential equations : bypassing numerical integration via smoothing.. Citation for published