Approximate Bayesian inference in semi-mechanistic models

(1)

University of Groningen

Approximate Bayesian inference in semi-mechanistic models

Aderhold, Andrej; Husmeier, Dirk; Grzegorczyk, Marco

Published in:

Statistics and Computing

DOI:

10.1007/s11222-016-9668-8

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2017

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Aderhold, A., Husmeier, D., & Grzegorczyk, M. (2017). Approximate Bayesian inference in semi-mechanistic models. Statistics and Computing, 27(4), 1003-1040. https://doi.org/10.1007/s11222-016-9668-8

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

DOI 10.1007/s11222-016-9668-8

Approximate Bayesian inference in semi-mechanistic models

Andrej Aderhold1 · Dirk Husmeier1 · Marco Grzegorczyk2

Received: 11 May 2015 / Accepted: 5 May 2016 / Published online: 16 June 2016 © The Author(s) 2016. This article is published with open access at Springerlink.com

Abstract Inference of interaction networks represented by

systems of differential equations is a challenging problem in many scientific disciplines. In the present article, we follow a semi-mechanistic modelling approach based on gradient matching. We investigate the extent to which key factors, including the kinetic model, statistical formulation and numerical methods, impact upon performance at network reconstruction. We emphasize general lessons for computa-tional statisticians when faced with the challenge of model selection, and we assess the accuracy of various alternative paradigms, including recent widely applicable information criteria and different numerical procedures for approximat-ing Bayes factors. We conduct the comparative evaluation with a novel inferential pipeline that systematically disam-biguates confounding factors via an ANOVA scheme.

Keywords Network Inference· Semi-mechanistic model ·

Bayesian model selection· Widely applicable information criteria (WAIC, WBIC )· Markov jump processes · ANOVA· Systems biology

B

Marco Grzegorczyk m.a.grzegorczyk@rug.nl Andrej Aderhold andrej.aderhold@glasgow.ac.uk Dirk Husmeier dirk.husmeier@glasgow.ac.uk

1 _{School of Mathematics and Statistics, Glasgow University,}

Glasgow, UK

2 _{Johann Bernoulli Institute (JBI), Groningen University,}

Groningen, The Netherlands

1 Introduction

A topical and challenging problem for computational statis-tics and machine learning is to infer the structure of complex systems of interacting units. This research area has been particularly motivated by the cognate research discipline of computational systems biology, where researchers aim to reconstruct the structure of biopathways or regulatory net-works from postgenomic data; see e.g.Smolen et al.(2000), De Jong (2002) and Lawrence et al. (2010). Two princi-pled approaches can be distinguished. The first paradigm aims to apply generic models like sparse Lasso-type regres-sion, Bayesian networks, or hierarchical Bayesian models. A recent overview and comparative evaluation was pub-lished by Aderhold et al. (2014). The advantage of this approach is that the computational complexity of inference is comparatively low, and the application of these methods to problems of genuine interest is computationally feasible. The disadvantage is that interactions are modelled at a high level of abstraction, which ignores the detailed nature of the underlying mechanisms. The second paradigm is based on mechanistic models and the detailed mathematical descrip-tion of the underlying interacdescrip-tion processes, typically in the form of ordinary or stochastic differential equations (DEs). Two pioneering examples of this approach were published byVyshemirsky and Girolami(2008) andToni et al.(2009). The advantage of this paradigm is a more detailed and faith-ful mathematical representation of the interactions in the system. The disadvantage is the substantially higher com-putational costs of inference, which stem from the fact that each parameter adaptation requires a numerical integration of the differential equations. A novel approach, presented by Oates et al.(2014) and termed ’chemical model averaging’ (CheMA), aims for a compromise that combines the strengths of both paradigms. The underlying principle is that of

(3)

gradi-Fig. 1 Overview of the presented work and how it extends the CheMA

model ofOates et al.(2014)

ent matching, first proposed byRamsay et al.(2007). Given the concentration time series of some quantities whose inter-actions are to be inferred, the temporal derivatives of the concentrations are directly estimated from the data. These derivatives are then matched against those predicted from the DEs. Formally, on the assumption that the mismatch can be treated like observational noise of known distrib-utional form, we can derive the likelihood and thus apply standard statistical inference techniques. The model is effec-tively a non-linear regression model, whose computational complexity of inference sits between the two paradigms dis-cussed above: it is lower than for proper mechanistic models, since the DEs do not have to be integrated numerically; it is higher than for standard models of the first category, since the model is non-linear in its parameters and an analytic mar-ginalization is intractable. Henceforth, we refer to it as a ’semi-mechanistic’ model.

This article takes the work ofOates et al.(2014), which won the best paper award at the European Conference on Computational Biology (ECCB) in 2014, further in four respects, related to accuracy and efficiency, model selection, benchmarking, and application expansion. An overview can be found in Fig.1.

1.1 Accuracy and efficiency

Robust gradient estimation is absolutely critical for semi-mechanistic modelling. The numerical differentiation pro-posed in Oates et al. (2014) is known to be susceptible to noise amplification. We here propose the application of Gaussian process (GP) regression and the exploitation of the fact that under fairly general assumptions, GPs are closed under differentiation. Our approach effectively implements a low-pass filter that counteracts the noise amplification of the differentiation step, and we quantify the boost in network

reconstruction accuracy that can be achieved in this way. We further critically assess the influence of the parameter prior in the underlying Bayesian hierarchal model. In particular, we compare the g-prior with the ridge regression prior (see e.g. Chapter 3 inMarin and Robert(2007)) in the context of the proposed semi-mechanistic model and demonstrate that the latter significantly improves both accuracy and compu-tational efficiency.

1.2 Model selection

Network reconstruction is effectively based on statistical model selection. The model selection paradigm applied in Oates et al. (2014)—computing the log marginal likeli-hood (MLL) with Chib’s method—is not uncontroversial. Conceptually, alternatives to the MLL based on predictive performance have been promoted (see e.g. Sect. 7.4 in Gel-man et al. (2014a)). Numerically, Chib’s method can give inaccurate results, as discussed e.g. inMurphy(2012), Chap-ter 24. In this article, we assess four numerical approximation procedures for the MLL in the context of semi-mechanistic models: Chib’s original method (Chib and Jeliazkov 2001), Chib’s method with a numerical stabilization, thermody-namic integration (Friel and Pettitt 2008), and a numerically stabilized version of thermodynamic integration (Friel et al. 2013). We further carry out a comparative evaluation between the MLL and four information criteria (IC) as approximations to the predictive performance paradigm promoted inGelman et al.(2014a): ’divergence IC’ (DIC), ’widely applicable IC’ (WAIC), ’cross-validation IC’ (CVIC), and ’widely applica-ble Bayesian IC’ (WBIC).

1.3 Benchmarking

Assessing methodological innovation calls for an objec-tive performance evaluation. We have carried out a com-prehensive comparative evaluation of the proposed semi-mechanistic model with 11 state-of-the-art network inference methods from computational statistics and machine learning, based on a realistic stochastic process model of the under-lying molecular processes (Guerriero et al. 2012) and six distinct regulatory networks with different degrees of con-nectivity. The analysis of such a complex simulation study is hampered by the influence of various confounding fac-tors, which tend to blur naive graphical representations. We therefore apply an ANOVA scheme, which enables us to disentangle the various effects and thereby extract clear trends and patterns in the results. In this way we can show that by integrating prior domain knowledge via a system-specific mathematical representation, the resulting semi-mechanistic model can significantly outperform state-of-the-art generic machine learning and computational sta-tistics methods. We provide an application pipeline (Fig.2)

(4)

Methods iCheMA Lasso ElasticNet GGM Tesla SBR BGe HBR BSA MBN GP Marcov Jump Process (Data Generator) mRNA Profile Difference Quotient Gaussian Process Edge Ranking AUROC AUPREC Gradients Scoring nmgl g m n nmgl N M G y prot transcr rad transcr mRNA G g g n G G G G 2 ) ( ) ( 8 8 4 deg Networks N_n g G m M mRNA Profile Difference Quotient Gaussian Process GradientsGg ANOVA Model

Fig. 2 Overview of the ANOVA pipeline. Based on a mathematical

formulation in terms of Markov jump processes, realistic mRNA con-centration time series are generated from a collection of gene regulatory networks, and the transcription rate (temporal gradient) is computed with two alternative methods: numerical differentiation versus GP regression. The proposed iCheMA model is compared with 10 estab-lished state-of-the-art network reconstruction methods, each predicting

a ranking of the potential interactions (edges) in the network. From these rankings, the areas under the ROC (AUROC) and precision-recall curve (AUPRC) are computed, and provide a score for the accuracy of net-work reconstruction. An ANOVA scheme is applied to disentangle the effects of network topology, gradient computation and reconstruction method, for clearer recognition of trends and patterns

with which a user can objectively quantify this performance gain.

1.4 Application extension

Finally, we adapt the method from the modelling of protein signalling cascades inOates et al.(2014) to transcriptional gene regulation and include an explicit model of transcrip-tional time delays. In Appendix 4 we provide a novel application of the proposed semi-mechanistic model to plant systems biology, where the objective is to infer the struc-ture of the key gene regulatory network controlling circadian regulation in Arabidopsis thaliana.

2 Model

2.1 Interaction model

Inference of networks from data has become a topical theme in various scientific disciplines, particularly in systems biology. Here, rather than merely aiming for a descriptive representation of associations, the objective is a quantitative mathematical description of the processes that lead to the formation of an interaction network (e.g. a ‘biopathway’ or a ‘biochemical reaction network’). A standard approach is to model this network with a system of ordinary differential equations (ODEs):

d xi(t)

dt |t=t = Fi(πi(t

_{), θ)} ₍₁₎

Here, i ∈ {1, . . . , n} denotes one of n components of the system, which is called a ‘node’. In systems biology, this

is typically a gene or a protein. The variable xi(t) denotes

a measurable concentration of node i at time t. This can, for instance, be a gene expression or mRNA concentration. The vectorπi(t) contains the concentrations of the

regula-tors of node i . In a network, the regularegula-tors of node i are those nodes with a directed edge (or arrow) pointing to node i . Finally, the differential equations depend on a parameter vectorθ. In systems biology, these parameters are typically reaction rates that determine the kinetics of the underlying reactions. A specific example, taken from Barenco et al. (2006), is given in Sect.2.3. What makes network inference in the context of such a mechanistic description particularly challenging is the fact that the parameters θ are typically not measurable, or that only a small fraction of them can be measured. Hence, the elucidation of the interaction net-work structure requires these parameters to be inferred from concentration time series, which are typically sparse and noisy. To avoid the computational complexity of numeri-cally solving the ODEs, we followOates et al.(2014) and use gradient matching. The idea, first proposed byRamsay et al.(2007), is to estimate the time derivativesd xi

dt directly

from the data, then treat the problem as nonlinear regres-sion. On the assumption that the estimated derivatives can be treated like noisy data distributed around the predicted deriv-atives, and this distribution is iid normal, we obtain for the likelihood: p(D|θ) = n i=1 T j=1 N (yi(tj)|Fi(πi(tj), θ), σi2) (2) where yi(tj) =d x_dti(t)|t=tj, andN (.|μ, σ 2_{) is a normal}

dis-tribution with meanμ and variance σ2.Oates et al.(2014) obtained the temporal derivatives yi(tj) ( j = 1, . . . , T ) by

(5)

σ2 i Vi Di πi ν Ki yi Ki∼ N{Ki≥0}(1, νI) σ2 i ∝σ12 i Vi∼ N{Vi≥0}(1, T σ2i(DiDi)−1) yi∼ N (DiVi, σi2I) Di=D_i(K_i) is the

T -by-(|πi| + 1) design matrix withT rows, defined in (5) The regulator set

πiis kept fixed. bσ aσ aδ bδ σ2 i δ2i Vi Di πi ν Ki yi Ki∼ N{Ki≥0}(1, νI) σ2 i ∼ IG(aσ, bσ) δ_i2∼ IG(a_δ, b_δ) Vi∼ N{Vi≥0}(1, σ2iδi2I) yi∼ N (DiVi, σ2iI) Di=D_i(K_i) is the

T -by-(|πi| + 1) design matrix withT rows, defined in (5) The regulator set

πiis kept fixed.

Fig. 3 Probabilistic graphical model representation of semi-mechanistic models. The figure shows a probabilistic graphical model representation of the semi-mechanistic models investigated in our study. Top panel CheMA, as proposed byOates et al. (2014).

Bottom panel The new variant of CheMA (iCheMA), proposed here

differencing the time series xi(t1), . . . , xi(tT), based on the

Euler equation. However, differencing is known to lead to noise amplification (see e.g.Chatfield(1989)). In the present work, we apply a GP to smooth interpolation and exploit the fact that GPs are closed under differentiation, i.e. provided the kernel is differentiable, the derivative of a GP is also a GP, and its covariance matrix can be derived (Solak et al. 2002;Holsclaw et al. 2013).1_{We provide more details in the} following section.

1_{Within this paper we refer to the resulting gradients as the numerical}

(Oates et al. 2014) and the analytical gradient, proposed here.

2.2 Rate (or gradient) estimation

The fundamental concept of the interaction model is the matching of gradients between the regulator variables on the right-hand side and the rate of mRNA concentration change

d xi(t)

dt on the left-hand side of Eq. (1). Since direct

mea-surements of these rates are typically missing, we derive a rate estimate from the available concentration measurements at the time points t ∈ {t1, . . . , tT}. A common procedure,

which is also used byOates et al.(2014), is to calculate the slope of the concentration change at each time point twith the finite difference quotients

d xi(t) dt t_=t ≈ xi(t+ δt) − xi(t− δt) 2δt (3) This numerical procedure can yield good approximations to the true rates of concentration change if the data xi is

rela-tively precise, i.e. the signal-to-noise ratio is high. If the data is noisy, however, the rates from the difference quotient are susceptible to distortions as a consequence of noise amplifi-cation, as mentioned above. We here propose the application of GP regression to counteract this noise amplification. A GP defines a prior distribution over functions g(.) that transform input data points, defined here as time points t= (t1, . . . , tT),

into output data points, defined here as the concentration vector xi = (xi(t1), . . . , xi(tT)) for species i such that

xi(t∗) = g(t∗). The joint prior distribution over the functions

p(g(t1), . . . , g(tT)) is Gaussian distributed and commonly

has a zero mean and a covariance matrix G with independent and identically distributed (iid) additive noiseσn2:

p(g|t) = N (0, G + σ_n2I) (4)

where I is the identity matrix. The key idea of the GP is that the elements p, q ∈ {1, . . . , T } of the covariance matrix G are calculated from a kernel function with Gpq = κ(tp, tq),

which is typically chosen in such a way that for similar points tp and tq, the corresponding values xi(tp) and xi(tq) are

stronger correlated than for dissimilar arguments. Widely used kernel functions that are also applied in this paper are the radial basis function (RBF), the periodic function (PER), or the Matérn class function (MAT); see Chapter 4 in Rasmussen and Williams(2006) for the explicit mathemat-ical expressions. By taking the first derivative of the kernel functionκ(tp, tq) we obtain a prior distribution p(g) over

functions that define the first temporal derivative, i.e. the con-centration gradients, for each of the time points in t. Provided the kernel function is differentiable, this is again a valid GP. The simplest approach is to compute the expectation over these functions and thus obtain a mean estimate of the ana-lytical solution for the gradients at each time point. For the explicit mathematical expression, see e.g. Eq. (1) inHolsclaw

(6)

et al.(2013). This acts as a proxy for the missing rates yi(t)

on the left-hand side of Eq. (1).2In Appendix 2 we describe the details of the GP application and the software we used.

2.3 Model and prior distributions

Equation (1) typically takes the form (Barenco et al. 2006) d xi(t)

dt |t=t = ci− v0,ixi(t

_{) + f}

i(πi(t), θ) (5)

Setting ci = 0 and employing Michaelis–Menten kinetics

yields the CheMA approach (Oates et al. 2014):3 d xi(t) dt |t=t = −v0,ixi(t ₎ + u∈πi vu,i Iu,ixu(t) + (1 − Iu,i)ku,i xu(t) + ku,i (6)

where the sum is over all species u that are in the set of regu-latorsπiof species i , and the indicator functions Iu,iindicate

whether species u is an activator (Iu,i = 1) or inhibitor

(Iu,i = 0). The first term, −v0,ixi(t), takes the

degrada-tion of xi into account, whilevu,iand ku,iare the maximum

reaction rate and Michaelis–Menten parameters for the reg-ulatory effect of species u ∈ πi on species i , respectively.

Equation (6) represents the typical form of transcriptional regulation without complex formation; see e.g. the supple-mentary material inPokhilko et al.(2010) andPokhilko et al. (2012). We discuss the limitations caused by complex forma-tion in Sect.6.5. Without loss of generality, we now assume thatπi is given byπi = {x1, . . . , xs}. Eq. (6) can then be

written in vector notation: d xi(t)

dt |t=t = D

i,tVi (7)

where Vi = (v0,i, v1,i. . . , vs,i)is the vector of the

maxi-mum reaction rate parameters, and the vector Di_,t depends

on the measured concentrations xu(t) and the Michaelis–

Menten parameters ku,i (u∈ πi) via Eq. (6):

2 _{Using the whole distribution, as on page 58 of}_{Holsclaw et al.}₍₂₀₁₃_),

would give us additional indication of uncertainty akin to a distribution of measurement errors. Due to the increased computational costs (addi-tional matrix operations) this has not been attempted, though.

3_{In the original CheMA model, a different decay term is used (for}

pro-tein dephosphorylation), and no inhibitory interactions are included for numerical reasons. The linear decay term in Eq. (6) is more appropriate for transcriptional regulation, and the inclusion of inhibitory interac-tions achieves better results (as shown in Appendix 2).

D_i_,t = −xi(t), I1,ix1(t) + (1 − I1,i)k1,i x1(t) + k1,i , . . . ,Is,ixs(t) + (1 − Is,i)ks,i xs(t) + ks,i (8)

We combine the Michaelis–Menten parameters ku,i(u∈ πi)

in a vector Ki, and we arrange the T row vectors Di,t(t∈

{t1, . . . , tT}) in a T -by-(|πi|+1) design matrix Di = Di(Ki).

The likelihood is then: p(yi|Ki, Vi, σi2) = (2πσ 2 i)− T 2_e − 1 2σ2i (yi−DiVi)(yi−DiVi) (9) where yi := (yi(t1), . . . , yi(tT))is the vector of the rates

or gradients for species i . To ensure non-negative Michaelis– Menten parameters, truncated Normal prior distributions are used:

Ki ∼ N{Ki≥0}(1, νI) (10)

where 1 is a vector of ones, I is the identity matrix,ν > 0 is a hyperparameter, and the subscript,{Ki ≥ 0}, indicates

the truncation condition, i.e. that each element of Ki has

to be non-negative. In the original CheMA model (Oates et al. 2014) a truncated g-prior is imposed on the maximum reaction rate vectors Vi:

Vi|σi2, Ki ∼ N{Vi≥0}

1, T σ_i2(DT_iDi)−1

(11) where Di = Di(Ki), and a Jeffrey prior is used for the noise

variance: p(σ2

i) ∝ σi−2. In Sect. 6.2we demonstrate an

intrinsic shortcoming of the g-prior, and we show that the model can be significantly improved by employing a trun-cated ridge regression prior instead:

Vi|σi2, δi2∼ N{Vi≥0}

1, δ2iσi2I

(12) whereδ_i2is a new hyperparameter which regulates the prior strength. For σ_i2 and δ_i2 we use inverse Gamma priors, σ2

i ∼ I G(aσ, bσ) and δ2i ∼ I G(aδ, bδ). Graphical model

representations for both models CheMA and iCheMA are provided in Fig.3. For inference of the iCheMA model the MCMC sampling scheme inOates et al.(2014) has to be mod-ified. The new full conditional distributions can be derived from the equations in Sect. 3.2 ofMarin and Robert(2007). The details are given in Sect.3.1, and pseudo-code of the MCMC algorithm is provided in Table1. Table2shows that the replacement of Eq. (11) by Eq. (12) yields a substantial reduction of the computational costs of the MCMC scheme. Figure6in Sect.6.2shows that this replacement can also lead to a significantly improved network reconstruction accuracy.

(7)

Table 1 Pseudo Code: MCMC sampling scheme for the iCheMA model

Table 2 Computational costs for CheMA and iCheMA

Dimensions 1 2 3 4

CheMA (s) 5.81 13.39 47.57 585.43

iCheMA (s) 7.61 7.93 9.62 13.39

The runtimes are given in seconds for 1000 MCMC iterations (the effec-tive sample sizes for the two methods were not significantly different). The increase of the computational costs for CheMA when increasing the dimension of Vi is discussed in Appendix 1. Both methods were implemented in Matlab, and the MCMC simulations were run on an Intel(R) Core(TM) E6850 with 3GHz

3 Inference

3.1 Posterior inference

We refer to the proposed new variant of the CheMA model, which employs an analytical rather than a numerical gradient

and replaces the truncated g-prior in Eq. (11) by the truncated ridge regression prior in Eq. (12), as the improved CheMA (iCheMA) model. For iCheMA, as outlined in Sect. 2.3, the Metropolis-within-Gibbs Markov Chain Monte Carlo (MCMC) sampling scheme proposed byOates et al.(2014) has to be modified. In the new variant (iCheMA) we replace the truncated g-prior on Viby the truncated ridge regression

prior, we use a conjugate inverse Gamma prior rather than a Jeffrey’s prior for the noise varianceσ_i2, and we introduce a new hyperparameterδ_i2. We thus have to revise the sampling steps of the original MCMC inference algorithm. We also replace the approximate collapsed Gibbs sampling step for the noise varianceσ_i2 fromOates et al.(2014) by an exact uncollapsed Gibbs sampling step. For computing the poste-rior distribution of the noise varianceσ_i2,

p σ2 i |Ki, yi ∝ pyi|σi2, Ki p σ2 i (13)

(8)

Oates et al.(2014) approximate the marginalization integral p yi|σi2, Ki = p yi|Vi, σi2, Ki p Vi|σi2, Ki dVi (14) with the closed form solution fromMarin and Robert(2007), Chapter 3. This is exact if there are no restrictions on the integration bounds. However, given the underlying positiv-ity constraint for Vi, symbolically{Vi ≥ 0}, the integral is no

longer analytically tractable and the expressions for Eqs. (13, 14) used inOates et al.(2014) become an approximation.4 We therefore switch to an uncollapsed Gibbs sampling step, where σ_i2 is sampled from the full conditional distribu-tion p(σ2

i |Ki, Vi, yi) and the marginalization integral from

Eq. (14) becomes obsolete.5

For species i and a given regulator set πi we have to

sample the maximum reaction rate vector Vi, the Michaelis–

Menten parameter vector Ki, the noise varianceσi2, and the

new hyperparameterδ2_i from the posterior distribution: pVi, Ki, σi2, δ2i|yi ∝ pyi|Ki, Vi, σi2 pVi|σi2, δi2 pδ_i2pKi pσ_i2 (15) where yi := (yi(t1), . . . , yi(tT))is the vector of rates or

gradients. For the full conditional distribution of Vi we get:

pVi|Ki, σi2, δi2, yi ∝ pyi|Ki, Vi, σi2 pVi|δi2, σi2 (16) Since Ki,σ_i2, andδ_i2are fixed in Eq. (16) and the (truncated)

Gaussian prior on Vi from Eq. (12) is conjugate to the

like-lihood in Eq. (9), we obtain:

Vi|Ki, σi2, δ 2 i, yi ∼ N{Vi≥0} ˜μ, ˜Σ (17) where ˜Σ = δ_i2(I + δ2_iD_i Di)−1, ˜μ = ˜Σ(δ_i−21+ D_i yi), and Di = Di(Ki) is the design matrix, built from the rows given

in Eq. (8). For the full conditional distribution ofδ_i2we have: pδ2_i|Vi, Ki, σi2, yi

∝ pVi|σi2, δi2

pδ_i2 (18) As Vi andσ_i2are fixed in Eq. (18) and the inverse Gamma

prior onδ_i2is conjugate for p(Vi|σ_i2, δ2_i), defined in Eq. (12),

we obtain: δ2

i|Vi, Ki, σi2, yi ∼ I G

˜aδ, ˜bδ (19)

4_{This issue applies to the CheMA model, as proposed by}_{Oates et al.}

and to the new improved variant (iCheMA), proposed here. For both models we implement the exact uncollapsed rather than the approximate collapsed Gibbs sampling step forσ_i2.

5_{The truncation of Vi}_{is then automatically properly taken into account,}

as it will only be conditioned on Vithat fulfil the required constraint.

with ˜b_δ = b_δ + 1₂σ_i2(Vi − 1)(Vi − 1), and ˜a_δ = a_δ + 1

Inverse–Gamma prior on (Vi, σ_i2) is conjugate for the

likelihood in Eq. (9), we get: σ2

i ∼ I G

˜aσ, ˜bσ (21)

where˜bσ = bσ+12[(yi−DiVi)(yi−DiVi)+δ2i(Vi−1)(Vi−1)], and ˜aσ = aσ+1₂(T + |πi| + 1). For the mathematical details see, e.g., Chapter 3 ofMarin and Robert(2007).

The full conditional distribution of Kicannot be computed

in closed-form so that the Michaelis–Menten parameters have to be sampled by Metropolis–Hastings (MH) MCMC steps. Given the current vector Ki, a realization u from a

multivariate Gaussian distribution with expectation vector 0 and covariance matrixΣ = 0.1 · I is sampled, and we pro-pose the new parameter vector K_i = Ki + u subject to a

reflection of negative values into the positive domain. The MH acceptance probability for the new vector K_i is then

A(Ki, Ki) = min 1, R(Ki, Ki) , with RKi, Ki = exp −1 2_σ2 i yi−Di K_iVi yi−Di K_iVi exp −1 2σ2 i (yi−Di(Ki)Vi)(yi−Di(Ki)Vi) ·P R · H R (22)

where the Hastings-Ratio (HR) is equal to one, and the prior probability ratio (PR) depends on the model variant.6For the original CheMA model (Oates et al. 2014) we obtain from Eq. (11): PRCheMA= P_{Vi≥0} Vi|σ_i2, K_i P_{V_i_≥0}Vi|σ_i2, Ki P{Ki≥0} K_i P_{Ki≥0} Ki (23)

For the proposed new variant (iCheMA) we get from Eq. (12):

PRiCheMA= P{

K_i≥0}(Ki)

P_{Ki≥0}(Ki)

(24) If the move is accepted, we replace Kiby Ki, or otherwise we

leave Ki unchanged. Pseudo code of the MCMC sampling

6 _{The HR is equal to 1, as the proposal moves are symmetric. New}

candidates K_i with negative elements are never accepted, as they have the prior probability zero, P(K) = 0.

(9)

scheme for the new model variant (iCheMA) is provided in Table1. Table2shows that the replacement of Eq. (23) by Eq. (24) yields a substantial reduction of the computational costs of the MCMC inference.

The ultimate objective of inference is model selection, i.e. to infer the n regulator setsπi (i= 1, . . . , n) of the interaction

processes described by Eq. (6). We compare and critically assess five alternative paradigms: the divergence information criterion (DIC), proposed bySpiegelhalter et al.(2002), the ’widely applicable information criterion’ (WAIC), proposed byWatanabe(2010), the ’widely applicable Bayesian infor-mation criterion’ (WBIC), proposed byWatanabe(2013), the cross-validation information criterion (CVIC) , proposed byGelfand et al.(1992), and the marginal likelihood (also called ’model evidence’). For the ’marginal likelihood’ par-adigm, we compare two numerical methods: Chib’s method (Chib), proposed byChib and Jeliazkov (2001), and ther-modynamic integration (TI), proposed byFriel and Pettitt (2008). For the latter, we have further assessed the numerical stabilization of the numerical integration proposed byFriel et al.(2013) in a simulation study, which can be found in Appendix 3.

3.3 Posterior probabilities of interactions

For the improved variant of the CheMA model (iCheMA) we followOates et al.(2014) and perform ’model averag-ing’ to compute the marginal posterior probabilities of all regulator-regulatee interactions (i.e. the ’edges’ in the inter-action graph). The marginal posterior probability for species u being a regulator of i is given by:

p(u → i|D) = π i∈Π(u→i)p(D|π i )p(πi) πi∈Π p(D|π i)p(πi) (25)

whereΠ is the set of all possible regulator sets πi of species

i , and Π(u→i) is the set of all regulator sets πi of i that

contain the regulator u. For simplicity, we chose a uniform prior forπi subject to a maximum cardinality of 3 for the set

of regulators (’parents’) of a node.

3.4 Network inference scoring scheme

For the CheMA model (Oates et al.(2014)) and the novel model variant (iCheMA) the marginal interaction posterior probabilities in Eq. (25) can be used to rank the network inter-actions in descending order. If the true regulatory network is known, this ranking defines the receiver operating character-istic (ROC) curve (Hanley and McNeil 1982), where for all

possible threshold values, the sensitivity (or recall) is plotted against the complementary specificity. By numerical integra-tion we then obtain the area under the curve (AUROC) as a global measure of network reconstruction accuracy, where larger values indicate a better performance, starting from AUROC = 0.5 to indicate random expectation, to AUROC = 1 for perfect network reconstruction. A second well estab-lished measure that is closely related to the AUROC score is the area under the precision recall curve (AUPREC), which is the area enclosed by the curve defined by the precision plot-ted against the recall (Davis and Goadrich 2006). AUPREC scores have the advantage over AUROC scores that the influ-ence of large quantities in false positives can be identified better through the precision. These two scores (AUROC and AUPREC) are widely applied in the systems biology com-munity to score the global network reconstructions accuracy (Marbach et al. 2012).

3.5 Causal sufficiency

In Appendix 1 we discuss causal sufficiency and how its vio-lation affects the inference of regulatory network structures.

4 Evaluation

4.1 ANOVA

Like other network reconstruction models, CheMA and the novel iCheMA model yield a ranking of the regulatory inter-actions. Hence, if the true interaction network is known, AUROC and AUPREC scores can be computed, as explained in Sects.3.3and3.4. For our performance evaluation on real-istic network data, described in Sect.5.2, we were running simulations for different settings, e.g. related to different reg-ulatory network structures, shown in Fig. 4, and different inference methods. In order to distinguish the relevant effects from the confounding factors, we adopted the DELVE evalu-ation procedure for comparative assessment of classificevalu-ation and regression methods in Machine Learning (Rasmussen 1996;Rasmussen et al. 1996). That is, we set up a multi-way analysis of variance (ANOVA) scheme to disentangle the fac-tors of interest. For instance, if there are three main effects ( A, B, and C), and yi j kl is the l-th AUROC (or AUPREC)

score obtained for the constellation A= i, B = j and C = k, then we set up a 3-way ANOVA scheme:

yi j kl= Ai+ Bj + Ck+ εi j kl (26)

whereεi j kl∼N(0, σ2) is zero-mean white additive Gaussian

noise. We then computed ANOVA confidence intervals, e.g. for the effects of A = 1, A = 2, . . . on the AUROC scores (i.e. confidence intervals for the parameters A1, A2, . . .); see,

(10)

Fig . 4 Hypothetical gene re gulatory n etw o rk of the circadian clock in A. thaliana , b ased on Pokhilk o et al. ( 2010 ). The n etw o rk displayed in the top left p anel is the P 2010 netw ork p roposed by Pokhilk o et al. ( 2010 ). The o ther netw orks are p runed v ersions of that netw ork, corresponding to certain protein knock-do wns, as noted in each netw ork title. T hese netw or ks are u sed for the realistic d ata simulations, d escribed in Sect. 5.2 . Gr ey boxes group sets of re gulators o r re gulated components. Arr o ws symbolize acti v ations and bar s inhibitions. Solid lines sho w protein–gene interactions; dashed lines sho w protein interactions; and the re gulatory influence o f light is symbolized by a sun symbol . F igure reproduced from Aderhold et al. ( 2014 )

(11)

Thereby the main effects of the ANOVA models varied in dependence on the addressed research question. Throughout Sect.6we use the following five main effect symbols:

– Nn(n= 1, . . . , 6) is the effect of the Network structure,

see Fig.4for the six structures,

– Kk (k = 1, . . . 4) is the effect of the GP Kernel, used

for the computation of the analytical gradient (’RBF’, ’PER’, ’MAT32’, or ’MAT52’),

– Gg (g = 1, 2) is the effect of the Gradient type, i.e.

numerical (’difference quotient’) versus analytical (’GP interpolation’),

– Mm (m = 1, . . . , 12) is the effect of the inference Method, i.e. the iCheMA model and eleven competing

methods, listed in Table5,

– and Pp( p= 1, 2) is the effect of the Prior on Vi(i.e. the

g-prior from Eq. (11) versus the ridge regression prior from Eq. (12).

In Appendix 2, we test the statistical assumptions of the ANOVA scheme in Eq. (26), and we include additional results related to the improvement of GP regression over numerical differentiation, and the influence of the network topology.

4.2 Simulation details

In our study we have included the four IC: DIC, WAIC,

CVIC, and WBIC, and we have employed four different

numerical methods to approximate the marginal likelihood: Chib’s original method (Chib and Jeliazkov 2001) (Chib

naive), a stabilized version of Chib’s method (Chib),

pro-posed here, thermodynamic integration with the trapezoid rule (TI), see Eq. (38), and thermodynamic integration with the numerical correction (TI-STAB), see Eq. (41).

The computation of the model selection scores (DIC, WAIC, CVIC, WBIC and the MLL with both Chib’s method and TI) requires MCMC simulations; pseudo code can be obtained from Table1. We monitored the convergence of the MCMC chains with standard convergence diagnostics based on potential scale reduction factors (Gelman and Rubin 1992). The application of Chib’s method is based on the selection of a particular ’pivot’ parameter vector ˜θ, as described under Eq. (34). Initially, we chose ˜θ to be the MAP (maximum a posteriori) estimator from the entire MCMC simulation. This was found to lead to numerical instabilities, though, as seen from the right panel of Fig.8. We found a way to numerically stabilize Chib’s method, which we dis-cuss in Appendix 1. We also found that the transition from TI to TI-STAB proposed byFriel et al.(2013) can be counter-productive, as seen from Table4. We have investigated this unexpected effect in more detail in a simulation study in Appendix 3. We also found that the improved variant of

Table 3 Parameter settings for the synthetic data of Sect.5.1. The parametersv0,yandv2,yare the maximum reaction rates

Parameter 1 2 3 4 5 6 7 8 9

v0,y 1 0.5 1.5 2 0.2 2 3 0.2 0.1

v2,y 1 1 1 1 1 0.2 0.1 2 2

CheMA (iCheMA) substantially reduces the computational costs of the MCMC-based inference, as shown in Table2and discussed in more detail in Appendix 1.

5 Data

5.1 Synthetic data

We generate T = 240 data points xs(t1), . . . , xs(tT) for

n = 4 species (s = 1, . . . , 4) from iid standard Gaussian distributions. Subsequently, to obtain non-negative concen-trations, the observations of each individual species are shifted such that the lowest value is equal to 0, before we followOates et al.(2014) and re-scale the observations of each species to mean 1. With x1taking the role of the

degra-dation process and x2being an activating regulator (I2,y= 1)

of a target species, whose gradient y(.) we here assume to be directly observable, we generate target observations y(tj)

with Eq. (6). We then have for j = 1, . . . , T : y(tj) = −v0,yx1(tj) + v2,y

x2(tj)

x2(tj) + k2,y + tj

(27)

where Vy = (v0,y, v2,y)Tis the vector of maximum reaction

rate parameters, Ky = (k2,y) contains the Michaelis–Menten

parameter(s), andtj ∼ N (0, σ

2_{) is additive iid Gaussian}

noise ( j = 1, . . . , T ). We keep the Michaelis–Menten para-meter fixed at k2,y= 1, while we vary the rates v0,yandv2,y,

as indicated in Table3. Our goal is to infer the set of regula-torsπyof y out of all subsets of{x2, x3, x4}, where πy = {x2}

is the true regulator set. The effect of the degradation, taken into account by x1, is included in all 8 models.

5.2 Realistic data

For an objective model evaluation, we use the benchmark data fromAderhold et al.(2014), which contain simulated gene expression and protein concentration time series for ten genes in the circadian clock of A. thaliana. The time series correspond to measurements in 2-h intervals over 24 h, and are repeated 11 times, corresponding to differ-ent experimdiffer-ental conditions. We use time series generated from six variants of the circadian gene regulatory network in A. thaliana, shown in Fig. 4; these variants correspond

(12)

Table 4 Score differences for iCheMA, applied with thermodynamic

integration: TI versus TI-STAB

Spread-factor (sf ) (δ2= s f, ν = 0.5) (δ2= s f, ν = s f ) TI TI-STAB TI TI-STAB 0.01 39 33 166 156 0.1 38 33 76 68 1 27 23 27 22 10 8 5 8 6 100 4 5 4 4 10000 5 119 5 119

1e+08 52 3.1e+10 52 1.7e+08

1e+16 52 9.8e+24 53 2.7e+24

1e+20 52 3.2e+32 51 3.3e+34

The table shows the MLL differences between a true and an over-complex model for the synthetic data from Sect.5.1for different spread factors s f . TI from Eq. (38) and the stabilized variant TI-STAB from Eq. (41) were applied using K= 10 discretization points and the power

m= 8 in Eq. (42). The setting of the hyperparametersδ2in Eq. (12) and the prior varianceν in Eq. (10) is shown in the first row of the table. The diffuseness of the corresponding prior distribution(s) increases with the spread factor (s f ). The score differences for TI-STAB sharply increase for s f > 100

to different protein, i.e. transcription factor, knock-downs. The molecular interactions in these graphs were modelled as individual discrete events with a Markov jump process, using the mathematical formulation from Guerriero et al. (2012) and practically simulated with Biopepa (Ciocchetta and Hillston 2009), based on the Gillespie algorithm ( Gille-spie 1977). For large volumes of cells, the concentration time series converge to the solutions of ODEs of the form in Eq. (5). However, for smaller volumes, time series simulated with Markov jump processes contain stochastic fluctuations that mimic the mismatch between the ODE model and gen-uine molecular processes, and the volume size was chosen as described inGuerriero et al. (2012) so as to match the fluctuations observed in real quantitative reverse transcrip-tion polymerase chain reactranscrip-tion (qRT-PCR) profiles. For the network reconstruction task, we only kept the gene expres-sion time series and discarded the protein concentrations; this emulates the common problem of systematically miss-ing values for certain types of molecular species (in our case: protein concentrations).

5.3 Real data

We have applied the iCheMa model to gene expression time series obtained with real-time polymerase chain reaction experiments to predict the circadian regulatory network in Arabidopsis thaliana. All details and the results can be found in Appendix 4.

Table 5 State-of-the-art network reconstruction methods

Abbreviation Full name

HBR Hierarchical Bayesian regression

Lasso Sparse regression with L1penalty

ElasticNet Sparse regression with L1

and L2penalty

Tesla Sparse regression with time-varying

change-points

GGM Graphical Gaussian models

SBR Sparse Bayesian regression

BSA Bayesian spline autoregression

SSM State-space models

GP Gaussian processes

MBN Mixture Bayesian networks

BGe Gaussian Bayesian networks

The table shows a list of the methods that were included in the compar-ative evaluation study with the realistic network data from Sect.5.2. A detailed description of these methods can be found inAderhold et al. (2014), and references therein

6 Results

This section discusses the effect of gradient approximation (Sect. 6.1), the influence of the prior (Sect.6.2), the accu-racy of model selection (Sect.6.3), the relative performance compared to the current state of the art (Sect.6.4), and the problem of model mismatch (Sect.6.5).

6.1 Evaluating the effect of the gradient computation

To illustrate the difference in the accuracy of network infer-ence between a numerically calculated gradient using the difference quotient defined in Eq. (3) and an analytical gradi-ent using a GP, we applied both gradigradi-ent types to the realistic data of Sect.5.2and evaluated the performance of the meth-ods listed in Table5together with iCheMA. The difference quotient was calculated with a time difference ofδt = 2 h7,

and the analytical gradient was calculated with a GP using a RBF kernel as described in Sect.2.2. The results of a pre-liminary study, in which we investigated the effect of the GP kernel on the network reconstruction accuracy, can be found in Appendix 2. We have recorded the AUROC and AUPREC scores for all the different conditions mentioned in Sect.5.2, and summarize the outcome with an ANOVA analysis that treats the different conditions and methods as distinct effects. Fig.5a shows that the results for the data with the analytical rate estimation significantly improves the performance over the numerical difference quotient while taking all methods 7 _{In the realistic data study, we assume gene measurements to take}

place in intervals ofδt = 2 h. This mimics typical rates for the qRT-PCR sampling experiments inFlis et al.(2015).

(13)

0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 AUROC AUPREC numerical analytical (0.76, 0.59) (0.80, 0.66) (a) 0.81 0.815 0.82 0.825 0.83 0.835 0.84 0.845 0.85 0.855 0.86 0.68 0.69 0.7 0.71 0.72 0.73 0.74 AUROC AUPREC numerical analytical (0.82, 0.70) (0.85, 0.72) (b)

Fig. 5 Effect of the gradient type: numerical versus analytical. For

the realistic network data from Sect.5.2we compared the network reconstruction accuracy of a numerical and an analytical gradient. The numerical gradient was computed with the difference quotient, as pro-posed inOates et al.(2014), and a time difference ofδt = 2 h. The analytical gradient, proposed here, was derived from the derivative of a Gaussian process (GP) using the radial basis function kernel with opti-mized parameters, as implemented in the Matlab library gpstuff. Both panels show the mean AUROC and AUPREC scores with confi-dence intervals for the effect of the gradient type derived from ANOVA models. In a all methods listed in Table5and iCheMA were included so that the ANOVA model has 3 effects: gradient type Gg, network structure Nn, and method Mm: ygnml = Gg+ Nn+ Mm+ εgnml, see Sect.4.1for details. b shows the confidence intervals for iCheMA only, i.e. for an ANOVA model with Mmbeing removed

of Table5 and iCheMA into account. The same trend can be observed in Fig.5b, which only considers the results for iCheMA. Hence, we conclude that network reconstruction accuracy significantly improves when an analytically derived gradient using a GP is used instead of the numerical differ-ence quotient used inOates et al.(2014). This can be observed for both a broad variety of different network reconstruction methods as well as specifically for the iCheMA method.

6.2 Evaluating the influence of the parameter prior

To evaluate the influence of the parameter prior on model selection, we computed the MLL for the g-prior in Eq. (11) and the ridge regression prior in Eq. (12) for the synthetic data from Sect.5.1. For each of the 9 parameter settings, shown in Table3, and 4 different noise variancesσ2we generated 10 independent data instantiations from Eq. (27). We then applied the stabilized Chib method (Chib) for approximating the MLLs for all possible regulator–regulatee configurations, and computed the logarithmic Bayes factors (i.e. the differ-ences of the MLLs) between the true and each wrong model. Fig.6gives the average logarithmic Bayes factors (averaged across the 9×4 = 36 parameter settings) for 10 independent replications. In Fig.6the distributions of the MLL score dif-ferences are represented with boxplots, where positive values indicate that the true structure is correctly selected, whereas negative values indicate that the wrong alternative structure is erroneously selected. It can be seen that regulator sets that do not contain the true regulator (corresponding to the three rightmost box plots) are clearly rejected. However, for the over-complex alternative models (three leftmost box plots), which contain spurious regulators, the MLL score difference obtained with the g-prior fails to consistently favour the true model. Two out of three differences are negative in Fig.6a, indicating that an over-complex model is preferred to the true one. For the proposed ridge regression prior, on the other hand, the MLL score difference does succeed in consistently favouring the true structure, as displayed in Fig.6b. We had a closer look at the results for the individual parameter sets, shown in Fig.7. This figure reveals that the g-prior performed well for some parameter settings, but not for others. In par-ticular, we found that the g-prior systematically fails when ’v0,y< 1 ≤ v2,y’ (see Table3). In Appendix 1 we provide a

theoretical explanation for this trend.

We used the synthetic data from Sect.5.1to cross-compare the performance of the model selection schemes; for an overview see Appendix 1. The MLL based selection pro-cedures score the individual models with respect to the differences in the MLL (or log Bayes factors). We compare these results with the score differences of the various IC. For inference we used the novel iCheMA model, and we var-ied the hyperparameters of the prior distributions so as to obtain increasingly diffuse prior distributions. In a first sce-nario we set the hyperparameterδ2in Eq. (12) to a value s f , which we refer to as ’spread-factor’, and we keptν = 0.5 in Eq. (10) fixed. In the second scenario we set both hyper-parametersδ2andν to the spread factor s f . Figure8shows boxplots of the log score differences between the true model and an over-complex alternative model (with one redundant

(14)

[2,3] [2,4] [2,3,4] [3] [4] [3,4] −10 0 10 20 30 40 50 60 70 80 90 100 MLL Difference to Tr u e Parent Set (a) [2,3] [2,4] [2,3,4] [3] [4] [3,4] −10 0 10 20 30 40 50 60 70 80 90 100 MLL Difference to Tr u e Parent Set (b)

Fig. 6 Comparison of CheMA and iCheMA: effect of the prior. The

plots show the differences in the log marginal likelihoods (MLL) between the true regulator–regulatee model and six alternative wrong models, obtained with CheMA/iCheMA for the data from Sect.5.1. There are three potential regulators{x2, x3, x4} in the system, with

πy = {x2} (’[2]’) being the true regulator of the response y. The

configurations on the horizontal axis define alternative regulator con-figurations. The log score differences have been averaged across the nine parameter configurations, shown in Table3, and four noise set-tings (σ2= 0.05, 0.1, 0.2, 0.4). The box plots show the distributions of the average log score differences for 10 independent data instantiations. Positive values indicate that the true model was identified correctly; for negative differences, the wrong model had a higher score and would thus be erroneously selected. The results for the g-prior (CheMA) from Eq. (11) are shown in (a); the results for the ridge regression prior (iCheMA) from Eq. (12) are shown in (b)

regulator-variable) for both scenarios and increasing spread factors (s f ). Again positive differences indicate that the true structure is correctly selected, whereas negative differences indicate that the alternative over-complex structure is erro-neously selected. The overall trend revealed in Fig.8is that the log score difference decreases with the spread-factor (i.e. with the diffuseness of the prior) for most of the model selec-tion criteria. 1 2 3 4 5 6 7 8 9 −50 0 50 100 difference score to tr u e parent set [2]

parent set order: [2,3] [2,4] [2,3,4] [3] [4] [3,4] (a) 1 2 3 4 5 6 7 8 9 0 20 40 60 80 100 120 140 160 180 difference score to tr u e parent set [2]

parent set order: [2,3] [2,4] [2,3,4] [3] [4] [3,4] (b)

Fig. 7 Detailed Chib log marginal likelihood (MLL) scores for the

original CheMA method with g-Prior (a) and a modified method with a ridge prior (b). Detailed plots from which the average score distribu-tions in Fig.6are derived. Each of the nine slots along the horizontal

axis corresponds to one parameter configuration of(v0,y, k2,y, v2,y) as displayed in Table3and used for the synthetic data in Sect.5.1. The slots contain the Chib MLL difference scores between a wrong parent configurations (the parent configurations are aligned from left to right in each slot according to the order in the legend) and the true parent set ([2]). Positive values indicate that the true parent configuration receives a higher score

The parameter priors in Eqs. (10) and (12) are Gaus-sians centred on μ = 1, with different variances. For low spread-factors s f (i.e. for low prior variances), both groups of criteria (IC and MLL) clearly favour the true model, since the prior ’pulls’ the spurious interaction parameter from its true value of zero towards a wrong value ofμ = 1. As the prior becomes more diffuse, the score differences become less pronounced, but still select the true model up to spread factors of about s f ≈ 100. As the prior becomes more dif-fuse, with the spread factor exceeding s f > 100, the IC occasionally fail to select the correct model. A more detailed representation focusing on the IC and larger spread factors is given in Fig.9. It is seen that among the IC it is mainly DIC that repeatedly fails to select the true model (the

(15)

cen-−30 −20 −10 0 10 20 30 40 50 60 DIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBIC

Difference to true parent setup 0.01 0.10 1 10 100 10000 1e+08 1e+16 1e+20 −250 −200 −150 −100 −50 0 50 100 150 200 DIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBICDIC WAICCVIC Chib Chib−naivTI−4 TI−8 WBIC

Difference to true parent setup 0.01 0.10 1 10 100 10000 1e+08 1e+16 1e+20

Fig. 8 Score difference between the true and an over complex

regulator-regulatee configuration for iCheMA, applied with different model selection schemes. Each box plot includes the average results from 10 independent data instantiations for the model of Sect.5.1. The over-complex configuration includes a spurious regulator. Positive values indicate that the true model is favoured over the over-complex model. DIC, WAIC, CVIC, and WBIC are the IC described in Appen-dix 1; the methods for calculating the MLL are: a naive implementation of Chib (Chib-naiv), a stabilized version of Chib’s method (Chib), and thermodynamic integration in two variants (TI-4 and TI-8). Both

TI-variants were applied with K = 10 discretization points and differ w.r.t. the power m∈ {4, 8} in Eq. (42). The panels demarcated by hori-zontal lines correspond to different prior distributions of the interaction parameters, characterized by the spread factor s f , ranging from 0.01 to 1e+20. Left panel The prior variance of the Michaelis–Menten parame-ters in Eq. (10) was kept fixed atν = 0.5. The spread hyperparameter δ2 for the prior of the reaction rate parameters in Eq. (12) was varied, i.e.

δ2_{= s f . Right panel Both δ}2_and_{ν take the value s f , i.e. δ}2_{= ν = s f .}

A higher resolution for the IC (plotted on a different scale) is available from Fig.9

tral inter-quartile range of the score difference distribution, between the first and third quartile, includes negative values), whereas for the other information criteria the selection of the wrong model is relatively unlikely (the central inter-quartile range does not include negative values). Two of the four MLL methods, namely TI-8 and Chib, start to increasingly

favour the true model as the spread factor further increases beyond s f > 1000. This is a consequence of Lindley’s para-dox, whereby MLL increasingly penalizes the over-complex model for increasingly vague priors. TI-4, in principle, shows a very similar trend but the score difference is lower than for

(16)

−1 −0.5 0 0.5 1 1.5 2 2.5 DIC WAIC CVIC DIC WAIC CVIC DIC WAIC CVIC DIC WAIC CVIC DIC WAIC CVIC DIC WAIC CVIC

Difference to true parent setup

(a)

−1 0 1 2 DIC WAIC CVIC DIC WAIC CVIC DIC WAIC CVIC DIC WAIC CVIC DIC WAIC CVIC DIC WAIC CVIC

Difference to true parent setup

(b)

Fig. 9 Score differences for iCheMA, for different information

crite-ria. This figure replicates the results from Fig.8for the information criteria DIC, WAIC and CVIC, on a different scale for improved reso-lution, for spread factors (s f ) ranging from 10 to 1e+ 20. a is extracted

from the left panel of Fig. 8, and b is extract from right panel of Fig.8. Negative score differences indicate that the over-complex model is favoured over the true one

(i.e. the applied temperature ladder) implied by the power m∈ {4, 8} in Eq. (42) can critically affect the result.

Among the MLL methods, the naive application of Chib’s method, Chib naive, as proposed in Chib and Jeliazkov (2001), shows a completely different pattern and systemati-cally fails to select the correct model for large spread factors. A theoretical explanation for this instability is provided in Appendix 1. We achieve a stabilization of Chib’s method, referred to as Chib, by selecting the pivot parameter set ˜θ with the highest posterior probability within the set of actu-ally sampled parameters (excluding the parameter states from the burn-in phase). We refer to Appendix 1 for details.

The left panel of Fig.8shows that the different ways of computing the MLL give very similar results up to a prior spread factor of about 1e+08. For spread factors exceeding this value, the results differ. The MLL computed with Chib’s method (Chib) monotonically increases, as expected from Lindley’s paradox. The MLL computed with TI is obtained without numerical stabilization and reaches a plateau, with different values obtained for different trapezium sum dis-cretization schemes (determined by m in Eq. (42)). This is a numerical discretization error that results from the form of the integrand in Eq. (37), which has most of its area concentrated on values near τ = 0. We tried to stabilize TI with the corrected trapezium rule, replacing Eq. (38) by Eq. (41). Interestingly, this transition from TI to TI-STAB

turned out to be occasionally counter-productive, as shown in Table4. We have investigated the effect of the numerical “sta-bilization” more thoroughly in a simulation study based on a Bayesian linear regression model. We found that TI-STAB can fail for small numbers of discretization points and diffuse priors. The study and its results can be found in Appendix 3. In our subsequent simulations we use the numerically stabi-lized variant of Chib’s method (Chib), which has a 10-fold lower numerical complexity compared to TI (because we used 10 different temperaturesτ for TI).

6.4 Comparison with state-of-the-art network reconstruction methods

We have compared the prediction accuracy of the pro-posed new novel variant (iCheMA) of the semi-mechanistic CheMA model of Oates et al. (2014) with the original CheMA model and 11 state-of-the-art machine learning methods, assessed in Aderhold et al.(2014). These meth-ods are listed in Table5 and were applied as described in Aderhold et al.(2014). The network reconstruction accuracy performance was tested on the realistic gene expression pro-files from Sect.5.2, based on the six network structures from Fig.4. The results in terms of AUROC and AUPREC scores are shown in Fig.11and demonstrate that the iCheMA model outperforms all alternative methods. The CheMA model of

(17)

0.70 0.72 0.74 0.76 0.78 0.80 0.82 0.6 8 0.70 0.72 0.74 AUROC A UPREC iCheMA analytical gradient iCheMA numerical gradient CheMA analytical gradient CheMA numerical gradient

Fig. 10 Network reconstruction for the wildtype network. The scatter

plot shows the network reconstruction accuracy for the CheMA model and its new variant iCheMA, using realistic data (see Sect.5.2) gener-ated from the wildtype network in Fig.4. Both methods, CheMA and iCheMA, were applied with both a numerical and an analytical gradient

Oates et al.is not included in this comparison. Due to the substantially higher computational costs, the simulations for all six networks in Fig.4 would require several weeks of computing time on a medium-size cluster, as indicated by Table2. In order to keep the computational complexity man-ageable, we compared CheMA and iCheMA in a separate study, using only data from the wildtype network, proposed byPokhilko et al. (2010) and shown in the top left panel of Fig.4. Note that for all methods included in Fig.11, the gradient for the response d xi(t)

dt of Eq. (6) is derived from

an analytical solution of the derivative using a GP with RBF kernel, as described in Sect.2.2. The original definition of CheMA, on the other hand, uses a numerical gradient esti-mation. To separate out the effects of gradient estimation and the other differences between the methods, we applied both CheMA and iCheMA with both gradients: the numer-ical and the analytnumer-ical gradient. The results are shown in Fig.10. The two findings of this study are: iCheMA consis-tently outperforms CheMA, and the analytical gradient leads to a significant improvement in the prediction accuracy over the numerical gradient.

6.5 Model selection for network identification

As a final test, we evaluated the accuracy of model selec-tion for network identificaselec-tion, using the data from Sect.5.2. These data contain gene expression time series from the six gene regulatory networks of Fig.4, which contain one wildtype and five mutant networks from protein knock-down experiments. We computed the MLL (with Chib’s

0.60 0.65 0.70 0.75 0.80 0.85 0.55 0.60 0.65 0.70 Method comparison AUROC A UPREC iCheMA HBR Lasso ElasticNet BGe BSA GGM GP MBN SBR SSM Tesla

Fig. 11 Comparison between the iCheMA model and state-of-the-art

network reconstruction methods. The scatter plot shows the network reconstruction accuracy for the novel iCheMA model and the eleven alternative methods from Table5, using the realistic network data from Sect.5.2, generated from the six network structures of Fig.4. All net-work reconstruction methods used the analytical gradient, obtained from GP regression with an RBF kernel, and the displayed AUROC and AUPREC values are the effects of the method, derived from an ANOVA with 2 effects: method Mm and network structure Nn:

ymnl= Mm+ Nn+ εmnl; see Sect.4.1for details

method) for each of the candidate networks, for each data set in turn. The evaluation was repeated over five indepen-dent data instantiations. For each data set, all models were ranked based on the MLL, and the ranks were averaged over all data sets. This leads to the six-by-six confusion matrix of Fig. 12a, where the rows represent the networks used for data generation, and the columns represent the can-didate networks ranked with the MLL. Note that we have emulated the mismatch between data generation and infer-ence characteristic for real applications. For data generation, molecular interactions corresponding to the edges in the net-work were modelled with complex Markov jump processes, as described in Sect. 5.2, with the intention to mimic real biological processes. For inference and model selection, interactions were modelled with Michaelis–Menten kinet-ics, corresponding to Eq. (6), and the interactions between different regulators were modelled additively, by adding the Michaelis–Menten terms—this reduction in complexity is required for general computational tractability and scalabil-ity. The results are shown in Fig.12a. The diagonal elements of the matrix show the average ranks for the correct network. Two of the six network structures are consistently correctly identified (average rank 1), but for the other four structures, the average ranks vary between 1.6 and 3. This failure to consistently identify the true network tallies with the fact