University of Groningen On a quest for metabolic fluxes: sampling and inference tools using thermodynamics, metabolome and labelling data Taborda Saldida Alves, Joana

(1)

On a quest for metabolic fluxes: sampling and inference tools using thermodynamics,

metabolome and labelling data

Taborda Saldida Alves, Joana

DOI:

10.33612/diss.157440136

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2021

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Taborda Saldida Alves, J. (2021). On a quest for metabolic fluxes: sampling and inference tools using thermodynamics, metabolome and labelling data. University of Groningen.

https://doi.org/10.33612/diss.157440136

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

(3)

3

CHAPTER 3 BENCHMARKING MONTE CARLO METHODS

FOR SAMPLING OF THERMODYNAMICALLY FEASIBLE FLUXES

Joana Saldida1_{, Daniele de Martino}2_{, Andreas Milias-Argeitis}1_,

Matthias Heinemann1

1_{Molecular Systems Biology, Groningen Biomolecular Sciences and Biotechnology}

Institute, University of Groningen, 9747 Groningen, The Netherlands

2_{Jozef Stefan Institute, Jamova Cesta 39, 1000 Ljubljana, Slovenia; and Biofisika Institute}

(4)

ABSTRACT

Sampling the space of flux solutions that are in agreement with a set of physical con-straints and experimental data is essential to characterise the metabolic state of a cell. While sampling the flux space defined by linear constraints was previously applied and improved, the non-convex nature of thermodynamic constraints makes sampling mathematically challenging. To explore such challenges, we evaluated the efficiency of three different samplers based on the same core approach applied to two different metabolic networks. We found that a strategy based on Rejection Sampling was the fastest for the small network but inefficient for the larger one, where a Conditional Sampler was required to increase sampling efficiency. When an additional energy bal-ance constraint was included, only the Conditional Sampler with a parallel tempering scheme was applicable for both networks. Our analysis contributes to the applicabil-ity of sampling methods to characterise the thermodynamically feasible flux space.

INTRODUCTION

The study of fluxes through metabolic networks provides a global view on cellular behaviour. Diverse metabolic flux analysis (MFA) methods have been developed to estimate metabolic fluxes in steady-state (Sweetlove et al., 2014). MFA methods compute steady-state fluxes through computational models of metabolite balancing, using the stoichiometric matrix to represent the metabolic network. The metabolic network can be represented in the model partially, with only specific pathways of interest (Nidelet et al., 2016), or fully at a genome-scale (Gopalakrishnan & Maranas, 2015a). MFA involves solving an optimisation problem where a model is fitted to experimental data, often exchange fluxes (uptake, secretion, growth), or a metabolic

(5)

3

trait is optimised (Flux Balance Analysis, FBA) (Orth, Thiele, et al., 2010). Models used in MFA are flexible to include diverse types of constraints and data (e.g. exchange fluxes or metabolomics) to help find the optimal flux solution. However, MFA prob-lems are highly underdetermined due to a large number of degrees of freedom and measurement error (Mahadevan & Schilling, 2003) and thus, producing one single flux solution may not be a good representation of the metabolic state of the cell. Instead of attempting to find one optimal flux solution, we can define a feasible flux solution space as the space of flux solutions that are in agreement with a set of constraints and experimental data (Price et al., 2006). To be able to evaluate the uncertainty of each reaction flux regarding the given measurements, or the effect of the application of different constraints, it is necessary to characterise this feasible flux solution space through sampling. Other than exploring the flux solution space and quantifying un-certainty, sampling stands as a flexible tool that can be used to answer different types of metabolic questions, e.g. revealing transcriptional regulation in enzymes (Bordel et al., 2010), and to infer global network properties (Schellenberger et al., 2011), e.g. testing network robustness to parameter variation (Alves & Savageau, 2000).

Metabolic flux models described by mass balance constraints (stoichiometry) can be sampled with a Markov Chain Monte Carlo (MCMC) method for randomised sam-pling, also known as Hit-and-Run (HR) (Chen & Schmeiser, 1996). The computational complexity of a metabolic model typically increases with dimensionality, which is in turn related to the size of the network. Hence, techniques like artificial centering hit-and-run (ACHR) were developed to improve the solution space coverage during HR sampling in networks of considerable size (Schellenberger et al., 2011). Differ-ent approaches were also developed to explore the feasible flux solution space when certain fluxes are fixed based on an FBA solution (Binns et al., 2015; Chaudhary et al., 2016). As an alternative to the iterative collection of sampled points, a machine learning strategy based on Expectation Propagation can also be applied to analytically approximate the feasible flux solution space defined by linear constraints (Braunstein et al., 2017).

Important constraints enforced in MFA models derive from thermodynamics. The loop law imposes conservation of energy in reaction loops at network level (Beard et al., 2002), while the second law of thermodynamics relates directly to the direction of individual reactions (Henry et al., 2007). Both constraints were widely applied in MFA (Beard et al., 2002; Fleming et al., 2012; Henry et al., 2007; Kümmel et al., 2006; Noor et al., 2012; Schellenberger & Palsson, 2009), including in a recently

(6)

published model that integrates metabolite concentrations in a thermodynamic and stoichiometric constraint-based model (TSM) for a medium-sized S. cerevisiae network (Niebel et al., 2019). However, the solution space constrained with thermodynamics presents the added difficulty of non-convexity. Sampling techniques for sampling high-dimensional non-convex solution spaces are scarce. Existing approaches to sample the flux solution space constrained by thermodynamics applied the loop law alone and mostly employed rejection sampling techniques (Price et al., 2006; Saa & Nielsen, 2016; Schellenberger et al., 2011). With the increase in use of thermody-namic constraints in metabolic models, the development of a new efficient sampling technique is essential.

In previous work (Chapter 2) we addressed the problem of sampling the solution space of the TSM, i.e. the set of fluxes, metabolite concentrations and Gibbs energies constrained by stoichiometry, flux bounds and thermodynamics, through the splitting of this non-convex set into convex polytopes: a polytope of fluxes (flux polytope) and a polytope of Gibbs energies and metabolite concentrations (concentration polytope) for each flux point. After the splitting step, described in Chapter 2, we could sample the TSM flux polytope conditional on the existence of a non-empty concentration polytope. In this way, it is possible to apply different types of sampling methods to the TSM flux space, i.e. the set of fluxes consistent with all the TSM model constraints. However, it is not yet clear how different samplers behave in this constrained set with metabolic models of different organisms, size and complexity.

In this work, we analysed the performance of three potentially applicable methods to sample the TSM flux space of two networks of different size – an E. coli network of central metabolism with 82 reactions, and a yeast network of central and amino acid metabolism with 258 reactions and 2 compartments - constrained by stoichiometry, thermodynamics, physiology and metabolome data. We evaluated the performance of three samplers that explore the constrained set with the Hit-and-Run (HR) algorithm to characterise the TSM flux space of the two networks: Rejection Sampling, and Conditional Sampling with and without a parallel tempering scheme. We employed autocorrelation analysis and convergence testing to identify the most efficient sam-pler for each network. While for the smaller network the most efficient option was to use the Rejection Sampler, for the larger network this sampler was not applicable and the Conditional Sampler was the sampler choice with best performance. When an additional energy balance constraint was included in the models, only the parallel tempering version of Conditional Sampling was applicable for both networks.

(7)

3 RESULTS

We applied the three sampling techniques to the TSM flux space, described by the mathematical model developed in (Niebel et al., 2019), with two metabolic networks of different organisms and complexity. The target set to sample was defined as in Chapter 2, where, first, the TSM model was fitted to exchange fluxes, metabolite con-centrations and standard Gibbs energies (Δ_rG0_{, Figure 1a). Second, the limits of the}

space of fluxes (v), concentrations (in logarithm form, lnc) and Gibbs energies (Δ_rG)

were determined through variability analysis (Mahadevan & Schilling, 2003), with a fixed Δ_rG0_{and with the measured fluxes and concentrations bounded according to}

their respective standard deviation (Figure 1a). Finally, the TSM feasible space of (v,

lnc, Δ_rG) was broken down to the TSM flux space (Figure 1b) through a split into a

flux polytope, described by the stoichiometry and flux bounds of the TSM model, and a concentration polytope, described by thermodynamic constraints and lnc and Δ_rG

bounds. We performed sampling in the flux polytope conditional on the existence of a non-empty concentration polytope for each flux point, what we defined as the TSM flux space above. All the sampling approaches used a Hit-and-Run algorithm in which the proposed search directions were weighted by an ellipsoid enclosing the target set (Methods Chapter 2). Additionally, when the flux polytope contained a significant amount of thermodynamically infeasible loops, it was divided in flux sectors prior to sampling (Methods Chapter 2).

The energy balance equation (Eq. 4 in Chapter 2) was enforced during sampling in Chapter 2 due to technical reasons arising during the fitting procedure prior to sampling (Supplementary information Chapter 2). However, its enforcement is re-dundant in most cases of interest. Particularly, if we don’t enforce the energy balance in this case, it is only marginally violated (Chapter 2). Hence, since this constraint imposes extra complexity to the sampling problem, here we considered its removal from the TSM flux space in order to compare different sampling approaches for the most common case, where this constraint is not necessary. Following the evaluation of the samplers without energy balance, we included this constraint in the TSM flux space of the two network models to evaluate the application of the same sampling techniques on a more constrained space.

The small network considered represents the core metabolism of E. coli with a single compartment and 82 reactions of which 66 are independent reactions. The number of independent reactions is smaller than the total number of reactions in the network because several reactions occur in linear pathways and, therefore, have the

(8)

same flux. With 56 metabolites, the flux polytope of this network has 16 degrees of freedom while, for a given flux point, the concentration polytope has 56 degrees of freedom in 123 dimensions when excluding the energy balance. After fitting of the TSM model to exchange fluxes, metabolite concentrations and ∆_rG0_{and subsequent}

variability analysis with the E. coli network, we obtained flux bounds that allowed the identification of many flux directions. Out of the 66 independent reactions, only 9 were left with undetermined direction. We evaluated the possible flux sign configura-tions of this network with the 9 undetermined flux direcconfigura-tions through sampling and enumeration of thermodynamically infeasible loops (Methods Chapter 2) and found that such loops do not have a significant impact on sampling, thus, it was not neces-sary to divide the flux polytope in sectors.

The larger S. cerevisiae network (yeast) included core metabolism reactions as well as amino acid production pathways distributed over cytosol and mitochondria. This network of 258 reactions, of which 148 are independent, is the same used in the work of Chapter 2. The flux polytope of this network has 59 degrees of freedom and the concentration polytope for a flux point has 196 degrees of freedom and 435 dimen-sions. After the fitting and variability analysis steps, 52 reactions were found to have undetermined flux direction. An analysis of the flux sign configurations showed that it is very unlikely to choose a feasible flux point while sampling the flux polytope due to the large number of thermodynamically infeasible loops present. For this reason, the flux polytope for the yeast network was divided in sectors within which there were few loops left (Chapter 2). The sampling techniques used with this network were applied to one single flux sector as an example, since the subsequent generation of a sample over all the sectors has no impact on sampler performance.

Samplers

We selected three sampling techniques to apply to the TSM flux space. The first sam-pling method we employed is based on Rejection Samsam-pling. This strategy is usually employed when the shape of the target set makes it difficult to sample from directly, as it is the case for non-convex sets. Non-convex target sets can generally be sampled with a rejection strategy (Casella et al., 2004), where the non-convex set is enclosed in a convex set, the latter is sampled with Hit-and-Run and the points outside the non-convex target set are rejected a posteriori. Rejection sampling could be applied here to sample the TSM flux space (Figure 1b) by first sampling the flux polytope (blue set A in Figure 1b) and later rejecting points that have an empty concentration polytope

(9)

3

(yellow set B in Figure 1b). The movement of the HR to sample the flux polytope (set A) is very fast as all points are accepted and no complex transition rules are necessary. However, depending on the size and form of the target set (green set A∩B in Figure 1b) compared to the flux polytope, this method can spend a great amount of time exploring the infeasible part of the flux polytope, a problem which is aggravated by increasing the number of dimensions of the flux space. Moreover, even though HR generates a Markov chain in the flux polytope, the process generated by the rejection sampler in the target set is not a Markov chain.

Alternatively, a Markov Chain Monte Carlo (MCMC) method could be applied to directly sample the TSM flux space (set A∩B), in which the flux polytope (set A) is sampled with Hit-and-Run using a transition rule based on the fulfilment of the ther-modynamic constraints. Specifically, in this conditional approach, a proposed point in the flux polytope (set A in Figure 1b) is accepted if the thermodynamic constraints are respected (i.e. there is a non-empty concentration polytope B) for that flux point (Figure 1b). This method was here named Conditional Sampler and, as common in MCMC samplers, when a point is rejected the previous point is newly stored. This sampler uses transition rules that enable it to stay in the feasible set, which can allow the exploration of very constrained regions of the feasible set but can also get “stuck” in such regions, making feasible points very difficult to find.

For highly constrained high dimensional sets it was important to consider yet a third sampling option: parallel tempering (PT). In this approach, multiple replicas of a system with different relaxation degrees on constraints are sampled in parallel and feasible points are swapped between replicas, promoting a cascade of points that allows the target system to move easily (Swendsen & Wang, 1986). Applied to the problem of sampling the TSM flux space (set A∩B), each replica of the flux polytope (set A) was sampled with the Hit-and-Run algorithm and flux points were accepted according to different levels of relaxation, called temperatures, of the thermodynamic constraints (Methods Chapter 2, Figure 1b). Sampled flux points from consecutive replicas were swapped according to their feasibility in the neighbour replica. As the name indicates, parallel tempering performs sampling of different replicas in paral-lel. By using multiple CPU cores to sample the different replicas, we prevent a large increase in running time that could come with the increase in implementation com-plexity. Because each replica was sampled with the conditional sampler, we called this approach Conditional Sampler with parallel tempering (Figure 1b). This method was applied in Chapter 2.

(10)

Figure 1: Workflow of the evaluation of performance for three different samplers to sample the flux solution space constrained by thermodynamics. (a) The thermodynamic and

stoichio-metric metabolic model (TSM) (Niebel et al., 2019) is fitted to data: exchange fluxes (red arrow), metabolite concentrations (red full circle) and Gibbs standard energies of reaction (∆rG0). After

fitting, ∆rG0 is fixed, the rest of the measured variables are bounded according to their standard

deviation (Methods Chapter 2) and variability analysis is performed to find the limits of the TSM feasible space. (b) The TSM feasible space, defined by the TSM constraints and estimated bounds, is divided into flux polytope (A, in blue) and concentration polytope (B, in yellow) to form the TSM flux space (A∩B, in green). Sampling is performed, in the TSM flux space, by considering the existence of a non-empty B for each flux point in A, through three alternative sampling approaches. In Rejection Sampling, A is sampled (grey and black points) and, sequentially, the flux points that fulfil thermodynamic constraints (points in black, in green set) are selected (i.e. non-rejected) to compose the feasible flux sample. In Conditional Sampling, A is sampled conditionally on ther-modynamic feasibility by accepting (full arrow) feasible flux points that fulfil therther-modynamic con-straints (inside green set). Points outside the green set are rejected (dashed arrow) and the previous feasible flux point is stored again. In Conditional Sampling with parallel tempering (PT), the base method is the same as the Conditional Sampler but with several replicas with different relaxation levels (called temperatures) that swap flux points among them to increase set exploration. The lighter green colour represents the relaxed version of the feasible set in green. (c) The last section of the workflow includes applicability testing, autocorrelation analysis and convergence testing. All three samplers are subject to this last section independently from each other. The objective of the workflow is to determine which sampler is the most efficient for a given metabolic network and set of measurements. In applicability testing, the acceptance rate is used to test if a sampler is

(11)

3 Sampler testing criteria

The quality of a sampler can and should be visually and quantitatively assessed to verify that the generated sample forms a good empirical approximation of the target distribution. To evaluate the applicability and performance of the Rejection Sampler (RejS), Conditional Sampler (CondS) and Conditional Sampler with parallel tem-pering (CondS+PT) applied to our problem of metabolic networks constrained with thermodynamics, we exploited an applicability test (based on the acceptance rate), autocorrelation analysis and convergence diagnostics.

Applicability test

The first step in our analysis was to test the applicability of the samplers based on their acceptance rate. If the number of accepted sampled points over the total number of proposed points (i.e. acceptance rate) of a sampler is very low, the computational time needed to achieve a given sample size is high, which makes the sampler inef-ficient. In MCMC samplers there is a setting that can be adjusted and that modifies the acceptance rate of a sampler. This setting is the number of points that are skipped between two consecutive stored sampled points, which we will call the number of skips. In this applicability test we deemed samplers with an acceptance rate above a given threshold as applicable. Since the number of skips can affect the acceptance rate, the tuning and the applicability test were performed iteratively and here we present only the final results.

Autocorrelation analysis

Autocorrelation between consecutive sampled points is a problem that causes the sample produced by an MCMC sampler to not be “diverse” enough and therefore contain lower amount of information compared to an independent (i.e. non-correlated) sample. When sampling, we want to produce a sample that contains as much informa-tion per sample size as possible. Given a correlated sample obtained by an MCMC sampler, the size of an independent sample carrying the same amount of information with the MCMC sample is called the Effective Sample Size (ESS) (Givens & Hoeting, 2012). Given a dependent (correlated) sample, the size of the independent sample

efficient enough to produce a sample in reasonable running time. In autocorrelation analysis, the sampler is tuned and its Effective Sample Size is estimated. In convergence testing, the conver-gence and reproducibility of the sampler are verified.

(12)

that carries the same information as the dependent sample is represented by the ESS. In theory, the ESS calculation should take into account the many dimensions of the feasible space. In practice, however, the multi-dimensional ESS calculation is not yet consolidated, thus here we use the univariate approach (Methods). It is then important to highlight that the calculated ESS should not be taken in absolute terms and it is only used for comparisons.

Practically, we can avoid correlation between the sampled points by skipping points during the sampling. By increasing the number of skips we are usually decreasing autocorrelation. However, the relationship between number of skips and autocorrela-tion is not fully clear in high dimensions, which means that by increasing the number of skips we may not necessarily be decreasing autocorrelation. Additionally, the running time of a sampler increases with the number of skips, promoting a trade-off between running time and autocorrelation. To evaluate the efficiency of a sampler, we compared the running time required to generate a given number of independent points. This comparison was used in this work to tune the sampler, i.e. choose the best number of skips for each sampler, and to find out which sampler was the most efficient among the different algorithms employed. By comparing the running time necessary to generate samples with different number of skips for a given ESS, we could choose the most efficient number of skips for a given sampler. Subsequently, we compared the running time that different samplers took to generate a sample with similar amount of information, i.e. for a given ESS, to evaluate their performance against each other.

As in CondS+PT there are multiple chains in parallel, the tuning method for this sampler was different from the one for CondS. For CondS+PT, we determined the number of skips for each parallel chain such that each chain had an acceptance rate (independent of inter-chain swaps) of 20-40%. This procedure excluded the first (target) and last temperatures for which we defined 1 and 1000 as the number of skips, respectively, because the first could be very constrained and not be able to achieve such large acceptance rates, while the last temperature could be very free and produce higher acceptance rates. Swap rates (i.e. the fraction of accepted flux point swaps between replicas) that are too large indicate that the temperatures are too close in terms of relaxation level, while very low swap rates indicate the opposite. The number of skips, the number of temperatures and their relaxation degree were manually adjusted to produce swap rates of around 20-40% in all parallel chains (Methods Chapter 2; Table S2).

(13)

3

The autocorrelation analysis and calculation of ESS are devised for MCMC methods and, thus, are not applied to the Rejection Sampler. For this sampler we set a fixed number of skips and assumed that, due to the sampler construction, its ESS is equal to the length of the generated sample.

Convergence diagnostics

Several visual and quantitative tests can be used to prove that a sampler provided reasonable estimates of its target distribution statistics, i.e. the sampler converged to its target distribution. When certain statistics (e.g. the mean) of the sampled distri-bution stop changing over the length of the sample, we can assume that the sampler has converged to its target distribution. A visual technique to verify convergence of a sampler as well as its reproducibility is to generate multiple independent sample paths starting from different initial guesses and compare their marginal distributions or statistical properties (e.g. the mean) of such distributions. Additionally, for MCMC methods, as it is the case of the two Conditional Samplers, convergence to a stationary distribution can be assessed through different empirical diagnostics. Since no existing diagnostic is proven to be infallible, here we employed two commonly used conver-gence diagnostics (Methods): Geweke (Geweke, 1992) and Gelman-Rubin (Gelman & Rubin, 1992). Reproducibility was verified for all samplers and the convergence diagnostics were only applied to CondS and CondS+PT.

Sampling of the TSM flux space with the E. coli network model

To evaluate the applicability of the three sampling methods with the E. coli network, we ran the samplers multiple times with different number of skips and calculated the acceptance rate. If the acceptance rate was above the defined threshold of 0.5% for any number of skips, we deemed the sampler applicable to this network. Excluding the energy balance from the model, all three samplers were considered applicable to the E. coli network model (Table 1).

We tuned the samplers based on autocorrelation analysis to find the number of skipped points that produced the lowest running time for a given ESS. We found that 100 skips would produce the lowest running time for the Conditional Sampler (Figure S1). Notably, we found that only two temperatures in the CondS+PT produced a swap rate of 70%, which indicates that the two replicas were similarly explored and that parallel tempering may not significantly help the exploration of the flux space in this case. Since MCMC methods suffer from autocorrelation between sampled points, the

(14)

effective sample sizes of the Conditional Samplers were much lower than the ESS of the Rejection Sampler, although being slightly larger for the sampler with the parallel tempering scheme (Figure 2a).

After the tuning of the samplers, we needed to verify their convergence to the target distribution. To prove reproducibility of all samplers, we ran each sampler 6 times with different starting points and compared the flux means of the different runs for each sampler. The Pearson correlation coefficient (r) between flux means of the different runs was generally higher for the Rejection Sampler than for the Conditional Samplers, but above 0.994 for all samplers and comparisons, which indicates the reproducibility of all samplers (Figure 2b). As additional convergence testing, we applied the Geweke and Gelman-Rubin diagnostics to the two Condi-tional Samplers. Since both convergence diagnostics are univariate, to account for all dimensions we considered that if at least one of the diagnostics showed convergence for at least 90% of the dimensions in the sample, the sample path converged to its target distribution. Both MCMC samplers (Conditional Sampler with and without parallel tempering) tested positive for convergence for over 90% of dimensions with one of the diagnostics and over 80% of dimensions converged with both diagnostics (Table S1).

Ultimately, the performance of the three samplers was compared through estimation of the running time to produce an independent sample of 100 points according to the estimated ESS (presented in Figure 2a). Since the Rejection Sampler produced much larger ESS and, computation-wise, it is the simplest method of the three, it is the ideal choice for sampling the TSM flux space of the E. coli model, with the shortest running time (Figure 2c). As it was expected, parallel tempering did not provide additional benefits compared to the Conditional Sampler, as we could infer by comparing their

Sampler

Rejection Conditional

Conditional + Parallel Tempering

AR

4% 13% 41%

Table 1: Applicability of different sampling methods to the TSM flux space with the

E. coli metabolic network excluding the energy balance constraint. AR is the

ac-ceptance rate of the samplers for the chosen number of skips in the tuning – 100 skips for Rejection and Conditional Samplers and 20 skips for each temperature in the Condition-al Sampler with ParCondition-allel Tempering. Samplers with AR > 0.5% are deemed applicable.

(15)

3

running times (Figure 2c). The latter method is usually beneficial because it can be parallelised, which is of weak help if we only use two temperatures. Regardless of their performance, the samplers seem to have achieved the same results, as the means of the different dimensions did not deviate among the samplers (Figure 2d). In conclusion, the Rejection Sampler greatly outperformed the Conditional Samplers in sampling the TSM flux space with the E. coli netwok model.

Figure 2: Comparison of sampler performance when sampling the TSM flux space with the E. coli network model. (a) Effective sample size calculated from a sample

of size 10000 for Rejection Sampler (RejS, red), Conditional Sampler (CondS, green) and Conditional Sampler with parallel tempering (CondS+PT, yellow). (b) Pearson cor-relation coefficient (r) for comparison of flux means between 6 sampler runs (for each sampler independently, x-axis) with different starting points. Flux means compared for all possible combinations among the 6 runs (15 combinations). (c) Running time for the samplers to generate the equivalent of an independent sample of 100 points, cor-responding to 100, 6000 and 5300 points for RejS, CondS and CondS+PT respectively. Running time of CondS+PT corresponds to maximum parallelisation: 2 temperatures and 2 cores. (d) Means of all dimensions compared among the three sampling methods. The identity line in light grey is included to ease axis comparison. Pearson correlation r represented in the bottom right for the comparisons of means: CondS vs RejS (green) and CondS+PT vs RejS (yellow). Flux means in mmol/g_DW/h.

(16)

When we include the energy balance in the TSM model, the TSM flux space becomes significantly more constrained, and consequently it becomes more difficult to find feasible points during sampling. This is because the energy balance equality involves most of the variables in the model. In this case, only the most sophisticated sampler, the Conditional Sampler with parallel tempering, was capable of exploring the set with 20% acceptance rate and it was therefore the only method deemed applicable, as opposed to the other samplers with 0% acceptance rate that were not able to move in such a constrained set. In this case, four temperatures were necessary to provide a good exploration of the TSM flux space (Table S2). We considered that this sampler converged to its target distribution as 81 and 92% of dimensions converged according to the Geweke and Gelman-Rubin diagnostics, respectively (Table S1). Reproducibility was also proven by comparing the flux means of six sampling runs, producing r values (Pearson correlation) above 0.85.

Summarising, when sampling the TSM flux space with the E. coli network while excluding the energy balance, the three samplers were considered applicable and all three were proven reproducible and produced similar flux results regarding flux means. With respect to performance, the Rejection Sampler was found to be the most efficient sampler with a large ESS and a lower running time compared to the other samplers. Contrarily, when including the energy balance constraint in this flux space, only the Conditional Sampler with parallel tempering was deemed able to perform sampling.

Sampling of the TSM flux space with the yeast network model

To evaluate the performance of the different sampling methods individually and against each other, when sampling the TSM flux space with the yeast network model excluding the energy balance, we followed the same approach described before for the E. coli network model. For the larger network, sampling with the Rejection Sam-pler resulted in an extremely low acceptance rate, below the 0.5% defined threshold, potentially due to the large difference in size of the target set (A∩B) compared to the size of the flux polytope (Figure 1b). Thus, this sampler was considered not applicable and therefore not included in further analyses. On the other hand, the Conditional Sampler, with and without parallel tempering, was capable of exploring the set and finding thermodynamically feasible points (Table 2) with acceptance rates around 15%. For the Conditional Sampler (CondS) we found that the best number of skips was 100 (Figure S2). Through tuning the number of temperatures and number of skips of the Conditional Sampler with Parallel Tempering, we achieved four temperature

(17)

3

levels each with an average acceptance rate around 17%. As it turned out, the effec-tive sample size was roughly the same for the Conditional Sampler with and without parallel tempering (Figure 3a), which indicates a similar level of autocorrelation and, consequently, that Parallel Tempering had a small additional effect on reducing the autocorrelation between sampled points.

In terms of convergence, we, again, ran each sampler 6 times with diverse starting points and compared the means of the resulting flux distributions. The Pearson cor-relation coefficient (r) showed a strong corcor-relation between means of different runs for both samplers (Figure 3b), with r distributions above 0.999 (excluding outliers). Regarding convergence diagnostics, the percentage of converged reactions was over 80% with the Gelmin-Rubin diagnostic and over 90% with the Geweke diagnostic for both Conditional Samplers (Table S1). With these results, we concluded that both samplers converged to their stationary distribution.

Finally, we compared the performance of CondS with CondS+PT in sampling the TSM flux space with the yeast network model. With a similar effective sample size for the two methods, the complexity of the algorithm had an important impact on the running time difference. Even while using full parallelisation power (one CPU core per temperature), the Conditional Sampler with parallel tempering was over two times slower than without parallel tempering (Figure 3c). In this case, even though there were four temperature levels swapping points, parallel tempering seemed to bring no benefit in terms of performance (Figure 3c). Comparing the flux means from the distributions produced by the two samplers, a high degree of correlation, indicated by the large r, and points gathered on the identity line showed that both methods converged to similar flux distributions (Figure 3d).

Sampler

Rejection Conditional

Conditional + Parallel Tempering

AR

0% 15% 16%

Table 2: Applicability of different sampling methods to the TSM flux space with the yeast metabolic network excluding the energy balance constraint. AR is the

ac-ceptance rate of the samplers for the chosen number of skips in the tuning, or for any at-tempted number of skips in the case of the Rejection Sampler. The number of skips was 100 for the Conditional Sampler and 100 for the four temperatures in the Conditional

(18)

Similarly to the case of the E. coli network model, the Conditional Sampler with Parallel Tempering was the only method capable of sampling the more constrained set of the TSM flux space including the energy balance constraint in the yeast network model. Since the Rejection Sampler was not capable of sampling the less constrained set without the energy balance, it was expected that it would produce an

Figure 3: Comparison of sampler performance when sampling the TSM flux space with the yeast network model. (a) Effective sample size calculated from a sample

of size 30000 for Conditional Sampler (CondS, green) and Conditional Sampler with Parallel Tempering (CondS+PT, yellow). (b) Pearson correlation r value for comparison of flux means between 6 sampler runs (for each sampler independently) with different starting points. Flux means compared for all possible combinations among the 6 runs (15 combinations). Crosses represent outlier points. (c) Running time for the samplers to generate the equivalent of an independent sample of 10 points, corresponding to 4690 and 4800 points for CondS and CondS+PT respectively. Running time of CondS+PT corresponds to maximum parallelisation: 4 temperatures and 4 cores. (d) Means of all dimensions compared among the three sampling methods. The identity line in light grey is included to ease axis comparison. Pearson correlation r represented in the bot-tom right for the comparison of means between CondS and CondS+PT. Flux means in mmol/gDW/h.

(19)

3

even smaller acceptance rate (0%) in the more constrained set. Also, the Conditional Sampler without parallel tempering was not able to move in the more constrained set, producing a 0% acceptance rate as well. Interestingly, even though this network is significantly larger than the E. coli network, parallel tempering is still capable of greatly improving the efficiency of the Conditional Sampler by allowing it to find feasible points in this more constrained, high-dimensional set. We found that six temperature levels were required to achieve an efficient swap of sampled points (Table S2). In terms of reproducibility, the Pearson correlation coefficients of the comparisons between the flux means of different runs were all above 0.99, indicating the sampler’s reproducibility.

In summary, the Rejection Sampler was deemed not applicable in the case of the yeast network model, and the Conditional Sampler was found to be the most efficient sampler compared to the Conditional Sampler with parallel tempering (CondS+PT), when sampling the TSM flux space of the yeast network excluding the energy balance constraint. Similarly to the smaller E. coli network, only the Conditional Sampler with Parallel Tempering was able to move in the highly con-strained flux space when the energy balance was included in the TSM model of the yeast network model.

DISCUSSION

We applied three sampling methods based on the split of the TSM feasible space into convex polytopes to two metabolic networks of different sizes and discussed their benefits and hindrances in each case. The three samplers studied in this work can be used to characterise the feasible thermodynamic set of a metabolic network constrained by measurement data if there is no need to enforce the energy balance constraint during sampling, which is the most common case in flux inference problem formulations. While all three methods were applicable to the E. coli smaller network, the most efficient approach for this network was undoubtedly the Rejection Sampler. Through selection of feasible points post-exploration, this sampler avoids autocorrela-tion and due to its simple movement and acceptance mechanism, it is fast compared to the other methods. However, for a network with roughly three times the number of reactions, it becomes too inefficient in finding feasible points in the TSM flux space. This is likely due to the relative size of the TSM flux space compared to the size of the flux polytope, which is much smaller in the larger network. The Conditional Sam-pler with and without parallel tempering could be applied to the yeast network and

(20)

produced similar Effective Sample Sizes. However, the implementation complexity introduced by the parallel tempering scheme increased the running time and, in this case, it was not compensated by parallelisation. We can therefore conclude that, for networks of similar size and complexity as the ones used in this work, if the energy balance equation is excluded, parallel tempering does not increase the efficiency of the Conditional sampler.

Even though the energy balance is regularly implicitly fulfilled through other con-straints, there are cases where it is important to include it in the problem formulation. We may, for example, want to consider the variability in Δ_rG0_{values instead of using}

them as fixed parameters. When we introduce this constraint into our sampling prob-lem, we create a much more constrained set because this equation is an equality that involves many variables in the model. According to our analysis, only the Conditional Sampler with Parallel Tempering was able to move in this constrained set efficiently enough to produce a feasible sample in reasonable running time. Notably, this was true for both networks of different sizes as, even for the smaller network, parallel tempering was required to move in the constrained set. Furthermore, when we moved to a larger network we were still able to use this sampling approach. The Conditional Sampler with parallel tempering is therefore an efficient tool to characterise the fea-sible set of metabolic fluxes constrained with thermodynamics when including the energy balance constraint.

We expect that our analysis contributes to increase the confidence in sampling the TSM flux space with the approach exploited in this work. Thus far, avail-able methods to sample the flux space with thermodynamic constraints consider exclusively the loop law, do not take advantage of the information contained in metabolite concentration measurements, and are mostly applied to small metabolic networks (Bordel et al., 2010; Chaudhary et al., 2016; Price et al., 2003; Saa & Nielsen, 2016). Even though the feasible space of (v, lnc, ∆_rG) is non-convex, we

can still sample the flux space conditional on the feasibility of thermodynamic constraints and hence overcome the non-convexity issue through reformulation of the sampling problem. We show that, within the new problem formulation, several alternative sampling approaches can be used according to the size and complexity of the given metabolic network. We envision that for an even larger network than the yeast network used in this work, parallel tempering will be required to increase the efficiency of the Conditional Sampler to provide a thorough characterisation of the TSM flux space.

(21)

3 METHODS

Metabolic networks and experimental data used in the steps

of fitting and variability analysis

Prior to sampling, we fitted the TSM model to exchange fluxes, metabolite concen-trations and standard Gibbs energies (Δ_rG0_{), as done in Chapter 2. The Δ}

rG0 values used in the fitting were obtained with the Component Contribution method (Noor et al., 2013). In the case of the yeast model, the exchange fluxes and metabolite concentrations were obtained experimentally as described in Methods of Chapter 2. The yeast network and the data on glucose growth used for the fitting are specified in Chapter 2. For the E. coli model, the metabolic network is from (Orth, Flem-ing, et al., 2010) (Table S1-5) and the data from (Gerosa et al., 2015). The TSM model of E. coli was fitted to one single carbon-source condition with glucose as substrate, three measured exchange fluxes, 38 metabolite concentrations and 72 Δ_rG0_{(Table S6-8).}

Effective Sample Size estimation

Sampling algorithms need to be tuned to the problem they are applied to. Skipping a number of points between two consecutive sampled points is required to improve the quality of the sample by lowering autocorrelation. For a given Markov chain we can calculate the minimum number of steps (τ) over which the autocorrelation of the chain becomes equal to zero, implying that values of the chain separated by more than

τ steps can be considered to be uncorrelated. This number of steps is often referred to

as the autocorrelation time τ and it is calculated as τ = 1 + 2 ∑_kρ(k) where ρ denotes

the autocorrelation of lag k and 0 ≤ ρ ≤ 1. The effective sample size is then inversely proportional to the autocorrelation time, ESS = N/τ with N as the sample size. The definition of τ (and ESS) is well defined for scalar (one-dimensional) Markov chains, and thus we determined it separately for each dimension of our multi-dimensional Markov chains. Autocorrelation estimates obtained from Markov chain sample paths are inevitably noisy, and may not reach zero due to stochasticity. For this reason we needed to define a threshold below which autocorrelation is considered to be sufficiently low. As a threshold for low autocorrelation, we verified when the 10% confidence interval of the autocorrelation estimate contained the value 0. Thus, to calculate τ we sum the autocorrelation until the threshold is reached. Subsequently, to account for all dimensions, we defined the overall autocorrelation time (T) to be the median of τ over all dimensions, and the overall ESS was calculated from T.

(22)

To estimate the optimal number of skips for the Conditional Sampler, we compared the running times required to generate a sample of size φ while varying the number of skips. We used φ = n × T for each sampler run, where n is 100 for the TSM flux space of the E. coli network and 10 for the yeast network. The optimal choice of number of skips was the one that produced the lowest running time. For the RejS, which is not a Markov process, we assume that the sampled points are independent of each other and therefore the effective sample size was equal to the size of the sample. Prior to tuning, a sampler may generate a very low acceptance rate and therefore be deemed not applicable. However, after tuning it may be that its acceptance rate increases significantly. Therefore, tuning and applicability were tested in an iterative way and here we only presented the final result of the iterations.

Convergence testing

After tuning a sampler, it is necessary to verify that it converges to its stationary dis-tribution. At initialisation, the sampler may take some time to enter a high-probability region on its stationary distribution. For this reason, a first set of points, the so-called burn-in sample, was always discarded from any generated sample path. To verify the convergence for MCMC samplers we can use different diagnostics, from which we chose two. Additionally, we wanted to show that each sampler reproducibly converges to the same stationary distribution. This test could be also applied to the Rejection Sampler that is not an MCMC sampler. To verify sampler reproducibility, we compared the sampled means of fluxes from different sampler runs, and assessed their similarity. The Geweke convergence diagnostic (Geweke, 1992) compares the first 10% of a sample path to segments of the last 50% of the same sample path with a z-test for difference in means for each segment. A z-test compares two distributions regard-ing their means, through z-scores - number of standard deviations away from mean. Failing to reject the null hypothesis of the z-tests (equal means), provides evidence of convergence. In other words, we can conclude that the sampler run has converged to its target distribution if the mean of the first part of the sample path is not significantly different from the mean of the last part of the sample path. We considered that a sam-pler run had converged if the z-cores of the z-test were below 1.98 for all segments, which represents a significance level of 2%.

The Gelman-Rubin convergence diagnostic (Gelman & Rubin, 1992) is applied to multiple sample paths (multiple runs of the same sampler) and it is based on a com-parison of within-chain (W) and between chain (B) variances. Within-chain variance

(23)

3

is calculated through , where M is the number of chains and s is the variance of each chain. Between-chain variance is calculated by

, where N is the chain length, θ the parameter (flux) and is the mean of each of the

M means. The estimated variance of θ is and the potential scale reduction factor (R), which is the score of the Gelman-Rubin diagnostic, is

R should be close to 1 which indicates that the between chain variance is small,

meaning the chains are mixing around the stationary distribution. Here, we ran the sampler three times (M) and determined R from within and between-chain variances. We assumed that for R lower than 1.2 (Gelman & Rubin, 1992) per dimension, we obtained convergence.

AUTHOR CONTRIBUTIONS

JS, AMA and MH designed the study. JS ran all simulations, performed all analysis and made all the figures. DM provided the method to divide the flux polytope in sec-tors. JS, AMA and MH wrote the manuscript.



(24)

Alves, R., & Savageau, M. A. (2000). Comparing systemic properties of ensembles of biological networks by graphical and statistical methods. Bioinformatics, 16(6), 527–533. https://doi.org/10.1093/bioinformatics/16.6.527

Beard, D. A., Liang, S., & Qian, H. (2002). Energy Balance for Analysis of Complex Metabolic Networks. In Biophysical Journal (Vol. 83, Issue 1, pp. 79–86). https:// doi.org/https://doi.org/10.1016/S0006-3495(02)75150-3

Binns, M., de Atauri, P., Vlysidis, A., Cascante, M., & Theodoropoulos, C. (2015). Sampling with poling-based flux balance analysis: optimal versus sub-optimal flux space analysis of Actinobacillus succinogenes. BMC Bioinformatics, 16, 49. https://doi.org/10.1186/s12859-015-0476-5

Bordel, S., Agren, R., & Nielsen, J. (2010). Sampling the solution space in genome-scale metabolic networks reveals transcriptional regulation in key enzymes. PLoS

Computational Biology, 6(7), e1000859–e1000859. https://doi.org/10.1371/

journal.pcbi.1000859

Braunstein, A., Muntoni, A. P., & Pagnani, A. (2017). An analytic approximation of the feasible space of metabolic networks. Nature Communications, 8, 14915. https://doi.org/10.1038/ncomms14915

Casella, G., Robert, C. P., & Wells, M. T. (2004). Generalized Accept-Reject Sampling Schemes. Lecture Notes-Monograph Series, 45, 342–347. http://www.jstor.org/ stable/4356322

Chaudhary, N., Tøndel, K., Bhatnagar, R., Martins, dos S., & Puchałka, J. (2016). Characterizing the optimal flux space of genome-scale metabolic reconstruc-tions through modified latin-hypercube sampling. Molecular BioSystems, 12(3), 994–1005. https://doi.org/10.1039/C5MB00457H

REFERENCES

(25)

3

Chen, M.-H., & Schmeiser, B. W. (1996). General Hit-and-Run Monte Carlo sampling for evaluating multidimensional integrals. Operations Research Letters, 19(4), 161–169. https://doi.org/https://doi.org/10.1016/0167-6377(96)00030-2

Fleming, R. M. T., Maes, C. M., Saunders, M. A., Ye, Y., & Palsson, B. Ø. (2012). A variational principle for computing nonequilibrium fluxes and potentials in genome-scale biochemical networks. In Journal of Theoretical Biology (Vol. 292, pp. 71–77). https://doi.org/https://doi.org/10.1016/j.jtbi.2011.09.029

Gelman, A., & Rubin, D. B. (1992). Inference from Iterative Simulation Using Multiple Sequences. Statist.Sci., 7(4), 457–472. https://doi.org/10.1214/ss/1177011136” Gerosa, L., Haverkorn van Rijsewijk, B. R. B., Christodoulou, D., Kochanowski,

K., Schmidt, T. S. B., Noor, E., & Sauer, U. (2015). Pseudo-transition Analysis Identifies the Key Regulators of Dynamic Metabolic Adaptations from Steady-State Data. Cell Systems, 1(4), 270–282. https://doi.org/https://doi.org/10.1016/j. cels.2015.09.008

Geweke, J. (1992). Evaluating the Accuracy of Sampling-Based Approaches to the Calculation of Posterior Moments. IN BAYESIAN STATISTICS, 169–193. Givens, G. H., & Hoeting, J. A. (2012). Integration and Simulation; Numerical

Integration; Simulation and Monte Carlo Integration; Markov Chain Monte Carlo; Advanced Topics in MCMC. In Computational Statistics;

Computa-tional Statistics; ComputaComputa-tional Statistics; ComputaComputa-tional Statistics; Compu-tational Statistics (pp. 127; 129; 151; 201; 237–127; 149; 199; 235; 283).

https://doi.org/doi:10.1002/9781118555552.part2; doi:10.1002/9781118555552. ch5; doi:10.1002/9781118555552.ch6; doi:10.1002/9781118555552.ch7; doi:10.1002/9781118555552.ch8

Gopalakrishnan, S., & Maranas, C. D. (2015). 13C metabolic flux analysis at a genome-scale. In Metabolic Engineering (Vol. 32, pp. 12–22). https://doi.org/https://doi. org/10.1016/j.ymben.2015.08.006

Henry, C. S., Broadbelt, L. J., & Hatzimanikatis, V. (2007). Thermodynamics-based metabolic flux analysis. Biophysical Journal, 92(5), 1792–1805. https://doi. org/10.1529/biophysj.106.093138

Kümmel, A., Panke, S., & Heinemann, M. (2006). Putative regulatory sites unraveled by network-embedded thermodynamic analysis of metabolome data. Molecular

Systems Biology, 2, 2006.0034-2006.0034. https://doi.org/10.1038/msb4100074

Mahadevan, R., & Schilling, C. H. (2003). The effects of alternate optimal solutions in constraint-based genome-scale metabolic models. Metabolic Engineering, 5(4),

(26)

264–276. https://doi.org/https://doi.org/10.1016/j.ymben.2003.09.002

Nidelet, T., Brial, P., Camarasa, C., & Dequin, S. (2016). Diversity of flux distribution in central carbon metabolism of S. cerevisiae strains from diverse environments. In Microbial Cell Factories (Vol. 15, Issue 1). https://doi.org/10.1186/s12934-016-0456-0

Niebel, B., Leupold, S., & Heinemann, M. (2019). An upper limit on Gibbs energy dissipation governs cellular metabolism. Nature Metabolism, 1(1), 125–132. https://doi.org/10.1038/s42255-018-0006-7

Noor, E., Haraldsdottir S., H., Milo, R., & Fleming, R. M. T. (2013). Consistent Esti-mation of Gibbs Energy Using Component Contributions. PLoS Computational

Biology, 9(7), e1003098. https://doi.org/10.1371/journal.pcbi.1003098

Noor, E., Lewis, N. E., & Milo, R. (2012). A proof for loop-law constraints in stoi-chiometric metabolic networks. BMC Systems Biology, 6(1), 140. https://doi. org/10.1186/1752-0509-6-140

Orth, J. D., Fleming, R. M. T., & Palsson, B. Ø. (2010). Reconstruction and Use of Microbial Metabolic Networks: the Core Escherichia coli Metabolic Model as an Educational Guide. EcoSal Plus, 4(1). https://doi.org/10.1128/ecosalplus.10.2.1 Orth, J. D., Thiele, I., & Palsson, B. Ã. (2010). What is flux balance analysis? Nature

Biotechnology, 28(3), 245–248. https://doi.org/10.1038/nbt.1614

Price, N. D., Papin, J. A., Schilling, C. H., & Palsson, B. O. (2003). Genome-scale mi-crobial in silico models: the constraints-based approach. Trends in Biotechnology,

21(4), 162–169. https://doi.org/https://doi.org/10.1016/S0167-7799(03)00030-1

Price, N. D., Thiele, I., & Palsson, B. Ø. (2006). Candidate States of Helicobacter pylori’s Genome-Scale Metabolic Network upon Application of “Loop Law” Thermodynamic Constraints. In Biophysical Journal (Vol. 90, Issue 11, pp. 3919–3928). https://doi.org/https://doi.org/10.1529/biophysj.105.072645 Saa, P. A., & Nielsen, L. K. (2016). ll-ACHRB: a scalable algorithm for sampling the

feasible solution space of metabolic networks. Bioinformatics (Oxford, England),

32(15). https://doi.org/https://doi.org/10.1093/bioinformatics/btw132

Schellenberger, J., Lewis, N. E., & Palsson, B. Ã. (2011). Elimination of thermody-namically infeasible loops in steady-state metabolic models. Biophysical Journal,

100(3), 544–553. https://doi.org/10.1016/j.bpj.2010.12.3707

Schellenberger, J., & Palsson, B. Ã. (2009). Use of Randomized Sampling for Analysis of Metabolic Networks. Journal of Biological Chemistry, 284(9), 5457–5461. http://www.jbc.org/content/284/9/5457.abstract

(27)

3

Sweetlove, L. J., Obata, T., & Fernie, A. R. (2014). Systems analysis of metabolic phenotypes: what have we learnt? Trends in Plant Science, 19(4), 222–230. https:// doi.org/http://dx.doi.org/10.1016/j.tplants.2013.09.005

Swendsen, R. H., & Wang, J.-S. (1986). Replica Monte Carlo Simulation of Spin-Glasses.

(28)

SUPPLEMENTARY FIGURES

Figure S1: Autocorrelation analysis on Conditional Sampler to sample the TSM flux space of the E. coli network excluding the energy balance constraint. (a):

Ef-fective sample size distribution over dimensions, colour-coded for different number of skips. (b) Running time for the sampler to generate a sample path of size 100 times the overall autocorrelation time (T) for different number of skips. Overall autocorrelation time (T) is represented above plot bars for each number of skips. Sample size (N) for Effective Sample Size calculation is 10000.

(29)

3

Figure S2: Autocorrelation analysis on Conditional Sampler to sample the TSM flux space of the yeast network excluding the energy balance constraint. (a)

Ef-fective sample size distribution over dimensions, colour-coded for different number of skips. (b) Running time for the sampler to generate a sample path of size 10 times the overall autocorrelation time (T) for different number of skips. Overall autocorrelation time (T) is represented above plot bars for each number of skips. T and running time for yeast network with 5000 skips are 3096 and 75 min. Sample size (N) for Effective Sample Size calculation is 30000.

(30)

SUPPLEMENTARY TABLES

Table S1: Convergence diagnostics results for the different sampling methods in

E. coli and Yeast networks. According to two different diagnostics – Geweke and

Gel-man-Rubin – chains are proved to converge for most of their dimensions at different sample sizes. Size represents sample path size. Incl. EB indicates that the energy balance constraint was included to define the set; when not specified, the energy balance was left out. Rejection Sampling is not a MCMC method thus it is not included in the table.

Table S2: Temperature levels and respective number of skips used in Conditional Sampling with Parallel Tempering for the two metabolic networks. Incl. EB

indi-cates that the energy balance constraint was included to define the set; when not speci-fied, the energy balance was left out.

93 90 100 94 81 91 Geweke 81 83 86 83 92 78 Gelman-Rubin CondS CondS CondS+PT CondS+PT CondS+PT (incl. EB) CondS+PT (incl. EB)

Sampler E. coli Yeast E. coli Yeast E. coli Yeast Network 5000 32000 3000 7000 8000 18000

Size % dimensions converged

E. coli E. coli incl.EB Yeast Yeast incl. EB Model 1, 1000 1, 10, 10 , 1000 1, 100, 100, 1000 1, 5, 5, 20, 1000 Number of skips 1, 900 1, 5.5, 13, 900 1, 4.5, 26, 900 1, 7, 10, 14, 23, 900 Temperature levels

(31)

3

Table S3: Metabolites of the E. coli network model.

13dpg 2pg 3pg 6pgc 6pgl ac acald accoa acon-C actp adp akg amp atp charge cit co2tot coa dhap e4p etoh f6p fdp for fru fum g3p g6p glc-D gln-L glu-L glx C00236 C00631 C00197 C00345 C01236 C00033 C00084 C00024 C00417 C00227 C00008 C00026 C00020 C00002 C00158 C00288 C00010 C00111 C00279 C00469 C00085 C00354 C00058 C00095 C00122 C00661 C00092 C00031 C00064 C00025 C00048 C3H8O10P2 C3H7O7P C3H7O7P C6H13O10P C6H11O9P C2H4O2 C2H4O C23H38N7O17P3S C6H6O6 C2H3O5P C10H15N5O10P2 C5H6O5 C10H14N5O7P C10H16N5O13P3 C6H8O7 HCO3 C21H36N7O16P3S C3H7O6P C4H9O7P C2H6O C6H13O9P C6H14O12P2 CH2O2 C6H12O6 C4H4O4 C3H7O6P C6H13O9P C6H12O6 C5H10N2O3 C5H9NO4 C2H2O3

Metabolite ID Formula KEGG ID

3-Phospho-D-glyceroyl phosphate 2-Phospho-D-glycerate 3-Phospho-D-glycerate 6-Phospho-D-gluconate D-Glucono-1,5-lactone 6-phosphate Acetate Acetaldehyde Acetyl-CoA cis-aconitate Acetyl phosphate ADP 2-Oxoglutarate AMP ATP unspecific charge Citrate HCO3-CoA Glycerone phosphate D-Erythrose 4-phosphate Ethanol D-Fructose 6-phosphate D-Fructose 1,6-bisphosphate Formate D-fructose Fumarate D-Glyceraldehyde 3-phosphate D-Glucose 6-phosphate D-Glucose L-Glutamine L-Glutamate Glyoxylate Name

(32)

h h2o icit lac-D mal-L nad nadh nadp nadph nh3 o2 oaa pep pi pyr q8 q8h2 r5p ru5p-D s7p succ succoa xu5p-D C00080 C00001 C00311 C00256 C00149 C00003 C00004 C00006 C00005 C00014 C00007 C00036 C00074 C00009 C00022 C00399 C00390 C00117 C00199 C05382 C00042 C00091 C00231 H H2O C6H8O7 C3H6O3 C4H6O5 C21H28N7O14P2 C21H29N7O14P2 C21H29N7O17P3 C21H30N7O17P3 NH3 O2 C4H4O5 C3H5O6P H3PO4 C3H4O3 C14H18O4(C5H8)n C14H20O4(C5H8)n C5H11O8P C5H11O8P C7H15O10P C4H6O4 C25H40N7O19P3S C5H11O8P

Metabolite ID Metabolite ID KEGG ID

H+ H2O Isocitrate D-lactate (S)-Malate NAD+ NADH NADP+ NADPH NH3 Oxygen Oxaloacetate Phosphoenolpyruvate Orthophosphate Pyruvate Ubiquinone Ubiquinol alpha-D-Ribose 5-phosphate D-Ribulose 5-phosphate Sedoheptulose 7-phosphate Succinate Succinyl-CoA D-Xylulose 5-phosphate Number of skips

Table S4: Names and stoichiometry of the reactions of the E. coli network model. The

square brackets indicate the compartmental location of the metabolites, i.e. extracellular space (e) or cytosol (c).

PGI PFK FBA Reaction ID Glucose-6-phosphate isomerase Phosphofructokinase Fructose-bisphosphate aldolase Reaction name g6p[c] <=> f6p[c] atp[c] + f6p[c] <=> adp[c] + fdp[c] fdp[c] <=> dhap[c] + g3p[c] Stoichiometry

(33)

3

TPI FBP GAPD ENO PYK PGL RPI TKT1 RPE TALA TKT2 PDH CS ICDHyr AKGDH FRD7 FUMA MDH ICL MALS PPC Reaction ID Triose-phosphate isomerase Fructose-bisphosphatase Glyceraldehyde-3-phosphate dehydrogenase, phosphofructokinase Phosphoglycerate mutase, enolase Pyruvate kinase Glucose 6-phosphate dehydrogenase, 6-phosphogluconolactonase, phosphoglycerate kinase Ribose-5-phosphate isomerase Transketolase Ribulose 5-phosphate 3-epimerase Transaldolase Transketolase Pyruvate dehydrogenase Citrate synthase Isocitrate dehydrogenase (NADP) Oxoglutarate dehydrogenase, succinate--CoA ligase

(ADP-forming) Succinate dehydrogenase Fumarase Malate dehydrogenase Isocitrate lyase Malate synthase Phosphoenolpyruvate carboxylase Reaction name dhap[c] <=> g3p[c] fdp[c] + h2o[c] <=> f6p[c] + pi[c] g3p[c] + nad[c] + pi[c] + adp[c] <=> 3pg[c]

+ nadh[c] + atp[c] 3pg[c] <=> h2o[c] + pep[c] adp[c] + pep[c] <=> atp[c] + pyr[c]

g6p[c] + h2o[c] + (2) nadp[c] <=> co2tot[c] + (2) nadph[c] + ru5p-D[c] r5p[c] <=> ru5p-D[c] r5p[c] + xu5p-D[c] <=> g3p[c] + s7p[c] ru5p-D[c] <=> xu5p-D[c] g3p[c] + s7p[c] <=> e4p[c] + f6p[c] e4p[c] + xu5p-D[c] <=> f6p[c] + g3p[c] coa[c] + nad[c] + pyr[c] + h2o[c] <=> accoa[c] + co2tot[c]

+ nadh[c]

accoa[c] + h2o[c] + oaa[c] <=> icit[c] + coa[c] icit[c] + nadp[c] + h2o[c] <=> akg[c] + co2tot[c] +

nadph[c]

akg[c] + adp[c] + pi[c] + nad[c] + h2o[c] <=> co2tot[c] + nadh[c] + atp[c] + succ[c]

fum[c] + q8h2[c] <=> q8[c] + succ[c] fum[c] + h2o[c] <=> mal-L[c] mal-L[c] + nad[c] <=> nadh[c] + oaa[c]

icit[c] <=> glx[c] + succ[c]

accoa[c] + glx[c] + h2o[c] <=> coa[c] + mal-L[c] co2tot[c] + pep[c] <=> oaa[c] + pi[c]

(34)

PPS PPCK ACKr ACALDDH PFL ALCD2x GLUSy GLNS GLUN GLUDy ME1 ME2 ATPHYD ADK1 NADTRHD MALt2_2_-2 MALt2_2_-2+2H SUCCt2_2_-2 SUCCt2_2_-2+2H ACt2r_-1 ACt2r_-1+1H AKGt2r_-2 AKGt2r_-2+2H CO2t_0 ETOHt2r_0 Reaction ID Phosphoenolpyruvate synthase Phosphoenolpyruvate carboxykinase Acetate kinase, phosphotransacetylase Acetaldehyde dehydrogenase (acetylating) Pyruvate formate lyase Alcohol dehydrogenase (ethanol) Glutamate synthase (NADPH) Glutamine synthetase Glutaminase Glutamate dehydrogenase (NADP) Malic enzyme (NADH) Malic enzyme (NADPH)

ATP hydrolysis Adenylate kinase NAD transhydrogenase Malate transport Malate transport Succinate transport Succinate transport Acetate transport Acetate transport 2-Oxoglutarate transport 2-Oxoglutarate transport Carbon dioxide transport

Ethanol transport

Reaction name

atp[c] + h2o[c] + pyr[c] <=> amp[c] + pep[c] + pi[c] atp[c] + oaa[c] + h2o[c] <=> adp[c] + co2tot[c] + pep[c]

ac[c] + atp[c] + coa[c] <=> accoa[c] + pi[c] + adp[c] acald[c] + coa[c] + nad[c] <=> accoa[c] + nadh[c]

coa[c] + pyr[c] <=> accoa[c] + for[c] etoh[c] + nad[c] <=> acald[c] + nadh[c] akg[c] + gln-L[c] + nadph[c] <=> (2) glu-L[c] + nadp[c] atp[c] + glu-L[c] + nh3[c] <=> adp[c] + gln-L[c] + pi[c]

gln-L[c] + h2o[c] <=> glu-L[c] + nh3[c] glu-L[c] + h2o[c] + nadp[c] <=> akg[c] + nadph[c] +

nh3[c]

mal-L[c] + nad[c] + h2o[c] <=> co2tot[c] + nadh[c] + pyr[c]

mal-L[c] + nadp[c] + h2o[c] <=> co2tot[c] + nadph[c] + pyr[c]

atp[c] + h2o[c] <=> adp[c] + pi[c] amp[c] + atp[c] <=> (2) adp[c] nad[c] + nadph[c] <=> nadh[c] + nadp[c]

mal-L[e] <=> mal-L[c] mal-L[e] <=> mal-L[c] succ[e] <=> succ[c] succ[e] <=> succ[c] ac[e] <=> ac[c] ac[e] <=> ac[c] akg[e] <=> akg[c] akg[e] <=> akg[c] co2tot[e] <=> co2tot[c] etoh[e] <=> etoh[c] Stoichiometry

(35)

3

FORt2_-1 FORt2_-1+1H FUMt2_2_-2 FUMt2_2_-2+2H GLCpts_0 H2Ot D_LACt2_-1 D_LACt2_-1+1H PIt2r_-2 PIt2r_-2+2H PYRt2r_-1 PYRt2r_-1+1H O2t_0 CYTBD THD2 NADH16 ATPS4r Ht_1 Cex nh3t_0 EX_ac EX_co2tot EX_etoh EX_for EX_glc EX_h EX_h2o EX_nh3 EX_o2 EX_pi Reaction ID Formate transport Formate transport Fumarate transport Fumarate transport D-glucose transport via pep:pyr PTS (periplasm) H2O transport Lactate transport Lactate transport Phosphate transport Phosphate transport Pyruvate transport Pyruvate transport Oxygen transport Cytochrome oxidase bd (periplasm) NAD(P) transhydrogenase (periplasm) NADH dehydrogenase (periplasm) ATP synthase (periplasm)

H+ transport Charge transport NH3 transport Acetate exchange HCO3- exchange Ethanol exchange Formate exchange D-glucose exchange H+ exchange H2O exchange NH3 exchange O2 exchange Phosphate exchange Reaction name for[e] <=> for[c] for[e] <=> for[c] fum[e] <=> fum[c] fum[e] <=> fum[c] glc-D[e] + pep[c] <=> g6p[c] + pyr[c]

h2o[e] <=> h2o[c] lac-D[e] <=> lac-D[c] lac-D[e] <=> lac-D[c] pi[e] <=> pi[c] pi[e] <=> pi[c] pyr[e] <=> pyr[c] pyr[e] <=> pyr[c] o2[e] <=> o2[c] (0.5) o2[c] + q8h2[c] <=> h2o[c] + q8[c] nadh[c] + nadp[c] <=> nad[c] + nadph[c]

nadh[c] + q8[c] <=> nad[c] + q8h2[c] adp[c] + pi[c] <=> atp[c] + h2o[c]

nh3[e] <=> nh3[c] ac[e] <=> co2tot[e] <=> etoh[e] <=> for[e] <=> glc-D[e] <=> h[e] <=> h2o[e] <=> nh3[e] <=> o2[e] <=> pi[e] <=> Stoichiometry

(36)

EX_succ EX_C Biomass Reaction ID Succinate exchange Charge exchange Reaction name succ[e] <=> Charge[e] <=> (1.496) 3pg[c] + (3.7478) accoa[c] + (22.873) atp[c] + (0.361) e4p[c] + (0.0709) f6p[c] + (0.129) g3p[c] + (0.205) g6p[c] + (0.2557) gln-L[c] + (4.9414) glu-L[c] + (22.873) h2o[c] + (3.547) nad[c] + (13.0279) nadph[c] + (1.7867) oaa[c] + (0.5191) pep[c] + (2.8328) pyr[c] + (0.8977) r5p[c]

<=> (22.873) adp[c] + (4.1182) akg[c] + (3.7478) coa[c] + (3.547) nadh[c] + (13.0279) nadp[c] + (22.873) pi[c]

Stoichiometry

Table S5: Stoichiometry for the reactions of the E. coli network model which contain transport processes. MALt2_2_-2 MALt2_2_-2+2H SUCCt2_2_-2 SUCCt2_2_-2+2H ACt2r_-1 ACt2r_-1+1H AKGt2r_-2 AKGt2r_-2+2H CO2t_0 ETOHt2r_0 FORt2_-1 FORt2_-1+1H FUMt2_2_-2 FUMt2_2_-2+2H GLCpts_0 H2Ot Reaction ID glc-D[c] + pep[c] <=> g6p[c] + pyr[c] Chemical conversion mal-L{-2}[e] <=> mal-L{-2}[c] (2) h{1}[e] + mal-L{-2}[e] <=> (2) h{1}[c] + mal-L{-2}[c] succ{-2}[e] <=> succ{-2}[c]

(2) h{1}[e] + succ{-2}[e] <=> (2) h{1}[c] + succ{-2}[c] ac{-1}[e] <=> ac{-1}[c]

ac{-1}[e] + h{1}[e] <=> ac{-1}[c] + h{1}[c] akg{-2}[e] <=> akg{-2}[c]

akg{-2}[e] + (2) h{1}[e] <=> akg{-2}[c] + (2) h{1}[c] co2tot{0}[e] <=> co2tot{0}[c]

etoh{0}[e] <=> etoh{0}[c] for{-1}[e] <=> for{-1}[c] for{-1}[e] + h{1}[e] <=> for{-1}[c] + h{1}[c]

fum{-2}[e] <=> fum{-2}[c]

fum{-2}[e] + (2) h{1}[e] <=> fum{-2}[c] + (2) h{1}[c] glc-D{0}[e] <=> glc-D{0}[c]

h2o{0}[e] <=> h2o{0}[c]

Metabolite transport

(37)

3

Table S6: General parameters used in the E. coli network model. D_LACt2_-1 D_LACt2_-1+1H PIt2r_-2 PIt2r_-2+2H PYRt2r_-1 PYRt2r_-1+1H O2t_0 CYTBD THD2 NADH16 ATPS4r Ht_1 Cex nh3t_0 (0.5) o2[c] + q8h2[c] <=> h2o[c] + q8[c] nadh[c] + nadp[c] <=> nad[c] + nadph[c] nadh[c] + q8[c] <=> nad[c] + q8h2[c] adp[c] + pi[c] <=> atp[c] +

h2o[c]

lac-D{-1}[e] <=> lac-D{-1}[c] h{1}[e] + lac-D{-1}[e] <=> h{1}[c] + lac-D{-1}[c]

pi{-2}[e] <=> pi{-2}[c]

(2) h{1}[e] + pi{-2}[e] <=> (2) h{1}[c] + pi{-2}[c] pyr{-1}[e] <=> pyr{-1}[c]

h{1}[e] + pyr{-1}[e] <=> h{1}[c] + pyr{-1}[c] o2{0}[e] <=> o2{0}[c] (4) h{1}[c] <=> (4) h{1}[e] (1) h{1}[e] <=> (1) h{1}[c] (1.5) h{1}[c] <=> (1.5) h{1}[e] (4) h{1}[e] <=> (4) h{1}[c] h{1}[c] <=> h{1}[e] charge{1}[c] <=> charge{1}[e] nh3{0}[e] <=> nh3{0}[c] Reaction ID

Chemical conversion Metabolite transport

Stoichiometry of: Temperature Nh biomass Δ_fG0_biomass Extracellular pH Cytosolic pH Extracellular ionic strength

Cytosolic ionic strength

http://www.ncbi.nlm.nih.gov/pubmed/20506321 From Battley 1991; ΔfGbm = -71.075 (KJ/cmol)

K mmol/gCDW kJ/gCDW M M 310.15 74 -2.69223 7 7.6 0.2 0.15

(38)

Table S7: Gibbs standard energies (Δ_rG0_{) and estimation errors in kJ/mol from}

Component Contribution Method, used in the fitting of the Thermodynamic and Stoichiometric Model (TSM) with the E. coli network model.

PGI PFK FBA TPI FBP GAPD ENO PYK PGL RPI TKT1 RPE TALA TKT2 PDH CS ICDHyr AKGDH FRD7 FUMA MDH ICL MALS PPC PPS PPCK ACKr ACALDDH PFL ALCD2x GLUSy Reaction ID 2.53 -20.5775 19.82423 4.3 -9.65577 -17.3007 0.03 -22.6746 -181.613 2.06 -5.31601 -3.38 0.806008 -11.41 -45.4723 -33.2018 -4.08648 -37.3886 76.56 -3.44 24.62352 9.172115 -38.6807 -31.1089 -6.63978 0.875637 5.466335 -22.4123 -20.9358 15.97352 -43.8203 ΔrG0 0.388208 0.684862 1.203987 1.203159 0.70193 1.284491 0.46544 0.429747 0 0.775731 2.261145 1.170106 1.942152 2.261145 0 0.550348 0 0 0 0.282179 0.309209 1.370614 2.112774 0 0.505101 0 0.38625 1.219705 1.513539 0.327926 0.957255 ΔrG0 error