Uncertainty Quantification to show Data Limitations for building a System Dynamics Model of Complex Diseases like Obesity

(1)

U

NCERTAINTY QUANTIFICATION TO SHOW DATA

LIMITATIONS FOR BUILDING A SYSTEM DYNAMIC MODEL OF

COMPLEX DISEASES LIKE OBESITY

Viktor O. van der Valk Computational Science University of Amsterdam

TNO Healthy Living viktorvandervalk@gmail.com

Daily Supervisor: PhD H.M. Wortelboer

TNO Healty Living heleen.wortelboer@tno.nl

Examiner: Prof. R. Quax

Computational Science Lab UvA r.quax@uva.nl

Second Assessor: Msc. L. Crielaard

Institute for Advanced Study (IAS) l.crielaard@amsterdamumc.nl

(2)

A

BSTRACT

Obesity is a disease, which is characterized by a plethora of factors that play a role in disease progression, but also show complex interactions among each other. Treatments for obesity are still moderately successful and a thorough understanding of the dynamics in obesity is still far away. Individual characteristic and genetic differences in the obese population might encumber the understanding of the obesity dynamics.

System dynamic modelling is a type of modelling which can describe complex phenomena, like obesity, with a combination of structural and differential equations and might therefore be of help in understanding the complex obesity dynamics. However, the use of system dynamic models (SDMs) in complex diseases like obesity is relatively unexplored, since SDMs require adequate data sets to either be build or validated. Unfortunately, obesity data is not yet collected for the purpose of making an SDM. An attempt to build an SDM of the determinants involved in obesity was therefore made here with the best available data set, the Whitehall II cohort data set [1]. The resulting SDM was then used to study the potential and limitations of the data set by means of uncertainty quantification (UQ), such that a future study could start collecting a more ideal data set for building an SDM.

UQ was shown to be successful in pointing out the limitations in data quantity (number of participants) and data quality. An UQ method to show the limitations in the data frequency (number of measurements per time span) was also proposed, but could not be validated due to the high number of missing values in the data set. A method to clearly distinguish limitations in data individuality (number of data points per individual) and diversity (number of different variables measured) from the other possible limitations in a data set was not found. The data individuality and diversity should therefore be determined based on other indicators than the UQ, such as expert opinion or literature. The path to building the perfect SDM will likely be an iterative procedure in which collecting data and quantifying an SDM and its imperfections will alternate.

Keywords System Dynamics · Causal loop diagram · Data Limitations · Symbolic Regression · Uncertainty Quantification · Health · Obesity · Whitehall II · Simulation

1 Introduction

Obesity as a complex disease Obesity rates are still increasing in the Netherlands; since 2018 more than 15.0% of the Dutch population is obese [2]. Even though there is a wide consensus about the major negative health consequences of obesity, there is no thorough understanding of the disease dynamics. Diet and exercise are the focus of most studies towards obesity or weight loss, but even though short-term weight loss is usually achieved, it is rarely maintained for the long-term (> 1 year) in the vast majority of the participants [3, 4]. Nonetheless, years of research has revealed a diverse range of factors that might influence obesity. Clear determinants of obesity include among others high energy intake, sleep deprivation, depression, stress, medication and lack of exercise [5–13]. This plethora of factors that play a role in obesity make obesity a complex disease of which the dynamics are difficult to capture.

What complicates the understanding of the disease dynamics in obesity even further are the individual, possibly genetic differences in these dynamics. For example, where some individuals gain weight rapidly in depressed periods, others might instead lose weight when depressed [14]. More and more studies are showing individual or genetic differences in the way, for example, food intake, mood, exercise, elevated cortisol levels or other hormones influence body weight [12, 15].

Modelling Obesity This combination of complexity and individuality make the dynamics of obesity difficult to capture with common analysis methods like regressions, t-tests or ANOVAs [16–18]. These methods are namely used to analyze the influence of only a few factors on a certain outcome, rather that the plethora of factors that was shown to influence obesity. Nevertheless, complexity science offers advanced computational models that could aid in the analysis of complex and individual diseases like obesity. A system dynamic model (SDM) is such a model, that has previously been used to analyse complex dynamics in among others finance, biology and ecology. An SDM tries to capture the dynamics of a system with a combination of structural and differential equations, often visualized as a stock-and-flow diagram. The purpose of SDMs in health sciences, and in obesity specific, however, is relatively unexplored, partly because adequate data sets currently lack. Building an SDM namely involves knowing the causal relationships between all the factors in a system, but also quantifying all these relations and possible interaction effects. Especially this quantification step is difficult, since it requires either to know the equations in advance or to obtain the equations by fitting the SDM to a data set. This data set then

(3)

needs to contain all variables in the SDM, which is often not the case since data is generally not collected for quantification of SDMs.

There have, however, been some studies that did try to model obesity. The majority of these studies is focused on conceptual modelling, rather than physical modelling. A conceptual model of obesity is commonly referred to as a Causal Loop Diagram (CLD). CLDs are a good way to visualize all factors involved in a complex phenomena in a network. Several attempts have been made to map all factors influencing obesity and their relations in a CLD [19–22]. However, the scarcity of data on obesity and the variety of factors influencing it complicate the quantification of CLD relations. There have been some studies that made an attempt at quantification, however, the quantified CLDs or SDMs are small and often do not include important factors like sleep, anxiety, depression, education, medication and financial situation [23–25].

The number of factors measured in a data set is not the only limitation for building an SDM. In general, the ideal data set to build an SDM with should be of high quality, have a short interval between time points (low frequency), include many participants (high quantity), include all relevant factors in obesity (high dimensionality) and include many time points per participant (high level of individuality). However, such a data set of obesity does not exist, since health data is often both expensive and hard to collect.

A first step along the way of building an SDM for obesity would therefore be to identify the limitations of current data sets. These limitations could then aid future research to focus on the aspects of a data set that are most important to improve. The path to building the perfect SDM will likely be an iterative procedure in which collecting data and quantifying an SDM and its imperfections will alternate.

Different data limitations Every limitation of the 5 pillars of data limitations, introduced above, will contribute to the noise in the SDM. When the data frequency is too low, the dynamics on a time span that is shorter that this frequency are averaged out, resulting in noise since information is lost in the process of averaging.

A low data quantity increases the noise that is already present in the SDM as a result of the other four data limitations. A high data quantity is namely necessary to distinguish noise from true effect, as averaging over different data points gives a more robust estimate. Here, averaging will also result in information loss, but when this information is noise (from for example noisy measurements), the information loss only increases the predictive quality of the SDM. However, when information about, for example, interpersonal differences is lost by averaging, it decreases the predictive quality of the SDM, especially when these interpersonal differences are important for the disease dynamics.

For complex diseases in which interpersonal differences are important, a high level of data individuality could be used to personalize the SDM. Separate models for subtypes or even personal SDM (with personal parameter values) would then be needed to capture the disease dynamics of different individuals.

A high level of data dimensionality could also help to understand these possible interpersonal differences. A higher data dimensionality could namely mean that extra mediating factors are measured. These mediating factors could possibly explain the differences in dynamics for different people or subtypes. Besides adding extra mediating factors, a higher data dimensionality also decreases noise, since more variables contain more information. More information means that more noise in the SDM can be explained. The data dimensionality is however constrained by the knowledge about factors that influence obesity. Moreover, not every factor that is known to influence obesity is easily measurable.

The last limiting factor in a data set is the data quality. A lower data quality means that the data contains less useful information, hence more noise in the SDM.

Uncertainty Quantification in SDMs The noise in the SDM, as introduced by the data limitations, is measured by means of uncertainty quantification (UQ). For this, the uncertainty in the SDM is divided into the structural and the parameter uncertainty.

Structural uncertainty regarding SDMs can be subdivided into the uncertainty in the topology of the model and the uncertainty in the functional form of the interactions of the variables in the model. The first is beyond the scope of this research and will not be further addressed. Here, focus is on the uncertainty in the functional form, as this uncertainty is often ignored. Assuming linear relations without interaction effects is a common practice in SDMs when no equations are available in literature. In this study, however, the functional form of the SDM equations is inferred from the data. If the data has limitations, this inference will likely suffer and the structural uncertainty will increase.

The second form of uncertainty, the parameter uncertainty, is the uncertainty in the parameter values of the SDM equations. These values could be the same for everyone, but the possibility that different people have different parameter values is very high, given the different characteristics and genetic make-up within our population. Especially for these cases, a data set should consist of many individual data points to infer these personal parameter

(4)

values. However, for little variation within the population, a high quantity data set is sufficient to infer a general parameter value set.

Another way to quantify the uncertainty in the SDM, is using the coefficient of determination, which is a goodness-of-fit indicator for regression-like analysis. Every equation of the SDM is then analysed separately. The effects of the different data limitations could be established with the coefficient of determination, by, among others, varying the magnitude of the different limitations whenever possible.

Study Design In order to use UQ to show the different data limitations, a first exploratory SDM will be build following a similar procedure as proposed in Crielaard et al. (2019). Firstly, a CLD will be made based on expert knowledge and available literature. This CLD will then be contracted to include only factors that are measured in the obese (WHtR > 0.57) participants of the Whitehall II data set [1]. This data set namely covers most of the important factors in obesity mentioned above and can therefore be used for quantification of an adequate obesity model. The SDM and in particular the equations of which it consists will be used to analyze all the different sources of uncertainty in the quantification steps. These sources of uncertainty will then be attributed to the different limitations of the data set.

This study describes a first step towards building an SDM that can be used to analyse the dynamics in obesity. Obesity was studied because it can be considered a complex disease for which the understanding of the disease dynamics currently lacks. Furthermore, data is a limiting factor when making an SDM of obesity. However, a similar workflow could be used for other complex diseases in which the disease dynamics are also not well understood and available data sets are too limited to directly develop a useful SDM.

2 Literature Review

Causal Loop Diagrams The best-known obesity CLD is the Foresight Map as described by Finegood et al. (2010) [20]. This CLD was built by a group of domain experts in 4 so-called ’system workshops’. The CLD covers a broad spectrum of factors that influence obesity, including stress, physical activity and education level. This makes the diagram useful for raising awareness and for mapping all the possible factors that could be of influence on obesity. The number of factors and the fact that the most factors are not (easily) measurable, make quantification of the CLD difficult. For this reason it was likely not attempted by the researchers.

Allender et al. (2015) made a similar extensive CLD from factors that play a role in childhood obesity, but instead of experts, the community was consulted in again multiple workshops [21]. This diagram has a purpose similar to the Foresight map, and similar limitations to quantification. In a follow-up study a network analysis of the CLD was conducted [19]. This resulted in a quantification based on the topology of the network, however, apart from an importance ranking of the factors based on their degree, no dynamics or simulations were described.

Van Wietmarschen et al. (2015) on the other hand, were able to produce simulations from a semi-quantified CLD [26]. This CLD was again made in multiple workshops with domain experts. However, it did not only include factors that influence obesity, because it was made to map all factors that influence human health in general. By assigning every relation a positive or negative strength (very weak, weak, normal, strong, very strong) based on expert knowledge, the researchers were able to get simple simulations without the need of an extensive data set that included all factors in the CLD. These simulations, however, had little predictive power and could only be used to simulate a population’s mean in general, rather than an individual.

System Dynamic Models The availability of the right data is essential for building either an SDM from scratch or for quantification of an existing CLD. Ideally, the data includes (frequent) measurements of all the variables in the SDM or CLD. Most, if not all data sets, however, are not collected for modelling. They are collected for either statistical analyses of a randomized controlled trial (RCT) in which predominantly the influence of only 1 intervention is analyzed, or for epidemiological studies focusing on the analysis of correlations [27]. SDMs are therefore not frequently made for analysis of obesity and if SDMs are made, they are normally limited due to the data restraints.

Madahian et al. (2012) did make such a limited model that focuses on different forms of energy intake and energy expenditure. The model did not include important factors that influence obesity like depression, anxiety, stress, medication and sleep [23]. For quantification of the model, data from an intervention study in 8-10 years old African-American girls was used. The model predictions, however, were not very accurate even though the researchers concluded otherwise. The model was able to predict BMI after 2 years with an accuracy of 10% in only 84% of the girls. This seems accurate, but considering the fact that almost 50 % of the girls this age is within 10 % of the mean BMI, the accuracy is not high [28].

(5)

Fallah Fini et al (2014) made a similar SDM, but used a data set of an American adult population survey [24]. The SDM was again relatively simple, but what was interesting about the modelling approach, is the split of the population in different subgroups based on BMI, gender and race/ethnicity that was made. By making this split the researchers acknowledged for the fact that not everyone is the same with regards to the obesity disease progression. By means of validation the researchers showed that the model was able to predict a similar BMI distribution as the data, 3 to 4 decades after the start of the study for every subgroup and sub-population. This however is a poor measure of validation since the model uses this data to calibrate its parameters values on. Nevertheless did they clearly show that different parameters values are needed to simulate different sub-population or subgroups. Abdel-Hamid (2003) made a physiological SDM of the individual obesity case [29]. In the model fast-free mass (FFM) and fast mass (FM) were taken as main outcome variables. Exercise and diet restriction were taken as the main independent variables. Quantification was done based on available literature about the physiological processes in the model. Since this was a more theoretical model, no validation against data was done for the model. The researchers however did simulations of theoretical interventions in order to try to explain "the lack of consensus among researchers and practitioners with respect to the effects of exercise training on body weight", which can be seen as a sort of validation.

Most of the CLD made are extensive and big, which makes finding a data set to quantify the data impossible, most of the SDMs are too small on the other hand, which makes them unable to capture the complex dynamics and interactions of a disease like obesity. Building a CLD from scratch, contracting it to a SDM and quantifying it with data according to Crielaard et al. (2019) therefore seems the best way to get an adequate SDM, which can be used for UQ.

Uncertainty Quantification and Data Limitations Different approaches to quantify the uncertainty in SDMs have been taken. None of the studies towards UQ in SDMs were done in health. Therefore methodological studies and studies in other research fields that could be applied to a health case will be discussed here.

Alvin et al. (1998) give a road map of the whole process of building a dynamic model including the quantification of the different uncertainties in the model, such as the structural and parameter uncertainty [30]. Similar to the more recent paper of Chrebtii et al. (2016), they used a Bayesian approach to quantify the model uncertainty [31]. The prior distribution of the outcome variable was used and updated to obtain a posterior distribution over the outcome variable, which reflects the uncertainty of the equation used to predict the outcome variable. This was done for every equation in the model. Both studies did not use actual data sets to build an SDM, but instead made use of stochastic processes and their likelihood estimates which were known in literature.

Arhonditsis et al. (2005) and Qi et al. (2011) took a different approach and used the coefficient of determination to quantify the uncertainty in the model equations [32, 33]. Similar to the Bayesian approach, every equation of the model was analysed separately, but instead of a posterior distribution, the coefficient of determination did give an indication for the amount of variance of the outcome variable that the model equation can explain. Thereby do these two studies use data to fit the models. Arhonditsis et al. (2005) used a model calibration method, which is not further specified in the paper, to fit an ecological model. Qi et al. (2011) used multiple regression to fit the model equations of an urban environmental model.

None of the studies mentioned above however, translates the uncertainty in the model equations to the limitations in the data used to fit the model. Limitations in a data set are usually only acknowledged in a paper, when the data set is used to build a model. The assessment of the data limitations is then done by comparing the characteristics of the data set with the characteristics that are assumed necessary to fit a certain model. Other methods for assessment of the data limitations are, among others, subjective assessments, such as the multiple expert assessments as describe by Weidema (1998) or automated rule based assessments as describe by Savchenko et al. (2003) [34, 35]. These are, however, not model specific, but more general data quality assessments.

There seems to be a gap in the adequate feedback modelers give when a model is made with a certain data set. More often that not, and especially in health sciences, the resulting model is far from perfect, among others due to the limitations in the data set that was used. The focus of modelers is more on finding workarounds for data limitations, then on giving adequate insight in these limitations, such that future data collection can be improved. This study therefore aims to close this gap by proposing an iterative loop which entails both building an SDM with data and giving feedback on the possible limitations of this data.

3 Methods

In order to show the limitations of a data set, the workflow as shown in figure 1 was followed. First, a CLD of obesity was build, based on expert interviews and literature. This CLD was then used to make an SDM, which was fit to the best available data set of obesity. Finally the SDM was used to quantify the uncertainties in the model,

(6)

which could be attributed to the limitations in the data set, namely the limitations in frequency, quality, quantity, dimensionality as explained in the introduction.

Figure 1: Workflow for getting data limitations by means of UQ. SDM = system dynamic model, CLD = causal loop diagram, UQ = uncertainty quantification

The workflow shown in figure 1 is a part of the iterative procedure to build and improve an SDM, that is introduced here. The procedure is shown in figure 2. Besides the workflow to build an SDM and quantify the uncertainties in the SDM, the iterative procedure also entails the feedback of the data limitations to the researchers. The feedback can be used to improve the data quality, quantity, dimensionality, individuality or frequency of future research. The data from the Whitehall II cohort study was used here as a first data set in this iterative procedure, however, the same steps could be used to get the limitations of other (future) data sets [1].

Figure 2: Iterative procedure for improving an SDM. SDM = system dynamic model, CLD = causal loop diagram, UQ = uncertainty quantification

3.1 Expert interviews and literature towards a Causal Loop Diagram

Interviewing Experts In order to to get a wide theoretical view of obesity, a diverse set of experts was inter-viewed and literature research was done. Details about the experts that were interinter-viewed are shown in table 1. Apart from scientists, doctors and dietitians, several (former) obese/overweight patients were interviewed to gather detailed personal experiences with obesity as well. The first part of the interviews consisted of more general questions in which factors that influence obesity were asked. The second part consisted of more detailed questions about the causality, the type of relationship and possible mediating or confounding factors. Since this SDM focuses more on the personal factors that contribute to obesity, the influences of external factors and environment were not taken into account in the interviews. The interviews were all personal interviews done by the same interviewer, who had a background in biology and modelling, but did not qualify for an obesity expert. During the interviews references to scientific papers were asked if available.

Temporal Diagram The first parts of the interviews were ordered in a temporal diagram (TD). In a TD the factors are mapped based on the temporal scale on which they change. Change in this case means relevant change with regards to obesity, so fat mass, for example, even though it has hourly fluctuations, is a factor that has relevant changes on a daily/weekly basis.

The SDM was made to examine the influence of factors within a person that influence obesity. The influence of the environment was therefore considered constant and was not taken into account. All the factors in the TD were therefore assumed to occur on the same spatial scale. This is the reason that a temporal diagram, instead of the usual spatio-temporal diagram was chosen here [36]. The TD is an important diagram when building an SDM, since it used to deduct which factors can be considered constants, auxiliaries (dynamic variables) or stocks (most important main outcome variables) in an SDM with a certain timescale.

(7)

Table 1: Expert Knowledge Overview

Research Area

Expert Name Level Of Expertise Glucose Metabolism Nutrition Physical Activity Sleep Mental Health Medication Genetics

Liesbeth van Rossum Professor x x x x x

Bibian van Voorn Postdoc x x x x x

Eline van der Valk PhD, MD x x x x

Joelle Oosterman PhD, MD x x x x

Suzan Wopereis PhD x

Wilrike Pasman PhD x

Anita Tump Dietitian x x x

Tesse Leermakers Dietitian x x

Lenneke Elderbroek Dietitian x x x x

Stephan Patz Lifestyle Coach x

7 (former) overweight/ Personal Experience x x x x

obese participants

The bold ’x’ in the table refers to a high level of expertise (PhD, papers published, specific job experience) and the non-bold ’x’ refers to a lower level of expertise (personal experience, work-related general experience and

knowledge) in that particular research area in relation with obesity.

Causal Loop Diagram From the second part of the interviews a CLD was gradually build in Vensim PLE 7.3.5 (Ventana Systems Inc.). Since the interviews were hold over a time span of approximately 5 months, the later interviewees were, in the end of the interview, confronted with a draft version of the CLD on which they were asked to give feedback. Two experts were interviewed twice especially to obtain feedback on the draft CLD. 3.2 Data set: the Whitehall II study

Whitehall II Data from the Whitehall II cohort study was used here as an example data set for which the data limitations can be analysed [1]. The Whitehall II study was a cohort study in which 10.314 British civil servants, two third men and one third women, aged between 35-55, were measured every 5 year. The study ran from 1984-2011. This data set was chosen, because it contains multiple measurements in time of the majority of the (measurable) factors in the CLD. Among others, these factors include medicine use, depression, anxiety, fasting glucose level, physical activity, waist-to-height-ratio (WHtR), weight, sleep, alcohol use, education level, financial situation and age. From the data set only the obese participants were used for analysis, a WHtR > 0.57 was considered obese (n=1687) [37].

Of the data set only 4 consecutive time points were available at TNO for analysis. The variety of variables that were measured, changed over time, which meant that not all the 4 consecutive time points contained the same measurements. Especially the first and the last of the 4 time points lacked the measurements of some important variables.

Fat free mass (FFM) and fat mass (FM) were not measured in the Whitehall II study, but were estimated by the following formula given by Swainson et al. [38]:

F M = 99.7 ∗ W HtR − 24.7 F F M = W eight − F M

The basal metabolic rate (BMR) was also not measured but estimated based on a formula given by Sabounchi et al.[39]:

male: BM R = 898 − 3.32 ∗ Age + 14.3 ∗ F F M + 6.46 ∗ F M female: BM R = 682 − 3.08 ∗ Age + 12.9 ∗ F F M + 5.9 ∗ F M

3.3 System Dynamics Model

Topology In order to make a quantified SDM, the final CLD was first contracted such that it only contained variables that were measured in the Whitehall II data set [1]. This was done by putting direct links between measured variables when their relation was mediated by a variable that was not measured. For example, when ’Energy Intake’ mediated the relation between ’Alcohol use’ and ’WHtR’, but ’Energy Intake’ was not measured, a direct link was put between ’Alcohol use’ and ’WHtR’ in the SDM.

For the SDM a time-step of 1 month was chosen, which meant that all the variables in the contracted CLD that had measurable change on a timescale larger than a year, were considered constants. All the variables that had measurable change on a timescale smaller than a month were averaged over monthly periods.

(8)

FFM and WHtR were chosen to be the stocks in the SDM. This was done, because both variables have a stock-like capacity and the change of these variables rather than the actual amount is relevant in an SDM with the chosen timescale. Thereby are they both important predictors in obesity. WHtR is on it’s way to replace BMI as indicator for obesity. BMI is namely dependent on muscle mass, in contrast to WHtR. For example, a bodybuilder, who is not obese, will have a high BMI, because his weight is high, but a low WHtR since his waist is still small. Besides that, WHtR is a good indicator for visceral fat mass, which is found to be more correlated with obesity-related problems, like type II diabetes and cardiovascular diseases, than BMI, non-visceral or total fat mass [37, 40]. FFM, on the other hand, is an indicator for muscle mass, which is also an important predictor in obesity. Muscle mass, namely influences basal metabolic rate, which is an important part of the energy expenditure of a person. Moreover, the absence of muscle mass is related to all-cause mortality in obese patients [41, 42]. The rest of the variables were considered auxiliaries, meaning that they directly updated during the simulation, instead of via a differential equation update, like stocks update.

Data from medicine use was only available in the data set in a general yes/no question, with the exception of a few common medicine groups which had their own yes/no question. In order to prevent medication to influence the relations in the model, was decided to exclude participants that used any form of medication.

Logistic Transformation The SDM now consists of a network of just stocks, auxiliaries, constants and rates, which is referred to as the Stock and Flow diagram of an SDM. The SDM is the set of equations behind this Stock and Flow diagram. Every auxiliary or rate, and it’s incoming arrows, is an equation in the SDM. This equation describes the way the incoming variables are combined to get the new value for the auxiliary or rate (with or without delay). The rates represent the change in the stock which is updated at the end of every time step. The functional form of this equation in most (smaller) SDM models is normally known in literature or is assume to be of a linear form. In this SDM model however, most causal relations are based on RCTs, that usually only take into account two variables [27]. The causal relation which is found in the RCTs, is often just positive or negative and only sometimes dose dependent or linear, but rarely a full functional form which includes multiple predictors and their interactions terms, is given. This poses a problem, since these functional forms are necessary for a working SDM. A common workaround in complexity science is to assume linear relations and thus use a linear functional form. This however neglects the possibly more complicated relations and interactions found in nature. Therefore different functional forms were investigated in this study, by inferring possible functional forms from data. First of all, to account for the facts that most of the variables in the data set are bounded to a specific domain, the (linear) equations were transformed, such that they contained a logistic bound. The bound prevented the variables to increase further than their maximum value or decrease further that their minimum value. Given the variable y, its minimum value ymin, its maximum value ymax and the linear equation to predict the variable

y = ax1+ bx2+ cx3+ ... the transformation is given by:

f (y) = ymax− ymin 1 + e−4(y−0.5(ymax+ymin)))ymax−ymin + ymin with: df (y) dy = 4dy (1 + e−4(y−0.5(ymax+ymin))ymax−ymin ₎2 which gives:

f (0.5(ymax+ ymin)) =

ymax− ymin

1 + e0 = 0.5(ymax+ ymin)

lim

y→∞f (y) =

ymax− ymin

1 + 0 + ymin = ymax lim

y→−∞f (y) = ymin

and:

df (0.5(ymax+ ymin))

dy = 4dy (1 + e0₎2 = dy lim y→±∞ df (y) dy = 0

(9)

The logistic bounded (linear) equation gives a sigmoidal function with a slope and value equal to the slope and value of the original function when y is around 0.5(ymax+ ymin) ≈ ymeanfor a normal distributed variable.

When y goes to ∞ or −∞,d(f (y))_dy gradually goes to 0 and f (y) to respectively ymaxor ymin.

Whenever y was normally distributed but appears to be cut off by a bound introduced by the measurement method, the extreme value at the cutoff side, was set such that ymedian= 0.5(ymax+ ymin). This was done in order to

keep the slope of the transformed function equal to the slope of the original function around ymean. An example

of the logistic transformation can be seen in figure 3.

The motivation to chose this logistic transformation, is because most variables in the data set, like depression, food quality or physical activity, all have a natural bound. Physical activity, for example, cannot be negative nor bigger than a certain maximum. Furthermore, these transformations still give intuitive parameter values, because of the linear behaviour near the mean (see figure 3).

x 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50 y ymax ymin Linear Equation Logistic Transformation

Figure 3: The logistic transformation

Candidate Functional Form Selection Next to this logistic transformation, a deterministic symbolic regression (SR) approach according to Schmelzer et al., (2019) was tried to select possible candidates for more complex functional forms that include interaction terms [43]. This regression algorithm uses Elastic Net Regression to select variables from a library of possible function terms. Elastic Net Regression is a combination of lasso and ridge regression, in which both L1and L2regularization are combined to select variables [44].

The library that was used here consisted of the normal variable (A), the inverse variable(A−1) and the product (A ∗ B, B ∗ A) and both quotients (A/B, B/A) of all possible combinations of incoming connections to a variable. For every combination of L1and L2regularization terms the algorithm gives a candidate functional form. These

candidate functional forms were considered to be representative for the distribution over possible functional forms of the equations. Bootstrapping was used to sample from the data set in order to increase the size of the functional form distribution [45]. The most frequently selected candidate functional form per regression analysis was used as best new functional form. This gives 1 new functional form per equation. These new functional forms were then compared with their linear equivalent in a model selection procedure which is described in the ’Model Selection’ paragraph. Whenever the new functional form gave better predictions, the linear equation was replace by the new functional form in the SDM.

For all algorithms and code as describe in this study, Python 3.7 with the Spyder IDE from the Anaconda Distribution, was used. The packages numpy and pandas, as part of the Anaconda Distribution, were used extensively and whenever additional packages were used, it is indicated and appropriately referenced [46, 47]. Parameter Value Quantification The quantification of the equations mentioned above was done with the Levenberg Marquardt (LM) algorithm, which is an iterative nonlinear least squares optimization algorithm [48]. A version of the LM implementation as found in SciPy library was regularized and used here [49]. The regularization was done to shrink the parameter values that would otherwise overfit due to collinearity in the predictors variables of the equations. Collinearity refers to correlations among predictor variables, which leads to overfitting in unregularized optimizations. Collinearity is clearly present between Anxiety Score and Depression Score, but a

(10)

lesser, more hidden degree of collinearity between for example sex and FFM might also influence the optimization. Ridge regression is the common workaround to deal with collinearity in the predictors in regression-like analysis [50]. Ridge regression imposes a penalty or regularization term (α) on the least squared error that is proportional to the parameters values of the model. This regularization term was added to the LM optimization algorithm to find the best parameter values for the equation.

The optimization was done on the normalized data set, to prevent the influence of the magnitude of the measure-ments scores. Whenever a causal relationship was shown to be delayed in literature, the value of the delayed outcome variable was imputed with linear imputation of the 2 time points before and after the time point needed. This was necessary since the Whitehall II data set only contained 1 time point per 5 years and the delays where either 1 or 6 months.

Comparison of functional forms This model selection procedure is used to select the best functional form for every equation in the SDM, which is either the linear functional form or the new functional form inferred from the data by symbolic regression. The selection was done by comparing the R2 scores of the functional forms in a 10.000-fold cross validation scheme for different regularization values. The R2score is the proportion of the variance in the outcome that can be explained by the predictor variables, thus the proportion of the variance in the outcome that the equation explains. The R2score normally ranges between 0 and 1, but can be negative if the variance in the predictions is higher than the original variance in the outcome. The regularization values, α, were chosen between 0 - 250, since the R2_{values of the optimizations decreased rapidly for higher levels}

of regularization (α > 250). This wide range was necessary since the level of collinearity in the predictors of the equations used in the SDM, was unknown in advance and different for every equation in the SDM. In the 10.000-fold cross validation scheme, the available data per optimization was split in a train and a test set in a 9:1 ratio.

3.4 Uncertainty Quantification to Data Limitations

To get an idea of the uncertainty in the final SDM, 2 indicators of uncertainty were analysed and translated to possible limitations in the data set.

The first indicators of uncertainty are the maximum R2_{values of every separate, regularized and cross-validated}

(see Model Selection), equation in the SDM. These R2values give an indication for the quality of the equations in the SDM. Some equation might be of better quality than others, ’Depression Score’ might for example be easier to predict, than ’Food Quality’. Since the equations in the SDM are fit on the data, the quality of these equations can be translated to limitations in this data. To illustrate, when the quality (R2values) of all SDM equations is high (≈ 1), the data has no limitations and the SDM is perfect. However, when the R2_{values are not close to}

1, several limitations in the data or a combination of these limitations could be the reason for this. As already mentioned in the introduction, can a data set be limited by its quality (quality of measurements), quantity (number of participants), individuality (number of time points per participant), frequency (time between time points) and dimensionality (number of different variables measured).

To attribute the uncertainty in the equations to these different data limitations, the influence of the data set size (the data quantity) on the equation quality will be examined as well as the influence of time between the predictor measurements and the outcome measurements (the data frequency). The data quality will be analysed by means of a leave-one-out procedure, in which one of the predictors will be left out of the equation, which will then again be fit to the data.

The effect of the limitations in the data due to the data individuality could not be established in this study since the number of data points per individual was maximum 4, which was too little to fit the SDM or any of the SDM equations on. However, clustering the data points in order to get enough similar individuals to fit the SDM on, was done to get an idea of the heterogeneity of the population. The analysis was done to established whether clusters regarding the SDM parameter values existed within the data set. The alternative would be that the parameter values for the SDM are normally distributed in the population or even that everyone has the same parameter value set for the SDM. In both cases will clusters regarding the SDM parameter values not exist within the data set.

The data quality was analysed with the second indicator of uncertainty, the parameter value distributions, as well. These distributions reflect the uncertainty in the parameter values of every equation in the SDM.

The equation quality The equation quality of every separate equation of the SDM will be compared and checked for significance. The significance of the equation quality depends besides on the R2_{value itself also on the number}

of predictors used in the equation. Simple equations (few predictors) are more significant at similar R2values, than more complex equations (more predictors).

(11)

An equation with a low equation quality makes noisy predictions, which means that the predictions do not contain much information about the outcome. This is likely a result of the limitations of the data. Which of the data limitations, as described above, is the most important limiting factor cannot be concluded by analysis of only the equation quality. All the limitations in the data namely result in a decrease in equation quality.

Low predictive quality of equations in the SDM could be the result of low data quality, since low quality data contains less information about the true variables (both predictors and outcome), which result in low predictive quality of the equations.

Low quality of the SDM equations can also be attributed to low data quantity, since having too few data points to fit the equations on, means that the equations are prone to over-fit or under-fit, which decreases the predictive quality of the equations.

Furthermore, low equation quality can be attributed to low data individuality, especially when the true diversity of the parameter values of the SDM is high in a population. High diversity in parameter values means that everyone has its own unique set of parameter values. Trying to fit 1 equation for everyone would then result in an average equation, which would not predict individual outcomes accurately. The only way to get these individual parameter values, would then be to fit the equations on individual (or clustered) data, which means that a considerable amount of data per person is needed (high individuality).

Low dimensionality in the data could also be the reason why the quality of the equations is low. Due to the necessary contraction of the CLD, the equations of the SDM have less predictors than was originally intended. Less predictors means less quality predictions, since less predictors contain less information about the outcome variable.

Last but not least could the low data frequency cause the low predictive quality of the equations. The predictive capacity namely decreases with the time between measurements of the predictors and measurement of the outcome. Predicting ’Depression Score’ one year from now, is more difficult than predicting ’Depression Score’ tomorrow. This low frequency will especially play a role in the stock equations, that try to predict the WHtR gain and the FFM gain, since these equations predict a process of change in the future, rather than the direct predictions of the other auxiliary equations.

The influence of the data set size on the equation quality In order to analyse the influence of only the data quantity on the equation quality, samples of different quantity were taken from the original data set. The optimization procedure as described in the ’Parameter Value Quantification’ section was used to obtain the equation quality for these sample sizes. Per sample size, 1000 samples were taken from the data set, to get a cross-validated estimate of the true equation quality. The maximum data quantity that could be used for the optimizations differed per equation. The variables in the equations namely have different numbers of missing values, which makes the maximum data quantity that can be used per optimization different.

The influence of time on the equation quality To get an indication of the influence of the data frequency, the time between the predictors and outcome was increased with steps of 1 month. Creating the data that is needed to do this analysis, namely a data point every month, was done by linear imputation, based on data of the outcome variable before and after the required time point. Again the equation quality for every extra month between predictors and outcome measurements, was estimated by a 1000-fold cross-validation scheme in combination with the optimization procedure as described in the ’Parameter Value Optimization’ section. Given the low data frequency in the Whitehall data set (once per 5 years), the auxiliary equations were used here to estimate the effect of time between predictors and outcome. The influence of time on the equation quality of the differential equations of the stocks in the SDM could not be estimated by following the procedure described here. Namely, the rate of change in stock variables is already an average of the rate of change in 5 years, interpolation would result in a rate of change that is the same for every month of those 5 years. Time would therefore be of no influence in these 5 years. Comparing the predictive quality of the differential equation to predict the rate of change in the next 5 years, did neither give useful results, since the predictive quality of the differential equations to predict the average change in both the first and the second 5 years was zero.

The influence of the single predictors on the equation quality In the leave one out procedure, noise was added to one of the predictors in the equation. By adding sufficient noise to the predictor, it will become a random variable without any predictive capacity. This way the importance of that predictor in the equation can be established. When adding noise to a predictor does not decrease the predictive performance of the equation, is the predictor of no importance in the equation. This indicates that the predictor variable does not direct effect the outcome variable (effect via other variables could still be possible), which means that either the quality of the predictor variable measurement is very low or that the true direct effect of predictor to outcome is indeed absent (or extremely low). The latter however is unlikely since the predictors are selected based on the, in literature established, effect they have on the outcome variable.

(12)

Cluster Analysis of the quality of the equations A combination of the k-nearest neighbor cluster algorithm and LM optimization, which will be called k-LM clustering similar to Spath (1985), was used here to examine how individual might differ regarding the SDM [51].

The algorithm works as follows:

• The algorithm was randomly initialized by assigning equal parts of the data to every cluster. • Start Loop:

• – A set of parameters was then calculated for each cluster with the LM optimization.

– Every data point was then clustered based on the cluster plane it was closest to (least squared error). • This loop was iterated over, until the groups no longer differed.

This procedure was done for every equation in the SDM. Accordingly, the possible existence of clusters could be established for every equation of the SDM. The improvement in equation quality was again used to evaluate the use of clusters in an equation. However, the cross validation of the equation quality is different for the cluster analysis since the cluster numbers of the data points in the test set have to be inferred from the training set. The test set can therefore only contain data points of individuals that already have a data point in the training set. This training set data point can then be used to infer the cluster number of the individual, which is assumed to stay the same over the different time points.

Uncertainty in the parameter values To obtain a distribution over the parameter values of the equations in the SDM, which could be used for parameter uncertainty quantification, the equations were fit on 10.000 bootstrapped samples of the data set. Very uncertain (wide) distributions or clustered distributions can indicate a high diversity of parameter values in the population. When the diversity of parameter values in a population is high, it means that everyone has a more unique set of parameter values in the SDM. To obtain these unique sets, multiple data points per individual would be needed (a high individuality). Very uncertain parameter values distribution could however also reflect the noise in the data (a low data quality) or a low data quantity. Besides that the parameter values distributions can tell something about only the data quality, especially when the distributions are not in line, with what literature and experts say, might the data quality be low. The results of the parameter uncertainty distributions analysis can be found in Appendix A.

4 Results

4.1 Causal Loop Diagram

Temporal Diagram The TD in figure 4 shows all the variables that were identified during the interviews, with their approximate temporal scales. The temporal scale of the SDM is months, so all the variables with temporal scale smaller than 1 month will be average on monthly basis and all the variables with temporal scales larger than 1 month will be considered constant. A variable like ’Medicine Use’ could change on a whole variety of timescales, depending on the type of medicine. ’Alcohol Use’ on the other hand is a variable that changes daily, but it’s average is a habit that is rather constant for years, which will therefore be considered a constant in this model. Causal Loop Diagram Figure 5 shows the final CLD as result of all the interviews. Table 2 shows the literature that was used to support all the correlations or causal relations. The arrows in figure 5 show a causal relationship between two variables as assumed by the experts. The aim was to find reviews to support every causal relation, but when a review could not be found, reviews or other papers that supported a correlation instead were used as evidence for the causal relationship, as indicated by the experts.

(13)

Seconds Day Month Year Lifetime

Energy Intake

Energy Expenditure

BMR

WHtR

Weight

Fat Mass

Muscle Mass

Sleep Quality

Physical Activity

Food Quality

Alcohol Use

Stress

Gut Health

Coping Capacity

Age

Anxiety Score

Depression Score

Fasting Glucose Levels

|---Medicine Use---|

Eating Disorders

Financial Situation

Education

Sex

Figure 4: Temporal Diagram: The x-axis indicated the time scales on which the variables approximately change. BMR = basal metabolic rate, WHtR= waist-to-height ratio

(14)

Table 2: Literature for links CLD:

From To Literature/Evidence

Education Level Food Quality [52]

Financial Situation Food Quality [52]

Eating disorders Food Quality [53, 54]

Sleep Quality Food Quality [12, 55–58]

Coping Capacity Food Quality [59–65]

Physical Activity Sleep Quality [66, 67]

Depression Score Sleep Quality [68, 69]

Anxiety Score Sleep Quality [68, 69]

Alcohol Sleep Quality [70, 71]

Age Sleep Quality [72, 73]

Education Level Physical Activity [74]

Financial Situation Physical Activity [74, 75]

Age Physical Activity [75, 76]

Sleep Quality Physical Activity [58]

Anxiety Score Physical Activity [77–79]

Depression Score Physical Activity [77–79] Fasting Glucose Levels Physical Activity [80, 81]

Sex Physical Activity [75]

Fat Mass Physical Activity [82]

Muscle Mass Physical Activity [82]

Sleep Quality Gut Health [55, 56, 83]

Food Quality Gut Health [84, 85]

Physical Activity Gut Health [85, 86]

WHtR Gut Health [87, 88]

Stress Gut Health [89, 90]

Alcohol Gut Health [91, 92]

Anxiety Score Coping Capacity [53, 54, 93] Depression Score Coping Capacity [53, 54, 93]

Sex Coping Capacity [93]

Stress Coping Capacity [53, 54, 93]

Food Quality Energy Intake [12, 94]

Eating Disorders Energy Intake [12]

Sleep Quality Energy Intake [12, 55–57]

Coping Capacity Energy Intake [95]

Gut Health Energy Intake [96]

Alcohol Energy Intake [12, 97]

Insulin Energy Intake [12, 98, 99]

Antidepressants Energy Intake [12, 98, 99] Other Medication Energy Intake [12, 98, 99]

Stress Energy Intake [12, 95, 100]

Physical Activity Energy Intake [12, 82, 94]

Age Energy Intake [94]

Sleep Quality Anxiety Score [69, 101]

Gut Health Anxiety Score [85, 102]

Financial Situation Anxiety Score [103, 104]

Alcohol Anxiety Score [105, 106]

Stress Anxiety Score [107]

Physical Activity Anxiety Score [77, 79, 108, 109]

WHtR Anxiety Score [5–7, 10]

Sleep Quality Depression Score [69, 101]

Gut Health Depression Score [85, 102]

Financial Situation Depression Score [103, 104, 110–112]

Alcohol Depression Score [113–115]

Stress Depression Score [107]

Physical Activity Depression Score [77, 79, 108, 109]

WHtR Depression Score [5–7, 10]

Physical Activity Stress [116]

(15)

Food Quality Fasting Glucose Levels [117–119] Physical Activity Fasting Glucose Levels [107]

Stress Fasting Glucose Levels [10, 120–123]

Sex Fasting Glucose Levels [124, 125]

WHtR Fasting Glucose Levels [107]

FFM/Muscle Mass Fasting Glucose Levels [107] Physical Activity Energy Expenditure [82, 126]

BMR Energy Expenditure by definition [127, 128]

Physical Activiy FFM/Muscle Mass [94]

Energy Intake FFM/Muscle Mass [94]

Energy Expenditure FFM/Muscle Mass [94]

WHtR Fat Mass [40]

FFM/Muscle Mass BMR Equation [39] and [127, 128]

Age BMR Equation [39]

Sex BMR Equation [39]

Fat Mass BMR Equation [39]

Other medication BMR [12, 98]

Stress WHtR [8, 10, 11, 15, 120–123]

Physical Activity WHtR [12, 58]

Energy Expenditure WHtR [94, 126, 129–132]

Energy Intake WHtR [94, 126, 129–132]

Fasting Glucose Levels WHtR [81]

Corticoids WHtR [12, 98, 99, 133]

Antihistamines WHtR [12, 98, 99, 133]

Antipsychotics WHtR [12, 98, 99, 133]

Birth control WHtR [12, 98, 99, 133]

Other medication WHtR [12, 98, 99, 133]

FFM/Muscle Mass Weight by definition

Fat Mass Weight by definition

References to the scientific papers, preferably reviews, that prove the causal link between two factors as given by the experts. For some sets of factors only correlation was proven and causality was assumed by experts, but for most links causality was proven in literature.

4.2 System Dynamics Model

Topology Figure 6 shows the SDM that was made from the CLD in figure 5. In the SDM different delay times are indicated with colors. These delays are taken from the CLD in Van Wietmarschen et al (2015) [26]. Some variables have multiple influences on the same variables with different delays, this is the result of contraction of the CLD. Since these multiple influences with different delays have high temporal correlation, fitting data to an equation that included all these delays resulted in multicollinearity problems. To prevent this multicollinearity in the predictors of the LM algorithm, the delay of the predictor with the strongest effect on the outcome variable according to Van Wietmarschen et al (2015) was chosen as the only delay of that predictor [26].

Since the data frequency (once every 5 years) was to low to fit the SDM including the delays, new data points were created by linear interpolation. Whenever possible, new data points at 6 months before, 1 month before, 1 month after and 6 months after were created by linear interpolation of the values of the variables before and after the data points needed.

Functional Form Candidate Distributions from SR analysis The SR analysis resulted in possible functional form candidate distributions. Of these distributions the most frequent functional per outcome variable is shown below, this functional form will be compared with its linear equivalent for usability in the SDM.

Sleep Quality= s1*Anxiety + s2*Age*Anxiety + s3* Depression/Anxiety

(16)

Anxiety Score= a1*WHtR + a2*Physical Activity + a3*Sleep + a4* Financial Situation + a5*Sex + a6*FQ +

a7*Alcohol

Depression Score= d1*Financial Situation/Physical Activity

Fasting glucose level= g1*Sex + g2*Sex−1

Physical Activity= p1*Age + p2*Anxiety + p3*Age*Financial Situation + p4*Education/Sleep

WHtR gain= w1*Sleep*Anxiety

FFM gain= m1*Age + m2*Age*BMR

Of these most frequent functional forms, some are very intuitive or even the same as their linear equivalent (the ’Anxiety Score’ equation), whereas others, like the ’Fasting glucose level’ function, are less intuitive. The ’Fasting glucose level’ function most likely resulted from the absence of any causal relation (in the Whitehall II data set) with any of the, in literature established, variables that is stronger than the given combination of gender predictors, which is why the SR algorithm selected gender as strongest predictor. Furthermore, the algorithm did give multiple interesting age dependent interactions (Age*Anxiety, Age*BMR and Age*Financial Situation) and some non intuitive interaction terms like Sleep*Anxiety, Depression/Anxiety and Education/Sleep. The optimal parameter values for these equations are discussed in the ’Parameter Value Distributions’ section in Appendix A.

(17)

Table 3: Mean R2for 1000-fold cross-validated LM optimizations per variable. α∗ Sleep

Quality

Food Quality

Depression Anxiety Fasting Glucose Level Physical Activity WHtR gain FFM gain 0 0.03735 0.02349 -0.005020 0.06518 -0.04735 0.05419 -0.2147 -0.03662 1 0.03812 0.02387 -0.005138 0.06559 -0.04437 0.05599 -0.2075 -0.03534 2.5 0.04020 0.02268 -0.005583 0.06153 -0.04406 0.05527 -0.2037 -0.03487 5 0.03817 0.02297 -0.005963 0.06422 -0.03840 0.05801 -0.1989 -0.03735 7.5 0.03948 0.02550 -0.005475 0.06396 -0.04079 0.05410 -0.1799 -0.03723 10 0.03890 0.02448 -0.005272 0.06471 -0.03960 0.05991 -0.1727 -0.03646 25 0.04032 0.02475 -0.002155 0.06466 -0.02970 0.05861 -0.1517 -0.03310 50 0.03880 0.02518 0.004230 0.06673 -0.02695 0.06605 -0.1398 -0.03607 75 0.03837 0.02386 0.002924 0.06396 -0.02480 0.06412 -0.1602 -0.04036 100 0.03760 0.02484 0.0006839 0.06042 -0.03113 0.06747 -0.1606 -0.04182 250 0.02822 0.01585 -0.02493 0.006105 -0.02888 0.03218 -0.1525 -0.04246

∗_{α is the regularization term.}

Table 4: Mean R2for 1000-fold cross-validated LM optimizations of the most frequent functional forms of the SR analysis.∗

α∗∗ Sleep Quality

Food Qual-ity

Depression Anxiety Fasting Glucose Level Physical Activity WHtR gain FFM gain 0 0.02206 0.01161 -0.07252 0.06556 -0.005194 0.07853 -0.02672 0.004358 1 0.02168 0.01160 -0.06964 0.06379 -0.004887 0.07550 -0.02623 0.002479 2.5 0.02219 0.01162 -0.07644 0.06646 -0.005306 0.07668 -0.02647 0.003011 5 0.02138 0.01049 -0.07771 0.06485 -0.005379 0.07548 -0.02623 0.002818 7.5 0.02167 0.01082 -0.07413 0.06411 -0.005235 0.07803 -0.02585 0.004676 10 0.02224 0.01164 -0.07281 0.06505 -0.005265 0.07737 -0.02561 0.002334 25 0.02209 0.01089 -0.06824 0.06796 -0.005064 0.08225 -0.02201 0.005686 50 0.02212 0.01104 -0.07095 0.06850 -0.005324 0.07572 -0.01974 0.001880 75 0.02160 0.01009 -0.07181 0.06201 -0.004815 0.07582 -0.01910 -0.001520 100 0.02121 0.01043 -0.05866 0.06273 -0.005292 0.07240 -0.01891 -0.0068199 250 0.008272 0.003940 -0.04586 0.006791 -0.005624 0.03433 -0.01879 -0.02110

∗_{the optimizations were done on a limited data set in order to equally compare the most frequent functional forms}

of the SR analysis with their linear equivalents.

∗∗_{α is the regularization term}

Regularized Functional Form comparison Table 3 and 4 show the results of the 1000-fold cross-validation for different regularization values of respectively the LM optimizations of the linear functional forms and the LM optimizations of the most frequent functional forms of the SR analysis.

Table 5: Functional Form Comparison. Mean R2 for 1000-fold cross-validated LM optimization with optimal α [95 % CI]

Equation Linear Symbolic Regression

Sleep Quality 0.03901 [0.03887; 0.04178] 0.02224 [0.02131; 0.02318] Food Quality 0.02550 [0.02408; 0.02692] 0.01164 [0.01074; 0.01252] Depression 0.004230 [0.001637; 0.006832] -0.04586 [-0.04909; -0.04264]

Anxiety 0.06673 [0.06352; 0.06994] 0.06850 [0.06609; 0.07148] Fasting Glucose Level -0.02480 [-0.02736; -0.02225] -0.004815 [-0.005266; -0.004364]

Physical Activity 0.06747 [0.064120; 0.07083] 0.08225 [0.07887; 0.08564] WHtR gain -0.1398 [-0.1438; -0.1359] -0.01879 [-0.01938; -0.01820]

FFM gain -0.03310 [-0.03468 -0.03152] 0.005686 [0.004475; 0.006907]

The 2 functional forms per equations, the linear and the functional form from the SR analysis, with their optimal regularization parameter (α) are compared in table 5. As can be concluded from the analyses, is the predictive quality of the most frequent functional form of the SR analysis for ’Fasting Glucose Level’, ’Physical Activity’, ’WHtR gain’ and ’FFM gain’ better their linear equivalent. The functional form from the SR analysis for the

(18)

’Anxiety Score’ equation, also gets a higher R2score, however the difference is still within the 95% confidence interval. This was to be expected since the functional form for the ’Anxiety Score’ equation as found in the SR analysis was exactly the same as the linear functional form. The 4 linear functional forms that had a worse predictive quality than their equivalent from the SR analysis were replaced by this equivalent in the SDM. 4.3 Uncertainty Quantification

The equation quality Table 6 shows the equation quality scores per optimization. The quality of the equations is very low in general, for some equation the R2is even negative or insignificant. This means that the data set is indeed too limited to build a useful SDM. Especially the ’WHtR gain’, the ’FFM gain’, the ’Fasting Glucose Level’ and the ’Depression Score’ variables seem very hard to predict with the data set. In the case of both the ’WHtR gain’ and the ’FFM gain’, the data frequency might be an limiting factor, since both variables represent the gain over a period of 5 years. However, the data frequency is not the only limiting factor in the data set, since the other optimizations, which do not suffer from this low frequency, also give very low R2values.

Table 6: Equation quality for 10.000-fold cross-validated LM optimizations per equation∗

Variable R2 _p-value Physical Activity 0.07729 <0.00001 Anxiety Score 0.06638 <0.00001 Sleep Quality 0.03876 <0.00001 Food Quality 0.02419 0.00004314 Depression Score 0.004291 0.9306 FFM gain 0.003098 0.2622

Fasting glucose level -0.005115

-WHtR gain -0.01208

-∗_{The equations are either of the new functional found in the symbolic regression (WHtR gain, FFM gain, PA and}

Fasting glucose level) or of the original linear functional form (Sleep Quality, Food Quality, Depression and Anxiety Score.)

The influence of the data set size on the equation quality A possible other limitation in the data set could have been the data quantity. However, the graphs in figure 7 show that the quality of most of the equations is already at the maximum value when the full data set is used in the optimizations. If all of the trends that can be seen in the graphs would be extrapolated, can be concluded that more data will not improve the quality of any of the equations a substantial amount. Especially the optimizations which used >1000 data points, will not benefit a substantial amount from more data. The optimizations that used around 500 data points, due to the higher amount of missing values in the data, would improve with more data points, but given the trend in the graphs, will the improvement be minimal. Therefore can be concluded that the data quantity is not an important limiting factor in this data set. This conclusion is confirmed by the simulation analysis described in Appendix B.

The influence of time on the equation quality As already mentioned in the ’Data set’ section, did the Whitehall II data set not contain measurements of all important variables at every time point. This made establishing the influence of time difficult, since that requires measurements on consecutive time points. Imputation of measurements for all the participants on 1 time point, was considered to uncertain for robust analysis. The analysis was therefore done with the optimizations of equations that had enough measurements of variables on consecutive time points. Enough here meant that the equation quality resulting from the optimization was at least above 0. Only the optimizations for the ’Anxiety’ and ’Physical Activity’ equations, contained enough data for robust analysis, as shown in figure 8.

That the data frequency is a limiting factor in this data set, intuitively makes sense, given the 5 years between the time points in the Whitehall II data set. Figure 8 confirms that the data frequency is indeed a limitation. It shows the equation quality for the two equations that had sufficient data points. After 5 years is the equation quality of both equations around 0. The graph shows different trajectories towards a equation quality of 0 after 5 years for ’Anxiety’ and ’Physical Activity’, however the analyses of the trajectories are based on imputed time points which makes them less robust. These time points namely had to be imputed since the original data set only contained a time point every 5 years. This could have caused the initial improvement of the ’Physical Activity’ equation quality.

Figure 7: The relation between the equation quality and the data set size for different optimizations: The number of data points that could be used for the optimizations differed per optimization, since some variables (used in the optimizations) contained more missing data points than others, as already indicated in the ’Data set’ section. The similar trends as shown by all the graphs indicate that the data quantity is not a severely limiting factor in the Whitehall II data set when building an SDM of obesity.

(20)

0

1

2

3

4

5 Time (in years)

0.01

0.00

0.01

0.02

0.03

0.04

0.05 R

2

sc

or

e

Anxiety

PA

Figure 8: The influence of time between predictors and outcome measurements on the equation quality: The different colors show the influence of time on the quality of 2 of the auxiliary equations. The quality of both equations shows a decrease when the time between measurement of the predictors and outcome is increased to 5 years. The data frequency of once per 5 years is therefore likely a limiting factor in the Whitehall II data set when building an SDM of obesity. PA= Physical Activity

The influence of the single predictors on the equation quality Table 7 - 10 show how leaving one of the predictors out of the equation effects the equation quality of the significant (p-value of R2score < 0.05) equations of the SDM. From table 7 can be concluded that the Anxiety Score is clearly the most important predictor in the equation for predicting the Sleep Quality, since leaving Anxiety out of the optimization, decreases the equation quality almost to 0. Physical Activity and Alcohol on the other hand barely seem to help in predicting the Sleep Quality. This indicates that either the quality of the Physical Activity and Alcohol measurements is very low or that their direct effects on Sleep Quality are indeed very small.

Table 7: Mean R2_{score for LM optimizations of Sleep Quality without indicated predictor}

- Physical Activity Alcohol Depression Score Age Anxiety Score Sleep Quality 0.03876 0.03875 0.03802 0.03762 0.03677 0.004729

Table 8 and 9 show a more equally distributed importance among the predictors of respectively Physical Activity and Anxiety Score. Leaving one predictor out of the equation always leads to an inferior prediction. Nonetheless is Age still with length the most important predictor for Physical Activity.

Table 8: Mean R2_{score for LM optimizations of Physical Activity without indicated predictor}

- Anxiety Financial Situation Education Sleep Quality Age Physical Activity 0.07729 0.07339 0.07017 0.06175 0.05008 0.004465

Table 9: Mean R2_{score for LM optimizations of the Anxiety Score without indicated predictor}

- Alcohol WHtR Food Quality Financial Situation Physical Activity Sleep Quality Sex Anxiety Score 0.06638 0.06316 0.05780 0.05463 0.04843 0.04344 0.03945 0.03639

(21)

Table 10: Mean R2score for LM optimizations of Food Quality without indicated predictor

- Sleep Quality Anxiety Score Financial Situation Depression Score Education

Food Quality 0.02419 0.02432 0.02279 0.02252 0.01541 -0.003488

Table 10 shows a similar result as table 7. Education contains the most information about Food Quality, followed by Depression Score. On the other hand do variables like Sleep Quality and, to a lesser extent, Anxiety score and Financial Situation not effect the predictive quality of the equation, which means that again either the true effect of these variables on Food Quality is very low or that the quality of the measurements of either Food Quality or of Sleep Quality, Anxiety Score and Financial Situation is low.

Cluster Analysis of the quality of the equations Table 11 shows the equation quality when different amounts of clusters are used to fit the equations. As indicated in the ’Method’ section, was the way of splitting the data set in a train and test set different for the cluster analysis. For some of the equation optimizations, the number of data points that could be used in the test set, was very small due to the complete absence of measurements of some of the variables on certain time points. This might have influenced the overall equation quality of the analysis. However the cluster analysis shows that the use of clusters is not beneficial for the equation quality for most of the SDM equations. Only the ’Depression Score’ and the ’Anxiety Score’ equations seem to benefit from the use of clusters. Unfortunately, both of these analyses suffered from the high number of missing data points as can be concluded from the negative R2values. The use of the cluster analysis for SDMs as proposed in this study can therefore not be completely validated here.

Table 11: Mean R2with 95 % CI of 1000- fold cross-validation for different number of clusters per SDM equation Number of clusters Equation 1 2 3 Sleep 0.231 [0.164;0.299] -7.802 [-8.078;-7.527] -Food Quality -0.938 [-1.067;-0.808] -1.915 [-2.055;-1.774] -Depression Score -0.810 [-0.970;-0.651] -0.092 [-0.148;-0.0358] -1.279 [-1.352;-1.206] Anxiety Score -6.416 [-6.662;-6.170] -0.439 [-0.510;-0.368] -0.259 [-0.309;-0.209] Fasting Glucose Level -0.653 [-0.555;-0.751] -2.676 [-2.530;-2.821]

-Physical Activity 0.160 [0.0894;0.232] -0.0770 [-0.144;-0.00948] -WHtR Gain -0.355 [-0.461;-0.249] -6.908 [-7.176;-6.641] -FFM Gain -1.690 [-1.877;-1.503] -7.263 [-7.632;-6.894] -Equation 4 5 Sleep - -Food Quality - -Depression Score - -Anxiety Score -0.124 [-0.168;-0.0799] -0.575 [-0.631;-0.520]

Fasting Glucose Level -

-Physical Activity -

-WHtR Gain -

-FFM Gain -

-Summary of results Table 12 shows the summary of the results found in this study. For every data limitation is indicated how the influence of the limitations was analysed and whether it could be distinguished from possible other limitations in the data set. The data limitations are ordered from easy to analyse and distinguish from the other limitations, to hard to analyse and distinguish from the other limitations.