Bayesian optimisation for modular design: BOMoD.jl

(1)

MODULAR DESIGN: BOMOD.JL

word count: 21574

Sam Dierickx

Student ID: 01804175

Supervisors: Prof. dr. Bernard De Baets, dr. ir. Michiel stock

Tutor: ir. Bjorn Criel

A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of Master of Science in Bioinformatics: Bioscience Engineering

(2)

Deze pagina is niet beschikbaar omdat ze persoonsgegevens bevat.

Universiteitsbibliotheek Gent, 2021.

This page is not available because it contains personal information.

Ghent University, Library, 2021.

(3)

A failure of Bayesian prediction is an opportunity to learn

Edwin Thompson Jaynes 1986

During this dissertation, I discovered a new world of Bayesian statistics. In this world, learning is a constant process of updating your knowledge based on successes and failures. Both were encountered last year, the successes were documented, the fail-ures were used to learn from.

Learning a new programming language, processing all new information and docu-menting all gained insights would not be possible without the help of several people. First, I would like to thank prof. dr. Bernard De Baets and dr. ir. Michiel Stock, my two promotors. Professor De Beats, thank you for supervising this project and providing me with feedback on my writings.

Michiel, I could count on you the entire year for many different problems. You enlight-ened me with the new Julia language. I appreciate the feedback you gave me on my text and on my Julia code of the package.

Secondly, I would like to thank ir. Bjorn Criel, my tutor, you supported me during the entire process of this work. Your good feedback on my writings pushed this work to a higher level. Also providing me with ready-to-use data is something where most data scientist can only dream from.

I want to thank the staff members of the "Taalonthaal" of the University of Ghent for providing a course on academical writing.

Generally, I would like to thank all my teachers, assistants and professors who edu-cated me and inspired me to keep on learning every day.

Finally, I would like to thank my family and friends for all the support during the project. I would like to especially thank Ellen, your encouragements and assistance were of great value for me throughout the entire year.

(4)

(5)

Acknowledgements i Contents v Abstract vii Nederlandstalige samenvatting ix List of figures xi List of tables xv

List of abbreviations xvii

1 Research objectives and outline 1

1.1 Introduction . . . 1

1.2 Problem statement and objectives . . . 3

2 Bayesian optimisation for modular design 5 2.1 Modular designs . . . 5

2.1.1 Modular designs in biology . . . 5

2.1.2 Combinatorial options . . . 7

2.2 Mathematical optimisation . . . 9

2.3 Bayesian optimisation . . . 11

2.3.1 Bayesian optimisation: example . . . 11

2.3.2 First sampling step . . . 14

2.3.3 Surrogate model: Gaussian Process . . . 15

2.3.4 Acquisition functions and sampling methods . . . 28

2.3.5 Termination of the algorithm . . . 32

(6)

3 The BOMoD.jl package 35

3.1 Introduction . . . 35

3.2 Constructing design space . . . 36

3.2.1 Set design inputs . . . 36

3.2.2 Constraints . . . 38

3.2.3 Construct design . . . 40

3.3 The space object . . . 42

3.4 Bayesian optimisation . . . 45

3.4.1 First sampling step . . . 46

3.4.2 Surrogate model . . . 46

3.4.3 Batch sampling . . . 55

3.5 Iteration . . . 60

3.6 ".jl", Why Julia? . . . 60

4 Results and Discussion: Case studies 63 4.1 Set-up of the analysis . . . 63

4.2 Simulated data . . . 65 4.2.1 Dataset . . . 65 4.2.2 Results . . . 65 4.3 Enzybiotics . . . 67 4.3.1 Dataset . . . 67 4.3.2 Results . . . 68 4.4 Synthetic co-cultures . . . 72 4.4.1 Dataset . . . 72 4.4.2 Results . . . 73

4.5 Kernels based on a graph structure . . . 74

5 Conclusions and future perspectives 77 5.1 Conclusions . . . 77

5.2 Future Perspectives . . . 79

References 81 Appendix A Graphs 89 A.1 Graphs . . . 89

(7)

B.1 Simulated data . . . 91 B.2 Enzybiotics . . . 92

(8)

(9)

In multiple industries, new products are constructed in a modular way and new pro-totypes are obtained after swapping a single module. The optimisation process of these modularly designed products is hampered by a potential combinatorial explo-sion. This results in high research costs to make and evaluate new prototypes. Op-timisation commonly is an iterative process where new variants are selected based on the evaluation of previous experiments. Conventional methods such as rational design are not efficient and integrate only a fraction of all valuable information from the previous experiments. In this work, a new-data driven alternative is proposed based on an iterative batch sampling algorithm in order to reduce the optimisation cost. The algorithm is built on a Bayesian optimisation (BO) framework that has be-come a popular method to solve combinatorial optimisation problems. It integrates all currently evaluated datapoints in a surrogate model, a Gaussian process (GP), and selects the best next datapoint by balancing exploration of the design space and exploitation of the underlying prediction model. The standard BO pipeline is not suit-able to handle modular design optimisation problems efficiently, given the specific data-input and the sequential sampling algorithm. Two notable improvements were proposed. First, GPs were adapted to handle a character vector input using different string kernels based on the cosine similarity and the Levenshtein distance. Secondly, the sequential sampling step of BO was extended towards a flexible batch sampling method, which allows researchers to evaluate multiple combinations simultaneously. The proposed batch sampling algorithms are extensions of standard acquisition func-tions used in the BO framework. The algorithm is integrated into a new Julia package "BOMoD.jl", which makes the tool available to researchers. Beside the new BO algo-rithm, the package provides an efficient method to generate and interact with the large combinatorial design spaces. The new optimisation method is evaluated on a real biological problem; the design of enzybiotics against Entrococcus sp., a possible alternative to antibiotics.

Key Words: Bayesian Optimisation, Batch sampling, Modular design, BOMoD.jl,

(10)

(11)

SAMENVATTING

Verschillende onderzoeksdomeinen vereisen de optimalisatie van producten met een modulair ontwerp. Deze producten worden gekenmerkt door hun specifieke opbouw waarbij individuele modules kunnen worden aangepast om nieuwe variaties te bekomen. Het optimalisatieproces wordt gekenmerkt door een hoge investeringskost, van zowel tijd als geld, aangezien meerdere iteraties nodig zijn. Hierbij worden telkens nieuwe prototypes aangemaakt, gebaseerd op informatie van de vorige experimenten. Huidige optimalisatiemethodes baseren zich voornamelijk op expertise, waardoor niet alle in-formatie uit voorgaande experimenten gebruikt kan worden. In dit werk wordt een it-eratief batch sampling algorithme voorgesteld. Dit algoritme is gebaseerd op Bayesi-aanse optimalisatie (BO), een populaire methode voor het oplossen van combina-torische optimalisatie problemen. BO integreert alle gegevens van vorige experi-menten in een onderliggend model, een Gaussiaans proces (GP). Het selecteert de beste combinaties die getest moeten worden door een afweging te maken tussen de voorspelde waarden van het model enerzijds en de onzekerheid op deze voorspelling anderzijds. Het klassieke BO-algoritme kan de modulaire input data niet goed verw-erken en kan slechts één enkele nieuwe combinatie voorstellen. Dit verhindert het groeperen van verschillende experimenten waardoor kosten verhogen. In dit werk worden daarom twee verbeteringen voorgesteld. Ten eerste worden er nieuwe ker-nels voorgesteld voor de GP zodat deze modellen efficiënter gebruik maken van het modulaire karakter van de input data. Ten tweede wordt er in elke iteratie van het algoritme een batch van nieuwe combinaties voorgesteld, waardoor meerdere com-binaties tegelijkertijd geëvalueerd kunnen worden. Het voorgestelde algoritme werd geïntegreerd in een nieuw Julia pakket, “BOMoD.jl” waardoor het beschikbaar is voor onderzoekers. Naast dit nieuwe BO-algoritme heeft het pakket een efficiënte meth-ode om alle mogelijke combinaties van modules virtueel te genereren. De nieuwe methode werd gevalideerd in het onderzoek naar de optimale structuur van modu-laire enzymbiotica tegen Entrococcus sp. als alternatief voor de huidige antibiotica. Kernwoorden: Bayesiaanse optimalisatie, Batch voorspelling, modulaire ontwerp, BO-MoD.jl, Gaussiaanse processen

(12)

(13)

1.1 Illustration of a modular design The first four blocks represent single modules. The modules can be combined to form unique constructs. Var-ious design rules are possible, given the setup of the experiment. The construct can then be evaluated to obtain an activity. The objective is to find the construct with the highest activity. . . 2

2.1 Advantages of a co-culture for multiple-step conversions S is the start substrate and P is the end product. To obtain P, the intermediate products X1 and X2 are produced. a) When a pure culture is used, one

organism is responsible for all conversions and needs to produce all en-zymes. b) In a multi-species approach the labour is divided over multiple organisms; in this way every organism is responsible for a single part in the overall pathway. Illustration from (Brenner et al., 2008) . . . 7 2.2 The combinatorial explosion of possible combinations n is the

given number of available modules, k is the size of the allowed com-binations. Top-left: permutations with replacement. Bottom-left: combi-nations with replacement. Top-right: permutations without replacement. Bottom-right: combinations without replacement. . . 9 2.3 Example of the BO framework Evaluation of ƒ () = sin(2) − cos()

with  ∈ [0, 7]. Iterations one till five are shown. In every plot the grey line is the true objective function, which is normally unknown. The dark blue line indicates the predictions are made by using a probabilistic sur-rogate model. The light blue area indicates the uncertainty on these predictions. Larger blue areas indicate that the model is more uncertain about the given predictions. The green dots and the red cross are the sampled data points used to fit the model. The red cross is the last sam-pled data point obtained from the previous iteration. The green line is the acquisition function made by using the blue model above. The max-imum of this function is indicated with a red triangle and this will be the next data point that is used to update the model . . . 13

(14)

2.4 An example of a GP model A GP with mean pior function equals to 0 and squared exponential kernel with hyperparameter  = 1, σ2n = 10−6. The bold blue line is the mean predicted value from the GP model for every x. The colored band represents the uncertainty on the predictions. The red dotted lines are random samples from the GP model. The yellow line equals = sin(2) − cos(). The green dots are the samples used to fit the GP. . . 17 2.5 Examples of various kernels Five random samples form a GP with

respectively from left to right: the linear kernel, the squared exponential kernel and the periodic kernel. All hyperparameters are set to 1 and

σ2

n= 10−6 . . . 21 2.6 Effect of the hyperparameter of a kernel Five random samples form

a GP with a squared exponential kernel with a different hyperparameter

 from left to right,  = [0.5, 1, 3], σn2= 10−6. . . 21 2.7 Examples of composed kernels Five random samples form a GP with

composition of basic kernels from left to right: linear + squared exponen-tial, linear · squared exponenexponen-tial, squared exponential · periodic + linear. All hyperparameters are set to 1 and σ2

n is 10−6. . . 22 2.8 Effect of the hyperparameter on the uncertainty of GPs The same

GP model as in Figure 2.4 was used. Only the hyperparameter  was changed between the different plots. From left to right  = [2, 0.692, 1/ 3]. The GP model in the central panel has an optimal  based on the maxi-mum marginal likelihood. . . 22 2.9 Evaluation of BO algorithm after 15 iterations The current highest

found ƒ (x+) in function of the number of iterations. If the line goes up, it means that a new data point is found with a higher activity value. A flat line over multiple iterations could indicate that a high active combination is found, or that the model is exploring the design space. . . 33

3.1 Illustration of a single iteration through the BOMoD pipeline The blue terms are the most important functions of the package. The green terms are the most important types used in the package. . . 35 3.2 The official logo of Julia . . . 61

(15)

rameters are, n = 5, b = 5 and r = 50. Left: the fraction of repeats that found the highest active combination at every iteration. Right: the high-est observed value was stored at every iteration and averaged over all repeats. The colored band represents the standard error on the mean. The dashed line represents the random baseline. . . 66 4.2 Violin plot of the enzybiotics dataset Every construct was evaluated

three times, the averaged rABC value of every construct is visualised in the plot. . . 68 4.3 Results of the enzybiotic dataset, large batch Different sampling

algorithms are combined with the different kernels as given in the leg-end. The given input parameters are: n = 40, b = 20 and r = 50. Left: The fraction of repeats that found the highest active enzybiotic construct at every iteration. Right: The highest observed rABC value at every it-eration, averaged over all 50 repeats. The colored bands represent the standard error of the mean. The dashed line represents the random base-line. . . 70 4.4 Results of the enzybiotic dataset: small batch Different sampling

algorithms are combined with the different kernels as given in the leg-end. The given input parameters are: n = 10, b = 10 and r = 50. Left: The fraction of repeats that found the highest active enzybiotic construct at every iteration. Right: The highest observed rABC value at every it-eration, averaged over all 50 repeats. The colored bands represent the standard error of the mean. The dashed line represents the random base-line. . . 71 4.5 Results of the enzybiotic dataset: comparison The enzybiotic dataset

evaluated with a GP-UBC sampling algorithm. Two settings are evaluated, one with b = 10 and one with b = 20. Left: the fraction of repeats that found the highest active combination terms of number of experiments. Right: the maximum observed value in terms of number of experiments, averaged over 50 repeats. The colored bands represent the standard error of the mean . . . 72 4.6 Violin plot of the synthetic co-culture dataset The percentage of

(16)

4.7 Results of the synthetic co-culture dataset Different sampling algo-rithms combined with cosine kernel as given in the legend. The given input parameters are: n = 5, b = 5 and r = 100. Left: the fraction of repeats that found the highest active co-culture at every iteration. Right: the highest observed percentage of mineralization at every iteration, av-eraged over 100 repeats. The colored bands represent the standard error of the mean. The dashed line represents the random baseline. . . 74 4.8 Results of the kernel using an underlying graph structure

Simula-tion based on the enzybiotic dataset. Different sampling algorithms are combined with the different kernels as given in the legend. The given input parameters are: n = 10, b = 10 and r = 50. Left: The fraction of repeats that found the highest active enzybiotic construct at every iteration. Right: The highest observed rABC value at every iteration, av-eraged over all 50 repeats. The colored bands represent the standard error of the mean. The dashed line represents the random baseline. . . . 75

A.1 Example of weighted undirected graph Graph with 5 vertices A, B, C, D, E and 6 edges (A,B), (A,C), (B,C), (B,D), (C,D), (D,E) with a weight attributed to every edge. . . 89

B.1 Fraction of iterations that found the 1% best combinations Differ-ent sampling algorithms combined with the linear kernel as given in the legend. The given input parameters are: n = 5, b = 5 and r = 50. The dashed line represents the random baseline. . . 91 B.2 Best or second best data point: one Different sampling algorithms

are combined with the different kernels as given in the legend. The given input parameters are: n = 40, b = 20 and r = 50. The fraction of repeats that found the best or second best data point. The dashed line represents the random baseline. . . 92 B.3 Best or second best data point: two Different sampling algorithms

are combined with the different kernels as given in the legend. The given input parameters are: n = 10, b = 10 and r = 50. The fraction of repeats that found the best or second best data point. The dashed line represents the random baseline. . . 93

(17)

2.1 Formulas for the different combinatorial designs n is the number of unique modules that can be used and k is the desired size of the product 8 2.2 Common stationary kernels . . . 20 2.3 Common non-stationary kernels . . . 21 2.4 Example of a spectra, p = 2 . . . 26 2.5 Example spectrum kernel The gram matrix obtained based on inner

product of the corresponding spectra from Table 2.4. . . 26

(18)

(19)

BAM 2,6-dichlorobenzamide

BO Bayesian Optimisation

CBD Cell Wall Binding Domain

DoE Design of Experiment

EI Expected Improvement

CDF Cumulative Distribution Function

EAD Enzymatically Active Domains

GP Gaussian Process

GP-UCB Gaussian Process Upper Confidence Bound

LBC Lower Bound Criteria

MAP Maximum A Posteriori

ML-II Type II Maximal Likelihood PDF Probability Density Function

PI Probability of Improvement

rABC relative Area Between Curves

TS Thompson Sampler

(20)

(21)

RESEARCH OBJECTIVES AND

OUTLINE

1.1 Introduction

Innovation is the driving force of our economy and can lead to new medicines, ma-chines and food technologies. One major drawback for the development of new tech-nologies is that it requires a significant investment of time and money. For example, in the pharmaceutical sector, it requires on average thirteen years and 1, 778 million US dollars to develop a new drug (Paul et al., 2010). This high R&D cost hampers the de-velopment of new drugs. For new antibiotics, research has been dramatically reduced because their development has a high risk of failing compared to their economic re-turn (Butler et al., 2013; Towse et al., 2017). This indicates the need to develop new approaches to reduce the resources needed for innovation.

Today, many products are built up in a modular way. This lowers production costs and facilitates the innovation and development of new products (Ulrich, 1995). The idea of modular design started in the construction of houses, but was picked up in many different fields such as biotechnology and electronics (Garcia, 2019). Modularly de-signed products are built by combining single elements called modules, where every module facilitates a specific task and is connected to other modules. For example, a small controller lamp is a single module connected by wires to the switch and the batteries, these three modules form the complete product (Chun-Che Huang, 1998). Different configurations of a single product can be made by changing the individ-ual modules. One specific configuration is referred to as a construct, as illustrated in Figure 1.1. To evaluate and compare different constructs, a method is needed to mea-sure their activity. This activity, e.g., the activity of a medicine, gives the construct a specific value, which is used to determine which construct is superior regarding the foreseen task. As a result, modular product innovation is summarised by the fol-lowing optimisation problem: "Which combinations of available modules result in the construct with the highest activity?"

(22)

1.1. INTRODUCTION The most straightforward way to identify the best construct is to empirically test all variants. As the number of individual modules grows, there is a combinatorial explo-sion of possible constructs (Faticoni, 2013). As a consequence, evaluating all possible combinations of modules requires a tremendous amount of resources and alternative approaches need to be considered.

Modules

Construc

ts

= y

₁

= y

₂

= y

₃

Activities

Objective

argmax(y)

i

Figure 1.1: Illustration of a modular design

The first four blocks represent single modules. The modules can be combined to form unique constructs. Various design rules are possible, given the setup of the experiment. The construct can then be evaluated to obtain an activity. The objective is to find the construct with the highest activity.

A proposed alternative is the use of mathematical optimisation algorithms to guide the researchers to the optimal combination of modules. Optimisation is a well-studied field in mathematics, with a large number of different algorithms and theorems. (Kochenderfer and Wheeler, 2019).

A mathematical optimisation algorithm known as Bayesian Optimisation (BO) has the potential to reduce the number of experiments needed to find a superior construct. This data-driven approach has been tested in different fields, for example in material science for the design of nanostructures (Packwood, 2017), in the field of synthetic biology to optimise the synthetic design of genes (González et al., 2015b) or for en-vironmental applications to find to best co-culture to perform waste-water-treatment (Daly et al., 2019).

This BO framework proposes the next best data point based on all previous known data points. Afterwards, its activity is determined and added to the other known data points. In this way, the BO algorithm tries to lead the researchers to the shortest road to the optimal construct, without exhaustively screening all variants.

The process of BO could be beneficial for many other modular design research prob-lems, but the classical version of BO has some limitations. First, BO typically uses continuous data, whereas a modular design is a discrete optimisation problem. Sec-ondly, the new data point is proposed sequentially, which is in many cases not the

(23)

most economical procedure. Thirdly, there is no easy-to-use-software available that makes these mathematical models applicable for researchers working in the lab with-out the need to understand all the mathematical background. Finally, there is the practical problem that when the number of individual modules increases, the number of different constructs that can be generated is too large to handle efficiently. Many of these problems are general and are currently being tackled by researchers all over the world. In this dissertation, the goal is to make a user-friendly software that pro-vides a BO pipeline for modular design optimisation problems that deals with these limitations. This will aid the optimisation process of modular design problems and reduce the cost for development.

The focus in this dissertation is on biotechnological applications. To demonstrate and evaluate the software, a case study of modular protein design will be used. The considered proteins are modular enzybiotics. Their modules originate from phage lytic proteins that are adopted by bacteriophages to cross the cell wall of their bac-terial hosts. They enzymatically degrade the structural component of the cell wall, peptidoglycan. Phage lytic proteins are very diverse because of their capability to target a specific host pathogen (Fernández-Ruiz et al., 2018). Many enzybiotics have a modular design of their own. They consist of different modules, each with a specific function (Díaz et al., 1990). The domains are divided into two different subtypes: one are the Enzymatically Active Domains (EADs), responsible for the degradation of the peptidoglycan in the cell wall of the pathogen. Various EADs types are known, each able to break a specific bond in the peptidoglycan structure. The second domain type is the Cell Wall Binding Domains (CBDs). These domains interact with the host cell and have two distinct tasks. Firstly, it assures that the EADs can be active because it provides close contact with the pathogen cell. Secondly, these CBDs cause the specificity of the interaction (Fernández-Ruiz et al., 2018). In nature, various EADs and CBDs are known, synthetic combinations of different EADs and CBDs result in non-active to higher-than-nature active proteins. The goal is to develop and test new combinations of known domains to obtain a new high active variant which could lead to new potential drugs (Gerstmans et al., 2018).

1.2 Problem statement and objectives

BO has proven to be a useful tool to solve modular design optimisation problems. Still, the application is limited in practice due to the absence of software combining the following elements:

(24)

1.2. PROBLEM STATEMENT AND OBJECTIVES 2. the ability to deal with a discrete design space represented as a string of charac-ters, containing no numerical information which is the case for modular design problems;

3. the ability to use a BO algorithm that samples multiple new data points simulta-neously.

The goal of this dissertation is to provide an encapsulated solution for these different problems in one user-friendly Julia package, BOMoD.jl.1. This package should allow researchers in different fields to speed up their research concerning modular designs. After development, the package was tested and evaluated on a real-world biological problem regarding the optimisation of modular enzybiotics. Finally, proper documen-tation is provided to allow other researchers to use or even improve this package. The dissertation is built up in five chapters:

1. Research objectives and outline

2. Bayesian optimisation for modular design

This chapter groups the literature review as well as materials and methods. It gives a definition of a modular design and focus on biological applications where BO is useful. Afterwards, the theoretical background of BO is discussed together with known literature extensions.

3. The BOMoD.jl package

This gives an overview of the BOMoD.jl functionalities and resembles closely the documentation of the package.

4. Case studies

This chapter evaluates the package on different case studies to evaluate the performance of the algorithm.

5. Conclusions and future perspectives

(25)

BAYESIAN OPTIMISATION FOR

MODULAR DESIGN

This chapter encompasses the literature review and materials and methods. First it starts with a formal definition of a modular design, its application in biology and the combinatorial explosion of possibilities. Secondly, mathematical optimisation is in-troduced, with an elaborate part about Bayesian optimisation and currently proposed adaptions from the literature that improve the BO algorithm for modular design prob-lems.

2.1 Modular designs

2.1.1 Modular designs in biology

Over time, many definitions of a "modular design" were proposed (Salvador, 2007), because modular designs appear in many different fields of engineering: chemical, civil, software, biology, etc. (Garcia, 2019). Miller and Elgard (1998) defined a modu-lar structure as:

A modular structure is a structure consisting of self-contained, functional units (modules) with standardised interfaces and interactions in accordance with a system definition. Replacing one module with another creates a new variant of the product.

The focus of this work is to aid the research of biological modular design problems, which are present in all domains of biology, starting at the molecular level going all the way up to multi-species communities (Garcia, 2019). In what follows, both ends of the spectrum are discussed in more detail.

(26)

2.1. MODULAR DESIGNS

Modularity at molecular level

At the molecular level, an entirely new field of synthetic biology emerged out of this modular thinking. The modular idea was introduced within the research on the lac-operon of Francois Jacob and Jacques Monodin in 1961 (Jacob and Monod, 1961). The starting point of synthetic biology came almost 40 years later, after the develop-ment of many new technological and molecular tools (Cameron et al., 2014). Today, researchers, engineers and computer scientists are working closely together in this interdisciplinary field. With the aid of a large molecular toolbox and many compu-tational algorithms, they can make new-to-nature occurring cellular compounds with all sorts of applications (Garcia, 2019). Examples are: biosensors (Feng et al., 2018), the development of enzymes that facilitate the production of bio fuels (Jagadevan et al., 2018) and protein engineering for the development of novel antimicrobial drugs (Schmelcher et al., 2012).

Modularity in multi-species communities

Modularity was first introduced in ecology when studying the interactions between different species in nature (Garcia, 2019; Grilli et al., 2016), which is not of interest in this work.

The second type of modularity focuses on the composition of synthetic co-cultures of microorganisms. These co-cultures can be used for many complex processes, e.g., wastewater treatment, production of bulk chemicals and production of biofuels. In all these processes it is beneficial to use a combination of multiple microorganisms known as co-cultures (Jiang et al., 2019; He et al., 2004; Zhang et al., 2015). These co-cultures have multiple advantages over pure cultures, which are based on a single species. One of the main advantage is that the labour is divided over the different organisms, as visualised in Figure 2.1. In this way, every organism is engineered and optimised for one or several steps of the entire pathway. For a pure culture, one or-ganism has to provide all enzymes of the pathway, which causes a high metabolic burden. Additionally, the optimal environment is different for each step from sub-strate to product. The use of co-cultures provides a tool to approach the optimal conditions for every reaction. In pure cultures, the conditions are sub-optimal regard-ing the different steps of the conversion (Brenner et al., 2008; Shong et al., 2012; Zhang and Wang, 2016).

Linking back to the definition of a modular design, the microorganism can be seen as the functional unit with a specific task. If a species is swapped with another variant, it will result in a new variant of the co-culture with different properties. The hard part is optimising the interactions between the microorganisms because swapping

(27)

microor-ganisms is not a plug-and-play system. The interactions between the microormicroor-ganisms are very complex, and not all microorganisms are willing to live in accordance with other organisms in one culture. The various organisms need to live together in a synthetic symbiosis where the relative concentration of the different organisms is im-portant to obtain a good yield. There has to be a good match between the different species in the co-culture to provide a robust system over time (Jones and Wang, 2018; Zhang and Wang, 2016).

Efforts are made to control the ratios between different organisms (Stephens Kristina, 2019). In the end, a lot of combinations need to be tested to obtain the optimal co-culture.

Figure 2.1: Advantages of a co-culture for multiple-step conversions

S is the start substrate and P is the end product. To obtain P, the intermediate

prod-ucts X1 and X2 are produced. a) When a pure culture is used, one organism is

re-sponsible for all conversions and needs to produce all enzymes. b) In a multi-species approach the labour is divided over multiple organisms; in this way every organism is responsible for a single part in the overall pathway. Illustration from (Brenner et al., 2008)

2.1.2 Combinatorial options

Every modular design problem starts with defining the available individual modules. Afterwards, researchers would like to know how many variants can be made, given the input modules. To start with, some additional boundary conditions are set as discussed below, then the number of variants can be calculated with basic combina-torics.

First, the size of the product needs to be defined, which is the number of modules used in a single variant, e.g. this is the number of different species used in one co-culture. If multiple sizes are allowed, one can determine the number of variants for every size separately and then take the total to obtain all possibilities (Mladenovi´c,

(28)

2.1. MODULAR DESIGNS 2019). Secondly, an important boundary condition is the relevance of the order of the different modules. This means that the relative position of different modules in-fluences the performance of the product. As a consequence, combinations with the same modules but in a different order will have a different activity and will be seen as two different configurations (Mladenovi´c, 2019). For example, in the design of enzy-biotics, two proteins that are built out of the same domains but with a different order, will have different activity values. In the case of co-cultures, there is no order be-tween the used microorganisms; all species are inoculated together in one fermentor. Thirdly, a design problem can allow for replacement or not. This refers to the fact that the same module can be used multiple times in one product variant.

Given these different boundary conditions, general formulas are known to calculate the number of combinations as given in Table 2.1 (Faticoni, 2013; Mladenovi´c, 2019).

Table 2.1: Formulas for the different combinatorial designs

n is the number of unique modules that can be used and k is the desired size of the

product

With replacement Without replacement Order important (Permutations)

n

k

_(n−k)!

n!

Order unimportant (Combinations)

n

+k−1

k

n

k

=

n!

(n−k)!·k!

These formulas indicate the explosion of possible combinations that can be made with only limited number of modules available. This is known as the Curse of Dimension-ality (Bellman, 2015). Figure 2.2 illustrates the problem. If the number of available modules increases, then the number of possible variants is going up quickly.

Finally, a modular design problem can have some constraints, such as specific combi-nations of modules that are not allowed. These design rules can be based on previous research or on physical restrictions. These constraints decrease the number of pos-sible combinations, which makes it more convenient to find the optimal combination. The down-side is that a unique formula to calculate all allowed combinations is not trivial. In most cases, the unconstrained design is generated, and the unwanted com-binations are filtered out afterwards. As a consequence, the entire design space still needs to be generated.

(29)

Figure 2.2: The combinatorial explosion of possible combinations

n is the given number of available modules, k is the size of the allowed combinations.

Top-left: permutations with replacement. Bottom-left: combinations with replace-ment. Top-right: permutations without replacereplace-ment. Bottom-right: combinations without replacement.

2.2 Mathematical optimisation

There is a large diversity in the different optimisation techniques available to solve an optimisation problem, but covering every single one is almost impossible. There is not just one algorithm to solve all problems, which is known as the "No-Free-Lunch" theorem (Yang, 2018). This implies that for every specific set of problems, a new al-gorithm is developed, and the diversity among them is large. In this work, the main focus is on the BO algorithm which gives a good match with the modular design prob-lem (Daly et al., 2019; Packwood, 2017). This first section gives an overview on the variety of optimisation methods and focuses on the general idea behind mathemat-ical optimisation. The general optimisation problem is given as (Kochenderfer and Wheeler, 2019):

max x∈S

(30)

2.2. MATHEMATICAL OPTIMISATION where x is the data point, evaluated by objective function ƒ . Generally x is a vector, corresponding to a p-dimensional data point, which is written as:

= [₁, ₂,_{· · · , }p] T

. _(2.2)

S is the search space, limiting the possible values of x. A solution of the given

prob-lem is noted as x∗. Others define the problem as a minimisation problem. Differ-ent formulations of mathematical optimisation can be found in text books (Nocedal and Wright, 2006; Kochenderfer and Wheeler, 2019; Snyman and Wilke, 2018; Yang, 2018). Maximisation problems or minimisation problems can be easily reformulated in one another by a simple transformation (Brochu et al., 2010):

g(x) = −ƒ (x) . _(2.3)

Using this transformation, minimising g(x) is equal to maximising ƒ (x). To reduce the complexity, in the remainder of this work all problems will be treated as a maximisa-tion problem.

A single generic algorithm to solve Eq. 2.1 for all optimisation problems is not pos-sible, because every optimisation problem has specific characteristics and requires a different approach. A complete classification of all different algorithms is beyond the scope of this work.

First, it is important to know if one has to deal with a single-objective or multi-objective optimisation problem. In single-objective optimisation only one objective is optimised e.g. in the case of enzybiotics it is only the activity of the enzyme. However, it is possible to also take into account the (thermo)stability, salt-tolerance, optimal pH or the activity against multiple bacterial strains. This would result in a multi-objective optimisation problem where multiple objectives are optimised. For multi-objective op-timisation, various equations are known depending on the relative importance of the different objectives and are described by Kochenderfer and Wheeler (2019). Multi-objective optimisation problems are much harder to solve than the single-Multi-objective variants. This work focuses on single-objective optimisation.

After selecting which objective is optimised, it is important to know whether there is a mathematical relation between the input vector x and the measured activity ƒ (x). For example if ƒ (1, 2) = 31− 212+ 2+ 1 is known, then this relation can be

ex-ploited to find x∗_{. This relation allows to calculate the first or second-order gradients,}

which are required and used in many optimisation algorithms. If ƒ (x) is unknown or if gradients cannot be computed then an approximation of ƒ (x) or black-box models can be used. Generally, gradients-based techniques outperform the other ones, so if this information is known, it is advised to use it (Audet and Hare, 2017).

(31)

math-ematically, it is important to evaluate the search space S. In a discrete optimisation problem, the number of possible solutions is finitely or countably infinite (Kongkaew, 2017). As for the case of enzybiotics, the number of unique combinations is limited. On the other hand, if the feasible set is uncountable infinite it is a continuous optimi-sation problem (Nocedal and Wright, 2006).

Additionally, the search space S can be constrained, which avoids the optimisation algorithm to give an infeasible solution that cannot be used in reality. There are two main types of constraints, inequality constraints and equality constraints. For ex-ample, when optimising the operating conditions of a bio-reactor there are boundary conditions for temperature, pressure, etc. that can be handled by the system. It is im-portant that the obtained results are within these boundaries. An equality constraint can be used when optimising the concentrations of different products at equilibrium, the obtained solution should meet the mass balance of the system. Generally, a con-strained optimisation problem is written as (Kochenderfer and Wheeler, 2019; Yang, 2018): max x ƒ() subject to gj(x) ≤ with j = 1, 2, · · · , m h(x) = 0 with  = 1, 2, · · · ,  (2.4)

Finally, an important characteristic of optimisation problems is whether the objective is deterministic or stochastic. In deterministic optimisation, all objective parameters are fixed. As a consequence, the output of the algorithm is only determined by the initial starting point. For a stochastic objective function some variables are set af-ter sampling from an underlying distribution, which introduces randomness into the model. Therefor, using the same initial starting point could result in a different solu-tion over multiple repeats (Nocedal and Wright, 2006; Yang, 2018).

2.3 Bayesian optimisation

2.3.1 Bayesian optimisation: example

BO is a black-box model that tries to find the optimal x∗ of an unknown objective function ƒ (x). To improve our knowledge on ƒ (x), data points are sampled sequential of ƒ (x) and evaluated afterwards. A BO framework selects the newly sampled data points with a specific strategy as given in pseudo-code in Algorithm 2.1. The idea of BO is to use a probabilistic surrogate model that captures all known information on

(32)

2.3. BAYESIAN OPTIMISATION the next best data point. Based on this new data point, containing new information on ƒ (x), the model is updated. An illustration of BO is given in Figure 2.3.

Algorithm 2.1: Code classic Bayesian optimisation

# design space

S = collect(0:0.05:7) #Uniform(0,7) # number of starting points

n = 3

# number of repetitions

R = 5

# sample first data points

x_n = firstSampleStep(S,n)

# evaluation of the new data point

y = f(x_n)

# performing R cycles of (re)modelling and sampling

for i in 1:R

# fit the a Surrogate model to the given data

ˆ

f = surrogateModel(copy(x_n),y)

# use the model to obtain the acquisition function

acquisition_response = acquisitionFunction(ˆf,S,x_n,y)

# optimise the acquisition function

x_next, acquisition_max = acquisitionOptimisation(acquisition_response,S)

# Add the new data point to the other data points.

push!(x_n,x_next) push!(y,f(x_next)) end # Obtain x_star y_max,index_max = findmax(y); x_star = x_n[index_max];

The different functions: "firstsamlingstep", "surrogatedmodel", "acquisitionfunction" and "acquisitionoptimisation" represent four important steps in the BO process. For all four functions, multiple options are known and needed to be chosen by the user. In the remainder of the work, all steps are explained and adaptations are given on how these steps can be adapted to fit the modular design optimisation problem.

(33)

0 2 4 6 -2 -1 0 1 2 3 0 2 4 6 0.0 0.1 0.2 0.3 x f(x) R = 1 0 2 4 6 -2 -1 0 1 2 3 0 2 4 6 0.0 0.1 0.2 0.3 x f(x) R = 2 0 2 4 6 -2 -1 0 1 2 3 0 2 4 6 0.00 0.05 0.10 0.15 x f(x) R = 3 0 2 4 6 -2 -1 0 1 2 3 0 2 4 6 0.000 0.025 0.050 0.075 0.100 x f(x) R = 4 0 2 4 6 -2 -1 0 1 2 3 0 2 4 6 0.00 0.01 0.02 0.03 x f(x) R = 5

Figure 2.3: Example of the BO framework

Evaluation of ƒ () = sin(2) − cos() with  ∈ [0, 7]. Iterations one till five are shown. In every plot the grey line is the true objective function, which is normally unknown. The dark blue line indicates the predictions are made by using a probabilistic surro-gate model. The light blue area indicates the uncertainty on these predictions. Larger blue areas indicate that the model is more uncertain about the given predictions. The green dots and the red cross are the sampled data points used to fit the model. The red cross is the last sampled data point obtained from the previous iteration. The green line is the acquisition function made by using the blue model above. The max-imum of this function is indicated with a red triangle and this will be the next data point that is used to update the model

(34)

2.3. BAYESIAN OPTIMISATION

2.3.2 First sampling step

First sampling

Before sampling can starts, the search space S needs to be constructed. For modular design problems S refers to a design space because it contains all possible designs. Constructing this design space is done by using the following steps (Gujral et al., 2018):

1. Select the objective that needs to be evaluated and optimised.

2. Determine the variables of the problem. For the modular design optimisation problem this means: selecting the different modules, selecting the allowed size of the constructs, determine whether replacement is allowed and whether there is an order effect.

3. Determine the possible constraints.

4. Define a well-suited response variable that allows the evaluation of different constructs.

When the design space is set, a first sampling step can be performed. The goal is to find which points provide the most information about the search space within the budget. First, sampling methods are based on the principle of a proper design of ex-periment (DoE). In DoE, the main idea is to explore the design space (Jocqué et al., 2019), because there is almost no prior information available on the objective func-tion. DoE generally has one sampling step resulting in a batch containing n data points limited by the budget. Afterwards, these n data points are evaluated.

In contrast to DoE, BO uses a sequential sampling scheme that allows for the incor-poration of new information of the preceding n data points. In theory, BO could start with only one random data point. Still, in most applications, BO starts with a small data set, because the surrogate model needs some minimal data points to give useful predictions.

Closely related to BO is the field of active learning. As in BO, only a small start set is used to construct the machine learning model. New samples are proposed based on the performance of the model. The significant difference between BO and active learning is that the sampled data points are chosen to improve the prediction capaci-ties of the model instead of looking for the optima of a function (Brochu et al., 2010). Different sampling schemes are possible. A commonly used sampling strategy is the random sampling approach. An oracle generates n different constructs with x∈ S. In this approach, the probability of sampling every construct is equal, but repetition of the same construct is avoided (Chauvet and Guillaume, 2020). This method is often

(35)

used mainly due to its simplicity, because it avoids long calculations of different dis-tances between all samples in S.

In many other sampling strategies, the distance between samples is an important parameter. This distance represents how dissimilar two variants are and allows to maximise the covered part of the design space. Calculating distances between co-ordinates in a Cartesian coordinate system is straightforward based on different Lp norms. The L2or euclidean distance is mostly used. The distance between sets or

be-tween strings is more complex and is handled in Section 2.3.3. Based on a given string distance the most space filling subset can be obtained (Kochenderfer and Wheeler, 2019). Unfortunately, the exact calculation of the most space filling subset requires a lot of computational resources. Therefore, heuristic approaches are used to get a relatively good subset within a limited time frame. In this work a random sampling step is used to avoid long calculations of heuristic optimisation algorithms.

2.3.3 Surrogate model: Gaussian Process

Surrogate models

After the data sampling, the selected constructs are evaluated: D1:n= {x1:n, ƒ(x1:n)}

where x is the -th sample corresponding to the observation value ƒ (x). Based on these data points, a surrogate model ˆy = ˆƒ(x) is constructed that mimics the true

unknown objective function ƒ (x). In BO this surrogate model is a probabilistic model, which implies that not only the value ˆy is estimated but for every point in the design

space, a complete distribution is obtained. The main advantage of these distributions is that the uncertainties of the predictions are estimated (Brochu et al., 2010).

The choice of a surrogate model has significant consequences in the BO pipeline. Different models are known and regarding the characteristic of the problem and the given data, the best model is selected. (Shahriari et al., 2016). A simple model is Bayesian linear regression with a Gaussian prior on the weights and with a mean vector μ covariance matrix  :

Y = XTβ _(2.5)

with β ∼ N (μ, ) . (2.6)

Because the application of this model is limited, a Bayesian linear regression model is extended to a Gaussian Process (GP) (Rasmussen and Williams, 2006) by:

Y = ϕ(X)Tβ _(2.7)

(36)

2.3. BAYESIAN OPTIMISATION where ϕ(X) is a feature mapping function that provides GP models the possibility to adapt to different optimisation problems (Shahriari et al., 2016). A more in-depth dis-cussion is available in this section below.

Besides of their flexibility GP models have additional advantages. GP models use a well-defined Gaussian distribution, resulting in a well-defined mathematical expres-sion and all solutions are analytically tractable, which avoids the need of sampling algorithms (Rasmussen and Williams, 2006). Additionally, GPs can be fitted on a small initial data set and good software is available to fit the GP models efficiently (Willteb-butt, 2020). Finally, GPs are commonly used in a BO framework and are proven to yield good results (Brochu et al., 2010). Given these arguments, GPs are chosen as a surrogate model to avoid a rigorous model evaluating procedure.

The biggest drawbacks of using GP is the sensitivity towards outliers and its ineffi-ciency towards big data sets given the polynomial time of the algorithm. Alternatives to GP are: Student’s-T processes using the Student’s-T distribution which is less sen-sitive to outliers (Tracey and Wolpert, 2018), or neural-processes which handle big datasets better (Garnelo et al., 2018).

Gaussian Processes

Because a GP model is forming the core of the proposed BO pipeline, a small intro-duction is given and adaptions towards modular designs are proposed. To start an alternative definition of Eq. 2.7 is given.

Definition from Rasmussen and Williams (2006):

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution

Before going into the mathematics behind GPs, an intuitive example is given in Figure 2.4, which models ƒ () = sin(2) − cos() and is represented by the yellow dashed line on the plot. In a real-world application, this function is unknown and needs to be approximated based on the data points sampled from ƒ (x). In this example, three data points are randomly sampled and plotted as green dots. To simplify the example, no noise was added. A GP model based on these three points is constructed. In short, a Gaussian distribution for every point x in [0, 9] is calculated. For every point, a mean

μ and standard deviation σ is estimated. These mean predicted μ’s form the dark

blue line on Figure 2.4. The blue area is the mean plus or minus twice the standard deviation. Because there is a distribution for every data point, random functions can be sampled. The three red dotted lines illustrate this property and were all three randomly sampled from the GP. This indicates the flexibility of a GP model.

(37)

Figure 2.4: An example of a GP model

A GP with mean pior function equals to 0 and squared exponential kernel with hyper-parameter  = 1, σ2n = 10−6. The bold blue line is the mean predicted value from the GP model for every x. The colored band represents the uncertainty on the predictions. The red dotted lines are random samples from the GP model. The yellow line equals

= sin(2) − cos(). The green dots are the samples used to fit the GP. Gaussian Process : mathematics

There are different ways to approach GPs: the weight-spaced view and the function-spaced view. Both are equivalent and give the same results. In the context of BO the function-spaced approach is more intuitive and is discussed below (Rasmussen and Williams, 2006).

Formally a GP is written as:

ƒ(x) ∼ GP(μ(x), k(x, x0)) . _(2.9)

The formula indicates that a GP only has two function arguments, μ(x) and k(x, x0) that completely describe the process. These are respectively the mean and the co-variance function1. If x and x0 are two input vectors, then these functions are given

(38)

2.3. BAYESIAN OPTIMISATION as (Rasmussen and Williams, 2006):

μ(x) =E[ƒ (x)] _(2.10)

k(x, x0) =E[ƒ (x) − μ(x)(ƒ (x0) − μ(x0))] . _(2.11)

The goal of using a GP is to predict the objective value ˆy, for all t unseen data points x ∈ S \ x1:n, commonly referred to as test data. The model is fitted based on the given

training data D1:n = {x1:n, ƒ(x1:n)}. In the remainder of this work all vectors of x1:n

are aggregated in a design matrix X with dimensions p x n where p is the dimension of a single input point and n is the number of training data points. The vectors of

xn_+1:n+t are aggregated in design matrix X0 with dimensions p x t. Predictions are made using the joint distribution:

          y ˆ y           ∼ N                     μ(X) μ(X0)           ,           K(X, X) K(X, X0) K(X0,_{X) K(X}0,_X0)                     (2.12)

μ(X) evaluates the mean function for every value x in X:

μ(X) = [μ(x₁), · · · , μ(xn)] (2.13)

K(X, X) evaluates the covariance function k(x_i,_x_j) between all data points in X:

K(X, X) =                  k(x₁,_x₁) · · · k(x₁,_xn) .. . ... ... k(xn,x₁) · · · k(xn,xn)                  (2.14)

(39)

The conditional distribution is then given as:

ˆ

y|y, X, X0∼ N (μcon,con) (2.15)

μ

con= K(X0,X)K(X, X)−1y (2.16)

con= K(X0,X0) − K(X0,X)K(X, X)−1K(X, X0) (2.17)

In most real-world modelling applications the training data y contains some noise ε with ε ∼ N (0, σ2n). As a result, the observations y are not directly sampled from ƒ (x) but are given as y = ƒ (x) + ε. Given that the noise is independent from x, the joint distribution of Eq. 2.12 with noise is:

          y ˆ y           ∼ N                     μ(X) μ(X0)           ,           K(X, X) + σ2 n K(X, X0) K(X0,_X) K(X0,_X0) ,                     , _(2.18)

where  is the identity matrix with dimension n x n. The conditional distribution is then given as:

ˆ

y|y, X, X0∼ N (μcon,con) (2.19)

μ

con= K(X0,X)[K(X, X) + σn2]

−1_y _(2.20)

con= K(X0,X0) − K(X0,X)[K(X, X) + σn2]−1K(X, X0) (2.21)

Covariance functions

As seen above, GPs are defined by two functions: the mean function μ(.) and the covariance function k(., .). Where μ(.) is in most cases zero or a constant value, the covariance function is more complicated and provides the GP its flexibility. To use a GP in a modular design problem, the covariance function needs to be adapted. The covariance function k(x, x0) is referred to as the kernel of the GP. Different ker-nels exist, all with their specific properties. The general idea of a kernel is that Cov(ƒ (x), ƒ (x0)) = k(x, x0), which means that the covariance between the value ƒ (x) and ƒ (x0) can be calculated based only on the feature representation x and x0. In practice, this kernel gives the similarity between the two vectors x and x0 of the input space, which is calculated by the inner product 〈x, x0〉. For non-linear

(40)

2.3. BAYESIAN OPTIMISATION improved by using an unknown function ϕ(.) that maps the feature vector x and x0 to a higher-dimensional feature space, referred to as the Hilbert space. That ϕ(.) is unknown is not a problem because only the inner product needs to be calculated

〈ϕ(x), ϕ(x0)〉 = k(x, x0) (Rasmussen and Williams, 2006).

A nice property of this inner product is that it can be calculated without evaluation of ϕ(.), which is described in literature as the kernel trick (Rasmussen and Williams, 2006; Van Hauwermeiren et al., 2020; Yu et al., 2011). The kernel trick is valid as long k(., .) is a symmetric function that is positive definite. Various types of kernels exist e.g., isotropic stationary kernels and non-stationary kernels. Most commonly known kernels are isotropic stationary kernels and are suited for continuous data in-put. These kernels are defined in terms of |x − x0| which hints to the fact that kernels

can be interpreted as some similarity between two vectors. Some commonly used isotropic stationary kernels are given in Table 2.2 and some well-known non-stationary kernels in Table 2.3, with r = |x− x0| and ¯x = [1, 1, ...n]T, an augmented input vector. Figure 2.5 illustrates some examples of different kernels used in a onedimensional GP. Most kernels contain additional input arguments besides the two input vectors. These are the hyperparameter(s) of the kernels and influence different properties of the proposed model. For example, the parameter  of the squared exponential kernel in-fluences the smoothness of the curve, as illustrated in Figure 2.6. The next paragraph discusses these hyperparameters more in detail.

Based on the sum or product of two kernels, a new kernel can be obtained which combines the properties of both. These kernels are known as composed kernels. An illustration is given in Figure 2.7.

Table 2.2: Common stationary kernels

name formula hyperparameters

Squared Exponential σ2_exp

−r2 22 , σ Matérn(3/2) σ2 1+p3r  exp −p5r  , σ Rational Quadratic σ2 1 + _2αr22 −α , α, σ Periodic σ2_exp −2 sin2₍πr p) 2 , σ, p Hyperparameter optimisation

Hyperparameters give flexibility to a given kernel. Still, for every problem an opti-mal parameter combination needs to be obtained. The parameters can be set by the

(41)

Table 2.3: Common non-stationary kernels

name formula hyperparameters

Neural network 2πsin−1 2 ¯xT ¯x0 p 1 + 2 ¯xT  ¯xp1 + 2 ¯x0T_{ ¯}_x0  Linear σ2 0+ x · x0 σ0

users but in most cases the optimal parameters are derived from the available data. The process of selecting the best kernel with the optimal hyperparameters is known in machine learning as model selection. Multiple strategies are known to accomplish this. To illustrate the effect of optimal and sub-optimal hyperparameter combinations, three GPs are modelled in Figure 2.8 where only the hyperparameter  is changed, af-fecting the uncertainty of the model. Two model evaluation techniques are discussed below: cross-validation and Bayesian model selection.

-2 -1 0 1 2 -3 -2 -1 0 1 2 3 x f(x) -2 -1 0 1 2 -1 0 1 2 x f(x) -2 -1 0 1 2 -1 0 1 2 x f(x)

Figure 2.5: Examples of various kernels

Five random samples form a GP with respectively from left to right: the linear kernel, the squared exponential kernel and the periodic kernel. All hyperparameters are set to 1 and σn2= 10−6 0.0 2.5 5.0 7.5 10.0 -2 -1 0 1 2 x f(x) 0.0 2.5 5.0 7.5 10.0 -2 -1 0 1 2 x f(x) 0.0 2.5 5.0 7.5 10.0 -1 0 1 2 x f(x)

Figure 2.6: Effect of the hyperparameter of a kernel

Five random samples form a GP with a squared exponential kernel with a different hyperparameter  from left to right,  = [0.5, 1, 3], σ2n= 10−6.

(42)

2.3. BAYESIAN OPTIMISATION -2 -1 0 1 2 -3 -2 -1 0 1 2 x f(x) -2 -1 0 1 2 -3 -2 -1 0 1 2 3 x f(x) -2 -1 0 1 2 -3 -2 -1 0 1 2 3 x f(x)

Figure 2.7: Examples of composed kernels

Five random samples form a GP with composition of basic kernels from left to right: linear + squared exponential, linear · squared exponential, squared exponential · pe-riodic + linear. All hyperparameters are set to 1 and σ2n is 10−6.

0 2 4 6 8 -4 -2 0 2 4 0 2 4 6 8 -4 -2 0 2 4 0 2 4 6 8 -4 -2 0 2 f(x) x f(x) x f(x) x

Figure 2.8: Effect of the hyperparameter on the uncertainty of GPs

The same GP model as in Figure 2.4 was used. Only the hyperparameter  was changed between the different plots. From left to right  = [2, 0.692, 1/ 3]. The GP model in the central panel has an optimal  based on the maximum marginal likeli-hood.

Cross-validation

A basic introduction is given, a more elaborated discussion can be found elsewhere (Gareth James and Tibshirani, 2013). To decide which configuration of kernels and hyperparameters is superior, the model performance needs to be quantified. This is done by an evaluation metric. Multiple metrics are available. Some need to be minimised, others maximised. A standard metric for regression is the Mean Squared Error (MSE), calculated for m data points it gives:

ϵCV= 1 m m X ₌₁ (ƒ (x) − ˆƒ(x))2. (2.22) The MSE needs to be minimised to obtain the best model. The training set is part of the available data that is used to fit the model. If, for every kernel, the entire data set is used as training set and the MSE is utilised to select the best one, the model will be prone to overfitting. In this case the best kernel and corresponding

(43)

hyperparame-ter(s) will have a low MSE on the training set, but the performance will drop steeply if the model is tested on unseen data.

Cross-validation tries to avoid this by splitting the data in a training set and a testing set. In standard machine learning models k-fold cross-validation is used. In this tech-nique the training set is randomly divided into k equal parts. For every combination of a given kernel and corresponding hyperparameter(s), k models are computed. The model is fitted on k − 1 parts of the data and the evaluation metric is calculated using the withheld set. This is repeated until all k parts have been selected as the withheld set. To compare different model configurations, the mean value over the k parts is calculated.

Still the given method has some disadvantages. Therefore this strategy is not pre-ferred in the remainder of the work. The first disadvantage is that not all available data is used to fit the model and in the first iterations of the BO algorithm only a small fraction of data points are known. Secondly, this technique requires much compu-tational time because multiple models have to be constructed. Thirdly, in the case of modular designs the model has to predict the interaction effect between different modules. The standard cross-validation scheme is not always reliable and more ac-curate cross-validation schemes were developed to evaluate the performance of the given models (Daly et al., 2019; Park and Marcotte, 2012; Stock et al., 2018).

Bayesian Model Selection

Models can be compared in a Bayesian way (Barber, 2012). Using the Bayes rule, the posterior of the model Mis, given the data set D_1:n= {_1:n, ƒ(_1:n)} = {X, y)}:

p(M|X, y) | {z } posterior = p(y|M,X) | {z } likelihood p(M) | {z } prior p(y|X)) | {z } marginal likelihood . _(2.23)

The best model configuration is obtained after maximisation of the posterior.

M_MAP= argmax



p(M|X, y), . (2.24) Direct calculation of Eq. 2.24 is not trivial in most cases. However a full model evalu-ation is not required. Only the evaluevalu-ation of models with various hyperparameters is required. In this context, the marginal maximal likelihood also known as type II maxi-mal likelihood (ML-II), can be used to obtain good hyperparameters which exploits the hierarchical structures of the GP model. The hierarchical idea is that the model is built upon different inference levels. At the top level, there is the structure of the model M. At the second level, it is assumed that the true model structure M is known and the

(44)

2.3. BAYESIAN OPTIMISATION hyperparameters θ are estimated based on data. At the bottom level, assuming that all the levels above are true, the parameters β of the model can be determined using the given data. More information on this hierarchical structure is given by Rasmussen and Williams (2006). As a result, at the bottom level of inference the Bayes rule is:

p(β|X, y, θ, M) =

p(y|X, β, M)p(β|θ, M)

p(y|X, θ, M)

. _(2.25)

The marginal likelihood of the bottom level is maximised to obtain a good estima-tion of the hyperparmeters for the given GP model. The marginal likelihood can be obtained after evaluation of:

p(y|X, θ, M,) = Z

p(y|X, β, M)p(β|θ, M)dβ . (2.26) In most cases, the log marginal likelihood is maximised to facilitate the optimisation procedure. Given the Gaussian prior used in the GP model with a noise factor σn2 an analytic solution exists:

log p(y|X, θ, M) = − n 2log(2π) − 1 2log|Kθ(X, X) + σ 2 n| − 1 2(y − μ(X)) T (Kθ(X, X) + σ2 n)−1(y − μ(X)) (2.27)

The formula has three terms. The first is a normalisation constant, independent from the model and the chosen parameters. The second term penalises the models that are too complex. The final term uses the observed data points to quantify how well the model fits the data. These three terms show that the marginal likelihood provides a trade-off between a good fit and the complexity of the model. This is a general property of marginal likelihood known as Occam’s razor effect (MacKay, 1992). Other approximations to evaluate Eq. 2.23 are known but are not discussed in detail, e.g., the Laplace method can be applied, or sampling methods like Markov chain Monte Carlo (MCMC) can be used (Rasmussen and Williams, 2006).

Discrete GP

Above, a general introduction on GPs was given. To apply a GP model onto a com-binatorial problem, one has to take the discrete nature of the problem into account. The model can only be evaluated at certain discrete values of X = [x1,· · · , xn] with corresponding y = [y1,· · · , yn] (Fortuin et al., 2018). The posterior of Eq. 2.18 can still be calculated for all x ∈ S. Also, the maximal marginal likelihood can be used for the optimisation of the hyperparameters. The major difference is that alternative kernels are required because the input data is a vector of strings, representing the