GABED: A Genetic Agent-Based framework for solving individual rules in Emergent phenomena Data

(1)

GABED: A Genetic Agent-Based framework for

solving individual rules in Emergent phenomena

Data

Diepgrond, Erwin

Lees, Michael

March 31, 2020

(2)

Since the dawn of computing, many researchers have used computers to sim-ulate and explain phenomena. Many of the developed methods rely on the understanding of the system to develop models and implement the simulations. This is also true for agent-based systems. Although some papers have purposed and implemented robust data-driven methods to infer models and simulation, most methods are specialized. In this paper, we purpose a new genetic agent-based system to infer models of emergent phenomena. A few well known and basic models are used to demonstrate this new framework called GABED and discuss the benefits of using GABED as well as the downsides and pitfalls of such a method.

(5)

Chapter 1

Introduction

Agent-based systems are a popular modeling method for many fields of sci-ence as well as an approach for analyzing complex systems[Jennings, 2001, Crespi et al., 2008, Kida et al., 2007, Keller and Hu, 2016, Keller and Hu, 2019]. The most common approach to agent-based system design, especially with com-plex systems, is the bottom-up approach where the agents behavior is defined from individual behavioral knowledge. The problem using the bottom-up ap-proach is that an understanding of the process is needed to make an initial guess what the individual behavior needs to be. From there on, a constant refinement of the agent rules is needed, running the simulation of agents and comparing the results to some real world data. This heavy hands-on approach is intensive and prone to assumptions and over-fitting.

Emergent phenomena are complex phenomena which, as the name implies, emerge from parts which are often simple in nature. A well known expression which applies is ”The whole is greater than the sum of its parts”. When observ-ing emergent phenomena the main goal is to deduce what parts make up the phenomena but this is often a challenging project. To understand emergent phe-nomena and their underlying dynamics, an understanding of the second level, as defined by Darley, is needed [Darley, 1994].

In emergent phenomena systems, a subsystem of complex systems, the dif-ficulty rises exponentially. For example, in crowd dynamics many social factors play a role in how individuals behave as well as their interaction with the im-mediate surroundings. As such, complex systems have many interacting com-ponents where creating the system by hand is a near impossible task. To solve this problem, data-driven methods have been developed to automate the pro-cess of identifying and creating rules for agent-based systems [Lee et al., 2007, Zhong et al., 2014]. Lee et al. shows us how to use crowd related areal video data to infer agent-rules using locally weighted linear regression.

Genetic programming, a subcategory of evolutionary techniques, is a popu-lar technique for fitting equations. Using so called ’Nodes’ as building blocks, many types of problems can be fitted and solved. It has been demonstrated that this technique also works for agent-based systems [Keijzer et al., 2004,

(6)

Zhong et al., 2014]. Zhong et al. shows that a genetic programming algorithm can evolve a function to evaluate each possible choice per step, using crowd based models in an agent-based system.

While genetic programming and agent-based models have been used to solve crowd based dynamics, it has not yet been applied to emergent phenomena in general[Zhong et al., 2017, Zhong et al., 2015]. In this thesis a framework is proposing using genetic programming to create, fit and refine equations for agent-based models, fitting a dataset of observations of generic emergent phe-nomena. The method is to use Agent-based modeling (ABM) in which the behavior of the agents is set by the equation, determined in the evolutionary process. The ABM thus generates a dataset which can than be compared to the data of the observations of the original phenomena, resulting in a metric of difference.

For this purpose the GABED framework has been written, the Genetic Agent Based inference of Emergent Data toolkit. Python is used for its ease of use, expandability, packages and wide-spread usage in distributed comput-ing. For the evolutionary component the package DEAP [Gagn´e, 2012] which provides the base framework for genetic programming as well as distributed evaluations. The problems are formulated as an agent-based problem and so an agent-based system is needed to evaluate the genomes. For this purpose Mesa[Masad and Kazil, 2018] is used, which provides base classes for agent, models and space. The most important aspect of Mesa is its wide-spread usage in agent-based modeling as for each problem an agent-based model has to been made. Mesa makes the creation of an agent-based model very user friendly.

In this framework ’individuals’ is used to describe unique genomes in the evolutionary population. As for the agents in the agent-based system, they will be refer to as agents.

This thesis aims to build a framework and investigate various challenges regarding the framework.

(7)

Chapter 2

Literature review

2.1 Agent-based modeling

2.1.1 Agent-based models

When using agent-based models, one decomposes a phenomena into smaller interacting components. The components, agents, are autonomous entities who often perform simple actions (e.g. moving, eating, trading, etc. . . ) and form a complex whole due to their interactions[Gilbert, 2011].

Agent based systems are made up of individuals called agents. The agents follow simple rules to interact with the world their in. The world might be a simple graph all the way to a complex 3D world. By using these simple rules, the agent can exhibit complex behaviors like in the ant model. Many ABM aim to simulate social and economical situations and aim to model human interactions to recreate the complex systems that arise in our daily lives.

The agents in the systems are of course not limited to one type, the ho-mogeneous agent-based system. Many systems incorporate different types of agents. Some systems also have agents capable of learning, can adapt to their environment or leave behind offspring.

2.1.2 Classic Agent-based models

Some well known early agent-based models are Sugarscape and Schelling’s model of segregation [Epstein and Axtell, 1996, Schelling, 1971]. Both models show a complex behavior formed from simple rules. Especially Sugarscape is well known and heavily extended to incorporate many features and behaviors and is able to simulate complex economic principles. Both models keep true to the central ideas of agent-based models and demonstrate complexity and emergence.

Sugarscape[Epstein and Axtell, 1996, Castellano et al., 2009] is a well known modular complex artificial society created in 1996 by Epstein and Axtell. Agents live in a world filled with a limited amount of ’sugar’. Sugar is the worlds cur-rency and agents require it to survive. Agents can also store the surplus of

(8)

sugar and thus obtain wealth. The agents rules are initially quite simple: Move towards the highest sugar amount. Each agent has a few initial statistics which are randomly generation upon initialization. These initial values are the agents view range, its consumption of sugar per tick, its maximum age and its initial wealth. In its most basic form, the model shows the emergent phenomena of an economic inequality where the rich agents become more rich and poor agent stay poor.

2.1.3 Seminal agent-based models

There are also models that can be interpreted as agent-based models or are

heav-ily influenced by agent-based systems. One such a model is Vicsek’s model[Vicsek et al., 1995]. Vicsek’s model runs, in contrast to the earlier discussed models, in continues

space. Vicsek’s model is a system of particles of a fixed absolute velocity v where the particles show local interactions. The local interactions follow the rule that each particle’s direction is an aggregate from all the directions of the particles within a fixed radius ρ. A random term is also included with the inter-val [−η/2, η/2]. This leaves the model with three free parameters ρ, v, and η. Given certain values for ρ, v, and η, the particles start to form groups. Given the few parameters, the simplicity, criticality and agent-based nature (parti-cles), this model is well suited as a test model where the parameters can be well controlled.

2.2 Evolutionary Algorithms

2.2.1 Evolutionary Systems

Evolutionary algorithms are, just like any computer learning algorithm, things that find many solutions where we, humans, see only a few. This often results in weird and unexpected results. This way of finding solutions is very important as it has less bias when generating equations. Many scientists who purpose a model for a phenomena have a biased view on the topic of said phenomena. Using an evolutionary algorithm might provide a new and fresh look upon a problem due to its ’unbiased’ way of finding solutions. Evolutionary algorithms are not completely unbiased as their implementation (Parameters, fitness func-tion, underlying principles, etc. . . ) also plays an important role. Therefore it is of importance that the underlying system be as general as possible to pre-vent bias [Eiben and Smith, 2015, Janga Reddy and Nagesh Kumar, 2012]. Of course, the bias cannot be completely removed from the system.

2.2.2 Genetic Algorithms

With evolutionary algorithms, a generalized algorithm is used to determine the offspring, mate and select. The basis of evolutionary algorithms is initialization, selection, crossover, mutation, selection and finally termination. The algorithm loops between selection, crossover and mutation till it reaches a desired value,

(9)

usually for a fitness value or stagnation. Commonly used evolutionary strategies are the classic ”select, vary, evaluate”, µ, λ where only the offspring gets selected and µ + λ where the both the parents and the offspring are considered for selection.

2.2.3 Genetic Programming

There are many different types of evolutionary algorithms. A well known evo-lutionary algorithm for fitting equations is genetic programming. Genetic pro-gramming is a technique where programs are evolved, usually equations (which are considered programs)[Poli et al., 2008]. The equations are represented in tree-wise fashion which facilitates both the evolutionary process as well as the ease of use when programming.The tree-nodes are operations, constant values or input variables. From the bottom node, the values move upwards through the operations resulting in a final output at the root node. Evolution occurs by merging and morphin trees with mutations and cross-overs.

This method of evolving programs has proven itself in numerous agent-based problems such as decision support systems (in agent-wise fashion), trading and business school, land-use and -cover modeling, rule identification for crowd mod-els and mobile agent-based rule inference [Zhong et al., 2017, Manson, 2005, Chen and Yeh, 2001, Keller and Hu, 2016, Keller and Hu, 2019, Krzywicki et al., 2014, Byrski et al., 2015].

2.3 Genetic Programming applied to agent-based

models

Bonabeau et al.[Bonabeau, 2002] describes agent-based models as a system well suited for capturing emergent phenomena, being flexible and being able to pro-viding natural descriptions of systems. Issues are also discussed in Bonabeau’s (et al.) work. Firstly, agent-based models have to serve a purpose. General-purpose agent-based models cannot work as each level of the model has to be crafted to specifically suit the problem. Secondly, in some systems, specifically those who try to simulate systems with complex agents (e.g. humans), agent-based models may not accurately quantify the details of the system due to patterns arising from irrational behavior, subjective choices and complex psy-chology. This problem is most noticeable in systems with heterogeneous agents as all agents react differently. Although internal state values and heterogeneous agents may reduce the problem significantly, capturing all the details of complex behavior (such as human behavior) is still not possible. The last major issue as described by Bonabeau is that ABM will almost always be more computation-ally expensive than other models that can be described in a higher-level instead of the low-level individual components.

Krzywicki et al. shows a method of using genomic data inside agents to opti-mize decision related problems in computer systems. This is similar to evolving a model but the model can be directly applied to, for example, multi-core vs

(10)

single-core usage, providing significant bonuses to processing speed. Chen et al. proposed a new method of analyzing artificial stock markets by evolving a single population trading model. With single population models, each individ-ual genome is mapped to an agent in the same model/simulation. Chen et al. showed that various economic questions could be answered with the internal equations of the agents.

Manson et al. combined a cellular-automata world with agent-based actors to model decision making in human-environment (with institutions) interac-tions. Their system also uses a single population model where each agent is evolving its own equation.

Keller et al. purposes a more general approach with mobile agent. Their system is designed to fit movement of crowds or similar concepts, relying on a data driven approach to create models by generating possible models for a given dataset and checking for flexibility, comprehensibility, controllability, compos-ability and robustness. Through the use of System Entity Structures (SES) Keller et al. allows for a modeler with incomplete knowledge of the emergent factors to create a formally defined problem space. The SES is than used as a basis for evolving behavioral groups, for example: one group being the direc-tional behavior group and the other being a speed behavior group, a solution is generated for each group.

Zhong et al. proposed a data-drive modeling framework for the inference of crowd behavior. The model is constructed with two layers, a social force layer for non-directional movement (e.g. avoidance) and a top-layer for directional movement, towards a goal. The latter is a rule which is derived from (video)data of crowds using genetic programming, evolving an equation. Using this two layer approach, the burden on the gp algorithm is lightened as the low-level movement is already incorporated and only the goal-oriented behavior has to be evolved. The downside is the assumption that social force avoidance drives the non-directional movement.

2.4 Research challenge

2.4.1 Challenges for GP and ABM

While the previous works where all related to data-driven rule inference, their work is mostly focussed on specific use-cases and models. Keller et al. proposed to use a more general method of rule inference, not fixed to one model, but their work focusses on one type of agent, a mobile-agent with vector-based movement. While focussing on one model or method has many advantages, such as being able to verify models more accurately and being able to implement model specific analysis in the framework, it is not yet general purpose and assumes a certain type of agent.

Based on the literature there are important challenges that need to be ad-dressed. Noise is always a issue in both agent-based systems and evolution-ary systems[Eiben and Smith, 2015, Byrski et al., 2015]. The evolutionevolution-ary

(11)

al-gorithm that will be used also brings various challenges, such as parameter optimization, algorithm choices and individual evaluation and rejection. The number of observations and the quality of observations will also have a big impact on convergence.

Stochastic processes also play an important role in agent-based models and together with chaotic parameter and fitness spaces, they will increase the diffi-culty of converging to correct models [Calvez and Hutzler, 2006]. Another chal-lenge in the fitness space are discontinuous models. When an agent-based model is made up from trigonometric functions, the fitness scape will be of a discon-tinuous nature.

Many issues in evolutionary algorithms arise from convergence to local opti-mum. Improved sampling methods for the generation and selection of evolution-ary genomes can often lead to increased performance of the algorithm. Methods such as latin-hypercube sampling increase the population variety and decrease the chances of converging to local optimum. This challenge brings along an-other challenge of its own. Genomes can be represented in a multidimensional uniform space but uniform sampling of this space leads to many inviable indi-viduals. Identifying these regions and generalizing a method to identify them and remove them from the sampling space will hugely increase the efficiency of the evolutionary algorithms.

Finally, the performance of such a framework is of importance as well. While parallelizing evolutionary evaluations is trivial in most cases, the fact still re-mains that many evaluations are required. Reducing the run-time of a single evaluation will have a huge impact on the speed of the system as a whole.

2.4.2 Contribution of the thesis

In this thesis, the focus will be on the noise, stochasticity and discontinu-ous models. These issues will be tested using the newly developed framework GABED[Diepgrond, 2020].

In the first chapter (ch.4), a simple hill-climbing model is used to explore the effects of discrete spaces and the effect of agent-density in the model. Chapter 5 will explore the effects of continues space combined with discontinuous func-tions using Vicsek’s model, which is an often found combination due to angle calculations and other trigonometric functions. Finally, Chapter 6 will look at a stochastic model using sugarscape, Exploring the effects of known stochastic variables in both the dataset and the model.

Each chapter will also implicitly address the challenges related to the evo-lutionary algorithm such as evaluation and rejection. The sugarscape chapter (ch.6) will also go more into depth about the challenge of (stochastic) evaluation and early rejection.

The other challenges noted earlier: performance, dataset quality, parameter optimization, are out of the scope of this thesis. Initial genome sampling is also outside the scope of this thesis but as stated earlier, early rejection and adap-tive sampling for stochastic agent-based models are implemented and discussed. While these methods attack a different issue they ultimately lead to the same

(12)

effect of increasing the efficiency of the evolutionary algorithm, making the time spend on bad solutions significantly less.

(13)

Chapter 3

Framework and

Implementation

3.1 Framework repository

The framework, including readme, can be found at

https://gitlab.computationalscience.nl/evosim/gabed . The repository contains all code used to generate the figures and data of this thesis. Note that in cases of stochastic processes the data may vary. For more information on parameters and codes, consult the README.

3.2 Framework

The framework is a three part modular framework. The three modules are common, evolution and ABS (agent-based system). While both the evolutionary and the ABS modules are dependent on the common module, they are not dependent on each other. This means that the modeler is not forced to use to use the provided agent-based system and can use any existing model or framework to evaluate the evolutionary individuals.

The evolutionary framework is derived from the DEAP module and incor-porates many changes for ease of use and extended functionality. More details on this can be found in the Implementation chapter. These three modules com-bined aim to provide the modeler with an easy to use framework that requires minimal input to work. The framework provides an environment for fitting emergent phenomena, using genetic programming, to homogeneous agent-based models with minimal user input. The framework provides function to determine sample requirements and parallel evaluation of individuals as well as population generators and heterogeneous genome populations. Basic mutation operators are provided and more can easily be added. Genomes are automatically limited in height to prevent enormous equations and provides strong typed equations.

(14)

Evaluated individuals are stored and recalled instead of re¨evaluated every time and duplicates are automatically removed and replaced by new randomized in-dividuals.

3.2.1 Parameters

In the thesis, evolutionary parameters are chosen by using sensible defaults and some tweaking. While these parameters are important, they are not within the scope of this thesis and the basis of these parameters are thus mostly omitted from the thesis. The chapters will refer to the relevant code in the git repository. Parameters for the agent-based systems are the original defaults if available. As Sugarscape and vicsek are not based on a single rule by default, they’re crafted to mimic the behavior as close as possible. The parameters used for the evolutionary system were kept constant over most experiments, as well as the algorithm used. The one exception is the first experiment, crosshill, which uses the simple evolutionary algorithm (DEAP’s eaSimple) with a crossover proba-bility of 50% and a mutation probaproba-bility of 10%. All the other experiments make use of the µ, λ paradigm with a crossover probability of 10% and a mutation probability of 40%. No dynamic probabilities were used.

Within experiments, all parameters are fixed with the exception of the pa-rameter of interest.

3.2.2 Workflow

When starting work on inferring the individual rules of agents, projected onto a dataset, of emergent behavior one has a logical order of thought. This order of though has been, hopefully, replicated. The following gives a brief description of the processes involved.

Initially, one starts off with a dataset of which emergence is assumed. First this emergence must be decomposed to create an agent-based model which can be used as a projection of the world which created the emergent phenomena as well as their parameters. For example, lets imagine a simple cellular automata which gives rise to emergent behavior. We first establish that the agent-based world which best reflects the properties of the cellular automata is a grid world with a Moore neighborhood. Given the Moore neighborhood we create eight choice objects, each with their own utility and action code such that given an agent as function parameter, it gives an utility for that choice and handles the action for that choice respectively. By design, all out of bounds and other movement related problems are also handled by the choice objects. If the model cannot be done using discrete choice, another method has to be made inside the agents (e.g. continues models like the viseck model) which handles the process.

Choice objects have two main functions: Evaluate and execute called Choice.utility and Choice.make respectively. The agent will simply have to evaluate the choices by their utility and choice.make (execute) one. This way, the thought process makes a clear separation between the perceived effectiveness and the actual action of a choice

(15)

The equation, which in case of discrete choice is the utility equation, which is determined by the genome has to be well thought out. The parameters and primitives have an enormous impact on the evolutionary process and its out-come. The equation exists of one or multiple input parameters, primitives which manipulate the input parameters and finally output a single output, which can be an array. The primitive tree’s are, following deaps own terminology, strong typed and as such will handle only the specified parameters or intermediate results (e.g. float, int, array, class, etc. . . ). Primitives can be any arbitrary function in python. The individual, which houses the genome and thus the equation, is passed to each agent which enables unique usages of the genome on a per-agent basis.

When the World, agents[, choices] and primitives have been defined, they are added to the EvolutionPool. In case of the world and agent classes, they are added to the evolutionPool via the WorldFactory which handles the initial-ization of the World, agents and choices for each evaluation of an individual. This process requires the evolutionary algorithm to be specified, the number of agents, the world parameters etc. . . . Finally, the evolution process can be started. The resulting data can than be analysed in multiple ways.

3.3 Implementation

The implementation of the modules comes together as two main components and a supporting module. The first component is the evolution, the second component is the agent-based system and finally a support component called ’common’.

The evolutionary component, based on the DEAP library, consists of the (evolution) Pool, individuals, algorithms and helper functions. Individuals are the GABED equivalent of DEAPs PrimitiveTree class (which they inherit) but for ease of use as well as flexibility incorporate many of the, in DEAP separated, functions which apply to them as class members such as mate, mutate and generatePopulation. This adds the flexibility to have heterogeneous populations with different mating and mutation operators, which in turn can be inherited through mating. This also makes creating an individual trivial.

The Agent-based system has been subdivided intro three main components and a factory. These are the world, the agent and the choice(object). The easiest way of creating an ABM is to define a world, choices and the agents are than only required to rank the choices and choose. The choice object incorporates both the utility calculation (which is in-turn defined by the evolutionary genome) and the action. The world is written as a Mesa model except for its initialization which is handled by the WorldFactory class. The world factory handles initialization such that each world has no cross-references and is correctly instantiated. World parameters are given to the factory as well as the classes ’World’, ’Agent’ and an optional choice array.

(16)

3.3.1 Changes to the packages

To get the full usage out of the packages a few changes had to made to the used packages. First of all, the primitive tree from deap. Much of this class has been rewritten, especially the ephemeral constants and the tree generator. Due to the implementation in DEAP ephemeral constants are incompatible with pickling (Saving to file) as well as being unable to add the same ephemeral function twice in the same python session as they were stored in the unitfile’s global storage. This is fixed by creating fixed ephemeral classes which can be found, saved and loaded by pickle systems.

The PrimitiveTree from the DEAP package is extended with the EvoIndivid-ual class in population. Mutation and mating operators are now class specific and are included in the EvoIndividual class, enabling heterogeneous individuals each with their own operators. The population generators are also included in EvoIndividual as static methods. The generator used in the modified evolu-tion algorithms is the generatePopLike, which takes all the settings from the given individual as well as their base class. The evaluate method, with its sub method run world, has also been moved to this class. Simple implementations for functional difference, structural difference and content difference have also been supplied.

3.3.2 Fitness

Every evaluation of an individual requires a run of an agent-based system, col-lection of data and its subsequent analysis. The first two processes, the abm and the collection of data, are nearly the same for every type of problem ex-cluding the model itself. The latter process of analyzing the generated data and comparing it to the original dataset, containing the real-world observations, will most likely require an unique solution. The collection of unique solutions that might be used can be roughly subdivided into two catagories: deterministic and stochastic. In the first case, the analysis of deterministic simulations is usually straightforward: The data is generated (directly taken from the abm), converted to reflect the original dataset (e.g. density measurements from agent positions). From there on, a simple comparison, like the square-error, can be used to derive a measure of fitness.

For stochastic systems the problem is somewhat more complicated. The stochasticity can be various factors from simply the initial positions to com-plex interdependent stochastic processes. The main issue is that an individual cannot be evaluated on a single run and thus requires multiple runs per evalua-tion. The next issue arises when these multiple-runs per evaluation are put into the context of genetic programming, were a population size proportional to the search-space is required which can quickly lead to huge computational require-ments. Aside from problem-specific solutions, the only solution is to determine the statistical certainty while the data is generated and stop when enough cer-tainty has been reached. Early stop mechanisms can also be integral to reducing the computational intensity but are often more specific to the problem.

(17)

In Chapter 6: Sugarscape, the equation 6.2 was introduced which is used to determine the equation required to calculate the amount of samples needed for a given individual based upon its standard deviation. The standard deviation requires a minimum of 2 samples to produce an usable number that is not infinity. The equation is implemented as a common function in statistics called samplesNeeded and takes a list of values, the required confidence interval as well as the margin of error.

The next issue regarding stochastic systems is how to analyse them. The analysis should be applicable to both the original observations as well as the agent-based data. Proposed is a two-step method. This two-step method, which can be considered a meta-analysis, consists of transforming the data to a statis-tic measure like a histogram. This statisstatis-tical measure is determined for mul-tiple runs of an individual as well as the original dataset. To get a measure-ment of similarity (or fitness) a second statistical measure is used. In case of histograms the Hellinger-distance can be used. The hellinger distance pro-vides a robust method of comparing models to data where the models are not strictly correct[Beran, 1977, Lindsay, 1994]. The given example and usage of the Hellinger-distance can be seen in Chapter 6.

3.3.3 Data retention

Individuals contain data about their fitness, function, possible parameters and more. The Individual class can be used to store any amount of extra information regarding an individual as well. The individuals are used to analyse evolutionary runs and thus must be stored. Python’s default storage mechanic is to pickle items or array of items. This method, however, cannot be used to read and write on demand and is fixed once written. Storing every individual is possible but poses problems as population sizes and generations start to increase. Not only does this result in increased storage usage, especially on block devices with large cluster sizes, the large number of individual files tend to overwhelm the operating system with file requests. Some other storage solutions exist for python that use a single file as filesystem but these are read-only filesystems.

To address these issues, the framework uses a filesystem especially written for these types of storage problems. The class that handles a filesystem is called a streamDict and handles mostly like a regular dictionary. All saved items are compressed using zlib after being pickled by dill. These are than stored as separate byte sections with their own header and make up a solid data archive. Items can be recalled, edited and be saved again without intervention. In-place editing of items is also supported as long as the blob size of the item does not increase. Items with a blobsize larger than previously stored are stored at the end of the archive. Removing dead blobs, dead blobs being overwritten items with larger data, can be done by calling prune on the streamDict. Note that the prune operation is akin to a defragmentation or duplication and is disk intensive.

While the streamDict is instantiated, the memory contains a dictionary con-taining all the headers of the data which is loaded in at initialization. This

(18)

makes data lookups fast, especially when only the existence of an item has to be checked.

During analysis, the streamDict can be instantiated as usual and handled as a regular dictionary.

(19)

Chapter 4

Simple Hill-climbing: The

effects of discontinuous

space and agent-density

In this chapter, using a simple-hill climbing model, discrete space and agent-density will be discussed and their effect on the quality of the generated models will be looked at. To start off simple, the model will be examined with a single agent. With the single-agent model, the effect of permutations in the ground-truth on the quality of the resulting models is analyzed. In the multi-agent case, the effects of increasing the density on the quality of the models are explored.

4.1 discontinuous space and model quality

4.1.1 Introduction

The most important feature of genetic programming (gp) (or evolutionary sys-tems) is its ability to converge to a solution or optimum. In case of a deter-ministic model, an exact solution can be determined, given enough data. The framework is applied to a deterministic model: A hill avoidance agent-based model, created with a single agent with a simple rule. The model dynamics as well as the solutions and their quality are discussed.

This algorithm follows the book Evolutionary Computation 1, chapter 7 [Fogel, David B, 2003].

4.1.2 Model

The single-agent deterministic model is a grid-based world with one or more gaussian-bell shaped hills. An agent is set at (-40,0) and must travel to (40,0) for which he has 4 choices per step, all valid choices are called an option. Each

(20)

(a) Crosshill simulation with α = 0 (b) Crosshill simulation with α = 1

Figure 4.1: Crosshill simulations with na = 1. The black dot symbolizes the

goal. Path displayed is the path taken by the agent given the specified alpha with the utility equation 4.1

option represents a cardinal direction in the Von Neumann neighborhood. Each option has an associated cost which is calculated by

cc= δdgoal+ α ∗ δh (4.1)

where ccis the cost calculated per option per step, δdgoal[−1, 1] is the difference

in distance to the goal compared to the current cell, α is the weight of the hill calculation and δh[−s, s] is the difference in height between the current position of the agent and the position that the option indicated (or slope). s denotes the highest absolute slope in the grid.

Hills are defined as a set of co¨ordinates, height and deviation (x, y, h, σ) and the height of a cell is calculated as a Gaussian curve

H(d, h, σ) = he−d22σ2 (4.2)

with d as the distance from the hill center, h as the height at the hill center and σ as the spread, or standard deviation, of the hill. When calculating the actual height for a cell, we take the height from all the mountain definitions and take the highest at that location:

H = max({ hne

−d2_n

2σ2_n _{: n = 1, ..., n}

h}) (4.3)

With nh as the number of hills in the world and n as the n’th hill.

4.1.3 Method

The model was created as a discrete world with a von Neumann neighborhood, which translates into four choices. The equation determines the cost, of moving towards the given choice, given the parameters dgoal and δh. The costs of the

(21)

choices for all neighborhoods were than sorted and the lowest choice is taken and executed.

Agents were protected by a loop-detection which checks if the agent has moved distance Y after X steps. If the agent has no option (the agent has to stay put) the loop detection is reset. If after X steps the agent has not moved more than distance Y, the agent is deemed stuck and is removed from the world. The world was limited to a maximum number of iterations to prevent a never ending world. There are cases in which the loop detection cannot detect a loop, in which case the world has to end itself. Generally, the world should have ended before the maximum number of iterations, a world exceeding this number already has a low fitness.

Hills were created randomly with a fixed seed to ensure the world stays the same across experiments. This fixed seed is given on function call and thus multiple seed may be used in experimental repeats.

Each timeframe it creates a matrix for each coordinate and adds 1 to a position it finds an agent in the given dataset and subtracts 1 from a position it finds an agent in the generated dataset. It than sums up the absolute of the matrix as the fitness value (lower is better) of the individual.

Equation 4.4 shows how the fitness was calculated with Dgivenand Dgenerated as the given dataset and generated (by evolution) dataset respectively. Both datasets were indexed by t as the current time step where tmax denotes the

highest timestep of the dataset with the highest number of timesteps. The in-dexes i and j both denote the coordinates of the system and sxand sy denote

the size of the system.

f itness = tmax X t=0 sx X i=0 sy X j=0

|D_tijgiven− Dgenerated_tij | (4.4)

When analyzing the results it is important to get an indication of the ac-curacy of the solutions, how well the solutions reflect the ground-truth of the form d + αh. Given the sheer number of solutions that will be generated to get a good statistical measure, evaluating each solution by hand is not possible. Therefore the mean sum of squares was used as a measure of accuracy. The latter is calculated as:

M SB = 1/ne X ne 1/ns X ns (f (b, h)solution− f (b, h)true)2 (4.5)

Where neis the number of experimental runs, nsis the number of solutions

and fsolution and ftrue are the generated solution and the ground-truth

respec-tively. Both fsolutionand ftruewere evaluated with the values -1, 1, 0.5, -0.5 for

b as well for h for each value of b, resulting in a total of 16 evaluations. These values were all within the domain of both δd and δh as specified in eq. 4.1.

The measurement of accuracy, while a good verification for this example, does not work for any real world model as the ground-truth equation is not known.

(22)

Figure 4.2: The number of evaluations required per population size for all used values of α. Note that the evaluations scale with the population size as every generation all unevaluated individuals are evaluated.

4.1.4 Results

In this example, various values for the population size and α were used. The results show convergence but the quality of the solutions are varying wildly. Figure 4.2 shows the number of evaluations required to find a solution that fits the dataset exactly while varying the number of evolutionary individuals (evo agents), which are considered to be a measure of difficulty. Figure 4.2 also shows that the number of evaluations directly scales with the number of individuals for larger populations, indicating that the solution was within the initial population. Smaller populations show a bigger spread with more outliers which is especially prominent with 10 evolutionary individuals as they reach over 600 evaluations, in two cases, before finding a solution. From 50 individuals onwards, the number of evaluations tends to be fairly constant at around one generation, which is the initial population evaluation followed by the evaluation of all new individuals. Note that solutions were taken from the initial population but the algorithm only stops after the first generation because ending the algorithm is not part of the bootstrap.

Figure 4.4 shows how much the equations deviate from the ground-truth equation for varying values of α. The figure shows solutions for α 0.5 and 1.0 reaching perfect solutions while all other values of α do not. The figure also shows that for every case, solutions were found which differ wildly from the ground-truth equation (M SB > 10).

(23)

Figure 4.3: The accuracy of the solutions compared to the ground-truth (eq. 4.5) against the population size. Only α 0.0, 0.5 and 1.0 reach a MSB of 0.0. The whiskers denote the 1.5 IQR. Three outliers, (0.0, 0.0), (0.66, 6.3 ∗ 1030) and (1.33, 6.3 ∗ 1014_{) are omitted to keep a clear graph.}

Figure 4.4: The accuracy of the solutions compared to the ground-truth (eq. 4.5) against the different α values. The whiskers denote the 1.5 IQR. Two outliers, (0.66, 6.3 ∗ 1030_{) and (1.33, 6.3 ∗ 10}14_{), are omitted to keep a clear graph.}

(24)

4.1.5 Discussion

The framework, applied to a trivial deterministic model, shows quick conver-gence. The dynamics of the model are simple but nonetheless results show that the quality of most solutions is marginal at best.

Figure 4.2 clearly shows that, given enough individuals, the evolution con-verges in about one or two generations. This indicates that the evolutionary process does not actively search as the solution is too easy or has too many possible answers which is nearly always found in the initial population.

Many systems using evolutionary strategies only require a few generations for an initial solution to be found. Following this initial solution, the latter gen-erations are responsible for futher optimizing the solution. Two examples of this can be found in a report for evolving steering behavior[Vogel, 2004] as well as a book about evolving flight control system design [Bourmistrova and Khantsis, 2010]. Both show similar evolutionary patterns where in the first 1 to 10 generations a fairly good solution is found; Futher generations only served to enhance the solution.

For the hillclimb model, solutions are found within the first few genera-tions. The issue here is related to the actual equation, which requires only two components in a fixed ratio. The required ratio was set to be the equation d + 1.333333h. This ratio can quickly be found when the initial population contain linear combinations of d and h, such as d + d + h + h + h. While this is not an exact solution, the simulated system will assign a good fitness and might even assign it the best value of “No difference found” 0. If the difference found is not exact zero, the process goes on to futher refine the best genome. Due to the nature of evolving equations, the final optimization process is rare as the steps taken in the evolution are fairly discrete.

The only way to prevent such early solutions is to increase the complexity of the model such that it, for example, contains a number of interacting variables, crafting a fitness-landscape with many hills and valleys. Unfortunately, classical models in their base-form are usually limited to a single interaction and increas-ing complexity will stay away from the known model. Increasincreas-ing complexity thus results in harder to predict and explain models. This thesis requires the manual evaluation of the results as well as the model behavior, such complexity would severely impact our ability to perform such evaluations. So while these results may not be an optimal representation for complex systems, it still pro-vides valuble information about the GABED system with understandable and predictable results.

Figure 4.4 shows the inaccuracy of most solutions, meaning that many mod-els will fit our truth dataset. When many modmod-els will fit the ground-truth dataset, the true equation cannot be determined without additional infor-mation about the system. The problem usually is the lack of constraints from the data which can occur due to multiple reasons. First and foremost is the issue of not having enough sample points in the ground-truth dataset. Having a small ground-truth dataset can often allow multiple models to be fitted.

(25)

specific model but having more data to increase the number of constraints is always preferred as manual constraints require additional (biased) input. Sec-ondly, when the observed data points show a limited subset of the number of choices that can be made, the resulting model can only be fitted using those choices resulting in inaccurate models. This example model of hill climbing has four choices where each step the agents get a subset of choices as options. The agent chooses the option with the highest utility which is depended on the dif-ference between the outcomes of the equation for each option. If a too small a subset of choices is observed, the quality of the solutions will suffer. This is especially prominent in the next chapter, where this is discussed in more detail. The dynamics of the model itself may also play a role in the accuracy of the solutions. While the agent models in these examples are all exact matches to the dataset, this might not hold true for cases where unknown processes are observed.

Finally, the number of possible states observed from the phenomena might be lacking. Given this example, the positions of the agents cover a very small portion of the whole map. More complex paths and better utilization of the space by larger paths or more agents will ensure that more information is avail-able about the dynamics of the system and the chosen options by the agents.

The nature of the evolved equations, the way they are used, can also affect the quality of the results. In this example, due to the equation being used as an utility function that is evaluated for every option, constants that can be taken out of the equation (i.e. play no role in the difference f (x1) − (f x2)) and thus have no effect on the model but will have an effect on the MSB. This is thought to be the main source of high MSB in this example, because retrieving the original (correct) equation is trivial. The other possible source of high MSB ’s is due to the samples being within a small subset of the larger domain of the equation. Just like a zoomed-in circle r2_{= x}2_+y2_{looks like a straight line y = x,}

the individual might be correct within the domain of the (evaluated) model but when the conditions of the model change, the individual not be correct anymore. In this specific model, most solutions will be near the ground-truth but there will be some extremes which contain powers (e.g. d + h2) but still accurately represent the dataset in this specific configuration. This shows the necessity to verify the results by holding back some of the data while training (or evolving) when working with systems that can show similar behavior.

Constraints can be added in various stages of the process and are often an effect method of improving evolutionary performance and reducing the num-ber incorrect results. Most constraint are in the evolutionary individual. The individual can be called invalid when its genome has certain combinations of genes or is missing certain genes. This method is similar to the in nature found ’Compositional constraints’ which constraints genomes by invalidating individu-als (e.g. birth defects or still birth)[Arnold, 1992, Bernardi and Bernardi, 1986, Montana, 2008]. Another method of evolutionary constraints is the phenotype constraints where the behavior of the evolved function is checked to be within certain parameters and domains given an input. To constrain the building of genomes, an effective method specific to genetic programming is strong type

(26)

constraints. This assigns types to the inputs, function parameters and outputs such that only correct combinations of inputs and primitives can be used. These evolutionary constraints are effective at directing the evolution and preventing incorrect models and an excess of improbable models which would otherwise be evaluated. While effective at constraining evolutionary behavior, they re-quire knowledge of the system that is to be fitted. Having a solid ground-truth dataset is therefore a requirement while constraints help the system converge more efficiently.

When analyzing the results, the MSB between different values of α showed that for some values of α the framework could find exact solutions while other using other values of α the framework could not. It was hypothesized that α 0.0, 0.5, 1.0 and 1.5 would be the easiest problem but notably an α of 1.5 did not produce exact solutions while 0.0, 0.5 and 1.0 did. Upon closer inspection, the equations often contained solutions with multiple α terms added together as the constant value of 1.5 is not possible given the primitive ephemeral. The requirement of multiple α terms which also required correct constants made the correct solution harder to find as simpler equations were good enough to fit the system, falling into a local optimum.

Looking at figure 4.4, it again shows an interesting spread in α 1.0 and 1.5. In this case it denotes the quality of the solution. While the worst solution found was in α 1.0, the best solution (exact solution in this case as MSB is 0) also has been found in the same category. Note that all these solutions are producing the exact pattern as the ground-truth and thus the evolutionary program cannot differentiate between these MSB ’s.

Figure 4.3 shows the effect of population size on the distribution of solution quality (MSB ). For bigger population sizes, there will generally be more solu-tions as more solusolu-tions arise in a single generation. The median of the figure shows a trend similar to an optimization curve where the population size of 50 shows the worst median of solution quality. Why both higher and lower popula-tion sizes show a better median is unknown, the difference is small in any case. Post analysis also shows no correlation between the MSB and the number of evaluations, negating the argument that population size 50 has the least amount of evaluations on average.

When working with limited datasets, many solutions may be found. When gathering more data is impossible additional constraints may be required. This, however, requires knowledge about the system.

Basic constraints, constraints which do not require extra knowledge about the system, are usually desired. These constraints are usually applied to the genome structure and effective domain. Further constraints can be applied by limiting the number of primitives used as well as using complex primitives of known equations such as basic geometry rules.

(27)

4.2 Exploring the effects of agent density

4.2.1 Introduction

In the world of agent-based systems, a model with a single agent rarely occurs. Therefore, in this section, the hill-climb model with multiple agents will be explored. The dynamics and settings are kept mostly the same and the effect of having multiple agents will be analysed.

The agents in the simulations were still not identified in the fitness calcula-tion (i.e. 4.4) as the goal is to replicate observing an emergent phenomena and not individual behavior. When increasing the number of agents, it is more likely that agents will neighbor each other and thus reduce the number of possible op-tions per agent. This in turn will result in the data only showing a subset of choices per step. In effect, the data, due to the lack of individual identification, will be more uncertain the more agents there are thus resulting in less accurate solutions.

When observing options, agents are not individually identified but are in-stead analysed using global state change. The pattern of the whole field changes and this pattern change encodes all the individual choices. But this encoding is not lossless and thus there may appear to be more options than there are choices when agents are adjacent to each other. These ”phantom choices” or ”phantom options” occurrences will follow an inverted u-curve with an increase of agents. A single agent cannot create phantom options and with just two agents the chance of being adjacent and taking a similar move is low. A sim-ilar effect happens when the number of agents approach the area of the world na → Aworld. In this situation the options that remain can be tracked by the

space that was occupied compared to the space now empty. Basically, the first agent that moves is the only agent that moves.

A drawback of discrete simulations is that their behavior does not change on a continues scale while the equations used to control the behavior does. Therefore, the evolutionary algorithm will find solutions which will generate the exact same discrete pattern but with a different equation compared to the ground-truth (the equation used in the original dataset). While these equations can be similar to the ground-truth, it may also contain higher order terms that have the same behavior in the domain used in the simulation of the agent-based system.

Another effect that will play a major role is the amount of data that is fed into the system. More agents will result that more data to fit against which in turn constitutes to more constraints for the final solution. Combined with the former problems regarding an increase in the number of agents, it is expected that the quality of the solutions will follow an inverted U-curve with the number of agents in the system. The two situations that will thus result in bad behavior is both overcrowded and uncrowded situations. In terms of observations, in both situations the observed options do not complete the set of choices that can be had in the system.

(28)

apply the framework to the generated ground-truth datasets and show an anal-ysis of the effects of increasing the number of agents. With an increment in the number of agents, more data will be available which in turn increases the number of constraints upon the model as discussed in the previous chapter. With the single agent case, the number of choices observed was limited by the simplicity of the agents path. With more agents the number of choices observed will go up, this is expected to improve the quality of the solutions up until a certain point where the effects of an increase of the number of agents become detrimental.

4.2.2 Model changes

The extension on the agent-based system as discussed in 4.1.2 are only related to a multi-agent environment. The first extension is that no more than one agent may occupy a cell in any point in time. Secondly, a choice that considers an occupied cell does not have an utility and thus is discarded as the option is impossible. The position of agents are randomly initialized with a fixed seed to ensure repeatability of the model. The same initial position seeds are compared between the ground-truth and the model.

To give an indication of how the agents will behave in the simulation a vector-field has been made (Fig. 4.5) with each vector being < U1,0− U−1,0, U0,1 −

U0,−1> where U is the utility and its subscript is the direction. The resulting

vector gives an analog indication of the discrete options with the option bound-aries being 45°, 135°, 225°and 315°. If the vector has any of these angles, it is considered critical as a small perturbation in the utility equation might result in a different choice.

To visualize the number of changes in the preferred choice as alpha is per-turbed over the range 0 to 3, figure 4.6 has been generated and showing the dynamics of the overal system. Note the lull in the system between alpha 0.5 and 1.0. This means that within alpha values of 0.5 to 1.0, given a fixed equa-tion, will result in the same system dynamic regardless of the position of agents unless agents do not have access to the preferred choice due to adjacent agents. Small perturbations in the system (i.e. changes of 1) might not have an effect depending on the initial positions of the agents and thus also the number of agents increase the change of finding the perturbation in their path as more agent positions are sampled. A system with only one agent might never see any changes and thus the accuracy of the final solution might be off by a fair margin.

4.2.3 Method

To test the performance of the evolution, 50 separate evolutions were run for different number of agents, up to the maximum number of agents possible. The resulting number of generations it took before an acceptable solution has been found was observed. An acceptable solution is a solution where the datasets match regardless of the equation found. To reduce the time the evolution runs

(29)

(a) Vectorfield for α = 0.0 (b) Vectorfield for α = 1.3333333333

(c) Vectorfield for α = 3.0

Figure 4.5: Three figures showing the utility vector for each cell (black) and their discrete choice vector (green). The white star shows the position of the goal. Shown are α 0, 1.33. and 3 for a, b and c respectively.

(30)

(a) The cumulative number of changes in the chosen direction from α = 0.00 onwards

(b) The number of changes in the chosen direction.

Figure 4.6: These figures display the number of changes in the chosen directions in the field as shown in fig.4.5. Alpha is [0, 3] with steps of 0.01. Each discrete change in the direction is one unit.

could take, the number of generations was limited to 200 generations per run. Another measure to reduce the run time of the simulation is to keep the width and height of the agent-based system small and thus was set to 15 for both the width and the height.

The primitives for the evolutionary process where multiply, add, subtract and a modified version of divide. The divide primitive was modified so that a divide by zero will return 0. The final primitive was the ephemeral constant. The ephemeral constant was defined as a rounded, to two decimal places, random between -1 and 1. The ephemeral constant was rounded to reduce the number of possible primitive nodes and two decimal places still provides enough accuracy. A important factor in this experiment was that the problem is hard enough to not be solved in the initial population and thus an α of 1.3333333333 was chosen. This number makes it so that the solution is not solely depended on either the random number generator for constants (ephemeral constants) nor the ordering of the nodes of the equation as both d + 1.33h, where 1.33 is an ephemeral constant, and 3d + 4h where it is a simple linear combination of nodes without the need for a constant. Another consequence of taking α = 1.3333333333 is that, due to the limited ephemeral constant, an exact solution is impossible. This is not a problem as the program will try to find the closest approximation and the solutions will still reach a perfect score as the decisions don’t change beyond two decimals.

To get a clear picture of the difficulty for every number of agents, the average number of options per choice made was calculated in the original simulations of the agent-based system. This is calculated by taking, for each tick in the

(31)

simulation, the number of possible choices from all agents and taking the grand average. This metric should give a better indication for the difficulty of the problem as a function of agents. The more the evolutionary process can observe of the choice set and the more distinct the choices are, the better its performance is.

The fitness was calculated by using the field per time-frame generated by evolution compared to the dataset (eq. 4.4). Noteworthy is that no agent was identified and the system for fitness calculation thus used pattern comparison.

4.2.4 Results

The resulting data of the 200 runs for different number of agents is shown in fig. 4.7. The graph clearly shows that the problem is solved quickly (¡10) when the number of agents is below 10 but the difficulty quickly increases. From 25 onward, every run has at least one capped run (200). Unfortunately the number of capped runs is obfuscated due to the nature of the plot as the markers are plotted over each other.

Figure 4.8 clearly shows a non-linear relation between the subset of choices that are observed each step and the number of agents in the model. The MSB metric is always calculated from the original dataset as an average from multiple runs with differing initial positions. Figure 4.9 gives a clearer picture of the number of capped runs given the reduction in the average number of options per choice made per step. The figure shows the percentage of the capped simulations for the average number of options per choice made in the simulation. This figure hints at the difficulty at finding either a good or a bad solution. Figure 4.10 shows the bar-plots of the data binned in steps of .5 of the average number of choices. Following the medians, the graph shows a curve similar to a right tailed bell curve and tells a similar story as figure 4.9.

Following the found solutions (i.e. fitness of 0), we also have the MSB measure of the actual correctness compared to the ground truth. Figure 4.11a shows the minimum, maximum and median MSB of the found solutions. Figure 4.11b shows the same solution accuracy, given the average number of choices that were observed per step. In addition, displaying a measure of difficulty, figure 4.12 and 4.13 shows how both the equation differs (in MSB) and the path. The red-curve shows the MSB while the other lines show the path difference both in relation to the ground truth of 1.3333333333. Figure 4.12 shows the path difference as a metric of how much of the path has changed. This is comparable to what the evolutionary system sees as fitness except for the fact that in this instance each agents path is known.

4.2.5 Discussion

In this chapter, the example of the discrete hill-climb model using single and multiple agents was given. In the previous section, the stability of the system was discussed and it was stated that using multiple agents should increase the

(32)

Figure 4.7: The number of generations till a solution was found for different numbers of agents in the model. Boxplot is binned on the number of agents with size 20 and centered on the X axis. Note that all runs are capped at 200 generations.

number of constraints and therefore increase solution quality (MSB) and the convergence time (in generations).

Figure 4.7 shows that, the number of generations required to find a solution drastically increases from having a single agent. The median, however, stabi-lizes from 50 agents onwards while fluctuating heavily, showing no trend at all. Figure 4.9 indicates that the cap of 200 generations is often reached with higher densities of agents. Both figures indicate that going from one agent to 50 agents has a huge impact in the number of generations required but high densities of agents show now discernable pattern. This increase in the number of genera-tions leading up to 50 agents is most likely due to the number of constraints as more of the world in agent-based system is sampled and more options are ob-served. Increasing beyond the 50 agents leading to no incremental trend might be because any additional data does not add to the number of constraints as enough of the world and options are already sampled. Similar to how specifying more than two points on a straight line while knowing the property of the line being straight does not reduce the possible number of lines (i.e. one line) going through the points.

Going up to maximum number of agents in the system, the number of options per agent are reduced to one or zero options per step. In this case, the system is not being sampled and the options observed are only chosen as they are the only option and that step must be taken. Also the rule of first come, first served applies here. The first agent making the step, even if it’s backward and the other

(33)

Figure 4.8: Showing the relation between agents and the average number of choices observed, this figure displays the density-freedom relation of the agents where the more agents there are, the higher the chance that an agent has less than its four cardinal directions to choose from.

agents could have made progress by stepping in to that one free square, the first agent will move. This high density data, when the density nears 1.0 (e.g. every cell occupied except for the goal), does not add to the constraints. Only in the latter part of the system, where the agents disappear one by one from the system via the goal, might there be enough options per agent to add to the constraints of the system again. This overall results in a decrease in the number of generations as a metric for the number of constraints.

Figure 4.8 shows the reduction in the average number of options per choice made for an increasing number of agents. It was initially hypothesized that a higher agent density, and thus a lower average number of observed options, leads to a harder to solve problem due to the effect of phantom choices and the reduction in the observed subset of choices per step. The higher agent density also increases the constraints on the models as they have to be fitted against more significant data points.

Shown in figures 4.9 and 4.10 are the effects of the average number of options for the agents on the number of generations before a solution was found. While all runs could sometimes find a solution within 10 generations, the median for most are above 50 generations. The simulations with the highest average number of options seen are also the easiest to be solved. This effect is mainly due to the low number of populations as this effect is quite extreme (fig. 4.7). At the other end of the spectrum, where the number of options for the agents are low, the number of generations required is also lower. This is the second effect that

(34)

Figure 4.9: This graph shows the percentage of evolutionary runs which have been terminated reaching the cap of 200 generations and thus have not found a satisfying solution. This has been plotted against the measurement of the average number of options per choice made where the 4, on the right most part of the axis, shows the case were the agent(s) always have all possible choices to choose from.

causes a reduced number of generations: overcrowding. The optimum seems to be around 2.0 where the solution is the hardest to be found with a median of ˜

160 generations.

The hypothesized effect was that the increasing number of agent-agent in-teractions as well as the higher data density should result in better constrained models. However, figures 4.3 and 4.11b show that, while the best solutions be-come indeed better with more agents, the median stays roughly equal. This effect could be contributed to various factors. The first issue might be in the definition of the base-model, the hill climb model, and its world. The used world is small and might not be enough to thoroughly weed out the bad models. The second issue is that agents position tracking, as done in this and all previous chapters, is not a suitable dataset. As discussed earlier, the paths are often insensitive to changes and do not constitute to a well defined fitness landscape. While the second issue and the first issue are closely related, as a better or more complex world should result in better sensitivity to perturbations, these are two complex problems often found in the real world. Every real world phenomena has multiple ways to be captured. For this framework, the best dataset will be the dataset with the most variety of metrics that are the most sensitive to changes in the system.

(35)

Figure 4.10: The number of generations binned, plotted against the average number of options per choice made as a box-plot showing the median, upper and lower quartile ranges, the maximum and minimums and the outliers. The whiskers show the 1.5 IQR.

analysis was performed to see the potential issues for this system. The red-curve is the actual difference in the equations output domain and the other lines are the path difference per agent per tick. When the path differences are low, the system is insensitive. The same is the case for the MSB curve, when the curve is flat, differences are less noticeable. The MSB is calculated using the fixed equation y = δdgoal+ x ∗ δh, where x is varied.

In both figures, a transition can be observed when passing 25 agents or a density of about 0.11 (25/225). The transition between the low agent density and high agent density condition is characterized by the slope of the observed difference in the model. When the agent density is lower than 11%, the observ-able difference is much smaller for every change in alpha compared to the higher densities above 11%. The higher densities, however, differ from the transition state of exactly 11% density (25 agents), as the observed difference drastically increases but than fluctuates around a fixed difference value where as the tran-sition state shows a more gradual slope. One of the underlying reasons for this early transition is the explored space. When the agent density is higher, more of the model space is explored, ultimately making the model more sensitive to permutations. This explanation is supported by the earlier graphs, showing a sharp increase in the number of generations required when the average number of choices observed decreases from 3.0 to 2.5, which is 25 to 50 agents, indicating a higher number of constraints (Figures 4.8 and 4.10). The higher number of constraints is indicative of higher sensitivity of the model as the equation is

(36)

re-(a) This graph shows the median, maximum and minimum accuracy against the number of agents in the simulation. The MSB is defined as eq. 4.5

(b) This graph shows the median, maximum and minimum accuracy against the average number of ob-served choices. The MSB is defined as eq. 4.5

Figure 4.11: This figure shows how the agent density affects the quality of the solutions. The agent density is shown in two metrics: Average observed options (or choice) and the number of agents.

Figure 4.12: The difference in MSB plotted and path for different values of alpha compared to alpha 1.3333333333 (ground-truth). The difference in path is the average change per agent per tick and show the number of changes compared to ground-truth. The red-curved line is the MSB.

(37)

Figure 4.13: The difference in MSB plotted and path for different values of alpha compared to alpha 1.3333333333 (ground-truth). The difference in path is the average distance per agent per tick. The distance is the euclidean distance of the agent positions per tick between the ground-truth and the other alpha value. The red-curved line is the MSB.

that the model is not well defined as multiple solutions will fit as the differences between the correct and incorrect solutions are not significant enough.

In conclusion, the most important factor in the convergence of accurate models is the dataset. Care should be taken when measuring phenomena or selecting data. The model that is to be fitted should be statically analysed on its sensitivity to ensure a fluid and continues fitness landscape. And finally, added constraints to the evolutionary process should be added to prevent bogus models and to keep equations within the domain used by the agent-based system.

GABED: A Genetic Agent-Based framework for solving individual rules in Emergent phenomena Data