Improving the Quality of Grammars for Procedural Level Generation

(1)

Improving the Quality of

Grammars for Procedural Level

Generation

A Software Evolution Perspective

Quinten Heijn

samuel.heijn@gmail.com

August 30, 2018, 46 pages

Supervisor: Riemer van Rozen Second Reader: Rafael Bidarra

Host organisation: Ludomotion, Joris Dormans

!This document should be read in colour!

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

Abstract

Grammar-based level generation increases the productivity of game designers, but comes at cost of quality assurance. Currently there is a lack of tools and techniques for debugging and testing generative grammars. This research explores how software evolution techniques can be used to improve the quality of grammar-based content generators for tile maps.

Two solutions are proposed. The first is the Metric of Added Detail (MAD) which shows whether rewrite rules add or remove detail. This is based on the notion that problems can occur when rules accidentally remove detail. The second solution is Specification Analysis Reporting (SAnR), which allows level designers to define level properties. The properties are used both for quality assurance and debugging. Both solutions are implemented in a level generator called Ludoscope Lite.

MAD and SAnR are evaluated with a case study on the video game Boulder Dash. With the case study we demonstrate that SAnR was able to express realistic level properties and was vital in improving the quality of the generative grammar. We also find possible improvements for both MAD and LL. We conclude that MAD and SAnR are promising first steps towards answering the need for tools that improve the quality of procedural level generators.

(5)

Chapter 1

Introduction

The rising cost and size of games is an important issue in game development[Kos18]. Procedural content generation (PCG) plays an important role in solving this problem. Generating the content with algorithms is generally cheaper and faster than hiring artists and game designers [STN16]. However, when PCG is used to generate content while the game is being played, assuring quality becomes an issue. This research addresses this challenge from the perspective of software evolution. The solutions proposed in this research are based on two techniques from software evolution: the use of metrics and the use of source code analysis. The research will focus on grammar-based content generation for tile maps.

Grammar-based content generation was originally adopted from biology [KS+₉₇_{], where it was used}

to simulate the growth of plants. Since then, the technique has been built upon and is now used for a wide range of content [DDRT11,ALY15,MBMT15,Dor10].

The research was performed in collaboration with Ludomotion. This company has created a state of the art tool for writing and executing generative grammars for video games, called Ludoscope. The two solutions that are presented in this research are implemented and tested using a light weight version of Ludoscope, called Ludoscope Lite (LL).

1.1 Problem Statement

There is a lack of tools and techniques for debugging and testing generative grammars. There are several factors that complicate evaluation.

• Testing is complicated by the combinatorial explosion of possibilities. For most generators it is simply not possible to test all possible outcomes.

• Predicting how each grammar rule impacts the overall level quality is difficult. Currently the only way to find out which rules are responsible for creating problematic results, is by examining the generation process step by step. Furthermore, even when the developer finds the problematic rules and replaces them, there is no way to check that the problematic content can no longer be generated.

• Generative grammars are often under-specified. When a designer writes a grammar he or she tries to capture a set of design constraints in the grammar rules. However, since the rules only capture the constraints implicitly, it can be difficult to tell if the constraints are actually expressed by the rules. In practice this leads to generated levels that are bad with respect to the design constraints.

(6)

1.1.1 Research Questions

This research addresses the following research questions:

Q1 How can software evolution techniques be used to iteratively improve the quality of grammar-based tile map generators?

sq1 How can metrics be used to evaluate grammar-based tile map generators?

sq2 How can origin tracking be used for root cause analysis of quality issues in grammar-based tile map generators?

1.1.2 Contributions

The contributions of this research consist of three solutions:

1. Ludoscope Lite (LL): an implementation of the domain specific language (DSL) used in Ludo-scope. Like other DSLs this programming language offers appropriate notations, abstractions and an expressive power focused on grammar-based content generation [VDKV00]. Ludoscope was used in the development of the game Unexplored and is still used in game development today. The DSL makes use of advanced concepts for generative grammars that both increase the expressiveness and the maintainability of these grammars. However, the original implemen-tation of this DSL was not designed for research. The implemenimplemen-tation created for this research was designed to make it easier to prototype new tools. This was mainly done by reducing the number of features that were implemented and by focusing on tile maps.

2. Metric of Added Detail (MAD): a metric for generative grammars. MAD shows if a rewrite rule adds or removes detail when applied. This is based on the notion that problems may occur when rules accidentally remove detail. Because the huge possibility space cannot be covered by tests, an interesting approach is static code analysis, which analyzes the code without running it. Any problems that can be found before testing can save a lot of resources in development.

3. Specification Analysis Reporting (SANR): a technique that uses a DSL to define the properties that the outcome should satisfy. The designers’ intent is only implicitly defined in the rewrite rules. SAnR allows designers to make these intentions explicit, by declaring them as level properties. This enables quality assurance, since levels that do not satisfy the properties can be filtered out. SAnR also supports dynamic code analysis, by finding the source code responsible for problems in generated levels.

1.1.3 Case Study

The solutions are evaluated with a case study on generative grammars for the video game Boulder Dash. We try to improve the grammar through an iterative design process, where changes are made based on the feedback provided by MAD and SAnR. The case study is used to answer the following questions:

• Can the solutions help with implementing realistic design goals? • What are inherent shortcomings of the solutions?

• What are possible ways to improve the solutions?

In our limited assessment, we find that SAnR is able to express realistic level properties and is helpful for improving the quality of the generative grammar. We also find possible improvements for both MAD and LL.

(7)

1.2 Related Work

1.2.1 Earlier Work

This thesis elaborates on ”Measuring Quality of Grammars for Procedural Level Generation”, a paper published at the PCG workshop of the Foundation of Digital Games Conference 2018. The paper gives an overview of both MAD and SAnR and argues how they could help the designers of generative grammars [vRH18]. This report supplements the paper with:

• A description of the Ludoscope DSL and its implementation in LL.

• A description of the technical details of the implementation of both MAD and SAnR. • An evaluation of MAD and SAnR with a case study.

1.2.2 Metrics for Analyzing Content Generators

Metrics have been proposed to help evaluating content generators. Summerville et al. compare several metrics for evaluating the quality of platform game levels on their ability to capture difficulty, visual aesthetics and enjoyment [SMnS+₁₇_{]. Smith et al. propose a method for analyzing the expressive}

range of procedural level generators, focusing on the variety of generated levels and the impact of changing input parameters [SW10].

MAD differs, since it is not applied to the generated content, but directly to the source code. An obvious advantage is that no content has to be generated, which saves resources. It also relates any problems directly to its source, making targeted improvements easier.

1.2.3 Languages for Generative Grammars

Ludoscope is a model-driven content generation system for level design. It is built around the idea that various aspects of a game’s content can be represented by models that can be transformed with the use of rewrite rules. A wide variety of data types are supported, including strings, tile maps, graph, shapes and Voronoi diagrams. These data types can be used for many content types, like missions, stories, names, terrains, encounters etc. The DSL that is used to describe the generative grammars is called Phantom Grammar. Chapter 3will describe this DSL in more detail as we discuss its implementation in LL.

Apart from Ludoscope, there are other tools for writing generative grammars with varying purposes. Tracery is an author-focused generative text tool that uses the grammar-based approach. It has been used for generating names, descriptions, stories in poetry, art, Twitter bots and games [CKM15]. Genr8 is a design tool for architects that uses a grammar-based approach for generating surfaces [HO04]. Puzzle Script is a language that uses rewrite rules to define mechanics for puzzle games [Lav15]. While this research focuses on tile maps, the notions and techniques behind MAD and SAnR could also be adapted for improving these tools.

1.2.4 Answer Set Programming

Answer set programming (ASP) is an approach to logic programming, where constraints and logical relations are declared in a Prolog-like language. Levels and mazes can be generated by describing the design spaces explicitly. [SM11,STN16] Van der Linden et al. did a comparative study on different PCG methods, including both the grammar-based approach and this constraint-based approach. A key observation is that generative grammars use a vocabulary that is intuitive to game designers, while have had difficulty with translating gameplay concepts (like pacing or difficulty) into constraints [vdLLB14].

ASP programming is related to SAnR, because in both cases the level constraints are declared explicitly. With ASP programming, the constraints are used to generate the levels. SAnR uses constraints only as a filter, which enables quality assurance and enhances debugging. From this perspective SAnR allows level designers to enjoy the authoring benefits form generative grammars, as well as a degree of control over quality as provided by ASP.

(8)

1.3 Outline

This report starts with a background on the topics addressed in this research. This is followed by three chapters on the three different parts of the solution. Chapter 3 explains the syntax of the Ludoscope DSL and the design of Ludoscope Lite, which implements this DSL. The chapter ends with the introduction of a simple generative grammar, which will be inChapter 4andChapter 5to demonstrate the solutions that were addressed in those chapters. Chapter 4 discusses the metric of added detail and Chapter 5 discusses analysis reporting. In Chapter 6 the solutions are evaluated with a case study on generative grammars for Boulder Dash.

(9)

Chapter 2

Background

2.1 Software Evolution

The research field of software evolution concerns itself with both the normative question of how the quality of software changes as the software evolves over time and the prescriptive question of how to modify evolving software to conform to changing requirements. An important challenge in software evolution is the lack of empirical research. Obtaining statistically significant results, requires access to software systems over long time spans. This is not always easy in an industrial setting [MWD+₀₅_].

This challenge also applies to this research, as there is no data available for doing static analysis of the effectiveness of the proposed solutions.

2.2 Metrics

Software metrics are measures that quantify software qualities. In practice metrics are not used to locate problems, but as markers of pieces of code that developers should keep an eye on. For example, while duplicating code is sometimes the best solution, it is often bad practice when not used with care [KG06]. Keeping track of how metrics develop over time can give insight into how the quality of the software changes.

Common metrics are Lines Of Code (LOC), which measures the volume of code, and Cyclometric Complexity (CC), which measures the number of branch points in the control flow. The original goal of CC was to count the number of distinct paths through the code, providing an indication of the number of test cases needed. The meaning of the numbers provided by these metrics depends on the language to which they are applied. In both cases it is unclear whether they are suited for analyzing generative grammars.

Heitlager et al. describe four requirements for metrics for the SIG maintainability model[HKV07]:

1. Metrics should be technology independent, so they can be applied to systems that use various kinds of languages and architectures.

2. Metrics should have a straightforward definition, so they are easy to compute.

3. Metrics should be easy to explain and understand, so they facilitate communication between various stakeholders in the system.

4. Metrics should enable root cause-analysis, relating source code properties to system qualities.

2.3 Grammars

Formal grammars were originally introduced as a way to describe language [Cho56]. Later they were adapted in biology to model the growth of plants algae [Lin68]. A formal grammar consists of a set of rewrite rules. Each rule consists of a left hand side and a right hand side. The rules are applied

(10)

to a string, by going through the string and replacing a symbol or a sequence of symbols on the left hand side of the rule with its right hand side.

There are several important variations in how grammars are used. Grammars can be used for both parsing and generating content. Parsing is the process of analyzing if a sequence of symbols is conforming to the rules of the grammar. With generative grammars the desired sequence of symbols is produced by the rules of the grammar.

While grammars were originally applied on strings, they are not restricted to this this type of representation. Since then, grammars have been applied to numerous data types, including tile maps, graphs and shapes. The type of data that is rewritten can cause constraints. For example, the rewrite rules for grammars for tile maps cannot change the size of the content. This means that the left and right hand side of the rule will always have the same dimensions.

Grammars can be executed in sequence or in parallel. With sequential rewriting a rule is applied as soon as its left hand side matches. Parallel rewriting first gathers all the symbols or sequences of symbols that match a left hand side. All the rewriting is then done at the same time.

Grammars can be deterministic and nondeterministic. In the case of deterministic grammars it is unambiguous how a sequence of symbols is rewritten. This means that there is always only one rule that applies to each symbol or sequence of symbols. Each rule also only has one right hand side. Nondeterministic multiple rules can apply to the same sequence of symbols and a rule can have multiple right hand sides. Which rule or right hand side is applied is chosen at random. In this case multiple outcomes are possible every time the grammar is applied.

For a more in depth description of grammars and how they are used in games readers are referred to Chapter 5 of ”Procedural Content Generation in Games” [vdLLB14].

(11)

Chapter 3

Ludoscope Lite

In this chapter we describe both Ludoscope and its implementation in LL. LL was created to simplify the research and prototyping of new tools for generative grammars. Ludoscope was designed for the development of games, resulting in many features. LL implements a subset of these features. As many features as possible are excluded, while retaining the ability to create realistic levels and answer the research questions.

This chapter begins with describing the core concepts of Ludoscope,. To the best of our knowledge, this informal definition is the first publicly available Ludoscope language definition.

3.1 Ludoscope

Table3.1compares the DSL with the visual notation that will be used in the rest of this report. There are five different file types in which this DSL is used. Table 3.2 lists which features are associated with each file type. The exact syntax is beyond the scope of this report, but can be looked up inthe code of LL.

Description Visual representation & code Alphabets define sets of symbols that can be

used in the transformation process. Each sym-bol has a name and some additional informa-tion about how it should be visually represented. The code snippet below defines a two symbols. The first is called ’dirt’, which abbreviated with ”D” (”abbreviation=”D””) and is represented as a brown tile (”fill=#996633”) with a black out-line (”color=#000000”). The second symbol is called ’boulder’, which is abbreviated with ”B” (”abbreviation=”B””) and is represented as a gray tile (”fill=#B3B3B3”) with a black outline (”color =#000000”).

= dirt, = boulder

dirt(color=#000000, fill=#996633, abbreviation =”D”)

boulder(color=#000000, fill=#B3B3B3, abbreviation=”B”)

Expressions describe the content that will be transformed. Expressions start with the content type that is described. In this report all expres-sions are of the type ”TILEMAP”. In the case of tile maps this is followed by its dimensions and the symbol on each location. The locations are num-bered from top left to bottom right. Expressions are used to describe the input, the output and the patterns in the rewrite rules.

(12)

Rewrite rules are the most fundamental concept in the language, since they transform the content. Every rule has one left hand expression and one or more right hand expressions. The left hand ex-pression contains a pattern. If this pattern matches part of the content, the content gets replaced with one of the right hand expressions. Which of the right hand expressions is selected, is chosen at ran-dom.

r1: or

rule: r1(width=1, height=2) = /∗ Left hand ∗/

TILEMAP 1 2 0:dirt 1:dirt > /∗ Right hand ∗/

{0 = TILEMAP 1 2 0:dirt 1:boulder} | {1 = TILEMAP 1 2 0:boulder 1:dirt}

Rule modifiers are syntactic sugar, that auto-matically add transformed copies of the rewrite rule to the grammar. Three transformations are supported:

(1) horizontal mirroring (H)

(2) vertical mirroring (V)

(4) rotations (90○_{, 180}○_{and 270}○_{) (R)}

Which transformations are applied is defined by the number that follows ”gt=”. This number is an addition of the index of each transformation. This index is the number preceding each transformation in the list above. 0 meaning no transformations are applied and 7 meaning all the transformations are applied. In the code snippet a vertical mirroring of the rule is added with ”gt=2”.

r1: (V)

rule: r1(width=1, height=2, gt=2) = TILEMAP 1 2 0:dirt 1:dirt >

{0 = TILEMAP 1 2 0:dirt 1:boulder}

Recipes are lists of instructions that specify how the rewrite rules are applied. This is useful because it gives the designer more control over the transfor-mation process. The most important instructions are:

• iterateRule, attempts to apply a certain rule a single time.

• executeRule, attempts to apply a certain rule a certain amount of times.

• splitTiles, increases the size of a tile map, by replacing each tile with multiple copies of itself.

r1: (1x)

r2: (5x)

IterateRule(”r1”) ExecuteRule(”r2”, 5)

(13)

Modules encapsulate grammars1 (sets of rules) and recipes. Modules enable developers to con-struct transformation pipelines where each mod-ule represents a separate concern. There are three separate situations with regards to the input of a module:

1. The module uses a predefined starting ex-pression.

2. The module uses the output from another module as input.

3. The module uses the output from multiple modules as input. In this case there are merg-ing functions that can be used to combine the expressions to a single expression.

If there is no recipe defined for the module (as the case is with ”m1” in the code snippet), the rewrite rules are applied randomly.

Module m1 *a grammar*

⇒

Module m2 *some recipe* module: name: ”m1” alphabet: ”AlphabetName” type: Grammar grammar: true module: name: ”m2” alphabet: ”AlphabetName” type: Recipe inputs: ”m1” grammar: true recipe: true

Registers can be used to store design parameters and make them available in the modules. In the code snippet three registers are set. ”w” contains an integer, ”request” a list of strings and ”setRe-quest” a boolean. This feature is not implemented in LL.

register: w 9

register: requests [”extraMonsters”, ”lessTime”] register: setRequests false

Member Values are used to annotate symbols. The member values can also be used in the left hand expressions of rewrite rules to check for cer-tain values, greatly reducing the number of symbols that is required. In the code snippet a rewrite rule changes the annotation ”open”, form true to false. This feature is not implemented in LL.

rule: r1(height=1, width=1) =

TILEMAP 1 1 0:dirt[open==true] > {1 = TILEMAP 1 1 0:dirt[open=false]}

Table 3.1: Overview of the core concepts of Ludoscope.

File type Description Features

.lsp Defines a Ludoscope project. Modules, register and options. .grm Defines a grammar. Starting expression, rewrite rules. .xpr Defines a model. Model.

.alp Defines an alphabet. Symbols. .rcp Defines a recipe. Instructions.

Table 3.2: Different file types used in Ludoscope.

1_{This report will use ”generative grammar” to refer to a pipeline of grammars, instead of the grammars used within}

(14)

3.2 Important Design Decisions

3.2.1 Using the DSL of Ludoscope

The DSL of Ludoscope was reused, instead of writing a new DSL. Because a number of Ludoscope’s features are not implemented, it would make sense to use a DSL that also does not define these features. It would also be an opportunity to reevaluate some of the decisions made in the original DSL. However, there were a number of reason to stick with the DSL of Ludoscope:

• The visual editor of Ludoscope can be used to edit the grammars. • It is easier to reuse the solutions for the original implementation.

• Grammars can be run in both implementations, making it easier to compare the functionality of both implementations.

3.2.2 Using Rascal

LL is developed in the meta-programming language Rascal. Rascal is used because it has great support for parsing, data transformations and data analysis. It was developed to function as a language workbench[KVDSV09] and was effective in similar projects[VRD14].

3.2.3 Test Driven Development

The development of LL was set out to be test driven. This means that every implemented feature is supported by a test that checks if the functionality works as expected. A number of reasons support this decision:

• Changes and additions can be made quicker, because it is easy to check if everything works as expected.

• The implementation becomes more robust, making it more suitable for collaboration.

• The expected results are made explicit, which helps other researchers to understand and use the implementation.

3.2.4 Focus on Tile Maps

The Ludoscope DSL supports the following content types: strings, tile maps, graphs, shapes and voronoi diagrams. In order to answer our research questions we focus the scope of this work on tile maps. Not only are tile maps used in a great number of games, the rewrite rules that transform tile maps are also easy to understand. This is because they are easy to visualize and the size of the expression remains the same during transformations.

3.3 Design Overview

Figure 3.1gives a visual overview of the design of LL, which consists out of three components: parsing, execution and a graphical user interface (GUI). The GUI functions as an integrated development environment (IDE) where a user can write grammars. The rewrite rules are also visually displayed together with their MAD score. The GUI uses a library called Salix: a library for Elm-style Web GUIs. Ludoscope projects can be executed and analyzed with SAnR from the GUI.

Both MAD and SAnR were designed as separate libraries. This was done to make the solutions more general than the implementation of Ludoscope Lite.

(15)

Ludoscope Lite MAD SAnR Salix Parsing Execution GUI system component used by Libaries

Figure 3.1: Overview of Ludoscope Lite.

3.4 Running Example

A simple example2 will be used to illustrate how these generative grammars work. This example is also used to explain MAD (inSection 4.4) and SAnR (inSection 5.4).

The example is based on the same game that will be used for the evaluation of both solutions: Boulder Dash. Chapter 6will discuss this game in more detail. For now it is important to know that Boulder Dash is a top-down game where players dig their way through dirt to collect diamonds in order to complete the level.

3.4.1 The Pipeline

As stated in table 3.1, a generative grammar consists of a pipeline of modules. Figure 3.2 shows the pipeline for this example. Both the input and output of this pipeline are a 6x6 tile map that represents a very simple level for Boulder Dash, as shown inFigure 3.3. The input is altered by two modules:

Module m1: add start and end

r1: (1x)

r2: (1x)

(a) Adding an entrance (r1) and exit

(r2) of level

⇒

Module m2: add boulders and diamond

r3: (3x)

r4: (1x)

(b) Adding three boulders (r3) and a

diamond (r4)

=entrance, =exit, =wall, =dirt, =boulder, =diamond

Figure 3.2: Level transformation pipeline consisting of two modules

(16)

(a) Input: dirt surrounded by steel walls

(b) Output: level with content

Figure 3.3: Tile maps that are input and output of the pipeline

Module 1 is responsible for adding an entrance and exit to the level. This is done with two rules:

Rule 1 replaces one of the walls at the north side of the map with a tile that indicates where the player will enter the level.

Rule 2 replaces one of the walls at the east side of the map with a tile that indicates where the player can exit the level after he/she has collected the diamond.

Module 2 adds both the goal and the challenge to the level. This is done with two rules:

Rule 3 replaces three random dirt tiles with a boulder. Rule 4 replaces one random dirt tile with a diamond.

3.4.2 Repairing problematic results

Some of the outputs of our example pipeline can be considered as incorrect. For example, the output inFigure 3.3bis problematic, because the player cannot dig through boulders. This means that this level is impossible to finish.

Module m3a: remove obstacles

r5: (1x)

r6: (1x)

(a) Remove boulders that block the start (r5) or end (r6)

Module m3b: move obstacles

r7: (M,1x)

r8: (M,1x)

(b) Move boulders that block the start (r7)

or end (r8)

Figure 3.4: Two attempts to fix the design of the pipeline

To prevent these outputs we can add a module that replaces problematic patterns. In Figure 3.4

we can find two possible additions to the pipeline:

Module 3a solves the problem by removing any boulders that block the entrance or exit. Figure 3.4a

is the result of this transformation. While this ensures that there are no boulders blocking the path, it also influences the difficulty of the level. By removing boulders, the player has an easier time maneuvering past the boulders without getting crushed.

Module 3b preserves the number of boulders by simply moving the boulder away from the entrance or exit. Figure 3.4bis the result of this transformation. While this approach may be less naive than module 3a, it still does not completely solve our problems. Figure 3.6displays two possible outputs from module 2 that cannot be fixed with module 3b.

(17)

R

(a) Module 3a removed a boulder at R

M ↑

(b) Module 3b moved boulder M

Figure 3.5: Repairing the example level ofFigure 3.3bin two different ways

A 1 2

(a) No space to move boulder 1 away from start A

A 1 2

(b) Boulder 1 has to be moved to spot A, or there will be no

space to move boulder 3.

Figure 3.6: Levels that cannot be repaired by m3b

Neither of the solutions considers the entire path through the level. Of course, it is also possible that the tile directly in front of the entrance or exit is accessible, but where the path between the entrance and exit is still blocked. For example inFigure 3.6bthe path from the entrance to the rest of the map is blocked indirectly.

A third approach would be to make the path from the entrance to the exit explicit. By replacing a number of dirt tiles with ’path’ tiles, we make sure that no boulders will be placed on the path from the entrance to the exit. In a later stage the ’path’ tiles can be replaced with dirt or some other traversable tiles. However, the module needed to add this path to the map is too complex to discuss in this example. Figure 3.7gives an indication of what this solution would look like.

P P P P

Figure 3.7: Output of a pipeline using a path, with the new tile ’P’ that cannot be replaced with boulders.

3.4.3 Writing Generative Grammars

Even though the levels from this example are a lot simpler than real Boulder Dash levels, we already encountered some problems that were not trivial. The problems for larger grammars are not only a lot more complicated, problems are also obscured by the size of the grammar and the number of different outcomes. In Section 4.4 and Section 5.4 we will analyze this example in more detail, to show how MAD and SAnR can help developers write generative grammars.

(18)

Chapter 4

Metric of Added Detail

As stated in Section 1.1, one of the problems with evaluating generative grammars is the large pos-sibility space of most generative grammars. Because of this, evaluating a generative grammar by analyzing the generated content costs a lot of resources and is almost never conclusive. One way to circumvent this problem, is by analyzing the grammar statically, i.e. without executing the rules, using metrics instead.

This chapter introduces a novel metric for grammars that generate tile maps called the Metric of Added Detail (MAD). MAD is based on the notion that the designers should be aware of whether a rule adds or removes detail to the map, since in general details are gradually added. The hypothesis is that there are at least two types of possibly problematic rules that do not add detail:

• Patches: rules that fix an issue introduced by earlier rules, instead of fixing the earlier rules (for example Figure 3.4b).

• Bugs: rules that remove critical content by accident (for exampleFigure 3.4a).

Instead of predicting level quality directly, MAD could be used to detect these problematic rules. We will take a look at how we can measure the amount of detail that is added or removed by a rule. Evaluating the relation between problematic rules and the detail they add is not in the scope of this evaluation, but part of future work.

4.1 Calculating MAD Scores

The MAD score of a rule is calculated by comparing each tile on the left hand side with the tile with the same location on the right hand side. This comparison is based on a predefined detail hierarchy. A detail hierarchy is a relation on grammar symbols that specifies how detailed each symbol is compared to the others. In Figure 4.1we can see a simple example with just two symbols: (empty) and (filled). In this example, the detail hierarchy defines that filled tiles are more detailed than empty tiles.

MAD score +3

(a) Rule adding detail

MAD score -1

(b) Rule removing detail

MAD score -1 (+1-2)

(c) Rule removing detail

MAD score 0 (+1-1)

(d) Rule turning a shape 90○

Alphabet: = empty, = filled. Detail hierarchy: >

Figure 4.1: Grammar rules that add and remove details

There are three scores when comparing the tile on the left (l) with the tile on the right (r): -1: if l has a higher position in the detail hierarchy than r.

(19)

0: if l has the same position in the detail hierarchy as r.

1: if l has a lower position in the detail hierarchy than r.

The final MAD score is the sum of the scores for each tile. For example, in Figure 4.1a four tiles are compared. Three are replaced with a tile that has a higher rank in the detail hierarchy and one remains unchanged. This adds up to a MAD score of +3. If a rule has multiple right hand sides, there will be separate scores for each right hand side.

4.2 Implementation of MAD in Rascal

Figure 4.2shows the concise implementation of MAD in Rascal. A few aspects of this implementation are worth discussing, since it does not assume any implementation details of the language that is used to write the generative grammar.

1 module util::mad::Metric

2 alias Detail = rel[str greaterSymbol, str lesserSymbol]; 3 alias Rule = lrel[str lhs, str rhs];

4 alias RuleScore = lrel[str lhs, str rhs, int score]; 5

6 RuleScore getRuleScore(Rule r, Detail d)

7 = [<lhs, rhs, getTileScore(lhs, rhs, d)> | <lhs, rhs> ←r]; 8

9 int getTileScore(str lhs, str rhs, Detail d){//rewriting a tile

10 if(<lhs,rhs> in d) return −1; //removes detail

11 else if(<rhs,lhs> in d) return 1;//adds detail

12 else return 0;//retains detail

13 }

Figure 4.2: Metric of Added Detail as a Rascal program. Code also available on git-hub (MAD-Level-Design).

The input for MAD a is rule and a detail hierarchy. The rule is abstracted to a 2-tuple1of the right and left hand sides that are compared (Figure 4.2, lines 3-4)2. The detail hierarchy is represented by a 2-tuple of all the symbols in hierarchy, indicating which of the two has a higher rank (Figure 4.2, line 2). Symbols are referenced with a string.

The output is a 3-tuple, where the first two values are the left and right hand tile and the third value the score (-1, 0 or 1) of those tiles. As stated inSection 4.1, the actual MAD score is the sum of the separate tile scores. However, in practice it is not always clear which replacement is responsible for the loss of detail, so the separate scores can be useful.

4.3 Extracting the Detail Hierarchy

In order to use MAD a detail hierarchy is needed. While the detail hierarchy can be given as input by the developer, it is also possible to derive the hierarchy if the generative grammars use a pipeline. Since modules in the pipeline already imply a natural hierarchy, we can group the symbols based on this hierarchy. We will use LL to describe how this hierarchy can be extracted, but this method only assumes the use of a pipeline.

In the case of LL the pipeline is declared implicitly, since only the inputs of each module are declared. The hierarchy of the modules is extracted with the following approach:

1_{Rascal defines tuples as list relations with ”lrel}

(20)

Algorithm 1: Extracting the module hierarchy. a Data: Modules: A list of all modules

Result: Hierarchy: A list of sets that represents the module hierarchy

1 Hierachy = [];

2 while Modules != [] do

/* getReadyModules(hierachy, modules) returns all the modules of which the input is already part

of hierarchy. */

3 ReadyModules = getReadyModules(hierarchy, modules); 4 Hierachy += toSet(ReadyModules);

5 Modules -= ReadyModules; 6 return Hierachy;

a_{For the implementation see the LL repository on GitHub.}

Once we have the module hierarchy, we can use it to extract the detail hierarchy by looking at where in the pipeline the symbol occurs in the right hand side of rewrite rules for the first time:

Algorithm 2: Extracting the detail hierarchy. a

Data: ModuleHierarchy: A hierarchy of all the modules Result: SymbolHierarchy: A hierarchy of all the symbols

/* The symbols used in the tile maps that are the preset input of the first modules are considered

the least detailed and are added first. */

1 StartingStates = ModuleHierachy[0].startingStates;

/* extractSymbols(tile maps) return a set of the symbols that are used in the tile maps. */

2 AllUsedSymbols = extractSymbols(startingStates); 3 SymbolHierachy = AllUsedSymbols;

/* All the symbols used in each module group are extracted and added as a separate set. */

4 foreach ModuleGroup ∈ ModuleHierachy do 5 NewSymbolGroup = {};

6 foreach Module ∈ ModuleGroup do 7 foreach Rule ∈ Module do

/* Only the symbols that have not be added in earlier module groups are added. */

8 NewSymbolGroup += extractSymbols(Rule.RightHands) - AllUsedSymbols; 9 SymbolHierachy += newSymbolGroup;

10 AllUsedSymbols += newSymbolGroup; 11 return SymbolHierachy;

a_{For the implementation see the LL repository on GitHub.}

While this method does not require the designer to declare the detail hierarchy explicitly, the result contains groups of symbols that are considered to be of the same detail. The detail of these symbols cannot be determined because the order in which the rules are executed within a module are not always deterministic. The order in which modules are executed is neither deterministic when the pipeline branches.

There are situations where it is appropriate to have multiple symbols that share the same spot in the hierarchy. For example, when variations of the same tile are placed. But when all the symbols are considered to be of the same detail, we can no longer detect rules that overwrite tiles that were introduced within the same module. There are multiple alternatives that can be used to make the extracted detail hierarchy more precise:

1. Allow users to improve the extracted hierarchy. In this case the extracted hierarchy could be used as a first draft. It is hard to estimate how much work this approach would save designers, compared to letting designers declare the entire hierarchy.

(21)

2. Derive it from an explicit rule ordering such as a Ludoscope recipe. A side effect of this approach is that it forces the designer to think about the order in which the rules should be executed. This approach does not solve the problem of modules that are executed in a nondeterministic order.

3. Assuming the addition of detail is sequential to the order of the rules in the module. For this approach to be effective it would require the designer to declare the rules in the order in which he/she intends them to be executed. However, even when this is not the case, just assuming a hierarchy is still more effective than using the default hierarchy. This approach can also be used for modules that are executed in a nondeterministic order, by using the order in which they are declared.

4. Considering symbols with the same rank in the hierarchy as more detailed. While this is not very intuitive, it does raise a flag for rules that rewrite content that was written within the same module, without any extra effort. It is not clear if the MAD score would be correct when the order in which modules are executed is nondeterministic.

4.4 Detail Hierarchy for the Boulder Dash Example

We will end this chapter by returning to the example fromSection 3.4 and take a look at how MAD applies to the problems we encountered in the example. The first step is defining a detail hierarchy for the pipeline inFigure 3.2. Since the module hierarchy is declared, we can simply usealgorithm 2:

1. Add the symbols that are introduced by the starting map: { (dirt) , (wall) }. 2. Add the symbols introduced by module 1: { (start) , (end) }.

3. Add the symbols introduced by module 2: { (boulder) , (diamond)}. 4. No symbols are introduced by module 3a or module 3b.

This results in the following detail hierarchy:

{ (boulder) , (diamond)}. > { (start) , (end) } > { (dirt) , (wall) } With this hierarchy we can calculate the MAD score of the individual rules. Rules 1-4 all have a score of +1, however, the rules from module 3a and module 3b are more interesting. Let us start with looking at module 3a inFigure 4.3.

Module m3a: remove obstacles r5:

r6:

MAD score heat map

-1 (+0-1)

−

-1 (-1+0) −

Figure 4.3: MAD scores for r5 and r6

The rules in module 3a were added in an attempt to remove the obstacles that could block the entrance and exit of the level. However, by doing so, they remove detail from the map, resulting in a negative MAD score. In the beginning of this chapter we classified these type of rules as bugs. While it depends on the context if these rules are actually considered problematic, it is important that a designer is aware that they could be problematic.

(22)

Module m3b: move obstacles r7:

r8:

MAD score heat map

0 (-1+1)

− +

0 (+1-1) + −

Figure 4.4: MAD scores for r7 and r8

The rules in module 3b tried to solve the problem without removing any content. Instead they moved the obstacle by swapping it with the symbol of an adjacent tile. In Figure 4.4 we see that this results in a MAD score of 0. In the beginning of this chapter we classified this type of rules as patches. While these rules are probably less problematic, preventing the use of patches is preferable. This is illustrated by the alternative approach from Figure 3.7, where the intent of the designer is made explicit. The rules used in this approach also have a positive MAD score, since replacing the dirt tiles with path tiles adds detail to the map.

(23)

Chapter 5

Specification Analysis Reporting

The solution presented in this chapter is a technique called Specification Analysis Reporting (SAnR) that can be used to define level properties and analyze level generation histories. As stated in Sec-tion 1.1generative grammars are often under-specified, since they sometimes generate levels that are bad with respect to design constraints. This is related to the problem that testing cannot be auto-mated without a specification of when a level is considered broken. SAnR addresses both problems. By defining design constraints as level properties we can filter out any generated levels that do not satisfy these properties. The same specification can also be used for test automation. Large quantities of levels can be analyzed, relating the broken properties to rewrite rules.

The details of how SAnR enables quality assurance and root cause analysis will be discussed in

Section 5.2 and Section 5.3. We will first take a look at the syntax that is used to define level properties.

5.1 Defining Level Properties

SAnR properties consist of two parts: a tile set and the size of that tile set. A tile set is defined with the name of a symbol from the alphabet that is used in the grammar. Here it is important to note the difference between a tile and a symbol. A tile is a location on the map, which is occupied by a symbol. While the symbols are changed by the rewrite rules, the location of the tiles remain the same. In principle, all the tiles with that symbol belong to the tile set. However, there are two ways to filter the tile set:

• ”tileSet adjacent to symbolName” filters out all the tiles that are not adjacent to at least one tile occupied by a symbol with the name ’symbolName’. Two tiles are only adjacent when they share a side.

• ”tileSet in ruleName” filters all the tiles that have never been touched by a rule by with the name ’ruleName’. Here, ’touched by’ means something slightly different than ’changed by’. A tile can be part of the left and right hand of a rewrite rule, while the symbol that occupies the tile is not changed. In other words: ’touched by’ is more inclusive than ’changed by’.

An important difference between the filters is that ”adjacent to” only considers the current state of the map, while ”in” considers the history of the generation process. The filters can also be used together.

The size of the tile set can be expressed in three ways:

• An exact size: ”10x tileSet”, which is true when the size of the tile set is exactly 10.

• An upperbound: ”at most 10x tileSet”, which is true when the size of the tile set is less than or equal to 10.

• A lowerbound: ”at least 10x tileSet”, which is true when the size of the tile set is equal to or more than 10.

(24)

The exact syntax of this language can be found in theSAnR repository.

5.2 Quality Assurance with SAnR

Using the level properties for quality assurance is straightforward. Any level that does not satisfy its properties is considered broken. When a broken level is generated, we simply generate a new one. Of course this can have a significant impact on the efficiency of the generator depending on the percentage of levels that are broken. Luckily, SAnR provides insight into how many levels are broken and how they could be fixed, as will be discussed inSection 5.3.

Generating levels that satisfy the properties could be made faster by incorporating SAnR in the generation process. Instead of analyzing the generation process once it is finished, the properties can be checked during the generation process. Once a property is broken, the generator could backtrack to the decision that broke a property and try an alternative option.

5.3 Root Cause Analysis with SAnR

SAnR enables level designers to analyze the root cause of a problem. Designers author properties to define the problem. SAnR uses this definition to find bugs in generated levels and relates them to the rules in the generative grammar. This analysis is performed on the generation history of a level, which contains the information of every step in the generation process. Every unique combination of a broken property and the rule that broke it is considered a bug. These bugs are found with the following algorithm:

Algorithm 3: Collecting the bugs from a generation history. For the implementation see the LL repository on GitHub.

Data: History: A data structure that contains the information about each step in the generation.

Data: Properties: the list of SAnR level properties.

Result: Bugs: A 2-tuple of properties that were broken a by specific rule and the step in which they were broken.

1 Bugs = [];

/* Store the last step in the generation process. */

2 LastStep = last(History);

3 foreach Property ∈ Properties do

/* PropertyIsSatisfied(stepA, propertyB ) only return true when propertyB is satisfied inStepA */

4 if !PropertyIsSatisfied(LastStep, Property) then

/* For every broken property FindStep is used to get the step in the generation process where the property was broken [ref ALGORITHM FINDSTEP]. */

5 Step = FindStep(History, Property); 6 Bugs += (Step, Property);

7 return Bugs;

(25)

Algorithm 4: Finding the step where the property was broken. For the implementation see the LL repository on GitHub.1

Data: History: A data structure that contains the information about each step in the generation.

Data: Property: a broken SAnR level property. Result: Step: A step from the generation history.

1 PreviousStep = first(History); 2 foreach Step ∈ History do

/* Return the current Step when the property was satisfied in the previous step, but is not in

the current. */

3 if PropertyIsSatisfied(PreviousStep, Property) && !PropertyIsSatisfied(Step, Property) then 4 return Step;

/* If the property was never satisfied, the first step is returned. */

5 return first(History);

There are several edge cases when it comes to reporting bugs:

• When the property is never satisfied in the generation history there is no rule to relate to the bug. In this case the initial step is used, to represent this fact.

• Broken properties can be repaired during the generation process. A broken property is only considered a bug when it is still broken in the end result.

• Properties can break multiple times. SAnR will only notify the user of the first time the rule was broken.

Figure 5.1shows the interactive interface in LL that is used to display the SAnR reports. The interface allows the user to generate a report for any given number of generated levels. LL groups the same bugs over different levels. Users can select any of the generated levels where the bug was found. LL will bring the user to the execution history of that level. As shown byFigure 5.2all the steps in the generation history are displayed and the step associated with the bug is highlighted. This a vital part of debugging with SAnR, since the designer needs insight into how the bug originated in order to fix it. Going through the steps of the execution history helps achieving this insight.

(26)

Figure 5.2: A view of the execution history in LL

5.4 Returning to the Example

We conclude this chapter by returning to the example introduced in Section 3.4and take a look at how SAnR applies to the problems we encountered in the analysis of that example. We begin with defining a list of level properties, which are displayed in the first column ofTable 5.1.

Property Example + m3a + m3b +path

1x diamond A A A A

3x boulder A S A A

0x boulder adjacent to start S A S A

0x boulder adjacent to end S A S A

Table 5.1: Expected results of how the different variations of the example satisfy the defined properties.

S means the property is sometimes satisfied and A means that every generated level satisfies the property.

Properties ”1x diamond” and ”3x boulder” define the number of diamonds and boulders that should be present in every generated result. Properties ”0x boulder adjacent to start” and ”0x boulder adjacent to end”, define that the entrance and exit of the level should not be blocked by boulders.

Table 5.1also shows the expected results based on the analysis of the different solutions and problems inSection 3.4.

(27)

Chapter 6

Case study: Boulder Dash

In this chapter we discuss a case study on a generative grammar for Boulder Dash. The goal of this case study is to demonstrate how both MAD and SAnR improve the process of authoring generative grammars. This is done through an iterative design process, where changes are made based on insight provided by both tools. We find that SAnR was able to express realistic level properties and was vital in improving the quality of the generative grammar. We also find possible improvements for both MAD and LL.

Because the outcome of generative grammars is random, the exact outcome of this experiment cannot be replicated. However, all the different versions of the used grammar are available atGithub (LudoscopeLite). Readers can use SAnR to analyze these grammars for similar results.

To provide some context to this case study we will first take a look at Boulder Dash and the pipeline created to generate its levels. The interpretation of the symbols used in this chapter is given inFigure 6.1a.

6.1 Boulder Dash

Boulder Dash is a top-down game where a player digs through dirt to collect diamonds. The exit of the level only appears if the player has collected enough diamonds. The challenge is avoiding falling boulders that crush the player and avoiding computer controlled enemies that chase the player. Readers are advised to play the game with one of the generated levels1. Boulder Dash is well suited for this case study, because it has few mechanics and uses tile maps. These maps are also self-contained, i.e. all the information about a level can be shown with a single image. Figure 6.1shows a Boulder Dash level from the original game.

6.2 The Pipeline

The generative grammar was written for this case study. The approach is similar to the one used in the game Spelunky [Yu16]. Instead of applying transformations to the entire map, the map is split up in different sections. A random template is assigned to each section. The templates consist of a number of tiles that are predefined and tiles that are determined by later transformations. With this approach, the designer can ensure structure, while making the levels look different. Figure 6.1

displays how the map evolves through the different steps in the pipeline. The walls at the edge of the map are automatically added when the map is loaded in the game.

(28)

Figure 6.1: A complete Boulder Dash level

=big template corner, =medium template corner, =variable, =empty

=start, =brick, =dirt, =boulder, =diamond =monster

(a) Legend of the used symbols.

(b) Output from module 1: Starting map of 20x40 is taken as input. Every section of 10x10 gets assigned a large template.

(29)

(c) Output from module 2: The large templates are picked. Some of the templates use medium templates (4x4) and variables (1x1).

(d) Output from module 3: The medium sized templates are picked, which consist of a mixture of fixed tiles and variables.

(30)

(e) Output from module 4: The variables are picked. At this point in the pipeline, the map consist entirely out of tiles that are actually used in the game.

(f) Output from module 5: Small variations are added to blur the lines between different templates. This can be done by swapping some tiles with each other or replacing homogeneous patterns (like 3x3 chunks of the same tile) with

something more diverse.

Figure 6.1: Boulder Dash map going through the pipeline.

6.3 From Constraints to Properties

In this section we will define the constraints for this case study and see how SAnR can be used to express these constraints as level properties. We separate the constraints into two categories: game mechanics constraints and design constraints. The game mechanics constraints are derived from the rules of the game and are the same for each level. The design constraints are derived from the design goals for a certain level. For an overview of all the properties we define in this section, readers are referred to the appendix.

(31)

6.3.1 Game Mechanics Constraints

We can identify a number of constraints that are a consequence of the game mechanics: C1: Every level should have an entrance, so the player can start the level.

C2: Every level should have an exit, so a player can finish the level.

C3: There should be enough diamonds to finish the level, since the exit only appears after five diamonds have been collected. However, if a level contains too many diamonds it would be too easy for the player to complete.

C4: There should be at least one possible path that can be followed in order to complete the level.

C5: They should only contain symbols that can be interpreted by the game engine. In the following sections we will try to define SAnR properties to express these constraints.

Constraints 1 & 2: Entrance and Exit: Constraints 1 and 2 to are easy to define with SAnR, as was already demonstrated inSection 5.4. We can reuse these properties. The property ”1x rockford” defines that there needs to be exactly one location where the player starts. The property ”1x exit” defines that there needs to be exactly one spawn location for the exit.

Constraint 3: Enough Diamonds: SAnR can only be used to check static tile maps and does not consider the mechanics of the game. While the amount of diamonds that are initially present can be expressed, the varying amounts of diamonds that enemies leave behind when they die cannot. SAnR can specify a range. The property ”at least 5x diamond” defines the lower bound and the property ”at most 15x diamond” defines the upper bound2_.

Constraint 4: A Path through the Level: Section 5.4 showed how SAnR can be used to check if a single tile is blocked off. Defining if there is a path from the start to the finish of the level is more complicated. InFigure 6.2 we can see that initially there is a path from the start to both the goal (the diamond) and the finish (the exit). However, in practice this level will be impossible to finish as demonstrated inFigure 6.2b. In Boulder Dash, recognizing a path will require a dynamic analysis that takes the mechanics of the game into account. SAnR seems inherently unfit to define this type of dynamic properties. No properties were defined for this constraint.

Constraint 5: Excluding Undefined Symbols: Since the game engine uses a predefined set of symbols, we need to make sure that only those symbols are present in the generated level. Some of the symbols used in our pipeline are only there to support the generative process. For example, the symbol ”dirtOrSpace” should be changed to either ”dirt” or ”space” in module 4 (see Figure 6.1). With the property ”0x dirtOrSpace” we can express that this symbol should not be present in the final result.

P

(a) Initial state.

→ ↓ ↓

→ → → _P

(b) Player (P) has picked the diamond up, but is bocked by

the boulders.

Figure 6.2: A Boulder Dash level that cannot be completed.

(32)

6.3.2 Design Constraints

Not all properties are derived from the constraints that are a consequence of the mechanics. Many properties are the part of design goals that are created by the designer. We take a look at two design goals and try to define SAnR properties to express these constraints.

Design Goal 1: Creating Maze Templates: The first design goal we will analyze is the addition of maze-like structures to the game. This goal is based on the observation that enemies are fun to out maneuver in structures where the path is only one tile wide. For these mazes to function properly, each maze should contain a firefly, to make the maze challenging, and a diamond, to give the player an incentive to enter the maze. Figure 6.3shows a maze that was randomly generated by our grammar. The entrance and the exit of the maze are closed off with dirt so the enemy cannot leave the maze before the player enters.

We can define several properties for this design goal. With the property ”1x firefly in AddTemplateMaze ” and ”1x diamond in AddTemplateMaze” we can ensure that both the challenge and reward are present in the maze. With the property ”2x dirt in AddTemplateMaze” we try to express that the entrance and the exit of the maze should be locked off with a block of dirt. This is not very precise, since the property only defines that they should be present somewhere in the maze. In hindsight this property could be made precise by adding the entrance and the exit with separate rules called ”AddMazeEn-trance” and ”AddMazeExit”. This way we could define the properties ”1x dirt in AddMazeEn”AddMazeEn-trance” and ”1x dirt in AddMazeExit”, which actually define our constraint.

D

E

Figure 6.3: Randomly generated maze with a diamond (D) and an enemy (E). The maze is closed off with dirt, therefore, the enemy cannot escape the maze before the player enters.

Design Goal 2: Creating Puzzle Templates: The second design goal is the addition of a puzzle. While the structure of the puzzle does not change, some of the tiles are variable. In Figure 6.4we can see an instance of the puzzle. The puzzle makes use of the fact that boulders can roll off edges.

Figure 6.4a shows the initial state of the puzzle. To enter the room, the player needs to push the boulder to the spot marked with a ”X”. When the player removes the dirt on spot ”Y”, the boulder will start to roll of the edge. Three things can happen:

1. The player is crushed by the boulder.

2. The player locks him/herself in the closed off area with the diamond.

3. The player cannot get to the diamond any longer.

For this puzzle to work, it is important that the player can enter the room. This is only possible when there is an empty space on the spot marked with a ”X” inFigure 6.4a, because the boulder cannot be pushed through dirt. We can define this with the property ”1x boulder adjacent to space in AddMediumTemplatePuzzle”.

(33)

B X

Y

D

(a) The initial state of the puzzle.

→ → ↓

→ ↑ ↓

↓

D B

(b) Failed attempt at the puzzle. Path of the player is

indicated with arrows.

→ → → ↓

→ ↑ ↓

↓

D ← ← ←

(c) Successful attempt at the puzzle. Path of the player is

indicated with arrows.

Figure 6.4: A puzzle. If the boulder rests on spot X and there is no dirt on spot Y, it will roll of the edge, blocking the access to the reward.

6.4 Design Iterations

Here we discuss the iterative design process. For each iteration SAnR analyzed 100 generated levels. An average of three hours was needed for each iteration. Based on the report produced by SAnR changes are made to improve the quality of the grammar. We will only discuss the highlights of each report. In the appendix there is a textual version of each report. For the actual debugging, the interactive version of the report in LL was used, since it provides better insight as we discussed in

Section 5.3. After four iterations the quality of the grammar was deemed sufficient.

6.4.1 Iteration 1

The first analysis lead to the following observations:

• All the generated levels still contained symbols that were not defined in the game. This was the result of the rules of module 4, responsible for picking the variables, only being executed 300 times. Increasing this number to 800 should solve this problem, since that is the maximum amount of symbols in the map.

• Some of the levels had more than 15 diamonds, breaking the property ”at most 15x diamond”. This could be fixed in several ways:

– Reducing the number of diamonds in the large templates.

– Reducing the number of diamonds that are picked from the variable symbols.

– Reducing the number of diamonds that are added by module 5, which adds variation. • In some of the levels the constraint for the puzzle was broken by a rule from module 5 that

added variation. Figure 6.5shows the steps that let to this property being broken. • The property that was part of the maze was never broken.

There were several noteworthy results from calculating the MAD scores. The first was that none of the rules had a negative score. This is probably a result of the grammar being designed with the metric in mind.

It was also striking that many of the rules in the last three modules had a score of 0. For the last module this is the expected result, since the rules do not add detail. However, intuitively the rules that replace templates and variables with terminal symbols do add detail. This is a result of the detail hierarchy that is automatically generated. Since the templates and variables were placed in the same module as the terminal symbols (dirt, brick, space, etc.), they got the same spot in the hierarchy.

The MAD scores were also taken into consideration during the other iterations. However, there were no negative scores and MAD did not provide a basis for changing the grammar during any of the iterations.

(34)

? ? ? ? ? ?

(a) The template that was used to generate the puzzle.

(b) The puzzle after the variables are picked.

←

(c) A rule that adds variation (’SwapSpaceAndDirt2x2’) swaps the dirt and the empty

space.

Figure 6.5: Step by step visualization of how the property ”1x boulder adjacent to space in AddMediumTemplatePuzzle” was broken. The tiles with question marks are variables that can either be dirt or space.

Changes:

• Increasing the number of times that the rules in module 4 were executed from 300 to 800. This should reduce the number of terminal symbols in the result to zero.

• Reducing the number of times diamonds are picked from variables from 1 2 to

1

3. This should

reduce the number of levels that have more than 15 diamonds.

• Changing the template for the puzzle as shown in Figure 6.6. This should ensure that the boulder is adjacent to an empty space.

? ? ? ? ? ?

(a) Template used in the first analysis.

? ? S ? ? ?

(b) Template used in the second analysis.

Figure 6.6: The tile marked with a ’S’ was changed, to prevent the bug shown in Figure 6.5. The tiles with question marks are variables that can either be dirt or space.

6.4.2 Iteration 2

The second analysis lead to the following observations:

• As expected none of the levels contained non-terminal symbols.

• The number of levels that broke the property ”at most 15x diamond” dropped from 31 to 13. • The number of levels that did not satisfy the property

”1x boulder adjacent to space in AddMediumTemplatePuzzle” dropped from 13 to 4. Figure 6.7

(35)

(a) The puzzle after variables that were introduced by the

template are picked.

←

(b) A rule that adds variation (’SwapSpaceAndDirt2x2’) swaps the dirt and the empty

space.

←

(c) A rule that adds variation (’SwapSpaceAndDirt2x2’) swaps the dirt and the empty

space again, blocking the entrance.

Figure 6.7: Step by step visualization of how the property ”1x boulder adjacent to space in AddMediumTemplatePuzzle” was still broken after the change to the template.

Changes:

• The variables that added diamonds were removed completely from the generator. Instead, every template contains a fixed number of diamonds. With either one of two diamonds in every template and eight templates being used in total, the number should always be between 5 and 15.

• The template for the puzzle was changed again as shown inFigure 6.8.

? ? ? ? ?

(a) Template used in the second analysis.

? ? S ? ?

(b) Template used in the third analysis.

Figure 6.8: The tile marked with a ’S’ was changed to prevent the bug from [ref FIGURE BUG]. The tiles with question marks are variables that can either be dirt or space.

6.4.3 Iteration 3

The third analysis lead to the following observations:

• The property ”at most 15x diamond” was still broken. One of the templates still contained some variables that could add diamonds to the level. This template was overlooked in the changes that were made before the analysis.

• The property ”at most 15x diamond” was also broken by a rule that added variation by adding diamonds to big humongous blocks of dirt.

Changes:

• The variables in the template that generated diamonds were replaced with boulders. • The rule that broke up humongous blocks of dirt with diamonds, now uses boulders.

(36)

6.4.4 Iteration 4

The fourth analysis did not report any bugs. This of course does not mean that the grammar is bug free.

6.5 Results

We can conclude that SAnR was helpful for improving the quality of the generative grammar. A number of relevant constraints could be expressed as SAnR properties. With these properties SAnR helped with resolving three different problems in our grammar:

1. In the first iteration we found that there were symbols in the end result that were not defined in the game. Since this problem was present in every generated level, this problem could also be found without the help of SAnR.

2. We also found that the entrance to the puzzle was blocked off (see Figure 6.5) in some of the levels generated by our initial grammar. SAnR also provided insight into how to fix the problem. SAnR was especially vital when the problem still occurred in the second iteration, after changes were made to the puzzle template to prevent the problem (see Figure 6.7). Since the problem was assumed to be solved and only appeared in 4 of the 100 generated levels, it could have easily been overlooked without the ability to generate and automatically analyze large quantities of levels.

3. Lastly, we found that our grammar generated too many diamonds. After multiple iterations we found a solution that worked for all of the generated levels.

The properties ”1x firefly in AddTemplateMaze”, ”1x diamond in AddTemplateMaze” and ”2x dirt in AddTemplateMaze” were always satisfied and did not inspire any changes.

We also observed some shortcomings that where inherent to SAnR and MAD and possible ways to improve both solutions.

• In this case study the automatically generated detail hierarchy was not sufficient for producing useful MAD scores.

• In § 6.3.1 we concluded that SAnR is not suitable for defining any properties that require a dynamic analysis.

• Each iteration costed an average of three hours. By optimizing the algorithm in LL responsible for generating the levels, this execution time could be much shorter.

• In§ 6.3.2we noted that the property ”2x dirt in AddTemplateMaze was not properly expressing our constraint. This stresses the fact SAnR is only as strong as the properties that are defined by its users.

For the most part the case study was successful in answering our questions, as we found that SAnR can help with implementing realistic design goals, short comings and possible improvements. The evaluation of MAD remains inconclusive.

Improving the Quality of Grammars for Procedural Level Generation