Memory leak Detection and Automatic Garbage Collection for the C-Based Compiler Framework CoCoNut

(1)

Bachelor Informatica

Memory leak Detection and

Automatic Garbage Collection

for the C-Based Compiler

Framework CoCoNut

Mehmet Faruk Unal

January 20, 2020

Supervisor(s): Clemens Grelck

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

Abstract

The CoCoNut framework is built to support the user in the development of their own C based compiler. It does this by providing an abstraction layer to interact with the Abstract syntax tree. However, if for some reason memory-leaks occur there are currently no means to identify or clean them automatically.

In this thesis we expand the CoCoNut framework with the means to support a memory-leak detector for the set of data-objects which form the AST. This is done by creating a wrapper which will manage all the dynamically allocated data-objects with regards to the AST. Because the CoCoNut framework generates all the constructor and destructor functions of the nodes which are in the AST. The implementation and function of it can be hidden from the user. With this wrapper implemented and now possible to exert more control over the data-objects which form the AST. The operation of an automatic garbage collector or a memory-leak reporter is now made possible. With the automatic-garbage-collector and memory-leak-reporter implemented, the CoCoNut framework is now able to support the user in finding the causes of memory-leaks without having to exert effort. This helps the user in boosting their productivity in developing their compiler.

(3)

Introduction

To run code from a high-level language you have two options. One method is to translate the code with a compiler, the other is to run code through an interpreter. An Interpreter is a program which executes instructions written in a high-level language. A Compiler reads code written in a high-level language and tries to generate code of a target language which is semantically equivalent.

Developing a compiler for any given platform requires knowledge of the following elements: • The high-level language which it is going to parse.

• A method to parse the high-level language.

• A way to store the parsed information in the compiler.

• Comprehension of the target language the developer wants the compiler to translate to. • Finally, a method to generate semantically equivalent code for the target language from

the stored information.

With these challenges in mind, the CoCoNut framework is designed to assist the user by providing tools to create a compiler in the C language. The framework provides methodological support for compiler construction to reduce errors and boost productivity. One way to reduce its errors is to reduce the boilerplate code the user would have to write. It does this with its own DSL(domain-specific language) such that the user can describe properties of the compiler, traversals, phases and the model of the AST(Abstract syntax tree). These descriptions are then processed by the meta-compiler which generates boilerplate code for the common structure.

After the user completes the compiler, the compiler will be able to read the programming language it is build to parse. From the parsed code the compiler builds an IR(Intermediate Represen-tation) which is an AST consisting of nodes described in its DSL. These nodes are dynamically allocated, since the number of nodes changes depending on the parsed code. During the run-time of the compiler, traversals on the IR will be performed. During a traversal transformations can occur on the IR where nodes can get detached from the AST. When the nodes get detached and are not freed due to user negligence, these nodes end up being garbage-objects/memory-leaks which the compiler can’t reach anymore.

A Regular C program does not have an inbuilt feature which detects and collects these garbage-objects/memory-leaks. Because in the C programming language the user is responsible for the de-allocation of data-objects stored in the heap. If the user is negligent to free these objects, these objects become unreachable for the program and end up being memory leaks. Unlike a regular C program, the CoCoNut framework does provide the means to support the imple-mentation of a memory-leak detector and by extension a garbage collector. The Intermediate Representation which the compiler uses can be traversed with its traversal system. And the

(5)

constructor function which build the nodes for the intermediate representation are generated with the meta-compiler. These features overlap with some features a memory-leak detector and garbage collector requires to function. To make it operate properly however, certain choices have to be made and implemented which is discussed in this thesis.

1.1 Research Question

CoCoNut Framework has the necessary tools to support a garbage collector, to make this possible the following research question has to be answered:

• In what way can memory-leak detection and automatic garbage collection be implemented in the CoCoNut Framework, without adding any more workload to the user of the Frame-work?

• With the information available from the CoCoNut framework, what reporting can be pro-vided to the user of the framework in finding with their memory-leaks?

1.2 Methodology

To answer the research question, the following actions have to be taken: The current feature set which can support the operation of a garbage collector has to determined. Research into the available garbage collection techniques with their benefits and disadvantages has to be made. A viable garbage collection algorithm has to be chosen to be implemented in the CoCoNut framework. On top of that choice, the missing features to make that viable garbage collection algorithm work has to be determined and implemented in the Framework. Finally, to build the garbage collector now the missing features are present in the CoCoNut framework.

As an alternative to garbage collection, memory-leaks can instead be reported to the user by building a memory-leak-reporter. The CoCoNut framework has some degree of reporting, but what information it can currently provide has to be determined. After that information is determined, that information still has to be recorded so it can be reported to the user. Features which it requires to record that information overlaps with features which were also required by the garbage collector. However at this point those features are built and can be expanded to record additional information so a detailed report can be provided to the user. At the end the user has the option to either use the garbage collector or a memory-leak-reporter.

1.3 Organisation of thesis

In this thesis we discuss the following subjects. In Chapter 2 the necessary background informa-tion to follow the thesis is provided. In Chapter 3 a memory management for the IR(Intermediate Representation) is introduced. In Chapter 4 the Garbage collector and the Memory-leak reporter is presented.

(6)

CHAPTER 2

Background

To help answer the research questions information about the CoCoNut framework and garbage collection concepts have to be presented. In section 2.1, features of the CoCoNut Framework is presented to understand what it can currently provide in assisting the implementation of a garbage-collector.

Garbage-collectors are a mean to reclaim data which are no longer considered in need by the application. This can be intentional like in Java or unintentional like in C. In order for a garbage collector to reclaim garbage-objects it first needs to distinguish data-objects which are no longer reachable. In this thesis we present 2 different methods to distinguish garbage-objects in the memory. These methods are reference counting presented in section 2.3 and tracing presented in section 2.4.

When the garbage collector reaches the reclamation stage, depending on the method there are different options to reclaim the garbage-objects. Reclamation of the garbage-objects with refer-ence counting is presented in section 2.3. When the garbage collector uses tracing to distinguish garbage-objects, its reclamation could be handled differently depending on the strategy used by the garbage collector. In section 2.5 three strategies are presented which could be used to manage the memory.

2.1 CoCoNuT

CoCoNut was in 2017 introduced as a new Framework from a Joint effort of Maico Timmermans and Lorian Coltoff under the supervision of Clemens Grelck. The framework provides method-ological support for compiler construction to reduce errors and boost productivity. It makes use of a domain-specific language(DSL) to describe properties of the compiler, traversals, phases and the model of the AST. From this a compiler is build which will convert programs written in the experimental programming language to the C programming language. This allows to guarantee some level of performance for which C is known for and with the C compiler available on a wide range of systems allows to user to test it on platforms without writing a direct machine/assem-bly language converter for it, since these are very platform dependent. In figure 2.1 an abstract overview of the several stages in the CoCoNut framework is presented.

In subsection 2.1.1 the AST concept which is used by the CoCoNut framework gets explained. In subsection 2.1.2 the custom DSL(domain-specific language) with which the AST model is described is presented. With the DSL the user can specify nodes which form the AST are presented in subsection 2.1.3. Subsections 2.1.4 and 2.1.5 present the descriptors which the user specify to be able interact with the AST through functions. The final descriptor in the DSL is phase as presented in subsection 2.1.6 which controls the order in which traversals, passes and phases run. After all the features are the AST model is specified, the meta-compiler in

(7)

subsection 2.1.7 generates code as presented in subsection 2.1.8. The aforementioned topics are features which are relevant in constructing answers to the research questions.

DSL Metacompiler

cocogen C compiler

User code Node and AST

ﬁles

Traversal driver Serialization functions Phase driver ﬁles Compiler executable coconut-lib libc mhash CoCoNut Dependencies User libraries

Figure 2.1: Abstract overview of the several stages in the CoCoNut framework, where blue denotes user defined files, green are generated files, white are programs, yellow are linked libraries and red is the final executable.(republished with permission of Damian Frolich)

2.1.1 Abstract Syntax Tree

Using ASTs in a compiler is common practice. With an AST(abstract syntax tree), code written in a high-level language can be made abstract by only keeping the content-related details and taking out the high level language specific syntax. An example of removing the syntax of the high-level language, but keeping the content-related details is displayed in figure 2.2. The structure which results from parsing the code allows to navigate and transform the nodes. The tree is build depending on the nodes which are available in the model. Taking the figure as an example, every node in that AST is described in the AST model. To expand the type of nodes which can be a part of the AST, the model has to be changed in order to accommodate the node in an AST.

Statement

Assignment

Var ' foo' BinOp '-'

Num '8'

Num '5' MonOp

'OP_neg'

(8)

2.1.2 Domain-specific Language

During the development of a compiler, the model of the AST can go through multiple changes. These changes can be significant, which could require an overhaul of the existing structures which resemble the nodes of an AST. These changes would have to be applied to the functions which operate with these data-structures. These are functions which construct, destruct, copy and traverse the nodes in of the AST. To make this process easier this framework introduced a custom domain-specific language. With this, the user can define structures and instructions which interact with the AST. The specification file which results from this will be processed by meta-compiler to generate all the necessary instructions and data-structures to build and interact with the AST.

With the DSL(domain-specific language), specifications of nodes, nodesets, enums, passes, traver-sals and phases can be described. The descriptions of each of these elements are defined in a syntax which is somewhat similar to how structures are defined in the C language. Like C it is also possible to use single or multi-line comments. This DSL does require that the user give all entries in the specification an unique name. The names given to each entry are used to create identifiers in the generated C code, like names for function, structs, enums and values. Addi-tionally for the purpose of documentation the user has the option to provide text in the info property. The general format of the different entries can be seen in listings 2.1.

1 [ < m o d i f i e r s >] < type > < name > { 2 [ i n f o = " < e n t r y i n f o r m a t i o n > ",] 3 = < value > , 4 ... 5 { 6 ... 7 } , 8 ... 9 };

Listing 2.1: General format of the different entry types in the DSL

The design of this custom domain-specific language was established from a joint effort of Tim-merman 2017 and Coltof 2017 who worked on the development of the CoCoNut Framework at an earlier stage. The design principles to which the language has to adheres is as follows:

• It should be consistent across the different entry types. • The syntax should be minimal and not unnecessary verbose.

• It should be easy to extend the syntax while keeping backwards-compatibility.

Later the design of this custom domain-specific language was expanded by Frolich 2019. One of the aims of his thesis was to increase the expressiveness of the DSL. This was accomplished by introducing set expressions making it possible to re-use a nodeset in another nodeset or to combine them. This addition allowed to make is more agile and reduce the number of lines to describe the same behaviour. Listings 2.2 shows an example how the DSL looks with the new expression. 1 n o d e s e t D e c l = { G l o b a l D e c l , FunDecl , 2 V a r D e c l }; 3 4 t r a v e r s a l b u i l d S y m b o l T a b l e { 5 p r e f i x = BST , 6 n o d e s = D e c l | { For } 7 };

Listing 2.2: The DSL using set expressions

Similar expansions have been made to the traversal by introducing shorthand notation for traver-sal and non-travertraver-sal like shown in listing 2.3.

(9)

1 p h a s e i n i t i a l i z e S y m b o l T a b l e { 2 i n f o = " I n i t i a l i z e the s y m b o l t a b l e ", 3 p r e f i x = IST , 4 5 a c t i o n s { 6 p a s s r e a d I n S o u r c e C o d e = d o P a s s A n d P a r s e ; 7 t r a v e r s a l u n p a c k F o r L o o p = { For , F u n d e f }; 8 b u i l d S y m b o l T a b l e ; 9 } 10 } 11 12 t r a v e r s a l b u i l d S y m b o l T a b l e = D e c l | { For };

Listing 2.3: Example of the inline notation for traversals and non-traversals.

To explore the custom domain-specific language designed by Timmerman 2017 and Coltof 2017 and expanded by Frolich 2019. Can be referenced to the thesis of Frolich 2019 who provides an updated Formal Definition of the DSL in appendix A and an example in appendix B in his thesis.

2.1.3 Nodes

Nodes are the primary objects which are used in the construction of the AST. Each node can have children and attributes as body, but at least one of them has to be present for the node to be valid. An example of such a description can be seen in listing 2.4.

In the scope of the thesis, the relevant elements of a node are its children and two of the possible attributes string and link. The child element and the link attribute of node are pointers referencing to a node. The string attribute is a char pointer which during the code generation is used to host names of variables.

For more detail in regards to the Node and its attributes. The work of Timmerman 2017 hosts a detailed description of the Node and other elements of the AST.

1 [ r o o t ] n o d e < n a m e > { 2 [ c h i l d r e n { 3 < c h i l d 1 > , 4 < c h i l d 2 > 5 } ,] 6 [ a t t r i b u t e s { 7 < a t t r i b u t e 1 > 8 < a t t r i b u t e 2 > 9 }] 10 }

Listing 2.4: Syntax to describe a node in the current CoCoNut framework.

2.1.4 Passes

Every defined pass in the DSL specification gets a corresponding function that is made by the meta-compiler, its either modelled after the name of the pass or by using the func property a different function name can be specified as shown in listing 2.5. When that pass gets called in the finished compiler it will execute the corresponding function.

1 p a s s < name >; 2 3 p a s s < name > { 4 [ i n f o = < string > ,] 5 [ p r e f i x = ,] 6 f u n c = < f u n c t i o n name >

(10)

7 }

Listing 2.5: Syntax to describe a pass in the current CoCoNut framework.

2.1.5 Traversals

A traversal is a pass which operates on the nodes of the AST. When describing the traversal the user can specify which nodes are involved as shown in listing 2.6, if the does not describe nodes, then the meta-compiler will involve every node which is described in the DSL. With that information the meta-compiler will make corresponding functions. This way, transformations can be applied on the AST.

1 t r a v e r s a l < name >; 2 3 t r a v e r s a l < name > { 4 [ i n f o = < string > ,] 5 [ p r e f i x = ,] 6 n o d e s { 7 < n o d e 1 > , < n o d e 2 > , .. 8 } 9 }

Listing 2.6: Syntax to describe a traversal in the current CoCoNut framework.

2.1.6 Phases & Cycles

The phase description contains actions which are executed in the order they are described. These actions can be traversals, passes or other phases. To specify in which order the compiler should execute the phases, there should be one phase in the DSL with the keyword root as shown in listing 2.7. When the compiler starts, it will call the actions described in the rootphase from top to bottom. 1 r o o t p h a s e R o o t p h a s e { 2 a c t i o n s { 3 r e a d I n S o u r c e C o d e ; 4 u n p a c k F o r L o o p ; 5 b u i l d S y m b o l T a b l e ; 6 } 7 }

Listing 2.7: Syntax to describe a root phase in the current CoCoNut framework.

In the phase description it is also possible to select a node type to start as the root during that phase. Effectively running the actions of that phase on a subset which are reachable from that specified root type. Listing 2.8 is an example of how a basic phase or cycle can be specified.

1 ( p h a s e | c y c l e ) < name > { 2 [ i n f o = < info > ,] 3 [ p r e f i x = ,] 4 [ r o o t = < n o d e i d e n t i f i e r > ,] 5 a c t i o n s { 6 < action1 >; 7 < action2 >; 8 } 9 }

Listing 2.8: Syntax to describe a phase or cycle in the current CoCoNut framework.

Cycles as seen in 2.8 use the same syntax, the only difference is that a cycle will execute the same actions for a number of specified times.

(11)

2.1.7 Meta-compiler

After the specifications are established and written in DSL, this file goes as input in the meta-compiler. With tools as Flex and Bison the specifications are parsed, semantic analysis checks are performed and resolves references to other entries. (Coltof 2017)

After the aforementioned is done, the meta-compiler is ready to generate C files in accordance with the DSL specifications. If at a later stage of development the user decides to change the DSL specifications for the AST model, the code generation will remove the previously generated files to avoid them being included in a user source or compilation.

2.1.8 Code Generation

The code which results from the meta-compiler is almost identical to the DSL specification. Nodes are represented as C-structs with the body containing pointers to the struct types of its children. Attributes are represented as C literals and of the two attributes mentioned earlier in subsection 2.1.3. The link and strings are pointers.

For every node which is now a C-Struct, a constructor, destructor and copy functions are provided by the meta-compiler. The constructor functions allocates the respective struct and fills its variables with default values. The destructor is provided in two versions: one which will only free their respective node and the other which will continue the free operation on its child nodes. The copying functions create exact duplicates of structures, it also build a hashmap of nodes which are getting copied. If later a node is found which is getting copied and links to a node which is copied, by looking at the hashmap the correct pointer is given to the copied node. With these functions an abstraction layer is provided for the user, with which he can build his compiler. With this abstraction layer the errors are reduced and productivity boosted.

2.2 Memory Leak

When an allocated data-object is never properly de-allocated but is no longer considered in need for the program, it becomes a memory-leak. However, there are two types of memory-leaks which can exist in a program. One type of a memory-leak is when an allocated data-object is no longer reachable from the root set: this set consists of global pointers reachable from any location in the whole program, local pointers in the activation stack and any registers used by active procedures. The second type of memory leak is where a data-object increases in size, but due to a programming error never de-allocates entries which are not required in that data-object.(Jump and McKinley 2007)

2.3 Reference Counting

Garbage collection through reference counting associates a count to each allocate data-objects which informs the number of references that are pointing to that data-object. Each time a new pointer points to this data-object, its count value increases. When a pointer stops pointing to this data-object, this count get decremented. When the count of a data-object equals zero, it indicates that no pointer references are left to this data-object and can be de-allocated.

An example of how reference counting could be implemented is by keeping the count of references on its data-object. In figure 2.3 the arrows represents a pointer reference to an object, the count of an object is equal to the number of arrows pointing to its data-object.

The advantage of implementing reference counting is that, due to its incremental nature it can operate without requiring a special sequence of which pauses the execution of the program. This allows reference counting to be a viable method with real-time applications where response time is critical.

(12)

Figure 2.3: An reference counting example

However, by relying on the count of a data-object to reach zero before it can be de-allocated prevents the garbage collector to remove data-objects which have a circular relation. Here, multiple data-objects are referencing to each other directly or indirectly to each other, causing a circular relation to take form. So when this subset of data-objects become unreachable from the root set, it will continue to exist since its count has not decremented to zero as shown in figure 2.4.

Figure 2.4: An reference counting example with a detached circular relation

Another disadvantage of reference counting is that the operating cost is proportional to the number of allocations, de-allocations and pointer manipulations. These actions will almost always require the reference counter to change, but also change the counts of related objects. In a different example: if the program make use of a significant number of short-lived variables this causes a great deal of overhead due to resources it requires from setting it up and removing it. (Wilson 1992)

2.4 Tracing

Tracing is the second method we present in this thesis to help distinguish garbage-objects. Start-ing from the root set, a traversal starts which traverses the graph of pointers reachable from the root set. This can be traversed by using either a depth-first or breadth-first traversal algorithm. With this method all the active data-objects are reached and the unreachable data-objects can be considered as garbage-objects.

With this traversal algorithm data-objects can be marked so that the garbage collector knows which of the data-objects were reachable and which were not. Figure 2.5 shows an example of objects which were not reached and thus were not marked.

(13)

Figure 2.5: The objects with a gray color were reached and marked by the algorithm

detached subset of data-objects which can form a circular relation will be identified as an garbage object. Unlike Reference counting which would allow such relation to exist since those objects do not reach zero.

However, unlike reference counting, executing tracing while running concurrently with the pro-gram requires additional safeguards to make sure all reachable data-objects are reached. It is possible that during run-time the location of a data-object reference can change, this can cause a pointer which is much deeper in the tree to be relocated to a location higher in the tree where tracing already traversed those objects. Figure 2.6 illustrates this possibility.

Figure 2.6: While tracing is performed on the set of data-objects, concurrently data-object A which is deeper in the tree gets relocated to a location higher in the tree by the program

This results that data-object A does not get marked and gets to be considered as a garbage-objects until the next iteration of tracing starts. This scenario can cause for unexpected behaviour and this is why safeguards are necessary if tracing works concurrently with the program.(Demers et al. 1989, Wilson 1992)

2.5 Reclamation of Garbage objects

After the garbage objects are distinguished trough tracing, the next stage is to reclaim the garbage-objects from the program. However the method the garbage collector handles the live and garbage objects differs from strategy. In this section 3 strategies are presented; sweeping, compacting and copying.(Wilson 1992)

(14)

2.5.1 Sweeping

After the marking of live data-objects with the tracing algorithm as described in section 2.4. The garbage collector can go through all data-objects which are known to the garbage collector. Because all the reachable data-objects were marked, all the data-objects which are still unmarked can be considered as garbage-objects. The garbage collector with that information frees the unmarked objects without doing anything else with the data-objects which are still reachable. This is also known as a Mark&Sweep garbage collector. However in the long run this approach can cause memory fragmentation as shown in figure 2.7 due to the possibility that data-objects don’t necessarily become de-alloacted in the same order they were allocated. This leaves the memory with empty spaces which might be too small to allocate new data-objects to it. This makes it difficult for the allocator to host new objects efficiently.(Wilson 1992)

Root Set

Marked data-object Unmarked data-object Free memory-space

Figure 2.7: A visual representation of memory-space before and after mark&sweep

2.5.2 Compacting

One strategy to prevent fragmentation in the long run is by moving the data-objects which remain from garbage collection. The remaining set of data-objects are then shifted in the memory-space in order to “squeeze” out the empty spaces of the memory so that all the allocated data-objects are compacted to a contiguous area as shown in figure 2.8. This allows more effective use of the available memory space. However because compacting shifts the location of an data-object its memory-address also changes. The garbage collector therefore needs to change the addresses of the existing pointers to the new addresses.(Wilson 1992)

2.5.3 Copying

Instead of shifting the data-objects to “squeezze” out the empty space, a different strategy is to divide the heap into two contiguous semi-spaces. With this strategy only one of the two contiguous semi-spaces are used at a time by the garbage collector to store the data-objects of the program. When the semi-space which is in use reaches a predetermined threshold the garbage collector using the tracing algorithm, copies all the data-objects it can reach to the other contiguous semi-space as shown in figure 2.9.

In this figure the data-objects of contiguous semi-space 2 gets copied to contiguous semi-space 1 and after changing the root set to point to the data-objects in contiguous semi-space 1 the copy is successfully. This way, instead of de-allocating the garbage-objects and using computing time, these objects get overwritten with the next copy cycle.(Wilson 1992)

However, like compacting the addresses of all the data-objects change and this has to be applied to all variables in the root set and data-objects which are copied to other semi-space. But in

(15)

Root Set

Data-object Free memory-space

Figure 2.8: A visual representation of memory-space before and after compacting

addition to the requirement to change the addresses the usable memory-space get divided by the number of contiguous semi-spaces the copying uses.

(16)

Root Set Root Set Contiguous Semi-space 2 Contiguous Semi-space 1 Contiguous Semi-space 2 Contiguous Semi-space 1

Reachable data-object Unreachable data-object

Figure 2.9: A visual representation of memory-space before and after copying, above the dotted line is before copying is applied and under the dotted line the memory-space after copying

(17)

CHAPTER 3

Memory Management and Memory Leak

Detection for the AST

The C programming language does not provide the user with a tool to get a list of all pointers which belong to the root set. It also does not keep track of all data-objects which are dynamically allocated. This makes it difficult to perform tracing as discussed in section ??. To make it possible for the garbage collector to distinguish the live data-objects from the garbage data-objects, two features are required: Access to the variables which resembles the root set and a data-structure which has entries of all the dynamically allocated data-objects.

3.1 Methodology

As presented in section 2.1 the CoCoNut framework builds an AST as its IR. The method to interact with the AST is through a traversal. A traversal starts from the root of the AST and goes through its child nodes until it has traversed all the accessible nodes. With this feature, tracing can be implemented by supplying the CoCoNut framework as presented in section 2.4. By supplying instructions to build a traversal which in turn calls a function to marks every node it traverses.

The nodes which are part of the AST are allocated with constructor functions generated by the meta-compiler as presented in subsection 2.1.7. These constructor functions get their dynamically allocated memory-space through a wrapper mem alloc. By constructing a new wrapper which is responsible for allocating data-objects specifically for the AST. It would allow the CoCoNut framework to have knowledge of all nodes which were dynamically allocated for the purpose of being a data-object in the AST.

The addition of a new traversal and a wrapper allows the framework to host a memory-leak detec-tor which expands the features of the CoCoNut framework. Besides making garbage-collection or reporting of memory-leak possible, now additional error checks can be done. These addi-tional checks allows the CoCoNut framework to verify whether there are no duplicate pointers to data-objects or pointers which are unknown to the memory-leak detector.

In section 3.2 the wrapper and its functions are presented to host information which would make the detection of errors and leaks possible. Followed by section 3.3 where the detection of leaks is presented. And finally in section 3.4 where additional the additional error checks are presented.

(18)

3.2 AST Memory manager

To make the distinguishing of live nodes through tracing possible, all data-objects allocated for the purpose of being a part of the AST has to be known to the CoCoNut framework. This is achieved by using a bucket data structure as presented in listing 3.1. Each index of the bucket data structure holds values which describe an allocated data-object.

The address of the allocated data-object is stored as an void* which has the purpose of being the key to request or change information of a specific entry. Size is stored to have direct access to the size of an allocated data-object for the purpose of reporting. Each entry can belong into 1 of 5 states; empty, unmarked, marked, new and leak. When a data-object gets allocated its status will initially be set to new. The status unmarked and marked are used by memory-leak detector to establish the findings during its traversal. The status leak is used by the memory-leak reporter to make the CoCoNut Framework know, that it has processed the leak. The variable type holds information to identify the entry as string or a node of the AST. As these are the two possible dynamic allocations which are in the AST. The variable build entry holds information during which pass this entry got added, this field is empty when the status of that entry is new. Variable leak path only holds the information when it becomes unmarked. It records after which traversal it became a memory-leak/garbage-object for reporting purposes. Finally, to make the recording of this information unbounded, this bucket data structure will refer to the next bucket data structure.

1 e n u m a s t _ s t a t u s { empty , u n m a r k e d , marked , new , l e a k }; 2 e n u m e n t r y _ t y p e { empty , string , n o d e }; 3 4 s t r u c t a s t _ m e m o r y _ m a n a g e r 5 { 6 int p o i n t e r [ A S T _ E N R T Y S I Z E ]; 7 s i z e _ t s i z e [ A S T _ E N R T Y S I Z E ]; 8 e n u m A S T _ s t a t u s s t a t u s [ A S T _ E N R T Y S I Z E ]; 9 e n u m e n t r y _ t y p e t y p e [ A S T _ E N T R Y S I Z E ]; 10 c h a r * b u i l d _ e n t r y [ A S T _ E N T R Y S I Z E ]; 11 c h a r * l e a k _ p a t h [ A S T _ E N T R Y S I Z E ]; 12 A S T _ m e m o r y _ m a n a g e r * n e x t ; 13 };

Listing 3.1: AST memory management

Before the AST is built, this structure is initialised to make the management of all the data-objects with regards to the AST possible. To manage the entries in the AST memory manager as presented in 3.1. The correct functions have to be executed to make this happen. These functions are presented in listing 3.2.

Function AST alloc requires 2 arguments, the size of the data-object and the type of data-object. The first argument is necessary to request the required size for the data-object with malloc. The second argument is for internal use of the AST memory manager. the AST consist of 2 types of data-objects, nodes which are described section 2.1 and strings. Since there is not a fixed length specified for strings which holds the description of an AST node, are therefore also stored in the AST memory manager. Internally the function will call an additional function which will instruct the AST memory manager to find a free entry in linear time. If at the end of the data-structure it was unable to find a free entry it will expand the AST memory manager by adding another bucket which can be accessed through the current bucket. After the entry is made AST alloc returns an address in which the node/string can store its data.

Function AST free, besides freeing the data-object, calls a function to remove the entry of that data-object from the AST-manager by freeing the strings and resetting the values of that entry to NULL or empty. If it is unable to find this entry in its data-structure than something has gone wrong and an error will be triggered presented in the section 3.4.

(19)

memory manager like they are new entries. And returns the address of the duplicated data-object.

1 v o i d * A S T _ a l l o c ( s i z e _ t size , e n u m e n t r y _ t y p e t y p e ) ; 2 v o i d A S T _ f r e e (v o i d * ptr ) ;

3 v o i d * A S T _ c o p y (c o n s t v o i d * src , s i z e _ t s i z e ) ;

Listing 3.2: AST functions

By default the CoCoNut framework will function without the AST memory manager and its related functions. If the user wants to make use of this feature one of the 2 arguments has to be passed to the meta-compiler : “automatic-garbage-collector” or “memory-leak-reporter”. This will change the constructor and destructor functions to use the respective functions as presented in listing 3.2 instead of the current implementation.

3.3 Memory leak detection of the AST

With the data-structure setup from section 3.2, it is now possible to interact with all the allocated data-objects for the AST. All the entries which have a pointer to a valid data-objects start this traversal with their status unmarked. Using the traversal feature as described in subsection 2.1.5 which has a similar behaviour to how tracing operates as described in ??. For the purpose of distinguishing live data-object from garbage-objects, a special traversal will run. This traversal will execute the function shown in listing 3.3 for each node and its strings it traverses. This function updates the status of each entry it was called for to marked.

1 int m a r k _ A S T _ m e m o r y _ m a n a g e r _ e n t r y (v o i d * ptr ) ;

Listing 3.3: marking function

This traversal will execute after every pass or traversal, however if in the DSL specification a pass or traversal belonging to a phase of which its root attribute has a value as described in listing 2.8. Then instead of every pass or traversal within that phase, the traversal will be executed at the end phase.

At the end of that traversal the data-structure described in section 3.2 hosts entries which are either marked or unmarked, ignoring the empty and leak entries. The marked entries are all the nodes and strings which were reachable through the traversal of the AST. All the entries which still have unmarked as status, which means that they became detached from the AST during the AST user specified traversal. Depending on the argument given when building the compiler, the compiler will either call the garbage collection presented in 4.1.1 or call additional functions to store additional information for the memory leak reporting as presented in section 4.2.

3.4 Detection of pointer errors in the AST

Besides distinguishing live data-objects from garbage-objects for the garbage collection or mem-ory leak reporter. The framework with the addition of the AST memmem-ory manager is now able to check whether the following errors exist in the AST, which will stop the execution of the compiler:

Duplicate child pointer

A node should only be referenced once as a child. When multiple nodes have a child pointer to the same node, the compiler risks run-time errors. This could indicate that there are branches which refer to the same subbranch at a certain point or that a node has a child-node which is higher in tree, which means there is a cycle relation.

Duplicate string pointer

(20)

This is to prevent errors on run-time so when a string gets freed somewhere, that it doesn’t mean that a node still points to an invalid string.

Unknown node pointer/string

When a node or string from the AST is getting marked during the traversal of the AST-memory-manager, but the entry of that node can’t be found in the data-structure. This means that this node is added to the AST without using its corresponding

constructor function.

The duplicate pointer errors are generated during the traversal in section 3.3. When the mark function gets a pointer which it had already marked during the same traversal, then there are elements in the AST referring to the same node/string.

The unknown node or string errors are generated when the free or mark function of the AST memory manager is called. The AST memory manager was unable to retrieve the entry from its data-structure which means it was allocated incorrectly.

(21)

CHAPTER 4

Automatic Garbage Collector Memory

Leak Reporter

With the implementation of the AST memory manager as described in section 3.2 and the tracing traversal as described in section 3.3 finished, the garbage collector and the memory-leak detector have the means to operate. In section 4.1 the challenge of making a copying or compacting work as presented in subsection 2.9 and 2.8 respectively is presented. For reasons explained in that section the choice of a mark&sweep get explained and in the subsection 4.1.1 the implementation is presented. In section 4.2 the memory-leak-reporter implementation is presented which runs if the user wants its leaks reported instead of automatically collected.

4.1 Garbage Collector

For the implementation for the garbage collector, there is a number of concepts you can choose: With the AST memory manager built in section 3.2 and the feature available from CoCoNut to apply tracing on the data-objects presented in 3.3. Only concepts which are build on top of distinguishing by tracing is an option. For reclamation this leaves the options to implement mark&sweep, mark&compact or copying as presented in section 2.5.

Outside of implementing mark&sweep it becomes necessary for the framework to additionally manage its own dynamic heap memory. A simple wrapper like in section 3.2 ends up insufficient. Mark&sweep can operate without requiring a contiguous memory space unlike mark&compact, copying and their derivations which require any number of contiguous memory-spaces to operate. Therefore the latter options causes the following consequences. The size of contiguous memory-space which the compiler requires for hosting the AST is inconsistent, thus it requires that the contiguous memory it manages can grow. Additionally, if copying would be implemented it would require any number of additional contiguous memory-spaces to operate, effectively splitting the available memory by the number of contiguous memory-spaces it requires.

Taking the previously mentioned information into consideration, using mark&sweep garbage collector wouldn’t require the self management of its memory-space. Another implication of using copying or compacting would causes the addresses of the pointers to change, which would require the framework to make an additional traversal which has to go trough all the nodes in the AST with a translation table to change the addresses to their new ones.

4.1.1 Mark&Sweep Implementation

With the arguments presented prior to this subsection, a mark&sweep garbage collector is imple-mented in the framework. The biggest challenge for this scheme to operate is the distinguishing stage. With the addition of the AST memory manager added in section 3.2 and the traversal

(22)

in 3.3 which marks the live data-objects, collecting garbage-objects is now possible. After the marking is done, the garbage collector is called and starts traversing every entry in the AST memory manager.

Each valid entry has 1 of 2 states, unmarked or marked. When the garbage collector reaches an entry which has the status unmarked, it executes AST free on that entry. The AST has only 2 types of allocations; nodes and strings which are descriptions of their respective node. The AST memory manager knows both type of entries, which allows the garbage collector to call AST free without using the respective destructor function of that node/string.

During the same process of freeing unmarked entries, it resets the entries which are marked to unmarked to prepare them for the next iteration of marking. At the end of this iteration of garbage collection, the data-objects which were detached from the AST are removed from the heap and saves the user from penalties due to memory-leaks.In figure 4.1 the state machine of the automatic garbage collector is displayed.

empty Unmarked

Marked

Marking algorithm Garbage Collector

A new Node is allocated

Figure 4.1: A state machine of when the automatic garbage collector works with the ast memory manager

4.2 Memory Leak Reporter

To make the memory leak reporter give the complete information about the leak, some additional information has to be stored in the AST memory manager. When the argument is on automatic garbage collection not all status values or variables of the data-structure in the AST memory manager is used. Garbage collection does not require to know when a node got attached or detached from the AST, that information is only relevant to the user of the framework. So when building a compiler with the CoCoNut framework using the memory-leak-reporter argument, more information is recorded in the AST memory manager.

For this purpose the statuses new and leak introduced in section 3.2 will be used additionally to unmarked and marked. With these additional statuses the memory-leak-reporter can record additional information in the entry of a specific node. The status new lets the memory-leak-reporter know that this entry lacks the information of during which pass, traversal or phase it was allocated. The status leak lets the memory-leak-reporter know that this entry is already processed and can be ignored. In figure 4.2 the state machine of entries using the memory-leak-reporter is displayed.

Like the Garbage collector, the memory-leak-reporter relies on the same tracing traversal as de-scribed in section 3.3. From that traversal all the data-objects which are reachable gets their re-spective entry in the ast memory manager marked. After this process the memory-leak-reporter is called which will go through the ast memory manager and save the name of the phase and pass/traversal in the entries with the status new and unmarked. Entries with the status new have its variable build\_entry activated by allocating a string and store name of the phase and pass/-traversal it was allocated. Entries With status unmarked have their status changed into leak

(23)

empty Unmarked

Marked

Marking algorithm A new Node is allocated

Memory-Leak detector New Leak Memory-Leak detector

Figure 4.2: A state machine of when the memory-leak-detector works with the ast memory manager

and their leak\_entry activated by allocating a string with name of phase and pass/traversal during which it became detached from the AST.

After the compiler is finished with its task, the reporting of the memory leaks begin. In the previous implementation started by Frolich 2019, the user was made aware of how much memory was allocated and freed during a traversal. To provide more insight to the user additional information is recorded to provide the user with more detail of what happens during each phase or pass/traversal. Now at the end of the compiler when a report is printed in the terminal it displays the number of nodes which were allocated, freed, lost and reached and the size of the nodes which were allocated, lost and free’d. In listing 4.1 an example where the information of a single pass is shown.

N a m e : O p t i m i s a t i o n 1 P a t h : R o o t P h a s e . O p t i m i s a t i o n s A l l o c a t e d num : 3 F r e e num : 6 M a r k e d num : 233 L e a k s num : 0 A l l o c a t e d s i z e : 96 F r e e s i z e : 168 L e a k s i z e : 0

Listing 4.1: memory reporting 1

Expanding the existing reporting system helps to get better understanding what happening in the background, but if something went wrong it does not provide adequate information to narrow down where the cause occurred. Therefore, besides this reporting done like listing 4.1 a txt document is generated which lists the information of each data-objects with the status leak. In listing 4.2 the BinOp node which was described in the DSL specification as presented in 2.1.3, has all its values written to that txt document.

L e a k 7 of 26 P o i n t e r : 0 x 5 2 3 3 0 6 0 T y p e : N T _ B i n O p Op : B O _ a n d A l l o c a t e d : R o o t P h a s e . L o a d P r o g r a m L e a k : R o o t P h a s e . O p t i m i s a t i o n s C h i l d L e f t : 0 x 5 2 3 2 f a 0 C h i l d R i g h t : 0 x 5 2 3 1 c b 0

Listing 4.2: memory reporting 2

Printing the content of the Child nodes might have been useful for the user of the CoCoNut Framework. However, at the time of reporting this node, its child nodes might have already

(24)

been deleted. However by reporting the addresses of memory leaks, the user has the option to manually confirm if it was only this node or more. By going through the list of all the leaks the user can look if any other leak has that child pointer as its address.

By supplying the user with this level of reporting, it should allow the user to narrow down the function which are insufficiently handling the transformations and improve these functions so any further leaks no longer occur.

(25)

CHAPTER 5

Related Work

5.1 Boehm-Demers-Weiser Garbage collector

In the C programming language the user is responsible for the de-allocation of data-objects stored in the heap. If the user is negligent to free these objects, these objects become unreachable for the program and end up being memory leaks. One way to capture these memory-leaks in the C programming language is to use the Boehm–Demers–Weiser garbage collector.(Boehm and Weiser 1988)

The Boehm–Demers–Weiser garbage collector keeps a list of all allocated chunks of data-objects. To determine whether these allocated chunks are accessible by the program it requires information of where all the pointers are referencing to. However, any compiled C program on run-time has no information whether any of its variables are pointers. Therefore, the Boehm–Demers–Weiser garbage collector treats any data item directly accessible to the program as a potential pointer. These data items consists of the contents of a register or a word on the stack. If the value of a data item corresponds to an entry in the list, that data item is assumed to be a pointer. The data items which are assumed to be a pointer can be considered as the root set. The objects which are accessible by the root set are similarly checked if their data items corresponds with an entry in the list of all allocated chunks. With this information, the garbage collector uses the mark&sweep algorithm to pass trough all the objects accessible by the root set and clear the objects which are not accessed.(Boehm and Weiser 1988)

The Boehm-demers-weiser garbage collector, while providing the means to function on any C program, does not help the user in solving the memory-leaks. With the implementation done for the CoCoNut Framework, the user is able to narrow down during which phase, pass or traversal the memory-leaks occur. With that information the user can fix the problems which cause the leaks to occur and recompile his compiler without the garbage-collector or memory-leak-reporter gaining effici¨ency with regards to computing power and memory usage.

(26)

CHAPTER 6

Conclusions

With the addition of AST memory manager to the CoCoNut framework, all dynamically allocated data-objects which make part of the AST can now be applied with additional checks to improve the quality of the compiler. One set of improvements is by catching duplicate and unknown pointers in the AST to intervene possible run-time errors. The other set of improvements is allowing the user to activate the garbage collector or memory leak reporter. By activating the garbage collector the compiler will check for garbage-objects after every traversal to keep the total memory usage of the compiler down. Otherwise by activating the memory leak reporter the user can examine the memory leaks in more detail to narrow down the part of code which can be improved. These additions allow the user to boost their productivity by checking for additional errors which previously were not possible to check.

While the garbage collector implementation for CoCoNut framework can be transferred to a different project, it is designed to make use of existing features in the CoCoNut framework. To make it work in a different c program, similar features like having a tree data structure and having a method to traverse the nodes of that tree data-structure would be required. The memory leak reporter can’t be transferred in a similar manner, since this is more intertwined with the CoCoNut Framework.

(27)

Bibliography

Boehm, Hans-Juergen and Mark Weiser (1988). “Garbage collection in an uncooperative envi-ronment”. In: Software: Practice and Experience 18.9, pp. 807–820.

Coltof, Lorian (2017). CoCoNut: a Metacompiler-based Framework for Compiler Construction in C: High-productivity, Traversal Optimization, and AST Serialization. BSc informatica, University of Amsterdam. University of Amsterdam.

Demers, Alan et al. (1989). “Combining generational and conservative garbage collection: Frame-work and implementations”. In: Proceedings of the 17th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. ACM, pp. 261–269.

Frolich, Damian (2019). CoCoNut: a Metacompiler-based Framework for Compiler Construction in C: High-productivity, Traversal Optimization, and AST Serialization. BSc informatica, University of Amsterdam. University of Amsterdam.

Jump, Maria and Kathryn S. McKinley (Jan. 2007). “Cork: Dynamic Memory Leak Detection for Garbage-collected Languages”. In: SIGPLAN Not. 42.1, pp. 31–38. issn: 0362-1340. doi: 10.1145/1190215.1190224. url: http://doi.acm.org/10.1145/1190215.1190224. Timmerman, Maico (2017). CoCoNut: A metacompiler-based framework for compiler

construc-tion in C: Scalability, modularity, space leak detecconstruc-tion and garbage collecconstruc-tion. BSc informatica, University of Amsterdam. University of Amsterdam.

Wilson, Paul R. (1992). “Uniprocessor garbage collection techniques”. In: Memory Management. Ed. by Yves Bekkers and Jacques Cohen. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 1–42. isbn: 978-3-540-47315-2. doi: 10.1007/BFb0017182.

Memory leak Detection and Automatic Garbage Collection for the C-Based Compiler Framework CoCoNut

Bachelor Informatica