Solving Hofstadter’s Analogies using Structural Information Theory

(1)

Solving Hofstadter’s

Analogies using Structural

Information Theory

(2)

Layout: typeset by the author using LA_TEX.

(3)

Solving Hofstadter’s Analogies using

Structural Information Theory

Geerten Rijsdijk 11296720

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. G. Sileno Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Jun 26th, 2020

(4)

Abstract

Because of their importance in cognition, analogies have been a topic of interest in both psychol-ogy and artificial intelligence for decades. To simplify the form of analogies, Hofstadter proposed a micro-world in which analogies consist of strings of letters. An example of such an analogy is abc:abd::pqr:?, which should be read as abc is to abd like pqr is to what?. Structural Information Theory, or SIT, is a theory on perceptual organization, centered around the SIT coding lan-guage, which defines a way of describing string structure by compression. This research project explores the possibility of utilizing the SIT coding language to solve Hofstadter’s analogies. First, the minimal coding algorithm PISA is implemented, which is currently the fastest algorithm for string compression using the SIT language. Second, an analogy solving algorithm is investigated, which uses PISA compression in its solving process. Comparison with human answers shows that the algorithm generally produces similar answers to humans and achieves better results than one of the current best methods. Nevertheless, the results also show areas where there is still room for improvement with regards to the methods used to generate and rank the answers.

(5)

4 Analogy Solving 18 4.1 Overview . . . 18 4.2 Symbol Distance . . . 19 4.3 Symbol Substitution . . . 20 4.3.1 Representation by Compression. . . 20 4.3.2 Representation by Chunking . . . 21 4.4 Structure in Parameters . . . 22 5 Results 24 5.1 Compressor Comparison . . . 24 5.2 Analogy Solving . . . 25 5.2.1 Murena’s dataset . . . 25 5.2.2 New Dataset . . . 28 5.2.3 On Measuring Complexity. . . 28

6 Discussion and future developments 28

(6)

1 Introduction

Analogies are a common part of life; our ability to understand them is critical in problem solving, humor, metaphors and argumentation [6]. Analogical reasoning is also one of the predominantly measured abilities on IQ tests. In psychology, an analogy is the process of understanding new information by understanding structural similarity with old information [8]. Schematically, any analogy can be expressed as A is to B what C is to D.

Because of their importance to cognition, analogies have interested researchers in the field of artificial intelligence. Systems for analogical computation have been created for many different purposes such as solving puzzles based on objects in images [3], obtaining information by inference [7], understanding the development in analogical reasoning in children [13], or even as support in suggesting specialized care for patients with dementia [18].

1.1 Hofstadter’s Analogies

To model analogical reasoning, Douglas Hofstadter proposed a micro-world for analogy-making [12]. Here, the elements of the analogies are strings of letters. An example of such an analogy is:

ABC:ABD::BCD:? which should be read as: ABC is to ABD like BCD is to ?.

In order to predict a human answer to such an analogy, Hofstadter created a computer pro-gram called Copycat [12]. To complete a given analogy, the program works with agents, which gradually build up structures representing the understanding of the problem, eventually reach-ing a solution. Later the Copycat program was improved upon with Metacat [11], which adds a memory, allowing the program to prevent itself from performing actions it has previously tried. Metacat, which was last updated in 2016, represents plausibly the state of the art of algorithms available for this problem.

Both Copycat and Metacat are based on the idea that there are structures on one side of the analogy that need to be replicated on the other side. This project is based on the intuition that Structural Information Theory provides an alternative way of describing such structure.

1.2 Structural Information Theory

Structural Information Theory, or SIT, is a theory about perception with roots in Gestalt psy-chology. Central to SIT is the simplicity principle. This principle can be seen as a formalization of Occam’s razor, an idea that states that the simplest explanation for data is likely the correct one [9]. The SIT coding language was created as a tool to work with the simplicity principle. The language applies to strings, and is used to describe those strings using regularities. A repre-sentation of a string in terms of regularities is called an encoding. Such an encoding can also be seen as a compression, since describing strings using regularities can greatly decrease the amount of memory needed to store strings (both in computers and in ones mind). The regularities used in the SIT coding language are iteration, symmetry and alternation:

Iteration: 3*(A) ⇒ AAA Symmetry: S[(A)(B),(C)] ⇒ ABCBA Alternation: <(A)>/<(B)(C)> ⇒ ABAC

These operations were chosen due to two properties they uniquely share, which will be discussed in more detail later. Empirically, SIT has been validated in several cognitive experiments with human participants [9][17].

(7)

SIT proposes to map the application of the simiplicity principle to the Minimal Encoding Prob-lem using the SIT language: given a string, use regularities to find an encoding with as little complexity as possible [9]. On small strings, these encodings can be found without much effort, but as strings get longer this problem can get very complex.

1.3 Research Project

This project explores the possibility of using Structural Information Theory to solve Hofstadter’s analogies. The bridge between these two concepts is compression; string analogies rely on an underlying structure in the strings, this structure can be captured by an adequate compression of the strings, and the SIT coding language defines a (empirically validated) way to perform string compression.

The project is divided in two parts. Firstly, it will focus on implementing the current best methods to solve the minimal coding problem in the SIT coding language. Secondly, it will explore and elaborate on the use of this language to create an algorithm that solves Hofstadter’s string analogies. Because SIT is concerned only about structural information, and Hofstader’s analogies intuitively rely also some metrical information (e.g. A:B::C:?), the SIT encoding needs to be extended in an appropriate way to be actually used for solving analogies. The document ends with an evaluation section, in which the answers generated by the solving algorithm are compared against human answers, as well as answers generated by Metacat.

(8)

2 Background and Representations

This section gives a more detailed description of the SIT coding language. In addition, the theory behind the PISA algorithm is explained, as well as the representations that are needed for it to work.

2.1 The SIT Coding Language

The SIT coding language was created to formalize the simplicity principle in vision [17], that is, our visual system tends to select the simplest explanation for given perceptual stimuli. The SIT coding language makes this idea more tractable by representing visual stimuli as strings of characters. Now, a simplest explanation of a string can be represented as a simplest code, or a code with a minimal complexity. Such a code can also be called an encoding or a compression. The process of turning a string into an SIT code is called encoding or compressing, while the process of turning a code back into the original string is called decoding or decompressing. The coding language used by SIT encodes strings using three regularities: iteration (I-form), symmetry (S-form) and alternation (A-form), which together are called ISA-forms. The lan-guage is formally defined as follows:

I-form: n ∗ (¯y) ⇒ yyy...y (n times, n ≥ 2) S-form: S[( ¯x0)( ¯x1)...( ¯xn), (¯p)] ⇒ x0x1...xnpxn...x1x0

S-form: S[( ¯x0)( ¯x1)...( ¯xn)] ⇒ x0x1...xnxn...x1x0

A-form: <(¯y)>/<( ¯x0)( ¯x1)...( ¯xn)> ⇒ yx0yx1...yxn

A-form: <( ¯x0)( ¯x1)...( ¯xn)>/<(¯y)> ⇒ x0yx1...yxny

Otherwise: D(t) ⇒ t

In this definition, (¯y), ( ¯xi) and (¯p) are called chunks. Specifically, the chunk (¯y) in the I- and

A-forms is called a repeat, the chunk (¯p) in the first S-form is called a pivot, and the chunk ( ¯xi)

is called an S-chunk or an A-chunk, depending on whether it occurs in an S- or A-form. Chunks can consist of one or more symbols. The last line in the definition refers to the fact that any string that is not one of the ISA-forms does not change when decompressed. [17]

Theoretically, strings that are compressed using these rules can consist of any symbols. How-ever, for simplification, this project will use only strings of alphabetic characters.

It has been proven that the regularities used in the SIT coding language are the only trans-parent holographic regularities [1]. Here, holographic refers to the regularities being invariant under growth; a repetition of n symbols can have that same symbol added to it forever, and will remain a repetition. Similarly, a symbol string that forms a symmetry can have the same symbol chunk added to both sides, and will remain a symmetry. Transparent refers to the arguments of the regularities being visible in the original string:

• in 3 ∗ (a), (a) occurs in aaa.

• in S[(a)(b), (c)], (a)(b)(c) occurs in abcba.

• in <(a)>/<(b)(c)(d)>, (b)(c)(d) occurs in abacad (although alternated with the repeat). Given a string, there are generally several codes that compress it. To find the code with minimal complexity, a way of calculating complexity should be defined. There are multiple ways of doing this, but the most used method is the Inewload [1]. This method calculates complexity by taking

the number of symbols in the code, and adding the number of chunks in the code that contain more than 1 element and are not S-chunks. This complexity metric is also the one used in PISA.

(9)

2.2 PISA

A string of length N can be represented by up to a superexponential O(2N log(N )) number of codes [4]. To find the one with the lowest complexity, one could generate each possible code and compare all of these. For long strings this can be very time consuming.

To efficiently find the minimal coding of a string in the SIT coding language, Peter A. van der Helm designed the PISA (Parameter load plus ISA-rules) algorithm [4]. PISA is significantly faster than a method which generates all possible encodings, being only weakly exponential.

2.3 Hyperstrings

PISA uses a concept called transparallel processing. This form of processing is similar to dis-tributed processing in that it uses a special data distribution to be able to perform many subtasks simultaneously. But where distributed processing performs the subtasks on different processors, transparallel processing allows for the subtasks to be done simultaneously on one processor as if there is only one subtask. In the SIT minimal coding problem, this means being able to encode multiple (sub)strings simultaneously.

To apply transparallel processing to the SIT minimal coding problem, van der Helm created the concept of hyperstrings. A formal definition is given in the paper Transparallel processing by hyperstrings [5]. Here, the fundamentals are explained.

A hyperstring is a representation of a string as a directed, acyclic graph that has exactly one source and one sink. Here, “directed” means that every edge in the graph has a direction, and “acyclic” that no path in the graph has the same start and end node. It has only one Hamiltonian path, i.e. a path from source to sink that visits every vertex exactly once.

0 a 1 a 2 b 3 c 4 b 5 a 6

2 ∗ (a)

S[(a)(b), (c)] S[(b), (c)]

<(b)>/<(c)(a)>

Figure 1: A hyperstring of the string aabcba. Not all possible edges are shown.

Edges of a hyperstring represent parts of the original string. They can be labeled with sub-strings of the original string, or encodings of those subsub-strings. Hypersub-strings allow many different encodings of a string to be represented in one data structure. For example, the hyperstring of figure 1 represents, among others, the code aabcba with the path (0,1,2,3,4,5,6), the code 2 ∗ (a)<(b)>/<(c)(a)> with the path (0,2,6) and the code aS[(a)(b), (c)] with the path (0,1,6).

Because this datastructure represents many different codes at once, encoding a part of the hyperstring (by for example adding an edge from 1 to 5 with label <(a)(c)>/<(b)> to encode abcb) encodes this part for many different codes simultaneously. This ‘simultaneous encoding’ is fundamentally what makes PISA efficient.

One of the challenges in the minimal coding problem is the fact that the arguments of symmetries and alternations can often be hierarchically recoded. In other words, the arguments of a sym-metry or alternation that is used to compress a code could itself also be compressible. Take for example the symmetry S[(a)(b)(a)(b), (c)]. Here, the arguments (a)(b)(a)(b) can be compressed

(10)

into 2 ∗ ((a)(b)), resulting in the symmetry S[2 ∗ ((a)(b)), (c)]. This possibility means that an algorithm for this problem has to be recursively called on arguments. Hyperstrings can be used to simultaneously encode these arguments in so called S-graphs and A-graphs.

2.3.1 S-graphs

S-graphs are graphs that represent possible symmetries in a string. Similar to how hyperstrings can represent many different codes at once, these graphs represent many different symmetries at once; one S-graph can represent all possible symmetries of a (hyper)string, for one pivot point.

Note that a pivot point is not the same as a pivot. Where a pivot is the chunk of characters that all S-chunks are the same distance away from (i.e. have the same number of symbols sepa-rating them), a pivot point is a single point (either at a character or in between two characters) that all S-chunks are the same distance away from. For example, the string abcba can be repre-sented as the symmetries S[(a)(b), (c)] and S[(a), (bcb)]. The former has pivot (c), the latter has pivot (bcb), but both have their pivot point at c. In a case like S[(a), (bc)], the pivot is (bc), but the pivot point is between b and c.

0 1 2 3 4 5 0 1 2 3 4 5 (a) (b) (a) (b) (aba) (bab) (ab) (ab) (ba) (abab)

Figure 2: S-graphs for the string ababpbaba (left) and the string ababpabab (right), with the pivot point at p.

Figure 2 shows examples of S-graphs for two similar strings. In S-graphs, the S-chunks are represented by edges between the nodes (unbroken arrows). In addition to the normal nodes and edges, S-graphs also have a pivot node, which represents the pivot point. Edges towards the pivot node (dashed arrows) represent the pivots of the symmetries. In figure2, the pivot edges are not labeled, but their labels can be inferred from the strings.

Any path from any node to the pivot node that visits at least one node in between represents a unique symmetry. For example, in the left graph of Figure2, the path (0,1,2,3,4,5) represents the symmetry S[(a)(b)(a)(b), (p)] and the path (0,3,5) represents the symmetry S[(aba), (bpb)]. In the right graph, the path (0,2,4,5) represents the symmetry S[(ab)(ab), (p)] and the path (1,3,5) represents the symmetry S[(ba), (bpa)]. The last example shows that symmetries do not necessarily have to cover the entire string.

As mentioned before, a pivot point can also be in between two characters. When this is the case, pivot edges could have empty labels. For example, the left graph in figure2could also be

(11)

an S-graph of the string ababbaba, with the pivot point in between the two b’s. In that case, the path (0,1,2,3,4,5) would represent the symmetry S[(a)(b)(a)(b)] and the path (0,3,5) would represent the symmetry S[(aba), (bb)].

Note that S-graphs themselves are not actually hyperstrings, but rather consist of one or more independent hyperstrings. In figure2, the left graph consists of one hyperstring with nodes 0,1,2,3 and 4. The right graph, however, consists of two independent hyperstrings: one with nodes 0,2 and 4, and one with nodes 1 and 3 (5 is not included because edges towards 5 should not be encoded).

2.3.2 A-graphs

A-graphs are graphs that represent many different alternations. Similar to how a (hyper)string has different S-graphs for different pivot points, a (hyper)string also has different A-graphs for different repeat lengths. A-graphs come in two types: one type for left alternation and one for right alternation. 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 (a|b) (a|cb) _(a|c)

(c|ba) (b|ac) (a|bacb) (a|cbac) (b|ac) (c|) (a|b) (ac|b) (|a) (b|a) (cb|a) (aba|c) (ba|c) (ab|a) (bacb|a)

Figure 3: A left alternating A-graph (left) and a right alternating A-graph (right) for the string abacbac. The repeat length is 1.

Figure3is an example of the two types of A-graphs for a string. In A-graphs, the A-chunks and the repeats are still together in the labels. In figure3, these are separated by a vertical line for clarity.

In graphs, there will often be edges labeled with a (part of a) repeat and an empty A-chunk (Fig. 3, edges 6→7 and 0→1). These pseudo A-chunks, as they are called, are necessary to maintain the structure of the A-graphs. They should, however, not be used to construct an alternation.

In an graph, any path that has at least two edges and does not contain a pseudo A-chunk represents an alternation, For example, in the left graph in figure 3, the path (0,2,5,7) represents the alternation <(a)>/<(b)(bc)(c)> and the path (1,4,7) represents the alterna-tion <(b)>/<(ac)(ac)>. In the right-hand graph, the path (1,3,6) represents the alternaalterna-tion <(b)(cb)>/<(a)>.

Like S-graphs, A-graphs themselves are not hyperstrings. Instead, A-graphs consist of multi-ple, connected hyperstrings. In left alternating A-graphs, these hyperstrings are connected at the sink. In right alternating A-graphs, the connection is at the source. The left graph of figure3, for example, consists of a hyperstring with nodes 0,2,5 and 7, one with nodes 1,4 and 7 and a third with nodes 3, 6 and 7.

(12)

3 Compression and Decompression

The following section outlines the algorithms created to perform compression and decompression on strings using the SIT coding language. Next to the decompressor and the compressor based on PISA, the initial, brute-force compressor is also described, which was used in the project before being replaced with the PISA-based compressor.

3.1 Decompressor

A decompressor is a function that takes as input an SIT code, and returns the string that was encoded. Algorithm1 shows the structure of the decompressor written for this project.

The decompressor continuously looks for all lowest-level operators, (or lowest-level ISA-forms), in the code (line 3). Here, lowest level refers to the operator not having any other operators in their arguments. In line 4-6, these are replaced with their decompression, which is a simple process given that all arguments of the ISA-forms are strings. This single-layer decom-pression step may cause different operators to become lowest level, and is repeated until there are no operators left in the code (line 2).

Algorithm 1: Decompressor

1 Function decompress(code)

2 while there are operators left in code do 3 find all lowest level operators

4 for op in operators do

5 decompress_lowest_operator(op)

6 replace operator in code with its decompression

7 end

8 end

9 return code

3.2 Brute-force Compressor

The initial method of generating encodings of strings was a brute-force compressor. This com-pressor generates all possible encodings of a string. Algorithm 2 shows the structure of this compressor.

The algorithm works by processing splits of the string, i.e. ways the string can be cut up into components. For example, abc can be cut into [abc], [ab,c], [a,bc] and [a,b,c]. Using splits allows the program to consider every possible code with every possible grouping of characters.

In line 2-4, lists are created to keep track of what codes have been found (2), what splits have been processed (3) and what splits still need to be processed (4).

The algorithm runs as long as there are still splits to be processed. In line 7-9, the current split is tested against the list of already processed splits, to prevent the same work from being done more than once. Line 10 joins all elements in a split to create an SIT code.

In line 11, new splits are created from the current split and added to the list of splits to be processed. A new split is added for every way a part of the split can be compressed using a single operator. For example, the split [a, a, b, a, a] yields the new splits [2*(a), b, a, a], [a, S[(a),(b)], a] and [a, a, b, 2*(a)]. When a split cannot be compressed further, nothing new is added to the list of splits.

(13)

As mentioned before, whenever a symmetry or an alternation is created, the arguments of this regularity could also be compressible. Therefore, the compressor function is recursively called on these arguments. For every way the arguments can be recoded, a new split is added to the splits list.

Algorithm 2: Brute-force compressor

1 Function Compress_bf(characters) 2 all_codes = []

3 processed_splits = []

4 create list of splits of characters splits 5 while splits is not empty do

6 take split s out of splits 7 if s in processed_splits then 8 skip the rest of this loop

9 end

10 create a code from s and add it to all_codes 11 create new splits from s and add these to splits

12 end

13 return all_codes

As a string of length N can be represented by a superexponential O(2N log(N )_{) number of codes}

[4], and this compressor generates each of these codes, it has a superexponential worst case time complexity.

3.3 PISA-based Compressor

Because of the poor time complexity of the brute-force compressor, it was decided to implement PISA. There were, however, several uncertainties regarding the exact working of PISA. The answers to these uncertainties were not found in the literature [17], and contact with the original author of PISA could not be established. Because of this, some of the methods used for this version of PISA are not the same as the ones PISA uses. These changes will be described at the end of this subsection.

3.3.1 QUIS

When compressing a (hyper)string, it is often necessary to compare the identities of (hyper)sub-strings. In python, string comparison using the == sign iterates over pairs of string characters until a distinct pair is found or one or both strings end [15]. This means that in the worst case, comparing strings of length N takes O(N ) time.

However, with compression, extra information is available: all strings that could be compared are substrings of a given string. This makes it possible to represent substrings as integers in an N by N triangular matrix. In this matrix, the N rows represent the N possible different starting indexes a substring can have, and the N columns represent the N possible different lengths a substring can have. Note that a string of length N always has N possible starting indexes and N possible substring lengths. By assigning identical integers to identical substrings in the columns of this matrix, and calculating this matrix before the comparisons are needed, the substring comparisons can be reduced to a O(1) integer comparison.

(14)

For this task, Peter A. van der Helm, in collaboration with Peter Desain, developed the QUIS (QUick Identification of all Substrings) algorithm, which he describes in his book Simplicity in Vision [17]. The algorithm takes as input a (hyper)string and a triangular matrix labels, initialized with labels[b,k] = b for each pair b, k. Note: b is used to indicate the start index of a substring, and k is used to indicate the length.

The QUIS algorithm uses linked lists to keep track of which substrings are identical. This is done using 3 arrays (which are just lists, but array is used to avoid confusion with the name linked list), which contain numbers corresponding to character indices of the string (here, strings are indexed starting with 1):

• NextList, which keeps track of the starting points of all linked lists. A value in NextList indicates three things simultaneously: the first value of the current linked list, the position of the second value of the current linked list in NextOcc, and the position of the start of the next linked list in NextList.

• NextOcc, which keeps track of the rest of the linked lists. A value in NextOcc indicates two things: a value in a linked list, and the index in NextOcc at which the next value of that same linked list occurs.

• LastOcc, which keeps track of which substrings were last expanded with which characters. LastOcc[a] = b indicates that the last substring that was expanded by an element with label a was the substring at label b. This array starts out empty whenever a new linked list is looked at.

Algorithm 3 shows the structure of the QUIS algorithm, as described in Simplicity in Vision [17].

Algorithm 3: QUIS

1 Function Compress_bf(string, labels) 2 Initialize lists for 1 element 3 while there are still lists do 4 Remove lists of length 1 5 for list in NextList do 6 for b in list do

7 update labels with b

8 expand substring at b and update the three lists accordingly.

9 end

10 end

11 end

In line 1, the algorithm starts by initializing the three arrays for 1 element substrings. For example, for the string abacab, the arrays would look like this:

• N extList [1, 2, 4, None, None, None, None] • N extOcc [’-’, 3, 6, 5, None, None, None, None]

Note that the first elements of NextOcc and LastOcc are never used, since the strings are indexed starting at 1.

These lists describe the linked lists (1,3,5), (2,6) and (4), corresponding to the substrings a, b and c, respectively.

(15)

In line 4, linked lists of length 1 are removed. In the example, this means the list consisting of only 4.

Lines 5-6 loop over all linked lists that are left. In line 7, labels[b, k] is updated to have the value of the first item in the linked list, ensuring that for the current substring length, identical substrings have identical integers. Lastly, line 8 expands the substrings. When substrings expand with different elements, the linked list is split into multiple linked lists, one for each unique substring. In the example lists, an expansion to substrings of length 2 would lead to the following lists:

• N extList [1, 2, 3, 6, None, None, None]

• N extOcc [’-’, 5, None, None, None, None, None, None]

These lists describe the linked lists (1,5), (2), (3), and (6), corresponding to the substrings ab, ba, ac, and b, respectively. Note that there is still a substring of length 1. This is because the substring is the last character of the string, and could not be expanded. Furthermore, substring ca is not present, since the list corresponding to c was removed earlier. The algorithm runs until no linked lists remain.

Running the algorithm on the example string results in the following matrix:         1 1 1 1 1 1 2 2 2 2 2 0 1 3 3 3 0 0 4 4 4 0 0 0 1 1 0 0 0 0 2 0 0 0 0 0        

In this matrix, the identical integers at positions (1,1), (3,1) and (5,1) (with 1 based indexing, where the first row/column is indicated with 1, the second with 2, etc.) indicate that the substrings of length 1 starting at positions 1, 3 and 5 are identical. Indeed, in the string abacab, this is the case. The identical integers at positions (2,1) and (2,5) indicate that the substrings of length 2 starting at positions 1 and 5 are identical. Once again, this is indeed the case, as both of these substrings are ab.

3.3.2 The Graph Class

The PISA compressor makes use of graphs to represent hyperstrings, S-graphs and A-graphs. In the version of the algorithm developed for this project graphs are objects and implemented as Python classes. This is different from the original implementation of PISA, which was written in C and did not use objects.

The variables of such a Graph class are a list of the nodes of the graph, as well as a dictionary of edges between the nodes. These edges consist of a label and a load, or complexity.

The class has the following functions:

• next(n): Returns the node with the smallest label that node n has an edge toward. • get_edge(n, n2): Returns the label and load between nodes n and n2.

• subgraph(n, n2): Returns a graph object with all nodes in between nodes n and n2 and all edges between these nodes.

(16)

• get_hamil_path(): Returns the hamiltonian path of the graph.

• path_len(n, n2): Returns the length of the hamiltonian path between nodes n and n2. • find_shortest_path(n, n2): Uses Dijkstra’s shortest path algorithm [2] to find the path

between nodes n and n2 with the lowest load.

• find_all_paths(n, n2): Returns a list of all paths between nodes n and n2.

• start_nodes(): Returns every node without incoming edges. These nodes are the sources of individual hyperstrings in the graph.

• split_hyperstrings(): Returns a list of graph objects, one for each hyperstring in the graph. If the graph is a hyperstring, returns itself.

• find_best_iteration(n, n2): Searches for a path between node n and node n2 that can be represented as an iteration. The function first looks at edges of length 1 and increases this length by one until an iteration is found, or the length is greater than half the number of nodes between n and n2. If one is found, an iteration is constructed and returned. • clear(): Removes all edges from the graph.

3.3.3 Constructing and Using S-graphs

S-graphs are implemented as child classes of the Graph class, meaning that they inherit the vari-ables and functions of the Graph parent class. Algorithm4explains how an S-graph is constructed from a hyperstring, a pivot point and an identity matrix Q created by the QUIS algorithm. The algorithm functionally maps to the definition of an S-graph provided in transparallel processing by hyperstrings [5].

The algorithm starts by finding the distance around the pivot point that should be looked at (e.g. for the string abcdedc with a pivot point at e, the distance is 2, since a and b are too far away to be in a symmetry.). In lines 3-6, a graph is initialized with the correct nodes. Lines 7-8 iterate over all possible substrings that could be an S-chunk. Line 9 calculates the position of the start of the substring on the other side of the pivot that needs to be compared to. In line 10, the QUIS matrix is used to check whether these substrings are identical. If this is the case, lines 11-14 add the correct edges to the graph.

(17)

Algorithm 4: Constructing an S-graph

1 Function construct_Sgraph(hyperstring, p, Q) 2 dist = min(len(hyperstring)-p-1, p)

3 g = copy(hyperstring) 4 remove all edges from g

5 remove all nodes from g that are higher than the pivot point p 6 add pivot node to g

7 for b in p - dist ... p do 8 for k in 1 ... p - b do 9 b2 = 2*p - b - k + 1

10 if Q[b, k] == Q[b2, k] then

11 edge1 = edge of hyperstring at b → b+k 12 add edge1 to g at b → b+k

13 edge2 = edge of hyperstring at b+k → b2 14 add edge2 to g at b+k → pivot node

15 end

16 end

17 end

18 return g

After the S-graph has been constructed, it can be used to find a symmetry in the hyperstring the S-graph was constructed from. This is implemented as the function get_symmetry(n, n2) of the S-graph class, which works in the following way:

Suppose you want to find a symmetry in the original hyperstring between nodes n and n2. It is first checked whether the pivot of the S-graph is exactly in between the two nodes; if it is not, no symmetry is possible. As explained in Section2.3.1, S-graphs have a node representing the pivot point. The find_shortest_path function of the parent class Graph is used to find the best path between node n and this pivot node. From this shortest path, the symmetry can be constructed. Each edge in the path represents an S-chunk, except for the edge towards the pivot node, which represents the (possibly empty) pivot.

3.3.4 Constructing and Using A-graphs

Like S-graphs, A-graphs are also implemented as child classes of the Graph class. As stated in Section 2.3.2, there are two types of A-graphs: one for left alternation (left A-graph) and one for right alternation (right A-graph). Algorithm 5 explains how a left A-graph is constructed from a hyperstring, a repeat length and an identity matrix Q, created by the QUIS algorithm. The algorithm is functionally an implementation of the definition of an A-graph provided in transparallel processing by hyperstrings [5].

In lines 3-4, a Graph class is initialized with the correct nodes. Lines 5-6 iterate over every possible combination of indices in the (hyper)string. In line 7, the QUIS matrix is used to compare the substrings at the indices with the specified repeat length. If they are the same, lines 8-11 add the correct edges to the graph.

(18)

Algorithm 5: Constructing a left A-graph

1 Function construct_LeftAgraph(hyperstring, rep_len, Q) 2 N = number of nodes in hyperstring

3 g = copy(hyperstring) 4 remove all edges from g 5 for b in 0 ... N - 1 do 6 for b2 in b+1 ... N do

7 if Q[b, rep_len] == Q[b2, rep_len] then 8 edge1 = edge in hyperstring at b → b2 9 add edge to g at b → b2

10 edge2 = edge in hyperstring at b2 → sink 11 add edge2 to g at b2 → sink

12 end

13 end

14 end

15 return g

The code for constructing a right A-graph is very similar to the code for the left A-graph. The main differences are that the edges used are source→b and b→b2, rather than b→b2 and b2→sink, and that the comparison in line 7 is done by looking at the ends of substrings, rather than at the beginnings.

After the A-graph has been constructed, it can be used to find an alternation in the hyper-string it was constructed from. The function get_alternation(n, n2) does this.

Suppose you want to find an alternation in the original hyperstring between nodes n and n2. This is simply done by using the find_shortest_path function of the Graph class to find the best path between nodes n and n2. From this path, if it exists, an alternation can be constructed. Each edge in the path represents an A-chunk as well as the repeat, which can be separated to find the code for the alternation.

3.3.5 The Compression Algorithm

The general outline of the PISA-based compressor can be seen in Algorithm6. Unlike the brute-force compressor, the PISA-based compressor takes as input a graph object. This is because, like the brute-force compressor, this compressor is recursive, and will be called on S-graphs and A-graphs.

The algorithm processes each hyperstring in the input graph separately (line 3). In line 4, the QUIS algorithm is called for each hyperstring. The resulting matrix is used in line 5 and 7 to create S- and A-graphs for the hyperstring. These graphs are themselves also encoded using this function. Lines 6 and 8 loop over every combination of two nodes (v, w) in the hyperstring. For each pair, lines 9 looks for the best possible code for the substring between these two nodes. This best code is chosen from:

• The current code.

• The best possible iteration, if any.

• The best possible symmetry, if any, calculated using the S-graph that has a pivot halfway in between the two nodes.

(19)

When the best code has been found, line 10 adds an edge representing the code to the hyperstring. Next, line 11 iterates over every node that comes before node v. Line 12 looks at the complexities of the codes between u, v and w. If the complexity of the edge u→w can be reduced by creating a combination of the codes in edges u→v and v→w, this is done.

At the end of the algorithm, all encoded hyperstrings are recombined into a new graph. This graph is then returned.

Algorithm 6: PISA-based compressor

1 Function compress_pisa(graph) 2 new_hyperstrings = [] 3 for hyperstring h in graph do 4 Q = QUIS(h)

5 Create and encode S-graphs of h 6 for w in h.nodes[1 ... N] do

7 Create and encode A-graphs of h, up to node w 8 for v in h.nodes[w ... 0] do

9 find best possible code for v→w 10 add best code to h as edge v→w 11 for u in h.nodes[0 ... v] do

12 if c(u, w) > c(u,v) + c(v,w) then 13 new_code = u→v + v→w 14 add new_code to h as edge u→w

15 end 16 end 17 end 18 end 19 add h to new_hyperstrings; 20 end 21 g = combine(new_hyperstrings); 22 return g;

When this algorithm is used on a hyperstring that represents a string, the same hyperstring is returned, with added edges that represent the best code for each substring. The code of the edge connecting the first and last node of the hyperstring will be the best encoding of the entire string. although this is the optimal encoding, every path in the hyperstring represents a code, which consists of several optimally encoded substrings that together form the entire string. 3.3.6 Complexity Metric

The complexity metric used by this compressor is not the Inewload, like PISA uses. Instead, the

complexity metric used was created specifically for the analogy solving purposes, and is based on the more general principles of Kolmogorov complexity [16].

This change was made for multiple reasons: for one, the Inew load does not really allow

for adjusting of parameters for aligning to human answers. Furthermore, the Inew load seems

somewhat unintuitive, assigning for example the same complexity to the codes a and 9 ∗ (a). At more fundamental level, SIT has been conceived for structural information, but analogies will require also to look at metrical information. The complexity metric considered here calculates a load by taking the number of symbols in a code and adding, for each operator in the code,

(20)

a certain value. This value might differ between the different types of operators, and can be adjusted later to find find the values for optimal performance of the analogy solving algorithm. 3.3.7 Differences with PISA

This compressor relies heavily on the same theoretical concepts of hyperstrings, S-graphs and A-graphs as PISA does. However, as stated at the start of this subsection, some uncertainty pertaining the exact working of PISA meant that new ideas were necessary to fill in the blanks. The following list outlines the major differences between the two algorithms.

• The explanation of PISA [17] mentions ‘updating its database of S- and A-graphs’ at the end of the first for loop. It is, however, not clear how this update is done. In the proposed compressor the graphs are not updated, but recreated each time.

• PISA updates the A-graphs at the end of the first for loop, while this compressor recreates the A-graphs at the start of the first for loop. A small exception to this is present in the code; at the end of each v loop, the algorithm does update the repeats of right a-graphs with the encodings of the v→w edge. This is not necessary for left a-graphs due to the algorithm encoding the string left to right.

• PISA updates the S-graphs at the end of the first for loop, while this compressor creates the S-graphs before the first for loop.

• PISA always returns the one code with the lowest complexity, while this compressor returns a Graph object. In this object, the edge connecting the first and last nodes also represents the edge with the lowest complexity, but other paths represent other codes, which consist of optimally encoded substrings which together form the whole string.

(21)

Figure 4: Outline of the process used to answer an analogy of the form A:B::C:?.

4 Analogy Solving

Analogies rely on a perceived underlying structure. This project started from the intuition that this structure could be captured by an adequate compression of the strings, and that SIT provides us with a way of describing such structure in the case of string analogies. The following section will present the analogy solving algorithm created for this project.

4.1 Overview

The algorithm proposed here is based on the following idea: in an analogy of form A:B::C:?, it is expected there is a certain structure in the left-hand side (A:B). In practice, the structure of A:B could be found using the compression methods discussed in section3focusing for instance on the concatenation A + B.1 _{By applying this same structure to the right-hand side,}

decom-pressing the resulting code and taking away the part C, an answer to the analogy can be found. The application of the structure to the right-hand side requires two additional steps: processing symbol distances and using a symbol replacement method (of symbols in A with symbols in C), both of which will be discussed in the next sections. Figure 4 outlines the general structure of the algorithm. More in detail, the algorithm has the following general structure:

1_{The idea of compressing A and B together, rather than separately, was inspired by the technique used in} the famous The Similarity Metric paper [10], in which concatenations of two DNA-sequences were compressed to find a measure of similarity between the two associated species.

(22)

Algorithm 7: Analogy solving algorithm

1 Function solve_analogy(analogy, n_answers, parstruct) 2 split analogy into A, B and C

3 lcodes = compress_pisa(graph(A + B)) 4 if parstruct then

5 solves = try parameter structure approach

6 else

7 solves = []

8 end

9 for code in lcodes do 10 add distances to code

11 substitute symbols in code with symbols in C 12 for new_code in substitutions do

13 result = decompress(new_code) 14 if result starts with C then 15 remove C from result 16 add result to solves

17 end

18 end

19 end

20 sort solves by complexity of respective new_code 21 return top n_answers solves

When A:B::C:? does not produce sensible results, it is also possible to rearrange the analogy to hopefully obtain better answers. This rearrangement is based on the equality between the following two formulas.

a b = c d ⇔ a c = b d

With fractions of these forms, part b and c can be swapped while maintaining the equivalence. A similar idea seems to apply to analogies. While the two forms of analogies might often result in the same answers, one of the two forms might be more solvable using the SIT approach. An example is the analogy abac:adae::baca:?. In this analogy, A + B has a structure of form <(a)>/<(.)>, while the left side has a structure of form <(.)>/<(a)>. This change in structure is a problem for our algorithm, since it tries to apply the same structure to C + D. Changing the analogy to abac:baca::adae:? results in the structure S[(a)(bac)], which is a structure the solver can deal with more easily.

4.2 Symbol Distance

The SIT coding language allows for the representation of structure in analogies. To apply this structure from one side (A + B) of the analogy to the other side (C+?), this structure should first be defined as a function of symbols in A. This will allow for easier symbol substitution in the next subsection.

To define a structure of A + B as a function of symbols in A, there needs to be a way to generate the symbols in B from symbols in A, and therefore there needs to be a method of describing functional relationships between symbols. The SIT coding language cannot do this;

(23)

the only relationship this language can define is the identity relationship. For instance 3 ∗ (a) establishes an identity relationship between all elements of aaa.

The easiest way to define relationships between symbols is by defining them to have (direc-tional) distances between each other. For alphabetic characters, distances between symbols can be defined as distance between the positions of the symbols in the alphabet. So, for example, the distance between a and b is 1, and the distance between b and z is 24. This allows symbols in B to be replaced by symbols in A, plus a distance.

A challenge that arises with this approach is the question of which symbol the distance should be calculated from. There are multiple possible approaches to this problem, some of which are useful in different situations than others. The ones explored in this project are highlighted here. A straightforward way of deciding which symbol to calculate distance from is to choose the previous symbol. For example, in the analogy a:bcd::i:?, the left part abcd could be described as a($ + 1)($ + 1)($ + 1), where $ refers to the last symbol used in the code. This same structure applies to the intuitive right-hand side i:jkl. There are, however, many analogies where this approach fails. Take the analogy aba:aca::ada:?, in which abaaca gets encoded as S[(a), (b)] S[(a), (c)]. When distances are applied, the code becomes S[(a), (b)] S[(a), (($ + 2))]. Now, when the symbol b is substituted by the symbol d (S[(a), (d)] S[(a), (($ + 2))]), the decompressed code becomes adaaca, resulting in solution aca, while aea is a more intuitive answer.

Another way of doing this is to choose the last new symbol used in the organization extracted using SIT. In the previous example, the last new symbol would have been b. The code S[(a), (b)] S[(a), (($ + 1))] would be the result of applying this method. Substituting the b for the d would result in S[(a), (d)] S[(a), (($ + 1))], decompressing into the expected adaaea.

One piece of information the last new symbol approach does not use is the actual position of symbols in the input strings. However, analogies can be constructed on more complex positions. For instance ae:bd::cc:?. Here the change applied to the string is an increase by 1 for the first element, and a decrease by 1 for the second. In a case like this, a position-based approach would be useful, resulting in code ae(a + 1)(e − 1) or ae($ + 1)($ − 1). Unfortunately, this approach quickly becomes difficult to use when parts A, B and C of an analogy have different lengths. To conclude, analogies come in many different forms, and different analogies require different strategies to compute metrics. In absence of a general theory, this project will exploit both the last new symbol and position-based methods, depending on which is more appropriate.

4.3 Symbol Substitution

When a compression of A + B has been defined using only symbols in A, a substitution (or replacement) can be done with the symbols in C. To perform such a substitution, A and C need to be represented in ways that allow them to correspond one to one. Initially, A and C are represented as strings of symbols, in which each symbol counts as one element. However, unless A and C share the same length, this representation does not easily allow for substitution. Therefore, different ways of representing A and C are used.

4.3.1 Representation by Compression

For C, the number of elements in the representation can be reduced by compressing the string and dividing it into symbols and highest level operators. A highest level operator is an operator which is not inside an argument of another operators. For example, ijjkkk can be compressed into i2 ∗ (j)3 ∗ (k), which is then split into i, 2 ∗ (j) and 3 ∗ (k). The number of symbols plus

(24)

the number of highest level operators in a code of C is always equal to or less than the number of symbols in C. This representation cannot be used for A, since A is already encoded in the compression of A + B.

When a highest level operator has been used to replace a symbol, this operator itself will count as one symbol for the purposes of calculating new symbols from distances. If a distance was calculated from a symbol that has since been replaced by an operator, the entire operator will be carried over as the new element, and each individual symbol in this operator is increased by the distance.

The following example shows this:

analogy: abc:abd::ijjkkk:? structure of A + B: <(ab)>/<(c)(($ + 1))> structure of C: i 2 ∗ (j) 3 ∗ (k)

substutition: (a→i), (b→2 ∗ (j)), (c→3 ∗ (k)) structure of C + D: <(i2 ∗ (j))>/<(3 ∗ (k))(($ + 1))> distances removed: <(i2 ∗ (j))>/<(3 ∗ (k))(3 ∗ (l))> decompression: ijjkkkijjlll

result: ijjlll

4.3.2 Representation by Chunking

Alternatively, A and C can be represented in terms of chunkings, meaning a division of the symbol strings into chunks. These chunks can then be replaced as if it concerned one element. For example, abcd could be chunked into [ab, cd], [a, bc, d], [a, b, c, d], and so on. For A, there is often not much choice in how it can be chunked; chunkings of A are given by the compression of A + B.

Some examples:

• [a, b, c] is a chunking of abc in the compression S[(a)(b), (c)]c. • [ab, c] is a chunking of abc in the compression ab2 ∗ (c)($ + 1)($ + 1). • [ab, c] is a chunking of abc in the compression <(ab)>/<(c)(($ + 1))>. • [abc] and [a, b, c] are chunkings of abc in the compression 2 ∗ (abc).

For C, there is no structure present that determines how it should be chunked. Therefore, it is possible to chunk C in such a way that it best corresponds to a chunking of A.

This process of creating a chunking of C as similar as possible to a chunking of A is called chunking element matching, and works as follows: a list is created for the number of symbols in each element in the chunking of A. (e.g. [a, bc, def ] results in [1, 2, 3]). Next, the largest number in this list is increased/decreased so that the sum of the numbers equals the number of symbols in C. Now, this list of numbers can be used to split C into chunks, which can be used to substitute the original elements of the chunking of A.

(25)

analogy: abc:abd::ijklm:? structure of A + B: <(ab)>/<(c)(($ + 1))> chunking of A in A + B: [ab, c]

lengths of chunking elements of A: [2, 1] matching chunking lengths of C: [4, 1] chunking of C: [ijkl, m]

substutition: (ab→ijkl), (c→m)

structure of C + D: <(ijkl)>/<(m)(($ + 1))> distances removed: <(ijkl)>/<(m)(n)> decompression: ijklmijkln

result: ijkln

A special type of chunking is a consecutive chunking (called sequential chunking in the code). A chunking is consecutive if it meets the following two requirements:

• All elements of the chunking contain exactly 1 symbol.

• The elements of the chunking together form all arguments of a single operator. The following list shows some examples of consecutive chunkings:

• [a, b, c] is a consecutive chunking of abc in the code <(a)(b)(c)>/<(d)>. • [a, b, c] is a consecutive chunking of abc in the code S[(a)(b), (c)].

Consecutive chunkings can be substituted in a special way: instead of replacing individual sym-bols or elements, the entire chunking can be replaced by a new chunking. The new chunking has one element for every symbol in C. When this consecutive chunking is applied, the entire argument string forming A is replaced with this new chunking (with the corresponding number of parentheses per element).

Example:

analogy: abc:cba::ijklm:? structure of A + B: S[(a)(b)(c)] consecutive chunking of A in A + B: [a, b, c] consecutive chunking of C: [i, j, k, l, m]

substutition: ([(a), (b), (c)]→[(i), (j), (k), (l), (m)]) structure of C + D: S[(i)(j)(k)(l)(m)]

decompression: ijklmmlkji

result: mlkji

4.4 Structure in Parameters

Up until now, the analogy solving used in this project has only looked at the structure of and the relationships between symbols. However, this is not always sufficient. Take for instance the analogy aaabb:aabbb::eeeeef :?. Intuitively, it seems like the analogy simply swaps the number of times the first symbol occurs with the number of times the second symbol occurs. However, PISA assigns A + B the structure 3 ∗ (a)S[2 ∗ ((b))(a)]b, which does not seem to represent this change, and there seems to be no symbol substitution that results in the expected answer ef f f f f .

The core of this problem does not lie in the structure of the symbols, but in the structure of the structure. By looking at the structure as a series of iterations, this becomes clear; in

(26)

3 ∗ (a)2 ∗ (b)2 ∗ (a)3 ∗ (b), the parameters of the iterations form a symmetry, namely S[(3)(2)]. It is this symmetry that forms the basis of this analogy.

The function written to solve these types of analogies is separate from the rest of the algo-rithm. It recodes A + B as a sequence of iterations. Next, the parameters of these iterations are compressed, distances are added and symbol substitution is performed in essentially the same way as described before. This results in new parameters which, combined with symbol substitution on the actual symbols, can produce an answer to the analogy.

Problems of this type can become easily more complex. When the parameters of a structure can have a structure, the parameters of that structure could also have a structure, which could again have parameters with some structure. With larger codes, this ‘parameter depth’ could become exceedingly large.

Furthermore, relationships between parameters at different ‘depths’ are also possible. Take for instance the analogy abc:aaabbbccc::abcd:. The structure of the parameters could be written as 3 ∗ (1)3 ∗ (3). To get to the plausible answer aaaabbbbccccdddd, there would need to be a relationship between the two 3’s out of brackets, and the 3 inside the brackets, which are at different parameter depths.

In short, structure in parameters can be very complex. In this project, it has only been explored at a surface level, accounting only for relationships between parameters that occur in the structure of the original analogy. Other configurations are left to future work.

(27)

5 Results

This project consists of two parts, the compression algorithm and the analogy solving algorithm. The results follow the same division. In the first part of this section, the brute-force compressor will be compared to the PISA-based compressor in terms of speed. In the second part, the answers generated by the analogy solving algorithm will be compared against two sets of human answers obtained in different experiments. Furthermore, the answers generated by the proposed solver will also be compared to the answers generated by Metacat [11].

5.1 Compressor Comparison

The brute-force compressor and the PISA-based compressor were compared by measuring the times it took them to compress strings of different lengths, up to a longest length k. For each of these k lengths, N random strings are compressed and the time taken is averaged.

The strings used for comparison consist of the symbols a, b, c and d. Using such a small set of symbols ensures that there will be some regularity in the strings that the compressors can use. The strings are not randomly generated for each length, but are randomly expanded; every time a new length is measured, the strings of the previous length are extended with one random symbol. This is to ensure there are no large fluctuations in the difficulties of the strings between different lengths.

The following graph shows a time comparison of the brute-force compressor and the PISA based compressor. The maximum string length compared is 14, and for each length the times of 5 random strings are averaged.

Figure 5: comparison of brute-force and PISA, k=14, N =5

From the graph, it is very clear that the brute-force compressor quickly becomes very slow, tak-ing upwards of 2.5 minutes for strtak-ings of length 14, while the PISA-based compressor only needs less than a second.

(28)

From figure5alone it is difficult to identify the performance of the PISA-based compressor. The following graph shows the times needed for comparison for strings of lengths of up to 30 symbols, for only the PISA-based compressor.

Figure 6: PISA-based compressor, N =10, k=30

As can be seen in the graph, the time the PISA-based compressor needs for compression increases with length far more gradually than the brute-force compressor, although it is still clear in the graph that the time needed for compression increases exponentially.

5.2 Analogy Solving

To test the analogy solving algorithm, the solutions generated by it are compared to human answers.

5.2.1 Murena’s dataset

The table in figure7published by Murena et. al. [14] shows an example of set of human answers for an analogy test. In the experiment to obtain this data, 68 participants were asked to solve analogies following the template ABC:ABD::X:?; the left-hand side of the analogy remained the same during the experiment, but the X changed in every test.

For each X, the data shows the two most common answers given by participants, as well as the percentage of participants that chose that answer. Note that in the original data, some questions were repeated to see the influence of having previously faced similar problems. Since the solving algorithm in this project runs independently of previous answers, repeated questions were omitted here. The original data also contained problems with numeric symbols. Because it was chosen not to use these in this project, numbers were either converted to alphabetic problems (by taking numbers as indices of the alphabet), or omitted completely.

The final two columns in the table shows the performances of the analogy solving algorithm proposed in this project (Ps) and Metacat (PM): for each answer, the value in the last column

(29)

Problem Proposed solution Proportion Ps PM IJK IJL 93% 1 1 IJD 2.9% - -BCA BCB 49% 3 2 BDA 43% 1 1 AABABC AABABD 74% 1 1 AACABD 12% - -IJKLM IJKLN 62% 1 1 IJLLM 15% - -KJI KJJ 37% 1 1 LJI 32% - 2 ACE ACF 63% 1 1 ACG 8.9% - -BCD BCE 81% 2 1 BDE 5.9% 1 -IJJKKK IJJLLL 40% 1 2 IJJKKL 25% 2 1 XYZ XYA 85% 1 -IJD 4.4% - -RSSTTT RSSUUU 41% 1 1 RSSTTU 31% 2 -MRRJJJ MRRJJK 28% 2 1 MRRKKK 19% 1 2

Figure 7: Human answers to analogies of form ABC:ABD::X:? from the Murena dataset 7, along with at which position the same answers were given by the solving algorithm proposed in this project and Metacat. The first column indicates the string at X.

answer, 2 means it was the second best generated answer, etc.). A dash indicates the answer was not generated at all. As for speed, the lack of a built-in way to measure the time Metacat uses for compression made it difficult to perform an empirical speed comparison. However, when working with Metacat, it was clear that this method is much slower than the solver implemented here, sometimes taking upwards of 10 minutes to generate a single answer. For this reason, only two answers per question were generated using Metacat.

The table shows that, for this dataset, the most common answer to the problem is always generated by our solver. Furthermore, the top answer generated by the solver is always one of the two most common answers by the participants. Both of these observations are also true for Metacat, with the exception of the problem XY Z, for which it produced none of the answers given by participants.

However, there are also answers given by human participants that the solvers did not generate. Overall, the most common human answer matched the top answer 8/11 times (72.7%), both for the solver and Metacat. The most common participant answer was in the top 2 generated answers 10/11 times (90.9%) for both algorithms. Answers given by participants were generated 16/22 times (72.7%) for the solver and 14/22 times (63.7%) for Metacat. The two best answers of the two methods for each problem in this set can be seen in Appendix A.

(30)

Problem Proposed solution Proportion Ps PM ABA:ACA::ADA:? AEA 97.1% 1 1 AFA 2.9% - -ABAC:ADAE::BACA:? DAEA 60% 2 -BCCC 28.6% 21 -AE:BD::CC:? DB 68.5% 3 1 CC 17.1% - 2 ABBB:AAAB::IIIJJ:? IIJJJ 57.1% 1 -JJIII 14,3% - -ABC:CBA::MLKJI:? IJKLM 88.6% 1 1 - - - -ABCB:ABCB::Q:? Q 100.0% 1 -- - - -ABC:BAC::IJKL:? JIKL 54.3% - -KIJL 14.3% 2 -ABACA:BC::BACAD:? AA 57.1% 1 -BCD 31.4% - -AB:ABC::IJKL:? IJKLM 85.7% 1 1 IJKLMN 11.4% - -ABC:ABBACCC::FED:? FEEFDDD 91.4% 2 1 - - - -ABC:BBC::IKM:? JKM 57.1% 7 -KKM 37.1% 2 -ABAC:ACAB::DEFG:? DGFE 68.6% 2 -FGDE 14.3% 1 -ABC:ABD::CBA:? DBA 51.4% 1 2 CBB 45.7% 2 1 ABAC:ADAE::FBFC:? FDFE 94.3% 1 -FDFA 2.9% - -ABCD:CDAB::IJKLMN:? LMNIJK 80.0% - -- - - -ABC:AAABBBCCC::ABCD:? AAABBBCCCDDD 74.3% 1 1 AAAABBBBCCCCDDDD 17.1% - -ABC:ABBCCC::ABCD:? ABBCCCDDDD 85.7% - -ABBCCCDDD 8.6% 1 -ABBCCC:DDDEEF::AAABBC:? DEEFFF 77.1% 1 -DCCDDF 8.6% - -A:AA::AAA:? AAAAAA 62.8% 1 -AAAA 25.7% 2 1 ABBA:BAAB::IJKL:? JILK 71.4% - -JIJM 11.4% 5

-Figure 8: Human answers to analogies collected in this project experiment, along with at which position the same answers were given by the solving algorithm proposed in this project and Metacat.

(31)

5.2.2 New Dataset

The Murena testset is quite small and the analogies it presents are all fairly similar. For this reason, a second testset was constructed on purpose for this project, consisting of 20 more complex analogies. 35 participants (18 male, 17 female, average age of 26.8) were asked to solve the analogies in the test, the results of which can be seen in figure 8. In this table, the top two answers given by participants are presented, as well as the percentage of participants that gave each answer. In some cases, only the top answer is given. This is done when either all participants gave the same answer, or when there were multiple answers tied for second which all had only one participant.

In this testset, the most common answer given by participants was generated by the solver 16/20 times (80%), whereas it was generated by Metacat only 7/20 times (35%). The top answer given by the solver was in the top two participant answers 13/20 times (65%), whereas the top answer generated by Metacat was in the top two participant answers 8/20 times (40%). The most common participant answer matched the top generated 10/20 times (50%) for the solver, and 6/20 times (30%) for Metacat. The two best answers of the two methods for each problem in this set can be seen in Appendix B.

5.2.3 On Measuring Complexity

The weights of the iteration, symmetry and alternation operators were chosen to optimize the results of the solver on the two testsets. These variables were tested with values ranging from 0.8 to 1.2, with steps of 0.1. Each possible combination of those values was tested on how highly they ranked the participant-given answers amongst all answers. Overall, it was found that small differences in the values often did not change much about the rankings. On a larger scale, the following requirements seem to yield the best results:

• The weight of iterations should be less than 1. • The weight of symmetries should be less than 1. • The weight of alternations should be more than 1.

The final weights chosen were 0.85 for iterations, 0.9 for symmetries and 1.1 for alternations.

6 Discussion and future developments

The goal of this project was to implement an efficient compressor for the SIT coding language, and to use SIT compression as the basis for an analogy solving algorithm.

The PISA-based compressor has shown to be significantly more efficient than a brute-force ap-proach to compression. Where using the brute-force compressor already became intractable when used on strings of length 14, the PISA-based compressor worked within reasonable time for strings up to 30 elements. However, this compressor is still significantly slower than the original PISA. Where this version of PISA reaches times of 3 seconds around string lengths of 25, the original PISA can process strings of one hundred characters in a little under 3 seconds. It is hypothesized that this difference can be attributed to a number of factors: firstly, the PISA-based compressor is written in Python, whereas the original PISA is written in C. Python is an interpreted language, which are generally a lot slower than compiled languages like C. It seems likely, however, that the most important difference lies in the constant reconstruction of

(32)

hyperstrings that the PISA-based compressor in this project performs. As stated before, this was done because it was unclear how updating of hyperstrings, which the original PISA does, is performed. To further improve this algorithm in the future, it could be useful to look into implementing the original PISA, which would likely lead to a better time performance for the algorithm.

As for the analogy solving algorithm, the solver has shown promising results on the test set created by Murena et. al [14]. However, this test set is very limited, having only 11 questions. Furthermore, each question is based on the same left-hand side ABC:ABD. The lack of variety and complexity in this testset make it difficult to evaluate the analogy solving algorithm by this set alone. This motivated the creation of the second testset.

When compared to Metacat, the solving algorithm has been shown to achieve similar results on the first testset, and drastically better results on the second. It should, however, be noted that Metacat uses randomness in its procedure to generate answers. Therefore, different runs of the algorithm on the problem could result in different, and possibly better, answers. Furthermore, for this comparison, only the first two answers generated by Metacat were used, but Metacat can often generate more answers than that. The choice to only consider the top two answers was made due to time constraints: generating a single answer using Metacat can, in some cases, take upwards of 10 minutes.

It is important to note that problems in the second testset were created after the implementation of the solving algorithm, and with the capabilities of this solver in mind. Because of this, for many of the questions it was clear beforehand whether the solver would produce intuitive answers. Therefore, the percentages of correct answers should not weigh heavily in the evaluation of the algorithm. Instead, the set should be used as a showcase of what types of analogies the algorithm can and cannot deal with.

Answers of the solver are ranked by the complexities of the codes they originate from. The results suggest that this might, on its own, not be a good way of ranking the answers. This is indicated by the answer BCCC (28.6%) to ABAC:ADAE::BACA:?, answer DB (68.5%) to AE:BD::CC:?, answer J KM (57.1%) to ABC:BBC::IKM :? and answer J IJ M (11.4%) to ABBA:BAAB::IJ KL:?. Despite these answers being chosen by (fairly) significant percentages of the participants, they do not rank highly amongst the answers generated by the solver.

The reasoning behind these answers (most likely) relies on applying positional distances in the left-hand side of the analogy to the right-hand side. However, this does not requires any structure, and therefore does not require compression. This means that the resulting code is not compressed, and therefore has high complexity, causing it to rank poorly. The most significant case of this is the answer BCCC to ABAC:ADAE::BACA:?, which the solver ranks lower than 20 other solutions, despite being picked by over a quarter of the participants. Future work on this project could look at alternate ways which could, either in combination with complexity or on their own, rank answers generated by the solver in a way that corresponds better to human answers.

The answer LM N IJ K (80.0%) to ABCD:CDAB::IJ KLM N :? might tell us something the cognitive equivalent of what in this project is called chunking (section 4.3). It seems that the structure that best corresponds to participants’ interpretation of this problem is S[(ab)(cd)], which essentially represents a swapping of ab and cd in part A to get part B. The same structure in the right-hand side of the analogy that corresponds to the top answer is S[(ijk)(lmn)].

This way of symbol substitution corresponds to chunking element matching, described in section 4.3, but uses a different method. Where the method used in this project tries to keep

(33)

as many elements of the chunking as possible at the same length (which results in structures like S[(ij)(klmn)] or S[(ijkl)(mn)]), the 3-3 division suggests a preference to maintain the same ratio between the chunking elements. Future work on this approach could look at improving the way chunking element matching is performed to better correspond to the human method. Finally, other answers that cannot be solved by the algorithm are the ones discussed in sec-tion 4.4, where there are relationships between iteration parameters at different levels. In the test, such relationships are (most likely) used for answer AAAABBBBCCCCDDDD (17.1%) to problem ABC:AAABBBCCC::ABCD:?, and answer ABBCCCDDDD (85.7%) to problem ABC:ABBCCC::ABCD:?. These answers suggest that such relationships are indeed understood and used by participants, although this begs the question of how complex these relationships can be before participants will no longer base their answer on them. Future work could expand on the, still very rudimentary, exploration of this problem in this project.

7 Conclusion

The PISA algorithm that the compressor used in this project is based on is currently the best method for compression using the SIT coding language. Overall, the transparallel processing method used by PISA has shown itself to be a significant improvement with respect to a naive brute-force approach. In this project, however, the compressor based on PISA is still significantly slower than the original PISA on longer strings. However, given the fact that the strings in analo-gies are generally fairly short, this has not posed a large problem for the overall performance of the algorithm.

Structural Information Theory has shown itself to be a useful tool for analogy solving, although it cannot do this on its own. The lack of metrical information, or a way to define relationships between symbols, resulted in the need for a way of defining symbols as distances from other symbols, as well as a way of choosing which symbol to calculate from. There are multiple ways of doing this, but the methods used in this project (last new symbol and positional ) seem to work well in the majority of analogies that have been tried.

The necessity to apply structure from one part of an analogy to another led to the need for a way to do symbol substitution. As it turned out, the different ways of representing the symbol strings that should be substituted often result in many different ways the substitution can be done. It seems, however, that there is still room for improvement here. Participant answers to one question in the test data hint at ways of improving symbol substitution. Furthermore, a better way to rank answers generated by the solver would allow it to rank answers that do not rely on structure more appropriately.

As it turned out, the structure in an analogy does not always have to apply to the symbols; as shown in section4.4, sometimes, the structure of the symbols has structure itself. The test data suggests that participants do indeed use these relationships in their answers. In this project, this has been explored only at surface level, and analogies using these kinds of structures in a more complex way cannot be solved by this algorithm. This could be improved by an extension to the current functionality.

Solving Hofstadter’s Analogies using Structural Information Theory

Solving Hofstadter’s

Analogies using Structural

Information Theory

Solving Hofstadter’s Analogies using

Structural Information Theory

Abstract

Contents

1

Introduction

1.1

Hofstadter’s Analogies

1.2

Structural Information Theory

1.3

Research Project

2

Background and Representations

2.1

The SIT Coding Language

2.2

PISA

2.3

Hyperstrings

3

Compression and Decompression

3.1

Decompressor

3.2

Brute-force Compressor

3.3

PISA-based Compressor

4

Analogy Solving

4.1

Overview

4.2

Symbol Distance

4.3

Symbol Substitution

4.4

Structure in Parameters

5

Results

5.1

Compressor Comparison

5.2

Analogy Solving

6

Discussion and future developments

7

Conclusion