Unsupervised compression-based Topic Modelling of legal text documents

(1)

Unsupervised compression-based

Topic Modelling of legal text

documents

Jelle van Noord

10797483

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. A.W.F. Boer Leibniz Center for Law

Faculty of Law University of Amsterdam Nieuwe Achtergracht 166 1018 WV Amsterdam

(2)

Abstract

Supervised compression-based classification proved to be a quite promising method for text classification, due to its substantial foundation in Information Theory. Such a classifier is able to capture documents similarities based on the Nor-malized Compression Distance, which is an approximation of the Information Distance. In this thesis, a compression-based classifier is applied on Dutch legal documents, resulting in a quite impressive classification performance. Fur-thermore, the computational problem, which arises with the search through all possible combinations of the topic model, is discussed. To avoid this, multiple alternative approaches are developed and implemented. Firstly, the representa-tion of the supervised training data with the most similar pairs of documents is displayed as a unsatisfying topic model algorithm. Secondly, with the represen-tation of this training data as the least similar set of documents, some valuable topic models are obtained, though more often worse topic models were obtained. Finally, a Steepest Ascent Hill Climbing approach, which minimises the Nor-malized Compression Distance divided by Entropy of Shannon, is implemented. This algorithm outperformed all previously implemented methods of this thesis and it resulted often in the best topic model.

(3)

Acknowledgements

I would like to express my sincere gratitude towards my supervisor, Dr. Alexan-der Boer, for his guidance, support and feedback throughout this entire thesis. Both during the research and writing of this thesis. Besides I want to thank the coordinator, Dr. Sander van Splunter, for his support and feedback at the necessary times.

(4)

1 Introduction

Text classification is a machine learning task at which unlabelled text documents are classified according to some classification, for instance topic or author, pro-vided by the training documents. Throughout many studies a range of different methods have been used for supervised text classification, including support vec-tor machines (SVM), k-Nearest-Neighbour (kNN) and Naive Bayes (Yang and Liu, 1999; Sebastiani, 2002). Despite the proven effectiveness of these super-vised classification methods in multiple domains, other non-standard methods are investigated.

One of these is compression-based classification, which uses compression programs to classify the unlabelled documents to one of the provided classes. Firstly, a model or dictionary is generated from the files of each class, which is then used to compress the unlabelled documents multiple times, each time with the dictionary or model of a different class. The document is then assigned to the class that yielded the highest compression rate, which corresponds to the cross-entropy between the document and class.

A major benefit of compression-based classification is their ability to capture document features other than words, such as punctuation, subword (stems) and superword (relations between words) features (Marton et al.,2005). The absence of sensitivity for these features is a broadly known problem in topic models, such as Latent Dirichlet allocation. Furthermore, are these methods easy to apply, for instance the method used byBenedetto et al.(2002), and require almost no pre-processing of the data. However, these methods also have a major disadvantage, namely speed, which makes them unsuitable in cases where speed is important. Thus, compression-based classification is extremely suited for applications where low-cost classification is need, and where speed is unimportant. To reduce the costs even more, unsupervised classification can be applied, because this makes the manual labelling of training documents unnecessary.

In this thesis, supervised compression-based classification, using RAR and the Approximate Minimum Description Length (ADML), is performed on legal verdicts of the Open Data van de Rechtspraak1 _{(ODR). The results of this}

classification will be evaluated, to ensure the usability of the data. Thereafter, three different methods of unsupervised compression-based classifications, Best Sets, Worst Set and Hill Climbing, are developed and applied on this data. The Best and Worst Set methods will generate an approximation of the training documents of supervised learning. Whilst the Hill Climbing method will search for the best distribution of documents regarding to the Normalized Compression Distance (NCD) by applying the generate and test principle.

This yields the following research question of this thesis:

How can the topics of legal text documents be modelled with a unsupervised compression-based method?

The creation and evaluation of this classifier can be broken down into different methods, which leads to the following subquestions:

1. Which performance rate can be achieved with supervised compression-based classification on the legal documents?

1_{https://www.rechtspraak.nl/Uitspraken-en-nieuws/Uitspraken/Paginas/Open-Data.}

(6)

2. To what extent ensures unsupervised compression-based topic modelling, with the use of the best or worst sets of documents, a valuable topic model? 3. Does a compression-based Steepest Ascent Hill Climbing result in an

ac-ceptable topic model?

The layout of this thesis is as follows: In the next section, the theoretical foundation is provided and related work is examined. Section 3 consists of an explanation of the approach. The layout of the experiments together with their results, are represented insection 4, while their results are discussed in section 5. Finally,section 6presents a conclusion, followed by some proposals for future work inSection 7.

2 Background

Firstly, this section provides a theoretical foundation as substantiation for the approach. Secondly, related work about compression-based classification is pre-sented.

2.1 Theoretical Foundation

Compression uses concepts of Information Theory, so some useful concept and formulas are analysed. Firstly, the Entropy of Shannon, since this will be used in the topic model experiments to value the distribution of files across the classes. Secondly, the Kolmogorov complexity is laid out. Subsequently, the (Normal-ized) Information Distance and Normalized Compression Distance are clarified. 2.1.1 Information Theory

Shannon established the discipline of Information Theory with his paper named “A Mathematical Theory of Communication” (Shannon, 1948). In this paper a measure of information in a distribution, called entropy, was proposed. The entropy H of a distribution P measures the uncertainty in P, or in other words the information that is gained when the outcome is discovered. The definition of Shannon’s entropy is as follows:

H(X) = −

n

X

i=1

P (xi) log(P (xi) (1)

X is a random variable with possible values {x1, . . . , xn} and with probabilities

P = {p(x1), . . . , p(xn)}.

The Kolmogorov complexity K(x) of a finite object x could be defined as the length of the shortest binary description of x (Kolmogorov,1968). Although the most common definition is the one at which K(x) is seen as the shortest computer program which can produce x as output. A detailed explanation of the Kolmogorov complexity and the equation of it is given byVitanyi et al.(2008). It is mentioned in this thesis, since it forms the foundation of the Information Distance and Normalized Information Distance (NID).

(7)

While the Kolmogorov complexity measures the absolute information con-tent of individual objects, the absolute information distance between individual objects is desired for the purpose of classification. Therefore has Vitanyi et al. (2008) formulated the Information Distance and distracted the NID from there. The definition of the Information distance between two individual objects x and y is given by the following equation:

E(x, y) = max{K(x|y), K(y|x)} (2)

The Information Distance E is universal, because of all admissible distances it is always the least and therefore it can also be named the universal informa-tion distance. However, to obtain a universal similarity metric a normalizainforma-tion is necessary. This normalization should return 0 when the objects are maxi-mally similar and 1 when the objects are maximaxi-mally dissimilar. Resulting in the following definition of the NID:

e(x, y) = max{K(x|y), K(y|x)}

max{K(x), K(y)} (3)

In despite of the fact that the NID is theoretically appealing, it is impractical, since it cannot be computed (Terwijn et al.,2011). Therefore an approximation of the NID is required to make it practically applicable. A, for this thesis, useful approximation is the Normalized Compression Distance (NCD) (Vitanyi et al., 2008). The NCD is constructed with two important deductions of the NID. Firstly, the compound max{K(x|y), K(y|x)} can be rewritten as K(xy) − min{K(x), K(y)}, which eliminates all conditional complexity terms. Secondly, a efficient real-world compressor Z, for instance RAR, gzip or bzip2, can be used to provide an upper-bound of K(x), resulting in the following definition of the NCD:

eZ(x, y) =

Z(xy) − min{Z(x), Z(y)}

max{Z(x), Z(y)} (4)

Where Z(x) indicates the compressed file size of string x using compression Z. Thus, eZ is dependent on the quality of Z, i.e. the better Z is, the closer eZ

approaches the NID. 2.1.2 Compression

Data compression is the process of encoding information to reduce the amount of bits used by the original representation (Mahdi et al.,2012). In text compression there are roughly two different approaches to this encoding; dictionary based and entropy-based. Dictionary-based compression techniques build up a dictionary of strings occurring in the text and achieve compression by replacing all following occurrences of this string with a reference to the dictionary. The Lempel-Ziv algorithms, LZ77 (Ziv and Lempel, 1977) and LZ78 (Ziv and Lempel, 1978), are a classic example of this kind of compression. These algorithms go through the file and for every given position in the file it looks at the string starting at the current position and searches for the longest match to that string in the preceding part of the file, which will output a reference to this match. Furthermore, LZ77 is characterised by its sliding window, which means it only looks back for matches in a certain fixed windows from the current position. On the other hand, LZ78 has no sliding windows, but it starts with a dictionary

(8)

contained an entry for all characters. During the compression it only adds strings to the dictionary if the prefix of one character shorter is already in the dictionary, so this will only be appended to the dictionary if thi is already in the dictionary. Whenever a new string is appended to the dictionary, the prefix of that string is replaced by its index in the dictionary. This type of compression eliminates the requirement to add the dictionary to the compressed file, since the dictionary can be constructed out of it.

Entropy-based compression techniques, on the other hand, generate a prob-ability distribution for the upcoming character, which is based on the input seen so far, where after this distribution is used to encode the actual character. The underlying idea of this type of encoding is that frequently occurring characters will be encodes with less bits than the rarely occurring characters, which will use more. A well known application of this type of compression is Prediction by Partial Matching (PPM), which has set the performance standard for text compression (Cleary and Witten, 1984). PPM generates a conditional proba-bility for each character based on the last n characters, i.e. a bth_{Markov model}

is used for the prediction of the characters. The most basic way to implement this idea is a dictionary which keeps track of each string s of length n and saves a count for each character x which follows s. However, the PPM algorithm is a bit more complicated, since it also uses strings smaller than n. Therefore, it is able to match a string of one shorter length, when the string of length n is absent in the dictionary. This process, also known as escape event, is repeated for shorter lengths until a match is found. It seems likely that the higher the value of n, the better the compression, though there is a limit to n due to the fact that the amount of storage required, grows exponentially with the growth of n.

RAR is a public available compression program2. According toMarton et al. (2005) both PPM as well as Lempel-Ziv algorithms are implemented in RAR, which causes the program to choose one of these methods based on the input data, for text PPM-based compression is usually used, particularly a version of the PPMII algorithm (Shkarin,2001).

2.1.3 Topic Modelling

Topic models algorithms are able to discover general topics in a collection of documents (Blei, 2012). These collections can be of different kinds of data, such as text or images. Topic modelling algorithms are based on the idea that when a document belongs to a particular topic, some strings appear more or less frequently. If a document, for example, has soccer as its major topic, it is likely that words such as “goal”, “ball” and “pass” are more likely to occur than words like “court” or “jury”, which are more likely to appear in legal documents.

Latent Dirichlet allocation is one of the simplest, but perhaps most com-mon topic model (Blei et al.,2003). LDA is a so called bags-of-words method, i.e. documents are represented as a bag-of-words. Furthermore, topics are rep-resented as a distribution of probabilities of generating various words. LDA assigns documents to several topics.

(9)

2.1.4 Hill Climbing

Hill Climbing is a iterative local search algorithm, which starts by selecting a random solution to the problem. From there on, at each iteration, it makes a modification to the solution to obtain a better solution. Hill Climbing was originally introduced as solution to producing school timetables, and possible other timetables (Appleby and Hall,1961).

There are generally three types of Hill Climbing; Simple, Steepest Ascent and Stochastic. The first one searches all possible successors of the current solution state and selects the first one that is closer to the solution. While on the other hand, Steepest Ascent compares all possible successors and selects the one which is the closest to the solution. Simple as well as Steepest Ascent Hill Climbing terminate when no state closer to the solution can be found, because of this a local optimum or minimum will sometimes be presented as solution instead of the global optimum or minimum. Stochastic Hill Climbing will randomly select a successor at each iteration, if that move is an improving it will stick with that move, else it will examine another random successor.

2.2 Related Work

The use of compression algorithms in natural language tasks was tested on dif-ferent tasks, such as entity extraction (Witten et al.,1999). Frank et al.(2000) was presumably the first one to perform topic classification with a compression-based method. They listed some benefits of a compression model according to “bag of words” methods, including the ability of an overall judgement of the document, in contrast to the pre-selected features. Moreover it avoids the prob-lem of defining word boundaries, deals uniformly with morphological variants of words and it can take account of phrasal effects that span word boundaries. However, after performing classification on the Reuters-21578 corpus3, they concluded that, despite out performing Na¨ıve Bayes in multiple categories, the results of PPM were inferior to the state-of-the-art machine learning techniques for classification. This inferiority was assigned to the insensitivity, of compres-sion methods, to subtle differences between articles that do belong to a class and those that do not.

Notwithstanding the rather inferior results of Frank et al.(2000), more re-search on compression-based classification was performed. Including the PhD thesis Thaper(2001), who used character-based and word-based PPM for au-thorship attribution and language identification, at which character-based PPM turned out to be the most effective method. According to the results of compression-based classification is highly reliable for language identification, even with small document sizes, while with authorship attribution the accuracy strongly de-pends on the document size. High accuracy was achieved with large documents, but it degrades strongly if the document size is below a certain threshold. Like-wise Thaper, Kukushkina et al. (2001) also used compression for authorship attribution, but with different, more standard, compression methods. Their ex-periment showed that compression algorithms assign the correct author quite often.

The Best-Compression Neighbour (BCN) method was developed and re-viewed, on language recognition and authorship attribution, byBenedetto et al.

(10)

(2002). The BCN is a 1-Nearest Neighbour method with the difference in com-pression sizes as metric. For all test files X and training files Ai, the difference

C(Ai + X) – C(Ai) is calculated, subsequently test file X is assigned to the

class with the lowest difference in compression sizes. According to Benedetto et al. (2002) the performance was excellent, since the fact that all files were classified to the right language or author.

However, there are not only proponents of compression-based classification, as Goodman (2002) expressed some criticism on the publication of Benedetto et al. (2002). He argued that their idea was not innovative, due to the fact that previous literature had suggested similar ideas, i.e. MLD based methods. Besides,Goodman(2002) showed that Na¨ıve Bayes outperformed a compression method, on the text classification task, both in speed and accuracy.

Yet,Khmelev and Teahan(2003) was able to produce accurate author classi-fications with compression-based methods, especially with RAR, which outper-formed SVM. Moreover, they discovered that the Reuters Corpus Volume 1 con-tains 3% of duplicates, which even had different labels in half of the cases. This is according toKhmelev and Teahan(2003) the reason of the poor compression-based performance for topic classification outlined byFrank et al.(2000). They conclude that the PPM approach is more suitable than SVM for classification problems where the texts are coming from a single source, such as texts which are written by a single author or belong to a single language. However, SVM is more suitable for classification problems, whereby the presence of single words determine the class to which it belongs, such as the Reuters-21578 corpus.

Also,Teahan and Harper(2003) achieved accurate classifications with compression-based methods, while comparing a PPM compression-based method with Na¨ıve Bayes and Linear Support Vector Machines (LSVMs). They concluded that their PPM method outperformed Na¨ıve Bayes in almost all categories, but was still inferior to LSVM in the majority of the categories. Besides the comparison, they also made an improvement to the PPM algorithm by performing features selection based on the maximisation of precision/recall based on the code length differ-ence, which performed slightly better than the PPM without feature selection. Marton et al.(2005) compared topic classification of different corpora with RAR, LZW and gzip using the ADML and BCN algorithms, at which RAR com-bined with ADML was superior. Furthermore they investigated if compression-based topic classification benefits from punctuation, sub-words and superwords features. Their results showed that in some cases compression algorithms may benefit from punctuation and subword information while performing classifica-tion, though RAR benefits much from superword features.

Classification has also been applied to the law domain, for instance by Gov-ernatori(2009). Their maximum accuracy was only 63.6%, while this was better than their simple baseline and random classification, it is not enough to be of practical use. Furthermore, their results indicate that law documents are to large for classification, since those are to diffuse to assign to a class.

Moreover Topic Models have also been applied to law documents. Boer and Winkels(2016) applied LDA to cold start recommendation problem. Their results confirmed that content-based classification approaches do not perform well in legal settings, since their users mostly avoided the folders generated by the topic model and instead preferred user-selected folders.

(11)

3 Approach

In this section the decisions, required to answer the research question, are ex-amined and substantiated. First, a type of classification and topic modelling is chosen. Secondly, a supervised classification algorithm is determined for the first experiment. Subsequently, a solution metric for the Best Sets, Worst Set and hill climbing topic modelling is defined. Finally, an evaluation method for the supervised as well as the topic model experiments is determined.

The classification of the documents will be performed in a multiple-classes situation, however all classifications will be single-label classification, i.e. each document will only be assigned to one and only one category (Sebastiani,2002). This restriction is imposed due to the preservation of simplicity.

In section 2.2 multiple compression-based classification setups have been discussed. Generally speaking, the principal techniques are standard MDL, ADML or BCN, which can all be combined with different types of compression or compression programs. In the desire to keep the program simple and therefore allow general usage, an off-the-shelf compression program will be used. Due to this decision, standard MDL will not be possible, since it requires a modification to the compression program in order to extract the dictionary of the archive. Remaining are ADML and BCN, whose classification performances are almost equal, however BCN requires many more compression, because it is a 1-Nearest Neighbour approach, hence ADML will be used during the experiments. Marton et al. (2005) reported that they achieved the best performance of ADML in combination with RAR, which is an open-source program available on multiple operation systems, because of this, the used method could be broadly adopted. To generate a valuable topic model, the documents in the classes have to be similar, thus the overall information distance in has to be minimised. As the Information Distance cannot be computed, the best solution is to use the NCD, because the modelling is already performed with a compression-based method. Thus a good topic model, minimises the NCD, although this also raises a problem, since the NCD is a similarity measure between two document, while the classes will consists of more than two documents. The, probably most intuitive, concept of the average of all combined NCDs is used, though presumably not the most computationally efficient solution. The formula for a state which contains c categories, which themselves contain n subsets of two documents: eZ(state) = 1 c c X i=1 1 nci n_ci X j=1 Z(xj) − min{Z(xj[0]), Z(xj[1])} max{Z(xj[0]), Z(xj[1])} (5)

nci is the number of subsets of two documents for category i, Z(xj) is the

com-pressed size of subset j, Z(xj[0]) is the compressed size of the first file of subset

j, while Z(xj[1]) is the compressed size of the second file of this subset. Whilst

a solution of the NCD for multiple documents is provided, minimising the NCD yields another problem, namely the degradation of the NCD when most docu-ments are drawn to a single category. Though this is undesirable, since a topic model with almost all files in one category has no value. Thus, a situation where each category has almost the same amount as other categories is desired. In other words, each category has a almost equal chance of the allocation of a new document. So, the entropy of Shannon is the ideal measurement for this

(12)

diversion of the document across the categories. The entropy will increase with more even distributed classes, therefore a maximisation of the entropy is pur-sued. These two measurements can be combined to obtain a value measurement of a topic model state, which results in the following formula.

state value = Ez(state)/H(state) (6)

For completeness, the following evaluation methods, common in the text classification task, will be used; accuracy, precision, recall and the F1 score. Detailed descriptions and formulas are given bySokolova and Lapalme(2009). Since the classification will be performed with multiple classes, the averages, for precision, recall and the f1 score, of these classes is taken as the general value. Just like the supervised experiment, the topic model experiments will be evaluated as a multiple-classes classification. However, “exact” classes cannot be determined, therefore the above evaluation methods will not be used, thence the result of the topic models will only be evaluated based on the distribution of the files across the categories. In other words, a good result is a result at which each categories only consists of documents of the same “type”. Therefore, these types will be determined in section 4.1 and thus before the experiment is performed.

4 Experiments and Results

This section will continue with a description of the data, followed by the pre-processing of the documents. Subsequently, the layout of the three experiments is described. All of these experiments are implemented with Python 3.6.4 and RAR 5.5, whose archives are constructed in the terminal without any adjust-ments of the parameters.

4.1 Data and Pre-processing

As stated before, the classification will be performed on legal verdicts from the ODR. These are available in an archive4_{, which is updated monthly and consists}

of numerous folders, corresponding to the represented years of the verdicts. These folders themselves contain archives of each month of that year if there are verdicts published in that month.

During the thesis, the downloaded archive consisted of folders ranging from 1913 to 2018 and contained a total of 353.683 documents, which is too many for the experiments, so the subset of documents from January 2018 was used, which narrowed down the number to 11.336. Yet, this data is not ready to be used in the experiments, due to the fact that the documents are still in the XML format and therefore contain numerous tags and additional information of the document besides the actual verdict. Thus, the actual verdict is extracted from the XML documents and saved as a text file. During this process, it appeared that 8552 documents are anonymous and therefore lack a textual verdict, so the remaining 2784 files could be used for the supervised classification.

Since the verdicts of the ODR are available in the XML-format, they contain several characteristics, of which some can be used to perform classification,

4_{At the time of the thesis, this archive could be downloaded at the following URL:}_static.

(13)

namely: jurisdiction, organisation and document type. These three presumably represent the general topics of the documents, however other characteristics may also be a clear representation of the documents and therefore result in a accurate classification, though these are not used in this thesis, while the topic model can be considered as the main objective.

The documents in the corpus belong to 23 different jurisdiction, though not each jurisdiction has a sufficient amount of files. Therefore all jurisdictions which contained less than 25 files will not be participating in the classification. The remaining jurisdictions, with their amount of documents, are listed inTable 1, at which indented items are sub-jurisdictions.

Jurisdiction # documents Bestuursrecht 420 — Ambtenarenrecht 76 — Belastingrecht 184 — Omgevingsrecht 48 — Socialezekerheidsrecht 300 — Vreemdelingenrecht 199 Civiel Recht 575 — Arbeidsrecht 57 — Burgerlijk procesrecht 29 — Insolventierecht 25 — Personen- en familierecht 139 — Verbintenissenrecht 80 Strafrecht 558

Table 1: Number of documents in the corpus per jurisdiction

Due to the same argument as the reduction of the jurisdictions, only the top five of organisations, based on the number of files they contain, will take place in the classification. Thus, the number of organisation is reduced from 27 to five. Since there are only two type of documents, which both contain sufficient files, all types will be used in the classification process. The distribution of documents across these types are displayed in Table 3.

Organization # documents

Centrale Raad van Beroep (CRVB) 332

Gerechtshof Amsterdam (GHAMS) 176

Gerechtshof Arnhem-Leeuwarden (GHARL) 196

Raad van State (RVS) 353

Rechtbank Den Haag (RBDHA) 228

(14)

Type # documents

Conclusie 123

Uitspraak 2661

Table 3: Number of documents in the corpus per type

4.2 Supervised Classification

The supervised classification experiment is performed to verify the usability of the corpus, additional the performance of the different characteristics are compared, to select the best describing one, which will be used for the data of the topic model experiments.

The classification consist of three stages; retrieval of the documents, parti-tion in training and test documents and assignment of the test files to a class. Firstly, the documents are retrieved from their folders. Each different jurisdic-tion, organisation and type has its own folder, which provides a label for the documents at the time of the retrieval. Secondly, the documents are split into a training batch per class and a overall test batch. The sizes of the training batches varies and is a user-adjustable parameter. Therefore, different values of the size will be evaluated. Finally, each test file T is assigned to the class Ci,

which minimises the compressed size difference vi (Marton et al.,2005).

vi= |AiT | − |Ai| (7)

|Ai| is the compressed file size of the combined file of all documents in class Ci,

|AiT | is the compressed file size of the combined file of Ai together with test file

T .

The classification was first performed on the main jurisdictions; “Bestu-ursrecht”, “Civiel Recht” and “Strafrecht”, the top-5 organisations and the document types. For each of these, two different corpus sizes were used, which are either the full amount of documents or one hundred documents per class, hereafter named as whole and partial size. The partial size is used to look into the classification behaviour on a small corpus, since the topic model will be performed on a corpus of only 50 documents. Resulting in the following perfor-mance scores, at which the left charts are the perforperfor-mance on the partial size, while the right charts are the performance on thew whole size. This applies for Figures 1to3.

(15)

Figure 2: F1 score of supervised classification across the organisations

Figure 3: F1 score of supervised classification across the types

Only the F1 score is reviewed in this section, because this is the standard measurement for supervised classification, although all four performance scores are provided inAppendix A. Firstly, the results show that the F1 score degrades significantly when all the documents are classified, which could be due to the small size of the training set. Secondly, the performance of the characteristics are quite different, since classification on jurisdiction gives a significant better result than on organisation or type, which performs the worst. This difference is especially highlighted when the complete corpus is used. Thus, jurisdiction is the most representative characteristic of the corpus, followed by organisation and type. That is why some more classifications are performed on the sub-jurisdictions of “Bestuursrecht” and “Civiel Recht”, as well as the combination of this sub-jurisdictions. Due to readability, these results are only included in Appendix A. They, however, appear to have a slightly worse performance compared to the jurisdiction classes shown inFigure 1.

4.3 Topic Modelling

Since the topic model will be evaluated as a multiple-class classification, but the “exact” class of each documents is not known, the result will be evaluated based on the distribution of similar files across the classes. Thus, to make this evaluation useful, the data needs to be equally distributed. The results of the previous experiment have shown that jurisdiction is the most representative

(16)

characteristic, followed by organisation. Therefore, 25 documents are selected from “Bestuursrecht”, while the other 25 belong to “Strafrecht”. However, these 50 documents will not be chosen randomly out of these jurisdictions, because the organisation of the document could also influence the topic model, therefore the frequent organizations of the partial data of both classes will be used, this corresponds to “College van Beroep voor het bedrijfsleven” (CBB) for “Bestuursrecht” and ”Gerechtshof Amsterdam” (GHAMS) for “Strafrecht”. Based on a small graph of NCD values between six documents, the assumption that these two classes both form a separate cluster, can be manually validated. As Figure 4 displays that the three CBB documents form a cluster as well as the three GHAMS documents.

Figure 4: NCD connected graph of documents

The optimal solution of the topic model with n classes, can be found by generating all possible states with n subsets of the set of files with two conditions; each subset must at least contain one file and each file may only occur once. Thus, a document can not occur in two subsets in the same state and no empty subsets are allowed. This solution is, however, incomputable, due to the amount of possible states. Especially with the use of general computers, where a lack of computation power and memory will occur. For example a Topic Model of 8 classes and with a corpus of 50 documents, which is a small corpus size compared to online available corpora. This topic model can be constructed with almost equally balanced class sizes or with unbalanced class sizes, therefore it is required to investigate all different combinations of subsets with different sizes of the corpus. Table 4displays the increase in possible subsets of a certain size. Resulting in the conclusion, that computing all possible states is impossible, because a combination of the number of sets of different subsets has to be calculated, which is simply to large.

(17)

Number of classes Number of sets 2 1225 3 19600 4 230300 5 2118760 6 15890700

Table 4: Increase in possible solutions with a corpus of 50 documents Thus, multiple alternative topic model algorithm approaches will be anal-ysed, among which Best Sets, Worst Set and Hill Climbing. First of all, the Best Sets (BS) solution is implemented, which is the concept of representing the training data of a supervised classification with the most similar matches found in the corpus. Such a training data representation leaves some unassigned files which will be assigned to the matches as if it is a supervised classification, thus via the ADML algorithm. Resulting in an elimination of the computational problem, since only subsets of the size 2 need to be generated. The implementa-tion of the BS Algorithm is given inAlgorithm 1and the resulting topic model of the appliance is shown inFigure 5.

Algorithm 1 Best Sets Algorithm

procedure BS(text directory, f iles, subset directory, number of classes) initial classes, ncds ← [ ]

subsets ← getSubsets(f iles, size = 2) for subset in subsets do

ncd ← calculateAverageN CD(subset) ncds ← ncd

for number of classes do index ← getM inN CD(ncds) initial classes ← subsets[index] return initial classes

Figure 5: Number of files, labelled according to their organisation, per class The resulting topic model of the BS algorithm indicate that this method will not result in useful topic models, since there is no guarantee that the selected

(18)

pairs are from different topics. Thus, it is quite possible that more than half of the matches are from one topic, which leads to a unbalanced topic model. Resulting in the implementation of an alternative approach based on the as-sumption that documents that are the least similar, i.e. have the highest NCD, will not be in the same class. This approach is named Worst Set (WS), because it selects the documents with the worst Information Distance. So, a set of n files with the highest average NCD for n classes is generated. These n files then correspond to the initial classes, to which the remaining files will be assigned. This concept is, however, also incomputable, due to the fact that when the num-ber of classes increases, the numnum-ber of possible worst sets increases even more, which is displayed in Table 4. Because of this increase, only the Worst Set topic model with two classes is generated (Figure 6).

Figure 6: Worst Set Topic Model

According to the expectations one class should almost only contain CBB documents, while the GHAMS documents belong the other class. Since this is not the case in the topic model shown in Figure 6, it is not considered as accurate. However, to completely verify the performance of the WS approach, a topic model with more than two classes will be evaluated. In order to realise this, two alternative algorithms of the WS are implemented. The first one, named Random Iterative Worst Set (RIWS), performs x iterations a topic model with n classes These iterations consist of selecting a random subset of n documents out of the corpus, where after the average NCD of this subset is calculated and stored. When the x iterations have emerged, RIWS will select the stored subset with the highest NCD and will use the n files of this subset as the initial classes for the topic model. Just like BS and WS, the remaining files will be assigned to one of these classes. RIWS is a rather dump solution, since it could return a set wich will not be even close to the worst set, so a second algorithm, named Hill Climbing Worst Set (HCWS), is implemented. As the name suggests this is a Steepest Ascent Hill Climbing approach to finding the worst set. For a topic model of n classes HCWS starts the initialisation with randomly selecting n files out of the corpus. Hereafter it performs a maximum of x iterations, which consist of retrieving all possible new sets where one document of the set is replaced by a document in the corpus, which is not already in the set. For every possible new combination the average NCD is calculated and stored,

(19)

subsequently the combination which has got the highest NCD is selected as new solution. This process terminates when HCWS reaches it maximum of iterations or when the new selected solution is worse, i.e. has got a lower NCD than the previous solution. Just like BS, WS and RIWS, the remaining files are assigned to these classes. A clear outlay of the above discussed algorithms is included in Appendix B.

(a) RIWS Topic Models

(b) HCWS Topic Models

Figure 7: Number of files, labelled according to their organisation, per class

State value NCD

method minimum average minimum average RIWS 2 1.152747 2.162216 0.604396 0.761927 RIWS 8 0.355250 0.425651 0.620649 0.735371 HCWS 2 1.152713 1.392561 0.531245 0.805554 HCWS 8 0.329399 0.422235 0.616214 0.728939 Table 5: Performance of 20 different RIWS and HCWS topics models The topic models in Figure 7 are the bests results, i.e. the results that minimise the state value ofEquation 6,out of 20 tries of RIWS and HCWS. Due to completeness, the distributions of all these 20 tries are included inAppendix B. Moreover, Table 5 shows the performance of these 20 topic models. These results indicate that RIWS as well as HCWS can produce valuable topic models, however they could also return invaluable topic models at any moment, since

(20)

the averages of the state values and NCD are significantly higher than the minimum of those. Thus, a more reliable topic model algorithm is required to produce valuable topic models at any time.

So finally, the Steepest Ascent Hill Climbing algorithm is applied on the whole corpus, searching for the state which minimises the state value. First, the algorithm randomly assigns the files of the corpus to one of the n classes, at which each class contains at maximum one file more than any other class, so that the size of the classes will not influence the outcome of the topic model. Subsequently, all possible successor states are determined based on 1 versus 1 or 2 versus 1 swaps of files between classes. Finally, the current state will be replaced with the successors that minimises the state value. In this case, the NCD is not the desired metric to minimise, since 2 versus 1 swaps are allowed, which will otherwise result in a shift of all files towards one class. The process of adjusting the state by swapping files terminates when the state value cannot be reduced or when the user stops the algorithm manually, which will both result in a topic model. A detailed implementation of the Hill Climbing approach is given in Algorithm 2, while the results are displayed inFigure 8andTable 6. Algorithm 2 Steepest Ascent Hill Climbing Algorithm

procedure Hill Climbing(directory, f ilesnumber of classes, iterations) state ← initializeRandomStart(directory, f iles, number of classes) for iterations do

ncds ← [ ]

successors ← getP ermutations(state) for node in successors do

ncd ← calculateAverageN CD(node) H ← calculateEntropy(node) if min ncds < NCD of state then

index ← index of min ncds state ← successors[index] else

break return state

Figure 8: number of files, labelled according to their organisation, per class of the Hill Climbing topic models

(21)

2 classes 8 classes

method minimum average maximum minimum average maximum

RIWS 1.152747 2.162216 5.007450 0.355250 0.425651 0.513576 HCWS 1.152713 1.392561 3.771658 0.329399 0.422235 0.584586 Hill Climbing 1.152713 1.155446 1.170769 0.304473 0.308137 0.313778

Table 6: state values of 20 different tries of RIWS, HCWS and Hill Climbing topic models

The Hill climbing algorithms with two topics, displayed inFigure 8, present a similar topic models as RIWS and HCWS, while the topic model with eight topics is substantially different. This one could be considered as more valuable than those of RIWS and HCWS, since it is slightly better distributed. The minimal state values, highlighted inTable 6, also indicate that the Hill Climbing topic model is more accurate. Because this algorithm outperforms both other methods also on the average and maximum state value, it is a more reliable. All in all, the results indicate that the Hill Climbing approach is a valid method for compression-based Topic Modelling.

5 Discussion

While the result with the average NCD turned out to work quite well, it never-theless has some poor properties, namely its computability and insensitivity for variance in the data. The average NCD is computed over all possible pairs of documents in a class, thus with a larger corpus the number of archives which are required increases significantly, causing the average NCD to be far from ideal regarding computation. Furthermore, taking an average imposes the insensitiv-ity to variance, because a good average can exists together with high variance. Thus, the average NCD will not exclude the existence of substantial variance in a state or class, which could result in a non ideal topic model.

Moreover, the assumption that the selected data for the topic model exper-iments is a good representation of the topics, may not hold on for this or any other corpus. Combined with the evaluation of the topic models as if they were supervised classifications, this will not guarantee that a good topic model as result also implies that the used method is valid. Because a wrong assump-tion of the data could result in the expected topic model, however this result is wrong since the assumption of the data is wrong and therefore the topic model algorithm could be considered wrong. Thus, an expected topic model as result does not necessary imply the validity of the algorithm.

The presented performances, on the whole amount of data, of the supervised compression-based classifier might not be representative for the abilities of such a compression-based algorithm, since the ratio of training data with respect to test data is not optimal for the experiments with the whole size. Larger amounts of training data will presumably result in an increase in the performance.

6 Conclusion

Based on the results discussed in the previous section, some conclusions can be distracted. The first one is that the data of the ODR is suitable for classification,

(22)

since the classification performance of the partial data was quite impressive. Although the results of the classification with the whole resulted in significantly lower performance, which could be due to the training data / test data ratio explained in the above section. Secondly, based on the performance rates of different characteristics of the documents, can be concluded that jurisdiction is most representative, followed by organisation and finally type.

Furthermore, all possible combinations of the compression-based topic model cannot be evaluated due to the lack of computation power.

Next, the Best Set method will not result in a valuable topic model, since multiple of the most similar pair of documents could belong to one specific topic, which is a poor representation of the supervised training data. Thus, it can be concluded that a Compression-Based Best Set Topic Modelling algorithm will not result in valuable topic models.

The set of documents which are the least similar, i.e. the Worst Set, cannot be computed when the number of desired classes exceeds four. However, this set can be approached with RIWS and HCWS, which both could result in a valuable topic model, but thist is not guaranteed. HCWS performs slightly better than RIWS, since its average state value is lower. Thus, it can be concluded that a Compression-Based Worst Set Topic Modelling algorithm will often not result in qualitative topic models, though it sometimes produces a valuable one.

Lastly, the Steepest Ascent Hill Climbing approach of Compression-Based Topic Modelling, based on minimisation of Equation 6, often results in dis-tributed Topic Models. The best topic model as well as the average one out-performed both RIWS and HCWS. Thus, it can be concluded that the de-scribed algorithm provides quite impressive and valuable Topic Models and therefore could be used in practice or act as foundation for further research on compression-based topic modelling.

7 Future Work

This future research could implement simulated annealing or backtracking in the Hill Climbing algorithm, to avoid finding local optima instead of the global optimum. Additional, this algorithm could be extended so that it could also assign documents to multiple topics, just like LDA does. Subsequently, the com-putational aspect could be improved, for instance by replacing the average NCD with a different NCD metric for mutiple files and classes or adjusting creation of the archives such that they do not have to be saved on the disk. Besides the improvements, the Hill Climbing algorithm could also be applied to different or larger corpora, in order to verify if same results can be achieved on those doc-uments. Finally, the performance of the Compression-Based Topic Modelling could be compared to the current state-of-the-art Topic Model algorithms.

References

Appleby, B. and Hall, N. A. (1961). Techniques for producing school time-tables on a computer and their application to other scheduling problems. OR, 12(3):201–201.

(23)

Benedetto, D., Caglioti, E., and Loreto, V. (2002). Language trees and zipping. Physical review letters, 88(4).

Blei, D. (2012). Probabilistic topic models. Communications of the ACM, 55(4):77–84.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.

Boer, A. and Winkels, R. (2016). Making a cold start in legal recommendation: An experiment. In F. Bex, & S. Villata (Eds.), Legal Knowledge and Infor-mation Systems, JURIX 2016: The Twenty-Ninth Annual Conference, pages 131–136. Amsterdam: IOS Press.

Cleary, J. and Witten, I. (1984). Data compression using adaptive coding and partial string matching. Communications, IEEE Transactions on, 32(4):396– 402.

Frank, E., Chui, C., and Witten, I. H. (2000). Text categorization using com-pression models.

Goodman, J. (2002). Extended comment on language trees and zipping. Governatori, G. (2009). Exploiting properties of legislative texts to improve

classification accuracy. In Legal Knowledge and Information Systems: JU-RIX 2009, the Twenty-second Annual Conference, volume 205, page 136. IOS Press.

Khmelev, D. and Teahan, W. (2003). A repetition based measure for verification of text collections and for text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, SIGIR ’03, pages 104–110. ACM.

Kolmogorov, A. N. (1968). Three approaches to the quantitative definition of information. International journal of computer mathematics, 2(1-4):157–168. Kukushkina, O., Polikarpov, A., and Khmelev, D. (2001). Using literal and grammatical statistics for authorship attribution. Problems of Information Transmission, 37(2):172–184.

Mahdi, O. A., Mohammed, M. A., and Mohamed, A. J. (2012). Implementing a novel approach an convert audio compression to text coding via hybrid technique. International Journal of Computer Science Issues, 9(6):53–59. Marton, Y., Wu, N., and Hellerstein, L. (2005). On compression-based text

classification. In European Conference on Information Retrieval, pages 300– 314. Springer.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1):1–47.

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3):379–423.

(24)

Shkarin, D. (2001). Improving the efficiency of the ppm algorithm. Problems of Information Transmission, 37(3):226–235.

Sokolova, M. and Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4):427–437.

Teahan, W. J. and Harper, D. J. (2003). Using compression-based language models for text categorization. In Language modeling for information re-trieval, pages 141–165. Springer.

Terwijn, S. A., Torenvliet, L., and Vit´anyi, P. M. (2011). Nonapproximability of the normalized information distance. Journal of Computer and System Sciences, 77(4):738–742.

Thaper, N. (2001). Using compression for source-based classification of text. PhD thesis, Massachusetts Institute of Technology.

Vitanyi, P. M. B., Balbach, F. J., Cilibrasi, R. L., and Li, M. (2008). Normalized information distance.

Witten, I. H., Bray, Z., Mahoui, M., and Teahan, W. J. (1999). Using language models for generic entity extraction. In Proceedings of the ICML Workshop on Text Mining. Citeseer.

Yang, Y. and Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’99, pages 42–49. ACM.

Ziv, J. and Lempel, A. (1977). A universal algorithm for sequential data com-pression. Information Theory, IEEE Transactions on, 23(3):337–343. Ziv, J. and Lempel, A. (1978). Compression of individual sequences via

(25)

Appendix A Supervised Classification Results

(a) Partial Size

(b) Whole Size

Figure 9: Accuracy, F1 score, Precision and Recall of the classification of Bestu-ursrecht, Civiel Recht and Strafrecht

(26)

(a) Partial Size

(b) Whole Size

Figure 10: Accuracy, F1 score, Precision and Recall of the classification of the organisations

(27)

(a) Partial Size

(b) Whole Size

Figure 11: Accuracy, F1 score, Precision and Recall of the classification of the types

(28)

(a) Partial Size

(b) Whole Size

Figure 12: Accuracy, F1 score, Precision and Recall of the classification of the ”Bestuursrecht” sub-jurisdictions

(29)

(a) Partial Size

(b) Whole Size

Figure 13: Accuracy, F1 score, Precision and Recall of the classification of the ”Civiel Recht” sub-jurisdictions

(30)

(a) Partial Size

(b) Whole Size

Figure 14: Accuracy, F1 score, Precision and Recall of the classification of all the sub-jurisdictions

(31)

Appendix B Worst Set Topic Modelling

Algorithm 3 Random Iterative Worst Set Algorithm

procedure RIWS(text directory, f iles, number of classes, iterations) states, ncds ← [ ]

for iterations do

state ← number of classes random elements of f iles ncd ← calculate NCD of state

states ← state ncds ← ncd

return element of states with the lowest value in ncds

(32)

(33)

Algorithm 4 Hill Climbing Worst Set Algorithm

procedure HCWS(text directory, f iles, number of classes, iterations) state ← random number of classes item from f iles

statencd ← calculate NCD of state

for iterations do

successors ← all permutation where 1 item is removed from state and 1 is appended from f iles

ncds ← calculate NCD of each item in successors if min of ncds ¡ statencd then

state = item of successors with the min value of ncds statencd ← min ncds

else

return state, statencd

(34)

(35)

Appendix C Steepest Ascent Hill Climbing Topic

Modelling

(36)

Unsupervised compression-based Topic Modelling of legal text documents