Multi-label Text Classification for Ground Lease Documents

(1)

Multi-label Text Classification for Ground Lease Documents

submitted in partial fulfilment for the degree of master of science Rouel de Romas

11073837

master information studies data science

faculty of science university of amsterdam

June 23, 2019

Internal Supervisor External Supervisor 1 External Supervisor 2 Title, Name dhr. Ziming Li Ramses Oomen Jorg Meurkes

Email z.li@uva.nl r.oomen@amsterdam.nl j.meurkes@amsterdam.nl Affiliation UvA Gemeente Amsterdam Gemeente Amsterdam

(2)

Multi-Label Text Classification for Ground Lease Documents

Rouel de Romas

rouel.deromas@student.uva.nl 11073837

Universiteit van Amsterdam Amsterdam

ABSTRACT

The focus of this paper is to predict the judicial destinations in ground lease documents in Amsterdam. Currently, the municipality extracts judicial destinations from ground lease documents man-ually. This is time consuming since information from a total of 180.000 ground lease document needs to be extracted. This paper looks at using machine learning techniques to predict the judicial destinations of ground lease documents. Furthermore, different methods were explored to represent these documents. Different evaluation metrics were utilized to compare the performances be-tween different classification methods. The combination of Doc2Vec and Multi-Layer Perceptrons (MLP) achieves the highest overall per-formance. Another finding is that predicting the rarely seen judicial destinations are very challenging with settings of this research.

KEYWORDS

text classification, multi-label classification, natural language pro-cessing, data science, ground lease documents

1 INTRODUCTION

Ground lease is a type of lease present in Amsterdam where a person gets the right to use the land that is owned by someone else [18], and in Amsterdam, most of the land is owned by the municipality. The right is valuable for 50 or 75 years (continuous ground lease), where the holder of the right pays a single payment or periodical payments. After the 50 or 75 years period, the value of the ground is re-calculated, which in turn influences the amount that will be paid for the upcoming period. As of the 28th of June 2017, the municipality introduced a new policy regarding ground lease. It is now possible for ground lease holders to switch from a continuous ground lease to a perpetual ground lease. In a perpetual ground lease, the value of the ground is only calculated once; as a result, the amount that will be paid, will not be recalculated for each period.

Due to the introduction of the policy, it was necessary for the municipality to process the change to perpetual ground lease for the leaseholders. Specific information must be extracted from the ground lease documents, such as the surface area, but most impor-tantly the judicial destinations of the grounds. The judicial destina-tions determine what is allowed to build on a ground. The extraction process is time consuming and labor-intensive for the municipality. Currently, the information extraction processes by the municipal-ity are performed manually by iterating over the documents and entering selected terms in a search engine. Afterwards, the results of the search engine will be evaluated for their relevance in future applications.

The goal of this research is to extract the judicial destinations from ground lease documents, which leads to the research question that will be answered:

How can the judicial destinations of Dutch ground lease documents be predicted using text classification methods?

The main research question will be answered with the use of the following sub questions:

• How can features be constructed from domain knowledge? • Which classification methods are applicable to predict the

ju-dicial destinations?

• How good is the performance of predicting judicial destina-tions?

The remaining part of this thesis proceeds as follows: the first part describes related literature of this thesis. Secondly, the method-ology of this research is described, which focuses on the document representation, domain knowledge features, and the classification methods. Thirdly, the data that was used is described. In the fourth part, the experimental setup is described. The fifth part shows the results of the experiments. The sixth part discusses the results shown in the previous part. The last part describes the conclusion of this thesis.

2 RELATED WORK

This section describes related literature of text classification, since data is available that entails which judicial destinations are men-tioned in the ground lease documents, which is suitable for text classification.

2.1 Machine Learning in the legal domain

As of today, little attention has been given to multilabel text classi-fication on legal documents, therefore several papers will be dis-cussed on machine learning and legal documents to highlight what is already established in this field. In a research that gave attention on using machine learning for a legal practices, text summarization was applied on legal documents [9]. Kanapala et al. emphasizes on different text summarization techniques.

Purpura and Hillard [19] involved machine learning in the legal domain with an automated classification system that categorized legislative text, United States Congressional bills specifically, into a specific topic. Purpura and Hillard described that legislative texts have different language patterns and characteristics compared to other text such as news stories. Their approach to the classification problem was to use Support Vector Machines (SVM), because of the good performance on different kinds of tasks the classification method can handle.

(3)

2.2 Multi-label classification approaches

As mentioned in the previous section, little attention has been given on multi-label text classification in the legal domain. To give a per-spective on how multi-label classification is applied on text, in this section several papers on multi-label text classification approaches on text documents.

Al-Salemi et al. [1] applied multi-label text classification to cat-egorize Arabic text into different topics. Al-Salemi et al. accom-plished this by using a dataset of 23,837 Arabic news articles and over 40 labels, on several multi-label learning algorithms. In the paper, four transformation approaches were described, which are Binary relevance, classifier chains, calibrated ranking, and label powerset. The transformation approaches will provide an opportu-nity to use single-label classifiers, with their base classifiers SVM, k-Nearest-Neighbors, and Random Forest, which are classifiers that are usually used for multi-class classification. SVM performed bet-ter among the base classifiers when the transformation approaches are used.

Another possible classifier for multi-label classification are neu-ral networks. Zhang and Zhou [26] used the well known Reuters-215781dataset to experiment with neural networks on a multi-label classification problem. Zhang et al. found out that the performance of a neural network apporach to multi-label classification is better than already known multi-label methods.

3 METHODOLOGY

In this section the methods are described which are used for this research. Figure 1 shows a pipeline indicating steps that were taken for predicting the judicial destinations of ground lease documents. Firstly, the processed data will be transformed into document rep-resentations either with the vector space model with TF-IDF [13] weighting or Doc2Vec [11]. Secondly, there is an option to include or exclude domain knowledge features for the document represen-tations. Lastly, with the document representation, three classifiers were implemented: Support Vector Machine, Logistic Regression, and Multi-layer Perceptrons.

3.1 Document Representation without Domain

Knowledge

3.1.1 TF-IDF. A method to represent the raw text of the ground lease documents is by using a vector space model. The represen-tations of documents as a vector in a common vector space is considered the vector space model [20]. The vector space model contains documents di, where each diis identified by terms ti. The

terms tican either be weighted or unweighted. In this research,

the terms tiare weighted with TF-IDF scores. Each document diis

represented by a t-dimensional vector di= (ti1, ti2, .. ,tit).

Term Frequency Inverse Document Frequency (TF-IDF) can be used as weighting for the terms used in the vector space model [13]. The method consists of two parts: Term Frequency (TF) and Inverse Document Frequency (IDF). The full formula of TF-IDF can be seen below:

T F -IDFt,d = T Ft,dxIDFt (1)

1_{www.daviddlewis.com/resources/testcollections/reuters21578/}

Figure 1: Methodology pipeline

wheret represents the term and d represents the document. TF can be calculated by counting the frequency of termt occurring in documentd. IDF can be calculated by the formula below:

IDFt = loд_{d f}N

t (2)

where N is the total amount of documents in the collection and df is the document frequency, which is calculated by counting the amount of documents where the term is present. The purpose of TF-IDF is to determine how relevant given words are in a particular document. The TF-IDF score will be high when a termt is a common term in a single or small group of documents, while the TF-IDF score will be low when a termt is common over a majority of documents or is present in all documents. Each documentd can be viewed as a vector with each component being a termt in the vocabulary V with a weight for eacht computed with equation (1). When a termt is not present in document d, the weight will be zero.

3.1.2 Doc2Vec. Another method to represent texts is Doc2Vec, which builds upon word embedding. This method is first introduced by Mikolov and Le with ’Paragraph Vector’ as initial name for 2

(4)

the method [11]. With Doc2Vec it is possible to learn the vector representation of text with no fixed lengths. Examples of such texts are sentences and documents. The advantage of using Doc2Vec, is that it is a unsupervised method, which lets Doc2Vec learn from unlabeled data. Doc2Vec is based on the method of creating word representation with Word2Vec, which will be described briefly. For each wordi in a context of the to predicted next word there is a vector wi, which is a column in the matrix W. The summation

or concatenation of the word vectors of the context can be seen as features for predicting the next word in the sentence. These features are input for a three layer neural network, with as output the probabilities of words being the next word in the sentence. The neural network is trained by using stochastic gradient descent. As a result, the trained word vectors can be used as feature for the word that follows in the sentence.

There are two methods of applying Doc2Vec: distributed memory (DM) and distributed bag of words (DBOW). The method that is used in this research is DBOW. The advantages of using the DBOW over the DM is that its a simpler method than DM, as well as being less memory expensive. With DBOW, words in a specific text window are predicted. For each document d, there is a unique vector, which is a column in the matrix D. Given a document vector, wordi in a text window must be predicted, with the text window being sampled from the same document. The document vectors can, therefore, be seen as features for the three layer neural network, with as output the probabilities of the words being in the document. After the training the neural network with stochastic gradient descent, the document vectors are also trained. These trained document vectors can then be used as document representations. When using the DM, word vectors are also included to predict a word in the sentence, making it more memory expensive. For this method, the concatenation or summation of the document and word vectors form the features for the neural network, which is also trained with stochastic gradient descent.

3.2 Document Representation with Feature

Engineering

Using domain knowledge about the ground lease documents, it is possible to expand the document representation. There are three features that can expand the document representation: inclusion of the type of ground lease document, inclusion of the presence of destinations, and inclusions of the document year.

3.2.1 Type of ground lease documents. One feature that may help in the process of distinguishing ground lease documents on judicial destinations is by providing the type of ground lease document. For example, the ground lease document type "Akte van Uitgifte" has a section called "Bijzondere bepalingen" in which the judicial destination can be retrieved. The ground lease document type "Akte van Splitsing" documents are relatively longer than other documents, since these documents usually contain lists of destinations that a ground is divided into. An in-depth explanation on how different ground lease documents differ from each other is discussed in Section 4.1.

3.2.2 Presence of destinations. Another feature that may help distinguishing ground lease documents is the presence of

all the destinations. In ground lease documents there is a difference between judicial destination and non-judicial destinations. Judicial destinations are needed for the process of changing from contin-uous to perpetual ground lease, while non-judicial destinations are destinations that are written by notaries due to, for example their relation to the judicial destinations. Knowing which destina-tions are present in the ground lease documents may tell which are definitely not present in a ground lease document.

3.2.3 Created time of ground lease documents. Lastly, a feature that may help distinguishing ground lease documents are the years the documents are created. Documents that are created recently are more structured than documents that are relatively old. Moreover, older documents may have different wording than more recent documents.

3.3 Classification

3.3.1 Binary Relevance. Binary relevance (BR) is an approach for handling multi-label classification [15]. It is commonly used as a baseline method for multi-label classification tasks. Given a set withm labels, with BR, binary classification tasks will be assigned per label. There will be m hypothesesh1,h2, ..,hm, where every hypothesis is responsible for predicting the relevance of one label in the dataset. The advantage of using this strategy is due to its simplicity. A drawback for BR is that label dependency is not given attention. As a result the classifier might fail to predict some label combinations where dependencies are present.

In this research three machine learning methods were compared to each other to identify which method yields the best performance. 3.3.2 Support Vector Machines. One machine learning tech-nique that can be used is Support Vector Machines (SVM). For the problem of text classification, SVM is suitable considering that SVM prevent overfitting. Preventing overfitting does not take into ac-count the amount of features even if that amount is high, which occur frequently in document representations [8]. When using SVM, vectors are used, which is mapped into a high-dimension feature space, as input to solve for example a binary classification problem [6]. The most simple form of SVM is the linear SVM. In linear SVM the algorithm tries to find the hyperplane for vectorsx of dimensionn that belongs to either classes A or B [3]. The input is a set of examplesxiwith labelsyi:

(x1,y1), (x2,y2), ..., (xp,yp) (3)

where yi= 1 if xkis in class A and yk= -1 if xkin class B. A decision

surfaceD(x), also called hyperplane, is then constructed that aims to separate classes the most optimal. The best hyperplane is one that has the maximum margin between the vectors from class A that are the closest toD(x) and the vectors from class B that are the closest toD(x). These vectors are called the support vectors. The classification of the test examples is predicted with:

x ∈ A, i f D(x) > 0x ∈ B, otherwise (4) whereD(x) can be defined as:

D(x) = i=1 Õ N wiφ(x) + b (5) 3

(5)

SVM takes on several hyperparameters. One of the important hyperparameter is the kernel function, with the popular kernels being radial basis function (RBF), polynomial and linear, where both RBF and polynomial are non-linear kernels suitable for non-linear hyperplanes. The degree (used for polynomial kernels) and gamma (used for RBF kernels) are hyperparameters that influence the flexi-bility of the hyperplane [2]. Another important hyperparameter is denoted by C, which, according to Chapelle et al. [5], controls the trade-off between margin maximization of the support vectors and minimization of the error.

3.3.3 Logistic Regression. The main characteristic of logistic regression‘(LR) is the logit, which is the natural logarithm of an odds ratio [17]. Logistic regression is best applied to to classifica-tion problems that are binary. In this case of this research, either belonging to a certain judicial destination or not. The classifier tries to predict the probability given its example [14, 25]:

Pr(y|x) = 1

1+ exp(−y(wT_{x + b))} (6)

wherex=(x1,x2,..,xm) is the vector of a document,m is the

num-ber of features in the vector andy ∈ {+1,-1 }. w = (w1,w2,..,wm)

is the weight vector andb is the intercept of the decision of the hyperplane. The objective is to find weight vector that maximizes the log likelihood:

l(w) = −

i=1

Õ

N

loд(1 + exp(−yi(wTxi+ b))) (7)

There are a few hyperparameters that can be set when using LR. Optimization methods can be specified that can maximize (7). One of the optimization method is called LIBLINEAR, which is an optimization method that excels when a large amount of features are used [7]. Another method is called L-BGFS, which is a NewtonâĂŹs method of optimization where simplicity is its main characteristic for solving the optimization problem [12]. Regularization can be applied to prevent overfitting with, either L1 or L2 regularization [16, 23]. Lastly, just like for SVM, logistic regression has also a C parameter that can be specified.

3.3.4 Multi-layer Perceptron. The multilayer perceptron (MLP) is a type of feed forward neural network. It is a neural network that incorporates simple interconnected nodes [22]. The network consists of an input layer, output layer and a hidden layer. For each nodei there is a subset Γ-1i ⊆ V containing all the predecessors nodes ofi, where V is the set of all the nodes. Each node in a layer is connected with all the node in the next layer. A connection between nodes is characterized by a weight coefficientwij, which determines the importance of the connection, wherei is a node in a layer and j is a node in the previous layer. The output value ofi-th node xican be computed with: xi = f (zi) (8) where zi= bi+ Õ j ∈Γ−1 i wi jxj (9)

f(z) is the activation function, which beforehand must be chosen. With the activation function nonlinearity is included in the algo-rithm.

There are several hyperparameters that can be tuned for MLP. For this research, only a few hyperparameters will be considered. Firstly, there are a couple of options for the activation function. One of them is the logistic sigmoid function, which is also applied in logistic regression. Another activation function is the hyperbolic tangent function, which also takes the characteristics of a logistic sigmoid function [10]. The difference between the two activation functions is the output range, which is between 0 and 1 for the logistic sigmoid function and between -1 and 1 for the hyperbolic tangent function. The ReLu is another activation function which prunes the negative part to zero and retains its positive part of the output a node [24]. Another hyperparameter is the influence of the L2 regularization. As in logistic regression, MLP also implements the optimization method L-BGFS. Another optimization methods is stochastic gradient descent, which is a gradient descent method that incorporates randomness and speed [4].

4 DATA

The dataset used in this research contains two sub datasets, with each sub dataset being stored separately. One of the sub datasets contains Dutch ground lease documents as PDF files. These files are the text documents that will be used for predicting the judicial destinations. The second sub dataset is an Excel file containing the registered judicial destinations of the ground lease documents. These registered judicial destinations are the labels that must be predicted from the ground lease documents. How the two sub datasets were processed is described in Section 5.

4.1 Ground Lease Documents

In this section a description will be given on how ground lease documents are organized. Ground lease documents are documents that contain the decision of the municipality to lend out a piece of ground to a person or organization. The municipality must, also, define what the piece of ground will be used for (judicial destination), for example, if a person wants to use a specific ground of the municipality to have a house built on it, the municipality has to give permission to have the ground used only for building a house.

There are different types of ground lease documents (used as domain knowledge features, see Section 3.2) for granting lease by the municipality:

• Akte van Uitgifte: The first type of ground lease document is called "Akte van Uitgifte". This type of ground lease is the original granting of a leasehold for a specific ground. • Akte van Splitsing: This ground lease document is

appli-cable for ground lease holders who split the specific ground into parts. To give an example, a leaseholder can use a lent ground to build an apartment building. To be able for Amster-dam citizens to buy an apartment, the ground lease holders must first split the apartment building into separate apart-ments, which is registered in the "Akte van Splitsing" • Akte van Levering This type of ground lease document is

usually for the buyer of an apartment from the ground lease holder. The documents contains the rights of the apartment buyer to use the apartment for its specific purpose only. 4

(6)

•Akte van Wijziging: Another type of ground lease docu-ment is the "Akte van Wijziging". With this docudocu-ment the municipality agrees to change the judicial destination of a ground.

All these ground lease documents are linked to each other, with the "Akte van Uitgifte" as its main ground lease document since in this document the original judicial destinations are registered by the municipality.

For every granting of a judicial destination, a corresponding ID, called "E-dossier" will be attached to the judicial destination. For example, the municipality can give the leaseholder the right to use ground for storage and a parking spot. For this right an E-dossier will be given. When dealing with an "Akte van Splitsing", a ground can be split up into numerous storages and parking spots. For every split, there will be an E-dossier attached to the split.

4.2 Exploratory Data Analysis

In this section the analysis on the data will be discussed. This analysis will give more insights on the data that will be used for this research. This exploratory data analysis is done on a subset of the datasets given by the municipality, since this subset contains data that is applicable for research. How the subset is chosen, is discussed in Section 5.1.

4.2.1 Labels. When looking at Figure 2, the majority class is "Koopwoning" (owner-occupied house). Which is not unusual since most of the ground that are lent out by the municipality are used for owner-occupied houses. Ground lease documents containing "Sociale Huurwoning" (Social housing) occur the least. A reason for the low amount of documents for social housing is that the change to perpetual ground lease mostly applies to owner-occupied houses. In the case of social housing, the change to perpetual ground lease only occurs as an exceptions.

When looking at Figure 3, it can be seen that a document have up to three distinct labels per document. Documents containing only one label occurs the most frequent, while documents containing two or three labels are notably less.

4.2.2 Document pages. When looking at Figure 4, it can be seen that most of the documents contain less than 50 pages. The reason for documents having more than 50 pages may be due to multiple E-dossier being present in a ground lease document, which can lead to many page consisting the listing of all the judicial destinations of the E-dossier long. Another reason may be due to the inclusion of images and floor plans.

4.2.3 Word frequencies. Most of the ground lease documents contain words less than 10.000 (see Figure 4). When looking at Figure 5, it can be seen that the most frequent word appears approx-imately 140.000 times in the whole corpus, with the word being "Amsterdam".

5 EXPERIMENTAL SETUP

In this section the three main steps, in which the experiment is done, is explained. Firstly, the data was processed in order for the data to be clean. Secondly, the hyperparameters of the classification methods were set. Lastly, evaluation metrics were used to assess the performances of the classifiers.

Figure 2: Distribution of labels

Figure 3: Distribution of amount of labels per document

5.1 Data Processing

5.1.1 Processing PDF files. In the text processing phase a PDF parser was used called Tika2. This PDF parser is a Python library with its function to extract text and metadata from PDF files. The PDF files that were used for the research are ground lease documents from the year 2000 and onwards. Older ground lease documents were hand written, meaning that the PDF parser could not extract text. Another complication that lead to the reduction of the amount of PDF files used is the occurrences of PDF files with ground lease documents that were scanned and seen as images instead of texts. After obtaining the extracted texts, preprocessing 2_{https://github.com/chrismattmann/tika-python}

(7)

Figure 4: Bar chart: Amount of pages

Figure 5: Amount of words per document

Figure 6: Top 100 Frequencies of words (without stopwords)

Figure 7: Data processing steps

techniques are applied with the Natural Language Toolkit (NLTK)3, which are tokenization, lowereing, punctuation removal and the removal of numbers. Lemmatization and stopwords removal were not applied in order to preserve the original semantics of the text. From these PDF files, the E-dossiers were also extracted from the PDF filenames. The PDF filenames also contained the date and the type of ground lease document. The resulting file is a CSV file with a column for E-dossier, PDF filenames, and the extracted text.

5.1.2 Processing Dataset Labels. Since the labels were present in another dataset, the dataset in which the labels were present also needed to be preprocessed. This dataset is an Excel file containing a column where the judicial destinations are registered correspond-ing to its specific E-dossier. Further columns were present that were deemed unnecessary for the research, such as the ID of employ-ees responsible for registering the judicial destination. The labels were mainly worded in sentences; as a result, tokenization, punc-tuation removal and the removal of numbers were also included. 3_{https://www.nltk.org/}

(8)

After discussions with the employees of the municipality about how employees extract judicial destinations from ground lease doc-ument, the conclusion came that there were three categories of main judicial destination and five categories of sub judicial desti-nations. A document was provided which indicates what type of judicial destination belonged to which category. The labels were processed according to this information, leading to the worded judicial destination in sentences to be transformed into lists of judi-cial destinations with a length of one word per judijudi-cial destination. The resulting file is a CSV file with a column for E-dossiers and a column for judicial destinations.

5.1.3 Combining the two datasets. Combining the two datasets was accomplished by transforming the two CSV files into Pan-das Dataframes and merging the Dataframes together based on E-dossier. The resulting Dataframe still needed to be processed, since the Dataframe contained duplicate rows in terms of text. The judicial destinations of all the duplicates were combined together, which can result in various judicial destinations for one ground lease document. This is the reason why this is thesis is about multi-label text classification instead of a multi-class text classification.

5.2 Setup for the classifiers

In this research, three different popular classification method were applied, namely Support Vector Machines, Logistic Regression and MLP. The best hyperparameters for each classifier were determined by using GridSearchCV4. With this function from Scikit-Learn5, it was possible to search the best performing parameters, given a range of parameters that were specified beforehand. In this research cross-validation is used in order to avoid bias when the data is split into the training and test set. The method that is used is the K-folds Cross Validation. With this method the dataset is split into k mutually exclusive subsets. While using the GridSearchCV, a 5-folds cross validation is used.

5.2.1 Support Vector Machines. The hyperparameters used for support vectors machines were C, kernel, gamma and degree. C indicates the penalty parameter of the error term, with the range being 0.1-10. For the kernel, there are three options: linear, RBF and polynomial. When nonlinear hyperplanes are used, gamma could be set as hyperparameter, with the range being 0.1-10. Lastly, the hyperparameter degree, which is only applicable when using the polynomial kernel, indicates the nonlinearity of the polynomial kernel. The range for this hyperparameter is set to 1-3.

5.2.2 Logistic Regression. The hyperparameters used for lo-gistic regression for the GridSearchCV were C, the penalty and the optimization solver. The range for C was in this situation 0.001-1000. The options for the optimization solver were LIBLINEAR and L-BFGS. The penalty that would be used for logistic regression was either L1 or L2 regularization; however, L-BFGS could only be applied with L2 regularization.

5.2.3 Multi-layer Perceptron. The hyperparameters used for MLP were the activation function, the influence of the L2 regular-ization, and the optimization method for back propagation. For the 4_{scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html} 5_{https://scikit-learn.org/stable/}

activation function, the options were a sigmoid activation function, a hyperbolic tangent function, or a ReLu activation function. The range for the influence of the L2 regularization was 0.01-10. Lastly, the options for the optimization function were L-BFGS or stochastic gradient descent.

5.3 Evaluation

For multi-label classification evaluation can be grouped into three categories, depending on the type of research problem: evalua-tion partievalua-tions, evaluating ranking, and using label hierarchy [21]. Since the categoryevaluation partitions assesses the quality of the classifications that are made, only the first category is used for this research. The second category deals with ranking as research problems, while the third category deals with evaluation research problems which contain hierarchical structures of the labels.

Within evaluation partitions, metrics can be used that evaluate results partially correct, which is less strict than using evaluation metrics for evaluating results fully correct. An example of a metric that examines results either fully correct or fully incorrect is Exact Match:

ExactMatch = 1_nÕi=1

n

I(Yi = Zi) (10)

wheren indicates the amount of instances. Let Y be the predicted labels and Z the ground truth. I is the indicator function of whether Y and Z are equal. This metric is the most strict out of all the metrics that is used for multi-label classification.

When only Exact Match is used, it is not clear whether fully incorrect predictions are fully incorrect or partially incorrect. To indicate whether predictions are partially incorrect, the evaluation metrics, Precision, Recall, F1-measure, and Hamming Loss can be included. With Precision as evaluation metric, the proportion of correctly labeled instances can be calculated to the total amount of labels, which is than averaged over the total amount of instances that are predicted. Precision is calculated with the following equa-tion: Precision =_n1 i=1 Õ n |Y_i∩Z_i| |Zi| (11)

Recall can be calculated by also calculating the proportion of cor-rectly labeled instances to the total amount of predicted labels, averaged again over the total amount of instances. Recall is calcu-lated with the following equation:

Recall =_n1Õi=1

n

|Y_i ∩Z_i|

|Yi| (12)

The F1-measure is calculated with both Precision and Recall. With the f-measure it is possible to evaluate the effectiveness for large classes in the collection (Powers, 2011). :

F 1-measure = 1_nÕi=1

n

2|Yi∩Zi|

|Y_i|+ |Z_i| (13) The Hamming loss is also an evaluation metric that includes partially incorrect labels, suitable for multi-label classification ap-proaches. It takes into account labels that are incorrectly predicted and labels that are missing, with the most optimal score begin 0: 7

(9)

HamminдLoss =_kn1 n Õ i=1 k Õ l=1 [I(l ∈ Z_i∧l < Y_i+ I(l < Z_i∧l ∈ Y_i] (14)

6 RESULTS

In this section the results are shown. First, prediction performance when the document representations do not include the features ex-tracted with domain knowledge is shown. Afterwards, the domain knowledge features will be taken into account and the correspond-ing classification performance will be given.

6.1 Prediction without domain knowledge

Table 1 and Table 2 show the results with TF-IDF representations and Doc2Vec representation respectively.

When looking at the Table 1, the performance with MLP were the highest out of the three classifiers, with a difference, to the second best performing classifier, of 1,65% for Exact Match, 3,6% for precision, 11,15% for recall, 11,6% for F1, and 0,0073 for Hamming loss. When looking at the Table 2, the performance with MLP and SVM were the highest out of the three classifiers. While classifica-tion with SVM performed better in terms of Exact Match, precision and Hamming loss, classification with MLP performed better in terms of recall and F1.

Overall, the performance of classification with Doc2Vec were better than with TF-IDF. The differences were not big, except for pre-cision and F1-measure. For these two evaluation measures, Doc2Vec performed better. Further analysis on the performances of Doc2Vec and TF-IDF is given in Section 7.1.

Table 1: TF-IDF without domain knowledge Exact

Match Precision Recall F1-measure Hamming loss SVM 68,49% 44,83% 26,47% 28,42% 0.0721 LR 65,21% 14,70% 16,67% 15,62% 0.0817 MLP 70,14% 48,43% 37,62% 40,02% 0,0648

Table 2: Doc2Vec without domain knowledge Exact

Match Precision Recall F1-measure Hamming loss SVM 70,96% 76,56% 32,16% 39,79% 0.0689 LR 66,03% 37,20% 18,64% 19,23% 0.0804 MLP 70,14% 59,12% 44,48% 50,14% 0.0731

6.2 Prediction including domain knowledge

In this section the performance of the classifiers with domain knowl-edge are presented (see Table 3 and Table 4). When looking at the Table 3, the performance with MLP were the highest out of the three classifiers. In terms of Exact Match, using SVM had the same performance as using MLP.

When looking at the Table 4 the performance with MLP were also the highest out of the three classification methods. While classification with MLP performed better in terms of Exact Match, recall, F1 and Hamming loss, classification with SVM performed better in terms of precision.

Overall, the performance of classification with Doc2Vec were better than with TF-IDF. The difference lied in the Hamming loss, where classification with TF-IDF performed better.

Table 3: TF-IDF features Exact

Match Precision Recall F1-measure Hamming loss SVM 70,68% 55,97% 35,54% 39,27% 0.0630 LR 65,21% 31,37% 17,07% 17,42% 0.0813 MLP 70,68% 63,49% 43,62% 47,22% 0,0598

Table 4: Doc2Vec with features Exact

Match Precision Recall F1-measure Hamming loss SVM 71,23% 76,31% 30,87% 37,82% 0.0694 LR 66,58% 37,56% 21,49% 23,23% 0.0776 MLP 71,51% 56,52% 47,09% 50,14% 0.0653

7 DISCUSSION

This section describes the attempt of interpreting the results of the application of document representation techniques, document representation including domain knowledge, and the application of classification methods. Insights about the evaluation metrics is discussed as well. Lastly, the limitations of this research is described.

7.1 Document Representation

To apply a predictive model on text documents, document repre-sentations have to be made for each text document. Two methods for document representation were applied. For the comparison of the performances of the two methods to represent documents, only the classifiers SVM, LR and MLP were used. When looking at the performances when using the two document representation, it can be seen that the performances when using Doc2Vec on most cases were better than when TF-IDF was used. One explanation for this happening is that according to Le & Mikolov [11], Doc2Vec learns the semantic of the documents that are used as input, through training a neural network. This is in contrast to TF-IDF, since it is possible to have documents with the same distinct and amount of words, leading to the same vector space model; however, the sequence in which the words occur may be entirely different. The purpose of TF-IDF is mainly to represent documents based only on the importance of words in the document and the corpus. The disadvantage, however, of using Doc2Vec was that the computation time in comparison with TF-IDF is longer.

(10)

7.2 Domain Knowledge

In general, adding domain knowledge features to the classifiers yielded better performances than excluding domain knowledge features from the classifiers. Even though the results with the in-clusion of domain knowledge features were higher than with the exclusion of domain knowledge features, the differences in perfor-mance were not great. It can therefore be assumed that that the type of ground lease document, the year of creation of the ground lease documents, and the indication of destinations occurring in the documents did not help greatly in distinguishing ground lease documents on judicial destination.

7.3 Classifiers

When looking at the results of the different classifiers it can be said that MLP performed the best in comparison to SVM and LR. As mentioned in Section 3.3.3, this may be due to the ability of neural networks to train on input data that is either incomplete or imprecise. Nevertheless, in terms of precision, F1, and especially recall, low performances were still achieved. This may be due to the low quality and quantity of the data, which is described further in detail in the Section 7.5.

7.4 Evaluation metrics

There were different evaluation metrics used in this research. The most popular evaluation metric is accuracy, which is Exact Match in a multi-label classification scenario. The disadvantage of only looking at this metric is that it does not represent the performances on the minority classes.

The other evaluation metrics tries to measure the performance for partially correct labels, which does look at the minority classes. The only situation where partially correct labels are acceptable for the municipality is when the predictions include at least one correctly predicted judicial destination, while lacking additional judicial destinations. For example, a ground lease can have owner-occupied house, storage and parking spot as judicial destinations. If a classifier predicts only owner-occupied house and storage, this can be seen as an acceptable partially correct instance. The reason being is that both owner-occupied house and storage are correctly predicted. The only judicial destination that is missing is parking spot, which is lacking and is not falsely predicted. In the case where the classifier predict other judicial destinations than mentioned in the correct judicial destinations, it may have a negative impact for the municipality. An incorrect given judicial destination could yield confusion for potential future employees when working with a prediction model for ground lease documents. As a result, the pre-diction model could be considered unreliable. With the Hamming loss, these problems could be detected, since the Hamming loss takes into account labels that are missing or are falsely predicted; however, looking at the results, the Hamming loss measures were close to 0, even though precision, recall and F1 indicate bad perfor-mances. This may be due to the classifiers, returning the majority class the most in all the cases.

7.5 Limitations

One of the reasons the performances were not high may be due to the quality of the data. Ground lease documents consists of sections

that are often unnecessary for determining what the judicial desti-nations are of a ground lease. However, extracting only the relevant sections for determining the judicial destinations is a difficult task. Ground lease documents are constructed by notaries. Since each notary and each organization in which a notary is active differs, this may lead to different constructions of the ground lease documents. Combining this with the fact that each ground lease document has a specific type, leading to different types of information that is required, makes the task of determining the relevant section more difficult. Subsequently, when whole documents are used, due to the incapability of using only relevant sections, non-relevant sections can be seen as noise when classifying ground lease documents. An-other type of noise that may have played a role in the performances of the classifiers are noise generated by the PDF parser. With the PDF parser Tika, all the text are extracted from PDF files. Examples of noise in this context are titles, administrative text, text from images, etc. These types of text contribute to the amount of noise in the texts.

Another potential reason for the performance of the classifiers is the amount of instances that are used to train and test the classifiers. Only ground lease documents in the year 2000 and onwards were used, considering that ground lease documents created earlier than 2000 are mostly scanned as images, which contain handwritten ground lease documents as well. Furthermore, many E-dossiers link to just one ground lease document, which further reduced the amount of instances that can be used. Lastly, there were many instances present where the judicial destinations were not "stan-dard", but specific to just one case of ground lease leading to these instances to be not useful for classification.

8 CONCLUSION

In this research the aim was to answer the following research ques-tion: "How can the judicial destinations of Dutch ground lease documents be predicted using text classification methods?". First, machine learning techniques for document representation were applied to make the ground lease documents applicable for machine learning tasks. Two methods were used for document representa-tion: TF-IDF and Doc2Vec. To extend the document representation, domain knowledge features, were added to the document repre-sentation. After obtaining the document representations, three classifiers were applied: Support Vector Machines, Logistic Regres-sion and Multi-layer Perceptron. When looking at the results of the classifiers, it can be concluded that using document representations with Doc2Vec slightly outperformed using document represen-tations with TF-IDF. Furthermore, it can also be concluded that classification with MLP performed the best; however the differ-ence in performance were not big. When looking at the evaluation measures, the exact match measure indicate decent performing classification models; however, looking at the precision and recall measures indicates bad performing classification models, especially for recall.

Future steps that could be made to deal with the limitations due to non-relevant sections is the identification of relevant sections for determining the judicial destinations of ground lease document. Successful results in this step may lead to better distinction of ground lease documents in terms of judicial destinations. Another 9

(11)

step that could be made is the use of the newly registered judicial destination for the E-dossiers that have been added by the munici-pality after obtaining the registered judicial destination for the use of this research. More instances can be provided, which may help classifiers in determining the judicial destination better.

REFERENCES

[1] Al-Salemi, B., Ayob, M., Kendall, G., and Noah, S. A. M. Multi-label arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms. vol. 56.

[2] Ben-Hur, A., and Weston, J. A userś guide to support vector machines.Data mining techniques for the life sciences 338 (2010), 223–239.

[3] Boser, B. E., Guyon, I. M., and Vapnik, V. N. A training algorithm for optimal margin classifiers. pp. 144–152.

[4] Bottou, L. Large-scale machine learning with stochastic gradient descent. COMPSTAT'2010.

[5] Chapelle, O., Vapnik, V., Bousqet, O., and Mukherjee, S. Choosing multiple parameters for support vector machines.Machine Learning 46, 1-3.

[6] Cortes, C., and Vapnik, V. Support-vector networks.Machine learning. [7] Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., and Lin, C. J. Liblinear: A

library for large linear classification.Journal of machine learning research (2008), 1871–1874.

[8] Joachims, T. Text categorization with support vector machines: Learning with many relevant features. pp. 137–142.

[9] Kanapala, A., Pal, S., and Pamula, R. Text summarization from legal documents: a survey.Artificial Intelligence Review.

[10] Karlik, B., and Olgac, A. V. Performance analysis of various activation functions in generalized mlp architectures of neural networks. International Journal of Artificial Intelligence and Expert Systems 1.

[11] Le, Q., and Mikolov, T. Distributed representations of sentences and documents. In International conference on machine learning (2014), 1188–1196.

[12] Liu, D. C., and Nocedal, J. On the limited memory bfgs method for large scale optimization.Mathematical programming 45, 1-3.

[13] Manning, C., Raghavan, P., and SchÃĳtze, H.Introduction to information retrieval, vol. 11.

[14] Minka, T. Algorithms for maximum-likelihood logistic regression.

[15] Montañes, E., Senge, R., Barranqero, J., Quevedo, J. R., del Coz, J. J., and HÃĳllermeier, E. Dependent binary relevance models for multi-label classifica-tion.Pattern Recognition.

[16] Park, M. Y., and Hastie, T. L1regularization path algorithm for generalized linear models.Journal of the Royal Statistical Society: Series B 69.

[17] Peng, C. Y. J., Lee, K. L., and Ingersoll, G. M. An introduction to logistic regression analysis and reporting.The journal of educational research 96, 1 (2002), 3–14.

[18] Ploeger, H., and Bounjouh, H. The dutch urban ground lease: A valuable tool for land policy?Land Use Policy 63, 12 (2017), 78–85.

[19] Purpura, S., and Hillard, D. Automated classification of congressional legisla-tion. Digital Government Society of North America.

[20] Salton, G., Wong, A., and Yang, C. S. A vector space model for automatic indexing.Communications of the ACM 18, 11, 613–620.

[21] Sorower, M. S. A literature survey on algorithms for multi-label learning. 1–25. [22] Svozil, D., Kvasnicka, V., and Pospichal, J. Introduction to multi-layer feed-forward neural networks.Chemometrics and intelligent laboratory systems 39. [23] Wager, S., Wang, S., and Liang, P. S. Dropout training as adaptive regularization.

Advances in neural information processing systems (2007), 351–359.

[24] Xu, B., Wang, N., Chen, T., and Li, M. Empirical evaluation of rectified activations in convolutional network.

[25] Zhang, J., Jin, R., Yang, Y., and Hauptmann, A. Modified logistic regression: An approximation to svm and its applications in large-scale text categorization. The Twentieth International Conference (ICML 2003) (2003).

[26] Zhang, M. L., and Zhou, Z. H. Multilabel neural networks with applications to functional genomics and text categorization.IEEE transactions on Knowledgeand Data Engineering 18, 10, 1338–1351.

Appendices

A

PRACTICAL INFORMATION

With the following link, the Github repository can be seen that was used for this research: https://github.com/rouelderomas/Master_

Thesis. In this repository all the code written in Python can be seen.