Text Classification in Citizen Incident Reports

(1)

submitted in partial fulfillment for the degree of master of science

Mink Rohmer

10578633

master information studies

data science

faculty of science

university of amsterdam

Date of defence: 2nd of July, 2018

Internal Supervisor External Supervisor Second Reader

Title, Name Maartje ter Hoeve Maarten Sukel Stevan Rudinac

Affiliation UvA, ILPS Municipality of Amsterdam UvA, ABS

(2)

Submitted in partial fulfillment for the degree of master of science

MINK ROHMER,

University of Amsterdam

In this paper we compare multiple methods to classify the priority of citizen incident reports. Techniques discussed are tf-idf, BM25, word2vec and par2vec for feature extraction and support vector machines (SVM) and character level convolutional neural networks (CLCNN) for classification. Results show that using the novel method of the character level convolutional neural network performs on-par with tf-idf in combination with SVM on several metrics. Using categorical features among with textual features increased performance of the tf-idf classifier.

ACM Reference Format:

Mink Rohmer. 2018. Text Classification in Citizen Incident Reports: Submitted in partial fulfillment for the degree of master of science. 1, 1 (June 2018), 18 pages. https://doi.org/10.475/123_4

1 INTRODUCTION

It has become increasingly easy for customers of online services to get in contact with service providers. Many organizations offer ways to directly approach them with complaints, questions and suggestions. This leads to increased customer satisfaction and shorter response times from companies [1].

With the growth of the number of customer reports, the amount of data that companies have to process is also expanding rapidly. It is becoming increasingly difficult for organizations to handle the influx of infor-mation [2]. To solve this problem several solutions have been proposed.

An example of such a solution is the frequently asked questions (FAQ) page on a company website. Here panies provide a static webpage with answers to com-mon questions that customers have. This is a relatively simple way to solve simple customer problems without the need for interaction with the customer. However, previous research has shown that FAQs are not very effective for increasing customer satisfaction [3], for this reason organizations have been looking for better ways to deal with customer problems.

A more sophisticated solution that has been gener-ating a lot of traction recently is the chatbot [4]. A chatbot is a software system that can interact or chat with a human user in natural language. Chatbots can in some instances replace human effort when interacting with customers. A limitation of chatbots is that they are merely meant for providing information to customers

[4]. They can take little or no significant actions to help a user solve a problem.

While chatbot and FAQ pages can help with solving some issues for customers, other problems still require human (inter)action. For these types of customer issues, the problem of the volume of reports can not be solved by the means of chatbots.

A big concern for customers is whether their issue is being tended to in a timely manner. When compa-nies take too long to respond to customers, customer satisfaction could deteriorate [1].

A solution that we propose to avoid lowering tomer satisfaction is to automatically prioritize cus-tomer complaints. Not all complaints require the same degree of immediate attention. If problems can be au-tomatically prioritized, more urgent complaints can be handled immediately and less urgent complaints can be handled at a more convenient time. This way, customer satisfaction can remain high even with an increase in volume of customer issues. This is a novel way of ap-proaching the improvement of customer satisfaction with the use of text classification techniques.

1.1 Research Questions

In this research an investigation is made into the effec-tiveness of several machine learning techniques to au-tomatically detect the priority of customer complaints. The following research question is answered:

How effectively can the priority of customer complaints be predicted?

(3)

• How effective are the vector based methods TF-IDF, BM25, word2vec and par2vec for predicting priority?

• How effective are character level convolutional neural networks for predicting priority? • Can the classifier created with the vector based

methods be improved by adding non-textual fea-tures? Can the non-textual features be improved by imputing missing values using Multiple Impu-tation by Chained Equations (MICE)?

1.2 Contributions

We contribute to the field in several ways. Firstly, we introduce a novel way to approach the automation of a process for improving customer satisfaction, to be used in a real-time production environment.

Secondly, we use the BM25 scoring function in a non-conventional way to use it as a classification algorithm instead of an information retrieval technique.

Third, we apply the technique of CLCNN to a real world data set, whereas previous research on CLCNNs has focused only on academic datasets.

A fourth, more practical contribution is that we pro-vide a system that can be used in real=world applica-tions and that can function in near real-time.

2 PROBLEM SETTING

This research has been conducted in collaboration with the municipality of Amsterdam. The city has adopted a citizen feedback system where citizens can submit com-plaints, suggestions and incident reports. Previously these reports were made in-person at one of the mu-nicipality locations, or through telephone. Recently the municipality has adopted an online reporting system, which has led to a great influx of new users of the feed-back system.

With the new feedback system, up to 1200 reports are being submitted daily. This has proven to be too many for the municipality to maintain a clear overview of what the urgent complaints are.

The current state of affairs is that all reports are put into a queue. This means that every complaint is as-signed the same level of urgency: all complaints are

required to have some sort of follow-up from the munic-ipality within three working days. This leads to dissatis-faction among citizen reporters, since some complaints require more immediate action than the alotted three working days. For example, if a citizen makes a report of a publically intoxicated person causing nuisance, it does not make sense for the municipality to act three days later - the nuisance will be long gone by then. On the other hand, a complaint about a wall sprayed with graffiti could easily be handled after 3 days.

Because of the nature of the system, the main objec-tive of the classification system is to recognize cases where the priority is high. This means that recall of the high priority notifications is more important than other metrics, even when that means having a slightly lower precision on the high priority class.

To increase citizen satisfaction a system that can auto-matically determine the priority of a report is expected to be of great value. This research focuses on creating such a system that can work in a real-time production environment.

3 RELATED WORK

Detecting priority of a citizen incident report is a text classification problem. A lot of research has been dedi-cated to text classification, for an overview see [5]. Text classification can be divided into two steps: (1) finding a representation for the text and (2) training a classifier to be able to classify the generated representation into some labelled categories. In the following section we describe several methods used for text classification.

3.1 Text Representation and Classification

When dealing with text data in machine learning set-tings, it is crucial to transform the textual representa-tion into something that can be used with a classifica-tion algorithm. Machine-interpretable representaclassifica-tions of text have been subject of research since the 1960s [6], and many different types of representations have been formulated over the years. We describe some of these techniques in the following section.

(4)

3.1.1 tf-idf. One of the most basic concepts in the field is the bag-of-words representation and its exten-sion, tf-idf [7]. The tf-idf method extends the bag-of-words concept with term weighting based on the rela-tive frequency of a word in a collection of documents. While simple in its concept and application, it has proven very effective for various tasks in the field of both in-formation retrieval and text classification [8] [9] [10]. For example, Zhang et al. [8] have shown that tf-idf representations in combination with a Support Vector Machine (SVM) as a classification algorithm is effec-tive for the task of text categorization. Other research has proven tf-idf in combination with SVM to be ef-fective for assigning categories to news articles [11]. The proven effectiveness of tf-idf suggests that it might also be suitable for detecting priority of citizen incedent reports.

As proposed in [12], bag of words representations with tf-idf weighting can be generated with:

t f id f (t,d, D) = t f (t,d) · id f (t, D) (1) whereby t f (t,d) = ft,d (2) and id f (t, D) = log( |D| + 1 |d ∈ D : t ∈ d| + 1) (3) andft,d denotes the number of occurences of termt

in documentd and D denotes the complete collection of documents.

3.1.2 BM25. An extension of the tf-idf weighting scheme is the Okapi BM25 algorithm [13]. This algo-rithm extends the tf-idf method with additional weight-ing for document length. Previous research in informa-tion retrieval has shown that BM25 can be even more effective than plain tf-idf [13]. To score query-document pairs the BM25 scoring algorithm uses:

score(q,d) =Í|q |

i=1id f (qi) · 2_{t f (q}_it f (q·_d)+ki,d)·2(k₁(1−b+b ·1+1)avдdl|d | )

(4)

3.1.3 word2vec. Another approach to text represen-tation, called word2vec, has been proposed by Mikolov et al. [14]. The word2vec model creates a vector space in which the semantic value of words is captured to some degree. This method has proven valuable for different types of tasks, such as text classification and sentiment analysis [15] [16]. Rexha et al. [16] have shown that word2vec in combination with SVM can be used to classify the sentiment of tweets. Citizen incidence re-ports are similar in nature to tweets, they are both short and semi-formal pieces of text. This suggests that these techniques might also be effectively applied to detect priority in citizen incidence reports.

3.1.4 par2vec. A method closely related to word2vec is a technique known as par2vec or doc2vec. Also devel-oped by Mikolov and colleagues [17], it uses a similar approach to create a vector space for words. The ex-tra value of par2vec is that it is not only able to map words to the vector space, but also entire documents. This creates a vector space in which each document can be represented with a single fixed length vector. This is especially useful because it takes away the task of finding a suitable way to combine word vectors of a document into a single representation, which is crucial but difficult when using word2vec for text classification. Previous research has shown that par2vec can be used for sentiment classification of text [18] [17]. This sug-gests that it might also be useful for different types of text classification problems.

3.1.5 Character Level Convolutional Neural Networks. A more novel approach in the domain of text classifi-cation is that of the character level convolutional neu-ral network (CLCNN) [19]. The main idea behind this model is to leave the text input as low-level as possi-ble and to let the neural network learn the relationship between the text input and the label. A significant differ-ence between this approach and the above mentioned approaches is that the CLCNN works on the character level instead of on the word level. This increased level of granularity could in theory be an advantage over the word based methods [20]. For example, where a word-level model such as tf-idf would not be able to detect that the word trash and trashcan are related, a character level model can detect these similarities. Previous re-search by Zhang et al. [19] has shown that the CLCNN

(5)

outperforms tf-idf on various text classification tasks. Other research has shown that neural language models on the character level can outperform other language models in a word-prediction type task [20]. Whereas previous research with CLCNN has has shown that it is effective on tried and tested academic datasets, little is known about the use of this technique on real-world datasets. Our research will shed light on the effectivity of CLCNNs in a real-world setting.

3.2 Categorical Features for Classification

Previous research has shown that providing other cate-gorical features besides textual input can improve the accuracy of a classification system [21].

When dealing with features used for classification, it is important that no features have missing values. If an entry in the feature vector for a sample has missing values, it can not be used to train a classification model. Unfortunately it is in practice not unusual for data to contain missing values. To combat the problem of miss-ing data, there are various techniques to impute missmiss-ing values. A common way to impute missing values is to simply use the mean or median value for a certain vari-able [22]. While this is sufficient in some cases, there are more advanced techniques that can yield better results [23].

One of these techniques is multiple imputation by chained equations (MICE) [24]. The idea behind MICE is instead of using mean or median imputation, a re-gression or classification method is used to predict the missing values. These regression or classification meth-ods can be any algorithm used for these types of tasks normally, such as linear regression and logistic regres-sion. The imputation can be done with multiple differ-ent variables, where all variables that contain missing values are imputed one by one. This provides an im-proved method of imputation over taking the mean or median value of a variable. MICE is a technique that is quite commonly used in psychological research , where missingess is common[25], but no research has been done on a dataset that deals with data from the public domain such as the dataset used for our research.

4 EXPERIMENTAL SETUP

4.1 Description of the data

The data for this research was collected by the mu-nicipality of Amsterdam through their digital citizen feedback reporting system. Data was collected from 1st of January 2017 through 31st of December 2017. In total, 151.573 citizen reports were collected.

Each notification consists of a short textual descrip-tion (max 1024 characters), the category, the locadescrip-tion and the time of the incident. Some reports contain addi-tional information, such as addiaddi-tional comments made by employees that responded to the report, the date the report was eventually solved, whether the report was made anonymously and the specific municipality department the report was assigned to.

4.2 Labelling of Data

To generate labels to be used for training our models, we created an online labelling tool where labels could be assigned interactively by users. 29 employees from the municipality were recruited to assign labels to noti-fications. These employees all had experience with and knowledge of the resolution of notifications and thus were deemed to be domain experts. For images of the labelling tool, see Appendix A.

Out of all reports, a random sample of 12.584 reports was taken to be manually assigned a priority labely. Priority was divided into three categories: low, medium and high. The raters were given specific instructions on how to label the reports, to ensure that labels were assigned in the same manner across all reports. For the instructions, see Appendix A.

To ensure standardization between the different raters, 1906 notifications were labelled by multiple raters. To evaluate the agreement across the raters, Cohen’s Kappa coefficients [26] were computed between the different raters. According to previous research [27] Kappa scores of over 0.4 are acceptable. Using this threshold, three raters were excluded from the final dataset because they had low kappa scores compared to the other raters. This resulted in removing 1321 labelled notifications. For a full overview of inter-rater agreement, see Appendix A.

(6)

Fig. 1. Distribution of different levels of Priority over all noti-fications

Low Middle High

20 40 60

Per

centage

4.2.1 Distribution of Data. In Figure 1 the distribu-tion of the different levels of priority assigned by an-notators can be seen. It shows that the largest number of notifications fall into thelow and middle category at 68% and 25% respectively and only about 7% falls in the thehiдh category. This has implications for learning the proposed models, since the least frequent category is the most important one to predict correctly. These implications are further discussed in section 6.

Figure 2 shows the distribution of the different levels of priority over the 7 possible categories of notifications. This shows that a very large number of notifications fall into theдarbaдe category and that of those notifi-cations the vast majority is oflow and middle priority. This is because these types of incidents generally in-volve things such as overfull garbage bins, which do not require immediate reaction from the municipality. No-table is also that categories that involve nuisance have a higher proportion ofHiдh priority than the other cat-egories. This is mainly due to the fact that these types of notifications regard incidents where the nuisance is temporary.

The labelled dataset was split into three parts: a train-ing set (70%), a validation set (15%) and a test set (15%). The training set was used to train the models on, the validation set for tuning hyper parameters of models and the test set for testing the performance of individual models and for comparison between different models.

4.3 Methods

4.3.1 Pre-processing. Before applying each of the methods, the text from each notification was cleaned to reduce noise in the data. Punctuation, stopwords, HTML tags resulting from the webform were all removed and all text was converted to lower case. Conversion to low-ercase was done so that words that were capitalized were not identified as different words by the vector based methods. Moreover, Zhang et al. [19] found that conversion to lowercase outperformed leaving text cap-italized in CLCNNs.

4.3.2 tfidf. Using the formula for tf-idf as mentioned in section 3 each notification was represented by a vec-tor of sizeN , where N denotes the size of the vocabulary of the complete collection of notifications.

4.3.3 word2vec. A word2vec model was trained on a corpus of text from in total 450.000 notifications made over the years 2013 - 2017. To train this model, the Gen-sim implementation for python of the word2vec model was used [28]. Models with vector sizes 50, 100, 200 and 300 were trained and evaluated on the classification task. Sizes 100, 200 and 300 performed similarly, which was slightly better than size 50. For this reason, a vector size of 100 was used in further comparisons, since it requires the least parameters while offering similar per-formance as the larger sizes. For the full comparison of the results and a more in-depth review of the word2vec model, see Appendix B.

The word2vec model was trained using the skip-gram model rather than the continuous bag-of-words model, because according to Mikolov et al. [14] it is better suited for relatively small corpora such as the one used in this study.

With the trained word2vec model each word is mapp-ed to the vector space and each notification is trans-formed into a matrix M of size d × l where d is the size of the word2vec vectors andl is the length of the notification. After transforming the notifications into a matrixM, the notification is transformed into a single vector that can be fed to a classification algorithm. There are multiple ways to combine the vector matrix into a single vector. Previous research has shown that addition of vectors along each axis outperforms other methods such as taking the mean, maximum or multiplication

(7)

Fig. 2. Distribution of different levels of Priority across the different Categories Garbage Nuisance Companies Public Greener y Nuisance Public Space Streets

andTraffic Other

Animal Nuisance Nuisance byPe ople 0 500 1,000 1,500 2,000 2,500 3,000 Lo w Lo w Lo w Lo w Lo w Lo w Lo w Lo w

Middle High Middle High Middle High Middle High Middle High Middle High Middle High Middle High

Numb

er

of

Complaints

of vectors [29]. For this reason, the representation for a single notification that was used was the addition of the first 10 word vectors of the notification along each axis. The reason that only the first 15 words were used was that less than 30% of notifications contained fewer than 10 words. This ensured that as much infor-mation as possible was retained for the notifications with a high number of words, while not disadvantag-ing notifications with fewer than 10 words too much. For the distribution of the number of words over the notifications, see Figure 3

4.3.4 par2vec. A par2vec model was trained on the same corpus described above, also using the gensim implementation in python. Models with vector sizes of 50, 100, 200 and 300 were also trained, with the size 100 being the most suitable candidate. Since par2vec embeds the entire notification with a fixed-length vector, no aggregation is needed after the embedding to be able to use the vectors with a classification algorithm.

Fig. 3. Distribution of the number of words over all notifica-tions

4.3.5 SVM. Each of the above mentioned methods for representing text as vectors were used as input fea-tures to train an SVM classifier [30]. Specifically the

(8)

one-versus-rest implementation for multi-class classifi-cation was used [31]. The scikit-learn implementation for python was used [32].

4.3.6 Baseline Model. A baseline model to compare other methods with was created using only the category that was assigned to the notifications as a predictor variable in an SVM. There were a total of 8 different categories, such as ‘Garbage’, ‘Nuisance in the public space’ and ‘Nuisance caused by animals’.

4.3.7 BM25. Using the scoring function 4 described in section 3, each notification in the test dataset was compared to notifications in the train dataset. Each no-tification in the test set was seen as a ‘query’ and the labelled notifications in the train set as ‘documents’. Each query-document pair received a score, and for each query these scores were ranked from high to low. Each of the categories was assigned a score, based on the sum of the scores of documents in that category within the 10 highest ranked documents. The query was then classified as the category that received the highest score. This is based on the assumption that no-tifications that are lexically similar are also similar in their level of priority.

The implementation used can be found on Github1_.

4.3.8 CLCNN. The character-level convolutional neu-ral network used for this research is based on the model used by Zhang et al. [19]. The main component of this network is a temporal convolutional module, which computes a 1-D convolution. The network uses 6 con-volutional layers and also 6 max-pooling layers. The final layers are 3 dense layers. The non-linearity used in the model is Rectified Linear Units, also known as ReLU [33]. For a graphical overview of the model, see Figure 4. For a more in depth review of the model including mathematical formulations, we refer to [19].

Several hyperparameters were tuned, such as num-ber of filter layers, embedding sizes and size of dense layers. For a review of the hyperparameters considered, see Appendix C. While the original architecture used an input size of 1012 characters, the final model used for this research was 512. A smaller size for the input layer drastically reduces the time used for the training process. All notifications that exceeded the maximum

1_{https://github.com/fanta-mnix/python-bm25}

character length of 512 were truncated to be able to fit the 512 character limit. Truncating notifications that were too long only affected about 5% of the notifica-tions, the distribution of the number of characters of the notifications can be seen in Figure 5.

The implementation of the network was done us-ing the Keras [34] library for the python programmus-ing language.

4.3.9 MICE. The implementation of MICE used in our research is similar to that used in [35]. The number of imputation cycles was 10 and for both variables the imputation technique was multi-class logistic regres-sion. The fields that were imputed in the variable matrix were ‘Anonimity’ and ‘Neighborhood’. The first being whether a notification was made anonymously and the latter being what neighborhood the notification was made in. These variables were chosen because they had suitable amounts of missing values for imputation with MICE [25], with ‘Anonimity’ having 12% missing values and ‘Neighborhood’ 5%.

4.4 Model Comparison

To compare whether the performance of models was significantly different, McNemar’s tests [36] were con-ducted. McNemar’s test is a contingency table test often used for comparing the performance of two machine learning models [37].

To avoid the problem of increasing Type 1 errors when performing multiple statistical tests [38], a Bon-ferroni correction was used [39].

The implementation used for McNemar’s test comes from the StatsModels package [40] for python.

5 RESULTS

Each of the methods mentioned above was evaluated on the labelled notifications. A baseline model was created where the only variable used was the category of a notification. Results on the test dataset can be found in Table 1. The metrics that were evaluated were precision, recall and F1 scores [41]. These metrics are reported for the overall result across all classes and separately for thehiдh priority class, because this class was of extra importance for our problem setting.

In the table, the highest achieved values are shown in bold font.

(9)

Fig. 4. Illustration of the architecture of the CNN model, taken from Zhang et al. [19]

Table 1. Results for the various methods. P stand for precision, R for recall and F1 for the F1 score. A * after the method indicates that the classification algorithm used is SVM. tf-idf + categorical indicates tf-idf in combination with the category of the notification as extra feature. tf-idf + imputed indicates tf-idf in combination with the the category of the notification and the imputed variables of anonimity and neighborhood.

Method Pall Rall F 1all Phiдh Rhiдh F 1hiдh

Baseline* 0.73 0.63 0.65 0.48 0.58 0.53 tf-idf* 0.83 0.83 0.83 0.55 0.61 0.61 BM25 0.73 0.74 0.73 0.54 0.45 0.49 word2vec* 0.71 0.73 0.71 0.40 0.28 0.33 par2vec* 0.79 0.78 0.78 0.52 0.58 0.55 CLCNN 0.83 0.83 0.83 0.68 0.51 0.59 tf-idf + categorical* 0.85 0.85 0.85 0.67 0.60 0.64 tf-idf + imputed* 0.84 0.83 0.83 0.66 0.58 0.62

Fig. 5. Distribution of the number of characters over all noti-fications

To compare whether differences between the perfor-mance of the different methods were significant, Mc-Nemar tests [37] were conducted between the different

methods. For a table with the significance results, see Appendix D.

The most notable results are that there is no signif-icant difference in the performance of BM25 and the baseline method, and the CLCNN and tf-idf. All of the other differences between method performance on pre-cision are significant. It is worth noting that McNemar test compares the overall precision of two methods, and not the recall of a specific class.

6 DISCUSSION AND FUTURE WORK

The most salient and promising finding of our research is that the CLCNN has performance that is on-par with the most succesful ‘traditional’ text representation, tf-idf. While the performance of CLCNNs has been re-searched to some extent with regard to academic datasets, no research had been done on the application on real-world data. While the CLCNN didn’t perform as well as other methods on the metric of choice for this research, recall of thehiдh class, it did perform well on overall

(10)

precision. It is exciting that the performance of this model on these types of user-generated texts is satisfac-tory. These results are promising and provide a hopeful outlook towards the future of using neural networks for low-level input classification. An interesting direction for future research would be to add additional layers such as recurrent or LSTM layers, which have proven useful in text classification problems [20] [42].

While the scores on the evaluation metrics of the CLCNN is quite close to tf-idf with categorical features combined with an SVM, there is a vast difference in the computational complexity of these two models. Where the CLCNN took 10+ minutes to train and 1 minute to predict on the entire test set2, the SVM took 2 minutes for training and 1 second for prediction. This is a consid-erable difference in the usage of computing resources, and something that is to be taken into consideration from both a sustainable and an economical perspective.

The result we found with respect to BM25 was un-expected. Since BM25 is an extension of tf-idf, results would intuitively be similar or only slightly better / worse. The results from our research indicate that BM25 performs much worse on our task than tf-idf in com-bination with SVM. This could be due to the fact that the classification algorithm used with BM25 was a dif-ferent, more information retrieval based approach. It is likely that this was the main cause of the discrepancy between the performance of BM25 and tf-idf. For future research a different approach to implementing BM25 could be chosen, where some form of document-length normalization is applied to the tf-idf vectors, instead of using a query-document scoring function. This way a BM25-like vector representation could be formulated that can be then used in an SVM, thus eliminating the effect of the different classifiers.

An unexpected finding is that word2vec representa-tions of notificarepresenta-tions yielded the worst performance out of all methods on many metrics. Previous research on using word2vec for text classification has found it to be a reliable technique for representing text in classifi-cation problems. Our findings contradict the previous research. As can be seen from the various plots and de-scriptions in Appendix B, the trained word2vec model

2_{During both training and testing of the CLCNN a Nvidia TitanX} GPU was used for computational speedup.

seems to be reliable in the created word embeddings -words that are semantically similar are mapped to close locations in reduced 2D t-sne space. However, when notifications are represented as their summed word2vec vectors and plotted with color based on priority, its hard to distinguish points in the vector space where different levels of priority are separable. This insinuates that this representation is not suitable for detecting the priority of notifications. A cause could be that summing the vectors along each axis is a bad representation, even though previous research indicates otherwise [29]. For future research an interesting approach could be to use k-means clustering [43] for all the word-vectors in a notification, and then using thek centroids of these clusters as representation of the notification.

Another remarkable finding is that using the vari-ables imputed with MICE does not improve perfor-mance. The reason for this is most likely that the vari-ables chosen to impute with MICE have no relation with the priority to begin with. This makes imputing the missing values to better predict priority a futile ex-ercise. Thus no real conclusions can be drawn whether MICE is effective for improving the performance of the classifier.

Besides worsening performance in the prediction of the priority, there are also ethical concerns regarding the use of the ‘Neighborhood’ variable. If the model were to be used in production, it would be undesirable to base the priority that is given to a notification on the neighborhood it was made in. We would like to express clearly that the investigation into the predictive power of this variable was only made for scientific purposes and that such practices should not be used in real-world applications.

7 CONCLUSION

In this research we assessed the effectiveness of several techniques for classification of the level of priority of customer complaints. The customer complaints that we specifically looked at were notifications about the public space from the municipality of Amsterdam.

Results from our research indicate that predicting priority of customer complaints is problem that can be solved reasonably well. While the performance can be

(11)

improved upon, it is good enough for a typical produc-tion environment in a real-world setting.

We looked at several different techniques, of which tf-idf in combination with categorical features had shown to be the most effective. There still remains research to be done to improve the other classification methods. This is especially the case with word2vec and BM25, which performed poorly possibly due to limitations mentioned in section 6.

This research takes the field of automating processes to increase customer satisfaction a step further. An in-teresting direction for future research could be to inves-tigate how our findings translate to other settings with different types of customer complaints.

REFERENCES

[1] Anna S Mattila and Daniel J Mount. The impact of selected customer characteristics and response time on e-complaint sat-isfaction and return intent. International Journal of Hospitality Management, 22(2):135–145, 2003.

[2] Angela Edmunds and Anne Morris. The problem of information overload in business organisations: a review of the literature. International journal of information management, 20(1):17–28, 2000.

[3] Hans H Bauer, Mark Grether, and Mark Leach. Customer rela-tions through the internet. Journal of Relarela-tionship Marketing, 1(2):39–55, 2002.

[4] Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akki-raju. A new chatbot for customer service on social media. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 3506–3510. ACM, 2017.

[5] Charu C Aggarwal and ChengXiang Zhai. A survey of text classification algorithms. In Mining text data, pages 163–222. Springer, 2012.

[6] Karen Sparck Jones. A statistical interpretation of term speci-ficity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972.

[7] Yin Zhang, Rong Jin, and Zhi-Hua Zhou. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1-4):43–52, 2010. [8] Wen Zhang, Taketoshi Yoshida, and Xijin Tang. A comparative

study of tf* idf, lsi and multi-words for text classification. Expert Systems with Applications, 38(3):2758–2765, 2011.

[9] Juan Ramos et al. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, volume 242, pages 133–142, 2003. [10] Stephen Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of documentation, 60(5):503–520, 2004.

[11] Franca Debole and Fabrizio Sebastiani. Supervised term weight-ing for automated text categorization. In Text minweight-ing and its

applications, pages 81–97. Springer, 2004.

[12] Gerard Salton and Christopher Buckley. Term-weighting ap-proaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.

[13] Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.

[14] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Effi-cient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[15] Joseph Lilleberg, Yun Zhu, and Yanqing Zhang. Support vector machines and word2vec for text classification with semantic features. In Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on, pages 136–140. IEEE, 2015.

[16] Andi Rexha, Mark Kröll, Mauro Dragoni, and Roman Kern. Po-larity classification for target phrases in tweets: a word2vec approach. In International Semantic Web Conference, pages 217– 223. Springer, 2016.

[17] Quoc Le and Tomas Mikolov. Distributed representations of sen-tences and documents. In International Conference on Machine Learning, pages 1188–1196, 2014.

[18] Natalia Maslova and Vsevolod Potapov. Neural network doc2vec in automated sentiment analysis for short informal texts. In International Conference on Speech and Computer, pages 546–554. Springer, 2017.

[19] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657, 2015. [20] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush.

Character-aware neural language models. In AAAI, pages 2741– 2749, 2016.

[21] Shantanu Godbole and Sunita Sarawagi. Discriminative methods for multi-labeled classification. In Pacific-Asia conference on knowledge discovery and data mining, pages 22–30. Springer, 2004.

[22] Donald B Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.

[23] Joseph L Schafer and John W Graham. Missing data: our view of the state of the art. Psychological methods, 7(2):147, 2002. [24] Patrick Royston et al. Multiple imputation of missing values.

Stata journal, 4(3):227–41, 2004.

[25] Melissa J Azur, Elizabeth A Stuart, Constantine Frangakis, and Philip J Leaf. Multiple imputation by chained equations: what is it and how does it work? International journal of methods in psychiatric research, 20(1):40–49, 2011.

[26] Jacob Cohen. Weighted kappa: Nominal scale agreement pro-vision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213, 1968.

[27] J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. biometrics, pages 159–174, 1977. [28] Radim Řehůřek and Petr Sojka. Software Framework for Topic

Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/

(12)

884893/en.

[29] Thijs Scheepers, Efstratios Gavves, and Evangelos Kanoulas. Im-proving word embedding compositionality using lexicographic definitions. 2017.

[30] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.

[31] Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classification. Journal of machine learning research, 5(Jan):101– 141, 2004.

[32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [33] Vinod Nair and Geoffrey E Hinton. Rectified linear units

im-prove restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.

[34] François Chollet et al. Keras. https://keras.io, 2015.

[35] Ian R White, Patrick Royston, and Angela M Wood. Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4):377–399, 2011.

[36] Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157, 1947.

[37] Thomas G Dietterich. Approximate statistical tests for com-paring supervised classification learning algorithms. Neural computation, 10(7):1895–1923, 1998.

[38] Richard A Armstrong. When to use the bonferroni correction. Ophthalmic and Physiological Optics, 34(5):502–508, 2014. [39] Olive Jean Dunn. Multiple comparisons among means. Journal

of the American Statistical Association, 56(293):52–64, 1961. [40] Skipper Seabold and Josef Perktold. Statsmodels: Econometric

and statistical modeling with python. In Proceedings of the 9th Python in Science Conference, volume 57, page 61. SciPy society Austin, 2010.

[41] David Martin Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. 2011.

[42] Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. A c-lstm neural network for text classification. arXiv preprint arXiv:1511.08630, 2015.

[43] John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.

(13)

APPENDICES

A

LABEL TOOL

Fig. 6. The instruction page of the labelling tool [Dutch language]

(14)

Fig. 8. Kappa scores between raters. Greyed-out squares indicate that two raters had no mutually labelled notifications. In Figure 8 the Kappa scores between different raters can be seen. Raters that had a mean kappa score lower than 0.4 were excluded from the final dataset.

If a certain notifications received different labels from different raters, the most frequent label was chosen. In case of a tie, the notification was excluded from the dataset.

(15)

Fig. 9. Embedded words in 2D t-sne space [Dutch language]

(16)

Table 2. Results for the various word2vec vector sizes. P stand for precision, R for recall and F1 for the F1 score. Vector size Pall Rall F 1all Phiдh Rhiдh F 1hiдh

50 0.70 0.69 0.69 0.31 0.22 0.27

100 0.71 0.73 0.71 0.40 0.28 0.33

200 0.71 0.73 0.71 0.40 0.28 0.33

300 0.71 0.73 0.71 0.40 0.28 0.33

Fig. 11. par2vec embedded notifications in 2D t-sne space

B

WORD2VEC MODEL

B.1 Analysis of word2vec model

In Figure 9 the word2vec vectors for a subset of words can be seen in reduced 2-dimensional t-sne space. In this vector space, words that are semantically related are indeed close to eachother. This suggests that the model that was trained does indeed capture the meaning of words.

In Figure 10 the summed word2vec vector representations can be seen in reduced 2-dimensional t-sne space. The points in the figure are colored on the level of priority. It is difficult to find a pattern in this space that makes the points easily separable on priority level, thus suggesting that this representation makes it difficult to classify priority.

These findings suggest that the summed vector representation might not be very well suited for the represen-tation of the notification.

In Figure 11 the par2vec representations of notifications can be seen in reduced 2-dimensional t-sne space. The points in the figure are colored on the level of priority. It is quite clear to see that the different levels of priority are more easily separable here than in Figure 10. This is also reflected in the results.

(17)

C

CHARACTER LEVEL CONVOLUTIONAL NEURAL NETWORK COMPARISONS

Several configurations of the CLCNN were considered. The different layers that were tuned were the emmbeding sizes, convolutional layers and fully connected layers. The different sizes that were considered for the embedding size was 128 and 256. For the fully connected layers the sizes 256, 512 and 1024 were considered. For the convolutional layer configurations see table 3 and table 4. To prevent overfitting each configuration featured a dropout module after each fully connected layer with a dropout probability of 0.5. All models were trained for 50 epochs. The final configuration with the best performance on F1 score of thehiдh priority class was that of embedding size 256, the convolutional layer configuration from the below Table 3 and fully connected layer sizes 512.

Full comparison results on all metrics, such as provided in Appendix B, were unfortunately not saved, and could not be produced again due to time constraints.

Table 3. Configuration 1 of the convolutional layers (C1) Layer Feature size Kernel size Pool

1 256 7 3 2 256 7 3 3 256 3 N/A 4 256 3 N/A 5 256 3 N/A 6 256 3 3

Table 4. Configuration 2 of the convolutional layers (C2) Layer Feature size Kernel size Pool

1 256 7 3

2 256 7 3

3 256 3 N/A

(18)

D

MCNEMAR’S TEST RESULTS

Results of McNemar’s test [37] can be found below. We use a bonferroni correction withα = 0.05₂₈ It is to be noted that McNemar’s test uses correctly / incorrectly classified samples between both models as values to be compared. This means that it compares on precision, and not recall of a specific class, which is the desired metric for our problem setting. This means that these statistical tests are to be seen in a more general context than only our problem setting.

Table 5. P-value results of McNemar’s test for comparison between methods. Text printed in bold indicates p <0.05 28 and

row model performance > column model performance. Text printed in bold and italic indicates p < 0.05

28 and column model

performance > row model performance.

Baseline tf-idf BM25 word2vec par2vec CLCNN tf-idf + categorical tf-idf + imputed

Baseline 1 tf-idf 0.000 1 BM25 0.045 0.000 1 word2vec 0.000 0.000 0.000 1 par2vec 0.000 0.000 0.000 0.000 1 CLCNN 0.000 0.021 0.000 0.000 0.000 1 tf-idf + categorical 0.000 0.000 0.000 0.000 0.000 0.000 1 tf-idf + imputed 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1