Detecting duplicate citizen reports

(1)

Detecting Duplicate Citizen Reports

Submitted in partial fulfilment for the degree of Master of Science

Joost Johannes van Eck

11838841

Master Information Studies

Data Science track

Faculty of Science

University of Amsterdam

Date of defence: 2018-06-28

Internal Supervisor Second Reader External Supervisor

Title, Name dr S. Rudinac dr K. Beelen Maarten Sukel

Affiliation UvA, FEB UvA, FGW City of Amsterdam

(2)

Abstract

The City of Amsterdam maintains a citizen report system for incidents in the public space. A substan-tial share of these reports are estimated to cover the same incidents. In this thesis we investigate the pos-sibility of using both textual and spatiotemporal fea-tures to detect these duplicate reports. Several ap-proaches are explored to capture and model multi-ple similarity measures based on nearly a full year of data. The results show that detecting these dupli-cates can be done reasonably well on spatiotemporal features alone, but can be complemented with lin-guistic features.

1 Introduction

A citizen report system has been in use by the City of Amsterdam for inhabitants of the Amsterdam mu-nicipality for a number of years under the name Meldingen Openbare Ruimte Amsterdam (MORA). Incidents in the public space can either be reported at the dedicated website, through contacting a pub-lic servant or webcare, or through verbeterdebu-urt.nl1. These reports include incidents such as lit-tered streets, broken asphalt, noise nuisance, and many more non-emergency incidents.

Each submitted report has a textual description, ranging from short and generic to long and detailed. The reports are associated with a variety of meta-data, inserted either by citizen, the public servant, or as generated by the MORA system. Said data in-cludes features such as several location details, sub-mission time, reporter details, category, priority, im-ages, and more.

MORA processes well over 150.000 reports per year and this number has been growing. All reports are dealt with one at a time. However, it is estimated that a significant portion of these reports are dupli-cates, estimates ranging from 10% to 20%. Two or more reports are said to be duplicates when they de-scribe the exact same incident. This can be one per-son reporting the same incident multiple times using nearly identical phrasing, or multiple people using

1_{verbeterdebuurt.nl}

quite different phrasing. This redundancy has im-plications. Among others, it takes up resources that could be spent more effectively.

The system handles non-emergency incidents, which suggests that most of these incidents have a small and local impact on the city. Reports in their spatiotemporal proximity could be more likely to de-scribe the same incident. The textual description given by the reporter can give deep insights into what the incident is about. Duplicate detection might yield good results looking at both of those features.

1.1 Research question

In this thesis we seek to answer the following research question: Can duplicate reports effectively be recognised based on content analysis and con-textual features? To answer this question the fol-lowing sub-questions will be used:

• What text analysis techniques and features can be used to determine if two reports are dupli-cates?

• What is the usefulness of spatiotemporal fea-tures in predicting duplicates?

• Can these two approaches complement each other in predicting duplicates?

Overview of thesis In Section 2 related work is discussed, followed by the data and methodology in Section 3 and 4 respectively. Section 5 details the results, Section 6 discusses the results. The thesis will conclude in Section 7. The Appendix in Section 8 contains additional figures and tables.

2 Related Work

Duplicate detection is an important factor in many situations. In any system that uses a database dupli-cate records take up a lot of time to process, hence taking up more resources. Detecting duplicates and either bundling or deleting them can make a process

(3)

more efficient and effective. Measuring textual sim-ilarity is a classic natural language processing prob-lem, a field that has been explored quite extensively [2].

Comparing different similarities measures on sen-tences is done in [1]. Measuring cosine similarity be-tween tf-idf vectors is among their applied methods. tf-idf is a widely used word weighting scheme [19]. It has several variations, but the underlying goal is to penalise words that are frequent in multiple docu-ments and to give more weight to infrequent words. It converts any document to a weighted bag of words vector. In [1] they find that tf-idf performs reason-ably well, however semantic measures are still able to outperform it. Looking at the share of overlap-ping words also appeared to be a strong indicator of textual similarity.

On Facebook multiple pages can refer to the same geographical place or entity. Detecting these dupli-cates is investigated in [10]. By calculating similarity measures for both relative term frequencies per so-called tiles, small squared geographical regions, and textual features in the names of these places they are able to outperform simple Levensthein Edit Distance and tf-idf, thereby showing that textual and spatial features combined can be used for duplicate detec-tion.

Okapi BM25 is a notable improvement of tf-idf [19], which takes into account additional factors, such as document length, while not introducing too many new parameters into the model. Okapi BM25 proved to be a very effective weighting scheme for Informa-tion Retrieval, as concluded in [30]. This suggests that it might prove useful in other areas of natural language processing.

One shortcoming of a word weighting scheme is that it cannot account for different phrases convey-ing the same meanconvey-ing. To capture this Word Mover’s Distance is introduced in [16]. It approaches the dif-ference between two documents as a transportation optimisation problem from all the words from one document to the most semantically similar word in the other document. This similarity is based on word embeddings [20], where ”a word is characterised by the company it keeps”, a technique that uses a shal-low neural network to capture semantic similarity of

words, by mapping the context in which words ap-pear.

The use of citizen reports and how to facilitate these is explored in [21]. It details Open311, a ”col-laborative model and open standard for civic issue tracking [in the public space]” primarily employed in several large cities in the USA2_{. The paper}

de-scribes not only the relevance of such a system, but also investigates how its design influences its use and adoption.

In [26] the use of machine learning techniques for citizen reporting systems was explored, more specifi-cally for MORA. Methods based on textual features to automatically find the correct category of a re-port proved useful. Spatiotemporal features proved not as useful, even though it was able to capture the correct category of each incident report to a certain degree. This suggests that time difference and spatial distances might prove useful for this task.

Automatically detecting regions within cities based on various features is explored in [24]. Latent Dirich-let Allocation (LDA) was used to detect latent topics in textual information of Flickr3 images originating in Amsterdam. These topics prove useful in captur-ing various semantically rich aspects of the city. LDA was first presented in [6] as a generative probabilistic model where documents are represented as a distri-bution of n latent topics and where a topic is a dis-tribution of the unique words in the corpus. These latent topics don’t have to be strongly defined se-mantically. These topics can be seen as collections of lexical terms, based on term co-occurrence.

3 Data

3.1 MORA data

The data-set used for this thesis covers every single incident report done through MORA in the year 2017 from February 15th _{to December 31}st_{. It totals at}

150764 records, with 93 fields per report. Said fields include information as given by the reporter and as generated by the MORA system. A subset of these

2_open311.org 3_flickr.com

(4)

fields are used in this thesis. Included are various spa-tiotemporal information, main and sub category, and incident description. Other fields detailed, for exam-ple, how those incidents were resolved. Those will not be used in this research. A fake report example with some of the used fields can be found in Table 1. The 8 main categories and their corresponding number of reports are shown in Fig. 1, and histograms detail-ing the number of characters and words per report in Fig. 2a and 2b. A bar chart for the 61 sub categories (Fig. 4) and a table with translations (Table 3) can be found in the Appendix.

Field Value

Description The downstairs neighbours are playing loud music again Time 12:34am, January 1st_{, 2017}

Address Stationsplein, 1012 AB Amsterdam Coordinates 52.38 Latitude, 4.9 Longitude Category Nuisance from groups or persons

Table 1: Report example

Figure 1: Main categories with number of reports

(a) Character counts (b) Word counts

Figure 2: Histograms, logarithmic scale The average number of characters and words per

report is 170.2 and 27.5 respectively. This means that the textual descriptions might not always be as rich in information as we would like. That is where addi-tional spatiotemporal features might prove useful.

3.2 Duplicate labels

In evaluating predicted duplicate reports examples are required. As these were not available they had to be acquired. This was done in the following manner. A labelling tool was made so that civil servants could come and annotate report pairs. A screen-shot can be found in Fig. 5 in the Appendix. Each time two reports would be sampled, paired, and presented to the civil servant to evaluate, the pair either be-ing about the same incident or not. There was also an option to skip the presented pair whenever it was unclear. Those were left out of the analysis. Com-paring each report with each other report would have resulted in 1507642_{pairs, which would have been}

un-feasible and sparse in duplicates to begin with. Re-ports pairs were therefore filtered on the following criteria to root out true negatives:

• Same neighbourhood • Same main category

• A maximum of 48 hours apart

• A maximum of 250 great circle meters apart This resulted in 317860 pairs of records to sample from for the labelling tool. In total 5004 labels were acquired (1314 duplicates vs 3690 non-duplicates). Each item in the acquired data-set now looks like this:

Report − ID1, Report − ID2, label

The label is a binary indicator whether or not the two reports are actually about the same incident. Each report ID can be linked to the actual report and its data, which will be used in predicting whether or not the two reports are duplicates.

(5)

3.3 Data validity

The labels should be consistent between the civil ser-vants, so we tested for inter-user agreement. A report pair that was already labelled would be presented with a 50% probability, resulting in 360 report pairs with 2 labels. Since this was implemented relatively late in the labelling process the resulting probabil-ity is lower. To evaluate this we used Cohen’s Kappa score [28], along with agreement rate. κ = −1 is total disagreement, κ = 0 agreement by chance, and κ = 1 for total agreement. The results can be found in Fig. 6 in the Appendix. The names of the civil servants have been anonymised. Kappa rule of thumb ascribes that scores above 0.6 are substantial and above 0.8 almost perfect [17]. Since almost all scores are 0.74 or higher (Fig. 6a) we can conclude that there is enough inter-user agreement, with a few exceptions. Three Kappa scores are 0.0, but when inspecting the number of shared report pairs we observe that these never exceed 3 (Fig. 6c) and are therefore not reli-able enough. The agreement rates also shows similar promising results (Fig. 6b). Out of 360 reports pairs there was disagreement about 24 report pairs (48 la-bels total), those will be removed. This leaves 1302 duplicates and 3654 non-duplicates.

3.4 Transitivity detection

There is a substantial class imbalance in the labels (± 1 to 3). Transitivity detection is a way to counter this imbalance, if not partially. Its formal definition is described in Equation 1. Where items a, b, and c are members of set X, a and b have relation R, and b and c have relation R, we can deduce that a and c also have relation R.

∀a, b, c ∈ X : (aRb ∧ bRc) ⇒ aRc (1) If we say that any member of X is any report in the MORA system and relation R is any two reports be-ing about the same incident, then we can enlarge our positive example pairs. Out of 1302 labelled positive pairs we can derive 29 new ones, a 2.23% increase. While not a vast increase, it is a small step in coun-tering class imbalance.

4 Methodology

This section details the techniques used to measure distances from one report to another report and simi-larities between them. Lexical, syntactic, and seman-tic elements will be covered in the linguisseman-tic features. The spatiotemporal features will be split into time difference, walking and great circle distance. These distances and similarities will be used as features for binary classification, both in separate classifiers and in an ensemble.

4.1 Linguistic features

This subsection describes the textual techniques used to capture similarities between the two textual de-scriptions of two reports. Report dede-scriptions were prepossessed by removing punctuation, lower-casing, and tokenising with the Dutch tokeniser by Natural Language Toolkit [5].

4.1.1 N-grams

An n-gram is a contiguous sequence of n words from any given document [8]. This can be used to measure both lexical and syntactic similarity between docu-ments, by looking at the proportion of overlapping n-grams over the total number of n-grams. This is a rather straightforward and easy algorithm to im-plement, since it doesn’t require a model. Values 1 through 5 will be compared for n.

4.1.2 tf-idf & Okapi BM25

A tf-idf definition as given by [19] can be found in Equation 2. For both tf-idf and Okapi BM254 _the

Gensim implementation will be used [23]. Every re-port description will be vectorised to both tf-idf and Okapi BM25 representations, and cosine similarity (Eq. 3) will be used to measure how similar the vectorised descriptions are. It measures the angle between two non-zero vectors in multi-dimensional space.

4_{Gensim has no implementation for BM25. A custom one,}

build upon Gensim’s tf-idf implementation, will be used. It can be found at github.com/lum4chi/mygensim

(6)

tf idft,d= log tft,d· log N dft (2) similarity = cos(θ) = n P i=1 AiBi s n P i=1 A2 i s n P i=1 B2 i (3)

4.1.3 Word Mover’s Distance

WMD measures the sum of minimum Euclidean dis-tances between the words of one text to another based on their word embedding. This can be an exhaustive algorithm, but on the scale of the acquired labelled set this is still computationally manageable. Embed-ding sizes of 50, 100, 200, and 300 will be compared to see how well they are able to capture the full se-mantic space of the corpus.

Since we are dealing with a domain specific corpus, we might not capture a broad enough semantic space. To account for this another corpus will be used, not based on MORA data. The use of a corpus based on a wide range of different sources of Dutch and Flemish language, primarily from news articles and WikiPedia, is explored in [27]. The language used in the corpus tends to be more formal and was used to train a word embedding of size 160. It will be interesting to see how the Word Mover’s Distances by this embedding will compare to the ones based on the MORA corpus. Gensim’s implementation of word embeddings and WMD will be used [23]. 4.1.4 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) captures the prob-abilistic word occurrence for any number of latent, or abstract, topics in a corpus. In this method a topic is a distribution of all words in the corpus, based on word cooccurrence, and a document is a distribution of n topics. Measuring similarity between two topic distributions of documents or, in the scope of this the-sis, reports can be done using cosine similarity (3). Several number of topics will be compared to see how well they capture the true latent topic space. 50, 100, 200, and 300 topics will be investigated.

4.2 Spatiotemporal Features

Spatiotemporal features are features concerning space and time. In the context of this thesis these are the distances in both space and time between re-ports. Space and time will be used separately. A temporal distance is the difference in time, here we will use time in seconds.

A spatial distance can be calculated in a few ways given two coordinates. First we will use the great-circle distance, which can be seen as connecting any two points on a sphere, in this case Earth. The line is straight, however, it moves through curved space, over the Earth’s surface. The Haversine distance for-mula is one way to measure this [25].

The second spatial metric will be walking distance. This is to account for the fact that a transportation route between the locations of two reports might not be the same as the great circle distance, since it could be blocked by some entity, be it a waterway, a high-way, a building, etc. The Google Maps Distance Ma-trix API is used to compute this distance [11].

4.3 Binary Classifiers

Since we are dealing with two classes, two reports either being about the same incident or not, bi-nary classifiers will be used in predicting duplicates. Scikit-Learn API [7] contains a range of machine learning tools, such as preprossessing methods, clas-sifiers, and evaluation methods. The following four classifiers will be used as implemented by Scikit-Learn: Logistic Regression [29], Support Vector Ma-chine [9], Decision Trees [3], and Random Forests [14]. All models will find or approximate their optimal hyper-parameters through a grid search over the pos-sible combinations of hyper-parameters. Grid search can be an exhaustive algorithm, however for this data computation times are still manageable. Per-formances will be evaluated through cross valida-tion (c=5) and optimised based on the ROC scores. The majority class (non-duplicates) will be under-sampled. Oversampling the minority class (dupli-cates) can result in over-fitting, something we need to avoid. Training and testing of all models will be done with a train-test split of the data (80% vs 20%). All

(7)

features will be normalised by subtracting the mean and scaling to unit variance.

4.4 Resulting Features

In total there are 5 values for n-grams, 4 different embedding sizes to base the Word Mover’s Distance upon (50, 100, 200, 300) plus one based on Dutch corpora [27], and 4 different number of topics for LDA (50, 100, 200, 300). Together with tf-idf, Okapi BM25, and the three spatiotemporal features this makes for 19 different features. Since several features are strongly correlated with other features this will most likely influence model performance and reliabil-ity. By looking at the correlation matrix in Figure 7 in the Appendix we make the following observations about (multi-)collinearity:

• Haversine and walking distance correlate nearly maximally

• All LDA and n-gram features seem to correlate similarly with all other features

• Of all LDA models 300 topics seems to correlated best with the label

• Of all n-grams 1-gram seems to correlate best with the label

• tf-idf and Okapi BM25 correlate maximally • All MORA based WMD’s correlate maximally • The Dutch corpora based WMD correlates with

the label slightly less than the MORA based WMD’s

4.5 Combining

The different classifiers will model the input features onto different spaces, where these spaces might cap-ture the variance differently. Therefore, an ensemble will be investigated. As concluded in [12] combined classifiers trained on the same feature set improved performance somewhat, even though combined clas-sifiers trained on different feature sets performed even better. Here we will do the former, with a majority

vote on summed probabilities. Combining classifiers can help errors in the models to balance out and thus make more accurate predictions.

5 Results

The following set of 6 features were used for train-ing and testtrain-ing the models: time difference, Haver-sine distance, LDA (300), 1-gram, tf-idf, WMD (50). Haversine is used over walking distance, since it is substantially quicker. Feature selection both de-creases collinearity and training times, and inde-creases model interpretability [15].

Table 2 displays the performance of the classifiers expressed in F1-scores [22]. The Random Forest clas-sifier was used in predicting duplicates for all 8 cat-egories. Those results can be found in Table 4. All computations were executed on DAS-4 [4].

Classifier Ling. Spatiotemp. Both

Log. Regression 0.75 0.85 0.88

SVM 0.77 0.86 0.87

Decision Trees 0.76 0.86 0.88

Random Forests 0.77 0.85 0.88

Ensemble 0.76 0.86 0.88

Table 2: F1-scores for various classifiers based on only linguistic or spatiotemporal features and on both.

6 Discussion

In Table 2 it is clearly visible that both the linguistic and the spatiotemporal features perform well, both separately and combined. The spatiotemporal fea-tures, however, performs reasonably better than the linguistic ones. It could be that duplicate reports only appear in their spatiotemporal proximity and that there are not many non-duplicate reports in that proximity. Linguistic features are complementary to spatiotemporal ones in duplicate detection. When looking at the boxplot of the Haversine distance (Fig. 3) we see that any distance greater than 0.1km is al-ready seen as an outlier for the duplicate labels. This could change if the number of reports in each others

(8)

proximity increases rapidly. In such a scenario lin-guistic features will most likely increasingly be the decisive factor, rather than spatiotemporal features.

Figure 3: Boxplots of spatial features in kilometers As can be seen in Table 4 in the Appendix the spa-tiotemporal features perform better than the linguis-tic features. There are minor differences, but nothing substantially. The category ’Openbaar groen en wa-ter’ does not perform as well as the other categories. Various factors could explain this. One of them be-ing that reports in this category come from a larger geographical area, where it is harder to indicate a specific location.

Looking at the same sub category we find that du-plicate reports have a 73.1% chance of having the same su b category, as opposed to 30.5% of non-duplicates having the same sub category, hence hint-ing at the use of this as a feature for duplicate detec-tion. However, using this as a binary feature did not improve performance in any of the classifiers. This could mean that the system is not consistent in sub category use, that sub categories can have overlap-ping semantics, or, most likely, that the variance ex-plained by sub categories is already modelled through one or more features.

Combining different classifiers did not improve per-formance, nor did it worsen performance. The en-semble of classifiers all modelled the same feature set onto different spaces, but were modelling the same variance. Combining classifiers based on different fea-ture sets could prove more fruitful.

One such approach could be to train classifiers solely on semantic similarity. A k-means word em-beddings centroid [13] or a full document embedding [18] could be determined for a report. Instead of tak-ing a similarity measure captured in one value, say cosine similarity, of the two vectors the absolute dif-ference of two vectors could be used as input for a classifier.

Another use of duplicate detection is its use in pri-ority detection. Some reports should be acted upon quicker than others. If multiple reports describe the same incident this could mean that its impact is big-ger and should be addressed quicker.

7 Conclusions

In this thesis we calculated and modelled similarity measures between two reports in the public space in order to predict whether or not they describe the same incident. We found that this can be done rea-sonably well. Spatiotemporal features turn out to be a stronger indicator of duplicates than linguis-tic features. However, combined they perform best. Furthermore, we see that all classifiers performed roughly equally well, but that combining classifiers based on the same feature set did not improve per-formance.

7.1 Acknowledgements

A special thanks to dr. Rudinac for his valuable feedback and to Maarten Sukel for the daily supervi-sion during the project. A final thanks to both Mink Rohmer and Michelle Koks, who also wrote their the-ses for the MORA project.

(9)

References

[1] Palakorn Achananuparp, Xiaohua Hu, and Xi-ajiong Shen. The evaluation of sentence sim-ilarity measures. In International Conference on data warehousing and knowledge discovery, pages 305–316. Springer, 2008.

[2] Charu C Aggarwal and ChengXiang Zhai. A sur-vey of text classification algorithms. In Mining text data, pages 163–222. Springer, 2012. [3] Rakesh Agrawal, Tomasz Imieli´nski, and Arun

Swami. Mining association rules between sets of items in large databases. In Acm sigmod record, volume 22, pages 207–216. ACM, 1993.

[4] H. Bal, D. Epema, C. de Laat, R. van Nieuw-poort, J. Romein, F. Seinstra, C. Snoek, and H. Wijshoff. A medium-scale distributed sys-tem for computer science research: Infrastruc-ture for the long term. Computer, 49(5):54–63, May 2016.

[5] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: ana-lyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.

[6] David M Blei, Andrew Y Ng, and Michael I Jor-dan. Latent dirichlet allocation. Journal of Ma-chine Learning Research, 3(Jan):993–1022, 2003. [7] Lars Buitinck, Gilles Louppe, Mathieu Blon-del, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Ga¨el Varoquaux. API design for ma-chine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learn-ing, pages 108–122, 2013.

[8] William B Cavnar, John M Trenkle, et al. N-gram-based text categorization. Ann arbor mi, 48113(2):161–175, 1994.

[9] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273– 297, 1995.

[10] Nilesh Dalvi, Marian Olteanu, Manish Ragha-van, and Philip Bohannon. Deduplicating a places database. In Proceedings of the 23rd in-ternational conference on World wide web, pages 409–418. ACM, 2014.

[11] Google Developers. Google Maps Distance Ma-trix API. https://developers.google.com/ maps/documentation/distance-matrix. Ac-cessed: 2018-06-01.

[12] Robert PW Duin and David MJ Tax. Experi-ments with classifier combining rules. In Inter-national Workshop on Multiple Classifier Sys-tems, pages 16–29. Springer, 2000.

[13] John A Hartigan and Manchek A Wong. Algo-rithm as 136: A k-means clustering algoAlgo-rithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979. [14] Tin Kam Ho. Random decision forests. In

Doc-ument analysis and recognition, 1995., proceed-ings of the third international conference on, vol-ume 1, pages 278–282. IEEE, 1995.

[15] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated, 2014.

[16] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kil-ian Weinberger. From word embeddings to doc-ument distances. In International Conference on Machine Learning, pages 957–966, 2015. [17] J Richard Landis and Gary G Koch. The

mea-surement of observer agreement for categorical data. biometrics, pages 159–174, 1977.

[18] Quoc Le and Tomas Mikolov. Distributed rep-resentations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196, 2014.

(10)

[19] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. Introduction to Informa-tion Retrieval. Cambridge University Press, New York, NY, USA, 2008.

[20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed rep-resentations of words and phrases and their com-positionality. In Advances in neural information processing systems, pages 3111–3119, 2013. [21] Dietmar Offenhuber. Infrastructure legibility—a

comparative analysis of open311-based citizen feedback systems. Cambridge Journal of Re-gions, Economy and Society, 8(1):93–112, 2015. [22] David Martin Powers. Evaluation: from preci-sion, recall and f-measure to roc, informedness, markedness and correlation. 2011.

[23] Radim ˇReh˚uˇrek and Petr Sojka. Software Framework for Topic Modelling with Large Cor-pora. In Proceedings of the LREC 2010 Work-shop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en. [24] Stevan Rudinac, Jan Zah´alka, and Marcel Wor-ring. Discovering geographic regions in the city using social multimedia and open data. In In-ternational Conference on Multimedia Modeling, pages 148–159. Springer, 2017.

[25] Roger W Sinnott. Virtues of the haversine. Sky Telesc., 68:159, 1984.

[26] M. Sukel. Using machine learning to improve a citizen feedback system. Master’s thesis, Univer-sity of Amsterdam, 2017.

[27] St´ephan Tulkens, Chris Emmery, and Wal-ter Daelemans. Evaluating unsupervised dutch word embeddings as a linguistic resource. In Pro-ceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 1–7, 2016.

[28] Anthony J Viera, Joanne M Garrett, et al. Un-derstanding interobserver agreement: the kappa statistic. Fam Med, 37(5):360–363, 2005.

[29] Strother H Walker and David B Duncan. Es-timation of the probability of an event as a function of several independent variables. Biometrika, 54(1-2):167–179, 1967.

[30] Hugo Zaragoza, Nick Craswell, Michael J Tay-lor, Suchi Saria, and Stephen E Robertson. Mi-crosoft cambridge at trec 13: Web and hard tracks.

(11)

8 Appendix

8.1 Figures

(12)

(13)

(a) Cohen’s Kappa Scores (b) Agreement rate

(c) Number of double labelled pairs

(14)

(15)

8.2 Tables

Dutch English

Afval Garbage

Wegen, verkeer, straatmeubilair Roads, traffic, street furniture Overlast in de openbare ruimte Nuisance in public space

Openbaar groen en water Public green and water

Overlast bedrijven en horeca Commercial- and nightlife nuisance

Overig Other

Overlast van dieren Animal nuisance

Overlast van en door personen of groepen Nuisance from groups or persons Table 3: Translations of the main categories

Category Linguistic Spatiotemporal Both Support

Afval 0.81 0.90 0.91 2780

Wegen, verkeer, straatmeubilair 0.84 0.88 0.90 902

Overlast in de openbare ruimte 0.84 0.85 0.89 690

Overlast Bedrijven en Horeca 0.85 0.91 0.93 268

Openbaar groen en water 0.79 0.83 0.85 129

Overig 0.85 0.84 0.90 112

Overlast van dieren 0.87 0.94 0.94 59

Overlast van en door personen of groepen 0.89 0.90 0.95 45