Predicting Liveability in Metropolitan Areas Using Social Multimedia Data

(1)

Predicting Liveability in Metropolitan Areas Using

Social Multimedia Data

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

LEX HAITSMA MULIER

10791345

MASTER INFORMATION STUDIES

HUMAN-CENTERED MULTIMEDIA

FACULTY OF SCIENCE

UNIVERSITY OF AMSTERDAM

July 29, 2015

Supervisor: Second examiner:

Dr Stevan Rudinac Dr Mihir Jain

(2)

Predicting Liveability in Metropolitan Areas

Using Social Multimedia Data

Lex Haitsma Mulier

University of Amsterdam Science Park 904, Amsterdam

Lex.haitsmamulier@

student

.uva.nl

ABSTRACT

This paper presents a model that determines the quality of life within large metropolitan areas. The model analyses im-ages and associated annotations from the user-driven image-sharing platform Flickr. We correlate the visual appearance that is encoded in the images and human perceptions that are reflected in the comments with factors that influence the liveability for different neighbourhoods of a city, such as safety, crime rates and wealth. The output of the model is a prediction on the presence of these factors. The classifier uses separate approaches to analyse text and image data and eventually fuses the results from these approaches to form a final prediction. We find that the majority of our model’s predictions are in agreement with the official measurements. Our research shows that there is convincing evidence of a relationship between a city’s appearance and its implicit at-tributes. Ultimately, we propose two prototype applications using our model for liveability estimations.

Categories and Subject Descriptors

I.5.4 [Applications]: Computer vision

I.5.2 [Design Methodology]: Classifier design and evalua-tion

General Terms

Experimentation, performance, algorithms.

Keywords

City liveability, urban computing, social multimedia, multi-modal analysis, classification.

1. INTRODUCTION

Is a big city a good place to live? There exist many different initiatives ranking metropolitan areas according to their quality of life. The European Commission, for example, monitors the evolution of public opinion on liveability in Eu-ropean cities [1]. Whether derived from large surveys or sta-tistical facts, quality of life estimations for cities are gener-ally based on a similar set of city-attributes, including topics such as wealth, property prices, healthcare, recreation and safety. A great deal of valuable information on those factors sits nested in the appearance of a city [2]. Some of this in-formation is visible and quickly identified, e.g. that some streets or areas within a city are filthier or busier than others.

1_{https://www.flickr.com}

2_{http://picasa.google.com}

The ‘broken windows’ theory by Wilson and Kelling sug-gests that appearance of physical environments directly re-lates to criminal behaviour [3]. Contrarily, some of this in-formation is not as obviously extracted by simply observing city scenery. For example, implicit relations between neigh-bourhood appearance and city attributes have been found for depression levels [4], littering [5] and vandalism rates [3]. Aside from visual clues, people’s perceptions also yield in-formation concerning such inferred city attributes [6], [7]. Comments made on the ambiance, safety or aesthetics of street scenery could tell us how people feel. While estimat-ing the quality of life by jointly performestimat-ing analysis of visual appearance and human perceptions may be uncommon at the moment, this could change with modern technology. With automatic computation we can perform large scale analysis of the factors involved in estimating quality of life without surveying people and without manually observing a city’s appearance. Social multimedia websites such as Flickr1_{, Picasa}2_{and Panoramio}3_{can be used to find extensive} amounts of data on city environments. This data not only consists of images, but also of additional comments added by users in the form of descriptions and tags. The purpose of this thesis is to answer the question how this social multime-dia content can be used to predict the quality of life in a city and its neighbourhoods.

We will use Flickr content, which is visual data and the cor-responding textual annotations, to predict presence of city-attributes that influence quality of life. Content on Flickr can be pinned to a geographic location, commonly referred to as geo-tagging, providing us with a large dataset of spatial in-formation. We build an attribute classification system using robust machine learning algorithms. A portion of the data is used to train the classifier to recognize distinctive aspects for city-attributes in the images and descriptions. Our assump-tion is that we can apply this classifier to new data to suc-cessfully identify the same distinctive aspects and make ac-curate predictions on the area’s quality of life. The classifier is a generic model that can be applied to analyse diverse city-attributes. Its predictive power and accuracy comes from ex-ploiting both a visual and textual modality to complement and validate one another.

(3)

Data will be collected for three cities in the Netherlands. The main focus will lay on the capital Amsterdam, which is rated 11th_{on the worldwide quality of living rankings [8].} Addi-tionally, we perform analysis for Groningen and Utrecht, not appearing in the rankings.

In Section 2 we reflect on the related work. In sections 3 and 4 we describe our setup, the data preparation process and our classification approach. The prediction results are evaluated in Section 5, and two prototype applications are proposed in Section 6. Finally, we discuss our approach and results in Section 7.

2. RELATED WORK

With the emergence of big data in recent years, doors to new possibilities have been opened for research on urban envi-ronments. A number of studies have used machine learning algorithms to analyse visual appearance of a city. Khosla et al. [9], show that a street scene yields more information than what is visible in an image, and prove that computers can extract information from images and be used to predict dis-tances to nearby establishments and estimate crime rates. Similarly, Arietta et al. [10] perform an analysis on images of street sceneries, resulting in a city map with crime rates and other attributes such as average house pricing and pop-ulation density. Quercia et al. [11], focus more on a city’s aesthetics, and build a classifier that recognizes what visual aspects make London look beautiful, quiet and happy. Do-ersch et al. [12] also aim to analyse a city’s architecture, and use computer vision technology to automatically seek visual elements of buildings that are distinctive for a certain areas. Their classifier is able to distinguish different cities by ana-lysing Google Street View data. Zhou et al [13] build a sim-ilar model. Their model uses a large set of images obtained from the social image sharing platform Panoramio. Other studies go further and seek to recognize visual elements of buildings to the extent that the exact location where a query image was taken can be found by matching to a large data set of geo-tagged images [14], [15]. Undoubtedly, visual analysis of city environments has already produced promis-ing results.

Besides an analysis of visual appearance, we will also use textual data to base estimations for attributes on. Text mining of social media data is a popular topic in today’s research. For example, Asur and Huberman [16] demonstrate that messages on Twitter can be mined to predict box-office rev-enues for movies. The model outperformed market-based predictors. Regarding predictions on liveability, through so-cial media analysis it is also possible to make predictions on air quality, crime rates and other city related attributes. For example, Wang et al. [6] and Gerber [17] used text mining of Twitter posts to predict potential crimes. Malleson and Andresen [18] show that textual analysis of Twitter mes-sages can be used to map and predict crime rate hot spot shifting. Kay et al. [7] use Weibo, a Chinese Twitter alterna-tive, in combination with a variety of other sources to trace

the trends on air pollution to determine and predict the air quality for Chinese cities.

Multimodal approaches to classification problems are also well-researched. Guillaumin et al [19], for example, have a set of images with corresponding tags and descriptions and use both as input to test various multimodal classification schemes. Xu et al [20] apply multimodal learning to a real scenario and analyse both visual and textual elements found within images to filter out spam messages. Although the au-thors in these papers use multimodal classification, to our knowledge such techniques have not yet been applied in ur-ban computing research.

What sets us apart from related work is the use of both text and images from social multimedia content to make estima-tions of city-attributes. We use a separate approach for each of the two modalities and eventually fuse the results to pro-duce a final prediction. Furthermore, the analysis of the in-dividual attributes supports a broader purpose: we use it to estimate quality of life in cities.

3. INDICATORS OF LIVEABILITY

The Rijksoverheid (Dutch central government) evaluates quality of life according to several attributes [21]. In this the-sis we adopt a selection of the city-attributes used in their analysis. Different neighbourhoods within a city may expe-rience contrasting levels of liveability. We measure quality of life by the degree of presence, or possibly absence, of these attributes in the neighbourhoods and for the entire city. In order to train the classifier and evaluate the accuracy of the predictions, we use attribute-values as ground truth data for each of our attributes. This data is retrieved from the Dutch government [22], and consists of official measure-ments for each of the city’s neighbourhoods. The ground truth data is collected for the past five years (2010-2014). This is done because our multimedia data is from the same time span, as will be discussed in Section 4.1.

Instead of making predictions with the continuous ground truth values, we turn our classification problem into a binary one, as the purpose of this research is to find out whether a city or a neighbourhood has high or low quality of life, and not to determine the exact levels of the attributes. Techni-cally speaking, we split the ground truth values based on an appropriate threshold, such as a point set by an agency, or if not available the mean value. This forms a set of positive labels, which are all the attribute values that lay above the threshold, and a set of negative labels, which are all the at-tribute values that lay below the threshold.

The included attributes are as follows:

 Crime rates. The amount of incidents per year, as

provided by the Amsterdam Police unit, normalized for the size of the area in which they occurred. We make a distinction between acceptable (negative set) and intolerable (positive set) crime rates by set-ting a threshold at 15 incidents per acre each year.

(4)

 Subjective safety. Aside from the actual crime

rate, we use an index that tells us how safe inhabit-ants of the neighbourhoods feel in their living area. In 2003, the index for a neutral situation was set at 100, which is also our threshold. Meaning that a value above this index indicates a relatively unsafe area, and a score below the index means it is per-ceived as safe.

 Housing prices. The average housing price per

area as determined by the government for tax pay-ing purposes. Label threshold is set at 231,000 eu-ros, the average housing price in Amsterdam.

 Recreational sites. These include areas such as

parks, playgrounds, plantation, sports fields, allot-ments and daytrip recreational areas. Absolute val-ues are normalized for area size. We set the label threshold at 0.03 acre of recreational area per acre.

 Population density. This attribute is obtained by

dividing the number of inhabitants per neighbour-hood by its area size in acres. The threshold is set at the mean value of Amsterdam neighbourhoods.

 Average income. No normalization is required for

this attribute, as the averaged values are provided by the city of Amsterdam. The income statistics are from the year 2014, and the division between posi-tives and negaposi-tives is set at an income of 30,500 euros per year, the national average from the same year.

The crime rate attribute is also collected for the cities of Utrecht and Groningen, allowing us to evaluate the perfor-mance of classification for these cities as well.

One can already make hypotheses on which attributes are expected to be more present in the textual and visual modal-ities. For example, the presence of recreational areas is prob-ably easier to estimate based on visual appearance than the crime rates.

4. LIVEABILITY PREDICTION

APPROACH

The process of building the classifier consists of four stages: 1. We gather social multimedia data, prepare it for classification, and assign the ground truth labels to train the classifiers.

2. Visual descriptors are extracted from the images, and used as input to train Support Vector Machines (SVM).

3. Textual descriptors are extracted, assigned weights, and then used as input to train SVM’s.

4. We fuse the predictions from the visual and textual SVM’s and use this as input to train a classifier that will be used for the final predictions.

Finally, we will apply the trained model on new data.

4.1 Preparing the Data

As data we use content from the image sharing website Flickr. All of the images in our dataset are registered under a Creative Commons license, allowing us to use it in research [23]. The platform provides an elaborate and functional API to retrieve data. All of the items are geotagged to a specific location. Images from Flickr can be pinned to any location in a city. They are not limited to be taken from the road, un-like data from Google Street View, as used in [10] and [12]. Moreover, we find additional benefits in using multiple mo-dalities and combining results to form more reliable predic-tions. A similar choice for social media content instead of Google Street View data is made in [13]. However, a partic-ular challenge associated with the use of Flickr is the creative nature of its users; images may be taken in artistic angles or may have been modified using image editing software. We filter multimedia content on Flickr using search terms and tags that indicate the presence of streetscape scenery in the images. We use terms such as street, urban, building and

architecture in several different languages. We further filter

on the date the image was taken, and we limit the results to items that are geotagged within the selected areas. This search results in datasets of roughly 43000, 6000 and 5500 results for Amsterdam, Utrecht and Groningen respectively. As opposed to Google Street View panoramas, content on Flickr consists more of images showing famous landmarks. As a consequence, our data is not proportionately distributed over the city, as can be seen in the mapping of all the item’s Figure 1. Geotag mapping of the (a) ~43000 results in Amsterdam,

(b) ~6000 results in Utrecht, and (c) ~5500 results in Groningen. Blue dots represent geotagged images, the red area are the official boundaries of the city.

(5)

geotags in Figure 1. The red areas mark the official bounda-ries of the cities, obtained from [24], [25] and [26]. Each item in the dataset consists of a visual component (the image), a textual component (the associated description and tags), spatial metadata (the geographical location) and tem-poral information (date the image was taken). We start to prepare the dataset for classification. In order to avoid pat-terns within the datasets, the items are randomly shuffled. This is done because users tend to upload multiple items at once, with for example the same descriptions and tags. We then use the city’s official district boundaries to determine within which neighbourhood the image was taken. For each city-attribute, we already know which neighbourhoods have a positive or negative label (cf. Section 3). We then assign these labels to the training data to perform supervised learn-ing, as will be discussed in the next two sections. Once the items are assigned to the neighbourhoods, the exact location is disregarded.

4.2 Visual classification

Instead of extracting and combining features from both mo-dalities, we construct separate probabilistic models for visual and textual channels and eventually combine scores in a final classification to produce a conclusive prediction. According to [27] our approach belongs to a group of late fusion ap-proaches.

For each attribute-value we need to discover visual elements in the scenery that are discriminative for that particular at-tribute, and which reoccur in the images we are making pre-dictions for. First we extract 128-dimensional SIFT de-scriptors (also referred to as features) from the images and from this learn a codebook of 64 visual words using k-means clustering. We then encode the descriptors using a vector quantization technique called Vector of Locally Aggregated Descriptors (VLAD) [28]. Using VLAD adds more infor-mation to the feature vectors and thereby more discrimina-tive power for classification. In this technique the descriptors are matched to their nearest clusters, and for each cluster we store the sum of the differences of the descriptors assigned to this cluster together with the centroid of the cluster. We then normalize the VLAD clusters by square rooting and then apply L2 normalization to find the Euclidian distances, and thereby define all the distances between the features. A subset of the Amsterdam dataset is used to train the model. Doersch et al. use roughly 10,000 panoramas during the training phase of their model [12]. We use an equal sized set of images and build our classifier as follows. First, we find the most discriminative descriptors for the city attributes. We reduce the feature vector size by selecting the features that are the most effective in describing the images. This is done by calculating the F-scores for the features using ANOVA, where only the 40 percent of the features with the highest statistical significance remain in the feature vector. For each attribute we train Radial Basis Function (RBF) SVMs. RBF Gaussian kernels have proven to perform well in image classification problems for datasets of around

10,000 images [29]. The SVMs, using the positive and neg-ative labelled images, are trained in a 5 fold cross-validation. We use automatic grid search, a technique that compares ac-curacy results for all combinations of parameters settings, to select the optimal parameters per attribute.

4.3 Textual classification

Performing ‘late fusion’ provides the advantage of ap-proaching classification of individual modalities in an opti-mized form. Thus, the textual model differs from the visual model. The textual data consists of the descriptions and tags affiliated to the images.

We use the same items from the training set to train the tex-tual classifier. Term frequency–inverse document frequency (tf-idf) is used to extract descriptors, resulting in a vocabu-lary of 62000 terms. Tf-idf is a weighting scheme, in which both term frequency in a document and its reoccurrence within the entire collection are considered [30]. Tf-idf ig-nores exact ordering of the words in the document, allowing us to merge the descriptions and tags. Furthermore, with tf-idf it is not necessary to approach different languages in the text data with a separate model, as the meaning of the term is irrelevant. A disadvantage of this approach is that we can-not ensure that the semantics of the terms are related to the topic of the attribute. A plausible consequence is that such a model may not achieve high accuracy in cities for which it hasn’t been trained, which will be discussed in Section 5.4. Some of the terms turn out to be related to the location where the image was taken. The highest weighted terms in the da-taset, according to the tf-idf algorithm, are sometimes names of highlights in Amsterdam (the top terms contain: Nemo,

IJburg, Rijksmuseum, Vondelpark for example) or general

terms used to describe Amsterdam (Museum, Canal, West,

Zuid). However, other terms that have received high weights

by tf-idf are not location related (for example: church,

graf-fiti, private, streetculture).

The textual feature vectors do not require any dimensionality reduction in the form of feature selection because the most discriminative features already have been assigned a heavier weight by tf-idf. We also use L2 normalization on the textual features to find the Euclidian distances. We then use a linear kernel SVM for classification of our text data. Text classifi-cation problems are generally linearly separable, often achieving an equal or sometimes higher accuracy than with a non-linear kernel. Finally, we again use automatic grid search to select the optimal parameters achieving the highest accuracy per attribute.

4.4 Fusing the predictions

For each item in the training set the visual and textual clas-sifiers produce a prediction. These predictions come in the form of probability scores, which are numbers between 0 and 1. The closer the number is to 1 the more confident the clas-sifier is that the item belongs to the positive set, and vice versa. We perform multimodal features combination by us-ing the probability scores from both classifiers as features to

(6)

train the final classifier. Thus, per item there are two fea-tures, one probability score for each modality. This approach is analysed by Snoek et al [27]. They conclude that late fu-sion with probability scores as features tends to give better performance, but its learning effort is higher than for early fusion. Finally, a linear kernel SVM is used to classify the probabilities.

A visualization of the entire model can be seen in Figure 2. The blue path illustrates the visual classifier, the red path shows the textual classifier and in the green boxes the data from the two modalities are combined. For each city-attrib-ute the same model is used, with optimized parameters.

4.5 Applying the model to new data

In the final three stages we have constructed and trained a multimodal classifier. The model can now be applied to clas-sify new data that is unseen in the training process. This in-cludes all but the training data from the Amsterdam set, and the entire Utrecht and Groningen datasets. The features are extracted in a similar manner as in the training process. The trained SVMs in both modalities then classify the items as either positive or negative and pass the probabilities on to the fusion classifier which produces the final predictions.

5. RESULTS

Section 5.1 presents the overall prediction results for the en-tire city of Amsterdam. To evaluate the performance per mo-dality the scores are discussed separately in Section 5.2. Then in Section 5.3 we zoom in and show that classifier can also produce neighbourhood specific predictions. Finally, in Section 5.4 we apply the classifier to data from Utrecht and Groningen.

5.1 Amsterdam prediction accuracy and scores

Table 1 presents the overall results for the city-attributes in Amsterdam. As mentioned in the previous section, all mod-els are trained on a dataset of 10,000 items, and the remain-ing data is used as completely new input for the classifiers to make these predictions on.

The ‘Accuracy’ column in Table 1 depicts the score for a direct comparison between the ground truth labels and the predicted labels. Note that in the case of a random prediction a score of 50% can be expected. Therefore, a trained classi-fier predicting a score between 50% and 100% is considered as discriminative, with a value closer to 100% suggesting a higher accuracy.

Measuring prediction accuracy alone can give a limited per-spective on our classifier’s performance. For example, if we have a hypothetical dataset of a hundred items with only one positive item, and we predict all items as negative the accu-racy would still achieve the high score of 99% (incorrectly predicting only the single positive item). Therefore, we want to know if we predicted all the positive items correctly, and vice versa. We use a metric called Precision. This tells us how many of the items that the classifier predicted as posi-tive are truly posiposi-tive based on the ground truth data (i.e. Are all our positively labelled items actually positive?). This score indicates the True Positive Rate (TPR) which is ideally as high as possible. The items predicted as positive that were actually negative are referred to as the False Positive Rate

(FPR). Furthermore, the score for Recall expresses the

num-ber of items predicted as positive by the classifier to the total number of positive items in the dataset (i.e. Have we found all of the positive items out there?) where a higher score also indicates a better result. Additionally, Figure 3 presents and explains the associated Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC). Table 1. Overall prediction results for the city of Amsterdam. The

scores are the combined scores of the visual and textual modali-ties.

Accuracy Precision Recall AUC Crime rates 0.899 0.954 0.954 0.903 Subjective Safety 0.836 0.688 0.592 0.835 Housing Prices 0.951 0.983 0.984 0.921 Recreational Sites 0.834 0.922 0.872 0.900 Population Density 0.758 0.791 0.734 0.835 Average In-come 0.780 0.835 0.751 0.846

The results in Table 1 are produced using the same model with optimal parameter settings per attribute. For every at-tribute the model predicts an accuracy score of over 75%, well above the 50% of random classification. The highest accuracy of 95.1% is obtained for the housing prices. Arietta et al. [10], who focus solely on visual appearance in combi-nation with city attributes, found the highest accuracy when Figure 2. A visualization of the entire classification model. The input is the Flickr multimedia data and output are the final predictions.

(7)

determining housing prices as well. The worst result is ob-tained for population density, which was 75.8% accurate. One can imagine that images of streetscapes and correspond-ing user annotations bear more discriminative cues for hous-ing prices, than for population density, as will be discussed in Section 5.2.

Precision and recall scores are again highest for housing prices, with both statistics attaining results of over 98%. De-spite the above average accuracy for subjective safety, the precision and recall are substantially lower than for the other attributes. There are several factors that could influence the performance in this case, and they differ per attribute. The first has to do with the main topic of this research, estimating the quality of life for various neighbourhoods. The division between positive and negative labels is often based on a value retrieved from official instances (i.e. reports by the municipality of Amsterdam set the subjective safety index at 100). This may result in an uneven spread of labels among the items. The binary separation for housing prices, for ex-ample, is set at 231,000 euros, meaning that 88% of items have a positive label. When only a small portion of the labels are negative (or positive), missing or incorrectly predicting only several quickly has large impact on the precision and Recall scores. A second (related) factor is the spread of im-age locations. Because of the user-driven nature of the data source, the majority of the images are taken in city centres or at prominent locations, where housing prices, for instance, are high. Despite challenges such as these in using social multimedia data, the classifier performs rather robust and re-liable.

Figure 3 confirms that all attributes perform well on accu-racy at the optimal threshold for the ROC. The lowest area under the ROC curve being 0.835 for subjective safety and population density. One can see that at a low threshold the

classifier trained for population density has more trouble la-belling the actual positive items as positive, as opposed to the rest of the classifiers. At around a TPR threshold of 0.8 the population density starts classifying better than the model trained on subjective safety. The latter has trouble with finding positive items when the positive rate is already high.

5.2 Examining the modalities separately

In this section we analyse the visual and textual modalities individually. Table 2 shows the accuracy scores per modal-ity. The textual channel scores higher than the visual for every single attribute, making it the stronger modality. The housing prices attribute scored best in the fused prediction, and unsurprisingly, the unmerged predictors achieve highest results as well.

Most of the attribute accuracy scores increase when the two modality’s predictions are fused. The crime rates attribute’s prediction accuracy is boosted by almost 7%, on top of its textual accuracy. However, as for the population density and subjective safety attributes this is not the case. Textual and overall prediction accuracy on an area’s population density is equal. For subjective safety final prediction has a lower accuracy than for the textual modality, which suggests that the addition of the visual predictions actually damages the final score.

5.3 Prediction maps

In Section 5.1 we presented the overall scores for Amster-dam, based on all the items in the test set. Figure 4 shows the predictions on individual neighbourhoods, for different city attributes. The maps in the top row illustrate the actual labels based on the ground truth data. The bottom row contains our predictions. The map is constructed as follows: for each neighbourhood we look at the predictions for all the items that are assigned to that neighbourhood. When more than half of the items in a particular neighbourhood are predicted as positive, the neighbourhood is also labelled positive (red), and vice versa (green).

With a prediction for each individual neighbourhood we can calculate the map accuracy by comparing a map with the cor-responding ground truth map. In the bottom left corner in Figure 4 we show the results of this neighbourhood predic-tion accuracy. The highest score is for the recreapredic-tional sites Figure 3. ROC curves and AUC for all the attributes, in Amsterdam.

ROC curves are used to test performance of binary classifiers, where one can read the TPR against the FPR for various thresholds. The dotted redline illustrates the 0.5 benchmark. The AUC (exact scores in Table 1) is equal to the probability that a randomly chosen positive element has higher probability of being positive, than a randomly chosen negative element [31].

Table 2. Accuracy scores for prediction in the two modalities. Visual Acc. Textual Acc. Fused Acc. Crime rates 0.772 0.830 0.899 Subjective Safety 0.770 0.846 0.836 Housing Prices 0.889 0.947 0.951 Recreational Sites 0.606 0.832 0.834 Population Density 0.540 0.758 0.758 Average Income 0.694 0.771 0.780

(8)

attribute. In this case 93.5% of the neighbourhoods are pre-dicted correctly.

In this research we make the assumption that city attributes are deducible from visual appearance and corresponding tex-tual annotations. The significant overlap between the predic-tion maps and the ground truth indicates that this assumppredic-tion is at least partly valid. For the incorrectly predicted areas we can assume that visual and textual appearance does not yield discriminative information for the presence or absence of the attribute.

One interesting discrepancy can be found for the housing prices attribute. While it achieves the highest accuracy on overall prediction in the previous section, it scores the lowest on neighbourhood prediction accuracy. An explanation for this gap might be that it performs exceptionally well on pre-dicting labels for the items in the city centre (seeing many similar houses and descriptions) and has trouble predicting labels for outer-city areas where item availability is low, and appearance is significantly different (possibly unknown).

5.4 Cross-city accuracy

After demonstrating that the model performs fairly well in Amsterdam, we investigate how well it performs on other cities. In other words: is the model that is trained on data from Amsterdam utilizable in other Dutch cities? In order to test this we use the crime rate attribute, as its ground truth data is available for all three cities. However, as the statistics are conducted by different agencies45_{, the dataset} composi-tion slightly differs per city (e.g. the data from Utrecht is av-eraged over three years, 2012-2014). Nonetheless, we be-lieve that the overlap is sufficient to produce reliable results. Table 3 shows that all accuracy scores lay above the 50% benchmark. The model can therefore be considered as cross-city discriminative. The classifier performs best in Utrecht,

4_{http://data.groningen.nl/datasets/}

achieving an overall accuracy of 64.4%. When we look at the modalities separately the visual classifier trained on Am-sterdam data achieves better accuracy for Groningen than for Utrecht, namely 2.7% higher. Additionally, the textual clas-sifier performs better in Utrecht, scoring 3.9% higher accu-racy.

Table 3. Overall prediction accuracy results for cross-city

classifi-cation. The classifier is trained on multimedia and ground truth (crime rates) based in Amsterdam, the model is then applied to me-dia from other cities.

Visual Acc. Textual Acc. Accuracy

Utrecht 0.611 0.663 0.644

Groningen 0.638 0.624 0.595

We see a drop of ~15% accuracy for visual classification as opposed to Amsterdam. Regarding textual classification, there is bigger gap between accuracy for Amsterdam and the other two cities, approximately ~20%. A larger decrease for the textual modality is imaginable considering some of the terms extracted by tf-idf are location related (cf. Section 4.3). The visual classifier even outperforms the textual classifier on the Groningen dataset, something that has not occurred for any other attribute or city.

Interestingly, the fusion of the two modalities results in a lower final prediction accuracy for both cities. Text classifi-cation accuracy outperforms the fused model in Utrecht. For the Groningen data both the classifiers achieve higher accu-racy separately than after its fusion. As explained in Section 4.4, the label probabilities from the separate models are used as feature vectors for the final predictor. A quick glance at the produced label probabilities tells us that models are less confident in the label assignment for unknown cities than for the Amsterdam dataset.

5_{http://utrecht.buurtmonitor.nl/}

Figure 4. Our predictors are able to rather accurately predict expected labels for 97 different neighbourhoods in Amsterdam. The top row

illustrates the actual situation; the bottom row shows our predictions. Each column represents one of the city attributes. White areas either lack a ground truth value or sufficient multimedia data, and therefore also a prediction.

(9)

Figure 5 shows that the neighbourhood prediction maps made for Utrecht and Groningen are accurate when deter-mining low presence for crime rates in neighbourhoods (green areas). However, for both cities, the areas where nor-malized crime rates are high (red areas), are often predicted incorrectly. When we look at the recall scores, Groningen scores 40%, and Utrecht has a recall of just 7.1%. This means that the classifier in Utrecht only managed to label 7.1% of the actual positive neighbourhoods as positive, which boils down to a single neighbourhood. Looking at the Utrecht maps in Figure 5, we can indeed see that just one area (located in the upper-middle part) is marked red in both the ground truth map and the prediction map.

6. APPLICATIONS

In this section we demonstrate how our model can be used for various applications.

6.1 Neighbourhood selection based on personal

preference weighting

When moving to a new city, people have different prefer-ences about a neighbourhood to settle in. Priorities are per-sonal; some might prefer an area of higher average income while others may desire more recreational areas nearby. We have built an application that allows people to assign weights to our city-attributes in order to construct their personal qual-ity of life heatmap, based on the predictions from the classi-fier. For each city-attribute the application asks the user what weight they would like to assign to the attribute.

As explained earlier, the model makes predictions in two la-bels. For this application these labels are 1 and -1. This label is then multiplied with the weights. For example, the appli-cation could ask: “Are low housing prices important?”, hav-ing a weight range from 0 to 10. It does this for all attributes to build the weights list. Figure 6 shows two example maps

for different parameter settings. The potential of this appli-cation is that it could be used to select a potential settling neighbourhood for new-comers in cities which lack official liveability statistics. A model can be trained in Amsterdam (or any other city) and employed on Flickr data from other cities. More city-attributes can be included in calculation to give more accurate advice, which is considered in Section 7.

6.2 Attribute value prediction

Because this research focuses on whether cities experience high or low liveability, it has produced binary predictions. However, in this section we would like to demonstrate the performance of the model when applied in a multi-label sit-uation. This application could be used to predict actual val-ues of city-attributes, making it a different tool.

Instead of dividing the dataset in a positive and negative set, we assign the neighbourhood’s original ground truth value as labels to all the items in the training set that lie within said neighbourhoods. For this experiment we use the ground truth data from the housing prices city-attribute to train the classi-fier and make predictions. Any other city-attribute can be used, this choice is simply made to demonstrate that attribute value prediction is possible.

As before, the classifier is trained on 10,000 random samples from the Amsterdam dataset, and used to predict labels for Figure 6. Two example heatmaps with different weights assigned

to the attributes. The bottom left corner shows the selected weights (out of 10). Blue areas are excluded from the application due to insufficient predictions.

Figure 5. Predictions of crime rate labels for Utrecht and

Gro-ningen, by a model trained on data from Amsterdam. White areas either lack sufficient multimedia data or ground truth values.

(10)

the remaining items. We achieve an overall accuracy of 45.6%. Note that 50% accuracy is not average in this exper-iment, considering there are 97 labels (a ground truth value for each of the 97 neighbourhoods in Amsterdam). The tex-tual classifier, which achieved 54.8%, outperformed the 20.3% accuracy for the visual classifier. After fusion the ac-curacy was 10% lower than for the initial textual classifica-tion. This is imaginable as the visual classifier performed much lower than the textual classifier, bringing the probabil-ity scores down in the fusion classifier.

While not designed to do so, the model can be used for (more or less) continuous classification, with decent overall accu-racy. For example, it could be used to estimate housing prices (or housing price ranges) in new cities.

7. DISCUSSION

In this research we have used social multimedia to predict quality of life in cities and neighbourhoods. Although cues for the factors involved can be implicitly encoded in visual appearance and human perceptions, they can be extracted us-ing machine learnus-ing. With an average prediction accuracy of 84.3% for the six city-attributes in Amsterdam and aver-aged 62% accuracy for the two cross-city predictions, the classifiers performed effectively. We argue that the classifi-ers for city-attributes with lower scores had more trouble finding and classifying distinctive descriptors, as these have a weaker correlation with the visual and textual appearance. The text data classifier produces better results than the visual classifier. Only for crime rates in Groningen the visual clas-sifier outperformed the textual predictions. Additionally, in several cases the prediction accuracy after fusion was lower than the separate classifier’s predictions. There are negative consequences for the fused prediction when the modalities do not agree on the label predictions or if doubt for some individual items is so high that it actually brings down the merged confidence.

Often this is not the case, and the model benefits from the fusion. The profits made from the predictions for the visual modality can be small, adding two or three percent on top of the text predictions. However, the visual classifier should be considered an essential part of the model as its performance is more consistent than the text classifier, of which proof can be seen when we apply the model in different cities. Text descriptors are more location dependent in our model, whereas visual descriptors perform more or less consistent for cross-city predictions in our case.

However, the classifier did have trouble with some aspects of cross-city classification. Accuracy was high for both Gro-ningen and Utrecht, but recall was low (40% and 7.1% re-spectively). It sometimes failed to correctly predict the labels for some of the areas that matter the most, the few ones with high crime rates. We argue that these two cities have very different cues than Amsterdam that are discriminative for recognizing crime rates.

A large portion of the data has not been used in the process of training, just as would be the case when applying the model in a real world scenario. Therefore, we argue that the classifiers are not overfitted on the training data. They only perform about 10% better on training data than on unseen data. This raises confidence in the legitimacy of our results and the conclusions we draw from them.

7.1 Limitations, challenges

Several aspects have had an influence on the results in this thesis. As we have already mentioned, the neighbourhoods in the outskirts of the cities have significantly less geotagged material located. This makes neighbourhood prediction probabilities unconfident. Despite this limitation, the classi-fier seemed to perform well, even for the neighbourhoods in the outskirts of Amsterdam.

Only six city-attributes are used to make an estimation on liveability, while official committees make estimations based on more factors (i.e. in the Amsterdam survey 24 are used). In this research we use data on neighbourhood scale and unfortunately not all sources that distribute statistics on involved factors provide their data in the same format. It was not always possible to transform the data to a comparable format. Nevertheless, the intention to prove that liveability estimations can also be achieved through machine learning has succeeded.

The dataset containing acres of recreational sites within Am-sterdam was missing ground truth values for 14 out of 97 of the neighbourhoods. Coincidentally, these areas accounted for about half of the geo-tagged images in the dataset, result-ing in size decrease to just 20566 items for the recreational sites attribute.

8. CONCLUSION

This thesis has shown that social multimedia data, obtained from the image-sharing website Flickr, can be used to make estimations on quality of life in cities. We have extracted unique descriptors from images and the user-added annota-tions, taken in Amsterdam, Groningen and Utrecht, to find discriminative aspects that tell us something about city-at-tributes influencing city liveability. We can conclude that visual appearance and human perceptions of a city do con-tain information that can tell us something about the implicit qualities of its neighbourhoods. Some are easier to extract, i.e. in 9 out of 10 cases we are able to correctly predict low or high housing prices, and some are harder to abstract from the data.

Applying the model to make predictions for different cities drops the accuracy to some extent. Every city has its own distinctive characteristics, making it challenging for the clas-sifier to find similar distinctive cues that depict presence of city-attributes. Despite the difficulties, the classifier achieves promising results, making it suitable for cross-city application.

(11)

9. ACKNOWLEDGMENTS

I would like to thank Dr Stevan Rudinac from the University of Amsterdam, for his supervision, guidance and comments on this thesis and the process.

10. REFERENCES

[1] European Commission, “Quality of life in cities,” 2013 [Online]. Available:

http://ec.europa.eu/regional_policy/sources/docgener/studi es/pdf/urban/survey2013_en.pdf

[2] R. Weber, J. Schnier, and T. Jacobsen, “Aesthetics of streetscapes: influence of fundamental properties on aesthetic judgments of urban space.,” Percept. Mot. Skills, vol. 106, no. 1, pp. 128–146, Feb. 2008.

[3] J. Q. Wilson and G. L. Kelling, “Broken Windows,” Atl. Mon., vol. 249, no. 3, pp. 29–38, 1982.

[4] C. Latkin and A. Curry, “Stressful neighborhoods and depression: a prospective study of the impact of neighborhood disorder,” J Heal. Soc Behav., vol. 44, no. 1, pp. 34–44, 2003.

[5] R. B. Cialdini, R. R. Reno, and C. a. Kallgren, “A focus theory of normative conduct: Recycling the concept of norms to reduce littering in public places.,” J. Pers. Soc. Psychol., vol. 58, no. 6, pp. 1015–1026, 1990.

[6] X. Wang, M. Gerber, and D. Brown, “Automatic Crime Prediction Using Events Extracted from Twitter Posts,” in Social Computing, Behavioral - Cultural Modeling and Prediction, vol. 7227, S. Yang, A. Greenberg, and M. Endsley, Eds. Springer Berlin Heidelberg, 2012, pp. 231– 238.

[7] S. Kay, B. Zhao, and D. Sui, “Can Social Media Clear the Air ? A Case Study of the Air Pollution Problem in Chinese Cities,” Prof. Geogr., no. January 2015, pp. 1–13, 2014.

[8] Imercer, “Quality Of Living City Rankings,” 2015. [Online]. Available:

https://www.imercer.com/content/quality-of-living.aspx [9] A. Khosla, B. An, J. J. Lim, and A. Torralba, “Looking

Beyond the Visible Scene,” in CVPR, 2014. [10] S. M. Arietta, A. a. Efros, R. Ramamoorthi, and M.

Agrawala, “City Forensics: Using Visual Elements to Predict Non-Visual City Attributes,” IEEE Trans. Vis. Comput. Graph., vol. 20, no. 12, pp. 2624–2633, Dec. 2014.

[11] D. Quercia, N. K. O’Hare, and H. Cramer, “Aesthetic Capital: What Makes London Look Beautiful, Quiet, and Happy?,” in Computer Supported Cooperative Work & Social Computing, 2014, pp. 945–955.

[12] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros, “What makes Paris look like Paris?,” ACM Transactions on Graphics, vol. 31, no. 4. pp. 1–9, 2012.

[13] B. Zhou, L. Liu, A. Oliva, and A. Torralba, “Recognizing City Identity via Attribute Analysis of Geo-tagged Images,” in ECCV, Springer, 2014, pp. 519–534. [14] J. Knopp, J. Sivic, and T. Pajdla, “Avoiding confusing

features in place recognition,” in Computer Vision, Springer, 2010, pp. 748–761.

[15] G. Schindler, M. Brown, and R. Szeliski, “City-scale location recognition,” in CVPR’07, 2007, pp. 1–7. [16] S. Asur and B. a Huberman, “Predicting the Future with

Social Media,” WI-IAT ’10, vol. 1, pp. 492–499, 2010. [17] M. S. Gerber, “Predicting crime using Twitter and kernel

density estimation,” Decis. Support Syst., vol. 61, pp. 115–125, May 2014.

[18] N. Malleson and M. a. Andresen, “The impact of using social media data in crime rate calculations: shifting hot spots and changing spatial patterns,” Cartogr. Geogr. Inf. Sci., no. January 2015, pp. 1–10, Apr. 2014.

[19] M. Guillaumin, J. Verbeek, and C. Schmid, “Multimodal semi-supervised learning for image classification,” in CVPR 2010, 2010, pp. 902–909.

[20] C. Xu, K. Chiew, Y. Chen, and J. Liu, “Fusion of Text and Image Features: A New Approach to Image Spam Filtering,” in Practical Applications of Intelligent Systems SE - 15, vol. 124, Springer Berlin Heidelberg, 2012, pp. 129–140.

[21] VROM, “Leefbaarheid van wijken,” 2004 [Online]. Available:

http://www.rijksoverheid.nl/bestanden/documenten-en- publicaties/rapporten/2004/03/01/leefbaarheid-van-wijken/wonen4007.pdf

[22] Gemeente Amsterdam, “Amsterdam in cĳfers 2014,” 2014 [Online]. Available:

http://www.ois.amsterdam.nl/publicaties/

[23] Flickr, “Flickr Creative Commons,” 2014. [Online]. Available: https://www.flickr.com/creativecommons/ [24] Amsterdam Open Data, “Gebiedsindeling Amsterdam,”

2013. [Online]. Available:

http://data.amsterdamopendata.nl/dataset/gebiedsindeling_ amsterdam

[25] Gemeente Utrecht, “Grenzen,” 2015. [Online]. Available: https://opendata.utrecht.nl/dataset/grenzen

[26] Groningen Open Data, “Buurten Gemeente Groningen,” 2014. [Online]. Available:

http://data.groningen.nl/buurten-gemeente-groningen/ [27] C. G. M. Snoek, M. Worring, and A. W. M. Smeulders,

“Early versus Late Fusion in Semantic Video Analysis. d,” Proc. 13th Annu. ACM Int. Conf. Multimed., pp. 399– 402, 2005.

[28] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregating local descriptors into a compact image representation,” in CVPR, 2010, pp. 3304–3311. [29] O. Chapelle, P. Haffner, and V. N. Vapnik, “Support

vector machines for histogram-based image

classification,” Neural Networks, vol. 10, no. 5. pp. 1055– 1064, 1999.

[30] C. D. Manning, P. Raghavan, and H. Schütze, “Scoring, term weighting, and the vector space model,” in Introduction to Information Retrieval, no. c, 2009, pp. 118–132.

[31] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., vol. 27, pp. 861–874, 2006.