Measuring social economic inequality: Streetview imagery and Deep learning

(1)

Measuring social economic inequality:

Street view imagery and Deep learning

(2)

(3)

Measuring social economic inequality:

Street view imagery and Deep learning

Franklin Willemen 10992693 Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098XH Amsterdam

Supervisors

Dr. S. Ghebreab & T. Alpherts MSc Civic Artificial Intelligence Lab (CAIL)

Informatics Institute University of Amsterdam

Science Park 904 1098 XH Amsterdam

(4)

Abstract

This thesis reproduces an method for measuring social-economic inequality. The method involves using street view imagery and deep learning to clas-sify a location’s position on the spectrum of social-economic inequality. The thesis reflects on the viability of this method to be applied in a public con-text. Viability of the method is determined on whether it is reproducible, functional and interpretable. Mean absolute error (MAE), Recall and a con-fusion matrix are used to measure whether the method’s is reproducible and functional. To determine whether the method is interpretable, correctly and incorrectly classified street view imagery are manually inspected and deep learning features are visualized. The results of this thesis suggest that this method is not reproducible, functional or interpretable. The MAE found in this thesis is significantly higher than found in the original work. The Recall of this method shows that the method’s ability to correctly predict a loca-tions social-economic status is only slightly better than random assignment. The confusion matrix shows that the method classifies locations to be within a fixed range of the deprived and privileged spectrum. Manually inspecting correctly and incorrectly classified locations street view imagery does not allow for interpretation, neither does the visualized deep learning features. This thesis conducted a preliminary test of the viability of this method for measuring social inequality, the results of this thesis suggest that the method is not viable for public use.

Keywords

(5)

List of Tables

(7)

List of Figures

2.1 Overview of the method. . . 3

2.2 VGG16 architecture. . . 4

2.3 Imagenet: example root-to-leaf branch, mammal subtree. . . . 4

2.4 Cross-entropy. . . 5

3.1 Pearson correlation matrix of the selected social economic at-tributes, before and after appropriating the data to ordinal categories.(Left: before, right: after) . . . 8

3.2 MIH attribute preprocessing to decile. . . 8

3.3 Street View Panorama Amsterdam. . . 9

3.4 Straightened Street View Panorama. . . 9

4.1 Baseline Ordinal classifier MAE scores. . . 13

4.2 Pretrained Ordinal classifier MAE scores. . . 13

4.3 Baseline Softmax MAE. . . 14

4.4 Pretrained Softmax MAE. . . 14

4.5 Confusion matrix MIH Pretrained Ordinal and Softmax. . . . 15

4.6 Confusion matrix ARV Pretrained Ordinal and Softmax. . . . 15

4.7 Confusion matrix NWIRP Pretrained Ordinal and Softmax . . 16

4.8 Confusion matrix SSC Pretrained Ordinal and Softmax. . . 16

4.9 Attributes recall per decile for Ordinal and Softmax classifiers. 17 4.10 ARV decile 1, correct and incorrect classifications. . . 18

4.11 ARV decile 10 correct and incorrect predicted locations. . . 18

4.12 Pretrained VGG16 hidden layer features. . . 19

(8)

1. Introduction

1.1 Urban Inequality

In Europe, 75% of the population lives in urban areas, which means Eu-rope has 552 million urbanites1_{. Urban populations typically have higher}

average economic status, education and better average health than their ru-ral counterparts [13]. However, there are significant inequalities in income, education, living environments, health and security between the urbanites themselves[7]. These statistics place cities in the centre when it comes to better understanding and combatting social-economic inequality.

To inform and assess policies that aim to reduce social-economic inequal-ity, measurements at high spatial and temporal resolution are crucial. Var-ious city characteristics can be calculated, such as crime rates and housing prices. By evaluating these characteristics, policymakers can establish suc-cessful strategies to address these increasing social-economic disparities [11]. However, most local governments do not have these metrics or the means to collect high-resolution data. Collecting high spatial and temporal resolution data can be a complicated and expensive endeavour[9] [12] [HoeData].

1.2 Inequality and Machine Learning

Today, large amounts of data are produced daily from heterogeneous sources at an unprecedented pace. Because useful insights can be extracted from large quantities of data, many academic fields have started using machine learning to explore and extract value from their data. This trend is also seen to evolve in social economics with a focus on measuring inequality. Blumenstock [3] demonstrates that it is possible to use mobile metadata to infer socioeconomic status. In turn, these inferred attributes can enable an accurate recreation of the distribution of resources of an entire nation when taken for millions of people. Moreover, Lei et al.[5] have shown that it is possible to estimate a hand-full of socioeconomic attributes using only a location attribute like restaurants.

1.3 Street View Imagery and Deep Learning

The visual appearance of the urban landscape also holds the potential to measure city attributes. If city characteristics such as crime rates and hous-ing prices could be related to visual appearances, it would be possible to do so for all cities worldwide regardless of whether prior measurements are available. Image data is widely available and frequently collected, allowing

(9)

for more local and regular measurements at a much lower cost than tradi-tional methods[12]. Arietta et al.[1] concentrated on visual attributes that discriminate for social and economic attributes derived from street imagery. For example, their method shows that for San Francisco, visual features such as fire escapes on the fronts of buildings, ramshackle convenience store signs, and a particular roofing style predict violent crime rates with 73% accuracy. Timnit et al. [6] show that there is a strong correlation between vehicle distribution and various socioeconomic factors. They utilize Google Street View (GSV) imagery and objective characteristics present in those images to infer social-economic statistics.

Esra et al. [9] concentrated on measuring a more comprehensive range of attributes that define inequalities. They predicted different outcomes for the spatial distribution of income, education, unemployment, housing, the environment, health and crime using GSV imagery.

1.4 Reproducibility

This thesis tested Esra et al.[9] method for measuring urban inequality in a different city, Amsterdam. The focus of this thesis is on the reproducibility of this method for civic use. For this method to be applied successfully in a civic context, it needs to be transparent and relatively easy to implement2.

This paper is organized as follows: Section 2 gives more information on the method used by Esra et al., and the research questions tackled in this thesis. Section 3 describes the method, the different experiment setups and the architecture of the neural network. Section 4 presents the results of the different setups. The conclusion, discussion and future works can be found in section 5, 6 and 7 respectively.

2 https://nos.nl/artikel/2366864-fraude-opsporen-of-gevaar-van-discriminatie-gemeenten-gebruiken-slimme-algoritmes.html

(10)

2. Background Information

2.1 Measuring social economic inequalities

Esra et al. used GSV imagery and deep learning to measure social-economic inequality in London. They have taken various aspects of human wellbeing into account, including income, health, education, employment, crime, hous-ing, and living environment. Local government statistics, or city attributes, were collected on a scale of 1,614 inhabitants on average, also known as Lower Layer Super Output Area (LSOA). For each of these city attributes, deciles were calculated, 1 to 10 corresponding to the deprived and privileged.

Figure 2.1: Overview of the method.

Their network used all four images from each location jointly in the four channels, as shown in figure 2.1. The outputs were aggregated and passed through a final layer which applied the sigmoid function, resulting in a value P between 0 and 1.

Given the city’s attributes’ ordinal nature, an ordinal classification method is applied as proposed by Costa et al. [2]. Specifically, the P-value is inter-preted as a probability Bernoulli trials are performed with, i.e. tosses with a coin. For the ordinal decile classes, ten different coin tosses are hypothet-ically performed, and the probabilities of obtaining between 1 and 10 heads are calculated. In other words, given a sample n the probability of it be-longing to the kth decile is derived by calculating the probability of exactly k successes, e.g. heads, with a probability of success being P.

Their method eliminates the need for predefined features by manually annotating the image data (e.g. roof types, trees, cars). To achieve this, they used a technique called transfer learning. Transfer learning [10] is an essential tool in machine learning to solve for insufficient training data. The target domain model does not need to be trained from scratch, which can significantly reduce various complexities associated with large scale machine learning. Transferring knowledge from a specific domain to the target domain is also referred to as feature extraction.

(11)

The features are extracted from the fc6 layer of a pre-trained VGG16 network. The fc6 layer is the first layer of the fully connected section of the VGG16 architecture, see figure 2.2. The VGG16 architecture can be referred to as a backbone architecture for this method.

VGG16 is a convolutional neural network(CNN) architecture proposed by K. Simonyan, and A. Zisserman [8]. The main strength was its increased depth in comparison to the alternatives when it was first proposed. The model achieved the state of the art accuracy on the Imagenet Large-Scale Visual Recognition Challenge of 2014(ILSVRC14).

Figure 2.2: VGG16 architecture.

The backbone architecture is trained on more than 1.3 million images from ImageNet [4] and can distinguish between a thousand different objects. The ImageNet dataset is a large-scale ontology of images based on a word-net hierarchical structure. For each meaningful concept in word-net, also called a synset, ImageNet aims to attach 500-1000 images per synset. Figure 2.3 shows an example of an Imagenet root-to-leaf branch.

Figure 2.3: Imagenet: example root-to-leaf branch, mammal subtree. The ImageNet dataset consists of around fourteen million human-annotated images covering a thousand objects at the time of writing, ranging from an-imals to musical instruments. The network was trained by optimizing the cross-entropy cost function, figure 2.4, where w are the network weights, yn

(12)

is a label vector, and p the probability of the nth sample belonging to the

mth decile.

Figure 2.4: Cross-entropy.

The truth labels correspond to the deciles associated with the LSAO to which the sample belongs. By training 80% of the data (i.e. image and true label pairs) and testing on the remaining, Esra et al. measured how well the network uses images to predict which decile a location belongs. They repeated this five times, each time sampling a different test set(i.e. 20% of the data). The mean value computed for each LSOA was converted to a decile and compared to the actual label. To evaluate Esra et al. used Pearson’s correlation coefficient, Kendall’s tau coefficient, Cohen’s kappa and mean absolute error (MAE). Esra et al. have also evaluated how well their model performed in predicting the deciles using street view imagery from other cities in the UK. For this they used the London pretrained network directly to the alternative city’s images, fine-tuned the network weights using subsets of the target city and finally trained the network from scratch using only data from the target city.

2.2 Research questions

Esra et al. have shown that applying deep learning to street imagery allowed for measurement of London’s city attributes. Moreover, they have found that their method performs better for some attributes, income and living deprivation. Esra et al. have also demonstrated that training in one city can be transferred to predictions in other cities in the same country. This transferability underlines that if replicated in additional cities, located in other countries, it might indicate universal visual features in deprived and privileged positions.

This thesis will reproduce the method and apply it on the capital of the Netherlands, Amsterdam. This thesis also makes a small modification to the method and takes note of its effect. Because the method promises to assist multiplicities with data-driven civic surveillance this thesis primarily reflects on its viability to do so. The method is viable for public use if it is reproducible, functional and interpretable. Thus the main research question

(13)

is: is this method viable for public use? To answer this the following sub questions are proposed:

1. Is the method reproducible? 2. Is the method functional? 3. Is this method interpretable?

(14)

3. Method

3.1 Socioeconomic data Amsterdam

Amsterdam’s socioeconomic data is made publicly available by the national statistical office of the Netherlands (CBS)1_{. This thesis used Amsterdam}

statistics measured in the year 2016. The CBS measurements are avail-able for each postcode in the Netherlands, 460.478 postcodes in total, from which 17528 in Amsterdam. The CBS data consists of 130 attributes; these attributes include statistics like gender, age, ethnicity, housing, median in-come, social welfare and distance to various services like supermarkets. Of all the attributes, only a small portion is selected to process for further use. A short description is provided for each of the selected attributes:

• Median income per household(MIH) values are defined as eleven cate-gories that describe the lowest until the highest measurements.

• Social security counts(SSC) describes the number of people over the age of 65 who are receiving social security benefits; if a person receives more than one kind of benefit, they are still counted as one.

• The average market value of the real estate(ARV) gives insight into the average value of the properties used for residential purposes at a given postcode.

• Non-western immigrants residents percentage(NWIRP) accounts for the percentage of people living at a given postcode who have a non-western immigrant background.

When inspecting the Pearson correlation between the selected attributes, fig-ure 3.1 shows a positive correlation between ARV and MIH and also between NWIRP and SSC while there is a negative correlation between the rest of the attributes. As this thesis is not focused on the social sciences, these correlations will not be examined in depth. ’ When it is possible to connect measurements to individuals, the attribute values are obfuscated or left out entirely, hence not all data-points could be used. This is the primary reason the selected attributes have different amounts of data points; only 6859 post-codes had measurements for all four attributes. The measurement counts are 15293, 7562, 16089, 11392 for MIH, SSC, ARV, and NWIRP. For this reason, the attributes were processed separately and not as a set.

Besides removing obfuscated values, other data preprocessing steps in-clude feature engineering the ordinal categorical values to numerical and dividing the data into ten equal parts, so that each part represents one-tenth of the measurements.

1 https://www.cbs.nl/nl-nl/dossier/nederland-regionaal/geografische-data/gegevens-per-postcode

(15)

Figure 3.1: Pearson correlation matrix of the selected social economic at-tributes, before and after appropriating the data to ordinal categories.(Left: before, right: after)

Figure 3.2: MIH attribute preprocessing to decile.

Extra care has been taken to preserve the ordinal information of the attributes. The attribute values had to be sorted, ranked, and their data type needed to be changed back to numerical. Figure 3.2 shows that the correlations between attributes are mostly preserved; some correlations have increased or diminished. The overall relations between the attributes remain the same.

In order to correlate these attributes with street images, these attributes were paired with their respective image id. The process of gathering street view images is discussed in the next section; due to this pairing, the total amount of data points has decreased. The remaining measurements for MIH, ARV, SSC, and NWIRP have decreased to 9408, 9931, 4703 and 7056.

Finally, the newly paired data is split into a train and test set. A split of 0.7/0.3 for MIH, ARV, NWIRP and 0.8/0.2 for SSC to accommodate the difference in measurement counts.

(16)

3.2 Street view panoramas

The street view panoramas used in these experiments were collected by the local government and was accessed through their API; the panoramas are freely available2_{. To pull the panoramas from this API, latitude and}

longi-tude coordinates are required. Google Geocoding API was used to convert postcodes to their respective coordinates. On a side note, despite there being open-source options for Geocoding, this thesis found that these options were not adequate.

These coordinates allow for retrieving the panoramas’ image id; this was also done through an API maintained by the municipality. The API required that a radius was to be defined such that all panorama URL’s and id’s within that range would be returned. The panoramas were collected within a ra-dius of 25 meters, only the closest within this rara-dius was used for further processing. If there was none within the 25-meter radius, the postcode was dropped.

Figure 3.3: Street View Panorama Amsterdam.

The Street View Panoramas of Amsterdam was retrieved with the use of the retrieved URL’s. These panoramas where equilaterally projected with the use of a so called opensource python script3. The straightened panorama was cut into five equally sized images, also a part of the sky and the car on which the camera sits are removed. The result for figure 3.3 can be viewed in figure 3.4.

Figure 3.4: Straightened Street View Panorama.

2_{https://panorama.data.amsterdam.nl/panorama/} 3_{https://github.com/bhautikj/vrProjector}

(17)

Esra et al. only used four images for each location. To test whether the method is reproducible this thesis also used four images per location, the fifth image is dropped for all locations.

3.3 Feature extraction and Interpretability

The features are extracted from the fc6 layer of a pre-trained VGG16 network. As depicted in figure 2.1, four images are passed through the network, and for each image, a 4096-dimensional vector is extracted. These vectors contain the learned patterns which enabled the model to achieve the state of the art accuracy when distinguishing between one thousand objects.

Interpretability is the degree to which a human can understand the cause of a decision; in other words, it is the degree to which a human can consis-tently predict the model’s result. One way interpretability is measured in this thesis is by manually comparing street view imagery of the predictions made. More specifically, correct and incorrect predictions by the Softmax classifier for the ARV attribute are compared. In addition to comparing street view imagery, the extracted features are visualized. These visualizations are gen-erated through a feature visualization technique called feature inversion, the Pytorch library Lucent4 _{was used for this.}

3.4 Network Weights

The fully connected network, figure 2.1, is the only part which has learn-able parameters, weights and a biases. Each layer of the network consists of neurons or nodes, and within each of them resides a weight value. As the input enters the neuron, it is multiplied by that value; the weight parameter transforms input data within the networks’ hidden layers. The fine-tuning of these weights is what is referred to when talking about training the neural network.

This thesis sets out to test the performance of two setups that are differ-entiated solely by their weights’ initialization. A neural network is initialized with randomized weights by default. This default initialization will function as the first setup, a baseline for comparison purposes. The second setup will have the network trained on Amsterdam data and test its performance on data not included in the training set.

(18)

3.5 Ordinal and Softmax

Given the ordinal nature of inequality, nominal classification would miss the inherent information that comes with the deciles’ ordering. Like Esra et al., described in section 2.1, this thesis also uses the machine learning paradigm proposed by Costa and Cardoso, which is intended for multi-class classifica-tion problems where the classes are ordered.

The feedforward neural network shown in figure 2.1 gets fed four fc6 feature vectors for each location. After three fully connected layers, the results are aggregated, and in the final layer, a single continuous value P between 0 and 1 is computed by applying the Sigmoid function. The value P is interpreted as a probability with which Bernoulli trials are performed. Ten different coin tosses are considered, and corresponding probabilities for obtaining H heads are computed for H = 1, . . . , 10. The highest of these probabilities is used as indicating to which decile the location belongs to.

This thesis also tests how the Softmax function performs in comparison to the ordinal classifier for this given task. Using the Softmax was easier to use in comparison to the ordinal classifier for practical reasons. The Soft-max5activation function transforms a previous layer’s output into a vector with ten probabilities. It normalizes its input value into a range of values that follow a probability distribution that sums up to 1. The maximum value in this vector decides which decile the location is classified as.

3.6 MAE, Confusion Matrix and Recall

The Mean Absolute Error(MAE), Recall, and a Confusion Matrix are used to measure whether the method is reproducible and or functional. As the name suggests, the MAE represents the average of the absolute differences between paired observations 6_{. Recall}7 _{is the fraction of relevant}

classifi-cations that were predicted, it is calculated by dividing the total correctly predicted instances by all instances of a given class. A confusion matrix vi-sualizes the performance of a model8_{, it shows whether and by how much the}

model confuses classes.

5_{https://en.wikipedia.org/wiki/Softmax} function 6_{https://en.wikipedia.org/wiki/Mean} absoluteerror 7_{https://en.wikipedia.org/wiki/Precision} andrecall 8_{https://en.wikipedia.org/wiki/Confusion} matrix

(19)

4. Results

4.1 MAE

The network tries to estimate to estimate values for each of the social eco-nomic factors, based on the four input images. MAE is used to measure whether or not the network makes good estimations. One way in which this thesis results is compared with that of Esra et al. is with the use of the MAE metric. This way it is possible to determine whether the method is reproducible. MAE measures the average of the absolute errors between the predicted and actual value, the lower its value, the better.

Ordinal Softmax

Attributes Baseline Pretrained Baseline Pretrained

ARV 3.08 2.28 3.45 2.21

MIH 3.09 2.38 3.23 2.65

NWIRP 2.94 2.34 3.19 2.70

SSC 2.96 2.49 3.32 3.10

Table 4.1: Average MAE scores

Table 4.1 shows the average MAE score for each attribute. An MAE score is calculated per batch; a batch consists of 20 postcodes, hence for a thousand samples there are 50 MAE scores. The average of all these MAE scores is presented in table 4.1.

The lowest average MAE was achieved with pre-trained Softmax setup on the ARV attribute. Furthermore, the worst was the baseline Softmax setup for the ARV attribute. Overall the Softmax classifier with the ARV attribute has the best performance increase when switching to a pre-trained setup.

Figure 4.1 shows the Baseline Ordinal setup’s MAE scores for all at-tributes. Its visible that there does not seem to be any significant improve-ment w.r.t the MAE. All the attributes seem to stagnate within the range of an MAE of 3.2 and 2.9 after 500 samples.

Figure 4.2 shows that the pretrained Ordinal setup achieves a lower MAE score for all attributes overall. This means that in general that a pre-trained network results in lower MAE scores. When comparing attributes, ARV seems to be performing the best. However when comparing it to its baseline it suggests no significant improvement except for the initial drop. As with the baseline all attributes also seem to stagnate within the same range.

Figure 4.3 shows the MAE scores for the Baseline Softmax setup. Com-paring this setup with the baseline Ordinal setup shows that the overall MAE is higher. The ARV attribute has a significantly higher MAE score overall than the other attributes. Despite that the ARV deviates from the rest the overall trend still suggests that no significant improvement is achieved.

(20)

Figure 4.1: Baseline Ordinal classifier MAE scores.

(21)

Figure 4.3: Baseline Softmax MAE.

Figure 4.4 shows the MAE scores for the pre-trained Softmax setup. Com-paring these results with those from the pre-trained Ordinal setup shows that the MAE scores for the SSC and ARV attributes are significantly different from each other and from the MIH and NWIRP attribute. The SSC at-tribute consistently scored a higher MAE while the ARV atat-tribute scored lower. The ARV attribute has the lowest MAE score overall and has the most performance gain compared its baseline scores.

Despite the differences between the setups, no significant improvement was realized.

Figure 4.4: Pretrained Softmax MAE.

4.2 Confusion Matrix

The confusion matrix shows how much the model confuses classes and gives insight whether the method is functional. Figures 4.5, 4.6, 4.7 and 4.8 shows

(22)

the confusion matrix for the pre-trained setup for both the Ordinal and Softmax classifier. Because the pretrained setup achieved an overall better MAE score than the baseline setup, the baseline was not included in this section.

Figure 4.5: Confusion matrix MIH Pretrained Ordinal and Softmax. Comparing the confusion matrix’s of the Ordinal and the Softmax setups shows that the Softmax classifications are spread more than the Ordinal ones. The Ordinal classifications are mostly concentrated, without exception, in-between decile 5 and 8.

Figure 4.6: Confusion matrix ARV Pretrained Ordinal and Softmax. For all attributes the Ordinal classifier predominately confuses other deciles to belong to decile 6. The Softmax classifier has a similar behaviour for the MIH and NWIRP attributes, classifications are concentrated with one decile. The Softmax classifier has a smoother spread for the ARV and SSC attributes.

(23)

Figure 4.7: Confusion matrix NWIRP Pretrained Ordinal and Softmax

(24)

4.3 Recall

Figure 4.9 shows the attributes Recall per decile for both the Ordinal and Softmax classifiers. The recall table shows in percentages how accurate the classifiers were to correctly classify each decile. The recall also gives an indication whether the method is functional.

The recall table clearly shows that the Ordinal classifier classifies within the range of decile 5 to 8 with no exception. The recall for the ordinal classifications is zero for deciles that do not fall in that range, it classifies all instances incorrectly for those deciles.

Figure 4.9: Attributes recall per decile for Ordinal and Softmax classifiers. The table also shows that the Softmax classifier can correctly classify at least some instances for all deciles. The highest Recall for the Softmax classifier is achieved for the MIH attribute decile 5.

Noteworthy is that the Softmax classifier achieves the highest Recall for decile 1 and 10 of the ARV attribute, which might indicate its better able to distinguish between the most deprived and privileged locations. The Softmax classifier also has the highest average Recall for the ARV attribute when compared to other attributes. This might suggest that the ARV attribute has the strongest correlation with the visual features; this would make intuitive sense because real estate’s visual appearance is mostly correlated with its market value.

4.4 Interpretability

Visually inspecting the results is a way to test whether the method is inter-pretable. The chosen examples are from the opposite sides of the inequality spectrum. This would allow for the easier assessment of interpretability. The examples are from the pretrained Softmax setup on the ARV attribute be-cause this setup was best able to distinguish between decile one and ten. The top row of figure 4.10 and 4.11 display the imagery for the location that was

(25)

predicted correctly. The bottom row displays the imagery of an incorrect classified location.

Figure 4.10: ARV decile 1, correct and incorrect classifications. With figure 4.10, two things jump out immediately. First, the imagery for the incorrect classified location is much brighter than its counterpart. Second, besides being brighter, there also seems to be more greenery present in the imagery. This was the only example of the most deprived being classified as the most priviliged w.r.t the ARV attribute.

Figure 4.11: ARV decile 10 correct and incorrect predicted locations. Figure 4.11 shows an example of correct and incorrect classification for the most priviliged location in Amsterdam w.r.t the ARV attribute. Despite that the bottom imagery is brighter and contains more greenery than the top it was still classified as being the most deprived. A difference between the top and the bottom is that the first seems to have more architectural variation, e.g. color and objects. Because both the misclassified deprived and privileged examples share the same differences to their counterexample these indications of interpretability are dismiss-able.

Figure 4.12 shows the extracted features of the hidden layers of the back-bone network. These layers come before the fc6 layer; hence these are not

(26)

directly used for training or testing. However, the fc6 layer can be understood as the aggregation of all these features.

Figure 4.12: Pretrained VGG16 hidden layer features.

The first six layers, CONV1-1 till CONV3-2, displayed in figure 4.11 show directly interpretable patterns. These features are still recognizable by the naked eye as depicting patterns inherent in the image. From the 7th layer onward, CONV4-1 until CONV5-3, the learned patterns are abstract and not easily interpretable by the naked eye.

Figure 4.13 displays street view imagery and their fc6 feature counter-parts. For all street view imagery, similar features are extracted and used for training and testing. Despite that these visualizations are interesting to look at they are not interpretable, its unclear to the naked eye what these features represent.

(27)

(28)

5. Discussion

The MAE scores show that a pre-trained model performs better for both the Ordinal and Softmax classifier. The ARV attribute achieves the lowest aver-age MAE score with the pre-trained Softmax setup. Despite the variations and improvements, they are not significant. The lowest MAE achieved for Amsterdam was twice that of the lowest for London. A couple of reasons why there is such a significant difference. First, there was significantly less data for Amsterdam than there was available for London. Second, Amsterdam is a much smaller city than London; thus, there might be less variation in Amsterdam’s visual features than there is in London. Despite this, it is clear that the method does not have reproducible results w.r.t MAE.

The confusion matrix and Recall table show that the Ordinal classifier predicts within a fixed range of 4 deciles with no exception. The Ordinal classifier predominately confuses most deciles as belonging to decile six. The Softmax classifier has a better Recall than the ordinal classifier and correctly classifies some instances for all deciles. Despite this difference, the average Recall for most attributes does not perform much better than random assign-ment. Overall both the confusion matrix and the Recall table show that the method misclassifies most of the time, hence the method does not function. The examples were handpicked from the classifications made by the pre-trained Softmax setup of the ARV attribute. Intuitively it makes sense that the most deprived and most privileged locations should also be the easiest to distinguish visually. The ARV attribute had the best Recall for decile 1 and 10 compared to all other city attributes. Comparing street view imagery of correct and incorrect classified locations showed that visual inspection does not allow for interpretability. This test is questionable for one good reason; handpicked examples cannot make a general claim. However, given the definition of interpretability, the analyses are found to be useful. It might also support the intuition that with street view images, too many factors could contribute to a particular classification. Furthermore, it can be argued that this method might not allow for this kind of interpretability at all because it does not use predefined features.

Instead of using predefined features, this method uses a pre-trained VGG16 network to extract features. The first layers of the pre-trained VGG16 are somewhat interpretable with the naked eye, albeit not allowing for anything conclusive to be said about them. The fc6 features used for training and test-ing are not interpretable. Other architectures have much more interpretable features [13], suggesting that VGG16 is not the best choice if interpretability is essential. Furthermore, the VGG16 used by this method is trained to dis-tinguish between 1000 objects, most of which are not present in street view imagery.

The method, as laid out by this thesis, is not viable for public use. That being said, there are some limitations to the statements and conclusions

(29)

drawn in this thesis. Reproducing the method used by Esra et al. for Am-sterdam was not easy. The difficulty can be boiled down to 2 points: Data and Code Legacy.

First, collecting the correct social-economic data was difficult. Amster-dam’s publicly available data somewhat lacked in the diversity of city at-tributes and the number of measurements. With some luck, the CBS pub-lished a more complete and up to date dataset during this thesis’s duration. Despite being recently published, there were still some issues with it; for ex-ample, some attributes were not present despite the documentation claimed to be so.

Second, the original code provided by Esra et al. has become a legacy. Moreover, while it did work after much debugging, it did not achieve similar results for the paper’s transferability subpart. Reproducing London’s results was complicated further because Esra et al. street view imagery was not open source. This thesis did not have the required funding to purchase the millions of GSV images used in Esra et al.’s work. Esra et al. were kind enough to help by providing an up to date repository. However, this code was still under review and thus was not guaranteed to work as intended. Even with the legacy issue addressed, London’s results were not reproduced during this thesis due to communication delays and other issues. Due to these reasons, any findings of this thesis should be seen solely as a preliminary inquiry into this method viability for public use.

(30)

6. Conclusion

This thesis tested the method used by Esra et al. to measure social-economic inequality on Amsterdam and reflected on whether it is viable for public use. The method is not reproducible because the lowest MAE scores for Amster-dam where twice that registered for London. The method is not functional for two reasons. First, the confusion matrix shows that the method classi-fies predominately for one decile. Second, the Recall table shows that the method’s accuracy is similar to random assignment on average. The method is not interpretable because a human cannot understand the cause of its de-cision. Neither visually inspecting and comparing the classifications made by the method nor visualizing the features it uses allowed for interpretability. This method is not viable for public use, given that it is not reproducible, functional or interpretable.

(31)

References

[1] Sean M. Arietta et al. “City forensics: Using visual elements to predict non-visual city attributes”. In: IEEE Transactions on Visualization and Computer Graphics 20.12 (Dec. 2014), pp. 2624–2633. issn: 10772626. doi: 10.1109/TVCG.2014.2346446.

[2] Christopher Beckham and Christopher Pal. “Unimodal Probability Dis-tributions for Deep Ordinal Classification”. In: Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of Machine Learning Re-search. International Convention Centre, Sydney, Australia: PMLR, June 2017, pp. 411–419. url: http://proceedings.mlr.press/v70/ beckham17a.html.

[3] Joshua Blumenstock, Gabriel Cadamuro, and Robert On. “Predicting poverty and wealth from mobile phone metadata”. In: Science 350.6264 (2015), pp. 1073–1076. issn: 10959203. doi: 10.1126/science.aac4420. [4] Jia Deng et al. “ImageNet: A large-scale hierarchical image database”.

In: Institute of Electrical and Electronics Engineers (IEEE), Mar. 2010, pp. 248–255. doi: 10.1109/cvpr.2009.5206848.

[5] Lei Dong, Carlo Ratti, and Siqi Zheng. “Predicting neighborhoods’ socioeconomic attributes using restaurant data”. In: Proceedings of the National Academy of Sciences of the United States of America 116.31 (July 2019), pp. 15447–15452. issn: 10916490. doi: 10.1073/pnas. 1903064116. url: https://www.pnas.org/content/116/31/15447% 20https://www.pnas.org/content/116/31/15447.abstract.

[6] Timnit Gebru et al. “Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the United States”. In: Proceedings of the National Academy of Sciences of the United States of America 114.50 (Dec. 2017), pp. 13108–13113. issn: 10916490. doi: 10.1073/pnas.1700035114. url: https://www.pnas. org/content/114/50/13108%20https://www.pnas.org/content/ 114/50/13108.abstract.

(32)

[7] Martin Ravallion, Shaohua Chen, and Prem Sangraula. “New evidence on the urbanization of global poverty”. In: Population and Development Review 33.4 (Dec. 2007), pp. 667–701. issn: 00987921. doi: 10.1111/ j . 1728 - 4457 . 2007 . 00193 . x. url: http : / / doi . wiley . com / 10 . 1111/j.1728-4457.2007.00193.x.

[8] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition”. In: International Con-ference on Learning Representations. 2015.

[9] Esra Suel et al. “Measuring social, environmental and health inequali-ties using deep learning and street imagery”. In: Scientific Reports 9.1 (2019), pp. 1–10. issn: 20452322. doi: 10.1038/s41598-019-42036-w. url: http://dx.doi.org/10.1038/s41598-019-42036-w.

[10] Chuanqi Tan et al. A Survey on Deep Transfer Learning. Tech. rep. [11] Eddy Van Doorslaer and Ulf G. Gerdtham. “Does inequality in

self-assessed health predict inequality in survival by income? Evidence from Swedish data”. In: Social Science and Medicine 57.9 (Nov. 2003), pp. 1621–1629. issn: 02779536. doi: 10.1016/S0277-9536(02)00559-2.

[12] Scott Weichenthal, Marianne Hatzopoulou, and Michael Brauer. A pic-ture tells a thousand. . . exposures: Opportunities and challenges of deep learning image analyses in exposure science and environmental epi-demiology. Jan. 2019. doi: 10.1016/j.envint.2018.11.042.

[13] Alwyn Young. “Inequality, the Urban-Rural Gap, and Migration*”. In: The Quarterly Journal of Economics 128.4 (Nov. 2013), pp. 1727–1785. issn: 0033-5533. doi: 10.1093/qje/qjt025. url: https://academic. oup.com/qje/article/128/4/1727/1850694.

Measuring social economic inequality: Streetview imagery and Deep learning

Measuring social economic inequality:

Street view imagery and Deep learning

Measuring social economic inequality:

Street view imagery and Deep learning

Abstract

Keywords

Contents

List of Tables

List of Figures

1. Introduction

1.1

Urban Inequality

1.2

Inequality and Machine Learning

1.3

Street View Imagery and Deep Learning

1.4

Reproducibility

2. Background Information

2.1

Measuring social economic inequalities

2.2

Research questions

3. Method

3.1

Socioeconomic data Amsterdam

3.2

Street view panoramas

3.3

Feature extraction and Interpretability

3.4

Network Weights

3.5

Ordinal and Softmax

3.6

MAE, Confusion Matrix and Recall

4. Results

4.1

MAE

4.2

Confusion Matrix

4.3

Recall

4.4

Interpretability

5. Discussion

6. Conclusion

References