Police Investigation of Mobile Phone Images

(1)

Police Investigation of Mobile Phone Images

submitted in partial fulfillment

for the degree of master of science

Bas Lindeboom

10777539

Master Information Studies

Data Science

Faculty of Science

University of Amsterdam

2018-07-19

Internal Supervisor

External Supervisor

Second reader

Title, Name

Thomas Mensink

Gijs Smit

Maarten de Rijke

Affiliation

UvA, FNWI

Politie Amsterdam

UvA, FNWI

(2)

Police Investigation of Mobile Phone Images

Bas Lindeboom

University of Amsterdam baslindeboom96@hotmail.com

ABSTRACT

When the police department of Amsterdam wants to ana-lyze the contents of a confiscated phone, it can easily con-tain thousands of images. Since some images could concon-tain valuable information for investigators, it isn’t desirable to review all thousands of images manually. With all the ex-isting classification methods available, this time consuming process can be improved relatively simple. In this paper the performance of models trained on the ImageNet data set will be compared. Furthermore, methods will be evaluated to make use of a pre-trained model more broadly. A method to sort images on relevance will be described and evaluated, using the output of such pre-trained models. Binary clas-sifiers were used to learn what predictions score determine if an image is relevant or not. Also the WordNet hierarchy has been used in combination with the standard ImageNet concept labels, to see if it makes a difference when labels are a few steps higher up in the hierarchy and prediction scores are combined. The method of Norouzi et al. (2013) will be used to rank images based on textual input and la-bel image clusters. These images were clustered based on the prediction scores of a pre-trained model. The results indicate that the presented methods can indeed be used in police investigations to find relevant images faster.

Keywords

Image Classification, Object Recognition, ImageNet, Zero shot learning, WordNet, Image Clustering

1. INTRODUCTION

In this digital era, a single smartphone easily contains

thou-sands of images. For the purpose of an police

investiga-tion, it could be useful to know what is on the images of a certain phone. Using machine-learning techniques, these

images could be analyzed. Every case can take a lot of

time, but efficient research can save some time. For exam-ple, when someone is arrested or when digital devices have been seized. It takes a lot of time to look at all the images

manually. Using machine learning for these image classi-fication tasks like object detection or face recognition, the computer can easily tell what is on the images in general. The suspect could have a few images of weapons in the over-all collection of thousands of pictures. Another example is that detectives working on a case could asks the Data Ana-lytics team to find all pictures containing a specific object, like a container in a drug traffic case. Machine learning is already used occasionally by the police to make some an-alytical tasks easier and less time consuming. However, it remains unclear which methods perform best. Within the field of image classification there are multiple tasks, like: face recognition, face detection, object recognition and op-tical character recognition (OCR). This paper will focus on the detection of objects.

With all the existing image classification methods avail-able, the time consuming process of reviewing all the im-ages of a confiscated phone can be improved. This paper will therefore answer the following research question: To what extend can models that were already trained on Ima-geNet, identify relevant images in investigations of confis-cated phones within the police department of Amsterdam? To be able to answer this main research question, the fol-lowing subquestions will be covered:

• SQ1 Which pre-trained model performs best in recog-nizing the 1000 ImageNet concepts in images? • SQ2 Can the output of these models be used to

distin-guish relevant from irrelevant images?

• SQ3 Can a textual query be used to find relevant im-ages containing concepts that are known and unknown for these models?

• SQ4 Can image clustering, based on the prediction scores of these models, help to find images that are relevant to the Amsterdam police?

Thesis structure.

Firstly, work related to the subquestions will be discussed. Secondly, the visual data that was used in this research will be described and there will be explained how it was gath-ered. Thirdly, the experiments will be described and evalu-ated on their performance. The conclusions based on these evaluations will be covered in the last section.

(3)

2. RELATED WORK

2.1 Supervised image classification

Image classification is not only an interesting topic for the police department of Amsterdam, but also in the academic world. Over the years a lot of research has been done around this topic. Within this topic, a distinction is made between supervised en unsupervised classification. Supervised classi-fication makes use of labeled data, so the model can ex-plicitly learn class names by analyzing the samples first. Unsupervised classification doesn’t use annotated data but groups similar images into the same class.

Every year the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) is organized. This competition is one of the main reasons why there are many research papers con-cerning supervised image classification. Attendees used dif-ferent approaches to solve this classification problem. Con-volutional Neural Networks (CNN) have proven to be a suit-able method for image classification (Krizhevsky, Sutskever, & Hinton, 2012) and object recognition specifically. In ad-dition, CNNs have become the dominant machine learning approach for visual object recognition. However, there are many variances and applications of these neural networks. This raises the question which network actually is the best. In previous research of 2016 some methods were analyzed by (Canziani, Paszke, & Culurciello, 2016), but not in the domain of the police of Amsterdam.

The paper of Canziani et al. (2016) compares and eval-uates several neural networks that were trained on Ima-geNet. Their analysis includes VGG, Inception and ResNet. To compare the performance measures of the classification models, the top-1 accuracy was used because this score was reported in all papers. The inception models seems to be the best performing classifiers according to the accuracy scores. However, the scores that were compared are just the evalu-ation scores that were mentioned in the literature (Canziani et al., 2016). The models were evaluated with the test set of the ILSVRC challenge. However, only the set of 2010 is pub-licly available and DenseNet (Huang, Liu, van der Maaten, & Weinberger, 2017) is not included in the comparison.

2.2 Image filtering on relevance

As previously stated, image classifiers can be used to indi-cate what is on the images of an image collection. A combi-nation of such a classifier and a binary classifier can predict the relevance of an image (Nguyen, Alam, Ofli, & Imran, 2017). This will reduce the workload of the police when the images are ranked on relevance and thus the most rel-evant images come first. By determining which images in a data set are relevant for the police and which are not, a binary classifier can be trained to predict the relevance. The promising evaluation results in the research of Nguyen et al. (2017) suggests that there is a lot of potential in the use of the output of an ImageNet model. In the paper of Nguyen et al. (2017) a method is described to filter the large amount of images on social media networks. They used a data set of images, annotated by human workers, to train supervised machine learning models to recognize the fea-tures of relevant images. The data was labeled into three categories: severe damage, mild damage and none. The au-thors propose mechanisms to purify the noisy social image data by removing duplicate, near-duplicate, and irrelevant image content (Nguyen et al., 2017). They show that the

state-of-the-art computer vision deep learning models can be adapted successfully to image relevancy classification prob-lems. A real-time classifier was developed to find damaged buildings or infrastructure for humanitarian organizations working in disaster areas. The goal is not to just look at relevancy, but also the damage level of buildings or infras-tructure on the image. The authors used the predictions of a re-trained VGG16 model that was trained on ImageNet at first.

2.3 Zero-shot learning

The pre-trained concepts can be used for non-ImageNet con-cepts with zero-shot classification models (Norouzi et al.,

2013). It is likely that a confiscated phone contains

im-ages that were never seen by the classifier in training time. Therefore, it is useful to use zero-shot learning to still be able to label these images. The classification of The pa-per of Xian, Lampa-pert, Schiele, and Akata (2017) compares the state-of-the-art zero shot learning methods. Which is classifying images where there is a lack of labeled training

data. The authors propose a unified evaluation protocol

which makes it possible to compare the several methods. They make a distinction between the approach of methods of early work and recent work. In early works zero shot learning made use of the attributes of images within a two-stage approach, for example in the proposal of Lampert, Nickisch, and Harmeling (2014). Firstly, the attributes of an input image are predicted and then the class label is in-ferred by searching the class which attains the most similar sets of attributes. This second stage has been extended in several methods as well, when attributes are not available. These methods first predict seen class posteriors and then project image feature into the Word2Vec space by taking the convex combination of top T most possible seen classes, like Norouzi et al. (2013). The semantic space can also be a mapping from an image feature space. This mapping can then be learned directly for zero-shot learning, according to most recent zero-shot learning proposals ((Akata, Per-ronnin, Harchaoui, & Schmid, 2016); (Bucher, Herbin, & Jurie, 2016)). One of the conclusions that was stated in the paper of (Xian et al., 2017) is that the Convex combination of semantic embeddings (ConSE) method of (Norouzi et al., 2013) performs slightly better than the others.

Norouzi et al. (2013) proposes a simple method for costructing an image embedding system from any existing n-way image classifier and a semantic word embedding model. Their method maps images into the semantic embedding space via convex combination of the class label embedding vectors, and requires no additional training. The authors claim that their proposed method outperforms state-of-the-art methods on the zero-shot learning task of ImageNet. Their method combines any existing probabilistic n-way im-age classifier with an existing word embedding model, which contains the n class labels in its vocabulary. According to this paper, a key to zero-shot learning is the use of a set of semantic embedding vectors associated with the class la-bels. Instead of learning a regression function explicitly, the classic machine learning approach is used and a classifier is learned from training inputs to training labels. The model predicts a semantic embedding vector for the top T predic-tions of the classifier for an image. This semantic vector that describes the image is the convex combination of the semantic embeddings weighted by their corresponding

(4)

prob-abilities.

2.4 Image Clustering

Since the research of Nguyen et al. (2017) suggests that pre-trained models can be used to identify the relevance level of an image it might be interesting to look at an unsupervised approach. Clustering the retrieved prediction probabilities should give insight in the contents of a confiscated phone. Based on the assumption that images with interesting con-tent are most likely not very common, the anomalies might reveal interesting content. For example, it is assumed that there will not be many images of corpses on a phone. Since this might not apply for all interesting concepts, it is im-portant to know what kind of images belong to the clusters. Therefore, generating labels for clusters might also reveal if there a clusters containing relevant images. Clustering the images of a confiscated phone can also reveal interesting im-ages that the detective had not yet searched for, since it is unsupervised.

Density-based Spatial Clustering of Applications with Noise (DBSCAN) clusters similar data points, but also detects noise in the data set (Ester, Kriegel, Sander, & Xu, 1996). This algorithm detects clusters since there exists a typical density of points. This density is substantially higher than outside the cluster and within the areas of noise the density is lower than in any of the clusters. A variation of this al-gorithm is HDBSCAN (Hierarchical Density-based Spatial Clustering of Applications with Noise) which uses the same principle as DBSCAN but also takes hierarchy into account. HDBSCAN performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon (McInnes, Healy, & Astels, 2017). Not much research has been done on unsupervised image clustering, particularly in combination with ImageNet. The paper of Agrawal and Karnick (2009) describes how images can be clustered on their semantics. The described approach involves extracting interesting patches and the segmentation of images at different scales. Using this approach the authors want to find differences and similarities in image patches within the image set to make clusters. The approach was designed based on the assumption that semantic informa-tion of images could be determined by the presence or ab-sence of a specific set of objects (Agrawal & Karnick, 2009). Since the pre-trained models predict 1000 probabilities for all concepts, these scores might be very useful for image clustering. However, the algorithm created by Agrawal and Karnick (2009) was never tested quantitatively. Neverthe-less, on two data sets there was semantic bias obtained in the clustering.

3. VISUAL DATA

Table 1: Data overview

Image set Training images Test images Classes

Google Images - 4980 996

Relevancy set 1642 412 2

Police concepts set 7021 - 50

3.1 ImageNet

The pre-trained models were trained using the ImageNet image collection (Deng et al., 2009). The image collection was build by querying several search engines using a set of WordNet synonyms (synset). These query set was expanded by appending the queries with the word from parent synsets. Translations to multiple languages have also been used as query to ensure diversity in the data set. The creators of the data set strived for approximately 500 - 1000 clean images per synset. However, nowadays the data set contains even more images per synset.

3.1.1 WordNet

The labels of the ImageNet concepts are obtained from the WordNet data set (Deng et al., 2009). This data set gives the potential advantage to use the hierarchy of the ImageNet concepts and narrowing its range. The goal of narrowing the concept range is to make it easier to ’whitelist’ images. For example, the label ’dog’ can ’whitelist’ all other underly-ing child classes. Besides, predictions become more certain because the prediction errors caused by similarities within the same parent class disappear. Possibly, it might be pos-sible to use these concepts to ’whitelist’ images in a data set that are potentially not interesting enough for the po-lice. For examples, images of plants or animals might not be really interesting. ImageNet used labels for their concepts that occur in the hierarchy of WordNet. Synsets that are located in the WordNet hierarchy above a word are called hypernyms. Some hypernyms were not taken into account, since their high position in the WordNet hierarchy makes them too abstract, like: ’entity’, ’physical entity’, ’object’, ’whole’, ’living thing’ and ’organism’. Using this hierarchy the images were labeled on 1000, 608 and 389 classes re-spectively 0, 1 and 2 steps higher up in the hierarchy. The prediction probabilities that have the same parent class were added together. When a word did not have any hypernyms, this word remained unmodified as well as the corresponding probability.

3.1.2 Google Image set

For the first experiment that compares pre-trained mod-els a new set of images was created with Google Images. For each ImageNet concept label, the top 10 query results from Google Images were obtained. Five random images of the top 10 search results were used to test each pre-trained model. Using the class label as search query ensures that all these images can be labeled as that class. There were only 997 concepts used for testing, due to the fact that three labels occurred twice in the class list. The only difference between these class labels is the capital letter at the begin-ning. However, when using these words as search query on Google this distinction cannot be made. Therefore, only one of these classes was used. Also, querying the label ’fig’ did not produce any results. This means that the test set con-sists in total of 4980 images: 5 images for the 996 remaining classes.

3.2 Police data

3.2.1 Relevancy set

An image set of a confiscated phone containing 2054 images was divided into two classes manually: relevant and irrele-vant. The train set contains 1270 irrelevant images and 372

(5)

relevant images. The test set contains 206 random images of both classes, which is approximately 20% of the overall im-age set. This set of imim-ages was used, since it relatively con-tains a lot of different relevant images and it is representative for image sets of confiscated phones in related cases. The images containing weapons, military uniforms and personal belongings were considered as relevant. Also screenshots of chats and map locations were considered as relevant.

This set of images was provided by the police of Amster-dam and was retrieved from the phone of a person poten-tially involved in the Syrian war. This data set was used to evaluate the performance of several relevance ranking meth-ods.

The images were converted into six different vectors: 1000 probabilities for every ImageNet concept, 608 and 389 prob-abilities. The output of the pre-trained ResNet50 and DenseNet20 contains initially 1000 probability scores for the 1000 Ima-geNet concepts. Using WordNet the hypernyms were de-rived from the original 1000 labels to convert the labels one (608) or two steps (389) higher in the WordNet hierarchy.

The prediction scores of relevant and irrelevant images were investigated. Figures 1 and 2 show the 50 ImageNet concepts with the highest average prediction probabilities. As shown in 1 and 2 there are some concepts occurring in the top of both relevant and irrelevant images. For example, ’web site’ is a label that is often is assigned to images that contain mainly text, like screenshots of WhatsApp messages or some memes.

Figure 1: Top 50 classes with their average prediction prob-ability

Figure 2: Top 50 classes with their average prediction prob-ability

Mutual information.

Mutual information can be used to select features for super-vised neural network learning (Battiti, 1994). This mea-sure describes the mutual dependence between two vari-ables. Specifically, it describes how much information can be derived from one random variable through another ran-dom variable. The theory is that observing the value of one variable reduces the uncertainty of the relevancy of an im-age. Therefore, the mutual information was calculated for the all the ImageNet concept combinations for each image in the Relevancy set with the formula

f (x) = I(X; Y ) =R_YR_Xp(x, y) log(_p(x)p(y)p(x,y) ) dx dy. This gives an insight in which class combinations say a lot about the relevance of an image and which do not. Since the experiments of SQ2 and SQ3 (table 9 and 12) indicate that ResNet50 generalizes better, these prediction scores were used to calculate the mutual information. For each potential concept pair, the mutual information was calculated (figure 3a and 3b) Also, for each concept the top 10 combinations with the highest mutual information scores were obtained. It was calculated how often a concept occurs in the top 10 of these pairs for relevant images and irrelevant images (table 3.

(a) Irrelevant images (b) Relevant images

Figure 3: Mutual Information scores for all ImageNet class combinations

Table 2: Class combinations with the highest mutual infor-mation score for irrelevant (0) and relevant (1) images

Top 10 Class Combinations (0) Top 10 Class Combinations (1) (garter snake, web site) (jacamar, military uniform) (garter snake, diaper) (jacamar, assault rifle) (garter snake, bassinet) (jacamar, web site) (garter snake, cradle) (jacamar, rifle) (garter snake, bonnet) (jacamar, restaurant) (garter snake, television) (jacamar, bulletproof vest) (garter snake, sleeping bag) (jacamar, parachute) (garter snake, book jacket) (jacamar, suit) (garter snake, pajama) (jacamar, kimono) (garter snake, bib) (jacamar, fur coat)

Table 2 shows that there is a clear difference in 10 binations with the highest mutual information. Not com-pletely unexpected, the right column mainly contains army related concepts. The concept ’jacamar’ is a specific type of bird and occurs in every combination in the top 10. One possible explanation is the bird that occurs on the military uniforms. These combinations are in line with the figures 1 and 2.

Table 3 shows that there are some differences in mutual

information for relevant and irrelevant images. However,

there are some similarities. A high mutual information score for relevant images can also mean that this feature has an

(6)

Table 3: Most frequent ImageNet concepts in top 10 mutual information scores of each concept

Irrelevant images Relevant images Concept Frequency Concept Frequency 1 ”yellow lady’s slipper” 991 ”yellow lady’s slipper” 993 2 ’disk brake’ 538 ’odometer’ 685 3 ’fox squirrel’ 334 ’espresso maker’ 550 4 ’damselfly’ 166 ’siamang’ 368 5 ’Sealyham terrier’ 158 ’lycaenid’ 265 6 ’Japanese spaniel’ 151 ’Japanese spaniel’ 151 7 ’red-backed sandpiper’ 139 ’flatworm’ 110 8 ’white wolf’ 124 ’jacamar’ 95 9 ’harvestman’ 61 ’harvestman’ 70 10 ’garter snake’ 57 ’water ouzel’ 20 11 ’common newt’ 26 ’military uniform’ 16 12 ’web site’ 18 ’web site’ 16

13 ’diaper’ 15 ’rifle’ 15

14 ’jacamar’ 14 ’restaurant’ 14 15 ’bassinet’ 13 ’assault rifle’ 13

high impact on the irrelevany of an image. For example, the probability score for the concept ”yellow lady’s slipper” says probably a lot about the irrelevancy of an image.

3.2.2 Police concepts set

This image set contains images that was used to train and test the image classification models of the police of Ams-terdam. So this image collection was also provided by the police of Amsterdam. It contains a total of 7021 annotated images of 50 concepts which are of interest for the police, like images of police cars, official documents and scooters. This set was used to measure the performance of the pre-trained ImageNet model to identify (un)known concepts. The im-ages within this data set were also represented as vectors of 1000, 608 and 389 probability scores for the three different hierarchy levels.

Table 4 shows what kind of images occur in the Police con-cepts set and how many images there exists for each concept.

Table 4: Police concepts images

Concept label Amount of images Concept label Amount of images

Acetone 11 Gun in hand 127

Afghan flag 190 Gun pistol 303

Baseball bat 261 Gun pointed 29

Baseball caps 158 License plate 35

Bomb vest 32 Masked face 22

Butcher shop 85 Money 238

Car 339 Moroccan flag 113

Caravan 14 Mosque blue 27

Car passengers 43 Official document 102 Cash machine (ATM) 39 Oxygen cylinder 19

Church 162 Party 464

Cobra fireworks 97 Passport photo 284 Cooking gas cylinder 120 Passport type I 473

Debit card 59 Passport type II 26

Drivers license 31 Police announcement 18

Drugs: MDMA 72 Police car 138

Drugs: weed 25 Police Uniform 101

Drugs: weed plants 28 Refugee boat 222

ID card 237 Scooter 48

Iraqi eagle 31 Sea container 175

Iraqi flag 243 Selfies 429

IS flag 16 Syrian flag 196

Gas mask 163 Truck 45

German police car 202 Watch 365

Gun automatic 284 WTC 911 80

4. EXPERIMENTS

4.1 SQ1 - Comparison of pre-trained models

This experiment is intended to roughly determine which neural networks trained on ImageNet should be used for the other experiments. At this moment there are some models available that are pre-trained using the ImageNet data set. The performance of ResNet50, DenseNet201, InceptionV3,

VGG16 and VGG19 were evaluated. The images within

the Google Image set were used to measure the performance of the models with the top-5 accuracy. This performance measure was frequently used in previous research concern-ing image classification (Canziani et al., 2016). However, in the comparison of Canziani et al. (2016) DenseNet was not included and the performance of the models was just compared with the top-1 accuracy.

Table 5: Performance of pre-trained ImageNet models

Pre-trained Model Top-5 accuracy

VGG16 93.2

VGG19 93.3

InceptionV3 94.0

ResNet50 94.2

DenseNet201 95.6

As shown in 5, the top-5 accuracy scores of the pre-trained models on the generated Google Image data set are very close to each other. The table shows that DenseNet201 per-formed best on this data set. Nevertheless, the performance of ResNet50 and DenseNet201 will be compared in the ex-periments for subquestion 2 and 3 since the differences are not very large.

As expected, all models performed much better than once randomly selecting the top-5 predictions and use these for every image. Since the accuracy scores do not differ very much and they might be overfitting on the ImageNet data, the best performing model after DenseNet will be used as well. According to the findings in table 5 this is ResNet50. While comparing the pre-trained models, it also came to light that ResNet50 was much faster in classifying images

than DenseNet201. This is also an advantage, since the

contents of a phone can then be analyzed even faster.

4.2 SQ2 - Rank images on relevance certainty

This experiment is intended to see if images can be classi-fied as relevant or irrelevant. The contents of a confiscated phone can easily contain thousands of images where most images aren’t relevant for the investigator. In this exper-iment there are just two classes: relevant and irrelevant. The binary classifiers give a confidence score for the given relevance prediction and the images are then ranked on this

score. Such a ranking of images makes it easier to spot

relevant images within the contents of a confiscated phone, since the most irrelevant images will occur at the bottom of the ranking. This gives the investigator a quick overview of the contents of a confiscated phone, without having to search too specifically. Additionally, this can bypass any confirmation bias that may be present. It has been exam-ined if the hierarchic level of the prediction labels affects the performance and which binary classifier performs best.

4.2.1 Models

The relevancy set was used for this experiment. Using the output of DenseNet201 and ResNet50, containing the prob-abilities for every ImageNet concept, a binary classifier was trained to predict if an image is relevant or not. A Random Forest with 100 estimators and random state 0, a Support Vector Machine (SVM) and a Logistic Regression classifier were trained on these 1000, 608 and 389 probability scores. These models were chosen because it is not required to con-stantly change the parameters for each predictions

(7)

repre-sentation. Due to the assumption that it might be hard to distinguish relevant images from irrelevant images, the im-ages were ranked on the confidence that an image is relevant. The images with the lowest confidence, which are therefore most likely irrelevant images, will be ranked very low.

4.2.2 Evaluation method

The models will be evaluated after on their performance on 1000, 608 and 389 predictions with the metrics preci-sion@k, recall@k, average precision and mean average pre-cision. The mean average precision was calculated by av-eraging the three average precision scores of the different hierarchy levels. The k for precision and recall was set on k = 150, due to the assumption that the police will look at least at the first 150 results. Recall is an important measure for the Amsterdam police department, because it is crucial not to miss anything. Since only 150 of 412 images are in the ranking result, the baseline of randomly ordering the images for recall and precision is 72.8

Table 6: Random Forest ranking evaluation

P@150

R@150

AP

DenseNet201 - 1000

84.7

61.7

84.4 DenseNet201 - 608

83.3

60.7

83.8 DenseNet201 - 389

84.0

61.2

84.2 ResNet50 - 1000

95.3

69.4

94.4 ResNet50 - 608

97.3

70.9

95.2 ResNet50 - 389

96.7

70.4

94.5

Table 7: Logistic Regression ranking evaluation

P@150

R@150

AP

DenseNet201 - 1000

82.0

59.7

82.1 DenseNet201 - 608

82.0

59.7

82.1 DenseNet201 - 389

80.7

58.7

81.7 ResNet50 - 1000

90.7

66.0

90.2 ResNet50 - 608

90.0

65.5

89.4 ResNet50 - 389

86.7

63.1

86.4

Table 8: SVM ranking evaluation

P@150

R@150

AP

DenseNet201 - 1000

80.7

58.7

78.8 DenseNet201 - 608

80.7

58.7

80.3 DenseNet201 - 389

79.3

57.8

80.0 ResNet50 - 1000

90.7

66.0

89.8 ResNet50 - 608

90.7

66.0

90.3 ResNet50 - 389

85.3

62.1

85.5

From table 6 it can be derived that the combination of ResNet50 with 608 prediction scores performs best with the Support Vector Machine. Table 7 shows that the logistic regression model performs best in combinations with the 1000 prediction scores of the pre-trained ResNet50 model. In table 8 it can be seen that the SVM performs best with the combination of DenseNet201 and 608 prediction scores. For ResNet50 the SVM performs also ranks best with the 608 predictions. It seems that the output of ResNet50 ensures a

better relevancy ranking, since for all three binary classifiers the performance scores are higher.

Table 9: Mean Average Precision of binary classifiers

Random Forest Logistic Regression SVM

DenseNet201 84.1 82.0 79.7

ResNet50 94.7 88.7 88.5

It can be concluded that using the output of ResNet50 en-sure better performance, since table 9 shows clearly that all the mean average precisions for ResNet50 are higher. Both models perform better than the baseline of 50.0% (206/412) for mean average precision when the images are randomly ordered. However, the recall for both models is lower than randomly ordering the images (150/206*100=72.8%).

4.3 SQ3 - Rank images on similarity to textual

query

The previous experiment just ranks the images based on the relevance predictions. However, it might happen that an investigator wants to find all images containing a spe-cific concept. The intention of this experiment is to rank the images of the confiscated phone based on textual input of a user. For example, ranking the images based on the similarity with the query ’weapon’. It is not inconceivable that the concept, in which the researcher is interested, didn’t occur in the training set of the classification model. This ex-periment investigates the possibility of searching for images containing known and unknown concepts based on text.

4.3.1 Models

For this experiment the zero-shot learning method of Norouzi et al. (2013) was used to give all images in the image set a vector. A skip-gram language model that was trained on Wikipedia was used to generate the word vectors. The lat-est Wikipedia pages were downloaded from the Wikipedia dump on 2 may 2018 to train the language model that was used for subquestion 3 and 4. Only the English pages were downloaded to train a Word2Vec skip-gram model with the same parameters as described in Norouzi et al. (2013). This model was used to retrieve the word vectors of predicted labels and query terms. So the words or labels that could be assigned to this generated vector for an image are in this case not relevant. It is only important that the vector of the query term matches or is quite similar to the image vector. After converting the query to a word vector with the lan-guage model, the cosine similarity was used to find images

that are the most similar to the query. The vectors for

the images were generated, according to equation f (x) =

T

P

t=1

p(y(x, t)|x) · s(y(x, t)). As described in the paper of

Norouzi et al. (2013), the normalization parameter can be neglected since Word2Vec uses the cosine similarity as

simi-larity measure. In this equation is y(x, t) the tth_{most likely}

predicted label. The probability for every predicted label in the top T predictions is multiplied with the word vector that was retrieved from the Word2Vec model (s) for that label. Finally, all these multiplied vectors will be added together and form a new vector that then describes the semantics of the image. For this experiment T was set on 15.

For example, after classifying an image with a pre-trained model, the top 3 predictions of an image are obtained (Fig-ure 4). Subsequently, the labels should then be converted to

(8)

Figure 4: Example of zero-shot method of Norouzi (2013)

word vectors using the trained language model. These vec-tors should be multiplied by their corresponding predicted probability. Lastly, these vectors should then all be added together. This new word vector now describes the contents of an image and can be compared with the word vector of a textual query. The vector of this example will probably be something close to a ’parking lot’ or ’driveway’.

To implement this, the models ResNet50 and DenseNet201 were used instead of the neural network of Krizhevsky et al. (2012) used by Norouzi et al. (2013). In principle, the 1000 probabilities of the models were used. However, not all the predicted labels of these models were present in the vocabulary of the trained Word2Vec model. This problem occurred mainly with labels consisting of multiple words. To solve this, it was firstly tried to split up the label and then add the word vectors together for these words. Some word parts of these divided labels didn’t occur in the vocabulary as well, so for those words the WordNet data set was used to find a synonym. In some cases there was no synonym avail-able either. In that case the WordNet hierarchy was used. The hypernym that was of the closest to the predicted label was then converted to a word vector instead.

4.3.2 Evaluation method

To evaluate the performs of this image retrieval system, the Police concepts set was used. This set contains concepts that are known and unknown concepts for the two ImageNet

models. Table 10 shows which concepts were considered

as relevant for a specific query. Also the performance of the use of DenseNet201 and ResNet50 was compared. The recall@k (k=200, k=500 and k=1000) and average precision were calculated for the similarity ranking (table 11). The mean average precision was calculated for DenseNet201 and ResNet50 by taking the mean of the average precision scores of all the queries (table 12).

Table 10: Relevant classes for each textual query

Query Relevant concepts Amount bomb vest Bomb vest 32

caravan Caravan 14

container Sea container 175 document Passport type I, Passport type II, Drivers license, ID card, Official document 869 drivers license Drivers license 31 drugs Drugs: MDMA, Drugs: weed plants, Drugs: weed 125 flag Afghan flag, Iraqi flag, IS flag, Moroccan flag, Syrian flag 758 gas cylinder Cooking gas cylinder, Oxygen cylinder 139 police German police car, Police announcement, Police car, Police uniform 459

scooter Scooter 48

uniform Police uniform 101 weapon Gun automatic, Gun in hand, Gun pistol, Gun pointed 743

From table 11 it can be derived that in general ResNet50 performs relatively better in ranking images on query

sim-ilarity. The table (12) below shows that the mean

aver-age precision of ResNet50 on the 13 queries is also much higher. A logical explanation for this is that DenseNet201 is overfitting on the ImageNet concepts and underperforms in generalizing. However, both models perform better than the baseline 4.1% that can be derived from the average

pre-Table 11: Evaluation scores of DenseNet201 (DN) and

ResNet50 (RN) on textual queries

Recall@200 Recall@500 Recall@1000 Average Precision DN RN DN RN DN RN DN RN bomb vest 0.0 15.6 3.1 25.0 12.5 37.5 0.4 2.4 caravan 7.1 7.1 14.3 14.3 21.4 42.9 0.3 0.07 container 4.0 25.1 17.1 34.3 18.9 48.0 3.2 27.3 document 1.0 16.1 1.6 36.5 7.9 68.0 11.8 60.0 drivers license 3.2 0.0 9.7 0.0 9.7 3.2 0.4 0.5 drugs 16.8 1.6 20.8 5.6 29.6 11.2 7.1 1.6 flag 5.1 13.5 19.5 30.2 35.9 41.6 20.9 31.4 gas cylinder 3.6 10.9 5.8 18.8 7.2 31.2 1.7 6.2 identification 0.0 22.0 43.8 43.8 57.8 57.8 52.2 52.2 police 0.5 42.8 2.0 60.3 5.9 66.2 10.9 63.5 scooter 1.5 91.7 3.1 93.8 10.9 95.8 5.6 87.5 uniform 4.2 21.0 6.2 26.0 10.4 35.0 0.7 6.6 weapon 3.0 25.6 5.9 56.4 12.9 73.5 1.2 74.4

cision scores for each query when the images are ordered randomly.

In addition, the table 11 also shows that queries that are more related to the original ImageNet labels have better results. For example, there ImageNet contains images of guns and automatic rifles which are weapons. Drivers license seems to be a difficult query, since the word ’driver’ is close to ’car’ and the ranking shows mostly car images at the top. From table 11 and the research of Norouzi et al. (2013) is can be concluded that the distance of the query term to the ImageNet labels has influence on the ranking performance. The closer the query term is to an ImageNet label, the better the ranking.

Table 12: Mean Average Precision on the 13 queries

ResNet50 DenseNet201

MAP 31.9 5.4

4.4 SQ4 - Image clustering

This experiment is based on the assumption that images that are relevant in police investigations do not occur fre-quently on the same phone. For example, the content of a confiscated phone will most likely contain more images of pets than of corpses. This indicates that there is a good chance that the anomalies will contain relevant images. How-ever, some concepts of interest might be more common. Im-ages of weapons, for instance, may be more common. There-fore, it is relevant to know what kind of content was clustered together. This experiment investigates the relevance of the detected anomalies and labeling the generated clusters.

4.4.1 Cluster algorithms

DBSCAN is a cluster algorithm that categorizes data points as a cluster member or as an anomaly. This algorithm makes it relatively easy to identify noise or anomalies in the data, which is useful to detect anomalies in an image set. For DB-SCAN it isn’t required to specify the number of clusters in the data a priori, as opposed to the k-means algorithm. To map the contents of a confiscated phone this is very useful, since this amount of clusters is unknown. The algorithm arbitrary selects a point P and retrieves all points density-reachable from P with respect to epsilon and minimum clus-ter points. When P is a core point, a clusclus-ter is formed. If P is a border point, no points are density-reachable from P and DBSCAN visits the next point in the data. This pro-cess continues until all of the points have been propro-cessed. Epsilon specifies how close points should be to each other to be considered a part of a cluster. The minimum amount of

(9)

cluster points specifies how many neighbors a point should have to be included into a cluster.

A variation of this method is HDBSCAN, which also takes class hierarchy into account. Both methods were used to cluster the images within the Police concept set. This im-age set contains concepts that are relevant for the police, so it is important that the cluster algorithm is able to dis-tinguish these concepts. Using the train and test set from the Relevancy set it was determined if the detected anoma-lies contain relevant images. Both images from the train set and test set were used for clustering. Both algorithms were tested on the Police concept set and the Relevancy set.

Since the results of SQ2 and SQ3 indicate that ResNet50 performs better in generalizing than DenseNet201, for SQ4 only ResNet50 was used to cluster the images. The predic-tions of ResNet50 were used in combination with DBSCAN and HDBSCAN. For both DBSCAN and HDBSCAN, two algorithm configurations were found that seemed to work well. DBSCAN was used with the cosine metric, minimum samples of 15 and an epsilon of 0.1. For HDBSCAN the minimum samples was set on 10 and the minimum cluster size on 20.

4.4.2 Evaluation of cluster quality

The cluster algorithms were evaluated with the evaluation metrics Purity, Rand Index, Precision and Recall (Manning,

Raghavan, & Sch¨utze, 2008). To determine which

clus-ter algorithm performs best, both were qualitatively com-pared. To determine the purity of the clusters, the noise was not taken into account since this is not really a clus-ter. The purity was calculated according to purity(C, X) =

1 N

P

k

max

j |ck∩ xj|. In this equation, C is the set of clusters

and X is the set of classes. Purity calculates the average of the maximum amount of images with the same class that were clustered together.

The Rand Index was also calculated for each method

fol-lowing the equation RI = _{T P +F P +F N +T N}T P +T N . Where the

amount of positives and negatives were determined by ana-lyzing the possible pairs within the cluster (Manning et al., 2008), as can be seen in table 13. When two data points in the cluster have the same ground truth label, this was considered as a true positive (TP). All the positives (TP + FP) can be derived from the total number of possible pairs within a cluster. The true negatives can be derived from the total amount of possible pairs within the data and sub-tracting the number of positives (TP + FP) (Manning et al., 2008). This measure is the percentage of correct decisions. Table 13: Determining the negatives and positives of a clus-tering

Same cluster Different cluster Same class True Positives (TP) False Negatives (FN) Different class False Positives (FP) True Negatives (TN)

Additionally, the Friedman test was used to find signif-icant differences between the individual performance mea-sures of the algorithms and the hierarchic representations. This is a suitable test to compare the results of more than two classifiers when the samples are related (Stapor, 2017). Using the Friedman test the evaluation results (table 14 for the different hierarchy levels were tested on significance. For both DBSCAN and HDBSCAN these different hierarchy levels were analyzed. Significant differences were found for

Table 14: Evaluation scores for DBSCAN and HDBSCAN with different levels of hierarchy

Method Precision Cluster Recall Rand Index Purity DBSCAN - 1000 37.2 71.7 93.3 62.1 DBSCAN - 608 31.8 77.2 91.8 56.1 DBSCAN - 389 24.8 72.6 90.0 47.5 HDBSCAN - 1000 56.6 65.4 96.0 72.3 HDBSCAN - 608 54.5 64.3 95.9 69.1 HDBSCAN - 389 52.0 62.8 95.7 65.9

DBSCAN on the purity results. This suggests that the hi-erarchy level does not really matter for DBSCAN. However, the analysis of the HDBSCAN results suggest that the 1000 prediction scores ensures the best cluster performance. For HDBSCAN there were significant differences for the purity and precision. In all three cases, the use of the 1000 pre-diction scores performed best. Therefore, the performance of HDBSCAN and DBSCAN will be qualitatively compared when the 1000 prediction scores were used.

Comparing the performance measures of DBSCAN and HDBSCAN on the Police concepts set, only the recall is lower for HDBSCAN. This means that, from a qualitative point of view, HDBSCAN produces better clusters. Fur-thermore, the computing time of DBSCAN is also faster than HDBSCAN since HDBSCAN performs DBSCAN with varying epsilon values and DBSCAN uses just one epsilon value. This varying epsilon is probably the reason for this difference in performance.

Table 15: Anomaly evaluation of relevancy set clustering

DBSCAN - 1000 HDBSCAN - 1000

Relevance ratio 29.1 27.9

Total Recall 66.6 72.3

For the clustering of the images within the relevancy set the best hierarchy representation of ResNet50 predictions was used: 1000 prediction scores. Again the two described configurations of DBSCAN and HDBSCAN were used. Next the ratio of relevant images that were clustered by HDB-SCAN and DBHDB-SCAN as anomalies was calculated. Subse-quently, the total recall was calculated by looking at the to-tal amount of relevant images in the relevancy set and those clustered as anomaly. Table 15 shows the relevance of the anomalies. It can be seen that HDBSCAN has a higher to-tal recall, which means that less relevant images are missed. The missing relevant images are assigned to a cluster. When all images are classified as noise the relevance ratio is 28.2 (578/2051) and the total recall is then 100.0 since it con-tains all relevant images in any case. This means that only DBSCAN performs better on relevance ratio than this base-line. Since the relevance ratio of HDBSCAN is lower than the ratio of DBSCAN, but the total recall is higher, it is plausible that it just detected more noise. Nevertheless, the difference between the recall scores are relatively large and the ratio difference is not. Therefore, it seems that the noise detected by HDBSCAN contains more relevant images and is therefore favoured for this task as well.

4.4.3 Generating cluster labels

To generate labels for the predicted clusters, two methods were designed and evaluated in combination with the 1000 prediction scores of the Police concepts set. The reason for this is that the use of the 1000 prediction score works best

(10)

method of Norouzi et al. (2013) can be used to predict la-bels of images, it was tested if this method can also be used to predict a cluster label. These predicted labels were com-pared with the labels created with the WordNet hypernyms of the top 3 predictions of cluster images and was used to a baseline to compare the zero-shot method with.

WordNet hypernym cluster labels.

Firstly, the labels and corresponding WordNet hypernyms

of the top 3 predictions were counted. The idea is that

the most common hypernym will describe the similarity be-tween the cluster members. However, when using just the frequency of hypernyms or labels, overlap occurs between

cluster names. Therefore, the TF-IDF was used because

this takes the rarity of terms into account. The assump-tion is that this will results in better cluster labels, since the TF-IDF values indicate which terms best represent the cluster and to distinguish it from the others. In this case every cluster (c) within the set of clusters (C) was treated as a document to determine the inverse document frequen-cies (IDF) of labels and its hypernyms. Using the TF-IDF values gives insight in the rarity of labels or terms and hy-pernyms within a cluster. The frequency (f) of a term (t) was normalized by dividing it by the total frequency of the labels within the cluster (c). The term frequency (TF) was calculated according to the formula

tf (t, c) = ft,c

P

t0 ∈c

f_{t0 ,c}. Subsequently, the IDF was determined

with idf (t, C) = log N

{c∈C:t∈c} and describes the rarity of a

term across all clusters. Merging both equations results in the TF-IDF of a cluster label:

tf idf (t, c, C) = tf (t, c) · idf (t, C). Using this TF-IDF val-ues to determine which hypernyms describe a cluster best, the three hypernyms with the highest value were picked as labels.

Zero-shot cluster labels.

The zero-shot method of Norouzi et al. (2013) was also used to determine cluster labels. The top 3 predictions were used for all images in the cluster in combination with the trained Word2Vec model. For the top 3 predictions the TF-IDF values were calculated and used in the zero-shot method of Norouzi et al. (2013) instead of the probability scores of a label. The formula of Norouzi et al. (2013) can be adjusted to the use of TF-IDF values instead of the prediction prob-abilities: f (x) =

T

P

t=1

tf idf (y(x, t)) · s(y(x, t)). Converting the individual predicted image labels to a word vector and adding all these vectors together gives a new vector. Then it was determined which three words are the closest or most similar to this vector. Word2Vec determines the distance between vectors with the cosine similarity.

4.4.4 Evaluation of generated cluster labels

For each cluster there were three labels predicted with both methods. To be able to measure the quality of a generated label, the average similarity was calculated. The cosine sim-ilarity was calculated between the word vector of the pre-dicted label and all the ground truth labels of the images within the cluster. From these similarity scores the aver-age label similarity (ALS) can be derived for a cluster label. This can also be formulated in the form of an equation:

ALS =_N1

I

P

i=1

cosine similarity(wv truei, wv pred). In this

formula, N is number of images within a detected cluster, I is the set of images i1, ..iN, wv true is the word vector of the

ground truth label of an image in the cluster and wv pred is the word vector of the predicted label.

After determining all the average similarity scores for the predicted cluster labels, the mean for the predicted labels

was calculated. Following the formula M ALS = _N1

C

P

c=1

ALSc

the mean average label similarity (MALS) was determined for each of the three predicted labels. In this equation, C is the number of clusters that was detected by DBSCAN or HDBSCAN. This results in table 16, which shows for each label method the mean average label similarity.

Table 16: Mean average label similarity predicted cluster labels

Label1 Label2 Label3

Label method DBSCAN HDBSCAN DBSCAN HDBSCAN DBSCAN HDBSCAN

WordNet 29.3 27.9 25.0 32.8 24.2 22.0

Zero-shot 30.2 33.6 26.6 25.3 28.3 34.4

Table 16 shows that the zero-shot labels, generated for the DBSCAN clusters, are more similar to the ground truth than the labels generated with WordNet. Furthermore, the zero-shot method predicts better cluster labels in 5 out of 6 cases.

5. CONCLUSIONS

The main research question To what extend can models that were already trained on ImageNet, identify relevant images in investigations of confiscated phones within the police de-partment of Amsterdam? can now be answered using the answers gathered with the experiments.

From the results it can be concluded that ResNet50 per-forms better on classifying images beyond the 1000 Ima-geNet concepts. Predicting the relevance of images and us-ing a textual query works better in combination with the 608 prediction scores of ResNet50. Furthermore, the combi-nation of ResNet50 and the zero-shot method of Norouzi et al. (2013) to rank images on similarity with a textual query performs better than the combination with DenseNet201. In addition, ResNet50 was also much faster in classifying images than DenseNet201.

Ranking the images on relevance works best with a ran-dom forest model. This makes it easier to review the con-tents of a phone since the least important images occur at the bottom of the ranking. It gives a quick overview in po-tential relevant items on a confiscated phone, without having to search too specifically.

The presented textual query method works well for classes related to the ImageNet concepts. However, it is harder to find relevant images for classes that are not really related to the ImageNet concepts. For example, the images of bomb vests and drugs didn’t really occur at the top of the ranking. Concepts like scooter and document were relatively easy to retrieve.

Image clustering seems to be a good method to find rel-evant images within the contents of a confiscated phone. Based on the obtained results for image clustering it can be concluded that HDBSCAN produces better clusters. In addition, the detected noise contains more relevant images, which means that fewer images are missing. Finally, the

(11)

zero-shot method of Norouzi et al. (2013) can be used for retrieving images with textual queries as well as for cluster labeling. In combination with DBSCAN it performs better than generating labels with WordNet hypernyms.

All in all, it can be concluded that ResNet50 pre-trained on ImageNet can indeed be used more broadly. Concepts related to the ImageNet concepts can relatively easy be re-trieved with a textual query and the predictions can be used to train a random forest model. Also clustering the images based on the prediction scores can give valuable insights in the contents of a confiscated phone by looking at the anoma-lies and generated labels.

5.1 Future work

The conclusion only applies on object recognition since it uses models trained on ImageNet. To be able to find all relevant images in a specific case it might be useful to create a pipeline that also works with persons or faces. OCR could also be used to retrieve valuable information from an image set. Future work should therefore look into a pipeline that uses a combination of recognition tasks.

The binary models that predict if an image is relevant or not is most likely case sensitive. A different model will probably be needed for cases from another nature. The im-ages that are relevant in a human trafficking related case are most likely not interesting for a case related to fugitives. Further research will have to confirm that for all case-related topics a specific relevance classifier should be trained. Also, this method should be tested with more data from multiple confiscated phones.

Replacing the softmax layer of the pre-trained model and re-training the network weights with police related concepts might improve the ranking of image similarity to the textual query. Also enriching the Word2Vec model with documents or articles related to crime and terrorism might provide bet-ter predictions.

5.2 Acknowledgements

Firstly, I would like to thank Dominique Roest for giving me the opportunity to do an internship within Team Rendement Operationele Informatie (TROI). She also made it possible for me to tell about my research on the Data Science meetup of the Landelijke Eenheid and at a team meeting of team Pre-Development.

Secondly, I would like to thank my external supervisor Gijs Smit and internal supervisor Thomas Mensink for

pro-viding valuable feedback. I would also like to thank my

fellow students that were supervised by Thomas Mensink, because they gave me some valuable insights during the peer sessions.

Finally, I would like to thank Maarten de Rijke for being my second reader in such a short notice.

References

Agrawal, A., & Karnick, H. (2009). Unsupervised image clustering (Unpublished doctoral dissertation). Indian Institute of Technology, Kanpur.

Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C.

(2016). Label-embedding for image classification.

IEEE transactions on pattern analysis and machine intelligence, 38 (7), 1425–1438.

Battiti, R. (1994). Using mutual information for select-ing features in supervised neural net learnselect-ing. IEEE Transactions on neural networks, 5 (4), 537–550. Bucher, M., Herbin, S., & Jurie, F. (2016). Improving

se-mantic embedding consistency by metric learning for

zero-shot classiffication. In European conference on

computer vision (pp. 730–746).

Canziani, A., Paszke, A., & Culurciello, E. (2016). An

analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678 . Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei,

L. (2009). Imagenet: A large-scale hierarchical image database. In Computer vision and pattern recognition, 2009. cvpr 2009. ieee conference on (pp. 248–255).

Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996).

A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, pp. 226–231).

Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In (Vol. 1, p. 3). IEEE.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Ima-genet classification with deep convolutional neural net-works. In Advances in neural information processing systems (pp. 1097–1105).

Lampert, C. H., Nickisch, H., & Harmeling, S. (2014).

Attribute-based classification for zero-shot visual

ob-ject categorization. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 36 (3), 453–465.

Manning, C. D., Raghavan, P., & Sch¨utze, H. (2008).

Intro-duction to information retrieval. New York, NY, USA: Cambridge University Press.

McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hier-archical density based clustering. The Journal of Open Source Software, 2 (11), 205.

Nguyen, D. T., Alam, F., Ofli, F., & Imran, M. (2017). Au-tomatic image filtering on social networks using deep learning and perceptual hashing during crises. arXiv preprint arXiv:1704.02602 .

Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., . . . Dean, J. (2013). Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650 .

Stapor, K. (2017). Evaluating and comparing classifiers: Review, some recommendations and limitations. In International conference on computer recognition sys-tems (pp. 12–21).

Xian, Y., Lampert, C. H., Schiele, B., & Akata, Z.

(2017). Zero-shot learning-a comprehensive

evalua-tion of the good, the bad and the ugly. arXiv preprint arXiv:1707.00600 .