A low-cost, privacy-preserving machine learning approach for designated loading area state classification

(1)

A low-cost, privacy-preserving

machine learning approach for

designated loading area state

classification

Vincent Kieberl 11011041

Bachelor thesis Credits: 18 EC

Bacheloropleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors M. Groen MSc Dr. N. Piersma Urban Analytics Faculty of Technology

Amsterdam University of Applied Sciences Weesperzijde 190

1097 DZ Amsterdam

(2)

Abstract

Traffic data analysis is of vital importance in order to gain insights into traffic flow, through which traffic bottlenecks can be identified and addressed accordingly. It is however of great importance that the privacy of the people filmed is preserved. We present an algorithm that can determine the state (occupied or not occupied) of a des-ignated loading area (a desdes-ignated parking space where delivery vehicles can (off)load goods) within±5% from the ground truth in 90.77% of normal cases, while being able to run on a Raspberry Pi and processing all data locally on the device. The algorithm is implemented using a relatively simple multi-layer perceptron classifier. Due to its low complexity compared to other machine learning approaches, the algorithm is suit-able for real-time use in low-performance computing devices such as a Raspberry Pi, making it portable while preserving privacy because all image data is processed locally. Special cases, i.e. when the view of the designated loading area is obstructed, still re-quire further research to be accurately classified.

(3)

Acknowledgements

Firstly, I would like to thank the Amsterdam University of Applied Sciences’ Urban Analytics research group as a whole for granting me the opportunity to work on this interesting project.

I would like to offer my special thanks to my supervisors, dr. Nanda Piersma and Maarten Groen BSc for all the advice, moral support and guidance I received from them during the writing of this thesis. Both of you have been extremely helpful and supportive; even in times when you were busy you still took time to help me.

I am particularly grateful for the assistance offered by Maro van Andel of the Ambas-sade Hotel, who let us use their spaces to set up the recording devices used to gather the data used in this thesis, and the technicians at the Ambassade Hotel that helped us set up the recording devices.

Heartfelt thanks to my friend Hanna van Aller for her willingness to proofread this the-sis and for her emotional support during long days (and nights) in the university library. Lastly, I would like to thank Inge Schuddeboom for making the time to proofread this thesis on short notice.

(4)

(5)

List of Figures

1 The data acquisition set-up . . . 11

2 Sample image of acquired data from RPi . . . 12

3 Tensorflowssd_mobilenet_v1_coco sample image . . . 13

4 Sample subframes for the Designated Loading Area. . . 14

5 Approach diagram . . . 14

6 Confusion matrix (CM) for the 3-class MLPC on preliminary test data 16 7 CMs for 3-class MLPC on day 1 & day 2, 5 test data . . . 17

8 CM for the 3-class MLPC on day 1,2 and 5 with histogram equalization 18 9 CM for the 3-class MLPC on day 1,2 and 5 with extra training data from day 1 . . . 19

10 CM 3-class MLPC on day 1,2 and 5 with extra training data from day 1 (SGD solver & log. act.) . . . 20

11 CM for binary MLPCC on day 1, 2 and 5 test data (without extra day 1 data, SGD solver & log. act.) . . . 21

12 Selection of subframes misclassified as ’empty’ by binary MLPC . . . 22

13 Selection of subframes correctly classified as ’full’ by MLPC, similar to misclassified subframes . . . 22

(7)

1 Introduction

Amsterdam consists of bustling and lively neighborhoods where people live and work within the same direct vicinity. This causes not only commuter traffic, but also supply traffic from and to business, and deliveries to residents. It is important for the local government as well as local businesses to gain insights using data to determine where traffic bottlenecks emerge. With this data, actions can be undertaken to reduce con-gestion, such as the intelligent use of designated loading areas, and combining supply traffic runs with waste collection runs. These measures are becoming ever more impor-tant with, for example, the increasing tourist numbers in the last years1_{, which logically}

cause larger and more frequent deliveries for local businesses as profit increases. The Amsterdam University of Applied Sciences’ Urban Analytics group aims to de-velop methods to automatically gather and process data about the city’s traffic flows in order to develop algorithms that can predict traffic flows and therefore provide insight into where the municipality can effectively change the infrastructure to improve traffic flow.

This thesis is aimed at exploring the possibilities of automating the process of gathering and analyzing raw data about a street in the city center using visual data (images). It in-tends to serve as preliminary research towards the development of algorithms that fully automate the process of traffic data analysis while preserving privacy, and the predic-tion of traffic flows. Due to the nature of the input data (images) and their informapredic-tion sensitivity, it is imperative that the algorithms to be devised do not store images (not in the cloud, and not even on the device itself). The algorithms should process the images in real-time to obtain anonymous data only about the state of the designated loading area (DLA) to obtain a solution that is designed with privacy in mind (privacy by

de-sign [2]). Another constraint is that the algorithms should be able to run (in real-time) on a Raspberry Pi 3 model B in order to create a low-cost solution that can be easily mounted on, for example, lamp posts. All data should be processed locally, thus not requiring an internet connection, to reduce the probability of raw image data falling into the wrong hands, maintaining an approach that incorporates privacy by design. We aim to answer the following research question:

How can we develop an algorithm that can run on a Raspberry Pi with PiCamera that can accurately determine whether the designated loading area is full or empty in real-time, and for how many percent of a 15-minute interval it has been in that state, incorporating the privacy-by-design principle?

We will discuss methods that we have explored to automatically determine the pres-ence of a vehicle in a predefined area that might be universally applicable with little pre-processing and training.

(8)

2 Related work

Automatic traffic video analysis has been a fast-emerging field in artificial intelligence ever since hardware costs for surveillance cameras dropped, sparking their large-scale deployment. Most research, however, is centered around vehicle counting, Automatic Number Plate Recognition (ANPR) and incident detection [3]. We will give a concise summary of the most relevant recent work on these topics in the sections below.

2.1 Vehicle counting

Vehicle counting techniques are most commonly used to gather data about road usage and traffic flow, and were originally implemented using inductive-loop traffic detectors. These detectors are installed in the road pavement and detect traffic using electromag-netic fields that measure the presence of vehicles using changes in alternating current properties of the circuit, which occur when a vehicle is stationed above the detection loop [4]. These detection loops are, however, high in installation and maintenance costs, and they generally do not discriminate between different types of vehicles. A visual approach is beneficial, seeing that it would theoretically be able to discriminate between vehicle classes while using hardware (cameras) that is already in place. This is the reason why in the last decade, significant research has been carried out on computer vision techniques for urban traffic analysis, and specifically vehicle counting. Research has involved the use of image processing and statistical models like Gaussian mixture models (see [5]) and Scale Invariant Feature Transform (SIFT, see [6] and [7]), but also - increasingly in recent years - involved machine learning techniques such as

k-nearest neighbor algorithms (see [8]) and Convolutional Neural Networks (CNNs, see [9]). The references above concern research that was conducted about daytime im-ages. There has also been extensive research into night-time traffic surveillance, mostly using vehicle features such as headlights to guide classifications (e.g., [10], [11]). In [12], research was carried out towards the privacy protection of filmed people in visual traffic monitoring systems.

2.2 Automatic Number Plate Recognition (ANPR)

Automatic Number Plate Recognition (ANPR) refers to systems that are able to auto-matically register vehicle number plate digits and letters from visual data. ANPR tech-niques are widely used today for highway speed control and the localization of major criminals [13], [14]. However, research is still conducted to use ANPR techniques for actual vehicle detection and counting (see [15]) and the implementation of intelligent collision warning systems in vehicles (see [16]).

2.3 Incident detection

Incident detection in the context of automatic traffic video analysis refers to the detec-tion of traffic accidents using images from traffic cameras and to collisions warning systems in vehicles. For the scope of this thesis, we will only discuss research con-ducted towards the automatic detection of traffic accidents using images from traffic cameras.

Substantial research has been carried out in the recent years, using mostly a combina-tion of different machine learning techniques. For example, [17] use hybrid Support Vector Machines (SVMs) to classify and track vehicles, deep learning neural networks

(9)

to classify vehicles as ambulances and detect traffic accidents using multinomial logis-tic regression (MLR). Sameen & Pradhan [18] have used Recurrent Neural Networks (RNNs) to predict the severity of traffic accidents using image data.

To conclude this section, the research above looks extremely promising for the future, but it is not entirely in line with our research. The research either uses a communi-cations system to transfer data from a recording device to a computer with more pro-cessing power (making the approach inherently more vulnerable to attacks, potentially leading to the invasion of privacy) or uses a high-performance computer that is directly connected to the recording device, rendering the approach not portable. However, [19] have conducted research towards automated vehicle counting using Raspberry Pi com-puters, although the approach used did not use machine learning techniques and did not discriminate between vehicle classes.

(10)

3 Approach & results

This section outlines our approach, in which we will describe how the research was con-ducted and what decisions were made, why they were made, and the results of our exper-iments. Considering our research questions, we have chosen for a machine learning-oriented approach that uses Tensorflow object detection models based on Recurrent

Convolutional Neural Networks (R-CNNs), and less complex Multi-Layer Perceptron Classifiers (MLPCs) to process input image data, and to obtain text-based output data

that informs the user about the time intervals in which the Designated Loading Area (DLA) was occupied or not occupied.

3.1 Data acquisition

Machine learning models require training data to be able to accurately classify new, unseen data. Seeing that there was no data available to us to test (and train) our models, we have first acquired this data ourselves.

Two Raspberry Pi (RPi) model 3B computers, of which one RPi equipped with a Pi-Camera v2, and the other with a PiNoir v2 night camera were installed at a hotel in the center of Amsterdam. These RPis were installed on the first floor (about 2.5 me-ters above ground level) of the hotel in a window sill, where the cameras had a view of the DLA and the street in front of the hotel. Both RPis were separately connected to a Western Digital Elements 1 TB external hard disk drive that was used to store the captured images on. They were connected to the secure wireless network of the hotel to allow system time synchronization and were operated by Raspbian Stretch (kernel 4.14). However, in the production stage, the RPis should obtain the correct time from a real-time clock (RTC) module to adhere to the privacy by design principle.

The RPis were configured to automatically start a system service (usingsystemd) after boot. This system service consisted of a Python script that would:

1. wait 60 seconds in order to ensure a wireless network connection would have been established by the operating system;

2. check whether a network connection was established successfully by creating a connection togoogle.com;

3. request the current time from the NTP.org Europe server (europe.pool.ntp.org) and compare this time to the system time in order to ensure that images that would be saved later on would have the correct timestamp;

4. check the available disk space on the mounted external hard drive;

5. if all above checks succeeded, the script would initiate the process of capturing frames from the PiCamera at a rate of about one frame per second.2

If any of the checks performed at boot would fail, the script would send an email to the administrator with a detailed error message and would abort.

These checks were implemented to guarantee the success of the data acquisition pro-cess on the first try, so that no time would be lost on restarting the entire propro-cess if any

2_{The frame rate is not exact because it was implemented with the Python}_{time module’s sleep function,}

which pauses a program for a number of seconds set by the programmer. Nonetheless, the amount of time it takes for the program to run until it reaches thissleep command is dependent on system factors such as current CPU load, other running programs and bus speed, which is why we have experimentally determined the input value for thesleep function on our RPis.

(11)

Figure 1: The data acquisition set-up we used to obtain the dataset.

kind of system failure would arise (e.g. the system time not being correct, insufficient disk space on the external hard drive, etc.), especially given the time constraint of this research project.

The RPis each successfully gathered about 660 GB of image data during 5 days (from 15:00 on day 1 until 15:00 on day 6)3 _{at the end of April 2018, consisting of about}

457,000 image files and yielding an average file size of 1.44 MB per file. The images were saved in a JPEG container at a resolution of 1920 by 1080 pixels. After inspect-ing the captured data, we discovered that the images from the RPi with the PiNoir night camera that were taken during night time were actually of no use since the PiNoir cam-era requires night vision infrared LEDs in order to function as an actual night camcam-era. As a consequence, our usable data was significantly reduced to about 13 hours per day, and only included the data obtained from the PiCamera v2, as the images that were taken by the PiNoir camera during daytime were essentially the same as the images taken by the RPi with regular PiCamera.

3.2 Full-image object detection approach

In order to keep our approach as simple as possible, we first explored the possibilities of using a highly-complex object detection algorithm with the whole image (1920 by 1080 pixels) as the input image. The method, advantages, limitations, and drawbacks

3_{Exact dates have been omitted in this thesis to ensure complete privacy for all people that have been}

(12)

Figure 2: A sample image from the dataset we acquired from the Raspberry Pi with PiCamera v2, taken around noon on a partly cloudy day. Some parts of the image are blurred for privacy reasons.

of this approach are described below.

3.2.1 The Tensorflow Detection Model Zoo

Tensorflow is an open-source machine learning platform that is often used for state-of-the-art machine learning applications. They provide a model zoo that contains a collection of pre-trained object detection models that can be downloaded for use on new test data4_{. These pre-trained models make the use of Tensorflow easily accessible}

to users that do not wish to or cannot train their own models (i.e. due to time or labor constraints). For this research project, we have chosen not to train our own Tensorflow R-CNN models as this not only requires access to high-grade computing servers, but it is also very labor-intensive due to the fact that, for an object detection model, all training and test data must be annotated with the bounding boxes of the objects that the algorithm should detect. For a model that is only trained to output a classification, the annotation task is much easier, considering that training and test data must only be classified per image, instead of per object in the image.

3.2.2 The SSD-Mobilenet v1 model

We have used thessd_mobilenet_v1_coco model for whole-image object detection on some of the data that we acquired from the PiCamera. This model is based on the

Single Shot Multibox Detector (SSD) by [20] and the MobileNets class of efficient con-volutional networks for mobile vision applications by [21]. The model was trained on the Microsoft Common Objects in Context (COCO) dataset (see [22]), which contains

4_See _{https://github.com/tensorflow/models/blob/master/research/object_}

(13)

80 classes5_.

We have tested the model on 100 random images from the obtained dataset. These im-ages were not rescaled or cropped and thus featured a resolution of 1920 by 1080 pixels. The model had an average performance of about 3.2 seconds per processed image on a 2015 MacBook Pro (Intel® Core™ i7-5557U CPU @ 3.10GHz × 4, 16 GB RAM, Intel® Iris 6100 integrated graphics) running Ubuntu 16.04 LTS. Tensorflow was built from source on this machine to make optimal use of the CPU instruction sets available. Our performance was much worse than the performance advertised in the Tensorflow model zoo documentation, which describes that thessd_mobilenet_v1_coco model performs at a speed of 30 ms on a 600 by 600 pixel image on an Nvidia® GeForce™ GTX TITAN X graphics card. Naturally, this is to be expected, seeing that the machine we tested on does not feature a dedicated graphics card and that the input image is more than twice the size of the 600 by 600 pixel image that the model zoo documentation describes.

From these intermediate results we conclude that if running this algorithm on a high-grade laptop takes around 3.2 seconds on average, it will definitely not run in real-time on a Raspberry Pi (the RPi does not have a dedicated GPU, and its CPU is not nearly as powerful as the MacBook Pro CPU). The time performance of the algorithm might have been better if the model were trained on classes that are of interest to our case (cars, buses, trucks), seeing that that would decrease the number of computations required. We leave this as a recommendation for future work.

Figure 3: Sample output image from the Tensorflowssd_mobilenet_v1_coco model that shows the bounding boxes for each detected object.

The results obtained from the approach above (illustrated in figure 3) stress the need for a more simplistic approach that requires less computational power, which we describe in the next section.

5_See_{https://github.com/nightrome/cocostuff#labels}_{for a full list of classes that the model}

(14)

3.3 3-class subframe-based approach

3.3.1 Algorithm

We have explored a method that is computationally less expensive than the whole-image object detection approach, making it potentially suitable for use on low-performance computing devices such as the Raspberry Pi. Our method consists of three parts that are followed sequentially for each image. First, a subframe is extracted from the input image. This subframe only contains a small part of the designated loading area that always contains a part of the vehicle if the DLA is occupied, as in figure 4.

(a) DLA empty. (b) DLA occupied. (c) DLA state unknown.

Figure 4: Sample subframes for the Designated Loading Area.

Subsequently, this subframe is used as input for a multi-layer perceptron (MLP) clas-sifier that classifies the subframe into ’empty’, ’full’ or ’unknown’. We have opted for the MLP classifier because it requires very little processing power compared to whole-image object detection approaches, while still being able to find complex non-linear relations in input data.

The ’unknown’ class is awarded to images that do not show a clear ’empty’ or ’full’ DLA (e.g. when a vehicle is in the process of parking on the DLA, or when the DLA is occluded by passing cars or trucks). If a subframe is classified as ’empty’ or ’full’, the time stamp of the original image is recorded to, later on, determine the time interval during which the DLA was occupied. Initially, our approach was to pass a subframe on to the more complex Tensorflow SSD Mobilenets v1 COCO object detection model if it is classified as unknown, hypothesizing that that approach is more feasible for low-performance devices, considering that the subframe is much smaller than the original image (and thus would be processed much faster). However, the coming sections will clarify why eventually a different approach was chosen.6

Figure 5: Our approach involving data acquision, then subframe extraction, and finally, classification.

3.3.2 Preprocessing the images

We have used data from days 2, 3, 4, and 5 to train, validate and (preliminary) test our model. Data from day 1 was deliberately omitted in order to leave part of the

(15)

obtained data untouched for later testing. The dataset from day 2 to 5 consisted of 2318 images. These images were annotated by hand to match the classes ’empty’, ’full’ and ’unknown’. From this dataset, 1265 images were annotated as ’empty’, 853 as ’full’, and 164 as ’unknown’. The ratio of different classes in the dataset was randomly determined by the availability of the different classes. As described above, the images were then preprocessed by creating subframes from the originals. These subframes were created using the Python Image Library (PIL) which produced subframes with a resolution of 352 by 140 pixels. Internally in Python, these images were represented as a 352 by 140 by 3 Numpy matrix. The third dimension is of size 3 because the images were processed in RGB format. Subsequently, the subframes were normalized using thescikit-image library, so that all RGB values would be represented as a floating point number between 0.0 and 1.0, instead of RGB values between 0 and 255. Lastly, the images were reshaped to a vector of dimensions 1 by 352× 140 × 3 = 147, 840. This vector was then appended to a matrix of vectors that at the end of the process contained all input images as rows. For a complete dataset of 2318 images, this would produce a matrix of shape 2318 by 147,840 (one image per row).

3.3.3 Scikit-Learn MLP classifier

The annotated images were used to train a multi-layer perceptron (MLP) classifier7

using the Python Scikit-Learn library [23]. The Scikit-Learn implementation of the MLP classifier features different settings that users may adjust. For this experiment, we have used the LBFGS algorithm8as solver (the optimization algorithm), 1× 10−5 as the learning rate α, (5,2) as the hidden layer sizes, and the value 1 as the random state9. For all other available settings, the Scikit-Learn default values were used. The Scikit-Learn library ships with built-in train-test division support, enabling us to specify a percentage of our data set to be used for testing purposes, while the remainder of the data is used as training data. The exact rows of the input data matrix are selected randomly, but we have set a random state seed of 42. We have used this Scikit-Learn function to divide our data set into 70% training data and 30% test data. During the model training phase, we have performed k-fold cross validation (CV) with k = 5 on the training data to prevent overfitting on the training data. The results of this cross-validation test produced the accuracy scores in table 1.

fold accuracy score

1 0.9755

2 0.9815

3 0.9876

4 0.9690

5 0.9814

Table 1: Accuracy scores for the k-fold CV on the 3-class MLPC with k = 5.

7_See_{http://scikit-learn.org/stable/modules/neural_networks_supervised.html}_{for a}

detailed description of the Scikit-learn MLPC implementation.

8_{Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm; a quasi-Newton method for}

optimiza-tion that has been specifically designed for limited memory computer systems. For more informaoptimiza-tion, please see [24].

9_{The seed used by the random number generator for generating a starting point in the cost minimization}

(16)

3.3.4 Results on the preliminary test set

After training, the model was tested on the remaining 30% of the data set (the pre-liminary test set). This test set contained 389 samples that were annotated as ’empty’, 266 samples of ’full’ subframes and 40 samples that the algorithm should classify as ’unknown’. The produced results are shown in the confusion matrix in figure 6.

Figure 6: The normalized confusion matrix for the 3-class MLP classifier on the pre-liminary test data (total accuracy = 0.9899).

These results showed that test results are positive for the ’empty’ and ’full’ cases, but that the algorithm still needed improvement on the ’unknown’ cases. This might be due to the ambiguity of a subframe that should be classified as unknown, as we have defined these cases to essentially be the ’rest’ bucket for cases that fit neither ’empty’ or ’full’, as they contain images in which the DLA state is unclear, either due to occlusion or due to a vehicle being in the process of parking.

3.3.5 Results on the day 1 test set & day 2 and 5

We then tested the model using the previously omitted data from day 1 (this data was not in the training set nor in the preliminary test set). The set contained 552 images to be classified as ’empty’, 188 as ’full’, and 4 as ’unknown’. The model accuracy on this test set was extremely low compared to the accuracy of the preliminary test set (0.2500 vs. 0.9899). The confusion matrix is shown in figure 7a.

We hypothesized that the images from day 1 might have been lighter in brightness than the images from the other days. To test this hypothesis, we also ran the algorithm on

(17)

(a) Accuracy: 0.2500 (b) Accuracy: 0.0471

Figure 7: Normalized confusion matrices for the day 1 test data (a) and the day 2 & 5 test data (b) on the 3-class MLP classifier.

test images that were taken from leftover data from days 2 and 5 (naturally, this data was not used anywhere before). This produced an even lower accuracy score of 0.0471; the confusion matrix is shown in figure 7b. These results are surprising as parts of day 2 and 5 were also in the training set, so if the reason that the model failed to accurately predict the data from day 1 was caused by a difference in brightness and/or contrast, it should have been able to (at least partially) correctly predict data from day 2 and 5.

3.3.6 Pre-processing with histogram equalization

In order to completely rule out brightness and/or contrast as a factor of algorithm perfor-mance, we re-initialized the complete training and testing algorithm, this time applying histogram equalization on all input images. The results on the preliminary test set were nearly the same as without equalization: the accuracy score for the preliminary test set was 0.9931. However, the accuracy on the test set from day 1, and the test set with leftover data from days 2 and 5 did not significantly improve (0.1345). The confusion matrix is shown in figure 8.

(18)

Figure 8: The normalized confusion matrix for the 3-class MLP classifier on the test data from days 1, 2 and 5 with histogram equalization (total accuracy = 0.1345). 3.3.7 Adding training data from day 1

We assumed that the model had probably overfitted on the training data, seeing that it was performing extremely well on data that was close (in capture time) to the training data, but was unable to generalize to new input data from day 1, and to new input data from other parts of days 2 and 5. To combat this issue, we expanded the training set with 1183 new images of the ’empty’ class that were derived from the day 1 data subset. We only added images that were to be classified as ’empty’ because this was the class that previously produced 0.0 accuracy. Consequently, we re-initialized the entire model, which produced an accuracy of 0.9857 on the preliminary test set (slightly higher than the model without the extra data), but also still produced a meager 0.4657 accuracy on the test data from days 1, 2 and 5. Cross-validation accuracy scores are displayed in table 2, and the confusion matrix in figure 9.

fold accuracy score

1 0.9880

2 0.9840

3 0.9820

4 0.9710

5 0.990

Table 2: Accuracy scores for the k-fold CV (k = 5) on the 3-class MLPC that was trained with the extra day 1 data.

(19)

Figure 9: The normalized confusion matrix for the 3-class MLP classifier that was trained with extra data from day 1, run on the test data from days 1, 2 and 5 t (total accuracy = 0.4657).

3.3.8 Changing the model hyperparameters

Considering that none of the above tests yielded a truly positive result for the final test data, we decided to look more closely at the model hyperparameters to prevent the overfitting situation that had happened before. The default activation function for the Scikit-LearnMLPClassifier class is the rectified linear unit function (ReLU), which is a function that returns the value of the hidden layer if it is greater than zero, and returns zero if the value of the hidden layer is smaller than zero. For the next test, we have the used logistic (sigmoid) activation function instead. As described in the section 3.3.1, in the results above, LBFGS was used as the optimization algorithm. We changed this to stochastic gradient descent (SGD). Moreover, we changed the learning rate α to 1× 10−4and set the learning rate mode to adaptive, meaning that Scikit-Learn keeps the learning rate constant as long as the loss keeps decreasing. If the solver fails to de-crease the loss for two consecutive epochs by at least a pre-defined tolerance threshold tol (the default tol in the Scikit-Learn implementation is 1× 10−4), then the learning rate is divided by 5.

The algorithm ran for 200 iterations (the Scikit-Learn default maximum number of iter-ations) on the training data which included the extra data from day 1. At 200 iterations, the reported loss was 0.2881. Cross-validation scores were high across all folds (see table 3). Accuracy on the preliminary test set with these settings was 0.9467, while the accuracy for the day 1, 2 and 5 test data was 0.9921.

(20)

fold accuracy score 1 0.9530 2 0.9530 3 0.9510 4 0.9490 5 0.9530

Table 3: Accuracy scores for the k-fold CV (k = 5) on the 3-class MLPC that was trained with the extra day 1 data.

Figure 10: The normalized confusion matrix for the 3-class MLP classifier that was trained with extra data from day 1, run on the test data from days 1, 2 and 5 using the SGD solver and the logistic activation function (total accuracy = 0.9921).

As can be seen in the confusion matrix in figure 10, the results for the ’empty’ and ’full’ classes look fairly promising, although the results for the ’unknown’ class are very poor with a 0.0 accuracy. It seemed that the algorithm was unable to accurately predict classes for the unknown cases, possibly due to the fact that the images that should have been classified as ’unknown’ fluctuate a lot in image composition. A solution for this might be to remove the more generic ’unknown’ class and to add two classes, of which one would represent images that contain a vehicle that is currently parking, and one that represents images in which the DLA is not visible due to occlusion.

3.4 Binary subframe-based approach

Despite the method of adding classes potentially being a solution to the problem of low accuracy on the ’unknown’ class, we have opted for the opposite potential solution: using only the ’empty’ and ’full’ classes. We considered this to be a possible solution because our research problem concerns a DLA that is used by delivery vehicles to park for short periods of time to (un)load supplies, meaning that any occlusion due to passing traffic or parking movements by the vehicle itself only concerns a small fraction of the captured frames. We clarify this with an example. Let us assume that there is a period of 10 minutes in which a delivery vehicle is parked on the DLA. At the start of this period,

(21)

and at the end, the vehicle is in the process of parking for a duration of 3 seconds, which brings the total number of seconds that the algorithm should classify as ’unknown’ to 6 seconds. Let us also assume that the recording device takes pictures at a frame rate of 1 frame per second, meaning that we have 6 images out of 10× 60 = 600 images total that should be classified as ’unknown’, or 1% of the total number of images taken in that 10-minute window. This means that if the algorithm is only trained on ’empty’ and ’full’ images, the maximum accuracy loss would be 1%, at the advantage of a model with lower complexity.

To test whether the model would run with minimal different accuracy, we modified the training and test data to remove the ’unknown’ class images. This produced a data set that only contained 3331 images in total, of which 2448 were to be classified as ’empty’, and 883 as ’full’. The dataset was automatically split into 70% training data and 30% preliminary test data by the Scikit-Learn function we used before for the other models (we also used the same random state number, 42). From the day 1, 2 and 5 test data, the ’unknown’ images were also removed, leaving 1394 images in total, of which 1206 were to be classified as ’empty’, and 188 as ’full’.

Figure 11: The normalized confusion matrix for the binary MLP classifier that was trained on the original training data, run on the test data from days 1, 2, and 5 using the SGD solver and the logistic activation function (total accuracy = 0.9871).

The results for this model show potential, but it is notable that with the binary MLP, accuracy for the full images is less than the accuracy on the 3-class model. Closer inspection of the test data and the model predictions reveal that the images that were misclassified as being of class ’empty’ are frames in which a person is standing next to the vehicle (see figure 12). However, it is also notable that not all frames in which a person is standing next to the vehicle were misclassified. The samples in figure 13 were correctly classified by the model as being ’full’.

(22)

(a) (b) (c)

(d) (e) (f)

Figure 12: A selection of the subframes that were incorrectly classified as ’empty’ by the binary MLP classifier.

(a) (b) (c)

(d) (e) (f)

Figure 13: A selection of correctly classified subframes (’full’) that are visually similar to the frames that the binary MLP classifier misclassified.

Unfortunately, we do not have a clear explanation for the reason that this binary model misclassified some of these images, while still being able to correctly classify other, visually similar images.

3.5 The final output algorithm

Our project aim was to develop an algorithm that could output statistics about the state of the designated loading area. We have devised a program that uses the MLP model described in the previous section to determine the DLA state and uses that information to generate a report on how many percent of tf old = 15 minutes the loading area has

been occupied or not occupied.

Our program should in the future read input directly from the Raspberry Pi camera, but for testing purposes, a user-specified folder is currently used to acquire the input data from. The program assumes that data is stored in a day-hour subfolder structure (i.e. /data/1jan/10/ for the data from 10:00:00 to 10:59:59 on the 1st of January). The program reads the images per hour-folder, iterating through all day and hour folders, generating predictions for images in blocks of tf old. This means that after every tf old

(23)

minutes, the algorithm updates the statistics data structure (a Python dictionary) with the predictions from these last tf oldminutes, yielding a percentage per 15 minutes of

the DLA state. We have run the statistics generation algorithm on a part of the test data. This produced the results as shown in table 4. Note that these results were generated using all images in the hour and day folders, meaning that the MLP classifier classi-fied the DTA state approximately every 1 second. For real-life situations, this would probably be excessive, seeing that the DLA state is not bound to change every one sec-ond. Instead, it would be more useful to record and classify one image every 10 or 30 seconds. However, the downside to that approach is that the accuracy of the statistics generation program would be more dependent on chance. For example, if due to some coincidence, many frames in one time window were taken during a time that the DLA was occluded due to a passing car, accuracy would decrease significantly, since the results in 4 show that, in situations in which the DLA is occluded, images are often (mis)classified as empty.

Algorithm 1 Generate statistics about DLA state based on input images

1: function getStatistics

2: for every day do

3: for every hour in day do

4: tstart← hour

5: ∆← 0

6: predictions← empty list

7: for every image in hour do

8: tcurr← getImageTimestamp(image)

9: if current image is first image in hour then

10: tlast← tcurr 11: 12: prediction← classify(image) 13: predictions.append(prediction) 14: ∆← (∆ + (tcurr− tstart)) 15: tlast← tcurr 16:

17: if (∆≥ tf old)∨ image.isLastImageInHour then 18: updateStatistics(predictions, day, tstart)

19: predictions← empty list

20: tstart← (tstart+ tf old)

21: ∆← 0

return statistics

We have run this statistics generation algorithm on a part of the test data; this produced the results as shown in table 4.

Table 4: Results from the statistics generation algorithm on part of the test data.

Day Time nempty nf ull % full MLPC % full actual ∆ Note

1 15:00 4 934 99.57 100.00 0.43

1 15:15 5 932 99.47 100.00 0.53

(24)

Table 4 continued from previous page

1 15:45 2 930 99.79 100.00 0.21 1 16:00 782 152 16.27 16.44 0.17 1 16:15 933 0 0.00 0.00 0.00 1 16:30 930 0 0.00 0.00 0.00 1 16:45 926 0 0.00 0.00 0.00 1 17:00 929 0 0.00 0.00 0.00 1 17:15 952 0 0.00 0.00 0.00 1 17:30 973 0 0.00 0.00 0.00 1 17:45 939 0 0.00 0.00 0.00 1 18:00 940 0 0.00 0.00 0.00 1 18:15 941 0 0.00 0.00 0.00 1 18:30 951 0 0.00 0.00 0.00 1 18:45 739 212 22.29 28.55 6.26 1 19:00 900 54 5.66 11.00 5.34 1 19:15 946 0 0.00 0.00 0.00 1 19:30 914 0 0.00 0.00 0.00 1 19:45 939 0 0.00 0.00 0.00 1 20:00 198 732 78.71 79.00 0.29 1 20:15 1 896 99.89 100.00 0.11 1 20:30 6 872 99.32 100.00 0.68 1 20:45 858 0 0.00 0.11 0.11 2 07:00 984 0 0.00 0.00 0.00 2 07:15 734 259 26.08 26.00 -0.08 2 07:30 352 646 64.73 100.00 35.27 1 2 07:45 180 813 81.87 100.00 18.13 1 2 08:00 354 647 64.64 100.00 35.36 1 2 08:15 501 501 50.00 100.00 50.00 1 2 08:30 115 884 88.49 100.00 11.51 2 08:45 42 951 95.77 100.00 4.23 2 09:00 90 909 90.99 100.00 9.01 2 09:15 38 957 96.18 100.00 3.82 2 09:30 69 925 93.06 100.00 6.94

(25)

2 09:45 11 981 98.89 100.00 1.11 2 10:00 479 518 51.96 52.22 0.26 2 10:15 437 556 55.99 63.44 7.45 2 10:30 24 971 97.59 100.00 2.41 2 10:45 14 978 98.59 100.00 1.41 2 11:00 541 456 45.74 100.00 54.26 2 2 11:15 212 785 78.74 80.66 1.92 2 11:30 9 988 99.10 100.00 0.90 2 11:45 167 826 83.18 100.00 16.82 3 2 12:00 291 707 70.84 100.00 29.16 4 2 12:15 10 985 98.99 100.00 1.01 2 12:30 6 987 99.40 100.00 0.60 2 12:45 11 982 98.89 100.00 1.11 2 13:00 7 990 99.30 100.00 0.70 2 13:15 4 992 99.60 100.00 0.40 2 13:30 7 989 99.30 100.00 0.70 2 13:45 8 985 99.19 100.00 0.81 2 14:00 16 980 98.39 100.00 1.61 2 14:15 5 991 99.50 100.00 0.50 2 14:30 8 986 99.20 100.00 0.80 2 14:45 9 985 99.09 100.00 0.91 2 15:00 4 994 99.60 100.00 0.40 2 15:15 8 988 99.20 100.00 0.80 2 15:30 13 984 98.70 100.00 1.30 2 15:45 1 991 99.90 100.00 0.10 2 17:00 2 973 99.79 100.00 0.21 2 17:15 4 969 99.59 100.00 0.41 2 17:30 586 380 39.34 40.11 0.77 2 17:45 953 0 0.00 0.00 0.00 2 18:00 963 0 0.00 0.00 0.00 2 18:15 977 0 0.00 0.00 0.00 2 18:30 628 346 35.52 36.66 1.14

(26)

2 18:45 961 0 0.00 0.00 0.00

3 12:00 991 0 0.00 0.00 0.00

3 12:15 993 0 0.00 0.00 0.00

3 12:30 992 1 0.10 0.00 -0.10

3 12:45 987 0 0.00 0.00 0.00

Note 1: The images from these time windows show glare from the morning sun shining through the window. This might be a possible explanation for the lower accuracy in these time frames.

Note 2: The images from this time window show that the DLA was occluded for a longer period of time (8 minutes) due to a truck unloading a delivery on the street in front of the DLA.

Note 3: The images from this time window show that the DLA was occluded for about 2.5 minutes due to a garbage truck emptying the bins.

Note 4: The images from this time window how that the DLA was occluded for about 4.5 minutes due to a delivery van unloading a delivery on the street in front of the DLA. The average difference between the percentage generated by the program and the actual percentage of tf oldin which the DLA was occupied was 4.42% with a standard

devi-ation of 10.90%. 90.77% of the % full MLPC percentages generated by the statistics generation algorithm were within±5% of the ground truth (% full actual) for the folds in which the DLA was not obstructed for longer periods of time (the folds in table 4 without notes). 87.69% of these folds was within±2.5% of the ground truth.

We have also tested our algorithm on a Raspberry Pi model 3B, to ensure that our algo-rithm could process images in real-time on a device with little processing power. The Raspberry Pi took 13 minutes and 1.819 seconds to process 3742 full-size images (1 hour of data) that were stored on an external hard drive. Thus on average, processing one frame took 208.9 milliseconds, making our algorithm suitable for real-time process-ing on a Raspberry Pi. The time mentioned includes the initialization time in which the model (16.9 MB) was loaded from the RPi’s SD storage. In real-time applications, we expect the processing time per frame to be even less because images will then be ob-tained directly from the camera instead of from an external USB hard drive, which is faster. The camera’s CSI2-interface supports a maximum of 2 Gbit/sec throughput (al-though practically limited to 800 Mbit/sec) [25], while USB 2.0 only supports speeds of up to 480 Mbit/sec [26], and the external hard drive loses time through disk seek operations.

$ time python generate.py 3742 of 3742 images processed real 13m1.819s

user 10m28.311s sys 1m7.950s

(27)

4 Discussion

We have attempted to use Tensorflow object detection models to determine the DLA state. This approach has however proven to be unsuccessful due to the high computation load, which made it unsuitable for use on a Raspberry Pi. We have also used different styles of MLP classifiers to determine the DLA state. The results show our binary MLP classifier can fairly accurately detect the DLA state when it is not occluded. Despite these results, the algorithm still needs improvement in exceptional cases, i.e. when the DLA is occluded for extended periods of time, which is often caused by stationary vehicles directly in the camera’s line of sight to the DLA. This results in increased mis-classifications, skewing the accuracy of the model. We have tested this algorithm on a Raspberry Pi model 3B, which processed one image in 208 ms, confirming that our approach is suitable for real-time use on the RPi.

Despite the positive final results, it is important to acknowledge that our results do not imply that our approach is a viable method for real-life applications as we cannot as-sume that the accuracy on new data would be of the same order as the accuracy on the test set. This is because our test set was quite limited, seeing it only contained data from 5 days. In order to conclude real-life usability of our approach, further testing on unseen data is needed, especially seeing that real-life applications will feature events that the model has never seen before (i.e. we cannot train the model for every possi-ble event, but we should test it for as many as possipossi-ble). Additionally, it has not been tested whether our proposed solution would retain its high accuracy in adverse weather conditions such as rain, snow, and fog. We also acknowledge that, considering that our algorithm requires separate training for every new location, it is not yet universally applicable.

Noting the limitations described above, we feel these findings are valuable for the de-velopment of advanced traffic analysis systems, and in particular privacy-preserving, portable systems for detecting whether parking spaces and loading areas are occupied or not.

5 Conclusion

The aim of our research was to develop an algorithm that runs on a Raspberry Pi com-puter with PiCamera and can accurately determine whether the DLA is occupied or not in real-time, and with that, for how many percent of a 15-minute interval it has been in that state, while incorporating privacy-by-design. We have developed an algorithm that can determine the Designated Loading Area (DLA) state within±5% from the ground truth in 90.77% of normal cases. Special cases, such as when a vehicle ob-structs the view of the DLA for longer periods of time, still require special attention, to be researched in future work. The algorithm is suitable for use on a Raspberry Pi, being able to classify images in real-time and safeguarding the privacy of the people filmed by processing all data locally into statistics, after which images can be deleted immediately.

(28)

6 Future work

Although our research has proven to be valuable, there are still improvements that could be made. Firstly, it is not yet clear how our classifier will respond to more test data, seeing that we have only been able to test it on five days of data, which we believe does not accurately represent the real world.

Secondly, the algorithm has not been tested on images obtained in adverse weather conditions, such as snow, fog, or heavy rain. It is therefore not yet known how the algorithm will respond to these circumstances, despite these weather conditions being quite common in reality.

Furthermore, at the end of section 3.3.8, we briefly discussed the options to use four classes instead of the 3-class MLP classifier and the binary MLP classifier that we have used in our research. Most misclassifications that we observed occurred when the im-age showed either a vehicle currently in the process of parking or the DLA view being obstructed by a passing or standing large vehicle. Hence, it might prove to be valu-able to explore the possibilities of extending the binary MLP model to a model that uses four classes (’empty’, ’full’, ’parking’, and ’occluded’) to reduce classification er-rors further. The misclassifications that occurred due to a passing vehicle could also be diminished by the usage of a ’flattening’ mechanism. This would be an algorithm iterates over the classifications generated by the MLP classifier, checking every classi-fication to determine whether it deviates from the classiclassi-fications which were made in the time around that classification. If the classification deviates from, for example, all classifications that were made in the last and next 30 seconds, it would ’flatten’ this classification, setting its value to the value of the neighboring classifications.

Another option that could be worth exploring is training an own Tensorflow object de-tection model that only contains classes that we are interested in (cars, trucks, and vans) to see if that decreases time and processing power needed to detect objects in an im-age. Optionally, a subframe that only contains the area around the DLA could be used instead of the whole image to further decrease the needed processing power. If this proves to be a practical approach, it could supply us with more additional data, such as the types of vehicles that use the DLA and at what particular time.

(29)

References

[1] Municipality of Amsterdam, Gemeente Amsterdam: Onderzoek, Informatie en

Statistiek - Dashboard toerisme, 2018. [Online]. Available: https : / / www . ois.amsterdam.nl/visualisatie/dashboard_toerisme.html(visited on 14/06/2018).

[2] J. van Rest, D. Boonstra, M. Everts, M. van Rijn and R. van Paassen, “Design-ing privacy-by-design”, in Privacy Technologies and Policy, B. Preneel and D. Ikonomou, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2014, pp. 55– 72, isbn: 978-3-642-54069-1.

[3] N. Buch, S. A. Velastin and J. Orwell, “A review of computer vision techniques for the analysis of urban traffic”, IEEE Transactions on Intelligent

Transport-ation Systems, vol. 12, no. 3, pp. 920–939, Sep. 2011. doi:10 . 1109 / tits . 2011.2119372. [Online]. Available:https://doi.org/10.1109/tits. 2011.2119372.

[4] Chapter 2, Traffic Detector Handbook, 3rd ed., United States Federal Highway

Administration (FHWA), 2009.

[5] E. Bas, A. M. Tekalp and F. S. Salman, “Automatic vehicle counting from video for traffic flow analysis”, in 2007 IEEE Intelligent Vehicles Symposium, IEEE, Jun. 2007. doi:10.1109/ivs.2007.4290146. [Online]. Available:https: //doi.org/10.1109/ivs.2007.4290146.

[6] D. G. Lowe, “Distinctive image features from scale-invariant keypoints”,

In-ternational Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, November

2004. doi:10.1023/b:visi.0000029664.99615.94. [Online]. Available:

https://doi.org/10.1023/b:visi.0000029664.99615.94.

[7] T. Moranduzzo and F. Melgani, “Automatic car counting method for unmanned aerial vehicle images”, IEEE Transactions on Geoscience and Remote

Sens-ing, vol. 52, no. 3, pp. 1635–1647, March 2014. doi:10.1109/tgrs.2013. 2253108. [Online]. Available:https://doi.org/10.1109/tgrs.2013. 2253108.

[8] S. Bouaich, M. A. Mahraz, J. Riffi and H. Tairi, “Vehicle counting system in real-time”, in 2018 International Conference on Intelligent Systems and

Com-puter Vision (ISCV), IEEE, April 2018. doi:10.1109/isacv.2018.8354033. [Online]. Available:https://doi.org/10.1109/isacv.2018.8354033. [9] J. Zheng, Y. Wang and W. Zeng, “CNN based vehicle counting with virtual coil

in traffic surveillance video”, in 2015 IEEE International Conference on

Multi-media Big Data, IEEE, April 2015. doi:10.1109/bigmm.2015.56. [Online]. Available:https://doi.org/10.1109/bigmm.2015.56.

[10] K. Robert, “Night-time traffic surveillance: A robust framework for multi-vehicle detection, classification and tracking”, in 2009 Sixth IEEE International

Confer-ence on Advanced Video and Signal Based Surveillance, IEEE, Sep. 2009. doi: 10.1109/avss.2009.98. [Online]. Available:https://doi.org/10.1109/ avss.2009.98.

[11] G. Salvi, “An automated nighttime vehicle counting and detection system for traffic surveillance”, in 2014 International Conference on Computational

Sci-ence and Computational IntelligSci-ence, IEEE, March 2014. doi:10.1109/csci. 2014.29. [Online]. Available:https://doi.org/10.1109/csci.2014.29.

(30)

[12] H. Xie, L. Kulik and E. Tanin, “Privacy-aware traffic monitoring”, IEEE

Trans-actions on Intelligent Transportation Systems, vol. 11, no. 1, pp. 61–70, March

2010. doi:10. 1109 / tits . 2009 . 2028872. [Online]. Available:https: / / doi.org/10.1109/tits.2009.2028872.

[13] Automatic Number Plate Recognition, United Kingdom National Police.

[On-line]. Available:https://www.police.uk/information- and- advice/ automatic-number-plate-recognition/(visited on 23/06/2018).

[14] Automatic Number Plate Recognition, Dutch National Police. [Online].

Avail-able:https://www.politie.nl/themas/anpr.html(visited on 23/06/2018). [15] Z. Yang and L. S. Pun-Cheng, “Vehicle detection in intelligent transportation

sys-tems and its applications under varying environments: A review”, Image and

Vis-ion Computing, vol. 69, pp. 143–154, January 2018. doi:10.1016/j.imavis. 2017.09.008. [Online]. Available:https://doi.org/10.1016/j.imavis. 2017.09.008.

[16] O. Alpar and R. Stojic, “Intelligent collision warning using license plate seg-mentation”, Journal of Intelligent Transportation Systems, vol. 20, no. 6, pp. 487– 499, November 2015. doi: 10 . 1080 / 15472450 . 2015 . 1120674. [Online]. Available:https://doi.org/10.1080/15472450.2015.1120674.

[17] V. C. Maha Vishnu, M. Rajalakshmi and R. Nedunchezhian, “Intelligent traffic video surveillance and accident detection system with dynamic traffic signal con-trol”, Cluster Computing, vol. 21, no. 1, pp. 135–147, Jun. 2017. doi:10.1007/ s10586-017-0974-5. [Online]. Available:http://dx.doi.org/10.1007/ s10586-017-0974-5.

[18] M. Sameen and B. Pradhan, “Severity prediction of traffic accidents with recur-rent neural networks”, Applied Sciences, vol. 7, no. 6, p. 476, Jun. 2017. doi:

10.3390/app7060476. [Online]. Available:https://doi.org/10.3390/ app7060476.

[19] M. Kochláň, M. Hodoň, L. Čechovič, J. Kapitulı́k and M. Jurecka, “WSN for traffic monitoring using raspberry pi board”, in Proceedings of the 2014

Fed-erated Conference on Computer Science and Information Systems, IEEE, Sep.

2014. doi:10.15439/2014f310. [Online]. Available:https://doi.org/10. 15439/2014f310.

[20] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu and A. C. Berg, “SSD: Single Shot MultiBox Detector”, 2016. [Online]. Available:http : / / arxiv.org/abs/1512.02325.

[21] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam, Mobilenets: Efficient convolutional neural networks

for mobile vision applications, 2017. eprint:arXiv:1704.04861.

[22] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick and P. Dollár, Microsoft coco: Common objects in

context, 2014. eprint:arXiv:1405.0312.

[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, “Scikit-learn: Machine learning in Python”, Journal of Machine Learning Research, vol. 12, pp. 2825– 2830, 2011.

(31)

[24] G. Andrew and J. Gao, “Scalable training of L1-regularized log-linear models”, in Proceedings of the 24th international conference on Machine learning - ICML

’07, ACM Press, 2007. doi:10.1145/1273496.1273501. [Online]. Available:

https://doi.org/10.1145/1273496.1273501.

[25] P. J. Vis, Raspberry Pi CSI-2 Connector Specifications. [Online]. Available:https: //www.petervis.com/Raspberry_PI/Raspberry_Pi_CSI/Raspberry_ Pi_CSI-2_Connector_Specifications.html(visited on 25/06/2018). [26] USB 2.0 Specification, Rev. 2, Universal Serial Bus (USB) Organization, 27 April

2000. [Online]. Available:http://sdphca.ucsd.edu/Lab_Equip_Manuals/ usb_20.pdf.

A low-cost, privacy-preserving machine learning approach for designated loading area state classification