Predicting the usage of electric vehicle charging stations

(1)

Predicting the usage of electric

vehicle charging stations

Tim van Loenhout

10741577

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dhr. Dr. S. van Splunter Informatics institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam July 2nd_{, 2017}

(2)

INDEX

INDEX ... 2 ABSTRACT ... 4 INTRODUCTION ... 5 THEORETICAL FOUNDATION ... 6 RELATED WORK ... 6 DATA ... 8 Transaction history ... 8 Demographic characteristics ... 9

MACHINE LEARNING ALGORITHMS ... 9

Generalized linear models ... 9

K-nearest neighbors regression ... 10

Random forest regression ... 10

Multi-layer Perceptron regressor ... 10

EVALUATION METHOD ... 11

METHOD ... 12

DATA PREPARATION ... 12

Creating the output matrix ... 12

Creating the input matrix ... 13

Creating the training, The cross validation and The test set ... 14

ALGORITHM OPTIMISATION ... 14

RESULTS ... 17

ALGORITHM OPTIMISATION ... 17

ALGORITHM TESTING AND COMPARING ... 19

DATA REDUCTION ... 20

Session reduction ... 20

Feature reduction ... 21

EVALUATION ... 22

CONCLUSION AND DISCUSSION ... 26

(3)

WITH SPECIAL THANKS TO

Over Morgen,

Gijs van der Poel,

Tom van Goeverden,

Tim Tensen

&

(4)

ABSTRACT

The increasingly popularity of electric vehicles in The Netherlands requires an efficient charging station infrastructure. In order to support municipalities during the decision-making process of constructing such an infrastructure, this research investigates whether it is possible to predict the usage of an electric vehicle charging station based on the demographic characteristics of its location. This question is answered using a machine learning approach, from which it can be concluded that 13 percent of the variance in the energy usage between different electric vehicle charging stations can be explained by the average income of the households that are located within a 200-meter radius of these charging stations. This proportion was discovered by training a Bayesian ridge regression algorithm over a sub set of the data and then using the generated prediction model to do predictions over the remaining data set. Furthermore, by integrating multiple demographic features into the prediction algorithm, the predictive power could be increased with an

additional 3 percent. These predictions can provide useful insights into which areas demand more or fewer charging opportunities. These insights can subsequently be used to minimise the number of both overstaffed and understaffed charging stations, resulting in a more efficient future charging infrastructure.

(5)

INTRODUCTION

In recent years, electric vehicles have become increasingly popular in The Netherlands (Beltramo et al., 2017). Due to both economic factors such as rising fuel costs and environmental concerns on greenhouse gas emissions, electric vehicles are widely promoted as a promising alternative to

conventional petroleum based vehicles (Mehta, Srinivasan, Khambadkone, Yang & Trivedi, 2015). According to Cuijpers, Staats, Bakker & Hoekstra (2016), the electrification of the Dutch transportation market will gain even more momentum in the future, demanding a significant increase in the number of electric vehicle charging stations (Sweda & Klabjan, 2015). Furthermore, considering the many charging stations in The Netherlands that are either overstaffed or understaffed (van Montfort, Visser, van der Poel & van den Hoed, 2016), this requires an efficient charging station infrastructure. In order to support municipalities during the process of constructing such an infrastructure, van Montfort et al. (2016) have created a model to discover correlations between several demographic factors and the daily number of transactions provided by a charging station. This model indicates that there is a significant correlation between the average income in a zipcode-5 area and the number of vehicles in that area that use a charging station. Building on this insight, this research will address the following question:

Is it possible to predict the usage of an electric vehicle charging station based on the demographic characteristics of its location?

In order to answer this question, two sub questions will be addressed; 1. What proportion of the variance in energy usage between different

electric vehicle charging stations can be explained by the average income of the households that are located within a 200-meter radius of these charging stations?

2. Can this proportion be increased by integrating additional demographic characteristics?

Based on the discovered correlation by van Montfort et al. (2016), the hypothesis is that it is possible to predict the amount of energy provided by an electric charging station based on the average household income. Furthermore, the implementation of multiple features is expected to have a positive impact on the accuracy of this prediction. This hypothesis will be tested using a machine learning approach. Instead of analysing each

demographic feature individually, machine learning techniques will be used to combine different features into a single prediction algorithm. Hence, such an algorithm could contribute to a more precise and automated method for predicting the usage of future charging stations. The next section will discuss the theoretical foundation for this research, which is followed by an

(6)

elaboration on the applied method, which consists of two parts; data preparation and algorithm training. Subsequently, the results of this method will be evaluated and finally the answer to the research question is stated in the conclusion and discussion section.

THEORETICAL FOUNDATION

RELATED WORK

Cuipers, Staats, Bakker & Hoekstra (2016) have done research on the electrification of cars in the Netherlands. Based on several trends such as the energy transition and the development of shared economies, four possible future scenarios were conducted, each indicating a significant expansion of the market for electric vehicles by the year 2020. Cuijpers et al. (2016) therefore emphasize the importance of an efficient charging station infrastructure. Constructing such an infrastructure requires reliable predictions on the usage of potential future charging stations.

According to Lampropoulos, Vanalme & Kling (2010), such predictions require insights in the charging behaviour of small size consumer groups. On top of that, a profound knowledgebase on the factors influencing the presence of electric vehicles is required. Several studies have addressed such factors. For instance, Sweda & Klabjan (2015) have built an agent-based information system that illustrates a correlation between the number of charging stations in a given area and the presence of electric vehicles. On top of that, the system also identifies patterns between environmental awareness and the willingness to invest in an electric vehicle. Sweda et al. (2015) claim that this study has demonstrated that an efficient infrastructure does indeed affect consumers’ decisions regarding electric vehicles.

Furthermore, according to Xi, Sioshansi & Marano (2013), the usage of a charging station partially depends on the presence of certain facilities. For instance, a charging stations is expected to be used more frequently when located near a university, workplace or shopping mall. These findings were verified by the regression model conducted by van Montfort, Visser, van der Poel & van den Hoed (2016). Moreover, this model also demonstrates that the average income of households surrounding a charging station positively influences the daily number of cars using this charging station (figure 1).

(7)

FIGURE 1. CORRELATION BETWEEN THE AVERAGE MONTHLY HOUSEHOLD INCOME (X-AXIS) AND THE NUMBER OF CARS USING AN ELECTRIC VEHICLE CHARGING STATION (Y-AXIS) (VAN MONTFORT ET AL., 2016)

From this research, it can therefore be concluded that there are significant correlations between charging behaviour and several demographic factors. Using these insights, municipalities can be advised on how to set up an effective charging station infrastructure. However, by basing the decision-making process on the expected number of vehicles using a charging station, differences in individual charging behaviour (Lampropoulos et al., 2010) are ignored. Due to differences in the frequency and duration of charging sessions between individual users (Franke & Krems, 2013) and the variance in battery capacities for different electric vehicle models (Fotouhi, Auger, Propp, Longo & Wild, 2016), the number of kilowatt-hours provided by a charging station is different for each transaction. Considering this variance in the frequency and magnitude of transactions, the number of cars using a charging station might not always be an accurate representation of the actual amount of energy provided by a charging station. Predictions on the amount of energy provided by a charging stations could, therefore, be of a useful complement to the current regression model by van Montfort et al. (2016). Furthermore, certain parts of the model by van Montfort et al. (2016) are constructed from a knowledge based approach. For instance, the optimal degree for the polynomial in figure 1 is determined using explanations from a socio-economical perspective on the dynamics of electric vehicle

consumerism. Opposed to this knowledge based approach, this research will apply a more exact data based method for determining the model’s parameter values by using a training, a cross validation and a test set. On top of that, insights obtained from the model by van Montfort et al. (2016) solely serve as guidelines during decision-making processes. Making actual predictions on potential future charging stations requires an automated prediction

algorithm which combines and directly integrates any discovered correlations without relying on human interpretations.

(8)

DATA

This research is based on electric vehicle charging behaviour in the city The Hague in the Netherlands, whose high density charging station infrastructure makes it one of Europe’s front runners in the adoption of energy based vehicles (van Montfort, Kooi, van der Poel, van den Hoed, 2016). Data on this charging behaviour is provided by Over Morgen and consists of two parts; the transaction history of each public charging station in The Hague, and the demographic characteristics of the residents that are located within a 200-meter radius of a charging station.

TRANSACTION HISTORY

Over the period January 2014 - March 2017, 688.855 transactions have been done on a total of 1006 charging stations (figure 2), 695 of which are property of Alphen, whereas the other 311 stations belong to Ecotap. For each of the 39 months, data on all the charging sessions during that month are stored in a separate XLS file for both ICU and Ecotap charging stations. Data on each session consists of information on the provided number of kilowatt-hours, the identification code of the charging station, the identification code of the user and the sessions start and end date and time.

FIGURE 2. LOCATIONS OF THE PUBLIC ELECTRIC VEHICLE CHARGING STATIONS IN THE HAGUE (JANUARY 1, 2016) (VAN MONTFORT ET AL., 2016).

(9)

DEMOGRAPHIC CHARACTERISTICS

For each of the 65.567 residences that are located within a 200-meter radius of a charging station, 14 demographic characteristics are stored in an XML file (table 1).

Index Feature Scale Year

1* Number of stations within a 200m radius - 2017

2* Monthly per capita income

(based on total area population) Postal code 2008

3* Annual per capita income (x1000 euro)

(based on total income recipients) Neighbourhood 2014

4* Annual per capita income (x1000 euro)

(based on total area population) Neighbourhood 2014

5* Low income (%) Neighbourhood 2014

6* High income (%) Neighbourhood 2014

7* Low income (%) 500m2 ₂₀₁₁

8* High income (%) 500m2 ₂₀₁₁

9* Number of vehicles Residence unknown

10* Own entranceway (1 = yes, 0 = no) Residence unknown

11* Surface area Residence unknown

12* Rented apartments % Neighbourhood 2014

13* Residence density Km2 _unknown

14* Residence value (x1000 euro) Neighbourhood 2014

TABLE 1. USED INFORMATION FOR EACH RESIDENCE IN THE HAGUE.

For each feature, the index number is used for references in the results and evaluation section, the scale corresponds to the area over which the feature value is determined and the year indicates time of data acquisition.

MACHINE LEARNING ALGORITHMS

The programming language used for this research is Python 3.6.1. A large number of packages are available for this widely used high-level

programming language, including scikit-learn, which is open source data science package that provides simple tools for data mining and analysis (Pedregosa et al., 2011). From this package, seven supervised machine learning algorithms are implemented, which are briefly described below.

GENERALIZED LINEAR MODELS

In generalized linear models, the target value is based on a linear

combination of the input features (Pedregosa et al., 2011). The core method underlying this group of algorithms is linear regression, in which a linear model is fit to the data in such a manner that it minimizes the residual sum of squares between the predicted values by the model and the observed values in the data set. The optimisation of the parameters for the model is done using ordinary least squares, which instantiates a vector w containing the

coefficient values according to the following formula: w = min(||y - wX||2), in which y is a vector containing the observed values and X a matrix containing the input data (Ng, 2012). During this research, several approaches to this method will be used, including ridge regression, lasso regression and Bayesian ridge regression. In these algorithms, a regularization penalty is

(10)

added to the prediction coefficients, resulting in slightly higher bias, yet lower variance on the predicted values. In case of lasso regressions, this penalty is equal to the product of a complexity parameter that controls the penalty size and the coefficients vector (L1), whereas for ridge regression the square of the coefficients vector (L2) is used. In Bayesian ridge regression, on the other hand, the coefficient values in w are determined by calculating the prior probability for each coefficient using Bayes theorem (Michel, 2010).

K-NEAREST NEIGHBORS REGRESSION

The K-nearest neighbors regression algorithm computes the value of a sample based on the values of its k nearest neighbors. Which k neighbors are considered the nearest neighbors is determined using the Euclidean distance. Subsequently, for these neighbors, their feature values are either uniform or weighted. In case of the latter, the feature values of each neighbor are multiplied by the inverse of their distance to the target sample (Pedregosa et al., 2011).

RANDOM FOREST REGRESSION

The random forest algorithm builds on the decision tree algorithm, which predicts a target sample by generating a tree based structure based on decision rules derived from the features in the data set (California State University Northridge, 2011). Following these rules, the tree is split at each level based on the feature with the highest predictive power. This predictive power is determined using the information gain, a method that calculates the entropy to measure the degree of disorganisation in a system. Instead of creating one such a tree, the random forest algorithm grows multiple trees based on different (overlapping) sub sets of the data. Furthermore, in order to generate unique trees, for each split only a sub set of the available features is taken into account. Subsequently, the mean of the predicted values is

assigned to the target sample.

MULTI-LAYER PERCEPTRON REGRESSOR

The multi-layer perception regressor uses backpropagation to train a neural network to generate a set of continuous values (Pedregosa et al., 2011). This neural network is based on the biological structure of neurons in the brain and consists of an input layer, one or more hidden layers and an output layer. The number of hidden layers depends on the complexity of the system. During this research only one hidden layer will be used due to the relative simplicity of the task, for which the number of nodes is determined by the following formula: Nh=Ns/ (α∗(Ni+No)), in which Nh is the number of hidden nodes, Ns is the number of samples in the training set, Ni is the number of input nodes, No is the number of output nodes and a is alpha, the

(11)

legalization parameter. This parameter adds a penalty L2 to the cost function in order to avoid overfitting. The input layer also contains several nodes, each for one of the features in the data set. The output layer on the other hand only contains one node, which outputs a value based on the activations of the nodes in the hidden layers. These activations are based on an activation function, which during this research is set to either the linear function (identity) or the rectified linear unit function (ReLU). Identity computes the function f(x) = x, whereas ReLU computes the function f(x) = max(0,x) and thus only activates when x is positive.

EVALUATION METHOD

During this research, the efficiency of the previously described algorithms is determined based on the duration of the training process and how well the generated prediction model fits the data. There are different methods for determining this goodness-of-fit, including the coefficient of determination (R2_{) which is the proportion of the variance in the dependent variables that}

can be explained by the independent variables. This proportion is calculated using the formula R2 = 1 – SSres / SStot, in which SSe is the residual sum of squares (sum(y - ŷ)2_{) and SStot the total sum of squares (sum(y - ȳ)}2_{). By}

taking the squares of the residuals, outliers have a higher contribution to the total prediction error (Walpole, Myers, Myers & Ye, 2011). Furthermore, by also taking into account the total sum of squares, the size of the prediction error is put into perspective as the magnitude of this error is not sufficient to determine the quality of the prediction model due to its dependency on the variance and mean of the data set.

According to Walpole et al. (2011), the R2 criterion can be a dangerous measure for comparing different models as by increasing the model’s

complexity, the residual sum of squares can be decreased until an abnormally high R2 is reached. The result of this artificial increase of R2 is that the model might focus too much on the specific characteristics of the training data instead of learning the more general underlying relationships in the data. In order to prevent this overfitting, the data is divided into a training, cross validation and test set. The training set accounts for approximately 60 percent of the data and is used to train the prediction model. Subsequently, using this model, predictions are done on another 20 percent of the data; the cross validation set. Based on these predictions, the model’s parameters are

adjusted until an optimised model with a maximum R2 on the cross-validation set is achieved. After this optimisation, the quality of the prediction model is determined by the R2_{on the final 20 percent of the data; the test set. In order}

to use all the data in the data set for both training and evaluation, k-fold cross validation is used. This approach repeats the previously described process while using different sub sets of the data for each iteration. Subsequently, the average over the k iterations is used to determine the final R2_.

(12)

METHOD

The method consists of two parts; namely the data preparation and algorithm optimisation. In the first part, the raw data files are transformed into suitable data matrices, which in the second part are used for finding the optimal parameters for the applied machine learning algorithms. Finally, using these optimal parameters, the predictive power of each algorithm can be

determined.

DATA PREPARATION

The conversion of the raw data files into a suitable data matrix is done in 3 steps. First, the average monthly energy usage for each charging station is stored in an output matrix. Secondly, an input matrix is created with for each station the demographic characteristics of the area in which it is located, and during the final step these two matrices are merged and then divided into a training, a cross validation and a test set. The different steps in this process will be explained in more detail below.

CREATING THE OUTPUT MATRIX

In order to create the output matrix, for each month the XML file containing the transaction history is converted to an CSV file and read into a python script. In this script, the information of all charging sessions is stored in a single data matrix and based on the feature containing the starting date and time of each session, a function is built which reduces the matrix to only those sessions that have taken place after 4pm and before 4am (van Montfort et al. 2016). This function is an attempt to separate the transactions belonging to the area’s inhabitants from those that are the result of commuter traffic (Lampropoulos et al., 2010), as people who are working in an area do not represent the stored demographic characteristics for that area. In order to also exclude any occasional visitors (Lampropoulos et al., 2010), a second function is build, which uses the user identification code to store only those sessions that belong to users who have consulted the same charging station at least five times a month (van Montfort et al., 2016). However, when certain spatial features such as the presence of a university, workplaces or a shopping mall were to be included in the demographic characteristics data set, the exclusion of these sessions would be expected to have a negative impact on the predictions (Xi, Sioshansi & Marano, 2013). This research, on the other hand, solely focuses on the demographic characteristics derived from the area’s inhabitants. Therefore, training an algorithm on both the whole and reduced data set is required to evaluate the effect of this method. Afterwards, for each remaining charging station, all its sessions are combined into a total energy usage value. During this process, sessions containing any

(13)

noisy data in the form of a missing, negative or unrealistic high usage value are removed from the data matrix and thus ignored during the summation. Furthermore, some stations contain multiple connection points which are referred to by an extra term at the end of the charging station identification code. Nevertheless, despite these different identification codes, they refer to the same charging station. Therefore, such extra terms are removed as a result of which all the relevant sessions add up to the same total energy usage value. However, the total energy usage value of a station is not necessarily an accurate indication of its popularity, because the size of this total value is also largely dependent on the amount of time over which it was calculated. Hence, in order to derive a more reliable estimation of the station’s usage, the total energy value is divided by the number of months the charging station has been in use. The result is a two-dimensional array with for each row a charging station identification code in the first column and its corresponding average monthly energy usage in the second column.

Finally, a third function is build which keeps only those charging stations that have been in use for at least a given number of months. The idea behind this function is that the energy usage during the first months after a charging station is installed might not be representative for its long-term popularity as the neighbourhood in which it is located might need time to adapt to its presence. That is, following the new charging opportunity, the average monthly usage may increase over time due to an increasing consumption of electric vehicles in the area. However, as with the other functions, this assumption needs to be verified by experimenting with different values for the minimal required usage time before it can be applied during the ultimate prediction process.

CREATING THE INPUT MATRIX

Contrary to the multiple transaction files, the demographic characteristics are stored in a single XML file as for this data there is no distinction between different months. After converting this file into a CSV file and reading it into a python script, all the information is stored in a data matrix consisting of one row for each residence and 15 columns; 14 for the features mentioned in the theoretical framework and 1 containing the identification code of the

charging station within whose radius the residence is located. However, those residences that are located within the 200-meter radius of more than one charging station represent multiple rows, each containing identical demographic feature values, yet a different charging station identification code.

Afterwards, entries referring to the same charging station are merged into a single entry with for each feature the average value calculated over all the relevant residences. Similar to during the output matrix preparation, before this merging process any extra terms at the end of a charging station

(14)

identification code are removed in order to treat different connection points as a single charging station. On the other hand, opposed to during the output preparation, any noisy data is replaced by a nan-value instead of removing the whole corresponding entry from the matrix as this would also result in the loss of the other feature values for that entry. Furthermore, each demographic feature is normalised using feature scaling, as some machine learning

algorithms are influences by the varying orders of magnitude between the different features. For instance, a feature containing relatively high values would contribute much more to the calculated Euclidean distance in the K-nearest neighbors regression algorithm than a feature containing relatively low values. By applying this normalisation, a two-dimensional input matrix is constructed with for each row a charging station identification code in the first column and a value between 0 and 1 for the corresponding demographic features in the remaining 14 columns.

CREATING THE TRAINING, THE CROSS VALIDATION AND THE TEST SET

In order to combine the input and output values into a synchronized data matrix, every row in the input matrix is attached to its corresponding row in the output matrix based on their matching charging station identification code. Afterwards, this matrix is divided into a training, a cross validation and a test set which consist of respectively 60, 20 and 20 percent of the data. However, before this data partition, all rows are put into a random order to have different sub sets during each iteration in the algorithm optimisation process. Finally, in order to provide the machine learning algorithms with the appropriate data formats, each sub set is separated into a two-dimensional matrix consisting of input values and a one-dimensional matrix containing the accompanying output values.

ALGORITHM OPTIMISATION

For each of the machine algorithms described in the theoretical framework, its predictive power on the data is determined using a two-step optimisation cycle (figure 3). This cycle starts by setting the initial values for the

algorithm’s parameters and then step two consists of a sub cycle in which first the generated training, cross validation and test matrices are imported. During the preparation of these matrices, all charging sessions are taken into account, meaning no data reduction has yet been applied using the functions as described in the previous section. Subsequently, the input and output values of the training matrices are used for training the algorithm, which afterwards, is used for making predictions on the cross-validation input values. By taking the sum of the squared differences between the predicted values and the corresponding cross validation output values, the residual sum of squares is determined. At the same time, the total sum of squares is calculated using the same approach, but instead of using the predicted values as generated by the algorithm, each prediction is set to be the mean of the

(15)

output values in the training set. Dividing the residual sum of squares by this total sum of squares and subtracting the result from 1 gives R2, the evaluation measure for the algorithm. This sub cycle is repeated 500 times, each time using a different training, cross validation and test set due to the shuffling in the final part of the data preparation process, after which the average R2 is stored along with its corresponding parameter values. These parameters are then updated, before re-entering the second step in the cycle. Finally, the parameters resulting in the highest R2 are used for the comparison to the other algorithms.

FIGURE 3. ALGORITHM OPTIMISATION CYCLE

This algorithm comparison is done by having each optimised algorithm doing 500 predictions on a test set that consists of a different 20 percent sub set of the data for each iteration. The average R2_{over these 500 predictions forms}

the algorithms test statistic. Comparing the different test statistics between the applied algorithms can indicate which algorithm performs best. However,

(16)

when an algorithm has a higher average R2 than another algorithm, this is not sufficient to conclude that it indeed performs better, as the observed

difference in R2_{might just be a coincidence. Therefore, in order to determine}

whether the observed values are meaningfully different, a statistical

significance test needs to be conducted. The test used for this research, is the two-sample t-test with a 95% confidence interval. This test investigates whether the true means of two independent data sets are identical. This is done by calculating the probability that the null hypothesis, which states that both true means are the same, is true. If this probability (p-value) is less than 0.05, the null hypothesis can be rejected and instead the alternative

hypothesis can be presumed, which for this research, states that the observed differences in R2 _{between two algorithms are significant.}

Using the best performing algorithm, the impact of applying the data

reduction functions as described in the data preparation section is tested. This is done by calculating the test statistic using different values for the minimal required monthly number of sessions done by a single user, the minimal required number of months a charging station has to be in use and by

reducing the number of sessions to only those that have started between 4pm and 4am. Furthermore, the predictive power of an algorithm might be influenced by which demographic features are present in the input matrix. Some algorithms, such as the random forest algorithm have an incorporated method for automatically comparing different features and deciding which are most important during the training process. Other algorithms on the other hand, such as the K-nearest neighbors regression algorithm do not apply such a feature selection method and thus consider every feature of equal

importance. In order to manually apply this feature selection, the predictive power of each feature is determined by calculating its individual R2_{using a}

regression analysis. On top of that, this value is combined with the feature’s information gain to get a more multi-perspective indication of its predictive power. However, in order to equally weight both measures before merging them, they need to be normalised, as the information gain often has a relatively high value compared to R2, which is always a number between 0 and 1. Whether ignoring certain features based on their predictive power has any beneficial effects on the prediction accuracy needs to be determined by using different feature sub sets.

This same feature reduction approach is used to estimate the predictive power of when solely using the feature containing the monthly per capita average income (2*). That is, by reducing the data to this single column, the

proportion of the variance in energy usage between different electric vehicle charging stations that can be explained by the average income of the

households that are located within a 200-meter radius of these charging stations, is computed.

(17)

RESULTS

This section will give an overview on the results of the previous described method. First, for each algorithm the optimal parameters based on the optimisation cycle are discussed, then the test set results of each optimised algorithm are compared and finally the best performing algorithm is used to illustrate the impact of applying data reduction functions during the data preparation.

ALGORITHM OPTIMISATION

The optimisation is applied to the following seven algorithms: k-nearest neighbors regression, random forest, multi-layer perceptron regressor, linear regression, ridge regression, lasso regression and Bayesian ridge regression. For the latter four algorithms, adjusting any parameters did not have any positive impact on the prediction accuracy. For the former three algorithms, on the other hand, R2 did increase by setting certain parameter values. The results from this parameter optimisation are described below.

For the K-nearest neighbors regression algorithm, setting the number of nearest neighbors to approximately 30 and applying a weight to each neighbor based on its inversed Euclidean distance resulted in the highest average R2 (figure 4) (table 2).

FIGURE 4. VISUALISATION OF THE PARAMETER OPTIMISATION FOR THE K-NEAREST NEIGHBORS REGRESSION ALGORITHM

(18)

NUMBER OF

NEIGHBORS 9 13 17 21 25 29 33 37 41 45 49

UNIFORM 0.11 0.12 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13

DISTANCE 0.13 0.14 0.14 0.15 0.15 0.15 0.16 0.16 0.15 0.16 0.15

TABLE 2. R2_{VALUES AS A RESULT OF THE DIFFERENT PARAMETER COMBINATIONS FOR} THE K-NEAREST NEIGHBORS REGRESSION ALGORITHM

For the random forest algorithm, setting no limit to the maximum tree depth and setting the maximum number of features considered for each split to be equal to the square root or the binary logarithm of the total number of features, resulted in the highest average R2 (figure 5) (table 3).

FIGURE 5. VISUALISATION OF THE PARAMETER OPTIMISATION FOR THE RANDOM FOREST ALGORITHM MAXIMUM TREE DEPTH 31 41 51 61 71 81 91 101 111 121 131 N 0.07 0.07 0.08 0.08 0.08 0.08 0.08 0.08 0.09 0.09 0.09 SQRT(N) 0.10 0.10 0.11 0.11 0.11 0.11 0.10 0.11 0.11 0.11 0.12 LOG(N) 0.10 0.10 0.10 0.11 0.10 0.10 0.11 0.11 0.11 0.11 0.11 TABLE 3. R2

VALUES AS A RESULT OF THE DIFFERENT PARAMETER COMBINATIONS FOR THE RANDOM FOREST ALGORITHM

For the multi-layer perceptron regressor, while the ReLU and identity activation resulted in a similar optimal R2_{, the identity function provided}

more constant results when iterating over different values for alpha.

(19)

the number of nodes in the hidden layer, the optimal parameters are considered to be a hidden layer size of 1 with a corresponding alpha of 30 and the activation function set to identity (figure 6) (table 4).

FIGURE 6. VISUALISATION OF THE PARAMETER OPTIMISATION FOR THE MULTI-LAYER PERCEPTRON REGRESSOR ALGORITHM

ALPHA 4 5 6 7 8 9 10 30 100 200

RELU -0.56 -0.28 -0.10 -0.03 0 0 -0.01 0.12 0.13 0.13

IDENTITY 0.14 0.14 0.14 0.14 0.14 0.13 0.14 0.15 0.14 0.14

TABLE 4. R2 VALUES AS A RESULT OF THE DIFFERENT PARAMETER COMBINATIONS FOR THE MULTI-LAYER PERCEPTRON REGRESSOR ALGORITHM

ALGORITHM TESTING AND COMPARING

When using the optimised parameters to do 500 predictions over random test sets, the k-nearest neighbours regression algorithm performs best with an average R2 of 0.1537 (table 5). However, according to the t-test, when comparing the results from the k-nearest neighbors regression to both those of the second and third best performing algorithm, namely Bayesian ridge regression and ridge regression, the p-value is 0.44 and when comparing the latter two to each other, the p-value is 0.98. Given the confidence interval of 95%, the null-hypothesis of equal performance can thus not be rejected for any of the three tests, making the differences in average R2 between the three best performing algorithms insignificant. In contrast, when comparing each of these algorithms to any of the remaining algorithms, the p-value is always

(20)

lower than 0.05, indicating a significant difference in R2. This in combination with their relatively fast computation time compared to the random forest algorithm and multi-layer perceptron regressor algorithm, makes each of them equally superior for this prediction problem.

ALGORITHM AVERAGE R2 _VARIANCE _{COMPUTATION TIME}

LINEAR REGRESSON 0.1436 0.0026 Fast

LASSO REGRESSON 0.1414 0.0013 Fast

RIDGE REGRESSON 0.1515 0.0017 Fast

BAYESIAN RIDGE REGRESSION 0.1516 0.0015 Fast

K-NEAREST NEIGHBORS REGRESSION 0.1537 0.0023 Fast

RANDOM FORST 0.1107 0.0022 Slow

MULTI-LAYER PERCEPTRON REGRESSOR 0.1445 0.0020 Slow

TABLE 5. TEST STATISTIC (R2_{), VARIANCE AND (RELATIVE) COMPUTATION TIME FOR EACH} OPTIMISED ALGORITHM

DATA REDUCTION

There are two types of data reduction. First, the reduction in sessions and second, the reduction in features. For testing the impact of these reductions, the Bayesian ridge regression algorithm is used. There is no particular reason for using this algorithm instead of the ridge regression algorithm, while the argument for not using the k-nearest neighbors algorithm is its parameter dependency, due to which it would require a new optimisation process for every new reduced data set.

SESSION REDUCTION

When applying the function that only keeps a session if it belongs to a user who has used the same charging station at least 5 times during the same month, the average R2_{over 500 predictions on a random test set is 0.1175}

opposed to the prior average R2_{of 0.1516. The prediction accuracy decreases}

even more after reducing the number of sessions to only those that have started between 4pm and 4am, as after applying this function the average R2 becomes 0.0875. According to the t-test, both decreases are significant. Opposed to these decreases, setting a minimum of 6 to the number of months a charging station has to be in use before taken into consideration, increases the average R2 significantly from 0.1516 to 0.1611 (figure 7) (table 6). A peak is also visible at a minimal number of months of 16, however, when applying a t-test to this point and its predecessor and successor, the increase turns out to be insignificant and thus the result of a coincidence.

(21)

FIGURE 7. VISUALISATION OF THE OPTIMISATION OF THE MINIMUM NUMBER OF MONTHS A CHARGING STATION HAS TO BE IN USE.

FEATURE REDUCTION

According to the information gain and individual regression analysis, the worst predicting feature is the fraction of households having an own

entranceway (10*) (figure 8). When this feature is removed from the data, the average R2_{becomes 0.1616 after initially being 0.1611. However, according}

to the t-test, this increase is insignificant. Furthermore, removing any more features only results in a lower average R2_.

Lastly, when doing predictions based on only the monthly per capita income (2*), the average R2_{over 500 predictions is 0.1231 with a variance of}

0.00121 when using the whole data set and 0.1302 with a variance of 0.00146 when setting a minimum of 6 to the required number of months a station needs to be in use.

(22)

FIGURE 8. NORMALISED R2

AND INFORMATION GAIN OF EACH INDIVIDUAL FEATURE

EVALUATION

Approximately 13 percent of the variance in energy usage between different electric vehicle charging stations can be explained by the average income of the households that are located within a 200-meter radius of these charging stations when using the Bayesian ridge regression algorithm. Despite an increase of 3 percent when integrating additional demographic features, the predictive power remains rather low. Yet, except for the number of vehicles (9*), the presence of own entranceways (10*) and the percentage of rented apartments (12*), all features are significantly correlated to the monthly energy usage (table 6) as indicated by the low p-value (<0.05). However, apparently these correlations are not sufficient for making accurate predictions.

(23)

The main reason for this is the significant amount of variance in the output data. That is, the average monthly energy usage per charging station is 269kWh with a standard deviation of 171kWh. Due to this high standard deviation, the standard error is relatively high for each of the correlations between the average monthly energy usage and the individual features (table 6). This standard error is the square root of the residual sum of squares and thus very similar to the average difference between the actual charging station usage and the usage as predicted by the on a single feature based linear regression function.

FEATURE

INDEX COEFFICIENT STANDARD ERROR R-SQUARED P-VALUE INFORMATION GAIN

1* 27.3976 166.1411 0.0550 1.7023e-09 9.1578 2* 0.0524 160.1658 0.1231 4.6263e-20 9.2968 3* 4.5244 160.8716 0.1140 1.2362e-18 9.0061 4* 6.8144 159.9077 0.1246 2.4980e-20 8.7644 5* -1.5693 169.3736 0.0179 6.7207e-04 7.6194 6* 2.8753 166.1778 0.0546 1.9687e-09 7.4467 7* -2.0299 168.4415 0.0295 1.2118e-05 9.0974 8* 2.8961 166.1104 0.0562 1.2177e-09 9.1021 9* -28.2080 170.7417 0.0019 0.2648 9.2968 10* -10.0810 170.9034 4.4922e-05 0.8652 5.9082 11* 1.0435 165.2228 0.0654 4.4643e-11 9.3030 12* -0.6249 170.6321 0.0032 0.1505 7.5652 13* -0.0120 169.4056 0.0175 7.6559e-04 9.2832 14* 0.4604 163.0326 0.0900 7.2825e-15 7.8976

TABLE 6. STATISTICAL DATA FROM THE LINEAR REGRESSION ANALYSIS ON THE CORRELATIONS BETWEEN THE INDIVIDUAL FEATURES AND THE MONTHLY ENERGY USAGE.

Especially the high standard error for the average income (2*) (3*) (4*) (table 6) (figure 9) is unexpected, because this error is much lower when using the same feature for predicting the number of vehicles (van Montfort et al., 2016) instead of the amount of energy. The substantial variance in the output values that causes this relatively high error could be explained by the theory that the frequency and magnitude of a charging sessions differs per user and vehicle (Lampropoulos et al., 2010). This would mean that besides the variance in the number of vehicles, there is also an additional variance in energy usage between stations with the same number of vehicles.

Furthermore, besides this explanation, the standard error is also expected to be lower for the number of vehicles than for the energy usage because of the absolute differences in output values. These differences are significantly

(24)

smaller for the number of vehicles than for the energy usage, as the former has an average is 0.89 cars with a standard deviation of 1.25, whereas the latter has an average is 269kWh with a standard deviation of 171. Nonetheless, a standard error of approximately 160 on a data set with a standard deviation of 171 is relatively high compared to a standard error of 3,78E-05 on a data set with a standard deviation of 1.5, which makes it plausible that either one of the standard errors is incorrect. As the research question and hypothesis are mainly built on the tight correlation between the average income and the number of vehicles, it is decided, in consultation with Over Morgen, to validate this correlation.

FIGURE 9. VISUALISATION OF THE DISTRIBUTION OF THE DATA AND THE LINEAR CORRELATION BETWEEN THE MONTHLY PER CAPITA INCOME (2*) AND THE MONTHLY PROVIDED ENERGY IN KWH.

For the validation, the same data preparation is used as described in the method section, for which only the average income (2*) of the households that are located within a 200-meter radius of a charging station is used as input data. Furthermore, the output value is not set to be the average monthly energy usage of a charging station. Instead, the monthly number of vehicles using a charging are counted, assuming that each user identification code corresponds to one vehicle. During the session reduction, the same functions have been applied as in the research by van Montfort et al. (2016). That is, while no requirement is set for the minimum number of months a charging stations needs to be in use, the minimum number of monthly sessions for each user is set to 5 and the sessions are limited to only those starting between 4pm and 4am. Applying linear regression analysis on this data set returns a correlation with a coefficient of 3.85e-4, which is slightly less steep than the correlation coefficient of 7.41e-5 as discovered by van Montfort et

(25)

al. (2016). This slight difference in steepness can be explained by the difference in scope for the output values. Montfort et al. (2016) used all zipcode-5 area’s in The Hague, including those in which no charging stations are located and thus also having samples with zero vehicles, whereas during this research no such areas were used, resulting in a slightly higher minimum output value of 1 for each sample. Nevertheless, considering the same order of magnitude, both outcomes indicate a rather similar correlation, each having a p-value of below 0.05, and thus being significant.

This similarity also becomes present when looking at the polynomial

correlation between the average income and the number of vehicles (figure 1) (figure 10). For both cases, a thrid-degree polynomial seems to describe the data best, resulting in a similar curve around the maximum value of

approximately three vehicles. However, while during this research the optimal degree is determined using a training, a cross validation and a test set, van Montfort et al. (2016) based the similar optimal curve on the reasoning that as the income increases, personal interest instead of

affordability becomes the main driver behind the decision on whether or not to buy an electric vehicle. This decreasing dependency on money during the decision-making process is visible by the decreasing rate of the positive slope at the left side of the maximum. Furthermore, they explain the negative slope on the right side of the maximum by the theory that once a certain income is reached, it becomes possible to purchase a private charging station, resulting in a relatively infrequent use of public charging station in the richest areas (van Montfort et al., 2016).

FIGURE 10. VALIDATION OF THIRD DEGREE POLYNOMIAL CORRELATION BETWEEN THE MONTHLY PER CAPITA INCOME (2*) AND THE MONTHLY NUMBER OF VEHICLES.

(26)

More important than this similar correlation however, is the discovered standard error of 0.88, which is much larger than the standard error of 3.78e-5 as discovered by van Montfort et al. (2016). The reason for this significant difference cannot be visually derived by looking at the position of the individual samples relative to the regression line as the visualisation of the correlation by van Monfort et al. (2016) does not contain a scatterplot. Nevertheless, there seems to be an explanation for the seemingly low standard error as this value is almost similar to the during the validation discovered correlation of the slope coefficient, which is 4.78e-5. The

standard error as indicated by van Montfort et al. (2016) might thus not refer to the variance in the data, but to the uncertainty in the steepness of the slope, which is very small due to the sufficient number of samples in the data set. In conclusion, due to a different interpretation of the term standard error, the vertical distance between the samples and the regression line turns out to be significantly larger than initially assumed for the correlation between the average income and the number of vehicles. Nevertheless, the hypothesis that the varying charging frequency and magnitude between different users makes it more difficult to predict the energy usage opposed to the number of

vehicles seems to be correct, as when using all features in the data base, predicting the number of vehicles results in a R2_{of 0.30 opposed to the R}2 _of

0.16 when predicting the actual average energy usage.

STATISTIC VAN MONTFORT ET AL. (2016) VALIDATION

COEFFICIENT 7.41e-4 3.8e-4

P-VALUE <0.0001 5.66e-15

STANDARD ERROR OF ESTIMATE 3.78e-5 4.78e-5

STANDARD ERROR unknown 0.88

TABLE 7. VALIDATION OF THE LINEAR CORRELATION BETWEEN THE MONTHLY PER CAPITA INCOME (2*) AND THE MONTHLY NUMBER OF VEHICLES.

CONCLUSION AND DISCUSSION

The main question of this research was whether it is possible to predict the usage of an electric vehicle charging station based on the demographic characteristics of its location. In order to answer this question, two sub-questions have been addressed. From the first sub question, it can be

concluded that 13 percent of the variance the energy usage between different electric vehicle charging stations in The Hague can be explained by the average income of the households that are located within a 200-meter radius of these charging stations. This proportion was discovered by training a Bayesian ridge regression algorithm over a sub set of the data and then using the generated prediction model to do predictions over the remaining data set. Whereupon the second sub question concludes that this R2 of 0.13 could be

(27)

increased to 0.15 when including an additional 13 demographic features in the data set. This proportion could even be increased to 0.16 when only considering charging stations that are at least six months old. Further data reduction on the other hand had a negative impact on the predictive power due to the collateral significant reduction in the number of samples.

According to these answers to the sub questions, the initial hypothesis that it is possible to predict the usage of an electric vehicle charging station based on the demographic characteristics of its location, is thus correct. The generated prediction algorithm could support municipalities during their decision-making process on where to install new charging stations, by providing an automated method for doing predictions on the energy demand in each of the potential locations. However, these predictions are merely estimations as the actual energy usage can significantly deviate, which is indicated by the relatively low predictability of 16 percent. Nevertheless, the predictions can provide useful insights into which areas demand more or fewer charging opportunities, which can be used to more effectively

minimise the number of both overstaffed and understaffed charging stations, resulting in a more efficient charging infrastructure.

However, a higher predictive power was expected based on the assumed strong correlation between the average income in an area and the number of vehicles in that area using a charging station as discovered by van Montfort et al. (2016). The main reason for the relatively low predictability despite this significant correlation is the high variance in the data, which causes a large standard error. However, this large standard error is not referred to in the article by Montfort et al. (2016), as their stated standard error turns out to, after partial validation of this research, refer to the standard error of the slope coefficient instead of the fitting error. Consequently, there indeed is a

significant correlation between the average income in an area and the number of vehicles in that area using a charging station, however, due to the high variance in the data, this correlation is not sufficient for doing accurate predictions.

The high variance and its negative impact on the predictive power of the model is partially a consequence of unavoidable limitations due to the psychological nature of the problem. That is, due to both contextual and personal factors (Lampropoulos et al., 2010), individual charging behaviour is difficult to predict. Nevertheless, there might also be more avoidable causes for this variance, such as the temporal discrepancy between the output and input data. For instance, the monthly per capita income (2*) dates back to 2008, whereas the output values are derived from transactions done over the period 2014-2017. Using more recent values for the demographic

characteristics might therefore result in stronger correlations and thus a more reliable prediction model.

(28)

Furthermore, for each charging station, the input features are based on the within a 200-meter radius located residences, whose demographic

characteristics on the other hand, are determined on a much larger scale. While only the in 2008 derived average income is based on the relatively small postal code scale, the other features are considerably less precise. Only the number of vehicles and the possible presence of an own entranceway are available on a residential scale, however these values are only estimates which are in turn also based on other larger level features. Consequently, despite considering different residences for each charging station, many have similar input values. Nonetheless, each of them has their own specific energy usage, resulting in a variance in output values amongst samples with similar input values. Apart from using more precisely determined input values, this spatial discrepancy could also be solved by combining multiple charging stations and thus doing larger scale predictions. In conclusion, closing the spatial and temporal discrepancy in the data may lead to more accurate predictions in the future. Furthermore, besides using better data, future research could also take into account other cities to either expand the data set or to validate the current model and thus increase its reliability.

REFERENCES

Beltramo, A., Julea, A., Refa, N., Drossinos, Y., Thiel, C., & Quoilin, S. (2017). Evaluatin the impact of EV charging demand on the Dutch energy system: A case study in The Netherlands. In Proceedings of the 14th International Conference on the European Energy Market.

California State University Northridge. 2011. Data Mining: week 4 [Class lecture slide]. Retrieved from from

http://www.csun.edu/~twang/595DM/Slides/  

Cuijpers, M., Staats, M., Bakker, W., & Hoekstra, A. (2016). Eindrapport - Toekomstverkenning elektrisch vervoer, 71. Retrieved from

http://www.ecofys.com/files/files/ecofys-2016-eindrapport-toekomstverkenning-elektrisch-vervoer.pdf

Fotouhi, A., Auger, D. J., Propp, K., Longo, S., & Wild, M. (2016). A review on electric vehicle battery modelling: From Lithium-ion toward Lithium–Sulphur. Renewable and Sustainable Energy Reviews, 56, 1008-1021.

Franke, T., & Krems, J. F. (2013). Understanding charging behaviour of electric vehicle users. Transportation Research Part F: Traffic Psychology and Behaviour, 21, 75-89.

(29)

Lampropoulos, I., Vanalme, G. M., & Kling, W. L. (2010, October). A methodology for modeling the behavior of electricity prosumers within the smart grid. In Innovative Smart Grid Technologies Conference Europe (ISGT Europe), 2010 IEEE PES (pp. 1-8). IEEE.

Mehta, R., Srinivasan, D., Khambadkone, A. M., Yang, J., & Trivedi, A. (2016). Smart Charging Strategies for Optimal Integration of Plug-in Electric Vehicles within Existing Distribution System Infrastructure. IEEE Transactions on Smart Grid.

Michel, V. (2010). Regularized Regression: A Bayesian point of view [PDF]. Retrieved from

http://www.unicog.org/pmwiki/uploads/Main/PresentationMM_02_10.pdf Ng, A. (2012). Lecture notes 1: Supervised Learning, Discriminative Algorithms [PDF]. In Stanford CS229: Machine Learning. Retrieved from http://cs229.stanford.edu/notes/cs229-notes1.pdf

Pedregosa, F., Varoguaux, G., Gramfort, A., Michel, V., Thirion, B., Grissel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubour, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnav, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.

Sweda, T. M., & Klabjan, D. (2015). An Agent-Based Information System for Electric Vehicle Charging Infrastructure Deployment. Journal of Infrastructure Systems, 21(2), 4014043.

https://doi.org/10.1061/(ASCE)IS.1943-555X.0000231

van Montfort, K., Kooi, M., van der Poel, G., & van den Hoed, R. (2016). Which factors influence the success of public charging stations of electric vehicles? 6th Hybrid and Electric Vehicles Conference (HEVC 2016), 11-16. 10.1049/cp.2016.0971

van Montfort, K., Visser, J., van der Poel, G., & van den Hoed, R. (2016). Voorspellen van benodigde infrastructuur van publieke laadpalen voor elektrische auto’s. Tijdschrift Vervoerwetenschappen, (2), 2–13. Retrieved from www.vervoerswetenschap.nl

Walpole, R. E., Myers, R. H., Myers, S. L., & Ye, K.

(2011). Probability and statistics for engineers and scientists (9th_edition).

Boston: Pearson Education, Inc.

Xi, X., Sioshansi, R., & Marano, V. (2013). Simulation–optimization model for location of a public electric vehicle charging infrastructure.

Transportation Research Part D: Transport and Environment, 22, 60– 69. https://doi.org/10.1016/j.trd.2013.02.014