## The effect of background knowledge on the demand prediction of Bikeshare from two different points of view

Kevin Kleijer, 11811927

Faculty Economics and Business, University of Amsterdam 6013B0354Y: Bachelor’s Thesis and Thesis Seminar Econometrics

Dr. Samaneh Khoshrou June 30, 2021

### Statement of Originality

This document is written by Student [Kevin Kleijer] who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it.

UvA Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

### Abstract

The CitiBike in New York City has been active since 2013 in the bike-sharing program. The use of bikeshare can often be affected by different weather characteristics or calendar effects. In this work, we addressed the differences in demand prediction from two different points of view namely the customers and the author- ity. From the user’s point of view, there are a certain amount of bikes to rent the following day. This is indicated by a certain chance and the effect of background knowledge on this chance was discussed in the methodology and results using a weighted least squares prediction to predict the number of bikes. This led to the question: How does background knowledge such as weather and calendar information affect the chance of renting a bike? From an authoritative point of view, the trips going in and out of the station are discussed and can result in a bike deficit. The trip predictions were performed using linear regressions.

The following question was asked: How does background knowledge affect trips starting/ending? This demand prediction is studied at 3 different bikeshare stations with different characteristics. Results showed that background knowledge such as weather and calendar information does have a significant effect on both points of view.

### 1 Introduction

The bike-sharing system started with the white bike plan in Amsterdam in 1965. This is the first time the bike-sharing program was introduced. However, the program did not go as planned because bikes were often damaged or stolen (DeMaio, 2009). Since then bike-sharing receives more and more attention over the past few years. Now the bikes are of greater quality and are better protected against theft. The bike-sharing program is currently active in more than 1900 cities around the world and still growing, the number of bikes in use worldwide is around 9 million (Meddin et al, 2021). The bike-sharing system is a popular program that allows citizens and tourists to rent a bicycle at a certain station and drop it off at the next station in a different location. The bike-sharing program often has a subscription option, members can make endless use of the bikes for a subscription fee. It is also possible to use the bike for a one-way trip so you pay to receive the bike and then you may use it for a certain time to complete the trip.

The bike-sharing system has a positive effect on the cycling population (Demiao, 2009). There are several reasons why the system increases the demand for cycling. The bike-sharing system can be used to solve the so-called last-mile problem. The last-mile problem is where the citizens travel by any transporta- tion method and do not have another way of walking the last part of the trip. Now the citizens can use the bike from bikeshare for the last part of their trip to solve this. Instead of walking this can be a faster way of traveling to save time (Ma, 2018). The convenience of the bike is also an important factor for the reason to use bike sharing (Fishman et al, 2015). First of all the users do not have to worry about their bike getting stolen or damaged after they returned the bike to the station, this is in contrast with when the user is renting a bike. Secondly, the bike can be faster at times to move around in the city than any other method of transport. Jensen et al (2010) conclude by a study on travel behavior in Lyon that moving by bicycle for a short trip is faster than moving by car, which can result in citizens having a preference for bicycles. Another and third reason is the fun bikeshare users are having, specifically for tourists and casual users, so-called non-members. At last, the installation of a bike-sharing system has both a positive effect on the public health of the users and reduces car usage which results in less air pollution in the city (Wang & Chen, 2020).

The CitiBike in New York is the largest bike-sharing program in the nation. New York has around 20000 classic bikes in use and over 1300 stations to grab a bike or drop a bike (CitiBike New York City, n.d.). The CitiBike is even one of the first to also let customers use E-bikes. In this paper, the weather reports of 2016 from New York are used in the model for the prediction of available bikes. There are several variables for instance the daily temperature that may affect the number of bikes available the next day. While the interest is growing in bikeshare there is some missing knowledge about the relation between the weather and bikes

used at a certain station. For instance, how will a weather report change the number of bikes used in the bike-sharing program? In this paper, this relation is further looked upon given at three separate bikeshare stations in New York City, Manhattan. The stations this study looks into are: ’Central Park S & 6 Ave’, ’E 47 St & Park Ave’, and ’St James Pl & Pearl’. Figure B.1 shows the locations of the stations (Google, n.d.).

In the study, these three stations are examined further.

One of the most important factors when riding a bike is the weather because you are outside in the environment. Environmental changes do have a significant effect on the use of bikeshare and the prediction on how many bikes are available and trips being made (Wang & Chen, 2020). The bike-sharing system is very interesting to predict the number of bikes that are used and the number of trips being made by bikeshare every day. In this project, we look at two different points of view for the demand prediction of bikeshare.

The prediction of bikeshare can raise questions for the customers, on a sunny day are there enough bikes available to use? Are these affected by certain factors around the stations? The customer would like to know whether they have a bike the next morning or midday to grab at one of the stations CitiBike is active.

First from a user’s point of view. The research question for the customers is the following: ’How does background knowledge like weather and calendar information affect your chance of renting a bike?’.

This same prediction raises questions for the authority of CitiBike. Are there enough bikes at the stations?

How many trips end and start at a certain station? Then for the authority point of view, there can be a second question formulated namely: ’How does background knowledge like weather and calendar information affect trips being made by bikeshare?’. From an authority point of view, it is important to know how many bikes are going in and out of the stations. These questions are examined at the three different stations.

The remainder of the paper is structured in the following way. The second section reviews the previous literature on bikeshare prediction in line with the research question of this review. The third section presents the dataset. The fourth section represents the theoretical model and framework used to explain the variables of interest. The fifth section will show the results for the certain methods used that are described in the previous section. At last, in the sixth section, the conclusion is given and further research is suggested.

### 2 Literature Review

Bike-sharing is a well-studied topic in literature. Several cases affected the use of bikeshare. A bike- share station next to another method of transport, for instance, the grand terminal of New York, comple- mented the use of bikeshare (Ma et al, 2018). In later research this was contradicted, it is suggested that the different methods of transport are substitutes (Wang & Chen, 2020). Wang and Chen (2020) concluded that

higher density areas such as Times Square and Central Park located in Manhattan in New York have a posi- tive effect on the use of bikes at the bikeshare stations there. The capacity for each station was significantly associated with the passenger demographic and vicinity of public transport around these stations. Trip data from a station level revealed this evidence (Wang & Chen, 2020). Gehrke et al (2021) were in line with this and say that built environment predictors of population increased bike access and usage. The earlier study by Ma et al (2018) had the same findings as within the research from Wang and Chen (2020) and showed that most of the trips in the morning and evening are for commuting purposes. Their study showed that the trips during 07:00-10:00 and 16:00-20:00 are the most common. This resulted in the preference for bikeshare stations near a public method of transport with a complaint of the low supply of bicycles at these stations.

It is important to know whether there are enough bikes in the city to let the system work. Goit and Cherrier (2014) concluded that it is possible to predict the use of the bike-sharing system 24h up ahead that led to several advantages. The first purpose of event detection on bikeshare was used for better planning and management of the system (Fanaee & Gama, 2013). Different models were being used to estimate this prediction. Goit and Cherrier (2014) predicted the usage in the bikeshare system for the upcoming day. They made use of a dataset with a period of 2 years in the city: Washington DC. Their analysis was based on 5 different regressions: Ridge, Adaboost, Support Vector, Random Forest, and Gradient Tree Boosting regres- sions to predict the number of bicycles. Data at time t was used to predict the use of bikeshare at time t + 1.

In a different paper, Fanaee and Gama (2013) made use of the dataset in Washington DC over a period of 2 years (same dataset as Goit and Cherrier use). They tried to detect events by making use of the bike-sharing system. Furthermore, they predicted the use of bike-sharing while using a different regression method than one of the five Goit and Cherrier (2014) used, namely a regression tree model. Goit and Cherrier (2014) approached the analysis of the bike-sharing system by extracting features from the data and then listed them by significance. There were multiple regressions used to find the most significant features. Similarly, Yang et al (2020) also used analysis for the most significant parameters to estimate the use of bikeshare. Yang et al (2020) made use of stability analysis to find the most significant parameters for the model. A combination of the two models and estimation of parameters can lead to the requested model for this paper.

Citizens would like to make use of the bike-sharing system, but several factors may affect their decision to make use of the system. Gebhart and Noland (2014) studied the impact of weather on the bikeshare trips made at that moment. They made use of the dataset of trips measured in Washington DC. Their analysis showed a positive correlation between bikeshare trips and the average temperature. Goit and Cherrier (2014) were using the same dataset and concluded that the temperature is very relevant (almost a necessity) for the

prediction of bikeshare usage 24h up ahead however they wrote that the weather and humidity variables had a lesser impact on the bikeshare usage. On the other hand, Gebhart and Noland (2014) stated that the variables rainfall and humidity (both as to weather conditions in the model of Goit and Cherrier) did have a significant impact on the usage of bikeshare because they were negatively correlated with the number of trips and the duration of trips. Their approach for data cleaning was by creating dummy variables for difficult to handle data, for instance, whether there is a storm coming or not (Gebhart & Noland, 2014). An analysis for the bike-sharing model required dummy variables to a beneficial model. At last, a significance analysis was used to check whether the dataset in use has the important factors for the model. This may lead to event detection having a more important role. Event detection mentioned in Fanaee and Gama (2013) also had a second purpose. It can alarm or suggest people to not make use of the bike-sharing system because of the weather conditions coming up.

From an authority point of view, there were conflicts between the government and the bike-sharing companies (Yang et al, 2020). The bike-sharing companies would like to maximize their profit while the government is responsible for the secure and stability of the bike-sharing system. Yang et al (2020) made use of a sensitivity analysis where research showed the most sensitive parameters for the model to compare companies against government regulations. This resulted in bike-sharing companies being most sensitive to producing low-quality bicycles because of the maintenance costs and therefore reminded the government to control this regulation. An earlier study by Lohry and Yiu (2015) showed that by comparing the government- run bike-sharing system against the private companies, the private companies had bikes of worse quality and were less effective in achieving higher bikeshare utilization. The reason for the government systems to be more effective was again the conflict between the bike-sharing companies and the government which Yang et al (2020) also discussed. The CitiBike in New York City makes use of employees or members of CitiBike to move bikes to stations that are often used (CitiBike, 2013). So that stations which were very busy do not run out of bikes to use for the customers. This is both a point for the authority and the bikeshare company itself to work on together so that the users of bikeshare do not run into the problem of not having a bike at their desired station.

In this study, the focus was on the relation between the prediction of bikeshare usage up ahead while making use of the most significant weather variables in the dataset. There are a large number of stations listed in New York where the CitiBike is active. We focused on three stations with different characteristics and predicted the chance of renting bikes for use. Then from an authoritative point of view, the trips start- ing and ending at the stations were examined/predicted. Based on the literature above there were several questions raised. These questions were put as a hypothesis. The problem was seen from two different per-

spectives: a users point of view and an authority point of view. From a user’s point of view, the following hypothesis was asked: When do the users have a higher chance of renting a bike? Did a higher temper- ature or a weekday affect this chance for the users? The chance of renting a bike is for the two timeframes also mentioned in the previous literature as commuting purposes, 07:00-10:00 and 16:00-20:00. Secondly for the authority: When does the authority need more bikes to keep the customer satisfied? Did an in- crease in temperature or a weekday result in more trips? These hypotheses are in detail in the methodology part of the research.

### 3 Data Set

Since there is no public dataset available for exploring the desired results for the research questions for particular stations, a new dataset was created by combining three complementary sources of informa- tion: weather, calendar information, the daily demand of bikes, and the capacity/availability of bikes. The timespan for the research is starting on the 22nd of January up until the 31st of December in 2016. The weather variables include the temperature (°C), the dew point (°C), the humidity (%), the wind speed (mph), the pressure (Hg), and the precipitation (in). This data is from the weather underground (url:

https://www.wunderground.com/history/monthly/us/ny/new-york-city/KLGA) where they show a summary of these variables for the whole month selected in New York City. The calendar information is obtained from 2016 and together with the weather information defined as background knowledge.

We focused on the following three stations: ’Central Park S & 6 Ave’, ’E 47 St & Park Ave’, and ’St James Pl & Pearl St’. The total number of docks for ’Central Park S & 6 Ave’ is 49, for ’E 47 St & Park Ave’ are 55, and for ’St James Pl & Pearl St’ are 27. A figure in Appendix B shows where the stations are located on the map. First, the ’Central Park S & 6 Ave’ station is a large station close to Central Park and can be used to ride the bike through Central Park or maybe move inwards into the city. This is different from the other station: ’E 47 St & Park Ave’. This station is also pretty large and close to the Grand Terminal of Manhattan and therefore used often during weekdays to go to work or get home from work. At last, there is a smaller station near Brooklyn bridge: ’St James Pl & Pearl St’. This station is less often used but might indicate differences when the weather and seasonal information was compared with the trips from the stations and bikes used within the station. In the study of Wang and Chen (2020), there was research being done on the station-wise trips but these are on the effect of trip attractions but not on how many trips were made each day. So therefore there was no research done particularly on these specific stations only.

The variable of interest were the number of bikes at the beginning of each day, between 07:00 until

10:00, and 16:00 until 20:00. For these timeframes the number of unique bikes that were used was calcu- lated for each day. Between 07:00 until 10:00 is the time most of the people in the city will start their day so it is interesting to take a closer look at how many bikes will be used then to go to work or other locations of interest. between 16:00 and 20:00 most people will end their workday and go home or start a trip to go somewhere in the evening. These time frames were the same as Wang and Chen (2020) made use of. The starting values for the number of bikes at a certain station were obtained from the NYC open data. This data list all the stations in New York on the 22nd of January and the information is given about the number of total docks, the number of bikes available that day, and the number of docking stations to place your bike.

The CitiBike trips made in New York for a certain month are available on the CitiBike New York website (url: https://www.citibikenyc.com/system-data). The data has information about individual trip history. In this dataset, the starting point and endpoint of a bikeshare trip are shown, the trip duration, the bike-id, and the gender of the individual using the bike.

The trips starting and ending were listed as a TripsStarti and TripsEndi variable, where i stands for 1 of the 3 stations. Then for each day starting from 2016-22-01 the number of bikes was calculated at the start of a station. For the three stations the number of bikes between 07:00 until 10:00, 16:00 until 20:00, and for the whole day starting at one of the 3 stations was calculated. With this data available there was the possibility to check the differences in bike usage in the given time frames examined at the three given stations. This number of unique bikes each day and within the time frames was calculated by making the use of the same dataset as for the trip data, the CitiBike trips from their website. In this dataset, the bike id is given for each bike and then the number of unique values for this bike id variable was calculated for the given time frames. In Table 1. and Table 2. there is a summary of the descriptive variables.

Variable Temperature Dew Point Humidity Wind Speed Pressure Precipitation

Mean 16.327 7.646 59.707 16.808 986.197 2.592

Spring (1) mean 15.120 5.874 58.012 16.610 1013.467 2.166

Summer (2) mean 25.970 17.624 61.950 15.165 1012.130 2.234

fall (3) mean 12.103 4.119 60.699 18.030 1017.906 3.703

Winter (4) mean 10.243 0.944 57.613 17.734 871.084 2.199

Table 1: Descriptive variables for Weather.

Note: In the table 1-2, the first 21 days of January are missing from the data set and the missing values described below are not included when calculating the mean.

Mean Trips Starting Trips Ending #Bikes #Bikes 7-10 #Bikes 16-20

’Central Park S & 6 Ave’ 209.991 198.336 166.124 17.724 60.896

’E 47 St & Park Ave’ 207.467 187.064 199.420 46.206 104.081

’St James Pl & Pearl St 30.670 30.910 29.675 6.365 9.536

Table 2: Descriptive variables for stations.

Figure 1: graph comparing the temperature and dew point with the trips starting

From figure 1 it can be seen that Temperature and Dew Point in Celsius follow roughly the same pattern that indicated that there was a correlation between Temperature and Dew Point. The trips starting/ending at the stations follow this same pattern that indicated a positive relation between trips starting/ending and Temperature.

Some missing data needed to be handled. On days 24, 25, and 26 of January, there were no bikeshare trips listed for the CitiBike. This could be the case that the CitiBike data had an error. Because there are 0 trips and 0 bikes used that day the data does not represent a true correlation between the trips or bikes used and the weather or seasonal information, hence these days were not taken into account. the 27th of January had some values 0 and some very low so this data was probably corrupt and also not added into the model. The same goes for August 24 until the 31st of August. There were also all 0’s for the number of trips and bikes hence this data was also removed from the model. At last for the station ’Central Park S & 6 Ave’ there was some missing data from the 21st of November until the 24th of November so for this station, only these days were removed. So for these data points, the number of trips and unique bikes were removed because if they would remain as 0, they would indicate a wrong number for the model.

For the calendar information, another explanatory variable might be interesting to take a closer look at namely the holidays. The variable holiday was introduced as a dummy variable into the module. This was indicated by a 1 if it ends up the day is part of a holiday and 0 otherwise. Then the season might show differences in the structure of breaks and the use of bikeshare and therefore seasonal variables were added into the data. These were listed from 1 up until 4 where 1 stands for spring, 2 for summer, 3 for fall, and 4 for winter. At last, it might be interesting to see the results between the weekend and trips or bikes being

used and a certain day with the trips or bikes used. So as the last two variables the weekend and the day were added into the frame which for the weekend was indicated by 1 if it is Saturday or Sunday and 0 otherwise.

For the days this was indicated by 1 up till 7 where the 1 stands for a Sunday and the 7 for a Monday. Now the monthly weather reports were added into the data frame of the 12 months of 2016 plus the number of bikes for the day, between 7 and 10, between 16 and 20. Then the trip data for each station together with the holiday and seasonal information was added. The components form a new data frame that is used to introduce and explain the model. In Table 3-5 there are summaries for the given calendar information for station-wise.

Mean Trips Starting Trips Ending #Bikes #Bikes 07:00-10:00 #Bikes 16:00-20:00

weekdays 187.504 180.796 150.768 17.614 59.248

Weekend 265.869 241.919 204.283 18.000 64.990

Holiday 251.520 241.031 196.474 21.000 70.608

Spring (1) 225.097 212.892 173.580 16.323 68.215

Summer (2) 277.138 262.032 219.223 24.298 76.117

fall (3) 214.944 200.500 172.911 18.088 61.156

Winter (4) 89.956 87.515 73.544 10.073 29.500

Table 3: Calendar information for ’Central Park S & 6 Ave’.

Mean Trips Starting Trips Ending #Bikes #Bikes 07:00-10:00 #Bikes 16:00-20:00

weekdays 274.561 246.411 263.963 58.264 141.467

Weekend 40.747 39.596 39.040 16.242 11.181

Holiday 251.958 208.639 241.938 52.443 134.412

Spring (1) 196.172 180.559 188.172 45.581 94.376

Summer (2) 258.117 211.149 247.531 50.160 137.074

fall (3) 240.022 227.777 232.133 52.933 125.166

Winter (4) 109.809 108.779 105.000 32.691 43.838

Table 4: Calendar information for ’E 47 St & Park Ave’.

Mean Trips Starting Trips Ending #Bikes #Bikes 07:00-10:00 #Bikes 16:00-20:00

weekdays 32.069 31.256 31.008 7.638 10.345

Weekend 27.192 30.051 26.366 3.202 7.525

Holiday 36.072 36.340 34.886 8.412 10.876

Spring (1) 32.602 32.280 31.580 6.978 10.828

Summer (2) 38.915 38.798 37.649 7.798 12.021

fall (3) 30.322 31.144 29.267 6.267 9.077

Winter (4) 17.088 17.824 16.588 3.676 4.941

Table 5: Calendar information for ’St James Pl & Pearl St’.

Note: In the Tables 1,3-5, for the winter (4) mean the first 21 days of January are missing from the data set hence this mean does not represent the whole winter of 2016.

### 4 Methodology

In this section, the methods to perform the research are discussed. The experimental testing was set up in the following way: the data was cross-validated into two groups, 70% training, and 30% test data because this provided the best accuracy for the model. The data being split into two groups was done to obtain a realistic evaluation of the model used otherwise the linear regression would be over-fitted. From a users point of view, 3 predictions were being made for the estimation of the number of bikes. This Y was the number of bikes used for the whole day, between the timeframe 07:00-10:00, and between the timeframe 16:00-20:00 for all the stations. For all these predictions the 3 models described were employed and compared in the evaluation criteria. Then as being later described in this section the chance of renting a bike for the costumers can be defined. Then from an authoritative point of view, 2 predictions were performed. Y was the number of trips going out and the number of trips going into the station at the three different stations. For this point of view, the first two regressions were compared: OLS and WLS.

The predicted difference of the trips going out and in at the stations became visual for the authority when predicting the number of trips starting/ending. The predictions made were in real numbers because if their integer they would simply add integer scores together. The real numbers could also be rounded without the loss of too much precision.

Method

To predict the daily demand of bikes for both points of view we employed three different regression models for the users and two different regression models for the authority. The first model employed was a Ordinay Least Squares regression. The mathematical equations for the OLS regression can be written as follows:

Y_{j,k}= α + βiX+ εi (1)

X_{j,k}= x_{j1,k}, x_{j2,k}, ..., x_{jn,k} (2)

Where j = {1, 2, 3}. 1 stand for the station ’Central Park S & 6 Ave’, 2 stands for the station ’E 47 St & Park Ave’, and 3 stands for the station ’St James Pl & Pearl St’. where k stands for the day in the training data, X is the set of parameters used to estimate the number Y . The OLS regression would result in interpretations

of the required estimators for our given research question.

The second model employed was a Weighted Least Squares (WLS) regression. When the model was tested using the OLS regression the Breush-pagan test, at a 5% significance level led to heteroscedasticity in the model for the stations when predicting the number of bikes. This meant that the estimators are no longer the most efficient estimators leading to an inefficient model hence using a WLS solved this problem. Then the weights vifollow from the OLS regression in (1). The transformed model then can be written as follow:

Y_{j,k}

√vi

= α

√vi

+ βi

√X vi

+ εi

√vi

(3)

The estimators were obtained minimizing the following criterion:

S(β ) =

k

### ∑

i=1

(yi− x^{0}_{i}β )
vi

(4)

Observations with a smaller variance were assigned a larger weight in determining the estimates. This led to less uncertainty around smaller variances, that led to these observations being more important for estimation.

Based on the research question from a users point of view and the characteristics of the training dataset a ridge regression was employed for the estimation of the number of bikes due to density. The OLS regression minimized the sum of squared residuals (RSS):

RSS=

k

### ∑

m=1

(y_{j,m}− α −

n

### ∑

i=1

βix_{ji,k})^{2} (5)

For the ridge regression there is a slightly different minimizing problem. There is a penalty being added into the minimizing of the sum squared residuals. This can be written in the following way:

RSS=

k

### ∑

m=1

(y_{j,m}− α −

n

### ∑

i=1

βix_{ji,k})^{2}+ λ

n

### ∑

i=1

β_{i}^{2} (6)

Where the λ is called the shrinkage penalty. When this is equal to 0 the penalty term has no effect and OLS is performed. The i, j, k represent the same values as in the OLS regression. Then the ridge regression was performed again using the OLS formula of (2) but then with the different minimizing setting. Then the Y variables were the same as described for the OLS/WLS regression. For the ridge regression there needs to be an optimal λ being chosen and this was done by Cross-validation. This is again done at 70% training and 30% test data. Then there was a ridge regression on the training data being done by cross-validating for λ .

λ was then chosen such that it minimizes the MSE by making use of formula (6) for minimizing the RSS.

MSE=1 n

n i=1

### ∑

( ˆyi− f (xi))^{2} (7)

Evaluation Criteria

From figure 2 and table 6 it can be seen that Temperature and Dew point are highly correlated which led to multicollinearity if both variables are used in the model. Hence Temperature is added as an explana- tory variable. Another variable that caused multicollinearity as a problem in the data when added was the

Figure 2: Correlations diagram

Temperature DewPoint Temperature 1.000000 0.921752 DewPoint 0.921752 1.000000

Table 6: Correlations table

Pressure. Hence Pressure is also left out of the model. Then the models for both points of view were tested for serial correlation using the Breush-Godfrey test at a 5% significance level at the order of 1. This resulted in serial correlation in some of the models hence a lag variable was used. From a users point this is a vari- able that indicated the number of bikes used the day before throughout the whole day or within the given timeframe depending on the selected target variable. By predicting the number of bikes for the following day using the previously acquired time data from the previous day the serial correlation caused in the model was avoided. This lag variable was used in all the models for the stations when predicting the number of bikes. From an authoritative point of view, these lag variables were the number of trips starting/ending on the previous day. For the first two stations, the lag variable was added into the models. For the last station

’St James Pl & Pearl St’ this lag variable was left out due to insignificance of the estimator.

To compare the 3 models described above there are 3 main evaluation measures. To determine the best modeling technique for our problem the estimation of bikes used throughout the whole day was used since the timeframes are a proportion of these values. From a users point of view using the predictions as mentioned in the model, the evaluation measures are summarized in Table 7. The first measure was the R-squaredof the model that described the variability in the dependent variables that can be explained by the

model. This was calculated in the following way:

R^{2}= 1 −RSS

T SS= 1 −∑^{k}_{i=1}(y_{i,k}− ˆy_{i,k})^{2}

∑^{k}_{i=1}(yi,k− ¯yi,k)^{2} (8)

We compared the model with the highest R-squared. This can be seen in Table 7. The R-squared improved using a WLS regression indicating that the variance in this model is explaining more of the variance com- pared with OLS. Compared with OLS the residual standard error decreases highly that indicated that the predictions are much closer to the actual values. For the ridge regression, this residual standard error was not presented because this was only presented for the comparison of OLS versus WLS, which showed that the WLS had a better prediction versus the OLS when dealing with heteroscedasticity. The goal of the prediction was to be as close as possible to the actual values hence we selected the model with the possible lowest MSE/MAE. Compared with the Ridge regression, WLS has a higher R-squared for all the stations.

The second measure was the Mean Square Error (MSE) that gives a squared number on the deviation of the predicted results versus the actual numbers (formula(7)). The MSE had a small increase when using WLS compared with OLS but this is so small hence it was neglected in the choice of our prediction model.

The Ridge regression has a higher MSE when compared with WLS (see Table 7).

The last measure is the Mean Absolute Error (MAE). This is relatively the same measure as MSE but gives smaller penalties to larger prediction errors compared with the MSE. The MAE is calculated in the same way except instead of the quadratic used in formula (7) now absolute brackets are used. The MAE had a small increase when using WLS but this is again a very small increase. When using a ridge regression the MAE increased when compared with the WLS regression. Overall the ridge regression was a worse prediction than using the other two methods. The model from a user’s point of view that was employed when estimating the number of bikes was the WLS regression model.

Regression Station R-Squared MSE MAE Residual standard error

OLS ’Central Park S & 6 Ave’ 0.6826 3127.86 41.70 57.3

’E 47 St & Park Ave’ 0.7772 4728.14 53.32 70.45

’St James Pl & Pearl St’ 0.5968 59.6989 6.084 7.916 WLS ’Central Park S & 6 Ave’ 0.7023 3154.93 41.99 1.353

’E 47 St & Park Ave’ 0.7825 4744.98 53.43 1.313

’St James Pl & Pearl St’ 0.6444 59.8429 6.083 1.282

Ridge ’Central Park S & 6 Ave’ 0.6796 3156.99 41.92 -

’E 47 St & Park Ave’ 0.7735 4807.74 53.87 -

’St James Pl & Pearl St’ 0.5939 60.1421 6.125 - Table 7: Evaluation Criteria output

From an authoritative point of view when the number of trips starting/ending was predicted the fol- lowing two regressions were compared: OLS and WLS. These were performed using the method discussed.

First OLS was performed and then tested for heteroscedasticity using the Breush-Pagan test at a 5% signifi- cance level. When this test led to heteroscedasticity a WLS regression was used because of improvement in the R-Squared and Residual standard error. When this was not the case OLS was used because an WLS did not improve the R-Squared or MSE/MAE for these predictions of trips. The models used for each individual station predicting the trips starting/ending are in the results section. From an authoritative point of view the importance of the estimators was discussed and predicted. This was done by using a significance analysis by constructing a forward step-wise regression method that allowed us to see the most significance estimators when predicting the number of trips starting/ending.

Hypothesis

Now to answer both of the research questions for both points of view there are multiple hypotheses taken into account. These were solved using the selected model. This research is done by the dataset that is described above. First, from the user’s point of view, there were several important cases. This is introduced by asking: When do the users have a higher chance of renting a bike? This can be written by the following hypothesis with different variables to take a closer look:

H_{0a}: An increase in Variablei lowers your chance of renting a bike the following day vs. H_{1a}: An increase
in Variableiimproves your chance of renting a bike the following day.

Where i = {1, 2, 3}, where 1 stand for the Temperature (°C), 2 stands for the Weekday (Monday until Friday), and 3 stands for the time frame (07:00-10:00 or 16:00-20:00). For bikeshare users during the week might be a day where the users are commuters and are only using bikeshare to get to their workplace. So for this problem, there is a closer look being taken into the difference between the weekdays and weekends. Then at last for the users, it is important to now which timeframe is the best to rent a bike and to see whether these time frames are busier than one another. The hypothesis of different time frames is checked which one of the time frames gives the user a higher chance of renting a bike. Then the user of bikeshare can take this hypothesis into account when wanting to ride the bike. This is checked with the reference group being the rest of the day with the unique number of bikes being used.

Now there is a closer look at the authority point of view. There it is important to know whether there are enough bikes for the stations to use and to check the difference between the trips going out of the station and the trips going into the station. Because the CitiBike program does move bikes them-self they could

then know whether they should move bikes or not for each of the selected variables. The hypothesis can be raised in question form: When does the authority need more bikes to keep the customer satisfied? This is written as the following hypothesis where several variables should be taken into account:

H_{0b}: an increase in Variable_{i} results in more trips being made vs. H_{1b}: a increase in Variable_{i} results in
fewer trips being made.

Where i = {1, 2}, again 1 stand for the Temperature (°C), 2 stands for the Weekday (Monday until Friday).

This is examined for the three stations and showed interesting behavior for the trips being made. Then for the weekdays and weekends, the authority needs to see whether they should have more employees or volunteers to move bikes to busy stations during weekdays compared with weekends.

In the research question for the user’s point of view, the chance of renting a bike (∆i, j) was introduced.

Because there are different time frames, a closer look was taken into the number of bikes used for the whole day and compare these numbers with the number of bikes used within 07:00-10:00 and 16:00-20:00. The chance of renting a bike is defined for the two separate timeframes. This was calculated by checking how many bikes were used within the timeframes and then this was divided by the number of bikes used for the whole day, this is the percentage of the bikes used within one of the two timeframes. Then 1 minus this percentage is the number of bikes left that day to use so that the citizens can still rent this bike hence a chance of renting a bike within that timeframe. This is the chance of renting a bike for that time compared with the whole day. In words, this is 1 minus the number of bikes used in a given timeframe, (Bi, jk) divided by the total number of bikes (TotBik) used that day. This is done by the following formula:

∆_{j,t} = 1 − #Bj,tk

#TotBj,k

(9)

where j = {1, 2, 3}. 1 stand for the station ’Central Park S & 6 Ave’, 2 stands for the station ’E 47 St &

Park Ave’, and 3 stands for the station ’St James Pl & Pearl St’. t = {1, 2} where 1 stands for the time frame 07:00-10:00, and 2 for the time frame 16:00-20:00. At last, k stands for the certain day.

### 5 Results

In this section, the model described in the methodology (section 4) was performed using the dataset described in section 3. The predicted number of bikes/trips starting/trips ending were calculated as real numbers to show the predicted real chance and real deficit in bikes changes due to the estimator effects.

Then conclusions were drawn from these real effects.

Results Analysis Users Point of View

The resulting outputs of the prediction of bikes being used from a user point of view are given in Tables

Variable WLS Y-variable

#Bikes #Bikes 07:00-10:00 #Bikes 16:00-20:00

Intercept 130.779* 15.542* 54.003*

(22.946) (3.502) (10.556)

Temperature 5.653* 0.705* 1.732*

(0.774) (0.111) (0.327)

Humidity -1.409* -0.179* -0.448*

(0.267) (0.041) (0.120)

Wind Speed -1.447* -0.044** -0.455**

(0.625) (0.097) (0.283)

Precipitation 0.472 0.032 0.152

(0.516) (0.077) (0.229)

Holiday -6.451 -1.408 -2.329

(9.776) (1.501) (4.394)

Weekend 52.114* 0.287** 6.737**

(8.553) (1.244) (3.664)

Summer (2) 6.280 2.785 0.546

(13.714) (2.125) (6.079)

Fall (3) 19.712** 2.229 1.693

(10.294) (1.562) (4.615)

Winter (4) -3.155 0.238 -7.889

(12.076) (1.837) (5.533)

Lag(Y) 0.269* 0.138* 0.274*

(0.056) (0.057) (0.0608)

Table 8: prediction output for # Bikes for ’Central Park S & 6 Ave’.

8-10 (for each of the station one table). The standard deviation of the estimators is given between brackets and the significance of the estimators is demonstrated by the * in the table. The target variable is given in the second row: the prediction of the number of bikes for the whole day/07:00-10:00/16:00-20:00. Each of the tables represented one of the three stations. The * in the table represents significance of the estimator at a 5% level and the ** represents significance of the estimator at a 10% level. To answer the research question from a user’s point of view the results are discussed in a later section together with the hypothesis being tested.

The estimated Temperature coefficient was for the WLS regressions statistically significant for all the timeframes and stations. The temperature estimators all had a positive effect on the number of predicted bikes supporting the hypothesis that Temperature has a positive effect on the number of bikes. This can

Variable WLS Y-variable

#Bikes #Bikes 07:00-10:00 #Bikes 16:00-20:00

Intercept 259.141* 65.053* 152.899*

(30.279) (10.752) (20.965)

Temperature 5.705* 0.799* 2.551*

(0.876) (0.313) (0.590)

Humidity -1.764* -0.455* -1.006*

(0.347) (0.125) (0.239)

Wind Speed -0.845* -0.179 -0.776

(0.829) (0.297) (0.574)

Precipitation -0.193 0.581* 0.119

(0.669) (0.224) (0.469)

Holiday -1.012 3.961 4.462

(12.547) (4.278) (8.891)

Weekend -214.179* -39.308* -116.539*

(10.272) (4.294) (6.971)

Summer (2) 6.451 -1.738 4.526

(16.535) (5.604) (11.707)

Fall (3) 49.756* 7.085 22.716*

(13.286) (4.534) (9.301)

Winter (4) -17.402 -6.118 -18.997**

(16.042) (5.933) (10.775)

Lag(Y) 0.151* 0.192* 0.146*

(0.037) (0.054) (0.044)

Table 9: prediction output for # Bikes for ’E 47 St & Park Ave’.

be interpreted as follow (see table 8): if the temperature increased by 1 °C the number of bikes predicted during the whole day increase by 5.653 ceteris paribus. For the other tables, this can be interpreted in the same way. The Humidity and Wind Speed both showed a negative effect on the number of bikes predicted.

However, the Wind Speed estimator is only significant for the ’Central Park S & 6 Ave’ station at a 5/10%

level. This meant that for the other two stations the Wind Speed did not affect the number of bikes predicted significantly. When the Humidity increased by 1% the number of bikes predicted during the day decreased by 1.409 ceteris paribus (see table 8). If the Wind Speed increased by 1mph the number of bikes predicted during the day decreased by 1.447 ceteris paribus (see table 8). For the other two stations, this was inter- preted in the same way. The Precipitation is not significant and also showed an interesting effect, a positive effect on the predicted number of bikes.

A second point for background knowledge was the calendar information. The season Spring (1) is used as a reference group hence it was not used in the regression. The season dummies are in some cases insignificant but used in the model to avoid a structural break in the data. The Holiday was a small group of

Variable WLS Y-variable

#Bikes #Bikes 07:00-10:00 #Bikes 16:00-20:00

Intercept 33.286* 5.813* 10.844*

(3.346) (1.128) (1.686)

Temperature 0.754* 0.168* 0.244*

(0.101) (0.024) (0.047)

Humidity -0.204* -0.006 -0.054*

(0.036) (0.011) (0.019)

Wind Speed -0.169* 0.022 -0.055

(0.085) (0.017) (0.044)

Precipitation 0.108 0.017 0.039

(0.074) (0.058) (0.038)

Holiday -1.458 1.837* -1.501*

(1.357) (0.675) (0.681)

Weekend -3.758* -3.457* -2.009*

(1.074) (0.408) (0.544)

Summer (2) 2.430 -3.767* 0.267

(1.910) (0.644) (1.003)

Fall (3) 0.628 -1.110* -0.849

(1.413) (0.498) (0.746)

Winter (4) -4.335* -2.669* -1.853*

(1.654) (0.514) (0.865)

Lag(Y) 0.068 0.006 0.081

(0.059) (0.051) (0.060)

Table 10: prediction output for # Bikes for ’St James Pl & Pearl St’.

data hence often not significant. The effect of a Holiday can be both positive or negative on the prediction of bikes depending on the timeframe and station. The Weekend estimator is significant for all the 3 stations and timeframes at a 5/10% level. When it is a Saturday or Sunday the number of bikes predicted during the day increased by 52.114 ceteris paribus (see table 8). The Weekend had a positive effect on the number of bikes predicted for the station ’Central Park S & 6 Ave’ but for the other two stations, this was a negative effect on the number of bikes predicted for the individual timeframes. The lag variable is significant for the first two stations and has a positive effect on the number of bikes predicted for the given timeframe.

Authoritative point of view

From the authority point of view, the prediction of the trips starting/ending are in Tables 11-12. The standard deviation of the estimators is given between brackets and the significance of the estimators is demonstrated by the * in the table. The regressions were done based on the testing discussed in the method- ology. This resulted in an OLS regression for the station ’Central Park S & 6 Ave’ and a WLS regression for the station ’E 47 St Park Ave’. The smaller station ’St James Pl Pearl St’ did not include lag variables

due to no serial correlation in the model and insignificance of the lag variable. For the trips starting at this station, an OLS regression was used and for the trips ending at this station, a WLS method was used.

Variable OLS Y = Trips Startingi WLS Y = Trips Startingi OLS Y = Trips Startingi

’Central Park S & 6 Ave’ ’E 47 St & Park Ave’ ’St James Pl & Pearl St’

Intercept 162.596* 287.192* 35.473*

(33.053) (30.29) (3.535)

Temperature 7.041* 5.482* 0.817*

(1.055) (0.852) (0.100)

Humidity -1.764* -2.064* -0.199*

(0.380) (0.352) (0.041)

Wind Speed -1.829* -1.026 -0.160

(0.905) (0.808) (0.097)

Precipitation 0.426 -0.111 -0.079

(0.744) (0.703) (0.078)

Holiday -11.197 -7.188* -2.076

(13.438) (12.501) (1.451)

Weekend 89.093* -202.537* -4.666*

(11.229) (10.419) (1.199)

Summer (2) 5.192 13.784 3.285**

(17.754) (18.456) (1.917)

Fall (3) 12.728 41.377* 1.116

(14.109) (13.946) (1.524)

Winter (4) -13.280 -17.938 -5.405*

(17.959) (15.764) (1.926)

Lag(Yi) 0.279* 0.154* -

(0.055) (0.039) -

Table 11: prediction output for Trips Starting.

The Temperature had a statistically significant positive effect on the number of trips starting/ending being made at the three stations. The Hypothesis of a Temperature increase resulting in more trips was sup- ported. When the temperature increase by 1 °C the prediction of the trips starting at ’Central Park S & 6 Ave’

increase by 7.041 ceteris paribus (see table 11). Humidity had a significant negative effect on the number of trips starting/ending at the three stations. This was interpreted as follows: when the Humidity increased by 1% the prediction of trips starting at ’Central Park S & 6 Ave’ decreased by -1.764 ceteris paribus. The Wind Speed is only significant for the station ’Central Park S & 6 Ave’ and resulted in a negative relation between the predicted number of trips starting/ending. The Precipitation estimator is not significant for all the stations. The precipitation also had a positive effect on the number of trips starting and ending at some of the stations which was unexpected.

The season spring (1) was used as a reference group hence not included in the model. The seasonal

Variable OLS Y = Trips Endingi WLS Y = Trips Endingi WLS Y = Trips Endingi

’Central Park S & 6 Ave’ ’E 47 St & Park Ave’ ’St James Pl & Pearl St’

Intercept 155.569* 263.618* 32.243*

(31.126) (26.091) (3.112)

Temperature 6.692* 4.508* 0.882*

(1.005) (0.726) (0.085)

Humidity -1.664* -1.827* -0.191*

(0.357) (0.301) (0.036)

Wind Speed -1.643* -0.983 -0.126

(0.851) (0.693) (0.084)

Precipitation 0.267 -0.170 -0.009

(0.700) (0.601) (0.072)

Holiday -7.378 -14.441* -1.028*

(12.635) (10.769) (1.297)

Weekend 70.251* -183.583* -0.852*

(10.498) (8.968) (1.028)

Summer (2) -3.881 1.554 2.485

(16.686) (15.803) (1.922)

Fall (3) 14.223 43.454* 1.614

(13.274) (11.997) (1.429)

Winter (4) -11.240 -11.009 -3.992*

(16.864) (13.509) (1.643)

Lag(Y) 0.297* 0.178* -

(0.058) (0.037) -

Table 12: prediction output for Trips Ending.

dummies were added to avoid a structural break. Some seasonal estimators were significant. The summer and fall did have more trips starting/ending predicted at the stations compared with spring. The winter did predict fewer trips starting/ending compared with spring. The Holiday was not significant for the station

’Central Park S & 6 Ave’ but is for the other two. The Holiday estimated a lower amount of predicted trips starting/ending. The Weekend estimator to predict the number of trips starting/ending was significant for all the stations. For the station ’Central Park S & 6 Ave’ the Weekend had a positive effect on the number of trips starting/ending. When it is a weekend day the trips starting predicted increase by 89.093 ceteris paribus(see table 11). This can be interpreted in the same way for the other station. For the other two stations, the Weekend was negatively significant correlated with the number of trips starting/ending. A thing to notice from Tables 11-12 is that the weekend estimator had a very large negative effect on the number of trips for the station ’E 47 St & Park Ave’ indicating that the station is often used during weekdays and not so much during weekends. The lag variable significant for the two bigger stations and positively correlated with the number of trips starting/ending.

Using the significance analysis discussed in the methodology (see section 4) we sorted the resulted relevance of the parameters when predicted the trips starting/ending. These were produced for the trips starting/ending and for each of the stations separately. The estimators were sorted in relevance by using most relevant (When assigned 8-10* in most cases), relevant(4-7*), not relevant(1-4*).

Most Relevant:

• Temperature

• Weekend

• Humidity

Relevant:

• Wind Speed

• Trips starting/ending previ- ous day

Normal:

• Precipitation

• Holiday

• Fall(3)

• Winter(4)

• Summer(2)

When analyzing whether there are enough bikes the following day the most relevant estimators were the temperature, weekend, and Humidity. These were most important when estimating the number of trips starting/ending. The Wind Speed and Lag-variable were relevant estimators when used in the prediction.

Results Discussion Users Point of View

From a user point of view, we would like to know how the chance of renting a bike is affected by the parameters used in the estimated models. This chance is calculated as described in the methodology.

This raised the hypothesis question: When do the users have a higher chance of renting a bike? This is done by first calculating the chance of the given timeframes using the intercept out of the regression models in the analysis of the results. Using the intercept while keeping other zeros gives the chance of renting a bike as described in the methodology. Then the desired estimator when increasing is added/subtracted to the intercept keeping the other variables constant. This results in a higher or lower chance than previously calculated and this is summarized in Table 13. Where 1 stands for ’Central Park S & 6 Ave’, 2 for ’E 47 St

& Park Ave’, and 3 for ’St James Pl & Pearl St’. The effect of Wind Speed/Holiday on the chance is also calculated but for some of the stations it was not significant but still put in the table.

The chances are calculated and summarized in Table 13. When there is a decrease in the chance it means that when one of the variables increases the number of bikes between that period is predicted to

Chance ’1’ ’2’ ’3’

07-10 16-20 07-10 16-20 07-10 16-20

Temperature Decrease Increase Increase Increase Decrease Decrease

Humidity Increase Decrease Increase Decrease Decrease Increase

Wind Speed Decrease Decrease Decrease Increase Decrease Increase

Holiday Increase Decrease Decrease Decrease Decrease Increase

Weekend Increase Increase Decrease Decrease Increase Increase

Table 13: effect on the chance of renting a bike.

increase/decrease more than the number of bikes predicted for that day increase/decrease. This chance is expressed as a relative amount compared with the total number of bikes predicted throughout the day. Let’s say that the Temperature increases by 1 °C for the 1st station, the number of bikes that are used between 07-10 increases relatively more than the number of bikes used throughout the whole day hence the decrease in your chance of renting a bike (see table 13). First, this is 11.884% and after calculating the Temperature change this is 11.908% hence more bikes are used between 7-10 which results in a slimmer chance renting your bike during this timeframe (See table 8). Because the Citibike program does move bikes we can make these calculations. The Temperature increase predicted fewer bikes at this time hence a lower chance of renting a bike during 07-10 for the 1st station.

The hypotheses for the user’s point of view described in the methodology are now compared with the result findings. Temperature does have a significant effect on the prediction of bikes but does not necessarily positively or negatively affect your chance of renting a bike hence we can reject the hypothesis stating that Temperature has a negative effect on your chance of renting a bike. This can be seen from the table of effects (Table 13). The Temperature effect is due to the nature of the stations with different characteristics different for the stations. For the station ’Central Park S & 6 Ave’ the temperature results in more predicted bikes during 07-10 relative to the increase in predicted bikes during the day hence a lower chance of renting a bike (explained above). For the later timeframe, this results in lesser bikes relatively to the whole day hence an increase. For the humidity variable, this is calculated and interpreted in the same way. The different weather variables do not have one certain effect on this chance but this is unique for each of the stations and timeframes due to the characteristics of the stations.

The Weekend seems to have one effect on the timeframes at the stations namely a decrease or increase of the chance of renting a bike. For the station ’Central Park S & 6 Ave’ and ’St James Pl & Pearl St’ the weekend increased the number of bikes predicted throughout the day to be more relatively compared with the number of bikes within the timeframes. This results in a higher chance of renting a bike during the weekend within these timeframes. The ’Central Park S & 6 Ave’ station has an increase in the number of

bikes predicted during the weekend hence more bikes are predicted but lesser when compared during these timeframes so availability of bikes to use to ride through central park. The ’E 47 St & Park Ave’ station is mostly used during the weekdays because of commuting purposes. Still the chance of renting a bike during the weekend decreases. This is because the number of bikes predicted during the day where decreasing much more relatively compared to the decrease in use during the timeframes. Hence a lower chance because during the other times of the day this station is predicted to be not used often during the weekend.

From the user’s point of view, in which time frame your chance of renting a bike is higher? If all the variables are kept constant the number of bikes used between 16-20 is higher than in between 07-10 for the three stations. So by the definition of the chance discussed in the methodology, more bikes are being used between 16-20 relatively compared to over the whole day. This means that fewer bikes are left to use at other times of the day. During 07-10, there are overall fewer bikes predicted for all the 3 stations indicating that this is indeed a less popular time to rent a bike. This results in the timeframe 16-20 having a lower chance of renting your bike compare with the time 07-10. Hence we can say that if a person wants to use a bike for the whole day and take it at one of the three stations discussed in the literature it is better to grab the bike early in the morning than late on the day.

Authoritative Point of View

Now from an authoritative point of view, the prediction of trips starting/ending is analyzed. The esti- mation of the parameters is discussed. This is done by checking the following question mentioned in the methodology: When does the authority need more bikes to keep the consumer satisfied? To study the effect of the different estimators we can summarize this in Table 14. The estimators of the desired parameter are added/subtracted to the intercept to get an estimated number of trips starting/ending. The numbers 1-3 again stand for the same stations mentioned earlier. Out of the significance analysis, the most relevant esti- mators are the Temperature, Humidity, and Weekend. When there are shocks in these estimators the Citibike company does have to take this into account when determining the number of people they should employ.

From the intercept, there can be seen that for the stations ’Central Park S & 6 Ave’ and ’E 47 St & Park Ave’

it is more popular to start trips there than to end the trip at these stations. For the last station ’St James Pl &

Pearl St’ this is roughly the same.

For the stations ’Central Park S & 6 Ave’ and ’E 47 St & Park Ave’, when the temperature increases the number of trips starting increases more than the number of trips ending at the stations increase (see Tables 11-12, 14). This results in the number of bikes leaving are more than the number of bikes coming in at the stations meaning there is a deficit in bikes. This leads to the fact that CitiBike has to move bikes to these stations to solve the deficit of bikes to use. So from an authoritative point of view, the temperature increase

Deficit Trips ’1’ Trips ’2’ Trips ’3’

Starting Ending Starting Ending Starting Ending

Temperature 169.637 162.261 292.674 268.126 36.29 33.125

Humidity 160.832 153.905 285.128 261.791 35.274 32.052

Wind Speed 160.767 153.926 286.166 262.635 35.313 32.117

Holiday 151.417 148.191 280.004 249.177 33.397 31.215

Weekend 251.689 225.82 84.655 80.035 30.807 31.391

Table 14: effect on the trips starting/ending.

causes problems for bigger stations in New York to remain to have bikes at these stations. For the smaller station ’St James Pl & Pearl St’ this is the opposite. When the Temperature increases the number of trips starting at the station is increasing but slightly lower than the number of trips ending there at the station.

For most of the days, this will not lead to a huge deficit in bikes at this station so for this smaller station the CitiBike program probably doesn’t have to move bikes very often.

The calendar also affects the number of trips starting and ending at the stations which is important from an authoritative point of view. The trips starting at station ’Central Park S & 6 Ave’ in the weekend increase by a larger amount than the number of trips ending there. This means that the authority should use people to move bikes on the weekend as well because otherwise, it would result in too few bikes at this station.

The weekend estimator has a negative effect on the other two stations. For the station ’E 47 St & Park Ave’

fewer trips are ending there and fewer trips starting there. The difference between the two is that there are a lot fewer starting trips than ending trips hence it results that on the weekend the station can become overfull and bikes should be moved away from this station. For the last station ’St James Pl & Pearl St’ the number of trips starting there decreases by a larger amount than the decrease in trips ending there hence more bikes are being placed there on the weekends than on the weekdays. Therefore in this smaller station with not too many docking places, it can result in too many bikes there and CitiBike needs to move bikes to make a place for docking places for the bike.

### 6 Conclusions

This research tries to find the answers for the effect of background knowledge on bikeshare from two points of view. This is done by comparing several methods leading to a WLS regression method to predict the number of bikes. For the second research question, a combination of OLS and WLS is used to predict the number of trips starting/ending. These models use time-series data when predicting these numbers but are limited by only using linear regression. This can be expanded in searching for Non-linear regression methods

when predicting these measures. The data is limited using the year 2016, this was a very drought year in New York hence the insignificance and unexpected sign of the precipitation estimator. Also, this resulted in few holidays in the data hence for these estimators the prediction was limited and we cannot withdraw a significant conclusion for these estimators. At last, the model was limited by explaining the chances using only the spring intercept because the other seasonal dummies were often not significant leading to insignificant findings for expressing these chances further for the other seasons.

From a users point of view, several factors affect the number of bikes being used throughout a day.

Temperature as Gebhart and Noland (2014) already suggested has a positive effect on the number of bikes predicted. But this does not necessarily mean that when the temperature increases your chance of renting a bike becomes slim. This is examined in this paper and the results are shown in Table 13. The background knowledge discussed in the research question affects your chance both positively and negatively depending on the desired timeframe you want to grab a bike. Due to the different characteristics of the stations, the chance also affects in different ways when compared with each other. From the regression output, it is also shown that the station ’E 47 St Park Ave’, next to the grand terminal, the number of predicted bikes is overall very high. This suggests that Ma et al (2018) are right and that another method of transport complements the use of bikeshare. The ’Central Park S 6 Ave’ station is suggested to also have a higher number of bikes predicted due to the positive effect of the area Central Park (Wang and Chen, 2020). Out of the research results, this can be seen that indeed this station has a large number of predicted bikes but this is not necessarily the case. The last smaller station ’St James Pl Pearl St’ seems to have the least amount of effects due to the background knowledge resulting in the chance of being in most of the times the same and often having a bike available there. With these bike-sharing predictions, the customers could check the weather for the next day to make a bike prediction and the effect on the chance. This is in the same way Fanaee and Gama (2013) introduced, for alarming or suggestion purposes.

From an authoritative point of view, the background knowledge mentioned in the research question affects the number of trips starting/ending both positively and negatively. From the results, it is visible that the Temperature has an expected positive effect and the Humidity and Wind Speed both a negative effect. The most important factors when deciding the number of people the authority should employ are the Temperature, Humidity and whether it is a weekday or not. The Holiday is somewhat unexpected to have a negative effect on the number of trips starting/ending but this could be due to insignificance or suggesting that these 3 stations are more often used for commuting purposes hence fewer trips starting/ending during the holiday. The Weekend has a positive effect on the number of trips at the station ’St James Pl Pearl St’.

This is a larger positive effect for the number of trips starting there hence Citibike has to move bikes towards

this station in the weekend to remain the right amount of bikes to keep customers satisfied. For the other two stations, the weekend has a large negative effect on the station indicating the purpose of commuting at these stations. The authority can use weather reports to then find out how many people they should employ that day to move bikes. Fanaee and Gama (2013) introduced the suggestion purposes, this can be formulated in the same way for this point of view. The authority knows how many bikes to move from one to another station because of the predicted trips starting/ending. The Citibike especially has to keep an eye at the bigger stations which require more work because more trips are often starting then ending there.

For further research suggestions, this dataset can be expanded with more time, information, and more stations in New York. For a more significant and explanatory model, the number of estimators could be expanded with more weather information such as dummies for a certain weather effect. The number of observed data could also be expanded by 1 or 2 years to obtain more realistic results. When there are more stations within New York the effects can be studied to obtain the same or different kind of results and compare these with the results found in this paper. To use a better prediction method such as Non-linear regression methods the prediction can be more accurate. In 2021 CitiBike is still expanding stations and also New York City is still promoting the use of bikes, for instance, there is a new bike road on the Brooklyn bridge hence what will be the effect on the stations close to this bridge.

### References

[1] CitiBike New York City. (n.d.). explore New York. https://www.citibikenyc.com [2] CitiBike. (2013). In Wikipedia. https://en.wikipedia.org/wiki/Citi_Bike

[3] DeMaio, P. (2009). “Bike-Sharing: History, Impacts, Models of Provision, and Future.” Journal of public transportation 12.4: 41–56. Web.

[4] Fanaee-T, H. & Gama, J. (2013) “Event labeling combining ensemble detectors and background knowl- edge.” Prog Artif Intell 2, 113–127.

[5] Fishman, E., Washington,S. & Haworth,N.(2015). “Bikeshare’s Impact on Active Travel: Evidence from the United States, Great Britain, and Australia.” Journal of transport health 2.2: 135–142. Web.

[6] Gebhart,K. & Noland,R,B. (2014). “The Impact of Weather Conditions on Bikeshare Trips in Washing- ton, DC.” Transportation (Dordrecht) 41.6: 1205–1225. Web.

[7] Gehrke,S.R., Sadeghinasr,B., Wang,Q. & Reardon,T.G. (2021). “Patterns and predictors of dockless bikeshare trip generation and duration in Boston’s suburbs” Case Studies on Transport Policy, Volume 9, Issue 2: Pages 756-766.

[8] Giot,R. & Cherrier,R. (2014). “Predicting Bikeshare System Usage Up to One Day Ahead.” IEEE Sym- posium Series in Computational Intelligence 2014 (SSCI 2014). Workshop on Compu-tational Intelli- gence in Vehicles and Transportation Systems (CIVTS 2014).France.pp.1-8.

[9] Google. (n.d.). [Google Maps Manhattan, New York City, New York, United States]. Retrieved 2 june, 2021, from

https://www.google.com/maps/place/Manhattan,+New+York+City,+New+York,+Verenigde+Staten [10] Heij,C., Boer,P. de, Franses,P.H., Kloek, T. & Dijk, H.K.van. (2004). "Econometric Methods with

Applications in Business and Economics." Oxford University Press.

[11] James,G., Witten,D., Hastie,T. & Tibshirani,R. (2017). "An Introduction to Statistical Learning."

Springer

[12] Jensen,P., Rouquier,J.-B., Ovtracht,N. & Robardet,C. (2010). “Characterizing the speed and paths of shared bicycle use in lyon.” Transportationresearch part D: transport and environment, vol. 15, no. 8, pp. 522–524.

[13] Larose,D.T., & Larose, C.D. (2014). "Discovering knowledge in data: an introduction to data mining."

y John Wiley Sons, Inc., Hoboken, New Jersey.

[14] Lohry,F.G. & Yiu,A. (2015). “Bikeshare in China as a Public Service: Comparing Government-run and Public-private Partnership Operation Models.” Natural resources forum 39.1: 41–52. Web.

[15] Ma,X., Ji,Y., Yang,M., Jin,Y. & Tan,X. (2018) “Understanding Bikeshare Mode as a Feeder to Metro by Isolating Metro-Bikeshare Transfers from Smart Card Data.” Transport policy 71: 57–69. Web.

[16] Meddin,R., DeMaio,P., O’Brien,O,. Rabello,R., Yu,C., Seamon,J., Benicchio,T., Han,D. (ITDP) &

Mason,J. (ITDP). (2021). "The Meddin Bike-sharing World Map." http://bikesharingworldmap.com/.

[17] NYC Open Data. (n.d.). Open Data for All New Yorkers. https://opendata.cityofnewyork.us/

[18] Purdue Online Writing Lab. (n.d.). General Writing FAQs. Purdue Online Writing Lab.

https://owl.purdue.edu/owl/general_writing/general_writing_faqs.html

[19] Wang,K., & Chen,Y-J. (2020) “Joint Analysis of the Impacts of Built Environment on Bikeshare Sta- tion Capacity and Trip Attractions.” Journal of transport geography 82 : 102603–. Web.

[20] The Weather Company. (2021). https://www.wunderground.com/history/monthly/us/ny/new-york- city/KLGA. An IBM business

[21] Yang,H., Hu,Y., Qiao,H., Wang,S. & Jiang,F. (2020). “Conflicts between business and government in bike sharing system.” International Journal of Conflict Management. 31. 463-487.