Predicting revenue for a large company

(1)

1

Predicting revenue for a large company

Gidion van Kempen

10610898

Thesis MSc in Econometrics Track Econometrics free track December 2018

Supervisor: Hans van Ophem Second reader: Katarzyna Łasak University of Amsterdam (UvA)

Abstract

This thesis tries to answer the question how to best predict the revenue of a company, in the situation where some big revenues occur. The company in question is a large freight forwarder, having thousands of destinations, with three manners of transport. The methods used for building models for each type of transport are ordinary least squares (OLS), segmented ordinary least squares using breakpoints (COST+), and multivariate adaptive regression splines (MARS). The methods are compared in two ways: (1) the quality of the predictions by estimating the models on a part of the data, and then comparing the prediction of revenue with the actual revenue, using the mean squared prediction error (MSPE), and (2) by comparing the ability of the methods to fit the data using a generalized cross validation (GCV) criterion. It turns out that for the best fit, MARS was the best model to choose (43.8%), closely followed by Cost+ (42.3%). For prediction MARS is also the best method (41.2%), but OLS (30.9%) and Cost+ (27.9%) came close.

(2)

2

This thesis tries to answer the question with what econometric model a freight forwarding company can best estimate and predict its future financial data. The main purpose is to predict the revenue based on orders that came in, which contributes to get an accurate estimate of the financial situation each month. For companies that ship goods around the globe, a big problem can be that large but scarce orders account for the biggest part of the revenue, but that there is no good way to predict the eventual revenue for these kind of orders when they come in. A company has multiple reasons to have a clear picture of their own financial situation as fast as possible, in order to remain competent and competitive. Several methods for building a model for the relationship between revenue and order size (i.e. weight) will be compared and subsequently evaluated what might be the best method to use for predicting the revenue of a company based on a measure of weight of the packages, which is primary used factor in international shipping to determine the price of the shipment.

Using data of a large freight forwarding company it will be evaluated which estimation method of the chosen methods will perform best for fitting the data and for the prediction of future revenues, using primarily weight as the explanatory variable. Furthermore these estimation methods will be compared with each other using the mean square prediction error to assess the quality of the prediction and a generalized cross validation criterion for comparing the fit. As evaluation it will be counted how many times each method turns out to be the best for the possibilities of shipment. Afterwards possible alterations and potential improvements or changes for these methods will be discussed.

In the first chapter the background of the company and the problems encountered to predict will be discussed. The estimation methods and the relevant literature on the methods used will be discussed in the second chapter. After that, there will follow a description of the data used for estimating the models. Then an description will be made of the process and a full outline of the methods used for comparing the models. Subsequently the results of this process will be discussed and there will be given some possible alterations. Finally a conclusion with recommendations will be provided.

1.2 Background, Context, & Subject

It will be attempted to predict the revenue of the large freight forwarding company using its past data. In operational processes correctness, timeliness and completeness of the data are essential elements for the quality of the predictions based on that data. By using suitable analytical methods, useful business steering processes can be found for operational analysis. Good data quality is key in this respect. Data of companies can be organized in two categories:

(4)

4

operational data and financial data. Most companies first deliver goods or services before the service is invoiced and moves through the financial chain. This means that operational data are earlier available than the full financial statements. So these operational data might not completely describe the eventual situation in the future, since the actual costs and revenue might be different than what is known at first. To have a temporary overview of the finances of the new orders to work with and fast insight into the (new) financial situation, a prediction for the final revenue has to be made.

There are multiple reasons a firm wants to have good predictions about their future revenue. On the internal side, the firm needs predictions of their revenue so they can act quicker in case something goes wrong. For example, suppose there is a big flaw in the administration; it would be possible to find it with mathematical models. Of course, it is preferred that these kind of flaws or errors are tracked down as soon as possible so the company has a better insight in their own financial situation. Furthermore, if a flaw is tracked down quicker it will be less work to correct and extra costs can be prevented, since the longer a flaw remains to be corrected, the more expensive the mistake can become. On the external side a firm wants to have their administration in order quickly so that they can close their books as fast as possible every month, in order to be regarded as a reliable company by its shareholders. Furthermore the company can show they comply to certain financial rules. Even though the temporary data might be flawed, it is still useful to have an indication of the future financial situation. In order to do all of this, the firm needs good predictions of their revenue. By closely analyzing past transactions, econometric models can make these predictions of the final revenue based on operation business processes. Once all bills are paid, and all flaws are tracked down in the administration, there is access to the financial data, i.e. data that tells the firm the real financial situation.However, these operational data have to be used by the firm for short-term decisions. There are strict rules and guidelines a firm has to abide by, so it is of the uppermost importance that the predictions are reliable. Since there is a link between the operational data and the financial data, one could use a timeseries or a cross-section analysis to determine the accuracy of the predictions in the operational data. Due to limitations in the structure of the data sets of the firm it is not possible to perform a timeseries analysis for this thesis, because the data sets are not specified in such a way to be linked over time.

Besides the problem that in the operational data the actual revenue of a shipment is not yet (fully) known (i.e. delayed) or is out of period income, there might be an additional problem: there can be an administrative or input error. For the first problem an example could be that a shipment still needs to be paid for by their client, so there is a risk the firm will not get the expected eventual revenue. An example of a mistake for the second category is when

(5)

5

an employee that put an extra zero with the revenue or the wrong currency. So when analyzing data, it is needed to take these possible problems into account.

After the forecasts have been made and evaluated, the company itself uses the results to recognize possible input errors that could not be easily detected automatically by the input software, and also to make predictions of the total revenue. This is done by comparing a prediction with the reported revenue in the operational data (which might still be flawed) and see if there is an error, i.e. an inserted value that actually needed to be different and almost certainly will change later provided that the flaw is found. Besides that, the firm can use the results to predict future revenue of all the shipments combined, so at the beginning of each month have an estimation of the revenue at the end of the month. This is more reliable than using just the average of past values, since we can now account for possible uncommon shipments. So the goal of this thesis is to find a model that can estimate and predict revenue (or the margin) for a company that does not have flawless operational data in their books.

In the transport sector the revenues and costs are typically very close, so there is usually a relatively small margin for profit on each shipment. The focus is to make that margin as large as possible. One big challenge is that in this sector the biggest margins are achieved for the largest shipments. However, these are rare in comparison to the smaller shipments, so there are fewer data to estimate the correctness of the values provided. However, precisely because of the fact that these shipments are big, it might be easier to evaluate the correctness of these reported values. This problem will be explained in depth in the next section.

The data used in this thesis is from an international freight-forwarder. This is a company that buys capacity for transport of goods on airplanes, ships, or trucks, in order to subsequently send packages, sizes vary from small to very big, of their clients around the globe. The freight forwarding company ranks as one of the world’s top ten largest in its sector. It is a global company active in more than 80 countries, employing 25.000+ people and affiliates. The company has at least 5000 different shipment lane combinations between the cities it is active in. A lane is the connection route between two cities or places, which can be connected via air travel, using trucks by road, or via ocean transport. Each different transport method is considered a different lane, even if it between the same cities. Of course not all three transport methods are possible for every city to city connection. The analysis will be done on each lane separately, such that a specific prediction can be made for each individual lane.

(6)

6

1.3 Example lane

Figure 1: data of an example lane

To illustrate the difficulties faced by finding relationships between the variables, the graph in Figure 1 depicts as an example the relationship of weight and revenue of one of the lanes, JFK (New York) to Tel Aviv, for the most recent dataset. This is one of the 9024 lanes where a model is estimated for. The weight of the shipment is on the horizontal axis and the revenue on the vertical axis. As is visible in the graph in Figure 1, the higher values do not appear to be out of the ordinary, with the exception of one, which has a much larger revenue than the rest and can be considered an outlier. In Figure 2 this outlier is ignored. There appear to be multiple possible relationships; at least for the values close to the origin. However, when the weight increases, there is no clear relationship anymore. Moreover, it seems even possible there could be a log-linear relationship instead of a linear one, or there could be a break in the relationship. An example of different possibilities of relationships is depicted in the graph in Figure 2 below, where the green line represents one possible linear relationship, the black lane a logarithmic relationship, and the orange line a relationship with a break point, indicated by the red dashed line.

(7)

7

Figure 2: same data, but now with possible relationships

Since there seems to be multiple options for (log-)linear relationships, it is needed to find a model that can be used also and especially for weights where the observations are large and scarce. These high values are important, since they can account for a big amount of revenue for a company and should there be estimated with care. So the prediction of lanes with scarce observations for high values is the main challenge if the goal is to build a model for each lane.

1.4 Workplan: Starting models

Of course, the most important step for predicting the revenue is finding a suitable model to do so. For this study, three different methods will be considered. This section explains them briefly and will be expanded in the next section. The methods are:

a) OLS. For each lane, use data like past revenue as dependent variable and some measure of weight and other explanatory variables to estimate the revenue. The estimated model can be used to predict the revenue of new instances in the future. For example: 𝑟𝑒𝑣𝑒𝑛𝑢𝑒 = 𝛼 + 𝛽1𝑤𝑒𝑖𝑔ℎ𝑡 + 𝛽2𝑥2+. . . +𝜀

A variation of this model is to use the logarithm of either the dependent variable, or of an explanatory variable. At first the standard OLS model will be used, later on the log variation models with will also be estimated, and the results will be compared.

b) Cost+. The name is taken from a model used in economics, but here it is reworked to a model using (known) chosen threshold points (also called breakpoints), where the coefficients can change. This is called a segmented or piecewise regression, or least squares with breakpoints. Besides that, some other constraints can be imposed, such as that the next slope has to be at most as large as the previous. This way the revenue for a small weight can be calculated with a different formula than for a heavy weight.

(8)

8

For example: 𝑟𝑒𝑣𝑒𝑛𝑢𝑒 = 𝛼 + 𝛽1𝑤𝑒𝑖𝑔ℎ𝑡 + 𝜀 𝑖𝑓 𝑤𝑒𝑖𝑔ℎ𝑡 ∈ 𝐴 𝑟𝑒𝑣𝑒𝑛𝑢𝑒 = 𝛼 + 𝛽2𝑤𝑒𝑖𝑔ℎ𝑡 + 𝜀 𝑖𝑓 𝑤𝑒𝑖𝑔ℎ𝑡 ∈ 𝐵 etc.

The problem with this model that the thresholds are fixed, so it might come across as arbitrary, and they might not be the best threshold points, but further investigation has to show that. On top of that, it will be impossible to manually select the points for each lane.

c) MARS, Multivariate adaptive regression splines, introduced by Friedman (1991). This technique takes into account that there can be break points in the data, basically by estimating those break points where the coefficient for an explanatory variable changes, so it can account for the different effects small and large values of weight (or other variables (𝑥2, etc.)) have on the revenue.

For example: 𝑟𝑒𝑣𝑒𝑛𝑢𝑒 = 𝛼 + 𝛽1max(0, 𝑤𝑒𝑖𝑔ℎ𝑡 − 𝑐1) + 𝛽2max(0, 𝑥2− 𝑐2) + . . . +𝜀 The MARS algorithm estimates both all 𝛽𝑘 and all 𝑐𝑘.

All these models and possible additional ones will be discussed in chapter two, including their advantages and disadvantages in using it for the problem at hand. The models have to work well for high values, since the biggest challenge for the company is estimating the revenue when they have an order with a big weight. The models will be build using different variables, but always including the relationship between revenue and weight. Depending on the model type, several other variables can be added. This will be explained in depth later. After the estimation the result is several different estimates and it is needed to decide which one to use as the best possible prediction of the revenue.

The models will be estimated in two ways: on the full data, and on a part of the data so it can be evaluated how well it works in predicting. Results from the comparison of the models for both of these models will be discussed in the fourth chapter.

1.5 Economic theory

OLS is a well-established method for modeling data. Cost+ can be seen as a more generalized version of OLS, but at the same time it can be seen as a simpler version of MARS. The MARS method has been used for predicting multiple times before, including the forecast of revenue. For example, Siddappa et al. (2008) use MARS to estimate airline revenue, by optimizing for airplane overbooking. Lu et at. (2012) use MARS to forecast sales for computer wholesalers in Taiwan, with the purpose of making better sales management decisions. Their investigation concludes that MARS outperforms other kinds of prediction methods, including linear regression. The forecast performance of the methods are compared by variations of the root mean squared error. Zareipour et al. (2006) use MARS to model the short-term hourly energy price in Ontario. They find that the MARS method works well for creating a model for forecasting when provided with extra variables, like information about past prices and

(9)

9

demand. The same method is also used by Lin et al. (2011) to forecast tourism demand in Taiwan. They compare MARS with alternative methods by using the root mean squared error. However, they find that ARIMA outperformed MARS in a timeseries analysis. De Andréas et al. (2011) compares MARS with another forecasting regression method based on clustering to predict bankruptcy of Spanish companies. Lastly, MARS is also used to forecast the stock prices by Rounaghi et al. (2015) for companies on the Tehran stock exchange, making use of 40 variables, including accounting and economic variables like various index values, ratios, prices of stocks, returns, and interest rates. They compare the prediction ability of MARS with another regression splines methods using a variant of the root mean squared error. This shows that MARS is an often used method for prediction various kinds of economic activities, and in research is often compared to other methods.

1.6 Objectives

The main objective is to make the best forecast for each transportation lane. The best possible outcome is that one method is the preferred method for forecasting for all lanes (instead of having that sometimes one method is preferred over the other and in other cases another without any clear indication why), and we can conclude that this method should be used in the future for prediction. However, another outcome is that the best possible prediction method is different for the transport lanes, and every method is the best option is at least some situation. To see what is in fact the case, it should be evaluated if one method is indeed preferred over another. When selecting the best model we are going to use for that forecast, there are two distinct requirements to keep in mind.

The first requirement is that, using the selected prediction, we can spot if there might be an error in the reported value. Instead of doing this by generating the warning when someone enters the amount based on some easy rules about the minimum or maximum, we’re looking for a prediction to use for the comparison with the already filled in data, so see if there might be a input flaw.

Another requirement of the estimations is that it is able to predict the total revenue of the entire month. So the estimated model should be able to make a prediction at the beginning of the month, in order to estimate the what the revenue is going to be at the end of each month. So this should be calculated at a time when the final income is not yet fully known for each shipment in the data set. It is the task of the company itself to choose to either make a prediction for each transport lane alone, to use this to make prediction for the total aggregate revenue of that month, or a combination of both.

(10)

10 2OVERVIEW OF DATA

2.1 Data

The data used in this thesis consists of six separate datasets, all from separate dates from 2014, 2015, or 2016. All these six sets relate to four months of data. Each data set contains the following variables:

- Orig_city: the city of origin of the shipment.

- Dest_city: the place of destination. Both of these are of course very important for building the models, for each lane (i.e. origin + destination) a separate model will be estimated.

- Orig_country: the country of origin. In case there are too few data points for a certain shipment route of a city-to-city, we can use a model for the lane between countries instead.

- Dest_country: the country of destination.

- Mode: this indicates if the shipment is done either by ship, by airplane, or by road, so by trucks. This is an very important variable, since we separate the data in this categories and estimate the models for each one of them. The preceding five variables define the lanes.

- gad_weight: the gad weight, also known as the dimensional weight or the volumetric weight, which is a measure widely used in the freight transport business as an alternative to the actual weight. It is a calculated value, making use of the height, width and length of the package that is to be shipped. The company has its own formula to calculate this value. For example, a simple formula could be 1 dm3 = 1 kg.

- teu: TEU or ‘twenty-foot equivalent unit’. This is a measure used in shipping via sea as an alternative for expressing the weight in kg. It describes the capacity of one container.

- Product_group: code of the carrier for the product group. There are several different product groups; each shipment has to have one of those. The possible product groups are: AIR/SEA, BREAKBULK (break bulk cargo), CHARTER, ECONOMY, EXPRESS, FCL (full container load), FCLFF (alternative full container load), FTLROAD (full truck load, via road), LCL (less-than-car load, so only a box), LTLROAD (less than truckload, via road), NFO (next flight out), SEA/AIR, STANDARD, TSAIR (transshipment via air), TSOCEAN (transshipment via ocean), and TSROAD (transshipment via road). This indicates how a shipment for this product is carried out. For example a transshipment via road from Amsterdam to Paris with a stop at Brussels.

(11)

11

- Payment_term_code: indicates the payment term. There are 3 different payment term arrangement, each shipment has to have one of them. The possible choices are COL (collected), OTH (other), PPD (postponed). Each one has about the same amount of occurrences in the data.

- incoterm: ‘International Commercial Terms’, an international standard that sets out the rights and obligations the buyer and seller have. This basically describes at which points of the shipment the ownership of the goods from the seller changes to the buyer. There are 15 different options, and each shipment has to have one. CFR, CIF, CIP, CPT, DAF, DAP, DAT, DDP, DDU, DEQ, DES, EXW, FAS, FCA, and FOB. These are all technical names for the different possible INCOTERMS.

- REV: the revenue that is known to the company (for standard shipping). - REV_EX_DEST: the revenue for express shipping.

In addition, there are several other variables available in the data sets that will not be used: - Cost: the costs known to the company.

- Margin: the difference between the costs and the revenue. - Movement_id: an id for each separate shipment order. - Piece_count: number of pieces the shipment consists of. - Actual_weight: the real weight of the shipment in kg. - volume: the volume of the shipment

Lanes will be estimated using both REV for standard shipping, as well as REV_EX_DEST for express shipping as dependent variable (see below for the explanation of the usage). Throughout this thesis both are referred to as just ‘revenue’ in the models described. In principle the main explanatory variable used is either gad_weight or teu, both are referred to as just ‘weight’ for short in this thesis. Both gad weight and TEU are widely used in international shipment as the primary determining factor for the price of transportation. Gad weight is used for transport on the road, via train and airplane, while TEU is used for transport via sea. However, besides weight other variables are used as to make sure the revenue get explained as well as possible; namely

1) Incoterms (incoterm); this seems relevant since incoterm denotes the part of the lane that handled by the firm, so the part that is the firms responsibility. It is possible that another part of the transport will be carried out by another company or by the client himself. Revenue will depend on whether the firm has to arrange transport on the entire lane or only part of it. Instead of considering every part a as a different lane, Incoterms is used as a variable to prevent the creation of 15 times as many lanes, some which will barely have any observations.

(12)

12

2) Payment term code; which denotes how and when the payment will be made, with the choices being a collective payment i.e. immediately paid, a postponed payment, or a category for others. This variable is deemed of lesser importance for the revenue but is still included because the company always deems this important enough to include this variable when it comes to making predictions. For example, it could be that postponed payment generally result in lower revenue.

3) Product group; this variable indicates for example what kind of container or truck is used for a shipment. It is included since it is deemed of importance for the level of the revenue what kind of product is being shipped.

It does not seem necessary to use other variables from the data set since they are either administrative indicators, or the variable for volume or actual weight since dimensional weight is already a measure derived from the volume and thought to be a good substitute for actual weight. Since the purpose is to estimate a model for each lane, which is already defined as the transport using a specific means of transportation between two cities, each lane has a fixed distance and specified transport method, so a variable for distance or transport method would be meaningless to use. Since the main interest is predicting revenue, it is chosen to use that as dependent variable and not cost or margin. A problem with costs is for example if the company rents a container to send goods for three different clients, the costs for that one container get divided in a way that is not always consistent and can be arbitrary. Besides, the costs may sometimes be higher due to external factors like temporary non-availability of containers, delayed shipment, or poor judgment in procurement of container space that results it significantly higher costs. The margin variable is derived from costs by definition. Revenue is preferred over costs to use as dependent variable because revenue is deemed more reliable, and is therefore used as the dependent variable. The two different variables of revenue that can be used, standard and express revenue, are available for almost every lane, with a few exceptions where there is not enough information for that particular lane. Standard revenue is the revenue that is gained for a normal payment while express revenue is that of a shipment that is faster. Both of these options are estimated for every lane, so they could be considered as separate lanes although they have the same begin and end. As can be expected, for express shipment the revenue is higher than for standard shipment, but that difference is not expected to have a large effect on other variables.

(13)

13

Dataset number Dataset name Period Number of rows

1 20160402 2016-01-01 to 2016-04-01 247095

2 20160314 2015-12-01 to 2016-03-14 281259

3 20160121 2015-10-01 to 2016-01-21 313298

4 20151121 2015-08-01 to 2015-11-21 331437

Table 1: overview of data sets

Dataset nr Mean of gad weight stdev of gad weight Min of gad weight Max of gad weight Median of gad weight 1 31977.8 2616631.6 0.001 1099411000 398 2 31902.6 2557044.5 0.001 1099411000 400 3 33152.2 2753432.5 0 1099411000 393 4 32181.3 2560293.0 0 1000000000 421.8

Table 2: statistics about dimensional weight

Dataset nr Mean of TEU stdev of TEU Min of TEU Max of TEU Median of TEU

1 0.97 106.5 0 47800.4 0.0282

2 0.92 98.9 0 47800.5 0.0291

3 0.97 103.4 0 47800.5 0.0292

4 1.62 96.5 0 26625.0 0.0307

Table 3: statistics about TEU

Dataset nr Mean of REV stdev of REV Min of REV Max of REV Median of REV

1 1406.09 3977.45 0 843758.7 440.4

2 1455.06 4043.32 0 844115.0 457.7

3 1505.81 3979.37 0 348855.9 468.4

4 1555.78 4248.34 0 673965.7 483.0

Table 4: statistics about revenue

As is clear from Table 2, 3, and 4, the four data sets are comparable for each mean, minimum, maximum and median of the variables. There seems to be a very big difference between the minimum and maximum of all three variables. The maximum for gad weight seems very high, while a minimum of zero is (too) low, since this would mean there are packages without weight, and shipments without revenue. These obvious flaws are removed from the data set. An important side note is that gad weight is not measured in actual kilograms, but a calculated measure based on the volume. Furthermore, the standard deviation for gad weight seems very large, and to a certain extent for the other variables as well. Calculating the median instead of the mean gives a different picture. It shows the maximum is very far off from the median, which gives an indication why the standard deviations are this large.

(14)

14 3.OVERVIEW OF METHODS USED

In this chapter the methods used for estimating the models and prediction of the revenue will be discussed: OLS, MARS and Cost+.

3.1 OLS

The ordinary least squares method is well-known in econometrics and will not need a lot of introduction. In this case the OLS model used for each lane can be written as:

𝑟𝑒𝑣𝑒𝑛𝑢𝑒 = 𝛼 + 𝛽₁ 𝑤𝑒𝑖𝑔ℎ𝑡 + 𝛽₂𝑥₂ + . . . + 𝜀

where revenue denotes the vector of revenues and weight the vector with a measure for weight. But besides that, other variables will be used in 𝑥 as explanatory variables. These are the dummies for the ‘Incoterm’ categories, ‘payment term code’, and ‘product group’ as explanatory variables.

There are several well-known advantages of using OLS. One advantage is that this model is simple and probably quite intuitive for everybody. However, recalling Figure 1 from above, a big disadvantage is that there might not be such a simple relationship, or there is possibly a non-linear one. When it is clear that there are two or more separate relationships, it is important the relevant explanatory variables will be used to explain the distinction between the two. A possible remedy for different relationships between high and low values of weight would be to estimate different OLS models for low and high values separately. Furthermore, there might be few data points for shipments with a much larger weight than the others. For that category there might be very few observations, so it might be difficult to estimate the actual relationship for these large values this way. An possible solution for this problem might be found in the Cost+ and MARS models. Besides that, another possible option could be to use the natural logarithm of the revenue or of the weight, or use a polynomial model, with squared or even cubed variables, or including interaction terms. Some of these alterations to the OLS will also be implemented and estimated instead of the normal OLS specification, and subsequently the results will be compared.

3.2 COST+

Since the OLS model is estimated for all the data of a lane, the model might not work well for predicting the revenue of shipments. Therefore the following model is introduced. The model that is called COST+, short for Cost Plus Pricing, is named after a model used in economics as a simple method for setting up prices for products.

In the model used here there are predefined break points, and for each interval between two break points, the weight variable can have a different coefficient. What this

(15)

15

method basically comes down to is a segmented (piecewise) regression. In principle a segmented regression model can be written as:

𝑟𝑒𝑣𝑒𝑛𝑢𝑒 = 𝛼_𝑛+ 𝛽_1,𝑛 𝑤𝑒𝑖𝑔ℎ𝑡 + 𝛽_2,𝑛 𝑥 + . . . + 𝜀_𝑛 with 𝑛 = 1 if 𝑤𝑒𝑖𝑔ℎ𝑡 ∈ [𝑝1, 𝑝2),

𝑛 = 2 if 𝑤𝑒𝑖𝑔ℎ𝑡 ∈ (𝑝2, 𝑝3), etc.

However, in order to make it an continuous model, the model is specified as: 𝑟𝑒𝑣𝑒𝑢𝑒 = 𝛼 + Σ_𝑛=1𝑁 _𝛽

𝑛[𝑤𝑒𝑖𝑔ℎ𝑡 − 𝑝𝑛]++ 𝛽𝑁+1𝑥 + . . . +𝜀𝑛

Where 𝑛 denotes the region between two breakpoints, 𝑁 the number of regions, and [𝑥]+ denotes the maximum of 0 and 𝑥. The breakpoints used here are initially set as 𝑝 = (0,50,500,1000,5000). Zero counts as the first break point. These points are chosen to reflect the spread of the points as well as possible, by looking at graphs of weight and revenue. However, any choice will result in being perceived as arbitrary. Later on it can be tested if results are very different when using other break points. Besides the model above, several other restrictions imposed can be imposed; like that the coefficient is only allowed to be smaller than the previous coefficient . So as model restriction:

𝛽_1,𝑛 ≥ 𝛽_1,𝑛+1.

This is done in order to implement the reasoning that the larger the package, the relatively cheaper the shipment. So it would not be advantageous to divide a package into two packages and try to ship them separately.

Like in OLS, the revenue is the dependent variable, and as explanatory variables are used the weight variable and the dummies for incoterm, payment term code, and product group.

This model might work better than for large values than OLS and MARS, because with this estimation method is forced to estimate a separate coefficient exclusively for large weights. Of course a big drawback of using this model is that the choices for the breakpoints has to be made on forehand, so 𝑝𝑘 are chosen manually and not by some smart selection mechanism like an algorithm. The MARS algorithm actually does implement this. The advantage of estimating the model this way is to provide an intuitive and easy to understand relationship between weight and revenue, like in OLS, but with a more room for different parameters based on weight. Another alternative is to use a polynomial, with squared or cubed variables.

It could be tested if the Cost+ model has a significant better fit than the OLS model. For this purpose a test for multiple break points is used, devised by Brai and Perron (1998, p.56) as an extension of the Chow break test. For the Cost+ model, de test could be written as

𝐹 =𝑆𝑆𝑅𝑂𝐿𝑆− 𝑆𝑆𝑅𝐶+

𝑆𝑆𝑅_𝐶+ (

𝑇 − 𝑁𝑞 − 𝑝

(16)

16

Where 𝑇 is the number of observations, 𝑁 is the number of regions (so 𝑁 − 1 the number of breakpoints), 𝑞 is the number of number of coefficients that are new or can change due to the breaks, 𝑝 the number of coefficients that will not change (so in this case the number of other variables, the dummies). The sum of squared residuals (SSR) of the full OLS regression is compared with the SSR of the Cost+ regression. This way it can be tested if the break points of Cost+ make the model better than the OLS model where there are not break points. For this test the null hypothesis states that the coefficient(s) of the weight variable of OLS are the same as the coefficients of Cost+. A significance level of 5% is used for the test.

3.3 MARS

The multivariate adaptive regression spline method was developed by Friedman (1991). The purpose of this method is to overcome the difficulties encountered by more basic forms of regression analysis. MARS can be seen as an version of a regression spline methods, or as an extension of the Cost+ method. It also uses partitions of the data, in which a parameter can have a different value, but here the breakpoints are not chosen on forehand but are determined by an algorithm. Moreover, it can also pick the relevant variables and add cross-terms. The structure of this method is that it uses an algorithm to determine the so called product spline basis functions, that apply between certain boundaries (i.e. breakpoints) of the data. A basis function consist of one term or a product of terms, that consist of a variable minus a break point and its corresponding sign.

The MARS model has the follow form:

𝑦 = 𝛼 + ∑ 𝛽_𝑚 𝑀 𝑚=1 ∏[𝑠_𝑚𝑘 (𝑥_{𝑣(𝑚,𝑘)}− 𝑡_𝑚𝑘) 𝐾𝑚 𝑘=1 ]₊+ 𝜀 (3.3.1)

To give a quick summary of the used variables and symbols: 𝑠 denotes a sign, 𝑡 denotes a break point or splitting point, 𝑀 denotes the number of basis functions, and 𝐾𝑚 denotes the number of (cross-)terms in basis function 𝑚. For the variable 𝑥, the 𝑣(. ) is a function that denotes the variable needed in that specific term. The MARS algorithm will estimate all elements that make up the basis functions. By 𝛼 and 𝛽𝑖 are denoted the parameters of the model. Further, [𝑥]+ denotes the maximum of zero or 𝑥. So the model consists of a sum of parameters times a product of terms consisting of the variable minus or plus a break point. The choice between minus or plus is denoted by the sign. For example, a model could be 𝑦 = 𝛼 + 𝛽₁(𝑥1− 𝑐1) + 𝛽2(𝑐2 − 𝑥2) + 𝛽3(𝑐3− 𝑥2)(𝑐4− 𝑥1) + 𝜀. This model includes two terms (with parameters 𝛽1 and 𝛽2) with a break point (denoted by 𝑐1 and 𝑐2), and a part with a cross-term (with parameter 𝛽3).

(17)

17

Starting point: In this section it will be explained step-by-step how this formula was

derived. At first, the starting point is to assume the data could be described with a model that could be written as the following formula:

𝑦 = 𝑓(𝑥) + 𝜀

With 𝑓(𝑥) a function to be found (such that it can be written as equation 3.3.1) and 𝑥 the explanatory data. One possibility for finding 𝑓(𝑥) is to use piecewise polynomial parameter fitting. This means that for different sections of the data the parameter estimations can take on different values. A well-known example of this is splines, where a polynomial consists of different partitions that have to be estimated with regression. Using a recursive partition regression and its own algorithm, the MARS method uses splines to find a function 𝑓(𝑥). This method is in short explained in Friedman (1993), for the full explanation of the method see Friedman (1991).

The challenge now consists of two parts where the MARS algorithm can help: first to use a strategy to find the best locations for the partitioning (the so called breakpoints), and second to estimate the parameters inside the partitions.

Recursive partitioning regression: Strategies for partitioning that try to approximate

functions in multiple dimensions are based on adaptive computation. An adaptive computation strategy is a strategy that dynamically makes adjustments, taking into account the behavior of the problems, like solving the functions to be estimated (Friedman 1991). An example is the recursive partitioning regression method, which is the method the rest of this section is based upon. This basically comes down to continuously splitting the partitions and subsequently testing whether the newly split regions have to be split again, and if two newly created regions have to combined. The most simple form of such a strategy is if 𝑓(𝑥) is assumed to be an expansion of just different functions based on the region the variable is in. When the functions are written as 𝑔𝑚 the partitioning of 𝑓(𝑥) could be written as:

𝑓(𝑥) = 𝑔_𝑚(𝑥|𝛽𝑚) if 𝑥 ∈ 𝑅𝑚

where 𝑅𝑚 indicates the subregion 𝑚, and 𝛽𝑚 denotes the parameters used in function 𝑔𝑚. For simplicity, the functions 𝑔𝑚 can be defined as just constants, so 𝑔𝑚(𝑥|𝛽𝑚) = 𝛽𝑚. In that case the model has almost no flexibility, so it has to be expanded.

Generally the course of action hereafter is splitting the subregions into two parts and estimating the model for both subregions and subsequently carrying out goodness of fit test. Some regions will have to be combined, so the fit is optimal based on a lack of fit test. This is done recursively until a certain criterion is reached to prevent overfitting, such as a maximum amount of subregions that can be generated or the amount of splits that can be made. Friedman observes two problems with this approach: it is discontinuous at the boundaries, and that certain types of simple functions are difficult to approximate; therefore this method has to be expanded. Continuity is needed since it is not permissible to have jumps in the

(18)

18

formula for the price of shipment (and thus revenue) when it is based on weight, such that one cannot increase (or decrease) the weight by a small amount with the costs totally changing from jumping to a higher (or lower) segment. As a consequence for the models that are needed to estimate the relationship between revenue and weight we like to have a continuous function.

Adaptive regression splines: For the MARS method, Friedman rewrites the previous

formula for the recursive partitioning regression using the set of basis functions the following way:

𝑓(𝑥) = ∑ 𝛽_𝑚𝐵_𝑚(𝑥) 𝑀

𝑚=1

Here 𝐵𝑚(𝑥) is called a basis function. And {𝛽𝑚}1𝑀 are the coefficients 1 to 𝑀 (the number of basis functions). When this basis function is chosen as an indicator function like 𝐵𝑚(𝑥) = 𝐼(𝑥 ∈ 𝑅_𝑚) we get the same formula as above defined with constant 𝑔𝑚, and thus 𝛽𝑚 functions as a constant. However, the basis function could also be defined as a more specific function, which will be needed for MARS.

The main goal of recursive partitioning is not just to modify the coefficients to find the best fit for the data, but to get a good set of basis functions too. Now, the method until here uses regions defined before the estimation. If one wants to use regions that better fit with the data, different basis functions have to be estimated. The algorithm for recursive partitioning for MARS generates the following basis functions:

𝐵𝑚(𝑥) = ∏ 𝐼[𝑠𝑘𝑚∙ (𝑥𝑣(𝑘,𝑚)− 𝑡𝑘𝑚) ≥ 0] 𝐾𝑚

𝑘=1

With indicator function 𝐼(𝑥 ≥ 0) = 1 if the argument 𝑥 is zero or positive and 0 otherwise. The MARS algorithm determines the optimal 𝐾𝑚, 𝑠𝑘𝑚, 𝑡𝑘𝑚. Here 𝑡 represents the break/ splitting point, i.e. the point where a region has to be split into two subregions. And 𝑠 represent the sign based on which part (left or right) of the split it is. And 𝑣(𝑘, 𝑚) is a function used as to properly label the predictor variables based on 𝑘 and 𝑚. And 𝐾𝑚 indicates the number of terms that construct the basis function 𝐵𝑚 of region 𝑚. This is the first part of the procedure explained above; the continuous splitting. After this the second part can be done: the backwards stepwise procedure in order to decrease the amount of subregions, such that there is no overfitting. This is done by means of lack of fit testing.

Continuity: As stated before, Friedman describes that a fundamental problem of

recursive partitioning is the lack of continuity. The previously mentioned basis functions are sharply discontinuous at the splitting points of the regions. To tackle this problem of continuity that at this point still arises at the boundaries used in the previous model, Friedman uses modifications for the model that replace the discontinuous indicator function with a different

(19)

19

function that makes it continuous. Therefore he creates an adjusted ‘truncated’ basis functions based on ordered splines.

Besides the problem of no continuity, another problem is the inability to provide good approximations for some classes of simple but often encountered functions. Besides replacing the indicator function, two other measures are proposed by Friedman to create the MARS algorithm: Firstly, after a parent basis function 𝐵𝑚(𝑥) is split into two partitions, to not remove that basis function, and thereby making the parent basis function and both of its children eligible for further splitting. Secondly, restricting the product inside each basis function to only use factors with distinct predictor variables.

The MARS algorithm: When these three changes are implemented, the result of the

recursive partitioning algorithm, the MARS backward and forward stepwise algorithms, can be found in the article by Friedman (1991). It uses a lack of fit test function to determine whether it should split a region further or if it should combine certain partitioned regions again. It also discusses how one can set 𝑀max, the maximum number of regions. Besides that, he provides us with measures for the degree of continuity and breakpoint optimization. This algorithm is used for recursive partitioning, and also produces the right signs, 𝑠𝑘𝑚, for each partition (Friedman 1991). Finally, if we apply the algorithm of MARS in a model form, inserting the adjusted basis function 𝐵𝑚(𝑥), the resulting model can be written as formula 3.3.1.

For using this model, it is chosen to use revenue as the dependent variable and supply the MARS mechanism with the weight measure, product group, payment term code, and incoterm as explanatory variables. The MARS algorithm itself decides whether it deems necessary to use the variables and dummies. On top of that, it could also add cross-terms of variables, meaning a dummy for a categoric variable times a basis function. These cross-terms can also be of chosen for higher orders, meaning a product of more variables. It is possible that the MARS algorithm decides that eventually not all of these dummies end up in the final model for that specific lane, since some could be left out to avoid overfitting. So for each lane the variables used in the end might be different. The degree of interaction, that is the maximum number of variables that are allowed to interact, is initially set to 2. This is done in accordance with Friedman (1991 p.31), where he calls this 𝑚𝑖. When we use 1 as the degree, then we prohibit the interaction between variables, so we only allow the basis function to be equal to one to appear in products of the univariate spline basis functions. The maximum degree of interaction is the amount of variables used. The degree is not set to that in order to prevent overfitting, and also to make sure the computation time does not get out of hand. Therefore a degree of 2 of used, which only allows two-variable interactions in the model. The algorithm of MARS is implemented in R as the package EARTH (Milborrow 2018). It provides

(20)

20

the user with a report of the eventually used variables for the full model and the estimated parameters.

The big advantage of using MARS, especially compared to OLS, is that due to there being interaction terms and hence more parameters there can be a better fit locally. A disadvantage might be that the technique how the splines are chosen the way they are chosen by the algorithm is very complicated. Besides that, there is no proof this method will work well for the large weights. Therefore the method called Cost+ is used as well, that forces there to be a region for large weights.

4.PROCESS

The process of handling the data can be divided in two distinct phases. The first phase is essentially making sure the data are in the correct format. The second phase is the actual comparison between the different methods. Both phases can be divided into several smaller steps. In this thesis those two steps will be what is focused on since this is relevant for econometric models.

4.1 The first phase

After the preparation which consist of cleaning the data thoroughly and making sure that there are no obvious mistakes, the actual process can be started. When estimating the models, it has to be made sure only the data applying to the specific lane is being loaded for that model. For estimation of all these models, the data have to be divided into its corresponding groups (i.e. lanes), and subsequently the models will be estimated for the methods mentioned above.

The first phase consists of estimating every model for each method. We want to estimate each lane if possible for city-to-city values, but if there are too few observations we have to use country-to-country. To make sure either of them is available to use, both of them are estimated as distinct lanes. Later on it can be tested if these two models give widely different results and make a recommendation to use one of both, so subsequently the company will be able to make its own choice which one to use afterwards. This could be used if the country is small enough so transport hubs could be seen as one. Considering transport itself is possible with three transport methods (via air, using roads, or with ships), it is important to note that transport lanes for the three methods are considered different lanes. The cities that are utilized as starting point or end points for lanes are main transport hubs. Furthermore, not all city-to-city combinations can be transported using all three methods. For example, New York to Tel Aviv is not available via road. Of course, these three methods can

(21)

21

be grouped afterwards when summarizing and presenting the results for convenience, but not when making forecasts.

The estimation will be done in two separate manners: first models will be estimated for the full dataset of the lane and in the next phase these can be analyzed using cross validation. The other manner is to take a part of the data set, estimate the models for that part and subsequently in the next step predict the values for the other part of the data set and compare the prediction with the observed value. Both are explained in the next section.

4.2 The second phase

The second phase is for making an analysis of the quality of the predictions based on the estimations. For this phase the stored estimated model results and their parameters are used. The results are assessed in two different ways. A cross validation to compare the all the models is carried out and a comparison is done with the sum of errors.

As a well-known and often used measure of the quality of the estimated model, the mean squared error (MSE) could be calculated. This can be done for the full model of each lane, but also for the regions between the breakpoints for the MARS and Cost+ models, using the following formula:

𝑀𝑆𝐸_𝑅 = 1

𝑛_𝑅 ∑ (𝑦𝑖− 𝑦̂)𝑖 2 𝑖 𝑖𝑛 𝑅

Here 𝑛𝐴 is the number of observations in region 𝑅, and 𝑦𝑖 is the dependent variable and 𝑦̂𝑖 is the fitted version. With this measure it is possible to rank the methods for each lane. It is also a possibility to make this comparison if the MSE is calculated only for the region between the chosen breakpoints for each model. Besides counting the best method for all lanes as if they are equivalent, it is also possible to distinguish between lanes with different transport method or revenue type (standard or express), to see if there might be differences in model choice then.

Friedman (1991) uses a criterion based on the MSE to compare the qualities of the fit for each of the models. The models for the full lane data set will be compared using a so-called generalized cross validation (GCV) criterion that is based on the method proposed by Golub, Heath & Wahba (1979). The latter define the GCV minimizer as:

𝐺𝐶𝑉(𝜆) = 1

𝑛 ∥ (𝐼 − 𝐴(𝜆))𝑦 ∥2 (1_{𝑛 𝑡𝑟(𝐼 − 𝐴(𝜆)))}

2

With 𝐴(𝜆) = 𝑋(𝑋𝑇𝑋 + 𝑛𝜆𝐼)−1𝑋𝑇. Here 𝑋 is the matrix with explanatory variables and 𝑦 the vector of the dependent variable. The criterion in this form can be used for ridge regressions, but when the parameter 𝜆 is chosen as 0, it can be used for OLS regressions. Friedman (1991)

(22)

22

makes use of that adapted version of the GCV criterion to rank different models. It basically comes down to using the mean squared error (MSE) divided by a correction term. He writes the criterion it as:

𝐺𝐶𝑉(𝑀) = 1

𝑛 ∑ (𝑦𝑖 𝑖− 𝑦̂)𝑖 2 (1 −𝐶(𝑀)_{𝑛 )}2

Where 𝐶(𝑀) = 𝑡𝑟(𝐵(𝐵𝑇𝐵)−1𝐵𝑇) + 1 with 𝐵 the (𝑀 ∗ 𝑛) matrix of the 𝑀 basis functions and 𝑛 observations. According to Friedman 𝐶(𝑀) is equal to the number of linearly independent basis functions, and hence 𝐶(𝑀) can be regarded as the number of parameters being fit. The GCV criterion specified by Friedman will be used to compare the quality of the fit of the models. Although this criterion is used to rate the fit of the models, it does have a correction for the number of parameters. In principle it is expected that MARS always produces the best fit, since Cost+ can be seen as simplified version of it, and OLS as an even more simplified version of that. Also it can take into account some of the non-linearities in the data. However, since both MARS and Cost+ can have more parameters than OLS due to the cross terms, it could be that another method than MARS has the lowest GCV criterion value. In addition, a different method is used to evaluate quality of the predictions of the models. This time the models are all estimated with only a part of the dataset. The estimated model will be used to predict the values for holdout sample, the other part of the data set. These predictions for the revenue will be compared to the observed value in the data set. This comparison is evaluated by calculating the more often used mean squared prediction error (MSPE), computed in the following way:

𝑀𝑆𝑃𝐸𝑅 = 1

𝑛𝑅(𝑝) ∑ (𝑦𝑖 − 𝑦̂)𝑖 2 𝑖 𝑖𝑛 𝑅(𝑝)

Here 𝑅 denotes the region used, and 𝑝 indicates the part of the dataset not used for estimating the model but used for comparing to the prediction, and thus 𝑛𝑅(𝑝) denotes the number of observations in that region 𝑅, that are not used in the estimation. Again, 𝑦𝑖 denotes the dependent variable and 𝑦̂𝑖 is the fitted version. Like with the GCV, this could be calculated for the full lanes, but it could also be split it into different regions and see what model performs best in that region between the break points. Subsequently the methods could be ranked to tell which one performs best in predicting. The formula and method can be found for example in Heij et alii (2004, p.280). For predicting it is in principle expected that MARS always outperforms Cost+ and OLS for the same reason as above. However, since the prediction is compared with a different part of the data set, it could be that the other methods have a lower MSPE. So for this reason OLS or Cost+ could be preferred over MARS.

(23)

23 5.RESULTS & ANALYSIS

In this section we look at the results from the process of implementing the previously discussed models. In the analysis in section 5.5 there will be an explanation why some results do not correspond entirely with the results.

5.1 Example lane

First we continue with the same example lane as in section 1.3 (JFK to Tel Aviv), to illustrate the results of the model estimation. For this lane, after we estimated the three models mentioned above, we can conclude that the GCV criterion indicate that MARS is the best method for estimating the model for this lane (GCV for MARS is 215336.6, Cost+ has 245353.2, and OLS has 217575.2). It might also be interesting to see what variables the MARS algorithm picks in this case for estimating the model. Here the model has the form of 𝑦 = 𝑎 + 𝛽1max(0, 𝑐1− 𝑥3) + 𝛽2max(0, 𝑥3− 𝑐1) 𝑑(𝑥4)1+ 𝛽3max(0, 𝑥3− 𝑐1) 𝑑(𝑥4)2+

𝛽₄max(0, 𝑥₃ − 𝑐₂) + 𝛽₅𝑑(𝑥₄)₃. Here is 𝑑(𝑥𝑖)𝑗 a dummy made of value 𝑗 in variable 𝑖. 𝑥3 stands for the gad weight and 𝑥4 for the incoterm. With the coefficients filled in it is 𝑦 = 2192.2 + −1.9(1104 − 𝑥3)++ 1.4 (𝑥3− 1104)+𝑥_{4(𝐸𝑋𝑊)} + 1.5(𝑥3− 1104)+𝑥_{4(𝐶𝐼𝑃)}+ 0.9(𝑥3− 1910)++ −251.6𝑥_{4(𝐶𝑃𝑇)}. Note that not all variables are used, which means that the MARS algorithm did conclude that these variables are not important for this model. To visualize how the break point affects the relationship, we could take a part of the formula by ignoring the dummy variables and cross terms for a moment, and only consider 𝑦 = 𝑎 + 𝛽1max(0, 𝑐1− 𝑥3) + 𝛽4max(0, 𝑥3− 𝑐2), then this relation could be depicted in a graph. In Figure 3 below, the relationship of MARS is drawn in orange, with the dashed red line indicated the break at 1104.

(24)

24

5.2 Most recent data set

Now the results of the full data sets will be discussed. We start with the most recent data set and subsequently present the results of the of the older data sets. Data set number 1, the most recent one, consists of 4876 lanes, for which the models have to be estimated. Most of the models will be estimated both with standard revenue and with express revenue as dependent variable, if the variables needed are available. Due to the technical limitation that both types of revenue are not paired in the data set (both are seen as a separate lane), if cannot be tested if there is a significant difference between those two types of lanes. However, they will be compared by discussing the preferred methods for either.

Comparing the prediction with MSPE: Firstly, to assess which of the three method is

best for predicting, the models are estimated only with a subsample of 66.6% of the data and subsequently the other part of 33.3% of the data are predicted using those models. Then by calculating the MSPE, the different methods can be compared with each other for predicting. This partition has to be chosen in a way that there will be still enough observations in lanes with few observations to carry out the estimation. Out of the 4876 lanes we find that according to the MSPE, overall MARS performs the best in 41.2% of the cases, and Cost+ the best in 27.9% of the cases, while OLS the best in 30.9%. If we distinguish this between standard shipping and express shipping we find that for the standard shipment MARS is also the best with 40.1% of the lanes, Cost+ with 28.1%, and OLS with 31.7%, while for express shipment MARS is the best in 42.4% of the cases, Cost+ in 27.9%, and OLS in 30.0%. The difference between the two types of revenue is very small and show there no clear difference between standard and express shipment.

The data could also be split between the three different travel modes. Then we get similar results (see Table 5b below): for air the best model is MARS in 41.4% of the cases, Cost+ in 27.5%, and OLS in 31.1%. For ocean it is MARS in 41.3%, and Cost+ 28.9% and OLS in 29.8%. For road it is MARS in 38.9% of the cases and OLS in 34.9%, while Cost+ 26.2%. Note that the transport methods are not represented in equal amount in the dataset, lanes using transport via air is used the most by far (air 62.4%, ocean 31.8%, and road 5.6%). In each of these three transport cases it is clear overall MARS performs the best. However, it is not universally the best for each lane, since there is a substantial amount of lanes that could be better estimated with Cost+ or OLS. In order to find the reason for this, we can split the data into regions (i.e. the part between breakpoints), and see which model performs the best in those regions, again according to the lowest MSPE. The results are presented in Table 5 below.

(25)

25

MSPE data1 Full 0-50 50-500 500-1000 1000-5000 5000+

Cost+ 27.9% 31.2% 28.0% 25.7% 23.7% 22.1%

MARS 41.2% 45.4% 44.1% 47.5% 49.3% 50.8%

OLS 30.9% 23.4% 27.9% 26.8% 27.1% 27.1%

NA 0.0% 4.8% 33.6% 42.9% 46.2% 78.5%

Total 4876

Table 5: results for prediction with the most recent dataset

Standard: Express: Air: Ocean: Road: Cost+ 28.1% 27.6% 27.5% 28.9% 26.2% MARS 40.1% 42.4% 41.4% 41.3% 38.9% OLS 31.7% 30.0% 31.1% 29.8% 34.9%

Table 5b

In Table 5, NA refers to non-available values, which results from when no data are available for that particular region. This is due to the fact that there are not enough observations in that particular region to estimate a model properly. This is especially true for the shipments with weights above 5000. These are heavy shipments that only occur sparsely. In the calculation of the percentages for Cost+, MARS and OLS, the non-available estimates are ignored. Note that for estimation of Cost+ the data are already split into these same regions, so it could be seen as a separate OLS for those values; the OLS models refer to the models for all data of a lane. We can see in Table 5 that MARS performs best in every regions and in Table 5b that MARS performs the best for every category. However, the relative occurrence of MARS being a better method to predict seems to be only a little bit higher in the higher regions compared to the first. The same thing is true for OLS, albeit this time the effect is less clear. Thus this has to be compared with the results of other data sets, to see if the effect is more clear there. Perhaps most surprising, Cost+ performs worse in the higher regions. This in contrast with the results from the first two regions, where Cost+ has a higher rate of best being the best than OLS.

Since the higher values are of importance, it is important to note that MARS still outperforms the other two methods. However, in the last column of Table 5 it can be seen that both OLS and Cost+ perform still relatively well in that region, so we cannot definitively say that MARS will always be the best method for prediction of revenue in transport companies based on this data set.

Comparing the fit with GCV: The next step is to investigate if the results are any

different if we estimate the models for the full data set and compare the fit. The GCV method based on the MSE for comparing the methods is used. For the overall result for the 4876 lanes

(26)

26

we find MARS with 43.8% of the lanes having the best fit, closely followed by Cost+ with 42.3%, and then OLS with 13.9% of the lanes. Again MARS seems to be the best, but this time Cost+ is quite close. But OLS seems to be the preferred method for a considerable amount of lanes. If we split up between standard and express revenue we get (see Table 6b) MARS: 43.4%, Cost+: 43.4%, and OLS: 13.3% of the lanes the best for standard shipment and MARS: 44.1%, Cost+: 41.3%, OLS: 14.6% the best for express shipment. Once again there does not seem to be a disparity between standard revenue and express shipping revenue. Now when we split the data on travel method we get for air MARS: 41.6%, Cost+: 52.63%, OLS: 5.9%; for ocean MARS: 48.1%, Cost+: 23.8%, OLS: 28.1%; and for road MARS: 43.6%, Cost+ 33.5%, and OLS: 22.9%. This gives different results to the predictions comparison method, especially for Cost+ and OLS. MARS is not always leading, but Cost+ is preferred in the case of transport via air. Furthermore OLS performs poorly for that method. For the other transport methods, Cost+ performs considerably worse that MARS, in both cases even being surpassed by OLS, with Cost+ performing significantly better for airplane lanes.

In the Chow test that is used to test whether the OLS and Cost+ models are significantly different, it results that for 76.9% of the lanes we can reject the null hypothesis that the models have the same coefficients. For this test only the 52 lanes with more than 1000 observations are considered. Hence for the most lanes there is a clear difference between those two methods.

Now, once we split the data according to weight categories and compute which model might be best in the regions between the breakpoints we get the results presented in Table 6 below. GCV data1 Full 0-50 50-500 500-1000 1000-5000 5000+ Cost+ 42.3% 48.4% 64.0% 42.0% 46.5% 23.5% MARS 43.8% 36.4% 22.7% 43.3% 46.2% 69.9% OLS 13.9% 15.2% 13.3% 14.7% 7.3% 6.6% NA 0.0% 5.6% 33.4% 40.2% 43.5% 75.4% Total 4876

Table 6: results for fitting with the most recent dataset

Standard: Express: Air: Ocean: Road: Cost+ 43.3% 41.3% 52.6% 23.8% 33.5% MARS 43.4% 44.1% 41.6% 48.1% 43.6%

OLS 13.3% 14.6% 5.9% 28.1% 22.9%

Table 6b

One very striking result is that in multiple regions between the breakpoints it occurs that MARS is not the best method; namely Cost+ is better in the lower regions, in the case of the

(27)

27

second region it wins by far. In the first region the score for Cost+ is also higher, but MARS is still big. Since the second regions seems like a break of the general trend that Cost+ works best with lower regions and MARS works best for the higher ones. It has to be seen if the results for the second region are comparable in other data sets, or if it is not the case again to get this extremely high value.

Another notable result is that OLS has a lower percentage being the optimal model for the purpose of fitting the data, this is in accordance with the expectation. It is remarkable that the share of OLS being the best model seems to decrease going to the regions with larger weights. This could indicate that OLS does indeed not work well for fitting data of high values, when most of the values are concentrated in the lower region. However, we would expect that OLS is never the optimal method to use for fitting the data. But in the results of Table 6 is shown that the amount of OLS being the best method is more than zero. This might be due to that if there are more parameters, the denominator of the GCV formula decreases, so the GCV itself increases. With Cost+ there are always more parameters than with OLS, and with MARS there are sometimes more parameters (it is different per lane). This is maybe why OLS sometimes still has the lowest GCV and therefore seems the best method for fitting.

5.3 Other data sets

It is interesting to see if other data sets from a different period give a different picture. It is chosen not to combine all data sets to be able to investigate if there might be any differences in time. Since the methods have to work independent of time period, the results should be comparable; if not, this may be due to different data gathering methods, or the fact that the sets are from different months of the year. Since each data set may contain different lanes, it could not be formally tested if the results differ. The results for the other data sets are included in the appendix (Tables A1 to A6), of which summary is shown in Table 7 below.

Dataset Method MARS Cost+ OLS NA Size

1 MSPE 41.2% 27.9% 30.9% 0.0% 4876 2 MSPE 41.6% 26.4% 31.9% 0.1% 5431 3 MSPE 41.7% 27.7% 30.7% 0.1% 5890 4 MSPE 40.6% 27.6% 31.7% 0.1% 6201 1 GCV 43.8% 42.3% 13.9% 0.0% 4876 2 GCV 41.8% 45.0% 13.2% 0.0% 5431 3 GCV 42.4% 45.3% 12.3% 0.0% 5892 4 GCV 41.8% 45.1% 13.1% 0.0% 6204

(28)

28

The results in the summarized table indicate that in every data set MARS seems the best method to use for prediction. However, other methods do well too, and if we distinguish in regions, other methods seem to the best as well for some regions. So it is hard to pinpoint one method as the best. The table doesn’t show much difference between the data sets. It is worth pointing out that every data set generated virtually no non-available results for the methods. For fitting the data, a notable result is that the GCV criterion seems to evaluate the Cost+ method much higher than for MSPE, which seems to prefer OLS for predicting over Cost+ on average. Moreover, the share of Cost+ is sometimes even higher than for MARS. OLS seems to be the best method for substantial a part of the lanes. But this is only about comparing the methods and does not tell us much about the quality of the estimation methods. The scores of Cost+ and OLS seems to be effected a lot by the difference in evaluation method, while we see that usually in almost half of the cases MARS is the preferred method for both fitting and predicting.

5.4 Changes

It is also interesting to look what happens when one or more of several changes are made in the models. In this section the following changes are explored: Changing the breakpoints used in Cost+ to different ones, to look at what the MARS algorithm does when the amount of cross-terms is changed, and what changes if the specifications of the OLS model are changed by making it a polynomial model, or using the logarithm of the dependent or explanatory variables.

Different breakpoints: For Cost+ an important change would be to pick different break

points. Since out of all the changes this is the most straightforward, it was chosen to do this and carry out the same procedure again. Since it is indicated above that all data sets give similar results, we only carry out the procedure for one data set. The new breakpoints that are used for estimating Cost+ are 𝑝 = (0, 100, 1000, 10000) instead of 𝑝 = (0, 50, 500, 1000, 5000). (These are not the same breakpoints used dividing the results into regions.) The choice is again a bit arbitrary, but this time the breakpoint grow with the same logarithmic scale. It is thought that after 10000 most observations have been covered, so above that is the region for the very high values. Using the new breakpoint the results are:

MSPE data1 Full 0-50 50-500 500-1000 1000-5000 5000+

Cost+: (b.p.) 29.3% 30.2% 29.5% 26.0% 27.1% 22.6%

MARS 41.4% 47.3% 44.3% 48.3% 48.0% 51.2%

OLS 29.3% 22.5% 26.2% 25.7% 25.0% 26.3%

NA 0.0% 4.4% 33.8% 42.9% 46.4% 78.9%

Total 4876

Predicting revenue for a large company