Robust Approximation Models for the Vehicle Routing Problem with Time Windows

(1)

Robust Approximation Models for the Vehicle

Routing Problem with Time Windows

M.A. (Jari) Meevis

(2)

(3)

Robust Approximation Models for the Vehicle

Routing Problem with Time Windows

M.A. (Jari) Meevis

February 28, 2020

Abstract

The Vehicle Routing Problem with Time Windows (VRPTW) can be described as the problem of finding a set of routes of minimum length from a depot to a set of geographically scattered customers, such that each customers’ demand is satisfied and each customer is serviced within his time window. In practice, it is not always necessary to know each route precisely, as long as the total length of the optimal solution can be approximated. Current approximation models perform well on particular sets of VRPTW instances but have problems with reliable approximations outside these sets. In this paper, we test the robustness of current route length approximation models and propose three new and more robust re-gression models for approximating the route length of arbitrary VRPTW instances. Additionally, we show that it is possible to incorporate time window constraints into regression terms related to inter-customer dis-tances and that these terms are useful for creating robust approximation models.

1 Introduction

The Vehicle Routing Problem with Time Windows (VRPTW) is concerned with finding a set of routes with minimum total length from a depot to a set of known customers’ locations, such that each customers’ demand is satisfied and each customer is serviced within the applicable time window. However, many decisions may need to be taken before it is known which customers to visit on any particular day. In such cases, a method that can determine the expected total route length of the VRPTW can be beneficial.

(4)

1984; Figliozzi, 2009; Nicola et al., 2019). To properly position a depot, we need not minimize the route length on any particular day, but rather need to min-imize the expected route length for all days. A typical algorithm that aims at determining a good depot placement repeatedly calculates this expected route length for various possible depot placements. Hence, a method to determine the expected route length preferably also has very short calculation times.

The expected route length cannot typically be determined exactly. Only Dijkstra (2019) presents a method for determining the exact expected route length for a special case of the Travelling Salesman Problem (TSP) in ware-houses. Other approaches aim at approximating expected route length. The approximation literature for the TSP with uniformly distributed customers follows two tracks: a mathematical approach using limit theory (i.e, Beard-wood et al., 1959), sometimes enhanced with empirically estimated parameters (Christofides and Eilon, 1969); and an empirical approach using regression anal-ysis (e.g., Chien, 1992; Kwon et al., 1995). When customers are not uniformly distributed, the mathematical approach no longer yields results and only the empirical track persists (e.g., Hindle and Worthington, 2004; C¸ avdar and Sokol, 2015; Nicola et al., 2019).

Similarly, the literature on approximation methods for the expected route length of the Vehicle Routing Problem (VRP) shows both analytical (e.g., Da-ganzo, 1984) and empirical approaches (e.g., Webb, 1968). Notably, the assump-tion that customers are uniformly distributed is absent in both the analytical and empirical tracks.

The addition of time windows to the VRP, yielding the VRPTW, displays the same two approaches to route length approximation. Figliozzi (2009) estab-lishes the analytical approach, while the empirical approach is used by Figliozzi (2008) and Nicola et al. (2019). Interestingly, the empirical track focuses only on one set of benchmark instances (see Section 2.2.1 for details). The question, therefore, arises to what extent these approximations are robust, i.e. apply to other instances than the exact instances evaluated in the literature.

In this paper, we aim to add to the literature on approximating the expected VRPTW route length by creating several new robust approximation models. We ensure that our approximation models perform well on VRPTW instances with different characteristics and that the models only use information known before solving the instances. To build these models we use only information that is typically available at the time of deploying the models: information about the customers (e.g. locations, time windows), the depot and the fleet of vehicles.

(5)

the robustness of our models by providing performance measures that show that the models perform well both on similar and dissimilar instances.

This is different from the existing literature, which considers only one set of benchmarks instances (e.g., Figliozzi, 2008; Figliozzi, 2009; Nicola et al., 2019), or considers different problems where the number of stops (Daganzo, 1984; Daganzo, 1987) or the number of routes (Figliozzi, 2009) is known.

The performance of the approximation models reported by Figliozzi (2008), Figliozzi (2009), and Nicola et al. (2019) is untested for instances outside the benchmark set they use for fitting. Our models will be tested on different datasets containing more instances with different characteristics. This allows us to create a more robust model and, more importantly, to validate that the models are indeed robust.

The models reported by Daganzo (1984) and Figliozzi (2008) use information that is not usually available at the time of deployment of the models. Daganzo (1984) considers a problem in which the maximum number of stops is known (and constant), while Figliozzi (2008) uses the number of routes in the (optimal) solution.

Our models only use information known before solving the instances, i.e. information about the customers, the depot and the fleet of vehicles. Our main contribution to the literature is thus to provide robust approximations for VRPTW instances with a large variety of different characteristics, using only information about the customers, the depot and the fleet of vehicles (and not about the routes themselves). This allows our models to be used without (partially) solving the instances that are being approximated. Additionally, we use our three regression models to show that it is possible to incorporate time window constraints directly into terms relating to distances between customers and that these terms can be used to create a robust model. The initial seed for this consolidation was planted by Nicola et al. (2019), but no analysis was provided.

In line with most VRPTW literature (e.g., Figliozzi, 2009; Nagata and Br¨aysy, 2009; Nagata, Br¨aysy, and Dullaert, 2010; Nicola et al., 2019), we con-sider the objective of the VRPTW to be hierarchical. That is, we first minimize the number of routes and then minimize the total route length.

The remainder of this paper is structured in the following way: in Section 2, we review existing literature concerning approximation models for the TSP and the VRP. Section 3 details our methodology in more detail. Section 4 identifies issues with existing models in the literature and derives insight from them, leading us to propose three new models in Section 5 and allowing us to present our results in Section 6. Finally, Section 7 concludes the paper and provides some possible directions for future research.

2 Literature Review

(6)

without time windows. The terms used in VRP and VRPTW approximation are often derived from analytical insights into the simpler TSP. Additionally, similarities between the TSP, VRP, and VRPTW allow us insights into different aspects of these problems.

2.1 Travelling Salesman Problem

The first and most well-known route length approximation for the TSP is pro-vided by Beardwood et al. (1959). Their analytical result proves that the length of a TSP tour through n independent and uniformly distributed customers in a compact and convex area of size A asymptotically converges to c√nA as n → ∞, with c some constant. This constant c is dependent on the norm used to determine distances and satisfies 0.625 ≤ c ≤ 0.922 for the Euclidian norm (Beardwood et al., 1959). A slight improvement on these bounds is achieved by Steinerberger (2015). Christofides and Eilon (1969) enhance the approximation by augmenting it with an empirically estimated parameter.

The first regression-based approximation for uniformly distributed customers is provided by Chien (1992). Their model is shown in Equation 2.1.

TSP(n) = 2.10 ¯D + 0.67p(n − 1) × A0 _(2.1) Here, ¯D is the average distance between the customer and the depot, while A0 is the size of the smallest area that covers only the customers.n is the number of customers (including the depot). Their approximation splits the total TSP tour length into two components: the customer-depot distances and the inter-customer distances. To account for the possibility that the depot is located far away from the customers, they modify the Beardwood et al. (1959) term to exclude the depot (i.e. one less customer and a differently sized area). The term p(n − 1) × A0 _{thus measures the inter-customer distances only, and the term}

¯

D measures the customer-depot distance.

Kwon et al. (1995) augment this model by making the coefficient of the√nA term dependent on the number of customers n and the shape of the area S, where S measures the length-width ratio of the area. In contrast with Chien (1992), they place the depot on the corners or inside the area where the customers are located, so they use the original term (i.e.√nA). Their model is shown in Equation 2.2.

TSP(n) = 0.4147 ¯D + (0.7754 − 0.0008n + 0.9027 × S/n) √

nA (2.2)

The results above indicate that the total TSP route length can be split into two components. The first component is the customer-depot distance (measured by

¯

D). The second component is the travel distance between the customers (mea-sured by√nA or a similar term). These components are also very important for VRP(TW) approximations and similar terms will be used in our approximation models.

(7)

up of smaller square subareas. Each of these subareas can have a different customer intensity. To account for the fact that some subareas might contain no customers, they use the perimeter of the convex hull as a term. This term is more closely related to the relevant area of the instance than the size A or A0. Secondly, to account for the fact that each subarea can have a different customer intensity, they include a measure of customer isolation. This term allows them to assess the impact of these different customer intensities. In contrast with previous work, they include an intercept in their regression model. Most other authors (e.g., Chien, 1992; Kwon et al., 1995; Nicola et al., 2019) note that a tour through zero customers has length zero, and thus do not include an intercept. While we will not use the terms used by Hindle and Worthington (2004) directly, we note that identifying customer isolation (and clustering) and measuring the useful area of the instance can lead to a more accurate and more robust model.

A second approximation for non-uniformly distributed customers is provided by C¸ avdar and Sokol (2015), who propose an estimation based solely on the coordinates of the customers. The strength of their approximation lies in the fact that it approximates route length for customers that are located in different spatial patterns. For example, their model performs well on instances where the customer intensity is biased or skewed in various ways. The main terms used in their model are based on standard deviations of customer coordinates and the size of the instance (measured similarly to Chien (1992) and Kwon et al. (1995)). They show that using customer coordinates and related standard deviations to measure spatial differences can be used instead of the more direct isolation measure used by Hindle and Worthington (2004).

Finally, Nicola et al. (2019) employ a stepwise regression approach to iden-tify variables that drive TSP route length. They start with a large list of terms and select the most promising (based solely on significance) to determine their model. They show that terms that are hard to interpret or measure multiple characteristics simultaneously can be useful for creating an accurate model. For example, they introduce terms related to the distances between all neighbours, terms related to each customer’s nearest and furthest neighbour and terms re-lated to the minimum and maximum of such terms.

Drawing inspiration from C¸ avdar and Sokol (2015) and Nicola et al. (2019), we will use terms relating to inter-customer distances and customers’ nearest neighbours to incorporate constraints that are hard to quantify otherwise, such as the spatial distribution of the customers. Additionally, we note that it might be possible to use some of these terms as a replacement of√nA when the shape of the area of size A is not compact and convex.

2.2 Vehicle Routing Problem (with Time Windows)

(8)

The first analytical approach to route length estimation is provided by Da-ganzo (1984), who proposed a method to manually create tours for irregularly shaped zones with non-uniformly distributed customers and a maximum num-ber of C stops per vehicle. Based on the method and insights from the TSP, he provides a straightforward approximation of the total tour length. This ap-proximation is shown in Equation 2.3.

VRP(n) = 2 ¯Dn/C + 0.57√nA (2.3)

This approximation is very similar to the approximation used by Chien (1992) and Kwon et al. (1995) for the TSP, but with some differences to account for the increased number of routes. The first difference is that component measuring the travel distance between the depot and the customers ( ¯D) has an analytical coefficient (2, for driving to and from the depot). The second difference is that the same travel distance is multiplied by the number of vehicles that are expected to travel to and from the depot (i.e. n/C). The inter-customer travel distance is identical to the term derived by Beardwood et al. (1959).

Robust´e et al. (2004) propose adjustments to this model for elliptical areas, based on the shape of the ellipse. This term is similar to the length-width ratio used by Kwon et al. (1995) for approximating the TSP tour length. This indicates that models can be enhanced by adjusting for the spatial distribution of the customers, which we also saw for the TSP (both by using a term that measures shape directly or by using terms based on the customer locations).

The first analytical approach for the VRPTW was provided by Figliozzi (2009), who estimated the change in route lengths due to the existence of time windows. The author showed that the total demand and the capacity can be used together to approximate the number of routes.

Figliozzi (2008) creates a model that incorporates time windows by using the number of routes m directly in the route length approximation. Their best model is shown in Equation 2.4.

VRPTW(n) = 2m ¯D + 1.45 ×n − m n

√

nA (2.4)

They show that it is not necessary to use terms directly measuring time windows, as long as some of the terms in the problem are related to or impact by the existence of time windows. In this model, that term is the number of routes m. A clear drawback is that the number of routes needs to be known, which is not always possible in practice. More importantly, we again see that the total route length can be dissected into terms related to customer-depot travel distances and inter-customer travel distances, which we also saw in Daganzo (1984) and (for the TSP) in Chien (1992) and Kwon et al. (1995).

(9)

Figliozzi (2009) on an identical instance set. Their main contribution to the literature is incorporating time window constraints into the terms relating to inter-customer distances. They do this by only counting the inter-customer dis-tance if a customer can feasibly be visited after another. Note that this is similar to how Figliozzi (2009) used the number of routes as terms in the model, which allows the model to incorporate the complex time window constraints into other terms. For our models, this insight shows us that it is possible to incorporate time window related terms into other, more straightforward terms.

2.2.1 Benchmarks

This section details the benchmarks used in recent VRPTW literature.

The benchmark used by Figliozzi (2009) and Nicola et al. (2019) is the Solomon benchmark (Solomon, 1987). The Solomon benchmark has a hier-archical objective of minimizing the number of vehicles firstly and the route length secondly. The instances and optimal solutions can be found at http: //web.cba.neu.edu/~msolomon/home.htm. Additional solutions can be found in Baldacci et al. (2011).

The Solomon benchmark contains 166 instances with 25, 50 or 100 cus-tomers, where the smaller instances consist of the first 25 or 50 customers of the 100 customer instances. The benchmark is separated into 6 different types: R1, R2, RC1, RC2, C1, and C2. The R types contain dispersed customers, the C types contain clustered customers and the RC types contain a mix of both types. The type 1 instances have a short planning horizon, while the type 2 instances have a long horizon. The planning horizon is measured as the differ-ence between the earliest possible departure time (from the depot) to the latest possible arrival time (at the depot).

A similar benchmark with identical instance types was created by Gehring and Homberger (1999). The G&H benchmark contains 300 instances consisting of 200, 400, 600, 800 or 1000 customers. The instances and related papers can be found at http://www.vrp-rep.org/variants/item/vrptw.html. This benchmark has not been used in approximation literature.

All recent experimental results for VRPTW approximations are fitted and tested only on the Solomon benchmark (i.e., Figliozzi, 2009; Nicola et al., 2019). The authors provide different models for different instance types. Figliozzi (2009) provides different models for each of the 6 different instance types, while Nicola et al. (2019) merges the type 1 and 2 instances by including variables that account for their differences. We deliberately refrain from creating separate models for different instance types, as we believe that any robust model should be able to handle different types of instances.

3 Methodology

(10)

used to approximate the route length of a VRPTW instance. To accurately fit our regression models, we require a large number of solved VRPTW instances with known (near)-optimal route lengths. We find the desired route lengths by applying the Adaptive Large Neighbourhood Search (ALNS) meta-heuristic. The choice of this algorithm and its implementation are detailed in section 3.1. To achieve the desired level of robustness, the VRPTW instances used for fitting the regression models must exhibit a large variety of different characteristics. We use a mix of instances from the literature and instances of our own design to achieve the desired level of variety. Section 3.2 describes how we create a dataset with both a large number of instances and a large variety of characteristics. To determine the terms of our regression models, we draw inspiration from Nicola et al. (2019). We start with a large list of possible terms and create our models by combining them in various ways. In contrast with their approach, we do not focus solely on significance but also use insights from the literature to determine our definitive models. The terms we have considered are shown in Section 3.3. Not all terms are used in our models. Finally, to verify that our models are robust we employ various performance measures. We use several performance measures from the literature, and one of our own design that measures the worst approximation in a set of instances. Section 3.4 describes the performance measures.

3.1 Solution Method

To accurately fit our regression models, we require a large number of solved VRPTW instances with a large number of different characteristics. Not all of these instances have been solved before, as some are of our own design (see Sec-tion 3.2 for details). To solve these instances quickly and to (near)-optimality, we have selected the Adaptive Large Neighbourhood Search (ALNS) meta-heuristic as our solution method.

(11)

of solutions. Additionally, the heuristic is still used successfully for similar problems (e.g., Liu et al., 2019; Avci and Avci, 2019; Sacramento et al., 2019). Our motivation for using ALNS as our solution method is thus three-fold: the heuristic is very robust and performs well on instances with different character-istics. Secondly, the heuristic is very fast compared to other solution methods. Thirdly, the heuristic has been shown to perform well on large instances, and optimal or near-optimal on smaller instances.

These all indicate that the ALNS is a good choice to obtain near-optimal solutions to a large number of VRPTW instances with different characteristics. It is important to note that for comparison with Figliozzi (2009) and Nicola et al. (2019), we have not solved the Solomon instances ourselves but have used the known exact solutions as described in Section 2.2.1.

For information on how the algorithm was tuned, see Appendix A.

We do note that the use of a heuristic implies that the solutions found might not be optimal. While we have selected the ALNS algorithm based on its good performance on instances with known optimal solutions, there are no guaran-tees that it performs appropriately on new instances. This does mean that our models will approximate the route lengths of solutions that can be found using the ALNS algorithm. In a practical sense, this is often the information that is required. Take for example the salesperson involved in a bidding pro-cess, mentioned at the start of Section 1. If his models approximate an optimal route length, but the actual routes created by their scheduling program are non-optimal, he might underestimate the route length and thus costs associated. For him, it is better to approximate the route lengths that can be achieved with their scheduling application. We thus note that for most practical applications the models are appropriate, but that there might exist differences between the esti-mated models when another algorithm is used to solve the VRPTW instances.

3.2 Instance Types

To achieve a regression model that performs well on instances of different types, the model must both be fitted to and tested on a large number of instances with a large variety of characteristics. To achieve this, we combine benchmarks from the literature with custom-designed instances. In Section 4 we show that using only one set of benchmark instances, the Solomon benchmark, is not enough to achieve a robust model.

The characteristics we consider important to vary are the following: the area and shape of the instances, the location of the depot, the number of customers, the size of the planning horizon, the size of the time window for each customer, the overlap of the time windows, the spatial distribution of the customers, the capacity of the vehicles, the service times of each customer, the demand of each customer and whether clustered customers share characteristics (such as time windows, service times and demand) or not.

(12)

(2009) and Nicola et al. (2019). The G&H instances have not been used in VRPTW approximation literature. A major drawback of these benchmarks is that there is not much variance in some of the characteristics of the instances. Consider the Solomon benchmark, where the number of customers only has 3 different levels (25, 50, 100) or where all 23 R-type instances with 100 customers have an identical average demand. This lack of variety can lead to a less robust approximation model. Similar patterns arise within the G&H benchmark.

To counteract this drawback, we design a new set of 1500 instances based on the Solomon benchmark instances. This dataset is called ‘Subsetted Solomon’ and is created by selecting random customers from the Solomon instances. This has a few advantages: firstly, we create slightly more variance than available in the original benchmark (e.g. we have instances of all sizes between 25 and 100). This allows the regression analysis to achieve a better fit. Secondly, it allows us to remove variables that are only significant by chance, i.e. due to the limited number of distinct values. Finally, we create a larger dataset, as we now have 1666 instances derived from a benchmark. This allows us to create more other instances without overcrowding the Solomon instances.

Most of the Subsetted Solomon instances still share a lot of similarities (e.g. the planning horizon is defined by the opening and closing times of the depot, which is unchanged for each instance). To counteract this, we create 2 separate datasets where all the characteristics mentioned at the start of this section are randomly and uniformly selected. The first dataset contains 1491 ‘Clustered’ instances, created using a M´atern Clustering Process. The second dataset con-tains 1536 non-clustered instances (called ‘Uniform’), created using a simple random process. All other characteristics mentioned above are uniformly ran-domly selected between upper and lower bounds in such a way that they all have a valid solution and that the variety is large but still small enough that we can find patterns using regression analysis. This is necessary because letting all characteristics attain all possible values would lead to an extremely sparse ‘parameter space’, where most instances share almost no characteristics. This would make it impossible to extract patterns and approximations from them using regression analysis. We ensure a large enough variety by selecting upper and lower bounds that are usually within one order of magnitude of the upper and lower bounds found in the G&H benchmark. The exact implementation and the chosen bounds are detailed in Appendix B.

For ease of reporting, we have also created 2 datasets that are only combina-tions of previous datasets. The first of these contains the instances we created from scratch. This dataset is called ‘Custom’ and consists of the Clustered and Uniform dataset. This dataset is the most dissimilar from the Solomon and Solomon-derived instances and can be used to test the robustness of models fitted to the Solomon dataset. The second dataset contains all instances (and is called ‘All’) and can be used to assess the overall performance of each model.

(13)

Table 1: A summary of the datasets used for performance testing of the models. The column ‘New’ indicates if the dataset was created by us.

Dataset Size New Description

Solomon 166 No Contains the Solomon (1987) instances. For fitting and testing, we use the 162 instances with exact optimal solutions.

G&H 300 No Contains the Gehring and Homberger

(1999) instances. Solutions are found using the ALNS heuristic.

Subsetted Solomon 1500 Yes Custom instances created by subsetting the Solomon benchmark. Solutions are found using the ALNS heuristic.

Clustered 1491 Yes Custom instances created using a

M´atern Clustering Process. Solutions are found using the ALNS heuristic. Uniform 1536 Yes Custom instances where the customers

are located uniform randomly through-out the instance. Solutions are found using the ALNS heuristic.

Custom 3027 Yes A combination of the Clustered and

Uniform datasets.

All 4989 Yes A combination of the Solomon, G&H,

(14)

3.3 Features

We provide an overview of the terms we considered for our regression models. This overview contains both terms previously considered and term we created based on insights from the literature. Not all terms are used in our models. An elimination process based on insights from the literature, stepwise regression and out of sample performance was used to arrive at our final models.

The list of terms previously mentioned in the literature can be found in Table 2. The terms are ordered chronologically, by their first mention in TSP, VRP or VRPTW approximation literature (as far as we know).

Each term containing the word Reachable uses the modification used by Nicola et al. (2019), where distances between customers are only measured if one customer can feasibly be visited after another by the same vehicle. This is determined by their time windows and the distance between the customers. This is not necessarily a symmetric relationship: it might be possible that customer A can be visited after customer B, but not vice versa. For more details and a mathematical description, refer to Nicola et al. (2019).

Table 2: A summary of terms from literature considered for the regression models.

Term Source Description

n Beardwood Number of requests

A Beardwood Size of compact and convex area

con-taining the customers √

nA Beardwood Limiting factor of travel distance for the TSP

¯

D Chien Average customer-depot distance

A0 Chien Smallest rectangle containing all

cus-tomers (excluding the depot)

p(n − 1)A0 _Chien _{Chien (1992)’s modification of} Beard-wood et al. (1959)’s limiting factor

S Kwon The length-width ratio of the rectangle

wherein the customers are located

pA/n Kwon Unexplained term used by Kwon et al.

(1995)

Q Figliozzi Vehicle capacity

sumDemand Figliozzi Total demand l

sumDemand Q

m

Figliozzi Expected number of routes based on ca-pacity constraints

n

Q Nicola Customer-capacity ratio. Mentioned as

being from previous literature in Nicola et al. (2019)

avgDemand Nicola Average demand

(15)

Continued from previous page

Term Source Description

minCustDist Nicola Smallest distance between two cus-tomers

maxCustDist Nicola Largest distance between two cus-tomers

avgCustDist Nicola Average distance between each pair of customers

varCustDist Nicola Variance of distance between each pair of customers

varXVarY Nicola Product of variance of x and y coordi-nates of customers

sumDepotDist Nicola Sum of customer-depot distances varDepotDist Nicola Variance of customer-depot distances

sumTW Nicola Sum of time window lengths

varTW Nicola Variance of time window lengths

sumReachable- Nicola Sum of distances between two

CustDist reachable customers

minReachable- Nicola Smallest distance between two

maxReachable- Nicola Largest distance between two

avgReachable- Nicola Average distance between each pair of

varReachable- Nicola Variance of distance between each pair

CustDist of reachable customers

sumTWOverlap Nicola Sum of time window overlaps between each pair of reachable customers avgTWOverlap Nicola Average of time window overlaps

be-tween each pair of reachable customers varTWOverlap Nicola Variance of time window overlaps be-tween each pair of reachable customers Note: Beardwood means Beardwood et al. (1959), Chien means Chien (1992), Kwon means Kwon et al. (1995), Figliozzi means Figliozzi (2009) and Nicola means Nicola et al. (2019)

The list of terms we designed can be found in Table 3. These terms are of-ten inspired by or derived from previously mentioned literature. Sometimes we replace an unknown value by a naive expectation (e.g.figsqrtAN and EDep-Travel, where the number of routes is replaced by the expected number of routes based on capacity constraints), sometimes we change a standard devia-tion (from C¸ avdar and Sokol, 2015) to variance (e.g.varX, varY, varCentralX, varCentralY ) or sometimes we scale a value to make it less dependent on the magnitude (e.g.avgTWScaled, avgTWOverlapScaled ).

(16)

sumMaxReachableCust-Dist were defined by Nicola et al. (2019) for the TSP. It is unclear from their paper whether these terms were considered with or without taking the ‘reacha-bility’ of customers into account when used in their VRPTW approximation.

The term ANNI has been used in the VRPTW literature (Mei, 2015). How-ever, it has not been used as a regression term in approximation models.

Table 3: A summary of terms of our own design considered for the regression models.

Term Description

depotCentral Location of the depot in the center of the area, or in the corner

sumX, sumY Sum of X, Y coordinates of customers. avgX, avgY Average of X, Y coordinates of customers. varX, varY Variance of X, Y coordinates of customers.

In-spired by C¸ avdar and Sokol (2015) maxX, maxX Maximum of X, Y coordinates. minX, minY Minimum of X, Y coordinates.

maxTW Latest arrival time at depot. Equivalent to plan-ning horizon (since the depot always opens at time 0)

minTWOverlap Minimum overlap in time window across all pairs of customers

maxTWOverlap Maximum overlap in time windows across all pairs of customers

varCentralX Variance of distance from each customer to cen-tral X axis. Inspired by C¸ avdar and Sokol (2015)

varCentralY Variance of distance from each customer to cen-tral Y axis. Inspired by C¸ avdar and Sokol (2015)

ANNI Average Nearest Neighbour Index, a measure of clustering

sumMinReachable-CustDist

Sum of distance to the nearest reachable neigh-bour for each customer

sumMaxReachable-CustDist

Sum of distance to the furthest reachable neigh-bour for each customer

EDepTravel Average customer-depot distance, multiplied by the expected number of routes based on capacity constraints (see Table 2)

figsqrtAN Beardwood et al. (1959)’s limiting factor mul-tiplied by the term n−m_n from Figliozzi (2009), where instead of m we use the expected number of routes based on the capacity constraint (see Table 2)

(17)

Continued from previous page

Term Description

avgTWOverlapScaled The average overlap in time window per pair of customers, divided by maxTW

avgTWScaled The average time window, divided by maxTW sqrtSumCustDist The square root of sumCustDist

sqrtSumReachable- The square root of sumReachableCustDist CustDist

3.4 Performance Measures

For measuring the performance of our models, we use the same approach as commonly used in the literature (e.g., Figliozzi, 2009; Nicola et al., 2019; Kwon et al., 1995). We provide the Mean Percentage Error (MPE), Mean Abso-lute Percentage Error (MAPE) and the maximum AbsoAbso-lute Percentage Error (APEmax) (see Equations 3.1-3.3). We compare models based on the MPE and MAPE and use the adjusted R2as a measure of fit. It is not reported when com-paring different models or comcom-paring performance on different datasets. The APEmaxmeasures the worst approximation on a dataset. It has not been used in the literature before but provides us with a straightforward way of check-ing the robustness of our models. A number closer to zero indicates a better performance for all three performance measures.

MPE = 1 k k X i=1 Di− Ei Di × 100% (3.1) MAPE = 1 k k X i=1 Di− Ei Di × 100% (3.2)

APEmax= max i∈dataset Di− Ei Di × 100% (3.3) where Di is the total route length obtained by the solution method, Ei is the approximation obtained by the regression models (both for instance i), and k is the number of instances.

(18)

4 A Critical Review of Recent Approximation

Models

In this section, we critically review three approximation models created by Nicola et al. (2019) and show that these models are not robust. We then show that some steps can be taken to rectify the lack of robustness. From these steps, we derive two insights that we use when creating our own models in Section 5. The first insight is that to create a robust model, a large dataset containing instances with many different characteristics is necessary. The second is that a model based on a strong theoretical background is more robust than a model lacking one.

The models proposed by Nicola et al. (2019) are all fitted to the Solomon benchmark with known optimal solutions. The three models correspond to the three instance types of the Solomon benchmark (type R, RC, and C). Each model can be used to predict route lengths in instances of one type. We call the models Nicola R, Nicola RC and Nicola C. The authors use an empirical method to create their models: they initially consider a large number of terms, which they whittle down using stepwise regression to arrive at their final models. The stepwise regression procedure iteratively adds the most significant term to a model (for forward selection) or removes the least significant term from a model (for backward elimination) and stops at some pre-determined point, usually when an optimum has been achieved. It is unclear from the paper whether the authors employed forward or backward stepwise regression. To test the performance of their models, Nicola et al. (2019) split the Solomon dataset into a training set and a testing set. This allows them to report in-sample and out of sample performance for each model. Because the training and testing sets are tailored to each model (e.g. the Nicola C model is fitted to and tested on instances of type C), the total number of instances used for fitting and testing is different for each model and ranges from 45 to 68 per type.

4.1 Lack of Robustness

We first show that the significance and coefficients of the models proposed by Nicola et al. (2019) are strongly dependent on the training set. This indicates that the models are not very robust. Secondly, we show that the performance of the models deteriorates on instances that are not similar to the Solomon benchmark. From this, we conclude that the models created by Nicola et al. (2019) are not robust.

We start by recreating the models by Nicola et al. (2019). From their paper, it unclear what split they used for their training and testing set. We use an 80%-20% split of training and testing data. Because we have a different training set, we refit their models to our training set.

(19)

Table 4: Performance of the refitted Nicola models for all instance types of the Solomon benchmark. The best performance per performance measure per instance type is bolded.

Type Perf. Nicola R Nicola RC Nicola C Nicola All

R MPE -1.42% -11.53% 43.87% -0.51% MAPE 4.17% 16.11% 43.87% 6.12% RC MPE -86.07% -4.17% 8.10% -0.76% MAPE 87.33% 6.66% 21.06% 6.51% C MPE -303.60% -84.56% -0.09% -1.32% MAPE 303.60% 84.94% 7.25% 12.21%

in the Nicola RC model. In the refitted model, this is 20.944. In their origi-nal model, the coefficient is -18.713. Both are significant at the p < 0.05 level. Other coefficients change in magnitude (e.g.demandDivQ in the R model, called Q/Cap in their paper) or in significance (e.g.varTWOverlap in the RC model, called varOverlap in their paper). The fact that the coefficients change in sign, magnitude and/or significance indicates that the coefficients are highly depen-dent on the training set. This is our first indication that their models are not robust.

We now show that the models’ performance deteriorates on instances unre-lated to the Solomon benchmark. The performance of the Nicola R, RC and C models on the Solomon benchmark is displayed in Table 4, while their perfor-mance on all available datasets is displayed in Table 5. Both tables contain one or more models (Nicola All and/or Beardwood ) that will be used in Section 4.2 to show how the performance can be improved. They can be ignored for now.

We first note that the performance of each individual model is best on the instances they were fitted too, which is as expected. The performance on other instances is substantially worse. This is especially evident when looking at the Nicola R model’s performance on C type instances, where the model predicts route lengths that are on average -300%. Similar patterns are visible for all models. The Nicola C model performs best overall, with a MAPE and MPE of less than 45% on all instances.

(20)

Table 5: Performance of the Nicola R, RC, C, All and Beardwood models on different datasets.

Dataset Perf. Nic. R Nic. RC Nic. C Nic. All Beard.

(21)

over 300% on the complete dataset. Comparing worst performance, as measured by APEmax, we see that all three models predict route lengths that are more than 30 times as high or as low as the actual route length in some instances. The Nicola C model even predicts a route length that is almost 260 times lower or higher than the actual route length. It is clear that while these models perform reasonably on the Solomon and Subsetted Solomon instances, this does not generalize to other instances. The models are not robust at all.

4.2 Improving Robustness

We use two new models fitted to all types of Solomon instances to show two possible approaches that can be used to make models more robust. The first model uses the same approach as used by Nicola et al. (2019) to show that models fitted to a larger number of instances with more variety achieve a more robust result on other instances. The second model uses a theoretical result to show that this stronger background also leads to a more robust model.

4.2.1 More Instances and Larger Variety

Here we describe how having a larger number of instances with more variety in their characteristics allows us to create a more robust model.

We create a new model, called Nicola All, using the stepwise regression ap-proach used by Nicola et al. (2019). We use forward selection stepwise regression to create the new model. We use the Akaike information criterion (AIC) as a stopping criterium, i.e. we stop the procedure when adding another term in-creases the AIC. The optimal AIC was achieved at 10 terms, which is only slightly more than the number of terms each of their models use (6-8 each). The coefficients of the new model are shown in Appendix C, Table 12. Table 4 displays the performance of the refitted Nicola R, C, RC models and the new Nicola All model. This table indicates that the new model performs compara-bly to the other models on each instance type. Its performance is worst on the C type instances, where its absolute relative error is 12.21%, which is 4.96% higher than the best performing model. Its MAPEs are better than the best MAPEs achieved by the other models on instance types other than their own. For example, the Nicola RC model has a MAPE of 16.11% on type R instances. The Nicola All model achieves a MAPE of 6.12%, which is the second-best. It is second to the model that was explicitly created for type R instances, i.e. Nicola R with a MAPE of 4.17%.

(22)

We also note that refitting the model using a different training set does not change the sign, magnitude or significance of any of the coefficients, which is another indication that we have created a more robust model.

From the fact that this model is more robust, we conclude that having a large number of instances with a large amount of variety in their characteristics allows us to more easily create robust models.

4.2.2 A Stronger Theoretical Background

We show how a model with a stronger theoretical background is more robust than a model selected using only significance as a criterium.

We create a new model based on the first and most straightforward route length approximation in the literature. This model is called Beardwood and contains one term: √nA, which was proven by Beardwood et al. (1959) to be the factor that determines route length for the TSP. We fit the model to a training set containing all types of Solomon instances. Its only coefficient is equal to 1.21171, significant at p < 0.001 level.

The performance of the model is displayed in Table 5. This table indicates that the performance of this model is slightly worse than the Nicola All model on the Solomon and G&H datasets, with a MAPE that is between 2-3 times higher than the Nicola All model. However, their performances are similar on the Subsetted Solomon dataset and between 2-2.5 times better on the Custom and All datasets (e.g. compare a MAPE of 39.8% to a MAPE of 81.7% on the All dataset). It is especially interesting that the model performs the best out of all models on the Custom dataset, which contains instances that are least similar to instances it was fitted to. This indicates that it is indeed a more robust model.

From this, we conclude that a model with a stronger theoretical background, like the Beardwood model, performs well on all kinds of instances. We do note that to achieve the best performance on specific instances, the empirical ap-proach used by Nicola et al. (2019) is unmatched.

5 New Models

In this section we propose three robust approximation models for VRPTW route lengths and explain why we propose them. Most of the explanations are based on the insights gained in Section 4. Section 6 details the experiments used to test their performance and robustness.

5.1 Basic Model

(23)

customers and the depot, and the distance travelled from customer to customer. Our first model uses the same structure.

The commonly used term for inter-customer distances is given by Beardwood et al. (1959) as √nA for the TSP with uniform customers. For the VRPTW, we propose using the term sumMinReachableCustDist by Nicola et al. (2019). This term sums the distance from each customer to his nearest reachable neigh-bour. Recall that ‘reachable’ here means that the neighbour can be serviced within their time window by some vehicle after that vehicle has serviced the first customer as early as possible. As an example, consider a customer and their nearest neighbour. If the time window of the nearest neighbour ends be-fore the time window of the customer opens, there exist no routes in which the neighbour is visited after the other customer and that neighbour is thus not the nearest reachable neighbour. We expect that in an optimal solution, the nearest reachable neighbour will often be visited directly after another customer since that minimizes the length of a route. We thus expect that sumMinReachable-CustDist can be a replacement of the√nA term and that it is more appropriate for the VRPTW.

Similarly, for the customer-depot distance, we usually see the term ¯D in the literature, where ¯D is the average distance between the customers and the depot. For the VRPTW, where this distance is often travelled by multiple vehicles, we propose using the term EDepTravel, defined as ¯D × sumDemand

Q ,

where sumDemand is the total demand and Q is the vehicle capacity. This term multiplies the average customer-depot distance by a simple estimate for the number of routes based solely on capacity constraints. This is similar to the term used by Daganzo (1984), where they use the number of customers divided by the maximum number of stops as an estimate for the number of routes. The term is also similar to Figliozzi (2009)’s approximation model, where the author uses the known number of routes instead of this straightforward approximation. The first approximation model we propose is shown in Equation 5.1. The model is called Basic.

VRPTW(n) = β1× EDepTravel + β2× sumMinReachableCustDist (5.1) where the coefficients β1, β2 are to be estimated via regression analysis.

5.2 An Extension to the Basic Model

To make the model proposed in the previous section more robust to instances with different numbers of customers, we propose an adjustment to account for these size differences.

(24)

combinatorial effects might become less pronounced as the number of customers (and/or routes) increases and the solution becomes ‘smoother’. We thus expect to see some interaction effects.

The new model, which we call Basic-N, is shown in Equation 5.2. VRPTW(n) = β1× EDepTravel +

(β2+ β3n) × sumMinReachableCustDist (5.2) where the coefficients βi, 1 ≤ i ≤ 3 are to be estimated via regression analysis.

5.3 Distance Based Model

Thirdly, we propose a model that augments the Basic model by including more terms that are related to inter-customer distance. The idea behind this is that the inter-customer distance makes up a significant part of the total route length, but that it is hard to correctly approximate it. We, therefore, add more terms that might be able to help the approximation.

We add 2 terms related to inter-customer distances. The first term is sum-CustDist, which provides a measure of the total area of the size by summing all inter-customer distances. This term scales with the area of the instance and the number of customers squared. The second term is sumReachableCustDist, which provides a counterpoint by measuring the distance to all reachable neighbours. This term scales with the area of the instances and (approximately) with the number of customers squared. However, it also incorporates time windows.

We hope that these two terms can be used in our approximation by creating an interplay between the size (in area and customers) and the time windows of the instance. Together with the previously defined terms, these terms might improve the approximation for inter-customer distances.

This together yields the Distances model, shown in Equation 5.3. VRPTW(n) = β1× EDepTravel + β2× sumMinReachableCustDist

+ β3× sumCustDist + β4× sumReachableCustDist (5.3) where the coefficients βi, 1 ≤ i ≤ 4 are to be estimated via regression analysis.

6 Experimental Results

In this section we fit the models designed in Section 5. We report their per-formance on the various datasets and compare them to the best performing models from the literature. Additionally, in Section 6.1 we validate the robust-ness of the models by fitting them to each available datasets and presenting their performance on all datasets.

(25)

percentage of clustered instances is approximately 1491/4989 = 30% in both the testing and training set.

The coefficients of the models are shown in Table 6.

We note that all coefficients are highly significant and have the expected sign. From the difference in the coefficient of sumMinReachableCustDist be-tween the Basic and Basic-N model and the significicant interaction term in the latter model, we infer that there does indeed exist an interaction effect be-tween sumMinReachableCustDist and the number of customers n. When an instance contains more customers, the distance travelled between the customers becomes larger. We suspect this is because it is no longer possible to always visit nearest reachable neighbours in sequence, as there might exist more ‘overlap’ between nearest neighbours, i.e. multiple customers consider the same neigh-bour as their nearest reachable neighneigh-bour but only one route can visit them in sequence. Notably, the coefficient corresponding to EDepTravel is very similar to its expected value 2 from analytical insights (e.g., Daganzo, 1984; Figliozzi, 2008). When looking at the Distances model, we note that we see the interplay between the size of the instance and its openness (w.r.t. time windows) that we hoped to achieve. The positive coefficient of sumCustDist, which measures the size of the instance, is moderated by the negative coefficient of sumReachable-CustDist, which measures the openness. Both scale similarly with the number of customers and the area. Together, they improve the performance of the model beyond the Basic or Basic-N models (judging from both the MPE and MAPE). The performance of the models on all different datasets is displayed in Table 7.

For comparison, we have added the Beardwood model (see Section 4) as it had the best performance on the ‘All’ dataset out of all models that we considered in Section 4. Notably, the Beardwood model performed better when fitted to the Solomon benchmarks than when fitted to the ‘All’ dataset, which is why we used the Solomon-fitted version of the model in the comparison.

All models perform well on all datasets, except for the Basic model. The Basic model struggles to perform on datasets containing smaller instances (i.e. Solomon, Subsetted Solomon), which we suspect is due to the OLS procedure which minimizes absolute errors and thus gives more weight to the instances with more customers because they have larger absolute errors. Disregarding this exception, the models perform similarly on all datasets, with a MAPE ranging from 10.9-19.3%. This is up to 2.5 times better than the range achieved by the Beardwood model (21.6-50.2%) and up to 6 times better than the range achieved by the Nicola All model (7.8-121.4%, see Table 5), which only performs well on the Solomon benchmark.

Out of the three models, the Basic-N model seems the most robust with an absolute relative error of at most 16%. Similar to the Basic model, the Distances model performs best on datasets containing larger instances (i.e. Custom, G&H, All).

(26)

Table 6: Coefficients of the Basic, Basic-N and Distances models, fitted to a training set created from the ‘All’ dataset.

Basic Basic-N Distances sumMinReachableCustDist 1.943∗∗∗ 1.371∗∗∗ 1.558∗∗∗ (0.016) (0.043) (0.022) EDepTravel 2.022∗∗∗ 2.016∗∗∗ 1.996∗∗∗ (0.004) (0.004) (0.004) ‘sumMinReachableCustDist:n‘ 0.001∗∗∗ (0.0001) sumReachableCustDist −0.00004∗∗∗ (0.00000) sumCustDist 0.0001∗∗∗ (0.00000) Adjusted R2 _0.991 _0.991 _0.992 MPE (in-sample) -5.49% 5.5% 1.05% MAPE (in-sample) 15.59% 12.94% 12.59%

MPE (out of sample) -5.13% 5.85% 1.40%

MAPE (out of sample) 14.21% 12.19% 11.55%

(27)

Table 7: Performance of the new Basic, Basic-N and Distances models on various datasets. The models were fitted to a training set created from the ‘All’ dataset. The reported performance is out of sample. For comparison, the performance achieved by the Nicola All model, described in Section 4, when fitted to the training set is shown.

Dataset Perf. Basic Basic-N Distances Beardwood

Solomon MPE -28.8% -1.8% -9.1% -7.8% MAPE 34.6% 15.4% 19.3% 23.4% APEmax 106.3% 61.1% 72.6% 80.3% G&H MPE -4.7% -1.3% -4.8% 31.6% MAPE 13.8% 12.1% 13.4% 35.0% APEmax 67.4% 61.5% 79.1% 68.9% Subsetted MPE -17.8% 7.6% 0.8% -0.6% Solomon MAPE 21.1% 16.0% 14.5% 21.6% APEmax 104.4% 58.9% 70.6% 80.8% Custom MPE 1.9% 5.6% 2.4% 45.2% MAPE 11.6% 11.1% 10.9% 50.2% APEmax 85.2% 85.2% 81.7% 121.8% All MPE -5.1% 5.8% 1.4% 29.0% MAPE 14.2% 12.2% 11.5% 39.8% APEmax 106.3% 85.2% 81.7% 121.8%

contains terms that do not scale linearly with the number of customers (sum-CustDist and sumReachable(sum-CustDist ).

However, all models perform better than models from the literature, with a MAPE of 11.5-14.2% on the complete dataset. The best model considered in Section 4, Beardwood, achieves a MAPE of 39.8% on the same dataset. Other models considered in the literature (e.g., Chien, 1992; Daganzo, 1984; Figliozzi, 2009) achieve a MAPE between 23-168%.

6.1 Robustness

We test the robustness of the models by fitting them to the five different datasets (see Section 3.2 for details) and analyzing their performance on the other datasets.

6.1.1 Basic

(28)

is achieved when the dataset is fitted to a training set created from that same dataset. This is in line with our expectations. Secondly, we see that the per-formance of the model is the best when it is fitted to a dataset similar to the dataset it was tested on. For example, the performance achieved on the Solomon dataset is best when the model is fitted to either the Solomon dataset itself or to the Subsetted Solomon dataset (MAPE of 17.6%, resp. 22.1%). Similarly, the performance on the Custom dataset is best when the model is fitted to the Custom dataset itself or to the G&H dataset (MAPE of 10.8%, resp. 3.1%). In a similar vein, we find that the performance is worst if the datasets are dis-similar. For example, the model fitted to Subsetted Solomon and tested on Custom yields an MPE and MAPE of 27.0%, the worst score achieved by this model on the Custom dataset. Similarly, the worst performance is achieved on the Solomon dataset, when fitted to the G&H (or Custom) dataset, with a MAPE of 40.1% (or 34.8%). However, these scores are much better than the scores achieved by models in the literature (for example, the Beardwood model achieves a MAPE of 323.8% on the Solomon dataset when fitted to the Custom dataset and a MAPE of 50.2% when it is the other way around).

It does underline once more that the Basic model is sensitive to the size of the instance and achieves the best performance on instances that are similar to the training set, especially with respect to the number of customers. The model remains more robust than the best current models.

The overall performance on the Solomon benchmark is not very good. We suspect that the reasons for this are that the Solomon benchmark is small and exhibits little variation in some characteristics. This indicates that the optimal solution is strongly dependent on combinatorial effects and less influenced by the characteristics of the instance. This makes it difficult to correctly approximate the expected route length.

6.1.2 Basic-N

The performance of the Basic-N model, when fitted to and tested on various datasets is shown in Table 9.

(29)

Table 8: Performance of the new Basic model on each dataset, when fitted to each dataset. Reported performance is out of sample. Data when the model is fitted to the ‘All’ dataset is reported in Table 7

Dataset used for fitting

Dataset Perf. Solomon G&H Sub. Sol. Custom

(30)

Table 9: Performance of the new Basic-N model on each dataset, when fitted to each dataset. Reported performance is out of sample. Data when the model is fitted to the ‘All’ dataset is reported in Table 7

Solomon MPE -2.8% -1.1% -12.4% -8.8% MAPE 15.4% 15.2% 23.2% 19.2% APEmax 56.1% 61.2% 69.2% 73.6% G&H MPE 23.8% 1.0% 76.8% -5.0% MAPE 25.0% 13.3% 76.8% 12.7% APEmax 61.0% 58.3% 167.2% 68.0% Subsetted MPE 8.3% 8.0% -2.0% 1.0% Solomon MAPE 15.6% 16.0% 12.8% 14.7% APEmax 54.8% 59.1% 69.8% 71.4% Custom MPE 20.7% 9.2% 53.9% 3.5% MAPE 21.2% 12.5% 54.5% 11.2% APEmax 87.8% 85.9% 150.2% 85.0% All MPE 16.5% 8.0% 36.1% 1.8% MAPE 19.6% 13.6% 42.6% 12.6% APEmax 87.8% 85.9% 167.2% 85.5%

all of similar size when the model is fitted to datasets containing instances with both few and many customers (i.e. G&H, Custom, All). The MAPE varies from 12.5-16.0% (11.2-19.2%) when fitted to the G&H (Custom) dataset, but from 12.8-76.8% when fitted to the Subsetted Solomon dataset (which contains instances of no more than 100 customers). That the errors are of similar size indicates that the model performs well for instances of various sizes, but that it does need instances with a large range of customers to correctly estimate the interaction effect.

We summarize this model as follows: it performs very well and is very ro-bust as long as the coefficients (and especially the interaction effect) have been estimated accurately.

(31)

6.1.3 Distances

The performance of the Distances model, when fitted to and tested on various datasets is shown in Table 10.

Similar to what we saw for the Basic and Basic-N models, the performance in best on datasets that are similar to the dataset the model was fitted to. On dissimilar datasets, the performance is worse. A major example of this is the performance of the model on the G&H and Custom datasets when fitted to the Solomon or Subsetted Solomon datasets. This performance is much worse than the performance of the other models, with a MAPE of up to 136.2%. The explanation for this has to do with the scaling of the terms sumCustDist and sumReachableCustDist. These terms grow (approximately) with the number of customers squared. If the coefficients are estimated on small instances, the per-formance on larger instances deteriorates very quickly as the terms grow quicker than linear. A slight improvement can be achieved by replacing these terms by their square root (the MAPE of the model fitted to the Subsetted Solomon dataset and tested on the G&H dataset improves from 136.2% to 20.7%), but the performance of that model on the ‘All’ dataset degrades from 11.5% to 22.5%. Without further research, it is hard to increase the performance of the model on instances of all sizes by transforming these terms. The model is less robust than we hoped and no straightforward way to improve its robustness seems apparent.

To summarize: while the model can outperform the other new models when fitted to an appropriate dataset consisting of very similar instances, the model is not robust enough to use in all cases due to the non-linear scaling of sumCustDist and sumReachableCustDist.

7 Conclusion

We have created 3 robust route lengths approximation models for the VRPTW. Each of these models is more robust than the current approximation models.

The most robust model is the Basic-N model, which can be used as-is, us-ing the coefficients displayed in Table 6. The Basic and Distances model can achieve better performance on a dataset if enough representative instances to fit the model are available. Of these, the Distances model can achieve bet-ter performance but is significantly less robust and should only be used when a training set containing very similar instances is available. Its performance quickly deteriorates if the number of customers or the area of the instances lies outside the training set.

However, each of our three is more robust than current route length approx-imation models on instances of varying sizes, with an average absolute error of 10-15%, compared to an average absolute error ranging from 30-950% for models in the literature.

(32)

Table 10: Performance of the new Distances model on each dataset, when fitted to each dataset. Reported performance is out of sample. Data when the model is fitted to the ‘All’ dataset is reported in Table 7

(33)

vehicle) can be used in VRPTW approximation as a replacement of the √nA term (used in TSP approximation). However, a notable interaction effect with the number of customers does occur.

There are two interesting limitations to our approach: the first is that the quality of the solutions found might depend on the algorithm used. While we have selected the ALNS heuristic for its good performance on these problems, using another algorithm might lead to solutions of a different quality, which in turn could impact the estimated coefficients of our approximation models. Further research could look into this in more detail.

Our second limitation pertains to the robustness of the models. While we have tried to make our models as robust as possible, by fitting them to and testing them on datasets with a large variety of characteristics and using terms with a high explanatory value, it is impossible to exhaustively cover all possible VRPTW instances. This is another avenue for further research: testing the performance of the models on instances with an even larger variety of charac-teristics. We expect that the models will hold up well, as they also held up well when fitted to each of the datasets and tested on all other datasets, but we might be mistaken.

A third avenue for further research might focus on exploring the sumReach-ableCustDist term in more detail or determining whether similar terms can be used to incorporate complex (time window) constraints into other terms.

A final idea to create more robust approximation models stems from the procedure used in fitting regression models. The Ordinary Least Squares (OLS) regression, which was used in this paper, estimates regression coefficients by minimizing the sum of squared errors. However, this is a measure of absolute deviation and not of relative deviation. Our performance measures are based on percentage error, which is a form of relative deviation. The OLS regression thus weighs instances with a larger route length more than those with a smaller route length, as the squared errors of these instances are often larger. To create models that are more robust across instances of varying route lengths, it might be interesting to adopt a regression procedure that minimizes the relative deviation.

References

Avci, Mualla Gonca and Mustafa Avci (2019). “An adaptive large neighborhood search approach for multiple traveling repairman problem with profits”. In: Computers & Operations Research 111, pp. 367–385. issn: 0305-0548. doi: https : / / doi . org / 10 . 1016 / j . cor . 2019 . 07 . 012. url: http : / / www . sciencedirect.com/science/article/pii/S030505481930187X.

(34)

Beardwood, Jillian, John H Halton, and John Michael Hammersley (1959). “The shortest path through many points”. In: Mathematical Proceedings of the Cambridge Philosophical Society. Vol. 55. 4. Cambridge University Press, pp. 299–327.

C¸ avdar, Bahar and Joel Sokol (2015). “A distribution-free TSP tour length estimation model for random graphs”. In: European Journal of Operational Research 243.2, pp. 588–598. issn: 0377-2217. doi: https://doi.org/10. 1016 / j . ejor . 2014 . 12 . 020. url: http : / / www . sciencedirect . com / science/article/pii/S0377221714010212.

Chien, T.William (1992). “Operational estimators for the length of a traveling salesman tour”. In: Computers & Operations Research 19.6, pp. 469–478. issn: 0305-0548. doi: 10.1016/0305-0548(92)90002-M. url: http://www. sciencedirect.com/science/article/pii/030505489290002M.

Christofides, Nicos and Samuel Eilon (1969). “Expected distances in distribution problems”. In: Journal of the Operational Research Society 20.4, pp. 437– 443.

Daganzo, Carlos F. (1984). “The length of tours in zones of different shapes”. In: Transportation Research Part B: Methodological 18.2, pp. 135–145. issn: 0191-2615. doi: https://doi.org/10.1016/0191-2615(84)90027-4. url: http://www.sciencedirect.com/science/article/pii/0191261584900274. Daganzo, Carlos F (1987). “Modeling distribution problems with time windows:

Part I”. In: Transportation Science 21.3, pp. 171–179.

Dijkstra, Arjan Stijn (2019). “Order fulfillment: warehouse and inventory mod-els”. English. PhD thesis. University of Groningen. isbn: 978-94-034-1760-8. Figliozzi, Miguel Andres (2008). “Planning Approximations to the Average Length of Vehicle Routing Problems with Varying Customer Demands and Routing Constraints”. In: Transportation Research Record 2089.1, pp. 1–8. doi: 10.3141/2089-01. eprint: https://doi.org/10.3141/2089-01. url: https://doi.org/10.3141/2089-01.

Figliozzi, Miguel Andres (2009). “Planning approximations to the average length of vehicle routing problems with time window constraints”. In: Transporta-tion Research Part B: Methodological 43.4, pp. 438–447.

Gehring, Hermann and J¨org Homberger (1999). “A parallel hybrid evolution-ary metaheuristic for the vehicle routing problem with time windows”. In: Proceedings of EUROGEN99. Vol. 2. Citeseer, pp. 57–64.

Hindle, Adam and Dave Worthington (2004). “Models to estimate average route lengths in different geographical environments”. In: Journal of the Opera-tional Research Society 55.6, pp. 662–666.

Kwon, Ohseok, Bruce Golden, and Edward Wasil (1995). “Estimating the length of the optimal TSP tour: An empirical study using regression and neural networks”. In: Computers & Operations Research 22.10, pp. 1039–1046. issn: 0305-0548. doi: 10 . 1016 / 0305 - 0548(94 ) 00093 - N. url: http : / / www . sciencedirect.com/science/article/pii/030505489400093N.

(35)

issn: 0305-0548. doi: https : / / doi . org / 10 . 1016 / j . cor . 2018 . 08 . 002. url: http : / / www . sciencedirect . com / science / article / pii / S030505481830220X.

Mei, Xi (2015). “Approximating the Length of Vehicle Routing Problem Solu-tions Using Complementary Spatial Information”. PhD thesis. George Mason University. url: http://ebot.gmu.edu/handle/1920/9623.

Nagata, Yuichi and Olli Br¨aysy (2009). “A powerful route minimization heuristic for the vehicle routing problem with time windows”. In: Operations Research Letters 37.5, pp. 333–338.

Nagata, Yuichi, Olli Br¨aysy, and Wout Dullaert (2010). “A penalty-based edge assembly memetic algorithm for the vehicle routing problem with time win-dows”. In: Computers & Operations Research 37.4, pp. 724–737. issn: 0305-0548. doi: https://doi.org/10.1016/j.cor.2009.06.022. url: http: //www.sciencedirect.com/science/article/pii/S0305054809001762. Nicola, D., R. Vetschera, and A. Dragomir (2019). “Total distance

approxi-mations for routing solutions”. In: Computers & Operations Research 102, pp. 67–74. issn: 0305-0548. doi: https://doi.org/10.1016/j.cor.2018. 10.008. url: http://www.sciencedirect.com/science/article/pii/ S0305054818302673.

Pisinger, David and Stefan Ropke (2007). “A general heuristic for vehicle routing problems”. In: Computers & operations research 34.8, pp. 2403–2435. Robusté, F, M Estrada, and Andrés López-Pita (2004). “Formulas for estimating

average distance traveled in vehicle routing problems in elliptic zones”. In: Transportation research record 1873.1, pp. 64–69.

Ropke, Stefan and David Pisinger (2006). “An adaptive large neighborhood search heuristic for the pickup and delivery problem with time windows”. In: Transportation science 40.4, pp. 455–472.

Sacramento, David, David Pisinger, and Stefan Ropke (2019). “An adaptive large neighborhood search metaheuristic for the vehicle routing problem with drones”. In: Transportation Research Part C: Emerging Technologies 102, pp. 289–315. issn: 0968-090X. doi: https://doi.org/10.1016/j.trc. 2019.02.018. url: http://www.sciencedirect.com/science/article/ pii/S0968090X18303218.

Solomon, Marius M (1987). “Algorithms for the vehicle routing and schedul-ing problems with time window constraints”. In: Operations research 35.2, pp. 254–265.

Steinerberger, Stefan (2015). “New bounds for the traveling salesman constant”. In: Adv. in Appl. Probab. 47.1, pp. 27–36. doi: 10.1239/aap/1427814579. url: https://doi.org/10.1239/aap/1427814579.

(36)

A

ALNS tuning

The ALNS algorithm can be tuned to achieve better performance. This ap-pendix details how we have tuned the algorithm.

To tune the algorithm, the approach set out by the original authors was followed (see Ropke and Pisinger, 2006; Pisinger and Ropke, 2007). We have tuned the cooling rate c and the initial temperature parameter w, for the ve-hicle minimization stage (stage 1) and the distance minimization stage (stage 2) separately. We have selected (w1, w2) = (0.35, 0.5). Because of the varying sizes of the instances, it turned out that a better result was achieved by making the cooling rate dependent on the initial temperature in the second stage. The cooling rate was selected such that the ending temperature would be 0.1 after 25000 iterations of the algorithm. We thus have ci=

1 10 × t0,i

(1/δi)

, where t0,i is the starting temperature in stage i and δi is the number of iterations left in stage i. Since the number of iterations in stage 1 is uncertain, we have selected δ1 = 2500. Additionally, we have selected the penalty cost for each unserved customer (γ) as a function of the cost of the solution at the start of each stage. Values that worked well turned out to be (γ1, γ2) = (2.5 × c(s0,1), 0.1 × c(s0,2)), where c(·) is the cost function of the solution (in this case defined as the total route length) and s0,1 and s0,2 are the initial solutions in stage 1 and 2, re-spectively. We have also incorporated the authors’ suggestion that the starting temperature is divided by the number of customers for instances of size n > 100.

B

Instance Creation

Benchmarks The Gehring & Homberger and Solomon benchmarking instances are used as is.

Subsetted Solomon We have created additional instances by subsetting the Solomon benchmark. To create one such instance, we first determine the type of the instance (R, C or RC) and consider a random instance of that type (of size 100). We then determine a random number (uniform) n for the size of the new instance (25 ≤ n ≤ 100). We randomly select n customers from the selected instance. These customers, together with the depot, form the instance. No time windows, demands or service times are adjusted.