VRP solution using GPGPUs

(1)

VRP solutions using GPGPUs

Student: Mihai Onofrei

mihonofrei@gmail.com

Host Organization: System and Network Engineering Research Lab

Supervisor : Ana Lucia Varabanescu

A.L.Varbanescu@uva.nl

Second reviewer: Souley Madougou

s.madougou@uva.nl

(2)

A ride sharing systems aims to match customers that share similar routes in hopes of lowering the price of transport and minimize the number of vehicles on public roads, thus enabling economic savings and lowering environmental degradation. In this study we set out to determine the re-quirements that are a fundamental for a ride sharing system, by analyzing previous research in the fields of vehicle routing problems and vehicle sharing. The literature review has also allowed us to define the constraints for a custom ride sharing scenario that takes into account the needs of both customers and small businesses alike.

Given a set of requirements for the system, we have implemented a task-based heterogeneous computing software prototype that is able to process a real world representation of a road network, process a batch of customer requests in order to determine if their itineraries match to some extent and assign compatible request to vehicles. Our work focuses on analyzing the impact of using GPGPUs on the design, implementation, and performance of the system and the results prove that the GPU accelerator can indeed have a positive impact on the performance of VRP systems.

Acknowledgements

I would like to thank dr. ir. A.L. (Ana) Varbanescu for the constant help and guidance that allowed me to start from simple ideas and gradually shape them into a complete project.

(4)

Chapter 1

Introduction

Nowadays, logistics and transportation (L&T) activities increase in volume and complexity as a result of globalization of producers and consumers [1]. Inherently, more complexity is added to transportation networks of all kinds, while high efficiency and low cost remain essential.

Exactly because efficiency and cost are essential, one of the recurring topics in the L&T literature is that of modeling and optimizing tour assignments of vehicles, also known as the Vehicle Routing Problem (VPR) [1].

The core problem of VRP is finding a least-cost set of routes to service a number of customers from a central depot, given a cost matrix specifying the travel cost between all customer/depot locations[2]. The problem can be further complicated by adding read-world constraints that will amount to multiple VRP variants[2]. Examples of such real world constraints are:

• The amount of goods that can be loaded on each vehicle.

• The customer can ask for both delivery and/or collection of goods. • Time windows for customer visits.

All of these constraints and many more can be freely combined to model actual use cases[2]. The main challenge of VRP is scalability, as VRP is a combinatorial optimization problem, where the number of feasible solutions for the problem increases exponentially with the number of customers to be serviced [3]. Most VRPs variants are NP-Hard, thus exact solution methods are confined to limited-size problem instances[4]. Due to the ever increasing number of constraints presented earlier, the complexity of the problems becomes even larger[4]. The combined size and complexity challenges recommend a high performance computing (HPC) solution to the VRP problem.

The purpose of this project is to create a system that is able to improve the performance of VRP by using HPC. The impact of our system is, to a large degree, dependent on its performance, which in turn is dependent on the selected optimization method, the implementation of this method on the targeted hardware, and the computational performance of that hardware[5]. Given that a graphical processing unit (GPU) is a readily available and low cost solution for time-consuming parallel computing tasks[6], we have selected the GPU as our target HPC hardware for performance improvement. General purpose GPUs (GPGPUS) are a modern approach to HPC.

It is challenging to address VRP using parallel processing and GPUs, because most proposed so-lutions utilize graphs [7] and irregular algorithms, which represent a challenge for parallel computing[8]. In fact, most approaches to solve VRPs do not take advantage of parallel computing at all [6]. The ones that do take advantage of it have the following limitations :

• They do not use GPUs efficiently [2, 9]. • They are limited to a single node.

• They provide no thorough analysis and/or model of their scalability and performance. Our main goal is to improve this state-of-the art in VRP, by enabling the efficient use of GPUs. In this context, the main research question of this project is:

(5)

What is the impact of using GPGPUs on the design, implementation, and perfor-mance of a VRP system in the context of a ride-sharing scenario?

We tackle the problem by formulating and subsequently answering the following subquestions: • SQ1: What are the requirements of a HPC VRP system?

• SQ2: How can we design and implement a VRP system compliant with the requirements? • SQ3: What is the performance of such an implementation when using GPUs?

1.1 Approach

In order to answer the main research question, our work focuses on creating a software prototype that is capable of solving a specialized type of VRP problem with added constraints imposed by a ride sharing scenario. The prototype is capable of handling four main activities:

1. Process an input graph that depicts the road network and generate the least cost route between any two points of the graph.

2. Process a batch of requests from multiple clients and determine common road segments that amount to a shared travel cost.

3. Computes a compatibility score between requests.

4. Assign a fleet of vehicles to handle the requests based on the shared travel cost and compat-ibility score.

Our software prototype brings empirical evidence that an efficient VRP solution can be imple-mented using GPUs.

We assert the functional correctness of the results by selecting a set of graphs (i.e., the road networks) and ground-truth scenarios (user requests with specific features and a known optimal solution). By comparing our results against the nvgraph1 graph library (first activity) and the ground-truth (second, third and fourth activity), we are able to assess if our prototype solves the VRP problem adequately.

We further analyze the performance of the software prototype in correlation with metrics such as cost and scalability for different types of scenarios. We aim to show that using GPGPUs is suitable in processing large amounts of data in the compute intensive operations required by the first three activities.

1.2 Contributions and outline of the thesis

The following list contains the main contributions of this work.

1. Based on the work of [10], we have implemented a parallel algorithm that runs on GPUs and solves the core SSSP (Single-Source Shortest Path) problem. Furthermore, we have devised a GPUs parallel algorithm that deals with the intricateness of identifying the series of nodes present on minimum cost path from any node to the starting node used by the SSSP. 2. We propose a custom ride sharing scenario that takes into account the research in the field

and real-world scenarios. We further propose a cost metric to quantify ride sharing and optimize the vehicle assignment.

3. We designed a parallel algorithm that computes the shared travel cost and compatibility score between requests based on the rules imposed by the ride sharing scenario. We provide a GPU implementation of this algorithm.

(6)

The remainder of this document is structured as follows. Chapter2presents the background in several ares of interest: vehicle routing problems, ride sharing services, GPGPUs, parallel graph processing, representation and analysis, the CUDA architecture and libraries used in the study. Chapter3 presents the requirements of the system. Chapter 4 presents the design overview and the implementation details for the system. Chapter5presents various performance measurements of different tasks performed by the system as well as the worst case scenario analysis and prototype validation procedure. Chapter6 present the conclusions and future study directions.

(7)

Chapter 2

Background and related work

In this chapter we present relevant related work and background information regarding the concepts and ideas to be found further in this thesis. Specifically, we briefly present the vehicle routing and ride sharing problems, GPGPU and heterogeneous processing, and parallel graph processing.

2.1 Vehicle Routing Problems (VRPs)

Vehicle routing problems can be categorized as static or dynamic. Static VRPs assume the travel routes can be computed apriori because all the problem instances are known in advance[7]. In contrast, dynamic VRPs assume that some data inputs are revealed during the execution of the plan [11]. In addition, for each category, the problem may also be deterministic or stochastic [6]. Deterministic implies that all of its inputs are known with certainty and there are no stochastic inputs [6].

As mentioned in Section1, real world constraints can be added to the base problem. In this study we chose to explore a ride sharing scenario that involves solving a specialized set of constraints for the VRP, where we have to take into account the capacity of the vehicles (number of seats available for passengers) as well as the pick-up and delivery (drop-off) points.

2.2 Ride Sharing Services

Ride sharing services aim to bring together travelers with similar itineraries and schedules[12]. Despite the obvious cost saving benefits, effective usage of empty car seats by ride-sharing may represent an important opportunity to increase occupancy rates, and could substantially increase the efficiency of urban transportation systems, potentially reducing traffic congestion, fuel con-sumption, and pollution[12].

Ride sharing services can be split up in two main categories, traditional and dynamic. Traditional ride sharing services link riders and drivers who travel between the same paces at the same time and assumes that users have a fixed schedule and fixed origin and destination points[13]. In contrast, dynamic ride sharing systems consider each trip individually and are designed to accommodate trips to random points at random times by matching user trips without regard to trip purpose[13].

In this study consider the problem of matching vehicles and clients in a dynamic setting.

2.3 GPGPU computing

GPGPU computing implies the use of GPUs to perform general computational tasks usually han-dled by the CPU [14]. GPGPU programming involves developing kernels, which are the part of the program that run on the GPU. These kernels are managed by the CPU, effectively leading to a separation of tasks between the two devices. The programmer still needs to explicitly manage the execution of the kernel (i.e., launching a number of threads to execute the kernel), and the

(8)

all machines are heterogeneous. Using this heterogeneity, however - that is, combining the use of CPUs and GPUs - depends on the application at hand. In this project, we assign different tasks to the two different processors (CPU and GPU). Thus, we employ task-based heterogeneous computing.

2.4 Parallel graph processing

Graph processing requires high performance for the increased size and complexity of graphs and their analyses. A lot of effort has been spent on parallelizing graph processing, despite the chal-lenges this approach brings [16]. In this section we highlight a couple of these efforts, specifically focusing on algorithms and their performance. For a detailed analysis on graph processing systems (with or without GPUs), we refer the reader to [17,18,19].

Although such systems do provide both productivity and performance for many graph algo-rithms, they are too complex for our case, because we only use a single algorithm, which is inte-grated in a larger workflow which would be impossible to implement inside such frameworks.

2.4.1 The "Gap Benchmark Suite"

The GAP benchmark suite1_{is a proposed benchmark that provides highly optimized parallel graph}

processing algorithms for the CPU. In this thesis, we have used one of their algorithms - namely, the SSSP parallel implementation - as reference for the performance to be achieved on the CPU. For more information on the optimization and analysis of these algorithms, we refer the reader to [20].

2.4.2 Graph representation on the GPU

The most commonly used space-efficient format to represent graphs in memory is CSR (Compressed Sparse Row). Due to the (traditional) memory limitations of GPUs, CSR is the format of choice for representing all the graphs used in the prototype described in this thesis.

CSR is a space efficient method to represent sparse matrices in GPU memory. If we consider V to be the number of vertices and E the number of edges we can represent a directed and weighted graph the CSR format using three arrays:

1. The column indices array is a set of the destination vertex for each edge, thus it has a total of E elements.

2. The source offset array is a set of the starting vertex of each edge and has the size of V + 1 elements.

3. The weight array contains the weight of each weight, thus it has a total of E elements. Using these data structures, the total space needed to store a graph G = (V, E) in CSR format is: (2 * E + V + 1) elements.

For a more detailed description of CSR we refer the reader to [10].

2.4.3 Parallel graph algorithms in CUDA

Compute Unified Device Architecture (CUDA) represents the programming model developed by Nvidia for it’s underlying GPUs [21]. As commodity graphics has become a cost-effective hard-ware parallel platform to solve general purpose computing tasks, there have been may efforts to implement efficient parallel graph algorithms using CUDA [21, 22]. All these papers focus on the difficulties of combining GPU processing and graph processing, because the massively parallel, lock-step architecture of GPUs exacerbates the challenges in graph processing [16].

(9)

2.4.4 Nvgraph

The Nvidia Graph Analytics library (Nvgraph) comprises of parallel algorithms for high perfor-mance analytics on graphs with up to 2 billion edges. Nvgraph makes it possible to build interactive and high throughput graph analytics applications. The library is closed source and provides a vari-ant of the SSSP algorithm capable of processing a graph network represented in the CSR format and computes the cost of the shortest path between a starting node and all the other nodes in the network.

We use the Nvgraph SSSP algorithm to check the correctness of our custom GPU SSSP imple-mentation, as well as for comparison in the performance analysis of the algorithms. The Nvgraph library aims to address the known issues of parallel graph processing by using a semi-ring model with automatic load balancing for any sparsity pattern.

2.5 VRP and GPGPUs

All the approaches that aim to solve VRPs are known to scale poorly when the problem size is increased (i.e., number of vehicles, number of deliveries, area of interest, etc. increase) [6]. As in many other fields, the use of the graphical processing units (GPUs) can provide a significant speed increase in solving VRP problems [6] even up to 40 times faster than the sequential versions [2]. Very few attempts to solve VRP problems using GPUs have been proposed. For example, in [9], the authors discuss a potential solution, which is unfortunately limited to special cases and, most importantly, it is not thoroughly evaluated in terms of performance and efficiency.

To the best of our knowledge, at the time of this writing, there is no research that aims to solve the VRP problem in the context of a ride sharing service using GPGPUs. Therefore, the focus of our work is to quantify the impact of GPGPUs on the ride-sharing variant of VRP.

(10)

Chapter 3

The Requirements of a HPC VRP

system (SQ1)

In order to determine the requirements for a HPC VRP system we conducted a literature study on vehicle routing problems, ride sharing systems, and HPC. Based on the core problem definition for VRP provided in Chapter1, we can derive that our first dependency for solving any VRP problem is a cost matrix that specifies the travel cost between all possible pick-up/delivery points. In the majority of papers, including the ones that focus explicitly on literature reviews of VRPs[7], the problem is generally defined on a graph G = (V , E, C), where V = {v1, ..., vn} is a set of vertices;

E = {(vi, vj)|(vj, vi)} ∈ V2, i 6= j} is the edge set; and C = (cij)(vi,vj)∈E is a cost matrix defined

over E, representing distances, travel times or travel costs [7]. From this definition we can derive the first requirement :

• REQ1: Use a graph structure that represents the road network and contains the travel cost between each pair of vertices in the graph.

Given that the first requirement is satisfied, the system needs to be able to select the most efficient travel route between any nodes of the road network. The system must use the information contained in the graph structure in order compute the optimal solution. Therefore, the minimum cost between any pair of vertices must be known. In addition, the system must also be able to compute the vertices that are part of the least cost route. This amounts to two more requirements:

• REQ2: Determine the minimum cost between any pair of vertices.

• REQ3: Determine all the vertices that are part of the least cost route from the starting vertex to the destination vertex of the path.

Supplementary to the base VRP, additional constraints are imposed by the ride sharing system. The main purpose of this type of system is to match clients that travel similar itineraries and schedules. In this study, we will focus solely on matching clients that have similar itineraries. In our ride-sharing system, each customer provides a pick-up and destination vertex, and the system will decide the matching.

• REQ4: Analyze and solve a batch of travel requests that contain the pick-up and delivery location of each customer.

The system must match the requests that share common road segments(edges) and must assign the request to vehicles. When considering the vehicle fleet that will service the customers we must take into account that the number of seats available for customers is finite. We can derive the following requirement:

• REQ5: Match the requests that share common road segments(edges) and assign the requests to vehicles.

Although we assume that the vehicle fleet is not limited in its capacity, we strive to group customers in cars such that less cars are used and less kilometers are ridden, but the user comfort and satisfaction is not compromised. Thus, to assess the correctness of the algorithm, we add the following requirement:

(11)

• REQ6: Minimize the number of cars used to transport the customers.

Finally, the VRP and vehicle assignment needs to be fast, even for large numbers of requests. We strive to achieve a minimum throughput even when the number of requests grows significantly. Specifically, our performance-bound requirement is:

• REQ7: The solution should provide a throughput of at least 1000 requests/second.

These seven requirements guide the design, implementation, and analysis of our system and its prototype.

(12)

Chapter 4

The Design and Implementation of a

HPC VRP system (SQ2)

In this chapter we will discuss the design and implementation details that enabled us to build a VRP system that is compliant with the requirements described in Chapter3.

4.1 Design overview

In order to satisfy REQ1, we searched graph structures that depict real-world road networks. We selected a total of six graphs from two sources: the Transportation Networks for Research1 _from

the Ben-Gurion University of Negev, and from the Stanford Large Network Dataset Collection2

from Standford University. From the former source we selected graph networks that depict the road networks for three cities: Birmingham from United Kingdom (14,639 nodes), Sydney from Australia (33,113 nodes), and Philadelphia (13,389) from the United States of America. From the latter source we selected graph networks that depict the road networks that encompass larger geographical areas composed of three states from the United States of America: California (1,965,206 nodes), Pennsylvania (1,088,092 nodes), and Texas (1,379,917 nodes).

All six graph networks are directed and represented using an edge list that depicts the start and the end node of each connection. In this study, the weight of each edge expresses the time (in minutes) needed to traverse the edge from start to finish, and it is considered constant. As the edge weight data is generally omitted from the graph networks, we generate a random cost, between 1 and 10 minutes, for the weight of each edge.

In order to satisfy REQ2, we need a method to determine the least cost between a given pair of nodes in the road graph. Computing the least cost path in a graph is one of the most common problems in computer science and network optimization [23]. The Single-Source Shortest Path (SSSP) algorithm computes the weight of the shortest path (SP) from a specific vertex (source) to all other vertices [23].

SSSP considers all the nodes in the graph as potential destinations. Therefore, it is beneficial in a ride sharing scenario: all requests that have the same starting node can use the results of a single SSSP run.

A disadvantage of using the SSSP algorithm, and a general challenge for VRP, is the scalability of the solution for large graphs and large numbers of requests from different pick-up points. Our solution needs to take into account two important dimensions of scalability: cost (expressed in time) and space (expressed in memory footprint). In order to balance out the two dimensions, we have considered two scenarios based on the size of the input graph.

• Scenario 1: small road networks. If the size of the road network graph is limited to a single city, it is both cost and space-effective to run the SSSP algorithm from every vertex in order to compute the least cost between any pair of vertices. This is a form of another algorithm, called All Pairs Shortest Paths (APSP). The least cost data that results after the APSP algorithm is ran, allows the ride sharing system to quickly access cost information regardless of the starting or destination node.

1_{https://github.com/bstabler/TransportationNetworks} 2_{https://snap.stanford.edu/data/}

(13)

• Scenario 2: large road networks. If he road network is large (i.e„ over a million nodes), the cost and the space needed to run the APSP algorithm would be too high for a dynamic ride sharing system. A full APSP on a graph with a couple of million vertices and tens of millions edges can run for hours, and the space needed to store the data could amount to tens or hundreds of gigabyes of memory. Therefore, a more conservative approach should be taken: once a batch of requests is gathered (say, in the order of tens of hundreds), we run the SSSP algorithm from the starting node of each request, avoiding duplicates.

To address REQ3, we have extended the core SSSP algorithm to also compute the nodes that constitute the shortest path. Throughout this study we will refer to this algorithm as SSSP with path reconstruction. Our extension allows the system to reconstruct the least-cost path. We have considered a space efficient method to to store the succession of nodes that are part of the least cost path. It implies storing the previous neighbor needed to get from each vertex to the staring vertex [24]. We obtain an array equal in size to the number of nodes in the graph, as opposed to storing the series of nodes needed to get from each node to the starting node. In total, the space needed to process and store the results of the SSSP algorithm with path reconstruction is 2 × |V |. For REQ4, the system needs to be able to analyze a batch of requests from multiple customers. Each request contains a pick-up and a destination point. In scenarios that utilize very large graphs, the system is capable of running the SSSP algorithm for every unique starting location vertex of the requests in the batch. We use the cost and path retrieval data generated by the SSSP algorithm to retrace the vertex succession of the least-cost path. We further take into account the cost of each edge that is part of the path. The system uses this aggregated information in order to match requests from clients and, as a result, group them as a shared ride.

In order to satisfy REQ5, the system needs to match requests that have common road segments (edges) and assign the requests to the vehicle fleet. We have created a custom request matching (RM) algorithm that abides to a series of constraints imposed by the ride sharing scenario. The RM algorithm selects the request that share a common route and computes a score for each such pair (i.e, we provide a compatibility metric and a procedure to compute it). The requests with the best compatibility will share the same vehicle. Please note that we impose one restriction for compatible rides: no deviation is allowed for picking up new passengers. This restriction ensures that in case a client fails to arrive at the pick-up location, the system does not impose further delays on the customers that are already present in the vehicle.

For REQ6, the system needs to minimize the number of cars used to transport the customers. We have created a custom algorithm that assigns customers to vehicles based on the compatibility metric described in REQ5. We consider that the fleet of vehicles is not limited in its capacity, but each vehicle has a finite number of available seats for customers. In the first phase the assignment strategy starts out by assigning a vehicle for the first(i.e, top) request in the batch. In the second phase, compatible requests(i.e, the ones with the highest compatibility rating) are assigned to the vehicle until no available customer sets are left. In the third and final phase the requests that were assigned to the vehicle are removed from the batch and their compatibility values are set to zero. This ensures that subsequent requests assignments will not consider unavailable ride partners. The three phases are repeated until there are no request left in the batch.

REQ7 is a quality requirement and concerns the performance (throughput) of the solution. We will demonstrate that this requirement is met by analyzing the performance throughput expressed as the number of requests assigned per second.

4.2 A High-level Architecture of the System

Effectively, the high-level architecture of the system has six stages: (1) read the graph to memory, (2) read and batch requests, (3) compute SSSP or APSP, (4) compute the common cost metric to determine ride compatibility, (5) apply the ride-sharing algorithm to rank the current batch of requests, and (6) assign requests to vehicles. The process repeats from stage (2) as long as the system is operational (e.g., for 12h or 24h worth of requests). Figure 4.1 depicts the high level architecture.

(14)

Figure 4.1: High level architecture

4.3 Implementation details

In this section we will discuss the most interesting implementation challenges and our proposed solutions for the ride-sharing system.

4.3.1 Heterogeneous computing and task assignment

Heterogeneous computing strives to assign the right application tasks to the best suited architec-ture. In order for the ride sharing prototype to take advantage of both the GPU and the CPU processors, we set out to identify which tasks exhibit a high degree of parallelism. We have identi-fied two suitable candidates - the SSSP algorithm and the ride matching algorithm, which are to be executed on the GPU.

Running SSSP on the GPU requires the graph to be stored in the GPU memory. While the best data structrue for SSSP is an adjacency matrix, this method of storage requires O(n2_{) memory}

and is not efficient for sparse graphs [10]. Moreover, it limits the size of the graphs that can fit into the limited GPU memory. Therefore, we opt for a more space-efficient solution, and use the CSR format (see section2.4.2).

In order to analyze the customer requests on the GPU, the ride matching algorithm needs to have access to the series of vertices that constitute the SSSP from the starting node to ending node of each request, as well as the cost of each edge on the path. We have created a custom format the meets our needs using three arrays: (1) the vertices array contains the succession of nodes that constitute the result of the SSSP for each request; (2) the offset array contains the indexes that mark the position of the start vertex of every request inside the vertices array; (3) the weight array contains the weight for each edge between every successive pair of vertices in the vertex array.

As for the CPU tasks, the CPU has to read the graph from the file (edge-based format) and store it using the CSR format in memory. We use the networkx library3 _{to convert the road}

network graph represented as an adjacency list to the CSR format. This is considered a one-time only operation, as the CSR information remains in memory for the whole duration of the processing. The CPU will also read and pre-process requests to be ready for the common-cost algorithm. Next, it will execute the ride matching and finalize the allocation. Finally, because we want to run the SSSP and common cost algorithms on the GPU, the CPU has to also perform several device-management tasks: (1) memory allocation on both the host and the device, (2) data transfers to and from the GPU, and (3) trigger the actual execution of the SSSP and RM algorithms, respectively.

In summary, Figure 4.2 depicts a task-level view of the implementation of our ride-sharing system. We indicate the suitable processor architecture for each task: the blue tasks are ran using the CPU, and green tasks are ran on the GPU. The following paragraphs dive into the details of the two GPU algorithms we have designed and implemented.

(15)

Figure 4.2: Task Distribution

4.3.2 The SSSP Algorithm

In this subsection we analyze the implementation of the SSSP algorithm that runs on the GPU. The algorithm solves the core SSSP problem described in section4.1and utilizes an array of integer values to store the cost of each vertex.

The SSSP algorithm is divided into two separate kernels: a data initialization kernel and a cost minimization kernel. The pseudo code of the algorithm (see Algorithm1) describes the initialization kernel, which sets the value of the staring vertex of the SSSP to 0 while all the other values are set to ∞ (maximum integer value).

Algorithm 1 Algorithm used to initialize the the cost array.

Input: Start vertex s, array vertexCost[0..n − 1] holding the cost of each vertex; Output: Array vertexCost[0..n − 1] holding the updated cost of each vertex;

1: _{procedure initSSSPValues(s, vertexCost)}

2: v ← threadID . One thread per vertex.

3: if v = s then . check if the current vertex is the starting vertex

4: vertexCost[v] ← 0

5: else

6: vertexCost[v] ← ∞

The pseudocode in Algorithm 2 describes the cost minimization kernel. It is inspired by the vertex parallel method for the breadth first search algorithm [10]. The algorithm assigns a thread to each vertex and then loops through all the neighbors of the vertex in order to minimize their cost. The algorithm computes the cost of traveling from the vertex to each of its neighbors. If the cost of traveling to the neighbor is lower that the current cost of the neighbor then the cost of the neighbor, it is updated using an atomic operation. This method prevents race conditions that may occur if multiple threads try to update the cost of the neighbor at the same time. Finally if any of the threads minimize the cost of a vertex, the loop variable is set to the value of 1, indicating that a further iteration of the algorithm may be able to minimize the cost of vertices further.

The number of iterations needed to minimize all the vertices in the graph is not know beforehand. Therefore, after each run of the algorithm, the CPU checks whether the loop variable is 0 or 1 in order to assert if further cost minimizations are possible.

(16)

Algorithm 2 Algorithm used to analyze the neighbors of each vertex and minimize their cost Input: Array vertexCost[0..n − 1] holding the cost of each vertex, row offset array R, column indices array C, weight array W, loop value indicates if further cost minimizations are possible. Output: Array vertexCost[0..n − 1] holding the cost of each vertex.

1: _{procedure analyzeNeighbourVPM(vertexCost, R, C, W, loop)}

3: if vertexCost[v] 6= ∞ then . Prevent vertex imbalance.

4: for r ← R[v] to R[v + 1] by 1 do . For every neighbor of the vertex

5: neighbor ← C[r] . neighbor vertex id

6: cost ← vertexCost[v] + W [r] . the cost of traveling to the neighbor vertex

7: ret ← atomicM in(vertexCost[neighbor], cost) . atomicMin prevents race conditions

8: if ret > cost then . check if the cost was updated

9: loop ← 1 . loop one more time

4.3.3 SSSP algorithm with path reconstruction

We further analyze the implementation of the SSSP algorithm with path reconstruction that runs on the GPU. The algorithm aims to extend the core SSSP problem presented in section4.3.2 by additionally computing the least cost path from each vertex to the starting vertex of the SSSP.

The pseudo-code presented in Algorithm 3 describes the cost minimization kernel with path retrieval, and builds upon algorithm2.

The difference between the two algorithms is that algorithm3introduces two arrays (equal in size to number of edges in the graph): the previousN eighbors array and the previousN eigborsCost array. Due to the structure of the road graph we have to take into account that the same vertex may have multiple previous neighbors that lead towards the source vertex. Due to this reason we will store all the possible previous neighbors ids in the previousN eighbors array. In order to select the previous neighbor that is part of the shortest path we must consider the minimization cost of that particular neighbor. We will store the minimization cost of cost of each neighbor in the previousN eigborsCost array.

Algorithm 3 Algorithm used to analyze the neighbors of each vertex and minimize their cost and path retrieval

Input: Array vertexCost[0..n − 1] holding the cost of each vertex, array previousN eighbours holding the previous vertex one would have to follow to get to the source vertex, previousN eighboursCost holding the previous vertex cost, row offset array R, column indices array C, weight array W.

Output: Array vertexCost[0..n − 1] holding the cost of each vertex, array previousN eighbours holding the previous vertex that one would have to follow to get to the source vertex, array previousN eighboursCost holding the previous vertex cost.

1: _{procedure analyzeNeighbourVPMPreviousNeighbor(vertexCost, previousN eighbours,}

previousN eighboursCost, R, C, W, loop)

3: if vertexCost[v] 6= ∞ then . Prevent vertex imbalance.

4: for r ← R[v] to R[v + 1] by 1 do . For every neighbor of the vertex

5: n ← C[r] . neighbor vertex id

6: cost ← vertexCost[v] + W [r] . the cost of traveling to the neighbor vertex

7: ret ← atomicM in(vertexCost[n], cost) . atomicMin prevents race conditions

8: if ret > cost then . check if the cost was updated

9: previousN eighbours[r] = v; . update the previous neighbor

10: previousN eighboursCost[r] = cost; . update the previous neighbor cost

11: loop ← 1 . loop one more time The pseudo code in Algorithm 4 describes the processNeighbors kernel, which is responsible for processing the previous node id of every neighbor of each vertex and selecting the least cost

(17)

previous vertex. Due to the parallel nature of algorithm3, the previous node id may be the same or different for multitude of neighbors. Table4.1 presents an example of such a case extracted form a real world test graph. The correct previous neighbor is the one that has the least cost. In the cases where there are multiple distinct previous neighbors with the same cost, the algorithm will select the first one. The algorithm will loop through all the previous neighbor values and will determine the least cost neighbor.

Vertex Previous neighbor Minimization cost 189 186 6

189 224 8

Table 4.1: Previous neighbor id and cost for vertex 189

Algorithm 4 Algorithm used to analyze the neighbors of each vertex and determine the least cost neighbor

Input: Number of edges e, array vertexCost[0..n − 1] holding the cost of each vertex, array previousN eighbour holding the previous vertex that one would have to follow to get to the source vertex, previousN eighbours holding the previous vertex one would have to follow to get to the source vertex, previousN eighboursCost holding the previous vertex cost, column indices array C. Output: array previousN eighbours holding the previous vertex that one would have to follow to get to the source vertex, array previousN eighboursCost holding the previous vertex cost.

1: _{procedure processNeighbors(e, vertexCost, previousN eighbour, previousN eighbours,}

previousN eighboursCost, C)

3: prevN eighbour ← (−1) . set default value

4: bestM atchCost ← ∞ . set default value

5: for index ← 0 to e by 1 do . For every index

6: tmpN eighbour ← previousN eighbours[index] . set default value

7: if tmpN eighbour 6= (−1) then . unreachable vertex

8: if C[index] == v then . vertex is found

9: if previousN eighboursCost[index] < bestM atchCost then . update cost

10: prevN eighbour ← tmpN eighbour . update previous neighbor

11: bestM atchCost ← previousN eighboursCost[index] . update bestMatchCost

12: if prevN eighbour > (−1) then . previous neighbor found

13: previousN eighbour[v] ← prevN eighbour . update previous neighbor

4.3.4 Ride matching algorithm

In this subsection we analyze the implementation of the ride matching algorithm that runs on the GPU. The pseudo-code in algorithm5 describes the ride matching kernel, which is responsible for processing the common cost and compatibility metric for a batch of requests.

The algorithm utilizes a custom format to store the request described in subsection4.3.1, where three distinct arrays O, V, W hold the offsets for the vertices of each request, the succession of vertices for every request, and the weight of every edge in the V array, respectively. Each thread compares two requests, and loops through the vertices of the first request to check if any of them is equal to the first vertex of the second request. This is done to ensure that the vehicle will not have to diverge from the shortest path in order to pick up another customer. If matching vertices are found, the algorithm determines the maximum number of common vertices left for both requests, then loops through each remaining vertex to ensure that no further path divergence occurs while adding up the common cost of each common edge. The algorithm finishes by computing the compatibility metric. This is done by dividing the common cost with the cost of the highest cost

(18)

Algorithm 5 Algorithm used to compute the common cost and compatibility metric

Input: Number of threads tn, array O holding the offset for the vertices of each request, array V [0..n2_{− n] holding the vertices of all the requests that will be analyzed, array W holds the weight}

of every edge in the V array, array costArray holds the total cost of the request, array f irstReq holds the vertices of the first request in the pair, array secondReq holds the vertices of the second request in the pair, array requestSharedCost will hold the shared cost, array requestCompatibility holds the compatibility metric.

Output: Array requestSharedCost will hold the shared cost, array requestCompatibility will hold the compatibility metric.

1: _{procedure commonCost(tn, O, V, W, costArray, f irstReq, secondReq, requestSharedCost,}

requestCompatibility)

2: idx ← threadID . One thread per request.

3: if idx < tn then

4: sharedCost ← 0

5: requestDiverge ← f alse

6: requestOne ← f irstReq[idx] 7: requestT wo ← secondReq[idx]

8: requestOneStartOf f set ← O[requestOne]

9: requestOneEndOf f set ← O[requestOne + 1]

10: requestT woStartOf f set ← O[requestT wo]

11: requestT woEndOf f set ← O[requestT wo + 1]

12: requestT woSize ← requestT woEndOf f set − requestT woStartOf f set

13: requestT woStartV ertex ← V [requestT woStartOf f set]

14: for index ← requestOneStartOf f set to requestOneEndOf f set by 1 do

15: if V [index] = requestT woStartV ertex then

16: maxCommonV ertices ← min(requestOneEndOf f set − index, requestT woSize)

17: while (maxCommonV ertices − 1) > 0 do

18: if V [index + 1] = V [requestT woStartV ertex + 1] then

19: sharedCost ← sharedCost + W [index − 1]

20: else

21: requestDiverge ← true

22: break . Exit the while loop

23: break . Exit the for loop

24: if requestDiverge 6= true then

25: requestSharedCost[idx] ← sharedCost . update the shared cost

26: maxCost ← f maxf (costArray[requestOne], costArray[requestT wo])

27: requestCompatibility[idx] ← _f div_rd((f loat)sharedCost, maxCost)

4.3.5 Ride assignment algorithm

The ride assignment algorithm runs entirely on the CPU and is responsible for assigning requests to vehicles based on the compatibility metric array obtained using algorithm5. The requests are processed in order of entry, from the first request to the last one in the batch. A vehicle is assigned for each request. The algorithm then loops through the compatibility metric array indexes of the the request and assigns compatible requests in decreasing compatibility order until there are no more passenger seats left in the vehicle. The final step consists in removing the all the requests that have been assigned to a vehicle. This is done to ensure that subsequent request assignments will not consider request that have already been assigned to vehicles.

In summary, our ride-sharing system uses both the CPU and GPU to process a batch of requests on a given road network. The system uses four GPU kernels: one for initialization, two for the SSSP with path reconstruction, and one for the computation of the common cost and compatibility metric. The following chapter focuses on the performance of these kernels and the system.

(19)

Chapter 5

Performance Measurement and

Analysis (SQ3)

In this chapter we present the experiments we have designed and performed to assess the perfor-mance of our ride-sharing system. We further analyze the results we have observed.

5.1 Experimental setup

All experiments have been run on a computer with the following specifications: • CPU: Intel Core i5-6600 Processor1

• RAM Memory: 16 Gb2

• GPU: Nvidia 1060 GTX 6GB3

• OS: Ubuntu 16.04, 64 bits4

• C/C++: NVCC compiler V8.0.61, optimization level 3 • OpenMP library: GOMP version 5.4.0

• CUDA: version 8.05

While measuring the performance of our application, we made sure all unnecessary tasks in the system were shut off. Each experiment was run four times, and we report here the average of the observed time. The variation in performance between runs was negligible (below 2%). Thus, we have not included any error bars in the graphs.

We have used six different datasets, representing six different road networks. Table5.1presents their properties: number of vertices, number of edges, average degree, and diameter (longest SP).

City Number of vertices Number of edges Average Degree Diameter Birmingham 14,639 33,937 2.32 75 Philadelphia 13,389 40,003 2.99 76 Sydney 33,113 75,379 2.28 164 Pennsylvania 1,088,092 3,083,796 2.83 786 California 1,965,206 5,533,214 2.82 849 Texas 1,379,917 3,843,320 2.78 1,054

Table 5.1: Properties of the road graphs

In order to determine the graph diameter for the road networks presented in5.1we have used the Stanford Network Analysis Platform C++ library6_{. Specifically, we used the netstat utility to}

(20)

plot the hop-graphs (number of reachable pairs of nodes in the graph) for all datasets. They are presented in Figures5.1,5.2,5.3,5.5,5.6, and5.4.

Figure 5.1: Birmingham hop-plot graph.

Figure 5.2: Philadelphia hop-plot graph.

Figure 5.3: Sydney hop-plot graph

Figure 5.4: California hop-plot graph

Figure 5.5: Pennsylvania hop-plot graph

Figure 5.6: Texas hop-plot graph

5.2 Results for the SSSP algorithm

We further present the performance results of running the SSSP algorithms1 and2 on the GPU. We have compared the performance of our GPU SSSP algorithm against two other SSSP imple-mentations: a parallel CPU version provided in the the Gap benchmark suite [25] and a GPU version offered by the proprietary graph analytics library from Nvidia[2.4.4].

The execution time we report as the performance of our SSSP algorithm is the combined time taken by the initialization and the distance calculation (i.e., Algorithms 1 and 2, respectively). Figures5.7, 5.8, 5.9, 5.10 5.11, 5.12illustrate the experimental results derived from running the SSSP algorithm on the road graphs for Birmingham, Philadelphia, Sydney, California, Pennsylva-nia, and Texas, respectively. In all these figures, each three-bar group represents the execution of one complete SSSP run from one start node; inside each group, each bar stands for one version of the SSSP algorithm: the GAP version running on the CPU (blue), ours (orange), and the nvgraph one (grey).

(21)

Figure 5.7: The performance of running SSSP for the Birmingham dataset. Each bar represents the execution time of a single SSSP, starting from one node. Different colored bars stand for different versions of the code.

Figure 5.8: The performance of running SSSP for the Philadelphia dataset. Each bar represents the execution time of a single SSSP, starting from one node. Different colored bars stand for different versions of the code.

(22)

Figure 5.9: The performance of running SSSP for the Sydney dataset. Each bar represents the execution time of a single SSSP, starting from one node. Different colored bars stand for different versions of the code.

Figure 5.10: The performance of running SSSP for the California dataset. Each bar represents the execution time of a single SSSP, starting from one node. Different colored bars stand for different versions of the code.

(23)

Figure 5.11: The performance of running SSSP for the Pennsylvania dataset. Each bar represents the execution time of a single SSSP, starting from one node. Different colored bars stand for different versions of the code.

Figure 5.12: The performance of running SSSP for the Texas dataset. Each bar represents the execution time of a single SSSP, starting from one node. Different colored bars stand for different versions of the code.

(24)

algorithms, and the SSSP algorithm from the nvgraph framework2.4.4. City Vertices CPU GPU GPU nvgraph

(s) (s) (s) Philadelphia 13,389 22.09 13.09 51.89 Birmingham 14,639 19.64 19.50 71.51 Sydney 33,113 98.15 86.26 325.75

Table 5.2: Total time needed to run the SSSP algorithms for city road graphs. State Vertices CPU GPU GPU nvgraph

(s) (s) (s) Pennsylvania 100 2.89 24.90 63.00 Texas 100 3.47 30.93 84.57 California 100 4.45 36.45 96.33

Table 5.3: Total time needed to run the SSSP algorithms for the state road graphs. Finally, table5.4 contains the average execution time and standard deviation for running one SSSP for each of the three algorithms under consideration.

City/ Vertices CPU GPU GPU nvgraph Standard State (ms) (ms) (ms) deviation Philadelphia 13,389 1.65 0.98 3.88 1.51 Birmingham 14,639 1.34 1.33 4.88 2.04 Sydney 33,113 2.96 2.60 9.84 4.08 Pennsylvania 100 28.88 249.08 630.12 304.18 Texas 100 34.73 309.28 845.71 412.47 California 100 44.53 364.46 963.27 466.37

Table 5.4: Average time needed to run the SSSP algorithms for the Philadelphia, Birmingham and Sydney, Philadelphia, Texas, California road graphs

We make the following observations regarding the results:

• For all datasets, the nvgraph performance is significantly worse than the other two versions. As the library is closed source we can only make some assumptions about the results. Firstly, Nvgraph SSSP algorithm uses the CSC format. We used a Nvgraph graph topology conver-sion function to convert from the CSR topology to the CSC topology. Secondly, the Nvgraph SSSP algorithm requires the edge weights to be expressed as floating point numbers. We suspect that the different data types influences the differences in running time between the two GPU implementations of the SSSP algorithm.

• For the city graphs, our implementation is competitive against the highly optimized CPU version. For the state graphs, however, the CPU parallel implementation is significantly faster. This happens because of two reasons: (1) the data structure we use (the CSR) is more complex and requires more memory accesses with less locality, and (2) their algorithm is an optimized version of the delta-stepping algorithm, a different approach than the one we took, much more suitable for CPUs than GPUs, and extremely useful for graphs with large diameters.

• For all versions, the execution time is highly dependent on the starting node, best seen by the large standard deviation reported in Table5.4.

• For all versions, the difference in the execution time of SSSP varies significantly per graph. For example, the average execution time of SSSP for Texas is over 10 times larger than for Pennsylvania, and 100 times larger than for Sydney.

(25)

5.3 Results for the SSSP algorithm with path reconstruction

We move on and discuss the results of our experiments with SSSP with path reconstruction. We report two different execution times - the SSSP itself (Algorithm3,) and the path reconstruction (Algorithm 4), the sum of which represents the time taken by the complete SSSP with path reconstruction.

We present the results in two tables: one focusing on the overall execution of all SSSP’s from the graphs, and the second one focusing on the average execution time (reporting also its standard deviation). The data is summarized in Table5.5and Table 5.6, respectively.

City Vertices Old SSSP SSSP Process Neighbors Total (s) (s) (s) (s) Philadelphia 13,389 13.09 13.63 125.91 139.54 Birmingham 14,639 19.50 19.97 119.96 139.93 Sydney 33,113 86.26 90.77 1347.37 1438.14

Table 5.5: Time needed to run the SSSP and previous neighbor algorithms for the Philadelphia, Birmingham and Sydney road graphs. Old SSSP is added for comparison, and it is the version from Algorithm2.

City Vertices Old SSSP SSSP Process Neighbors Total Standard (ms) (ms) (ms) (ms) deviation Philadelphia 13,389 0.98 1.03 9.40 10.43 0.03 Birmingham 14,639 1.33 1.37 8.19 9.56 0.02 Sydney 33,113 2.60 2.75 40.69 43.44 0.10 Table 5.6: Average time needed to run the SSSP and previous neighbor algorithms for the Philadel-phia, Birmingham and Sydney road graphs. Old SSSP is added for comparison reasons, and it is the version from Algorithm2.

Based on these results, we make the following observations:

• The new version of SSSP is not affected significantly by the additional computation. • The kernel analyzing the neighbors is very time-consuming. This happens because the

algo-rithm does not take full advantage of the GPU. The computational imbalance is created due to the fact that some threads will would have to perform more computations when analyzing the value of each vertex, while other threads will finish earlier and remain idle.

• No data about the large graphs is available because in this study we are mainly concerned about city wide graphs. They represent the most realistic target for an dynamic ride sharing service because the number of compatible request would be on average much higher than on state wide graphs. We also assume that clients expect the taxi to arrive in a short period of time to the pick-up point.

5.4 Results for the common cost algorithm

We continue the presentation with the performance results for the common cost algorithm (i.e.,5) running on the GPU. To further understand the performance of the GPU implementation, we have also implemented two distinct CPU variants of the common cost algorithm. The CPU variants utilizes the same logic code as the GPU variant. The sequential CPU variant of algorithm5 loops sequentially through all the requests. The parallel CPU variant adds an OpenMP7for loop pragma with a dynamic scheduler, due to the fact that the size of each request is of random length.

Our measurements are based on a set of random requests, grouped in batches of different sizes - from 20 to 1000. We have considered the city-wide road networks: Birmingham, Sydney and

(26)

the results for the other two networks (Philadelphia and Sydney, respectively). In all tables we have also included the CPU to GPU memory transfer in order to aid the performance analysis between the different algorithms. The memory transfer time takes into account the time needed to transfer memory from the CPU to the GPU and the time needed to transfer the results computed in GPU memory back to CPU memory.

Figure 5.13: The performance of running the ride sharing algorithm on the Birmingham dataset.

Number of CPU-sequential CPU-parallel GPU Memory transfer requests (ms) (ms) (ms) (ms) 20 0.19 0.01 0.03 0.62 40 0.12 0.02 0.03 0.65 60 0.30 0.04 0.03 0.64 80 0.39 0.06 0.03 0.66 100 0.41 0.09 0.04 0.44 200 4.85 0.31 0.04 0.55 300 9.00 0.71 0.08 1.01 400 19.27 7.76 0.13 1.54 500 25.66 10.58 0.19 1.88 600 40.43 11.56 0.25 2.61 700 65.93 14.07 0.33 3.06 800 87.02 15.02 0.44 3.62 900 108.29 27.83 0.56 4.34 1000 121.39 40.66 0.65 4.86

(27)

Figure 5.14: The performance of running the ride sharing algorithm on the Philadelphia dataset.

(28)

Figure 5.15: The performance of running the ride sharing algorithm on the Sydney dataset. Number of CPU-sequential CPU-parallel GPU Memory transfer

requests (ms) (ms) (ms) (ms) 20 0.12 0.03 0.04 0.61 40 0.17 0.04 0.04 0.65 60 0.25 0.06 0.05 0.89 80 0.42 0.09 0.06 0.69 100 0.48 0.13 0.05 0.68 200 1.54 0.41 0.07 0.82 300 7.10 0.97 0.14 1.06 400 16.61 11.11 0.20 1.37 500 34.79 13.65 0.32 2.05 600 51.63 15.62 0.42 2.81 700 63.82 16.66 0.57 3.33 800 93.54 17.36 0.71 4.14 900 106.96 41.90 0.90 4.53 1000 139.71 46.22 1.06 5.14

Table 5.9: The performance of running the ride sharing algorithm on the Sydney dataset. Based on the data, we note that the GPU outperforms the CPU only when we reach a batch of 80 requests. When we take into account the supplementary time added by memory transfers between the two processors, we can see that the GPU is able to outperform the CPU only when we reach a batch of 400 requests.

5.5 Worst case scenario for the ride matching algorithm

The performance of the ride matching algorithm is a major contributor in the success or failure to meet the performance demands imposed by REQ7. In order to assert the performance of the common cost algorithm, we have designed a worst-case scenario - i.e., a scenario with high computation cost: given a SSSPath that contains 2000 vertices, we analyzed the time needed to run the common cost algorithm using a variable number of requests (up to a maximum of 1000 requests). Each request is identical and the start vertex of the request is the first vertex of the SSSPath, while the destination vertex of the request is the final vertex of the SSSPath. This setup

(29)

represents the worst case scenario because the algorithm will access each vertex id on the SSSPath and will add up the cost of each edge on this path.

In order to test the performance of the common cost on the CPU, we use both the serial and parallel CPU implementation of Algorithm5 described in Section4.3.4.

Figure 5.16depicts the worst case scenario, where the blue line represent the time needed to to run the ride matching algorithm using the serial CPU implementation of the algorithm, while the The yellow and the red lines represent the time needed to to run the same algorithm using the parallel CPU implementation and the GPU version respectively.

Figure 5.16: Worst case ride matching scenario. Table5.10contains a summary of the experimental data itself.

(30)

As described in subsection5the algorithm utilizes three distinct arrays (O,V,W) for the custom format, with the following space considerations:

The following enumeration we will use requestN umber in order to symbolize the total number of requests.

• Array O has requestN umber + 1 elements.

• Array V has is equal in size to the total number vertices that form the shortest path for every request.

• Array W array is equal in size to array V .

The results of the computation requires 2 ∗ (requestN umber2_{− requestN umber) elements.}

When we consider the space requirements for a worst case scenario where every request has a shortest path that has 2000 elements we can derive that the total number of required elements can be computed using : 1 + 3999 ∗ requestN umber + 2 ∗ requestN umber2

If we consider that the size of an element is represented as an integer and requires 4 bytes of memory we can deduce that the maximum number of request that can be processed using a GPU with a total of 6 gigabytes(i.e, the experimental setup is described in section5.1) of memory is 26404.

We observe from the data that the GPU outperforms the CPU when we reach 80 requests even when we take into account the memory transfers between the CPU and GPU. When we reach 1000 request the GPU is 2.6x faster than the CPU. We must also note that under a high workload, the parallel version of the ride matching often fails to perform better then the serial version of the same algorithm.

5.6 Assignment of requests to vehicles

Finally, this section presents the experimental results of running the ride sharing prototype. We present the running time of the prototype considering random requests included in batches of different sizes (from 20 to 1000) for the three city-wide road graphs described in Section5.1. The experiments will consider the tasks described in Section4.2, starting with the analyze request task and ending with the assign requests to vehicles task. Figure5.17and table5.11present the results of the experiment.

(31)

Number of Birmingham Philadelphia Sydney requests (ms) (ms) (ms) 20 0.68 0.71 1.01 40 0.78 0.84 0.91 60 1.01 1.06 1.54 80 1.46 3.73 1.55 100 1.81 1.77 2.19 200 31.82 39.81 31.86 300 69.51 68.41 67.42 400 104.82 109.49 111.68 500 160.79 159.88 162.77 600 257.53 251.88 252.57 700 365.34 366.11 369.62 800 527.67 524.62 529.34 900 729.52 731.15 726.30 1000 976.76 980.77 975.25 Table 5.11: Assignment of requests to vehicles

We observe that we are able to deliver on the performance restriction imposed by REQ7, by assigning 1000 random request to vehicles in under one second. The ride sharing prototype benefits greatly from the use of the GPU accelerator only when the number of request is relatively high and is able to outperform the CPU by 2.6x times in a worst case scenario situation. Overall, we believe that the ride matching algorithm represents an advantage when executed on the GPU, as we believe that in a real world scenario the requests will be more condensed that in a random experimental setup, and as such the performance demands will certainly tip towards the worst case scenario that the random one.

5.7 Prototype validation

In this section we discuss the validation methodology behind the SSSP algorithm and the common cost algorithm.

In order to validate the correctness of the SSSP algorithm we used the nvgraph graph library supplied by Nvidia with the CUDA 8.0 development toolkit. The library is closed source and offers parallel algorithms for high performance graph analytics including a variant of the SSSP algorithm that computes the minimum cost needed to travel from a starting vertex to all other vertices in a given graph. The nvgraph library is able to process CSR representation of the road graphs, as such given the same input data representation we were able to test if the output of the algorithms is identical.

The nvgraph SSSP algorithm does not offer any functionality that can be used to retrieve the nodes that constitute the path of the SSSP. In order to determine that the SSSP is correct, we implemented a custom test that traverses the nodes of the SSSP and computes the cost of the path. We then cross reference the added cost with the total cost of the path in order to assert that correctness of the results. The test design takes into account the parallel nature of the SSSP algorithm. If we consider the graph depicted in figure5.18we can see that there are two possible cost identical paths from vertex 1 to vertex 4. The SSSP algorithm will try to minimize the distance in parallel between nodes 2, 3 and 4, and the first thread that will minimize the distance first will hold the value of the previous neighbor to the source vertex 1. As such both node 2 and 3 are valid solutions for the previous neighbor associated to node 4.

In order to validate the common cost algorithm we ran custom test scenarios. A directed graph containing a succession of connected vertices from 0 to n+1 where each edge weight is 1 provides a high degree of flexibility needed to create a multitude of SSSPs. In order to test vertex divergence scenario we added an extra edge from vertex n-1 to vertex n+1. A visual representation of a trivial graph example is provided in figure5.19. We have considered the following scenarios:

(32)

• The two SSSPs share a single common vertex along their paths. • The two SSSPs do not share any common vertices along their paths.

Figure 5.18: Random neighbor

Figure 5.19: Trivial test graph for the common cost algorithm

In summary, we have stress-tested our algorithms and found they are functioning correctly. Combining this validation with the performance evaluation, we have provided empirical evidence for the feasibility of a HPC ride-sharing system using GPUs.

(33)

Chapter 6

Conclusion and Future Work

With the increase in complexity of all transportation networks, efficient logistics is an important challenge, impacting many aspects of our lives (e.g., from package delivery to traveling).

An example of a problem identified as critical in the context of logistics and transportation is vehicle route planning (VRP). In this thesis, we focused on analyzing a potential HPC solution for this problem: the use of GPU computing. Specifically, we pursued an answer to the following research question:

What is the impact of using GPGPUs on the design, implementation, and perfor-mance of a VRP system in the context of a ride-sharing scenario?

We based our approach on an empirical analysis, where we proposed and implemented the prototype of a GPU-enabled VRP system in the context of a ride-sharing scenario. Our prototype is based on a three stage algorithm, where the first stage consists of running our custom SSSP implementation capable of computing the elements that are part of the least cost path. Retrieving the least cost path is our main contribution to the core SSSP algorithm, and the implementation proved challenging to due to the parallel nature of the algorithm, and the restrictions of the GPU hardware.

The second stage consists of computing the common cost and establishing the compatibility metric for a batch of requests. The high number of computations in the second stage allowed us to take advantage of the GPU. Our core contribution here was to define a ride compatibility scenario and a metric for it. Moreover, to the best of our knowledge, there is no previous research regarding ride sharing systems that utilizes GPUs.

The third stage consists of the assignment of requests to vehicles, based on the compatibility metric obtained in the previous stage. The algorithm enforces the rules defined by our ride sharing scenario (i.e., we aim to group requests based on compatibility, while minimizing each traveler’s delay and the number of cars used to transport the customers).

We have used the prototype for a thorough analysis of GPU-enabled VRP, using six different real-world road networks. Our main findings, performance-wise, are:

• Our system requirements are a good fit for the ride sharing system, and have provided enough architectural flexibility to implement a task-based heterogeneous computing prototype and answer our main research question.

• Our prototype implementation focuses on the flexibility, performance and scalability of the solution. We strive to keep the solution as flexible and reusable as possible: tasks can be easily replaced or optimized without compromising the overall use of the system (see4.2). The prototype only depends on the CUDA toolkit and the C++11 standard (i.e., there are no dependencies on external libraries). From a scalability standpoint, our system is able to handle large graphs and, more importantly, will increase in performance in the future as the GPU and CPU hardware will evolve in terms of memory and computational power.

(34)

implementations we have used showed significant limitations in performance, but some of these can be blamed on the naive versions of the code. More optimizations are needed before a thorough comparison can be made. When we consider the ride matching task, the results prove that the GPU accelerator provides a constant performance for random batches of request up to 1000 elements. When considering the worst case scenario, a very demanding use case (the diameter of the test graph is twelve times larger than the diameter of Sydney network), we can see that the GPU significantly outperforms the CPU when the number of requests exceeds 100. Thus - for large numbers of requests, the CPU is no longer a suitable target.

• The performance of the software prototype proves that the system is able to meet the demands of REQ7 (i.e., a throughput in excess of 1000 requests/s) when considering the average use case.

Based on these findings, we believe that GPUs can indeed have a positive impact on the perfor-mance of VRP systems.

6.1 Threats to validity

We identify in this study three major concerns that could be labeled as threads to validity. We discuss them briefly in the following paragraphs.

The GPU code is not fully optimized.

Our work can benefit from future optimizations in several areas of the GPU and CPU code. The previous neighbor computation for the SSSP algorithm that is executed on the GPU could benefit from a higher degree of parallelization by analyzing each of the previous neighbor candidates in parallel and utilizing atomic operations to select the least cost alternative.

The CPU specific ride assignment algorithm could be parallelized to search the the best suitable candidates in the compatibility metric array in parallel, and aggregate the results afterwards. These possible enhancements that can still be applied to our prototype will only increase the performance of our GPU-based VRP. As such, the performance results we have presented so far are likely to be conservative.

Datasets weights are based on random values.

The graphs used in this paper lacked weight information. Therefore, we chose to use randomly generated weights as integers from 1 to 10. When considering real weights in a static or dynamic scenario, we are confident that the proposed solution can be easily adapted with minimal perfor-mance impact. A large data type will however increase the amount of memory needed to store the weights and cost arrays of the SSSP algorithm. It is also important to note that the current implementation can only handle weights of integer types as the atomic operations used in the SSSP algorithm currently do not support floating point types. Adding support for floating point weights would require some implementation changes in the SSSP algorithm.

We used a simplified scenario for ride sharing.

The ride sharing scenario abides to a set of rules that aim to match customers with the highest degree of compatibility (share the longest common path) in order to reduce cost and minimize the number of vehicles. While the system has originally been intended to handle a dynamic ride sharing scenario, we have not tested any such cases. Moreover, there are several limitations that may hinder its usage in more dynamic settings. For example, one or several clients might opt out of the proposed planning. The system would then have to adapt dynamically to changes in customer availability, in order to avoid unnecessary stops that would lead to delays and customer dissatisfaction. Another limitation is that the system does not currently inform the customers of vehicle arrival times or shared cost. We do however compute the shared cost and the time needed to travel to a any customer can be accessed from the leas travel cost computed by the SSSP algorithm. This information can be easily correlated into an custom pricing and pick-up/delivery scheme.

VRP solution using GPGPUs

VRP solutions using GPGPUs

Student: Mihai Onofrei

mihonofrei@gmail.com

Host Organization: System and Network Engineering Research Lab

Supervisor : Ana Lucia Varabanescu

A.L.Varbanescu@uva.nl

Second reviewer: Souley Madougou

s.madougou@uva.nl

Contents

Abstract

Acknowledgements

Chapter 1

Introduction

1.1

Approach

1.2

Contributions and outline of the thesis

Chapter 2

Background and related work

2.1

Vehicle Routing Problems (VRPs)

2.2

Ride Sharing Services

2.3

GPGPU computing

2.4

Parallel graph processing

2.4.1

The "Gap Benchmark Suite"

2.4.2

Graph representation on the GPU

2.4.3

Parallel graph algorithms in CUDA

2.4.4

Nvgraph

2.5

VRP and GPGPUs

Chapter 3

The Requirements of a HPC VRP

system (SQ1)

Chapter 4

The Design and Implementation of a

HPC VRP system (SQ2)

4.1

Design overview

4.2

A High-level Architecture of the System

4.3

Implementation details

4.3.1

Heterogeneous computing and task assignment

4.3.2

The SSSP Algorithm

4.3.3

SSSP algorithm with path reconstruction

4.3.4

Ride matching algorithm

4.3.5

Ride assignment algorithm

Chapter 5

Performance Measurement and

Analysis (SQ3)

5.1

Experimental setup

5.2

Results for the SSSP algorithm

5.3

Results for the SSSP algorithm with path reconstruction

5.4

Results for the common cost algorithm

5.5

Worst case scenario for the ride matching algorithm

5.6

Assignment of requests to vehicles

5.7

Prototype validation

Chapter 6

Conclusion and Future Work

6.1