Benchmarking Deep Neural Networks with Distributed Training

(1)

Benchmarking Deep Neural

Networks Frameworks

(2)

Layout: typeset by the author using LATEX.

Cover illustration: Unknown artist Source: www.global-engage.com

(3)

Benchmarking Deep Neural

Networks Frameworks

for Distributed Training

Pim P.L van Helvoirt 10546413 Bachelor thesis

Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisor

dr. ir. A.L. (Ana) Varbanescu Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Jun 26th, 2020

(4)

Acknowledgements

First, I would like to thank Ana, for her continuing (online) supervision, patience and guidance where it was necessary. Additionally I would like to thank Matthijs for his invaluable contribution in setting up and troubleshooting the required en-vironment for this benchmark, and Valeriu for additional help with configuration of the Lisa Cluster. Furthermore I would like to thank SURFsara for making their computing resources available for this research. Additionally, I would like to thank Bo for providing me with a steady supply of fresh coffee to keep me going.

Finally I would like to thank my partner Annabel for supporting me, and my family for helping me from a distance during the writing process.

(5)

Abstract

Within Deep Learning (DL) Deep Neural Networks (DNNs) and the datasets that are used for their training have been rapidly increasing in size. To keep the training of these large models and datasets manageable, Distributed Training Frameworks (DT Frameworks) have been developed. Most research done on DNNs, however, does not concern the performance of DT Frameworks during training. This thesis focuses on this computational performance.

For distributed training, there are three different theoretical approaches: data-parallelism, model-parallelism and pipelining. In theory, data-parallelism is the best performing, but whether this happens in practice depends on the actual mod-els and the implementation of the frameworks. The main goal of this research is to propose a systematic evaluation method, through a benchmark, to aid choosing be-tween different DT Frameworks that use different distributed training approaches. To build such a benchmark, we first select relevant models from different classes, frameworks, and representative datasets, all based on a survey of state-of-the-art scientific literature. For this benchmark, we selected throughput and memory us-age as metrics, based on scientific material on benchmarking in High Performance Computing (HPC). The proposed architecture is partially implemented in a pro-totype, which enabled us to provide an empirical evaluation (and ranking) of three of the chosen DT Frameworks PyTorch, Horovod and PipeDream, using ResNet, ResNeXt and EfficientNet and the Cifar10 and MNIST dataset.

Our main contributions are the benchmark architecture, and the prototype im-plementation of this benchmark that provides a ranking of the frameworks. We further discuss possible extensions of this benchmark, and raise awareness of the challenges in the implementation of diverse models in different frameworks. Fi-nally, our benchmarking results demonstrate that, with MNIST, PipeDream, in contrast to theory, performs better than Horovod. We have provided several ex-planations for this result.

Keywords: Benchmarking, Distributed Training Frameworks (DT Frameworks),

Deep Neural Networks (DNNs), Deep Learning (DL), High Performance Comput-ing (HPC), RankComput-ing

(6)

Chapter 1 Introduction

In the data-driven world of today Machine Learning (ML) models are used to understand and draw possible conclusions from data. An important part of ML is Deep Learning (DL) using Deep Neural Networks (DNNs). DNNs are models that, inspired by the human brain, are built as a network of nodes, have been successfully applied in a multitude of fields, including computer vision, speech recognition, natural language processing, audio recognition, social network filtering and machine translation [1].

All these models need training, and it is a common assumption that more data leads to a better trained model. As input data has been rapidly increasing, meth-ods to address the scale and speed of training for these massive data collections have become an active topic of research. One of these methods is the distributed training of DNNs[1]. There are, however, many Distributed Training Frameworks (DT Frameworks) with different approaches to distributed training.

With respect to the performance on a given model (i.e., the speed of training), choosing between these DT Frameworks is non trivial, and impossible to do without extensive testing.

1.1 Research Question and Approach

In this research, we help choosing between frameworks with the following research question:

Can we design and build a benchmarking suite to empirically compare the performance of DT Frameworks for DNN?

(10)

To answer this research question, several milestones need to be reached. To this end, we propose the following research sub-questions.

• What representative networks, frameworks and datasets should be

included for testing? To answer this question, we performed a literature

survey, and selected the networks and datasets that are considered represen-tative in current research. To select the framework(s), we reviewed recent scientific material and technical documentation, assessing the features of existing frameworks, and selected the most representative ones for state-of-the-art distributed training.

• What are representative metrics that characterise the performance

of DT frameworks? We selected benchmark metrics based on scientific

literature regarding benchmarking in the High Performance Computing en-vironment.

• What is the architecture of a benchmark for ranking DT DT

Frame-works? We designed the benchmark based on a brief list of requirements

(derived from the previous two questions). We further implemented a pro-totype of the benchmark, and further provide empirical evidence that the benchmark design can illustrate performance differences, in terms of the proposed metrics, between its selected components: DT frameworks, mod-els, and datasets.

1.2 Thesis Outline

The remainder of this thesis is organised as follows. In Chapter 2 we introduce background and related work. To this end, we briefly discuss what training of a DNN entails, and discuss theoretical distributed training methods. In Chap-ter 3, we present the selection process for networks and frameworks included in this benchmark. Next, we discuss the metrics for this benchmark. We further discuss the proposed benchmark architecture, and include a brief description of its implementation.

Results are included in Chapter 4, where we show results for the benchmark proto-type, and provide a ranking of the frameworks based on these results. In Chapter 5 we discuss possible extensions of the benchmark, and present preliminary results for some of these extensions.

Finally, we conclude this research in Chapter 6, with an overview of our most important results, and an answer to our research question. We also discuss possible

(11)

(12)

Chapter 2 Background and Related Work

In this chapter, we provide a brief overview of DNNs, and we discuss three theo-retical methods of distributed training. We also present a brief survey of related work.

2.1 Background

2.1.1 Deep Neural Networks

A Neural Network is a structure with interlinked nodes, organized in layers, that is used to learn a feature, or multiple features from a data set, and, based on these learned features, can, for example, classify objects into categories. A DNN is a Neural Network, that has multiple hidden layers. However, up until now, re-searchers have not defined any lower/upper bound limits for the number of hidden layers required for a network to be classified as deep[2].

Currently, many frameworks are proposed to allow users to design, train, and

potentially use DNNs. Examples include TensorFlow [3], Caffe [4] and Keras1_.

Besides providing users with an environment for network "management", they also provide a lot of backend functionality, as they manage the required computation on the provided hardware infrastructure (e.g., CPUs, one or more GPUs, and even multiple nodes in a cluster).

To train a DNN using a DT Framework, we first prepare a dataset for training, generally including a transformation of the data to fit it on the input layers of the

1_{Keras does not have an official research paper, but the GitHub repository can be found here}

(13)

model. Additionally, other transformations can be used, like artificial noise or/and cropping, generally to enhance the dataset and ultimately improve correctness of the model. After creating the model, it is time for training on the data. The data is split into two parts, a training set and a validation set. The training happens in training epochs. One training epoch is defined as a forward pass and a backward pass over the entire dataset, with the data partitioned in batches determined by

the batch-size2_{. During the training training, the parameters of the model are}

subject to change, as the data is passed forward through the model in batches and the output of the DNN given this data is used to compute and update of these parameters on a backward-pass to reduce the error on the training set to optimise the training loss. After the training epoch comes the validation phase. This phase is purely used to track progress for training. The output from data that is passed forward through the model, is then compared to the true output. Based on the difference between the output from the model and the true output, we can employ different metrics to calculate how accurate the model performs, and we will expand on this in Section 3.3.

The model is trained on the training set and subsequently, validated on the vali-dation set, until a certain goal is reached. The goal depends on the aim for using the model. Examples of regular goals are: reaching a required level of accuracy on the validation set, training until the training error stops decreasing, or training for a preset number of epochs.

2.2 Theoretical Methods of Distributed Training

Within distributed training, there are multiple approaches to reduce the training time or to allow for better scaling using multiple machines for a given DNN. In this thesis, we focus on three of these: data-parallelism, model-parallelism, and pipelining.

The first method for distributed training is data-parallelism. Using this technique, the dataset is distributed over different devices, each of which has the complete model. During the training stage, different mini-batches of data are forwarded in parallel through the model, and parameters learned from this data are averaged to reflect learning for all parallelised mini-batches. Naturally, this approach imposes communication overhead, compared to regular model training, for averaging the parameters across multiple devices.

The second method is pipelining. A model is split into different parts, which are

(14)

then distributed over different devices. The data flows through these different devices sequentially. After piping the data through the model into sequential order, the parameters of the model are updated on the backwards pass in reversed sequential order. This imposes an communication overhead between the different parts of the split model. Additionally, how to exactly partition a model is not a trivial task[5].

The last theoretical method to distribute the training of DNNs is model-parallelism. With model-parallelism, a model is duplicated over different computational de-vices. These different devices then try to estimate different parameters of the given problem, using the same data. Models synchronise repeatedly between it-erations to keep parameters of the model up to date. Theoretically, this allows scaling up, because data can be split and trained in parallel over the replicated models. Additionally, this approach allows a model that does not fit on one device to still be used. However, this method of distributed training generally leads to either under-utilisation of computer resources or re-computation of gradients, and an overhead in averaging gradients of the model across different devices. [6]

Best Performing Theoretical Method in Distributed Training

In principle, the best performing theoretical approach should be data-parallelism, due to less overhead than the other two approaches. Moreover, model parallelism should be outperformed by the other methods. Second in ranking would be pipelin-ing.

However, in these HPC-environments we have to deal with a lot of hardware com-plexity. This means that we should empirically verify these theoretical principles. With regards to using frameworks on DNN, the most limiting factors in perfor-mance, should be a combination of the model and the framework used. Therefore, if the model is the same in two different runs given two different DT Frameworks, the two runs reflect the differences in performance between these two frameworks.

2.3 Related Work

A lot of research has been done in the last decade on DNNs. The research concerns the development of new or updates to DNN models, for example Chen et al. [7], Tan et al. [8] Xie et al. [9], Zhu et al., [10] Karras et al., [11] Ronneberger et al., [12] Turk et al.,[13] or Karras et al. [14]. Other research focuses on the application of DNNs to a new domain, e.g. in speaker recognition[15]. We have used some of these papers for insight in models used in DNN related research and for metrics

(15)

that are relevant for specific domains. These papers are all relevant as isolated cases studies, but do not provide a good oversight of DNN-training. This is due to the fact that this scientific body of research, in most cases, reduces performance reporting to only reporting accuracy (or a more domain specific replacement for the accuracy metric), an overview of the layers of the model, and the infrastructure the model was trained on. For performance research we would however require more insight in the training, e.g. reporting on throughput and execution speed. In contrast, the body of scientific work on the topic of benchmarking different DT Frameworks for DNN itself is very small. Benchmarking is only done when proposing a new framework, e.g with GPipe [6], and not in a comparative way. We were only able to find the research of Zhu et al. [16] as a related work. In their research, they provide a toolchain for analysing memory performance during the training of DNN. Additionally, they also provide some recommendations with regards to further research on optimisation of DNN training. We asses that the body of research is very small, due to fact that both the models and the frameworks are very new.

(16)

Chapter 3 Design and Implementation

In this chapter, we first present how networks and the frameworks were selected for the benchmark. Afterwards, we discuss the metrics we propose for the bench-mark. Finally, we introduce the benchmark architecture, and briefly discuss the implementation challenges of that design.

3.1 Selecting Networks

Deep Neural Networks Class

ResNet[17] classification

EfficientNet[8] classification

ResNeXt[9] classification

CycleGaN[10] generation

Progressive GAN[11] generation

StyleGAN(2)[14] generation

Deeplabv3[7] semantic segmentation

UNet[12] semantic segmentation

UNet3D[13] semantic segmentation

Table 3.1: Overview of DNN for the benchmarking suite, divided into different classes.

In Table 3.1 we present an overview of the selected networks, along with the classes they belong to. These DNNs were selected based on a survey of articles from the period 2015 – 2020. We have made this rigid classification of DNNs in this benchmark because we were not being able to compare different tasks of the

(17)

classes of networks one-to-one. Therefore, we should distinguish between these different classes in our benchmark. For the classification networks ResNet and ResNeXt, we can vary the depth of the networks by adding more layers, e.g. we

have several ResNet models from ResNet18 to ResNet1521_{. For the segmentation}

model Deeplabv3, we can change the backbone from a ResNet50 to a ResNet152 model.

In this selection networks were chosen from different application domains in DL, i.e. Classification Models (CMs), Generative Models (GMs) and Semantic Segmen-tation Models (SSMs) to give the benchmark suite a large scope. The ResNet type models are less state-of-the-art compared to the other models selected (stemming from 2015), and as a tried and proven model, was selected to provide performance we can compare other models to within the classification domain.

3.2 Selecting Frameworks

To cover all three approaches we selected Distributed Training Frameworks (DT Frameworks) for this benchmarking suite. These are current, state-of-the-art DT Frameworks found in research from 2018 and 2019. Specifically, the DT Frame-works compared in this benchmark are PyTorch[18], PipeDream[5] and Horovod[19]. These frameworks cover three of the theoretical approaches, namely data-parallelism frameworks (i.e. PyTorch and Horovod) and pipeline parallelism frameworks (i.e. PipeDream and GPipe). Model parallelism is some form supported in all of these frameworks. In PyTorch and Horovod there is no automatic partitioning over de-vices, and partitioning over multiple devices can be done manually in the model itself. PipeDream and GPipe are only able to emulate the behaviour of model parallelism. In the following paragraph, we first provide an how te selected frame-works are implemented and used for training of a DNN. Afterwards the chosen frameworks will be explained briefly with a focus on similarities and differences between them.

3.2.1 PyTorch

PyTorch is entirely written in Python, with an underlying implementation of C++ for performance optimization[18]. Additionally, PyTorch maintains a strict distinc-tion between its control flow and data flow. One of the core design principles of

1_{A table of the layers of a ResNet18 network can be found in appendix A as an example of}

(18)

PyTorch is to keep its internal implementation simple. PyTorch provides native support for model- and data-parallelism. In data parallelism with PyTorch, data is trained in different buckets, in which mini-batches of data are forward-passed through the model. After all the buckets are done computing (this implies a waiting time), an asynchronous all-reduce algorithm computes an update of the gradients of the model. This averaged gradient is used in each bucket to com-pute an update of the parameters of the model. In our implementation we use a Stochastic Gradient Descent (SGD) with momentum. In PyTorch this differs from Sutskever et. al.[20] and implementations in some other frameworks. Considering

the specific case of momentum, the update can be written as2

vt+1 = µ ∗ vt+ gt+1,

pt+1 = pt−lr ∗ vt+1,

where p , g , v and µ denote the parameters, gradient, velocity, and momentum respectively. This is in contrast to Sutskever et. al.[20] and other frameworks which employ an update of the form:

vt+1 = µ ∗ vt+lr ∗ gt+1,

pt+1 = pt− vt+1.

3.2.2 Horovod

Horovod is the most popular framework for data-parallelism [19]. It is a stan-dalone Python package and can easily be used to extend other frameworks. A key difference when compared to the PyTorch data parallelism is the communication happening between data-parallel threads. In Horovod, instead of an all-reduce, a ring-all-reduce algorithm is used to average gradients. In this algorithm, a thread only communicates with its two neighbours instead of one asynchronous update averaging everything (as done in PyTorch), reducing communication. Addition-ally, this algorithm computes updates across all threads twice. In the first stage, updated gradients are placed in a buffer for each thread, and in the second stage new gradient update will overwrite gradients already placed in the buffer.

3.2.3 PipeDream

The third framework in this benchmark, PipeDream, uses a combination of the outlined methods Harlap et al. named pipeline-parallel training, (as a combination

(19)

of data-parallelism and pipelining) to speed up model training [5]. In PipeDream, a model is first profiled with a small dataset to observe where the most compu-tationally expensive layers reside. Another use for profiling before commencing training, is splitting a model that is too large for a single Graphics Processing Unit (GPU) over multiple ones. After the profiling stage, a model is split into different parts, in which computationally expensive layers are parallelised using model parallelism. After this structure is built, mini-batched data is pipelined through the structure in the sequential order. To update the parameters of the model, the gradients are put through the model in reverse sequential order. We observe an initial difference between PipeDream and PyTorch/Horovod: the initial overhead of profiling the model and the requirement of models to be sequential. We have chosen not to mention this overhead in this benchmark.

3.2.4 GPipe

GPipe is an open-source framework for DNN that also uses pipeline-parallelism, like PipeDream. [6] GPipe also supports model-parallelism to overcome memory limitations with large models. In contrast to PipeDream, mini-batched data is further divided in micro-batches to split computation. These micro-batches are trained in parallel, and after a backwards pass through each micro-batch, gradients are averaged across these micro-batches. Like PipeDream, GPipe requires models to be sequential.

3.3 Selecting Metrics

Applications’ performance is dependent on a lot of aspects, i.e., the application, the input, the compiler, the runtime environment, and the machine. These different aspects are hard to keep consistent when running programs in large environments, as is the case for DNN, hence reproducibility is difficult in most cases. Therefore, the interpretability of metrics that are chosen for a given benchmark is most im-portant[21]. This benchmark follows this principle, by keeping the metrics simple and using default hyper-parameters.

In general speed and resource usage are both of importance when measuring per-formance in High Perper-formance Computing (HPC). To have a high level of general-isation, we should choose metrics that are as consistent as possible across different classes of models. For DNNs, computation speed is measured as throughput. For this benchmark, we have chosen the metrics samples/sec and sec/epoch for this. These metrics will be averaged all training epochs and averaged across the

(20)

differ-ent devices to provide a represdiffer-entative oversight of the differ-entire training procedure. Resource usage is measured in terms of memory usage, as proposed in [22]. This memory usage is measured in memory allocated and memory cached, and it is averaged is the same way as the throughput.

We have discussed adding energy consumption as an additional metric, following the suggestion of [22]. This was however deemed too complicated, as isolating the energy consumption for the DNN training, and splitting it from the power requirement of the cluster on which it was computed, required significantly more tooling for this research project. However, such measurements would be a relevant, high-priority extension of the benchmark features.

There are a number of other parameters that have to be set before training of DNN can be started. These hyper-parameters are momentum, learning-rate, and batch-size. However, the relation of these hyper-parameters to the performance of the model on the validation set is complex [23][24][25]. Thus, in cases the model is optimised for accuracy on the validation set, multiple runs are done tweaking momentum, learning rate, and batch-size, to optimise performance on the validation set and find optimal values for these hyper-parameters. In this context, we deemed not optimal performance, but correctness of the model a precondition. Therefore, we chose to set these hyper-parameters to default (known) values, and to not include these as metrics.

Besides hyper-parameters, generally in the training of DNN, transformations are used on the data to augment the data. This can be done using a multitude of different transformations, e.g blurring, resizing, cropping and normalisation. Such data processing serves multiple purposes, potentially speeding up training or enhancing the robustness of the model, and thus avoiding errors. We chose to keep these transformations consistent for different frameworks, and exclude these from the metrics we measure.

Class Metric(s)

Classification accuracy

Segmentation accuracy, mean IoU

Generation Frechet inception distance

Table 3.2: Overview of metrics used for different classes of DNN

The main validation metric for segmentation models and classification models is accuracy. Accuracy is calculated, roughly, as the correctly (semantically) classified images divided by the total number of images in the validation set. For different classes of models, different validation metrics apply, as shown in Table 3.2. Note that we omitted accuracy in generative models, as when generating new data

(21)

accuracy cannot be measured. In the following paragraphs, a brief explanation is provided of metrics that are less commonly used.

For the classification models, we only need to measure how correct the classification of the image is. This can be done in the standard way, by an accuracy metric based on performance to the validation set.

For the segmentation class of models, the Intersection over Union (IoU) is a metric that expresses how correct the area defined by the network for an object and the validation area of that object overlap. The mean IoU refer to the IoU averaged across the different classes of objects. It is calculated as follows:

IoU = target ∩ prediction

target ∪ prediction

For the generative class of models, we chose the Fréchet Inception Distance (FID) (also known as Wasserstein-2 distance) [26], calculated as shown below.

FID = ||µr− µg||2+Tr(Σr+ Σg− 2(ΣrΣg)1/2)

where Xr ∼ N (µr, Σr) and Xg ∼ N (µg, Σg). This metric can be used to provide a

reflection of how well the distribution of generated data reflects of the distribution of the validation set of the dataset, by measuring the distance between the two Gaussians that represent these datasets [27].

3.4 Benchmark Architecture

The benchmark designed in this bachelor thesis aims to determine which framework performs best on the selected metrics, when given a class of models. To do so, it provides a quantitative comparison between the examined frameworks for the defined classes of models. Additionally, it constructs a ranking of these different frameworks, to determine which framework performs best given the metrics we selected.

The proposed architecture of this benchmark is presented in Figure 3.1. The networks and frameworks it includes have been discussed in Sections 3.1 and 3.2, respectively, while the datasets under considerations are presented in Section 3.5.2.

In this benchmark, a DNN is trained on hardware, by using a framework with default settings of hyperparameters (see Figure 3.1). After final validation, the

(22)

Figure 3.1: Diagram overview of benchmark architecture.

frameworks are ranked for a given model, based on the metrics that were discussed in the previous section. Based on these obtained metrics and a ranking rule, a ranking of DT Framework is created for each model. Using these multiple rankings, we can create a final ranking of the frameworks. In the following paragraphs, this process is covered in more detail.

The benchmark implements a factorial design, as suggested in [21]. Specifically, the measurements are conducted per model of a class, in which the hyper-parameters, i.e. the learning rate, batch size, momentum and weight-decay, of the model training have been set. For training, we have set the end point of the training to be the completion of a set number of epochs. For each model, the hardware platform can be tuned in terms of the amount of available GPUs, thus changing the parallel infrastructure. For each model, a non-parallel version of PyTorch is also run, to serve as a point of reference (i.e., the baseline) of computation speed. Based on the results we gather from these experiments, a ranking of models is created within a class. In this research, we choose to prioritize throughput over memory. So, if a model did not execute properly, given the memory constrains of the number of devices, we rank this model last. Next, we order the frameworks in terms of performance, measured as throughput (thus, higher is better). If there are cases of equivalent performance with varying memory footprint, we choose the models with a lower memory footprint. The rank of a framework is thus as follows.

RF ramework =RThroughput

We have considered calculating the rank a weighted sum, but due to the complex relation between throughput and memory footprint in this results, selecting coef-ficients for the weighted sum required further analysis of the results which did not

(23)

fit in the time constraints of this research project.

Based on this ranking per model, frameworks are ranked within a class. To find an overall winner within a class, we chose Copeland’s rule[28]. A explanation of Copeland’s rule is given below.

Do pairwise majority contests. (3.1)

Each alternative gets +1 for a win and -1 for lose. (3.2)

Select a winner based on the alternative with the most points. (3.3)

Potential caveats of using Copeland’s rule are that it is very likely to produce ties. Additionally, it puts a big emphasis on quantity of victories and defeats, forgetting about their magnitudes. We think, however, that these constraints do not limit the application of the rule in this context. We make this claim because all these models are of equal importance, and, in our view, any increase in performance, within the memory requirement, no matter how small, should make one framework rank higher than another.

To find an overall winner across different class, we intended to select the framework that won most based on Copeland’s rule per class.

3.5 Implementation

We conclude this chapter with a few details on the implementation challenges we faced when implementing a first prototype, following the architecture presented in Figure 3.1.

3.5.1 Implementing the Frameworks

A general choice we made early in the implementation phase, was to write all frameworks as an extension of PyTorch (v1.5.0, with TorchVision version 0.6.0 and NCCL 2.5.6 with CUDA version 10.1.243) to maintain comparability between the different approaches, and keep the running environment as consistent as possible, based on the work of M. Jansen [29].

For PyTorch, we first have a non-parallel implementation. Initially we intended to also provide data-parallelism with PyTorch, but due to lack of time in this research project, this implementation remains unfinished.

(24)

Horovod was built on top of PyTorch by adapting the non-parallelised version of PyTorch to distributed training over multiple GPUs. This was done with minimal changes to the code, and was relatively easy to implement. This process did deliver on the authors’ promise of Horovod being a portable framework extension that can be used on top of other frameworks, like PyTorch and TensorFlow.

In contrast, PipeDream first required a custom build of prior release of PyTorch (a commit just before v1.1.0) to be built. For this, we again started from the work of M. Jansen [29]. Additionally, PipeDream requires models to be rewritten not to use PyTorch’s functional library. Using an earlier build of PyTorch also meant that newer, out of the box functional models, e.g. ResNeXt models, and new convenient functions, are not available, e.g. the functional library. Additionally, due to a known issue with the PipeDream port to PyTorch, the PipeDream framework gets stuck in an infinite loop during the initial profiling stage with larger models. Therefore, our current version of PipeDream is unable to run larger models (like ResNet101 and larger) at all. Unfortunately, this means that PipeDream cannot be fully tested and compared to the other frameworks in this benchmark; this reflects, in our results, in PipeDream losing the direct comparison against all the other "competitors" when the models grow large.

For the implementation of GPipe, models also had to be rewritten, but now in a sequential manner, using a specific class from PyTorch. Initial results running the GPipe framework pointed to bugs judged difficult to solve. Due to time con-straints in this research project, our current benchmark prototype does not include a working version of GPipe, and therefore no GPipe results can be presented. Links to code used to gather the results can be found in the Appendix C.

3.5.2 Implementation for Different Datasets

The models were initially divided in three classes: classification, semantic segmen-tation and generative models as listed in Table 3.1.

The current prototype only includes the classification models, i.e. ResNet, ResNeXt, and EfficientNet, implemented on the Cifar10[30] and MNIST[31] datasets. To ac-complish the implementation for these models, we adapted code from GitHub user kuangliu3 for the ResNeXt models and EfficientNet model. This code was initially written for the models using dataset Cifar10, but we rewrote the code for the models to be training on MNIST (as well).

For the other classes of models we did not provide a full implementation, but we

(25)

lected the relevant dataset. For semantic segmentation model we have selected the PascalVOC [32] dataset, with an unfinished implementation. For the generative models we selected the celebA [33] dataset, without an implementation.

(26)

Chapter 4 Benchmark Results

In this chapter, we introduce the platforms used to validate our benchmark pro-totype. Next, we present the results and analyze them in detail. We conclude this chapter with a final ranking of the models, as provided by the benchmark.

4.1 Experimental setup

To collect the result for the classification models, we trained the models with a Stochastic Gradient Descent Optimiser for 5 epochs, using a momentum of 0.5, a batch-size of 64 and a learning rate of 0.01. During implementation we noticed that the variability between runs was very low, and based these results on a single run of the benchmark for each model and framework. The collection of results was

done on 1 node of SURFsara’s Lisa system using the maximum of 12 CPU-cores1_.

For this benchmark we used node number 23. The specification for this node are in table 4.12.

In all experiments, we scaled the amount of GPUs per node (set to 1), from 1 to 4 GPUs to test the parallelism ability of the different DT Frameworks.

4.2 General Remarks

For all results using the MNIST dataset we were able to obtain a very high accu-racy on all models - ranging from a maximum of 100% on ResNet 18 model to a

1_{An extensive description of the Lisa system can be found on their website.} 2_{More specification of the CPU’s can be found on the website of Intel.}

(27)

Specification for Node 23

Number of CPU cores 12

CPU Type Intel R Xeon R Bronze 3104 Processors

Clock Speed 1.70 GHz

Memory 256 UPI 10.4 GT/s GigaByte (GB)

Sockets 2

Cache size 8.25 MegaByte (MB)

Number of GPUs 4

GPU type GeForce 1080Ti with 11 GB GDDR5X

Inter-Machine connection 40 Gbit/s ethernet

Table 4.1: Specifications of Lisa system node 23,

miminum of 95% on a ResNeXt model with cardinality of 4 and a bottleneck of

323_{. The MNIST dataset can therefore be useful to measure throughput through}

the DNNs, also because the computation is fast in comparison to larger datasets, like Cifar10 and ImageNet[22].

In terms of general trends we observe a clear speedup for both frameworks, relative to the sequential version of PyTorch: the amount of seconds/epoch is decreasing with an increasing number of devices. The amount of samples/sec also decreases, indicating that the overhead for the parallelism is increasing with an increasing number of devices. In both seconds/epoch and samples/second we observe that PipeDream outperforms Horovod. With regards to memory (cached and allocated) we observe a close to linear increase, when increasing the GPUs.

If we consider the ResNet models in the order ResNet 18 to ResNet 150, we observe a gradual increase in memory consumption for a fixed amount of GPUs. We also observe a throughput decrease, when moving from smaller to larger models, for a fixed amount of GPUs. If we scale to the maximum amount of GPUs, for a given model that we can obtain maximum throughput for a given model, by using the maximum amount of GPUs.

ResNeXt models do not provide a straightforward growth in terms size in contrast to regular ResNet models. We observe that if we scale in cardinality, the memory consumption increases by a large margin. Additionally, if we scale in cardinality, we can see bigger differences in minimal throughput by using the maximum amount of GPUs. This is to be expected, as we essentially multiply the size of the model by increasing the cardinality.

3_{We do not further report on the accuracy for the DNN models, but only on throughput and}

(28)

In the results for EfficientNet, we observe a relatively small memory footprint with a throughput comparable to that of a ResNet101 model.

If we compare the memory consumption of DT Frameworks, we observe that PipeDream allocates a large amount of memory during training and a low amount during validation. For Horovod, the allocation pattern is opposite: it allocates most memory during validation, and little memory during training. We observe, however, that the peak memory consumption of Horovod and PipeDream are close to each other. Looking at memory cached only, we observe that during training, as well as during validation, less memory is cached using PipeDream as opposed to Horovod. This gap increases with the increase in parallelism. The non-parallel version of PyTorch uses the least amount of memory.

we observe that PyTorch uses the least amount of memory. Additionally, we note that PipeDream uses most memory during training, and Horovod uses most memory during validation. If we look at the overall memory allocation (summing the allocations graphs), we see that PipeDream uses less memory overall. With regards to memory caching, we see that PipeDream cached more memory during training and validation. The results look similar for the ResNet 18 and ResNet 34, only differing in magnitude.

4.3 Results per Model

To calculate the values for throughput, we averaged the results obtained per GPUs, and averaged them across all epochs. We assumed the results were normally dis-tributed. To calculate the memory values for each model, we observed that the al-located and cached memory was consistent for each device. Therefore, we summed the memory allocation and caching values of each device, and reported this sum as the overall memory consumption of the model.

4.3.1 ResNet 18, ResNet 34, ResNet 50

In Figure 4.1 we show results for the throughput of the ResNet 50 model.4 _The

results show similar trends for the ResNet 18 and ResNet 34, only differing in magnitude. These results follow the general trends as we described in Section 4.2, in which PipeDream outperforms Horovod.

In Figure 4.2 we present the performance of the models relating to memory. We

4_{In this figure, and the remaining graphs in this chapter, standard deviation is not reported}

(29)

1 2 3 4 0 10 20 30 40 Number of GPU’s Seconds/ep oc h Seconds/epoch

Pytorch Horovod Pipedream

1 2 3 4 0 500 1,000 1,500 Number of GPU’s Samples/second Samples/second

Figure 4.1: Throughput results for running the benchmark with a ResNet 50 model. 1 2 3 4 0 0.5 1 1.5 2 Number of GPU’s Memory Allo cated (Bytes/ 10 9)

Training Memory allocated

1 2 3 4 0 0.5 1 1.5 2 Number of GPU’s Validation Memory allocated

1 2 3 4 0 0.5 1 1.5 2 Number of GPU’s Memory Allo cated (Bytes/ 10 9) Memory Cached

Figure 4.2: Overview of results obtained relating to memory caching and alloca-tion for a ResNet 50 model; left, memory allocated for training; middle, memory allocated for validation; right, memory cached for both validation and training observe that PyTorch uses the least amount of memory. These results look similar for the ResNet 18 and ResNet 34, only differing in magnitude, and follow the general trends as we described in Section 4.2.

4.3.2 ResNet 101 and ResNet 152

In Figure 4.3, we present the measured throughput for the ResNet 101 model. Please note that the measurements for PipeDream are missing, because this run resulted in errors, as described in Section 3.5.1. Due to PipeDream not being able to run, we can only observe that Horovod is faster than the sequential version of PyTorch. The results look similar for the ResNet 152 model, only differing in

(30)

1 2 3 4 0 20 40 60 Number of GPU’s Seconds/ep oc h Seconds/epoch Pytorch Horovod 1 2 3 4 0 200 400 600 800 Number of GPU’s Samples/second Samples/Second Pytorch Horovod

Figure 4.3: Throughput results for running the benchmark with a ResNet 101 model. magnitude. 1 2 3 4 0 0.5 1 1.5 2 Number of GPU’s Memory Allo cated (Bytes/ 10 9)

Pytorch Horovod 1 2 3 4 0 0.5 1 1.5 2 Number of GPU’s Validation Memory allocated

Pytorch Horovod 1 2 3 4 0 0.5 1 1.5 2 Number of GPU’s Memory Cac hed (Bytes/ 10 9) Memory Cached Pytorch Horovod

Figure 4.4: Overview of results obtained relating to memory caching and allocation for a ResNet 101 model; left, memory allocated for training; middle, memory allocated for validation; right, memory cached for validation and training;

As shown in Figure 4.4, we can only observe that Horovod uses more memory than sequential PyTorch, due to PipeDream not being able to run.

4.3.3 ResNeXt 29 with 2 x 64d, 4 x 64d and 32 x 4d

In Figure 4.5 we show the throughput measured for the ResNeXt 29 model with a cardinality of 2 and a bottleneck of size 64. We observe a speedup relative to the sequential version of PyTorch for both the frameworks, with Horovod still slower than PipeDream. The results look similar for the ResNeXt 4 × 64 and 32 × 4 model, only differing in magnitude.

(31)

1 2 3 4 0 20 40 60 80 Number of GPU’s Seconds/ep oc h Seconds/Epoch

1 2 3 4 0 200 400 600 800 Number of GPU’s Samples/second Samples/second

Figure 4.5: Throughput results for running the benchmark with a ResNeXt 29 model with 2 × 64d. 1 2 3 4 0 2 4 6 8 Number of GPU’s Memory Allo cated (Bytes/ 10 9)

1 2 3 4 0 2 4 6 8 Number of GPU’s Validation Memory allocated

1 2 3 4 0 5 10 15 Number of GPU’s Memory Cac hed (Bytes/ 10 9) Memory Cached

Figure 4.6: Overview of results obtained relating to memory caching and alloca-tion for a ResNeXt 29 model with 2 × 64d; Left, memory allocated for training; Right, memory allocated for validation; Below, Memory Cached for validation and training;

As shown in Figure 4.6, we observe that memory for resnext models is larger than for the other models. The results look similar for the ResNeXt 4 × 64 and 32 × 4 model, only differing in magnitude.

4.3.4 ResNeXt 29 with 8 x 64d

In Figure 4.7 results are shown for the throughput through the ResNeXt 29 with a cardinality of 8 and a bottleneck of size 64. With this model, the Horovod frame-work could not be run due to memory limitation. Additionally, the PipeDream framework could not be run on a single GPU, also due to memory limitations. If

(32)

1 2 3 4 0 200 400 600 Number of GPU’s Seconds/ep oc h Seconds/Epoch Pytorch Pipedream 1 2 3 4 0 20 40 60 80 Number of GPU’s Samples/second Samples/second Pytorch Pipedream

Figure 4.7: Throughput results for running the benchmark with a ResNext 29 with 8 × 64d model.

we compare the sequential PyTorch to PipeDream, we again see speed up in terms of seconds/epoch. 1 2 3 4 0 10 20 30 40 Number of GPU’s Memory Allo cated (Bytes/ 10 9)

Pytorch Pipedream 1 2 3 4 0 2 4 6 Number of GPU’s Validation Memory allocated

Pytorch Pipedream 1 2 3 4 0 10 20 30 40 Number of GPU’s Memory Cac hed (Bytes/ 10 9) Memory Cached Pytorch Pipedream

Figure 4.8: Overview of results obtained relating to memory caching and allocation for a ResNext 29 - 8x64d model; left, memory allocated for training; middle, memory allocated for validation; right, memory cached for validation/training; If we observe the performance of the models relating to memory, as shown in Figure 4.8, we only observe that PipeDream uses more memory than sequential PyTorch, due to Horovod not being able to run.

4.3.5 EfficientNet

In Figure 4.9 results are shown for the throughput through the EfficientNet model. PipeDream is not included in this figure as the run resulted in errors, described in

(33)

1 2 3 4 0 20 40 60 80 Number of GPU’s Seconds/ep oc h Seconds/Epoch Pytorch Horovod 1 2 3 4 0 200 400 600 800 Number of GPU’s Samples/second Samples/second Pytorch Horovod

Figure 4.9: Throughput results for running the benchmark with a EfficientNet model.

Section 3.5.1. Due to PipeDream not being able to run, we can only observe that Horovod is faster than the sequential version of PyTorch.

1 2 3 4 0 0.2 0.4 0.6 0.8 1 1.2 Number of GPU’s Memory Allo cated (Bytes/ 10 9)

Pytorch Horovod 1 2 3 4 0 0.5 1 1.5 2 2.5 Number of GPU’s Validation Memory allocated

Pytorch Horovod 1 2 3 4 0 0.2 0.4 0.6 0.8 1 1.2 Number of GPU’s Memory Cac hed (Bytes/ 10 9) Memory Cached Pytorch Horovod

Figure 4.10: Overview of results obtained relating to memory caching and alloca-tion for a EfficientNet model; left, memory allocated for training; middle, memory allocated for validation; right, memory cached for validation and training;

As shown in Figure 4.10, we observe a similar trend for the memory usage of the EfficientNet as with the ResNet 101. We can only conclude that Horovod uses more memory than sequential PyTorch, due to PipeDream not being able to run.

4.4 Final Ranking

To provide a final ranking across all models, we calculated the amount of wins (i.e., a framework ranking first for a specific model) across the classification models

(34)

Framework Wins Losses "Crash"

Horovod 3 7 1/10

PipeDream 7 3 3/10

sequential PyTorch 0 10 0/10

Table 4.2: Amount of wins and losses per framework for all of the classification models (a total of 10), on MNIST, based on ranking first or not.

category. These results are shown in Table 4.2. We observe a clear trend here. In all cases that PipeDream ran without errors, it outperforms Horovod. Moreover, in all cases, the sequential PyTorch is outperformed by either Horovod or PipeDream; as sequential PyTorch acts as a reference and it provides correct results for all models, we choose to include it in the ranking.

These results are surprising, given the theory from Section 2.2, that suggests that Horovod should outperform Pipedream. There are multiple explanations possible for this result. First, the limited size of the MNIST images (only 28 × 28, which is very small compared to other datasets) could expose underlying overhead in the DT Frameworks. That would mean that this contrasting result disappears given another dataset with large image-size, e.g. ImageNet or Cifar10. Secondly, we have a slight difference in environment between the two frameworks due to the fact a custom built of PyTorch was used for PipeDream. Differences in this custom built, in comparison to the stable version of PyTorch we used for Horovod, could also have influenced these findings.

In summary, based on our empirical results, we can build our ranking using Copeland’s rule as follows: 1. PipeDream 2. Horovod 3. sequential PyTorch

(35)

Chapter 5 Extending the Benchmark

This chapter discusses the benefits and challenges of extending our benchmark prototype.

5.1 Limitations and Imminent Extensions

Our benchmark design includes, for the classification models, multiple datasets: ImageNet, MNIST and Cifar10. In theory, moving a model from one dataset to another should be relatively simple: it only requires that the input and output layers of a that given model are adapted (after loading the dataset into the frame-work). In practice however, moving from Cifar10 (32 × 32 × 3 channel image-size) to MNIST (28 × 28 × 1 channel image-size) turned out to be more complicated. After the correct changes were made to the input layers, a forward pass using the data could still not be completed. Only after halving the kernel size (from 8 to 4) in the classification models in the last 2-Dimensional Average Pooling Layer, we were able to successfully train the framework. However, halving the kernel size is not something that scales appropriately with the different input images. Overall, this was unexpected behaviour, and debugging this issue took a significant amount of time. We found out this change had to be made, because the tensors sizes in the DNN were affected differently throughout the model as a result of the changes in the input layers.

Currently, our benchmark prototype does provide a working implementation for the Cifar10 dataset for the classification models, but we ran out of time to include these results in the final report. An example of raw results can be viewed in

(36)

our GitHub repository linked in Appendix C.1_{. Upon inspection, our preliminary}

CIFAR10 results indicate very similar results with those for the MNIST dataset, which is expected given that changing from MNIST to the Cifar10 dataset is a move from grayscale to colour images, with a slightly higher size. Thus, our conclusions seem to hold.

However, adding the final results for CIFAR10, and further adding the correct implementations for ImageNet and the corresponding results, are required exten-sions, that would enable a more informed ranking of the analysed frameworks in practice.

After the implementation of the classification models, we further attempted the im-plementation of the segmentation model Deeplabv3. For the Deeplabv3 model, we

used code from the Github user chenxi1162. Additionally, we cloned and adapted

code from Github user VainF3_{, to load PascalVOC into PyTorch and perform}

data-transformations. Code from GitHub user jfzhang954 _{provided examples}

dur-ing this implementation phase. When attemptdur-ing to use the PascalVOC Segmen-tation dataset, we noticed it was not natively supported in PyTorch. This mean custom DataLoader and Transformation classes had to be implemented. Addi-tionally, for a segmentation task, we need to ensure that the same transformation is applied to the training image as the validation image (with the bounding box) to achieve correct results, in contrast to a classification task (where there is no such validation image). This validation did not seem to be natively supported by PyTorch either, and again this meant that we had to rely on a custom im-plementation. Furthermore, PyTorch also does not natively support metrics for segmentation tasks, so again a custom implementation had to be made. All in all, during our implementation work, we ran into memory limitation for the batch-size of 64 we had chosen. Reducing the image dimensions eventually lead to a running implementation, but the initial results were underwhelming: after 5 epochs, we were only able to reach an IOU of around 0.02, and an accuracy of 0.3119. These results indicate very poor performance, so we could not guarantee this imple-mentation performs correctly. Thus, due to time limitations, this impleimple-mentation remains work-in-progress. However, adding segmentation models like Deeplabv3 to our benchmark prototype, is a mandatory extension to ensure the promised diversity of the networks and models.

In this research, we also attempted the implementation of a distributed version of PyTorch, to test the native data-parallelism capabilities of PyTorch itself. In

1_{a direct link to an example of results obtained for Cifar10 Horovod run can be found here.} 2_{The Deeplabv3 repository can be found here}

3_{The second Deeplabv3 repository can be found here.} 4_{The github repository of jfzhang95 can be found here.}

(37)

PyTorch, data-parallelism can be achieved in two ways: using the DataParallel library, and the DistributedDataParallel library. The first provides a single machine data-parallelism implementation, whilst the latter provides additional support for multi-node distributed training. The PyTorch documentation itself states that the first is outperformed by the latter in all use cases.

We experimented with implementation using both libraries, but ultimately chose to pursue the DistributedDataParallel solution in the end. We attempted this

implementation based on code of GitHub user narumiruna5_{. However, we were}

unable to complete this implementation as we encountered infrastructure config-uration issues (i.e., the setting of a Default Process Group was impossible on the Lisa node/cluster). Finishing this implementation should be possible with addi-tional time investment, and will provide (a) a third data-parallel framework to be included in the ranking, and (b) a direct competitor for Horovod’s take on data parallelism.

Additionally, we chose not to implement the GPipe framework in this first bench-mark prototype, as we pointed out in Section 3.5.1. Including this implementation in the final benchmark prototype will enable an even wider comparison of the frameworks, covering all the state-of-the-art DT Frameworks.

5.2 Other Desirable Extensions

We also included generative models as part of our benchmark. However, it quickly became apparent that this is a very difficult task in terms of implementation, and could not fit in the time allocated for this project. We believe the diversity of the benchmark will be significantly improved when adding such models. Moreover, it would be interesting to have them implemented in order to assess whether our benchmark architecture is generic enough.

Ohter possible extensions could be including more frameworks in this benchmark. PyTorch is only one the actively used frameworks for DL. Popular frameworks, like Keras, TensorFlow or Caffe, could also be included in the benchmark, further enlarging the comparison. However, we foresee such an expansion would require a lot more (custom) coding. In Section 5.3 we explore what this means for our benchmark.

As for extending our benchmark design, expanding its body of models is always a relevant direction for possible extensions. For the selection of this body of models, a new way of selection, different from the one we used, might be more useful.

(38)

For example, if the purpose of updating this benchmark is to provide more of an overview of new models, the selection process might be adapted to more accurately reflect this intent.

Another direction we have reflected upon, but have no explored, is the effect of hyper-parameters on the benchmark results. As we noted in Section 3.3, the rela-tion between these parameters and performance is complex. A direcrela-tion of explo-ration is to estimate/measure how large this impact is for different DT Frameworks, and compare this impact across frameworks.

5.3 Extensibility and Maintanability Issues

Aside from the extensions we named imminent, extending the benchmark is a hard task to accomplish.

The first issue arising for a larger benchmark with more models and datasets, is the effort required in engineering work. More code, requires more engineering. In some cases, native support is available, and in most cases, a benchmark has to rely on custom code.

A second issue is that by extending the benchmark, there is more code to main-tain. With more additions to this benchmark, the maintenance grows in size. A complex benchmark is, of course, more difficult to keep up-to-date. Even without extensions this benchmark requires structural maintenance due to functions depre-cating, and new code being developed to improve performance. This maintenance is made increasingly more by the pace of change for the DT Frameworks, looking for example at the pace of change for PyTorch.

Third, a larger body of code for a benchmark is more complex to keep aligned with its theoretical design and with extensions, this problem grows in side too. Fourth, by using custom code to ensure a consistent environment, next to the effort of writing this code, this practice is vulnerable to configuration issues, like we observed when using PipeDream in this work. Most of the time, using it requires a custom build, which leads to limitations in some form, e.g. in terms of program versions, by down- or upgrading either of the frameworks. These limitations/requirement often lead to a reduction in usability or performance. Finally, these frameworks and models are hard to work with due to the fact that they are state of the art. The readily available body of up-to-date examples is very small. Additionally, the documentation for some (parts of a) frameworks is lacking, which limits their application greatly, and redirects focus from application of the framework to understanding the code framework, before it can be applied

(39)

in practice. This hinders attempts at a scientific comparison between frameworks. As users of these frameworks, we believe releasing DT Frameworks open-source should only be the first step, and extensive documentation should be the second.

(40)

Chapter 6 Conclusion

Due to the rapidly increasing scale of inputs and Deep Neural Networks, training these networks becomes prohibitively expensive in terms of time. Thus, several so-lutions are envisioned, where exploiting parallelism enables faster training and/or support for larger and larger models. However, when faced with a large DNN to train, a regular user does not know which of these solutions is, in practice, the faster (feasible) one to use. Therefore, it is important to determine how these solutions compare against each other.

We argue that the reasoning for using one framework over the other should be result- and performance-based. Without benchmarking research, this choice can only be made at random and without proper argumentation.

In this thesis, we propose a benchmark architecture, partially implemented in a functioning prototype, that enables a performance-based ranking of distributed models for training DNNs. We further commented on the challenges of designing and extending such a benchmark, as well as on the massive engineering challenges that need to be overcome to fully implement it.

Using this prototype, we determine the ranking of three frameworks when execut-ing a set of state of the art classification models (ResNet, ResNext and EfficientNet) on the dataset of MNIST. Our final ranking, specifically build using throughput as metric of interest, reads as follows: 1. PipeDream 2. Horovod 3. sequential PyTorch

As we pointed out in 4.4, these results are surprising. In theory, Horovod should outperform PipeDream, but our analysis shows that PipeDream is faster than Horovod. This behaviour, highlighted by our benchmark, could be due to the dataset (small image sizes) or the slightly different runtime configuration - more

(41)

investigation is needed to bring empirical evidence to support these hypotheses. Moreover, additional research is also needed to empirically verify whether Horovod is outperformed by PipeDream in more cases.

Finally, we revisit our main research question:

Can we design and build a benchmarking suite to empirically compare the performance of DT Frameworks for DNN?

This thesis shows that such a benchmark can be designed and partially imple-mented, with several caveats. As we pointed out in Section 5.3 and Section 3.5.1, implementation is complicated for many non-standard frameworks and models, and a lot of structural maintenance is required to keep this benchmark up-to-date. We have already sketched several immediate and long-term directions of research to further improve on the proposed benchmark. We believe such a benchmark is relevant for the future of DNN research, especially in terms of efficient training for large models, but also in terms of efficient hardware resource utilisation. One final future work direction is to actually test this hypothesis, and propose the benchmark for scrutiny (and improvements) to the community itself, as this is the only way benchmarks become relevant and survive the test of time.

(42)

Bibliography

[1] T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis”, ACM Comput. Surv., vol. 52, no. 4, Aug. 2019, issn: 0360-0300. doi: 10.1145/3320060. [Online]. Avail-able: https://doi.org/10.1145/3320060.

[2] J. Schmidhuber, “Deep learning in neural networks: An overview”, Neural networks, vol. 61, pp. 85–117, 2015.

[3] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning”, in 12th {USENIX} symposium on operating systems

design and implementation ({OSDI} 16), 2016, pp. 265–283.

[4] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast fea-ture embedding”, in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 675–678.

[5] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, and P. B. Gibbons, “Pipedream: Fast and efficient pipeline paral-lel DNN training”, CoRR, vol. abs/1806.03377, 2018. arXiv: 1806.03377. [Online]. Available: http://arxiv.org/abs/1806.03377.

[6] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and z. Chen, “Gpipe: Efficient training of gi-ant neural networks using pipeline parallelism”, in Advances in Neural

In-formation Processing Systems 32, H. Wallach, H. Larochelle, A.

Beygelz-imer, F. d’Alché-Buc, E. Fox, and R. Garnett, Eds., Curran Associates, Inc., 2019, pp. 103–112. [Online]. Available: http://papers.nips.cc/paper/

8305-gpipe-efficient-training-of-giant-neural-networks-using-pipeline-parallelism.pdf.

[7] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation”, arXiv preprint arXiv:1706.05587, 2017.

(43)

[8] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolu-tional neural networks”, arXiv preprint arXiv:1905.11946, 2019.

[9] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual trans-formations for deep neural networks”, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.

[10] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks”, in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232. [11] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation”, arXiv preprint arXiv:1710.10196, 2017.

[12] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation”, in International Conference on Medical im-age computing and computer-assisted intervention, Springer, 2015, pp. 234– 241.

[13] F. Turk, M. Luy, and N. Barisci, “Comparison of unet3d models for kidney tumor segmentation”, 2020.

[14] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks”, in Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, 2019, pp. 4401–4410.

[15] O. Kudashev, S. Novoselov, T. Pekhovsky, K. Simonchik, and G. Lavren-tyeva, “Usage of dnn in speaker recognition: Advantages and problems”, in

International Symposium on Neural Networks, Springer, 2016, pp. 82–91.

[16] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Jayarajan, A. Phanishayee, B. Schroeder, and G. Pekhimenko, “Benchmarking and analyzing deep neu-ral network training”, in 2018 IEEE International Symposium on Workload

Characterization (IISWC), IEEE, 2018, pp. 88–100.

[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition”, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[18] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. De-Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library”, in Advances in Neural Information Processing Systems 32, H. Wal-lach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Gar-nett, Eds., Curran Associates, Inc., 2019, pp. 8026–8037. [Online]. Available:

(44)

http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.

[19] A. Sergeev and M. D. Balso, “Horovod: Fast and easy distributed deep learn-ing in TensorFlow”, arXiv preprint arXiv:1802.05799, 2018.

[20] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks”, in Advances in neural information processing sys-tems, 2014, pp. 3104–3112.

[21] T. Hoefler and R. Belli, “Scientific benchmarking of parallel computing sys-tems: Twelve ways to tell the masses when reporting performance results”, in SC ’15: Proceedings of the International Conference for High Performance

Computing, Networking, Storage and Analysis, 2015, pp. 1–12.

[22] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey”, Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.

[23] P. M. Radiuk, “Impact of training set batch size on the performance of convolutional neural networks for diverse datasets”, Information Technology and Management Science, vol. 20, no. 1, pp. 20–24, 2017.

[24] C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl, “Measuring the effects of data parallelism on neural network training”, arXiv preprint arXiv:1811.03600, 2018.

[25] L. N. Smith, “A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay”, arXiv preprint arXiv:1803.09820, 2018.

[26] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,

“Ana-lyzing and improving the image quality of stylegan”, arXiv preprint arXiv:1912.04958, 2019.

[27] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium”, in Advances in neural information processing systems, 2017, pp. 6626–6637.

[28] A. Copeland, “A ‘reasonable’social welfare function, seminar on mathematics in social sciences, university of michigan”, Cited indirectly from its mention by Luce and Raiffa (1957), p. 358, 1951.

[29] M. Jansen, “Recommender system for distributed deep neural networks”, Master’s thesis, Universiteit van Amsterdam, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands, Jul. 2020.

(45)

[30] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images”, 2009.

[31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition”, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[32] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospec-tive”, International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, Jan. 2015.

[33] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild”, in Proceedings of International Conference on Computer Vision (ICCV), Dec. 2015.

Benchmarking Deep Neural Networks with Distributed Training

Benchmarking Deep Neural

Networks Frameworks

Benchmarking Deep Neural

Networks Frameworks

for Distributed Training

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Research Question and Approach

1.2

Thesis Outline

Chapter 2

Background and Related Work

2.1

Background

2.1.1

Deep Neural Networks

2.2

Theoretical Methods of Distributed Training

Best Performing Theoretical Method in Distributed Training

2.3

Related Work

Chapter 3

Design and Implementation

3.1

Selecting Networks

3.2

Selecting Frameworks

3.2.1

PyTorch

3.2.2

Horovod

3.2.3

PipeDream

3.2.4

GPipe

3.3

Selecting Metrics

3.4

Benchmark Architecture

3.5

Implementation

3.5.1

Implementing the Frameworks

3.5.2

Implementation for Different Datasets

Chapter 4

Benchmark Results

4.1

Experimental setup

4.2

General Remarks

4.3

Results per Model

4.3.1

ResNet 18, ResNet 34, ResNet 50

4.3.2

ResNet 101 and ResNet 152

4.3.3

ResNeXt 29 with 2 x 64d, 4 x 64d and 32 x 4d

4.3.4

ResNeXt 29 with 8 x 64d

4.3.5

EfficientNet

4.4

Final Ranking

Chapter 5

Extending the Benchmark

5.1

Limitations and Imminent Extensions

5.2

Other Desirable Extensions

5.3

Extensibility and Maintanability Issues

Chapter 6

Conclusion

Bibliography