Optimization of Adaptive Spiking Neural Networks on GPU's

(1)

Optimization of Adaptive Spiking

Neural Networks on GPU’s

Thesis BSc Artificial Intelligence

Student: Jaquim Cadogan

UvAID: 10709649 1st _Supervisor: _{Ana Lucia Varbanescu, UvA}

Credits: 18EC 2nd _Supervisor: _{Sander Bohte, CWI}

(2)

Abstract

Adaptive Spiking Neural Networks (ASNN’s) have shown to outperform state-of-the-art Spik-ing Neural Networks. An Adaptive SpikSpik-ing Neuron (ASN) can effectively be used as a drop-in replacement for a neural ReLU unit in conventional (deep and/or convolutional) neural networks to obtain a spiking ReLU neuron. Biologically plausible, these units encode information in a highly efficient fashion. Characteristic for these ASN’s is that they compute event-driven highly sparse input for downstream convolutional layers in the network. The convolution operation in these layers is highly suitable for parallelization – which has accounted for significant speed-up in processing state-of-the-art neural networks. These convolutional operations – however often highly optimized - do not account for the sparse input pattern computed by an ASN. One might have the intuition, the sparser input becomes, the less elements contribute to the computation of a convolution, the faster the computation should occur. By proposing a kernel that accounts for input in sparse format, I try to expose whether there is indeed such a space-time trade-off. Fur-thermore, due to bounded memory bandwidth on conventional processors these networks are not able run in a real-time fashion. With significantly increased memory bandwidth on the Graphical Processing Unit (GPU) and the parallel nature of their architecture, exploiting that sparse be-haviour in a suitable data structure to come to efficient sparse convolution may ultimately enable significant reduction of memory overhead and operation time speed-up. Deployment of an ASNN in real-time filtering settings are then likely to follow.

(3)

List of Figures

1 Energy of convolutional layers versus feed-forward layers. . . 6

2 Fragments of the MNIST dataset . . . 7

3 Average fire rate over time of an adaptive spiking network, per layer, for changing input. 9 4 Evolving parameters of an adaptive spike neuron over time . . . 10

5 Average sparsity of 784 adaptive spiking neurons over time. . . 10

6 Visualization of performing a convolution . . . 12

7 CSR representation of a dense matrix . . . 13

8 Convolution of different degrees of sparsity by Theano and SciPy . . . 15

9 Convolution with a custom kernel on sparse input for different devices in OpenCL . . 16

10 Standard convolution on sparse input for different devices in OpenCL . . . 17

List of Tables

1 MNIST dimensions and statistics . . . 7

2 Variables for characterizing a Spiking Neural Networn . . . 8

3 The parameters involved in performing a convolution. . . 11

4 An overview of popular Machine Learning frameworks . . . 13

(4)

List of Algorithms

(5)

1 Introduction

Based on recent insights within the domain of neuroscience, an Adaptive Spiking Neural Network (ASNN) [Zambrano and Bohte, 2016] has been presented. As opposed to conventional Artificial Neural Networks (ANN’s), time is added as an extra dimension – a key component within the paradigm of Spiking Neural Networks (SNN’s) [Maass, 1997], [Ponulak and Kasinski, 2010], [Vreeken et al., 2002], [Gr¨uning and Bohte, 2014]. Because of this, a powerful concept of a time-dependent neural coding scheme emerges, which is in line with incorporating biologically plausible realism into neuro-cognitive modelling. Given input, adaptive spiking neurons compute spike-trains according this coding schemes and emit these trains as input for subsequent, downstream layers of neurons. Of particular interest in biological neurons is that these trains are homeostatically optimized for the rate in which a pulse is emitted, which leads to the encode of information in a highly efficient fashion. The power of these networks lies in being able to represent an input space as a (highly) sparse feature space varying over time. Shown to respond up to an order faster and using up to an order fewer spikes, ASNN’s outperform state-of-the-art SNN implementations.

The presented ASNN can therefore be considered a novel paradigm for neural coding with spiking neurons, with an almost direct correspondence to biological spiking neurons. Amongst a wide variety of available datasets, the MNIST handwritten digit dataset is used to train an ASNN for the problem of handwritten digit recognition and classification. Lately, there has been a tendency where the accuracy of that classification significantly improves as networks tend to go deeper. By going deeper, a tendency is indicated that is reliant on building networks with a great number of stacked layers – in a downstream and often feed-forward fashion – so called ‘deep’ networks [Sze et al., 2017].

The neurons in these stacked layers often pass to layers that compute and communicate by means of convolution. These networks therefore go by deep Convolutional Neural Networks (CNN’s). In a convolution, a kernel essentially weights two-dimensional spatial input into single dimension values. Input therefore becomes ‘abstracted’ and the deeper a network is designed – the more abstracted the computed output will be. This is inspired by the notion of how the human visual system is assumed to processes input. Current deep CNN implementations are heavily optimized for performing convolutions on the GPU’s, an operation which closely fits the highly parallel architecture of GPU’s [Nageswaran et al., 2009a], [Nageswaran et al., 2009b].

Indeed, the presented model has been proven to map classical ANN’s and CNN’s to ASNN’s in a one-to-one fashion and without loss of performance. However, as CNN’s tend to go deeper in which although they do get more accurate, their memory usage increases. Similarly, as the mapped ASNN’s have a sub-optimal memory usage, there is increased computational overhead. Compared to classical ANN’s, where neural units propagte analog values in a continuous fashion, the outputted spike-trains of neurons emitted to subsequent layers within ASNN’s show highly and a time varyiing degree of sparse behavior. As these neurons are adaptive, over time the trains become gradually sparser.

Here, more sparse 1 _{is interpreted as the ratio of zero valued output (a neuron being inactive for} a given time step), over the outputted ones (a neuron being active) in a two-dimensional symmetric spatial matrix. This matrix is therefore often addressed as the ’activation matrix’. For sparsely active neurons in neural networks, where most neurons remain inactive at any given time-step, novel approaches need to be developed – since typically for any stimulus only a subset of neurons is active, only a subset of the input - the activation matrix - in a convolutional operation effectively contributes to the output computed. As neurons in ASNN’s emit sparse activation matrices, potential speed up most likely lies in focusing on representing the input or activation matrices in sparse format.

Graphical Processing Units (GPU’s) offer orders of magnitude more floating point performance than conventional processors and the bandwidth of data transfer is thus significantly increased. For large fan-in fan-out and dense connected ANN’s, reading weights from memory is typically bounded by a limited bandwidth. Due to the capability of an ANN being converted into an ASNN, the suitability of an ASNN to be deployed on the GPU, this sparse behavior within ASNN’s may potentially offer an exploit to reduce computational overhead for both ANN’s and CNN’s significantly. To truly exploit the efficiency of sparsely active and asynchronous spiking neural networks, efficient GPU implementations that rely on adequate data structures for sparse input are likely to contribute to the reduction of the

(6)

computational overhead. This may eventually lead to develop and deploy models on low-powered hardware and enabling models to operate in real-time streaming and filtering settings [Veenboer, 2013], [´Sla˙zy´nski and Bohte, 2012].

I therefore aim to expose and, potentially, quantify how exploiting the characteristics of ASNN’s on the parallel architecture of the GPU, enhances performance of these ASNN’s. Specific interest lies quantifying the space-time tradeoff by accounting for efficient space computation by exploting the sparse activity of ASNN’s. To make such quantifcation possible, litarature study will reveal what already has been done on exploting sparse computation. After, I will analyze how the variables that describe an ASNN behave for over simulation time. Second, the quantification of the degree of sparsity in this behavior will provide insight in how sparse the emitted spike-trains are. That is followed by defining the convolution operation. Subsequently, analysis of available frameworks regarding the operation time on the CPU of convolution-like operations given different degrees of sparse spatial input will reveal which of these frameworks already account for sparse input. Finally, moving these operations to the GPU and running them incorporating suitable data structures will enable to establish whether there is a relation, and quantify accordingly, regarding (operation) time-complexity of convolutional operations for different degrees of sparsity.

2 Theoretical foundation

2.1 Earlier work on sparse exploitation and related CNN optimalization

As Shi and Chu [2017] and Sze et al. [2017] suggest high speed up of convolutional operations can be realized through omitting the insignificant calculation of multiplying by zero. Implementing this idea would account for opportunities as energy saving, the reduction of multiply-accumulate operations (MAC) and less calls made to (off-chip) memory registers (often expensive) [Reagen et al., 2016]. Speed-up due to such implementation should be inevitable. However, while implementing the idea above and omitting zero multiplication through simply mapping this operation to a dense-sparse multiplication algorithm may look trivial, dense-dense multiplication is often faster due to highly optimized caching schemes and continuous memory access by means of memory coalescing and memory alignment. By simply moving the input to a sparse representation therefore not obtains the speed up as may have trivially been expected. Particular interest therefore lies in how to exploit these sparse data structures effectively.

While Park et al. [2016] focus on exactly this potential speed-up by using a sparse data representa-tion, he realizes this speed-up through representing the kernel of the convolution in sparse format. See the pseudocode for a convolution with such a sparse kernel in algorithm 1. While this speed up shows to be very effective for large filter sizes, for typical filter dimensions used in state-of-the-art CNN’s, it is often the case that these dimensions are smaller than the dimensions of the input computed upon. As mentioned in section 1, neural units in ASNN’s, grouped together, emit sparse activation matrices, thus potential speed-up most likely lies in focusing on representing the input or activation matrices in sparse format and exploit for this specific operand of the convolution.

(7)

Figure 1: Energy of convolutional layers versus feed-forward layers.

Furthermore, Park et al. [2016], [Sze et al., 2017] both emphasize the importance of focusing on the convolutional layers when regarding for the optimization of these networks. State-of-the-art CNN’s often contain high amount parameters that have a near-zero value. The majority of these parameters are often contained in the feed-forward layers of these networks. Pruning these parameters by retrain-ing these networks with a regularization factor does reduce the size of these networks - the gained speed-up of the inference speed of these networks is often negligible. In addition, the computational heavy part often is often accounted for in the convolutional layers of these networks. Figure 1 shows the energy consumption of convolutional layers versus the consumption of the feed-forward layers. Which furhter emphasizes that most potential for speed-up lies in the efficient exploitation of the operands in a convolution.

2.2 Analysis of the MNIST dataset

With an increasing accuracy of image classification and recognition models there is a growing avail-ability of image datasets 2_{. Among this range of available datasets there is the widely used MNIST} dataset. Table 1 shows an overview of the dimensions and statistics on the images contained in the MNIST dataset. It is a subset of a larger dataset that is known as NIST. The images in these sets are centered and size-normalized. By using this dataset one is not concerned with preprocessing and formatting, while still having the advantage of training on real-world handwritten data. The data in the training and test set are disjoint, and contain images of around 250 individual writers.

The pixels in the images are single dimensioned intensity (grey) values within a range of 0 − 255 (for 8-bit grayscale images, 256 = 28) where 0 is black and 255 is white at maximum intensity. Normalizing this range according equation 1 obtains a range of 0 − 1. How this normalization decision directly affects the sparse behaviour of an ASN is elaborated upon in section 2.3.1. As mentioned, the images are centered and have a black background as visible in figure 2a. As the images are centered, it is mostly the center part that actually contains the information encoded. Table 1 shows that on average it is approximately 150 elements that encode the colored information.

One can however not disregard the spatial dimensions as the location of black values (approximately 635 pixels per image) also contribute to the information about the shapes described in an image. One therefore has a sparse spatial input for a model that is trained on the data in MNIST - a large part

(8)

N (training) N (test) Dimensions Representation µ>0 elements µ%>0 elements

60K 10K 28 x 28 (= 784 pixels) 0-255 (grey values) 149.9 (σ = 41.46) 19.12

Table 1: MNIST dimensions and statistics

of the numbers in a given image is zero, encoding this has a low fire rate, resulting in emitted spike trains being even more sparse. For any spiking models being trained on these type of datasets there is a combination of sparse spatial input which affects the sparse temporal dynamics. One can however for the ease of iterating make use of a flattened representation (1x784 vector), this representation however has to be reshaped to its original dimensions before presented to a convolution operation.

zi=

xi− min(x)

max(x) − min(x) (1)

(a) A single image from the MNIST dataset with di-mensions of 28 by 28 (by 1).

(b) A subset from the MNIST dataset.

Figure 2: Fragments of the MNIST dataset

2.3 Spiking Neural Networks

There have been great advances in the development of Artificial Neural Networks. To incorporate more biological resemblance with the human brain in these networks, there is joint research within the domain of (computational) neuroscience and machine learning. The human brain comprises an intricate system of interconnected neurons. The transmission and processing of spikes emitted by biological neurons is a widely studied concept within neuroscience.

Outgoing spikes travel along axons to influence the state of a synapse. Dependent on that state, a synapse either has a suppressing or inciting effect on the membrane potential of a postsynaptic neu-ron. That effect is communicated to neurons by the release of neurotransmitters. Neurotransmitters that can increase (depolarize) and decrease (hyperpolarize) the membrane potential of a postsynap-tic neuron are Excitatory Postsynappostsynap-tic Potentials (EPSP’s) and Inhibitory Postsynappostsynap-tic Potentials (IPSPs) respectively. When a postsynaptic neuron is sufficiently stimulated by receiving EPSP’s, its membrane potential reaches a threshold which causes the neuron to generate a spike itself. That generated spike is then sent again over axons as input for successive neurons. As soon as a spike is emitted, the membrane potential is reset and for a fixed time constant it is not receptive to incoming potentials. That refractory constant causes the membrane potential to be in its resting state.

While the above is a general description of the dynamics through which neurons communicate, the actual behaviour that is exhibited in real-life neurons may greatly vary. Great attempt in describing those dynamics has been done through modelling the behaviour of neurons in differential equations as dynamical systems. Attempts that not rely on differential equations - but on the summation of integral kernels, are known as Spike Response Models. They essentially map incoming spike trains to outgoing spike trains, by means of operating as a filter on the input (incoming potentials of synapses). Both attempts contribute in the research of the dynamic behaviour of neurons. The capability of a

(9)

Denotation: Represents: Consists of:

N Number of neurons

Sneuron Bytes required for the storage of a single neuron

Snetwork Bytes required for the storage of a network N × Sneuron Cneuron Computational cost of a neuron

Cnetwork Computational cost of the network N × Cneuron

Table 2: Variables for characterizing a Spiking Neural Networn

model to describe the dynamics of neuron’s behaviour in a specific and detailed fashion is often paired with an increased computational complexity of the overall model. The incorporation of neuroscientific realism is therefore dependent on – and often at the expense of - the computational power of available hardware resources. Relevant parameters that can be used to characterize a typical neural network are summarized in table 2.

Given the MNIST input, the encoding of an image would proceed by mapping a single neuron to every pixel in the image. N is thus equal to 784, which corresponds to the dimension of the flattened representation of the dimensions of an image in the MNIST (28×28). This information can be encoded using a single bit for every neuron. Hence one obtains an array of ones and zeros of length N, which encodes the activity for a given layer within a complete network in the distribution step.

2.3.1 Adaptive Spiking Neural Networks

An Adaptive Spiking Neural Network’s input layers comprise of ASN’s mapped to a pixel in the input layer in a one-to-one fashion. Then for the amount of time steps in a given simulation time – typically steps in milliseconds for 500 milliseconds of simulation time – there is an update of the state of the corresponding neuron. When the membrane potential is sufficiently depolarized – the fire threshold is met - it will result in to the generation of a spike for that time step. For a thorough introduction in Adaptive Spiking Neural Networks see [Zambrano and Bohte, 2016]

A neural unit in an ASNN - an adaptive spiking neuron - can be effectively used as a drop-in for a conventional ReLU neural unit often used in conventional and deep neural networks. The drop-in unit is a spiking ReLU unit that is biologically plausible as the learning rule to which the unit adapts. As the presented model is a direct mapping of state-of-the-art (convolutional) neural networks, only feed-forward architectures have been mapped yet. Other possibilities such as recurrent loops or a hybrid architecture of feed-forward and recurrent connections are not implemented yet. The neurons are replaced by computational units that are of the ASN kind in the mapped network. The output of an ASN layer is input for a successive convolutional layer. See section 2.3.3 for a detailed description of a convolutional operation. The output of a convolutional layer is inputted for a RELU activation unit resulting in output for another layer of ASN’s.

Performing the computation according to the mechanics described above, one sees the behaviour of the variables over time in figure 4. In this figure the choice to normalize the grey values in the MNIST dataset reveals itself as the higher a presented input current is presented, the higher the average fire rate over time is. One therefore has a data structure of 28 by 28 neurons for 500 milliseconds, 784 neurons of being either active or inactive per time-step, and thus a total of 392.000 measurements. We here encounter thus a data structure that has a very high degree of sparsity which varies over time in become even more sparse.

2.3.2 Activity of an ASNN

The activity of the trained ASNN model per layer is shown in figure 3. This figure shows bursts for new input presented directly after presentation of the stimulus and shows a decay of the fire rate over time. Indeed, the lower this fire rate, the more sparse the emitted data as input to successive layers gets. These bursts therefore should and do relate directly to the bursts seen in figure 5.

For 10ms of stimulus time, one sees that the sparsity ratio of the time slices is in δ ∈ (0.835, 1). Where the lower end is only directly after stimulation time and only persists for a time interval

(10)

of around 10ms. The remainder of the simulation time, 400ms, there is a range that is closer to δ ∈ (0.93, 0.99). So for every 784 pixels inputted to 784 ASN’s there is for 1000 images at most 130 neurons active. This activity quickly drops to a fluctuation of 70 to 10 neurons actively contributing to the computation. Zero value input however results into a flat spike train. According to table 1 around 150 pixels have a positive value. So of those pixels containing information actually encoded, there is an effective sparse range of (0.53, 0.93).

The time-constants involved of neuron’s state are variable. By tweaking them there is a variable degree of sparsity emitted over time. In addition, although with similar patterns, the deeper a layer the more sparse a neuron’s emit is. This disposes therefore towards the focus on the last fully connected layers of large networks, as these layers are thus highly sparse and have a high number of connections per neuron (large fan-in-fan-out factor).

Figure 3: Average fire rate over time of an adaptive spiking network, per layer, for changing input.

2.3.3 Convolutional layers in an ASNN

To make use of highly optimized linear algebraic libraries such as BLAS, Lapack and Intel ML on conventional CPU’s and make use of GPU’s through CUDA, cuBLAS and OpenCL a convolution is often mapped to a matrix multiplication. By doing such the advantage of SIMD processing is exploited and the convolution is often highly parallelized. The variables that characterize a convolution operation are listed in table 3. A vanilla and straight forward convolution would consist of Hout x Wout x R x S x K x C floating-point multiplication and addition calculations. Figure 6b visualizes the dimensions. Which could readily be reduced by only performing those calculations that contribute to the output of a convolution – when regarding for the sparse structure in the input.

A convolution is in general computed according equation 2. Figure 6a shows the essence of per-forming a convolution. For a convolution on the GPU one can assign each pixel of the input to a copy of a kernel. To obtain parallelization, a kernel such as for example seen in algorithm is thus assigned to every of such a pixel in the input. For every pixel the weights of the filter compute their contribution, sums this, and replaces this pixel in an output matrix. This obtains high parallel exe-cution, as opposed to conventional processors which often compute in a pipelined fashion. To achieve high and parallel performance, GPU’s use a large number of units that execute in SIMD fashion. In packages such as CUDA and OpenCL threads are grouped into blocks which are in turn organised into grids. Within some limits it is left to the end user to determine how to assign threads to fine-tune for specific applications. For a thorough tutorial on sparse structures and a range of kernels that perform mathematical operations on it see Bell and Garland [2008].

(11)

Figure 4: Conversion of an ASNN model from Matlab and Python. The evolving parameters of an adaptive spiking neuron neuron over time, for a given input current.

(12)

Parameter: Represents:

C # of input feature maps

Hin Height of the input image Hout Height of the output image Win Width of the input image Wout Width of the output image

K # of the output feature maps

R Height of the filter kernel S Width of the filter kernel

P Padding of the input image

T Stride of the convolution filter

Table 3: The parameters involved in performing a convolution.

Ok,i,j= ΣC−1c=0Σ R−1 r=0Σ

S−1

s=0W (n, c, r, s)I(c, y + r, x + s) (2)

Data: 2-Dimensional activation matrix for each output channel n do

for j in [W.rowptr[n], W.rowptr[n+1]) do off = W.colidx[j]; coeff = W.value[j]; for for (int y = 0; y < H OUT; ++y) do

for (int x = 0; x < W OUT; ++x do out[n][y][x] += coeff*in[off+f(0,y,x)]; end

end end end

Algorithm 1: A convolution with a kernel that is in CSR format. For the MNIST dataset the number of output channels n is equal to 1, as the images are two dimensional. RGB images would have 3 dimensions for having red, green and blue values per pixel.

2.4 Sparse representation of input

2.4.1 The CSR format

As has become clear, the computed two dimensional matrices computed by layers of ASN’s are highly sparse. Approximately 70-10 neurons out of a total of 784 neurons contribute for any given time step to the computation in the convolutional layers. How to account for only the contribution information, i.e. is there a way to account for the active neurons (indices that contain a one in the matrix), and omit the inactive neurons (indices that contain a zero in the matrix)?

In the compression of data structures there has been a great deal of proposals on how to perform such effectively and reliably. Each of these multitude of options can be characterized by its storage and computational requirements and its manipulation/accessing properties, and are often written and chosen with regard to a specific pattern in the data to be compressed. Patterns can be with regard to the diagonal of a matrix or for example the upper or lower triangle of a matrix. Of arbitrary input patterns one often resorts to general purpose formats to compress the data. In a 2-dimensional matrix where entries are specified and stored, every entry requires some storage requirements in the memory of the device. For sparse structures however, it is often the case that (significant) large parts of the entries are the same and therefore redundant. Explicitly having to store those elements in a dense format is thus inefficient regarding memory usage.

A popular choice among general purpose data compression formats is of the CSR kind. By CSR specification non-zero elements are stored in a dense vector (data) and another vector (ind) of the

(13)

(a) A filter is essentially put on the input image and slides over it from top-left to bottom-right. Notice that the filter height and width are often equal and are of an odd integer to weigh neighbouring pixels for a centering pixel and replace that centering pixel.

(b) A visualization of the dimensions involved in a convolution

Figure 6: Performing a convolution with a filter (a kernel) and an input image. The computed output is a weighted sum of the surrounding pixels of a source pixel. A kernel or filter is therefore often regarded as a ’weight matrix’ as each pixel in this weight matrix computes the contribution of a surrounding pixel to the computed output. One can impose various constraints on these filters to obtain different features to be learnt. The output then replaces the source pixel. With padding and stride equal to zero and one respectively, the dimensions remain the same for the output image.

(14)

same length as the data vector, is used to store the column index of each non-zero element. A third array of row-pointers (rowptr) is then used to carry offsets into data corresponding to the first element of each row. The representation of a dense matix in CSR format is visualized in figure 7.

Figure 7: Reprentation of a dense matrix A in the components defined by the CSR specification. Note that for an ASNN the activation matrix at a given timestep only comprises zeros and ones.

The Compressed Row Format (CSR) enables one to store a dense structure, often a 2-dimensional matrix, with only having to specify indices that contain information. By moving from a dense matrix to a sparse format through for example the CSR specification, one effectively purges the elements that are not defined. In the case of computation by a layer of ASN’s one would only want to explicitly specify which neuron at indices (corresponding to a pixel of the input image) are active at a given time step, i.e. one would like to specify the ones, and not the zeros.

2.5 Available frameworks

With a growing ecosystem of Machine Learning frameworks there is a wide variety to choose from. For different programming languages and for each of these languages there is a variety of packages available that implement upon these frameworks.

Those packages provide as such a customized wrapper as an API to make use of the framework in the programmers’ preference for a certain programming language. As such, different handles are written for the Python programming language. Python is a language that has a growing scientific and industrial community. To develop on the frontier of a certain topic of interest in this language is therefore of value to research disciplines. Different handles that are suitable for different subdomains of the Machine Learning domain.

Framework: Language: Object acces leve:l Implementation: Sparsity CUDA/GPU Documentation on sparsity: Purpose: Theano Python Low level Stand-alone SciPy CSC/CSR format Yes/Single-GPU Elaborate Multi MxNet Python optional High level Stand-alone SciPy CSC/CSR format Yes Poor Multi Keras Python High level On top of either

Tensorflow or Theano SciPy CSC/CSR format Yes/Single-GPU Poor Multi Tensorflow Python High level Stand-alone Sparse tensor object (CSC/CSR) Yes/Single-GPU Elaborate Multi Torch Python optional High level Stand-alone Sparse tensor object (CSC/CSR) Yes Medium Multi

CNTK Python API/C High level Stand-alone N/A Yes/Multi-GPU N/A Multi

DSSTNE C/LUA High level Stand-alone Optimized N/A Poor Recommender

systems

Table 4: An overview of popular Machine Learning frameworks, their Python support and their account for sparse functionality.

As API operates on top of a framework one could regard such an API as operating on a ‘high level’. Different API’s on high level in different programming languages may have been implemented upon the same framework in again a differing programming language. One could regard these frameworks therefore operating on a ‘low level’. High level wrappers thus inherit their functionality from the framework at their cores. To provide new functionality to the Machine Learning community one obtains impact if such functionality is implemented on a low level. Great boost in the computation speed and accuracy of state-of-the-art convolutional networks rests on the support of GPU computing.

(15)

For the CPU, conventional and highly optimized linear algebraic functions are implemented for the by the so called BLAS and Lapack libraries. Two major frameworks that provide such implementation for the GPU are the OpenCL and CUDA libraries and generally are supported by all frameworks. An examination of popular frameworks in table 4 provides insight which frameworks are available, have a Python handle and account for the implementation of any functionality regarding sparsity.

Using appendix to generate batches of 28 by 28 matrices for different degrees of sparsity, figure 8 shows the operation time of performing a convolution on these matrices and on the CPU for two popular libraries in Python. The operation time reveals itself to be constant and thus not dependent on the degree sparsity contained in the input frame. Although having thus a sparse representation by a library such as SciPy, accounting for the sparse structure obtains is imperative to benefit from the structure. Most of the frameworks rely on the SciPy Python package3_{. Implementing a sparse} convolution kernel therefore that is either compatible or directly integrated into SciPy would be of great contribution the scientific community.

Among the great benefits of packages that are implemented upon frameworks such as OpenCL and CUDA for GPU computing is their modularity for changing an existing function. The functions are so called kernels, and these kernels can be altered based on interest. Kernels can be replaced in a manner that do not alter the computational output of a given function to be computed, but perform that computation in different and often optimized fashion given new insights in a computational domain. The proposed kernel with regard of convolution on a sparse input structure in CSR format is found in appendix 6.2.

3 Experimental setup

Queriying the PyOpenCL engine for the devices available for computation results in table 5. One can see that it reveals at disposal one CPU, two GPU’s and their properties. Thhe Iris Pro is an onboard processing unit, whereas the AMD Radeon unit is off-chip. Off-chip accessing and transfer of operonds is often expesive regarding time and energy consumption.

To confirm that OpenCL also has no optimizations regarding the sparse input structure, figure 10 in section 4 shows that indeed there is a constant dependency given the varying degree of sparsity and indeed shows that the off-chip processing unit is slowest overall, off-chip accessing and transfer of operonds is often expesive regarding time and energy consumption.

With the integration of the batch function in appendix 6.3 and calling the script in 6.4, performing a convolution kernel which is implemented accoding the proposed algorithm in section 6.2, the operation time is measured. Note that in this kernel one assumes that we do not deal with analog values outputted by an ASNN, but that its neural units emit binary output. This leads to being able of not having to pass the actual values in the activation matrix to the proposed kernel, as the kernel loops over the indices. By having a boolean check, one is able to retrieve whether there present a zero or one at a specific place in the input matrix by looping over the columns and rowpointers components that make up the CSR format. For a linear range of near-zero4 to one hundred percent sparsity, batches of size hundred comprised of matrices with dimensions of 28 by 28 are presented to a convolutional operation with conventional filter sizes of three, five and seven. As one can see according section 6.4, a solution now has been implemented that relies on tweaking the package gputools written for Python. This way one is able to run a customaized script on the available GPU’s present in the device. This package has the option to modularily insert customized kernels. Gputools is called through Python, which in turn calls OpenCL5_{, which in turn relies on C.}

3_{A well maintained documentation of SciPy can be found at: https://www.scipy.org/} 4_{Zero output is not included, as empty arrays are passed to the kernel in this case.}

5_{The repository for gputools and PyOpenCl can be found at https://github.com/maweigert/gputools and}

(16)

(a) Performance of the SciPy package

(b) Performance of the Theano framework

Figure 8: Convolution operation time as a function of different degrees of sparsity for different filter sizes, for SciPy and Theano, on the CPU.

(17)

Platform: Apple OpenCL runtime and version: PyOpenCl 2017 1:1 (OpenCL 1.2) Device: Intel Core i7-4870HQ CPU (8 cores) @ 2.50GHz (CPU) Iris Pro (GPU) AMD Radeon R9 M370X (GPU)

GLOBAL MEM SIZE: 17.18 GB 1.61 GB 2.15 GB

MAX MEM ALLOC SIZE: 4.29GB 400 MB 530 MB

LOCAL MEM SIZE: 32768 65536 32768

IMAGE2D MAX WIDTH: 8192 16384 16384

IMAGE2D MAX HEIGHT: 8192 16384 16384

IMAGE3D MAX WIDTH: 2048 2048 2048

IMAGE3D MAX HEIGHT: 2048 2048 2048

IMAGE3D MAX DEPTH: 2048 2048 2048

MAX WORK GROUP SIZE: 1024 512 256

MAX WORK ITEM SIZES: [1024, 1, 1] [512, 512, 512] [256, 256, 256]

Table 5: Overview of the available devices in OpenCL.

4 Results

With generating data again accodring appendix 6.3 for different degrees of sparsity, batch sizes of one thousand, and input dimensions 28 by 28, figure 10 and 9 show the results of running kernels for convolution in norman dense format and input in CSR format respectively.

4.1 PyOpenCL for non-optimzed convolution

See figure 10 for the results of running a non-optimized kernel.

4.2 PyOpenCL for optimzed convolution by accounting for input in CSR

format.

See figure 9 for the results of running a kernel that handles input in CSR format according to algorithm .

Figure 9: Runtime of convolution of different filtersizes on different CPU and GPU devices. For the proposed kernel that accounts for input that is in CSR format. Batch size is 1000 and input dimensions are 28 by 28.

5 Conclusion

The aim of this project has been to expose and quantify how exploiting the characteristics of an ASNN on the parallel architecture of the GPU affects the performance of these ASNN’s. This has

(18)

(a) Performance of the 3×3 filter

(b) Performance of the 5×5 filter

(c) Performance of the 7×7 filter

Figure 10: Runtime of convolution of different filtersizes on different CPU and GPU devices. These convolutions are performed on an increasing degree of sparse input matrix (in dense format), and the filters used are of If ilter size. One might expect a faster runtime of the GPU devices over the CPU one, however transfering the operands from the host to the (off-chip) GPU accounts for most of the measured opeartion time. Batch size is 1000 and input dimensions are 28 by 28.

(19)

been done with specific focus on quantifying the space-time trade-off by accounting for efficient sparse computation that exploits the sparse activity of ASNN’s. Literature study reveals that the most computational expensive component in state-of-the-art networks is processing a convolutional layer. Enhancement of execution time of processing by state-of-the-art networks therefore most likely directly relies on the optimization of the convolutional computations in these networks. The potential of speed-up in performance of an ASNN has become highly likely, as analysis of the variables that describe an ASN and how their behaviour over time reveals that the sparsity range of computed output by a layer of ASN’s is above 90% for most of its simulation time - which in turn exposed the urgency for efficiently accounting for the sparse structure passed to the convolutional layers. As discussed in section 2.3.2, the deepest layers in the ASNN architecture exhibit the highest degree of sparsity. This disposes therefore towards the focus on the last fully connected layers of large networks, as these layers are thus highly sparse and have a high number of connections per neuron (large fan-in-fan-out factor). These layers are potentially most affected by a convolution kernel that accounts for sparse input. Analysis of the available Machine Learning frameworks exposes that most of the frameworks lack implementation that exploit sparse input - in dense format - despite having a sparse representation of data. Section 3 and section 4 indeed show a constant relation for input in dense format. By proposing a kernel that performs a convolution on a sparse input structure in CSR format, there is indeed shown that the operation time of performing a convolution on a sparse structure reduces up to 20-30% as the structure gradually becomes more sparse. Highest speed-up is obtained for a 95% or more degree of sparsity in the input. Which suggests that there is indeed a space-time trade-off of linear form for performing a convolution on input data of an increasing sparsity degree.

5.1 Discussion of findings

The neural units in an ASNN have shown activity in a sparsity range of above 90%. The matrices that result from this activity for every time-step of simulation time are thus for the majority of its entries almost always comprised of zeros. These matrices serve as input for convolutional layers in these networks. The intuition that as the sparser an input structure for a convolution gets, the less multiplications are of contribution to the computed output, the faster the overall computation could occur, is partially confirmed given the implementation of a kernel that accounts for sparse input.

With a growing number of Machine Learning frameworks there is a wide variety to choose from. It is a time expensive task to get these frameworks to behave in a uniform manner for the same setting, given the variable options of different devices and its corresponding configurational requirements. Almost all frameworks rely on their specific dependencies, which in turn have their own configurations. Furthermore, the available hardware resources imposed further restrictions on whether it is possible to run a customized script, resulting in eventually only being able to run the script on one of the devices. Great deal of this project has been spent on reviewing these requirements and hardware resources, causing to have less elaboration on the evaluation and analysis of the presented results. Although of being able to present modest results that suggest a confirmation of the intuition mentioned above, careful analysis, review and evaluation of the results enables one to establish and quantify the space-time trade-off of how the operation space-time of a convolution is affected by an increasing degree of sparsity presented in sparse format.

While getting these frameworks to operate in a uniform fashion is a time expensive step - specifically regarding sparse functionality, almost all frameworks rely on and inherit from the sparse implementa-tion of SciPy however. Providing a low-level sparse convoluimplementa-tional kernel in to a low-level library like SciPy is likely to propagate to having sparse functionality in all the frameworks that are dependent on SciPy. This is potentially of great contribution to the scientific community.

5.2 Future work

Although of having shown that the operation time in performing a convolution for an increasing degree of sparsity is reduced linearly, there is room for improvement in the mechanics of the presented kernel. Optimal use of the GPU by means of for example optimal threading and parallelization is highly likely to contribute to obtain the full benefits of the sparse input and parallelization on the GPU. The presented kernel may inspire and serve as the foundation for one to optimize.

(20)

For real-time and high quality video filtering settings by ASNN’s to occur, where the input is often of a high number of frames per second (typically 60 frames) and its conventional input dimensions, there are high amounts of consecutive images and often of significantly larger dimensions than an image in the MNIST dataset is. To facilitate such settings on conventional devices such as GPU’s, an optimal kernel that operates on sparse input and thus accounts for computation on compressed data, memory reduction and reduction of operation time of convolutional layer is indispensable.

Finally, while exposing this space-time trade-off is insightful in the potential speed-up of convolu-tional layers of an ASNN, - in general, domains that affiliate with computaconvolu-tional science, sparse input for computation is often encountered. Exposing this relation is therefore of contribution to the scien-tific community, especially with an emerging and ongoing tendency of high performance computing that relies on the computation performed on the GPU and its parallel architecture.

(21)

6 Apendices

6.1 Proposal for kernel that expects input in CSR format

// note that the there is no values array passed to this function.

// looping over the columns and rowpointers enables one to check whether there is an entry

// and since there is only binary information in the activation matrices a boolean check suffices

// note on the input:

// -- h: filter in dense format

// -- output: np-array in dense format if same dimensions of input in dense format

__kernel void convolve_sparse_buf(__constant float * h, // dense format

__global float * output, // dense format, same dim input const int Nhy, // filter dim y

const int Nhx, // filter dim y

__constant int * indices, // indices csr __constant int * rowptr, // ptrs csr const int nnz, // # nonzero elements

const int num_rows, // input dim x (symmetric) const int num_rowptr) // # rowptrs

{

// get the indices of the pixel in the input image. int i = get_global_id(0) / 28;

// retrieve the ’y’ coordinate by reshahping the ’x’ dimension from the flat array int j = i % 28;

// input dimensions int Nx = 28;

int Ny = 28;

// accumulator of the convolutional output float res = 0.f;

// compute the x coordinate of the filter

const int hx_start = ((i+Nhx/2)>=Nx) ? i+Nhx/2+1-Nx:0; const int hx_end = ((i-Nhx/2)<0) ? i+Nhx/2+1:Nhx; const int startx = i+Nhx/2;

// compute the y coordinate of the filter

const int hy_start = ((j+Nhy/2)>=Ny) ? j+Nhy/2+1-Ny: 0; const int hy_end = ((j-Nhy/2)<0) ? j+Nhy/2+1 : Nhy; const int starty = j+Nhy/2;

// we only focus on the rows we care about: htx_start...htx_end for (int htx = hx_start; htx < hx_end; ++htx){

// get coordinate in the input matrice: where to place the filter int coord = startx + htx;

int row_start = rowptr[coord]; int row_end = rowptr[coord+1];

// we take the first column that is non-zero on the row we care about int ind = indices[row_start];

// loop over the filter indices

for (int hty = hy_start; hty < hy_end; ++hty){

// as long as we are on lower column indexes, keep advancing while (ind < starty + hty) {

(22)

if (row_start < row_end){ ind = indices[row_start]; }else{

//we go out of the while ONLY if we are at hty or higher; break;

} } //

if (ind == starty + hty){

// if we are at hty, we found a non-zero, so we use it (either one or zero)

// note that this only works for binary values in the dense input ’activation’ matrix. bool multiply = (ind == hty);

res += h[htx+hty*Nhx] * multiply;

row_start += multiply; ind = indices[row_start]; }

}

// place the weighted sum at the right index in the output structure output[i+j*Nx] = res;

} }

6.2 Customized convolve function in gputools

from __future__ import print_function, unicode_literals, absolute_import, division import logging

logger = logging.getLogger(__name__)

import os

import numpy as np

from gputools import OCLProgram, OCLArray, OCLImage, get_device from gputools.core.ocltypes import assert_bufs_type

from gputools.utils.tile_iterator import tile_iterator import sys

import time

import pyopencl as cl # import pyviennacl as vcl

from scipy.sparse import csr_matrix

from ._abspath import abspath

def convolve(data, h, res_g=None, sub_blocks=None, sparse_input_repr=False, **kwargs):

"""

convolves 1d-3d data with kernel h

data and h can either be numpy arrays or gpu buffer objects (OCLArray, which must be float32 then)

(23)

"""

if not len(data.shape) in [1, 2, 3]:

raise ValueError("dim = %s not supported" % (len(data.shape)))

if len(data.shape) != len(h.shape):

raise ValueError("dimemnsion of data (%s) and\

h (%s) are different" % (len(data.shape), len(h.shape)))

if isinstance(data, OCLArray) and isinstance(h, OCLArray):

return _convolve_buf(data, h, res_g, sparse_input_repr, **kwargs) elif isinstance(data, np.ndarray) and isinstance(h, np.ndarray):

if sub_blocks == (1,) * len(data.shape) or sub_blocks is None: return _convolve_np(data, h)

else:

# cut the image into tile and operate on every of them

N_sub = [int(np.ceil(1. * n / s)) for n, s in zip(data.shape, sub_blocks)] Npads = [int(s / 2) for s in h.shape]

res = np.empty(data.shape, np.float32) for data_tile, data_s_src, data_s_dest \

in tile_iterator(data, blocksize=N_sub, padsize=Npads, mode="constant"): res_tile = _convolve_np(data_tile.copy(), h) res[data_s_src] = res_tile[data_s_dest] return res else:

raise TypeError("unknown types (%s, %s)" % (type(data), type(h)))

def _convolve_np(data, h): """

numpy variant """

data_g = OCLArray.from_array(data.astype(np.float32, copy=False)) h_g = OCLArray.from_array(h.astype(np.float32, copy=False))

return _convolve_buf(data_g, h_g).get()

def _convolve_buf(data_g, h_g, res_g=None, sparse_input_repr=False, **kwargs):

""" buffer variant """ nnz = None if sparse_input_repr is True: num_rows = kwargs[’dense_copy’].shape[0] csr = csr_matrix(kwargs[’dense_copy’])

(24)

if csr.nnz > 0:

data_g = OCLArray.from_array(csr.data.astype(np.float32, copy=False)) indices = OCLArray.from_array(csr.indices.astype(np.float32, copy=False)) rowptr = OCLArray.from_array(csr.indptr.astype(np.float32, copy=False)) nnz = csr.nnz

assert_bufs_type(np.float32, data_g, h_g)

prog = OCLProgram(abspath("kernels/convolve.cl"))

if res_g is None:

res_g = OCLArray.empty(data_g.shape, dtype=np.float32)

Nhs = [np.int32(n) for n in h_g.shape]

if sparse_input_repr is True and nnz is not None:

kernel_name = "convolve_sparse_buf"

try:

ts = time.time()

prog.run_kernel(kernel_name, data_g.shape[::-1], None, h_g.data, res_g.data, *Nhs, indices.data, rowptr.data,

np.int32(nnz), np.int32(num_rows), np.int32(num_rowptr)) te = time.time()

op_time = te-ts print(op_time)

except cl.cffi_cl.LogicError as e:

# this catches the logicerror if the kernel is to big for constant memory if e.code == -52:

ts = time.time()

kernel_name = "convolve%sd_buf_global" % (len(data_g.shape)) prog.run_kernel(kernel_name, data_g.shape[::-1], None,

data_g.data, h_g.data, res_g.data, *Nhs) te = time.time() op_time = te-ts else: raise e else:

kernel_name = "convolve%sd_buf" % (len(data_g.shape)) try:

ts = time.time()

prog.run_kernel(kernel_name, data_g.shape[::-1], None, data_g.data, h_g.data, res_g.data, *Nhs)

te = time.time() op_time = te-ts

(25)

except cl.cffi_cl.LogicError as e:

# this catches the logicerror if the kernel is to big for constant memory if e.code == -52:

kernel_name = "convolve%sd_buf_global" % (len(data_g.shape)) ts = time.time()

prog.run_kernel(kernel_name, data_g.shape[::-1], None, data_g.data, h_g.data, res_g.data, *Nhs)

te = time.time() op_time = te-ts

else: raise e

return res_g, op_time

6.3 Batch function for different degrees of sparsity

import

collections

import

numpy as np

def

g e n r a n d s p a r s e d a t a ( batch size =1000 , randint =2 ,

\

min range = 0.01 , max range =0.995 ,

\

sample freq =50 , size =(28 , 28)):

\

s p a r s e d a t a d i c t = collections . OrderedDict ()

for

sparsity

in

np . linspace ( min range , max range ,

\

num = sample freq ):

s p a r s e d a t a d i c t [( sparsity )] =

\

np . array ([ np . random . choice ([0 , 1] ,

\

size = size ,

\

p =[ sparsity , 1

−

sparsity ])

for

x i

range

( batch size )] ,

\

dtype =

’ float32 ’

)

return

s p a r s e d a t a d i c t

6.4 Script for the convolution for different filter sizes for different devices

from

gputools

import

convolve , OCLArray

from

helpers

import

g e n r a n d s p a r s e d a t a

import

pyopencl as cl

# I m p o r t t h e O p e n C L GPU c o m p u t i n g API

import

numpy as np

# I m p o r t Np n u m b e r t o o l s

import

collections

import

gputools

import

time

import

os

os . environ [

" P Y O P E N C L C O M P I L E R O U T P U T "

] =

str

(1)

filter dims = [3 , 5, 7]

def

GPU convolution ( r data , dF , n p r p r = True ,

∗ ∗

kwargs ):

gputools . config . init device (

∗ ∗

kwargs )

(26)

d e v n a m e =

\

gputools . config . o c l g l o b a l s . device . g e t i n f o (

’ NAME ’

)

h = np . ones (( dF , dF ), dtype = np . float32 )

h = OCLArray . from array (h)

t i m e p s =

list

()

for

sp p , data

in

r d a t a . items ():

s p p r l i s t =

list

()

for

d

in

data :

if not

n p r p r :

c = d. copy ()

d = OCLArray . from array (d)

ts = time . time ()

res =

\

convolve (d , h ,

s p a r s e i n p u t r e p r = True , dense copy =c)

te = time . time ()

t = te

−

ts

s p p r l i s t . append (t)

t i m e p s . append ( np . mean ( s p p r l i s t ))

frame = pd . DataFrame ([

list

( r d a t a . keys ()) , t i m e p s ]). T

frame . columns =[

’ percentage ’

,

’ runtime ’

]

frame . index . name = ( d e v n a m e +

’ @ %s ’

) % dF

# f r a m e . t o e x c e l ( ’ . / ’ + f r a m e . i n d e x . n a m e + ’ . x l s x ’ )

data = g e n r a n d s p a r s e d a t a ( batch size =10 , linspace =3)

for

f i l t e r d

in

filter dims :

# c p u

GPU convolution ( data , filter d ,

n p r p r = False , id device =0 , id platform =0 , u s e g p u =0)

# o n b o a r d GPU

GPU convolution ( data , filter d ,

n p r p r = False , id device =0 , id platform =0 , u s e g p u =1)

# o f f − c h i p GPU

GPU convolution ( data , filter d ,

n p r p r = False , id device =1 , id platform =0 , u s e g p u =1)

6.5 Python code for an ASN layer for different input currents

from __future__ import division import numpy as np

import collections

class ASN(object):

def __init__(self, t=500, theta_base=0.1, input_curr=2, dtype=np.float32): # variable parameters

(27)

self.theta_base = dtype(theta_base) self.mult_fact = dtype(theta_base ** 2) self.input_curr = dtype(input_curr) # exponent parameters self.theta = self.theta_base self.th0 = self.theta_base self.th1 = 0 self.epsi = dtype(0) class Signal(ASN):

def __init__(self, t=500, input_curr=2, tstep=10, tend=400, dtype=np.float32):

self.sim_t = np.arange(t) self.dtype = dtype self.input_curr = dtype(input_curr) self.tstep = tstep self.tend = tend self.epsi = 0 self._set_filter_params() self._set_signal() self._set_approximation() def _set_filter_params(self): # fixed parameters self.theta_dec = self.dtype(15) self.eta_dec = self.dtype(50) self.tau_fil = self.dtype(2.5) self.a1 = np.exp(-1/(self.tau_fil)) self.dth1 = np.exp(-1/(self.theta_dec)) self.dEta = np.exp(-1/(self.eta_dec))

self.theta_track = np.full(len(self.sim_t), self.dtype(0.1))

def _set_signal(self):

self.input_signal = np.zeros(len(self.sim_t))

self.input_signal[self.tstep : self.tend] = self.input_curr

def _set_approximation(self): self.y = np.zeros(len(self.input_signal) + 1) self.yhat = np.zeros(len(self.input_signal) + 1) self.spike_train = np.zeros(len(self.input_signal) + 1) self.spike_time = list() @timeit

def simulate(asn, signal):

for time_step in range(len(signal.sim_t)):

asn.epsi = asn.epsi + (1-signal.a1) * (signal.input_signal[time_step] - asn.epsi) signal.y[time_step] = asn.epsi

(28)

if v>(asn.theta_base/2):

signal.spike_train[time_step] = 1 signal.spike_time.append(time_step) signal.yhat[time_step] += asn.theta

fmult = asn.theta

asn.th1 += (fmult * asn.theta_base)

signal.yhat[time_step + 1] = signal.yhat[time_step] * signal.dEta asn.th1 *= signal.dth1

asn.theta = asn.th1 + asn.theta_base signal.theta_track[time_step] = asn.theta

return asn, signal

def _ASN_layer(input_range=[0.01, 0.1, 1, 10, 100]):

for i, value in enumerate(input_range):

asn = ASN(input_curr = value) signal = Signal(input_curr = value)

signal._set_filter_params() signal._set_signal() signal._set_approximation() simulate(asn, signal) u = np.array(signal.y-signal.yhat) spikes = signal.spike_train spikes[spikes==0]=np.nan

(29)

References

Nathan Bell and Michael Garland. Efficient sparse matrix-vector multiplication on cuda. Technical report, Nvidia Technical Report NVR-2008-004, Nvidia Corporation, 2008.

Ahmed H El Zein and Alistair P Rendell. Generating optimal cuda sparse matrix-vector product im-plementations for evolving gpu hardware. Concurrency and Computation: Practice and Experience, 24(1):3–13, 2012.

Andr´e Gr¨uning and Sander M Bohte. Spiking neural networks: Principles and challenges. In ESANN, 2014.

Wolfgang Maass. Networks of spiking neurons: the third generation of neural network models. Neural networks, 10(9):1659–1671, 1997.

Jayram Moorkanikara Nageswaran, Nikil Dutt, Jeffrey L Krichmar, Alex Nicolau, and Alexander V Veidenbaum. A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors. Neural networks, 22(5):791–800, 2009a.

Jayram Moorkanikara Nageswaran, Nikil Dutt, Yingxue Wang, and Tobi Delbrueck. Computing spike-based convolutions on gpus. In Circuits and Systems, 2009. ISCAS 2009. IEEE International Symposium on, pages 1917–1920. IEEE, 2009b.

Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. Faster cnns with direct sparse convolutions and guided pruning. 2016.

Filip Ponulak and Andrzej Kasinski. Introduction to spiking neural networks: Information processing, learning and applications. Acta neurobiologiae experimentalis, 71(4):409–433, 2010.

Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, Jos´e Miguel Hern´andez-Lobato, Gu-Yeon Wei, and David Brooks. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In Proceedings of the 43rd International Sympo-sium on Computer Architecture, pages 267–278. IEEE Press, 2016.

Shaohuai Shi and Xiaowen Chu. Speeding up convolutional neural networks by exploiting the sparsity of rectifier units. arXiv preprint arXiv:1704.07724, 2017.

Leszek ´Sla˙zy´nski and Sander Bohte. Streaming parallel gpu acceleration of large-scale filter-based spiking neural networks. Network: Computation in Neural Systems, 23(4):183–211, 2012.

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel Emer. Efficient processing of deep neural networks: A tutorial and survey. arXiv preprint arXiv:1703.09039, 2017.

Bram Veenboer. Gpu accelerated spiking neural networks for video classification. Master Thesis, Department of Computer Science, Vrije Universiteit Amsterdam/CWI, 2013.

Jilles Vreeken et al. Spiking neural networks, an introduction. Institute for Information and Computing Sciences, Utrecht University Technical Report UU-CS-2003-008, 2002.

Davide Zambrano and Sander M. Bohte. Fast and efficient asynchronous neural computation with adapting spiking neural networks. CoRR, abs/1609.02053, 2016. URL http://arxiv.org/abs/ 1609.02053. .

Optimization of Adaptive Spiking Neural Networks on GPU's