Computational and Representational Efficiency Through Adaptive Basis Functions in Convolutional Neural Networks

(1)

C O M P U TAT I O N A L A N D R E P R E S E N TAT I O N A L E F F I C I E N C Y T H R O U G H A D A P T I V E B A S I S F U N C T I O N S I N C O N V O L U T I O N A L

N E U R A L N E T W O R K S l e o na r d o r o m o r

An investigation on the application of basis expansion methods in Convolutional Neural Networks

(2)

Leonardo Romor: Computational and Representational Efficiency through Adaptive Basis Functions in Convolutional Neural Networks, An investigation on the application of basis expansion methods

in Convolutional Neural Networks, c August 2020 s u p e r v i s o r s: Erik Bekkers l o c at i o n: Amsterdam t i m e f r a m e: August 2020

(3)

C O N T E N T S

i i n t r o d u c t i o n, background and related work

1 i n t r o d u c t i o n 2

1.1 Contributions . . . 4

2 b a c k g r o u n d a n d r e l at e d w o r k 5 2.1 Atrous convolutions . . . 5

2.2 Deformable convolutions . . . 5

2.3 Basis expansion methods . . . 6

2.4 B-Spline CNNs on Lie Groups . . . 7

2.5 Relationship between self-attention and convolutional layers . . . 7

2.6 PDE-based Group Equivariant CNNs . . . 7

2.7 Semantic Image Segmentation . . . 8

ii m e t h o d o l o g y, experiments and conclusion 3 m e t h o d o l o g y 10 3.1 Functional representation . . . 10

3.2 Group Convolutions . . . 12

3.3 Function Approximation . . . 14

3.3.1 Cardinal B-splines and Proposed Parametrization . . . 14

3.4 Cardinal B-Splines Convolution (CBSConv) . . . 16

3.5 Implementation . . . 18

4 e x p e r i m e n t s 20 4.1 Experimental Design . . . 20

4.2 Classification . . . 22

4.2.1 Wide Residual Networks (WideResnet) . . . 23

4.2.2 Training . . . 23 4.3 Segmentation . . . 24 4.3.1 DeepLabV3+ . . . 24 4.3.2 Training . . . 25 5 r e s u lt s 26 5.1 Classification . . . 26 5.1.1 Analysis . . . 26 5.2 Segmentation . . . 30 5.2.1 Analysis . . . 30 5.3 Discussion . . . 31 6 c o n c l u s i o n a n d f u t u r e w o r k 33 6.1 Future Work . . . 35 iii

(4)

b i b l i o g r a p h y 36 iii a p p e n d i x a g r o u p s 40 b c b s c o n v t i m e c o m p l e x i t y 41 c u n i va r i at e b-splines 42 d i n p u t-expansion 43 e t r a i n i n g c u r v e s 44 e.1 Classification Experiments . . . 44 e.1.1 Loss curves . . . 44 e.1.2 Accuracy curves . . . 46 e.2 Segmentation Experiments . . . 48 e.2.1 Loss curves . . . 48 e.2.2 mIoU curves . . . 49

e.2.3 Pixel accuracy curves . . . 50

(5)

A B S T R A C T

Image pattern recognition relies on Deep Convolutional Neural Networks (DCNNs) to achieve state-of-the-art accuracies. Classical CNNs make use of discrete, learnable filters to perform convolutions, often struggling to preserve generic spatial transforma-tions of the input across the convolutional layers. To lift this limitation, recent models, infuse equivariance properties to their architectures. Another limitation of classical convolutions is the elevated computational cost in case of large, d-dimensional convo-lutional filters. Large kernels are often needed, for instance, in segmentation tasks where they are replaced with sparser convolutions, called atrous or dilated convolu-tions. An increased number of recent architectures shifted the classical convolution paradigm by representing the convolutional filters as continuous functions, using basis expansion methods and, in general, powerful tools from calculus, to achieve both higher degrees of sparsity and transformation equivariance. Following these steps, we study the consequences of choosing B-splines, as basis functions with learnable parameters, to construct convolutional filters. The potential of this approach was studied by performing B-splines convolutions for both classification and segmentation tasks, using different initializations and training procedures. The results showed that B-splines convolutions can generalize previous models by reducing the number of parameters required, while preserving the model performance, resulting in a valuable choice for also sparsifying classical convolutions.

(6)

Part I

(7)

1

I N T R O D U C T I O N

Current image recognition and segmentation relies on different types of Convolutional Neural Networks (CNNs) to achieve state of the art performance[30][5] [20][25][12].

Classical CNNs perform discrete convolutions between input features and a set of learnable filters to generate intermediate, translational equivariant, and abstract representations which locally describe the input image.

Toghether with convolutions, CNNs are often equipped with pooling layers. Pooling layers aim to downsample input representations while introducing scale equivariance and translational invariance properties.

To capture long distance spatial dependencies, large fields of view (FOVs) are required. Current models effectively increase FOVs sizes by chaining multiple convo-lutional and pooling layers but, often, this results in reduced spatial definition and increased computational cost. These shortcomings sparked researches into sparser convolutional kernels [16][15] to better tackle the exponential growth of the

compu-tational complexity, in case of generic d-dimensional kernels, and to find ways to preserve generic spatial transformations of the input across the convolutional layers.

Chen et al. [3] suggested the use of dilated (atrous) convolutions to increase FOVs

by downsampling using large convolutional kernels. This method effectively combines convolution and pooling techniques, reducing the computational cost of a comparable convolution while mitigating the loss of local information.

A denser but less flexible alternative to atrous convolutions is found in separable convolutions. In Peng et al. [22], Global Convolution Networks (GCNs) are introduced

to solve, at the same time, the classification and localization tasks required in the context of semantic-segmentation. GCNs perform global convolutions to generate resolution feature maps which both encode large FOVs and better retain high-frequency information. Global convolutions require large filters to be performed. GCNs make use of large separable filters, greatly reducing their computational cost while improving final accuracies.

In Chen et al. [5] Atrous Spatial Pyramid Pooling (ASPP) is used for context

embedding. To perform per-pixel classification, Deep Convolutional Neural Networks (DCNNs) are often employed as encoders. Due to the operations of max-pooling and downsampling, DCNNs achieve invariance at the cost of degrading the amount of encoded detailed information[4]. ASPP alleviates the loss of detailed information

related to object boundaries by performing several parallel atrous convolutions at different rates while employing depth-wise separable convolutions for lower cost denser representations.

(8)

i n t r o d u c t i o n 3

A generalization of dilated convolutions are deformable convolutions (Dai et al. [11]). Deformable convolutions allow to increase the field of view of the channels as

well as specialize the density and position of the filter parameters during the learning process and dependently on the input. This is done by lifting the restriction of the parameters to lie on a regular grid. When used in conjuction with previous state of the art architectures, deformable convolutions were found to improve accuracy, advancing over previous state of the art models[11]. The application of

deformable/a-trous convolutions is found especially useful in segmentation tasks. Deformable convolutions trade-off dense representations with computational complexity. In order to lift the restriction of the parameters to lie on a regular grid, deformable convolu-tions need to define offsets at fractional posiconvolu-tions using a continuous representation. The toll of this choice is the necessity to utilize bilinear interpolation to sample the kernel values on a grid, biasing their estimation. This work aims to generalize the deformable convolutions approach with a continuous and more flexible formulation which directly embedds a learnable interpolation in the model but by learning the parameters positions independently of their input.

Examples of continuous approches to convolutions and their perks can be found in Fey et al. [14], Weiler et al. [29] and Sosnovik, Szmaja, and Smeulders [25]. In Fey et al.

[14] B-splines are utilized to construct a continuous formulation, resulting in a new

type of graph CNNs capable of dealing with non-euclidean data (graphs/discrete manifolds). B-splines were used to interpolate kernel functions with a fixed amount of kernel parameters. In Weiler et al. [29], an analytical representation of the input

and the filters was essential to perform the continuous mappings required in the method. In this case, spherical harmonics modulated by an arbitrary continuous radial function were used to accomodate the spherical symmetry of the formulation[29].

Similarly, in Sosnovik, Szmaja, and Smeulders [25], filters are respresented as Hermite

polynomials with a 2D Gaussian envelope to enable their sampling at different scales as they demonstrated good results[25]. In Cordonnier, Loukas, and Jaggi [10] the

self-attention mechanism found in Natural Language Processing was adapted to entail sparse-adaptive representations of kernel functions. In their setup, a set of parametrized radial basis functions adapts during training resulting in attention patterns similar to biological vision systems, drawing a connection between self-attention and deformable convolutions.

Continuous representations lift the limitations of classical convolutions such as learning the parameters on a fixed, d-dimensional grid and are essential to enable con-volutions on generic smooth manifolds, enabling the potential usage of the powerful tools from differential geometry and group theory. Group/steerable CNNs[2][29][25],

B-splines convolutions[14] and deformable convolutions[11] are all examples of these

achievements.

To tackle data complexity and robusteness of deep learning models, equivariance and invariance properties are often sought after. Classical convolutions are already

(9)

1.1 contributions 4

natively equipped with translational equivariance, nonetheless, inputs often carry, a priori, other types of global and local symmetries. Common examples are rotated pictures or symmetries carried by complex geometries, like proteins structures. To accomodate this need of embedding equivariance properties in neural networks, a new class of techniques were developed called group-equivariant CNNs or G-CNNs[8][7][29][9].

Bekkers [2] generalized the concept of atrous and deformable convolutions using

B-spline representations of convolutional kernels while enabling equivariant (w.r.t generic Lie groups) convolutions, unifying previous approaches, and laying the foundation of the present investigation.

1.1 c o n t r i b u t i o n s

This investigation is built on top of the results achieved in Bekkers [2]. The novelty

of this work lies on examining a new algorithmic implementation of the B-splines representation method by exploiting specific properties of convolutions, as well as studying the efficiency of the model given different types of parametrizations and initializations.

It will be shown how discrete filters can be represented as sampled continuous functions. Using notions from approximation theory, we will discuss the implication of a function basis expansion using B-splines, and how this affects the discrete convolution operation, leading to a potentially more efficient implementation. Finally, the model will be tested on classification and segmentation tasks with increased kernel sizes. More specifically, our interest focuses on what types of model parameter initializations and training precautions might be required. The set of questions this investigation effort is trying to answer are:

1. Provided a model implementation of a cardinal B-splines convolution layer (CBSConv) how does the new model compare with the baselines found in the literature?

2. Given the increased flexibility, what kind of precautions are necessary to have a stable and efficient learning?

3. How different initializations change the model performance and why?

4. Does reducing the number of parameters lead to similar or perhaps higher accuracy?

By answering these questions, we aim to provide insights about the training, initialization and evolution of B-splines continuous representations, as well as to discover more properties and methods which might improve their efficacy as a generic framework.

(10)

2

B A C K G R O U N D A N D R E L AT E D W O R K

In this chapter, current state-of-the-art and recent investigations are presented to the reader to give an overview of similar or related research.

2.1 at r o u s c o n v o l u t i o n s

Figure 2.1: (Atrous or dilated convolution) Graphical representation of atrous convolutions. Each output value is computed from a weighted sum of an à-trous (with holes) patch of the input.

In Chen et al. [3], atrous convolutions(Figure 2.1) replace the subsampling

opera-tions to compute denser scores. In Chen et al. [5] atrous convolutions were employed

to support Spatial Pyramid Pooling[18] which enables the model to learn atrous filters

both at different scales and in parallel. Drawbacks of this method are the need of sampling at different scales and the impossibility of adapting the kernels’ parameters around specific regions of interest.

2.2 d e f o r m a b l e c o n v o l u t i o n s

Atrous convolutions can be seen as a special case of deformable convolutions. De-formable convolutions were firstly introduced in Dai et al. [11]. This type of

convolu-tions uplift the constraint of learning a small set of parameters in a fixed upsampled grid, by allowing them to adapt their positions using fractional offsets (Figure 2.2). A different set of offset is generated by convolving each different input map. An important result of this choice is that the offsets are dependent on the single input maps, resulting in a strong difference with our work, where the corresponding offsets are directly parametrized instead of being generated. Toghether with ROI pooling, this

(11)

2.3 basis expansion methods 6

Figure 2.2: Example of deformable convolution filters. A set of parametrized points in a domain is used to sample a bilinearly interpolated filter. On the left, the green dots represents the initial sampling points of a classical filter. In the other pictures, the blue dots represent the sampling points of a deformable convolution with the arrows representing the offset from their original position in the grid.

led to more powerful and localized convolutions, while keeping the same, compared to atrous convolutions, the amount of parameters to be learned.

A drawback of deformable convolutions is the need to rely on bilinear interpolation since the parameters coordinates values need to be in R to be able to compute the gradient. The method shown in this investigation will be proven to generalize deformable convolutions’ kernels where bilinear interpolation can be thought as a special case of a B-spline interpolation of an input and a kernel function over a pixel basis.

2.3 b a s i s e x pa n s i o n m e t h o d s

Fey et al. [14] introduced B-splines as an interpolation method for non-euclidean

data (graphs/discrete manifolds). B-splines were used as an interpolation method of the kernel functions due to their local support property. The approach consisted of mapping a normalized adjacency matrix in an euclidean space and then, through the means of B-splines as kernel basis, efficient convolutions were performed. For 3d classification tasks (on protein structures), Weiler et al. [29] used a special type

of basis functions: spherical harmonics. These allow for steerable implementations of group convolutions by which intermediate sampling on the grid can be avoided. The choice was made in the context of steerable CNNs where spherical harmonics rotations, due to their symmetries, can be easily sampled. A downside is this choice of basis functions do not easily transfer to different type of groups or data structures, and moreover, sticking to steerable functions limits the type of non-linearities that can be used. In Weiler and Cesa [28] an extensive evaluation for 2D Euclidean group

steerable neural networks where regular group convolutional neural networks (so not necessarily steerable) work best was performed. In regular group convolutions it is sufficient to have a continuous description of the kernel. From here we can conclude

(12)

2.4 b-spline cnns on lie groups 7

that B-splines are a very suitable choice due to the compact support and due to their local can be easily parametrized and made adaptive. In a way, the most optimal basis function can then be learned as part of the optimization procedure. In this thesis we explore the potential of adaptive B-splines.

2.4 b-spline cnns on lie groups

In Bekkers [2], convolutions are modelled as kernel operators where the kernel

func-tion is treated analytically as an element of a funcfunc-tion space spanned by B-spline basis functions. Having analytical smooth functions enables tractable continous mappings over complex manifolds required to embed equivariant properties in the model. The study then concentrates on the development of the theoretical framework and the formalization of B-splines on Lie groups. Previous implementation efforts where not fully exploiting useful properties of these methods, resulting in computation-ally demanding models and sparking the interest in developing a sparse, separable convolution.

2.5 r e l at i o n s h i p b e t w e e n s e l f-attention and convolutional layers Cordonnier, Loukas, and Jaggi [10] extend the self-attention mechanism of natural

language processing with a set of parametrized gaussians. The parametrization was made to accomadate the Natural Language Processing (NLP) formalism and establish a parallel. Finally, a strong connection between self-attention and convolution is proven concluding that a self-attention layer generalizes convolutions in a similar way to deformable convolutions[10]. This is done through convolution kernels constructed

using sums of shifted Gaussians which is a result of using an adaptive quadratic encoding of positions in multiple attention heads.

2.6 p d e-based group equivariant cnns

In Partial Differential Equations (PDEs) CNNs[24] layers, a similar mechanism of

convolutional kernels built from a superposition of of localized but shifted basis functions takes place. A main component in their work is the linear combination of convection-diffusion PDEs to parametrize the layers. In their study, they show how these layers are equivalent to convolutions with kernels that are a weighted sum of the corresponding Green’s functions which are shifted (due to convection) gaussians with some scale (due to diffusion).

(13)

2.7 semantic image segmentation 8

2.7 s e m a n t i c i m a g e s e g m e n tat i o n

For semantic image segmentation large kernels become especially useful due to the need of classifying an input image per-pixel. Each output pixel needs a large FOV to take into accout both local and global features. The per-pixel classification is usually done by upsampling the feature maps using deconvolutions, unpooling, or atrous convolutions. With ablation experiments, Peng et al. [22] argue that the problem

with standard cone-shaped architectures is the tendency to have densely connected classifiers at the expense of losing more spatial context. For this reason, segmentation tasks are considered a fit to test our model for large kernels. This principle was further extended with the introduction of the Atrous Spatial Pyramid Pooling which, thanks to atrous convolutions at different rates, can efficiently compute high-resolution features maps at different input scales.

(14)

Part II

M E T H O D O L O G Y, E X P E R I M E N T S A N D C O N C L U S I O N

The reasons why B-splines were used is because of their great flexibil-ity and properties. In the next part we study the implications of their properties when dealing with standard convolutional layers.

(15)

3

M E T H O D O L O G Y

In this chapter we introduce the theoretical framework toghether with the novelties necessary to answer the research questions and justify the experimental setup.

Convolutional layers are involved in generating feature maps which capture abstract local correlations of an input representation. In a convolutional layer, kernels, also called filters, are the subjects of the learning process and are generally treated with discrete parametrizations. In this work, similarly to Bekkers [2], kernels are not treated

as a discrete set of parameters but as analytical multivariate functions to be sampled. Furthermore, the linearity properties of convolutions will be studied, as they form the basis for the formulation of a separable B-splines convolution. In a second stance, a comparison will be made to link the current method with deformable and atrous convolutions by simply changing a different kernel initialization showing how the current method constitutes a powerful generalization.

3.1 f u n c t i o na l r e p r e s e n tat i o n

The current work requires the representation of input features and kernel functions as elements of functional spaces. To formalize the model under study, we start by defining the mathematical framework necessary to handle such entities.

The input images and the kernel functions can be seen as continuous functions sampled from a discrete pixel-based domain.

Classical Artificial Neural Networks (ANNs) layers can be seen as learnable vector-valued functions mapping an input x∈ X =RNx _{to the corresponding output vector}

y∈ Y =RNy _{with the following general relation:}

y=φ(Kwx+b), (3.1)

withKw and b being a linear map:Kw ∈ RNx×Ny :X → Y, parametrized by w and

the bias term respectively, and φ(x)usually a non-linear mapping (i.e. ReLU). In classical ANNs, the linear mapping is defined by a matrix, usually dense for fully connected layers. During training, the model learns, using gradient descent techniques, the fittest set of matrix coefficients that maps the input to a better representation that minimizes a loss function. When dealing with structured inputs, for instance images or time series, a sparser representation that efficiently retains the ordered local structure of the data is needed. In the last decade, CNNs fullfilled this requirement

(16)

3.1 functional representation 11

by employing discrete convolutions between a discrete input and a set of discrete learnable filters:

y=φ(x∗κ_w+b). (3.2)

0 1 2 2 2 0 0 1 2

Figure 3.1: Example of discrete 2D kernel κw.

2 2 3 0 3 0 0 1 0 3 0 0 2 1 2 0 2 2 3 1 1 2 3 1 0 0 2 0 1 2 1 2 0 2 9.0 10.0 12.0 6.0 17.0 12.0 14.0 19.0 17.0 2 2 3 0 3 0 0 1 0 3 0 0 2 1 2 0 2 2 3 1 1 2 3 1 0 0 2 0 1 2 1 2 0 2 9.0 10.0 12.0 6.0 17.0 12.0 14.0 19.0 17.0 2 2 3 0 3 0 0 1 0 3 0 0 2 1 2 0 2 2 3 1 1 2 3 1 0 0 2 0 1 2 1 2 0 2 9.0 10.0 12.0 6.0 17.0 12.0 14.0 19.0 17.0 2 2 3 0 3 0 0 1 0 3 0 0 2 1 2 0 2 2 3 1 1 2 3 1 0 0 2 0 1 2 1 2 0 2 9.0 10.0 12.0 6.0 17.0 12.0 14.0 19.0 17.0 2 2 3 0 3 0 0 1 0 3 0 0 2 1 2 0 2 2 3 1 1 2 3 1 0 0 2 0 1 2 1 2 0 2 9.0 10.0 12.0 6.0 17.0 12.0 14.0 19.0 17.0 2 2 3 0 3 0 0 1 0 3 0 0 2 1 2 0 2 2 3 1 1 2 3 1 0 0 2 0 1 2 1 2 0 2 9.0 10.0 12.0 6.0 17.0 12.0 14.0 19.0 17.0 2 2 3 0 3 0 0 1 0 3 0 0 2 1 2 0 2 2 3 1 1 2 3 1 0 0 2 0 1 2 1 2 0 2 9.0 10.0 12.0 6.0 17.0 12.0 14.0 19.0 17.0 2 2 3 0 3 0 0 1 0 3 0 0 2 1 2 0 2 2 3 1 1 2 3 1 0 0 2 0 1 2 1 2 0 2 9.0 10.0 12.0 6.0 17.0 12.0 14.0 19.0 17.0 2 2 3 0 3 0 0 1 0 3 0 0 2 1 2 0 2 2 3 1 1 2 3 1 0 0 2 0 1 2 1 2 0 2 9.0 10.0 12.0 6.0 17.0 12.0 14.0 19.0 17.0

Figure 3.2: Example of 2D discrete convolution with the kernel shown inFigure 3.1. In blue, the discrete input. In green, the final output.

The classical d-dimensional discrete convolution is defined as:

(x∗κ)_j =

∞

∑

i=−∞

xiκj−i, (3.3)

where i and j are indices denoting the discrete values of the input and the output vector respectively. A graphical representation of a 2D convolution given the filter shown inFigure 3.1is shown inFigure 3.2.

(17)

3.2 group convolutions 12

A generalized representation of the discrete case models the discrete signals as samples of functions in a domain. In this setup, the input vector and the feature maps are represented as samples on a domainX of function elements in the the space of square-integrable functions f ∈ L₂ : X → RN _{or more simply, f} _{∈ (}_L_{(X ))}N

. The space of the square-integrable functions L2 (also called Hilbert space) is a vector

space equipped with a norm and and inner product. For this class of functions, convolution is always defined. Since these new objects are now functions between spaces as opposed to just vector values, the linear mapping between the input and the output is converted into a linear operator, which now maps functions to functions: K:(L₂(X ))Nx _{→ (}L

2(Y ))Ny.

By expressing the kernel as an element of a function space and a set of assumptions, we are now allowed to use powerful tools from functional analysis, such as defining smooth transformations, otherwise unfeasible in a discrete setting. In Bekkers [2],

kernel functions were approximated using basis-expansion methods enabling the research to perform smooth group transformations on Lie groups. An important result of this type of formulation, which follows from Dunford-Pettis Theorem, see e.g Bekkers [2, Theorem 1], is that in the setting of square-integrable functions, any

linear operator Kis a kernel operator of the form: (Kf)(y) =

Z

X κ

(x, y)f(x)dx. (3.4)

Taken toghether with the other results of Bekkers [2, Theorem 1], they constitute

the building blocks necessary to perform group convolutions. In the next section, we will introduce the advantages of group convolutions and show how classical convolutions are a special case of group convolutions with translation equivariant properties, reconducing their formulation to their classical one.

3.2 g r o u p c o n v o l u t i o n s

Sometimes, it’s possible to establish a priori that a dataset might be biased to certain transformations. This type of inductive bias can be constrained on a model, increasing its predictive power. As an example, in a classification task we often require a model to capture the semantics of an object representation, regardless of priorly known transformations that it might carry. If we try to classify, for instance, an image with a generic object, we would like a model to output always the same result, regardless of the object position in the image. We call this property invariance. After applying other linear unconstrained transformations for feature extraction, the output feature map often loses the input structural information, making it impossible to disentangle the object from its initial transformation, and making unfeasible any invariance property. To solve this problem, equivariant kernel operators were introduced.

Before proceeding, we restricted our formulation to a special type of transformation biases, called Lie groups. When continuous smooth transformations on a set have the

(18)

3.2 group convolutions 13

structure of a group they are called Lie groups (Appendix A). Bekkers [2, Theorem 1]

has provided a method to specialize a linear kernel operatorK(x, y)to be equivariant to transformations on Lie groups. Examples of these transformations are translations, scalings or rotations.

Standard convolutions can be seen as a special case of the generic kernel operator K(x, y), but equivariant to translations. Before proving this statement, it’s necessary to define how groups interact with functions. To define how group elements g ∈ G = (X,·) act on a function f ∈ L2(X), group representations are necessary. A

group representation maps a group element g ∈ G to an operatorLG→L2(X)

g . If we

consider the group of translations G= (Rd,+)and f :Rd →R, and the translations g, g0 = x, x0 ∈ G, the left regular representation of translations, is defined as:

LG→L2(X)

g f

(g0) = f(g−1·g0) = f(x0−x) (3.5)

where g0·g=x0+x denotes the group product (Appendix A). We can now formally introduce the concept of equivariance, in the context of group transformations, as the property that makes the composition of a group transformation and a generic transformation commutative:

Definition (Equivariance). Let K : L2(X) →L2(Y)given byEquation 3.4 and G a

group with representations LG→L2(X)

g andLG

→L2(Y)

g .Kis equivariant to G if:

LG→L2(Y)

g ◦ K = K ◦ LGg→L2(X), ∀g∈ G (3.6)

To constrain group equivariance on the kernel functionK(x0, x), we use the second point of Bekkers [2, Theorem 1]:

κ(x0, x) = 1 |det g|κ(g

−1_·_g0_{) =}

κ(x0−x), (3.7)

we can now compute the new Linear Kernel Operator which becomes the continuous and more general version of Equation 3.3:

(Kf)(x0) =

Z Rdκ(x

0₋

x)f(x0)dx. (3.8)

The resulting kernel operator is then translation-equivariant and we recover the clas-sical cross-correlation formulation. In the Real field, the cross-correlation is equivalent to a convolution with reflected kernel κ(−x). To show that this type of kernel is indeed equivariant, we test the formulation by translating the input function by an arbitrary element of the translations group g00 =c∈G= (Rn_,₊₎_{, f}₍_g00−1_·_g0_{) =} _f₍_x0₋_c₎_.

(K ◦ L(Rd,+)→L2(Rd)

c f)(x0) =

Z Rdκ(x

(19)

3.3 function approximation 14

By means of a simple change of variables x00⇒ x0−c, we obtain: Z

Rdκ(x

00_{− (}

x−c))f(x00)dx00 = (Kf)(x−c) = (L(Rd,+)→L2(Rd)

c ◦ Kf)(x) (3.10)

showing that any translation of the input function results in the same translation in the output feature.

Now that we have shown that classical convolutions retain the translation infor-mation of the input, one last step is still required to enable invariance in a model. Since we are often not interested anymore in the specific position of an object, we can now discard its translation information. This is easily done in CNNs by the means of pooling operators which project the feature map to a lower dimensional and transformation invariant space, e.g by maxx(f ∗k)(x).

3.3 f u n c t i o n a p p r o x i m at i o n

Previously, we changed the input and weights from standalone discrete values to samples of continuous functions defined in the functional spaceL2. In this section, we

discuss the benefits of this choice and how it affects the training procedure, regardless of the group convolutions structure that can be built upon. An important property of

L2spaces, is that all their elements can be represented as a linear combination of a

chosen basis φ(x) ∈L₂(X )and a set of coefficients ci over the chosen field:

f(x) =

∞

∑

i

ciφ(x) (3.11)

In our setting, the basis of choice is the same as in Bekkers [2, Section 3.2]: Cardinal1

B-splines, a special case of B-splines (Appendix.C). For finite domains these functions form a basis for square-integrable functions and will be the center of the current investigation.

3.3.1 Cardinal B-splines and Proposed Parametrization

Cardinal B-splines are special type of piecewise polynomials with compact support2

and fixed distance between breakpoints.

Definition(Cardinal B-spline onRd_{). An univariate Cardinal B-spline of order k is}

defined as a function f :Rd→R: Bk₍_x₎_:₌₁ [−1 2,12] ∗ (k)₁ [−1 2,12] (x), (3.12)

1 Also called uniform B-splines.

(20)

3.3 function approximation 15

Partition of unity

Random linear combination

linear combination scaled b-splines

k=0

k=1

k=2

Figure 3.3: Example of univariate cardinal B-splines. On the top-left corner, the partition of unity property constraints the linear combination of the basis functions to sum up to one in the interpolation region. In the bottom-left, an arbitrary function is reconstructed. On the right, the basis functions of order 0, 1, 2 are sampled.

where∗(k) _{denotes k-fold convolution of the indicator function 1} [−1

2,12]. To extend the

basis to the multivariate case, we can make use of the tensor product: BRd,k₍_x₎_:₌_Bk

1⊗ · · · ⊗ Bkd

(x1, . . . , xd) (3.13)

From the basis expansion formulation (3.11), every f ∈span{BR

d_,k

i }

N

i=1 ⊂L2(Rd)

can then be expressed as finite linear combination of d-dimensional shifted and scaled basis functions of order k:

F (x):= N

∑

i=1 ciBR d_,k (x; ˆx_i, si) = N

∑

i=1 ciBR d_,k x− ˆx_i si , (3.14)

where ˆx_i, si parameters are respectively determining the shift and the scaling for

each of the N basis functions. To simplify the model, avoid possible degenerate cases during learning[10], and to maintain the (almost)3 isotropy of multidimensional

B-splines, the scaling parametrization was kept as a single scalar.

(21)

3.4 cardinal b-splines convolution (cbsconv) 16

The choice of Cardinal B-splines was justified by their local support[23], which

states that basis functions evaluate to zero for all inputs outside of a known interval, proving to be an advantageous choice for efficient computation, scalability[14] and

locally map manifolds[2].

By expressing the kernel as an analytical linear combination of functions it is now allowed to induce a smooth parametrization of the basis. This enables learning not only for the coefficients (ci) but also of the basis functions center positions ( ˆxi)

and scalings (si). This can be seen as a more general formulation of deformable

convolutions where the scalings are fixed to a constant value, the filter weights are the ci and the encoded learnable offsets map to the centers positions xi. In the

next section, we will proceed by substituting the kernel function of a translation equivariant convolution with a B-spline basis expansion and study its properties. 3.4 c a r d i na l b-splines convolution (cbsconv)

By substituting inEquation 3.8 the kernel function with a d-dimensional B-spline expansion (3.14), the formulation becomes:

(Kf)(y):= Z ∞ x=−∞ f (x)κ(x−y)dx = Z ∞ x=−∞ f (x) N

∑

i=1 ciBR d_,k (x−y; ˆx_i, si)dx = N

∑

i=1 ci Z ∞ x=−_∞ f (x)BRd,k(x−y; ˆx_i, si)dx = N

∑

i=1 ci Z ∞ x1=−∞ Bk₍_x 1−y1; ˆx1i, si) · · · Z ∞ xd=−∞ Bk(xd−yd; ˆxdi, si)f(x1, . . . , xd)dx = N

∑

i=1 ci Bk_ˆx 1i,si ∗ · · · ∗ B k ˆxdi,si ∗f (y). (3.15)

Thanks to the properties of linearity, this formulation has the benefit of expressing the original convolution as a sum of multiple (smaller) convolutions. Another benefit arises from the properties of the tensor product, making each of these convolutions

(22)

3.4 cardinal b-splines convolution (cbsconv) 17

separable. Following from (3.10), another interesting formulation arises when s_i =s and by applying the equivariance properties previously shown:

(Kf)(y) = N

∑

i=1 ci Z ∞ x=−_∞ f (x)BRd,k(x−y; ˆx_i, s)dx = N

∑

i=1 ci f∗ BR_ˆxd,k i,s (y) = N

∑

i=1 ci f∗ BR_0,sd,k(y− ˆx_i). (3.16)

In this setting, the convolution is expressed as a linear combination of a convolution with a single basis function, shifted multiple times.

To express the convolution above in a discrete setting, we need to define a discrete domain at which our kernel and input features functions will be sampled. From this point on, we will always consider as domain a centered grid that has as vertices sampling points. Starting fromEquation 3.15, by making explicit the parametrization and by expressing the integrals as finite sums, we get:

(K f)(y) j = N

∑

i=1 ci

∑

x1∈L1 Bk x1+ ˆx1i−y1 si · · ·

∑

xd∈Ld Bk xd+ ˆxdi−yd si f(x1, . . . , xd). = N

∑

i=1 ciBR d_,k (x−y; ˆx_i, si) ∗f(x). (3.17)

In summary, given Equation 3.15, Equation 3.16 and Equation 3.17, it is now possible to observe that:

1. D-dimensional cardinal B-splines result in a separable convolution due to the properties of tensor products (3.15).

2. Due to the linearity of the kernel operator, the resulting convolution consists of a linear combination of N smaller separable convolutions (3.15).

3. Due to the properties of linearity and translational equivariance, each of the basis convolutions can be computed as single convolution with a centered basis shifted by an amount equal to the basis centers coordinates (3.16).

4. In a discrete setting, if the centers are part of a square grid, only a single convolution is required to compute the final feature map (3.17).

(23)

3.5 implementation 18

To appreciate how this representation generalizes previous models found in the literature, we will show how these models can be easily recovered by enforcing different types of centers initializations. Given a set of centers disposed in a grid, it’s possible to make the model behave as classical convolutions, while still maintaining the same number of parameters. Once the number of parameters is reduced while still forming a grid, atrous convolutions can be found instead. If we instead allow the model to learn the centers’ positions, deformable convolutions with directly-learnable offsets can be achieved, enabling sparser convolutions.

3.5 i m p l e m e n tat i o n

Implementation wise, a list of precautions were necessary to implementing effective cardinal B-splines convolutions. To enable the adaptability of the bases during training it is necessary to propagate the convolution gradient to the basis parametrization. The gradient of the chosen parametrization for a single basis is trivially computed:

∇s, ˆxBR d_,k x− ˆx s = {−B0Rd,k x− ˆx s x− ˆx s2 ,−B 0Rd_,k x− ˆx s 1 s}, (3.18) whereB0Rd_,k

denotes the first derivative of a k-order, d-dimensional B-spline (BRd,k₎

function.

CNNs layers, instead of training over a single filter, increase their capacity4

by stacking multiple filters on a single layer. Since the kernel’s expressivity in cardinal B-splines is also determined by the basis parameters, we decided to include an hyperparameter to trade-off between expressivity and capacity. This was done by subdividing each set of filters with an independent set of basis parameters. The subdivision value was encoded by a parameter we named basis groups.

Usually, CNNs make use of limited-in-size and often square kernels. Standard convolution implementations, based on the problem size are often implemented using direct or Fast Fourier Transform (FFT) methods. Assuming inputs of size Md and filters of size Kd, K < M, the time complexity of a convolution operator is O(MdKd)for the direct implementation, andO((M+K)dlog(M+K)d)for the FFT one. When dealing with cardinal B-splines convolutions, sparsity and separability can be exploited to achieve faster convolutions. If we consider a kernel with N bases, each within a support of size Bd, B<K, the theoretical time complexity of a forward pass would then becomeO(d·N·Md−1(M+B)log(M+B) +N· (M−K)d₎5

(see Appendix B), making the CBSConv an efficient algorithm for applications with large and sparse kernels at inference time. When the centers lay on a grid, the complexity is supposed to furtherly drop by setting N =1 in the time-complexity formula.

4 Number of degrees of freedom or parameters. 5 The K term is factored out in Md+K.

(24)

3.5 implementation 19

Memory wise, the proposed naïve implementation resulted in the full storage of each different convolution, making the training of the separable implementation inefficient and memory consuming. A more direct implementation consisted of directly sampling the virtual kernel on the grid domain. This solution was found overall more convenient due to the overall small kernel sizes in common architectures and the fact that libraries are better optimized for standard convolutions.

(25)

4

E X P E R I M E N T S

To assess the capabilities of B-splines convolutions and their behavior, the new convolutional layer was tested on different tasks and architectures. In this chapter we will go through the rationale behind the experimental choices and how they will adress our research questions shown inSection 1.1

4.1 e x p e r i m e n ta l d e s i g n

As we recall fromSection 1.1, we are interested in comparing the performance of standard convolutions layers with our cardinal B-splines convolutions. Convolutions are often applied to different types of structured data, from images, audio signals to time series. We decided to restrict the application of B-splines convolutions to computer vision tasks as they constitute a very popular choice with a thriving research community and many available datasets and applications.

Our interest focussed on selecting the most promising architectures for which the properties of sparsity and attention-like mechanismsm could enhance their current state of the art. Even though with large kernels this model should theoretically perform the best, we are still interested in their application in feature extraction which oftens require small kernels and deep architectures.

Our first aim is to generically compare the results with similar research. In Sosnovik, Szmaja, and Smeulders [25], Wide Residual Networks are used as baseline architecture

for the scale-invariant convolutional layer against the STL10 dataset[6], making a good

candidate for our investigation. DCNNs architectures are often designed to minimize the kernel size while increasing the depth. This is done to increase the effective FOV and enhance their representation power thanks to the stacked non-lineararities. For this reason, popular classification architectures like Resnet select small kernel sizes and exploit striding, pooling layers and depth to enlarge the effective FOVs. Replacing striding and pooling layers with large B-splines convolutional layers could be a great application which, however, could not be included since preliminary tests resulted in a strong accuracy degradation due to the very large kernels required, suggesting the need of a proper architecture redesign which we could not investigate due to time constraints. Finally, to answer our questions, we decided for our classification experiments to simply replace the convolutional layers of a Wide Residual Network baseline architecture, with the toll of training over small kernel sizes, on the STL10 dataset.

(26)

4.1 experimental design 21

In Dai et al. [11], deformable convolutions are used instead of atrous convolutions

for segmentation tasks on Pascal Voc. Deformable convolutions allowed to greately reduce the number of parameters required to compute the convolution over large kernel sizes. A drawback of this type of implementation is the support of these moving pixels which is very limited. Our atrous convolutions can cover large fovs while keeping unchanged the amount of parameters resulting in denser convolutions. Their experiments were performed by substituting some convolutional layers of the DeepLabChen et al. [4] architecture. More recently, Atrous Spatial Pyramid Pooling

was employed to achieve higher accuracies in DeepLabV3Chen et al. [5] by making

use of atrous convolutions at different rates, laying an ideal setting for employing our model. For this reason we decided to test our model on a segmentation tasks using DeepLabV3 as architecture baseline trained on the augmented[17] Pascal Voc 2012

dataset[13].

In Bekkers [2], B-splines convolution learning procedures were not fully explored.

The second target of this paper was to research what are the requirements for stable and efficient learning during the training for this type of convolutions. By preliminary tests, we noticed that the basis parametrizations are sensitive to the learning rate dynamics. In Liu et al. [19], the variance observed in the early stages of the training

is studied. This was seen as especially impactful for transformers architectures. In our setup, a warmup of 5 epochs was included whenever CBSConv were trained as it was shown to result in more stable learning. To increase the convolutions expressivity without greatly increasing their capacity, basis groups were used. Basis groups were included to help different sets of filters specialize on different regions resulting in a similar behavior of having a denser kernels while keeping a similar amount of parameters (seeSection 3.5).

A third quest, was to discover which types of initializatons might benefit the training. In summary, the questions we aim to answer are:

1. Is initializing the set of filters in a grid fashion improving the training w.r.t to a random initialization?

2. What types of scalings of the bases are necessary?

To answer these questions, the layer was made parametrizable by multiple factors and initializations. For the basis centers, the following initializations were chosen:

• grid(mxn): Centers are positioned positioned in a dense, centered d-dimensional mxn grid within the virtual kernel.

• random(mxn): Centers are randomly initialized within a d-dimensional mxn grid within the virtual kernel.

The scaling value is an important parameter to be tuned also for computational reasons as it defines the basis functions support.

(27)

4.2 classification 22

• uniform: Bases share the same initial scaling value. The scaling value is com-puted using:

s =r Vd k

Nc

, (4.1)

where s is the initial scaling value of a d-dimensional kernel of volume Vk and

with Nccenters.

• random: Scalings s are uniformly initialized and then scaled s = s·w, w ∼ U (0.5, 1.5).

• scaled: Scalings s are uniformly initialized and uniformly scaled by 1.5: s = s·1.5.

In our experiments, we decided to test different combinations of these initializations while toggling their adaptivity.

Finally, we are design an ablation experiment to investigate if reducing the network capacity leads to simlar accuracies thanks to the adaptivity and what is the conclusion of enlarging the kernel sizes of the baseline FOVs.

4.2 c l a s s i f i c at i o n

The task of classification using supervised models aims to train a model to classify the represented content of n-channel images into a fixed number of classes. The metrics commonly used to assess a classifier capabilities is usually reported in terms of accuracy or number of correctly classified images divided by the total number of trials.

As discussed in the previous section, the chosen images dataset for our classification task was STL10[6]. STL10 contains only 1000 training images and 5000 test images.

Each image, is an RGB 96x96 resolution image of 10 possible classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck. Due to its low number of training images, STL10 enforces a model to a low-data regime, testing its generalization qualities.

Current state-of-the-art in classification over the STL10 dataset relies on DCNNs to achieve accuracies above 90%. To achieve such high accuracies, models[20][27][1]

often rely on external datasets or utilize other learning methodologies such as semi-supervised learning. In this work, we restrict to the class of semi-supervised learning methods in order to have a more fair comparison with previous models. Sosnovik, Szmaja, and Smeulders [25] made use of the same setup presented in [12] but with the

added equivariance properties introduced by scale-equivariant networks, based on Hermite polynomials basis functions, reaching an accuracy on STL10 of 91.49%. Even though our bspline frameworks enables the usage of similar equivariance properties, we decided to restrict our study to standard convolutions chosing as baseline the

(28)

same architecture shown in DeVries and Taylor [12]. The architecture involves a

WideResnet with cutout augmentation in the dataset. The cutout augmentation has shown to improve the final accuracy of over 8% reaching 87% in the test set. Other notable results made use of Deep Hybrid Networks[21] which exploit The architecture

utilized is a so-called wideresnet, a residual deep neural network with increased number of channels. For the classification experiments we will make use of the same baseline.

4.2.1 Wide Residual Networks (WideResnet)

The Residual Networks (ResNet) family architecture tries to address the problem of vanishing gradients by exploiting skip-connections to sum the input of a block after a sequence of convolving layers boosting the signal of the original input.

WideResnet is a special type of residual network with a greatly increased number of channels per-layer, stacked residiual networks blocks and a pooling layer at the end, before the final fully connected. Each convolutional block contains also Dropout layers to improve the generalization of the model[26] and batch normalization to

improve the training.

The cone shaped architecture is meant to increasingly downsample the initial input before entering a fully connected layer. Each block of the architecture contains a set of residual blocks. Each residual block contains two stacked convolutional layers and a skip connection with an optional convolution. After entering the network, the input is passed to the resulting in high leavel feature extractions and translation equivariance to finally enter a pooling layer to capture long distance correlations. The final layer is usually fully connected and maps the extracted features to the appropriate logits to compute the final loss cross entropy loss function.

4.2.2 Training

The same training conditions of DeVries and Taylor [12] were recreated. We used

cross entropy loss with SGD as optimizer with different per-parameters settings. The weight decay was set to 1e-4 for all the parameters except the basis centers and scalings. Nesterov optimization with momentum 0.9 was also active.

To achieve the best performance the dataset was augmented with cutout and random flips and rotations for the training phase. Cutout is a simple but effective technique which removes a rectangular patch of the input. This results in an improved generalization, fundamental for low-data regimes.

The experiments were run with a batch size of 128 on a single GPU Titan 2080 RTX until reaching 1000 epochs.

When training with the B-splines convolutions, a warwmup regime in the first 5 epochs was added to stabilize the basis parameters learning. This was added to

(29)

4.3 segmentation 24

ensure a smooth transition between the non-optimal landscape generated by the random weights often pushing the centers outside the virtual kernel region. We found that the optimal number of groups trading off computation and accuracy was 4 basis groups.

4.3 s e g m e n tat i o n

The task of semantic segmentation aims to perform per-pixel classifications. Recent segmentation architectures[5] are constituted of three subsystems: a backbone, a

pyramid pooling block and a decoder architectures. As main feature extraction part, the backbone with its large FOVs takes care of the classification task, convolving high and low level features and is often pre-trained on an external dataset. A multiscale Spatial Pyramid Pooling[18] is a set of convolutional/pooling layers made to capture

high level features at different scales from the backbone output. Finally, for the localization task, a decoder is assigned to classify and upsample the previous feature maps to a set of per-class embeddings via soft-max. The resulting output is a mask which augments the input by associating each pixel to their corresponding class. The metrics commonly utilized for this type of task are the mean Intersection over Union (mIoU) between the predictions and the labels. Other types of metrics include class accuracies which estimates the percentage of correctly classified pixels.

As investigated by Peng et al. [22], segmentation requires large kernels to capture

contextual information. Current implementations employ dilated convolutions as a mean to reduce the high computational demand of an otherwise very large kernel size and amount of parameters to store. This is an important difference compared to the previous task of classification and it’s the reason why the segmentation task is thought as a promising application for deformable convolutions. To compute these large kernel convolutions while keeping the number of parameters low, dilated or atrous convolutions are employed. Dilated convolutions tradeoff computation time with density of the filters resulting in very coarse convolutions. Bspline convolutions could highly mitigate this issue by not only capturing a more precise convolution thanks to their larger support but also by lifting off the constraint of having the filter weights disposed in a grid-like structure allowing each filter to adapt to different regions and specialize on regions which would otherwise not be taken into account. 4.3.1 DeepLabV3+

During training, an input image is passed through a backbone architecture. The chosen backbone architecture was a Resnet-101 stripped of its last fully-connected layer. For this task, it is also important to retain the low level features. For this reason, toghether with the last layer feature map, previous layers are directly forwarded to the decoder which sequentially performs a set of deconvolutions. To control the

(30)

4.3 segmentation 25

resolution of the backbone output features an output striding value can be set. The striding value describes the ratio between the input and output resolution. To keep the whole network light, we chose a striding of 16 at the cost of reducing the spatial resolution. With this output striding, the Atrous Spatial Pyramid Pooling (ASPP) module performs atrous convolutions with ratios 1, 6, 12, and 18.

4.3.2 Training

We tried to recreate the same training conditions of Chen et al. [5]. We used a

2-dimensional cross entropy loss with SGD as optimizer and different per-parameters settings. The backbone was trianed with a learning rate of 1

10γ, where γ=0.007 is the learning rate of both the ASPP and decoder. When training with the CBSConv layers, warmup until the first epoch and a learning rate of 0.07 were used. Again, the weight decay was set to 1e-4 for all the parameters except the basis centers and scalings. The scalings were initialized to 1 in all the experiments. Without scale adaptivity this essentially corresponds to deformable convolution kernels using cubic (k =3) B-spline interpolation.

To achieve the best performance the dataset was augmented with random flips and scalings for the training phase. The experiments were run with a batch size of 16 and four GPUs Titan 2080 RTX until reaching 60 epochs.

(31)

5

R E S U LT S

In this section we summarize the results of classification and segmentation experi-ments and address the research questions discussed in the previous chapter.

5.1 c l a s s i f i c at i o n

For classification, each experiment was trained three times using three different seeds (0, 1, 2). During training, a subset of the convolution layers parameters were recorded to study their dynamics. The results are reported in Table 5.1. A small sample of filters’ dynamics is visually represented by the heatmaps shown inFigure 5.1. 5.1.1 Analysis

Baseline

The baseline accuracy resulted in a slightly increased mean accuracy compared to the one found in DeVries and Taylor [12](87.26±0.23). The second experiment with a kernel size of 5 resulted in a non statistically significant increase in the accuracy. Fixed grid, k =0 leads to the same results as the baseline

The first experiment of this group produced an accuracy in accordance with the baseline confirming, that the model can indeed emulate standard convolution by setting the B-splines order to 0 and the centers on a grid layout. When increasing the B-spline order the bases support increases, introducing coupling effects between the bases. This effect was tested in the second experiment and resulted in increased variance and degraded accuracy. This is possibly due to the cumulative effects of the optimizer reugularization term and coupling effect, which might also contribute to not fully understood, local regularization effects.

Adaptivity with less bases parameters leads to higher accuracies

In the first experiment, the effect of initializing 5 basis centers randomly without any adaptive behavior during training resulted in a sharp accuracy degradation. The adaptivity over the centers resulted in a statistically significant accuracy increase.

The reduction of the number of basis functions from 5 to 3, but with adaptive centers, resulted in an identical degradation compared to the fixed case. Enabling

(32)

5.1 classification 27 # Bases parameters initialization Dynamic centers Dynamic scalings Kernel size #Centers Accuracy (t-test 95%) Baseline 1 - - - 3 - 87.6±0.5 2 - - - 5 - 87.7±0.4

Does fixed grid, k= (0, 2)lead to the same results as the baseline?

3 grid-uniform 7 7 3 9 87.7±0.1

4 grid-uniform 7 7 3 9 86.3±1.4

Does adaptivity with less bases parameters lead to higher accuracies?

5 random-uniform 7 7 3 5 84.4±0.4

6 random-uniform 3 7 3 3 84.0±0.5

7 random-uniform 3 7 3 5 85.2±0.6

8 random-uniform 3 3 3 5 87.3±1.3

Does increasing the kernel size lead to higher accuracies?

9 grid-uniform 3 3 3 9 87.8±0.8

10 grid-uniform 3 3 5 25 88.0±1.3

11 random-uniform 3 3 5 9 87.6±1.8

Does changing the scaling initialization using more parameters lead to higher accuracies?

12 random-random 3 3 5 25 84.2±1.3

13 random-uniform-large 3 3 5 25 87.7±0.4

Does reducing the number of parameters lead to similar accuracies?

14 random-uniform 3 3 3 7 87.6±0.1

15 random-uniform 3 3 3 5 87.4±1.7

16 random-uniform 3 3 3 3 86.9±0.8

Table 5.1: Accuracies summary of the performed experiments. The first columns shows the experiment number. The secon column presents the type of basis initialization using CentersDescriptor-ScalingDescriptor. Dynamic centers and scalings report if the bases parameters were trained toghether with the weights. Kernel size describes the virtual-kernel size and #Centers the number of centers utilized to compose the kernel. Finally the Accuracy column reports the resulting accuracies with their sample mean confidence interval. The confidence interval is estimated via t-test using 95% confidence.

(33)

5.1 classification 28 3 2 1 0 1 2 3 3 2 1 0 1 2 3 Frontal convolution 3 2 1 0 1 2 3 3 2 1 0 1 2 3 Back convolution 0.04 0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.0050 0.0025 0.0000 0.0025 0.0050 0.0075 0.0100 0.0125

Samples of B-splines filters, and dynamics from experiment 10.

3 2 1 0 1 2 3 3 2 1 0 1 2 3 Frontal convolution 3 2 1 0 1 2 3 3 2 1 0 1 2 3 Back convolution 0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.008 0.006 0.004 0.002 0.000

Samples of B-splines filters, and dynamics from experiment 14.

Figure 5.1: Examples of cardinal B-splines filters at the end of the training procedure. The centered grid with black crosses represents the virtual kernel with its sampling points. In red, dashed lines, the basis scale and its position. A red line describes the trajectory of the centers during training.

both dynamic centers and scalings with 5 basis functions, resulted in a statistically significant increase in accuracy, reaching a slightly, but not significant, lower accuracy than the baseline.

Increasing the kernel size leads to higher accuracies with higher variance

Due to the random initializations, larger kernel sizes resulted in increased sample mean accuracies but higher variance. The first experiment achieves a slight change in average accuracy, possibly due to an adaptive regularization effect caused by the coupling between centers. On a larger grid, with 25 centers, the training resulted in a slight increase of the mean accuracy. Finally a random initialization resulted in a similar accuracy to the baseline with kernel size 5 but with a lower number of weights (9 against 25).

(34)

Experiment: 1 14 15 16

#params 11M 8M 6M 3M

Table 5.2: Number of trainable parameters for experiments 1,14,15,16.

Changing the scaling initialization using more parameters degrades the performance

Utilizing random scalings resulted in a performance degradation of almost 4%. Increasing the scalings did not result in a statistically significant change in the result. Reducing the number of parameters leads to similar accuracies

This ablation experiment resulted accuracies similar to the baseline but with higher variance. Halving the number of parameters of the convolution architecture resulted in a degradation of only 0.2% for 5 centers and in a in a 0.8% when reducing the number of weights to a third. The detail of the number of paramters for these experiments is shown inTable 5.2.

(35)

5.2 segmentation 30 # Centers layout Dynamic centers Dynamic scalings Kernel size #Centers mIoU (t-test 95%) Px-Acc (t-test 95%) Baseline 1 - - - 3 - 75.2±0.5 93.5 ±0.14

Does fixed grid lead to the same results as the baseline?

2 grid 7 7 3 9 74.76 ±0.29 93.5 ±0.14

Does adaptivity of bases parameters lead to higher accuracies?

3 grid 3 7 3 9 75.05 ±3.17 93.5 ±0.52

4 grid 3 3 3 9 74.9 ±0.64 93.5 ±0.52

5 random 3 3 3 9 74.7 ±3.17 93.4 ±0.52

Table 5.3: Mean Intersection over Union and Pixel Accuracies summary of the performed experiments. The first column presents the layout initialization random or grid. In this case, the scalings where always uniform and initialized to 1. Dynamic centers and scalings report if the bases parameters were trained toghether with the weights. Kernel size describes the virtual-kernel size and #Centers the number of centers utilized to compose the kernel. Finally the columns mIOU and Px-Acc report the mean Intersection over Union and Pixel accuracies with their confidence interval. The confidence interval is estimated via t-test using 95% confidence.

5.2 s e g m e n tat i o n

For classification, each experiment was ran only two times using different random seeds. The results are reported in Table 5.3. A small sample of filters’ dynamics is visually represented by the heatmaps shown inFigure 5.2.

5.2.1 Analysis Baseline

The baseline accuracy resulted in a lower mIoU compared to the original model[5](78.85).

The original values resulted outside the confidence interval implying that we could not reproduce the original results.

Does fixed grid leads to the same results as the baseline?

We observed a slight reduction in the mean mIoU which intersects nonetheless with the confidence interval of the baseline. The pixel accuracies were identical.

(36)

5.3 discussion 31 7.5 5.0 2.5 0.0 2.5 5.0 7.5 8 6 4 2 0 2 4 6

8 ASPP filter (size=15) for experiment 2.

10 5 0 5 10 10 5 0 5 10

ASPP filter (size=27) for experiment 2.

7.5 5.0 2.5 0.0 2.5 5.0 7.5 8 6 4 2 0 2 4 6

8 ASPP filter (size=15) for experiment 4.

10 5 0 5 10 10 5 0 5 10

ASPP filter (size=27) for experiment 4.

0.002 0.001 0.000 0.001 0.002 0.003 0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.004 0.003 0.002 0.001 0.000 0.001 0.002 0.003 0.001 0.000 0.001 0.002 0.003

Samples of ASPP adaptive filters.

Figure 5.2: Examples of cardinal B-splines filters, of the ASPP module, at the end of the training procedure. The centered grid with black crosses represents the virtual kernel with its sampling points. In red, dashed lines, the basis scale and its position. A red line describes the trajectory of the centers during training.

Does adaptivity of bases parameters leads to higher accuracies?

Due to the low number of runs it was not possible to statistically notice improvements below the 1.5%. An important difference with the classification experiment, is that the scalings were initialized to value 1.0 resulting in bases supports only slightly bigger than the original atrous convolution.

5.3 d i s c u s s i o n

For classification, the results have shown higher accuracy given specific settings and similar accuracies when trained with equal or less number of parameters. The difference between adaptive and non adaptive centers shows that adaptivity has a fundamental role in maximizing the accuracy (Section 5.1.1). This could be explained

(37)

5.3 discussion 32

by the possibility of the basis functions of retaining local information due to their extended but limited support toghether with their deformability properties which allows the filters to specialize at different locations. In Figure 5.1, it’s possible to notice that the centers dynamics results in often limited in length trajectories. This could be explained by the possibly too long warmup or by other unknown dynamics’ mechanisms. The bases also had the tendency to overlap their centers, decreasing the overall parameters efficiency.

For segmentation, the low number of experiments resulted in large confidence intervals. This led to inconclusive results over the different types of experiment configurations. The filters’ dynamics (shown inFigure 5.2) also resulted in limited in length trajectories, exacerbated by the heatmaps due to the large kernel sizes.

(38)

6

C O N C L U S I O N A N D F U T U R E W O R K

The main contribution of this thesis focussed in re-implementing and studying part of the theoretical framework introduced in Bekkers [2]. B-spline convolutions were

studied outside their role of supporting group convolutions but as a more general method of expressing convolutional kernels as samples of full-fledged functions and at the same time provide sparser and deformable convolutions.

We went through the basics of convolutional layers and generalized their repre-sentation using continuous functions. We discussed the notion of functional space and the basics of approximation theory introducing a special type of basis functions with compact support called b-splines and establishing the main connection with Bekkers[2].

On top of this theory, we expressed standard convolutions as an instance of kernel operators with translation equivariant properties and we expressed them as approximated functions using linear combinations of b-splines. The connection between group convolutions and standard ones was made explicit by showing the latter is an instance of the former. This lead to a convenient formulation of b-spline convolutions as a linear combination of pre-blurred input features.

A naive implementation of separable b-splines was tested but resulted in an inefficient use of resources. The separable implementation was then disfavoreud and replaced by a standard convolution with a pre-sampled full-sized kernel1

.

To be able to sparsify this representation, these basis functions were parametrized by their centers and scalings. We showed how the chosen parametrization of these basis functions can be included in the training by computing the respective param-eters’ gradient to enable back-propagation and studied how this impacted models’ performance by performing a set of experiments. The models were benchmarked by replacing convolutions at different levels of classification and segmentation ar-chitecuters. Furthermore, experiments with different parameters initializations were studied to investigate their effects on such models.

When applied to classification tasks, replacing the standard convolutions in a WideResnet architecture has shown that is possible to recover the original baseline accuracies by simply expressing the b-splines as order-0 piecewise polynomials placed on a kernel-grid.

Given the same capacity, b-spline convolutions did not introduce statistically sig-nificant improvements to the baseline architecture when kernel sizes were increased.

1 More details can be found inAppendix C

(39)

c o n c l u s i o n a n d f u t u r e w o r k 34

When the parameters were randomly initialized, adaptivity achieved by the simple means of gradient descent optimization positivelty impacted the overall accuracy.

The ablation experiments have shown the potential of b-spline convolutions to increase sparsity and theoretically decrease the computational cost. When adaptivity is enabled, we where able to reach a third of the amount of parameters with an accuracy reduction of just the 0.8%.

The results did not show any relevant difference in terms of accuracy between grid or random layout initializations. This result could be justified by the fact that relatively small kernel sizes where used and thanks to a the centers can easily adapt to fit such a small amount of pixels.

Scaling initialization is an important factor when training using this type of layers. Bases scaling contributes to the learning process by increasing or decreasing the single basis effective receptive field. Random initialization has shown increased accuracy degradation. This behavior was possibly justified by a lower parameter’s efficiency2

A solid modeling of this dynamics is left to further research.

Finally, the following conclusions from the original research questions (Section 1.1) can be drawn:

1. The application of adaptive cardinal B-splines convolutions as a replacement of standard convolutional layers against baseline classification and segmentation architectures resulted in comparable final accuracies. Random initializations and no adaptivity, resulted in higher variance and degraded performance. Non-adaptive B-splines of order 0 constitute a generalization of classical convolutions. 2. The application of a learning scheduler with warmup was found important for

stable bases’ parameters learning.

3. Random and grid centers’ initializations, with adaptivity enabled, did not result in statistically significant differences.

4. With adaptivity enabled, reducing the number of parameters led to a slightly, oftentimes not significant, degraded accuracy compared to the baseline. Regardless of the fact that the application of b-spline convolutions in the studied architectures did not result in improved accuracies, b-spline convolutions proved to be a powerful alternative to standard convolutions. Thanks to their robustness to different initializations, their separable form which enables lower computational complexity, and the possibility of providing favorable tradeoffs between sparsity and accuracy they can be seen as an ideal candidate for large-kernel efficient convolutions and equivariant deep learning applications.