Transformation Equivariant Models

(1)

Transformation Equivariant

Models

Daan Ferdinandusse 11345705 Bachelor Thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor MSc. I. Sosnovik UvA-Bosch Delta Lab University of Amsterdam Science Park 904, Room C3.201

1098XH Amsterdam

(2)

Abstract

The convolutional layers in convolutional neural networks (CNNs) make use of the translation symmetry to correctly classify translated versions of the input. This makes these networks very effecient for training on large amounts of data. A CNN is able to detect all translated versions of the input, so they are able to correlate the training set to the testset. To utilise more of such symmetries, Cohen and Welling [7] created group equivariant convolutional neural networks. These networks have shown state of the art performance on the rotated MNIST dataset. In this research we will dive deeper in transformation equivariant models. We will demonstrate that a model equivariant to the transformation of its input is able to classify more cases correctly compared to a model that is not equivariant to these transformations.

(3)

1 Introduction

Deep convolutional neural networks (CNNs) have demonstrated state of the art performances in a wide range of object recognition tasks. The main advantage that CNNs have over other neural networks is that CNNs make effective use of weight sharing. This greatly reduces the amount of trainable parameters com-pared to a fully connected neural network. CNNs are much faster at learning a training set, which makes them able to learn from larger datasets. When optimizing artificial models, it is highly beneficial to train on larger datasets because the higher variety in training data makes the model more robust to changes in the test data. The MNIST dataset [15] is frequently used for train-ing and testtrain-ing of these CNN models. The dataset contains a large number of handwritten digits, this emulates the real life problem of reading someone’s handwriting. However, nowadays many computer vision architectures are suc-cessful at completing this task with relatively high accuracy. Because the digits are fairly similar most models are able to exploit the lack of variety. Earlier research shows that increasing the complexity of the task greatly reduces per-formance of deep CNNs [14]. Artificial models benefit form being as general as possible, to be applied in a larger variety tasks.

In physics symmetries groups are often used to create a mathematical frame-work that can describe the connections between objects. These connections range from reflection to rotations. CNNs make use of the symmetry over all translations Z2. Because of a translation symmetry in the features of an image, CNNs are able to reduce the complexity while staying invariant to shifts. This means that objects can be identified even if they appear on different parts of the image. Convolutional neural networks are said to have translation equivariant layers through out the whole network. This means that feeding a translated input to one of the convolution layers is the same as feeding the original input through the CNN and then shifting the result. However CNNs are not equiv-ariant to all transformations. By transforming the input of a CNN, the model becomes more inaccurate. CNNs are unable to translate such transformations to a correct classification. One of the solutions to this problem is to create larger and more complex training sets to cover a larger variety of object instantiation. However, in the field of artificial intelligence (AI), it is more desirable to gener-alize for all cases.

Cohen and Welling [7] created a model called a Group equivariant convolutional neural network (G-CNNs). G-CNNs provide a framework that generalises the symmetry groups in convolutions. Where CNNs are translation equivariant, G-CNNs are equivariant to the transformation group G. This means that the model is able to generalise the transformations of the input if the transforma-tions are a part of group G. This generalisation of convolution limits the model to only the implementation of group G in the convolution.

In this research it will be demonstrated that models equivariant to transfor-mations of the input are able to identify objects more accurate, than models which are not equivariant to these transformations.

(5)

2 Group Theory

As the main source of information about group theory Vvedensky and Evans [25] was used.

2.1 Symmetry

An object is said to be symmetric when the object is invariant to a certain transformation. This means that when applying that transformation the objects appearance does not change [25]. Examples of these transformations would be reflections or rotations. To illustrate this effect we take a look at Figure 1. Here we see an equilateral triangle. When rotated 120 degrees we end up with the exact same triangle. A equilateral triangle is said to be symmetric for a 120 degree rotation around the center of the triangle. A square is symmetric for a 90 degree rotation around the center and a circle is symmetric for all rotations around the center.

Figure 1: Three geometric shapes. Image is taken from [25]

2.2 Defining Groups

To describe the symmetric transformation we shall define a concept called groups. A group is an abstract set that operates on a certain binary opera-tion, in definition below we define the rules of a group.

Definition: A group G is a set of elements a, b, c, ... together with a binary composition law, called multiplication, which has the following properties:

1. Closure. The composition of any two elements a and b in G, called the product and written ab, is itself an element c of G: ab = c.

2. Associativity. The composition law is associative, i.e., for any elements a, b, and c in G, (ab)c = a(bc).

3. Identity.There exists an element, called the unit or identity and denoted by e, such that ae = ea = a for every element a in G.

4. Inverses. Every element a in G has an inverse, denoted by a−1, which is also in G, such that a−1_{a = aa}−1 _{= e.}

The so called “multiplication” or “product” in these rules does not necessarily correlate to the mathematical product. The product in this context is the binary operation between two elements of the group that generates another element of

(6)

the group. Even so, the mathematical product is an operator for the group of real numbers. The closure rule states that the product of 2 elements of a group will always result in another element of that group. This makes a good test to see if the group is a valid group, by using 2 random elements and the operator to check if it is still an element of the group. Because of the binary composition rule the product abc is unambiguous. This implies that the order of multiplications does not matter, (ab)c is the same as a(bc). Yet a group does not need to be commutative. The product of ab does not need to be equal to ba to make a valid group. Whenever these quantities are equal we call the group a commutative group. Furthermore, the group has an identity element such that the product between any element and the identity element will result in that original element. Finally, for every element in the set there exists an inverse element in the set, such that combining them in either order produces the iden-tity element. An example of inverse can be given in the group of real numbers with the multiplication operator, here 1/x is the inverse of x, for multiplying 1/x with x will result in the identity element 1.

In AI group elements often correspond to coordinate transformations of the objects in an image where the composition law corresponds to matrix multi-plication, so the associativity property is guaranteed. Looking at the closure rule, the product of two symmetry transformations results in another symme-try transformation of the same group. The identity transformation of such a group is performing no transformation at all, the identity matrix. The inverse of a symmetry transformation corresponds to the reverse of that transformation.

Consider the group of 2 × 2 matrices with real entries such that the determinant is non-zero. The composition law of the group is regular matrix multiplica-tion. This group of matrices is a valid group if the product of two matrices with a non-zero determinant also is a matrix with a non-zero determinant. So det(A)det(B) = det(AB), for random matrices A and B with non-zero determi-nants. The identity element of the group is the identity matrix

1 0 0 1

for multiplying any matrix with the identity matrix will result in the original matrix. The inverse of the group exists and is equal to

a b c d −1 = 1 ad − cb b −d −c a

since the determinant can be calculated by ad − bc and is non-zero.

2.3 Permutation Groups

A permutation of a set is a rearrangement of the elements of the set. When looking at the permutations of a group structure we define the composition law as performing successive permutations by rearranging the objects according to

(7)

the first permutation and then using this as the reference order of the group to rearrange the objects according to the second permutation. These permuta-tion groups can be denoted by Sn. When taking a look at an example of the

structure of S3, we are able to see its use when working with objects like an

equilateral triangle.

The group S3 is the set of all permutations of three different objects, where

each element of the set represents a permutation of the three objects with the first element being equal to the reference order. Group S3consists of 6 elements

due to the ordering of the objects, the first object can be put in any of the three positions, the second object can be put in two of the positions and the third object can be put in only one position, this results into 3! = 6. The elements of S3 are listed as follows:

e =1 2 3 1 2 3 a =1 2 3 2 1 3 b =1 2 3 1 3 2 c =1 2 3 3 2 1 d =1 2 3 3 1 2 f =1 2 3 2 3 1

Here the top line represents the reference order and the bottom line is the permutation. The product of two permutations results in another permutation of the group. Considering the product of permutation a and permutation d. Element a permutes the first object in the second position, this puts the first object of d in the second position. Following this logic:

ad =1 2 3

1 3 2

= b

Taking the product in the reverse order:

da =1 2 3

3 2 1

= c

shows that S3is not a commutative group.

The different symmetry transformations of an equilateral triangle can also be represented by S3 (2). In this representation element e is the reference order of

the triangle. The elements a, b, c correspond to reflections through lines where the vertical axis is flipped. Here a corresponds to the vertical flip from line 3, b corresponds to the vertical flip from line 1 and c corresponds to the vertical flip from line 2. The elements d and f , correspond to clockwise rotations by 120 and 240 degrees respectively. Now all the elements of S3can be represented

with the transformations of the equilateral triangle, where a in S3 is equal to a

of the equilateral triangle.

Considering the earlier example of the product between a and d, the result will be the reference order rotated by d and then flipped by a:

The result is the same as element b of the set and the result of S3 was also b,

(8)

Figure 2: Symmetry transformations of an equilateral triangle. Image is taken from [25].

Figure 3: Transformation ad of the equilateral triangle. Image is taken from [25].

Similarly, the result of da results in c for both groups. Two groups that share the same algebraic structure are called isomorphic groups. This means that they are identical to one another. This makes these geometric structures and their permutations easy to work with when written in the way of a permutation group.

2.4 Finite Groups

Groups can be divided in two general categories: discrete groups and continuous groups. Here we consider only discrete ones due to their relative simplicity. Discrete groups can be of two types: finite and infinite. Finite groups have a finite amount of elements while infinite groups have an infinite amount of elements. A way of representing the amount of elements is called the order of a group and is denoted by | G |. So the order of S3, the group with all the

permutations of three elements, is six and the order of the group of integer numbers Z is infinite. The basic definitions apply to both groups, yet finite groups have some extra properties that don’t apply to infinite groups. For example multiplying element g by itself enough times will eventually result in element g. For example take the group C4, the group of all 90 degree rotations

around the center. The groups elements can be denoted as {e, r1, r2, r3}, where e is the identity transformation or no rotation at all, r1 is the rotation by 90 degrees around the center, r2is the rotation by 180 degrees and r3is the rotation

(9)

by 270 degrees. Multiplying r1 _{5 times with itself will result in r}1_{, as is shown}

in Figure 4.

Figure 4: Discrete group transformation

2.5 Subgroups

A group is said to have subgroups if elements of the set can form a group of their own with the same composition law as the group itself. Following the definition of the identity element, there is always one subgroup in a group, that is the identity element itself. Also the group itself is a subgroup. These subgroups are called improper subgroups and are generally of no use. Proper subgroups on the other hand show the symmetry between the elements in the subgroup and can therefore give valuable information about the group. The group S3

does contain a number of proper subgroups: {e, a}, {e, b}, {e, c}, {e, d, f }. The subgroups of a, b, c make a valid group due to the flipping of 2 elements every time the transformation is performed, so naturally taking the product of the same transformation twice will result in the identity element. The subgroup of d, f makes a rotation subgroup with 120 degree rotations around the center. So these subgroups show us the symmetries of the possible transformations to be done on the equilateral triangle, which can be seen in Figure 2.

3 Neural Networks

We used Subramani Palanisamy[23] as the main source of information about neural networks.

3.1 Artificial Neural Networks

Artificial Neural Networks (ANN) are computer systems created to effectively perform complex tasks through repeated learning. ANNs are inspired by bio-logical neural networks that form the brains of animals. ANNs, like biobio-logical neural networks, consist of a large number of neurons that work together in blocks called layers or perceptrons. These layers are chained together to form a neural networks. ANNs have proven to be very powerful in a wide range of tasks. One of the most famous examples of such a task is the recognition of handwritten digits. Trained on a large data set of handwritten digits with their corresponding labels, the ANN is even able to learn to recognize handwritten digits of similar category it has not seen before in training.

(10)

3.2 Layers

ANNs consist of hierarchical layers with connections between them where infor-mation passes from one to another in order to make the final decision. Figure 5 shows a simple ANN. The ANN consists of a input layer, represented by a 1-d vector, one hidden layer and an output layer where the output is based on the input of the hidden layer. The connections between the layers are a set of weights and biases that will determine the output of the next layer based on the input layer. The weights between the layers are frequently updated between iterations of the training process to best fit the ground truth of the training-set.

Figure 5: Simple representation of an ANN. Image is taken from [23].

The Input layer of an ANN is the layer containing the input. When training, the input is one of the training samples of the training data set. In the case of handwritten digits, the input is one of the images. The input is connected to a hidden layer most of the time but can also be directly connected to the output layer. In this case it is called a single-layer network.

The Hidden layers follow the input layers and send their output either to another hidden layer or to the output layer. Each neuron in the hidden layer collects the input of connected neurons in the previous layer and their respective weights to calculate the weighted sum P wixi. Here Wi corresponds to input

weight and Xicorresponds to the input value of the i-th sample. The neuron in

the hidden layer becomes active once a threshold is met. The threshold is called the activation function which will be discussed in the next section. In Binary data if the weighted sum is greater that the threshold the neuron will be set to 1 else the neuron will not be activated and be set to 0. In a single Feed forward neural network, only one hidden layer is used as shown in Figure 5. However, a neural network can contain any number of hidden layers with any number of neurons. When using more than one layer it is called a deep neural network as shown in Figure 6. This architecture is used for more complex problems.

The Output Layer processes the data from the previous layer just like hidden layers do with respective weights. The output layer maps the previous layers to the output classes. In the case of the handwritten digits the output classes are the numbers represented by the images. In binary data the output class will either be 0 or 1 and with non-binary data the output is a probability of the output class.

(11)

Figure 6: Multi layer ANN. Image is taken from [23].

3.3 Activation Function

The activation function is the function that decides whether a neuron is active or not. The activation function also gives the layers in our network non-linear properties. To map complex problems to a neural network a non-linear func-tional mapping between the inputs and respective output is needed. One of the more popular activation functions used to be the sigmoid function. Sigmoid functions create a bounded version of the weighted sum of the input within the range of 0 and 1. However, this can create the vanishing gradient prob-lem, which keeps weights stuck around a certain value. Because of this certain weights can never reach their optimal value and so the network is not optimized for its problem. The current most used activation function is the ReLU function which maps the weighted sum of the input to range from 0 to the actual value of the input. Some other activation functions with their respective graphs are shown in Figure 7.

Figure 7: Respective graphs of different activation functions. Image is taken from [23].

(12)

3.4 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a very popular sort of neural net-work. They have proven to perform particularly well in image recognition tasks. In particular, CNNs have proven to be most useful for driving autonomous cars and medical image analysis. CNNs are able to process a larger amount of train-ing samples compared to traditional computer vision algorithms due to their unique way of processing an image. A CNN is structured with three main com-ponents: a convolution layer, a pooling layer and a fully connected layer. The convolution layer and the pooling layer are alternated to systematically learn the features of the image. The amount of alternating layers can be increased according to the complexity of the task. To get the final classification CNNs use a number of hidden layers similarly to ANNs. An abstract version of a CCN is shown in Figure 8.

Figure 8: Representation of a CNN. Image is taken from [26]

3.5 Convolutional Layer

The convolution layer analyses the pixels of the image with a kernel that strides along the image to create a feature map of the image. The kernel of the convo-lution layer is a filter of a certain size that looks at different parts of the image at a time. This filter provides the CNN with a pixel wise weight estimation for the respective filter kernel as shown in Figure 9. Wherever the kernel exceeds the input image the images is padded, usually with zeros. An image can contain multiple channels of set image, this is the case with color images. Whenever an image contains multiple channels the kernel strides simultaneously on all chan-nels and will record the average value of all chanchan-nels to the output feature map. The convolution of x and w denoted as x ? w is

s(t) = (x ? w)(t) =X

a

x(a)w(a − t) (1)

Convolution has three main advantages:

1. The weight sharing ability in the feature maps significantly reduces the parameters needed to learn

2. Because the kernel looks at more than 1 pixel at a time, the net is able to learn correlations between the pixels

(13)

3. The convolution layer makes the model able to learn features invariant to the location in the image.

Figure 9: Visualisation of a convolution step. Image is taken from [2].

3.6 Pooling

The pooling layer follows after the convolution layer in the CNN architecture. The pooling layer will reduce the computation power required through dimen-sionality reduction. Similarly to the convolution layer, the pooling layer uses a kernel to stride over the image. There are two types of kernels for pooling layers, the Max kernels and the Average kernels. A Max kernel will take the maximum value of the window to generate the new feature value, while an Average kernel will take the average value of the window to generate the new feature value. This means pooling layers reduces dimensionality by averaging or maximizing local neighbors, this makes the net invariant to local transformations. This can be useful when the existence of the feature is more important than the exact location of the feature.

3.7 Back propagation

In order for a model to learn the model has to update its parameters to better fit the training data. This procedure is called back propagation. With back propagation, the model compares the ground-truth to the expected value of the model and improves its parameters. To calculate how far-of the models are from the ground-truth a loss function is used. One of the most popular loss functions for classification problems is the Cross-Entropy loss function shown in Equation 2.

Hp(q) = − C

X

c=1

p(yc) · log q(yc) (2)

Here q(yc) is the prediction and p(yc) is the ground-truth respectively for the

classes denoted by c. Cross-Entropy loss function calculates difference between the estimated value of our model and the ground truth for all our classes. To minimalize the loss function, the model uses a gradient decent method. The

(14)

gradient decent method finds the way to change the parameters in a way that decreases the loss function the most. By decreasing the loss function with every training step the model eventually finds a local minimum of the loss function, this is shown in Figure 10.

Figure 10: Visualisation of gradient decent. Image is taken from [1].

In the example of the handwritten digits, an optimal output would be a pre-diction of 1.0 for the output class representing the number on the image and a prediction of 0.0 for all the other output classes. When an output class does not match the optimal output the weights of the nodes responsible for that output are adjusted to come closer to the optimal output. The weights of the nodes which contributed more to the output are adjusted more. Weights are either decreased if the prediction is too high or increased if the prediction is not high enough. All the output predictions are taking into consideration through the average of the adjustments to the weights. This leads to a desired output in one of the hidden layers, which can then back propagate this desired output to the previous layer.

4 Group Equivariant CNN

We used Cohen and Welling [7] as the main source of information about G-CNNs.

4.1 Invariance and Equivariance in CNNs

Deep learning neural networks create translation invariant representations of features, using a series of parameterized functions that map the input to pro-gressively more abstract representations. Through every layer the translation invariant property is kept. Invariance makes the representation of higher level features in the right spatial configurations impossible to determine. The mini-mal internal structure of the representation spaces makes deep neural networks susceptible to changes in the input space.

G-CNNs use a representation space called a linear G-space, for some group G. In contrast to the Linear space of regular neural networks, the representation space of G-convolutions are associated with a pose which are the transformations of some group G. This additional structure allows the data to be modeled more effectively. The filter of a G-CNN is able to detect feature constellations in every

(15)

pose through an operation called G-convolution. In order to keep this structure from the representations space in higher level layers of the network, the layer that maps the different rotations needs to be structure preserving. Where most models are invariant to changes in rotation, G-CNN layers are equivariant to these changes. This means that the structure of one layer carries over to the other layers, as shown in Equation 3.

φ(Tgx) = T

0

gφ(x) (3)

Here x in the input transformed by transformation g and φ is the mapping of the input from group G. equivariance means that mapping the input first is the same as transforming the input and then mapping it to the transformation of group G.

Invariance is a special kind of equivariance, namely the invariant transforma-tion of group G is the identity transformatransforma-tion. In Equatransforma-tion 1 if φ is the identity transformation, we see that φ has no impact on the input which makes the input invariant to transformation φ. Transformation invariance loses information of the input, where equivariance maintains the transformation through the layers of the CNN. Knowing the instantiation of features is often more valuable than knowing only the presence of a feature. Equivariant CNNs constrain the sym-metry transformations of the network in a way that it can aid generalization of the input. A non-injective network maps two instances of a face to a single output vector indicating the presence of a face. If the model is injective the G-transformations map the two instances to the same output as well, yet the symmetry transformations are preserved due to the equivariant transformations.

4.2 Related work

A large number of studies work on invariant representations. One invariant network is a pose normalization of (Lowe [17]). The network uses a large data base of features that are then grouped together to match an object. Another invariant network is made by averaging a nonlinear function over a number of similarities in objects (Reisert [18]; Skibbe and Reisert [22]).

Scattering CNNs use wavelet concolutions, non-linearities and group averag-ing to produce stable invariance (Bruna and Mallat [5]). To futher improve the scattering CNNs small deformations of features are calculated for further improve the object and texture recognition (Sifre and Mallat [21]).

(Sabour et al. [19]) created a capsule network to expand CNNs to have equiv-ariant properties. The capsules in the network are able to find the instantiation parameters of the desired object features. A number of other equivariant mod-els have occurred since, equivariant Bolzmann machines (Kivinen and Williams [12]), equivariant descriptors (Schmidt and Roth [20]) and equivariant filtering (Skibbe and Reisert [22]).

For deep convolutional neural networks, equivariance is a good inductive bias. To support this theory (Lenc and Vedaldi [16]) show that the AlexNet, cre-ated by (Krizhevsky et al. [13]), learns the representations equivariant to flips, scaling and rotation when trained. (Agrawal et al. [3]) created a unsupervised

(16)

CNN equivariant to ego-motion which showed similar representations for the transformations.

(Gens and Domingos [11]) created symnets which is a deep symmetry network that genneralised the convolution process to the notion of groups. This archi-tecture could map high-dimensional feature maps to the transformations of a group. By rotating feature maps (Dieleman et al. [9]) use the rotation symme-try of CNNs to learn the equivariant representation of set feature map. Later on this work was continued to implement the method on cyclic symmetries for various CNNs (Dieleman et al. [10]).

(Cohen [6]) use a Bayesian conjugacy relation on a group of rotation and cyclic transitions of an image. This produces an invariant-equivariant representation of the data while reducing the operators needed. Later extrapolating the notion of disentangling to the concept of decorrelation (Cohen [8]).

4.3 Rotation group P4 as a function

In section 2 we have discussed what groups are, in this section we will discuss the rotation group p4 which we will use to create the G-CNN for our research. Ideally we would like to create a network equivariant to all possible rotations and all possible translations, which is a complicated or even impossible task to solve. Therefore, we consider a subgroup p4 of this group. The transformation group p4 consists of all the possible translations and rotations by 90 degrees around the center of a square. The transformation matrix of p4 is given in Equation 4, where r ranges from 0 to 3 representing the rotations by 90 degrees and (u,v) represent all the possible translations in Z2_.

g(r, u, v) =   cos (rπ/2) − sin (rπ/2) u sin (rπ/2) cos (rπ/2) v 0 0 1   (4)

The composition law of group p4 is matrix multiplication. When p4 acts on pixels in the image, the homogeneous coordinates of the pixel are multiplied by the transformation matrix of p4, as shown in equation 5

gx =   cos (rπ/2) − sin (rπ/2) u sin (rπ/2) cos (rπ/2) v 0 0 1     u0 v0 1   (5)

In regular CNNs the convolution layer maps the input image to a K-dimensional feature map as function f , f : Z2_{→ R}K_{. The function f works on a}

rectan-gular plane and maps every pixel coordinate (p, q) ∈ Z2 _{to a feature map. The}

dimensionality of the feature map depends on the number of kernels in the input image denoted by K.

Defining the convolution as a function extends the range to infinity. While the image still is a finite array that is mapped to a finite feature map, when talking about the function the pixel coordinates that extend the image can be seen as zero padding. The notion of a function simplifies the convolution math-ematical analysis.

(17)

When using the symmetry groups for the convolution, our feature map is trans-formed. We denote the transformation of feature maps in Equation 6, where the transformation g acts on the function of our convolution f .

[Lgf ](x) = [f ◦ g−1](x) = f (g−1x) (6)

The points of the original feature map f are mapped to the transformed feature map Lgf by g. This means that point x of the transformed feature map can be

found on the original feature map at point g−1x Let g be a simple translation t = (u, v) ∈ Z2 , then g−1x denotes to x − t. This means that coordinates are shifted in the positive direction when transformed by a positive translation. The inverse of g ensures that two transformations g and h are homomorphism (shown in Equation 7) even if the transformations do not commute.

Lgh= LgLh (7)

Regular feature maps of CNNs can be viewed as functions on group Z2_,

G-CNNs instead create feature maps that work on the group G. The function of the feature maps of G-CNNs is the same as for regular CNNs but the pixel coordinates denoted by x are replaced by an element of group G denoted by h as shown in equation .

[Lgf ](h) = [f ◦ g−1](h) = f (g−1h) (8)

To visualize the a p4 feature map generated from an image, Figure 11 (left) shows four patches correlating to the four 90 degree rotations of our group. The coordinates of the feature map are indicated the same way as the parameters of our transformation matrix shown in equation 4. The rotation coordinate of each pixel in the feature map correlates to a patch of the group.

As we will discuss later on, a p4 convolution layer often rotates the transforma-tion functransforma-tion by 90 degrees. As shown in Figure 11 (right) when the functransforma-tion is rotated, the order of patches also shifts by 1 (mod 4). On top of that the patches are themselves rotated by 90 degrees. This rotation highly changes the product within our convolution, for now a different filter is convolved with a patch of our input.

4.4 Equivariance in Convolution

Convolutional layers are equivariant to translations but not equivariant to rota-tions. In this section we will mathematically demonstrate why this is the case. This will give us a good understanding on how to prove the equivariance of the G-convolutional layers. In regular CNNs the convolutional layer takes a stack of feature maps and convolves it with a set of convolution filters. The convolution is shown in Equation 9. [f ? ψi](x) = X y∈Z2 Kl X k=1 fk(y)ψki(y − x) (9)

When we add translation t to the convolution, we substitute y with y + t. In Equation 10 we see that a translation followed by a convolution is the same

(18)

Figure 11: Representation of Z2to p4(left), a rotation by 90 degrees of p4(right). as a convolution followed by a translation. This means that the convolution operation is equivariant to translations. Similarly we can say that convolution and translation commute.

[[Ltf ] ? ψ](x) = X y X k fk(y − t)ψk(y − x) =X y X k fk(y)ψk(y + t − x) =X y X k fk(y)ψk(y − (x − t)) = [Lt[f ? ψ]](x) (10)

Although convolution is equivariant for translations, it is not equivariant for other transformations such as rotations. As shown in Equation 11 rotating an image and then correlating with a filter is not the same as first correlating and then rotating the result. The convolution of a rotated image is actually the same as a rotation of the image correlated with the inverse-rotated filter.

[[Lrf ] ? ψ](x) = Lr[f ? [Lr−1ψ]](x) (11)

4.5 G-CNN convolution layer

The convolution of regular CNNs shift a filter over the input to create feature maps that are used in the next layer as the input. With G-convolution the shift is replaced with the more general transformation of group G. The generalized convolution of the first layer in G-CNNs is shown in Equation 12.

[f ? ψi](g) = X y∈Z2 Kl X k=1 fk(y)ψki(g−1y) (12)

(19)

In the first layer the input of our convolution layer is the image and so the convolution is performed on the image plane Z2_{. This creates a feature map}

that is a function on our group G. In higher convolution layers of the network the input is these feature maps, so as shown in Equation 13 our convolution changes to become functions on G.

[f ? ψi](g) = X h∈G Kl X k=1 fk(h)ψki(g−1h) (13)

To demonstrate the equivariance of G-convolution we add the transformation u and derive it the same as Equation 10. Now we see that a transformation followed by the convolution is the same as a convolution followed by the trans-formation (Equation 14, where we substitute h with uh).

[[Luf ] ? ψ](g) = X h∈G X k fk(u−1h)ψk(g−1h) =X h∈G X k fk(h)ψk(g−1uh) =X h∈G X k fk(h)ψk((u−1g)−1h) = [Lu[f ? ψ]](g) (14)

Because transforming a small filter is more efficient than transforming a large feature map, f ? ψ is used for G-convolution.

5 Implementation

For our research we created a fully connected neural network, a CNN and a G-CNN with Pytorch. The Fully connected neural network is a reference model without any use of symmetry groups. For our CNN we will use a very basic 3 convolutional layer architecture. For our implementation of G-convolution we are following the steps of Cohen and Welling [7]. All our models are trained using a DAS-4 computer form a server [4].

The G-CNN structure allows for larger feature maps. This in turn creates more trainable parameters in a G-CNN model compared to an regular CNN with an equal number of input units. Meaning the complexity of our models would be unequal. Since we want to demonstrate the benefit of symmetry groups in CNNs and not the perks of having more trainable parameters we reduce the input for the G-convolution layers and increase input for the layers in the fully connected neural network.

5.1 G-CNN implementation

In group theory a group is called a split when the transformations of group G can be decomposed into a translation and another transformation that leaves the origin unchanged [7]. In the group p4 we can split our elements of G into a translation t and a rotation r. Considering the homomorphism properties of

(20)

our group (Equation 7) we can rewrite the G-convolution shown in Equation 12 and 13 to the equation shown in 15, Where X is the group Z2 _{in the first layer}

and the group G in the other layers.

[f ? ψi](tr) = X h∈X Kl X k=1 fk(h)Lt[Lrψki(h)] (15)

The cost of the filter transformation Lsψk is negligible compared to the rest of

the convolution. By first computing the filter transformation of all four rota-tions we ensures a fast computation of our model similar to the computation of a regular convolution.

The pseudo-code of the G-convolution in the first layer is given by Equation 16, note that this is a convolution layer. In Equation 17 we see the pseudo-code for the other G-convolution layers.

ConvZP4(f , K) = squeeze(conv2D(f , transform(K)) (16)

ConvP4P4(f , K) = squeeze(conv2D(transform(f ), transform(K)) (17) The filters of layer l are stored in an array of shape Kl× Sl−1_{× K}l−1_{× n × n.}

Where Kl represents the number of output channels and Kl−1 represents the number of input channels. Sl−1 _{is the number of transformations in G, so in}

the first layer G is the group Z2 _{so S}l−1_{= 1. In all other layers G is group p4}

so Sl−1_{= 4. n × n represents the size of a filter.}

5.2 Generated datasets

The MNIST dataset is one of the most popular datasets for experimental com-puter vision. Yet most of the models created perform very well on this data set, to the point that improvements become negligible. For our experiments we will be transforming the MNIST dataset to be more challenging.

The MNIST dataset contains images of size 28 × 28 where the handwritten digit takes up most of the image. To make sure our transformations do not move the digit out of the image, we pad all the images to be size 56 × 56. Our first Data set on which we will train our models will be the Padded MNIST data set where we only pad the MNIST data set to our desired size. Our second data set is the Translated MNIST data set where we pad our images and also translate them by 25% of the width and height of the image. Our third data set is the Rotated MNIST where we pad, translate and rotate our images by 180 degrees around the center. The datasets are shown in Figure 12 and represented in Table 1.

6 Results

We tested our three models on all three of our datasets six times. Each time we trained a model we looped over the entire training data set 60 times. All the

(21)

Data set Padding, px Translation, % Rotation,◦

Padded MNIST 14 0 0

Translated MNIST 14 25 0

Rotated MNIST 14 25 180

Table 1: Datasets properties.

Figure 12: Top left: Regular MNIST, Top right: Padded MNIST, Bottom left: Translated MNIST, Bottom right: Rotated MNIST.

error rates of our models are shown in Table 2.

Our fully connected model scored increasingly worse on the harder datasets. When the model was trained on the Padded MNIST the result was around a 3.56% which is considered not that accurate. For our Rotated MNIST data set the result is close to educated guesses. Our CNN model scores around 1% error when working with the Padded MNIST and Translated MNIST datasets. When the model is trained on the Rotated MNIST dataset the accuracy drops to 8.3%. We trained our G-CNN in the same way as our CNN to keep the results reliable. This means for every data set training six times, looping over the entire data set 60 times for each training of the model. Our G-CNN varied more between the Padded MNIST and Translated MNIST datasets but still scored an error of around 1%. The G-CNN scored an improved accuracy on the Rotated MNIST dataset, dropping to an error of around 5.1%.

Model Padded Translated Rotated # Params

Fully connected 3.6 ± 0.1 28.5 ± 0.5 68.0 ± 0.8 440 K

CNN 0.9 ± 0.1 1.0 ± 0.1 8.4 ± 0.3 441 K

G-CNN 0.9 ± 0.1 1.3 ± 0.1 5.1 ± 0.1 442 K

(22)

7 Conclusion

For our results we are able to see that models which use symmetry groups score better results on the data set transformed by that group. Our fully con-nected model uses no symmetries. Therefore when it is trained on a transformed dataset the accuracy drops significantly to the point of educated guesses. This is an expected result however because the fully-connected model is unable to find the correlation between a digit of the training data and a digit of the testing data. The fully-connected model can only correctly classify digits that are very similar to the data its trained on.

Our CNN model is equivariant to all translations of the input. The transla-tion symmetry between the Padded MNIST and Translated MNIST datasets is captured within our convolution architecture. This makes a CNN able to cor-rectly classify the digits translated differently than the training data. From our results we are able to see that translating the input does not significantly affect the accuracy of our model. We can clearly see the difference with the result of our fully connected model where the results differ between the two datasets. However our CNN model is rotation invariant. For our Rotated MNIST dataset the rotations are not in the symmetry group of convolution. Therefore the CNN is unable to correlate these rotations if they differ too much from our training data. Form our results it is clear to see that a translated and rotated version of our input greatly reduces our accuracy. Yet compared to our fully connected model, the accuracy is still a large improvement.

The results of our G-CNN are an improvement on those of our regular CNN, yet the accuracy is still reduced when transforming the input. Because the translation group Z2_{is a subgroup of our group p4, used for our G-convolution,}

our results show that translating our input does not significantly reduce our accuracy of the Translated MNIST data set compared to the Padded MNIST data set. Comparing our results on the Rotated MNIST dataset, We are able to see that our rotation equivariant model is significantly more accurate than our rotation invariant models. Yet there still is a decrease in performance compared to the Padded MNIST and Translated MNIST datasets.

From our results we are able to see that using models that use the symmetries are more accurate in classifying digits of the transformed input. The transfor-mations can be captured and applied to data that has not been seen before. This proves that equivariant models are able to generalise transformations of the input. The more symmetries that can be captured by these models the better the model will be when trying to classify data it has never seen before.

8 Discussion

Our experiments have shown that equivariant models perform better on trans-formed datasets than models that are not equivariant to these transformations. Yet this has only been proven for the transformed version of the MNIST data set. To what degree the same results are acquired with different datasets is still to be discovered. Another poplar data set is the CIFAR-10 dataset which

(23)

con-sists of a large amount of pictures ranging from animals to transport vehicles. To compare the results the same conditions of training must be met. However exploring different conditions for our experiments could have made for different results. Suggestions for the experiments could be to try more training time or a larger amount of trainable parameters for our models.

Increasing the amount of symmetries used in our model seems to positively impact the accuracy. thereforee substituting CNNs with a G-CNN where the group consisting of all possible symmetry transformations could be beneficial. Yet in deep neural networks it is shown that there is a limit to the increase in accuracy and deeper models [24]. Further research could show if there is a limit to the amount of symmetries of the group that increase the accuracy of a model.

One of the limitations of the G-CNNs is that they only operate on continu-ous groups. In our Rotated MNIST dataset, we used all rotations of 180 degrees around the center. This could explain why we still found an accuracy drop compared to the Padded MNIST dataset. To generalise rotations, the imple-mentation should be a continuous group of all rotations around the center. Though this might not be feasible. One alternative could be to create a group consisting of a larger amount of rotations to see if there is a correlation to the amount of elements in our rotation group and the decreasing accuracy.

Sabour et al.[19] proposed another solution to create equivariant networks, in the form of capsule networks. Be using Dynamic routing and a method of agreement, equivariant features could be passed to higher layers in the network. A combination of G-convolution and dynamic routing could further improve accuracy of our models. For future networks this possibility would be worth exploring.

References

[1] An introduction to different types of convolutions in deep learningbatch, mini batch stochastic gradient descent.

[2] An introduction to different types of convolutions in deep learning. 7 Feb. 2018.

[3] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. 2015-05-07.

[4] Henri Bal, Dick Epema, Cees de Laat, Rob van Nieuwpoort, John Romein, Frank Seinstra, Cees Snoek, and Harry Wijshoff. A medium-scale dis-tributed system for computer science research: Infrastructure for the long term. Computer, 49(5):54,63, 2016-05. ISSN 0018-9162.

[5] Joan Bruna and S Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8): 1872,1886, 2013-08. ISSN 0162-8828.

[6] T. Cohen. Learning the irreducible representations of commutative lie groups. 2014. ISSN 1938-7288.

(24)

[7] Taco S. Cohen and Max Welling. Group equivariant convolutional net-works. 2016-02-24.

[8] T.S. Cohen. Transformation properties of learned visual representations. 2015.

[9] Sander Dieleman, Kyle W. Willett, and Joni Dambre. Rotation-invariant convolutional neural networks for galaxy morphology prediction. Monthly Notices of the Royal Astronomical Society, 450(2):1441,1459, 2015-04-25. ISSN 0035-8711.

[10] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. 2016-02-08.

[11] Robert Gens and Pedro M Domingos. Deep symmetry networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Sys-tems 27, pages 2537–2545. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5424-deep-symmetry-networks.pdf.

[12] J.J. Kivinen and C.K.I. Williams. Transformation equivariant boltzmann machines. volume 6791, pages 1,9, 2011. ISBN 9783642217340.

[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classifi-cation with deep convolutional neural networks. Communiclassifi-cations of the ACM, 60(6):84,90, 2017-05-24. ISSN 00010782.

[14] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on machine learning, volume 227 of ICML ’07, pages 473,480. ACM, 2007-06-20. ISBN 9781595937933.

[15] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[16] Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. 2014-11-21.

[17] David Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91,110, 2004-11. ISSN 0920-5691.

[18] Marco Reisert. Group integration techniques in pattern analysis–a kernel view. 2008.

[19] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. 2017-10-26.

[20] U Schmidt and S Roth. Learning rotation-aware features: From invariant priors to equivariant descriptors. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2050,2057. IEEE, 2012-06. ISBN 9781467312264.

(25)

[21] Laurent Sifre and Stephane Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1233,1240. IEEE, 2013-06. ISBN 9780769549897.

[22] Henrik Skibbe and Marco Reisert. Rotation covariant image processing for biomedical applications. Computational and Mathematical Methods in Medicine, 2013:19, 2013. ISSN 1748-670X.

[23] Harisubramanyabalaji Subramani Palanisamy. Risk assessment based data augmentation for robust image classification: using convolutional neural network, 2018.

[24] Sasha Targ, Diogo Almeida, and Kevin Lyman. Resnet in resnet: Gener-alizing residual architectures. 2016-03-25.

[25] Dimitri D Vvedensky and Tim Evans. Group theory, 2005.

[26] Orhan Gazi Yal. Image classification in 10 minutes with mnist dataset. 5 Sept. 2019.

Transformation Equivariant Models