Unsupervised Image Classification And Hashing With Binary Representations

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Unsupervised Image Classification And Hashing With

Binary Representations

by

V

ASILEIOS

C

HARATSIDIS

12148288

February 16, 2021

48 01/01/2020 - 30/01/2021

Supervisor:

V

ICTOR

G

ARCIA

S

ATORRAS

Dr P

RIYANK

J

AINI

Assessor:

Dr S

TRATIS

G

AVVES

(2)

Abstract

Supervised learning has proven a solid methodology for tasks like image classification. However, the labeling of the data that is required for supervised learning is a laborious and costly task. As a matter of fact, self-supervised systems with the use of contrastive-learning have become increasingly popular since they do not need labeled data for training. Most of the research so far, focus on contrastive-learning with positive pairs of length two while the end-to-end clustering approaches use the mutual-information between class assignments of the positive pair . In an attempt to use positive pairs of length greater than two we tried to simplify the end-to-end clustering approaches to be based only in the probability agreement of the soft class assignments of the positive pair.

In the unsupervised representation learning approach, where it is attempted to map images in a lower dimensional embedding space, most of the research use variants of InfoNCE loss to learn dense float representations. We devised a family of loss functions that assigns multiple binary class assignments producing sparse binary representations. Mapping images with binary strings is a long-posing problem in image retrieval since the sparse binary representations can be used for fast image retrieval with the use of the hamming distance.

The main contribution of this paper is a novel family of loss functions based on the concept of probability agreement and contrastive learning. Besides the very competitive results in unsupervised representation learning (and state-of-the-art results for SVHN data-set) we show that one of the above loss functions can further be used in supervised learning achieving better results than the standard Multi-class-cross-entropy loss in CIFAR-100 and SVHN data-sets for 3% and 0.5% improve in accuracy, respectively. Finally, we show that the binary representations achieve state-of-the-art results in unsupervised image hashing in CIFAR-10 data-set improving by 10.5% and 8% the Mean-average-precision @1000 and Top 1 precision, respectively.

1 Introduction

One image is worth a thousand words, as the saying suggests images carry a lot of information. The useful information of an image contains patterns that are attributed uniquely to a specific class of objects while the noise information can be

(3)

other emerging patterns that are common to semantically not-meaningful classes. Examples of noisy information consist of the pose, the rotation of the depicted object, or the brightness and the background of the image.

Supervised systems learn through labeled examples which patterns are uniquely attributed to semantically meaningful classes. For every object there are several structural patterns, invariant from the conditions held when the image was taken, that yield the identity of the object. For example, an airplane has always wings and a car has always wheels.

However, acquiring labels is a costly and laborious task and avoiding it is a long standing problem called unsupervised-learning, learning without the need of labels. In unsupervised learning there are two main approaches: unsupervised representation learning and unsupervised end-to-end clustering.

Considering unsupervised representation learning, again, there are two mainstream approaches: generative and dis-criminative. Generative approaches model pixels which is computationally expensive and learns representations that might carry redundant information. Discriminative approaches use a pretext task that forces a neural network to learn low dimensional representations that can be used for other tasks.

In unsupervised end-to-end clustering the network produces class assignments that can be, straightforwardly, used to classify the images without the use of extra-processing.

Both of the above approaches utilize a very prominent technique called contrastive representation learning or con-trastive learning, a paradigm shift from architecture to data engineering. In concon-trastive learning the system learns by comparing images. Initially, different augmentations of the same image are created, called positive pairs. The positive pairs, so far in the literature, consist of just 2 images, we define this as length 2. The augmentations are crafted such that they include variance in the viewing conditions, for example rotating, scaling, horizontal flipping and changing the color and brightness of the image. Then the system must produce similar representations for the positive pairs and thus learning not to pay attention to noise. Lately many successful systems, as we will present in the related work section, use contrastive learning both for unsupervised representation learning and for unsupervised end-to-end clustering.

Considering unsupervised end-to-end clustering many of the methods use the joint distribution between the class assignments of the positive pairs. We propose two loss functions that need fewer computations, utilizing the agreement of the class assignments which is just the diagonal of their joint distribution, with the ability to use positive pairs with length greater than two.

In representation learning most of the research so far use variations of the InfoNCE loss [14] and positive pairs of length two. Furthermore, all the recently proposed methods are based on decimal dense representations. We propose a novel loss function that performs contrastive learning differently, producing sparse binary representations, learned in a supervised, unsupervised or a mixed manner. Sparse binary representations are considered to be the data structure of the brain, since in the brain only a few neurons fire at a time. The biological neurons either fire or not, in contrast with artificial synapses that are activated in a continuous way from 0 up to a number. Furthermore, previous research[1] has demonstrated that sparse binary representations can be more robust to noise than dense representations. In addition, sparse binary representations have some nice properties that make them practical for solving real life problems like image retrieval from a large data-base. With high compactness and efficient bit-wise calculation it is easier to compare images by using the hamming distance (XOR them and then count remaining ones) in a sparse binary representation than in the traditional floating-point representation. Additionally, they can be stored and transferred easily since we can save the array with only the indexes that point to "1" elements. As a matter of fact it is possible to learn very big representations since their storing space grows sub-linearly. For example a vector of 65536, with sparsity of 99.6% needs only 262 integers to get stored.

Last but not least, we show how the Binary-Contrastive loss function, as we call it, outperforms, the Multi-Class-Cross-Entropy in supervised-learning in 2 data-sets.

More concretely our contributions are:

1. Two novel loss functions, called Probability-Agreement (PA) and Penalized-Probability-Agreement (PPA) that per-form competitive results in unsupervised end-to-end clustering in CIFAR-100, CIFAR-10 AND STL-10 data-sets, by contrasting positive pairs with length greater than 2 (more than two augmentations of the same image at the same time).

2. Two novel loss functions, called Binary-Contrastive (BC) and Binary-Penalty (BP) that both create highly trans-ferable dense float and sparse binary representations in an unsupervised manner by contrasting positive pairs with length greater than 2. They both demonstrate competitive accuracy in unsupervised representation learning, eval-uated by the linear protocol, in CIFAR-100 and CIFAR-10 data-sets, while the Binary-Contrastive achieves state-of-the-art results in SVHN data-set improving 1.5% from the second best method. Moreover, worth noticing is that the Binary-Penalty operates without the need of negative sampling a technique that many other methods in the literature need [3], [4].

3. As a side effect we show that the Binary-Contrastive, which can operate in an unsupervised, supervised or mixed manner surpass Multi-class-cross-entropy loss in supervised-learning by a margin of 3% accuracy in the CIFAR-100 data-set and 0.5% accuracy in the SVHN data-set.

(4)

4. Finally, as a side effect the binary representations learned by the Binary-Contrastive loss achieve state-of-the-art results in unsupervised image hashing at mean-average and top 1 precision by the large margins of 12% and 8%, respectively, in the CIFAR-10 data-set.

2 Relevant Background

Recently, the most competitive methods for unsupervised learning have been using self-supervised contrastive learning. These methods use neural networks to learn a low-dimensional embedding of data by a “contrastive” loss which pushes apart different images (negative pairs) while pulling together different views of the same image (positive pairs).

A very influential contrastive loss to learn representations in an unsupervised manner is InfoNCE [14] which maxi-mizes a lower bound on the mutual information between the representations of two images of a positive pair. In practice, given some images x1, ..., xn, their augmentations, x1a, ..., xna, and a given distance function f, the InfoNCE loss is

optimized to create image representations that have a minimum distance for the positive pairs and a maximum distance of N negative pairs. For the positive pair x1, x1athe InfoNCE becomes:

LN CE:x1,x1a = −E " log e f (x1,x1a) PN j=1ef (x1,xj) #

Where N is the batch-size and the expectation is with respect to the batch. One of the design choices in contrastive learning is the selection of the positive and negative pairs. The standard approach for generating positive pairs without additional annotations is by image augmentation. Negative pairs can be either generated by using the sampled batch images like in simCLR [3] or can be maintained in a queue, thus decoupling the batch size from the number of negatives like in MoCo [4].

MoCo

Momentum contrast for unsupervised visual representation learning [4] is another successful instance discrimination technique. It characterizes contrastive learning as a dictionary lookup task. When we want to identify if two images are augmentations of the same image, we present an image as a query and the other as a key. Furthermore, we present a bunch of other images-keys, that serve as negative samples, the system has to find the correct key that matches the query. So it encodes the query and the keys and computes the pairwise similarities. A schematic of the architecture can be shown in Figure 1.

The key idea of MoCo is that it maintains a buffer of previous embeddings as negative samples, decoupling the negative sampling from the batch-size thus allowing smaller batch-sizes. For that buffer to function well the encoder of the keys (negative samples) needs to be a historical average version of the encoders used, since if the current encoder is used then the previous embeddings are not relevant anymore.

Figure 1: Momentum Contrast [4]. The τ parameter in the equation is a temperature parameter.

Recent methods belonging in the contrastive representation learning family use different ways to construct ziand zj.

CPC [14] and CPCv2 [6] leverages spatio-temporal co-occurrence. MoCo [4] and SimCLR [3] use data augmentation to the same image to obtain positive pairs while CMC [16] employs natural image channels.

SeLa

Another interesting approach with very competitive results is "self-labelling via simultaneous clustering and representa-tion learning" or SeLa [2]. The method maximizes the informarepresenta-tion between labels and input data indices. The standard

(5)

cross-entropy loss is transformed to a minimization of the optimal transport problem, which is solved using a fast variant of the Sinkhorn-Knopp algorithm. The method can self-label millions of images and learn competitive representations. BYOL

Another new approach is Bootstrap latent representations [5]. BYOL uses two neural networks, referred to as online and target networks. Starts from an augmented view of an image and trains its online network to predict the target network’s representation of another augmented view of the same image.

In practice, BYOL generalizes this bootstrapping procedure by iteratively refining its representation, using a slowly moving exponential average of the online network as the target network instead of fixed checkpoints. While fixed target networks are more common in deep RL, BYOL uses a weighted moving average of previous networks, in order to pro-vide smoother changes in the target representation. The loss function is defined as the mean squared error between the normalized predictions and target projections.

DRC

Deep robust clustering [18] which reports state-of-the-art results in unsupervised end-to-end clustering in CIFAR-100, CIFAR-10 and STL-10 data-sets is a combination of methods. It uses both a contrastive loss of representations and a maximization of mutual information between the class assignments of each pair. Their loss function consists of three parts LAF, LAPand LCR. LAF = − 1 N N X i=1 log e zt iz 0 i/T PN j=1e zt iz 0 j/T !

The contrastive loss of image representations, where ziand z0iare representations of augmentations of the same image

produced by a neural network and N is the batch size.

LAP = − 1 K K X i=1 log e zt iz 0 i/T PK j=1e zt iz 0 j/T ! (1)

The contrastive loss of class assignments, where qiand q0iare the cluster assignments for two augmentations of the

same image and K is the number of clusters.

LCR= 1 N K X i=1   N X j=1 zi(j)   2 (2)

And a cluster regularization term to avoid degenerate solutions where all the images are placed on the same cluster.

3 Relevant work

In this section we will present the methods that inspired ours for unsupervised end-to-end clustering and representation learning.

IIC

IIC was the inspiration of our PA method. In an attempt to simplify the IIC loss function such that it can train faster and make it able to process more than 2 augments of the same image at the same time we came up with our first loss function Product-Agreement.

"Invariant Information clustering" or IIC [9] has as objective the maximization of the mutual information between the class assignments of each positive pair. The mutual information between X and Y , with joint density p(x, y) and marginal densities p(x) and p(y), is defined as the Kullback–Leibler divergence between the joint and the product of the marginals:

I(x, y) = DKL p(x, y)||p(x) · p(y) (3) = Ep(x,y) " log p(x, y) p(x) · p(y) # (4)

A more concrete explanation of how IIC works is the following. We create positive pairs xi, xj with different

(6)

Φ and instead of representations we get the class probability vector z = {1, . . . , C} where C is the number of classes. Attain the C × C matrix P by marginalizing over the batch. Where each element at row r and column c constitutes Prc

is the mean probability that the first augmentation is of class r and the second augmentation is of class c.

P = 1 N N X i=1 Φ (xr) · Φ (xc) > (5)

Where N is the batch size. Then the objective function is:

I(zr, zc) = K X r=1 K X c=1 Prc· log Prc Pr· Pc (6)

Although we do not use negative sampling degenerate solutions (where all images predicted as to be in one and only class) are avoided because mutual information expands to I (zi, zj) = H(zi) − H (zi| zj) . Maximizing this quantity

trades-off minimizing the conditional cluster assignment entropy H (zi| zj) which dictates that 2 augmentations of

the same image must be in the same class and maximizing individual cluster assignments entropy H(z) which forces the encoder to give predictions for every class available, thus avoiding degenerate solutions. The smallest value of H (zi | zj)

is 0, obtained when the cluster assignments are exactly predictable from each other. The largest value of H(z) is ln C, obtained when all clusters are equally likely to be picked. This occurs when the data is assigned evenly between the clusters.

SimCLR

SimCLR or "A Simple Framework for Contrastive learning of Visual representations" [3] is another recent technique which achieves very good results. Although our unsupervised representation learning method, Binary-Contrastive, has many differences with SimCLR it was inspired by it and uses a similar pipeline, shown in the Figure 2, which differentiates mainly in the loss function.

Informally, we randomly sample a batch of images. From every image derives a pair of 2 different augmentations. The augmentations are chosen randomly from an augmentation pool. For every pair of augmentations extract their repre-sentations (xi, xj), using a neural network. This pair of representations is called a positive pair since it is derived from

the same source image. The representations pass through a projection head which is a multi layer perceptron that maps representations to the space where the contrastive loss is applied (zi, zj). Then the Normalized Temperature-Scaled Cross

Entropy Loss (NT-Xent) for a positive pair is defined as: `i,j= − log

exp (cosine-similarity (zi, zj) /τ )

P2N

k=11[k6=i]exp (cosine-similarity (zi, zk) /τ )

(7) Where τ denotes a temperature parameter and N the batch-size. Minimizing the above loss is equivalent to maximizing a lower bound on the mutual information between ziand zj.

I (zi; zj) ≥ log(k) − LNT-Xent

(7)

Figure 2: Schematic representation of the above explained mechanism from simCLR [3].

4 Methods for unsupervised end-to-end clustering through probability

agree-ment

In this section we will present two new loss functions, Probability-Agreement and Penalized-Probability-Agreement, that are based on the probability agreement of the class assignments of a positive pair.

Our first method has as input an image and as output a class assignment C, [0, 1]C_{. The main concept is to force}

the same class assignment for different augmentations of the same image (positive pairs), while enforcing predictions for every class to avoid degenerate solutions, where all images will be assigned to the same class.

4.1 Product-Agreement loss (PA)

Properties of Product-Agreement loss

The Product-Agreement loss has similar properties as the IIC [9] and does 2 things. It minimizes the entropy of the individual predictions, such that they look like a one hot vector, pointing to one class rather than be scattered probabil-ities across all possible classes. Maximizes the entropy of the batch mean prediction such that it looks like scattered probabilities across all classes. Ideally, the batch mean prediction will equal to the prior probabilities of the classes. It is maximized when the individual predictions of a positive pair are one-hot-vectorish and point at the same class, while there are predictions for all possible classes.

Benefits of Product-Agreement loss

While other systems like IIC [9] and DRC [18] maximize mutual information between positive pairs of 2, our loss function maximize the agreement between semantic labels of Z-positive tuples which helps for faster convergence during training. Furthermore, it applies more pressure in minimizing the predictions entropy because it enforces all the probabilities of the joint distribution to be concentrated at the diagonal where the predictions of 2 different augmentations agree. Finally the Product-Agreement loss is faster since it is not required to operate with the whole joint-distribution, which is a C × C matrix but only with it’s diagonal, which is a vector of dimensionality C.

It is very easy to implement and it achieves competitive scores in multiple data-sets. Using a very simple encoder (3 convolution layers and 1 linear layer) and trained in a common 4GB ram GPU it achieves 98% on MNIST, 46% in

(8)

Figure 3: A schematic representation of the forward pass of the Product-Agreement loss in a very simple setup: batch size N = 2, Z = 2 (augmentations compared) and C = 2 classes to split the data.

CIFAR-10 and 41.6% on STL-10 completely unsupervised. Method Product-Agreement

The predictions are soft class assignments with dimensionality as the number of clusters that we want to cluster the data in. The agreement of the predictions is simply the point-wise multiplication of Z predictions. This makes sense since the point-wise product of two or more predictions is the diagonal (or hyper diagonal) of their joint distribution. The diagonal is where the predictions coincide, meaning that the predictions agree on the class.

For example if we have 4 classes and the soft predictions for an augmentation are [a1, a2, a3, a4], and the soft predictions for a different augmentation of the same image are [b1, b2, b3, b4], where a1 to a4 and b1 to b4 represent probabilities and they sum up to 1 respectively, then the joint probability distribution of the two predictions is depicted on the Table 1.

class predictions b is class 1 b is class 2 b is class 3 b is class 4 a is class 1 a1∗ b1 a1∗ b2 a1∗ b3 a1∗ b4

a is class 2 a2∗ b1 a2∗ b2 a2∗ b3 a2∗ b4

Table 1: Join probability distribution of soft class assignments of 2 augmentations of the same image.

As we can see, the diagonal elements are maximized when the predictions of the two images point at the same class. Furthermore, the magnitude of the product [a1, a2, a3, a4] ∗ [b1, b2, b3, b4] = [a1b1, a2b2, a3b3, a4b4] is maximized when

both predictions show probability 1 for the same class.

The only caveat is that the above is trivially solved by producing predictions that map to just one class for every given image. To mitigate this we have to enforce that the mean probability (of a given batch of images) should be a vector with the highest entropy possible which is achieved when all values of the vector are 1/C. Both of the above objectives occur naturally with the Product-Agreement loss.

Let x be N images, where N is the batch size, and x1, x2, ..., xZbe Z different transformations of x. Let Φ be our

encoder which is a neural network that has as input a batch of images and as output the aforementioned prediction vectors. All Φ(x1), Φ(x2), ..., Φ(xZ) are N × C matrices. We encode the transformations x1, x2, ..., xZ, using Φ into the

respective predictions p1, p2, ..., pZ. We calculate the element wise () product of the predictions A which stands for

agreement:

A = p1 p2 ... pZ (8)

This pushes together the predictions while minimizing their entropy. Then we calculate the batch-mean prediction and take the minus log of it. This step enforces predictions for all classes by maximizing the batch mean prediction entropy. Finally, we take the class mean of the log vector to make it scalar:

(9)

LP A= 1 C C X i=1 − log 1 N N X i=1 Aij (9) A simple schematic of the PA loss pipeline is displayed in the Figure 3.

Limitations of Product-Agreement loss

One big limitation of the Product-Agreement loss is that it requires a large batch size to classes ratio, since the batch class distribution must match the prior data-set class distribution for good results. For example if in a batch there are not instances of a class then the product loss will return a wrong signal, since we need to predict instances of all classes. As a matter of fact, it would not work well with small batch sizes or data-sets with many classes like imageNet.

Algorithm 1 Product-Agreement loss with Z = 4 augmentations input: batch size N, classes C, Φ

1: for sampled mini-batch {xi} N

i=1, . . . do 2: for all i ∈ {1, . . . , N } do

3: create 4 augmentations xi1, xi2, xi3, xi4and encode them into probability vectors. 4: pij = Φ(xij)

5: calculate the tuple-wise predictions agreement, which is the product Pi=Q Z j=1pij 6: end for 7: LP A= _C1 C P i=1 − log 1 N N P i=1 Pi ! 8:

9: update network Φ to minimize LP A 10: end for

11: return Φ

A pytorch implementation can be found in the Appendix 1.

4.2 Penalized-Product-Agreement loss (PPA)

In an attempt to mitigate the limitations of the Product-Agreement loss we change the axis of the regularizing mechanism. Instead of applying the log function in the batch-mean prediction (batch axis), we now apply the log function in the penalized positive pair agreement individually. The penalty mechanism penalizes the classes that are most often used per batch.

Benefits of the Penalized-Product-Agreement loss

The main advantage of the Penalized-Product-Agreement loss is that it is more flexible and forgiving of sampling imper-fections. When we sample images the sample is not stratified to reflect the prior class distribution of the data-set. The Product-Agreement loss penalizes heavily the imperfections. If there are not instances of a class in the sample then the Product-Agreement loss will generate a huge error (log( 0)) while the Penalized-Product-Agreement loss will generate a milder error since in case of instances of a class being absent the only consequence would be that the penalty of the other classes will be a bit higher.

Method Penalized-Product-Agreement

Informally, we sample a batch of images and we create 2 different transformations x1and x2for every image. We pass

the transformations from our encoder Φ and get the respective soft class assignments p1 = Φ(x1) and p2= Φ(x2),

both [N, C] matrices. As in PA loss, we define the agreement of the predictions as their element wise product: A = p1 p2. Then we define the penalty as the sum of the predictions in the batch level: P = 1₂

N

P

n=1

(p1+ p2). Then the

penalized agreement is A

P. Finally we sum every individual penalized agreement take the log of it and the mean of all

(10)

Figure 4: A schematic representation of the forward pass of the Penalized-Product-Agreement loss with a batch size of 2 and Z of 2. LPPA= 1 N N X i=1 − log C X c=1 A P ! (10) = 1 N N X i=1 − log C X c=1 p1 p2 P ! (11)

Stacked Penalized-Product-Agreement loss

Unfortunately, to generalize the Penalized-Product-Agreement loss to Z > 2 we cannot simply make the product larger by multiplying more augmentation class assignments, like in the Product-Agreement loss, since the product will become very small in contrast with the penalty and the learning will stop. What we can do though is to share a common penalty and calculate all pairwise Penalized-Product-Agreement losses for all of our augmentations. The penalty then becomes:

P = 1 Z N X n=1 (p1+ p2+ ... + pZ) (12)

And the equation 11 becomes:

LPPA= 1 N N X i=1 Z−1 X t1=1 Z X t2=t1+1 −log C X c=1 pt1 pt2 P (13)

(11)

Algorithm 2 Penalized-Product-Agreement loss with Z augmentations input: batch size N, classes C, x1, x2, ..., xZ, Φ

1: for sampled mini-batch {xi}N_i=1, . . . do 2: for all i ∈ {1, . . . , N } do

3: create 4 augmentations xi1, xi2, ..., xiZ 4: attain the predictions pj = Φ(xij) 5: end for

6: calculate the penalty P =_Z1

N P i=1 Z P j=1 pij 7: for all i ∈ {1, . . . , N } do

8: calculate the sum of pair-wise penalized-agreements Ai= Z−1 P t1=1 Z P t2=t1+1 −log C P c=1 p_it1p_it2 P 9: 10: end for 11: LP P A−Z =N1 N P i=1 Ai 12:

13: update network Φ to minimize LP P A−Z 14: end for

15: return Φ

A Pytorch implementation of the PPA loss can be found in Appendix 2 Limitations of Penalized-Product-Agreement loss

The Penalized-Product-Agreement loss solves some issues of the Product-Agreement loss but it still has some limitations. The penalty is not normalized according to the combination of clusters-to-split and batch-size, affecting the learning in a counter-intuitive way for different batch-sizes. We notice that when the penalty is getting very high it impairs the training. When more than 2 augments are used the loss function cannot incorporate them naturally like the Product-Agreement loss and it needs to be stacked calculating the pairwise losses.

5 Methods for unsupervised sparse binary representation learning

In another attempt to solve the limitations of the Product-agreement loss and to expand the probability agreement family to the representation learning the idea of sparse binary features emerged. The main concept is that instead of attributing one label to an image we attribute a bunch of binary labels. These binary labels represent local characteristics of the image that either exist or not. The motivation comes from the fact that humans in an attempt to characterize a complex object break it down to local features. For example a car should have wheels and a tree should have leafs. Both wheels and leafs are local characteristics of the objects car and tree, respectively.

The way to train a neural network to attribute sparse binary representations to images is similar to the other methods. There are two approaches that both use augmentations of the same image as positive pairs. The first approach, called Binary-Contrastive, needs negative sampling while the second, called Binary-Penalty, does not. Both approaches boil down to maximize the similarity of the representations of the positive pairs while either minimizing the similarity of the representations of negative pairs or using a class regularization parameter based on batch statistics.

Both approaches share the following steps. Let x be N images, where N is the batch size, and x1, x2, ..., xZ be Z

different transformations of x. Let Φ be our encoder which is a neural network that has as input a batch of images and as output a vector of configurable length C consisting of numbers from 0 to 1. This encoding is technically not in binary form yet but we call it binary non the less. When the training begins the strings consist of numbers from 0 to 1, in the latter steps of the training most of the strings consist of either 0 or 1 but there are still some remnants of floats in between. While evaluation we round the predictions to make them binary.

5.1 Binary-Contrastive loss function (BC)

Benefits of the Binary-Contrastive loss

A big advantage of the Binary-Contrastive loss is that it can operate in an unsupervised, supervised or mixed manner with very few adjustments. Furthermore, it can produce, without the use of any labels, high quality dense float and binary sparse representations that show high transferability between different data-sets.

One strange new peculiarity of the binary representations is the sparsity of the embedding vector. The sparsity can be configured with a hyper-parameter and it has an impact on the quality of the binary and non binary representations (the representations of the last convolutional layer).

(12)

Figure 5: Schematic representations of the forward pass of the Binary-Contrastive loss with a batch size of 2 and Z = 2. In case of a queue mechanism (explained latter) all 4 binaries 1a, 2a, 1b, 2b, additionally repel all the stored binaries in the queue.

High sparsity is beneficial since using the trick of displaying only the indices of "1s" we can efficiently store or com-pare the binary representations. We noticed that when the sparsity is high the quality of both binary representations and representations extracted from the last convolutional layer goes up. In our experiments we achieve good representations with the sparsity reaching up to 99.5%.

The end goal of the Binary-Contrastive loss function is to produce the same binary string for the positive tuples while producing different binary strings for the negative tuples.

Method Binary-Contrastive

Informally, we sample a batch of images and we create Z different transformations for every image. Then we stack the Z batch-sized [N, C] transformations in one big batch of size [V, C], where V = Z × N . We pass the transformations to our encoder Φ and get the respective binary representations B. Then we calculate the matrix G, G ∈ RV ×V.

G = − log

R M + A (1 − M )

Ω (14)

Where stands for the element wise multiplication,

A = B·B

T

Σ+1 (15)

stands for the attraction of the positive pairs and Σ =

C

P

i=1

Bi is a normalizing vector consisting of the sums of

the binary representations. The maximization of A achieves the alignment of the binaries of the positive pairs while at the same time is pushing for sparsity and for zero entropy (binarization). The numerator of A pushes the values to become 1 and to be aligned while the denominator Σ normalizes the product allowing for sparsity, since equally aligned representations of different sparsity can achieve the same score. Furthermore penalizes misalignment, since misalignment make the numerator bigger while the denominator stays the same.

R = B·(1−B)

T

Σ+1 (16)

Which stands for the repel of the negative pairs, or pushing negative pairs apart. The maximization of R achieves the alignment of the binaries of the first part of a negative pair with the reverse binary of the other part. It achieves three things similarly to A it pushes the entropy to zero and pushes for sparsity. Opposing to A it pushes for diversity of the binary strings, since when the binary and the reverse-binary align means that the binaries of the negative pair are misaligned.

M, which stands for the adjacency matrix that represents the relationships between all the pairs. It consists only of 0s and 1s. There is a 0 for a positive pair and 1 for a negative pair . The adjacency matrix is going to be used as a mask in later steps, dictating when we attract and when we repel. Most of the elements of the mask are ones since Z << N .

As it is shown in the Table 2, from one batch of 4 images we derive 4 batches with different augmentations (Z-1 to Z-4). In the big diagonal (green) the images are compared with themselves which only helps to drive the entropy of the

(13)

Augmentations - Z-1 Z-2 Z-3 Z-4 - Images 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Z-1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 2 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 3 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 4 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 Z-2 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 2 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 3 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 4 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 Z-3 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 2 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 3 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 4 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 Z-4 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 2 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 3 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 4 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0

Table 2: Adjacency Matrix for batch-size N = 4 and number of different augmentations Z = 4

representations to zero but in the small diagonals (red) different augmentations of the same images are fully attracted (the value is 0). In every other spot augmentations of different images are compared and they are fully repelled (the value is 1).

Ω a bonus coefficient to the attraction results, where Ω = M +(λ(N −1)(1 − M ))diag=1, whereλ is a constant. Since

the positive pairs are N-1 times less than the negative pairs we multiply the positive pairs with a large bonus coefficient to balance the attract and the repel. This bonus has an effect on the sparsity of the representations. We experimented with λ between the range of [0.5, 2]. The bonus is applied only in the small diagonals of the matrix M denoted with the red colors in the Table 2, since the attraction of one image with itself (big diagonal) is trivial. Then we can compose the loss function: Lbin= 1 (V )2 V X i=1 V X j=1 Gij (17) = − log R M + A (1 − M ) Ω (18) = − log R M + A (1 − M ) Attraction bonus z }| { M + (λ(N − 1)(1 − M ))diag=1 (19) = 1 (V )2 V X i=1 V X j=1 − log rij mij+ aij (1 − mij) mij+ λ(N − 1)(1 − mij)diag=1 (20) = 1 (V )2 V X i=1 V X j=1 − log Bi· (1 − Bj) T_m ij+ Bi· BjT (1 − mij) σi+ 1 mij+ λ(N − 1)(1 − mij)diag=1 ! (21) In the Figure 6 we show another schematic of the forward pass. We will provide two examples of how the numbers of the matrix G in the Figure 6 occurred. The cell in the top left corner comes from the encodings of the first augmented image with itself. The cell is a positive pair so we have attraction. We have Bi= [1, 0], Bj = [1, 0], with a binary length

of C = 2 the denominator becomes Σ1,2= 1₂ 2

P

i=1

(Bi) + 1 = 1 + 0 + 1 = 2. Substituting to equation 15 we have:

Cell1,1= − log 1 ∗ 1 + 0 ∗ 0 + 2 = − log1 2 = 0.3 after rounding

(14)

Figure 6: A simple schematic of the forward pass of the Binary-Contrastive with a batch size of N = 2 images, Z = 2 augmentations and a binary string of length 2. In the right side we see the matrix G where the number b is the attraction bonus.

The Cell1,2is a negative cell so we have repel. We have Bi= [1, 0], Bj = [0.25, 1], with a binary length of C = 2

the denominator becomes Σ1,2= 1₂ 2

P

i=1

(Bi) + 1 = 1 + 0 + 1 = 2. Substituting to equation 16 we have:

Cell1,2= − log 1 ∗ (1 − 0.25) + 0 ∗ (1 − 1) + 2 = − log0.75 2 = 0.43 after rounding The loss function is minimized when:

• The parts of the positive tuples have exactly the same representations and their entropy is 0 (so all the elements must be either 0 or 1).

• All the parts of the negative pairs have completely different representations, namely there should be no dimension that has a non zero value for both binary representations of the parts of a negative pair at the same time. In addition, as in the positive pairs, the parts of the negative pair have 0 entropy.

• The binary representations are as sparse as possible, such that the values of the summation matrix Σ in the de-nominator will become as small as possible while the parts of the negative pairs will have the less possible aligned 1s.

Queue

We further enhanced the loss function by storing the batch predictions in order to use them as negative samples in future predictions. We experimented with a queue size S of up to 390 iterations in the past. We define Bold, a [S × N, C] matrix,

we store only the binaries of 1 transformation for every image since the binaries of other transformations are going to be similar anyway. The old binaries are detached from the graph so we do not back propagate in them. Then the additional part is: Lold= 1 ZN2_S V X i=1 SN X j=1 − logbi· (1 − boldj) T σi+ 1

(15)

LBC= Lbin+ Lold

Algorithm 3 Binary-Contrastive loss with Z = 4 augmentations and S = 50 input: batch size N, classes C, Φ, old binaries O [50 × N, C], λ.

1: define the adjacency matrix Q = 1 − IN [N, N] to have ones everywhere and zeros in the diagonal. 2: stack 4 matrices Q across both dimensions (width and height) to acquire M [4N, 4N].

3:

4: for sampled mini-batch {xi} N

i=1, . . . do calculate the binaries b: 5: bij = Φ(xij)

6: for all i ∈ {1, . . . , N } do

7: create 4 augmentations xi1, xi2, xi3, xi4 8: end for

9:

10: stack the binaries into a big batch [4N, C]

11: calculate the attract and repel matrix G:

12:

13: for all i ∈ {1, . . . , 4N } do

14: for all j ∈ {1, . . . , 4N } do

15: calculate the sum of the binary i: σi=_C1 C P i=1 bi 16: set aij = 0 and rij= 0 17: 18: if mij= 0 then

19: calculate the dot product paij= bi· bTj and the agreement

a

ij

= − log(

paij

σi+1

)

20: else

21: calculate the dot product prij= bi· (1 − bj)T and the repel

r

ij

= − log(

_σprij

i+1

)

22: end if 23: 24: end for 25: end for 26: 27: for all i ∈ {1, . . . , 4N } do 28: for all o ∈ O do

29: calculate the dot product prio= bi· (1 − bo)T and the repel rio= − log(_σi+1prio)

30: end for 31: end for 32: 33: LBC−4= _16N12 4N P i=1 4N P j=1 (aij∗ λ ∗ (N − 1) + rij) +_200N1 2 4N P i=1 50N P o=1 rio 34:

35: update network Φ to minimize LBC−4 36: add the binaries bi1to O

37: if O is full remove the oldest stacked batch of binaries [N, C].

38:

39: end for

40: return Φ

A vectorized Pytorch implementation of the BC loss can be found in Appendix 3

5.2 Binary-penalty loss function (BP)

In this section, we will further simplify the BC Loss 21 with an aim to make the resultant model train faster. To achieve this we revisited the penalty mechanism once again. The main concept is that it brings together positive pairs while penalizing predictions that occur often in the batch. As a matter of fact, the Binary-Penalty loss function use the batch statistics instead of contrasting each and every negative pair. Informally, we achieve this by first sampling a batch of images. Subsequently, we create two different transformations for each image in the sampled images. Next, we pass these transformed images through the encoder Φ to achieve binary representations B1, B2. Finally, we apply the following

(16)

loss function to these binary representations: LBP= 1 N N X i=1 − log PC i=1A/P Σ (22)

Where A =

B

1

B

2, A ∈ RC×N and stands for the attraction of the positive pairs. As in the Binary-Contrastive

loss the maximization of A achieves the alignment of the binaries of the positive pairs, while at the same time pushing the binaries to sparsity and zero entropy. P = 1₂

N

P

n=1

(B

1

+ B

2

)

Summing in the batch level we attain the penalty,

the higher the sum per column the more predictions use the column. Since we want to maximize the sparsity, diversity and discriminative power of the binary representations all of the columns must be represented equally in the predictions.

Σ =

C

P

c=1

B

1

+ B

2

+ λ

S, a scalar that penalizes the number of 1s in the binary representation thus pushing for

sparsity. λS stands for a sparsity-coefficient which controls the sparsity of the binary vector. If the λS is less than 1 the

penalties are strong enough to collapse the binary representations in higher entropy forms ( instead of 0 and 1 will be between 0 and 1) which is unwanted. Composing the loss:

LBP= 1 N N X i=1 − log PC i=1A/P Σ (23) = 1 N N X i=1 − log C P c=1 B1B2 P C P c=1 (B1+ B2) + λS !! (24) Stacked BP

To generalize to more transformations we need to use a stacked form of the BP loss function where we calculate a shared penalty and then all pairwise BP losses of the combinations of the augmentations. The penalty becomes:

P = 1 Z N X n=1 (B1+ B2+ ... + BZ) (25)

The equation 24 becomes:

LBP= Z−1 X t1=1 Z X t2=t1+1 1 N N X i=1 − log C P c=1 B_t1B_t2 P 1 2 C P c=1 (Bt1+ Bt2) + λS !! (26)

Algorithm 4 Stacked-Binary-Penalty loss with Z = 4 augmentations input: batch size N, classes C, Φ, sparsity coefficient λS.

1: for sampled mini-batch {xi} N i=1, . . . do 2: for all i ∈ {1, . . . , N } do 3: create 4 augmentations xi1, xi2, xi3, xi4 4: bij = Φ(xij) 5: end for

6: calculate the penalty P =1₄

N P i=1 4 P j=1 bij 7: for all i ∈ {1, . . . , N } do

8: calculate the sum of pair-wise penalized-agreements Ai= 4−1 P t1=1 4 P t2=t1+1 − log C P c=1 bit₁bit2 P 1 2 C P c=1 (b_it1+b_it2)+λS 9: end for 10: LBP −4= _N1 N P i=1 Ai 11:

12: update network Φ to minimize LBP −4 13: end for

(17)

A vecotrized Pytorch implementation of the Stacked BP loss can be found in Appendix 4

6 Augmentations strategy

Our augmentation strategy includes 19 different augmentations which are chosen randomly every time. First we define the pre-processed image as the original image horizontally flipped with chance 50% and its brightness, contrast, saturation and hue randomly changed. All of the augmentations below including random crops happen in the pre-processed image unless stated otherwise. All the random crops are of height and width 2/3 of the pre-processed image unless stated otherwise.

1. pre-processed image as is.

2. the pre-processed image scaled down by 25%, the frame is blackened. 3. the pre-processed image rotated between -46 and 46 degrees.

4. the original image rotated between -46 and 46 degrees.

5. the pre-processed image with the values of the pixels reversed. The starting values are between 0 and 1 so we execute values = 1 - values.

6. The pre-processed image after applying sobel filters in x or y or both directions. 7. the pre-processed image with Gaussian blur.

8. Various random crops of the pre-processed image with different sizes (from 1/2 to 2/3) and processing, for example some of them are Gaussian blurred or sobeled.

9. the pre-processed image with a patch erased (blackened). 10. the pre-processed image with salt and pepper noise.

Figure 7: 4 batches with different augmentations of the same original (middle) images of CIFAR-100.

We find that the most important augmentations are the random crops because they help the network to learn scale invariant characteristics. Furthermore, they help the network to correlate global and local features. The pre-process of color, pose and blurriness is also crucial to help the network ignore viewing conditions. For all of our augmentations we use the torchvision transforms package of pytorch.

(18)

6.1 Limitations of contrastive learning

A limitation for contrastive learning, is that some engineering and human intelligence needs to be invested in the creation of the augmentation pool. Some of the augmentations might have positive effects on almost every data-set, like for example scale, rotate and brightness jitter, since these are attributes of every imaginable object, but some augmentations seem to benefit specific data-sets and they can be even harmful for other data-sets. For example if we use a vertical flip in the SVHN data-set which depicts images of numbers of streets the sixes vertically flipped would look like nines leading the network to the wrong direction.

7 Experiments

In this section we will present the experiments which we split across 4 sub-sections: Starting with "Experiments for unsupervised end-to-end clustering" which includes experiments for the PA and the PPA losses, "Experiments for unsu-pervised representations learning" which includes experiments with the BC and the BP losses, "Image Retrieval Experi-ments" which includes experiments with the BC loss in the task of unsupervised image hashing and finally the sub-section "Transfer learning for Binary-Contrastive" which, as the name suggests, includes experiments regarding the transferability of the representations learned with the BC loss from a data-set to another.

In all the experiments we use the Adam as the optimizer [11]. For the learning rate, batch-size and epochs trained there is some differentiation from experiment to experiment.

Model choice

In unsupervised end-to-end clustering it is adopted that no validation set is used [18], [7] and [15]. As shown in [15] some researchers ([18], [7]) in their models even use the test set for training since they do not use or interact with the labels of the data. We confirm that this is not far off with the results of the Table 3 were the accuracies in the train-sets are a bit lower than these of the test-sets, while in supervised learning the accuracies achieved in the train-sets are a lot higher since the networks extract knowledge from the labels.

We elect to not have a validation set and we pick the model that it’s loss function is the lowest in the test set (which we do not use for training). Our loss functions do not require interaction with the test labels so we do not pick the models with the highest accuracy. Furthermore, we try to use the same hyper-parameters for all the data-sets and the experiments unless the experiment involve studying a hyper-parameter. Notice also that we evaluate the test average loss only at the end of every epoch.

Data-sets 3-conv-1-linear 4-conv-1-linear 5-conv-1-linear 5-conv-2-linear

CIFAR-100-100 test-set 0.238 0.233 0.222 0.217

CIFAR-100-100 train-set 0.234 0.224 0.211 0.212

Table 3: Comparing accuracies of Penalized-Product-Agreement loss for the test and train sets. The batch size is 128 and the train took place for 200 epochs.

Data-Sets

We experimented with several commonly used data-sets of natural images. Every data-set has a different peculiarity that makes the tasks of representation learning and end-to-end clustering challenging.

• CIFAR-100: A natural image data-set with 50,000/10,000 32 × 32 samples from 100 classes. The images have also a hyper-class label where the hyper-classes are 20. We denote CIFAR-100-100 when we assess the accuracy considering 100 classes and CIFAR-100-20 when considering 20 classes.

• CIFAR-10: The CIFAR-10 data-set consists of 60000 32 × 32 color images in 10 classes (airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck), with 6000 images per class. There are 50000 training images and 10000 test images.

• STL-10: (Coates, Ng, and Lee 2011) Contains images from ImageNet with 500/800 training/test images from each of the 10 classes (airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck.) and an additional 100,000 unlabeled images. These 96 × 96 images are extracted from a similar but broader distribution of images. For instance, it contains other types of animals (bears, rabbits, etc.) and vehicles (trains, buses, etc.) in addition to the ones in the labeled set. Specifically in STL-10 we scale down the images to 32 × 32 for all experiments.

• SVHN: A Street View House Numbers data-set including 10 classes of 32 × 32 digit images. It contains 73257 labeled digits for training, 26032 labeled digits for testing, and 531131 additional unlabeled images. The peculiarity is that many of the images contain more than one number. The correct label is the number in the middle.

(19)

7.1 Experiments with unsupervised end-to-end clustering

In this section we will talk about the evaluation method for end-to-end unsupervised clustering results, the technical details of our encoder and how the results of our methods PA and PPA compare with other methods. Furthermore, we will display how PA and PPA fare with different configurations.

Evaluation for unsupervised end-to-end clustering

To evalutate the resutls of our models in the task of unsupervised end-to-end clustering we use the accuracy which is computed by assigning each cluster, created from our model, with the dominating class label and taking the average correct classification rate as the final score.

Encoder

As for encoder we used a very simple 3 layered convolutional neural network that ends up with one fully connected linear layer of varying, per data-set, dimensionality C.

Figure 8: Schematic of our encoder used for end-to-end clustering.

7.1.1 Product-Agreement Experiments

In this subsection we are going to present the experiments regarding the Product-Agreement loss function. The experi-ments include an ablation study that shows the benefit of using a positive pair of length greater than two, an experiment that shows the impact of the batch-size when we use the PA loss, and finally an experiment that shows the potential of the PA loss with a minimal augmentations pool in the MNIST data-set.

Since the main argument of exploring the Product-Agreement loss is the possibility to compare more than 2 augmen-tations of the same image at the same time we experimented with positive pairs of length 2 and 4. In an attempt to show that it is more efficient to compare more augmentations at the same time we train the 4 augmentations for 100 epochs to match the extra computations that happens for the image transformations. The results are displayed in the Table 4, the batch size is fixed at 256 and the learning rate at 4e-4.

Data-sets 2-augments-200-epochs 4-augments-100-epochs

CIFAR-100-100 0.192 0.212

Table 4: Accuracies for the PA when using positive pairs of length 2 and 4.

The results confirm our hypothesis that it is more efficient to have positive tuples with length larger than two. We might need more computations to create the transformations but as we can see 4-augmentations in 100 epochs outperform the 2-augmentations in 200 epochs with the same batch-size and learning rate even though the gradient updates are two times more when the model trains for 200 epochs. This is an important find that might be applicable for other methods too. For the history the 4-augments-200 epochs of PA accuracy was 0.235.

(20)

We noticed that PA-4 (PA using 4 augmentations) is favored when a larger batch-size is used. This makes sense since the larger the batch-size the closer it is to the apriori class distribution, so that the sample of images is more balanced in respect with the classes. The comparison of a 128 and a 256 batch-size is displayed in the Table 5

Data-set 128 256

CIFAR-100-100 0.189 0.235

Table 5: Accuracy of the PA-4 loss trained for 200 epochs for different batch-sizes.

In the last experiment we explored how PA-4 loss fares with a minimal augmentation pool in the MNIST data-set. The MNIST data-set consists of images of handwritten digits. The images contain minimal information since the background is always the same and the pixels are of the same value. Furthermore, all pixels contribute to a monolithic structural pattern, as opposed to real images that might contain objects irrelevant with the dominant object of the image that we try to classify.

The PA-4 loss using a minimal transformation pool containing only 3 simple transformations (scale down, rotate 20 degrees to the left, rotate 20 degrees to the right) and the original image achieves 98% accuracy completely unsupervised within only 12 epochs with a batch size of 100 and a learning rate of 1e-4.

7.1.2 Penalized-Product-Agreement Experiments

In this subsection we are going to present the experiments regarding the Penalized-Product-Agreement loss function. The experiments include a study that compares the impact of different encoder architectures, an experiment that shows the impact of the batch-size (this is an important experiment since unsupervised techniques use batch information either by comparing the images of the batch or using the batch statistics), an experiment that shows the impact of over-clustering, an experiment that shows the impact of auxiliary over-clustering and finally an experiment that shows how PPA fairs with transfer learning. Notice that with PPA-4 we denote the Penalized-Product-Agreement loss contrasting 4 augmentations at the same time. The augmentations are drawn from the pool of augmentations as explained in section 6.

In the first experiment we compared how PPA-4 performs with several encoder architectures with a varying number of convolutional or linear layers in the CIFAR-100-100 data-set. The batch-size is fixed at 128, the learning rate at 2e-4 and we train for 200 epochs. As we can see in the Table 6 the shallower architectures are favored for the PPA loss function in the CIFAR-100-100 data-set.

Data-sets 3-conv-1-linear 4-conv-1-linear 5-conv-1-linear 5-conv-2-linear

CIFAR-100-100 0.235 0.233 0.222 0.217

Table 6: Accuracy of the Penalized-Product-Agreement loss with different encoder architectures.

Next we investigate how the batch-size affects the accuracy of the PPA-4. For the batch-size of 64 we use a learning rate of 1e-4, for a fair comparison we double the learning rate for every doubling of the batch-size since the gradient updates are half for every doubling. For the batch-size of 128 we use a learning rate of 2e-4 and, finally, for the batch-size of 256 we use a learning rate 4e-4. All the models were trained for 200 epochs.

Data-sets 64 128 256

CIFAR-100-100 0.242 0.251 0.244

Table 7: Accuracies for the Penalized-Product-Agreement loss for different batch-sizes.

Counter-intuitively the batch-size of 128 performs the best. We would expect that a bigger batch-size would contribute to a more accurate penalty such that the batch-size of 256 would be beneficial. It turns out that the not normalized penalty of Penalized-Product-Agreement is a flaw since it affects the training by becoming very big depending on the combina-tion of class numbers C and the batch-size N. On the other hand PPA-4 has the flexibility to train with small batch-sizes something that is not very common in unsupervised learning where most of the methods harness the batch information and they are benefited from a larger batch-size.

In the next experiment we show the impact of over-clustering, trying to split the data in more virtual-clusters than the number of apriori-clusters. We define the term apriori-clustering as clustering the data in the number of apriori-clusters. Naturally occurs the hypothesis that the over-clustering will achieve higher accuracies on the cost of divergence from the target (which is to cluster the data in the number of apriori-clusters). Nevertheless, it is interesting to show how the accuracies change when we cluster the data in more clusters. In the Table 8 we compare the accuracies achieved when we cluster in the apriori-cluster number with the accuracies achieved when we cluster the data in 100 and 1000 clusters. All

(21)

the models were trained with a batch-size of 128, a learning rate of 1e-4 and a Z of 4. The number of apriori-clusters is noticed in parenthesis of the apriori-clustering column. Our hypothesis is confirmed by the results, the more clusters we cluster the data in the more homogeneous the clusters become.

Data-sets apriori-clustering over-clustering 100 over-clustering 1000

CIFAR-100-20 0.260 ( 20 ) 0.421 0.513

STL-10 0.385 ( 10 ) 0.510 0.635

Table 8: Accuracies of the Penalized-Product-Agreement loss for different numbers of clusters C in CIFAR-100 and STL-10 data-sets.

Auxiliary over-clustering is the concept where auxiliary linear heads (layers denoted as "Overclustering head" in Figure 9) are used in parallel with the main linear layer (denoted as "Apriori clustering" in Figure 9) after convolutions during the train face. The extra linear layers over-cluster the data in more clusters than the original layer. This can be considered as an ensemble of linear layers and improves the quality of the convolution representations. The explanation is that the clustering of the data in more classes needs more specific information from the images and the convolution layers need to learn richer representations. The auxiliary heads that over-cluster the data are discarded after the train is done and they are not used during testing. As we can see in the Table 9 auxiliary over-clustering boosts the accuracy a little, from 26% to 27%. However, it does not pose a very interesting research topic since it is a form of ensemble models, in this case ensemble the mlp layers.

Figure 9: Schematic of the architecture that involves auxiliary over-clustering when using the PPA-4 loss.

Data-sets with auxiliary over-clustering without auxiliary over-clustering

CIFAR-100-20 0.27 0.26

Table 9: Accuracy when using or not auxiliary over-clustering with the PPA-4.

Finally, we experimented with the ability of PPA-4 to train models that give accurate predictions when trained on data of similar but slightly different distributions from the test data. More specifically when the model is trained with PPA-4 on the unlabeled data (100k) it achieves a lower accuracy than when trained on the labeled data (5k, still unsupervised). This should not be the case since we use 20 times less data. At this point we should remind that the unlabeled data contain images from a slightly different distribution than the test data while the labeled data contain images of the same distribution with the test data as explained in the sub-section "Data-sets" 7. The results are displayed in the Table 10. This shows that PPA-4 do not fair very well when the train and the test data come from different distributions. For this experiment we used a batch-size of 16 and a learning rate of 2e-4.

(22)

Data-sets unlabeled samples(100k) labeled samples(5k)

STL-10 0.401 0.422

Table 10: Accuracies of the PPA-4 trained in the unlabeled samples set and in the labeled samples set (still unsupervised) of STL-10 data-set.

7.1.3 Comparing PA and PPA losses with other methods

The best performance for unsupervised end-to-end clustering achieved from our methods was reported from PPA-4.

Data-sets CIFAR-10 CIFAR-100-20 CIFAR-100-100 STL-10

K-means 0.229 0.130 - 0.192 SC 0.247 0.136 - 0.159 AC 0.228 0.138 - 0.332 NMF 0.190 0.118 - 0.180 AE 0.314 0.165 - 0.303 DAE 0.297 0.151 - 0.302 GAN 0.315 0.151 - 0.298 DeCNN 0.282 0.133 - 0.299 VAE 0.291 0.152 - 0.282 JULE 0.272 0.137 - 0.277 DEC 0.301 0.185 - 0.359 DCCM 0.623 0.327 - 0.482 IIC 0.617 0.257 - 0.610 PICA 0.696 0.337 - 0.713 GATCluster 0.610 0.281 - 0.588 ConCUR 0.693 0.363 - 0.611 DRC 0.727 0.367 - 0.747 Ours PA-4 0.486 0.220 0.235 0.411 Ours PPA-4 0.561 0.270 0.252 0.422

Table 11: Accuracies achieved for different methods and data-sets. The Table is from [18]. DRC, PICA and ConCUR use test data during training, in the ConCUR paper [15] there are some accuracies of the same methods without the test data been used (just slightly lower).

PA-4 was trained with a batch-size of 256 and a learning rate of 4e-4 for all the data-sets. For PPA-4 unfortunately we have to adjust the batch-sizes from data-set to data-set since they affect the penalty and the training. The best results achieved with a batch size of 16 for CIFAR-10 and STL-10, a batch-size of 32 for CIFAR-100-20 and a batch-size of 128 for CIFAR-100-100. The learning rates were also adjusted accordingly.

When comparing PA and PPA with the other methods we notice that they achieve worse results and can be used only as benchmarks. However, notice that the other methods used more powerful neural networks as for encoders (for example DRC [18] used a variant of the ResNet). Furthermore, our methods show that they might have potential when the data-set has many classes, since their results for CIFAR-100-20 and CIFAR-100-100 do not differ much while it would be expected that clustering the data in 100 classes would be a more challenging task. Finally, as a conclusion we notice that PPA performs decently even when using a very small batch-size and outperforms PA in every data-set.

7.2 Experiments for unsupervised representation learning

In this section we will talk about the evaluation of the representations learned unsupervised, the technical details of our encoder and the results of our methods Binary-Contrastive (BC) and Binary-Penalty (BP).

Linear evaluation

A very common protocol adopted by many researchers [14], [6] is to train, in a supervised manner, a linear layer using as input the representations learned from unsupervised training. More concretely, we freeze the encoder and train a supervised linear classifier (a fully-connected layer followed by a softmax) using either the predictions of the encoder or the features derived from the 5th convolutional layer (conv5) which is, after flattening, a 4096 dimensional vector. A schematic of the linear classifier is shown in the Figure 10. All the accuracies reported in this section are the accuracies that the linear layer achieves when trained supervised with the input of different representations acquired with unsupervised training.

(23)

Figure 10: Schematic of the linear classifier that is used to evaluate the learned representations.

Encoder

The neural network architecture used for the Binary-Contrastive and the Binary-Penalty experiments is the one depicted in the Figure 11 unless stated otherwise. A Pytorch implementation of our encoder can be found in Appendix 5.

Figure 11: Schematic of our encoder used for unsupervised representation learning.

7.2.1 Experimenting with varying number of epochs

In this experiment we compare the binary representations and the conv5 representations using the Binary-Contrastive and the Binary-Penalty losses for epochs ranging from 200 to 400. The model evaluated is the same for CIFAR-100-100 and CIFAR-100-20 for all the tables.

(24)

Data-sets representations loss 200-epoch accuracies 300-epoch accuracies 400-epoch accuracies CIFAR-100-100 Binaries BC 41.54% 43.25% 44.44% BP 41.49% 42.36% 42.34% Conv 5 BC 50.00% 52.33% 53.30% BP 51.97% 53.61% 53.94% CIFAR-100-20 Binaries BC 54.73% 56.78% 57.61% BP 55.10% 55.61% 56.13% Conv 5 BC 61.28% 62.66% 63.09% BP 61.43% 62.44% 63.04%

Table 12: Accuracies for the Binary-Contrastive and the Binary-Penalty losses in CIFAR-100-100 data-set after training the encoder for a different number of epochs with a fixed binary size of 4096 and a batch-size of 128. The Binary-Contrastive attraction bonus coefficient is fixed at 0.4 (sparsity 94.9%) and the Binary-penalty sparsity-coefficient is fixed at 5 (sparsity 99.6%).

As we can see the conv5 representations outperform the binary representations as expected since the earlier layers carry more information. Furthermore wee see that the Binary-Contrastive and the Binary-Penalty have similar results for conv5 representations while the Binary-Contrastive shows better results for the binary representations. Finally, we notice that the Binary-Contrastive scales better with more epochs of training than the Binary-Penalty.

7.2.2 Experimenting with sparsity

Furthermore, we explored how the sparsity of the binary representations affect the accuracies.

Data-sets attraction bonus coefficient (λ) sparsity accuracy linear/binaries accuracy linear/conv5

CIFAR-100-100 0 97.8% 37.23% 37.35% 0.5 93.5% 34.84% 51.52% 1.2 87.7% 31.91% 50.63% CIFAR-100-20 0 97.8% 49.62% 55.71% 0.5 93.5% 49.22% 61.13% 1.2 87.7% 47.49% 60.59%

Table 13: Accuracies and sparsity for the Binary-Contrastive loss in CIFAR-100-100 data-set after training the encoder for 200 epochs.

Data-sets sparsity coefficient (λS) sparsity accuracy linear/binaries accuracy linear/conv5

CIFAR-100-100 0 99.95% 24.18% 49.35% 1 99.86% 34.24% 52.02% 5 99.60% 41.49% 51.97% 120 98.33% 38.35% 50.75% 500 96.00% 33.13% 50.52% CIFAR-100-20 0 99.95% 35.82% 59.72% 1 99.86% 50.89% 60.84% 5 99.60% 55.10% 61.44% 120 98.33% 53.07% 61.43% 500 96.00% 48.69% 60.68%

Table 14: Accuracies and sparsity for the Binary-Penalty loss in CIFAR-100-100 data-set after training the encoder for 200 epochs.

As we can see in the Tables 13 and 14 a better accuracy using the binary representations does not always lead to a better accuracy for the conv5 representations. Furthermore, we notice that the Binary-Contrastive binary representations achieve higher accuracies when there is more sparsity while for the binary representations of the Binary-Penalty the ideal sparsity seems to be around 99.6%. Finally, we notice that the Binary-Penalty loss creates very sparse binary representations that might be convenient to store and compare in practice.

(25)

7.2.3 Experimenting with the length of the binary vector

Another interesting experiment is to explore how the length of the binary vector affects the accuracies of the binaries and conv5 representations. While experimenting with varying lengths we need to report the sparsity / bonus coefficients since they have different impact for varying lengths. The results are shown in the Tabels 15 and 16.

length bonus coefficient sparsity accuracy linear/binaries accuracy linear/conv5

64 0 87.50% 29.02% 30.29% 64 0.5 88.50% 27.05% 49.16% 64 1 76.00% 25.45% 49.34% 1024 0.5 92.40% 34.57% 52.21% 4096 0.5 93.50% 34.84% 51.52% 8192 0.5 94.00% 35.52% 51.19%

Table 15: Accuracies and sparsity for the Binary-Contrastive loss with varying binary lengths in CIFAR-100-100 data-set after training for 200 epochs and a batch size of 128.

length sparsity coefficient sparsity accuracy linear/binaries accuracy linear/conv5

64 3 96.30% 15.28% 49.39%

64 10 93.00% 14.01% 47.95%

1024 5 99.10% 31.04% 50.68%

4096 5 99.60% 41.49% 51.97%

8192 10 99.61% 43.17% 51.92%

Table 16: Accuracies and sparsity for the Binary-Penalty loss for varying binary lengths in CIFAR-100-100 data-set after training for 200 epochs and a batch size of 128.

The conclusion from this experiment is that the Binary-Contrastive loss performs a lot better in binary accuracies for small code-lengths (64) while the Binary-Penalty loss performs better in binary accuracies of larger code-lengths (8192). The accuracies of the conv5 representations seem to be similar. The impact of the code length in image retrieval is displayed in section Image Retrieval Experiments in the Table 27.

7.2.4 Binary-Contrastive and Binary-Penalty performance compared to other methods

In the Table 17 we show how the Binary-Contrastive and the Binary-Penalty perform in representation learning in com-parison with other state-of-the-art methods from the literature in various different data-sets.

Data-set CIFAR-100-100 CIFAR-10 SVHN

Split-Brain 39.0 67.1 77.3 Counting 18.2 50.9 63.0 DeepCluster 41.9 77.9 92.0 Instance 39.4 70.1 89.3 AND 47.9 77.6 93.7 PAD 58.6 84.7 93.2 SeLa 57.4 83.4 94.5 Ours Binary-Penalty 54.2 83.1 95.5 Ours Binary-Contrastive 56.3 84.0 96.0 supervised 69.7 91.8 96.1

Table 17: Accuracies for different methods in various data-sets using the Linear classifier / conv5 protocol. The table is from [8] with the addition of PAD and SeLa scores.

Since the other methods used the AlexNet as an encoder for a fair comparison, particularly for this experiment, both of our loss functions trained an AlexNet [12] without the dropout layers with a binary code-length of 4096, a batch size of 128 and a learning rate of 2.2e-4 for 200 epochs. Furthermore, for the Binary-Penalty loss we used a sparsity coefficient λsc = 5 while for the Binary-Contrastive loss we used a bonus coefficient λ = 0.5 while the positive pairs

were consisting of 4 augmentations. It is worth mentioning that AlexNet achieves approximately 3% more accuracy than our original encoder but it requires a lot more time to train (3 days against 1 day for an experiment).

As we can see the results are pretty competitive. It is noteworthy that we did not do any hyper-parameter tuning for the AlexNet, we just run it two times with dropout and without and we kept the latter, so it is very possible that

Unsupervised Image Classification And Hashing With Binary Representations

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Unsupervised Image Classification And Hashing With

Binary Representations

V

ASILEIOS

C

HARATSIDIS

February 16, 2021

Supervisor:

V

G

S

Dr P

J

Assessor:

Dr S

G

Contents

1

Introduction

2

Relevant Background

3

Relevant work

IIC

SimCLR

4

Methods for unsupervised end-to-end clustering through probability

agree-ment

4.1

Product-Agreement loss (PA)

4.2

Penalized-Product-Agreement loss (PPA)

5

Methods for unsupervised sparse binary representation learning

5.1

Binary-Contrastive loss function (BC)

a

= − log(

)

r

= − log(

)

5.2

Binary-penalty loss function (BP)

B

B

P

(B

+ B

)

Σ =

P

B

+ B

+ λ

6

Augmentations strategy

6.1

Limitations of contrastive learning

7

Experiments

7.1

Experiments with unsupervised end-to-end clustering

7.2

Experiments for unsupervised representation learning