Discrete Parameter Autoencoders for Semantic Hashing

(1)

Discrete Parameter Autoencoders

for Semantic Hashing

Robbert van Ginkel 10352600

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervised by: Peter O’Connor Max Welling Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 26th, 2015

(2)

Semantic hashing is a method for creating descriptive binary sim-ilarity hashes from feature vectors. The semantic hashing function is usually learned by a training an autoencoder neural network and applying a trick to binarise the middle layer. This study investigates the novel training algorithm Expectation Backpropagation (EBP) for training such a network. EBP is able to binarise the code layer without special additions to the algorithm and is capable of restricting the possible weights for the neural connections. The latter could be used efficiently implement the network in hardware or use it on embedded platforms as the weight matrix storage space is reduced. Using the EBP algorithm, a binary weight autoencoder network is trained on the MNIST dataset. Results indicate that 28-bit hashes created from the MNIST dataset are sufficiently distinctive for aiding in similarity search, either directly or as a way of pruning the search space.

(3)

1 Introduction

A common task in information retrieval involves searching a large dataset for entries which are similar to a given query entry. A common approach to this problem is extracting distinctive feature vectors from each sample in the dataset and using these feature vectors for comparison. This procedure allows data of various sizes and content to be compared and usually outperforms comparing raw datapoints such as word counts or pixel values. When querying a large dataset for similar entries, speed and accuracy of a system are two important components for its usability. The accuracy depends on the quality of the feature vectors. A high quality feature vector should be sufficiently distinctive to capture the essence of the original data, but also be sufficiently general so that data with similar content (e.g. all pictures containing bridges), map to similar feature vectors. The retrieval speed depends on the comparison metric and the size of the search space. An optimal search system should make use of high quality feature vectors that can be compared in a search space which is independent of the database size. When binary feature vectors are used, these vectors can be interpreted as hashes of the data. These hashes would allow a search operation to generate similar hashes to the query hash by changing a few random bits in the hash and look up the associated entries in a database in O(1).

Feature vectors can be constructed easily from raw data. Simple feature vectors for textual documents can be created by transforming the document into a bag-of-words or a TF-IDF vector. For images a feature vectors could simply be the raw pixel values or a set of descriptors gained from vision algorithms like SIFT, FREAK or GIST. These approaches have some prob-lems: using a bag of words or pixel values results in large feature vectors unsuited for comparison; image descriptors need to be manually created and all methods require that a comparison is made against each entry in the dataset when searching for similarities. To avoid these problems, a method for creating high quality feature vectors should fulfil the following requirements: 1. Compose a low dimensional feature vector as small feature vectors are computationally less expensive and reduce noise. This requirement alone can be achieved by using a dimensionality reduction method such as Principal Component Analysis (PCA). PCA however, is only able to capture linear correlations between data and it should be possible to create more informative feature vectors through non-linear methods. If these low dimensional feature vectors are binary and constructed according to the idea of locality sensitive hashing [1], they can be used to efficiently search through a hashed database in O(1).

2. Automatically find hidden structure in the data. Handcrafting descrip-tors results in descripdescrip-tors which are only able to describe structure that humans know how to describe and typically only suitable for one type

(4)

of data. It has been shown that learned features can perform as well as handcrafted features on datasets for which the features were created and that they perform better than handcrafted features in situations for which they are less suited [2]. Ideally, the procedure creating the feature vector would be able to find such latent structure automatically regardless of the datatype.

A method that satisfies all requirements is semantic hashing [3], a dimen-sionality reduction approach based on autoencoder neural network [4], [5]. Krizhevsky and Hinton [6] demonstrated the usage of semantic hashing on images from the CIFAR-10 dataset and showed that the qualitative and quan-titative performance of this approach is significantly better than comparing raw pixel values.

Although semantic hashing works well, there is a limitation that holds back the application on embedded devices. Especially on a mobile phone with limited bandwidth it would be cheaper to send a small hash for comparison than it is to upload a whole image. Unfortunately, the large deep architectures required for computing feature vectors require an amount of storage space which is too large for applications on mobile phones [7].

In recent research, Soudry, Hubara, and Meir [8] proposed a novel algo-rithm for training neural networks. Expectation backpropagation (EBP) is a Bayesian based training algorithm with an option to restrict the weights to a set. Using this algorithm it is possible to train a neural network with weights from the set {−1, 1}. These binary weights require 32 times less storage space than 32-bit floating point numbers and have a potential fast implementation in hardware.

Another benefit of using EBP for training a semantic hashing function arises from the probabilistic nature of the algorithm. It allows for the usage of the sign activation function, something which is not possible when training a network through regular gradient descent as that activation function does not have a usable gradient. Because of this property EBP will automatically train toward binary activations and does not require extra effort for binarising the hash layer.

The aim of this study is to investigate the usage of the EBP training algorithm to create a semantic hashing autoencoder. The following section will discuss the theoretical context of autoencoder, semantic hashing and the EBP algorithm. In section 3, an extension to the EBP algorithm is proposed to make it suitable for training an Autoencoder. Section 4 describes experiments and results of applying this modified algorithm for the creation of semantic hashes from MNIST. The final sections will discuss these results and present ideas for future work.

(5)

2 Theoretical foundation

This section will introduce the concepts of semantic hashing, autoencoders and the EBP algorithm.

2.1 Semantic hashing

Semantic hashing is a method to create binary hashes from input vectors in such a way that similar vectors map to similar hashes [3]. These hashes can be used to quickly search for similar vectors by restricting the search space to all input vectors for which the similarity hash only differs for a few bits. This requires some pre-computation as each searchable entry needs to be hashed and stored in a database with that hash as an index, but this is a negligible extra cost considering that the retrieval time for similar entries is now completely independent of the database size. A schematic representation of semantic hashing can be found in figure 1.

         x1 x2 x3 .. . xn−1 xn            1 0 1     1 0 0     0 0 1     1 1 1   Hash Vector 000 x1 001 x2 010 x3 .. . ... 111 xn Similar hashes Hash Query Database f (q) flip bits

Figure 1: Overview of using semantic hashing for similarity searching

The goal of semantic hashing is similar to the goal of Locality Sensitive Hashing (LSH), another method which aims to create hashes with a high probability of collision for similar original vectors [1]. However, their methods for creating the similarity hash are very different. LSH tries to approximate nearest-neighbour matching in the original vector space, whereas semantic hashing aims to find latent structure in the original dataset and exploit that to generate a compressed version containing the relevant information. This is achieved by training a deep Autoencoder neural network to learn the structure of the original dataset and use it to create a hashing function.

(6)

It has been shown that semantic hashing preforms well on text and image retrieval tasks [3], [6]. In both studies the semantic hashing approach outperformed the comparison of the raw feature vectors. The increased performance is attributed to the autoencoder used to create the hashes, as these are able to discover latent structure in the original dataset.

2.2 Autoencoders

Dimensionality reduction of high-dimensional datasets has many usages. The resulting low-dimensional version of the input data can be used for visualisations if the reduced vectors are 2D or 3D or for better performance when compared against each other. Lower dimensional feature vectors are usually preferred over their original counterparts when used for comparison as smaller vectors require less memory, compare faster and can be used as hashes. Depending on the dimensionality reduction method, a low dimensional feature vector can simply be a vector in which some of the original values are discarded (feature selection) or an entirely new vector in which each value consists of a

combination of the old values (feature extraction).

An Autoencoder is a neural network designed to reduce dimensionality through feature extraction [4]. It is an unsupervised learning algorithm that aims to recover the original high dimensional input data after it has been reduced to a low dimensional code in the middle layer of the neural network. After the network has been trained, a forward pass up to the middle layer can be performed to gain a low dimensional representation of the input. See figure 2 for a typical autoencoder structure.

x1 x2 x3 x4 x5 Input layer y1 y2 y3 Middle layer z1 z2 z3 z4 z5 Output layer

Figure 2: Autoencoder neural network structure

More formally, an autoencoder can be defined as follows: Given an input vector x, an autoencoder maps it to a hidden vector y, which is mapped to a reconstruction vector z. The activation for the middle layer y is calculated

(7)

from x by recursively calculating the activations for the hidden layers up to the middle layer:

y = fΘ1(x) = s(WMs(WM −1s(· · · W1x + b1) + bM −1) + bM) (1)

In which s is a chosen activation function, M is the hidden layer and Θ₁ is a parameter set consisting of {W1· · · WM, b1· · · bM}.

The reconstruction from y to z is calculated by recursively calculating the activations for the hidden layers up to the final layer:

z = gΘ2(y) = s(WFs(WF −1s(· · · WM +1y + bM +1) + bF −1) + bF) (2)

In which s is a chosen activation function, F is the final layer and Θ2 is a parameter set consisting of {W_{M +1}· · · W_F, bM +1· · · bF}.

When training the autoencoder, each training sample x(i) is mapped to a corresponding low-dimensional representation y(i) and a reconstruction z(i). The goal of training is then to find the optimal set of parameters such that the reconstruction is closest to the original input. More formally:

Θ1, Θ2 = argmin Θ1,Θ2 1 n n X i=1 C x(i), z(i) = argmin Θ1,Θ2 1 n n X i=1 C x(i), gΘ2(fΘ1(x (i)₎₎ (3)

In which C is a cost function, such as the mean squared error. These parameters can be found through regular gradient descent back propagation, although for deep networks with a large number of hidden layers it might be required to do layer wise pre-training first [9].

The basic autoencoder can be adapted to make it better suited for specific tasks. One of these tasks is de-noising: the recovery of the original output from a corrupted input. This can be achieved through training an autoencoder and corrupting the input. An autoencoder trained with this method is called a denoising autoencoder and it has been shown that they produce more robust features that generalise better on the test set than a regular autoencoder [5]. De-noising autoencoders are also well suited for layer by layer pre-training and can be stacked to create a deep autoencoder [10].

Auteoencoders for semantic hashing: binarizing the middle layer To use an Autoencoder for semantic hashing the code layer must have a binary output. A simple way of achieving this is by applying the sign(·) function to the output of the code layer. A better result can be achieved if this desired result is taken into account while training the model. The best result will be produced if the binarisation step loses as little information

(8)

as possible. To achieve this the continuous activation of a node should be as close to the binary value as possible. One method to accomplish this is through the injection of Gaussian noise in the code layers [3]. The most reliable method in to transport information through the code layer in the presence of noise is to either have a very large positive or negative activation, such that the result of applying the sigmoid activation function isn’t distorted much by the noise.

The code layer can also be binarised by rounding the code layer to the nearest binary value in the forward pass and ignoring this rounding during backpropagation [6].

2.3 Expectation Backpropagation

Expectation Backpropagation is a recent algorithm for training training neural networks [8]. The algorithm is based in Bayesian statistics, in contrast to the frequentist foundation of regular gradient descent backpropagation. An essential difference between EBP and normal BP is that EBP treats possible weight values as distributions, whereas normal BP uses single point values for its weights.

The bayesian foundation allows the EBP to have several advantages over regular BP: (1) the training is parameter-free (i.e. no learning rate) and less prone to overfitting due to the regulatory effects of using priors and treating unknown variables as distributions. (2) EBP also allows for the restriction of the possible weights to discrete values. Trained discrete weight value neural networks can be embedded in hardware and would allow for very time and energy efficient computation of the outcome of the network. Overall EBP seems like a promising algorithm for training neural networks.

The algorithm as proposed works for neurons with binary deterministic activations (±1). This is a desired property for semantic hashing as the activation of the code layer should be binary. EBP allows this for free without any change of the algorithm.

Algorithm Expectation backpropagation is an algorithm which can be used for training a general feed-forward multilayer neural network. For a network with binary activation neurons, the output of the network can be defined as:

VL= g(x, W) = sign(WLsign(WL−1sign(· · · W1x))) (4)

in which W is the set of binary weight matrices and x is an input vector. The goal of EBP is find P (W|D), the posterior probability over the weights given the data. Given this distribution and an input vector, it is possible to approximate the expected outcome of a deterministic forward pass. This is achieved by treating each hidden node as a binary random

(9)

variable and calculating the associated probability distribution to determine its expected output. By recursively calculating the expected outputs for the hidden layers, the distribution of the output in the final layer can be calculated and compared to the target output. In a very similar fashion to gradient descent backpropagation the discrepancy between the target output and expected output can be backpropagated to determine how W must be updated to increase the probability for the target vector given the input vector.

The distribution over the output of the probabilistic forward pass should be calculated while training, but its expected value can also be directly used as an output of the network. This is defined as the probabilistic output of the network (EBP-P). After training, a binary network can be sampled from the trained weights distributions to calculate a deterministic output of the network according to equation 4. This is defined as the deterministic output of the network (EBP-D) and is expected that the average of multiple deterministic outputs eventually approximate the expected probabilistic output. A detailed description of the training procedure can be found in Soudry, Hubara, and Meir [8].

2.3.1 Binary weight restriction

When EBP is restricted to binary weights, any weight wij,l is taken from the set {-1, 1}. In Soudry, Hubara, and Meir [8], the distribution of W_ij,l is parameterised by a value h_ij,l in such a way that:

P (Wij,l|Dn) = eh (n) ij,lWij,l eh (n) ij,l_{+ e}−h (n) ij,l (5)

This parametrisation is chosen because the expected value and variance for W_ij, which are required for performing a probabilistic forward pass, are now easily calculated as hWij,li = tanh(hij,l) and V ar(Wij,l) = sec2(hij,l). After training of the network, the Maximum A Posteriori (MAP) weight estimate W_ij,l∗ can obtained by clipping h_ij,l.

W_ij,l∗ = sign(hij,l)

The performance of the EBP-P algorithm can be approximated by averaging the outcome of multiple networks that use the EBP-D algorithm. The weights for each network in the EBP-D are sampled from the distribution defined by equation 5:

W_ij,l∗ ∼ 2 ∗ Bernoulli(1

2tanh(hij,l) + 1) − 1

Where Bernoulli(x) is function sampling single Bernoulli trail with prob-ability x for success. Empirical results show that using 16 or more binary

(10)

networks with sampled weights approach the probabilistic output well [8, Appendix G].

(11)

3 Expectation Backpropagation for Autoencoders

Although the binary activations of nodes trained in EBP are useful for creating hashes from the middle layer, they are very restrictive when used to create an autoencoder. The currently presented work only allows EBP to be trained on binary target vectors. This limits the algorithm to classification tasks or an autoencoder for a binary dataset. While such datasets exists (for example: binary bag-of-words based on the appearance of a word), this is a crucial restriction of the algorithm. It would be favourable to extend the algorithm to be able to cope with real value target vectors, such as bag-of-word vectors; TF-IDF vectors or image pixel value vectors. In this section, a modification to EBP is proposed to support training with continues-value target vectors. The restriction arises from the usage of the sign(·) function in the target layer. By removing this activation function from the last layer, the last layer can be interpreted as a linear activation layer. The activation value should then be interpreted as a continues random variable defined by µ and σ with expected value µ, rather than a binary random variable with expected value 2Φ(µL/σL) − 1. To incorporate this change into the EBP algorithm, the cost function must be redefined to reflect the log probability of target vector y given a normally distributed random variable:

ln P (vL= y) = ln P (y|µL, σL2) = lnY r 1 q 2πσ2 r,L e −(yr − µr,L)2 2σ2_r,L ! = −1 2 X r ln 2πσ_r,L2 −(yr− µr,L) 2 2σ2 r,L (6)

which is the log probability of the Gaussian probability density function. The gradient initialisation is based on the cost function and must also be redefined to reflect the changes in the final layer:

∆k,L = ∂ ln(vL= y) ∂µk,L = ∂ ∂µk,L " −1 2 X r ln 2πσ2_r,L −(yr− µr,L) 2 2σ_r,L2 # = yk− µk,L σ_k,L2 (7)

The rule for backpropagating to find the gradient for earlier layers does not need to be changed because they are based on the already changed gradient initialisation and the activation functions in these layers stay the same.

(12)

4 Experiments & Results

To verify that EBP can be used to train an autoencoder for semantic hashing the following properties must be investigated: (1) an autoencoder trained using EBP can create good reconstructions of the input. (2) a semantic hashing autoencoder trained using EBP does not need tricks to binarise the middle layer as the expected activations for the code layer mostly close to -1 or 1. (3) the hashes created by the semantic hashing autoencoder are a good representation of the original data.

To test these assumptions, the MNIST dataset is explored. CIFAR-10 was also considered but proved to be too difficult to train using fully connected layers.

4.1 MNIST

MNIST is a database containing handwritten digits. The set is randomly divided into 50.000 training images and 10.000 test images. The data was preprocessed by centralising (mean = 0) as recommended for backpropa-gation [11]. The data is not normalised as recommended because EBP is invariant to input scaling [8, Appendix F]. The presented results are from training an autoencoder with layer sizes 784-128-64-28-64-128-784 for 200 epochs. The training was done on Amazon EC2 g2.2large GPU instances us-ing Theano [12]. Figure 3 shows the reconstruction of a set of digits randomly drawn from MNIST.

Figure 3: Reconstruction of random samples from the MNIST dataset by an autoencoder with code layer size 28.

To check whether EBP is able to create hashes with a good representation of the data, three different representations of the data will be compared. The first representation is the 784-dimensional vector of normalised pixel values of MNIST. It is used as a reference representation as it contains all possible information about a sample. The second representation is the expected code layer activation generated by EBP-P network. This real valued 28-dimensional vector will be referred to as the EBP-P outcome and the signed version will be referred to as the EBP-P hash. The last representation is created by signing the average result of 20 EBP-D networks with sampled weights and should be an approximation to the EBP-P hashes.

Histogram of code layer activations To generate informative binary codes in the middle layer of the EBP-P network, their expected value should be

(13)

close to either −1 or 1 so no significant information is lost during binarisation. Figure 4 shows a histogram of observed activations in the code layer over the training set. The histogram clearly shows that the activations in the trained network are mostly near −1 and 1.

1.0 0.5 0.0 0.5 1.0 Activation probability 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Frequency 1e5

Figure 4: The distribution of activities in the code layer.

Qualitiative analysis: same class retrieval A quantitative and qualita-tive analysis of the retrieval performance is done using the k-nearest neighbour algorithm. For any query image in the test set, the k nearest neighbours are retrieved and it is computed how many of the k neighbours are the same class (number) as the query image. Note that the test set contains a total of 10,000 images with 1,000 images in each class. Figure 5 shows a the first 9 neighbours for a test image. It is clear that the pixel space neighbours are perceptually very similar to the query image: a five with a relatively large upper stroke and a smaller rounding in the bottom left corner. The hash space neighbours are more varied but also contain some images of other classes that exhibit some similar characteristics: a thick stroke or a slightly tilted top stroke.

Quantitative analysis: same class retrieval Figure 6 shows the fraction of nearest neighbours which are in the same class as the tested sample compared to the used value for k. The graph was generated using all 10,000 samples from the test set. Euclidean distance in pixel space performs a bit better initially, but is worse than the hash space similarities for a large k. Figure 7 shows the same but is generated from a subset of the MNIST digits. It is less likely that this 200 sample dataset contains a neighbour which is nearly identical in pixelspace. The hashes that represent a more generalised version of the input data perform better on this subset. The proximity of

(14)

(a) Pixel space using euclidean distance.

(b) EBP-P hash space using Hamming distance

(c) EBP-D hash space using Hamming distance

Figure 5: The 9 Nearest neighbors for a query image in different representa-tions.

the EBP-P and EBP-D result shows that the probabilistic output is closely approximated by averaging the output of several binary networks.

0 50 100 150 200

number of nearest neighbors 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

same category hit rate

EBP-P Hash Pixel space EBP-D Hash

Figure 6: Same category retrieval performance on MNIST dataset. The EBP-D errorbars indicate a 95% confidence interval for the result of the sampled networks.

Cluster visualisation To investigate the amount of distinctive information retained in the hashes, a visualisation of the local structure of the data is created in hash and pixel space. The visualisation is created by t-Distributed Stochastic Neighbour embedding (t-SNE), a visualisation method designed to visualise local structure of high dimensional datasets in 2D or 3D [13]. T-SNE is not a dimensionality reduction method that creates a mapping from high dimensional vectors to low dimensional vectors, but rather a technique starts with all high dimensional datapoints on a low dimensional plane and moves them around until the local similarities found in the high dimensional representations are optimally retained in the low dimensional visualisation.

(15)

1 2 3 4 5 6 7 8 number of nearest neighbors

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

same category hit rate

EBP-P Hash Pixel space EBP-D Hash

Figure 7: Same category retrieval performance on a subset with size 200 of the MNIST dataset. The EBP-D errorbars indicate a 95% confidence interval for the result of the sampled networks.

This technique allows us to visualise and evaluate if the hashes capture enough distinctive information about the original inputs. All the plots have been generated with the default scikit-learn parameters: a perplexity of 30, early exaggeration of 4.0 and a learning rate of 1000 for a maximum of 1000 iterations. Figure 8 shows the embedding for the original 784-dimensional vectors, figure 9 shows the embedding for the 28-bit hashes generated by EBP-P and figure 10 shows the embedding for the 28-bit hashes generated by EBP-D. Although the difference between classes is slightly more clear in pixel space, the local similarities are also clearly visible in hash space.

(16)

Figure 8: 2D embedding of the MNIST dataset pixelvalues using t-SNE. MNIST classes are code colored.

(17)

Figure 9: 2D embedding of the 28-bit EBP-P hashes using t-SNE. MNIST classes are code colored.

(18)

Figure 10: 2D embedding of the 28-bit EBP-D hashes using t-SNE. MNIST classes are code colored.

(19)

5 Conclusions

Using the last layer linear modification to the EBP algorithm it is possible to train an autoencoder on the MNIST dataset. The binary hashes generated by the autoencoder are usable as semantic hashes for quickly searching through a large database, independent of the database size in O(1). It has been shown that these hashes can also be generated by networks with binary weights. By sampling multiple networks and averaging their output, a result very similar to the real valued probabilistic outcome is obtained. These results indicate that EBP is an excellent algorithm for training an autoencoder from which a very efficient (computational complexity and storage space wise) semantic hashing function can be derived.

6 Discussion

The same class retrieval results on MNIST show that the hash spaces perform worse than pixel space similarity for a low amount of neighbours in a dataset with near identical images. If the dataset has little nearly exact pixel space similarities and class similarities require a more abstract representation of the original data, then the hash space performs better. It should be noted that the similarity retrieval based on digit classes is not a flawless evaluation metric as digits with the same class are not necessarily more perceptually similar than images of different classes. It is possible that a poorly drawn three looks more like an eight than another three. As the algorithm is never trained with any target classes and has no idea that there are 10 numbers. The results should be interpreted as an indication that the hashes contain a compressed form of the input and that hashes with a close distance represent original images with a large similarity. It is expected that in most cases this similarity is a result of the digits being from the same class, but this is not necessary. Regardless, the results indicate that the hashes could be used to filter for possible similarity candidates in the queried database in order to have less comparison candidates when using a more intensive comparison metric.

Although the EBP-D results are only satisfactory after sampling multiple networks and averaging over their outputs, the need for computing over twenty forward passes should not be considered a problem . The compu-tation of multiple of these networks can be implemented in hardware and is embarrassingly parallelizable. The computations mostly involve doing elementary matrix operations and performing multiple deterministic passes might even be faster than a single probabilistic pass which actively uses more computationally intensive calculations like trigonometric functions and logarithms. From a mobile platform perspective, twenty sampled networks with binary weights require less storage space compared to the 32-bit float

(20)

values required for real weight storage. Because the expected values for the weights are biased towards binary weights, a large part of the weight matrices is expected to be the same. By only storing a complete matrix for the most common weight configuration and storing sparse difference matrices for the rest the required storage space will probably shrink even more.

6.1 Future work

The presented work indicates that EBP is an excellent algorithm for training an autoencoder to create a semantic hashing function, but there are still some areas for further research. As MNIST is a relatively simple dataset, it would be interesting to see how well the hashes created using a binary network perform for a more complicated dataset, like CIFAR-10. To successfully train with a more complex image dataset dataset, it might be beneficial to study EBP in the context of convolutional neural networks. For this, a generalisation of the EBP algorithm should be made to support different layers in the network, such as layers with varying activation functions or layers with shared weights. The current study evaluates semantic hashes of 28-bits, but further research could investigate the how the performance of the hashes changes when larger or smaller hashes are used.

(21)

References

[1] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions”, in Proceedings of the Twentieth Annual Symposium on Computational Geometry, ser. SCG ’04, Brooklyn, New York, USA: ACM, 2004, pp. 253–262, isbn: 1-58113-885-7. doi: 10.1145/997817.997857. [Online]. Available: http: //doi.acm.org/10.1145/997817.997857.

[2] K. Kavukcuoglu, M. A. Ranzato, R. Fergus, and Y. Le-Cun, “Learning invariant features through topographic filter maps”, in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 1605–1612.

[3] R. Salakhutdinov and G. Hinton, “Semantic hashing”, International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969 –978, 2009, Special Section on Graphical Models and Information Retrieval, issn: 0888-613X. doi: http://dx.doi.org/10.1016/j.ijar.2008.11. 006. [Online]. Available: http://www.sciencedirect.com/science/ article/pii/S0888613X08001813.

[4] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks”, Science, vol. 313, no. 5786, pp. 504–507, 2006.

[5] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders”, in Pro-ceedings of the 25th International Conference on Machine Learning, ser. ICML ’08, Helsinki, Finland: ACM, 2008, pp. 1096–1103, isbn: 978-1-60558-205-4. doi: 10.1145/1390156.1390294. [Online]. Available: http://doi.acm.org/10.1145/1390156.1390294.

[6] A. Krizhevsky and G. E. Hinton, “Using very deep autoencoders for content-based image retrieval.”, in ESANN, Citeseer, 2011.

[7] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolu-tional networks using vector quantization”, arXiv preprint arXiv:1412.6115, 2014.

[8] D. Soudry, I. Hubara, and R. Meir, “Expectation backpropagation: parameter-free training of multilayer neural networks with continuous or discrete weights”, in Advances in Neural Information Processing Systems, 2014, pp. 963–971.

[9] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al., “Greedy layer-wise training of deep networks”, Advances in neural information processing systems, vol. 19, p. 153, 2007.

(22)

[10] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion”, The Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010.

[11] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient back-prop”, in Neural networks: Tricks of the trade, Springer, 2012, pp. 9– 48.

[12] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des-jardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler”, in Proceedings of the Python for Scientific Computing Conference (SciPy), Oral Presentation, Austin, TX, Jun. 2010.

[13] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne”, Journal of Machine Learning Research, vol. 9, no. 2579-2605, p. 85, 2008.

Discrete Parameter Autoencoders for Semantic Hashing