Object Representations in Deep Convolutional Neural Networks and Their Behavioural Implications

(1)

MBCS-Research Report: Object Representations in Deep

Convolutional Neural Networks and Their Behavioural

Implications

Author:

Lucas Weber

University of Amsterdam

Supervisors:

dr. Tim C. Kietzmann

MRC - Cognition and Brain Science Unit, University of Cambridge

dr. Nikolaus Kriegeskorte

Zuckerman Institute, Columbia University in the City of New York

Collaborator:

Johannes Mehrer, MSc

MRC - Cognition and Brain Science Unit, University of Cambridge

August 16, 2018

Abstract

After revolutionizing the field of artificial intelligence and in particular computer vision, deep convolutional neural networks (DCNNs), biologically inspired machine learning algo-rithms, are finding their way back into computational neuroscience. Currently, image rep-resentations in DCNNs trained on visual object recognition show greater similarity to the human ventral visual stream than any prior model. However, it still remains an open question whether the representations in these models are also able to predict human perceptual deci-sion making as it has been shown for neuroimaging data from the inferior temporal lobe in humans. In the current study, we predict two different human behavioural dataset from rapid visual categorization tasks using the representations of two DCNNs trained on different image datasets. We found that the networks are able to explain reaction times and accuracies and that their predictive power increases with network depth. In a linear combination of network layers, intermediate layers added the largest portion of explanatory merit, which is in line with previous studies. This indicates that, even though object representations DCNNs are good in matching the ventral visual stream, networks solely trained on object recognition tasks may not be able to completely map the terminal areas of the ventral visual stream.

keywords deep convolutional neural networks, vision, object recognition, ventral stream

1 Introduction

The human brain is one of the most complex entities known. A means to make complex systems understandable is to abstract away from dispensable detail by building simplified models. The field of computational neuroscience builds mathematical models describing the neural responses to sensory stimulation and inner representational transformations of different cognitive subsystems, their interactions and how representations are translated into observable behaviour. A fundamental subsystem is the ventral visual stream achieving visual object recognition (Goodale & Milner,1992). Ventral visual stream and its early models. Even though there are few things that appear to humans as natural as seeing an object and classifying it as a dog rather than a banana or an umbrella, under the hood, visual object recognition is a complex task to accomplish. In the

(2)

hierarchically-organized ventral visual stream, information flows from low-level structures in the striate cortex, calculating simple Gabor-like functions (Hubel & Wiesel, 1962; Jones & Palmer,

1987), to highly category-selective neural populations in the inferior temporal lobe (IT) (Gross et al.,1972;Hung et al.,2005). While low-level feature detectors are organized retinotopically and have relatively small receptive fields, higher visual areas become increasingly indifferent to location and more affected by object category in their response.

Early models of human visual object recognition were guided by neuroscientific insight, trying to reverse-engineer the functionality of the visual stream. Model design in this top-down manner requires handcrafting of visual feature-detectors and becomes increasing impractical with greater feature-complexity in later processing stages. This limited models in the past to explaining either only low-level visual processing or to using toy datasets. An example is HMAX and its subsequent revisions (Riesenhuber & Poggio,1999;Serre & Riesenhuber,2004), formerly often referred to as the ‘standard model’ of vision (Cadieu et al.,2004). HMAX was successful in showing scale, viewpoint and location-invariance for datasets of paperclips (Serre & Riesenhuber,2004) and matched human performance on a superordinate discrimination task using natural images (animate vs. inanimate; see Serre et al. (2005)). However, when using more complex datasets, it failed to account for categorical clustering in intermediate and high-level visual cortex (Khaligh-Razavi & Kriegeskorte,

2014) that is typical for the humans.

Short introduction to DCNNs. When aiming to model more complex high-level visual processing, top-down model-design involving a lot of handcrafting might be the wrong approach. Instead, models that let complexity emerge from data seem more promising. Artificial neural networks (ANNs) are emergent models that recently received increased attention in modeling of brain function.

ANNs are machine learning algorithms, inspired by the way biological brains are thought to process information. Nodes in ANNs correspond to biological neurons and are organized in layers. Proceeding through the layers, the networks calculate non-linear transformations by weighting the output from the respective previous layer and applying a non-linearity. Owing to large amounts of computational resources becoming widely accessible in the recent years, stacking up greater numbers of neural layers to form so-called deep learning-architectures is now the approach of choice for many machine learning problem. While deep and shallow networks are both universal function estimators, deeper networks are usually able to model functions of equal complexity with a smaller number of optimizable parameters than shallow networks (Bengio,2009). Another common theme in state-of-the-art neural network-architectures besides deep learning is the use of convolutional layers (LeCun et al.,1989). Motivated by the insight that features in one location of an image might also be useful in another, convolutional layers apply learned local feature detectors over the whole visual field, reducing the number of optimizable parameters drastically. Those more efficient deep convolutional neural network-architectures (DCNNs) revolutionized the field of artificial intelligence research in the recent years and became the central approach to many machine learning problems (e.g. computer visionKrizhevsky et al.(2012); speech-recognition (Graves et al.,

2013) i.a.).

In computer vision, DCNNs have recently surpassed human object recognition abilities (He et al.,2015). Despite that there are many aspects in which DCNN-architectures have simplified and abstracted away from their biological archetype, their functionality is in principle still mimicking information processing of biological neural networks, giving us reason to believe that their learned features can be matched onto the human ventral visual stream.

Object representations in DCNNs and area IT. Evidence from neuroscientific research indeed suggests representational similarity between visual DCNNs and human neuroimaging data of the ventral stream. Even though not implemented with the idea of maximizing biological plau-sibility but object classification performance, the first DCNNs in computervision straightaway ex-plained representations in area IT better than any biologically-motivated model (Khaligh-Razavi & Kriegeskorte,2014). Yamins et al.(2014) andKhaligh-Razavi & Kriegeskorte(2014) demonstrate, that representational similarity between DCNN model and area IT increases with the model’s object classification performance. Interestingly, not only representational similarity between cate-gories increases in better performing models, but also within catecate-gories, which is remarkable since the models were not constrained in their within-category representational organization. This is good news, suggesting that great model similarity can be obtained by training it on real-world be-haviour, instead of constraining the model with neural data, generally limited by relatively small data-set sizes (Kietzmann et al.,2018). DCNN trained in this way, are able to explain up to 50%

(3)

of variance in IT spike counts (Yamins et al.,2014), which is comparable to Gabor filters in V1 (Olshausen & Field,2005).

Object-representations and behaviour. Human visual decision-making behaviour is re-lated to IT’s representational geometry (Carlson et al., 2014). An accurate model of the ventral stream and area IT should thus also be able to explain actual real-world behaviour of human beings.

While decoding information from the brain, researcher often concentrate on what information they are extracting, thus on the interpretation of the data. More seldom, researcher explicitly address the question how the rest of the brain is actually making use of this information (de Wit et al.,2016). An implicit assumption while focusing on the interpretation of the data is that as the system holds the information, it will also use it to implement a certain behaviour. This is not necessarily true. Even though the information is decodable it could be just epiphenomenal or used for a different behaviour. For example, could mostly mobile animate object and mostly station-ary inanimate objects elicit different patterns of activation in the dorsal visual stream (which is amongst others occupied with perception of movement). A researcher decoding animate/inanimate information from neuroimaging data may misinterpret the dorsal visual stream as an object clas-sifier, while in reality it is just an epiphenomenon caused by the object’s movement. Grootswagers et al.(2018) put this idea to proof by decoding, amongst others, animate vs. inanimate information from fMRI data recorded all over the brain, while human subjects were classifying photographs of objects. While finding multiple areas within the cortex that are showing differences in animate and inanimate activation patterns, only the read-outs from area IT are able to predict catego-rization behaviour. The question then becomes, do representations in DCNNs not only resemble representational geometries in human area IT, but in how far are they capable of also explaining behavioural data similar to IT?

Distance-to-boundary measure. Just like in multi-voxel pattern analysis, Grootswagers et al.(2018) train a linear decoder to discriminate between neural patterns produced by different experimental conditions. The classifier’s decision boundary in high-dimensional activation space is then used to estimate the certainty with which an activation-pattern belongs to a category. Stimuli that are closer to this boundary representational space are less easily distinguishable relative to those farther away (see also figure 1 ). The approach is inspired by signal detection theory (SDT) from classical psychophysics (Green & Swets, 1966). In SDT, a signal is easiest to detect when its output distribution is distant from the decision criterion. The easier it is to detect, the faster, more accurate and the more confident the judgements about the nature of the signal will be. The distance-to-boundary measure in activation-space is therewith in some sense a multi-dimensional generalization of SDT (Ritchie & Carlson, 2016). When developing SDT, the perceptual evi-dence utilized by the observer was thought to be connected to some unknown neural correlate (Swets et al.,1961;Werner & Mountcastle,1963). Boundaries in neural representational patterns may corroborate this assumption made half a century ago. Generalizing SDT to our multidimen-sional activations space, activation patterns distant from the classifier’s boundary are more clearly attributable to a certain category and accordingly reactions times (RTs) should decrease while classification accuracy increases. Distances to the boundaries in activation-space should thus be negatively correlated with RTs and positively correlated with categorization accuracies.

Carlson et al.(2014) were the first to come up with this method of linking representations and perceptual decision-making behaviour and tested it using neuro-imaging techniques. They found that distance-to-boundary measures from a classifier trained on fMRI data from area IT was able to predict human accuracies and RTs during an animate vs. inanimate classification task. Ritchie et al.(2015) successfully transferred the method to MEG-data.

The distance-to-boundary measure can also be applied to the activations of different layers of visual DCNNs. Eberhardt et al. (2016) used the activations of pre-trained networks of different depth to train linear support vector machines (linear SVMs) discriminating between animate and inanimate stimulus representations. They found the distances between boundary and stimulus representations to be correlated with human accuracy data from a perceptual decision task using the same stimuli. This correlation increases with model depth, peaking at 0.7 relative network depth and subsequently declining again until the output layer is reached.

Animate vs. inanimate object representations. The prior cited experiments train their decoders to discriminate between animate/inanimate representations. The animate/inanimate distinction has been repeatedly shown to be at the top-level of the hierarchical organization of category representations in human IT.Warrington & Shallice(1984) discovered a double

(4)

dissoci-Figure 1: A 2D activation space for object representations. A decoder (linear SVM) trained on animate/inanimate discrimination draws a decision boundary into activation space. We project activation patterns of individual onto a discriminant axis, orthogonal to our decision boundary, to determine their distance to boundary. We consider the signed distances from the decision boundary as measure for how ambiguous perceptual evidence from this sample is and therefore how confident the model is in the classification. Image a) is correctly classified as an animate exemplar; a high distance value indicates high confidence. b) is correctly classified as inanimate, while the negative sign for c) indicates a misclassification.

ation between semantic memory for animate and inanimate objects in semantic anomia patients that suffered brain lesions due to encephalitis caused by the type 1 herpes simplex virus. Warring-ton and Shallice’s patients had problems naming or visually identifying either animate or certain groups of inanimate objects depending on the part of area IT that was lesioned. The dissociation is supported byKiani et al.(2007) using neural recordings of monkey IT cells andKriegeskorte(2009) using the same dataset and comparing it to fMRI data from human IT. While animate objects are a relatively closely related category, inanimate objects span every other category, putting e.g. tools, fruits and vegetables, buildings, into a single group. Due to their exclusive definition, inanimate object representations are usually less homogeneous, reflected in smaller representational similarity. Accordingly,Carlson et al.(2014) andRitchie et al.(2015) found that predictability of behaviour using distance to boundary measures in human area IT was largely driven by the animate portion of the dataset. A decision boundary, thus, seems to be less meaningful for the negated, exclusively defined category. If DCNNs are posing a good model for ventral stream representations, we would expect a stronger correspondence between model prediction and human behaviour for animate compared to inanimate stimuli.

The current study will lay out steps of analysis to estimate RT- and accuracy-data of humans doing an animate/inanimate rapid categorization task from raw neural network activations. To this end, we will use two deep convolutional neural networks trained on different imagesets and compare the predictions we calculated based on their representational geometries to two different behavioural datasets using the prior described distance-to-boundary measure.

2 Methods

2.1 Ethical approval

Ethical approval for the used behavioural datasets was granted by the Cambridge Psychology Research Ethics Committee and Brown University’s IRB respectively. For the study at hand, no ethical approval was needed due to its exclusively computational nature.

(5)

2.2 Materials

2.2.1 Datasets for neural network training

Prior to analysis, we trained two separate neural networks on different large-scale image datasets. The first network was trained on the widely used ImageNet Large Scale Visual Recognition Challenge (ILSVRC) – dataset of 2012 (Russakovsky et al., 2015), comprising >1.28 million images in 1000 image categories with 732-1300 images per category in the training set. The selection of ImageNet categories is intended to allow challenging fine-grained categorization on a subordinate level. For example, the ILSVRC-2012-subset comprises 120 different dog-breeds. In this way, ImageNet lends itself to show subtle differences in categorization performance between models, but it results in a biased collection of categories.

The second dataset used for network training was eco-set (Mehrer et al., 2017), comprising >1.5 million images in 565 image categories. The training-set contains between 600 and 4900 images per category. The motivation behind the design of eco-set is to more closely match the natural human visual diet. To achieve better ecological validity, Mehrer et al. (2017) selected categories to be rather concrete than abstract (based onBrysbaert et al. (2014)) and to be more frequent in linguistic occurrence statistics (using the British National Corpus). In this way, training models on eco-set promises to better capture human representational geometries of complex visual objects than training on a set of images with an allegedly smaller ecological relevance, such as ImageNet.

2.2.2 Neural Networks

Figure 2: An illustration of the used deep convolutional neural network architecture. The 128p x 128p x 3 color channel input is processed in 7 successive feed-forward convolutional layers alternating with intermediate 2x2 max-pooling layers. Convolutions and maxpooling are using a stride of 1. The networks output is given by a softmax function with 565 dimensions for eco-set and 1000 dimensions for imagenet.

The neural networks were trained using Python’s open-source software library TensorFlow (Abadi et al.,2016).

The networks were composed of seven convolutional layers with decreasing kernel-sizes and max-pooling operation after every convolutional layer (see figure 2 for a more detailed description of the model architecture). The readout-layer consisted of a softmax function with 1000 or 565 dimensions for ImageNet and eco-set respectively. During training, the network was regularized using Gaussian dropout. Instead of setting the activation of some units to zero as in Bernoulli dropout, Gaussian dropout adds multiplicative noise to every unit in the network (Srivastava et al., 2014). The variance we used in Gaussian dropout (σ2 _{= 0.25) approximately corresponds}

to a keep-probability of p = 0.8 in Bernoulli-dropout. Weights and biases were initialized using the Xavier-initializer (Glorot & Bengio, 2010). To avoid representational differences that might emerge from different initial weight settings, they were set to be the same for both networks. The learning-rate was adapted during training using the Adam-optimizer (Kingma & Ba, 2014). The distribution of training samples over the categories in the training set of both imagesets are not uniform. To adjust for the different number of images per category the loss-function was weighted

(6)

by the inverse of the total amount of images of a given category. Both networks reached convergence after being trained for 60 epochs.

2.2.3 Behavioural data

The behavioural data used in this study was provided by collaborators (Charest, in preparation) and (Eberhardt et al., 2016). Behavioural dataset (1) consisted of reaction-times (RTs) of twenty participants that were asked to categorise 96 colour photographs (250 x 250 pixels) of isolated real-world objects on a grey background (see figure 3a) according to the animate/inanimate dichotomy. Every subject completed 6 trials for every image, resulting in a set of 576 RTs per subject. The 96 images are a subset of the stimuli used in Kiani et al. (2007), displaying 48 animate and 48 inanimate objects and were first used in this composition by Misaki et al. (2010). Of these 96 stimuli we analyzed a subset of 92 used by many other studies in the field (i.a. Jozwik et al. 2017;

Khaligh-Razavi & Kriegeskorte 2014;Kriegeskorte et al. 2008).

Prior to analysis, behavioural dataset (1) was pre-processed by excluding RTs of misclassified images (about 5% of trials) and the prior mentioned four unused stimuli. Additionally, the RTs from experimental trials using the same stimuli were averaged within and across subjects to achieve greater signal to noise ratio.

Behavioural dataset (2) consisted of the averaged accuracies of 281 participants that were asked to categorise 1500 grey-scaled photographs (256 x 256 pixels) of real world objects on natural back-ground according to the animate/inanimate dichotomy (see figure 3b). All images were sampled from ImageNet, image-categories were 50% animate and 50% inanimate and further balanced across 14 basic categories (by using the invertebrate, bird, amphibian, fish, reptile, mammal, domestic cat, dog, structure, instrumentation, consumer goods, plant, geological formation, and natural object subtrees of ImageNet). Data was collected using the Amazon Mechanical Turk (AMT) platform. Each AMT worker classified 271 images on average, resulting in 76,201 responses in total. On average, the accuracy of a single image was thus based on 51 responses. For more details of the experimental design and procedure we refer toEberhardt et al.(2016).

Figure 3: Sample of experimental stimuli. (a) exemplifies inanimate and animate images used to obtain behavioural dataset (1), (b) shows inanimate and animate images used for behavioural dataset (2).

2.2.4 Data analysis

The overall goal of our data analysis was to predict behavioural data from the distance-to-boundary measure. To do so, we extracted activations from our neural networks, reduced them in dimen-sionality, fitted a linear decoder on these activations and calculate distances between the decoder hyperplane and the networks activations in response to our experimental stimuli. Afterwards, we estimated how good distances-to-boundary are in explaining human behavioural data.

The dimensionality of the network activations in response to 6,600 photographs (coloured, dataset 1; greyscale, dataset 2) of real-world objects (50% animate, 50% inanimate) was reduced performing principle component analysis (PCA, scikit-learn implementation (Pedregosa et al.,

2011)) across images for every layer, thereby preserving 98% of the variance.

Dimensionality reduction is necessary since otherwise the excessive number of features would make our classifier prone to overfitting and consequently its hyperplane less meaningful. Further, generalizing SDT to a multi-dimensional space requires the dimensions to be independent from each other (Ritchie & Carlson,2016). While the dimensions of the original activation space might by highly correlated, principle components are per definition independent.

To obtain decision boundaries in activation space, we trained a Support Vector Machine (SVM, scikit-learn implementation (Pedregosa et al.,2011)) on the activations of every convolutional layer

(7)

to perform a binary classification task (animate/inanimate). The SVMs used a linear kernel, since prior research has shown that decoders using more complex kernels for read-outs on neuroimaging data are likely to yield worse results due to overfitting (Misaki et al., 2010). To further reduce overfitting, the SVMs were l2 regularized, with C-parameters optimized by an interval-halving procedure using 20-fold random split cross-validation. Subsequently, the activations of both ex-perimental stimuli sets were reduced in dimensionality applying the same principal components as for the SVM-trainingset and distances-to-boundary were calculated. Distances-to-boundary were coded positive, if the respective SVM classified the sample correctly and negative, if they were assigned wrongly. The obtained distances were then correlated with the pre-processed behavioural data using spearman-rank-correlations.

Finally, the explanatory powers of the networks across all layers were estimated by linearly combining the distances of all layers to predict the behavioural data. We did so,t by fitting a general linear model (GLM) to predict the behavioural data using the distance-to-boundary measure of the different layers as predictors. The resulting beta-parameter distribution can be used to compare the explanatory power of different layers. To interpret and compare beta values of different features with each other it is necessary for them to have the same mean and variance, since different scaling in the features would lead to the same scaling differences in the betas, rendering them incomparable. For this reason, we z-transformed the distances-to-boundary within layers before regression.

All analysis steps for both behavioural datasets were identical, besides using activations from the coloured imageset to estimate the principal components and train the SVMs for behavioural dataset (1) and activations from the same but grey-scaled imageset for behavioural dataset (2).

3 Results

Both networks converged after being trained for 60 epochs on their respective training sets, bringing the top-1 training-classification-error down to 39.4% for both networks and error on the valida-tion set to 43.5% for the eco-set-trained network and 47.7% for the ImageNet-trained network respectively.

After convergence, both neural networks activations in response to the independent stimuli-set of 6600 images were extracted to be later used for our decoder training. PCA was applied to every layer separately to reduce dimensionality of the activations space, keeping 98% of the activations variance in each layer (for dimensionality of every layer pre- and post-PCA see in table 1 ). Remarkably, all early and intermediate layers can be described by approximately the same number of dimensions, independent of their initial dimensionality. This means that principal components of the first layer of an exemplar network preserve 98% of the variance in the activations using only 0.33% of the initial dimensionality, compared to 4.4% in layer 4 and 55.3% in layer 7. Early layers thus show an extreme redundancy in their units. It is an interesting question where this redundancy comes from, since reduction of early layers initial dimensionality of input image size to decrease the number of units in early layers, as well as reduction of feature number or kernel size of early layers are likely to reduce network performance.

Dimensionality of activation spaces neural network layer dimensions pre-PCA dimensions post-PCA; coloured images dimensions post-PCA; grey-scaled images 1 1572864 5309 5144 2 524288 5724 5620 3 262144 5821 5723 4 131072 5807 5687 5 32768 5183 4991 6 16384 4525 4239 7 4096 2330 2185

Table 1: Dimensionality of activation spaces of the eco-set trained DNN before and after application of principal components.

(8)

1 2 3 4 5 6 7 convolutional network layers

0.5 0.6 0.7 0.8 0.9 1.0 accuracy ecoset-trained color imagenet-trained color ecoset-trained grey-scale imagenet-trained grey-scale

Figure 4: Illustration of the regularized decoder performances (linear SVMs) trained on the different layers’ activations patterns for both neural networks. Performance estimates were obtained using 20-fold random-split cross-validation.

Next step was to calculate a hyperplane in activation space. Support vector machines (SVMs) were trained as decoder on activations of every layer of both networks, to perform a binary clas-sification task (animate/inanimate). After l2-regularization, the decoder shows an increase in performance with depth of the network (see figure 4 ). Performances of the decoders were as-sessed using nested twenty-fold cross-validation with random splits of the dataset into training and validation set, retraining and validating the decoder every time (as described inVaroquaux et al. 2017). This cross-validation method is described to be more stable and unbiased compared to the popular ‘leave-one-out’ strategy (seeVaroquaux et al. 2017). The results were validated using l2-regularized logistic regression. Predictably, the decoder trained on activations of coloured images has advantages over the grey-scaled images, by having access to additional colour informa-tion. Decoder performance increases monotonically with model depth, with no difference between networks. Model confidence for individual stimuli was estimated by the distance to the SVM hy-perplane. Distances were signed in a way that positive values indicate correct classification and negative values indicate a misclassification; small values indicate low model confidence in the clas-sification, while high values indicate high model confidence (according toRitchie & Carlson 2016; see also figure 1 ).

We computed layer-wise rank-correlations between the distances-to-boundary and both be-havioural datasets. The rho coefficients are shown in figure 5. Bebe-havioural dataset (1) showed relatively high correlations in early layers, increasing up to layer 5, to then drop-off. This finding is in line with Eberhardt et al. (2016) demonstrating that correlations of model confidence and human accuracies increase up to 0.7 relative model depth, to then decline again. Splitting-up the data reveals that higher correlations in intermediate layers are attributable to the animate portion of the dataset, while the inanimate portion mainly drives the drop-off in later layers.

Behavioural dataset (2) exhibits small correlations in early layers, to increase with model depth and level-off in the final layers of the network. Interestingly, opposing to results from behavioural dataset (1), splitting the data shows correlations of both portions of the dataset increasing with network depth, with animate correlations moderately outperforming their inanimate counterparts. To estimate how well the network as a whole is able to describe the behavioural data, we linearly combined the individual layers’ distances to predict behaviour. The explained variance of the linear combination is shown in figure 7. Similar to the rank correlations, the separation of the datasets into animate and inanimate parts exposes that explanatory power is mainly driven by the animate portion, just like in humans (see e.g. Carlson et al. 2014; Ritchie et al. 2015). Additionally, we see that the networks seem to be able to better explain behavioural dataset (1). However, the results for the same dataset seem also to be way less consistent since we find large differences between networks.

The inconsistency of behavioural dataset (1) repeats itself in the beta-coefficients (figure 6 ). Behavioural dataset (1) again shows less congruence between networks and between the ani-mate/inanimate divisions. In behavioural dataset (2), beta-values approximately increase with

(9)

1 2 3 4 5 6 7 convolutional network layers

0.8 0.6 0.4 0.2 0.0 0.2 spearman's rho behavioural dataset (1) imagenet-trained overall ecoset-trained overall imagenet-trained animate ecoset-trained inanimate imagenet-trained animate ecoset-trained inanimate (a) 1 2 3 4 5 6 7

convolutional network layers 0.2 0.0 0.2 0.4 0.6 0.8 spearman's rho behavioural dataset (2) imagenet-trained overall ecoset-trained overall imagenet-trained animate ecoset-trained inanimate imagenet-trained animate ecoset-trained inanimate (b) 1 2 3 4 5 6 7 convolutional layers 0.0 0.2 0.4 0.6 0.8 1.0 1.2 beta values behavioural dataset (1) imagenet-trained overall ecoset-trained overall imagenet-trained animate ecoset-trained animate imagenet-trained inanimate ecoset-trained inanimate (c) 1 2 3 4 5 6 7 convolutional layers 0.0 0.2 0.4 0.6 0.8 1.0 1.2 beta values behavioural dataset (2) imagenet-trained overall ecoset-trained overall imagenet-trained animate ecoset-trained animate imagenet-trained inanimate ecoset-trained inanimate (d)

Figure 5: (a) and (b) illustrate rank-correlations between model-confidence estimated by the distance-to-boundary and behavioural data (RTs and accuracies, respectively) from both behavioural datasets and both neural networks. (c) and (d) show the beta-values of GLMs trained to predict the behavioural data based on the different layers’ distances-to-boundary. Correlations and betas for animate/animate stimuli are either displayed combined (black) or separated into their category affiliation (green for animate; red for inanimate).

model-depth, meaning that later layers have higher explanatory power for the behavioural data. This matches our prior observation that correlations of distances-to-boundary and subjects aver-aged accuracies is increasing over layers. A commonality across the beta-distribution can be seen in the peak of layer 5, suggesting this layer to add great explanatory merit.

overall

animate

inanimate

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7 explained variance

imagenet-trained; behavioural dataset (1) ecoset-trained; behavioural dataset (1) imagenet-trained; behavioural dataset (2) ecoset-trained; behavioural dataset (2)

Figure 6: Illustration of how much variance in the behavioural data the distances-to-boundary from all layers are able to explain when linearly combined. The amount of explained variance is mainly driven by the animate portion of the dataset.

(10)

4 Discussion

Models in neuroscience are an essential means to understand how the brain processes informa-tion. Recently, biologically-inspired DCNN algorithms from the domain of artificial intelligence entered neuroscience as computational models for visual processing in humans. Previous re-search has shown that object representations in DCNNs trained on visual object recognition show greater resemblance of human ventral stream representations than any earlier ventral stream model, especially for higher processing stages in area IT (Khaligh-Razavi & Kriegeskorte,2014).

The goal of the current study was to substantiate this finding by investigating whether repre-sentations in DCNNs can explain behavioural data in humans. For this purpose, we trained two DCNNs on different large-scale image-databases (ImageNet and eco-set) and predicted human be-havioural data based on their inner object representations, using a distance-to-decoder-hyperplane measure (Carlson et al.,2014).

We found that the distance-to-boundary measure in DCNNs can explain accuracies and RTs of humans in rapid visual categorization. Predictive power increases with model depth and peaks at the intermediate layer 5 for behavioural dataset (1), while monotonously increasing for behavioural dataset (2). Additionally, we observed peaks in explanatory power (through beta-values of the linear combination of layers) in the same layer for both behavioural datasets. The correlation peak is in line with prior findings of Eberhardt et al.(2016), who reported peaks in correlations of DCNN representations and human behavioural data at about 0.7 of relative network depth.

Eberhardt et al.’s explanation for this is that human rapid categorization utilizes representations of intermediate processing stages and that later layers of DCNNs exceed the complexity of those representations. This explanation, however, is opposed by the observation that read-outs from human area IT are better in predicting human behavioural decision-making than intermediate processing stages (Carlson et al., 2014). At the same time, IT shows greater representational similarities with deeper, and not intermediate, DCNN layers (Cichy et al.,2016a,b;Khaligh-Razavi & Kriegeskorte, 2014). This suggests, that later layers of a DCNN model should be better in explaining behavioural data. So why do we observe peaks in intermediate and not the final layers? Considering the decoder performances, we see a monotonous increase with network depth (figure 4 ). If categorization performance would be the only optimization criterion throughout the whole ventral stream, the final layers with the highest categorization performance should also be best in predicting human behavioural data, which opposes our findings. Thus, it appears that represen-tational geometries in IT are constrained by another objective than optimization of categorization performance.

Huth et al. (2012) show that besides shape information, semantic knowledge might be repre-sented in IT as well. Currently, DCNNs can solely represent shape information, since they are neither constraint by causal relations of objects in the real world that would give them access to semantic knowledge, nor is any context provided in the training images that would help learning semantic relations (objects are mostly presented in cropped images with little context information, like co-occurrence with other objects). This lack of top-down knowledge about object relations that forms human IT representations, may be the missing constraint for later network layers to develop more human-like, partly semantic representations. Partly semantic representations in the final layers of DCNN models could help straighten the dip in the final layers predictive power.

If the DCNN has no access to semantic information, how do we explain that animate object representations are better in explaining behavioural data than their inanimate counterparts in both datasets? Animacy is a semantic concept and should in a shape-based system not cluster together any better than inanimacy. However, animate objects share more perceptual features than inanimate objects, like fur, eyes, body shape and so on. These features are able to let animate object representations become more homogeneous on an intermediate level while purely being based on object shape.

Activations in adjacent neural network layers are usually highly correlated, since they are directly dependent on each other. Simple correlations of individual layers distances therefore give us no estimate about how much of behavioural data our network can explain as a whole, since different layers are likely to explain similar portions of variance. Using a linear combination of all layers’ distances, we demonstrated how much overall variance the network model can account for. The explanatory power of the whole network exceeds all individual layers by a considerable margin, suggesting that different layers can explain different parts of the behavioural data. After normalizing all distances before linearly combining the different layers, beta-values reflect the

(11)

respective layer’s influence on the prediction of the GLM, attributing greater predictive power to layers with high beta-values compared to layers with low values.

Considering the beta values in figure 6c/d we see for behavioural dataset (1) that early layers already have a large influence on the model’s prediction, while later layers contribute significantly less. For behavioural dataset (2) on the other hand, the influence of different layers on the model’s prediction approximately increases with network-depth. The differences between the datasets may be attributable to differences in the experimental stimuli. In behavioural dataset (1), objects are separated from their natural background while in behavioural dataset (2) objects are in context. The context causes great amounts of object-unrelated noise in the activations of early layers of the DCNN, making the classification of the image more complicated. In behavioural dataset (1) the entireness of activations is object-related and therefore may be easier to classify based on the noise-sensitive earlier layers. Beside this difference between the datasets, we can observe similar peaks in beta-values in layer 5 and a dip in layer 6 in both. These observations, however, are difficult to interpret without deeper knowledge about what kind of calculations are performed in these layers. Taking a look into the ‘black box’ through visualization techniques (e.g. deconvolutionZeiler & Fergus 2013) can help to understand what kind of features are processed in layer 5 and 6.

With being able to describe the higher-level visual representations of area IT by using DCNNs and using these descriptions to explain decision making, computational visual neuroscience is starting to get in touch with research on modeling in other domains. One of the most common classes of models in perceptual decision making are sequential sampling models (SSMs), with drift diffusion models considered to be a ‘standard’-model in decision-making neuroscience (Forstmann et al.,2016). In SSMs noisy perceptual evidence is accumulated until a threshold is crossed and a decision is reached. SSMs have few free parameters and are considered a natural extension of SDT (Pleskac & Busemeyer,2010), which the here-used distance-to-boundary measure is a generalization of. This contact point between the paradigms suggests to be a great possibility to merge the fields. The advantage of SSMs over the plain correlations or linear regression models used in the current study is their ability to estimate multiple behavioural variables at once (e.g. reaction times and accuracies) and therewith also taking interactions between those variables into account (e.g. speed-accuracy trade-off in perceptual decision tasks, Heitz 2014). Tremel & Wheeler (2015) estimate human reaction times in image classification based on brain activity of area IT recorded in fMRI BOLD-responses by using SSMs. As suggested byCarlson et al.(2014), representational distances-to-boundary may be translated into free model parameter of SSMs (e.g. drift-rate, distance-to-decision-threshold), and thereby integrating the different frameworks.

Another interesting topic for future research might be the usage of dynamic, recurrent neural networks. The current study as well as prior research have shown that feedforward DCNNs can explaining neuro-imaging and behavioural data from rapid categorization tasks. These tasks are in real life rather the exception than the rule. For recognition in a constant stream of visual input, anatomical (Felleman & Van Essen,1991;Sporns & Zwi,2004) and functional (Sugase et al.,1999;

Brincat & Connor,2006;Freiwald & Tsao,2010;Carlson et al.,2013) evidence suggest, that vision is a dynamic, recurrent process that unfolds over time. Recurrent processing is thought to be of special merit when objects are presented in suboptimal, degraded conditionsWyatte et al.(2014), how they are common in everyday-life. Utilizing DCNNs with local recurrent connections (see e.g. Spoerer et al. 2017) may therefore catch human visual object recognition in more ecological plausible situations. Additionally, since evidence in those models builds up over multiple time-steps, they may be a natural match with prior mentioned perceptual decision-making theory of evidence accumulation.

(12)

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X., & Research, G. (2016). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.

Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations and Trends in MachineR

Learning, 2(1), 1–127.

Brincat, S. L. & Connor, C. E. (2006). Dynamic Shape Synthesis in Posterior Inferotemporal Cortex. Neuron, 49(1), 17–24.

Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904–911.

Cadieu, C., Kouh, M., Riesenhuber, M., & Poggio, T. (2004). Shape representation in V4: Investi-gating position-specific tuning for boundary conformation with the standard model of object recognition. Journal of Vision, 5(8), 671–671.

Carlson, T., Tovar, D. A., Alink, A., & Kriegeskorte, N. (2013). Representational dynamics of object vision: The first 1000 ms. Journal of Vision, 13(10), 1–1.

Carlson, T. A., Ritchie, J. B., Kriegeskorte, N., Durvasula, S., & Ma, J. (2014). Reaction Time for Object Categorization Is Predicted by Representational Distance. Journal of Cognitive Neuroscience, 26(1), 132–142.

Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016a). Comparison of deep neu-ral networks to spatio-temponeu-ral cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports, 6(1), 27755.

Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016b). Deep Neural Networks predict Hierarchical Spatio-temporal Cortical Dynamics of Human Visual Object Recognition.

de Wit, L., Alexander, D., Ekroll, V., & Wagemans, J. (2016). Is neuroimaging measuring infor-mation in the brain? Psychonomic Bulletin & Review, 23(5), 1415–1428.

Eberhardt, S., Cader, J., & Serre, T. (2016). How Deep is the Feature Analysis underlying Rapid Visual Categorization?

Felleman, D. J. & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, N.Y. : 1991), 1(1), 1–47.

Forstmann, B., Ratcliff, R., & Wagenmakers, E.-J. (2016). Sequential Sampling Models in Cogni-tive Neuroscience: Advantages, Applications, and Extensions. Annual Review of Psychology, 67(1), 641–666.

Freiwald, W. A. & Tsao, D. Y. (2010). Functional Compartmentalization and Viewpoint General-ization Within the Macaque Face-Processing System. Science, 330(6005), 845–851.

Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.

Goodale, M. A. & Milner, A. D. (1992). Separate visual pathways for perception and action. Trends in neurosciences, 15(1), 20–5.

Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech Recognition with Deep Recurrent Neural Networks.

(13)

Green, D. & Swets, J. (1966). Signal detection theory and psychophysics. Oxford: John Wiley.

Grootswagers, T., Cichy, R. M., & Carlson, T. A. (2018). Finding decodable information that can be read out in behaviour. NeuroImage, 179, 252–262.

Gross, C. G., Rocha-Miranda, C. E., & Bender, D. B. (1972). Visual properties of neurons in inferotemporal cortex of the Macaque. Journal of Neurophysiology, 35(1), 96–111.

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition.

Heitz, R. P. (2014). The speed-accuracy tradeoff: history, physiology, methodology, and behavior. Frontiers in Neuroscience, 8, 150.

Hubel, D. H. & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional archi-tecture in the cat’s visual cortex. The Journal of physiology, 160(1), 106–54.

Hung, C. P., Kreiman, G., Poggio, T., & DiCarlo, J. J. (2005). Fast readout of object identity from macaque inferior temporal cortex. Science (New York, N.Y.), 310(5749), 863–6.

Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant, J. L. (2012). A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron, 76(6), 1210–24.

Jones, J. P. & Palmer, L. A. (1987). The two-dimensional spatial structure of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6), 1187–1211.

Jozwik, K. M., Kriegeskorte, N., Storrs, K. R., & Mur, M. (2017). Deep Convolutional Neural Net-works Outperform Feature-Based But Not Categorical Models in Explaining Object Similarity Judgments. Frontiers in Psychology, 8, 1726.

Khaligh-Razavi, S.-M. & Kriegeskorte, N. (2014). Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation. PLoS Computational Biology, 10(11), e1003915.

Kiani, R., Esteky, H., Mirpour, K., & Tanaka, K. (2007). Object Category Structure in Response Patterns of Neuronal Population in Monkey Inferior Temporal Cortex. Journal of Neurophys-iology, 97(6), 4296–4309.

Kietzmann, T. C., Mcclure, P., & Kriegeskorte, N. (2018). Deep Neural Networks in Computational Neuroscience.

Kingma, D. P. & Ba, J. (2014). Adam: A Method for Stochastic Optimization.

Kriegeskorte, N. (2009). Relating population-code representations between man, monkey, and computational models. Frontiers in Neuroscience, 3(3), 363–373.

Kriegeskorte, N., Mur, M., Ruff, D. A., Kiani, R., Bodurka, J., Esteky, H., Tanaka, K., & Bandet-tini, P. A. (2008). Matching Categorical Object Representations in Inferior Temporal Cortex of Man and Monkey. Neuron, 60(6), 1126–1141.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolu-tional Neural Networks. NIPS’12 Proceedings of the 25th InternaConvolu-tional Conference on Neural Information Processing Systems, 1, 1097–1105.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541–551.

Mehrer, J., Kietzmann, T., & Kriegeskorte, N. (2017). Deep neural networks trained on ecologically relevant categories better explain human IT.

Misaki, M., Kim, Y., Bandettini, P. A., & Kriegeskorte, N. (2010). Comparison of multivariate classifiers and response normalizations for pattern-information fMRI. NeuroImage, 53(1), 103–118.

Olshausen, B. A. & Field, D. J. (2005). How Close Are We to Understanding V1? Neural Computation, 17(8), 1665–1699.

(14)

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.

Pleskac, T. J. & Busemeyer, J. R. (2010). Two-stage dynamic signal detection: A theory of choice, decision time, and confidence. Psychological Review, 117(3), 864–901.

Riesenhuber, M. & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019–1025.

Ritchie, J. B. & Carlson, T. A. (2016). Neural Decoding and "Inner" Psychophysics: A Distance-to-Bound Approach for Linking Mind, Brain, and Behavior. Frontiers in neuro-science, 10, 190.

Ritchie, J. B., Tovar, D. A., & Carlson, T. A. (2015). Emerging Object Representations in the Visual System Predict Reaction Times for Categorization. PLOS Computational Biology, 11(6), e1004316.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision.

Serre, T., Kouh, M., Cadieu, C., Knoblich, U., Kreiman, G., & Poggio, T. (2005). A Theory of Object Recognition: Computations and Circuits in the Feedforward Path of the Ventral Stream in Primate Visual Cortex.

Serre, T. & Riesenhuber, M. (2004). Realistic Modeling of Simple and Complex Cell Tuning in the HMAX Model, and Implications for Invariant Object Recognition in Cortex.

Spoerer, C. J., McClure, P., & Kriegeskorte, N. (2017). Recurrent Convolutional Neural Networks: A Better Model of Biological Object Recognition. Frontiers in psychology, 8, 1551.

Sporns, O. & Zwi, J. D. (2004). The Small World of the Cerebral Cortex. Neuroinformatics, 2(2), 145–162.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15, 1929–1958.

Sugase, Y., Yamane, S., Ueno, S., & Kawano, K. (1999). Global and fine information coded by single neurons in the temporal visual cortex. Nature, 400(6747), 869–873.

Swets, J., Tanner, W. P., & Birdsall, T. G. (1961). Decision processes in perception. Psychological review, 68, 301–40.

Tremel, J. J. & Wheeler, M. E. (2015). Content-specific evidence accumulation in inferior temporal cortex during perceptual decision-making. NeuroImage, 109, 35–49.

Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2017). Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines. NeuroImage, 145, 166–179.

Warrington, E. K. & Shallice, T. (1984). Category Specific Semantic Impairments. Brain, 107(3), 829–853.

Werner, G. & Mountcastle, V. B. (1963). The variability of central neural activity in a sensory system, and its implications for the central reflection of sensory events. Journal of Neurophys-iology, 26(6), 958–977.

Wyatte, D., Jilk, D. J., & O’Reilly, R. C. (2014). Early recurrent feedback facilitates visual object recognition under challenging conditions. Frontiers in Psychology, 5, 674.

(15)

Yamins, D. L. K., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences of the United States of America, 111(23), 8619–24.