Data-Driven Vision Research: Naturalistic versus Synthetic Research Paradigms

(1)

Title

Data-Driven Vision Research: Naturalistic

versus Synthetic Research Paradigms

Question

Is sensory neuroscience ready to employ rich natural stimuli to

enhance our understanding of the visual system? Computer vision

models and their impact on visual neuroscience.

Simon Hofmann | Master Brain & Cognitive Sciences, Cognitive Science Track | July 05, 2017

(2)

Abstract

Vision is the supreme perceptual modality for the majority of life forms. More than 50% of the human cortex is devoted to vision; consequently, the striate cortex is one of the best-studied areas in the human brain. Nevertheless, our knowledge about visual processes remains sparse. Models, derived from previous neuroscientific findings, were not able to scale up to real world scenarios. Meanwhile, the field of computer vision made a huge leap into a new era of neural network modelling. Current deep learning approaches surpass human-level performance in many vision tasks, as well as in tasks of other sensory domains. Although the biological validity of these models is still debated, recent research demonstrates that artificial neural networks explain more variance in the human visual system than previous models that were proposed by neuroscientists.

This review paper provides an overview of (1) current movements in sensory and computational neuroscience of vision and (2) the parallel rapid developments in artificial intelligence with a focus on the success-story of the deep learning approach. The driving force behind this thesis is the ongoing debate about the usage of controlled synthetic stimuli versus rich naturalistic stimuli in experimental setups for vision research. It will be argued that the novel computational models cannot just be utilized in classical research paradigms but moreover play a fundamental role for the field to move towards incorporating more ecological validity. Furthermore, a new development in visual media technology known as virtual reality (VR) will be discussed (3) that can close the gap between classical and naturalistic paradigms. This data-rich technology allows for employment of deep learning techniques and the presentation of simple up to near-to-natural stimuli, while being fully controllable.

(3)

Abstract ... 2

1. Introduction ... 4

1.1. A short history of vision research ... 4

1.2. Technological progress gives rise to algorithmic and virtual paradigms ... 5

2. Will we understand the visual system through synthesized visual stimuli ... 6

2.1. How close are we to understanding the visual system? ... 6

2.2. The debate about syntactic controlled stimuli versus natural rich stimuli ... 8

3. Neural network models and their impact on vision research ... 10

3.1. Computers are getting close to human-level performance: Artificial neural network models

of vision ... 10

3.2. Neural networks in computational neuroscience ... 18

4. Virtual Reality, controlled rich environments: a new potential for vision research ... 25

4.1. Bringing multimodal vision research to the third dimension ... 26

5. Conclusion ... 27

Literature

...

30

(4)

1. Introduction

1.1. A short history of vision research

541 million years ago an outburst of life forms spread over the planet. The Cambrian explosion is today one of the big miracles in evolutionary history of life. One of the leading explanations for this mass event is the development of vision (Parker, 2003), which allowed for extended explorations of the environment and heated the arms race between predators and prey. Vision is the evolutionary success-story of perception and it is present in nearly any higher life form. In humans 50% of cortical tissue is devoted to vision (Nakayama et al., 1995). Consequently, human and non-human vision is one of the best-studied modes of perception with theories going back to Euclid and Aristotle. In the Middle Ages, it was Leonardo da Vinci who discovered the spatial properties of the ocular system and understood the difference between foveal and peripheral vision. Newton (1704) presented one of the first structured analyses of colour vision. And at the end of the 19th_{century Hermann von Helmholtz conducted the first modern}

experiments on visual perception. The constraints of the ocular system he discovered, let him to the conclusion that a crucial component of perceptual processes must be some form of unconscious inference (Helmholtz, 1860). Still today, Helmholtz’s insights influenced a dominant conceptualization of a functional principle of the brain known as predictive processing (Friston, 2003; Friston & Kiebel, 2009; Clark, 2013). Predictive processing is a framework, which, on a lower level, explains redundancy reduction and efficient coding of neural information processes, but the theory also promotes the inference machinery of higher neural organizations (Rao and Ballard, 1999; Huang and Rao, 2011). Predictive processing is also an answer to the explanatory limitations of experimental techniques that thrived in the mid 20th_{century. In particular, the methodological progress of single-cell recordings in the 1960’s gave}

rise to a more reductionist understanding of vision. With their pioneering studies, Hubel and Wiesel (1962, 1968) revealed the sensitivity of single neurons for visual stimulus features. Stimuli in these experiments were primarily simple bars, stripes or corners. Subsequent experiments of this kind tackled phenomena like stereopsis (Barlow et al., 1967) and colour vision (DeValois et al. 1967). Even higher-level visual information could be detected in single cells; for instance, in their famous study, Gross et al. (1972) found neurons, located deep in the visual stream of Macaque monkeys, which responded to hands and hand-like features. But are such single-cell recordings sufficient to explain the function of the cortex, and ultimately vision?

Information theory account of vision

It was David Marr (1982) who generalized the findings of Hubel and Wiesel, claiming that cerebral neurons are flexible learners of the statistical structure of their input patterns. Nevertheless, he was critical about the reductionist conclusions many scientists drew based on these findings. In Marr’s view, vision is an information processing system, which can only be analysed and understood at three complementary levels:

(5)

(1) Computational level: what is done by the system and why

(2) Algorithmic or representational level: how are the necessary computations performed. What representations are used and how are they manipulated

(3) Implementational or physical level: the biological underpinnings of vision, e.g. neural structure

These levels reflect how vision was approached by many disciplines such as Philosophy, Psychology, Biology and others. Each discipline encompasses sound and robust theories about the phenomenological and optical sides of vision. Despite that, the underlying neural processes in the brain of the perceiver remain difficult to uncover. Marr’s levels of analysis represent a useful and widely accepted distinction, which particularly guide computational theories in sensory neuroscience. But more crucially, his underlying assumption that visual perception is an information process has been strongly confirmed in the previous years. This is mirrored in the breakthrough of modern computer vision, which caused a flurry of scientific papers and a rapidly increasing field of research.

We will see that sensory neuroscience has promising steps lying ahead fostered by this development. However, the bright outlook is not just due to the rapid development of computational models, but is also nurtured by technological progress in general.

1.2. Technological progress gives rise to algorithmic and virtual paradigms

Major progress in computer vision models: the success story of deep neural nets

It is part of our modus operandi to copy successful adaptations of nature into our technological equipment. With the rise of computers many scientists devoted their career to let these machines join the visual world (see: Papert, 1966). But biological vision evolved over millions of years in different species and therefore it seems not surprising that computer vision is a hard problem and far from trivial. However, in 2012, the field had a major breakthrough. With the employment of deep neural networks trained on huge datasets, machines were now able to find structure and invariances in their visual input and could accurately label new objects they had not seen before. The trained models are not just inspired by the human visual system (Fukushima, 1980), but show also comparable structure in their evolved feature space, as we will see later. This opens up a completely new range of research possibilities in sensory neuroscience. Where we face insurmountable hurdles in biological experiments, these new computer models enable us to probe theories of visual processing from an engineering perspective. At the same time, the field of artificial intelligence (A.I.) moves independently, but rapidly forward, mostly driven by engineering goals such as higher accuracy rates and benchmark achievements. Nonetheless, the improvements brought by new architectures and processing strategies could, in the end, turn out to be very informative about biologically grounded visual information processing.

(6)

Trends in visual media technology: high-resolution stereoscopic vision with head mounted virtual reality displays.

At this point, another technological trend will be introduced, which is a promising candidate to merge elegantly with the data-hungry deep learning methodology, while providing a research toolset that could overcome many limitations of former vision research paradigms.

2012 was the year of another technological change. Palmer Luckey developed his first prototype of head mounted virtual reality (VR) glasses and founded the company Oculus VR. VR systems allow the user to enter and navigate through an environment virtually in a highly immersive way. Next to Oculus VR (now: Oculus Rift) other major tech-companies are joining the market with their own head-mounted systems, spending billions of dollars on their development. Increasing processing power, following Moore’s law, makes high-resolution 3D-rendering of videos and virtual environments more and more feasible for portable units. At the same time prizes for VR glasses drop continuously. Consequently, the technology becomes affordable for a wide consumer market. Media producers like the gaming or film industry use this trend and extensively create virtual content. Graphical Engines are sometimes even freely accessible and companies work hard on solutions for 360-Degree Camera systems.

As we will see later, these methodological and technological trends create entirely new research possibilities in sensory neuroscience and promise to enrich and deepen our understanding of vision as well as the other modalities of perception and, more crucially, their symphonic interplay. In this paper, the focus will lay on visual perception, but analogous to the major progress there, significant steps were made as well for other perceptual modalities. In the next chapter, we will ask how far we have come in understanding the visual system. We will encounter one of the fundamental debates of vision science about the usage of experimental stimuli. Two articles, by Olshausen and Field (2005) and by Rust and Movshon (2005), which are exemplary for this debate, will be discussed. They represent the opinions of central figures in vision neuroscience regarding the state of their own field a decade ago. However, current methodological and technological progress could turn the tide in this debate.

2. Will we understand the visual system through synthesized visual stimuli 2.1. How close are we to understanding the visual system?

More than a decade ago, in 2005, Olshausen and Field quantified the knowledge we have about the primary visual cortex (V1) in the occipital lobe. The authors claimed that we understand only around 10-20% of how V1 actually operates under natural conditions (Olshausen & Field, 2005). In their view, classical visual neuroscience faces five major problems that limit not just our understanding, but also future progress of the field. First, they argue that there is a biased sampling of neurons in the widely used paradigm of single-cell recordings. For instance, neurons with large cell bodies or higher firing rates are more attractive for examination, whereas neurons that do not show the expected behaviour might be discharged as being “visually unresponsive”. Second, single-cell recordings ignore interdependencies between neurons in the striate cortex, i.e. V1, and the strong synchronous activities of cell populations,

(7)

which are still poorly understood. It was shown that cell behaviour in V1 is highly influenced not just by the surrounding neurons, but also by cells in the lateral geniculate nucleus (LGN) (Peters & Payne, 1993) and by high-level contextual information about the natural scenery (Vinje & Gallant, 2000). The functionality of different cell types (simple cell, complex cell) proposed by Hubel and Wiesel is, hence, not an inherent property of the cell itself, but emerges from the general encoding of visual information (Olshausen & Field, 1996, Hyvärinen & Hoyer, 2001). Third, the authors criticize the categorization in simple and complex cells as being simplified and misleading. Fourth, they state that present models of vision are poorly generalizable to a diverse range of ecological conditions. This limitation is rarely reported, but of significant importance. Even if tested under natural conditions, the models are subject to interpretation, since the relevant responses are hardly traceable. Lastly, and in line with the previous points, Olshausen and Field question the experimental stimuli their predecessors predominately used. After the great success of the studies by Hubel and Wiesel, reduced synthetic stimuli like bars, spots or stripes were common in experimental paradigms. One reasons is that those low-dimensional stimuli induce robust narrow-band oscillations in neural populations. In contrast, naturalistic stimuli provoke time-varying and much more complex response patterns, which are barely tractable with conventional methods of analysis (Singer, 2013; Griffiths et al., 2016). However, simple stimuli characterize linear systems. According to Olshausen and Field1_{, successful generalization to real world scenarios through an}

arbitrary combination of such stimuli is highly unlikely, since linear combinations would not allow for approximations of nonlinear, dynamical systems (see also Kayser et al., 2004). Already Reid et al. (1992) argued that the highly nonlinear processing of V1 could only be revealed when the nonlinearities of natural scenes are adopted in experimental designs (see also: Dan et al., 1996; Gallant et al., 1998). Evidence is also provided by a study of Lehky et al. (1992), who trained an artificial neural network (ANN) with simple to complex image patches that were also shown to monkeys. The network could predict firing rates of single neurons in the monkey striate cortex, in particular, when presented to complex stimuli in contrast to impoverished stimuli. Furthermore, the limitations of synthetic stimuli become also evident, if one considers the sensitivity of neurons in V1 to visual properties of surfaces, like occlusion, figure-ground relationships and border ownerships (e.g., Sugita, 1999). Surface processing is an essential part of perceiving the three-dimensional world (Gibson, 1950; Nakayama et al., 1995). On the one hand, perceiving surfaces is considered as a simple bottom-up process, which allows for rapid visual processing. One argument in favour of this theory is the autonomously driven mechanism of figure-ground separation (e.g., see: Rubin, 1921; Vase-Face distinction), which is independent of object-level knowledge (Nakayama et al., 1995). On the other hand, information about object-surfaces highly depends on top-down knowledge and the context of the scenery (Sekuler et al., 1994). Ambiguity must be clarified in feedback-loops from higher-level processes (disambiguation)2_.

Last but not least, Olshausen and Field stress the point that present models of V1 are not able to explain why visual representations are invariant to position, scale, rotation and other deformations. They conclude that the possibilities of experiments with synthetic, simplistic stimuli are nearly exhausted.

1_{See also: Felsen & Dan (2005)}

2_{The predictive coding framework formulates exactly such feedback loops.} 3_{Some speak also of the two “A.I. winters” between these three waves.} 2_{The predictive coding framework formulates exactly such feedback loops.}

(8)

Consequently, the authors suggest an unbiased sampling of neurons of all types, to investigate larger cell populations by multiunit recordings and most crucially, to test those under natural viewing conditions. Ultimately, this requires more sophisticated and generalisable models to understand the feedback connections from other areas to the striate cortex.

The article was written more than a decade ago. Since then, new and insightful discoveries were made, and we will see that some of the formulated demands are about to be satisfied. However, Olshausen and Field only addressed the primary visual cortex, which is, within the visual pathway, one of the best-studied brain areas. Down the visual stream, our understanding becomes more and more sparse, due to increased multimodal and top-down influences, which are difficult to segregate. It also remains unclear how early visual low-level features (V1) are reassembled to create more high-level representations (Riesenhuber & Poggio, 1999). Hence, even a decade later we are far from having a comprehensive understanding of the visual system (Movshon & Simoncelli, 2014).

Nonetheless, in the same year as Olshausen and Field called out for a paradigm shift, contrary voices were raised. Rust & Movshon (2005) propagate the indispensable and fundamental role of controlled, synthetic stimuli in experimental paradigms.

2.2. The debate about syntactic controlled stimuli versus natural rich stimuli

As we have just seen, proponents of the naturalistic approach like Olshausen and Field (2005) claim that experiments with synthetic and impoverished stimuli fail to reveal the mechanisms of natural vision, primarily due to the fact that such stimuli ignore the rich interdependencies of statistical invariances in natural scenes. On the other hand, advocates of the classical view like Rust and Movshon (2005) argue that such methods have proven to reveal fundamental and widely accepted mechanisms of the primary visual cortex (V1), leading to what they call “standard-model”. The authors admit that this model of visual cortical cells was in need of an update, which took place around 2005. The new “standard model” incorporates multiple initial filters, which are combined by a non-linearity. Another adaptation addresses the out-dated Poisson-spiking model, which is replaced by more realistic neural models, e.g. by Hodgkin and Huxley (1952), which allow for a description of the spike generation of neurons over short time-scales (<50ms). Rust and Movshon underline that the new “standard model” is purely built on studies that employed synthetic stimuli. By contrast, experiments with natural stimuli have not contributed to detect fundamental, visual mechanisms like gain controls for luminance, contrast or temporal dynamics, which are implemented in the model. Naturalistic paradigms usually apply either of the following two strategies. The first is to reverse engineer brain mechanisms for visual coding that would best process natural scenes. One such approach is based on the sparse-coding conceptualization, that is, the reduction of redundancies in visual processing as a form of downsampling of the high dimensional sensory input data (Atick, 1992; Olshausen & Field, 1996, 2004; Hyvärinen & Hoyer, 2001; Simoncelli & Olshausen, 2001). According to Rust and Movshon, the flaw is that it is far from clear what is to be optimized for. The second strategy tests the visual system directly, with natural images. The authors state that in light of the unknown statistical structures of naturalistic stimuli it is impossible to untangle mechanisms in the visual system elicited by those (but see: Scholte et al., 2009). A comprehensive neural model should be employed

(9)

to analyse the experimental data, instead of being derived from the data. Rust and Movshon conclude that only experiments with controlled synthetic stimuli are able to refine the “new standard model” and consequently increase our understanding of mechanisms in the primary visual cortex.

As has been argued above, synthetic, impoverished stimuli have been a crucial tool to unveil early visual processes in V1. Nevertheless, even Movshon eventually came to the conclusion that more naturalistic stimuli are fundamental to examine processes down the visual stream (Movshon & Simoncelli, 2014). Movshon and Simoncelli introduce a texture model, which is supposed to capture intermediate stages in the visual processing of V2. The texture stimuli of the experiment remain synthetically produced, but share statistical features of natural images. In their concluding remarks, the authors underline similarities to modern computer vision models with respect to the evolved visual features (see Chapter 3.2.2). But in comparison with such hierarchical models it remains unclear how Movshon and Simoncelli want to integrate their approach in a more general framework of the visual pathway. In light of the vast spectrum of processing steps, their focus on the texture model as a fundamental building block of vision feels almost arbitrary. For instance, it has been shown earlier that surface processing precedes texture processing (Nakayama et al., 1995). Ultimately, the here presented research still resembles the classical mechanistic approach, which has so far not been able to reassemble its parts into a coherent model which holds under naturalistic conditions.

Another aspect in the debate is the unsolved question why single neurons in the primary visual cortex respond with such a high variability to classical, synthesized stimuli, whereas these neurons show sparse and temporally precise responses towards natural stimuli (Froudarakis et al., 2014; Ravello et al., 2016, but see: Singer, 2013). Recently, Kremkow et al. (2016) claimed that their model clarifies this diverging behaviour, by revealing thalamic fast synaptic depression and push-pull inhibition as a form of

effective feed-forward inhibition. The mechanisms, known also from other sensory modalities, explains

the contextual reshaping of V1. The findings underline that even primary visual processing is optimized for natural scenes vs. artificial stimuli and, more crucially, that this mechanism was discovered by studying V1 not with simple but with naturalistic stimuli. Pinto et al. (2008) raise awareness that in order to scrutinize the visual system and visual models adequately, natural image sets should be carefully chosen. By taking real-world variations in object pose, position and scale into account they could show that standard V1-like models were insufficient to explain the variance of the data.

Finally, in order to understand the visual system it is not enough to detect feature sensitivities, that is, a subfunction of sensory processing. It also must be shown how the brain solves primal visual operations like object recognition. Even Rust agrees that complex image datasets and modern algorithmic approaches are needed to make progress in understanding such capacities (DiCarlo, Rust et al. 2012). In the following, we will see that there are promising candidate algorithms, which are suggested to process visual information of natural scenes similar to the human brain, with evolving visual features of different levels of abstraction (e.g., LeCun et al., 2015). In contrast to former vision models, these representations are not handcrafted anymore, but learned from naturalistic input. So far, we have seen that we are in the middle of a paradigm shift within sensory neuroscience. The proposal to turn to more naturalistic experimental setups proposed by Olshausen & Field (2005) and others before (see Chapter 2.1), now finds more and more application in experimental designs and

(10)

modelling approaches. This trend is primarily fuelled by the tremendous success of the thriving field of deep learning (Kriegeskorte, 2015; van Gerven, 2017), as we will see in the next chapters.

3. Neural network models and their impact on vision research

Once inspired by the findings of neuroscientists, neural models evolved relatively independently into powerful processing algorithms with the ability to learn and adapt to an infinite variety of tasks such as self-driving cars (Huval et al., 2015), face-recognition (Sun et al., 2014a,b; Parkhi et al., 2015), predicting effects of gene mutations (Leung et al., 2014; Xiong et al., 2015), stock-market trajectories (Ding et al., 2015) and many more. In the following chapter we will see that these seemingly independently developed artificial counterparts of biological neural networks provide novel insights into neural processes and coding schemes of the brain. 3.1. Computers are getting close to human-level performance: Artificial neural network models of vision Though ANNs are applicable to many forms of sensory data, the focus of this paper is on visual processing. As we have seen in the beginning of this paper, vision is one of the evolutionary success-stories of perception; it developed over millions of years in different species and is, consequently, subject to intensive research for many years now. A comparable fascination among scientists was evoked by the field of computer vision, which initialized the revolution of neural network modelling in the previous years, also referred to as the third wave of A.I. (Goodfellow et al, 2016). The developed network architectures are claimed to reach human-level performance on restricted vision tasks (Kriegeskorte, 2015; Russakovsky et al., 2015). Thus, we have seemingly achieved that artificial systems are able to visually navigate through an unknown environment and, hence, are part of our visual world. Crucially, the fundamental principles of these artificial pendants are inspired by biological neurons. 3.1.1. The roots of neural network modelling and the renaissance of deep neural nets in computer vision Before we come to modern computer vision models and how they interrelate to and can inform us about their biological relatives, a brief review of the history of neural models is provided. Initially, neural networks were biologically inspired and they were studied extensively in different periods of the 20th century; these periods are referred to as the three waves of neural modelling3_.

The first wave started with the simplified neural model by McCulloch & Pitts (1943), who showed that neuron-like processing elements are able to calculate Turing-computable functions. A few years later, Hebb (1949) introduced his influential neural learning theory (Hebbian learning, captured by the phrase:

neurons that fire together, wire together), followed by the first neuron-model that incorporated a learning

mechanism, known as the perceptron (Rosenblatt, 1958). A perceptron is one neuron summing over a

(11)

certain bias plus a received input vector, which is multiplied by the input-weights. When this sum reaches a certain threshold, the neuron fires, else it keeps silent (see Figure 1). Around the time of the discoveries by Hubel and Wiesel, Seymour Papert (1966) called out for the summer vision group at MIT to shift the focus of the young field of A.I. to the challenge of building an artificial visual system. Ironically, it was his book “Perceptrons” he had written together with his colleague Marvin Minsky (Minsky and Papert, 1969), which induced the first

neural network winter. The book was

perceived as demonstrating the limitations of perceptrons, which reflects the unawareness of that time regarding the computational capabilities of non-linear multilayer perceptrons (MLP)4,5_{. An MLP is constituted of a}

defined number of nodes (perceptrons) connected together to make up a network of different hierarchical feedforward layers (see Figure 2). In such form, MLPs are able to classify non-linearly separable data in contrast to the perceptron. Later, it was shown that MLPs are universal approximators of any given continuous function (Hornik et al., 1989).

After the first neural network winter, the introduction of the backpropagation algorithm (Rumelhart et al., 1985, 1986) caused a revival of neural networks – the second wave, in the 80s and 90s. Backpropagation makes use of the differentiable nature of neural networks, allowing for efficient supervised training of neural weights by propagating an error signal back through the network. Then, each weight in the network can be updated according to

its proportionate contribution to the overall error. Despite this intriguing learning algorithm, the field of A.I. also brought forward other methods (e.g. Support Vector Machines), which were computationally more efficient and therefore more suitable for to the processing powers of that time, consequently leading to a lack of interest in the more complex neural nets.

Finally, with the continuously increasing processing power6_{and access to large-scale datasets}7_,

the third wave – the era of deep learning – started, providing models by Hinton & Salakhutdinov (2006), Hinton et al. (2006), Bengio & Lecun (2007) and others the necessary conditions to be up-scaled to real world

4_{It has been argued that it was also a strategic move by Minsky and Papert to promote their symbolic understanding of A.I.} 5_{Nevertheless, in 1950 Minsky was the first to build a randomly wired neural network learning machine (SNARC) (see: Russell}

& Norvig, 2010, p.16).

6_{In particular, the development of graphical processing units (GPUs) was crucial for rapid parallel processing}

7_{It was in the late 90s, when the Internet moved to web 2.0, i.e. the modern Internet we know, going along with the rise of}

online storage. In the year 2000 the Olympus E-10, the first amateur digital camera came on the market. And it took another 3-4 years before these cameras really reached a wide market.

Figure 2: MLP with two hidden layers (1 & 2) Figure 1: Perceptron

(12)

problems. After Krizhevsky et al. (2012) won by a large margin the ImageNet competition, an image classification task with over 15 million pictures of 22.000 classes8_{(Deng 2009; Russakovsky et al., 2015),}

ANNs celebrate a renaissance while revolutionizing computer vision and related fields (see also9_: Le et al.,

2012).

New and essentially deeper network architectures were presented. This trend was summarized under the term “deep learning” (DL). Different layer types, recurrent connections and other operations are incorporated (see for an overview: LeCun et al., 2015; Schmidhuber, 2015; and the book about Deep Learning by Goodfellow et al., 2016). But also operations that were already formally introduced and analysed in the early 90’s10_{could now unfold their full potential. For instance, convolutional network}

layers (LeCun et al., 1998, 2010) were the crucial component for the success of Krizhevsky’s AlexNet in the ImageNet competition (Krizhevsky et al., 2012) and its better performing successors (Sermanet et al., 2013; Simonyan & Zisserman, 2015; Szegedy et al., 2015; He et al., 2016). The underlying idea of convolutional layers is to combine single units to local receptive fields that share their weights (see

Figure 3). This is often

followed by a pooling

layer, which

summarizes a local field in one value. The approach is biologically inspired, based on the discoveries of local receptive fields and complex cells by Hubel and Wiesel11_(1962) and

was already probed by others before (see: the

“Neocognitron” by

Fukushima, 1975, 1980; or LeCun et al., 1989).

Nonetheless, early attempts by Fukushima (1980) did not incorporated end-to-end supervised learning (e.g., backpropagation) yet; therefore they were not scalable to practical tasks.

Ever since, computer vision models constantly improve in performance; in many vision tasks they reach and even outperform human-level accuracy (Russakovsky et al., 2015). Consequently, the field is moving on and incorporates new domains, e.g. language. Instead of using True-False questions (binary classification), or multiply choice questions (Softmax) to predict correct image labels, new approaches use natural language as label space. Applications are in image captioning (Johnson et al., 2016; Andreas et al.,

8_{The free ImageNet dataset was created out of millions of images found in the Internet, which were labelled by humans using}

the online service Amazon Mechanical Turk (Deng et al., 2009).

9_{Le et al. (2012) employed an unsupervised and parallelized learning regime on images from YouTube videos, resulting in nearly the}

same performance as Krizhevsky et al. (2012) in the ImageNet classification task.

10_{In the early 90’s, ConvNets have been already commercially used, for instance, in check-reading ATM machines.} 11_{LeCun et al. (2010) called the class of ConvNets also Multi-Stage Hubel-Wiesel Architectures.}

(13)

2016a,b), text to image synthesis (Reed et al., 2016: generative adversarial approach; Mansimov et al., 2016: generating images from captions with attention), relational reasoning (Hu et al., 2017; Johnson et al., 2017; Santoro et al., 2017: visual question answering; see also: Watters et al., 2017: predicting dynamics of physical systems via observation). This also brought about new forms of datasets, such as Visual Genome (Krishna et al. 2017), CLEVR (Johnson et al., 2016) or Visual Madlibs, and new tasks, for instance the COCO (Common Objects in Context; Lin et al., 2015) contest by Microsoft12_{. Lastly, recent}

work combines even more modalities in a unified deep learning model. Kaiser et al. (2017) incorporate several building blocks of multiple domains in the architecture of their MultiModel. The model nearly reaches state-of-the-art results in several benchmark tasks for speech recognition, machine translation, and image classification. Moreover, the authors show that the multi-modality approach not only leads to an partially increased performance when compared to single models, but also can deal better with data-sparsity.

Especially in light of the development in computer vision, the history of neural network models intriguing unfolds. Biologically inspired operations like convolutional filters were crucial for the breakthrough of deep learning (DL) algorithms. But there are more types and components of architectures that contributed to this development. Therefore, a brief outline of different models as well as learning regimes shall be given in the next section.

3.1.2. Different network types and developments in computer vision models

A fundamental assumption behind DL is that high-dimensional data of the same class lie on a much lower dimensional data-manifold, i.e. an n-dimensional topological latent structure in the possible data space (Zetzsche et al., 1999; Roweis & Saul, 2000; Wiskott & Sejnowski, 2002; Lee et al., 2003; Olshausen & Field, 2004). For instance, a photo of a person is m-dimensional, where m is the number of pixels. If you consider now other photos of the same person in different poses or under different lighting conditions, the

manifold hypothesis claims that these photos are topologically close to each other and are embedded on a specific n-dimensional manifold, where n≪m (see Figure 4). A deep neural network, which is to be trained

to classify the depicted person among other people (classes) correctly, would have to untangle the different class-manifolds through its non-linear processing steps in order to separate the given classes from each other13_.

12_{http://mscoco.org}

13_{See Colah’s Blog for a concise discussion on the topic (http://colah.github.io/posts/2014-03-NN-Manifolds-Topology)}

Figure 4: Data-Manifold (Image by Pascal Vincent, Deep Learning Summer School, 2015)

(14)

But which mechanisms cause the neural network to adjust its connection strengths properly? Different Learning Paradigms The most common and efficient way to train a neural network is under supervision. The network receives labelled training-data, i.e. the target vector yi for a given input vector xi is known. Consequently, the error

or loss can be calculated precisely for any given input. Via backpropagation (the chain rule) this error can be attributed to each neural weight proportionately. Then the network parameters can be updated accordingly to minimize the error (LeCun et al., 2015). The ImageNet competition is one example for a supervised learning task. Despite its effectiveness, supervised learning is heavily criticized. First, this learning regime seems biologically implausible (Lake et al., 2016; but see: Lillicrap et al., 2016). In the developmental phase, humans are rarely provided with the according labels of objects they see. Second, even if the plausibility criteria are not of interest, annotated datasets are rare and in most cases they demand huge amounts of labour to be created and more crucially, it is often far from clear what the correct labels are.

Consequently, unsupervised learning is an active field of research; in the field of A.I. it is considered to “become far more important in the longer term” (LeCun et al., 2015, p.442). In essence, learning is unsupervised if the target vector yi of a given input vector xi is unknown (see: Hastie et al., 2009). The goal

is to unveil latent structures or relationships of the input data without pre-imposing specific assumptions on such inputs. This can be done, for instance, with a Hebbian learning approach (Hebb, 1949; Oja, 1982). Semi-supervised learning brings together the best of both worlds: supervised and unsupervised learning; only for a few input vectors xi target vectors yi are given. This approach seems to reflect the human way of

learning and overcomes the drawbacks of expensive fully labelled datasets.

Finally, reinforcement learning (Sutton & Barto, 1998) is an approach inspired by psychology (behaviourism). It is mostly applied when a system needs to learn about its actions and states in a dynamic environment, i.e. the environment is constantly changing also as a result of the system’s own actions. In most dynamic situations it is not clear what the optimal action is, since it is computationally too expensive to predict all future scenarios. Instead, the system tries to maximize the future reward of a small subset of possible actions. The most famous example of reinforcement learning in the previous years was the success of the ANN called AlphaGo: it defeated the best human players of the Chinese board game GO (Silver et al., 2016). Different Architecture Layers of neurons can be connected in different ways. MLPs are also called feedforward neural networks (FNNs); they are called deep if they have more than one hidden layer. Layers are stacked on top of each other and the signal successively passes through the network from the input layer through the hidden layers to the output layer (see Figure 2). Though feedforward designs are considered rather simple, they have proven to be powerful assets in computer vision, in particular in the form of convolutional neural networks (ConvNets, e.g. AlexNet by Krizhevsky et al., 2012).

(15)

A more biologically plausible type of architecture is the recurrent neural network (RNN). Here, the output of a layer is not just sent to the following units, but also to previous layers or is incorporated in the own input during subsequent time steps (see Figure 5). Therefore, RNNs can be applied to sequential data like sound, videos or other signals that unfold over time. This is a compelling advantage over feedforward designs, since they are able to explore interdependencies over time. Furthermore, it was shown that RNNs are universal approximators of dynamic systems (Funahashi & Nakamura, 1993). Many applications speak in favour of this network type. For instance, Long Short-Term Memory (LSTM) RNNs (Hochreiter & Schmidhuber, 1997) are state-of-the-art models for speech recognition (Sak et al., 2014), machine translation (Sutskever et al., 2014), social signal classification (Brueckner & Schulter, 2014) and others domains. Like FFNs, RNNs are also trainable using backpropagation. In the learning phase, the recurrent network gets unfolded over time, i.e. sequential time-steps resemble the hierarchical structure of feedforward models, and the error-signal is propagated backwards in time. This process is known as backpropagation through time (BPTT) (see: Robinson & Fallside, 1987; Werbos, 1988).

Another distinction to be made is between discriminative and generative neural models (Kriegeskorte, 2015). Discriminative networks are used to classify objects, reveal structures or predict future events based on given data, whereas generative networks (re-)construct data that ultimately can be

compared to natural data. Generative models often have the advantage of being independent of labelled target data; hence they can be trained without supervision. Models of this type are Autoencoders (AE; AEs try to optimize the reconstruction of their input via an encoding and decoding process) like (Restricted) Boltzmann Machines (Ackley et al., 1985; Salakhutdinov et al., 2007), and Variational Autoencoders (VAEs) (Kingma & Welling, 2014), autoregressive models like the PixelRNNs (predicts and generates the next pixel of an image based on previously seen pixels: van den Oord, 2016a), and Generative Adversarial Networks (GANs: Goodfellow et al, 2014a, 2017; LAPGANs: Denton et al., 2015; DCGANs: Radford et al., 2016). GANs are intriguing for a couple of reasons. First of all, the fundamental learning principle is unsupervised. GANs consist of two separate adversarial parts, a generative and a discriminative neural network, which compete against each other. The discriminator network gets either a real sample, drawn from a dataset, or a sample generated by its opponent network. The discriminator optimizes for detecting the real samples, whereas the generator network optimizes for fooling the discriminator. GANs are faster than other models that generate their entries sequentially (PixelRNN, WaveNet, see: van den Oord et al.,

(16)

2016b; and others). A drawback is the difficulty to reach the global optimum with current algorithms. The unstable optimum that is to be reached is also called Nash equilibrium, i.e. a state where each party has developed a strategy that could not further be improved without the other party changing its own strategy (for a geometrical understanding for this form of saddle point, see Figure 6). Moreover, GANs worked well for generating images but did not do well on discrete data, for instance text. Nevertheless, among leading machine learning experts, GANs are considered one of the most promising paradigms of the past years for unsupervised learning14_{, and recent results of}

variational image generation seem to be not far from fooling even human eyes (see: Radford et al., 2016).

Though there are diverging architecture types, all neural networks are faced with an optimization problem. Either they are trained to minimize an objective function, i.e. a loss or cost function, or to maximize a utility function, i.e. a fitness function. These functions are defined, hence handcrafted. However, the corresponding optimization process itself remains difficult to analyse15_{. Therefore the}

question arises: what do we actually know about the inner computations and representations of ANNs?

3.1.3. Opening the brain: Methods to analyse the features of neural networks

Various experiments with network architectures, designs of optimization functions and learning algorithms led to increasing performances in a variety of tasks. However, this trend also shows that little is known about the interrelationships between particular hyper-parametric decisions. In practice, finding the right configuration is mostly based on heuristics (Walczak & Cerpa, 1999) and vague empirical approaches (trial-and-error).

Nonetheless, in the previous years new and insightful methods have been developed to analyse ANNs. For instance, Zeiler and Fergus (2013) used a visualization technique to diagnose weaknesses of AlexNet and, ultimately, won the ImageNet competition with their ZF net – an improved version of the AlexNet. The method reveals the hidden features of ConvNets (see Figure 7) by applying a deconvolutional network, i.e. a network that reverses the convolutional and pooling feedforward processes of the primary network (Zeiler et al., 2010, 2011; Zeiler & Fergus, 2013). Similarly, Yosinski et al. (2015) provide the Deep

Visualization toolbox, for live visualisation of the feature activations in different layers of ConvNets.

Dosovitskiy & Brox (2015) used another method (up-convolution, see also: Dosovitskiy et al., 2015) to show that feature maps in all layers preserve colour information and object positions. Yet another empirical approach is offered by Li & Yosinski et al. (2015), who confirmed the common assumption of convergent learning, that is, different networks learn a similar feature space, even though they start from a distinct random initialization. The authors found matching neural clusters (populations) and partly even

14_{Yann LeCun considers GANs the “most interesting idea in the last 10 years in ML”.} 15_{A common critique regarding neural networks is that they operations are untraceable, hence a black box for investigations.}

(17)

one-to-one neuron alignment in distinct neural networks. However, Kindermans et al. (2017) recently revealed some explanatory drawbacks of previous visualization techniques (e.g., Zeiler and Fergus, 2013; Springenberg et al., 2015; Bach et al., 2015; Radford et al., 2016;). These methods implicitly try to detect the signal of interest, i.e. the informative part of the input with respect to the imageclass, in the direction

of the network’s weight vectors. Yet, Kindermans and colleagues demonstrate that the weight vector in fact captures the relation between the signal and the distracting noise contribution, and is thus not informative about the relevance of features in the target DNN (see also: Haufe et al., 2014). Therefore, the authors introduce a novel two-component signal estimator that is neuron-wise approximated in each layer (PatternNet), leading to more accurate signal visualizations (see Figure 8). Moreover, in a second step they estimate the signal-function interaction (PatternLRP), revealing the pixel-wise contributions to the decision of the target DNN for specific classes (see also: Bach et al., 2015). Zintgraf et al. (2017) propose a similar method, which depicts the relevance not only of each pixel but also of each feature for a specific imageclass (via heatmaps). Besides that, Karpathy et al. (2015) used language models to analyse representations, predictions and error types of LSTMs. Others applied gradient-based approaches to synthesize adversarial images, which “fool” neural nets into crucial classification mistakes (Simonyan et al., 2014; Szegedy et al., 2014; Nguyen et al., 2015). For humans, such images look either like random pixel noise or like a natural image that underwent unnoticeable changes from a given original image; the ANN miss-classifies both types with high confidence. This method revealed limitations of current ConvNets, but also inspired new network architectures (e.g., GANs; Goodfellow et al., 2014, 2015). More recent gradient-based approaches that incorporate novel regularization methods (activation maximization: Erhan et al., 2009) are able to generate not only naturalistically appearing images, but also various instances of the same imageclass (deep generator networks: Nguyen et al., 2016a, b). With such a visualization technique, Nguyen and colleagues showed that single higher-level nodes encode information of multiple facets; similar findings were presented for human cortical cells (Quiroga et al., 2005).

In Chapter 3.1.1, we have seen that A.I., and in particular computer vision, has been inspired by findings in neuroscience (MLPs, ConvNets). Nonetheless, the fields are currently diverging due to the rapid developments in deep learning (see: Cox & Dean, 2014) accompanied by a lack of concern towards biological plausibility among many A.I. researchers, but also due to methodological limitations in neuroscience (Olshausen & Field, 2005; see: Chapter 2.1). However, at the same time, the deep learning

(18)

movement provided new analytical tools for computational neuroscience (see: Cox & Dean, 2014); and the previous years have shown that new paradigms thrive from this synergetic relationship.

3.2. Neural networks in computational neuroscience

Before we draw our attention to the impact of deep neural networks on neuroscience, we need to understand that there is a general term for computational models which translate complex stimuli to neural data: encoding models (van Gerven, 2017). These models provide insights into neural feature representations of rich naturalistic stimuli and the corresponding functional organization of neural structures, in particular of neural populations.

3.2.1. Encoding and decoding the brain

Encoding models are based on the, still debated, theoretical concept of population receptive fields (pRF), which describe the selective activation-responses of populations of neurons on given stimuli16_{. The}

encoding models try to explain, first, how stimuli modulate the activity of neuronal populations and, second, how population responses affect data recordings. Next to sensory processing, encoding models can be also used to test effects of attention, expectations, task demands, learning effects, developmental changes or neurodegeneration (see: van Gerven, 2017). These models do not just take the current stimulus into account but also previous activations of the neural population of interest, in order to predict current highly non-linear response properties as well as corresponding measurement signals. Furthermore, encoding models are capable of unveiling the impact of top-down executive processes and bottom-up sensory evidence on population responses in naturalistic experimental conditions. This process can also be reversed into a decoding procedure. Decoding is the prediction of the stimuli given the measurements, which allows for testing the representational content of neural recordings (e.g., Horikawa et al., 2013; Güçlütürk & Güçlü et al., 2017; Horikawa & Kamitani, 2017). The validation of this process is

16_{This is conceptually derived from the idea of receptive field (RF), i.e. selective activation-responses of single neurons on given}

stimuli.

(19)

simply done by comparing the given stimuli with the reconstruction (e.g., via correlation, structural similarity or, if feasible behavioural testing). Finally, models can be compared with respect to their predictive power. This gives rise to hypotheses on how particular brain regions are functionally organized and what they are specialized for (van Gerven, 2017). For instance, Khaligh-Razavi and Kriegeskorte (2014) examined 37 of such computational models and compared their internal representations to those in the Inferior temporal (IT) cortex, which is located in the ventral visual pathway. It turned out that models with a better performance on a visual naturalistic classification task were more similar to IT with respect to the emerged representations. The most successful model was the supervised deep convolutional network by Krizhevsky et al. (2012). Models that were trained in an unsupervised fashion could not explain the variance in IT. Unsupervised models might cluster more on the basis of statistical information of the visual data regardless of its semantic content. The superior explanatory power (with respect to IT) of supervised models seems plausible, since the categories of the training-data (targets) have a semantic component. An intriguing conclusion that can be drawn from this finding is that object categories have an influence on how we see the world and consequently on how the brain is processing visual information. Hence, categories are also derived from the affordances of the class-instances and not just from their visual aspects (see also: Mur et al., 2013). This goes along with the finding that IT appears to be visual and semantic (Kriegeskorte et al., 2008a; Connolly et al., 2012; Huth et al., 2012; Carlson et al., 2013). Nonetheless, one fundamental assumption for the encoding approach is that the brain is optimized to extract statistical invariances from the world. The success of deep neural networks as encoding model, which is reflected in high correlations between model and brain responses and high reconstruction accuracies (van Gerven, 2017), indicates that this assumption holds. As we have seen in the previous chapter, neural nets are powerful tools to explore statistical invariances in any given form of data17_.

Therefore, they bear many insights about their biological pendant.

3.2.2. What do we learn from artificial neural network models?

Deep neural networks (DNNs) are intriguing for a couple of reasons. The most interesting fact is that the evolved feature maps in each layer of convolutional feedforward neural networks (ConvNets) such as

AlexNet partly correspond to the hierarchy of visual features in different stages of the primate visual

cortex (Agrawal et al., 2014; Cadieu et al., 2014; Khaligh-Razavi & Kriegeskorte, 2014; Yamins et al. 2014; Güçlü & van Gerven, 2015; Kriegeskorte, 2015; Cichy et al. 2016, 2017; Güçlü & van Gerven, 2017a). Though ConvNets do not explain the full variance of activation in the visual pathway, they outperformed all their predecessors (e.g., Leibo et al., 2017). For instance, features of layer 1 in AlexNet (Zeiler & Fergus, 2013) show strong similarties to simple stimuli presented by neuroscientist like Hubel and Wiesel (1962), which induced neural activity in V1. Representations of layer 3 bear high resemblance to the synthetic texture stimuli created by Movshon & Simoncelli (2014) – corresponding to activity of the border area of V1 and V2. And upper layers of the artificial network reveal features of concepts or objects, for instance of

17_{Computer vision models are not just applied on natural images, but are a prosper tool for any form of image data. For instance, in}

(20)

faces and animals, but also of text and characters. Such high-level features correspond to activity found down the ventral stream, more specifically, in the inferior temporal (IT) cortex and fusiform gyrus (Fusiform Face Area, FFA) (Gross et al., 1972; Logothetis & Sheinberg, 1996; Quiroga et al., 2005). Similar discoveries were made for sound and speech processing. Lee et al. (2009) showed that in their unsupervised hierarchical deep belief network for audio classification, low-level features evolved that correspond to phonemes, which are discriminately encoded in the human auditory cortex (Mesgarani et al., 2008) depending on the linguistic environment (Lisker & Abramson, 1964; Manca & Grimaldi, 2016). Similarly, Güçlü et al. (2016) found a representational gradient along the superior temporal gyrus, which correspond to levels of a DNN trained on musical data. For a higher level of representation, Huth et al. (2016) used a generative model to map audio stimuli to the semantic atlas of the cortex. They fed their model with fMRI data collected from subjects who listened to hours of narrative stories. This brings us to the second intriguing property of DNNs – their universality18_{. The networks are}

not just used for computer vision, but also for auditory data, language processing, abstract decision processes and so on. This fundamental principle is shared with the biological analogue and therefore could provide a computational explanation for phenomena such as neural rewiring (Sharma et al., 2000; von Melchner et al., 2000) and neural reuse (Anderson, 2010), among others.

Thirdly, despite their computational complexity in comparison to other machine learning paradigms, particular operations of DNNs underline the importance of efficient coding. For instance, convolutional layers exploit the spatial correlations of image data. Weights in each convolutional filter are shared across the whole image. This does not just reduce the number of parameters substantially, but also increases performances in vision tasks. Similarly, dropout layers (Hinton et al., 2012) lower computational costs while fostering more robust features maps, i.e. they prevent the network from overfitting. And pooling layers come with the same benefits, by scaling down the dimensionality of its input. So what can we learn from such models? Marblestone et al., (2016) argues that if we assume that the computations of the brain strive for optimization of an ensemble of objective functions, we can only effectively analyse the brain from a higher level of organization (“top down”) (see also: Robinson, 1992). ANNs teach us that in order to understand network properties we need to extract what they optimize for. The synergetic case of computer vision and vision neuroscience is an insightful illustration of that. It is not surprising that researchers state that FNNs and RNNs are well suited for modelling brain processes (Kriegeskorte, 2015; Heeger, 2017). The claim is that the brain employs both types of networks to compute quick guesses in the forward passes (FNNs) and to reassure decisions/categorizations while incorporating response predictions (priors) by recurrent (RNNs) loops (top-down, signal latencies), ultimately converging to an interpretation and/or a motor-program/behaviour. How can we compare computational models with real neural processes? There are many approaches to apply computational models in sensory neuroscience to analyse brain data (for a review see: van Gerven, 2017). Next to classical attempts like statistical parametric mapping (SPM)

18_{This universality is most profound in RNNs.}

(21)

(Friston, 1995) and recent techniques like multivariate pattern analysis (MVPA) (e.g. Haxby et al. 2001) and population receptive field mapping (pRF) (Wandell & Winawer, 2011, Wandell et al., 2015), there are also representational approaches. For instance, Kriegeskorte and colleagues (2008b) suggest an analytical framework that allows for the comparison of the representational structure of computational models and responses in specific brain regions. The representational similarity analysis (RSA) integrates data coming from the different modalities in systems neuroscience – the three major branches: brain-activity measurement, behavioural measurement, and computational modelling. The authors call this matching process a second-order isomorphism opposed to the first-order isomorphism, that is, the relationship between the stimulus property and the corresponding (e.g. neural) activity. The second-order isomorphism is studied by relating the similarity structure of the presented objects to the similarity structure of the representations (in each modality). Therefore, RSA is neither a decoding nor an encoding model itself, but a form of meta-analysis of different information processes. The approach is intriguing because it allows for (1) relating regions, subjects, species, and modalities of brain-activity measurements, (2) relating brain and behaviour and, most crucially for the focus of our paper, (3) the integration of computational models like DNNs into the analysis data from brain-activity. Hence, the RSA is an essential analytical method to compare representational structures of computer vision models to those of the cortical visual pathway (for a comparison to other representational models, see: Diedrichsen & Kriegeskorte, 2017).

However, though such representational similarities were found, the employment of DNNs in neuroscience is still under debate due to their lack of biological plausibility. In the next section, the most common critical points will be presented.

3.2.3. The biological plausibility of neural nets

Issues that are often raised address the dependency of most DNNs on supervised learning, i.e. target labels are given; and training datasets should be as large as possible with millions of exemplars (Lake et al., 2016). In contrast, even in their developmental phase humans rarely have access to such labels and supervision. Young children already show the ability to learn new concepts from very sparse data (one-shot-learning) and are able to make meaningful generalizations (ibid.). Although there are unsupervised learning approaches for ANNs, their emerged feature maps could not explain the variance of real neural data (Khaligh-Razavi & Kriegeskorte, 2014), nor were they sufficient enough to perform well in ecologically valid tasks. Unsupervised learning mechanisms iteratively update the model statistics such that they match to statistics of the input, which mostly does not reflect the semantic content of the data. In addition to that, most computer models ignore essential aspects of biological neural networks. For instance in vision research, Khaligh-Razavi & Kriegeskorte (2014) argue that the feedforward processing of CNNs might reflect rapid and task-independent recognition processes, but not temporally dynamic processes such as active exploration and disambiguation. Only recurrent neural networks (RNNs) could encode the latter (see also: Elman, 1990; Heeger, 2017). However, previous versions of RNNs did not sufficiently account for the spatial aspects of real neurons and their long- and short-range connectivity

(22)

across networks (see: VanRullen, 2017). Furthermore, RNNs encountered difficulties with respect to long-

term time-dependencies within given data (Bengio et al., 1994; Hochreiter & Schmidhuber, 1997). Most critically, many neuroscientists (e.g. Crick, 1989; Stork, 1989) see no possible analogue for the backbone of algorithmic learning, namely backpropagation, i.e. iterative, gradual error-propagation via chain-rule (Rumelhart et al., 1985, 1986).

Nonetheless, VanRullen (2017, p.3) points out that recent work in A.I. brought new architectures forward, which tackle problems such as one-shot-learning (Rezende et al., 2016; Santoro et al., 2016), achieving “good representations” via unsupervised learning (Anselmi et al., 2014, 2016; Doersch et al. 2015; Wang & Gupta, 2015) and reveal high correspondence when compared to the biological counterpart (Leibo et al., 2017). RNNs were trained via reinforcement learning, which is inspired by psychological findings (Schultz, 2015), to model attention processes (Mnih et al., 2014; Ba et al., 2015). Also semi-supervised learning is a plausible model for human learning, since small children do get a certain amount of instructions. Ladder networks (Valpola, 2015), i.e. autoencoders that conduct denoising steps in each layer, need only a handful of target-labels to accurately classify given data, e.g., images (Rasmus et al., 2015). More generally, combining different learning regimes (e.g., semi-supervised + reinforcement learning) in DNNs is thought to be the most rational track to creating biological plausible systems (Bengio et al, 2016a,b; Finn et al., 2017). Real neurons exhibit characteristic spiking behaviour. Spike trains are believed to decode and encode information very efficiently (Maass, 1997). Consequently, spiking ANNs become more and more popular (Yu et al., 2013; Diehl et al. 2015; Thalmeier et al., 2015; Abbott et al., 2016; DePasquale et al., 2016; Hunsberger & Eliasmith, 2016; Lee et al., 2016; Zambrano & Bohte, 2016) and are successfully combined with unsupervised learning mechanisms (Kheradpisheh et al., 2016a). In addition to that, new architectures incorporate connections, which could reflect far-spanning apical dendrites of pyramidal cells (Pascanu et al., 2014; Huang et al., 2016a,b; Zilly et al., 2017). We can also reject the critique that biological neurons are either inhibitory or excitatory (Dale’s principle; Eccles et al., 1954), whereas most artificial nodes are both, since either of the network types have comparable function-approximation capacities (Parisien et al., 2008; Tripp & Eliasmith, 2016)19_{. Moreover, just}

recently Hinton and others convincingly rejected objections within neuroscience that the brain would not be able to apply backpropagation as a learning regime (Hinton, 2007; Bengio et al. 2016a; Guergiuev et al. 2016; Hinton, 2016; Lillicrap et al., 2016; Scellier & Bengio, 2016; Bengio et al., 2017). For instance, Hinton argues that spike-timing dependent plasticity (STDP) mechanisms are a form of derivative filter, which approximates backpropagation (temporal derivatives code error derivatives) and that spike coding (opposed to real-value coding) can be seen as a strong regularizer. If one considers random top-down connection weights in an autoencoder architecture, the bottom-up weights adapt so that the fixed top-down weights are approximately their pseudo-inverse near the data manifold, hence they approximate the true error derivative (Lillicrap et al., 2014; Liao et al., 2016). Moreover, addressing higher levels of functionality, Kheradpisheh et al. (2016b,c) showed that current deep convolutional neural networks (DCNNs) have similar difficulties as humans during object recognition, when confronted with variant views (occlusion, rotation in depth and plane, scale, position, illumination, background clutter, etc.) – a further indication that such models resemble human feed-forward vision.

19_{This direct comparison might be misleading anyways. Others suggest that single biological neurons are rather}

Data-Driven Vision Research: Naturalistic versus Synthetic Research Paradigms