Deep learning to probe neural correlates of music processing

(1)

Jordy Thielen

Radboud University Nijmegen, Donders Center for Cognition Artificial Intelligence Department

Master of Science Programme MSc Thesis

On-site supervision: Dr. Marcel van Gerven and Umut Güçlü Abstract

Sensory cortices process their domain specific information hierarchically, with increasingly more complex features represented at more downstream areas along the cortical sheet. This has been best known for the visual cor-tex, but not so much for the auditory cortex. Recent advances in artificial neural networks allow the end-to-end learning of models for solving prob-lems such as automated music tag prediction. Here, we trained a residual neural network to predict tags of natural music stimuli. In turn, the trained model was used to probe neural representations of music across the cortical sheet. Using a searchlight representational similarity analysis we revealed a representational gradient across the Superior Temporal Gyrus (STG). This gradient extended from Planum Polare, which was more sensitive to com-plex feature representations, to central STG, which was more sensitive to simple feature representations, to Planum Temporale, which was again more sensitive to complex features. The results imply low-level processing around primary auditory cortex with a broad auditory association area around it along STG.

Keywords: deep learning, music processing, functional magnetic resonance

imaging, representational similarity analysis

Introduction

The inverse problem is a common theoretical concept for sensory neuroscience: how does the brain build sensible percepts from a limited, noisy, and ambiguous source of sen-sation, being for instance visual projections onto the retina, or auditory waves entering the eardrum. Particularly, in many cases ambiguities arise when multiple sources can cause similar sensations. In sound perception (e.g., speech, music) a multitude of information is

(2)

used, namely: amplitude information (i.e., intensity), spectral content (i.e., pitch), temporal organization (i.e., rhythm), and the ability to distinguish instances (e.g., instruments and voices) while intensity and pitch remain equal (i.e., timbre).

In any respect, raw audio signals are caught by the pinna of the outer ear and travel as waves into the ear canal, a tube that narrows down along the way and thereby amplifies the signal (Gazzaniga et al., 2002). When the waves hit the eardrum of the outer ear, it causes vibrations at the ossicles (i.e., the hammer, anvil, and stirrup). The ossicles in turn cause vibrations that travel through the mid-ear to the fluid inside the cochlea. These vibrations propagate up the cochlea, from base to apex, stimulating hair cells that are sensitive to certain frequencies (i.e., a gradient from base to apex represents high to low frequencies). In this way, frequencies are extracted along the cochlea.

Excited hair cells cause cilia to bend, which in turn cause the flow of ionic currents through non-linear channels, initiating electrical potentials. These potentials are transferred by individual bipolar cells, which bundle together in the the auditory nerve. The auditory nerve in turn connects to the cochlear nucleus at the level of the brainstem, which in turn integrates a multitude of activations and projects these to the superior olive. Both the cochlear nucleus and superior olive project to the inferior colliculus (Kolb & Whishaw, 2001).

Two pathways emerge at the inferior colliculus. One projects to the ventral part of the Medial Geniculate Body (MGB), which projects to the Primary Auditory Cortex (PAC): Heschl’s Gyrus (HG) at the anterior Superior Temporal Gyrus (STG). The other projects to the dorsal part of the MGB, which projects to nearby auditory areas: Planum Temporale (PT) and posterior STG. These ventral and dorsal pathways are believed, as is the case for the visual counterpart, to represent ’what’ and ’where’ pathways (Romanski et al., 1999; Ahveninen et al., 2006).

The auditory cortex is mainly localized along STG, the gyrus between the Lateral Sulcus (LS) and the Superior Temporal Sulcus (STS) (see Figure 1). PAC is mainly located along HG, around the intersections of LS and the Central Sulcus (CS). PAC interprets the one-dimensional spectral content and expands it to a three dimensional representation including a tonotopic, scale (local bandwidth), and symmetry (local phase) axes. PAC is thought to represent a multi-scale representation at various degrees of spectral and temporal resolution. This may result in perceptual invariances, for instance to recognize patterns like melodies despite their rate of delivery or external noise. The secondary auditory cortex is located anterior to HG, along PT.

The majority of the work on functional auditory cortical representations has been lim-ited to hand-designed low-level representations, such as spectro-temporal models (Santoro et al., 2014), spectro-location models (Moerel et al., 2015), power-spectral profiles (Hu et al., 2016), timbre, rhythm, tonality (Alluri et al., 2012, 2013; Toiviainen et al., 2014), melodic contour (Lee et al., 2011), rhythm (Chen et al., 2008), and pitch (Patterson et al., 2002; Griffiths, 2003), or high-level representations such as music genre (Casey et al., 2012) and sound categories (Staeren et al., 2009). Despite the importance of such controlled studies, hand-crafted features might bias or limit the experimental manipulation to the hypothesis space of the experiment. Additionally, we do not live in a world with simple and abstract constructs, but instead in a dynamic and complex world. Hence, the representations of the simple and abstract concepts might not represent the full extent to which complex stimuli

(3)

Figure 1 . A cortical surface plot (left) and and coronal view of the auditory cortex (right).

CS: Central Sylcus, CG: Cingulate Gyrus, PP: Planum Polare, PT: Planum Temporale, STG: Superior Temporal Gyrus, STS, Superior Temporal Sulcus, hA1: first human Auditory area, and HG: Heschl’s Gyrus. Figure taken from Brewer & Barton (2016).

are represented (i.e., it might not be a simple linear mapping).

Instead of hand-crafted controlled abstract stimuli we used complex dynamic natural stimuli: real music clips. The features that we extracted from these music clips were the representations within a data-driven task-optimized artificial neural network. In this way we did not severely restrict the hypothesis space, only up to the extend of the task that was learned by the neural network.

Neural networks have a long history starting with the perceptron (Rosenblatt, 1958), a unit capable to learn patterns for linear binary classification. However, the perceptron was soon proven to be incapable to distinguish data that were not linearly separable (Minsky & Papert, 1988). With the advent of error back-propagation, multilayer perceptrons made it possible to distinguish data that were not linearly separable (Rumelhart et al., 1985). Basically, each layer within a feed-forward network adapts its inputs by certain weights and passes the output through some non-linearity (e.g., a sigmoid activation function). Such a network is learned in a supervised manner by matching input-output pairs in such a way that the error between the predicted output and true output is minimized, achieved by adapting the weights accordingly. With insight from neuroscience of simple and complex cells in the visual system that have receptive fields which get more complex up the hierarchy (Hubel & Wiesel, 1959), neural networks with multiple units and especially multiple layers became standard to deep learning. For an overview of deep learning, see LeCun et al. (2015).

Deep neural networks (DNNs) could provide a powerful approach to construct and test alternative hypotheses about what information is represented across the cortical sheet. On the one hand, a task-optimized DNN learns a hierarchy of nonlinear transformations with the objective of solving a particular task. On the other hand, functional magnetic res-onance imaging (fMRI) measures local changes in blood-oxygen-level dependent (BOLD) haemodynamic responses to sensory stimulation. Subsequently, any subset of the DNN representations that emerge from this hierarchy of nonlinear transformations can be used

(4)

to probe neural representations by comparing DNN and fMRI responses to the same sen-sory stimuli. Considering the sensen-sory systems are biological neural networks that routinely perform the same tasks as their artificial counterparts, it is not inconceivable that DNN representations are suitable for probing neural representations. Additionally, since their resurgence, DNNs have achieved many breakthroughs in for instance image classification (Krizhevsky et al., 2012)), object detection (Girshick et al., 2014), and semantic segmenta-tion (Long et al., 2015).

Indeed, this approach has been shown to be extremely successful in visual neuro-science. To date, several task-optimized DNN models were used to accurately model visual areas along the dorsal and ventral streams (Khaligh-Razavi & Kriegeskorte, 2014; Yamins et al., 2014; Seibert et al., 2016), revealing representational gradients where deeper layers of the neural network map to progressively more downstream areas along the visual pathways (Güçlü & van Gerven, 2015a,b).

The current study aimed to expand on this line of research where our aim is to model how the human brain responds to music. We achieved this by probing neural representations of music features using a deep neural network optimized for music tag prediction. We used the representations that emerged after training a DNN to predict tags of musical excerpts as candidate representations for different areas of the brain in a searchlight representational similarity analysis (RSA). We observed that different DNN layers correspond to different locations within and outside the auditory cortex. In particular, we found that deeper neural network layers were shown to map to auditory brain regions that were more distant from primary auditory cortex.

Methods Natural music stimuli

We used the MagnaTagATune dataset (Law et al., 2009), which contained 25863 music clips. Each clip was a 29-second excerpt from one of 5223 songs, 445 albums, and 230 artists. The clips spanned a broad range of genres like Classical, New Age, Electronica, Rock, Pop, World, Jazz, Blues, Metal, Punk, and more. The motivation behind using such dataset was to have stimuli that span a wide range of natural, instrumental, as well as vocal sounds, leading to highly rich representations.

Along with the music excerpts, each audio clip was supplied with a vector of binary annotations of 188 tags. These annotations were obtained by humans playing the two-player online TagATune game. In this game, the two players were either presented with the same or a different audio clip. Subsequently, they were asked to come up with tags for their specific audio clip. Afterward, players viewed each others tags and had to decide whether they were presented the same audio clip. Tags were only assigned when more than two players agreed. The annotations included tags like ’singer’, ’no singer’, ’violin’, ’drums’, ’classical’, ’jazz’, et cetera.

The entire MagnaTagATune dataset was used to train a DNN, as outlined below. The dataset came in sixteen parts, from which parts one to twelve were used for training the model, part thirteen was used for validation, and parts fourteen to sixteen were used for testing. The DNN was trained to predict the 50 most frequent tags within the full MagnaTagATune dataset, following Dieleman & Schrauwen (2014).

(5)

We restricted the clips presented during the experiment to the Magtag5k dataset (Marques et al., 2011), a cleaned version of the MagnaTagATune dataset. This dataset removed several issues from the original dataset, such as misspelling (e.g., merging tags like ’clasical’ and ’classical’), synonymy (e.g., merging tags like ’classical’, ’classic’), compati-bility (e.g., removing clips tagged with both ’drums’ and ’no-drums’), trivial cases (e.g., tags like ’silence’ were removed), sparseness (e.g., removing clips without any tag), and duplication (i.e., for each song, only the one clip with the maximum amount of tags was kept, others were removed). The Magtag5k dataset contained 5259 clips from 230 artists, each annotated with 137 tags.

A subset of clips was selected from the Magtag5k dataset to be used in the experi-mental procedure. As we had a session with eight unique runs of sixteen stimuli each (i.e., the training session), and a session with eight runs of sixteen equal stimuli each (i.e., the testing session), a total of 9 × 16 = 144 stimuli were required (see experimental details be-low). In the selection procedure, we ignored clips at the start of songs, to ensure that clips contained sound throughout the entire 29 seconds. We hierarchically clustered the clips in sixteen clusters using complete linkage, based on their tag assignments. Subsequently, we selected nine times sixteen clips iteratively, where each iteration selected one representa-tive clip from each cluster. A representarepresenta-tive clip was defined as the one that minimized the mean correlation with any other already selected clip from within that cluster. First, this procedure ensured that the subset of selected clips contained a broad range of tags. Second, this procedure biased subsequent selections towards clips containing less-frequent tags, thereby balancing the distribution of tag assignments. The first iteration of sixteen selections formed the testing set for the testing session, the others formed the training set for the training session.

Deep neural network

We trained a residual neural network (ResNet), which instead of standard DNN did not directly learn an underlying mapping F (x) = H(x) given input x, but fitted a residual mapping F (x) = H(x) − x (He et al., 2015a). The function to be learned then became

H(x) = F (x) + x, which was realized by adding shortcut connections over stacks of layers.

Such residual nets benefit from increased performance when more layers are added, while preventing the problem of saturating and degrading model performance when more layers are added. Residual nets can contain many layers that at least learn an identity mapping, which forces a residual net to perform at least as well as a residual net with fewer layers. The benefit of such residual networks is shown by numerous breakthroughs in image classification (Szegedy et al., 2016; He et al., 2016) and object detection (Liao & Poggio, 2016; He et al., 2016).

Here, the ResNet was an adapted version of the eighteen-layer ResNet in He et al.

(2015a). Both theirs as well as ours ResNet contained eighteen layers, of which eight

contained shortcut connections, short-cutting stacks of two layers (see Table 1 and Figure 2). Since dimensions differed from stack to stack, layers conv3_1, conv4_1, conv5_1 performed

down-sampling by learning a projection mapping (1 × 1 convolution): y = F (x, W_i) + W_sx.

In between convolution and activation functions batch normalization was performed (Ioffe & Szegedy, 2015). All artificial neurons were rectified linear units (ReLUs). The output

(6)

layer was a fully-connected layer performing average pooling followed by sigmoid activation functions.

Our ResNet differed from the original ResNet from He et al. (2015a) at certain points. Because our input signal was a one-dimensional input (i.e., the raw time-domain audio signals), we used one-dimensional filters of size 1 × 49 in the first layer, and 9 × 1 elsewhere for the convolutional layers, instead of square filters of size 7 × 7 or 3 × 3. Additionally, the pooling kernels and strides were adapted accordingly. Finally, because our classification problem was an independent multi-class problem (i.e., multiple tags may be classified for a single clip), the activation function of the output layer is a sigmoid function instead of a softmax function, and thus the neural net was optimized by minimizing the binary cross-entropy loss function. For full details, see Table 1.

The neural net was implemented in Chainer (Tokui et al., 2015), which is a framework for neural nets based on the Python programming language. The neural net was initialized with random weights as in He et al. (2015b). The model was trained using Adam (Kingma &

Ba, 2014), with parameters α = 0.0002, β₁= 0.5, β₂ = 0.999, = 1e−8with mini-batches of

size 36, minimizing the binary cross-entropy loss function. We did not use dropout (Hinton et al., 2012), because batch normalization already regularizes the network sufficiently during the learning procedure (Ioffe & Szegedy, 2015).

Table 1

The architecture of the residual neural network. In total, the network had 18 layers contain-ing eight blocks with shortcut connections (i.e., conv2_x to conv5_x). In layers conv3_1, conv4_1, and conv5_1 the short-cut connections were realized by a projection mapping be-cause of a change in dimensionality.

layer name architecture

conv1 1 × 49, 64, stride 4, padding 24

1 × 9, max pool, stride 4, padding 4

conv2_1 1 × 9, 64, stride 1, padding 4

1 × 9, 64, stride 1, padding 4

(7)

Input Conv1 Conv2_1 Conv2_2 Conv3_1 Conv3_2 Conv4_1 Conv4_2 Conv5_1 Conv5_2 fc ma x p o o l a vg p o o l + + + + + + + + 64 64 64 64 128 128 128 128 256 256 256 256 512 512 512 512 50 64 1

Figure 2 . A schematic visualization of the architecture of the residual neural network. All

layers are convolutional layers, except the last layer which is a fully-connected layer. After layer 1 max pooling is implemented, the last layer implements average pooling. Colors show identical building blocks. The short-cut connections are illustrated by arrows with a plus sign skipping one layer within building blocks. The numbers below layers denote the number of feature maps within that layer.

During training of the neural net, we used excerpts of clips of approximately 3 seconds (i.e., 50176 samples, 3 seconds at 16000 hertz). To predict the tags of a 29-second-long excerpt in the test split of the dataset, the full 29-second clips were slided through the network, and averaged by the fully-connected final layer (i.e., average pooling). From these we computed the area under the receiver operating characteristic (ROC) curve (AUC) for performance evaluation. The ROC represents the performance (i.e., true positive rate versus false positive rate) of a binary classifier when its threshold for discrimination is varied between 0 and 1. The AUC then evaluates the chance of classifying a random positive example as higher than a random negative example.

For each clip in the experimental set, we extracted the artificial neuron activities (i.e., feature representations) within individual (stacked) layers. These transformations resulted

in m matrices of size k × pi, where m is the number of (stacked) layers (10), k = 16 is the

number of clips and pi is the number of artificial neurons in the ith layer (64, 64, 64, 128,

128, 256, 256, 512, 512 and 50 for i = 1, ..., 10, respectively).

Participants

Data were collected from eight participants (5 male, 25 ± 2.5 years). All participants were healthy and had normal hearing. The experimental protocol was approved by the local ethical committee of the Radboud University Nijmegen. Written informed consent was obtained from all participants prior to the experiment. Data from participant 2 were discarded because of extensive movement during the scanning period.

Experimental procedure

Participants completed two separate sessions: a training session and a testing ses-sion. Both sessions contained eight runs. Each run contained sixteen trials. Stimuli were presented using Presentation software and were presented over MRI-compatible in-ear ear-phones.

In the training session, each run presented sixteen unique clips. Thus, the training session counted 8 × 16 = 128 unique clips. In the testing session, each run presented the same sixteen clips. Thus, the testing session counted only 16 unique clips. During the entire experiment, participants listened to 9 × 16 = 144 unique clips, where sixteen of these were

(8)

repeated eight times. The order of the clips in the testing runs were counter-balanced by a Latin Square design, to prevent order and carry-over effects among clips. The same ordering was used for the training runs, to prevent any effect of the iterative selection procedure.

Before starting a session, participants completed a practice run in which they were asked to change the volume of the sound to an appropriate level. As a reference, participants listened to the two extremes outside the scanner. The practice run was completed inside the scanner, so that the volume could also be adjusted relative to the scanner noise.

During the experiment, a trial started with a 29-second natural music stimulus from the experimental set, while a fixation-cross was presented on a screen at the bore of the scanner. Participants were asked to fixate at the fixation-cross while carefully listening to the clip, and to answer two questions post-trial (see Figure 3). These two questions required the participant to rate the likability and complexity of the presented clips. Ratings were given by selecting one of seven points on a horizontal line where left was negative and right was positive. Participants could mark this point by moving to the left or to the right by pressing one of two buttons with their right index-finger or right ring-finger, respectively. Participants could accept their choice by a third button pressed with their right middle-finger. To prevent motor preparation, the order of the two questions, as well as the starting point of the selection were randomized within and between trials. After the last question had been answered, the next trial started immediately.

-

+

< ^ >

-

+

< ^ >

-

+

< ^ >

-

+

< ^ >

29 s 29 s ~3 s ~3 s ~3 s ~3 s Stim 1 Stim 2 Q 1 Q 2 Q 1 Q 2

-

+

< ^ >

-

+

< ^ >

29 s ~3 s ~3 s Q 1 Q 2 Stim 16

Figure 3 . A visualization of one particular run, which contained sixteen trials. A trial

contained a stimulation period during which the 29-second audio was presented. After stimulation, the participants answered two questions about likability and complexity of the perceived stimulus, in random order. Participants answered by moving a randomly positioned bar to the left (negative) or right (positive) by pressing one of two buttons, and could accept their choice by a third button. There were two sessions, each containing eight of these runs. In the training session, all stimuli were unique. In the testing session, sixteen unique stimuli were repeated over runs.

(9)

MRI data collection

MRI data were collected in a 3T Siemens MAGNETOM scanner at the Donders Center for Cognitive Neuroimaging using a 32-channel Siemens volume coil. Functional scans were collected using a multi-band sequence with repetition time (TR) = 735 ms, echo time (TE) = 39 ms, multi-band acceleration factor = 8, voxel size = 2.4 mm isotropic, 64 slices, slice-thickness = 2.4 mm, 0% slice gap, and field of view = 210 mm × 210 mm. Structural data were collected using a T1-weighted multi-echo MP-RAGE sequence in the same 3T scanner.

Preprocessing

Preprocessing of the data was carried out using the Statistical Parametric Mapping toolbox (SPM12, http://www.fil.ion.ucl.ac.uk/spm). Functional scans were realigned to the first volume of the first run and next to the mean scan, by translation and rotation transformations, in order to correct for motion during the scan period. Functional scans were then slice-time corrected to anticipate sampling of the BOLD response at different time-points within a volume. The anatomical scan was co-registered to the mean functional scan. Finally, the realigned and slice-time corrected functional scans were normalized to MNI space to facilitate group analysis.

Representational Similarity Analysis

We used representational similarity analysis (RSA) to investigate the correspondence between the representation in several computational models and observed brain activity (Kriegeskorte et al., 2008; Nili et al., 2014). In RSA, a (computational or brain) model is characterized by a representational dissimilarity matrix (RDM). An RDM captures the internal structure of a model’s representation as dissimilarities between pairs of categories (here, the presented audio clips). In turn, the overlap between the computational model (i.e., the candidate model/RDM) and the neural activity (i.e., the target model/RDM) can be estimated by computing the similarity between them (i.e., their RDMs). The esti-mated similarity provides evidence about how well a particular model explains the response patterns in a particular brain region.

We ran a searchlight RSA using the CoSMoMVPA toolbox (Oosterhof et al., 2016), a Matlab toolbox for multivariate pattern analysis of neuroimaging data. Within the search-light RSA, a target RDM was generated by computing the dissimilarity (1 – Spearman correlation coefficient) between brain activity observed within a spherical neighborhood of 100 voxels at pairs of TRs. Then, for each searchlight, the Spearman correlation was com-puted between each of ten candidate RDMs and the target RDM. The searchlight RSA returned for each voxel a vector containing the similarity values between the computational models and the brain activity, for each participant individually.

In this study, we restricted our analysis to the testing session only. The functional scans were masked with a gray-matter mask. For each run, we linearly detrended and z-scored the voxel time courses. We averaged repetitions over runs to increase the signal to noise ratio (SNR). Finally, we removed the first six seconds of each trial to account for the delay of the blood-oxygen-level dependent (BOLD) response and stacked all trials. This procedure resulted in n matrices of size v × t, where n is the number of participants, v the

(10)

number of voxels, and t the number of TRs. The target RDM was generated by computing the dissimilarity (1 – Spearman correlation coefficient) between pairs of TRs. Thus, the target RDM for a spherical neighborhood of 100 voxels was a square matrix of size t × t.

In this study, several (10) different computational models were employed, being the

layers within the ResNet. For each of these computational models, a candidate RDM

was created by computing the distance (1 – Spearman correlation coefficient) between the

model’s representation of pairs of TRs. Specifically, we extracted the artificial neuron

activations of individual clips and individual layers, aligned with the TR. These activations were convolved with the haemodynamic response function (HRF). After convolution, we removed the first six seconds to anticipate the BOLD response. Subsequently, we stacked

all clips’ feature representations. This procedure resulted in m matrices of size pi× t, where

m is the number of layers, pi is the number of artificial neurons in the ith layer, and t is

the number of TRs. Thus, each candidate RDM was a square matrix of size t × t.

Statistical Analysis

First, we computed the value and index of the layer that revealed the maximum correlation across layers. Second, we used a one-sample t-test (right-sided) to test whether the maximum representational similarities were significantly higher than zero, for each voxel

independently. We used false discovery rate (FDR) with q = 0.05₁₀ = 0.005, to account for

multiple comparisons, both due to multiple voxels as well as multiple layers. Voxels with p-values less than the FDR threshold were maintained, all others were assumed to be non-significant and removed. Finally, the grand-average activity map (i.e., non-significant maximum correlation values) and layer map (i.e., index of the layers where the maximum was found) were defined as the average over participants.

Results Prediction of music tag-assignments

The performance of the ResNet on prediction of the tag assignments for individual tags was defined as the area under the receiver operator characteristic (ROC) curve (AUC). The performance of the ResNet was 89.88%, which was significantly higher than chance level (p < 0.05, Z-test). To the best of our knowledge, this is the highest automated music tag prediction performance of an end-to-end model evaluated on the same split of the same dataset (Dieleman & Schrauwen, 2014).

We then compared the performance of the ResNet tags individually (see Figure 4). Visual inspection did not reveal a prominent pattern in the performance distribution over tags. The performance was not significantly correlated with tag category or frequency (p

> 0.05, Student’s t-test). The performance for positive tags were significantly higher than

those for the negative tags (p < 0.05, Z-test).

Representational dissimilarity matrices

We constructed candidate RDMs of the (stacked) layers within the ResNet (see Fig-ure 5). Visual inspection did not reveal any categorical structFig-ure in the similarity patterns (i.e., preferences for natural, instrumental, or vocal sounds). It can be observed that the

(11)

Figure 4 . Performance of the ResNet for automated music tag prediction for individual

tags. The red line indicates the mean accuracy over tags.

layers within identical blocks (i.e., conv2_x, conv3_x, conv4_x, conv5_x) code for highly similar representations. Specifically, layer 2 and 3, 4 and 5, 6 and 7, and 8 and 9, each look similar in terms of their pairwise dissimilarities over time (i.e., TRs).

Representational similarity

We performed a searchlight RSA to investigate the correspondence between each of ten candidate models (i.e., the layers within the ResNet), and the target RDMs of spherical neighborhoods within a gray-matter brain mask. For each searchlight sphere (i.e., each sphere centered around a voxel), we assigned the value (i.e., activation map) and index (i.e., layer map) of the candidate model with maximum correlation. These maps were FDR corrected and averaged over participants.

First, we estimated the distributions of (FDR uncorrected) representational similar-ities across layers (see Figure 6). These distributions show a strong positive bias. Addi-tionally, the Spearman correlation coefficients between individual candidate RDMs and all searchlight spheres were most prominent for the deeper ResNet layers. This was revealed by an increase of average correlation with respect to an increasing layer depth. Specifically, the deeper the layer, the more right-shifted the distribution became).

Second, we estimated the distributions of (FDR corrected) layer assignments (see Fig-ure 7). Visual inspection revealed a right-shifted distribution, where deeper layers are more prominent than shallow layers, though the peak was found at mid-level layers. Specifically, layers 6 to 9 were assigned to most voxels. The shallowest layer was not represented at all. Additionally, an asymmetry in cortical representations was found between the left and right hemisphere. We found that the left hemisphere routinely represented deeper layers, whereas the right hemisphere revealed a more shallow distribution.

Third, the layer map revealed a prominent gradient along STG (see Figure 8). This gradient reflected deep layers represented at posterior as well as anterior areas of STG (i.e., more red), while shallow layers were represented at the center of the STG (i.e., more green). Specifically, STG reflected a high-low-high gradient of complexity, where PT showed a clear

(12)

Layer 1 0 1.5 Layer 2 0 1.2 Layer 3 0 1.2 Layer 4 0 0.8 Layer 5 0 0.6 Layer 6 0 0.4 Layer 7 0 0.4 Layer 8 0 1.1 Layer 9 0 1.1 Layer 10 0 1.4

Figure 5 . Candidate RDMs of the individual layers of the residual neural network. Each cell

within a matrix shows the dissimilarity (i.e., 1- Spearman’s correlation coefficient) between the layer’s representations of pairs of samples from audio clips. Squares within a matrix denote trials (i.e., single audio clips). Note, color scaling differs between matrices.

sensitivity to the deepest layers (i.e., layer 9 and 10), decreasing to the central areas of STG (i.e., layers 5 to 7), and increasing to PP (i.e., layers 7 to 10).

Finally, apart from auditory cortex, also frontal, parietal, occipital, and inferior tem-poral representations were found. Particularly, the frontal and inferior temtem-poral represen-tations showed higher similarities than the auditory areas. Notably, the bilateral parietal and left occipital and inferior temporal areas showed higher sensitivity to complex fea-ture representations, whereas again the right hemisphere seemed dominated by sensitiv-ity to lower-level feature representations. Besides, outside auditory cortex, several areas showed sensitivity to very low-level representations, predominantly around bilateral Cingu-late Gyrus (CG).

Discussion

We employed a supervised residual neural network for end-to-end learning of auto-mated music tag prediction. The artificial neural network learned hierarchical and

(13)

increas-0 0.1 0.2 0 100 200 300 400 500 600 700 layer 1 0 0.1 0.2 0 100 200 300 400 500 600 700 layer 2 0 0.1 0.2 0 100 200 300 400 500 600 700 layer 3 0 0.1 0.2 0 100 200 300 400 500 600 700 layer 4 0 0.1 0.2 0 100 200 300 400 500 600 700 layer 5 0 0.1 0.2 0 100 200 300 400 500 600 700 layer 6 0 0.1 0.2 0 100 200 300 400 500 600 700 layer 7 0 0.1 0.2 0 100 200 300 400 500 600 700 layer 8 0 0.1 0.2 0 100 200 300 400 500 600 700 layer 9 0 0.1 0.2 0 100 200 300 400 500 600 700 layer 10

Figure 6 . Distributions of the Spearman correlation coefficients between individual

candi-date RDMs and all spherical neighborhood target RDMs. The red line shows the fit of a Gaussian function to these distributions.

ingly complex feature representations in a data-driven manner. In turn, we compared these features to the representations across the brain to search for correspondence between the artificial neural activities and biological neural activities, both in response to natural music clips. For this purpose, we performed a searchlight RSA across gray-matter voxels and investigated for each voxel which artificial neural layer was represented highest by means of maximum correlation.

We found a representational gradient along STG where both ends (i.e., PP and PT) revealed sensitivity for deeper (i.e., more complex) layers, whereas the center of STG re-vealed a higher sensitivity for shallow (i.e., simpler) layers. The primary auditory cortex is represented at the center of STG along HG, contrasted with surrounding areas along both ends of STG representing secondary areas, also called the auditory association cortex. Additionally, the posterior area of left STG also covers Wernicke’s Area, which is known to be involved in the understanding and comprehension of both speech as well as music (Koelsch et al., 2002). Finally, these results are analogous to results found in visual cortex, where downstream brain areas were more sensitive to deeper layers for both ventral as well

(14)

1 2 3 4 5 6 7 8 9 10 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 left right

Figure 7 . Distributions of the layer assignments to (FDR corrected) significant voxels. The

blue bars represent the distribution for the left hemisphere, the yellow bars represent the right hemisphere.

as dorsal streams (Güçlü & van Gerven, 2015a,b).

The central to posterior part of the representational gradient is in line with literature on hierarchical organization related to temporal complexity (Chevillet et al., 2011; Patterson et al., 2002). Pure tone responsive areas were found to be most dominant around the auditory core, located along HG (Sweet et al., 2005; Hackett et al., 2001). In contrast, the auditory belt, most responsive to band-passed noise, is located at the anterior and posterior banks of HG (Sweet et al., 2005). Finally, auditory para-belt regions that are most responsive to species-specific vocalizations are located predominantly around PT (Sweet et al., 2005). These results also showed a gradient in complexity from center STG (i.e., HG) to posterior STG (i.e., PT), in line with the representational gradient found here.

Both the center to anterior and center to posterior gradient are also in line with the hypothesis that there exists a ’what’ pathway that extends from anterolateral HG to posterior PP, and a ’where’ pathway that extends from posterior STG to PT (Ahveninen et al., 2006). These results were obtained by a dissociation between sounds that were either manipulated phonetically (’what’) or spatially (’where’). Also, it has been shown that the

(15)

anterior parts of STG are more responsive to changes in the temporal domain, whereas posterior parts of STG are more sensitive to changes in the spectral domain (Samson et al., 2011). Importantly, HG was found to be most responsive to a combination of both temporal and spectral content. From the artificial neural network layers we can not directly infer any differentiation between ’what’ and ’where’ representations, nevertheless such independent streams again suggest an increasingly complex stream from HG to both PP as well as PT. The asymmetry found in this study might be related to the left hemispheric special-ization for speech (Samson et al., 2011), though we did not investigate this pattern, to this end. Nevertheless, the left anterior STG (i.e., belt area) has been found to be predominantly sensitive to temporal changes, whereas the right hemisphere was more sensitive to spectral changes (Schönwiesner et al., 2005; Zatorre et al., 2002). These hemispheric differences might reflect specialization to speech and music for left and right hemispheres respectively. Further investigation of the features represented in the individual layers of the neural net is required to conclude that the asymmetry found is this study is related to a specialization for speech and music in both hemispheres.

Although the result look convincing, it should be noted that measuring auditory responses using fMRI is difficult. The difficulty arises from the substantial background noise coming from the MRI scanner, which is directly related to the EPI sequence (Peelle, 2014). EPI sequences generate periodic sound with high sound levels and a complex spectrum (Hedeen & Edelstein, 1997). This may cause participants to perform a different task than passive listening, as they also have to extract the audio signal from background noise, a so-called masking effect (Belin et al., 1999). Additionally, the constant activation of auditory cortex relative to the scanner noise might cause saturation of the auditory response, and thereby reduce the sensitivity to the actual experimental task (Bandettini et al., 1998). Specifically designed sparse recording schemes allow one to record in between the EPI bursts, though then require longer TRs (Di Salle et al., 2001). Additionally, as fMRI suffers from low temporal resolution fast changes in the musical content might not be captured. These issues might explain the absence of the shallow artificial neural layers in the representational gradients found in this study.

Conclusion

We have shown that a residual neural network achieved state-of-the-art performance in automatic music tag prediction. Additionally, we have shown that the representations across layers of the residual neural network revealed a representational gradient along STG. Specifically, PP (anterior STG) together with PT (posterior STG) revealed higher sen-sitivity to complex stimulus features (deep artificial neural layers), whereas central STG (involving HG) revealed higher sensitivity to low-level stimulus features (shallow artificial neural layers). These results, in conjunction with previous results on the visual and au-ditory cortical representations, suggest the existence of multiple representational gradients that process increasingly complex conceptual information as we traverse down the sensory hierarchy of the human brain.

(16)

Acknowledgments

I would like to thank both on-site supervisors Marcel van Gerven and Umut Güçlü for their support and advice throughout my internship. Finally, I would like to say thanks to fellow students, to friends, and to family, for help, guidance, and support both on academic level as well as personal level.

References

Ahveninen, J., Jääskeläinen, I. P., Raij, T., Bonmassar, G., Devore, S., Hämäläinen, M., . . . others (2006). Task-modulated ’what’ and ’where’ pathways in human auditory cortex.

Proceedings of the National Academy of Sciences, 103 (39), 14608–14613.

Alluri, V., Toiviainen, P., Jääskeläinen, I. P., Glerean, E., Sams, M., & Brattico, E. (2012). Large-scale brain networks emerge from dynamic processing of musical timbre, key and rhythm. NeuroImage, 59 (4), 3677–3689.

Alluri, V., Toiviainen, P., Lund, T. E., Wallentin, M., Vuust, P., Nandi, A. K., . . . Brattico, E. (2013). From Vivaldi to Beatles and back: predicting lateralized brain responses to music. NeuroImage, 83 , 627–636.

Bandettini, P. A., Jesmanowicz, A., Van Kylen, J., Birn, R. M., & Hyde, J. S. (1998). Functional mri of brain activation induced by scanner acoustic noise. Magnetic Resonance

in Medicine, 39 (3), 410–416.

Belin, P., Zatorre, R. J., Hoge, R., Evans, A. C., & Pike, B. (1999). Event-related fMRI of the auditory cortex. NeuroImage, 10 (4), 417–429.

Brewer, A. A., & Barton, B. (2016). Maps of the auditory cortex. Annual Review of

Neuroscience(0).

Casey, M., Thompson, J., Kang, O., Raizada, R., & Wheatley, T. (2012). Population codes representing musical timbre for high-level fMRI categorization of music genres. In

Machine Learning and Interpretation in Neuroimaging (pp. 34–41). Springer.

Chen, J. L., Penhune, V. B., & Zatorre, R. J. (2008). Listening to musical rhythms recruits motor regions of the brain. Cerebral Cortex, 18 (12), 2844–2854.

Chevillet, M., Riesenhuber, M., & Rauschecker, J. P. (2011). Functional correlates of the an-terolateral processing hierarchy in human auditory cortex. The Journal of Neuroscience,

31 (25), 9345–9352.

Dieleman, S., & Schrauwen, B. (2014). End-to-end learning for music audio. In Acoustics,

Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on (pp.

6964–6968).

Di Salle, F., Formisano, E., Seifritz, E., Linden, D. E., Scheffler, K., Saulino, C., . . . oth-ers (2001). Functional fields in human auditory cortex revealed by time-resolved fMRI without interference of EPI noise. NeuroImage, 13 (2), 328–338.

(17)

Gazzaniga, M. S., Ivry, R. B., & Mangun, G. R. (2002). Cognitive Neuroscience: The

biology of the mind. New York: WW Norton.

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accu-rate object detection and semantic segmentation. In Proceedings of the IEEE conference

on computer vision and pattern recognition (pp. 580–587).

Griffiths, T. D. (2003). Functional imaging of pitch analysis. Annals of the New York

Academy of Sciences, 999 (1), 40–49.

Güçlü, U., & van Gerven, M. A. (2015a). Deep neural networks reveal a gradient in the com-plexity of neural representations across the ventral stream. The Journal of Neuroscience,

35 (27), 10005–10014.

Güçlü, U., & van Gerven, M. A. (2015b). Increasingly complex representations of natural movies across the dorsal stream are shared between subjects. NeuroImage.

Hackett, T. A., Preuss, T. M., & Kaas, J. H. (2001). Architectonic identification of the core region in auditory cortex of macaques, chimpanzees, and humans. Journal of Comparative

Neurology, 441 (3), 197–222.

He, K., Zhang, X., Ren, S., & Sun, J. (2015a). Deep residual learning for image recognition.

arXiv preprint arXiv:1512.03385 .

He, K., Zhang, X., Ren, S., & Sun, J. (2015b). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE

Interna-tional Conference on Computer Vision (pp. 1026–1034).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks.

arXiv preprint arXiv:1603.05027 .

Hedeen, R. A., & Edelstein, W. A. (1997). Characterization and prediction of gradient acoustic noise in mr imagers. Magnetic Resonance in Medicine, 37 (1), 7–10.

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012).

Improving neural networks by preventing co-adaptation of feature detectors. arXiv

preprint arXiv:1207.0580 .

Hu, X., Guo, L., Han, J., & Liu, T. (2016). Decoding power-spectral profiles from fmri brain activities during naturalistic auditory experience. Brain Imaging and Behavior , 1–11.

Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. The Journal of Physiology, 148 (3), 574–591.

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 .

Khaligh-Razavi, S.-M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS Computational Biology, 10 (11), e1003915.

(18)

Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint

arXiv:1412.6980 .

Koelsch, S., Gunter, T. C., Cramon, D. Y. v., Zysset, S., Lohmann, G., & Friederici, A. D. (2002). Bach speaks: a cortical "language-network" serves the processing of music.

NeuroImage, 17 (2), 956–966.

Kolb, B., & Whishaw, I. Q. (2001). An introduction to brain and behaviour. Worth

Publishers New York.

Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2 , 4.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (pp. 1097–1105).

Law, E., West, K., Mandel, M. I., Bay, M., & Downie, J. S. (2009). Evaluation of algorithms using games: The case of music tagging. In Ismir (pp. 387–392).

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521 (7553), 436–444. Lee, Y.-S., Janata, P., Frost, C., Hanke, M., & Granger, R. (2011). Investigation of melodic

contour processing in the brain using multivariate pattern-based fMRI. NeuroImage,

57 (1), 293–300.

Liao, Q., & Poggio, T. (2016). Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv preprint arXiv:1604.03640 .

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition (pp. 3431–3440).

Marques, G., Domingues, M. A., Langlois, T., & Gouyon, F. (2011). Three current issues in music autotagging. In Ismir (pp. 795–800).

Minsky, M., & Papert, S. (1988). Perceptrons. MIT press.

Moerel, M., De Martino, F., Uğurbil, K., Yacoub, E., & Formisano, E. (2015). Processing of frequency and location in human subcortical auditory structures. Scientific Reports,

5 .

Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., & Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. PLoS Computational Biology, 10 (4), e1003553.

Oosterhof, N. N., Connolly, A. C., & Haxby, J. V. (2016). CoSMoMVPA: multi-modal multivariate pattern analysis of neuroimaging data in Matlab/GNU Octave. bioRxiv, 047118.

(19)

Patterson, R. D., Uppenkamp, S., Johnsrude, I. S., & Griffiths, T. D. (2002). The processing of temporal pitch and melody information in auditory cortex. Neuron, 36 (4), 767–776. Peelle, J. E. (2014). Methodological challenges and solutions in auditory functional magnetic

resonance imaging. Frontiers in Neuroscience, 8 .

Romanski, L. M., Tian, B., Fritz, J., Mishkin, M., Goldman-Rakic, P. S., & Rauschecker, J. P. (1999). Dual streams of auditory afferents target multiple domains in the primate prefrontal cortex. Nature Neuroscience, 2 (12), 1131–1136.

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65 (6), 386.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1985). Learning internal representations

by error propagation (Tech. Rep.). California University San Diego La Jolla Institute For

Cognitive Science: DTIC Document.

Samson, F., Zeffiro, T. A., Toussaint, A., & Belin, P. (2011). Stimulus complexity and categorical effects in human auditory cortex: an activation likelihood estimation meta-analysis. Frontiers in Psychology, 1 , 241.

Santoro, R., Moerel, M., De Martino, F., Goebel, R., Ugurbil, K., Yacoub, E., & Formisano, E. (2014). Encoding of natural sounds at multiple spectral and temporal resolutions in the human auditory cortex. PLoS Computational Biology, 10 (1), e1003412.

Schönwiesner, M., Rübsamen, R., & Von Cramon, D. Y. (2005). Hemispheric asymmetry for spectral and temporal processing in the human antero-lateral auditory belt cortex.

European Journal of Neuroscience, 22 (6), 1521–1528.

Seibert, D., Yamins, D. L., Ardila, D., Hong, H., DiCarlo, J. J., & Gardner, J. L. (2016). A performance-optimized model of neural responses across the ventral visual stream.

bioRxiv, 036475.

Staeren, N., Renvall, H., De Martino, F., Goebel, R., & Formisano, E. (2009). Sound categories are represented as distributed patterns in the human auditory cortex. Current

Biology, 19 (6), 498–502.

Sweet, R. A., Dorph-Petersen, K.-A., & Lewis, D. A. (2005). Mapping auditory core, lateral belt, and parabelt cortices in the human superior temporal gyrus. Journal of Comparative

Neurology, 491 (3), 270–289.

Szegedy, C., Ioffe, S., & Vanhoucke, V. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261 .

Toiviainen, P., Alluri, V., Brattico, E., Wallentin, M., & Vuust, P. (2014). Capturing the musical brain with Lasso: Dynamic decoding of musical features from fMRI data.

(20)

Tokui, S., Oono, K., Hido, S., & Clayton, J. (2015). Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning

Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS).

Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111 (23), 8619–8624. Zatorre, R. J., Belin, P., & Penhune, V. B. (2002). Structure and function of auditory

(21)

Figure 8 . Activation maps (top) and layer maps (bottom) for both left hemisphere (left)

and right hemisphere (right) averaged over participants. The activation map shows the Spearman correlation coefficients between the maximum candidate RDM and the target RDM. The layer map shows the index of the candidate RDMs where the maximum was found, delineating the most representative layer at each voxel (FDR corrected).

Deep learning to probe neural correlates of music processing

Jordy Thielen

-­

+

< ^ >

-­

+

< ^ >

-­

+

< ^ >

-­

+

< ^ >

-­

+

< ^ >

-­

+

< ^ >

-

-

-

-

-

-