• No results found

The Role of Hierarchal Visual Features in the Attentional Blink

N/A
N/A
Protected

Academic year: 2021

Share "The Role of Hierarchal Visual Features in the Attentional Blink"

Copied!
9
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

The Role of Hierarchal Visual Features in the Attentional Blink

Sophie Berkhout Bachelor Thesis Supervised by: Daniel Lindh

University of Amsterdam

Abstract

The attentional blink (AB) is a reduced ability to report a second target (T2) when it is presented within 200-500 milliseconds after the first target (T1) in a rapid serial visual presentation (RSVP). In order to study conscious access to visual stimuli, the AB is a popular tool for inducing controlled lapses of attention. Prominent theories of the AB posit that the AB follows a two-stage process, in which stimuli are first processed up to a semantic level, followed by frontal and parietal engagement where object representations are put into working memory. It is here that the attentional depletion, caused by T1, is believed to impair the second stage of processing of the T2, leaving the semantic processing intact. This current study looks at how the AB affects natural images in different ways and I hypothesize that any difference in the AB across images can be related to the difference in high-level features within each given image. To define a full feature space of progressively more complex features, we employ a Deep Convolutional Neural Network (DCNN), which is trained to classify millions of images up to human performance. Results show that a multivariate, cross-validated, regression model could not predict the AB magnitude of an image based on either high-level or low-level features as provided by the DCNN. We conclude that we did not find a prominent role for high-level features in the AB. The usefulness of DCNNs in predicting neural responses as well as human behaviour is discussed.

The attentional blink (AB) is a well-studied phenomenon that occurs when people are presented with two targets that are embedded in a rapid serial visual presentation (RSVP) together with distractor masks (Raymond, Shapiro & Arnell, 1992). The ‘blink’ refers to a reduced ability to report the second target (T2) when it is presented within lags of 200-500ms after the first target (T1). The second target is shown equally long as the other images, but it is not consciously perceived and can therefore be reported. The AB presents a useful tool for researchers to induce these controlled lapses of conscious experience and makes it possible to study different stages of processing for conscious access to visual stimuli.

Leading theories of the AB state that AB processing attends two stages. The stimuli are firstly processed on a semantic level, after which frontal and parietal regions take part to consolidate object representations into working memory (Colzato, Spapé, Pannebakker, & Hommel, 2007). While encoding of the initial target occupies attentional recourses, the processing of a second target is susceptible to decay or being overwritten by masks (Nieuwenstein, Van Der Burg, Theeuwes, Wyble & Potter, 2009). Processes limiting the performance of reporting the T2 are attentional selection, working memory encoding, episodic registration, response selection, attentional enhancement and engagement, and distractor inhibition (Dux & Marois, 2009).

Chun and Potter (1995) proposed the two-stage model for the AB. In the first stage, stimuli rapidly activate short conceptual representations in the brain. Here, their identification is briefly available. When interfered by the other RVSP stimuli, these representations are vulnerable to rapid forgetting unless they are selected by attentional resources for further processing and consolidation. However, the brain can only access the target consciously if the

(2)

stimuli are processed through the second stage. This second stage transfers the short representation of a target stimulus into a more durable representation in working memory. This stage is capacity-limited and triggered by an attentional response that selects a specific target. When the second target appears before the second stage is free of processing, stage two processing for T2 will be delayed. Consequently, it can only be processed in the first stage. Thus, the second stage is only accessible through a bottleneck, where attentional resources are necessary to enter this stage. Research has shown that visual areas in the human brain respond to missed as well as reported second targets, while parietal-frontal regions only respond to reported second targets (Luck, Vogel & Shapiro, 1996; Dux & Marois, 2009). Another study found that consciously perceiving a target amplified activation of the medial temporal cortex (Marois, Yi & Chun, 2004).

Krizhevsky, Sutskever, and Hinton (2012) composed a way to define a full feature space with progressively complex features of processing, as is the case in the human visual stream, by creating a Deep Convolutional Neural Network (DCNN). DCNNs are modelled for the object recognition of images and is based on different layers of processing. The DCNN models have a classification accuracy near human capacity. This DCNN is particularly useful because of its neural relevancy. It describes a vast feature space, from which it is possible to extract feature activation per layer for each image.

The DCNN corresponds well to the hierarchy of the visual system, where processing complexity increases along the layers of the model which is analogous to downstream the visual areas in the brain (Wen, Chen & Liu, 2017; Cichy, Khosla, Pantazis, Torralba, & Oliva, 2016; Güçlü, & Van Gerven, 2015). This means that earlier stages of processing in the human brain correspond to the first convolutional layers of the DCNN, while later stages of processing correspond most to the final fully-connected layers of the DCNN. Furthermore, Wen, Shi, Zhang, Lu & Liu (2016) found that the DCNN reliably predicts fMRI data from humans. Because of this association, the DCNN can be used as a biological relevant way to explore the neural processing of object recognition by looking at different layers of the model.

AB studies generally use images of letters and numbers. However, these symbols do not represent stimuli humans encounter in the real world, which in turn makes them less ecologically valid. Identification and localization of objects in images of natural scenes are more attention-demanding (Evans & Treisman, 2005). Moreover, Sha et al. (2015) found that inanimate stimuli are treated differently in the ventral vision pathway than animate stimuli. Images are represented in the ventral vision pathway along a so-called animacy continuum, ranging from least animate objects to most animate. Carlson, Ritchie, Kriegeskorte, Durvasula & Ma (2014) found that differences in representational space in the ventral vison pathway causes reaction time variation for object recognition. They conclude that the form of the sensory representation of images is key to object categorization. These studies show that the use of symbols, like letters and numbers, cannot fully explain the processes behind object recognition. Animate images show different AB behaviour. Guerrero and Calvillo (2016) showed that animate images were reported more often than inanimate objects on both lags of the AB task, suggesting that animate objects capture attention more easily than inanimate objects. Moreover, according to Evans and Treisman (2005), the visual stream differently applies divers features that are shared within categories. Evans and Treisman showed that performance for reporting the T2 was better when both targets were of the same category than when they were different.

This study investigates how the AB interferes with conscious perception of different types of natural images. The hypothesis is that differences in the AB across images correspond to the high-level features within each image, rather than low-level features, e.g. orientations and contrasts. We tested the attentional blink with images of inanimate as well as animate natural scenes and used the DCNN model to define a full feature space, which was extracted

(3)

from different layers in the model. A key assumption in this study is that if the deeper layer predicts the AB magnitude (ABM) better than the first layer of the DCNN, this is due to the second stage of processing in the two-stage model.

Method

Participants

For this experiment 22 psychology students were recruited. One participant was removed due to below chance T1 accuracy. The participants were between 18 and 26 years old (M = 20.38, SD = 2.01), and 16 were female. Participants were recruited through a participant pool at the University of Amsterdam. Participants received credits for participation which are necessary for the completion of their first year of study. The University of Amsterdam approved ethics and all participants signed consent.

Apparatus and Stimuli

The Attentional Blink task used 16 images, comprising six different categories, namely bears, monkeys, beetles, butterflies, cars, airplanes, cabinets, and chairs. The first half of these categories characterize as animate images, and the other half as inanimate. The task consisted of 25 blocks, each containing two runs of 24 trials. Within each block, randomisation of the images happened such that all images were shown three times. The T1 was randomly picked for each trial, with the only constraint being that it cannot be the same as the T2. Each trial started with a 500 milliseconds fixation point, followed by a rapid serial visual presentation (RVSP) stream of 19 images. A stimulus-onset asynchrony (SOA) of 100 milliseconds presented the images for 20 milliseconds with a blank of 80 milliseconds in between. The experiment consisted of three conditions; lag-1, lag-2, and lag-7, where lag stands for the time delay between targets. In all conditions, the T2 was presented on the 13th position in the RVSP. In the lag-1 condition the T1 was placed on the 12th position, in the lag-2 conditions the T1 was presented on the 11th position, and in the lag-7 condition the T1 was presented on the 6th position. Figure 1 is an illustration of what the RVSP looked like for the lag-2 and lag-7 condition.

Figure 1 RVSP of the AB task, with lag-2 on the left and lag-7 on the right.

After each RVSP, the screen displayed a response menu separately for the T1 and the T2. The menus presented four labels two by two, of which only one matched the target and the others were randomly drawn from the remaining labels, see figure 2. Participants responded by pressing corresponding keys (‘S’, ‘X’, ‘M,’ or ‘K’) to the image labels.

(4)

Figure 2 Example of the lay-out of the response menus. Note that the original menus were presented in Dutch. The feature space of the images was derived from the DCNN composed by AlexNet Krizhevsky, Sutskever, & Hinton, 2012. The network consist of are convolutional layers and three are fully-connected layers of artificial neurons. The neurons in the convolutional layers correspond to features across spatial locations. The fully-connected layers on the other hand, are connected to all features of the previous layer. Figure 3 shows a visual representation of the DCNN.

The first layer of the model corresponds to low-level processing of orientations and contrasts. The seventh fully-connected layer is associated with high-level processing. Therefore, the features per image were retrieved from the first and seventh layer of the model. Every image has a unique activation pattern in each layer that the DCNN uses to determine the label of the image.

Figure 3 Composition of the AlexNet DCNN including eight layers (Cichy et al., 2015, p. 2).

Procedure

First participants read an instruction paper, after which they were asked to sign a consent form. After consent of the participants, the experimenter gave verbal instructions on how to perform the task. The task was presented in the centre of the screen with a visual angle of 5 degrees. The targets and the distractors were all converted to a greyscale and all images were squared.

The experiment had a duration of two hours. Participants sat alone in a room with a computer for the duration of the session. The experimenter was always present in the adjacent room.

Analysis

In order to show an AB, a one-way repeated measures ANOVA over lag-2 and lag-7 performance is conducted. Then, we examine the effect of animacy on T2 performance by one-way repeated measures ANOVA.

The AB magnitude (ABM) is equal to the proportion correct T2 reporting of lag-7 minus the proportion correct T2 reporting of lag-2. The measured ABM per person per image resulted in 16x21 ABM values.

Each image creates a unique neuron activation pattern in the DCNN. The features used in the analysis are these activation patterns found in the first and seventh layer. To test the

(5)

predictive power of the first and seventh layer in the DCNN, a multivariate, cross-validated, linear regression was conducted per participant per image, producing 16 linear models per participant. The predictors are features selected from each of the layers, using a wrapper method around a Random Forest classification algorithm for feature selection. The output variables of the regression consisted of the ABM per image of the corresponding participant. Cross-validation tested the performance of the linear model, where a model was trained on 15 images with one image left out. This left out image was then predicted using the regression model.

We tested the accuracy of the predicted ABMs by correlating them to the actual ABMs per participant. This resulted in one correlation value per person, which in turn were Fischer transformed. These correlation coefficients were then tested whether they differed

significantly from zero.

Results

The one-way repeated measures ANOVA over lag-2 and lag-7 shows a significant effect of lag performance, F(1, 20) = 58.45, p < .001. Lag-2 performance (M = 0.82, SD = 0.39) was lower than lag-7 performance (M = 0.92, SD = 0.28). Figure 4 shows the attentional blink.

Figure 4 Average performance per lag.

We tested the effect of animacy on T2 reporting using a one-way repeated measures ANOVA. Animacy had a significant effect, F(1, 20) = 7.42, p = 0.01, meaning that participants reported animate objects (M = 0.88, SD = 0.32) more frequently than inanimate objects (M = 0.85, SD = 0.36). Figure 5 shows the different AB behaviours for the six categories.

(6)

Figure 5 Average performance on lag 7 and lag 2 per image category. The differences between the lags are the attentional blink magnitudes.

The first multivariate, cross-validated, linear regression trained to predict the ABM using low-level features from the first layer of the DCNN. The second regression used high-level features of the seventh layer to predict ABMs.

The predicted ABM values from the first regression presented no significant

correlation with the actual ABM values, r = -.05, t(20) = -0.51, p = .616. This result shows that the predictions did not correlate with the real values, meaning that the features from this first layer of the DCNN were not accurate predictors.

The same analysis for the predictions of the ABM values of the second regression, based on features of the seventh layer, also finds that the predicted ABMs did not

significantly correlate with the actual ABMs, r = .13, t(20) = 2.04, p = .054. This result means that this layer was not an accurate predictor for the attentional blink either.

Figure 6 shows the average ABM and the average predicted ABM for both layers. This figure illustrates that the predictions do not fit the data well. They both show outliers and no correlation.

(7)

Discussion

The present study tested the effects of the AB for natural images. For the AB task, 16 images were used, derived from six different categories. To define the feature space of these images, a visual model with neural relevance was used.

This study shows that when images of natural scenes are used, an attentional blink is still present. These images of natural scenes are more ecologically valid than letters or numbers and indicate that the attentional blink phenomena also holds for natural images.

We found that second targets of animate objects were reported more often than inanimate objects. This is in accordance with the findings of Guerrero and Calvillo (2016), who suggested that animate images capture attention more easily.

Results of the multivariate, cross-validated, regression model based on high-level features show that it could not predict the ABM impairments of an image. Low-level features could also not predict accurately. Thus, from this study it cannot be concluded that the bottleneck of the attentional blink occurs in the later stages of processing. The two-stage model could not be verified based on this analysis.

A reason for why the DCNN model could not explain the ABM may be the limited number of images for the attentional blink task. Because of this, the regression model was trained on 15 images for cross-validation and therefore the model may not have had enough features overlapping with the left out 16th image. This left out image could have features that

other images do not have.

It is possible that the DCNN will show better predictions when including more images, because evidence exists that the DCNN has predictive power for spatio-temporal neural dynamics in object recognition. A number of studies show the relevancy of the features in the DCNN. They find that the DCNN accurately models neural responses across the ventral stream in the human brain (Cichy, Khosla, Pantazis, Torralba, & Oliva, 2016; Güçlü, & Van Gerven, 2015).

Another reason for the lack of predictive power of the DCNN could be that it has too many parameters, resulting in low generalizability due to overfitting of the training data. Since the upcoming of AlexNet, competing DCNNs have shown to outperform the AlexNet in its ability to predict neural responses of the ventral visual stream, such as ResNet (He, Zhang, Ren & Sun, 2016).

Finally, the AlexNet model is trained to label appearances of objects. However, research has shown that the function of objects, rather than the appearance, plays an important role in human object recognition (Grabner, Gall & Van Gool, 2011). Knowing the function of an object makes it easier to recognise it. A deep convolutional neural network build on the recognition of object function instead of object appearance, or maybe even both, might have a better performance and correspond better to human object recognition.

In conclusion, images of natural scenes do show an attentional blink. Objects of different categories show differences in processing for conscious awareness. This could mean that the stimuli researched in previous studies lack generalizability. The findings of these studies would therefore be specifically applicable to only those stimuli, requiring a careful interpretation of those findings.

How this AB occurs, however, could not be explained by the DCNN. Therefore, the analysis could not show the importance of high-level features for the AB. However, this does not give cause to understate the usefulness of the DCNN in predicting neural responses, and in turn human behaviour. Future research should focus on using more images and competing DCNNs to further investigate neural responses throughout the visual stream.

(8)

Literature

Carlson, T. A., Ritchie, J. B., Kriegeskorte, N., Durvasula, S., & Ma, J. (2014). Reaction time for object categorization is predicted by representational distance. Journal of Cognitive

Neuroscience, 26(1), 132-142.

Chun, M. M., & Potter, M. C. (1995). A two-stage model for multiple target detection in rapid serial visual presentation. Journal of Experimental psychology: Human perception and

performance, 21(1), 109.

Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016). Comparison of Deep Neural Networks to Spatio-Temporal Cortical Dynamics of Human Visual Object

Recognition reveals Hierarchical Correspondence. Scientific Reports, 6, 27755.

Colzato, L. S., Spapé, M. M., Pannebakker, M. M., & Hommel, B. (2007). Working Memory and the Attentional Blink: Blink size is predicted by individual differences in operation span. Psychonomic Bulletin & Review, 14(6), 1051-1057.

Dux, P. E., & Marois, R. (2009). The Attentional Blink: A review of data and theory.

Attention, Perception, & Psychophysics, 71(8), 1683-1700.

Evans, K. K., & Treisman, A. (2005). Perception of Objects in Natural Scenes: is it really attention free?. Journal of Experimental Psychology: Human Perception and Performance,

31(6), 1476.

Grabner, H., Gall, J., & Van Gool, L. (2011). What makes a chair a chair?. Computer Vision

and Pattern Recognition, 2011 IEEE Conference, 1529-1536.

Güçlü, U., & van Gerven, M. A. (2015). Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral Stream. Journal of Neuroscience,

35(27), 10005-10014.

Guerrero, G., & Calvillo, D. P. (2016). Animacy increases second target reporting in a rapid serial visual presentation task. Psychonomic bulletin & review, 23(6), 1832-1838.

(9)

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp.

770-778).

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems (pp. 1097-1105).

Luck, S. J., Vogel, E. K., & Shapiro, K. L. (1996). Word Meanings can be Accessed but not Reported during the Attentional Blink. Nature, 383(6601), 616-618.

Marois, R., Yi, D. J., & Chun, M. M. (2004). The neural fate of consciously perceived and missed events in the attentional blink. Neuron, 41(3), 465-472.

Nieuwenstein, M., Van der Burg, E., Theeuwes, J., Wyble, B., & Potter, M. (2009). Temporal constraints on conscious vision: On the ubiquitous nature of the attentional blink. Journal

of Vision, 9(9), 18-18.

Raymond, J. E., Shapiro, K. L., & Arnell, K. M. (1992). Temporary suppression of visual processing in an RSVP task: An attentional blink?. Journal of experimental psychology:

Human perception and performance, 18(3), 849.

Sha, L., Haxby, J. V., Abdi, H., Guntupalli, J. S., Oosterhof, N. N., Halchenko, Y. O., & Connolly, A. C. (2015). The animacy continuum in the human ventral vision

pathway. Journal of cognitive neuroscience.

Wen, H., Shi, J., Chen, W., & Liu, Z. (2017). Deep Residual Network Reveals a Nested Hierarchy of Distributed Cortical Representation for Visual Categorization. bioRxiv, doi:10.1101/151142

Wen, H., Shi, J., Zhang, Y., Lu, K. H., & Liu, Z. (2016). Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision. Cerebral Cortex, doi:10.1093/cercor/bhx268

Referenties

GERELATEERDE DOCUMENTEN

Tevens liet het onderzoek van Boucher (2003) zien dat kinderen met autismekenmerken slechtere semantische vaardigheden hebben waardoor deze kinderen woorden of begrippen

Following the same line of thought as for the isentropic case, we can also extend the kinetic energy port-Hamiltonian system using the distributed port (e d , f d ) to model

In the general case where the regression function belongs to a convex class, we show that our simultaneous estimator achieves with high probability the same convergence rates and

• Our findings also do not confirm the prediction that the individual size of the attentional blink covaries with other hypothesized behavioral and physiological markers of

Furthermore, if performance at the long lag is decreased after the Color-Salient training (Tang et al., 2013), this will reinforce the theory that temporal

5/20/2015 Welcome

Drie groepen bloemen zijn niet alleen in aantal achteruit gegaan sinds 1994-1995 maar vertoonden ook een sterk verband met de afnemende vlinderrijkdom: totaal bloemen- aanbod,

However, if subjects that normally use a reactive strategy in the attentional blink paradigm change to a proactive strategy, the PRIMs model would predict a