• No results found

Where do we look next? : image complexity and salience as possible explanations of visual attention allocation in infants

N/A
N/A
Protected

Academic year: 2021

Share "Where do we look next? : image complexity and salience as possible explanations of visual attention allocation in infants"

Copied!
18
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

possible explanations of visual attention allocation in infants.

Master thesis of Adam Sasiadek

Universiteit van Amsterdam

Abstract

In this report, we present two approaches in which we examine the role of bottom-up image features in guiding infant attention in natural scenes. In our first approach we examined the influence of a biologically plausible mea-sure of natural visual complexity on the overall attention spend on a scene. We expected infants to seek out optimal complexity. In our second approach we examined the strength of the influence of contrast, edge-content, chro-maticity and luminance on attention allocation in the scene. We expected an increase in strength of influence with age. The infants studied were in the 3 to 15 month age range and analyses were conducted on eye-tracking data. The results for approach one suggest that infants attend less to complex scenes than to simple scenes. The results for approach two show that all im-age features were discriminative of fixations and non-fixations and suggest a phase in which infants could be guided more by image features than adults. Future possibilities and limitations of our method are discussed.

Over the last decades, there has been a considerable increase of knowledge about how and which bottom-up image features are influencing and guiding attention (Engel, Zhang, & Wandell, 1997; Giesbrecht, Woldorff, Song, & Mangun, 2003; Maunsell & Treue, 2006). However, most of the resulting computational models are aimed at understanding adult visual processing and comparably less is known about image-feature driven attention in infants, nor the development thereof. Additionally, most infant research is conducted with abstract, static or moving stimuli (Bronson, 1994). (Bornstein, Mash, & Arterberry, 2011) however, has shown that infants show different looking behaviour when viewing more natural, context-rich, scenes. Finally, most of the published infant research relies on the subjective scoring of attention by experts. The goal of this study therefore became to analyse global and local aspects of image feature driven attention in more realistic, natural scenes. To heighten objectivity, we relied on eye-tracking data. In this report, we present two approaches in which we examine the role of bottom-up image features in guiding infant attention in natural scenes.

Examining the features that influence and explain different aspects of infant attention

I would like to thank Ingmar Visser and Maartje Raijmakers for supervising this project with so much patience, good ideas, encouragement and understanding.

(2)

attention and searching also begin in this age range. At around 6 to 9 months acuity is known to develop quickly to near mature levels and by two years the myelinization of the optic nerve is completed (Daw & Daw, 2006). We thus assumed that infants would be capa-ble of exploring the stimuli, albeit with individual differences caused by their developmental state.

In our first approach, we examined the influence of the overall visual complexity of a natural scene on the attention infants would spend while exploring it.

That visual complexity plays a meaningful role in visual processing has been shown in a study by Scholte, Ghebreab, Waldorp, Smeulders, and Lamme (2009). Scholte et al. (2009) suggest that our brain has evolved to efficiently compress incoming sensory information by exploiting the statistical regularities of our natural surroundings. In particular, they show that the parameters of the Weibull distribution, beta and gamma, can easily be estimated by X and Y-Cells found in the early visual cortex and that these parameters show a high covariance with EEG firings in the visual cortex when participants were viewing images of natural scenes. The beta and gamma parameters have been shown to be a naturally occurring, sparse representation of the contrast distribution in our natural surroundings. Higher beta and gamma values go along with more image cluttering and show more random patterns. Lower beta and gamma values on the other hand, encode images which consist out of more uniform surfaces or single objects.

Also, when visually exploring their surroundings, infants have been reported to show preference to seek out an optimum of informational complexity. Kidd, Piantadosi, and Aslin (2010) show that infants attend longer to sequences of stimuli when the informational complexity is neither too high nor too low. In the study, infants were presented with a sequence of toys on a computer screen. By varying the probability of the first toy re-appearing, the informational complexity of the sequence changed. In this context, totally random sequences were seen as very complex, sequences in which the toy re-appeared more often, were considered less complex.

In both studies, complexity is conceptually related to Solomonoff‘s algorithmic com-plexity theory (Solomonoff, 1985). In algorithmic information theory, information complex-ity raises with the amount of information or bits that are needed to be transferred in order to fully reproduce an original message. The more random a sequence of bits is, the more bits are needed to exactly reproduce the original. If there are regularities in the message however, it is possible to use much less information for transfer and the complexity is lower. Given these conceptual parallels, we hypothesized that infants could show a similar preference as reported by Kidd for preferring optimal complexity in visual information. If the scene is very simple, infants would not spend a lot of attention on exploring it. If the scene becomes too cluttered and complex however, a similar drop in attention should be observable. We also assumed that age and the sequence of presentation would influence the

(3)

total amount of attention. We expected age to influence the individual differences in the total attention spend by each infants. The later an image was presented in the experiment however, was expected to lead to lower looking times.

In our second approach we explore the influence of image features which are involved in the local allocation of attention.

Local bottom-up attention is thought to be guided by external influences and phys-iological research suggests that the visual system evolved to pick up an array of visual features. In particular, ganglion cells of the retina have been found to be tuned to con-trast, luminance and chromaticity. This neuronal tuning becomes increasingly specialised with the progression from lower to higher-order visual systems and there are neurons re-sponding only to corners, junctions, shape-from shading cues or even specific objects (Itti & Koch, 2001). There is also evidence that attention is guided towards these image features. Reinagel and Zador (1999) have shown that the locations adults fixate in a scene have higher contrast than the non-fixated locations. Based on this knowledge, Koch and Ullman (1987) propose that after early processing, low-level scene properties are represented in our brain as a salience-map, a physical manifestation of the non-uniform distribution of salience in our surroundings. The idea of a salience-map has led to a easy to use computational model by Itti and Koch (2000) and has been extensively studied. However, most of the research influencing and informing the development of the salience map model has been conducted with adults. Much less is known about whether the same image features that influence adults also influence infant eye-guidance.

To lay the foundations for building a comparable model of infant eye guidance and local visual attention, we aimed to determine the strength of the influence of an array of biologically plausible image features on attention allocation. In particular, we investigated whether contrast, chromaticity and edge-content are discriminatory between fixated and non-fixated locations. Tatler, Baddeley, and Gilchrist (2005) have shown that these image features are discriminatory of fixated and not-fixated locations in adults viewing natural scenes.

To validate our findings and create a point of reference, we first replicated Tatler et al.‘s (2005) findings in an adult sample w. In a second step, we aimed to investigate possible changes in the strength of the influence of image features on attention allocation. Amso, Haas, Tenenbaum, Markant, and Sheinkopf (2014) have shown that bottom-up saliency is a better predictor for attention allocation for adults than for children. This is in line with findings of Althaus and Mareschal (2012), who report that 12 year old infants are more responsive to salience than 4 month old infants. The mentioned studies however used the Itti and Koch salience model and did not allow for inferences about the strength of the influence of the separate aforementioned image features. The results of this approach were aimed at giving insight into the perceptual capabilities and development of bottom-up driven viewing. In line with the empirical evidence, we hypothesized that the strength of the influence of image features on attention would rise.

(4)

Figure 1. : Three sample scenes from the free-viewing dataset.

Method Participants and data recording

Forty-three infants (Mean age = 8.4 months, SD = 3.7, range = 3 to 15 months) and forty-seven (Mean age = 21.7 years, SD = 4.5, range = 17 to 39 years) adults have participated in a natural scene viewing experiment by Scott Johnson. The data of this experiment was kindly made available to us. In the experiment, the participants were presented with 28 pictures of everyday, natural scenes for a duration of four seconds each (see Figure 1 for examples of the scenes). Eye movements were recorded monocularly with an EyeLink eye-tracker operating at 500Hz.

Eye-tracking data processing

To differentiate saccades and fixations, we used a novel algorithm by Mould (2012). The algorithm makes use of a special characteristic of eye movements, namely that eyes make many very quick and short movements during fixations, but more slow and long movements during saccades. To make use of this characteristic for differentiation, the algorithm uses the local speed maxima of the eye and separates them by finding an optimal threshold. Additionally, it is necessary to separate fixations from noise. This is achieved by calculating the frequencies of the non-saccadic durations and separating them at an optimum. The frequencies of non-saccadic durations resemble a bimodal mixture-like distribution. In the Mould algorithm, the minimum between the two modes is used to separate the shorter-lasting noise and the longer-shorter-lasting fixations. The advantage of the Mould algorithm is that these separations are calculated for each trial and participant, therefore taking into account individual differences between the participants and trials. This characteristic is especially useful when studying infant data.

Approach 1: Visual complexity as a predictor of total attended time

Visual complexity. Cluttering of scene was quantified by the sum of the beta and gamma parameters of the Weibull distribution fitted to the edge histogram of each scene separately. In order to calculate the beta and gamma parameters we followed the method described by Scholte et al. (2009) First, gradient images were derived by filtering each image with differently oriented Gaussian derivative filters. After combining the resulting output back together, an edge histogram was generated on the basis of the gradient magnitude of

(5)

the image. Finally, the parameters were estimated by fitting the Weibull distribution to the edge histogram using maximum likelihood.

Figure 2. : An example of a mapping of images into beta-gamma space. The axis from bottom left to top right shows increased image cluttering and thus complexity. The axis from top left to bottom right shows a change in image texture, with images going from more dissimilar textures to more similar textures. The image is taken from Scholte et al. (2009).

Looking time. Total attended time was quantified by taking the sum of all fixation times in the image boundaries per participant. The Mould, Foster, Amano, and Oakley (2012) algorithm results in a fixation-onset time and a fixation-end time for each fixation. The difference between those two values was summed for each scene.

We also accounted for the influence of age and the sequence of presentation on looking time.

Approach 2: Salience as a predictor of fixation location

Modelling the salience properties. We have chosen to asses four biologically plausible image properties: contrast, luminance, edge-content and chromaticity. The models used to

(6)

Figure 3. : The stimuli used in this experiment mapped into beta-gamma space.

derive the image properties are described below. An important characteristic of the visual system is that it operates at different spatial resolutions or scales. This is due to efficiency reasons. When processing information that is too detailed, the eye guidance will be off, the same applies for too coarse information. Which spatial scale is processed by the visual system depends on receptor properties and the task at hand. In order to derive the optimal spatial scales for kids and adults, we have calculated each image property in 13 biologically plausible spatial scales, ranging from 0.42 to 10.8 cycles per degree (cpd).

The process used to derive the salience properties of each picture needed multiple steps. The first step for the features of luminance, contrast and edge-content was to convert the pictures to grey-scale, using a standard MATLAB function. For all images, it was necessary to reduce artefacts caused by the convolution along the edges of the pictures. To achieve this, the pixels at the end of each row and column of an image were averaged and the mean intensity was then streamed out from the end of the column or row. The images were extended by eight times the standard deviation of the used filter. At the corners, pixel intensity was calculated by the nearest neighbouring pixel intensities. After convolution, the pictures were cropped to their original size again.

To model receptor non-linearities, the grey scale image was log-transformed. For more details about the reasoning behind this step, see Tatler et al. (2005) and Valeton and van Norren (1983). Finally, the images were convolved using the different filters.

Luminance was extracted by convolving images with a Gaussian filter as described in Equation 1, where x and y specify the coordinates of each pixel in the image.

f (x, y) = exp− x

2+ y2

2σ2



(1) Contrast was modelled with a difference of Gaussian filter, as in Eqation 2. The ratio of surround to centre was set to 3.88.

(7)

Figure 4. : Examples of feature maps at 0.75, 1.5 and 10.8 cpd for one of the scenes viewed by the participants. (a) The original scene, (b) edge-conent, (c) contrast, (d) chromaticity, (e) luminance.

(8)

of the carrier is defined by θ2 and was set to 0.4σ. All of the parameters were chosen to be

within plausible biological ranges (?, ?). f (x, y) = sin 1 θ2 x sin θ1+ y cos θ1  exp  − x 2+ y2 2σ2  (3) To capture the unsigned difference in the pictures, the mean feature salience value has been subtracted from the average salience in the picture and the output squared. This way, it is possible to capture both extremes of salience, e.g. “brightness” and “darkness” for the luminance feature maps. Finally, to allow for meaningful comparisons between the images, the salience maps were standardized by dividing by the standard deviation of the output.

Chromaticity was modelled with a crude approximation of the MacLeod-Boynton color space (MacLeod & Boynton, 1979). In this space, chromaticity is represented by two channels, the difference between L and M receptors, and the difference between the S and a combined L and M channel. We approximated the channels by using the RGB channels. For L we used the red channel, for M with the green and S with the blue channel. This way, large differences in the MacLeod-Boynton color space are also large differences in our color space. The chromaticity filters therefore measure difference from average color, irrespective of the actual colors.

To derive the chromaticity, we used a slightly different approach from the other fea-tures. First, the image was split into its red, blue and green channels. Each channel was then log transformed, extended and convolved separately. Next, the green channel was subtracted from the red and the sum of the green and red channels from the blue channel (subtraction, because the image was log-transformed). The next step was to subtract the mean and square the output of each map. The final map was then produced by combining the two feature maps by using the maximum value for each pixel. Finally, the output was normalized by dividing by the standard deviation. Figure 4 shows examples of each feature map at three different spatial scales.

Measuring the difference between fixated and non-fixated locations

After the construction of the feature maps, the image statistics at the fixation points had to be extracted. The statistics were extracted by centering a box of 1°around the fixation centre and by taking the maximum value of the feature in the patch. Fixations outside the stimulus were not processed.

To make a meaningful comparison between fixated and non-fixated locations in the stimuli, it was necessary to take into account multiple peculiarities of the collected data. First, because of the large amount of data collected, usual hypothesis testing methods can lead to high statistical significance for irrelevant effects. Therefore, a statistical method is

(9)

needed that takes into account the magnitude of the effect and stays independent of the number of data points used in its calculation. Second, many of the derived feature statistics and their variance are non-normally distributed (Baddeley, 1996). Parametric statistics are thus not suitable for the analysis of the data. The chosen statistical method should not be affected by such aspects of the data, favourably by being non-parametric. Finally, the method should not be affected by two other idiosyncrasies stemming from the stimulus material and the observer. First, most photographs have a higher accumulation of salience in the centre of the image. This is caused by e.g the sky being in the upper part of the picture or the photographers tendency to put objects of interest in the centre of the picture. Second, there is also a reliable oculomotor bias towards making the first couple of fixations in the centre of the image , regardless of salience (Tatler & Vincent, 2009; Tseng, Carmi, Cameron, Munoz, & Itti, 2009). If not accounted for, the central fixation bias, together with the heightened salience values in the centre of the image, lead to an over-estimation of the predictive value of the salience statistics.

Given the first two considerations, the chosen statistic is the receiver operator curve area, which is the area under the ROC curve (Metz, 1978). The ROC curve determines the performance of a binary classifier system given a varying threshold. The curve is created by plotting the true-positive rate against the false-positive rate given a range of thresholds. In our case, a false-positive means labelling a non-fixated location as fixated and a true-positive means labelling a fixated location as fixated. The result is a curve showing the performance of the classifier at different threshold settings. To summarize the performance of the classi-fier, the area under the curve is calculated. When two distributions are indistinguishable by the classifier, the ROC area will be 0.5. When the distributions can be perfectly predicted by the classifier, the area will be 1.0 and when the classifier is predicting worse than chance, the area is smaller than 0.5. The ROC area is independent of the underlying distribution of the feature statistics and their variance and gives a interpretable magnitude of the effect size, making it a suitable choice for the task at hand. To make further statistical analysis possible, the 99% confidence intervals of the ROC area were calculated by using the boot-strap technique (Efron & Tibshirani, 1994). This was done by sampling 1000 times with replacement from the original data set and by calculating the ROC areas from each of the samples. The distribution of the resulting values was then used to calculate the confidence intervals.

To account for the central fixation and salience bias, fixations were compared with non-fixations chosen from a distribution of fixations occurring at the same time, but in other trials. This way, the non-fixations at the beginning of the trial were matched to other fixations roughly in the same central area, thus correcting for the higher salience values during the first fixations.

In order to validate our approach we have chosen to first replicate the part relevant to our research question from Tatler et al. (2005) with our adult participant data. Figure 5 shows the ROC area values derived by Tatler given his experimental set-up.

(10)

Figure 5. : ROC area values for luminance, chromaticity, contrast and edge-content of fixated vs. non -fixated locations. Error bars indicate 99% confidence intervalls. The y-axis is log scaled. Taken from Tatler (2005?)

Results Drop out

One infant participant has been left out of all following analyses due to only two recorded fixations during the entire recording time, leaving us with a total of 42 infants for the analysis. All adult data was used for the experiment.

Approach 1: Image complexity as a predictor of total attended time

Just as in Scholte et al. (2009), beta and gamma showed a high correlation in our stimulus material r(1113) = .74, p < .001, suggesting a successful replication of the calcula-tions with our stimuli. The range of the beta gamma values however is noticeably smaller than in the original experiment. Figure ? shows a mapping of our stimuli into the beta-gamma space. The infant’s summed fixations per picture averaged at 2970ms (SD = 877), suggesting that infants were overall attentive to the stimuli.

In order to test the hypothesis that higher complexity means longer looking times, we used a multilevel modelling approach. Our final model consisted out of two levels. Level one was the picture level, level two was the subject level. We hypothesized that the average looking time per picture will be explained by complexity and the sequence of presentation of the picture, whereas the variance in the mean average looking times per picture is explained by the age of each participant. For such a model, it is necessary to free

(11)

the intercept parameter. The multilevel model took the following form, with the subscript i indicating images and subscript j indicating each subject:

Level 1 : looking timeij = β0j+ β1complexityi+ β2sequencei+ eij

Level 2 : β0j = γ00+ γ01agej+ u0j

(4)

Table 1 shows the results of a stepwise model procedure we used to validate our model. In total, we fitted four models. The baseline model, which does not contain any explanatory variables and consists only out of a random intercept. Model number two, which has complexity as a level one predictor added. Model number three, where sequence of presentation is added as a first level predictor. And lastly, model number four, where age is added as a second level predictor. All of the models have a random intercept.

Table 1: Dependent variable: looking time (1) (2) (3) (4) (5) complexity −110.66∗∗∗ −106.48∗∗∗ −106.78∗∗∗ −442.05∗∗ (36.07) (35.68) (35.68) (222.62) complexity2 115.55 (75.60) sequence −6.28∗∗∗ −6.29∗∗∗ −6.30∗∗∗ (1.29) (1.29) (1.29) age 30.89∗∗∗ 31.22∗∗∗ (10.67) (10.62) Constant 1,476.54∗∗∗ 1,639.18∗∗∗ 1,723.34∗∗∗ 1,462.93∗∗∗ 1,693.62∗∗∗ (42.71) (68.09) (69.85) (112.52) (191.20) Observations 1,115 1,115 1,115 1,115 1,115 Log Likelihood −8,175.50 −8,170.82 −8,159.10 −8,155.27 −8,153.92 Akaike Inf. Crit. 16,357.01 16,349.64 16,328.20 16,322.54 16,325.84 Bayesian Inf. Crit. 16,372.06 16,369.70 16,353.28 16,352.64 16,370.99 Note: ∗p<0.1;∗∗p<0.05;∗∗∗p<0.01

The intra class correlation of ρ = 0.37 indicated that 37% of the variance in looking time is at subject level. This is moderately high and justifies using a random intercept model. In the second model, complexity is added as an explanatory variable and leads to a better fit of the model in comparison with the baseline model, χ2(1) = 9.37, p < .001. Adding sequence into the model improves the fit again in relation to the previous model, χ2(1) = 23.44, p < .001. Adding the age variable at level two improves the fit of the model again, χ2(1) = 7.65, p < .001). Finally, adding the polynomial term, does not lead to an

improvement in model fit χ2(3) = 2.7, p = .44). These tests, together with the other fit criteria in Table 1 (AIC and BIC) imply that the linear model structure and chosen variables are a better choice to explain looking time in the given context.

Considering the parameter estimates, the effects of age and sequence are in the ex-pected direction. Thus, the later a scene is shown in the experiment, the less time children spent looking at them. Also, when children get older, they tend to spend more time look-ing at the pictures. What is surprislook-ing however, is that increaslook-ing complexity leads to a

(12)

the pattern is roughly the same: The ROC areas are generally lower at the low spatial fre-quencies and raise with higher spatial frefre-quencies until an maximum is reached. Thereafter a slight decline can be seen. When looking more closely however, there are numerous differ-ences. First, Tatler found that at the lowest spatial frequencies, prediction is below chance level, whereas in our case the areas indicate prediction above chance level. Second, there are differences between the optimal cpd for maximal predictability for contrast, edge-content and luminance. Given our data, the highest ROC area values for contrast, edge-content and luminance are at 2.2, 0.85, and 10.8 cpd, respectively, whereas in Tatler the peaks for contrast and edge-content are at 5.2 cpd and the peaks for luminance and chromaticity are at 10.8 cpd. Third, luminance shows a pattern of decline from the lowest spatial frequencies towards 1.5 cpd, where it falls under chance level. Thereafter, the ROC areas raise back to above chance level prediction and peak at 10.8 cpd. This sudden decline pattern does not appear in Tatler’s results. The fourth difference, is the dominance of chromaticity as a predictor of fixation allocation in our experimental set-up. Tatler finds a clear pattern of contrast and edge-conent as being the most predictive image features overall, whereas chro-maticity and luminance come third and fourth. The ranking of these image features in our data-set is different, with chromaticity as first and contrast, edge-content and luminance in declining order. Finally, maximal ROC areas for each image feature are overall lower than what has been found by Tatler. We found a maximum discriminability of 55% and 54% for chromaticity and contrast and for edge-content and luminance we found 53% and 54%, respectively. This is lower than the values of 63% for contrast and edge-content and 57% for luminance and chromaticity which have been documented in Tatler’s work. All in all, our findings suggest that the replication of Tatler’s findings was only partially successful and questions remain as to the reasons of these differences.

To assess the spatial scale at which infants and adults select image features for fixa-tions, it was necessary to look at the predictive performance of each image feature at the 13 different spatial scales for both of them. ROC area values above 0.5 suggest that the image feature is selected for fixation guidance at a given spatial scale. Figure 7 shows the ROC area values for all image features and spatial scales derived for all infants. Just as with adults (Fig. 6), the ROC areas for the different image features peak at not just one, but different spatial scales. However, given the large confidence intervals, it is not possible to exactly determine one single spatial scale as being dominant for both infants and adults. Overall, the higher spatial scales do show better predictive performance than the lower ones, with a possible threshold at 2.2 cpd. At and above this threshold, at general, the image features do show a higher performance than below the threshold. As to discriminability, all image features at and above the threshold do show above chance performance. Still, neither for adults nor for infants, the effect sizes are very large. For infants, effect sizes above the threshold are in a 52% to 56% range, whereas for adults, the effect size ranges from 51% to

(13)

0.50 0.52 0.54 0.56 0.5 1.0 5.0 10.0 Spatial scale (cpd) R OC area feature chromaticity contrast edge content luminance

Figure 6. : Replication of Tatler (2005) with adult data. ROC area values for luminance, chromaticity, contrast and edge-conent feature values in fixated compared to non-fixated locations at 13 different spatial scales. A ROC area value above 0.5 indicates no discrim-inability. The Error bars indicate 99% confidence intervals.

55%. These effects are not very strong.

For subsequent testing it was necessary to select the spatial scale with the highest overall predictability. Because of the wide confidence intervals, it was not possible to de-termine which of the three highest spatial scales is best. Therefore, we have chosen to use the same scale that has proven to be the most predictive in Tatler’s work, namely 5.4 cpd to make eventual comparison between our findings and the original possible.

To test if infants develop towards more image feature driven looking behaviour, we split the infant data into two parts at the median age. This way we created two almost equally sized data sets. The expectation was, that for the older infants image features would be more predictive of fixation location than for the younger infants. In order to test this hypothesis, we compared the ROC areas for each image feature with one-sided hypothesis tests, using DeLong’s (??) method as implemented in the pROC package (?, ?) for R. We assumed significance at a Bonferroni corrected α <= .0025. In line with our hypothesis, we found a difference in contrast (p = .0003), edge-conent (p = .023) and luminance (p = .023). No difference in chromaticity was found. It thus appears, that children aged 8 to 15 months rely more on image features for attention allocation than children aged 3 to 8 months.

(14)

0.50 0.52 0.5 1.0 5.0 10.0 Spatial scale (cpd) R OC area feature chromaticity contrast edge content luminance

Figure 7. : Replication of Tatler (2005) with infant data. ROC area values for luminance, chromaticity, contrast and edge-conent feature values in fixated compared to non-fixated locations at 13 different spatial scales. A ROC area value above 0.5 indicates no discrim-inability. The Error bars indicate 99% confidence intervals.

Discussion

Approach 1: Image complexity as a predictor of total attended time

In this study, we investigated whether infants will seek out an optimum of complexity when viewing natural scenes. Our results do indicate that infants in the 3 to 15 month age range spend less time attending to complex stimuli than to simple stimuli. Age and image sequence were found to influence looking time. The older infants spend more time attending than younger infants. The later a scene was presented on the screen, the less attention the infants spend on it.

Our results do not support our hypothesis that infants in the 3 to 15 months age range seek out an optimal amount of complexity as encoded by the parameters of the Weibull distribution. Instead, our findings suggest that infants attend more to images with lower complexity than to images of higher complexity. The results however, have to be treated with some reservation because of the restricted range of complexity in our stimulus material.

Despite using the original code and a second check by the authors of the original calculations, our stimuli surprisingly only cover one tenth of the range of the beta parameter and half of the range of the gamma parameter found in the original study by Scholte et al. (2009). Such a restriction often leads to a suppression of the strength of linear effects. However, in this case it could be that we only observe the declining part of the proposed

(15)

0.53 0.55 0.57 Infant: Young Infant: Old Adult group R OC area

feature chromaticity contrast edge content luminance

Figure 8. : Comparison of ROC area values for image features chromaticity, contrast, edge-content and luminance at 5.4 cpd for adults, older infants in the 8 to 15 month range and younger infants in the 3 to 8 month range. The error bars depict 99% confidence intervals.

parabola. Repeating the approach with a wider range of image complexity could lead to the detection of the proposed relationship. To investigate this issue, a new study is required with a set of stimuli that cover a more complete range of complexity.

On the theoretical level, the question whether beta and gamma are a realistic influ-ence on the deployment of attention needs to be asked. Answering this question is not easy however, as image complexity is notoriously difficult to define (Donderi, 2006). Computa-tional approaches to define visual complexity are based on Solomonoff‘s (1985) algorithmic information theory (AIT). Beta and gamma are closely related to these concepts. Accord-ing to AIT, the complexity of information corresponds to the length of the bit-strAccord-ing that is needed to encode the information. Obviously, there is no conscious translation of im-ages into bit-strings, but on the neuronal-level such translation could be plausible. How the higher neuronal information transfer translates to seeking out an optimum of attention however needs further investigation. Another difficulty is that AIT based complexity can be seen as counter-intuitive to subjective notions of complexity. In case of an image filled completely with white noise, the complexity is considered high. Complexity is very low on the other hand, when the image is entirely uniform or empty. Images of white noise thus need a large amount of information to be encoded, whereas uniform images need very little. Donderi (2006) points out that random images may appear less complex than non-random images and that some images generated by simple rules can appear chaotic. The code needed to generate a pattern does therefore not necessarily determine the amount of processing required by the visual system. In future, a more detailed and thorough theory of

(16)

heighten the credibility of this image feature as an influence on global attention. Approach 2: Salience as a predictor of fixation location

In our second approach, we investigated the strength and the development of the influence of local image features on infant attention allocation in natural scenes. Overall, all image features were predictive above chance level for both adults and infants in the 3 to 8 and in the 8 to 15 month age range. Infants in the 8 to 15 months age range have been found to overall rely more on feature driven attention allocation than infants in the 3 to 8 month range and adults. We also succeeded in replicating Tatler‘s (2005) results in both our adult and pooled infant dataset, albeit with minor differences. Chromaticity instead of contrast appeared to be the most predictive image feature for adults and infants in our replication attempt. Contrast, edge-content and luminance on the other hand maintained the same ordering as in Tatler. Also, the predictive strength of the image statistics was generally lower.

Our findings about bottom-up driven viewing are mostly in line with earlier research by Amso et al. (2014) and Althaus and Mareschal (2012). Our findings extend the current knowledge however, by showing that infants in the 8 to 15 month age range tend to be guided more by the tested image features than adults and infants in the 3 to 8 month age range.

The results can be taken as cautious support for the theory that the perceptual capabilities for bottom-up driven eye-guidance are still developing in the 3 to 8 month range and that this phase is followed by a phase in attention is driven more by image features than in adults. A reason for this, could be that adults rely more on top-down influences to guide their viewing than older infants. Younger infants on the other hand could be less sensitive, coordinated or slower in reacting to bottom-up image features.

Additionally, it will be necessary to investigate why in the results of our replication chromaticity instead of contrast was found to be most predictive of fixated locations in adults and in the pooled infant data. The reason for this difference could be the different image material that was used in the original experiment. It is possible that our image material contained more highly salient locations distinguished by colour compared to the scenes used by Tatler. However, we did not investigate this possibility further at time of writing this report. At this moment, it is not common to report any statistics considering the distribution and intensity of image statistics in the stimulus material, nor is there a standardized set of images for saliency research purposes. To rule out the possibility of a skewed distribution of image statistics in the future, the average statistical intensity values per pixel could be reported and compared to prior research and the possibility of an influence of the stimulus material on the predictive strength of each image statistic should be explored.

(17)

Another surprising finding is the overall lower predictive strength in our adult replica-tion. One reason for the differences in predictive strength could be the longer presentation of each scene in the original experiment. However, Tatler shows that the influence of the strength of the studied image features should not change over time. Another reason could be that the participants were instructed to memorize the scenes in the original, whereas in our experiment, the participants were allowed to freely explore each scene. To shortly speculate, it could be that a memory task activates more top-down strategies, which could also rely on fixating in highly salient locations.

Finally, it must be noted that, albeit present, the effect size of the influence of image features on attention allocation appears relatively small, ranging from 3% to 6% above chance. This implies that there must be other influences on local attention allocation for adults and infants alike. In the future, the influence of more image features like intensity, orientation, motion and interactions of different image features need to be investigated to create a more complete overview of the factors influencing infant bottom-up guided attention.

By charting out infant bottom-up attention, new insights and possibilities for the study of the visual system will arise. Disentangling bottom-up and top-down influences in adults has been notoriously difficult. Given our theory about a phase of heightened bottom-up influences and lower top-down influences in infancy is true, studying infants in the given age range could allow for studying bottom-up visual attention in more isolation.

The presented findings are a first step into understanding more about bottom-up driven infant visual attention in natural scenes and much has to be done in this area to catch up with adult research. However, what we know now is that contrast, edge-content, chromaticity and luminance are predictive of where infants locate their attention in natural scenes and that the influence of these image features varies over time.

References

Althaus, N., & Mareschal, D. (2012). Using saliency maps to separate competing processes in infant visual cognition. Child development , 83 (4), 1122–1128.

Amso, D., Haas, S., Tenenbaum, E., Markant, J., & Sheinkopf, S. J. (2014). Bottom-up attention orienting in young children with autism. Journal of autism and developmental disorders, 44 (3), 664–673.

Baddeley, R. (1996). Searching for filters with’interesting’output distributions: an uninteresting direction to explore? Network: Computation in Neural Systems, 7 (2), 409–421.

Bornstein, M. H., Mash, C., & Arterberry, M. E. (2011). Young infants eye movements over natural scenes and experimental scenes. Infant Behavior and Development , 34 (1), 206–210.

Bronson, G. W. (1994). Infants’ Transitions toward Adult-like Scanning. Child Development , 65 (5), 1243–1261.

Daw, N. W., & Daw, N. W. (2006). Visual development (Vol. 9). Springer.

Donderi, D. C. (2006). Visual complexity: a review. Psychological bulletin, 132 (1), 73. Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.

Engel, S., Zhang, X., & Wandell, B. (1997, July). Colour tuning in human visual cortex measured with functional magnetic resonance imaging. Nature, 388 (6637), 68–71.

Giesbrecht, B., Woldorff, M. G., Song, A. W., & Mangun, G. R. (2003, July). Neural mechanisms of top-down control during spatial and feature attention. NeuroImage, 19 (3), 496–512. Itti, L., & Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual

(18)

of equal luminance. JOSA, 69 (8), 1183–1186.

Maunsell, J. H. R., & Treue, S. (2006, June). Feature-based attention in visual cortex. Trends in Neurosciences, 29 (6), 317–322.

Metz, C. E. (1978, October). Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8 (4), 283–298.

Mould, M. S., Foster, D. H., Amano, K., & Oakley, J. P. (2012). A simple nonparametric method for classifying eye fixations. Vision research, 57 , 18–25.

Reinagel, P., & Zador, A. M. (1999). Natural scene statistics at the centre of gaze. Network: Computation in Neural Systems, 10 (4), 341–350.

Scholte, H. S., Ghebreab, S., Waldorp, L., Smeulders, A. W. M., & Lamme, V. A. F. (2009, April). Brain responses strongly correlate with Weibull image statistics when processing natural images. Journal of Vision, 9 (4), 29.

Shannon, C. (1948, July). A mathematical theory of communication. Bell System Technical Journal, The, 27 (3), 379–423.

Solomonoff, R. J. (1985). The Application of Algorithmic Probability to Problems in Artificial Intelligence. In UAI (pp. 473–494).

Tatler, B. W., Baddeley, R. J., & Gilchrist, I. D. (2005). Visual correlates of fixation selection: Effects of scale and time. Vision research, 45 (5), 643–659.

Tatler, B. W., & Vincent, B. T. (2009). The prominence of behavioural biases in eye guidance. Visual Cognition, 17 (6-7), 1029–1054.

Tseng, P.-H., Carmi, R., Cameron, I. G., Munoz, D. P., & Itti, L. (2009). Quantifying center bias of observers in free viewing of dynamic natural scenes. Journal of vision, 9 (7), 4.

Valeton, J. M., & van Norren, D. (1983). Light adaptation of primate cones: an analysis based on extracellular data. Vision Research, 23 (12), 1539–1547.

Referenties

GERELATEERDE DOCUMENTEN

Hier wordt onderzocht welke rassen uit het gangbare assortiment het meest geschikt zijn voor de biolo- gische teelt, maar ook geschikt zijn om onder biologische omstandigheden

Cette espèce de petite cave qui reçoit les eaux du bain est comblée avec des pierres de construction, des débris d'ardoises e t de la terre.. Quelques tessons y

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

Informacje o osobach represjonowanych (w przypadku braku miejsca, dalsze osoby wpisać w informacjach dodatkowych lub na od- wrocie formularza) Imię i nazwisko Data urodzenia Imię

WHERE TO LOOK FOR GUIDANCE AS A CENTRAL QUESTION FOR RELIGION AND SCIENCE I have considered the possibility of turning toward past traditions, present science, or future

a) Can dialect discrimination found in Nazzi et al., 2000 be replicated and reflected in neural activity? Activation level should be compared in pure and

In general, test subjects seemed to prefer the gender- marked agents Anna and Bart over the gender-ambiguous agent Ruth, while Anna, got most of the preference “votes”.. Thus,

Scale-variant and scale-invariant image representations Training CNNs on large image collections that often exhibit variations in image resolution, depicted object sizes, and