Refining Grey Pixel Selection for Improving Colour Constancy

(1)

MSc Artificial Intelligence

Master Thesis

Refining Grey Pixel Selection for

Improving Colour Constancy

by

Gaurav M. Kudva

12205583

August 25, 2020

48 EC November 2019 - August 2020

Supervisor:

Dr. Sezer Karaoglu

Examiner:

Dr. Theo Gevers

Assessor:

MSc. Partha Das

(2)

Acknowledgements

For this thesis project, I would like to thank a lot of people who have played pivotal roles in helping me to complete the project as a whole. First of all, I would start by thanking my supervisor Dr. Sezer Karaoglu (3DUniversum) and my examiner Prof. Dr. Theo Gevers (3DUniversum, IvI) for their constant guidance and support and for giving me the opportunity to pursue such a chal-lenging and interesting topic in the field of Computer Vision.

Next, I would like to thank MSc. Partha Das (3DUniversum, IvI), Dr. Yang Liu (3DUniversum), and MSc. Masoumeh Bakhtiariziabari (3DUniversum) for their valuable insights, code assist, and collaboration.

I would like to thank my colleagues at 3DUniversum for inculcating a healthy working environment and for making the entire thesis journey fun and less stress-ful.

I would also like to thank SurfSara for providing access to efficient compute resources in order to carry out my thesis experiments, which made the work a smooth process and less tedious.

Last but not the least, I would also extend my thankful greetings to my parents who kept me focused and motivated during the entire course of the project work. At the end, I would like to thank the almighty God without whose blessings, it would not have been possible for me to get the necessary inspiration and determination to complete this project.

(3)

Abstract

Colour Constancy (CC) is the task of maintaining the relative constancy of the perceived colour of objects under varying illumination conditions. Although a multitude of algorithms have been proposed for the same, a majority of them target the single-illuminant setting. The global illumination assumption and the availability of proper datasets are key factors for its popularity. On the other hand, the multi-illuminant setting is less targeted and more challenging as it lacks in the above two factors. As of recent years, a new line of approaches targeted towards both the single and multi-illuminant settings focus on the task of grey pixel detection and show their potential by obtaining comparable or state-of-the-art results on benchmark datasets.

The main drawback of these approaches is the selection of inherently coloured surfaces as grey surfaces. By targeting the task of grey pixel detection, this re-port tackles the task of CC with a greater emphasis on the refinement of grey pixel selection. As a solution to the above drawback, this report proposes the usage of two existing illumination invariant measures as constraints for the re-finement of grey pixel selection. In lieu of the advancements made by deep learning in Computer Vision in the recent years, this report also presents a deep-learning framework for the further refinement of grey pixel detection that utilizes the above proposed constraints. Developed from a deployment per-spective, the framework is lightweight, efficient and incorporates pixel adaptive convolutions for the easy integration of more such constraints. Results from ex-periments show that the framework is able to maintain a robust selection of grey pixels under varying illumination conditions, thus overcoming the drawback of coloured surface inclusion to a large extent. Furthermore, the framework ob-tains results comparable to the state-of-the-art on the MIMO dataset [9] while it outperforms the same on the extended CubePlus dataset [20].

(4)

1 Introduction

Colour Constancy (CC) is defined as maintaining the relative constancy of the perceived colour of objects under varying illumination conditions. The human visual system has been able to achieve this to a great extent but the same cannot be said when it comes to machines. Figure 1 illustrates how the visual sensation of an object can vary depending on its inherent characteristics and the illumi-nation conditions of its environment. The problem of CC arises in various tasks like semantic segmentation, object tracking, etc. where varying illumination conditions can significantly affect the accuracy of intended approaches. For ex-ample, [2] shows that strong colour casts can negatively impact the performance of deep neural networks intended for image segmentation and classification.

Varying Illumination Conditions _{Original Colour}

Figure 1: The cross-piece illusion example illustrating the need for CC. The images have been taken from [51]. In the first half of the left image, the cross-piece looks bluish-grey while it looks yellow in the other half. In reality, the cross-piece has precisely the same colour in both the cases, which is shown by the image on the right. A large number of CC algorithms have been proposed over the last two decades. These approaches can be classified into two main categories: static and learning-based. Static or learning-free methods only rely on the reflectance distribution of the image in question. They further utilize certain assumptions or constraints derived from physical light reflection models to refine their per-formance. Learning-based methods rely on machine-learning models or neural networks that are trained on related data. Their advantage over static methods is that they are able to capture the more detailed structure information of sur-face reflectance and further improve CC, but most of them are dataset-oriented and normally depend on an extensive training phase.

As of recent years, a new line of static approaches have been proposed for the single and multi-illuminant settings that target the task of grey pixel detec-tion. Owing to their equal reflectance in the red, green, and blue colour spectra, grey pixels can reflect the colour of the illumination under which they are cap-tured. Therefore, if detected accurately, grey pixels can serve as essential cues towards estimating the illumination colour. With necessary constraints derived

(7)

from physical light reflection models, works such as [66], [54], and [55] formulate a greyness index (GI) that quantitatively indicates the degree of greyness of a particular pixel. By obtaining comparable or state-of-the-art results on bench-mark datasets, they show the potential that such constraints can have towards the refinement of grey pixel detection which further facilitates CC. However, the constraints used in these works can only improve grey pixel detection to a certain extent as they suffer from certain limitations. One of the main limita-tions is that they mistake inherently coloured surfaces as grey surfaces. This highlights the need for the inclusion of more robust constraints.

By targeting grey pixel detection, this report tackles the task of CC with a greater emphasis on the refinement of grey pixel selection. It has the fol-lowing contributions: (1) In order to overcome the drawback of the selection of inherently coloured surfaces, this report proposes the usage of two existing illumination invariant measures as constraints for the refinement of grey pixel selection. (2) This report also presents a deep-learning framework for the further refinement of grey pixel detection that utilizes the above proposed constraints. Developed from a deployment perspective, the framework is lightweight, effi-cient and incorporates pixel adaptive convolutions [61] for the easy integration of more such constraints.

Results from experiments show that the framework is more robust under varying illumination conditions as compared to the state-of-the-art grey pixel detection methods. Moreover, it is able to overcome the drawback of coloured surface inclusion to a large extent. Furthermore, the framework obtains results comparable to the state-of-the-art on the MIMO dataset [9] while it outperforms the same on the extended CubePlus dataset [20].

1.1 Research Question(s)

The constraints employed in recent grey pixel detection methods such as [66], [54], and [55] suffer from the limitation that they permit the selection of pixels from inherently coloured surfaces as grey pixels. In that regard, this report aims to answer the following questions:

1. What other constraints can be formulated for the further refinement of grey pixel selection ?

2. Can a learning-based approach further refine the selection results through the integration of such constraints in the training/inference pipeline ?

2 Literature Review

2.1 Image Formation Process

Depending on an object’s characteristics, when light falls on it, a part of it is absorbed by the object, a part of it is transmitted through the object and the remaining is reflected back by the object’s surface. The reflected light is what

(8)

our eyes as well as digital cameras capture and perceive.

The process of capturing a colour image of an object by a digital camera is mainly dependant on the spectral power distribution of the illuminant, the surface reflectance of the object, and the spectral sensitivity of the camera’s sensors. Mathematically, this is formulated as

I(x, y) = m(~n, ~s) Z

λ∈ω

ρ(x, y, λ)e(x, y, λ)f (λ)dλ (1)

where ~n (surface normal of the object) and ~s (direction of incident light) corre-spond to the object and viewing geometry. ω refers to the visible spectrum of wavelengths and λ corresponds to the wavelength of incoming light. ρ represents the surface albedo of the object, e corresponds to the spectral energy density of the illuminant, and f refers to the spectral sensitivity of the camera’s sensor.

Based on the physical interaction of light with an object’s surface, many models have been proposed that mathematically attempt to explain the process of light reflection. The most notable are the bi-directional reflectance model [17], Phong’s reflectance model [53], and the dichromatic reflection model [57]. However, this report only explains the dichromatic reflection model in detail as it forms the basis of many CC approaches.

2.1.1 Dichromatic Reflection Model (DRM)

The standard DRM [57] was proposed to explain the reflection of light at the surface of an inhomogeneous dielectric object. It describes the reflected light as a linear combination of two components namely, body (diffuse) reflection and interface (specular) reflection.

Consider an infinitesimal surface patch of an inhomogeneous dielectric ob-ject illuminated by a spectral power distribution of incident light denoted by e(λ). To obtain the corresponding image using the red, green, and blue sensors with spectral sensitivities given by fR(λ), fG(λ), and fB(λ) respectively, the

measured sensor values are given by

IC= mb(~n, ~s) Z λ fC(λ)e(λ)cb(λ)dλ + ms(~n, ~s, ~v) Z λ fC(λ)e(λ)cs(λ)dλ (2)

where C = {R, G, B} corresponds to the response of the respective sensor. cb(λ)

(surface albedo) and cs(λ) (Fresnel reflectance) are known as the composition

terms that only depend on the wavelength and are independent of geometry. They are responsible for the perceived colour of the respective component that they represent. λ denotes the wavelength, ~n is the surface patch normal, ~s is the illuminant source direction, and ~v is the direction of the viewer. Magnitude terms mb and ms correspond to the geometric dependencies of the body and

surface reflection components, respectively. They only depend on the geometry and are responsible for the perceived intensity of the respective component that they represent.

(9)

reflection processes, and that each has a characteristic colour whose intensity, but not spectral distribution, varies with the viewing and illumination direc-tions. It also incorporates the neutral interface reflection (NIR) assumption which states that the Fresnel reflectance is constant over the visible spectrum range and that its colour corresponds to the colour of the light source. This changes equation 2 to IC = mb(~n, ~s) Z λ fC(λ)e(λ)cb(λ)dλ + ms(~n, ~s, ~v)cs Z λ fC(λ)e(λ)dλ (3)

2.2 Colour Constancy (CC)

CC is a perceptual phenomenon where the perceived colour of objects remains relatively constant regardless of the illumination conditions they are viewed un-der. It is also known as chromatic adaption and in the photography jargon, known as white-balancing. For instance, an orange appears more or less “or-ange” regardless of whether it is viewed under a fluorescent tube in the kitchen or under natural sunlight. Even though the orange reflects different light spec-tra under these different conditions, we are still able to perceive its inherent material characteristics.

Unlike size or mass, colour is not a physical quantity per se but a rather subjective aspect that is dependant on the visual system of the entity that per-ceives it. In other words, colour constancy is not a property of objects but an outcome of how the biological visual system works. Even though its exact functioning remains unknown, there are certain agreed mechanisms that involve comparing the reflected light from different locations across the scene. All of these suggest that colour constancy is a fundamentally contextual phenomenon. Figure 2 illustrates such an instance.

Checker Board Actual Colour

Figure 2: Checker-shadow illusion [1]. Tile A appears to have a darker shade of grey than tile B when they both actually have the same shade of grey.

Spatial contrast is one such mechanism that has been broadly proposed un-der the Retinex theory [49]. Retinex is a portmanteau term that represents the

(10)

collective functioning of the retina and the cerebral cortex. The Retinex com-pares the red light reflected from a particular surface with the spatial average of red light reflected from the surrounding surfaces. It does the same for the green and blue wavelengths. These three relative reflectance components are responsible for the colour that is perceived.

Even though the human visual system has been able to achieve colour con-stancy to a satisfactory extent, it is not perfect. If it were the case, then we would completely discard changes in the illumination colour and perceive ob-jects as if they were illuminated by white light. Yet, we still notice that the colours of object surfaces slightly alter under different illumination conditions.

2.2.1 Fundamental approach towards colour constancy

The foundational approach towards CC consists of the following two steps: 1. Estimating the colour of the illuminant

2. Colour-correction using the estimated illuminant colour

The second step can be seen as an instantiation of chromatic adaptation [21] which formulates the problem as a linear transformation:

I = W ∗ e (4)

where I is the image captured under an unknown light source, W is the image captured under a reference light (canonical light source), and e is the illuminant colour. The aim is to correct the image such that only the chromaticity of the illumination conditions is accounted for while everything else remains unchanged [32]. This is an under-constrained problem as we only have knowledge about I. Neither do we know how W should appear as nor do we have information pertaining to e. Thus, most of the CC approaches either perform the above two steps or directly estimate the white-balanced image.

2.2.2 Colour Correction Process

This subsection describes the general approach towards performing colour cor-rection [50]. This is often modelled through a linear transformation which is further simplified through a diagonal transformation when certain conditions are met [32]. Based on this approach, methods such as Von Kries, Bradford, and XYZ scaling are generally utilized that mainly differ in the transformation matrices MA used. Let

• (XR, YR, ZR) denote the XYZ values of a colour under the reference

illuminant

• (XT, YT, ZT) denote the XYZ values of the same colour under the test

illuminant

• (XW R, YW R, ZW R) denote the XYZ values of white colour under the

(11)

• (XW T, YW T, ZW T) denote the XYZ values of white colour under the test

illuminant

The XYZ values of the reference and test white are first converted into corre-sponding values in the cone response domain (ρ, γ, β) by means of transforma-tion matrices MA. This is mathematically formulated as

  ρR γR βR  = [MA]   XW R YW R ZW R   (5) (6)   ρT γT βT  = [MA]   XW T YW T ZW T   (7)

Subsequently, the corresponding (ρ, γ, β) values are used to determine the transformation matrix M which is mathematically formulated as

M = [MA]−1    ρT ρR 0 0 0 γT γR 0 0 0 βT βR   [MA] (8)

which is finally used to perform the chromatic adaptation   XR YR ZR  = [M ]   XT YT ZT   (9)

2.3 Static Colour Constancy Methods

Static CC methods are those that only rely on the statistics of the image in question. They further utilize certain constraints or assumptions derived from physical light reflection models to refine their performance. As they are devoid of machine-learning models or neural networks, they are also referred to as learning-free methods.

2.3.1 Methods based on low-level statistics

Due to its simplicity and very low computational cost, the grey-world (GW) algorithm [13] is the most well-known method under this category. It is based on the assumption that, under a neutral light source, the average reflectance of a scene is achromatic. In other words, the average intensity of each colour chan-nel of an image should have an equal value. Any deviation from achromaticity in the average scene colour is assumed to have been caused by the effects of the illuminant. However, this assumption does not hold on scenes involving large uniformly coloured patches. Overcoming this limitation to a certain extent, [29]

(12)

shows an improvement in the performance of the GW algorithm when it is ap-plied on the segmented components of an image rather than the entire image itself.

Another well-known method is the white-patch (WP) algorithm [49] which states that the maximum response in the RGB-channels is caused by a perfect reflectance. Since a perfect reflectance fully reflects the range of light that falls on it, its colour is the same as that of the illuminant under which it is captured. Therefore, for a scene captured under a neutral light source, it assumes the colour of a perfect reflectance present in the scene to be white. Therefore, any non-white perfect reflectance must be due to the illuminant colour. In practice, this assumption is relaxed by considering the colour channels separately, result-ing in the max-RGB algorithm, whose accuracy degrades when applied on noisy images. [34] and [19] have shown that smoothing the image before applying the WP algorithm leads to an improvement in performance.

Furthermore, [24] mathematically shows that the GW and WP algorithms are special instantiations of the Minkowski framework. It results in the GW algorithm when a norm of 1 is used and the WP algorithm when a norm of ∞ is used. Based on the assumption that the average reflectance of a scene is a shade of grey, they propose the Shades of Grey (SoG) algorithm with an intermediate norm of 6, which works better than both the GW and WP algorithms for a large calibrated dataset.

Unlike the earlier methods which are based on the distribution of colours in an image, [64] proposes a framework called grey-edge (GE) that utilizes higher-order image statistics such as first-higher-order and second-higher-order image derivatives. Based on the observation that the information conveyed might differ based on the edge type, [37] proposes the weighted GE framework by introducing physics-based weighting schemes that weight edges physics-based on how important they can be.

2.3.2 Physics-based methods

Methods under this category utilize information from the physical interaction of light with the surfaces of objects. Along the lines of [30], many methods are based on the Lambertian reflection model (LRM) [46] while others exploit the DRM to constrain the illuminants, similar to [65].

According to the DRM, the colour of the specular reflection component is same as that of the illuminant. In that regard, methods such as [63] make use of specular reflections or highlights which assume that essential cues regarding the illumination colour can be obtained from pixels where the magnitude of body reflectance is close to 0. However, they suffer from the difficulty in the retrieval of specular reflections.

[66] proposes a novel illuminant estimation approach for the single as well as the multi-illuminant scenario. They show that almost all natural images captured under a neutral light source contain some grey or approximately grey pixels. In order to quantitatively indicate how grey a pixel could be, they propose a luminance-based GI function that makes use of an illuminant-invariant

(13)

measure in the logarithm space. They reason that, for a particular pixel, the lower the value of GI, the closer it is towards resembling a grey pixel. Having a low computational cost, they highlight the efficiency and feasibility of the same as it can be applied to any image in a camera-agnostic manner. However, it suffers from certain limitations. It fails to perform well on images that do not contain a sufficient amount of detectable grey pixels. Moreover, being based on the LRM, the approach is limited in the number of constraints that it can impose for the refinement of grey pixel detection. As a result, it mistakes coloured pixels as grey pixels which skew the illuminant estimation process.

Based on the observation that true-grey pixels are aligned towards a single direction in the colour space, [54] proposes a statistical approach similar to [66] but makes use of a more robust illuminant-invariant measure and mean-shift clustering to estimate the illumination colour. By highlighting 3 certain properties that a GI function should meet, the work employs the angular error function in that regard. They also show a mathematical correspondence between their GI function and that used in [66] where the latter is a product of the former and a luminance dependant factor. Intended for the single-illuminant scenario, their approach fails on images captured under multiple illumination conditions. Similar to [66], it also performs poorly when there is none or a limited amount of detectable grey pixels.

Also based on grey pixel detection, [55] proposes a learning-free approach for the single as well as the multi-illuminant scenario. Derived from the DRM, they formulate a novel GI function that overcomes the limitations of those proposed in [66] and [54] to a certain extent. Once the grey pixels have been detected, the work employs the same single or multi-illuminant estimation approach as described in [66]. On standard single illuminant estimation benchmarks under the camera-agnostic setting, their approach outperforms prior state-of-the-art learning-free and learning-based methods. The work also reports state-of-the-art results on the standard multi-illuminant estimation benchmark (MIMO dataset [9]).

2.4 Gamut-Based Colour Constancy Methods

Gamut-based CC methods operate on the assumption that a real-world im-age captured under a particular illuminant can only contain a limited range of colours. Any deviation of colour from this range must be due to a change in the illuminant. The colour-gamut of the illuminant in consideration is referred to as the canonical gamut (CG) and the illuminant is accordingly known as the canonical illuminant (CI). The CG is learned in a training phase by observing as many surfaces under the CI as possible.

[26] proposes a novel gamut-mapping algorithm to estimate the colour of an unknown illuminant under which a given image is captured. It assumes that the range of colours present in the input image is the gamut pertaining to the un-known illuminant. Then it learns a set of feasible mappings that transform the gamut of the unknown illuminant to a gamut that lies entirely within the CG. An estimator is subsequently used that selects one of these feasible mappings

(14)

to be applied on the CI to get an estimate of the unknown illuminant. The selection is based on the heuristic that the mapping which results in the most colourful scene (diagonal matrix with the largest trace) is to be selected. How-ever, this approach suffers from the limitation that it can result in an empty set of feasible mappings if the transformation is not able to properly fit the gamut of the input image within the CG.

The above limitation is overcome by [7] and [8] by incrementally augment-ing the input gamut until a non-empty set of feasible mappaugment-ings is found. [6] systematically extends the CG by learning it not only from surfaces that are illuminated by the CI, but also from surfaces that are captured under different illuminants which are mapped to the CI using the transformation model.

By incorporating the differential nature of images, [38] extends the gamut-mapping algorithm to show that the gamut-gamut-mapping framework is able to in-corporate any linear filter output and that derivative based gamut-mapping will not result in empty feasible mapping solution sets.

In what is known as gamut-constrained illuminant estimation, [22] trans-forms the problem of illuminant estimation to that of illuminant classification. It learns a gamut for each one of a limited set of possible CIs. The unknown illuminant of the input image is estimated as the CI whose gamut best matches the gamut of the input image.

2.5 Learning-Based Colour Constancy Methods

Methods under this category estimate the illuminant through models that are trained on related data. Even though gamut-based methods fall under this category, they have been described in a separate subsection due to their specific restricted colour range assumption towards illuminant estimation.

2.5.1 Machine-Learning Based

Referred to as colour by correlation, [23] propose an approach similar to [22] that performs illuminant classification instead of illuminant estimation. Utiliz-ing the chromaticity space, a correlation matrix is computed for each one of a set of possible CIs. Then the information extracted from the given image is matched with that contained in each correlation matrix to obtain a respective score. The score indicates the probability that the given image was captured under the respective CI. The most appropriate CI is chosen using maximum-likelihood estimation or Kullback-Leibler divergence as has been done in [56].

Works such as [35] and [36] propose approaches that determine the CC al-gorithm to be applied on a given image based on its statistics. The statistics are determined using weibull parameterization. They also show that the corre-sponding parameters learned correlate with the image attributes that low-level feature based CC methods are sensitive to.

[9] proposes an approach for the single as well as the multi-illuminant sce-nario with a more dominant focus on the latter. Known as “multi-illuminant random field” (MIRF), they adapt conditional random fields (CRFs) for CC

(15)

un-der non-uniform illumination. They account for the estimation of not only the colour of the illuminants but also their spatial distribution through an energy minimization framework. They show that MIRF can express several static CC methods as an error minimization problem. The work also introduces a dataset consisting of images captured under two dominant illuminants catered towards benchmarking multi-illuminant CC solutions. Known as the multi-illuminant multi-object (MIMO) dataset, it contains 58 laboratory images and 20 real-world images.

2.5.2 Deep-Learning Based

Using a shallow neural network, [14] proposes an approach to estimate the chro-maticity of the illuminant acting on a given image. The work makes use of a multi-layer perceptron with 2 hidden layers that takes a large binarized chro-maticity histogram as input and subsequently outputs the two chrochro-maticity values of the estimated illuminant. They use the chromaticity space as it ranges from 0 to 1 (from a pre-processing perspective) and removes any dependency on the intensity information. The work shows the potential of such a simple net-work in accurately estimating the illuminant even when a few distinct surfaces are present in an image. However, it suffers from the requirement of a large amount of training data.

The work of [12] is the first to utilize Convolutional Neural Networks (CNNs) towards CC. They make use of a 4-layer CNN that consists of a single convo-lutional layer and 2 fully-connected (FC) layers. The network takes each non-overlapping patch of an RGB image as input and outputs the RGB values of the illuminant that it estimates to be acting on that patch. Before being provided to the network, the patches are subjected to contrast normalization by means of histogram stretching. From the local patch estimates, the illuminant estimate for the entire image can be obtained. They show that their network obtains state-of-the-art performance on a standard dataset of RAW images surpassing prior state-of-the-art static and learning-based methods.

[52] proposes an approach that utilizes a deep CNN architecture for illumi-nant estimation. Adopting the AlexNet architecture [47], their approach takes an RGB image as input and outputs the RGB values of the estimated illumi-nant. In order to adapt the network for the purpose of CC, they propose a sequential training scheme owing to the large number of parameters and the ab-sence of a very large training dataset for CC. They show that their framework obtains accurate results on real-world datasets and outperforms state-of-the-art methods in the inter-dataset cross-validation setting.

Prior CNN-based methods such as [12] sequentially process the local patches of an image to get local illuminant estimates which are subsequently aggregated to obtain a global illuminant estimate for the entire image. However, they suffer from the limitation that not all the patches contain valuable semantic information resulting in ambiguities that can skew the global illuminant esti-mate. Overcoming the limitation, [42] proposes a fully-convolutional network (FCN) architecture that learns to assign importance weights to each local patch

(16)

based on the information they contribute towards the global illuminant esti-mate. The local patch illuminant estimates and their respective importance weights are combined in a novel weighted-pooling layer that learns what local patches should be considered and what should be ignored.

With the advent of Generative Adversarial Networks (GANs) [41], the task of CC has been further simplified. By learning a mapping from the colour-biased domain to the white-balanced domain, GANs can directly produce the white-balanced image, thus discarding the need for illuminant estimation. In that regard, [18] and [59] are the first works to propose CC as an image-to-image translation problem. [18] compares 3 state-of-the-art GAN architectures (Pix2Pix [43], CycleGAN [68], StarGAN [15]) through an extensive qualitative and quantitative survey describing how each architecture can be adapted for CC. [59] proposes a novel GAN architecture (AngularGAN) that adapts the Pix2Pix architecture to additionally estimate an illuminant map and in turn be supervised through an angular loss with the ground-truth illuminant map. By posing the task of shadow detection and removal as an instance of multi-illuminant CC, the work also introduces the largest shadow removal dataset and shows the potential of the proposed architecture in that scenario.

2.6 Illumination Invariance

Illumination invariance can be seen as an instance of CC which seeks to formu-late descriptors that maintain stable representations of scene intrinsics under varying illumination conditions. However, these descriptors are not aimed at recovering the colour of the scene illumination.

In what is known as Colour Indexing, [62] proposes a descriptor to repre-sent and match images on the basis of colour histograms. In other words, they perform indexing based on the actual RGB values. Being robust to the viewing and object geometry, the work makes a significant contribution in introducing colour for object recognition. However, it suffers from the limitation that the recognition accuracy degrades significantly when the illumination circumstances vary in the images to be compared.

Based on the Retinex theory, [27] improves upon the approach proposed in [62] by indexing on illumination-invariant descriptors instead of the actual RGB values. By reasoning that the illumination conditions remain constant over a small local region, they show that ratios of neighbouring pixels in each colour channel can act as illuminant-invariant surface descriptors, which are referred to as Colour Ratios. They are formulated as follows

r1= Rx1 Rx2 (10) r2= Bx1 Bx2 (11) r3= Gx1 Gx2 (12)

(17)

where x1 and x2 correspond to two neighbouring pixel locations. In the

loga-rithm space, these equations would be transformed to

log(r1) = log(Rx1) − log(Rx2) (13)

log(r2) = log(Bx1) − log(Bx2) (14)

log(r3) = log(Gx1) − log(Gx2) (15)

which is equivalent to taking the derivative of the logarithm of each colour chan-nel. However, the descriptors are derived based on the underlying assumption that the neighbouring points in consideration have the same surface normal. This in turn, makes them highly sensitive to changes in object geometry.

Overcoming the drawback of [27], the work by [31] introduces surface de-scriptors that are derived from the DRM. The work shows that invariance to illumination conditions, viewing geometry, and object geometry can be achieved if the ratios of neighbouring pixels are computed in an inter-channel manner rather than an intra-channel manner (as has been done in [27]). These are referred to as Cross Colour Ratios (CCRs) and are formulated as follows

m1= Rx1_Gx2 Rx2Gx1 (16) m2= Rx1_Bx2 Rx2_Bx1 (17) m3= Gx1_Bx2 Gx2_Bx1 (18)

where x1 and x2 correspond to two neighbouring pixel locations. In the

loga-rithm space, these equations would be transformed to

log(m1) = log Rx1 Gx1 − logR x2 Gx2 (19) log(m2) = log Rx1 Bx1 − logR x2 Bx2 (20) log(m3) = log Gx1 Bx1 − logG x2 Bx2 (21)

This is equivalent to computing the derivative of the ratio of each colour channel pair (R/G, R/B, G/B) in the logarithm space.

3 Methodology

The main limitation of recent grey pixel detection methods such as GI [55] is that they mistake pixels from inherently coloured surfaces as grey pixels. In lieu of the advancements brought about by deep learning in computer vision, this section proposes a deep-learning framework for the further refinement of grey pixel selection, in order to overcome this limitation.

(18)

3.1 Overview

Figure 3 illustrates the detailed architecture of the proposed framework.

Figure 3: Proposed network architecture. PAC stands for Pixel Adaptive Convolution. The other four numbers shown in the convolutional blocks represent Filter Size 1 × Filter Size 2 × Input Channel × Output Channel.

The framework is a fully convolutional architecture incorporating pixel-adaptive convolutions [61] that make use of pixel-adaptive features. The Greyness Index (GI) [55] and L2 norm of the CCR components (m1, m2) computed from

the colour-biased image are passed as adaptive features to the network. The purpose of the CCR L2 norm is to provide a more robust prior of pixel

grey-ness to mitigate the drawback of GI to a certain extent. In order to train the framework, we use Saturation (computed from the white-balanced image) as the ground-truth pertaining to the true greyness of pixels in the scene. ReLU is used as the activation function for the first three layers while the Sigmoid activation function is used for the final layer. The following subsections explain the motivation behind the adopted network architecture.

3.2 Fully Convolutional Architecture

The motivation behind the fully convolutional aspect of the proposed architec-ture is two-fold

(19)

involves estimating a greyness measure for each pixel in an image. Thus, it would be preferable for intended deep-learning architectures to preserve the spatial dimensions of its input and output.

• A fully convolutional architecture enables a model to process images of any spatial dimensions. For example, this can be quite useful when pro-cessing high resolution images where the information in the images is to be kept as intact as possible or where the accuracy is paramount. In such a scenario, without the fully convolutional aspect, the images need to be down-sampled to the input image size required by the model. This would lead to a loss of information which is highly undesirable.

3.3 A shallow architecture for model compactness and

faster inference

Known as Convolutional Mean, [40] proposes an approach for illumination es-timation, with a focus on the consumer application scenario. They reason that models intended for deployment on to embedded platforms should ideally have a fast inference speed and a small loading time (lesser model parameters). Their proposed network utilizes only 2 convolutional layers and a weighted global-average pooling layer that is inspired by the functioning of the Grey World [13] and Grey Edge [64] algorithms. Owing to the small number of parameters (∼ 1000) and the requirement of thumbnail images as input, the authors argue that their network is several times more efficient than state-of-the-art learning-based methods such as [42]. On standard single-illuminant benchmark datasets, the network obtains results that are comparable to the state-of-the-art.

3.3.1 Application in this context

In this context, following the design paradigm of Convolutional Mean, the net-work consists of a total of 4 convolutional layers with the first three consisting of 3 × 3 convolution kernels and the last layer consisting of a 1 × 1 convolution kernel. The number of filters in the first three layers are increased by a factor of 2 starting from 8 filters in the first layer and ending up with 32 filters in the third layer. This is done in order for the model to learn more high level features with each increasing level of abstractness. The effect of the first three layers can be seen as learning rich feature representations of the input image instead of hand-crafted variants. The effect of the last layer can be seen as a weighted combination of the 32 feature maps resulting from the third layer. Each feature map can be seen as representing some aspect of the greyness quality for all the pixels in an image that ultimately contributes to their final saturation value. Unlike Convolutional Mean, the model does not contain any pooling layers in order to preserve the input-output spatial dimensions. The model contains a total of 6065 parameters.

(20)

3.4 Pixel adaptive convolutions for the effective

integra-tion of refinement constraints

Owing to the limitations of the spatial-sharing and content-agnostic nature of the standard convolution operation, [61] proposes a content-adaptive convolu-tion operaconvolu-tion called Pixel Adaptive Convoluconvolu-tion (PAC). They adapt a stan-dard spatially invariant convolution filter at each pixel by multiplying it with a spatially varying filter which they refer to as the adapting kernel. The adapting kernel has a pre-defined form and depends on an additional set of pixel features. They mainly use the Gaussian parametric form: e−12||fi−fj||2 _{where f}

i ∈ Rd is

a d-dimensional feature at the ith pixel. These pixel features are referred to as adapting features which can either be hand-crafted or learned end-to-end. The authors show that PAC has a diverse use-case owing to its flexibility and can be seen as a generalisation of several widely-used filters (e.g. bilateral filter). Moreover, [61] also shows the state-of-the-art results that are obtained in deep joint image-upsampling when PAC is incorporated.

3.4.1 Application in this context

In this context, PAC can be seen as a natural fit for the effective integration of the refinement constraints, within the inference pipeline of the framework. The PAC layers enable the model to adaptively process each pixel location in an image using the refinement constraints of GI and CCR L2norm as the adaptive

features. Moreover, the learned feature representations at each layer can be seen as being further refined by the adaptive features.

3.4.2 Greyness Index (GI) as an adaptive feature

For the baseline constraint, the GI function of [55] is considered as it obtains state-of-the-art results on single and multi-illuminant benchmark datasets. Us-ing assumptions based on the DRM, the authors of [55] mathematically show that, for grey pixels

Mr= C{log(Ir) − log(I)} = 0 (22)

where Ir is the red colour channel, I is the intensity, and C is a local contrast

operator (laplacian of gaussian). However, as equation 22 by itself is not a sufficient condition for detecting grey pixels, the authors also extend the condi-tioning to the blue colour channel

Mb= C{log(Ib) − log(I)} = 0 (23)

reasoning that the spectral response of the red and blue colour channels rarely overlap in sensors [55]. They compute the GI as

GI = q

M2

(21)

In order to ensure that the candidate grey pixels are not selected from flat patches (without any spatial cues), the authors propose

C{Ii} > (25)

where i ∈ {R, G, B} and corresponds to a pre-defined threshold. In other words, only those pixels are considered whose regions have a local contrast above a certain threshold in each colour channel. As a final step, they perform a 7 × 7 average filtering to weaken the effect of isolated grey pixels. From equa-tions [22, 23, 24], it follows that lower the value of GI, the more grey a pixel is considered to be.

One limitation of this GI function is that it also selects pixels from intrinsi-cally coloured surfaces which skew the illuminant estimation process. Figure 4 shows such an example.

Figure 4: The top 10% pixels that are considered to be grey by the GI function of [55] on the cameras image of the MIMO dataset [9]. Pixels from the top-right and bottom-left surfaces are still selected even though the surfaces are intrinsically dark blue in colour. Moreover, pixels from the intrinsically yellow coloured filter of the projector are also selected.

3.4.3 Cross Colour Ratio L2 norm as an adaptive feature

According to equation 19, the CCR component m1can be seen as the derivative

of R/G in the logarithm space. Adopting the convention of [55], it can be written as

m1= C{log(Ir) − log(Ig)} (26)

where C indicates a first-order or second-order derivative. Following the math-ematical formulation of the DRM adopted in [55], it can be shown that for grey pixels

m1= C{log(Ir) − log(Ig)} = 0 (27)

where Ir and Ig represent the red and green colour channels respectively.

(22)

components should be equal to 0.

Theoretically, according to [31], in the logarithm space, for pixels from a uniformly coloured region, all the three CCR components would be equal to 0 as the two neighbouring pixels would have the same corresponding RGB values. At least one of the three components would be non-zero for pixels on locations where two regions of distinct surface albedo meet [31]. In other words, CCRs give a high response for pixels that lie on the boundaries of surfaces and a low response for those that lie within. Thus, theoretically, CCRs serve a dual pur-pose where for a pixel having m1= m2= m3= 0, it can either signify that the

pixel is grey or that it originates from a uniform region (achromatic or chro-matic).

In that regard, the L2 norm of the CCR components could be seen as an

indicator of the greyness of a pixel or the uniformity of the region from which a pixel originates. Moreover, it goes in correlation with GI where lower the value, the more grey a pixel is considered to be. As this is computed from illu-mination invariant measures, it can be provided as a robust prior pertaining to the greyness of pixels, to an intended approach for the refinement of grey pixel detection. Table 1 summarizes the L2 norms of the various combinations

of the CCR components. It can be observed that the L2 norm of (m1, m2) and

that of (m2, m3) obtain mean angular errors lower than GI (3.79 as given in

[55]). The L2 norm pertaining to the combination of (m1, m2) is provided as

the constraint in all further experimentation.

Combination Mean Angular Error L2norm (m1, m2) 3.71

L2norm (m1, m3) 3.83

L2norm (m2, m3) 3.77

L2 norm (m1, m2, m3) 3.83

Table 1: L2 norms of the various combinations of the CCR components. The errors

have been obtained on the real-world category of the MIMO dataset [9]. Lowest error is indicated in bold.

3.5 Saturation as the ground-truth

Saturation indicates the amount of achromaticity present in a colour. A value of 0 indicates a shade of grey while 1 indicates a hue in its entirety. Theoretically, for a given surface, the closer its saturation value is towards 0, the more intrin-sically grey it is. According to [31], saturation is invariant to the illumination intensity, viewing geometry, and object geometry for matte surfaces under white illumination. However, it suffers from the drawback that it is sensitive to the illumination colour. As formulated in [60], for a pixel, saturation is computed as:

Saturation = Max(R, G, B) − Min(R, G, B)

(23)

For a scene, the intrinsically grey surfaces in the white-balanced image take on the full effect of the illumination colour, appearing as coloured surfaces in the colour-biased image. Figure 5 illustrates such an instance.

Figure 5: White-balanced image and corresponding colour-biased image. The intrin-sically grey regions take on the full effect of the illumination colour as can be seen in the colour-biased image.

Figure 6: Saturation maps computed from the respective white-balanced and colour-biased images. As can be observed, the intrinsically grey regions have a high saturation value in the colour-biased image.

As a result of this, when saturation is computed from the colour-biased im-age, the intrinsically grey regions correspond to a very high saturation value thereby being mistaken as coloured surfaces. Therefore, Saturation cannot be provided as an input to intended approaches since it could act as an incorrect prior pertaining to the greyness of surfaces. On the other hand, owing to its in-variance to illumination intensity and scene geometry under white illumination, saturation computed from the white-balanced image can serve as a reason-able ground-truth for the greyness of surfaces. This could be quite useful for training learning-based grey pixel detection approaches. Figure 6 shows such an instance.

(24)

4 Experimental Setup

4.1 Training / Validation Dataset

As mentioned in the previous sections, research in the multi-illuminant set-ting is limited by the lack of a large and proper training set for developing intended learning-based approaches. As it can be hard to manually capture a large amount of images under the multi-illuminant setting, the general pipeline would be to subject existing white-balanced images under synthetically created multiple illumination conditions.

Intended for the single-illuminant setting, the INTEL TAU dataset [48] con-tains 7022 images in total making it the largest available high-resolution dataset for illumination estimation research. The images are captured using 3 different cameras namely Canon 5DSR, Nikon D810, and Mobile Sony IMX135 facilitat-ing camera invariance evaluation. Moreover, the images cover a diverse range of indoor and outdoor scenes shot in 17 different countries, thus allowing for scene invariance evaluation. In compliance with GDPR, privacy masking is also applied on all sensitive information.

The dataset contains four categories of images per camera namely Field 1, Field 3, Lab Printouts, and Lab Real Scenes. Field 1 category contains unique field images captured by the camera. Field 3 category consists of images of com-mon scenes captured by all the cameras. Lab Printouts consists of lab printouts while Lab Real Scenes consists of real lab scenes. For the purpose of this re-search, in order to avoid redundancy in the training dataset, only images from the Field 1 category have been chosen for each camera resulting in a total of 4852 images. They have been further colour-corrected in order to obtain the white-balanced counterparts.

In order to simulate the multiple illumination setting under two illuminants, firstly, the set of real-world illuminants is obtained from the SFU Greyball [16] and ReCC [58] datasets. In order to make the illuminant distributions adhere more to the semantics of the image, binary semantic masks are created using the imsegkmeans() function of MATLAB with cluster number set to 2. Then to each cluster mask, a random gaussian distribution of two randomly sampled illuminants with a random mean and variance is applied in order to generate the final tint map. Colour-biased images are created by tinting the white-balanced images with tint maps using inverse von Kries transform. Figure 7 illustrates an example.

(25)

White-balanced image Semantic Binary Mask

Tint Map Colour-biased image

Figure 7: An example image from the synthetic multi-illumination dataset.

4.2 Hyper-parameter Configuration

The resulting images are shuffled and then split into training and validation sets using a 90:10 split. Table 2 shows the hyper-parameter configuration used for the training phase.

Hyper-Parameter Value Optimizer Adam [45] Loss Function L1 Loss

Batch Size 32

Learning Rate 1e-5

Epochs 100

Weight Initialization Xavier normal distribution [39] Table 2: Hyper-parameter configuration for the training phase.

As the loss would be computed on values that range from 0 to 1, using the L2 loss in this case would have some demerits as it involves squaring. Since

squaring of values between 0 and 1 would lead to even smaller values, the error propagated to the network would be very small for the parameters to be updated reasonably. As L1loss does not involve squaring, it can be used as a reasonable

(26)

5 Experimentation and Analysis

In order to determine the effectiveness and robustness of the proposed architec-ture in the detection of grey pixels, the model is evaluated in a couple of scenarios which are described in the following subsections. As a baseline comparison, the GI function proposed in [55] is considered as it is the current state-of-the-art grey pixel detection method.

5.1 Grey Surface Detection Test

5.1.1 Motivation

In the multi-illuminant setting, using the estimated spatial illumination map as a means of evaluating the performance of grey pixel detection algorithms can have some demerits. The demerits mainly arise from the multi-illumination estimation pipeline as proposed in [55] which are as follows

• As reported in [55], for some scenes with complex geometry, the use of the euclidean distance can result in spatial illumination maps where the predictions may not be sharp. This can act as a potential source of error during comparison with the ground-truth illumination maps which capture the sharp changes in illumination.

• It is important to choose the optimal number of clusters M as each cluster can be seen as representing an illuminant or a mixture of illuminants. If the number of clusters does not cover the possible illuminants and their mixtures, it can greatly influence the quality of the resulting spatial illumi-nation map. This is supported by [66] where they report an improvement in performance on the MIMO dataset [9] when the number of clusters M is moderately increased over the number of illuminants in the scene. They attribute it to the fact that there are actually some image regions lighted by the mixture of two illuminants. This also applies to settings with more than two illuminants.

In other words, even if an intended algorithm detects grey surfaces to a satis-factory extent, the limitations of the multi-illumination estimation pipeline can result in a wrong evaluation of its performance. Therefore, it would be desirable to evaluate an intended algorithm directly based on its accuracy of grey pixel selection rather than through the accuracy of the estimated spatial illumination map.

5.1.2 Evaluation Dataset

From the datasets introduced in [10], [3], and [5], a total of 20 white-balanced images are chosen consisting of 10 indoor and 10 outdoor scenes. Figure 8 shows the selected images. Then, for each image, 5 variants are created which are described below

(27)

(28)

• 2 variants where the image is subjected under a two-illuminant setting. • 1 variant where the image is subjected under a three-illuminant setting. • 2 variants where the image is subjected under a two-illuminant setting

consisting of spatially varying intensities of the involved illuminants. This results in a total of 100 evaluation images for intended algorithms to be tested upon. The motivation here is to test the selection robustness of an in-tended algorithm under different settings of illumination and intensity.

The variants under the two and three-illuminant settings are obtained us-ing the synthetic multi-illumination generation pipeline described under section 4.1. In order to obtain the varying spatial intensity variants, only one additional step needs to be performed. A random gaussian distribution of random mean and variance is generated and multiplied with the maximum intensity (255). This is then applied to the two-illuminant tint map to generate a tint map with spatially varying intensities, that is later applied to the white-balanced image.

The next step would be to obtain the ground-truth data that represents the actual grey surfaces in the images. For this, the white-balanced images are manually annotated into achromatic and chromatic regions resulting in binary masks where white regions denote achromatic regions and black regions denote chromatic regions. The image annotation tool presented in [67] is used for the purpose of manual annotation. Figure 9 shows the 5 variants as well as the ground-truth mask for one of the images.

(a) ml21 (b) ml22 (c) ml31

(d) vi21 (e) vi22 (f) Binary Mask

Figure 9: 5 variants and the ground-truth binary mask for one of the images in the evaluation dataset. “ml” and “vi” stand for multiple lights and varying intensity respectively. The next digit indicates the number of light sources present in the scene. The last digit indicates the variant number.

(29)

5.1.3 Results

The evaluation is performed on three metrics namely Dice Coefficient, Precision, and Recall. The Dice Coefficient is a measure of overlap between 0 and 1 where a value of 1 indicates perfect and complete overlap. It is computed as

Dice = 2|A ∩ B|

|A| + |B| (29)

where |A ∩ B| represents the number of common elements between A and B. |A| and |B| represent the number of elements in A and B respectively. As it is intended for binary data, it would only make sense to have a ground-truth and predicted binary mask in this case. However, both GI and the proposed model output greyness measure maps where the values are not binary.

In order to overcome this limitation, [44] presents a variant of the dice coeffi-cient called the Soft Dice Score which makes direct use of the greyness measure values without having the need to threshold and convert to binary values. It does however, require the greyness measure values to be rescaled to the range [0, 1]. They also need be inverted so that higher scores represent higher degrees of greyness. It is computed as

Soft Dice Score = 2 ∗ sum(A ◦ B)

sum(A) + sum(B) (30) where A represents the rescaled greyness measure map and B represents the ground-truth binary mask. ◦ indicates element-wise multiplication while sum(.) denotes the sum of all the elements of the matrix. Table 3 shows the soft dice scores obtained on the proposed architecture and GI.

Overall Two-Illuminant Three-Illuminant Varying Intensity GI 0.50716 0.49535 0.50073 0.52219 Proposed 0.50966 0.51537 0.51916 0.49920 Table 3: Soft Dice Scores for the proposed architecture and GI. Values in bold indicate best performing scores.

From Table 3, it can be observed that overall, the proposed architecture performs slightly better than GI. Moreover, it outperforms GI in the two and three-illuminant settings indicating its more robust nature over GI under mul-tiple lighting conditions. However, GI outperforms the proposed architecture in the varying intensity setting. This could be attributed to the fact that GI trans-forms the input colour-biased image into an intensity invariant space which could make it more robust towards changes in illumination intensity. Even though the proposed architecture makes use of illumination invariant features, it still ob-serves the colour-biased image as its main input which could skew its grey pixel estimation.

For Precision and Recall, evaluation is performed on the top 1%, top 5%, and top 10% candidate grey pixels as determined by the greyness measure maps. To

(30)

obtain the corresponding binary mask, locations corresponding to the candidate grey pixels are set to 1 while the rest are set to 0. Tables 4, 5, and 6 show the precision and recall scores obtained on the top 1%, top 5%, and top 10% pixels for each variant category respectively.

Overall Two-Illuminant Three-Illuminant Varying Intensity Precision Recall Precision Recall Precision Recall Precision Recall GI 0.79780 0.02846 0.75950 0.02640 0.82183 0.02907 0.82408 0.03023 Proposed 0.89751 0.04269 0.86202 0.04126 0.87115 0.04203 0.94618 0.04445

Table 4: Precision and Recall scores on the top 1% pixels. Values in bold indicate the best performing scores.

From tables 4, 5, and 6, it can be observed that the proposed architecture has a better precision of grey pixel selection over GI, highlighting its more robust nature under varying illumination conditions. Moreover, the higher recall of the proposed architecture hints towards its better reliability, in the sense that, a higher fraction of the total number of true grey pixels would be selected when this architecture is used.

Figures 10, 11, and 12 show the top 1%, top 5%, and top 10% grey pixel selection results on an image of the evaluation dataset. They illustrate an instance of the overall trend observed on the evaluation dataset. From the images, it can be understood that the lower precision and recall of GI stems from its selection of coloured surfaces under all the selection scenarios. This goes in accordance with the coloured surface selection drawback of GI as reported in the earlier sections. On the other hand, the proposed architecture overcomes this drawback to a large extent as can be seen by its more robust selection of grey pixels under varying illumination conditions.

(31)

Figure 10: Top 1% grey pixel selection for the 5 variants of an image of the evaluation dataset. The top row shows the GI selection results while the bottom row shows the selection results of the proposed architecture.

(32)

WB GT MASK 0.20430 0.24214 0.24626

WB GT MASK 0.06851 0.11550 0.21346

Figure 13: Top 10% selection results on some failure case images resulting in the proposed architecture not achieving an overall perfect precision. Values below the images indicate their respective precision scores. WB refers to the white-balanced image and GT MASK refers to the ground-truth annotated mask.

Figure 13 shows some failure cases pertaining to why the proposed archi-tecture does not achieve an overall perfect precision on the top 10% selection. From figure 13, for the images on the first row, it can be observed from WB that the low precision scores arise due to the inherent nature of the scene. The scene by itself, does not contain many grey pixels. Thus, when the top 10% pixels are selected, a small portion of the top 10% is covered by all the grey pixels in the image, resulting in the inclusion of pixels from coloured surfaces in order to cover the remainder of the top 10%. For the images on the second row, the low precision scores arise due to the choice of annotation during the creation of the ground-truth masks. Physical grey surfaces are the primary focus of the manual annotation process as a result of which the sky region is not annotated as an achromatic region. As the proposed model selects a majority of the top 10% pixels pertaining to the sky region, the precision gets negatively affected.

Two ablation studies are conducted to evaluate the individual effect of each refinement constraint on the performance of the proposed architecture. One study focuses on the impact of the adaptive features while the other focuses on the impact of using Saturation as the ground-truth. In total, 4 variants of the model are evaluated as given in table 7

Variant Description

M GI Only GI as the adaptive feature M CCR NORM Only CCR L2 norm as the adaptive feature

M NORMAL Both GI and CCR L2 norm as the adaptive features

M ACHR Model variant trained on the Achromatic Loss [11] Table 7: Variants considered for the ablation studies. Kindly note that M ACHR does not make use of Saturation as the ground-truth.

(33)

pixels. Table 8 summarizes the results on the adaptive features using the model trained on Saturation with L1loss (as proposed under 3.1). Table 9 summarizes

the results on the ground-truth / training-loss used.

Variant Overall Two-Illuminant Three-Illuminant Varying Intensity

Precision Recall Precision Recall Precision Recall Precision Recall

GI 0.69614 0.20384 0.67569 0.19133 0.69652 0.20038 0.71639 0.21808

M GI 0.81229 0.31986 0.79144 0.31216 0.79937 0.31735 0.83959 0.32882 M CCR NORM 0.81230 0.31987 0.79146 0.31217 0.79937 0.31734 0.83960 0.32882 M NORMAL 0.81252 0.31994 0.79168 0.31226 0.80004 0.31761 0.83959 0.32879

Table 8: Precision and recall scores for the ablation study on the adaptive features. Best scores are highlighted in bold. The scores using the GI function [55] are reported for the sake of comparison.

From table 8, it can be seen that the model variant M GI significantly outperforms GI on all the categories of illumination settings. This is in accor-dance with the hypothesis put forth in the earlier sections that the drawback of GI could be overcome upon its proper integration within a deep learning framework resulting in a further refinement of grey pixel selection. Moreover, M CCR NORM performs slightly better than M GI under all the illumination settings justifying the choice of using CCR L2 norm as a slightly more robust

alternative to GI. Using GI and CCR L2norm in tandem leads to the best

per-formance of the model under all the categories of illumination conditions except for the Varying Intensity setting where it performs slightly worse than using GI or CCR L2 norm individually. This indicates the higher degree of robustness

that is obtained when using both the refinement constraints as adaptive features to the network.

Variant Overall Two-Illuminant Three-Illuminant Varying Intensity Precision Recall Precision Recall Precision Recall Precision Recall M NORMAL 0.81252 0.31994 0.79168 0.31226 0.80004 0.31761 0.83959 0.32879 M ACHR 0.73233 0.26934 0.72733 0.26456 0.72902 0.26534 0.73898 0.27611

Table 9: Precision and recall scores for the ablation study on the ground-truth / training-loss used. Best scores are highlighted in bold.

From table 9, it can be seen that the model variant M NORMAL that makes use of Saturation as the ground-truth significantly outperforms M ACHR under all the illumination settings. One reason could be that the achromatic loss [11] makes direct use of the colour distribution of the ground-truth white-balanced image in order to train a model for grey pixel selection. Despite being white-balanced, the colours of the objects in the scene can still be affected by shadows, inter-reflections, etc. which can result in a less robust ground-truth pertaining to the greyness of pixels. Saturation, on the other hand, is an invariant to the illumination intensity and scene geometry under white-illumination [31]. Thus, it acts as a more robust ground-truth resulting in a better training process for the model

(34)

5.2 Multi-Illuminant Multi-Object (MIMO) Dataset

[9] proposes two high-quality datasets under the multi-illuminant scenario. One contains 58 images taken under a controlled laboratory setting while the other consists of 20 real world indoor and outdoor scenes. All these images have been captured under two dominant illuminants. Each set includes complex scenes with multiple reflectances and specularities. A variety of lighting conditions and illuminant colours are present in the datasets. Moreover, the pixel-wise ground-truth illuminants are also provided. The main difference between the two datasets is that the laboratory images are exactly in the form as extracted from the camera, while the real world images have been colour post-processed with a sharpening software to emulate realistic camera output. Due to its small size, it is only suitable for testing purposes and not training.

5.2.1 Quantitative Comparison

In real life, as one would generally not encounter scenes captured under con-trolled settings such as those in the lab category, the main focus is on the performance on the real-world category. But the quantitative results on the lab category are still included for the sake of entirety. The results of other evaluated methods are obtained from [55].

Multi-Illuminant Laboratory(58) Real-world(20) Method Median Mean Median Mean Doing Nothing 10.5 10.6 8.8 8.9 Gijsenij et al. [33] 4.2 4.8 3.8 4.2 CRF [9] 2.6 2.6 3.3 4.1 GP (best) [66] 2.20 2.88 3.51 5.68 GI (best) [55] 2.09 2.66 3.32 3.79 Proposed 2.97 3.61 3.65 3.93

Table 10: Quantitative comparison on the MIMO dataset. Values in bold indicate the best scores.

From the results, it can be observed that the proposed method obtains the second-best mean angular error on the real-world category. This goes in accor-dance with the expectation that training the proposed architecture on a set of diverse images would allow it to generalize better on other real-world images. However, it gives a comparatively worse performance on the lab category. This can be attributed to the fact that the images in the lab category are captured under relatively darker settings involving sharper changes in illumination. The training dataset of the proposed method mostly consists of images captured under relatively brighter conditions and involve very few images with sharp illumination changes. This could be due to the following two factors

• Inherently, the images in the INTEL TAU dataset involve smooth illumi-nation transitions since a majority of them have been captured outdoors under relatively brighter conditions.

(35)

• Moreover, the synthetic multi-illuminant generation pipeline used (as de-scribed under section 4.1) produces smooth illumination tint maps which fail to account for the cases of sharp changes in illumination.

5.2.2 Qualitative Analysis

In this subsection, the qualitative analysis is performed only on images of the real-world category. The main intention here is to determine whether the pro-posed architecture is able to overcome the drawback where pixels from inher-ently coloured surfaces are mistaken as grey pixels (as encountered in the GI function). Figure 14 shows 3 images on which the proposed architecture obtains a lower angular error than GI. Figure 15 shows 3 images on which the proposed architecture obtains a higher angular error than GI.

Darktools (2.9145) Extinguisher (2.1089) Screens (1.2606)

Darktools (1.8953) Extinguisher (1.2803) Screens (1.2283) Figure 14: Qualitative analysis on 3 images of the real-world category where the pro-posed architecture performs better than GI. The top row shows the top 10% selection results of GI while the bottom row shows the top 10% selection results of the proposed architecture.

(36)

All Grey (1.6723) Cameras (6.1286) Detergents (2.9785)

All Grey (1.8798) Cameras (9.3225) Detergents (7.5908) Figure 15: Qualitative analysis on 3 images of the real-world category where the pro-posed architecture performs worse than GI. The top row shows the top 10% selection results of GI while the bottom row shows the top 10% selection results of the proposed architecture.

From figure 14, it can be observed that the proposed architecture overcomes the drawback of GI as it does not select pixels from inherently coloured surfaces. The refined selections are also accompanied by a reduction of the angular error. However, in figure 15, the proposed architecture obtains a higher error than GI even though it selects grey pixels and avoids coloured surfaces to a greater extent. This can be attributed to the fact that the model selects grey pixels from clustered regions rather than from those more spread out across the scene. As a result of this, the model might be selecting grey pixels that are only under the influence of one of the illuminants. Thus, during the estimation process, only one of the illuminants gets reliably estimated while leaving out very little scope for reliably estimating the other.

5.3 2nd International Illumination Estimation Challenge

As an extension to the CubePlus dataset, the recently hosted challenge [25] introduces a large and diverse dataset in order to promote the development of novel algorithms for the single and multi-illuminant setting. The dataset con-tains around 5000 images captured on cameras with the same sensor (Canon 600D and Canon 550D) where each image is labelled with metadata. The meta-data contains estimates of the two light sources in the scene (ground-truth) and additional information on the scene content.

The challenge consists of 3 tracks namely General, Indoor, and Two-Illuminant. Images in the Indoor track form a subset of the General track. Although all the images in the challenge dataset have been captured under two dominant light

Refining Grey Pixel Selection for Improving Colour Constancy

MSc Artificial Intelligence

Master Thesis