OADS: a new data set of high resolution RAW images

(1)

OADS: a new data set of high resolution RAW images

Luna Blumberg August 19th 2020

Student number: 11587377 Supervisor: Steven Scholte

(2)

Abstract - Deep neural networks can match human performance on tasks like object recog-nition. However, these networks are very susceptible to sources of noise. Recently, researchers try to achieve greater noise robustness by adjusting the training of the network, including filters or various pre-processing steps. A possible pre-processing step that can be used is a retina inspired filter. However, images filtered by a retinal ganglion model might be significantly dis-torted. Modern cameras can capture pictures of higher resolution that contain more information than the images that are currently used in computer vision. Here, it is investigated whether images of high resolution that are filtered by a retinal ganglion model will create better images than low resolution images filtered by a retinal ganglion model. To test this, a new image data set, OADS, will be created containing high resolution RAW images. It is shown that details are better preserved in the high resolution images compared to the low resolution images. It would be expected that a deep neural network enhanced with a retina inspired pre-processing step trained on the high resolution images would perform better than an enhanced network trained on low resolution images. This new data set can be applied to try to enhance the robustness of a DNN that contains a retina inspired pre-processing step against sources of noise.

1 Introduction

Deep neural networks (DNNs) have long been used to model the behaviour of the brain (14). A main focus for researchers working with DNNs has been object recognition and, more recently, object detection. As (18) points out, object detection is an important aspect of machine learn-ing that not only classifies the objects in an image, but also determines the location of a given object within that image. Object detection has many applications, such as face recognition (19) and autonomous driving (2).

Recent advances in object recognition have permitted DNNs to be able to match human classification performance on image sets with hundreds of object categories (12). Nevertheless, these networks have been proven to be very susceptible to sources of noise. When images are distorted or degraded, DNNs cannot compete with humans on an image recognition task (5; 4). Even some types of noise that are generally imperceptible to humans can be detrimental to the performance of a DNN. These images, called adversarial samples, are generated by intention-ally adding small but worst case noise to the images such that the classification prediction is incorrect (8). However, adversarial samples are created to intentionally cause the DNN to make mistakes and they are not likely to be encountered by a network in practical scenarios.

On the other hand, images can contain varying types of distortions caused by artifacts from image acquisition or storage. During acquisition, the camera sensor can exhibit noise in varying light conditions, or the image could be blurred if the camera is moving. In storage, packet-loss could cause missing regions of the image, or missing frequencies. Distortions like additive noise or blur can severely impact the performance of DNNs (3). Recent research tries to solve the problem that DNNs are not robust against noise. (6) showed that training networks on noisy images improved the performance of these networks in terms of robustness to that type of noise. However, DNNs that are trained on one type of noise still fail to generalize towards different types of noise. Others apply various types of filters or pre-processing steps to attempt to im-prove the noise robustness of DNN’s with different degrees of success (1; 9; 17).

One pre-processing step that could be used is to apply a retinal ganglion filter to the images before training. Retinal processing in the biological visual system is fundamental for further cortical interpretation. It shows features like band-pass filtering and gain control, that play an important role in the processing of noise in visual systems (16). Adding a retina inspired

(3)

pre-processing step have been shown to improve the robustness of DNNs in face recognition tasks even for the most challenging illumination conditions (11). However, a problem when using a retina inspired pre-processing step for object detection is that filtered images might be significantly distorted and too many details might be lost for a DNN to train on these images. Recent developments in technology have made it possible to capture and manipulate un-compressed images in RAW format. A RAW image file contains the full resolution data directly from each of the camera’s image sensor pixels (Sony). Additionally, modern camera’s can cap-ture images containing up to 400 million pixels. However, commonly used data sets in computer vision, like ImageNet (12), contain images of 256 by 256 pixels. Images of higher resolution con-tain more information. It is interesting to investigate whether a high resolution image filtered by a retinal ganglion model will create a better image than an image of low resolution.

It is expected that images of a high resolution will create better images when filtered by a retinal ganglion model than images of low resolution. To test this, a novel image data set will be created containing high resolution RAW images. Images from ImageNet and images from the new image set will be compared. These images will be filtered by a retinal ganglion model and the model’s response on images from the new data set and on the images from ImageNet will be tested. The prediction is that details are better preserved in the high resolution images from the new data set compared to the low resolution images from ImageNet.

2 Materials and methods

2.1 creating the image set

To create a new image data set, pictures should be taken of different object categories. To choose the categories, the Common Objects In Context (COCO) data set was considered (10). This data set is commonly used for object detection tasks and it is therefore useful to create a data set that has similar categories. Initially, 7 categories, that are prevalent on the street, from all 91 object categories from the COCO data set were chosen. These categories were: bicycle, car, boat, truck, motorcycle, bench, traffic sign, and traffic light.

However, the category ”boat” was the only category that was photographed in a scene with water. (15) shows that contextual information affects the efficiency of the search and recognition of objects. Thus, the category ”boat” was divided into three different categories: cargo boat, house boat, and pleasure boat. These three ”boat” categories were later relinquished, because is was not possible to photograph an adequate number of boats, especially cargo boats and house boats. For this same reason, the category ”motorcycle” was changed to ”scooter”. Because it is useful to create a data set with enough categories, the categories ”bin”, ”tree” and ”lamppost” were added. Additionally, the category ”car” was separated into the categories ”Compact car” and ”SUV” and the category ”bicycle” was divided into two common types of bicycles, namely ”carrier bike” and ”city bike”. Later, it was found that one or more vans appeared in many images, so the category ”van” was added. The categories that were eventually used to create the data set are: bench, bin, carrier bike, city bike, compact car, lamppost, scooter, SUV, traffic light, traffic sign, tree, truck, and van.

To take the pictures, instructions were made to assure that if multiple people took pictures for the data set, it would not be possible to distinguish who took what picture. The full manual can be found in appendix 1. The most important instructions are as follows:

1. Objects should be positioned naturally in the scene that is being photographed. The whole scene should be photographed instead of just a centered object. Thus, the image

(4)

set contains images of natural street scenes. A lot of researchers of different fields use natural scenes in their studies (7).

2. The zoom function on the camera should not be used, as this could decrease the quality of the image.

3. The angle and place of objects should vary in the pictures. This way, a diverse data set is created and a DNN will be better to train.

4. Pictures should only be taken when it is sunny outside. This creates pictures that are overexposed or that contain a lot of contrast between a part of the scene that is in the shadows and a part of the scene that is in the sun. It is interesting to study these types of images, especially because retina filters have been proven to improve illumination robustness in face recognition (11).

Using these instructions, images were taken in Amsterdam.

2.2 OADS

A new image data set, the Open Amsterdam Data Set or OADS, was created using a Sony cyber-shot DSC-RX100 digital camera. The camera takes pictures in RAW format (Sony ARW 2.3 format). The size of the images is 5496 by 3672 pixels. The photographs were taken in the city of Amsterdam according to a photography manual (see paragraph 2.1 and appendix 1). OADS contains 1017 annotated images of 13 object categories in natural street scenes. Figure 1 shows some examples of images in the data set. Table 1 shows the categories and number of objects in the images.

Figure 1: Three example images of OADS

To label the objects in the images, the online tool Supervise.ly was used. To upload the im-ages on the labeling tool, the imim-ages were converted to JPEG format using the python package Rawpy (version 0.14.0) and the images were resized by a factor of 4 to accelerate the uploading process.

(5)

Category number of objects Bench 131 Bin 179 Carrier bike 98 City bike 405 Compact car 657 Lamppost 592 Scooter 202 SUV 206 Traffic light 123 Traffic sign 536 Tree 631 Truck 51 Van 202

TOTAL number of images 1017

Table 1: Table of all categories and number of objects in OADS

Because natural scenes were photographed, some images contained a number of objects that was difficult to differentiate from each other, for example if a group of bikes was parked in a bike rack. Such a group of bikes clearly contained bikes, but it was impossible to draw bounding boxes around each bike individually. It could confuse a DNN if there were no bounding boxes around an object that it was trained to recognize, thus a separate category (MASK) was created to refer to these groups of objects. A total of 358 masks were used in OADS. These masks can be used to let a DNN ignore certain parts of an image during training.

2.3 Retinal ganglion filter

To test the retinal ganglion response images from OADS and ImageNet were taken that con-tained cars, since both data sets contain the category ”car”. The sides of the images from OADS are cropped in such a way that the resulting images are square and have a size of 3672 by 3672 pixels as to make it easier to compare the images to the square images from ImageNet. The images are then filtered by the retinal ganglion model and the resulting responses are compared side by side.

3 Results

The size of images from OADS was compared to the size of images from ImageNet. Even when cropped, images from OADS contain more than 205 times more pixels than images from ImageNet. It is important to note however, that objects in ImageNet are centered within the image, while objects in OADS can be placed anywhere within the image. This means that, even though the difference is still very significant, the number of pixels that constitute a car in an image from OADS is less than 205 times the number of pixels that constitute a car in an image from ImageNet. Figure 2a demonstrates the comparison of the size of an image from OADS and the size of an image from ImageNet

Figure 2b shows the modelled retinal ganglion response of an image containing a car from OADS and the response of an image of a car from ImageNet. The car from the image from OADS is better recognizable than the car from the image from Imagenet. More details are preserved in the modelled retinal ganglion response of the image from OADS than the response of the image from ImageNet.

(6)

Modelled retinal ganglion response A

B

Figure 2: Comparison of an image from OADS and an image from ImageNet. A) An image containing a car from OADS (left) and an image from ImageNet containing a car compared to the size of an image from OADS (right). The image from ImageNet is 256 by 256 pixels, while the image from OADS is 3672 by 3672. B) The modelled retinal ganglion response of the image from OADS (left) and the image from ImageNet (right). More details are preserved in the image from OADS than in the image from ImageNet.

4 Discussion

The results show that details are better preserved in the high resolution images from OADS than in the low resolution images from ImageNet when filtered by the retinal ganglion model. Thus, high resolution images filtered by a retinal ganglion model create better images than images of low resolution. It would therefore be expected that a DNN with a retina-inspired pre-processing step would perform better when trained on images from OADS than images from ImageNet.

The novel data set presented here, OADS, can be used to try to enhance the robustness of a DNN that contains a retina inspired pre-processing step against sources of noise. To properly

(7)

train a DNN, however, the image set should be expanded to contain more images.

References

[1] Borkar, T. S. and Karam, L. J. (2019). Deepcorrect: Correcting dnn models against image distortions. IEEE Transactions on Image Processing, 28(12):6022–6034.

[2] Chen, C., Seff, A., Kornhauser, A. L., and Xiao, J. (2015). Deepdriving: Learning affordance for direct perception in autonomous driving. CoRR, abs/1505.00256.

[3] Dodge, S. F. and Karam, L. J. (2016). Understanding how image quality affects deep neural networks. CoRR, abs/1604.04004.

[4] Dodge, S. F. and Karam, L. J. (2017). Can the early human visual system compete with deep neural networks? CoRR, abs/1710.04744.

[5] Geirhos, R., Janssen, D. H. J., Sch¨utt, H. H., Rauber, J., Bethge, M., and Wichmann, F. A. (2017). Comparing deep neural networks against humans: object recognition when the signal gets weaker. CoRR, abs/1706.06969.

[6] Geirhos, R., Temme, C. R. M., Rauber, J., Sch¨utt, H. H., Bethge, M., and Wichmann, F. A. (2018). Generalisation in humans and deep neural networks. CoRR, abs/1808.08750.

[7] Geisler, W. S. (2008). Visual perception and the statistical properties of natural scenes. Annual Review of Psychology, 59(1):167–192.

[8] Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014). Explaining and harnessing adversarial examples.

[9] Hossain, M. T., Teng, S. W., Zhang, D., Lim, S., and Lu, G. (2019). Distortion robust image classification using deep convolutional neural network with discrete cosine transform. In 2019 IEEE International Conference on Image Processing (ICIP), pages 659–663.

[10] Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., and Zitnick, C. L. (2014). Microsoft COCO: common objects in context. CoRR, abs/1405.0312.

[11] Ngoc-Son Vu and Caplier, A. (2009). Illumination-robust face recognition using retina modeling. In 2009 16th IEEE International Conference on Image Processing (ICIP), pages 3289–3292.

[12] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Li, F. (2014). Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575.

[Sony] Sony. About raw data. https://support.d-imaging.sony.co.jp/www/disoft/int/ idc/intro/raw.html. [Online; accessed 06-August-2020].

[14] Storrs, K. R. and Kriegeskorte, N. (2019). Deep learning for cognitive neuroscience. [15] Torralba, A., Oliva, A., Castelhano, M., and Henderson, J. (2006). The role of global

features on objects search. Psych. Rev, 113.

[16] Wohrer, A. (2008). The vertebrate retina: a functional review. Research Report RR-6532, INRIA.

(8)

[17] Yin, B., Schaafsma, S., Corporaal, H., Scholte, H. S., and Bohte, S. M. (2019). Local-norm: Robust image classification through dynamically regularized normalization. CoRR, abs/1902.06550.

[18] Zhao, Z., Zheng, P., Xu, S., and Wu, X. (2018). Object detection with deep learning: A review. CoRR, abs/1807.05511.

[19] Zhenheng Yang and Nevatia, R. (2016). A multi-scale cascade fully convolutional network face detector. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 633–638.

(9)

OADS: a new data set of high resolution RAW images

OADS: a new data set of high resolution RAW images

1

Introduction

2

Materials and methods

3

Results

4

Discussion

References

Appendix 1: Photography manual