Eindhoven University of Technology MASTER Smart cropping of image based on saliency detection Manjunath Shetty, A.

(1)

Eindhoven University of Technology

MASTER

Smart cropping of image based on saliency detection

Manjunath Shetty, A.

Award date:

2019

Link to publication

Disclaimer

This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

(2)

Smart Cropping of Images based on Saliency Detection

Aishwarya Manjunath Shetty

Department of Mathematics and Computer Science Security and Embedded Networked Systems Research Group

Supervisors:

Selcuk Sandikci (Naspers AI) dr. Dmitri Jarnikov (TU/e)

Eindhoven, August 2019

(3)

(4)

Abstract

The digital world has seen immense development in the field of computer vision due to the usage of images in most of the platforms, especially online marketplaces. The image quality plays an important role in these websites since it enhances the visual appeal of the image and thus improving the buyer experience. These websites use thumbnail images to accommodate multiple ad-postings in a single webpage. The traditional method of downsizing the image for thumbnail generation results in reduction of the image resolution and makes the object of interest less apparent to the buyers. Smart cropping of the image modifies the composition of the image around the object of interest increasing the visibility of the object aiding the buyer to select the most interesting ad-posting.

This thesis is focused on developing a smart cropping application for online marketplaces based on saliency detection in images. A state-of-the-art saliency detection approach is implemented as baseline and the results are analyzed to detect the image characteristics resulting in failure cases. Different crop quality metrics to evaluate the saliency map are proposed and analyzed. A novel saliency detection approach is designed by re-purposing Mask R-CNN framework, the state-of-the-art method for instance segmentation. A number of saliency detection approaches are proposed based on different variants of Mask R-CNN framework. A user study is performed to subjectively measure the crop quality for each approach. Our novel Mask-RCNN based approach significantly outperformed the baseline approach and improved mean user opinion score by 42.7% and the overall cropping success rate by 30%.

(5)

(6)

Preface

The thesis titled - Smart Cropping of Images based on Saliency Detection - has been conducted to fulfill the graduation requirements of the Masters degree in Embedded Systems at the Eindhoven University of Technology. The research has been conducted in the Artificial Intelligence team of Naspers between February and July 2019.

I would like to thank dr. Dmitri Jarnikov from TU/e and Selcuk Sandikci from Naspers AI for their insight, guidance, feedbacks and ideas throughout the project. Without these people, this thesis would not have been possible.

I would like to also thank Sandor Akszenovics, Alexey Grigorev, Carmine Paolino from OLX group for their continuous support and insights from the product perspective.

I also extend my gratitude to all my colleagues at Naspers AI sand fellow master students, for the inspired discussions and support during the thesis.

(7)

(8)

List of Figures

1.1 Illustration of different image enhancement techniques . . . 2

1.2 Example of thumbnails generated by compressing the images . . . 2

1.3 Image cropping based on rule of thirds: The image is cropped such that the salient object is placed on the intersection of lines through the thirds of the image. . . 3

1.4 Image cropping based on saliency: The image is cropped around the salient object to enhance the visibility of the object. . . 4

2.1 Example of unsuccessful aesthetics based cropping on an image from online marketplace. . . 7

2.2 The side outputs of different layers of HED network[21]. The outputs are arranged in an increasing order of layers from left to right. . . 9

2.3 The DSS network structure for saliency detection [21] . . . 10

2.4 Examples of unsuccessful cropping using object detection . . . 11

2.5 Mask R-CNN Framework . . . 11

2.6 Network Structure of Mask Scoring R-CNN [22]. . . 13

3.1 Block diagram of smart cropping application . . . 14

3.2 Example output saliency maps from DSS . . . 15

3.3 Images containing multiple objects and their corresponding saliency maps . . 15

3.4 Original images, corresponding saliency maps and successfully cropped images 16 3.5 Original images, corresponding saliency maps and unsuccessfully cropped images 17 3.6 Crop success rate per object category on marketplace websites based on manual evaluation . . . 18

3.7 Example of unsuccessful crops per characteristic of their original images . . . 20

4.1 Example results from Mask R-CNN for saliency detection . . . 22

4.2 Steps involved in merging the outputs of Mask R-CNN and DSS network . . 23

4.3 Example results from MRCNN+SAL . . . 24

4.4 Loss curve for classification, segmentation and mask score prediction . . . 25

4.5 Example results from MRCNN SAL . . . 26

4.6 Loss curves for classification, segmentation and mask score prediction . . . . 26

4.7 Example results from MSRCNN SAL. . . 27

4.8 Feature Pyramid network hierarchy [30] . . . 28

4.9 DSS and FPN network structure . . . 28

4.10 Modified FPN structure . . . 29

4.11 Loss curves for Classification, Segmentation and mask score prediction . . . . 30

(11)

LIST OF FIGURES

4.12 Example results from MODIFIED MSRCNN SAL . . . 30 4.13 Summary of the resulting Saliency maps . . . 31 4.14 Screenshot from Amazon Sagemaker Ground Truth to evaluate the crop bound-

ing box . . . 32 4.15 Mean user opinion score per approach . . . 33 4.16 Correlation of Mask Scores from MSRCNN SAL with mean user opinion score 34 4.17 Selection of threshold for the metric . . . 35 4.18 Confusion matrix for the metric - MaskIoU score from MSRCNN SAL . . . . 35 4.19 Correlation of Mask Scores from MODIFIED MSRCNN SAL with mean user

opinion score . . . 36 4.20 Selection of threshold for the metric . . . 37 4.21 Confusion matrix for the metric - MaskIoU score from MODIFIED MSRCNN SAL 37 A.1 Images and corresponding saliency maps for positive and negative cropping

results . . . 45 A.2 Success rate of cropping per sub-categories of Fashion category from OLX.. . 46 A.3 Laplacian kernel . . . 47 A.4 Density plot of Laplacian variance values for successful and unsuccessful crops 47 A.5 Precision-recall curve, AP is the average precision. . . 48 A.6 Confusion matrix of metric Laplacian variance with threshold value of 100 . . 48 A.7 Density plot of NIMA values for successful and unsuccessful crops . . . 49 A.8 Precision-recall curve, AP is the average precision. . . 49 A.9 Confusion matrix of metric NIMA score with threshold value of 0.6 . . . 49 A.10 Density plot of ratios of salient are to total image area for successful and

unsuccessful crops . . . 50 A.11 Precision-recall curve, AP is the average precision. . . 50 A.12 Confusion matrix of metric ratio of salient area with threshold value of 0.3. . 51 A.13 Examples of saliency maps that will be rejected based on the metric, salient

area ratio . . . 51 A.14 Density plot of average of salient pixel values for successful and unsuccessful

crops . . . 52 A.15 Precision-recall curve, AP is the average precision. . . 53 A.16 Confusion matrix of metric average salient pixel with threshold value of 210 . 53 A.17 Examples of saliency map of negative results the metric average salient pixel

value will not avoid. . . 53 A.18 Steps to calculate the metric - KL Divergence of image classification network

predictions . . . 54 A.19 Density plot of KL Divergence values for successful and unsuccessful crops . . 55 A.20 Precision-recall curve, AP is the average precision. . . 55 A.21 Confusion matrix of KL divergence metric with threshold value of 0.05 . . . . 55 B.1 Example of object segmentation using COCO annotations . . . 57 B.2 Example of salient object segmentation using annotations created for saliency

dataset. . . 57

(12)

List of Tables

3.1 Number of images per characteristics of input images of negative results . . . 19 4.1 User study results . . . 32 A.1 Percentage of successful and unsuccessful crops . . . 44

(13)

(14)

Chapter 1

Introduction

Currently, digital images are vital for many businesses, from social networking sites to online marketplaces. Image quality is one of the most important aspects to these businesses as it is highly correlated with the visual appeal of the images. The image quality depends on various factors such as the dynamic range and resolution supported by the device capturing the image, bandwidth of the transfer (in the case of online acquisition of the image), lighting conditions and the perspective of the person capturing the image. Various image quality enhancement techniques are used in order to increase the visual appeal of the image in terms of either resolution[47], color[48], exposure[49] and its semantics[1].

Figure 1.1 shows some examples of different image enhancement techniques. The low resolution of the camera and the low bandwidth of online transfer often results in lower image resolution. The technique used to upscale the image resolution is referred to as super resolution [27] and is illustrated in Figure 1.1a . Deep learning can also be used to vary the image contrast and brightness to set them to a visually appealing level automatically [34] as shown in Figure1.1b. A salient region in an image is the most visually standing out region of the image. Image cropping and background removal are the image enhancement techniques where the region of saliency needs to be identified. These techniques are used to further bias the focus of the image to the salient region [42, 18]. Figure 1.1d shows an example of smart cropping of an image and Figure1.1c illustrates background removal.

Online marketplaces and e-commerce websites use thumbnail images to accommodate multiple ad-postings in a single page and the images are downsized to generate thumbnails which results in reduced image resolution. Figure1.2shows the example of thumbnails from an online marketplace generated by downsizing the image. In this case, the object to be sold may not be visible to the buyer and the ad-posting may not be clicked on. This reduces the seller experience on these platforms and to improve this, smart cropping of images is essential to generate thumbnails on these websites. Thus, the focus of this thesis is the image enhancement technique - Smart cropping.

(15)

CHAPTER 1. INTRODUCTION

(a) Image super resolution (b) Color adjustment using deep learning

(c) Background removal on boxbrownie.com (d) Smart cropping by Twitter Figure 1.1: Illustration of different image enhancement techniques

Figure 1.2: Example of thumbnails generated by compressing the images

(16)

1.1 Background

Image cropping is one of the image quality enhancement techniques used to adjust the photo composition by removing unnecessary or distracting regions of the image. Photo re- composition improves aesthetics of the image and aids in focusing on the most important regions of the image, regions of saliency. This makes photo re-composition especially important for generating thumbnail or preview images. Traditionally, the thumbnails are generated by reducing the size of the image which compromises the clarity of the image or by cropping the image in a predefined way (e.g. center crop) which may result in removal of the most prominent region of the image. Thus, smart image cropping, retaining the most important parts of the image, plays a major role in image quality enhancement.

Automatic cropping of the images is usually implemented in social networking websites and the most common aim of the cropping is to improve the aesthetic quality of the images or to provide more visibility to the most essential regions of the image. Smart cropping of images can be broadly classified as Content Aware and Content Unaware cropping, cropping based on information perspective and visual perspective respectively. Content aware cropping is usually based on object detection in the image and cropping the image along the bounding box of the object/objects. Content unaware cropping is classified into Aesthetics based cropping and Attention based cropping. Several rules are proposed in order to evaluate the aesthetic quality of images. The rule of thirds is one of the prominent rules[2]. The rule states that, if vertical and horizontal lines are placed through one-thirds of an image, the salient object should be placed on the lines or the intersection of lines and such images are said to be aesthetically pleasing as illustrated in Figure1.3.

(a) Original Image (b) Cropped Image

Figure 1.3: Image cropping based on rule of thirds: The image is cropped such that the salient object is placed on the intersection of lines through the thirds of the image.

Another type of image cropping is based on the most interesting region of the image using saliency map, called attention-based cropping. Saliency map is the image indicating the probability of every pixel of the original image being foreground or background. The attention-based cropping can use either of the two approaches - top-down approach and bottom-up approach. The bottom-up approach is driven by the visual stimulus and identifies the visual pop-out in the image using low level features like edge detection or contrast variation depending on the application. The top-down approach is where the region of cropping is narrowed down using the high level semantic information of the image and then the saliency in that region is identified. Figure1.4provides an example of attention based cropping.

(17)

(a) Original Image (b) Saliency Map (c) Cropped Image

Figure 1.4: Image cropping based on saliency: The image is cropped around the salient object to enhance the visibility of the object.

1.2 Challenges

As discussed in the Section 1.1, there is a strong need to intelligently crop the thumbnail images on online marketplaces. The main assumption that the most salient object in the images from marketplaces is the object to be sold. The challenge is to develop a robust saliency detection framework which will be used in cropping the image. The image cropping based on saliency detection sometimes result in decreased visual appeal of the image and to increase the buyer experience, the aesthetics of the image also needs to be taken into consideration.

Moreover, online marketplaces host images of huge variety of objects with high degree of visual variety. This poses a significant challenge on the saliency detection.

The negative results from image cropping may lead to buyers not clicking on the listing and thus reducing the seller experience on the platform. Thus the negative results must be avoided to a maximum extent for the application to be used in production. The exact percentage of unsuccessful crops to be avoided is domain dependent. The goal is to reduce the negative results to the least while keeping avoiding good crops. A robust metric should be designed to avoid the negative results from the image cropping based on a certain threshold.

The metric should accurately capture the quality and confidence of the cropping operation, the failure cases should result in low values, whereas successful cropping should give rise to high values. This metric should be less computation intensive and should be a light weight addition to the smart cropping application. The smart cropping application as a whole, should be computationally efficient.

1.2.1 Research Questions

The challenge discussed in the above section is converted into 3 research questions as follows:

• How do current state-of-the-art saliency detection methods perform on images from online marketplaces (i.e. in the wild)?

• Is there a robust metric to capture the quality of the cropping?

• Can existing segmentation frameworks be used for the detection of saliency in the image?

Which variant of segmentation frameworks can be the most suitable for the saliency detection with reliable metric to evaluate the image crops?

(18)

1.3 Contribution

In this thesis, a smart cropping application based on saliency detection is developed. The results are evaluated category-wise and the observations are made to identify the possible reasons for failure cases. Since saliency detection network does not give a confidence score for the saliency map, we proposed a number of hypotheses based on the observations and investigated metrics based on image quality, area of salient region, salient pixel values and coherency of predictions of an image classification network before and after cropping. We concluded that these metrics do not capture the cropping confidence sufficiently. In order to design a robust confidence metric, we followed a machine-learning based approach where we train a neural network to predict salient regions and confidence of these regions simultaneously.

A user study is conducted to evaluate the results and based on the user study, the proposed methodology improves the saliency map by 30% compared to state-of-the-art methodology.

1.4 Outline of the thesis

The outline of this thesis report is as follows. Chapter1gives the background of the problem and defines the Research Question. In Chapter 2 we study existing work related to our Research Questions. Chapter 3 explains the baseline implementation of saliency detection and post processing to crop the image and the results are presented and analyzed. Chapter4 discusses the main contribution of the work where different approaches to re-purpose Mask R- CNN for saliency detection are proposed and the results are presented . Chapter5 concludes this work and proposes a few possible directions for future work to extend the work to other applications.

(19)

Chapter 2

Literature Review

This chapter provides a detailed literature survey about smart cropping of images. The Sec- tion 2.1 elaborates on the different techniques of content unaware cropping. The baseline saliency detection network used in this thesis, Deeply supervised salient object detection with short connections is discussed in Section 2.1.2. The content aware cropping is discussed in Section2.2with detailed explanation about Mask R-CNN and Mask Scoring R-CNN, frameworks for instance segmentation which is further repurposed for saliency detection in the thesis.

2.1 Content unaware smart image cropping

Content unaware smart cropping can be further classified as aesthetics based smart cropping, where the image is cropped based on photo-composition rules in order to enhance the visual appeal of the image and attention based smart cropping where the image is cropped around the regions of saliency.

2.1.1 Aesthetics based image cropping

Based on the application the image is intended to be used for, the cropping methodology varies between attention based cropping and aesthetics based cropping. The aesthetics based cropping focuses on mimicking human interpretation of visually pleasing images. Many different approaches have been proposed to assess the aesthetic quality of the images. The paper [12] is a survey of different approaches to aesthetics assessment of an image and these approaches further act as a part of image cropping by evaluating the aesthetic quality of different anchors of the image. Early works on image aesthetic quality assessment involved manual evaluation of the image based on low level features of the image such as lighting, contrast and background blurring and high level features like edges distribution and colour distribution throughout the image [25]. The paper [13] proposes several high level features to assess the aesthetics of the image, broadly classified as compositional attributes - which relate to the layout, content attributes of the objects within the image and illumination of the image.

Some of the recent methods have used deep learning to assess the pictorial aesthetics and achieved promising results. The work [31] uses a supervised approach with a deep con-

(20)

CHAPTER 2. LITERATURE REVIEW

volutional neural network(CNN) for image aesthetics quality assessment which uses double column CNN to use both global and local view of the image in consideration. The image composition in such methods is compromised as the CNN requires fixed size inputs. The paper [32] describes an approach to overcome this problem by adding an adaptive spatial pooling layer on the CNN to fit the input of any dimension, thus preserving the composition of the image.

Aesthetics aware photo cropping selects the various cropping candidates generated based on the aesthetic quality of the crop. The aesthetic quality of the crop is usually based on various photo-composition rules (For example, Rule of thirds). The paper [16] proposes a supervised cropping methodology which the CNN outputs two predictions to move the crop bounding box accordingly to increase the aesthetic quality of the crop. The two predictions of the network are either to move the top left or the bottom right corners of the bounding box with a fixed length. The paper [28] proposes a sequential decision process in which and agent decides on a series of actions for optimizing the target. The process starts with a whole image and every iteration, the network predicts an action to change the cropping window.

Based on the aesthetic quality of the resulting crop, every action is rewarded based on the training evaluation process and with the goal to optimize the reward. The process stops when the reward is lower than the reward for the preceding crop. This method has achieved significantly promising results compared to the state-of-the-art.

Aesthetics based image cropping has significant amount of work done in order to assist professional photographers. However, for the images on Marketplace websites, the images are not captured professionally and the object to be sold is usually placed at the center of the image. Applying photo composition rules may sometimes result in cropping the salient object of the image as shown in Figure2.1a, resulting in loss of information or not crop the image sufficiently to increase the object visibility as shown in figure 2.1b. Thus, attention based cropping is used on these websites.

(a) Example of loss of information

(b) Example of image not cropped based on rule of thirds

Figure 2.1: Example of unsuccessful aesthetics based cropping on an image from online marketplace.

(21)

2.1.2 Attention based image cropping

Attention based image cropping involves cropping the image around the most visually pop- ping out region, also called salient region in the image. Saliency detection is a technique in computer vision which is mainly used in attention based image cropping, background removal and object detection. Early works on automatic cropping rely mostly on sliding window based saliency detection and cropping around the salient region. [39] proposes a simple image processing based method where small segments of the image are compared against each other, and the segment different from the others is considered foreground and the rest as background.

The paper [40] evaluates the thumbnail generation methods using saliency detection and face detection based on a user study. This paper evaluates several saliency detection methods such as [24] [11] [26] [10]. The user study is conducted with about 20 participants, using 3 image datasets (Animal Sets and Face Set prepared by the authors and Corbis set [6]) with 3 types of thumbnail generation - Plain shrinking without cropping, Saliency based cropping and Face detection based cropping. The user study showed that the latter two methods pro- duced significantly better thumbnails compared to the former method of cropping.

The paper [45] proposes a method of image cropping based on salient region where the network learns the change in the image after every iterative crop based on the exclusion features which refers to the regions of the image that are cropped out. The network also considers low level features like color, texture, shape of the objects in the image to separate foreground and background in the image. The network is trained with 1000 pairs of original image and ground truth for cropping.

An automatic thumbnail generation technique is proposed in the paper [15], using feed forward networks. The technique employed is supervised and the network takes the dimension of the target thumbnail and crops the input based on different filters learnt for different aspect ratios. The method does not use the regions of saliency as dictated by the saliency map. The paper [29] proposes a saliency detection methodology using a fully convolutional neural network which predicts the regions of saliency and a segment-wise spatial pooling to find the segment wise features from the image to remove the problem of discontinuities from the saliency map at the boundary regions.

Most of the weakly supervised cropping methods depend on the sliding window mechan- ism since they are not supervised with bounding box. The paper [51] proposes a technique to transfer image level semantics to region level semantics to overcome the problem of the semantics of the image being not standardized and is based on the dataset or designers and the aesthetics not being preserved in the cropped photo.

Some recent works include both concepts of attention based and aesthetics photo cropping.

The paper [17] proposes a novel cropping methodology which combines visual composition, boundary simplicity, and content preservation models. The visual composition model takes care of the image composition after cropping and is trained using both positive and negative samples, where the negative samples are obtained by random cropping of the positive samples. The boundary simplicity models take care of smooth boundary of the image and thus avoiding any object cut through. This is done by analyzing the image gradient and low gradients considered to be good boundary. The content preservation models employ saliency

(22)

detection to confirm the presence of the essential content in the cropped image. A saliency score is calculated, which is the ratio of salient energy in the crop to that in the input image and this score must be high.

The paper [43] proposes a model which is a cascade of attention based cropping and aesthetics driven crop window selection. A set of cropping candidates are generated using the attention aware cropping network and one of them is selected based on aesthetics score provided by the Aesthetics Assessment network. The whole network achieves a high compu- tational efficiency by selecting the cropping candidates only around the attention box, sharing several convolutional layers between the attention detection and aesthetics scoring model and also by extracting the features before the generation of cropping candidates.

The saliency detection methods described are limited with respect to the objects that can be detected as salient based on the training data. However, the multiple variety of object categories in online marketplaces poses significant challenge with respect to saliency detection. In this thesis, the baseline saliency detection in the images from marketplace websites is implemented using the state-of-the-art saliency detection approach by Hou et. al.

[21] explained in Section 2.1.2.

Baseline for saliency detection - Deeply supervised salient object detection with short connections (DSS)

Saliency detection is the first step in smart cropping of the images from the Marketplace websites. The work by Hou et. al. [21] is a state-of-the-art method and is selected as the baseline implementation of the smart cropping application in this thesis. The baseline is selected based on the accuracy of the network to multiple representative images from a marketplace website.

The paper [21] extends the idea of Holistically Nested Edge Detector(HED) architecture [44] to obtain the saliency map of the input image. In general, the deeper side outputs of the fully convolutional neural networks(FCN) contain high level semantic information of the image such as the location of salient objects and the shallower side contains the low level information such as spatial information of the boundary of salient objects. Due to the down-sampling in fully convolutional neural networks (FCN), the deeper side outputs lack the spatial information and thus the resulting map is very blobby as shown in the Figure2.2.

Figure 2.2: The side outputs of different layers of HED network[21]. The outputs are arranged in an increasing order of layers from left to right.

The proposed architecture is to use HED architecture with deeply supervised skip-layers.

The multi-level features are combined by introducing connections between the side outputs of the shallower layers and deeper layers. This results in the high level features being trans- formed to shallower side layers aiding them with better location of the salient object and also refining the saliency map predicted by the deeper side outputs. Figure2.3shows the network

(23)

architecture for the baseline network used for saliency detection.

The outputs from higher layers of the FCN are concatenated with the output from lower layers using bilinear interpolation for upsampling and 1x1 convolution to reduce the number of channels. The network outputs a saliency map which is of resolution 300 x 400 Pixels. The network requires pixel level supervision which means for every RGB training image, there should be a ground truth binary mask.

Figure 2.3: The DSS network structure for saliency detection [21]

2.2 Content aware smart cropping

Since the marketplace websites contain images with of objects to be sold, object detection frameworks like Faster R-CNN [37] or YOLO [36] or can be used to crop the images along the bounding box around the objects detected. Image segmentation framework like Mask R- CNN [19] can also be used to crop the image since it provides the segmentation mask which can be also be used as saliency map when the object classes are only foreground or background.

However, the first drawback of these frameworks are that they detect only the objects that they are trained for. Secondly, when there are multiple objects in a single image, the object detection framework may not detect all the objects as shown in Figure2.4

Training the framework with generic objects aid in detecting various objects and merging the bounding boxes of all the instances of the object can help in detecting multiple objects in the image.

(24)

Figure 2.4: Examples of unsuccessful cropping using object detection

In this thesis, the framework for instance segmentation by He et. al. [19], Mask R-CNN explained in Section2.2.1is used to detect the salient object in the image.The saliency maps generated by Mask R-CNN are evaluated using the approach Mask Scoring R-CNN by Huang et. al. [22] which is explained in Section 2.2.1.

2.2.1 Mask R-CNN

Mask R-CNN [19] is the current state of the art framework for instance segmentation which achieves simultaneous detection, classification and segmentation of objects in images. Saliency detection can be expressed in terms of instance segmentation with two classes- foreground and background. This thesis investigates the possibility of re-purposing Mask R-CNN for saliency detection. Mask R-CNN framework has two stages - first stage extracts features from an input image and proposes the regions likely to contain the objects and the second stage classifies the objects along with refining bounding box and predicting segmentation masks.

Figure 2.5: Mask R-CNN Framework

(25)

Figure2.5 shows the different stages of the Mask R-CNN framework. The backbone network of Mask R-CNN framework is usually a typical convolutional neural network (ResNet), the initial layers of which extracts low level features with high resolution and the later layers detect features with high semantic accuracy. The next stage of the framework is the Feature Pyramid Network(FPN)[30]. This network passes the high semantic information contained in the higher layers of CNN to the lower layers with high resolution to accurately detect the small objects .

Region Proposal Networks(RPN) [38] are used to scan the input feature map using sliding window method and generate proposals of regions likely to contain an object. The feature maps are shared among the regions of interest, thus reducing redundant computations. The RPN scans all the windows which are referred to as anchors in parallel and generates two outputs for every anchor - anchor class, implying if the region contains a foreground object or not and the bounding box refinement, which is a refinement factor to align the anchor perfectly over the object. Non-max suppression is used to remove the anchors with low foreground scores and the top anchors are passed onto the next stage. Based on the size of the region proposal created by RPN, the feature map of appropriate scale from the FPN is selected.

The proposed Region of Interest(ROI) is passed on to the ROI classifier and bounding box regression stage [38]. This stage has two outputs similar to the RPN - First output is the class of the object in the ROI.The ROI classifier classifies the object into multiple object classes and if the ROI has no object, classifies it as background class. The second output from this stage is the further refinement of bounding box to contain the complete object. The classifier stage handles only fixed size inputs, but the bounding box regressor from the RPN adjusts the bounding box to contain the object, thus causing the bounding boxes around different ROIs to be of variable sizes. ROI pooling is the technique used to convert variable size ROIs to fixed size (H x W where H is the height of the input feature map and W is the width) inputs to the classifier. The H and W are hyper parameters of the layer and are independent to any ROI. The original ROI size is h x w and it is divided into H x W grids and values in every grid is max-pooled to get corresponding output value. The sub-windows have a size of h/H x w/W and the cell boundaries are forced to align to the boundary of input feature maps making the target cells not of equal sizes. Mask RCNN introduces new technique called ROI align where the cell boundaries after ROI pooling layer is not quantized and bilinear interpolation is used to calculate feature map values within the cell [19].

Mask RCNN includes additional head for instance mask generation. The masks are generated by a fully-convolutional neural network head and are of size 28 x 28. The generated masks are soft masks, represented by floating point numbers, where each pixel in the mask denotes the probability of pixel belonging to the foreground object, thus holding more details even though they are small. These masks are scaled up to fit the object size in original image during inference [19].

Mask Scoring R-CNN

The paper on Mask scoring R-CNN [22] proves that the classification confidence from the classifier stage of Mask R-CNN does not correlate with the quality of segmentation mask.

The paper introduces an additional head to the Mask R-CNN framework called MaskIoU

(26)

head which scores the segmentation mask quality based on Intersection over Union (IoU) score of the ground truth and predicted mask. The network architecture is illustrated in Figure2.6.

Figure 2.6: Network Structure of Mask Scoring R-CNN [22]

The MaskIoU head defines a particular mask quality score, the ideal value of which is the IoU between the ground truth mask and the predicted mask. The mask score should be positive for the object class and zero for all the other classes and this requires the masks to be classified into classes. The output of the classifier stage of Mask R-CNN is directly used for this task. IoU regression is the next step to get the score for the predicted mask. The feature from the ROI align layer and the predicted mask is concatenated (Max pooling with Kernel size of 2 and stride of 2) given as the input to the MaskIoU head. The MaskIoU head consists of 4 convolutional layers with kernel size of 3 and filter number of 256 and 3 fully connected layers, where the outputs of first two fully connected layers are set to 1024 and the final layer is set to the total number of object classes.

For training the MaskIoU head, the RPN proposals are used as training samples since the IoU between the predicted bounding box and the ground truth box should be greater than 0.5. During inference, the MaskIoU is predicted for the top-k boxes (typically k =100 ) from the RPN, multiplied with the classification confidence score and given as mask quality score.

(27)

Chapter 3

Content Unaware Smart Cropping - Baseline Implementation

This chapter elaborates on the implementation of content unaware smart cropping application based on saliency detection using [21] as mentioned in Chapter2. The salient regions of the image are identified and the image is cropped along the bounding box around the salient region. The Section 3.1 describes the implementation of saliency detection and Section 3.2 describes the post processing techniques applied to crop the image around the regions of saliency. The results of the implementation are discussed in the Section 3.3. The results are analyzed and observations on the possible factors influencing cropping are listed in the Section3.4.

Figure3.1 shows the various stages involved in the implementation of smart cropping based on saliency detection.

Figure 3.1: Block diagram of smart cropping application

3.1 Saliency detection

The network architecture(DSS) described in Section 2.1.2 is used for the baseline implementation of saliency detection. The pre-trained model, trained using MSRA-10K[41] and MSRA-B[41] datasets for 24000 iterations is used for inference. MSRA-10k dataset consists of 10000 images and MSRA-B dataset consists of 5000 images of various categories with corresponding ground-truth binary saliency maps. The deep learning framework used in the implementation is Tensorflow [8].

(28)

CHAPTER 3. CONTENT UNAWARE SMART CROPPING - BASELINE IMPLEMENTATION During inference, the test image is resized to 320X320 pixels and fed into the network.

The saliency map is generated by the network with resolution of 320X320 pixels. This saliency map is further resized to the original input size. The saliency map is a grayscale image with pixel values varying with respect to the probability of saliency of the pixel. Figure3.2shows the output of the saliency detection network for input images of different categories.

Figure 3.2: Example output saliency maps from DSS

3.2 Post-processing for image cropping

The post processing on the image is done using the Python OpenCV library[23]. The saliency map which is a grayscale image is converted to binary image with the threshold value 50. This means any pixel value in the saliency map under 50 is converted to 0 and any pixel value above is converted to 255. This binary image is then subject to contour detection. Contours are closed curves having same color or intensity. In this case, all the closed curves with bright pixels which indicate the regions of saliency are detected. The noise in the image is removed as the contours with area below a certain predefined threshold are eliminated. The contours are then individually bounded by bounding boxes, along the perimeter of the contour.

The post-processing of the image is application-dependant. The requirement for this application is to crop the thumbnail image in the advertisement listings on marketplace websites around the object to be sold. Sometimes, thumbnails can contain multiple objects of interest.

Figure3.3shows some examples of the images where there are multiple objects of interest and their respective saliency maps. The saliency map of such images also contain independent contours. In this case, the individual bounding boxes around the contours are merged to form one outer bounding box along which the image is cropped. To enhance the aesthetics of the cropped image, the bounding box is placed with a gap from the perimeter of the contour.

The gap is set to of 8% of the corresponding contour dimension based on experimentation to find the optimal gap.

Figure 3.3: Images containing multiple objects and their corresponding saliency maps

(29)

CHAPTER 3. CONTENT UNAWARE SMART CROPPING - BASELINE IMPLEMENTATION

3.3 Experimental results

The images for testing the performance of the application were obtained from OLX marketplace website. The test images were categorized into 43 parent categories from OLX and 359 sub-categories. The smart cropping application was tested on 3900 thumbnail images of the ad listings posted on the website, 20 images maximal per category.

The resulting crops are evaluated manually with respect to their original images. The evaluation is binary, categorizing the crops as successful crops or unsuccessful crops. The object of interest if completely bounded by the bounding box along which the cropping is achieved, then the crop is considered to be successful. If any part of the object of interest is cropped, the cropping is considered unsuccessful, irrespective of the size of the part that is cropped. The sample results are shown in the figures below. Figure3.4 shows the examples of successful crops and Figure3.5shows the examples of unsuccessful crops.

Figure 3.4: Original images, corresponding saliency maps and successfully cropped images

(30)

Figure 3.5: Original images, corresponding saliency maps and unsuccessfully cropped images

(31)

The chart shown in Figure3.6shows the average success rate of cropping per category of images from OLX website. The categories with defined object of interest like animals, fashion accessories, vehicles have significantly higher success rate compared to the categories with landscapes like land and farm and the interiors of houses, apartments and garages. But in order to be used in production, the smart cropping application should have a higher success rate for majority of the categories.

Figure 3.6: Crop success rate per object category on marketplace websites based on manual evaluation

3.4 Analysis of the results

The original images of unsuccessful crops are analyzed to identify the possible reasons for the failure of cropping. Table 3.1 shows the percentage of unsuccessful crops per image characteristics of original images that were observed in most cases of negative results.

A description of each characteristic is outlined in the following points.

• Low image quality in terms of image sharpness, exposure and contrast: The original image is either blurry or of low contrast in about 15% of the cases. The low lighting conditions of the image also have an effect on cropping, since 16% of failure of cropping is due to low exposure of the image.

• Completeness of the object in the image: If the object of interest in the original image is cut through or is not completely visible due to the lighting, the results are negative. About 23% of the original images of unsuccessful crops have this characteristics.

(32)

Characteristics Number of Unsuccessful crops Percentage of Unsuccessful crops(%)

Low image quality 257 29

Incomplete object 191 22

Text 157 18

Others 135 15

Landscape 108 12

Thin edges 33 4

Table 3.1: Number of images per characteristics of input images of negative results

• Text: Most of the images with text are not cropped successfully. The text in the images are not identified as salient and about 19% of the original images of unsuccessful crops contains text which are ignored.

• Thin edges: If the original image contain objects with thin edges, the saliency detection network fails to identify such objects. About 4% of the original images of unsuccessful crops contain objects with thin edges.

• Landscape images: The images from categories of land and farm, houses and apartments are cropped unsuccessfully since there is no object that is salient to the network to crop accordingly. However, the images from these categories do not need to be cropped since the information is contained in the entire image and cropping such images may result in loss of significant information.

Figure 3.7 shows the examples of unsuccessful crops categorized with the characteristics defined above.

In order to use the smart cropping application in production, the negative results must be avoided. A metric indicating the quality of the cropping is essential to avoid the negative results by not cropping the image if the metric value is below a certain threshold. The AppendixAelaborates on the analysis of input images and saliency maps to obtain a suitable metric for the application.

(33)

Figure 3.7: Example of unsuccessful crops per characteristic of their original images

(34)

Chapter 4

Content Aware Smart Cropping

In this chapter, MASK R-CNN, a framework used for instance segmentation is re-purposed for saliency detection and the results are evaluated by a user study. Section 4.1 discusses the motivation to use Mask R-CNN for saliency detection. Section4.2describes the baseline implementation for smart cropping using the segmentation masks from Mask R-CNN and the late fusion of the outputs of Mask R-CNN and DSS network. Section 4.3 describes the approaches proposed to tailor Mask R-CNN for saliency detection and the results are presented. The evaluation results based on a user study are presented in Section4.4.

4.1 Motivation

As discussed in Appendix A, the quality of saliency map directly affects the quality of the crop. Thus, the saliency maps must be improved significantly compared to the results from the baseline implementation presented in Section 3.3. The content aware smart cropping can be used to preserve the semantic information contained in the image and crop the image accordingly. In this direction, Mask R-CNN, the state-of-the-art instance segmentation framework is re-purposed to be used for saliency detection. The complete Mask R-CNN framework is explained in Section 2.2.1. Instance segmentation is a task of classifying every pixel to its corresponding object class. Saliency detection in an image is classifying every pixel of an image with probability of the pixel being foreground. Saliency can also be defined as instance segmentation with two classes - foreground and background and probabilities for two classes. This is the main idea behind using Mask R-CNN for saliency detection.

Mask R-CNN framework provides the bounding box around the instance of the object, classifies the object with a prediction probability and also provides the segmentation mask.

Thus, using Mask R-CNN for saliency detection aids in getting object class confidence and segmentation mask from the same network which simplifies the image cropping pipeline.

The Section 2.2.1 explains Mask Scoring R-CNN and the authors of the paper[22] have proven that the Mask R-CNN prediction probability does not match with the segmentation mask quality and thus, introduce a new head to Mask R-CNN network to calculate the Mask quality score. Therefore, we decided to employ this variant as it matches very well with our requirement of quantifying saliency map quality.

(35)

CHAPTER 4. CONTENT AWARE SMART CROPPING

4.2 Smart cropping of images using segmentation masks from Mask R-CNN

4.2.1 Baseline Implementation

In this experiment, Mask R-CNN network is trained with the COCO dataset, which is a large-scale object detection, segmentation, and captioning dataset[3]. Every prediction from the Mask R-CNN is considered as objects belonging to the foreground class. The object class confidence threshold is set to 0.3 to obtain higher number of predictions in an image.

Results

Figure 4.1 shows some results from MRCNN+SAL, where the saliency map is obtained by combining the segmentation masks from Mask R-CNN and saliency map from DSS network.

Figure 4.1: Example results from Mask R-CNN for saliency detection

4.2.2 Late fusion of the outputs of Mask R-CNN and Saliency detection network (MRCNN+SAL)

In this experiment, Mask R-CNN network is trained with the COCO dataset, which is a large-scale object detection, segmentation, and captioning dataset[3]. The saliency map is generated for every input image by the DSS network. For every object prediction made by Mask R-CNN on the input image (with the probability higher than 0.3 to obtain higher number of predictions), the Intersection over Union (IoU) of the predicted mask and the saliency map is calculated. The mask with highest IoU is selected and provided as the output. The outputs from Mask R-CNN and the DSS network, both converted to binary, and merged by using Bitwise OR to obtain the union of the masks. Figure4.2 shows the steps involved in merging the outputs of Mask R-CNN and DSS networks.

The Mask R-CNN network does not output any mask if the prediction probability is less than the threshold. In this case, intersection of segmentation mask and the saliency map

(36)

results in the loss of saliency information. That is why the union of the maps are used. The Intersection over union can be used as the metric to quantify the quality of the resulting saliency map in this case, since it estimates the overlap between the saliency map and the segmentation mask.

Figure 4.2: Steps involved in merging the outputs of Mask R-CNN and DSS network

Results

Figure 4.3 shows some results from MRCNN+SAL, where the saliency map is obtained by combining the segmentation masks from Mask R-CNN and saliency map from DSS network.

(37)

Figure 4.3: Example results from MRCNN+SAL

4.3 Repurposing Mask R-CNN for saliency detection

The following approaches are proposed to improve the saliency maps of the images by lever- aging the segmentation masks output by the Mask R-CNN.

• Training Mask-RCNN with Saliency Dataset: As mentioned above, Mask R- CNN trained on COCO dataset classifies the instances to different classes. However, using the saliency dataset - MSRA-10K and MSRA-B to train Mask R-CNN enables the network to learn the background and foreground regions of the image.

• Training Mask Scoring-R-CNN with Saliency Dataset: This approach allows us to exploit the mask score provided by the Mask Scoring R-CNN to be used as a metric to evaluate the saliency maps.

• Using Saliency Detection Network as backbone of Mask Scoring R-CNN:

The saliency detection is specifically designed to capture the saliency in the image, both semantically and structurally. Modifying Mask R-CNN with respect to saliency detection network, can increase the success rate of saliency detection.

4.3.1 Approach 1 - Training Mask-RCNN with Saliency Dataset (MR- CNN SAL)

This approach is proposed to obtain the saliency map from a single network, thus reducing the cost in terms of time and computation complexity. The ROI classifier stage, succeeding the RPN stage of Mask R-CNN framework, as described in Section2.2.1, classifies the object into different classes and if the ROI contains no object, it is classified as background class.

Training Mask R-CNN framework on saliency datasets MSRA-10K and MSRA-B enables the network to learn the features corresponding with any object in the ROI as foreground object.

(38)

Training

The saliency dataset is used to train the Mask R-CNN network end-to-end. In order to train the Mask R-CNN network, the datasets must be converted to COCO-style datasets.

The training dataset preparation is described in AppendixB. The backbone network used is ResNet-101[20]. The dataset size used for training consists of 15000 images. The network is trained for 90000 iterations on a single Nvidia GPU. The loss curve for classification and segmentation is shown in Figure4.4.

(a) Classification loss (b) Segmentation loss

Figure 4.4: Loss curve for classification, segmentation and mask score prediction

Inference

Figure 4.5 shows some inference results from MRCNN SAL. The saliency map is clear and thus it results in successful crops.

(39)

Figure 4.5: Example results from MRCNN SAL

4.3.2 Approach 2 -Training Mask Scoring-RCNN with Saliency Dataset (MSRCNN SAL)

The Mask Scoring R-CNN outputs a mask score along with the segmentation mask which predicts the mask quality and thus can be used as a metric to evaluate the saliency map.

Training

The Mask scoring R-CNN is trained with the saliency datasets MSRA-10k and MSRA-B converted to COCO-style as described in the previous approach.The dataset size used for training is 15000 images. The network is trained for 480000 iterations on a single Nvidia GPU. The base learning rate is 0.0025 and the batch size is chosen as 2 images. The loss curves are shown in Figure4.6.

(a) Classification loss (b) Segmentation loss (c) MaskIoU loss Figure 4.6: Loss curves for classification, segmentation and mask score prediction

(40)

Inference

Figure4.7 shows some results from MSRCNN SAL.

Figure 4.7: Example results from MSRCNN SAL

4.3.3 Approach 3 - Using Saliency Detection Network as backbone of Mask Scoring R-CNN (MODIFIED MSRCNN SAL)

This proposed approach uses the idea from DSS to tailor the Mask scoring R-CNN framework specifically for saliency detection. The main idea of DSS network for saliency detection is the connecting the feature maps from higher layers to lower layers in order to transfer semantic information to the lower layers as described in Section2.1.2. In Mask scoring R-CNN a similar structure is followed in the Feature pyramid network(FPN)[30] to detect objects of different scales. The FPN is composed of bottom up pathway and a top down pathway. The bottom up pathway is a convolutional network used for feature extraction and the semantic value increases in the higher layers and the spatial dimension is reduced by 2 for every immediate higher layer. The top down pathway passes the rich semantic information contained in the higher layers to the lower layers to build a higher resolution layer. Figure 4.8 shows the Feature Pyramid Network hierarchy.

(41)

Figure 4.8: Feature Pyramid network hierarchy [30]

The main difference between the FPN and DSS network is that the DSS network introduces skip connections between the higher layers to the lower layers. Skip connections are connections between different layers of a network that skip one or more layers in between.

However, in FPN, the information from the higher layers is passed onto only the immediate lower layer. The feature maps from higher layers are concatenated with the feature maps of lower layers in DSS, whereas in FPN, the feature maps are added. Figure4.9 shows the structures of the FPN and DSS networks.

Figure 4.9: DSS and FPN network structure

In order to tailor Mask scoring R-CNN for Saliency detection, the skip connections are

(42)

added to the existing FPN architecture. The higher layers are up-sampled using bi-linear interpolation and concatenated with the lower layers along the channels dimension skipping the immediate lower layer. In order to reduce the number of channels, 1x1 convolution is used. The modified FPN architecture is shown in Figure4.10. The MaskIoU head from Mask Scoring R-CNN is used to obtain the mask scores.

Figure 4.10: Modified FPN structure

(43)

Training

The modified Mask-Scoring R-CNN is trained with the saliency datasets MSRA-10k and MSRA-B converted to COCO-style as described in MRCNN SAL. The network is trained for 650000 iterations on a single Nvidia GPU. The base learning rate is 0.0025 and the batch size is 2. The loss curve is shown in Figure4.11.

(a) Classification loss (b) Segmentation loss (c) MaskIoU loss Figure 4.11: Loss curves for Classification, Segmentation and mask score prediction

Inference

Figure4.12 shows some results from MODIFIED MSRCNN SAL.

Figure 4.12: Example results from MODIFIED MSRCNN SAL

(44)

4.4 Results and Evaluation

4.4.1 Results Summary - Saliency maps

Figure4.13 compares some sample saliency maps for all the approaches described in Section 4.3and Section4.2.2with respect to DSS. The MODIFIED MSRCNN SAL achieves the best saliency maps in comparison to the other techniques.

Figure 4.13: Summary of the resulting Saliency maps

4.4.2 User study

A user study is conducted using Amazon Sagemaker Ground truth [4], a labeling service to build high quality training datasets for machine learning models. The dataset used for evaluation consists 500 images of various categories of OLX marketplace in order to sclae the application to all the categories. The cropping results using saliency maps from DSS and all the approaches based on Mask R-CNN framework for saliency detection are the input for the user study. Thus, a total of 2500 images are used as input for user study. Every image is assessed by 5 users and each user rates the image on a scale of 0 to 4. When an image is evaluated by 5 users, the image is removed from the evaluation list. Figure 4.14 is a screen shot from the Amazon Sagemaker Ground truth for evaluating our crop bounding box.

(45)

Figure 4.14: Screenshot from Amazon Sagemaker Ground Truth to evaluate the crop bounding box

4.4.3 Results

The total number of images evaluated by 5 users among 2500 images is 466, as the user study experiment is not completed yet. We provide our results over this subset of the dataset.

The users rate the image on a scale of 0 to 4, with 0 corresponding to the worst crop and 4 corresponding to the best crop. For the analysis of the results, any image with score greater than 2 is considered to be successful and the images with score lesser than or equal to 2 are considered as unsuccessful crops. Table 4.1 shows the number of images evaluated per approach, percentage of successful and unsuccessful crops per approach and the average user score for crops per approach.

Approach Name Total number of images evaluated

Number of successful crops

Percentage of successful crops(%)

Number of unsuccessful crops

Percentage of unsuccessful crops(%)

Mean user opinion score

DSS 102 52 50.09 50 49.91 2.12

MRCNN+SAL 100 64 64 36 36 2.57

MRCNN SAL 86 54 62.7 32 37.3 2.46

MSRCNN SAL 88 58 65.9 30 34.1 2.61

MODIFIED MSRCNN- - SAL

90 72 80 18 20 3.04

Table 4.1: User study results

It is clear from Table4.1that all the different approaches with Mask R-CNN outperforms DSS network in terms of success of cropping based on user opinion. Cropping using the saliency maps from DSS network results in success rate of 50.09%. The Mask R-CNN saliency output merged with saliency map from DSS network increases the number of successful crops

(46)

by 14%. This is a significant improvement, however, the technique is computation intensive since it requires outputs from two networks.

The Mask R-CNN trained with the saliency dataset also increases the number of successful crops by about 13%. The advantage in this case is that it is computationally efficient compared to the previous technique of merging the outputs of Mask R-CNN and DSS networks.

In order to obtain a robust crop quality metric, we use the Mask Scoring R-CNN approach and training mask Scoring R-CNN with saliency dataset, which increases the number of successful crops by about 16% compared to the DSS networks.

The Mask Scoring R-CNN with DSS backbone has the highest success rate and improves the cropping by 30% compared to the DSS network. The techniques based on Mask R-CNN outperform DSS network since Mask R-CNN contains Region proposal network stage where the entire feature map in different scales (outputs of FPN network) is scanned to detect objects in the image.

Figure 4.15: Mean user opinion score per approach

Figure 4.15 shows the mean user opinion score per approach using Mask R-CNN and DSS networks. The mean user opinion score for crops using the saliency maps from the DSS network is 2.12. The MODIFIED MSRCNN SAL has a significantly higher mean user opinion score of 3.04, improving the average crop quality by 42.7% compared to the DSS baseline.

Mask Scores to evaluate crop quality

In order to avoid bad crops, there is a need to avoid unsuccessful crops. To evaluate the quality of the crop, MaskIoU score from Mask Scoring R-CNN is used. Figure4.16b shows the histogram of the MaskIoU score from MSRCNN SAL vs mean user opinion score. The

(47)

brighter the bin in the histogram, more correlated is the data. It is clear that for the user opinion score greater than 3, the mask quality score is also above 0.9, thus the mask quality is good.

(a) Density plot of Mask scores values for successful and unsuccessful crops

(b) 2-D histogram of MaskIoU Score v/s mean user opinion score

Figure 4.16: Correlation of Mask Scores from MSRCNN SAL with mean user opinion score Figure4.16a shows the density distribution of mask scores over successful and unsuccessful crops based on the user study. The x-axis of the graph represents the mask scores from MSRCNN SAL. The y-axis is the density values which are derived based on the x-axis values to make the total area under the curve to integrate to 1.

The Receiver operating characteristics curve is shown in the Figure4.17aand the precision recall curve is shown in Figure 4.17b. Since, the requirement is to reduce the false positives rate, we select the threshold of 0.8, which has the low false positive rate, while also not reducing the true positive rate. The threshold value of 0.8 for the MaskIoU score avoids 73.4% of the negative crops and allows 70.6% of positive crops along with 26.6% of negative crops as shown from the confusion matrix in Figure4.18.

(48)

(a) ROC curve

(b) Precision-recall curve, AP is the average precision

Figure 4.17: Selection of threshold for the metric

Figure 4.18: Confusion matrix for the metric - MaskIoU score from MSRCNN SAL

The DSS network is used as the backbone of the Mask R-CNN network in MODI- FIED MSRCNN SAL. Figure 4.19b shows the histogram of the mask score from MODI- FIED MSRCNN SAL vs mean user opinion score. As already mentioned above, the brighter the bin in the histogram, more correlated is the data. It is clear that for the user opinion score greater than 3, the mask quality score is also above 0.8.

(49)

(a) Density plot of Mask scores values for successful and unsuccessful crops

(b) 2-D histogram of MaskIoU Score v/s mean user opinion score

Figure 4.19: Correlation of Mask Scores from MODIFIED MSRCNN SAL with mean user opinion score

Figure4.19ashows the density distribution of mask scores over successful and unsuccessful crops based on user study. The x-axis of the graph represents the mask scores from MODI- FIED MSRCNN SAL. The y-axis is the density values which are derived based on the x-axis values to make the total area under the curve to integrate to 1.

The Receiver operating characteristics curve is shown in the Figure4.20aand the precision recall curve is shown in Figure 4.20b. Since, the requirement is to reduce the false positives rate, we select the threshold of 0.8, which has low false positive rate, while also not reducing the true positive rate. The threshold value of 0.8 for the MaskIoU score avoids 82.4% of the negative crops and allows 69.86% of positive crops along with 17.6% of negative crops as shown from the confusion matrix in the Figure4.21. Thus, we can conclude that the MaskIoU score from MODIFIED MSRCNN SAL is the most robust metric compared to the metrics discussed in AppendixA and also compared to the MaskIoU score from MSRCNN SAL.

(50)

(a) ROC curve

(b) Precision-recall curve, AP is the average precision

Figure 4.20: Selection of threshold for the metric

Figure 4.21: Confusion matrix for the metric - MaskIoU score from MODI- FIED MSRCNN SAL

(51)

Chapter 5

Conclusions

This thesis proposes a smart cropping application for images uploaded by users on online marketplaces. The baseline saliency detection network(DSS) was used to obtain saliency map to implement smart cropping by applying post processing techniques on the saliency map and the results are analyzed. The success rate of baseline implementation is 50.1% based on user study and cropping mostly failed when the original image was of low quality, the object in the image is not complete, the original image contains text or thin edges or the image is of land, house, apartments, garages etc. Based on the observations and a need to evaluate the saliency map to use the application in production, a few hypotheses were proposed to design metrics for evaluation of saliency maps. A novel approach of using Mask R-CNN for saliency detection is proposed in order to improve the success rate of cropping and obtaining the crop quality score. The DSS network is used as the backbone of the Mask R-CNN network for saliency detection, and it shows significant improvement in detecting the salient regions in the image. The success rate of cropping the image using the modified Mask R-CNN is 80%

based on the user study.

5.1 Contribution

Thumbnail images on Marketplace websites gives the benefit to fit multiple ad-postings on a single page. The traditional method of compressing the image to create the thumbnail reduces the resolution of the image and the object to be sold may not be apparent to the user. In this direction, we develop a smart cropping application based on Saliency detection in an image to re-compose the image in order to increase the visibility of the salient object.

The baseline saliency detection network (DSS) is implemented to obtain the saliency map of the image. The image is post processed to crop around the salient region based on the saliency map obtained. The results are manually evaluated as successful and unsuccessful crops and the correspnding original images and saliency maps are analyzed to detect the anomalies which result in unsuccessful crops. Based on the analysis, different hypotheses are proposed to avoid the negative results and the corresponding quality metrics are validated. Among the metrics proposed, the metric ’Average salient pixel’ performs comparatively better than the rest of the metrics discussed. However, the metric does not perform well if the saliency detection network detects only certain regions of the image with high confidence. Thus, with a need to design and validate a robust metric and to improve the saliency detection across all

Eindhoven University of Technology MASTER Smart cropping of image based on saliency detection Manjunath Shetty, A.

Smart Cropping of Images based on Saliency Detection

Aishwarya Manjunath Shetty

Abstract

Preface

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Background

1.2 Challenges

1.3 Contribution

1.4 Outline of the thesis

Chapter 2

Literature Review

2.1 Content unaware smart image cropping

2.2 Content aware smart cropping

Chapter 3

Content Unaware Smart Cropping - Baseline Implementation

3.1 Saliency detection

3.2 Post-processing for image cropping

3.3 Experimental results

3.4 Analysis of the results

Chapter 4

Content Aware Smart Cropping

4.1 Motivation

4.2 Smart cropping of images using segmentation masks from Mask R-CNN

4.3 Repurposing Mask R-CNN for saliency detection

4.4 Results and Evaluation

Chapter 5

Conclusions

5.1 Contribution