• No results found

CHAPTER 2. LITERATURE REVIEW

2.2 Content aware smart cropping

Since the marketplace websites contain images with of objects to be sold, object detection frameworks like Faster R-CNN [37] or YOLO [36] or can be used to crop the images along the bounding box around the objects detected. Image segmentation framework like Mask R-CNN [19] can also be used to crop the image since it provides the segmentation mask which can be also be used as saliency map when the object classes are only foreground or background.

However, the first drawback of these frameworks are that they detect only the objects that they are trained for. Secondly, when there are multiple objects in a single image, the object detection framework may not detect all the objects as shown in Figure2.4

Training the framework with generic objects aid in detecting various objects and merging the bounding boxes of all the instances of the object can help in detecting multiple objects in the image.

CHAPTER 2. LITERATURE REVIEW

Figure 2.4: Examples of unsuccessful cropping using object detection

In this thesis, the framework for instance segmentation by He et. al. [19], Mask R-CNN explained in Section2.2.1is used to detect the salient object in the image.The saliency maps generated by Mask R-CNN are evaluated using the approach Mask Scoring R-CNN by Huang et. al. [22] which is explained in Section 2.2.1.

2.2.1 Mask R-CNN

Mask R-CNN [19] is the current state of the art framework for instance segmentation which achieves simultaneous detection, classification and segmentation of objects in images. Saliency detection can be expressed in terms of instance segmentation with two classes- foreground and background. This thesis investigates the possibility of re-purposing Mask R-CNN for saliency detection. Mask R-CNN framework has two stages - first stage extracts features from an input image and proposes the regions likely to contain the objects and the second stage classifies the objects along with refining bounding box and predicting segmentation masks.

Figure 2.5: Mask R-CNN Framework

CHAPTER 2. LITERATURE REVIEW

Figure2.5 shows the different stages of the Mask R-CNN framework. The backbone net-work of Mask R-CNN framenet-work is usually a typical convolutional neural netnet-work (ResNet), the initial layers of which extracts low level features with high resolution and the later layers detect features with high semantic accuracy. The next stage of the framework is the Feature Pyramid Network(FPN)[30]. This network passes the high semantic information contained in the higher layers of CNN to the lower layers with high resolution to accurately detect the small objects .

Region Proposal Networks(RPN) [38] are used to scan the input feature map using sliding window method and generate proposals of regions likely to contain an object. The feature maps are shared among the regions of interest, thus reducing redundant computations. The RPN scans all the windows which are referred to as anchors in parallel and generates two out-puts for every anchor - anchor class, implying if the region contains a foreground object or not and the bounding box refinement, which is a refinement factor to align the anchor perfectly over the object. Non-max suppression is used to remove the anchors with low foreground scores and the top anchors are passed onto the next stage. Based on the size of the region proposal created by RPN, the feature map of appropriate scale from the FPN is selected.

The proposed Region of Interest(ROI) is passed on to the ROI classifier and bounding box regression stage [38]. This stage has two outputs similar to the RPN - First output is the class of the object in the ROI.The ROI classifier classifies the object into multiple object classes and if the ROI has no object, classifies it as background class. The second output from this stage is the further refinement of bounding box to contain the complete object. The classifier stage handles only fixed size inputs, but the bounding box regressor from the RPN adjusts the bounding box to contain the object, thus causing the bounding boxes around different ROIs to be of variable sizes. ROI pooling is the technique used to convert variable size ROIs to fixed size (H x W where H is the height of the input feature map and W is the width) inputs to the classifier. The H and W are hyper parameters of the layer and are independent to any ROI. The original ROI size is h x w and it is divided into H x W grids and values in every grid is max-pooled to get corresponding output value. The sub-windows have a size of h/H x w/W and the cell boundaries are forced to align to the boundary of input feature maps making the target cells not of equal sizes. Mask RCNN introduces new technique called ROI align where the cell boundaries after ROI pooling layer is not quantized and bilinear interpolation is used to calculate feature map values within the cell [19].

Mask RCNN includes additional head for instance mask generation. The masks are gen-erated by a fully-convolutional neural network head and are of size 28 x 28. The gengen-erated masks are soft masks, represented by floating point numbers, where each pixel in the mask denotes the probability of pixel belonging to the foreground object, thus holding more details even though they are small. These masks are scaled up to fit the object size in original image during inference [19].

Mask Scoring R-CNN

The paper on Mask scoring R-CNN [22] proves that the classification confidence from the classifier stage of Mask R-CNN does not correlate with the quality of segmentation mask.

The paper introduces an additional head to the Mask R-CNN framework called MaskIoU

CHAPTER 2. LITERATURE REVIEW

head which scores the segmentation mask quality based on Intersection over Union (IoU) score of the ground truth and predicted mask. The network architecture is illustrated in Figure2.6.

Figure 2.6: Network Structure of Mask Scoring R-CNN [22]

The MaskIoU head defines a particular mask quality score, the ideal value of which is the IoU between the ground truth mask and the predicted mask. The mask score should be positive for the object class and zero for all the other classes and this requires the masks to be classified into classes. The output of the classifier stage of Mask R-CNN is directly used for this task. IoU regression is the next step to get the score for the predicted mask. The feature from the ROI align layer and the predicted mask is concatenated (Max pooling with Kernel size of 2 and stride of 2) given as the input to the MaskIoU head. The MaskIoU head consists of 4 convolutional layers with kernel size of 3 and filter number of 256 and 3 fully connected layers, where the outputs of first two fully connected layers are set to 1024 and the final layer is set to the total number of object classes.

For training the MaskIoU head, the RPN proposals are used as training samples since the IoU between the predicted bounding box and the ground truth box should be greater than 0.5. During inference, the MaskIoU is predicted for the top-k boxes (typically k =100 ) from the RPN, multiplied with the classification confidence score and given as mask quality score.

Chapter 3

Content Unaware Smart Cropping