Repurposing Mask R-CNN for saliency detection

Content Aware Smart Cropping

CHAPTER 4. CONTENT AWARE SMART CROPPING

4.3 Repurposing Mask R-CNN for saliency detection

The following approaches are proposed to improve the saliency maps of the images by lever-aging the segmentation masks output by the Mask R-CNN.

• Training Mask-RCNN with Saliency Dataset: As mentioned above, Mask R-CNN trained on COCO dataset classifies the instances to different classes. However, using the saliency dataset - MSRA-10K and MSRA-B to train Mask R-CNN enables the network to learn the background and foreground regions of the image.

• Training Mask Scoring-R-CNN with Saliency Dataset: This approach allows us to exploit the mask score provided by the Mask Scoring R-CNN to be used as a metric to evaluate the saliency maps.

• Using Saliency Detection Network as backbone of Mask Scoring R-CNN:

The saliency detection is specifically designed to capture the saliency in the image, both semantically and structurally. Modifying Mask R-CNN with respect to saliency detection network, can increase the success rate of saliency detection.

4.3.1 Approach 1 - Training Mask-RCNN with Saliency Dataset (MR-CNN SAL)

This approach is proposed to obtain the saliency map from a single network, thus reducing the cost in terms of time and computation complexity. The ROI classifier stage, succeeding the RPN stage of Mask R-CNN framework, as described in Section2.2.1, classifies the object into different classes and if the ROI contains no object, it is classified as background class.

Training Mask R-CNN framework on saliency datasets MSRA-10K and MSRA-B enables the network to learn the features corresponding with any object in the ROI as foreground object.

CHAPTER 4. CONTENT AWARE SMART CROPPING

Training

The saliency dataset is used to train the Mask R-CNN network end-to-end. In order to train the Mask R-CNN network, the datasets must be converted to COCO-style datasets.

The training dataset preparation is described in AppendixB. The backbone network used is ResNet-101[20]. The dataset size used for training consists of 15000 images. The network is trained for 90000 iterations on a single Nvidia GPU. The loss curve for classification and segmentation is shown in Figure4.4.

(a) Classification loss (b) Segmentation loss

Figure 4.4: Loss curve for classification, segmentation and mask score prediction

Inference

Figure 4.5 shows some inference results from MRCNN SAL. The saliency map is clear and thus it results in successful crops.

CHAPTER 4. CONTENT AWARE SMART CROPPING

Figure 4.5: Example results from MRCNN SAL

4.3.2 Approach 2 -Training Mask Scoring-RCNN with Saliency Dataset (MSRCNN SAL)

The Mask Scoring R-CNN outputs a mask score along with the segmentation mask which predicts the mask quality and thus can be used as a metric to evaluate the saliency map.

Training

The Mask scoring R-CNN is trained with the saliency datasets MSRA-10k and MSRA-B converted to COCO-style as described in the previous approach.The dataset size used for training is 15000 images. The network is trained for 480000 iterations on a single Nvidia GPU. The base learning rate is 0.0025 and the batch size is chosen as 2 images. The loss curves are shown in Figure4.6.

(a) Classification loss (b) Segmentation loss (c) MaskIoU loss Figure 4.6: Loss curves for classification, segmentation and mask score prediction

CHAPTER 4. CONTENT AWARE SMART CROPPING

Inference

Figure4.7 shows some results from MSRCNN SAL.

Figure 4.7: Example results from MSRCNN SAL

4.3.3 Approach 3 - Using Saliency Detection Network as backbone of Mask Scoring R-CNN (MODIFIED MSRCNN SAL)

This proposed approach uses the idea from DSS to tailor the Mask scoring R-CNN framework specifically for saliency detection. The main idea of DSS network for saliency detection is the connecting the feature maps from higher layers to lower layers in order to transfer semantic information to the lower layers as described in Section2.1.2. In Mask scoring R-CNN a similar structure is followed in the Feature pyramid network(FPN)[30] to detect objects of different scales. The FPN is composed of bottom up pathway and a top down pathway. The bottom up pathway is a convolutional network used for feature extraction and the semantic value increases in the higher layers and the spatial dimension is reduced by 2 for every immediate higher layer. The top down pathway passes the rich semantic information contained in the higher layers to the lower layers to build a higher resolution layer. Figure 4.8 shows the Feature Pyramid Network hierarchy.

CHAPTER 4. CONTENT AWARE SMART CROPPING

Figure 4.8: Feature Pyramid network hierarchy [30]

The main difference between the FPN and DSS network is that the DSS network intro-duces skip connections between the higher layers to the lower layers. Skip connections are connections between different layers of a network that skip one or more layers in between.

However, in FPN, the information from the higher layers is passed onto only the immediate lower layer. The feature maps from higher layers are concatenated with the feature maps of lower layers in DSS, whereas in FPN, the feature maps are added. Figure4.9 shows the structures of the FPN and DSS networks.

Figure 4.9: DSS and FPN network structure

In order to tailor Mask scoring R-CNN for Saliency detection, the skip connections are

CHAPTER 4. CONTENT AWARE SMART CROPPING

added to the existing FPN architecture. The higher layers are up-sampled using bi-linear interpolation and concatenated with the lower layers along the channels dimension skipping the immediate lower layer. In order to reduce the number of channels, 1x1 convolution is used. The modified FPN architecture is shown in Figure4.10. The MaskIoU head from Mask Scoring R-CNN is used to obtain the mask scores.

Figure 4.10: Modified FPN structure

CHAPTER 4. CONTENT AWARE SMART CROPPING

Training

The modified Mask-Scoring R-CNN is trained with the saliency datasets MSRA-10k and MSRA-B converted to COCO-style as described in MRCNN SAL. The network is trained for 650000 iterations on a single Nvidia GPU. The base learning rate is 0.0025 and the batch size is 2. The loss curve is shown in Figure4.11.

(a) Classification loss (b) Segmentation loss (c) MaskIoU loss Figure 4.11: Loss curves for Classification, Segmentation and mask score prediction

Inference

Figure4.12 shows some results from MODIFIED MSRCNN SAL.

Figure 4.12: Example results from MODIFIED MSRCNN SAL

CHAPTER 4. CONTENT AWARE SMART CROPPING

In document Eindhoven University of Technology MASTER Smart cropping of image based on saliency detection Manjunath Shetty, A. (pagina 37-44)