An Analysis of the Characteristics and Limitations of Image Data Sets affecting U-Net-based Metal Recognition

(1)

An Analysis of the

Characteristics and

Limitations of Image Data

Sets affecting U-Net-based

Metal Recognition

(2)

Layout: typeset by the author using L

A

_TEX.

(3)

An Analysis of the Characteristics

and Limitations of Image Data

Sets affecting U-Net-based Metal

Recognition

Lisa A. Hooftman

12063207

Bachelor thesis

Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam

Faculty of Science

Science Park 904

1098 XH Amsterdam

Supervisor

Dhr. dr. S. van Splunter

Informatics Institute

Faculty of Science

University of Amsterdam

Science Park 904

1098 XH Amsterdam

Feb, 2021

(4)

Abstract

In this thesis, an analysis is performed of the influence of eight data set characteristics on a model’s metal segmentation performance. The study was performed by training a U-Net based architecture on various state-of-the-art data sets containing images and ground truth masks of metal, i.e. the Flickr Material Database (FMD) and the Materials in Context database (MINC). To further determine the data sets’ relevant characteristics and limitations, an X-ray data set (XRAYS) was specifically created for this study. The trained models were tested on a variety of test samples from the data sets. Characteristics and limitations of the data sets were placed in proximity to results yielded by training a U-Net-based model on them. The results show that image data with large metal segments, a clear distinction between the object of interest and its background, and the depiction of full metal objects positively influence a model’s performance on metal recognition. Objects blending in with their cluttered surroundings and poorly labelled data by human observers negatively influences metal recognition by a U-Net. The size of a data set, occurrences of occlusion, and occurrences of images fully depicting metal do not necessarily affect metal recognition by a U-Net in a specific manner.

(5)

1 Introduction

The daily interaction that we as humans have with the world around us shape our ability to recognize the materials objects are made of [1]. We combine all of our senses to recognize materials by their specific attributes. Shiny objects may appear to be made of metal to the naked eye; however, our capability of correlating the way the object feels, sounds when touched, or smells, help us determine if the object is made of the material our eyes perceive it to be made of [5].

Various studies in the field of computer vision have attempted to use models for performing and understanding material recognition [2][7][12][14]. Automating material recognition for this thesis is inspired by the company FloatScans [6]. FloatScans has developed a 3D scan technology that captures real-world objects and artefacts indistinguishable from reality. Scanning objects in 3D are preserves them digitally, which creates possibilities for using them in video games or VFX productions. To create a digital object that reacts to light rays similar to the real-world object, classifying the materials from the scanned object is critical. For instance, metal objects tend to reflect light rays differently than non-metal objects. Albedo and specular reflectance are the specific properties of metal objects that distinguish them from non-metal objects [5]. This thesis aims to determine which attributes from data sets affect a model’s performance on recognizing metal parts in visual data. In this paper specifically, the challenge is to use a model that distinguishes metal from non-metal components, and to compare how various aspects from data sets influence this models’ capability of doing so.

Images carry a generous amount of information, which is stored in a large number of parameters, or pixels. Convolutional Neural Networks (CNN) are suitable for tackling computer vision problems because of their power in analyzing visual imagery by reducing the large number of parameters of an image (known as dimensionality reduction) [10], and their translation-invariant characteristics [15]. For this thesis, a U-Net [11] (a type of Fully Convolutional Network (FCN) [19]) was constructed to detect metal segments in an end-to-end setting. To accomplish this, three data sets of images containing metal objects were fed into the U-Net, and trained on by consistently comparing its predictions with binary masks depicting the true metal regions in images. The data sets used are Flickr Material Database (FMD) [13], Materials in Context Database (MINC) [2], a cropped variation on MINC, and a personally created data set containing images and corresponding X-rays of several objects. While training, the U-Net detected patterns in metal segments by their attributes like texture and reflectance. Once the model was trained on a data set, it was tested on images from various data sets. It created binary metal segmentation maps based on its knowledge on metal depictions from training. This thesis focuses on the comparison of features and limitations of data sets given a certain model, not on creating a model that detects metal segments as well as possible.

(7)

1.1 Objectives

Despite rapid advancements in computer vision and material recognition, metal recognition is still an evolving field [14]. A model’s architecture affects its performance in metal recognition. Also, the data used for training appears to shape this performance. This thesis aims not to create a learning algorithm to recognize metal as accurately as possible, but to analyze the determining features and limitations of state-of-the-art data sets on a model’s performance for metal recognition (seeBackground). Furthermore, a newly created, unique data set consisting of images and corresponding X-rays of several objects was used to determine whether these new features affected the model’s performance compared to the other data sets. This exploration supports the development of better guidelines for creating a diverse and high-quality data set for future metal recognition problems.

From the stated objectives, the following research questions can be derived: 1. Which characteristics and limitations of data sets affect the performance

of a U-Net on metal recognition?

1.1 Which data sets are available for metal recognition, and what are their characteristics and potential limitations?

1.2 How can an X-ray data set be created for metal recognition, and what are its characteristics and potential limitations?

1.3 Assuming a U-Net is trained on either a single data set or a com-bination of data sets, how accurately can a U-Net detect metal in them?

1.4 How can the characteristics and limitations of the single and combined data sets be related to the performance in segmentation by the U-Net?

1.2 Background

This thesis utilizes various data sets for training a model to locate metal compo-nents in an image. The aim of this thesis is to compare a model’s performance in metal recognition in correlation with the characteristics of the data used for training the model. The representations of metals in images are not bound to a particular shape. Moreover, metal objects or metal components often find themselves in an image composition containing numerous other materials. Therefore, classification should not merely be done on image-level, as regular Convolutional Neural Networks are capable of, but rather on pixel level. The concept of semantic image segmentation assigns each pixel of an image to a corresponding class, in this case: metal or non-metal. A suitable model for this issue is one that learns to segment images in an end-to-end setting, i.e. a raw image is entered as input, and the result is a segmentation map. This segmentation map is a binary representation in which a pixel with a value of 1 corresponds with a pixel satisfying a class. A pixel with a value of 0 corresponds

(8)

with a pixel not satisfying this class. In the case of this project, the target class is metal, as opposed to non-metal.

Ronneberger et al. [11] introduced a skip-architecture-based learning model called a U-Net, initially designed for the segmentation of biomedical images. The concept of a skip-architecture was proposed by Long et al. [8], where superficial information of an image from shallow encoding layers is combined with a high-level representation from decoding layers. A U-Net utilizes this concept, where a downsampling (encoder) path is brought together with an up-sampling (decoder) path. The downsampling path is used to capture the context in an image and determine what is present. Convolutional layers make up the downsampling path, each followed by a rectified linear unit (ReLU) and a max-pooling layer. The up-sampling path uses up-convolutional layers to gradually increase the image size again, and recover where the information that was found is specifically located. By combining the outputs of the downsampling and upsampling path, information is effectively connected, and a detailed, substantiated segmentation can be predicted [11]. Unlike regular CNNs, a U-Net can accept any image size, and it is memory efficient because of the absence of Dense layers.

2 Data Sets

The Flickr Material Database [13], the Materials in Context Database [2] and the cropped Materials in Context Database, are introduced and analyzed in Sections2.1,2.2and2.3respectively. This section on data sets aims to answer research question1.1. In section2.4, the newly created X-ray data set for this thesis is introduced and analyzed, which answers research question1.2.

2.1 Flickr Material Database (FMD)

The Flickr Material Database (FMD) [13] was designed to study the speed and accuracy of human perception in material categorization. It consists of 100 images of surfaces and segmentation masks per material category, including fabric, foliage, glass, leather, metal, paper, plastic, stone, water, and wood. Within each category, there is an even distribution of 50 close-ups and 50 regular scenes.

FMD was built with the particular objective of catching the general scope of material attributes, i.e. a variety of colours, sizes, objects and surface roughness. This diversity ensures that within the same metal category, for instance, rusty metal surfaces, as well as shiny, grey metals, are both included, which is visible in the sample images in Figure1. This intentional variety aims to make distinguish-ing material classes, by humans or computers, invariant to low-level information. In this way, chances are reduced that a model becomes highly biased because of learning metal aspects by recognizing only a specific type of object. Object segmentation, as opposed to segmentation of metal parts against non-metal

(9)

parts, seems to be avoided by the visual distribution of metal objects present in the image data. However, FMD contains many compositions where objects are positioned against a clear background (as shown in the first and fourth example in Figure1). Therefore, the FMD data set could still make a model prone to object segmentation, instead of segmentation of metal against non-metal. Moreover, every object of interest in FMD fully consists of metal. In reality, it is highly plausible that objects consist of various materials. Therefore, FMD could be prone to failing in accurately segmenting objects with mixed compositions. As shown in Table2, on average, over 80% of the pixels in the images represent metal. Large parts of images can then be analyzed and used for learning about metal features. Some FMD images fully represent metal, where its ground truth is entirely white (labelled as metal). This characteristic could deteriorate a model’s segmentation capability, as it does not learn to segment metal against no-metal in these images.

Previous applications of FMD in computer vision problems reach ± 60% accuracy at identifying FMD categories [13]. In comparison, humans achieved an accuracy of 84.9%. In the study by Schwarts and Nishino on obtaining per pixel material information, it was shown that the property ‘metallic’ scored lowest out of all when training a model on FMD to recognize material attributes. The accuracy score of this model was 66.4% on the attribute ‘metallic’. Thus, based on the results of training models on FMD, computer vision systems do not nearly meet the human performance on metal recognition.

Figure 1: Example images and their corresponding mask from the Flickr Material Database (FMD).

2.2 Materials in Context Database (MINC)

The Materials in Context Database (MINC) [2] was made to showcase objects in everyday scenes. In the natural world, objects consist of various materials, find themselves in numerous lighting conditions, and are surrounded by clutter. The factor of materials blending in with their context makes segmenting material representations challenging since much information is present in the image data.

(10)

However, this feature of MINC also makes the representation of metal objects realistic.

MINC contains 288 images with metal components and their respective ground truth masks. Bell et al. [2] argue that the image data is diverse and well-sampled. This statement is accurate regarding image camera angle and distance. The data set contains mostly long shots of metal objects and a few medium shots and close-up shots. The large distance from the camera to the metal objects of interest causes only 5.42% of the pixels in images to be labelled as metal (as shown in Table2). When cropping the images to the area of interest only (see Figure3b), this value increases to 24.63% (Table2). There is a lack of diversity in the subjects presented in the image data, as objects of interest are mostly situated in kitchen-like environments. Consequently, most metal objects from the image data are kitchenware and kitchen appliances, which share similar characteristics such as a grey colour and a shiny surface. Therefore, a learning model can become biased and score poorly on different metal representations. Furthermore, annotators created MINC’s ground truth masks based on their human perception of metal. While for FMD this seems to be no issue for the accuracy of the ground truth masks, it appears that for cluttered compositions in image data from MINC it is. Many segmentations that observers labelled as metal are indeed metal components of the image, yet many other metal parts have been ignored (see Figure2). This error could confuse a model in trying to learn metal attributes from visual data. If this is considered a criterion to filter the data on, only 138 pairs of images and masks are of adequate quality, implying that 90 pairs are of low quality.

Figure 2: An example of an poorly annotated image (l) and its mask (r) from the MINC data set, where only a small metal part was labelled.

MINC-2500 is a specific data set made up of 2500 image patches for each material category. Using the GoogleNet CNN model [17], 77.7 percent of metal patches were predicted correctly. However, this thesis does not train and test on patches, but rather images from MINC depicting full scenes along with their corresponding ground truth masks.

(11)

2.3 Materials in Context Database Cropped

An alternative variation on the MINC data set was used to further analyze MINC’s characteristics and their impact on metal recognition. The images were cropped to the area of interest, i.e. the segments where the main metal object is situated. In Section4.1.3, the implementation details are described. Cropped images portray the same objects as the original images from MINC. They share the same principles as the original MINC data set, as most items are still grey kitchenware (see Figure3for examples). The main difference between the cropped version and the original version is the cluttered background. By zooming in, the main part of the image is now the metal object of interest. Even though this defeats the original purpose of MINC, which is to showcase metal objects in context, in this way an analysis is possible of how much the cluttered background impacts the ability to do metal recognition. As shown in Table2, images from the MINC cropped data set contain, on average, around 25 % pixels that represent metal.

(a) Original images.

(b) Images cropped to area of interest.

Figure 3: Example images and their corresponding masks from the Materials in Context Database (MINC), where3b are the original images from3acropped to the area of interest only.

(12)

2.4 X-ray Data Set (XRAYS)

In both the FMD [13] and MINC [2] data sets, human annotators have created segmentation masks based on the colours, shapes and reflectivity which they associate with metal.

The presented U-Net requires binary segmentation masks for training on image data. For each pixel in a segmentation map, a value of either 1 or 0 is assigned. This value in the segmentation map is interpreted as a classification of the pixel in the original image, representing either metal (pixel value of 1) or non-metal (pixel value of 0). The concept that metal objects light up when placed in an X-ray machine because of occlusion founded the approach of creating a new and uniquely annotated data set. This goal was achieved by photographing and scanning several household objects with an X-ray machine. This approach automatically creates a segmentation map for objects. It is merely another, unique approach to create ground truth segmentation masks, independent of the human perception of metal. Please refer to Section4.1.4.1for example images. All used items were selected to show the difference between a full metal object and an object made up of various materials. As shown in Table2, on average, 24% of the pixels present in the images are metal. Limitations of the data set are that it is small, which means little training samples are available. Furthermore, the data set is somewhat biased since all images are close-ups with white backgrounds. Another weakness is that X-rays show the occlusion of metal parts in the objects, also when these are not visible to the naked eye.

2.5 Characteristics Overview

Table1shows a set of assembled characteristics C1 ... C8. When a characteristic holds for a data set, this is visualized with a ×. In Table2, the average amount of pixels representing metal in images is shown per data set.

(13)

Data sets

FMD MINC MINC (cropped) XRAYS

C1 Occlusion × C2 >50% metal (on average) ×

C3 Clear distinction between objects of interest and backgrounds

× ×

C4 Objects of interest blend in with background

× ×

Char

acteristics C5 Objects of interest are fully_{made up of metal} × C6 Occurrences where full

image portrays metal (ground truth mask is fully white)

× C7 Ground truth masks

con-structed based on human perception

× × × C8 Data set size >100 images

and masks

× × ×

Table 1: Characteristics of the data sets, where × is used when a characteristic holds for a data set and C1 ... C8 identify each characteristic.

Data set Average percentage of metal FMD 80.24%

MINC 5.42% MINC cropped 24.63% XRAYS 24.39%

Table 2: Average percentage of pixels representing metal in the image data per data set.

3 Methodology

This thesis’s primary goal is to place data sets and their characteristics in proximity to the performance of training a U-Net-based model on them. The proposed method was tested and evaluated on the Flickr Material Database (FMD), Materials in Context Database (MINC), a separate data set where MINC images were cropped, and a newly created X-ray data set. The four data sets each contain 2D, multichannel RGB images of metals, along with their respective segmentation masks. In both the model training and the final segmentation evaluation, these masks are used as ground truths representing metal parts in

(14)

the images. A pixel value equal to ‘0’ in a segmentation mask indicates a pixel classified as non-metal in the corresponding image. Similarly, a pixel value equal to ‘1’ in a mask indicates a pixel classified as metal in the corresponding image. This combination shows a clear distinction between metal parts and non-metal parts in the images, as metal segments light up as white segments in the masks. Every data set considered in this project was constructed for a different purpose. FMD serves to study the human ability to make distinctions between materials based on their perception of specific material attributes. MINC was created for analysing metal objects in more realistic, cluttered settings. These show metal surfaces in a real-life context. These original intentions for data sets have resulted in several differences between them. Variations between data sets are discussed in section1.2, such as the average amount of metal visible in images or how objects are positioned in an image. These variations are studied by feeding the data to a model. The model attempts to find a blueprint of metal segments in the training data to, in turn, find segments in a new image that comply with these learned features. It is not a comparison of learning algorithms that is the main focus of this project, but rather comparing and analysing the various data sets to feed a model. To compare the strengths and weaknesses of data sets used for metal recognition, a U-Net is used. This type of Convolutional Neural Network outputs a high-resolution prediction map, where each pixel is classified as depicting metal or non-metal, rather than classification over a full image. Since image data used for training can display other materials, the ability to do segmentation instead of global classification is a prerequisite of the learning model.

Data augmentation was applied to enable more extensive training on the available training data. In a study on biomedical segmentation by Dong et al. [4], elastic warp is used to deform training data [20]. Elastic distortion changes the way the edges of an object appear. Metal objects typically have smooth edges, not irregular or wavy edges. Therefore, it was chosen not to apply elastic warp to deform the training images but rather to shear them. When shearing, images are slanted horizontally and vertically, but the edges stay intact. Rotating, flipping and adjusting the brightness are also means of augmenting the data without harming typical metal attributes.

As mentioned above, a variation on the MINC data set was created by cropping the images and masks. The original characteristic of MINC is that metal objects are not the main subject of images but rather part of a broad scene. The MINC images and masks were cropped to the objects of interest to study the influence of a metal object’s surroundings. It can then be analysed whether parts of an image around the object of interest interfere with training a model.

The additional XRAYS data set was constructed to have a means of comparison to the other data sets. Ground truth masks were created by making X-rays of objects containing metal. The data set’s annotation approach is unique, as other

(15)

data sets are annotated based on the human perception of metal. This unique feature opens up new points of comparison and more extensive analysis. In the experiment, the U-Net architecture was trained on FMD, MINC, MINC cropped, and XRAYS separately. Each of the four trained models was tested separately on a few test images of each data set. In other words, a model trained on one data set resulted in four different scores, based on the four different test sets from FMD, MINC, MINC cropped and XRAYS. In this way, the impact of a single data can be analysed in proximity to a model’s metal recognition performance on a variety of unseen images. The models are then tested on data sets that share characteristics and contrasting data sets. The experiment was repeated for combinations of FMD, MINC, MINC cropped and XRAYS. When combining the sets, it can be determined what characteristics slightly cancel out another set’s limitations or how characteristics of different data sets can conspire to make accurate predictions. The data sets are combined in proportion and by using their original sizes (see4.4). These results can be compared to see how strongly the data sets’ features influence the learning model’s outcome and if the ratio between data sets is influential.

3.1 Performance Measurements

The Sørensen-Dice similarity [3] [16] coefficient is a suiting means of metrics for measuring image segmentation performance. It calculates the pixel-wise similarity between two binary images, in this case, the predicted segmentation map and its ground truth. The parts predicted as metal are compared to the ground truth metal parts (the intersection between the two, see formula4) and observed in accordance with the sum of all pixels labelled as metal in both the prediction map and the ground truth mask. The formula returns a value between 0 and 1, which can be rewritten to percentages. These percentages are then an intuitive indicator of how well metal parts were accurately predicted.

dice(A, B) = 2 · |intersection(A, B)|/(|A| + |B|)

Figure 4: The formula for calculating the dice coefficient over image A and image B.

Accuracy is a primary evaluation metric used in machine learning applications. Pixel accuracy is defined as the percentage of pixels classified correctly over all pixels in the image. Thus it considers every pixel labelled as metal, as well as every pixel labelled as non-metal. However, accuracy is not an appropriate metric for a situation where predicted image segmentations are compared to their ground truths. Assume the MINC data set, where on average 5% of the pixels of ground truth masks are classified as metal. When a prediction mask from the model would be 100% black, the pixel accuracy would still be 95%, which is

(16)

the number of pixels that it classified correctly (namely the black pixels). This example shows that high pixel accuracy does not necessarily represent a strong segmentation ability. Dice coefficient solely considers pixels labelled as metal and compares this information with pixels classified as metal in the ground truth mask. Therefore, it is a much more intuitive evaluation metric that illustrates well how accurate a model can detect metal in images [18]. In the situation sketched above, the dice coefficient would approach a value of 0, which is much more informative.

4 Experimental Set-up

The following set up was constructed to answer research question1.3 on how accurately a U-Net can detect metal images from various data sets.

4.1 Data Generation

In this section, the preprocessing steps applied to the data sets are shared in Subsections4.1.1,4.1.2and4.1.4.1. The steps concerning other data processing come forward in Section4.1.3and4.1.4.

4.1.1 General Preprocessing

Every image from the data sets is normalized as a general preprocessing step. From each pixel, the image’s mean is subtracted, after which the result is divided by the image’s standard deviation. Every image and mask is resized to a smaller, square shape of (256, 256). As a consequence, the images are then shaped (256, 256, 3). These are three-channel RGB images. The masks are shaped (256, 256, 1), thus one-channel, greyscale images. All pixels in the masks are rescaled from the range of [0,255] to [0,1]. This rescaling step is required for the combination of a U-Net with dice coefficient loss.

4.1.2 Preprocessing FMD

For the sake of metal recognition, compared to general material classification in FMD, solely the images and masks from the metal catergory were used for training and testing the model. Out of the 100 available metal images and masks, one mask did not match metal parts in its image. Therefore this pair was removed from the data set to ensure a more accurate training on the set. 4.1.3 MINC and MINC cropped

The MINC data set has undergone the general preprocessing steps that were discussed above. These images were used for training, and in this thesis, referred to as simply the MINC data set. The images from the MINC data set were also cropped to the area of interest only. The area of interest is cropped by a square bounding box, where the shape of the box is based on the largest horizontal or

(17)

vertical distance between white pixels. This additional variation on the original MINC data set is referred to as MINC cropped. Both the original MINC data set and the cropped set were used for training and testing the model.

4.1.4 X-ray Data Set

For the X-ray approach, a surgical C-arc machine was used from one of the operating theatres in the Sint Franciscus Gasthuis hospital in Rotterdam. In total, 31 household objects were collected, ranging from kitchenware and utensils to items from a toolbox. From these, 12 items consist of only metal, and 19 items are fabricated of metal and other materials (such as glass, wood or plastic). The items were placed on a disk on top of the X-ray tube (for a reference image of a surgical C-arc machine with its labelled components, please consult Figure

14 in the Appendix). This disk has a diameter of 19cm and is located at a distance of 35cm from the image intensifier. At the display monitor, the received X-rays were saved to an external USB stick. A photograph was taken of every object from the same position of the image intensifier. For this, a 40-megapixel camera was used. The images were taken with a shutter speeds of 1/120s, an ISO of 80, focal length of 27 mm and an F-stop of f/1.6. As a result, images and corresponding X-rays were retrieved of 31 objects.

4.1.4.1 Preprocessing XRAYS

The images and X-rays were loaded into Adobe Photoshop to precisely map them to each other, and solve the slight disparity between the images and X-rays. The text stamped onto the X-rays by the C-arc was removed, and the X-rays were further preprocessed by thresholding each pixel to create a binary map. A threshold value t was manually assigned at t = 0.5, meaning that every pixel greater than the constant t resulted in a white pixel in the final segmentation map. The resulting masks are shown in Figure5.

(18)

Figure 5: Examples of mapped images, mapped X-rays and their corresponding mapped masks from the X-rays data set.

4.2 Data Augmentation

The data was augmented to feed the U-Net more training data. Firstly, the training data was flipped horizontally and vertically, and the original sample was kept as well. The resulting images and masks were rotated along three random angles in multiples of ten. Then, random shearing was applied three times to the image data, where borders were reflected at the edges. For more details on data augmentation applied, please consult Table3. The stated modifications have been done with Python’s OpenCV image processing library. A random brightness level of each image was used while loading every batch, as Keras Image Data Generator was used to load data into the model.

Method Range

Flip Horizontally, vertically

Rotation _{10%, 20% . . . . 360%, i.e. i = {10k|k ∈ R : 0 ≤ k ≤ 36}} Shear j ∈ [−6, 6] where j /∈ [−2, 2]

Brightness γ = 0.8 ∼ 1.2

Table 3: Summary of data augmentation methods, where γ is defined as the shift in brightness of the outputs, j is the amount and direction of shear, and i is defined as the rotation angle.

Since the data was flipped three times, then rotated three times, and afterwards sheared three times, the amount of data was increased 27 times (3 · 3 · 3 = 27).

(19)

This has augmented the data sets to the sizes declared in Table 4. Amount FMD MINC MINC cropped XRAYS

None 94 223 223 26 27x 2538 6021 6021 702 Table 4: Amount of data augmentation applied to data sets.

4.3 U-Net Based Deep Convolutional Networks

The U-Net architecture is inspired by Ronneberger et al. [11]. More specifically, the implementation by Zhixuhao [21] was used as a reference. It consists of an encoding path and a decoding path, as shown in Figure6. The encoding path consists of five convolutional blocks. A convolutional block contains two convolutional layers, each followed by a batch normalization layer. The con-volutional layer has a filter size of 3×3 and uses a ReLu activation function. The blocks are followed by a max-pooling layer with stride 2×2, except for the fifth block since this is where the decoding path starts. The decoding path consists of four deconvolutional blocks, each with two convolutional layers with a filter size of 3 × 3. Instead of max-pooling layers, now each block is followed by an up-convolutional layer. This layer doubles the size of the feature maps. In every block, the encoding path’s feature maps are concatenated with the up-convolutional feature maps (portrayed by grey horizontal arrows in Figure

6). The two convolutional layers in each block reduce the amount of feature maps from this concatenation. Lastly, a 1 × 1 convolutional layer is applied to map the final feature maps to a binary segmentation image. Every convolutional layer used same padding and a he normal initializer. No fully connected layer is present in the architecture since it is designed to be a fully convolutional network. For more details and a more intuitive view of the U-Net, please refer to Figure6.

(20)

Figure 6: U-Net architecture used for training and predicting on data sets [11].

4.4 Experiment Specifications

For testing purposes, five images were taken from every data set, as shown in Table5. An equal number of test images is taken from the data sets since the test sets should be proportionate for accurate comparison between each trained model’s test scores. The specific choice for five images is due to the small size of the X-ray data set. From the remaining data, 90% of the data is used for training, and the other 10% is used for validation. The U-Net requires an optimization method to minimize the cost in relation to its parameters. The adaptive moment estimator (Adam) was employed to estimate the parameters. The network was trained for 100 epochs on batches of 16 images (see Table6). For the sake of memory efficiency, Keras Image Data Generator was used to load one batch of images at a time. In this way, memory was only allocated for the number of images per batch instead of all training data. The data was shuffled after each epoch to take random data for training each time. The data augmentation step where brightness is randomly adjusted also takes place in the Image Data Generator.

The U-Net training was executed on single data sets, i.e. FMD, MINC, MINC cropped and XRAYS, as well as combinations of these (see Table7). A distinc-tion was made for training on combined data sets. When combining two data sets in proportion, the smallest data set defined the number of samples to use from the other data set. When a data set is combined with the small XRAYS data set, many of its data samples are not used for training. The data sets are

(21)

also combined while keeping their original sizes, where they are merged without considering the ratio between the two sets. Both of these combinations were used for training.

The test sets were constructed from the single data sets only, thus FMD, MINC, MINC cropped and XRAYS. The dice scores were calculated based on a model’s segmentation performance on these test sets. An additional overall score was calculated by taking the mean over the results from the test sets. In this way, assumptions can be made on data set level and on the ability to generalize over several data sets.

Amount FMD MINC MINC(cropped) XRAYS Original 99 228 228 31 Test images 5 5 5 5 Excluding test images 94 223 223 26

Table 5: Amount of data set aside for testing. Parameter Value

Epochs 100 Batch size 16

Learning Rate adam(lr = 0.0001) Image size 256x256x3

Table 6: Training parameters used for training the U-Net. Single FMD MINC MINC(cropped) XRAYS Combined

In proportion Original sizes FMD & MINC FMD & MINC FMD & XRAYS FMD & XRAYS MINC & XRAYS MINC & XRAYS Table 7: Data sets and combinations of data sets used for training, where in proportion signifies a corresponding number of training samples between the two sets, and original sizes signifies the original sizes of data sets were used.

5 Results

In this section, research question1.3is answered. Models were trained on single and combined data sets. These models were tested on the four separate test sets,

(22)

and their dice scores (calculated with Formula4) are shown below in Table8. The overall score column on the right is used to determine how well a model has generalized over all test sets.

Tested on

FMD MINC MINC (cropped) XRAYS Overall score

Single FMD 0.794 0.276 0.396 0.534 0.500 MINC 0.044 0.146 0.101 0.123 0.104 MINC(cropped) 0.183 0.005 0.313 0.009 0.128 XRAYS 0.669 0.243 0.410 0.569 0.473 Combined In proportion T raine d on FMD & MINC 0.578 0.382 0.256 0.430 0.412 FMD & XRAYS 0.713 0.272 0.536 0.709 0.558 MINC & XRAYS 0.330 0.165 0.369 0.514 0.345 Original sizes

FMD & MINC 0.513 0.374 0.186 0.494 0.392 FMD & XRAYS 0.648 0.240 0.535 0.618 0.511 MINC & XRAYS 0.248 0.175 0.258 0.409 0.273 Table 8: U-Net results in dice scores, where in proportion signifies a corresponding number of training samples between the two sets, and original sizes signifies the original sizes of data sets were used.

FMD MINC MINC cropped XRAYS 0 0.2 0.4 0.6 0.8 1 Train sets Dic e sc or es

Tested on FMD MINC MINC cropped XRAYS

Figure 7: A bar plot visualizing the results from Table8obtained by training on single data sets.

(23)

FMD & MINC FMD & XRAYS MINC & XRAYS 0 0.2 0.4 0.6 0.8 1 Train sets Dic e sc or es

Figure 8: A bar plot visualizing the results from Table8obtained by training on combined data sets in proportion.

FMD & MINC FMD & XRAYS MINC & XRAYS 0 0.2 0.4 0.6 0.8 1 Train sets Dic e sc or es

Figure 9: A bar plot visualizing the results from Table8obtained by training on combined, original sized data sets.

6 Analysis

The following section discusses research question1.4on how a data sets charac-teristics and limitations can impact a U-Net’s performance. For clarity, Table1

is repeated below. First, the results retrieved from training on single data sets are analyzed in Section6.1. Then the results from training on combined data sets are analyzed in Section6.2. Lastly, the characteristics C1 ... C8 from Table

(24)

Data sets

FMD MINC MINC (cropped) XRAYS

C1 Occlusion × C2 >50% metal (on average) ×

C3 Clear distinction between objects of interest and backgrounds

× ×

C4 Objects of interest blend in with background

× ×

Char

acteristics C5 Objects of interest are fully_{made up of metal} × C6 Occurrences where full

image portrays metal (ground truth mask is fully white)

× C7 Ground truth masks

con-structed based on human perception

× × × C8 Data set size >100 images

and masks

× × ×

Table 1: For clarification, Table1is shown again in this section. It shows the characteristics of the data sets, where × is used when a characteristic holds for a data set and C1 ... C8 identify each characteristic.

6.1 Trained on Single Data Sets

This section discusses how the observed features of single data sets relate to the results shown in Section5. Each subsection analyses training a U-Net on a single data set, thus FMD, MINC, MINC cropped and XRAYS.

6.1.1 Trained on FMD

The results show that FMD scores well on its own test set and decent on XRAYS (0.794 and 0.534 respectively), compared to other scores shown in Table8. When looking at the specific test images, FMD especially scores well on images where the colour and texture of the metal object in the foreground are distinguishable from their background (see Figure10).

(25)

Figure 10: Image results of metal objects against a clear background, where the model was trained on FMD.

Most images in FMD’s training set are full metal objects against a background, as was discussed in

2.1. Therefore, it seems that FMD is good at sim-ply segmenting objects from their backgrounds since this is often true for the data it is trained on. This notion also explains the low scores when testing on the MINC dataset (uncropped). MINC typically contains images of large scenes, where the metal objects are present in a messy background that share similar colours, textures and shapes as the metal objects. In other words, image data from MINC is dissimilar from FMD training data. MINC cropped is more comparable to FMD, as a larger part of the image is metal. When FMD was tested on MINC cropped, the model yielded higher re-sults. However, the prediction was not necessarily

Figure 11: Image re-sult samples when tested on MINC (l) and MINC cropped (r), where the model was trained on FMD.

more accurate. On both sets, prediction maps were almost entirely white. Too much of the image was predicted to be metal, as shown in Figure11. In the FMD training data, various samples are images fully depicting metal, which could explain the tendency of a model trained on FMD to predict images as fully metal. Simply because the cropped MINC images contain more metal, the dice scores for MINC cropped are higher than for the original MINC set. FMD is thus accurate for object segmentation, which works well on its own test data. The XRAYS test samples are also objects against a simple background, but testing on them yielded slightly lower results than testing on FMD. The XRAYS data set items contain other materials besides metal, unlike FMD, which could confuse the model when trained on FMD (see Table1). Thus, FMD does a decent job at object detection but not at practically recognizing metal objects. It predicts poorly on objects situated in a cluttered image.

(26)

6.1.2 Trained on MINC

Figure 12: Image re-sult samples when tested on MINC (l) and FMD (r), where the model was

trained on MINC. It is observed in section2.2and Table1, that humans

annotated the MINC data set poorly. From both Table8as well as Figure7, it is observed that MINC scored very poorly on every test set, with an overall dice score of 0.104. As shown in Table2, only 5.42% of pixels in the image data from MINC contain metal. Therefore, it is likely that a model trained on MINC learns to predict that merely small parts of images are metal. This issue could be adequate if it still accurately predicts the smaller metal parts in, for instance, MINC’s test set. This test set is also where it scored the highest out of all test sets. However, it is still a very disappointing score of 0.146. When only labelling a small part of an image as metal, chances are very high that the prediction misses the actual metal part, which immediately results in a low score. The model trained on MINC scored especially

low on FMD, which contains over 80% metal pixels on average. The same issue as mentioned above could explain this: only small segments of the image are labelled. Examples of this statement are shown in Figure12. Here a small part of the fridge from MINC on the left image was predicted, and in general small parts were labelled as metal in an image from FMD on the right.

6.1.3 Trained on MINC Cropped

In section2.3, MINC cropped was described to share the same features with the original MINC data set, except for the full, cluttered background. The objects depicted are still mostly grey kitchenware and contain small parts of messy backgrounds. However, the central part of the image is now the metal object of interest. The experiment results show that a model trained on MINC cropped scores higher on its own test data and FMD than a model trained on the original MINC data set. MINC cropped is made more comparable to FMD, as both contain larger metal segments than the original MINC set. It scores very low on the original MINC data set, which in general depicts small metal segments. It is assumed that this is for the same reason as mentioned in section

6.1.2. A model trained on data where a metal object is the main part of the image is assumed to score best on unseen data. MINC cropped obtained poor results when testing on XRAYS, a dice score of 0.009. This would disprove the statement since in both MINC cropped and XRAYS around 25% of pixels are labelled as metal. However, this could be because metal objects in MINC still find themselves within a cluttered context, without clear separation from their backgrounds. This makes training on them complicated and very different from images from the XRAYS data set.

(27)

6.1.4 Trained on XRAYS

Figure 13: Image result samples when tested on XRAYS (l) and MINC (r), where the model was

trained on XRAYS. In the XRAYS data set, it was observed that there

is a clear distinction between objects of interest and their backgrounds (Table1), similar to FMD. The experiment results show that a model trained on the XRAYS data set scores similar to a model trained on FMD. It scores high on test sets with a clear distinction between object and background and low on test sets where objects are in chaotic contexts, as shown in Figure 13. It is assumed that this is for the same reasons as stated above for training a model on FMD. As indicated in Table1, objects from the XRAYS data set contain other materials besides metal, and the ground truth masks deal with occlusion. It is assumed that this confuses the model when predicting on test sets where these properties are not present in the data, such as FMD and MINC. A model trained on the XRAYS dataset

scores somewhat lower than FMD on test data from FMD and MINC, which underpins this statement. Training on XRAYS yields a higher score on the XRAYS test set than training on FMD did. During training, a model takes into account occlusion and objects composed of several materials, which is necessary for accurate segmentation in XRAYS test samples. A model trained on ground truth masks where occlusion plays a role also detects metal parts in images when they are not visible to the human eye, such as in Figure 13 (l) for the scissors. Although, it is likely that similar to FMD, XRAYS mostly applies object detection rather than material recognition. The ground truth masks then confirm the prediction of invisible parts, and therefore these occluded parts of the object are ‘correctly’ labelled as metal as well.

6.2 Trained on Mixed Data Sets

This section discusses how the analyzed characteristics of data sets come together and relate to the results shown in section5. Each subsection is an analysis of training a U-Net on a combination of with one of the data sets in proportion, thus x & MINC, x & FMD, and x & XRAYS. These results are then compared in Section6.2.2 to training a U-Net on combinations of data sets where their original sizes were preserved.

6.2.1 In Proportion

In general, what can be observed in the results is that the distribution between the overall scores is smaller than shown in the single part in Table5. Combinations of data sets do not score as exceptionally well or exceptionally bad overall as single data sets do. Furthermore, training on a combination of two data sets, for

(28)

instance, FMD and MINC, shows that the model scores better testing on FMD than when it was trained on just MINC. Logically, when testing on a specific data set, including images of this data set in the training data will improve the model’s performance. The model is then already familiar with specific aspects of this data set.

6.2.1.1 Trained on a Data Set x with MINC

Training on a data set together with MINC (thus FMD & MINC or MINC & XRAYS) shows that testing on MINC and MINC cropped improves compared to training on MINC only (dice scores of 0.382 and 0.256 vs 0.146 and 0.101 respectively). The model is trained together with data sets that have already obtained a higher score than a MINC-trained model when testing on MINC and MINC cropped (see single for FMD and XRAYS in Table8). Training the model on a combination with MINC, does deteriorate the testing scores on FMD and XRAYS compared to scores yielded when a model is trained on only FMD or XRAYS. Based on these observations, it is assumed that FMD and XRAYS have characteristics that enable better and more generalizable metal recognition than MINC. Characteristics that XRAYS and FMD share are the larger amounts of pixels per image that represent metal, as well as a clear distinction between objects and their backgrounds. MINC does not share these features, and it seems that training on MINC rather confuses the model than improves metal recognition.

6.2.1.2 Trained on a Data Set x with FMD

Training on a data set together with FMD (thus FMD & MINC or FMD & XRAYS) generally increases the score on the prediction of the test sets, compared to training on only MINC or XRAYS. However, scores do not increase when testing the combined model on FMD itself, compared to the score of training and testing on the single FMD set (0.794), which is the best-obtained score of all trials. From this, it is assumed that FMD is a robust data set for training a segmentation model. FMD’s property of a clear depiction of the metal object is considered to be a vital characteristic for learning metal attributes, even when testing on more cluttered images.

6.2.1.3 Trained on a Data Set x with XRAYS

In general, training on a data set together with XRAYS (thus FMD & XRAYS or MINC & XRAYS) slightly improves the scores on the test sets compared to single data sets. The combination of MINC and XRAYS obtains higher results for every test set than solely training on MINC. Training with FMD and XRAYS achieves similar results as training on single FMD or single XRAYS when testing on MINC. However, the score of testing on XRAYS increases from 0.534 (for training on FMD) and 0.569 (for training on XRAYS) to 0.709 (for training on FMD & XRAYS). Training samples of FMD are more comparable to those of

(29)

XRAYS, and therefore it makes sense that learning from this extra and diverse data from FMD, the test images XRAYS can be more accurately segmented by the model. The XRAYS data set seems to have the same power as FMD to generally increase the prediction scores on test sets. A logical explanation could be that FMD and XRAYS both have the characteristics of objects clearly positioned against a background and large parts of the images representing metal. FMD still scores better on its own test set when trained on just FMD. The XRAYS data set seems to confuse the model slightly in this case, likely because of occlusion and lack of diversity in objects in the training samples.

6.2.2 Original Sizes

In general, the statements made in6.2.1also hold for the combinations of data sets when preserving their respective original sizes. To summarize, training on a data set in combination with MINC improves segmentation on MINC’s test samples (compared to training on single MINC) but highly confuses the model for other test sets. When training in combination with FMD, test scores on other data sets increase, but for FMD, the combination is detrimental. Lastly, training on a data set combined with XRAYS, overall scores increase slightly and prediction on MINC test sets is improved.

The combinations of training sets keeping their original sizes overall score lower compared to the combinations of training sets in equal proportion. This is clearly visible when comparing Figure8and9. The larger data set of the two has the most influence when working with their original sizes. For example, MINC is larger than the XRAYS data set, which is why the combination of the two sets scores better in proportion than by using their original sizes.

The combination of original sizes of FMD and MINC scores lower on every test set than a combination of them in proportion. This statement only excludes the XRAYS test set. MINC has more power in this proportion, as it the largest data set. From the lower scores compared to a model trained on the sets in proportion, it can be assumed that MINC confuses the model more when having more influence on training.

When combining FMD and XRAYS in their original sizes, the scores are com-parable to the scores yielded by combining them in proportion. FMD is much larger than XRAYS, which negatively influences testing on the XRAYS test set. It is assumed that this is caused by XRAYS are underrepresented in the training data from FMD (from Table8the ratio of 2538:702 (FMD:XRAYS) is observed after data augmentation).

The combination of MINC and XRAYS scores lower when using their original sizes compared to combining them in proportion. Again, the MINC data set is much larger than XRAYS (6021:702 as can be read from Table8). Since it has been discussed that MINC is in general not a robust data set for training

(30)

because of its small metal segments and cluttered backgrounds, it is no surprise to observe worse results when training on more MINC samples. The model scores slightly better on MINC’s own test data than the combination in proportion, i.e. from a dice score of 0.165 to 0.175. It could be the case that the lost images that were not used when training in proportion, but were added when training on all MINC images, carried interesting information to predict better on unseen MINC images.

6.3 Characteristics Overview

The statements made about data sets and their results in the analysis above relate to the eight characteristics defined in Table1. Therefore, in order to concisely answer research question 1.4, the characteristics are discussed separately on whether they improve, deteriorate or not influence a model’s performance on metal recognition.

• C1. Occlusion from invisible metal parts (from X-rays) in data samples does not influence a model’s metal recognition performance to a great extent.

• C2Large representations of metal in data samples have shown to improve a model’s metal recognition performance.

• C3A clear distinction between the object of interest and the background has shown to improve a model’s metal recognition performance significantly. • C4Objects of interest blending in with their background in data samples

confuses a model and negatively influences a model’s metal recognition performance.

• C5Image data samples in which objects are fully made up of metal have shown to improve a model’s metal recognition performance.

• C6Training based on images fully depicting metal has shown to make a model biased. The model then works decent on images illustrating large segments of metal but worse on models illustrating small segments of metal. • C7 A data set that contains ground truth masks constructed based on human perception only confuses a model when ground truth segmentations are done poorly.

• C8 A data set larger than 100 images and masks does not necessarily influence a model’s metal recognition performance.

(31)

7 Conclusion

This paper analysed the influence of eight characteristics of data sets on a model’s metal segmentation performance. The Flickr Material Database (FMD) [13], Materials in Context database (MINC) [2], a modified cropped version of the Materials in Context database, and a for this project created X-ray data set, were used in various ways as training data and test data. The main research question was: Which characteristics and limitations of data sets affect the performance of a U-Net on metal recognition? (1).

All in all, image data with large metal segments (C2), a clear distinction between the object of interest and its background (C3), and the depiction of full metal objects (C5), have shown to influence a model’s performance on metal recognition positively. These are characteristics from the strongest data sets for training, FMD and XRAYS. Characteristics of image data such as objects blending in with their cluttered surroundings (C4), and poorly labelled data by human observers (C7) negatively influence metal recognition by a U-Net. These characteristics belong to the MINC and MINC cropped sets. The size of the data set (C8), the occurrence of occlusion in ground truth masks (C1), and the occurrences of images fully depicting metals (C6) do not necessarily affect metal recognition by a U-Net, or show mixed results. These were described as limitations from XRAYS and FMD. This paper’s analysis of data sets lays the foundation for developing guidelines for creating effective data sets to use in automated metal recognition.

8 Discussion

In this section, decisions that were made in the approach of this thesis are discussed that possibly influenced the generation results unfavorably. Firstly, five images per data set were used for testing because of the smallest data set: XRAYS. This is quite a low number, which might have biased the final dice scores, which were calculated over the test sets. Chances are high that the test sets do not represent their data set accurately. For instance, in the MINC data set, two images of metal fridges were used to be tested upon. Across all data sets combined, the test data does show more variety. Secondly, a U-Net was chosen for its end-to-end setting of inserting an image, and then receiving a prediction map out. It is however officially created for biomedical segmentation and used for colourless images of human tissue, which uses image data with contrasting aspects compared to the data used for the purpose of this paper. Therefore, the data used in this paper could be too complex, which could declare the general low dice scores.

(32)

8.1 Future Research

As stated in Section2.1, humans are capable of identifying material categories with an accuracy of 84.9%. Applications of computer vision on FMD have shown results below 60% accuracy [13]. A clarification could be that computer vision systems lack four out of five senses that humans use for cognitive tasks like recognizing materials. Furthermore, humans associate objects with certain materials. Physical properties of materials, such as thermal conductivity which can be felt, or density which can be heard when the object is touched, can-not be derived from only the visual aspects of materials. Recognition should probably not be based on solely the visual aspects of materials but combined with another approach. Therefore, for further research, it could be valuable to combine the concept of object recognition with material recognition. A similar U-Net approach could be used with images and segmentation maps. When combining the original U-Net principle with object recognition, an object’s label could be included along with information on what materials often are associated with this object label. A deeper understanding of the object’s identity could form the basics of further progress for material segmentation. Moreover, more segmentation methods can be applied before attempting semantic segmentation. Based on the simple, non-local material properties discussed in the introduction (colour, illumination and shapes), rough segments of an image can be created as

preparation for final material segmentation.

Research concerning data sets can be extended by training a model on a data set, saving the weights that come out, and using these as input for training on new data. With this pipeline, a model is exposed to various data sets, which might make it generalize better and well-trained for material recognition.

References

[1] Edward H. Adelson. “On seeing stuff: the perception of materials by humans and machines”. In: Human Vision and Electronic Imaging VI. Ed. by Bernice E. Rogowitz and Thrasyvoulos N. Pappas. Vol. 4299. International Society for Optics and Photonics. SPIE, 2001, pp. 1 –12. doi:

10.1117/12.429489. url:https://doi.org/10.1117/12.429489. [2] Sean Bell et al. “Material Recognition in the Wild with the Materials in

Context Database”. In: Computer Vision and Pattern Recognition (CVPR) (2015).

[3] Lee R. Dice. “Measures of the Amount of Ecologic Association Between Species”. In: Ecology 26.3 (1945), 297–302. doi:10.2307/1932409. [4] Hao Dong et al. “Automatic Brain Tumor Detection and Segmentation

Using U-Net Based Fully Convolutional Networks”. In: Communications in Computer and Information Science (2017), 506–517. doi: 10.1007/978-3-319-60964-5_44.

(33)

[5] Roland W. Fleming. “Visual perception of materials and their properties”. In: Vision Research 94 (2014), 62–75. doi:10.1016/j.visres.2013.11. 004.

[6] _{FloatScans. 2020. url:}https://www.floatscans.com/.

[7] C. Liu et al. “Exploring features in a Bayesian framework for material recognition”. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2010, pp. 239–246. doi:10.1109/CVPR.

2010.5540207.

[8] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully Convolu-tional Networks for Semantic Segmentation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015. [9] _{LBN Medical. When to Use Mobile C-arm Machines? (Guide). 2019. url:}

https://lbnmedical.com/when-to-use-c-arm-machines/.

[10] Prafful Mishra. Why are Convolutional Neural Networks good for image classification? 2019. url:https://medium.com/datadriveninvestor/ why are convolutional neural networks good for image -classification-146ec6e865e8.

[11] O. Ronneberger, P.Fischer, and T. Brox. “U-Net: Convolutional Networks for Biomedical Image Segmentation”. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). Vol. 9351. LNCS. (available on arXiv:1505.04597 [cs.CV]). Springer, 2015, pp. 234–241. url: http: //lmb.informatik.uni-freiburg.de/Publications/2015/RFB15a. [12] Gabriel Schwartz and Ko Nishino. “Recognizing Material Properties from

Images”. In: IEEE Transactions on Pattern Analysis and Machine Intelli-gence 42.8 (2020), 1981–1995. doi:10.1109/tpami.2019.2907850. [13] Lavanya Sharan, Ruth Rosenholtz, and Edward H. Adelson. “Accuracy

and speed of material categorization in real-world images”. In: Journal of Vision 14.10 (2014).

[14] Lavanya Sharan et al. “Recognizing Materials Using Perceptually Inspired Features”. In: International Journal of Computer Vision 103.3 (2013), 348–371. doi:10.1007/s11263-013-0609-0.

[15] Divyanshu Soni. Translation Invariance in Convolutional Neural Net-works. 2021. url: https : / / divsoni2012 . medium . com / translation -invariance-in-convolutional-neural-networks-61d9b6fa03df. [16] Thorvald Sorensen. “A method of establishing groups of equal amplitude

in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons”. In: Kongelige Danske Videnskabernes Selskab. 1-34 5.4 (1948).

[17] Christian Szegedy et al. “Going deeper with convolutions”. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015).

(34)

[18] Ekin Tiu. Metrics to Evaluate your Semantic Segmentation Model. 2020. url: https://towardsdatascience.com/metrics-to-evaluate-your-semantic-segmentation-model-6bcb99639aa2.

[19] Sik-Ho Tsang. Review: FCN - Fully Convolutional Network (Semantic Segmentation). 2019. url: https://towardsdatascience.com/review-fcn-semantic-segmentation-eb8c9b50d2d1.

[20] _{Volkan YILMAZ. Elastic Deformation on Images. 2019. url:} https : / / towardsdatascience . com / elastic deformation on images -b00c21327372.

[21] Zhixuhao. Implementation of deep learning framework – Unet, using Keras.

(35)

Appendix

C-arm Machine Components

Figure 14: A representation of a surgical C-arm machine and its labelled compo-nents [9].

An Analysis of the Characteristics and Limitations of Image Data Sets affecting U-Net-based Metal Recognition

An Analysis of the

Characteristics and

Limitations of Image Data

Sets affecting U-Net-based

Metal Recognition

Layout: typeset by the author using L

TEX.

An Analysis of the Characteristics

and Limitations of Image Data

Sets affecting U-Net-based Metal

Recognition

Lisa A. Hooftman

12063207

Bachelor thesis

Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam

Faculty of Science

Science Park 904

1098 XH Amsterdam

Supervisor

Dhr. dr. S. van Splunter

Informatics Institute

Faculty of Science

University of Amsterdam

Science Park 904

1098 XH Amsterdam

Feb, 2021

Contents

1

Introduction

1.1

Objectives

1.2

Background

2

Data Sets

2.1

Flickr Material Database (FMD)

2.2

Materials in Context Database (MINC)

2.3

Materials in Context Database Cropped

2.4

X-ray Data Set (XRAYS)

2.5

Characteristics Overview

3

Methodology

3.1

Performance Measurements

4

Experimental Set-up

4.1

Data Generation

4.2

Data Augmentation

4.3

U-Net Based Deep Convolutional Networks

4.4

Experiment Specifications

5

Results

6

Analysis

6.1

Trained on Single Data Sets

6.2

Trained on Mixed Data Sets

6.3

Characteristics Overview

7

Conclusion

8

Discussion

8.1

Future Research

References

Appendix

_TEX.