Detection of shipping containers in video for 3D model reconstruction

(1)

Master Artificial Intelligence

Detection of shipping

containers in video for 3D model

reconstruction

Ruben van Erkelens

January 12, 2021

Supervisor(s): dr. Leo Dorst

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

In this thesis a novel method for shipping container detection in video is presented. On individual frames, a deep learning segmentation network trained on shipping container data computes a probability map, from which an optimization scheme detects container faces and corners using the prior knowledge about the cuboidal shape of a container. This method is shown to be robust for various container poses and can detect container corners located outside of the image. It is compared to a state-of-the-art cuboid detector, which it beats on shipping container detection. The temporal order in video is used to disambiguate labels for faces that falsely appear as identical due to symmetry. Also, a method is presented for texturing a simple 3D model of a container, by extracting face detections, filtering on the most visually appealing ones, and applying a homography. Furthermore, detailed experiments on real data motivated the different design choices.

(4)

(5)

Introduction

1.1 Relevance of this Thesis

In 2019, 8,781,185 shipping containers passed through the port of Rotterdam[1]. With this many containers come many liability charges for damages, which can be a major cost for ports. These claims are not always legitimate though. It is thus desirable for ports to have evidence whether a container was already damaged when it came in. Automated damage checking is a low labor time solution. A first step towards automated damage detection is to detect the container in images. This helps by focusing the attention of a model to the container and not to the background, as damage detections in the background are not desirable. Also, as ports often desire damages to have labels indicating where on each face it is located, it would be beneficial to first precisely detect the faces and corners of the container. Afterwards, it should be possible for an employee to confirm the damage detections by eye. Instead of going through loads of images, or going to the physical containers, 3D model representations of the containers can quickly show any container from any angle. It is therefore beneficial to also create digital assets in the form of 3D models of containers.

Besides damage detection, container detection in images has plenty of potential use-cases for ports. It has become a trend for ports to function automatically. Some ports go as far as being completely controlled robotically; they are often called ‘ghost ports’. Detecting faces and corners of containers is essential for these ports to function, for example for grabbing a container or reading the codex. These ports however, often function mostly on expensive equipment, such as lasers, to detect containers. Detecting from images necessitates only cameras, which are relatively cheap. If this works, it can be a driving factor for labor-intensive ports to start automatizing.

(8)

1.2 Related Work

1.2.1 Container detection

There has been previous interest in container localization and detection in images. Shen et al. [6] use the color changes between container and background for an image in the S-channel of HSV to compute horizontal and vertical histograms of gradients. From the histograms the container edges can be located, as it assumes that the background containers no objects colored similarly like the container. This method only works with a consistent camera pose, because the container needs to be aligned horizontally and vertically for the histograms, which is done by a per-spective transform. Also, the method requires that containers face the camera perpendicularly. Mi et al. [5] detect container castings, these are the corners of the container, by applying a Support Vector Machine classifier on histogram-of-gradient features extracted from an image. They only use images with containers standing straight and facing the camera perpendicularly. This has the benefit of symmetry, which they use to quickly find the left corner casting given a right cor-ner casting. The goal for Yoon et al. [12] is to detect the contaicor-ner and extract depth for an auto-landing system. For a container image, they provide candi-date detections, which are a combination of extracted lines representing container edges found with the Hough transform and extracted holes representing container castings found by color changes. They then choose the detection which best fits geometric properties of a container. Several top-view cameras then extract depth by the use of correlation-based stereo-vision. Atienza et al. [2] strive to segment containers on moving trucks with optical flow, as the container moves but the background does not. Note that in contrary to previously mentioned methods this method does not work for one image, but instead requires several images spaced temporally.

1.2.2 Cuboid detection

Another approach for detecting containers in images is to approach it as a cuboid detection problem. A cuboid is a geometrical shape of a box, which is essentially what a container is, but note that the following methods are not exclusively for containers. Xiao et al. [9] approach the problem of cuboid detection by exhaus-tively testing cuboid corner proposals that are geometrically consistent with a cuboid in a bounding box window around the object. The proposals are scored by HOG descriptors and terms describing the displacement, edges, and shape of the cuboid proposal. The score components are weighted by a trained SVM. Yang et al. present CubeSLAM [10], a method for cuboid detection and multi-view SLAM. For an image, they first make proposals by using a 2D object detection model, which gives a bounding box, after which they sample vanishing points which define a cuboid. Proposals are scored based on image edges. The problem of cuboid detection is approached with Deep Learning by Dwibedi et al. [4]. Their approach is based on the object detection model Faster RCNN. They change the object classes to cuboid and non-cuboid, and add a layer that regresses vertex

(9)

off-sets with respect to the region of interest, which give the cuboid corner locations. The model is then used iteratively, by transforming the region of interests with the vertex offsets, which give refined input features, from which refined vertex offsets regressions are found.

1.3 Outline

3D model reconstruction methods that rely on feature matching tend to fail when it comes to modelling containers from an unordered set of images, as features on opposite sides might be confused due to the symmetry of the container. Also, these methods do not use prior knowledge of the shape of the container, which to certain extent, is always the same. The method presented in this thesis handles the symmetry issue by using video. It first detects each face of the container in video frames, then uses the temporal order of the video to resolve the symmetry issue.

To accurately detect the different faces of a container, a multi-step process was chosen starting with pixel-wise segmentation. The segmentation network is further discussed in section 2.2. The segmentations would result in arbitrary shapes, but the shape of a container is always that of a cuboid. A post processing step in the prediction-probability space fits quadrilaterals representing container faces (section 2.3). If more than one face is found, quadrilaterals are combined to represent a cuboid (section 2.3), after which vanishing point constraints ensure that the shape is geometrically consistent with a cuboid (section 2.4). Afterwards, ambiguous labels for faces are disambiguated given their place in the video (section 2.5), the faces are then projected using the homography matrix (section 2.6), which are used to texture a 3D model of a container, after selecting the visually attractive textures (section 2.7). Because the container shape is a simple cuboid and container dimensions are assumed to be known, the 3D model of choice is manually defined.

The different steps and design choices of the detection pipeline are motivated with experiments in section 3.2. The method is then compared to a state-of-the-art cuboid detector in section 3.3. Lastly, experiments are done on texture selection, and results for 3D models shown and discussed in section 3.4.

The contributions of this thesis are three-fold:

1. To present a novel method for container detection in images.

2. To present a method for 3D model reconstruction and texture selection of containers.

(10)

(11)

CHAPTER 2

Methodology

2.1 Introduction

A method for detecting containers in individual video frames should be able to locate and label each corner correctly, such that ultimately these detections can be used to texture a 3D model. To accomplish this, a segmentation model which strives to segment each individual face yields a probability map. From this information-heavy probability map an optimization scheme can find the container using the prior knowledge about the shape of a container. The method should be robust to various container poses and frames containing containers which are partly outside of the image, as these conditions were prevalent in the available data. After applying the detection method on several sampled frames, face labels that are ambiguous because of container symmetry can be disambiguated. Only one texture is necessary for each face to texture a simple 3D model of a con-tainer consisting of 8 vertices, but possibly there are several available face textures given multiple frames. The challenge is to filter on textures that are most visually appealing, which is done by using a simple rule based on the container orientation.

2.2 Deep Learning segmentation

Given training data consisting of images x and ground truth labels lgt_{, deep}

learn-ing networks learn parameters w that minimize the negative log-likelihood on the training data:

L(w) = − Ex,lgt_∼˜_p

datalog (pmodel(l

gt_{|x; w)) ,}

(2.1) with pmodel(lgt|x; w) the model prediction for the ground truth label.

For segmentation problems, the goal is to assign a semantic label li for each

pixel i. This means that the output of the last layer, pmodel(l|x; w), is a tensor of

dimensions h × w × C, with h and w the height and width of the image respec-tively, and C the different possible semantic labels. During inference, the output is generally chosen as a hard assignment of a label for each pixel. This is done with an argmax operation:

(12)

lpred_i = argmax

li

(pmodel(li|x; w)) . (2.2)

There are many different design choices in segmentation networks, but most re-volve around an encoder with convolution layers and a decoder. Convolution layers are beneficial in vision tasks as they keep the spatial information during forward pass. The encoder results in an information dense bottleneck with low resolution features, after which the decoder projects these to the pixel space, which is in the same resolution as the input image. The encoders are similar in architecture to classification networks up to the bottleneck, and due to the extensiveness of classification datasets, they are often pretrained and then used for segmentation tasks.

The segmentation network of choice for this thesis is DeepLab with a ResNet-101 encoder, as it has been shown to beat the state of the art on many segmentation tasks.[3]

2.3 Quadrilateral fitting

The argmax operation (equation 2.2) over prediction probabilities given by the network works on a per pixel level. This means that the shape of the segmenta-tions is not necessarily that of an image of a cuboid. The prior knowledge of the general shape of a container can be used by replacing the argmax operation with a quadrilateral fitting algorithm based on simple maximization. This replacement is expected to increase segmentation accuracy around the object of interest, re-move artifacts, and allow for easy corner casting coordinate extraction. The prior knowledge applicable here is that an image of a container always contains either 1, 2, or 3 container faces and that a container face can be defined as a convex quadrilateral. The vertex positions of the quadrilateral were chosen as the param-eters θ of the algorithm. Optimal paramparam-eters maximize the joint of the prediction probabilities of the pixels inside and outside of the quadrilateral. For each pixel this can be modelled by a Bernoulli distribution with pi(l) the prediction

proba-bility for pixel i being inside the quadrilateral corresponding to label l. This gives log-joint probability:

P = X

i∈I

Ii(θ) log(pi(l)) + (1 − Ii(θ)) log(1 − pi(l)), (2.3)

with Ii(θ) an identifier function giving 1 if pixel i is inside the quadrilateral defined

by parameters θ, and otherwise 0. P is maximized for one parameter, and thus for one vertex coordinate, at a time by varying this parameter and fixing the other parameters. This is repeated for each parameter. As the parameters are correlated, this process has to be iterated until convergence. The algorithm is defined to converge when the relative improvement in log-joint probability for the quadrilaterals from before to after optimizing each parameter is below a certain threshold (chosen as 10−3).

(13)

Figure 2.1: Given a probability map of dimensions h (height) by w (width) by C (Channels, number of labels), an output segmentation of size h by w can be found by an argmax operation. An alternative is to use prior knowledge about the quadrilateral shape of a container. This possibly improves segmentation accuracy.

Before optimization, the quadrilaterals and corresponding labels must be ini-tialised. Firstly, the labels must be chosen. They are ordered on the sum of their prediction probabilitiesP

ipi(l). A container can have a maximum of three visible

faces, so up to three labels with highest sums are chosen. If a sum is below a certain threshold it will not be chosen to prevent artifacts from being considered a face. To initialize the quadrilaterals corresponding to the labels, four possible methods were tried, of which the latter two use the information from the probability space:

• Random: Initialize each quadrilateral by a random convex quadrilateral. • Generic: Initialize each quadrilateral as a large rectangle in the middle of

the image.

• Gradient-based : Calculate gradient histograms in x- and y-direction from the label probability map, and define a rectangle by the maxima and minima. • Contour-based : Find a contour in the label probability map by first

thresh-olding the map, then dilating it, after which a contour is calculated (for the contour estimation algorithm, refer to [7]). The contour is a polygon with an unspecified number of vertices. The Douglas–Peucker algorithm estimates a polygon with less vertices from this contour [8]. A parameter specifying ap-proximation accuracy is varied to converge such that the algorithm outputs a quadrilateral.

The search space can be restricted by allowing only convex quadrilaterals. This will speed up the algorithm and remove local maxima from the search space. A convex quadrilateral contains no interior angles greater than or equal to 180 degrees.

(14)

Without further improvements, the optimization would not be robust to con-tainer faces that are not fully captivated in the image. A simple solution is to add padding to the matrix p(l). This makes it possible for quadrilaterals to have ver-tices outside of the image, such that corners that are not visible can be located. As each pixel is modelled by a Bernoulli distribution, padded pixel values are chosen as 0.5, because this will not encourage a quadrilateral to contain padded pixels nor will it discourage.

Thus far, each quadrilateral is optimized individually, but a container is one object with connected faces. This knowledge can be used to disambiguate con-tainer casting locations and possibly improve segmentation accuracy. A simple approach is to look at all possible configurations, average the vertices from differ-ent quadrilaterals that are of the same corner casting in a configuration, resulting in a new set of quadrilaterals. To evaluate, the log-joint probabilities are computed for each quadrilateral in a set. These are summed (equation 2.4) and the highest value gives the best set of connected quadrilaterals.

Psum = X Θ,lq∈Q X i∈I Ii(Θ) log(pi(lq)) + (1 − Ii(Θ)) log(1 − pi(lq)) (2.4)

A connected set of quadrilaterals might not yet be optimal though, as the land-scape of the sum of log-joint probabilities for quadrilaterals has different local op-tima than the individual log-joint probabilities, and it is unlikely that an averaging operation yields parameters exactly at an optimum. Fortunately, it is straightfor-ward to optimize a connected set of quadrilaterals. Each pair of connected vertices can be seen as one vertex, and is described by one pair of x,y-parameters. This means that varying one parameter can change multiple quadrilaterals at once, as they are connected. The new set of parameters is optimized on the sum of log-joint probabilities (equation 2.4), such that an optimal connected set of quadrilaterals is found that represents the cuboid in the image.

2.4 Vanishing point constraints

The faces of a container are generally rectangular in 3D space. This means that a container has 3 sets of 4 parallel edges. After projecting parallel lines in 3D to 2D, these lines will have a common vanishing point. A vanishing point is defined as the intersection of its corresponding projections of parallel lines. Given that two lines are enough to find an intersection, this means that two projected parallel lines add a constraint to the possible orientations of other projected parallel lines. We can differentiate between three different kinds of images of a container (see also figure 2.2):

• One container face is visible, in which case there are two pairs of visible parallel lines. Two vanishing points follow from the intersections of these lines and enforce no further constraints. The last vanishing point cannot be found due to a lack of depth information.

(15)

• Two container faces are visible, with 6 vertices and 12 parameters. There are two pairs of projected parallel lines, and one set of three projected parallel lines, which is defined by all 12 parameters. The set of three enforces a constraint on one parameter, such that 1 parameter follows from the other 11.

• Three container faces are visible, with 7 vertices and 14 parameters. There are three sets of three projected parallel lines. Each set enforces a constraint on one of the parameters that define its lines, such that 3 parameters follow from the other 11.

(a) One visible face. Two

vanishing points can be

found, but the third van-ishing point and the rest of the cuboid remain undeter-mined.

(b) Two visible faces. With the parametrization of 5 ver-tices and a direction (11 pa-rameters total) all vertices can be found with vanishing point constraints.

(c) Three visible faces.

With the parametrization of 5 vertices and a direction (11 parameters total) all vertices can be found with vanishing point constraints.

Figure 2.2: Three cases for container images, with one, two or three visible faces respectively, showing parametrizations for applying vanishing point constraints. Parallel lines are shown with the same color in red, green or blue. White lines connect to the vanishing points. For thick lines all vertices are given parameters, arrows indicate that only the starting vertex and the direction are given parame-ters, and thin lines are undetermined but can be found from the vanishing point constraints.

It is apparent that 11 parameters are sufficient to fully describe a geometrically consistent container. An example parametrization is that of 5 vertices and one direction, such that all vanishing points can be found.

The knowledge that a container is geometrically equivalent to a cuboid and must thus follow vanishing point constraints is utilised to further improve the segmentations of connected quadrilaterals. The set of parameters to optimize is reduced to 11; other relevant parameters follow from the constraints, as can be seen in figure 2.2. Starting from a set of connected quadrilaterals, the parame-ters are optimized on equation 2.4. This step will from now on be called cuboid optimization with VP constraints.

(16)

2.5 Resolving label ambiguity

Figure 2.3: Two opposing views of a container. The left and right of a container look very similar and are therefore hard to distinguish. The front contains the door, which makes it easy to distinguish from the back.

Ideally, found quadrilaterals will have labels distinguishing 5 kinds of faces: ‘Front’, ‘Back’, ‘Left’, ‘Right’, and ‘Top’. The front face was chosen as the face with the container door, which is visually distinctive from the back face, and is therefore assumed to be detectable. However, the left and right of a container look very similar, as can be seen in figure 2.3. As there is no clear distinction between the visual features of these faces, a Deep Learning segmentation model would not be able to learn to distinguish them. Therefore, only one label was used for the left and right. This results in an ambiguity that has to be resolved. Besides distinguishing the faces visually, the left and right can be distinguished by their relative position to the front or back face. However, this is not always possible, for an image may exclude the presence of a front or back face. A solution is to use video instead of randomly ordered images. The camera pose in such a video is assumed to rotate around a z-axis with the container at the origin and z-axis perpendicular to the ground, and this pose rotation may not change in direction. The pose around the x- and y-axis may vary between -90 to 90 degrees. These assumptions ensure that every video has a temporal order in visible container faces. Every label can then be found: the left and right labels in frames also including front or back labels are known from their relative position; this gives enough information to disambiguate between clockwise or counterclockwise rotation, after which the rest of the left and right labels can be found by using the temporal order from frame to frame. Figure 2.4 visualizes this process.

2.6 Homography

With the estimation of quadrilaterals, the coordinates of corners are also found. After matching them to the corners of the physical container, estimated container faces can be projected to a plane with dimensions proportional to physical con-tainer face dimensions via a homography. A homography can be calculated from four point correspondences, so an estimated quadrilateral with matched corners is

(17)

Figure 2.4: The temporal order in which faces occur is used to disambiguate ambiguous left/right labels. The left column of frames shows images including all ambiguous left/right labels (question marks). In the second column frames including a front/back label disambiguate the left/right by relative spatial orien-tation. In the third column frames including only an ambiguous left/right label are disambiguated by using the temporal order of the video, that is the direction of rotation and the last or next known left/right label.

enough to warp the corresponding face. An example can be found in figure 2.5. For more details about the homography calculation see appendix A.

(18)

Figure 2.5: The face given by the green points is transformed by a homography to a plane which is dimesionally proportional to a container face in 3D space. The four point correspondences are enough to define the homography.

2.7 3D container model

Finally, it is possible to texture a 3D model given the warped container faces. Firstly however, a 3D model has to be defined. Due to the simplicity and similarity of containers, the same basic 3D model was used for all containers. This 3D model is defined as a box with 8 vertices and dimensions proportional to those of a standard 20 feet container, which is 6.1m long, 2.44m wide, and 2.59m high. Each face that appeared in a video can be textured, by taking a warped image and sticking it onto the corresponding face of the 3D model. A face will very likely appear more than once in a video though, so there is an opportunity to choose the best texture. It is speculated that texture quality improves as the camera pose gets closer to facing the container face perpendicularly. There are two reasons for this speculation:

• The amount of visible pixels of a container face gets smaller as the angle of the camera pose is further from perpendicular. This effect is the greatest near the furthest edge of the face. This means that the container face will have a relatively low resolution. After the pixels have been warped and interpolated, they will be spread out across the image, resulting in a low quality texture. This means also that any error on the furthest corner estimations that exists will be amplified by the warping.

• Deep Learning segmentation might be easier for the model when the camera faces the container face perpendicularly, because there would only be one container face visible and the visual features for container and background are more distinctive than those for different faces. More accurate

(19)

segmenta-tions lead to more accurate container corner estimasegmenta-tions, leading to better looking textures.

These kinds of container faces can easily be found, for their opposing edges must have similar lengths. Note that a chosen container face must also be fully visible, or else part of the texture would be missing.

This finalizes the pipeline, a schematic overview of the full pipeline can be found in figure 2.6.

(20)

(21)

CHAPTER 3

Experiments

3.1 Data and implementation

The available annotated data contains 921 images of different containers, and 325 frames from three videos of containers. The videos were chosen primarily for their representativeness of the available unannotated video data, and secondly to be able to test the method on varying cases. They can be distinguished in three categories:

• An easy case, separated in 128 frames for which the top face is never visible and the whole container is always visible.

• A semi-hard case, separated in 81 frames for which the top face is visible in all but two frames. In merely four frames is the whole container visible, in others part of the container falls outside of the image.

• A hard case, separated in 116 frames, for which the top is visible in all but nine frames. In none of the frames is the whole container visible, and in some frames larger parts of the container faces are unseen than are seen. The DeepLab model was trained on the data images and video frames. When validating on a specific set of video frames, these were left out of the training data. All images were scaled to a size of 600 by 400.

3.2 Container detector pipeline experiments

3.2.1 Pipeline steps evaluation

In this section the different steps of the pipeline responsible for container detec-tion are evaluated. Specifically, deep learning segmentadetec-tion, quadrilateral fitting, combining quadrilaterals, and cuboid optimization are compared. For cuboid op-timization, two versions were compared: optimization with VP constraints and optimization without VP constraints. All steps are compared on segmentation output with the accuracy and mIOU metrics. The steps that aim to fit a cuboid

(22)

are compared on the task of container corner localization, which is essentially key-point localization. A metric that fits this task is the probability of correct keykey-point (PCK) that is often used in the human pose estimation literature [11]. For this metric, a keypoint is considered correct if the Euclidean distance to the ground truth keypoint is smaller than α · max(height, width), with α a constant set to 0.1, and height, width the height and width of the minimal bounding box that contains all ground truth keypoints. In total there are 677 keypoints for the easy video, 386 for the semi-hard video, and 489 for the hard video. 150 of these keypoints were found outside of the image by line intersection. Results of the experiment can be seen in table 3.1.

video method accuracy mIOU PCK

easy video DL segmentation 0.953 0.711

-easy video quadrilateral fitting 0.956 0.828

-easy video combining quadrilaterals 0.953 0.826 0.817

easy video optimizing cuboid without VP 0.957 0.835 0.839

easy video optimizing cuboid with VP 0.957 0.830 0.830

semi-hard video DL segmentation 0.951 0.837

-semi-hard video quadrilateral fitting 0.950 0.831

-semi-hard video combining quadrilaterals 0.947 0.829 0.777

semi-hard video optimizing cuboid without VP 0.952 0.837 0.819

semi-hard video optimizing cuboid with VP 0.952 0.836 0.811

hard video DL segmentation 0.947 0.846

-hard video quadrilateral fitting 0.943 0.847

-hard video combining quadrilaterals 0.943 0.846 0.581

hard video optimizing cuboid without VP 0.945 0.850 0.595

hard video optimizing cuboid with VP 0.942 0.845 0.618

average DL segmentation 0.950 0.791

-average quadrilateral fitting 0.950 0.835

-average combining quadrilaterals 0.948 0.834 0.733

average optimizing cuboid without VP 0.952 0.841 0.757

average optimizing cuboid with VP 0.950 0.837 0.758

Table 3.1: Resulting accuracy, mIOU and PCK for the different parts of the con-tainer detection pipeline

The results show that the quadrilateral fitting step improves upon DL segmen-tation on the easy video, but does not on the semi-hard and hard videos. One possible explanation is that quadrilateral estimation becomes harder if more cor-ners are outside of the image. Another explanation is that these videos are looked at from various angles. This means that the orientation of some faces is often such that an artifact in the probability map can lead to a diamond shaped quadrilateral. An example of this is figure 3.1d, in which the probability map for the top face overestimates near the top left. As the amount of pixels from the top face gets smaller to the top right, the two top right corners can be generalized by one point, without hurting the likelihood score too much. The result is a diamond shaped quadrilateral. Poorly estimated quadrilaterals, such as diamond shaped

(23)

quadri-DL segmentation quadrilateral fitting combining quadrilater-als optimizing cuboid without VP optimizing cuboid with VP

(a) Frame from the easy video. The green quadrilateral reaches far onto the other face due to wrong segmentations. Optimization with VP constraints partly fixes this issue.

(b) Frame from the semi-hard video. Due to the orientation and the segmentation for the face segmented in purple, the fitted quadrilateral is off. This error remains using optimization without VP constraints, but optimization with VP constraints estimates the bottom left corner correctly.

(c) Frame from the semi-hard video. The segmentation for the top face reaches too far at the end, leading to a poorly estimated far-back corner. This error is corrected when using VP constraints.

(d) Frame from the hard video. The orientation of the top face leads to a diamond shaped quadrilateral. This poor estimation is fixed by using VP constraints.

(e) Frame from the easy video. The probabilities for the face segmented in red are such that the estimated face is too large in all steps. In order to still obtain a valid cuboid, the VP constraints enforce an incorrect position upon the far back top corner.

(24)

(f) Frame from the semi-hard video. Part of the container falls outside of the image, leading to a diamond shaped quadrilateral for the top face. The resulting cuboid after merging is such that the vanishing points are far from correct. Optimization with VP leads to a poor local optimum, whereas optimization without VP does not.

(g) Frame from the hard video. The bottom right of the image is incorrectly segmented. The error remains in the resulting cuboids, but worsens the other corner estimations for optimization with VP constraints.

Figure 3.1: Results with a padding of 100 for the various steps. From left to right: Deep learning segmentation, quadrilateral fitting, combining quadrilaterals, optimization without VP constraints, optimization with VP constraints.

laterals, also appear to worsen other estimations by the averaging operation of the combining quadrilaterals step. This could explain that this step performs the worst on all videos. Looking once more at the diamond shaped green quadrilateral in figure 3.1d, it appears to worsen the estimations from the red and blue quadri-laterals after combining and averaging. Another example is figure 3.1f, in which the estimations from the green quadrilateral worsen the estimations from the blue quadrilateral. These mistakes can be corrected by the cuboid optimization step, either with or without VP. These steps reach the highest scores on the metrics, both on the task of segmentation and keypoint localization. Note however that these improvements over DL segmentation are questionable for the semi-hard and hard videos. This suggests that it is sub optimal to have corners outside of the image for this method.

The results also show that optimization without VP scores slightly better on accuracy and mIOU and slightly worse on PCK than optimization with VP con-straints. Figure 3.1 show examples of cases in which either optimization with or without VP constraints yield the best results. Subfigure a to d are exam-ples for which the VP constraints visibly improve the estimated cuboid. It seems that whenever one corner is poorly estimated, the other corners can force its cor-rectness given VP constraints. In subfigures e and g however, optimization with VP constraints yields cuboids that are worse. In these cases several poor cor-ner estimations worsen other corcor-ner estimations via VP constraints. The reduced performance with VP constraints is consistent with the results by Dwibedi et al. [4]; they blamed the propagation of error in vertices to inferred vertices, which is a phenomenum similar to that seen in subfigures e and g. This effect might be more drastic for their method, as it does not learn from the inferred corners, whereas the thesis detector uses a deep learning segmentation network that learns

(25)

from any visible corners, and instead uses VP constraints afterwards in cuboid optimization, for which inferred vertices are used in the optimization score defini-tion. This means that for their method one erroneous vertex directly propagates an error, whereas the method of this thesis can fix one erroneous vertex via the constraints, but fails nonetheless when several vertices are erroneous.

3.2.2 Padding evaluation

The previous experiment was done with a padding of 100. As 205 out of 325 video frames contain faces that fall outside of the image, it is impossible to perfectly fit quadrilaterals and cuboids without padding. To motivate the use of padding, it was compared with the same method without padding. Results from this experiment can be found in table 3.2. Results for the first video were excluded because they include no frames with faces outside of the image, and the DL segmentation step was excluded because the padding becomes used in the steps after. The other steps without padding consistently give worse performance on the semi-hard and hard video. This is to be expected, as estimations are limited to the size of the image, making it impossible to perfectly fit some quadrilaterals and cuboids. Examples of this are shown in figure 3.2.

accuracy mIOU PCK

step padding no padding padding no padding padding no padding

quadrilateral fitting 0.946 0.942 0.840 0.834 -

-combining quadrilaterals 0.945 0.940 0.839 0.831 0.667 0.642

optimizing cuboid without VP 0.948 0.945 0.845 0.838 0.694 0.666

optimizing cuboid with VP 0.946 0.939 0.841 0.826 0.703 0.641

Table 3.2: Accuracy, mIOU and PCK for the different parts of the container detection pipeline without padding and with padding, averaged over the semi-hard and semi-hard videos frames. Results would be identical for the easy video and for the DL segmentation step, so they were excluded in this comparison. The method with padding scores better on every step on every metric.

3.2.3 Quadrilateral initialization evaluation

Furthermore, the type of quadrilateral initialization method is important for the outcome of the quadrilateral fitting step. The random, generic, contour-based, and gradient-based quadrilateral initialization methods were compared. Results for experiments can be seen in table 3.3, and examples of resulting quadrilaterals in figure 3.3.

• The random initialization performed the worst. In figure 3.3, the leftmost pair of images shows some of these random initializations. The green quadri-lateral for the easy video frame, red quadriquadri-lateral for the semi-hard video frame, and red quadrilateral in the hard video frame were too far away from, or contained too few pixels of, the corresponding container face. In this case,

(26)

DL segmentation quadrilateral fitting combining quadrilater-als optimizing cuboid without VP optimizing cuboid with VP

(a) Frame from the semi-hard video. The bottom corner falls outside of the image. With padding (top row of images) this corner seems to be estimated correctly in all steps, but without padding (bottom row of images) the estimation is limited to the bounds of the image.

(b) Frame from the hard video. Three out of eight corners fall outside of the image. Using padding (top row of images), the visible information is enough to make a decent estimation, but without padding (bottom row of images) it is impossible to make a proper fit for quadrilaterals and cuboids.

Figure 3.2: Examples of results for using padding and no padding

the locally optimal quadrilateral is one that is as small as possible, as this minimizes the amount of false negatives, which have a negative effect on equation 2.3. Especially in the case of a small container face, the effect of including true positives will often be outweighed by the effect of excluding false negatives.

• Generic rectangle initialization performs better, but still suffers from many of the same problems as random initialization. The container is always ap-proximately in the middle of the images, but some faces can be located at the edge of the image such that they are excluded by the rectangle. This is the case for the red quadrilateral for the semi-hard video frame and the red quadrilateral for the hard video frame.

• Contour based initialization appears to solve some of these issues, and as expected it reaches higher accuracy and mIOU. Often the initialization would already be similar to the final quadrilateral. In some cases however, the method initializes around an artifact. An example can be found in the third

(27)

video method accuracy mIOU

easy video random 0.875 0.604

easy video generic 0.931 0.691

easy video contour 0.937 0.721

easy video gradient 0.957 0.778

easy video contour or gradient 0.957 0.831

semi-hard video random 0.841 0.536

semi-hard video generic 0.879 0.568

semi-hard video contour 0.948 0.831

semi-hard video gradient 0.945 0.823

semi-hard video contour or gradient 0.951 0.838

hard video random 0.794 0.559

hard video generic 0.849 0.634

hard video contour 0.940 0.843

hard video gradient 0.937 0.833

hard video contour or gradient 0.944 0.847

average random 0.838 0.571

average generic 0.889 0.640

average contour 0.941 0.792

average gradient 0.947 0.809

average contour or gradient 0.951 0.838

Table 3.3: Resulting accuracy and mIOU for different kinds of initializations for the quadrilateral fitting step

column of images for the easy video frame, for which the green quadrilateral is initialized on part of a background container. The resulting quadrilateral is similarly wrong. These cases have very low accuracy and mIOU.

• Gradient based initialization is more consistent than contour based initial-ization. Even though the initial quadrilaterals are not as precise as those from the contour based method, they appear to always be around the corre-sponding container face. However, for poses with great angles this method can result in flawed quadrilaterals. This can be seen in the fourth column of images for the hard video frame, where the resulting green quadrilateral has a wrong orientation, possibly because the bottom left horizontal parameter was optimized prior to the top left parameter, resulting in an incorrect local optimum.

The contour and gradient based methods each have different pros and cons, which leads to the believe that a combination might increase results. Firstly, a quadrilateral is initialized with the contour based method, but if it is too small (less than 5% of the image), it is discarded and the gradient based method is used instead. This approach outperforms all other methods, and was therefore the method of choice during other experiments.

(28)

random generic contour gradient

contour or gradient

(a) Easy video frame

(b) Semi-hard video frame

(c) Hard video frame

Figure 3.3: Initialized quadrilaterals (every top row images) and resulting quadri-laterals (every bottom row of images) for different initialization methods. From left to right: random, generic, contour, gradient, and contour or gradient initialization.

3.3 Comparing the container detector

To evaluate the proposed container detection method, it was compared to the Deep Cuboid Detector by Dwibedi et al. [4]. This Deep Learning based method beats the state of the art on cuboid detection. Not only can a Deep Cuboid Detector trained on cuboids be used for the task of container detection, the model can also be trained specifically on containers, which might improve container detection results. As the authors did not release code publicly, the method had to be implemented first from the description in the paper. The method was implemented in PyTorch from an existing Faster R-CNN implementation, which the method is similar to. The final implementation can be found at

(29)

github.com/rubenve95/Deep-Cuboid-Detection. To confirm that the implementation is correct, the results of the paper were reproduced by training and validating on the SUNPrimitive dataset. This is a dataset consisting of 3516 annotated images of cuboids. The authors compare on Average Precision (AP), Average Precision of Keypoints (APK), and Probability of Correct Keypoint (PCK). The implementation was confirmed to be correct, as it reached an AP of 75.1 and a PCK of 40.4, compared to the original AP and PCK of 75.47 and 38.27 respectively. Note that differences in outcome could exist due to the difference in pretrained weights of the encoder of the network, VGG16. Afterwards, the method was used on container data. The data first had to be converted from face segmentations to cuboid vertices with bounding boxes. As the method outputs all 8 vertices of a cuboid, vanishing point constraints were used to find the ground truth vertices which are not directly visible in the image. However, this is impossible if only one face is visible, so these images were excluded. Images in which part of the container falls outside of the image were also excluded, as the conversion of annotation type resulted in inaccuracies. This leaves 504 annotated container images, of which 4 from the semi-hard video, and 82 images from the easy video; the latter was used for validation. A model was trained on this dataset with as initialization the model pretrained on SUNPrimitive. For further comparison, the model only trained on SUNPrimitive was also tested on container data. In order to compare test results to the container detector presented in this thesis, the annotation type had to be converted back from vertex coordinates to segmentations, such that the accuracy and mIOU metrics become relevant. Note that AP is not relevant for the thesis container detector, as this metric evaluates bounding box estimations. PCK was calculated over all keypoints, in contrary to previous container experiments, where only keypoints of visible faces were evaluated. Results of these tests can be found in table 3.4.

Model AP PCK accuracy mIOU

Deep Cuboid Detector trained only on SUNPrimitive 96.4 0.346 0.826 0.533

Deep Cuboid Detector trained on container dataset 100 0.62 0.878 0.652

Thesis Container Detector Not relevant 0.654 0.948 0.778

Table 3.4: Test results on 82 container images. The thesis container detector outperforms the Deep Cuboid Detector on segmentation and vertex localization.

Training the Deep Cuboid Detector on container data has improved the results on all metrics. Note that the AP is 100, meaning that it always finds the correct bounding box. This high score can be explained by the data, as there is always exactly one bounding box for a container in the middle of the image, and this is learned by the model. Looking at the scores for vertex localization (PCK) and segmentation (accuracy and mIOU), it is apparent that the thesis container detector outperforms the Deep Cuboid Detector.

3.4 3D model texturing

For texturing the simple 3D model, the results from the cuboid optimization with VP constraints step were used. The corners were labeled and the faces were warped

(30)

with homographies. That leaves the task of choosing the textures among the warped faces. One approach is to select the container faces on the length ratios of their opposing edges. Two ratios were tested:

• One-on-one for both horizontal and vertical edges. This filters on faces that face the camera perpendicularly.

• Two-on-one for the vertical and one-on-one for the horizontal edges. This filters on faces that are looked at from an angle horizontally.

The best textures are considered the ones that primarily contain the most visible corners and secondly come the closest to these ratios. In figure 3.4, the top 3 textures for each face are shown for the easy video for both ratios. Firstly, from the corresponding original images it becomes clear that the ratios indeed filter on straight and angled views. Secondly, the textures were compared qualitatively and quantitatively. For the latter was used a normalized Euclidean distance metric inspired by the PCK metric, defined as

d(corner) = dEuclidean(H(pointcorner), H(point

gt corner))

√

width · height , (3.1)

with H the homography matrix, pointcorner and pointgtcorner the corner coordinates

for estimation and ground truth respectively, and width,height the width and height of the texture. This score was summed for each corner on a face and aver-aged for the top three textures for each face for both ratios in table 3.5. The ratio of 1:1 scores better on all faces. This can be confirmed visually, as the resulting textures found with a ratio of 1:1 contain faces that fit more neatly and contain less blur. The blur seems more prominent near the corners furthest from the cam-era, which is according to the hypothesis that the back pixels are interpolated to a relatively large area, causing a low resolution, and also that a small error for the furthest corners gets magnified by the homography. The latter effect is well visible for the second right texture with ratio 1:2; even though the front-left-top corner is only slightly off in the corresponding image, the loss in sharpness in the texture at the left side is great. In comparison, the front-right-top corner from the same texture is off by a greater absolute amount, yet causes less loss in sharpness at the right. Another reason for why the textures with a 1:2 ratio are worse is that these generally contain two visible faces instead of one. The DL segmentation model might have learned to distinguish between container and background more convincingly than between individual faces. Uncertainty between faces means that the cuboid estimation is based upon an ambiguous probability space. Indeed, the closest two corners often appear to be poorly estimated, resulting in faces that do not fit the textures well. Note also that this effect is less clear with images con-taining the front face, possibly because this face contains more visually distinctive features than the back face, as it contains the door.

The end result is a simple 3D model using the best textures given the amount of visible corners and the 1:1 ratio score. The 3D models for each video can be seen in figure 3.5. For the semi-hard and hard videos the number of visible corners

(31)

(a) Front faces found with ratio 1:1 (b) Front faces found with ratio 1:2

(c) Right faces found with ratio 1:1 (d) Right faces found with ratio 1:2

(e) Back faces found with ratio 1:1 (f) Back faces found with ratio 1:2

(g) Left faces found with ratio 1:1 (h) Left faces found with ratio 1:2

Figure 3.4: Each subfigure shows the top three faces found with either a ratio of 1:1 or 1:2. Each top row shows the original images with in green the estimated corners and each bottom row shows faces warped by homography to fit into the texture.

was of primary importance, as this maximizes the textured area. For the hard video, there was no left and back face that contained all corners and so part of the texture is black. None of its textures appears to be fitting well. The textures for the semi-hard video seem to fit well, except for the top. Note also that due to a dent the top could not fit perfectly. This is a limitation of this simple approach for 3D model reconstruction, as it assumes that the container is a perfect cuboid, even if due to damages it is not. The 3D model for the easy video looks decent, but contains some background pixels in the textures. It also has a completely black top face as this remained unseen in the video.

(32)

face distance score with ratio 1:1 distance score with ratio 1:2 front 0.223 0.226 right 0.061 0.321 back 0.140 0.831 left 0.067 0.227 average 0.123 0.401

Table 3.5: Accumulated distance scores for warped corners of textures from the easy video. The textures were chosen with length ratios closest to 1:1 or 1:2.

(a) Back and right view for a 3D

model of the easy video (b) Front and left view for a 3D_{model of the easy video}

(c) Back and right view for a 3D model of the semi-hard video

(d) Front and left view for a 3D model of the semi-hard video

(e) Back and right view for a 3D model of the hard video

(f) Front and left view for a 3D model of the hard video

Figure 3.5: 3D models predefined as a standard 20ft. container. Textured with warped faces that contain as many corners as possible and have opposing edges close to a ratio of 1:1.

(33)

CHAPTER 4

Conclusion

4.1 Summary and conclusion

In this thesis a novel method for detection and 3D model reconstruction of ship-ping containers was presented. The detection pipeline is based on a deep learning segmentation network, for which the pixel-wise semantic assignments by argmax operation on the probability space is replaced by a quadrilateral/cuboid fitting algorithm using the prior knowledge of the shape of a container. Experiments in section 3.2.1 showed that the method can improve upon its initial segmentations, though improvements are minor. These experiments also showed that the use of vanishing points constraints in cuboid optimization is overall not an improvement compared to optimization without vanishing points, though it can improve estima-tions in some cases. Specifically, it was observed that when one corner is wrongly estimated the vanishing point constraints could correct this error, but when several corners were wrong their errors would propagate to the inferred corner.

The use of padding was motivated in section 3.2.2, and the capabilities of the method to estimate cuboids with some of its corners located outside of the image was shown. Also, experiments were done with different kind of initializations in section 3.2.3, which showed a clear preference for methods which used the information from the probability space such as contour-based and gradient-based, instead of random or generic quadrilateral initializations. The latter two appeared to be more prone to poor local optima. A combination of the prior two appeared to be superior, as it compensated for each others weaknesses. The detection algorithm was then shown to beat a state of the art cuboid detection method on the shipping container detection task in section 3.3.

An advantage of the method over just Deep Learning segmentation is that it is also capable of corner localization. This is essential as it allows for the calculation of a homography which warps the container faces to the dimensions of textures used for a simple 3D model of a container. As assignment to a left or right texture of a container (both the large faces) is ambiguous due to their visual similarity, the use of video was proposed. The temporal order of the faces in video and the orientation with respect to left and right faces, which are not ambiguous, solves the issue of left or right face ambiguity. With multiple frames showing each

(34)

face multiple times, there is an opportunity for face estimation selection for each texture. In section 3.4 an experiment on one video showed that faces estimated to be the closest to perpendicular to the camera pose were visually the best textures. However, more experiments would have to be done to support this claim. Lastly, the end result of three 3D models were shown. They are simple and intuitive, but lack the full visual expression of a shipping container.

4.2 Discussion and future research

Even though the detection algorithm was generally rather accurate, there is room for improvement. For example, the vanishing point constraints are chosen such that the outmost corner of the smallest face in the image is given by the other parameters. In many cases this is wise, as it often appears to be the outmost corner of the top face, which often falls outside of the image and is therefore hard to predict. For images in which this is not the case, this parametrization is suboptimal. Another possibility would be to prioritize corners that are located in the image as parameters whenever possible. This would be future research.

For comparison with the state of the art, the method of this thesis was com-pared with the Deep Cuboid Detector [4], which it beat by a large margin. Note however that this is not a completely fair comparison, as the latter is a method for cuboid detection, not for shipping container detection. These two tasks seem equivalent, but they are not. The Deep Cuboid Detector has to work for every cuboid, regardless of the semantic meaning of the individual faces; instead it has labels associated to corners corresponding to their orientation in the image. In comparison, the detector of this thesis has labels which are associated with the visual features of the faces. It would be interesting to see if the detector of this thesis would work if labels were chosen by their orientation in the image, instead of by their actual semantic meaning. If this works, the method could be extended to all cuboids, as it indicates that a Deep Learning model could learn based on a dataset of cuboids. This would be an interesting addition to the field of robotics, where cuboid detection is relevant.

Though some angles of the shown 3D models were visually appealing, others showed to be lacking. Reasons include faces being poorly estimated, lacking gen-erality in the 3D model representation due to possible deformations, corners being unseen in the video, or even a full face remaining unseen. The latter is inevitable and is caused by the data, however the first three reasons are limitations of this method. Especially in the use-case of damage detection, it is important that de-formations can be modelled, which is not the case with a predefined 3D model. To alleviate these issues, one could consider structure-from-motion with bundle adjustment. This is expected to mistakenly confuse features from the left and right faces, however one could disambiguate these features by the temporal order of them appearing in a video, similarly as was done with face detections in this thesis.

(35)

4.3 Recommendations

Anyone who wishes to use this method should be mindful of the following:

• Results depend on the quality of the probability map, and so on the trained segmentation network. The network of this thesis was trained on a relatively small dataset (about 1250 images). However, this was enough for decent quality, because it had a wide variety of container poses. This is important because the data then includes images of containers showing several faces, which lets the network learn to distinguish these faces.

• The video frames were resized to 600 by 400, meaning that a camera with 480p resolution is sufficient for reliable results.

• Frames should be sampled from a video with a frequency that allows dis-ambiguation between left and right faces; that is when at least one frame showing the front or back face is taken between two frames showing only left after right face or right after left face. This becomes especially relevant if one would prefer to limit computation time, by taking as few frames as possible. • For optimal results the method should be used on data without container parts outside of the image, or at most one corner. In this case the method seems to be quite robust and consistent. Results will likely vary more when more corners are outside of the image.

• With the current implementation it takes about a minute per frame to go through the detection pipeline. The main bottlenecks are the quadrilateral fitting and cuboid optimization steps. Note however that computation time was not a priority, and it is expected that there is plenty of room for im-provement in computational efficiency. Possibilities include calculating the difference in log-joint probability score between quadrilaterals after parame-ter change, instead of recalculating the full score for every new quadrilaparame-teral, and GPU acceleration.

(36)

(37)

APPENDIX A

Derivation of the homography calculation

The homography matrix relates two planar surfaces in 3D space:

  x0 y0 1  ∼   h11 h12 h13 h21 h22 h23 h31 h32 h33  .   x y 1   (A.1)

This yields the equations:

x0 = h11x + h12y + h13 h31x + h32y + h33

y0 = h21x + h22y + h23 h31+ h32y + h33

Note that these equations still hold if multiplying all terms with a factor k. As-suming that h33 6= 0, this means that we can define the same homography with

ˆ h33 = kh33 = 1: x0 = kh11x + kh12y + kh13 kh31x + kh32y + kh33 = ˆ h11x + ˆh12y + ˆh13 ˆ h31x + ˆh32y + 1 y0 = h21x + kh22y + kh23 kh31+ kh32y + kh33 = ˆ h21x + ˆh22y + ˆh23 ˆ h31+ ˆh32y + 1

Which can be rewritten to:

ˆ

h11x + ˆh12y + ˆh13− ˆh31xx0− ˆh32yx0 =x0

ˆ

h21x + ˆh22y + ˆh23− ˆh31xy0− ˆh32yy0 =y0

Setting ˆh33 = 1 reduced the problem to 8 degrees of freedom, which means that

four points are required to find a solution. In matrix form, the problem for four known point correspondences can be written as:

(38)

            x1 y1 1 0 0 0 −x1x01 −y1x01 0 0 0 x1 y1 1 −x1y10 −y1y10 x2 y2 1 0 0 0 −x2x02 −y2x02 0 0 0 x2 y2 1 −x2y20 −y2y20 x3 y3 1 0 0 0 −x3x03 −y3x03 0 0 0 x3 y3 1 −x3y30 −y3y30 x4 y4 1 0 0 0 −x4x04 −y4x04 0 0 0 x4 y4 1 −x4y40 −y4y40                          ˆ h11 ˆ h12 ˆ h13 ˆ h21 ˆ h22 ˆ h23 ˆ h31 ˆ h32              =             x0₁ y₁0 x0₂ y₂0 x0₃ y₃0 x0₄ y₄0             (A.2)

This is a linear equation of the form Ah = b, with least squares solution for h:

(39)

Bibliography

[1] Port of Rotterdam official website. https://www.portofrotterdam.com/ sites/default/files/facts-and-figures-port-of-rotterdam.pdf. Ac-cessed: 2020-09-18.

[2] Vicente Atienza, ´Angel Rodas, Gabriela Andreu, and Alberto P´erez. Optical flow-based segmentation of containers for automatic code recognition. Lecture Notes in Computer Science, 3686(PART I):636–645, 2005.

[3] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016.

[4] Debidatta Dwibedi, Tomasz Malisiewicz, Vijay Badrinarayanan, and Andrew Rabinovich. Deep cuboid detection: Beyond 2d bounding boxes. CoRR, abs/1611.10010, 2016.

[5] Chao Mi, Zhiwei Zhang, Youfang Huang, and Yang Shen. A fast automated vision system for container corner casting recognition. Journal of Marine Science and Technology (Taiwan), 24(1):54–60, 2016.

[6] Yang Shen, Weijian Mi, and Zhiwei Zhang. A positioning lockholes of con-tainer corner castings method based on image recognition. Polish Maritime Research, 24(S3):95–101, 2017.

[7] Satoshi Suzuki and KeiichiA be. Topological structural analysis of digitized binary images by border following. Computer Vision, Graphics, and Image Processing, 30(1):32 – 46, 1985.

[8] Mahes Visvalingam and J Duncan Whyatt. The douglas-peucker algorithm for line simplification: Re-evaluation through visualization. In Computer Graph-ics Forum, volume 9, pages 213–225. Wiley Online Library, 1990.

[9] Jianxiong Xiao, Bryan C. Russell, and Antonio Torralba. Localizing 3D cuboids in single-view images. Advances in Neural Information Processing Systems, 1:746–754, 2012.

[10] Shichao Yang and Sebastian Scherer. CubeSLAM: Monocular 3-D Object SLAM. IEEE Transactions on Robotics, 35(4):925–938, 2019.

(40)

[11] Yi Yang and Deva Ramanan. Articulated human detection with flexible mix-tures of parts. IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2878–2890, 2013.

[12] Hee Joo Yoon, Young Chul Hwang, and Eui Young Cha. Real-time container position estimation method using stereo vision for container auto-landing sys-tem. ICCAS 2010 - International Conference on Control, Automation and Systems, pages 872–876, 2010.

Detection of shipping containers in video for 3D model reconstruction

Master Artificial Intelligence