• No results found

Height Estimation from Aerial Imagery with Stereo Matching Networks

N/A
N/A
Protected

Academic year: 2021

Share "Height Estimation from Aerial Imagery with Stereo Matching Networks"

Copied!
15
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Height Estimation from Aerial Imagery with

Stereo Matching Networks

J.C. Zuurmond

University of Amsterdam

Sciencepark 904, Amsterdam

cor.zuurmond@student.uva.nl

Abstract

Stereo matching is a common topic within autonomous driving, it is not commonly applied to aerial imagery. This research applies stereo matching methods to aerial imagery for height estimation. Height is a fundamental variable in a wide variety of geographic information system (GIS) ap-plications. This research is a comparative study of stereo matching performance of aerial imagery with convolutional neural networks (CNNs). Two CNN-based approaches are benchmarked against Semi-Global Matching (SGM). Fur-thermore a module is proposed to jointly predict an object mask and a disparity map. The fine-tuned Pyramid Stereo Matching (PSM) network has lower errors than SGM, with the exception of the 1-pixel error evaluated on pixels in the foreground. The fine-tuned PSM network can be used for applications which require similar matching performance as SGM. SGM predicts disparities for about 44% of the pix-els, while the learning methods predict disparities for all pixels (dense disparity maps). The proposed mask module predicts inaccurate and noisy building masks. The disparity maps became more accurate when the network was jointly trained for predicting a building mask and disparity.

1. Introduction

While stereo matching is a common topic within au-tonomous driving [1,8], this research is focussed on ap-plying stereo matching methods to aerial imagery. The goal is to estimate height from aerial imagery with stereo match-ing networks to create a dense digital surface model (DSM). DSMs are recognized as probably the most fundamental and one of the most needed variables in geographic information systems [2]. DSMs are used in a wide variety of applica-tions [31], among which climate studies [13], flood analysis [22] and agriculture applications [25]. Readaar, the com-pany for which this research is conducted, will use DSMs for applications in real estate.

Figure 1. Schematic overview of the correspondence problem in stereo matching given a pair of rectified images.

In stereo matching, the objective is to estimate a dispar-ity d for each pixel in the reference image. This goal can be translated to the following correspondence problem [5]: given a pair of rectified stereo images, for each pixel in the reference image find the conjugate pixel in the target image. The disparity is defined as the horizontal shift in pixels, such that if pixel i in the reference image located at (xi, yi) has disparity di, its conjugate is pixel i0 located at (xi− di, yi) in the target image, as shown in Figure1.

Knowing the focal length f of the camera, the baseline B of the system set-up and height hcam at which the refer-ence image is acquired (see Figure2), the disparity can be transformed to a distance h with the following formula:

h = hcam−f Bd

This research is a comparative study of two deep learning approaches to stereo matching (Section4.2and4.3) and one local-global method (Section4.1) applied to aerial imagery. Additionally a module (Section4.4) is proposed in order to jointly predict a binary object mask with a disparity map.

2. Related Work

Scharstein et al. [27] conducted a taxonomy of stereo matching algorithms to allow for comparison of different

(2)

Figure 2. Schematic overview of a stereo vision system set-up for overhead imagery. The images are taken after each other, from an airplane, with the camera looking downwards [33].

dense two-frame stereo matching algorithms. According to them a stereo matching algorithm generally consists of the following steps (not necessarily in this order):

i matching cost computation, for example a cost function based on the squared differences in intensity values. ii cost (support) aggregation, for example by summing

cost over a window with constant disparity.

iii disparity computation / optimization, for example by se-lecting the disparity which corresponds with the mini-mal aggregated cost, at each pixel (winner takes all). iv disparity refinement, for example by enforcing a

smoothness assumptions.

Current literature has a focus on step (i) computing matching cost and (iv) disparity refinement [3].

Besides these algorithmic steps, Scharstein et al. [27] divide stereo matching methods in two groups; local; and global. Local (window-based) methods compute dispari-ties based on values within a finite window. Global meth-ods minimize a global cost function constrained by explicit global smoothness assumptions.

The related work is separated in three parts. Each part corresponds to one method used in this comparative re-search. In each Section references to the framework pro-posed by Scharstein et al. are made in order to give con-text to each method. First local learning based methods are outlined (Section2.1), then global learning based methods (Section 2.2) and finally local-global non-learning based methods (Section2.3).

2.1. Local learning based methods

Zbontar et al. [35] were first to propose a convolutional neural net (CNN) architecture to calculate (i) matching cost. The architecture proposed by Zbontar et al. computes matching cost by comparing patches (windows) from the reference image with patches from the target image. The patches are passed through a Siamese CNN architecture

which evaluates patch descriptors. The similarity between the descriptors of the reference and target patch are inputs for a matching cost function.

A year later, Zbontar et al. [36] improved the speed of their architecture, without compromising much of the accu-racy, by computing the matching cost with an inner product. Several adjustments [4, 18] on Zbontars et al. archi-tecture are proposed which improve accuracy on multiple benchmark datasets [8,21,26].

The limited receptive field of local methods require hand-crafted regularization [12, 30] and post-processing functions [34,36] to be applied to (iv) refine the disparity map [28].

2.2. Global learning based methods

Recently, learning frameworks which include global in-formation were proposed [3,14,20,23]. These frameworks do not need hand-crafted regularization methods and post-processing functions.

The CNN architecture proposed by Chang et al. [3] cap-tures context information with a spatial pyramid pooling and 3D CNN. Kendall et al. [14] incorporate contextual in-formation with 3-D convolutions. Disparities are regressed over the cost volume. Mayer et al. [20] propose three syn-thetic videos on which they train their DispNet architecture, which is an adaption of FlowNet [6] for disparity estima-tion. Pang et al. [23] propose a two stage learning system where the first stage is DispNet and the second stage refines the disparity map generated by DispNet.

2.3. Local-global methods

Local-global methods, like Semi-Global Matching (SGM) by Hirschm¨uller [12], combine (i) matching cost computation with (iv) disparity refinement. SGM enforces a semi-global constraint along several 1D paths, which makes it a computationally efficient method. In Section4.1, SGM is explained in more detail.

Another computationally efficient method is ELAS (Ef-ficient LArge-scale Stereo) [9]. ELAS estimates a disparity map from support points with triangulation. Support points are pixels that can be robustly matched.

These methods do not learn (hyper)parameters. Tuning (hyper)parameters can be uneasy and time consuming [28].

3. Data

In this Section first the datasets are enumerated, then the data preparation process is explained.

Aerial imagery: The stereo10 dataset consists of aerial imagery with a ground resolution of 10 cm, covering all land mass of the Netherlands. This dataset is managed by “Beeldmateriaal Nederland”. See Figure7for some exam-ples.

(3)

The imagery is acquired with a 60% forward and 30% side overlap and provided as multiple tiffs files of approx-imately 1.7 by 1.1 km. The imagery is acquired yearly in the loaf-off season (winter and early spring) on a day with favorable weather conditions to ensure maximum visibility of man-made structures. This research uses the imagery ac-quired in early 2017. Intrinsic and extrinsic parameters of the cameras were provided as metadata.

Building footprints: The building footprints (and addi-tional information) of buildings is provided by the “Ba-sisregistratie Addressen en Gebouwen” (BAG). The BAG is the Dutch registration of addresses and buildings. The dataset is maintained by the Kadaster, a Dutch governmen-tal organization.

The information about each building is gathered by the municipality in which the building is situated. The foot-prints have an absolute point precision of 20 cm in urban areas and 40 cm in rural areas. An extract of the BAG is taken in the beginning of 2018 from “Publieke Dienstver-lening Op de Kaart” (PDOK). PDOK is a platform which distributes Dutch governmental geospatial data.

Municipalities: The topography of a municipality is pro-vided by the “Basisregistratie Grootschalige Topografie” (BGT) and also acquired through PDOK.

Height data: Height information is acquired with a tech-nique called “Light Detection And Ranging” (LiDAR), the same technique used in the KITTI dataset [8]. LiDAR is a height measuring technique based on lasers.

In the Netherlands a countrywide LiDAR dataset is pub-licly available, which is updated roughly every 7 years [32]. This dataset is commonly known as “Actueel Hoogtebe-stand (Nederland)” (AHN). The AHN is provided as cloud and in a gridded format. This research uses the point-cloud dataset acquired in early 2017.

Data preparation: The aerial imagery is tiled in tiles of 100 by 100 meters. The tiles are filtered such that they inter-sect with a certain municipality and each tile should contain a part of a building footprint. Then stereo pairs are selected of imagery that cover the same tile and are acquired with the same camera. Pairs of stereo images are rectified, as shown in Figure 7and explained in Section A. The AHN points and building footprints are transformed to disparities and a building mask for the reference image (see Figure8), with the intrinsics and extrinsic parameters of the camera.

4. Methods

In this Section three methods are described; Semi-Global Matching by Hirschm¨uller [12] (Section4.1); the patch net-work proposed by Luo et al. [18] (Section 4.2); and the pyramid stereo matching network proposed by Chang at el.

[3] (Section4.3).

4.1. Semi-Global Matching

With SGM [12], proposed by Hirschm¨uller, an energy function containing terms of pixelwise matching cost and a global smoothness constrains is optimised. The energy function E(D) for disparity map D is defined as:

E(D) =X p C(p, Dp) + X q∈Np P1δ [|Dp− Dq| = 1] + X q∈Np P2δ [|Dp− Dq| > 1]

with C(p, Dp) the pixel matching cost for pixel p with dis-parity Dp. P1is a penalty for pixels q, with disparity Dq, in neighbourhood Np of pixel p which have a slightly differ-ent disparity, i.e. an absolute difference of 1 pixel. P2is a penalty for pixels which have a larger absolute difference in disparity with pixels in their surroundings. δ[·] is the Kro-necker delta function which evaluates to 1 if the condition inside the square brackets is satisfied, otherwise 0.

The energy E(D) is minimized through the use of an intermediate pixel-wise loss function:

Lr(p, d) = C(p, d)+ min(Lr(p − r, d), Lr(p − r, d − 1) + P1, Lr(p − r, d + 1) + P1, min i6=d±1Lr(p − r, i) + P2) − min k Lr(p − r, k)

with Lr a traversed path in direction r (see Figure3). The last term is the minimum path cost which is subtracted to avoid large values. The loss function is recursive with p − r being the pixel before pixel p along path r.

A pixel p gets disparity Dp that minimizes the aggre-gated loss over all directions r in R.

Dp= arg min d

P r∈R

Lr(p, d)

Hirschm¨uller advises the number of paths R to be at least 8, and 16 for good coverage.

4.2. Patch Network

Luo et al. propose a CNN architecture that calculates cor-relation between image patches [18], from now on referred to as the “patch” network. Given a pair of rectified stereo images, a square patch is extracted from the reference (left)

(4)

Figure 3. Schematic overview of 8 paths from all directions r to be traversed in SGM.

Figure 4. Architecture overview of the patch network proposed by Luo et al. [18].

image and a rectangular patch from the target (right) im-age patch, see Figure4. With a Siamese CNN architecture, feature tensors for both patches are evaluated.

The feature tensor of the left patch is a vector describ-ing its center pixel. The feature tensor of the right patch is a matrix describing its middle row. Correlation between the feature tensors is calculated with an inner product. The center pixel (x, y) of the left patch is considered to have disparity d if pixel (x − d, y) in the right patch has highest correlation.

If the left patch has width 1 + 2W , the right patch has width 1+2W +dmax, where dmaxis the maximum disparity. W is set to 18 and dmaxis set to 192, thus the left patch is 37 by 37 pixels and the right patch 37 by 229 pixels.

The network architecture is shown in Table1. All lay-ers in the patch network are followed by a batch normal-ization and a rectified linear unit (RELU), except the last layer “conv final” where the RELU is omitted to not loose information from the negative values.

Loss: The cross-entropy loss given by Luo et al. [18] is minimized with respect to the model weights w for the out-put of the softmax layer pd(y, w):

min w dmax P d=1 pdˆ(yd) log pd(yd, w)

The target distribution pdˆ(yd) is symmetrical around the pixel which corresponds with ground truth disparity ˆd:

pgt(yi) =        λ1 if d = ˆd λ2 if |d − ˆd| = 1 λ3 if |d − ˆd| = 2 0 otherwise

Table 1. Patch CNN architecture as proposed by [18]. Each layer in the CNN is followed by a spatial batch normalization and a rec-tified linear unit (RELU). The final convolution layer is followed by a spatial batch normalization only. “pad” is short for padding. The output dimensions are given as [width x height when left patch is input / height when right patch is input x channels].

Name Layer setting Output dimension

Input 37 x 37 / 229 x 3

CNN

conv1 5 x 5, 32, pad = none 33 x 33 / 225 x 32 conv2 5 x 5, 32, pad = none 29 x 29 / 221 x 32 conv3 5 x 5, 64, pad = none 25 x 25 / 217 x 64 conv4 5 x 5, 64, pad = none 21 x 21 / 213 x 64 conv5 5 x 5, 64, pad = none 17 x 17 / 209 x 64 conv6 5 x 5, 64, pad = none 13 x 13 / 205 x 64 conv7 5 x 5, 64, pad = none 9 x 9 / 201 x 64 conv8 5 x 5, 64, pad = none 5 x 5 / 197 x 64

Features

conv final 5 x 5, 64, pad = none 1 x 1 / 193 x 64 Correlation

product 1 x 193

softmax 1 x 193

As in [18] the hyperparameters λ1 = 0.5, λ2 = 0.2, λ3= 0.05 are set.

4.3. Pyramid Stereo Matching Network

The pyramid stereo matching (PSM) network proposed by Chang et al. [3] is used as global learning based method. Their network takes image patches of equal size as input.

As described in their paper the network has two impor-tant parts (see Figure 5); Spatial Pyramid Pooling (SPP) module; and a 3D CNN module.

The SPP module captures information from different granularities, with pooling layers. The layers before the SPP module are the first layers from ResNet [11]. After the SPP module a 4D cost volume is created by concate-nation of the features, extracted from the left patch, with the features, extracted from the right patch, shifted for each disparity. All filters before the cost volume are shared.

The 3D CNN module takes in the cost volume from which it computes disparities. There are three hourglass modules, in what is called a “stacked hourglass” architec-ture. From the output of each hourglass module a(n) (inter-mediate) disparity map is regressed. The complete architec-ture is stated in Table5.

Loss: Chang et al. propose an average smooth L1 loss because of its robustness and low sensitivity to outliers [10]:

L(D, ˆD) = 1 | ˆN |

P p∈ ˆN

(5)

Figure 5. Architecture overview of the spatial pyramid network proposed by Chang et al. [3].

with ˆN the pixels with ground-truth disparity and L1,smooth(x) =



0.5x2 if |x| < 1 |x| − 0.5 otherwise

4.4. Object Mask

We propose an extension to the PSM network that pre-dicts an object mask for the left patch. The mask module has the left patch, the output of the “conv1 1” block of the PSM network and the output of the SPP module of the PSM network as input (see Figure5and Table5).

This module is inspired by the EdgeStereo architecture proposed by Song et al. [29]. They increase the accuracy of predicted disparities at object boundaries by integrating a module that detects edges, to include context information. Their edge detector is composed of a backbone of the dis-parity network. They observe that edge detection and stereo matching can help each other. We expect that including context information about buildings might improve the ac-curacy of the predicted disparities at the boundaries of the buildings.

The architecture is stated in Table2. Each of the mask module layers are followed by a batch normalization and a ReLU, with the exception of the layers in the output blocks, again to not lose information in negative values. A binary mask is created by applying a sigmoid function after the output layers.

The values, before the sigmoid is applied on the output of scale 2, are upsampled with bilinear interpolation, such that it has the same dimensions as scale 1. This upsam-pled output is added to the output of scale 1, creating a skip

connection, before the sigmoid is applied. The upsampling and the skip connection is repeated for scale 1 to 0. These skip connections and prediction at different scales are in-spired on the Feature Pyramid Network proposed by Lin et al. [17].

Finally to integrate the context information from the mask back into the PSM network, the layer “conv2 3” of the mask module is concatenated with the input of the “fu-sion” layer of the PSM architecture (see Table 5), for the reference patch only.

Loss: Inspired by Song et Al. [29], deep supervised smooth L1 loss is applied to train the mask module:

Ltotal1 = K−1 P k=0 αk |Nk| P p∈Nk L1,smooth(Mp,k− ˆMp,k)

with total number scales K, αka weight for scale k, Nkthe pixels at scale k, and Mkand ˆMkthe predicted and (down-sampled) ground truth mask at scale k, respectively. The kth scale corresponds with the scale having size 21k times

the dimensions of the left patch.

The weights αkare set to α0= 6, α1= 3 and α2= 1.

5. Experiments

The experimental settings are described in Section5.1. The evaluation metrics are given in Section5.2. The results for stereo matching are shown in Section5.3and for mask prediction in Section5.4.

(6)

Table 2. The mask module is integrated with the PSM network. The mask module predicts a mask at different scales. The lowest scale (0) has the same dimensions as the left patch.

Name Layer setting Output dimension Scale 2 concat (PSM) 14H x 14W x 320 CNN2 conv2 1 3 x 3, 128 14H x 14W x 128 conv2 2 3 x 3, 64 14H x 14W x 64 conv2 3 3 x 3, 32 14H x 14W x 32 Output2 output 2 1 x 1, 1 14H x 14W x 1 Scale 1 conv1 1 (PSM) 12H x 12W x 32 CNN1 conv1 1 3 x 3, 32 12H x 12W x 32 conv1 2 3 x 3, 32 12H x 12W x 32 conv1 3 3 x 3, 32 12H x 12W x 32 Output1 output 1 1 x 1, 1 add output 2 1 2H x 1 2W x 1 Scale 0 Left patch H x W x 3 CNN0 conv0 1 3 x 3, 32 H x W x 32 conv0 2 3 x 3, 32 H x W x 32 conv0 3 3 x 3, 32 H x W x 32 Output0 output 0 1 x 1, 1 add output 1 H x W x 1

5.1. Experiment details

The data which covers the municipality of Breda (Netherlands) is used to train the models. For evaluation, 5000 randomly selected images of the data covering the mu-nicipality of Roosendaal (Netherlands) are chosen. At these two locations the most recently acquired stereo10 and AHN dataset are available. Both are acquired in early 2017.

Note that SGM does not require training, but careful hy-perparameter tuning. SGM is implemented with the “dis-parity” function of MATLABs Computer Vision System Toolbox [19], which uses sum of absolute differences [16] to compute pixel matching cost.

As stated in Section 4.2, the patch network has left patches of 37 by 37 pixels and right patches of 37 by 228 pixels. The patches are randomly extracted from the im-age pairs, with the condition that the center pixel of the left patch has a known ground truth disparity.

The PSM network is trained with cropped images of 256

by 512 pixels. The left patch should contain at least one pixel with known ground truth disparity.

The patch and pyramid networks are implemented with PyTorch [24]. For data preprocessing each image is color normalized. The whole network is trained end-to-end with the Adam optimizer [15] with parameters β1 = 0.9 and β2= 0.999. The batch size is set to 3 and the architectures are trained on NVIDIA Tesla V100 GPUs.

For stereo matching the patch and PSM network are trained for 5 epochs with a learning rate of 3 and 1e-6, respectively. For both architectures dmax is set to 192. Additionally the weights provided by Chang et al. [3] are used as initialisation after which the network is trained end-to-end for 5 epochs with a learning rate of 1e-6, from now on referred to as “fine-tuned PSM network”.

To evaluate the object mask and the influence of the ob-ject mask on the disparity prediction an experiment consist-ing of three parts is conducted:

I The weights of the first layer in the fusion block (see Table5) are set to random values, then the network is trained end-to-end with the average smooth L1 (dispar-ity) loss for 5 epochs.

II The weights of the first layer in the fusion block are set to random values, the network is trained end-to-end for 5 epochs with the deep smooth L1 (mask) loss and for 5 epochs with the average smooth L1 (disparity) loss. III The weights of the first layer in the fusion block are set

to random values, the layer is adjusted to include con-text information from the mask module (as proposed in Section4.4) the network is trained end-to-end for 5 epochs with the deep smooth L1 (mask) loss and for 5 epochs with the average smooth L1 (disparity) loss. During each part the learning rate is set to 1e-6. Note that in experiment II and III the object mask is used for training and not provided during evaluation.

5.2. Evaluation

The following evaluation metrics are used:

Disparity The predicted disparities are evaluated with the n-pixel error: n-px error = 1 | ˆN | P p∈ ˆN δh|D(p) − ˆD(p)| > ni where n is a threshold.

KITTI error For comparison with the KITTI dataset [8], the error metric used to evaluate methods on the KITTI

(7)

(a) SGM (b) Patch (c) PSM (d) fine-tuned PSM

Figure 6. Two examples of evaluated disparity maps for each method. Upper left image is the reference image. Upper row (excluding the reference image) contains the evaluated disparity maps. White indicates high disparity, black low disparity and light purple no predicted disparity. The lower row shows the difference between the evaluated disparities and the ground truth, where blue indicates that the evaluated disparity is greater than the ground truth, red vice versa.

dataset is included: derror,KITTI= 1 | ˆN | X p∈ ˆN δ[·] [·] = " |D(p) − ˆD(p)| > 3 or |D(p)ˆ D(p) − 1| > 0.05 #

Mask The predicted mask is evaluated with intersection over union J (M, ˆM ):

J (M, ˆM ) =kMTM kˆ kMS ˆ

M k

Coverage The coverage is defined as the number of pixels with predicted disparity over the total number of pixels.

5.3. Stereo Matching Performance

Examples of results of the different methods after train-ing for 5 epochs to predict disparity are shown in Figure6. The images indicate that of the learning methods the PSM model with fine-tuned weights (d) outputs sharpest disparity maps. The disparity map of the PSM model without fine-tuned weights (c) tends to underestimate the disparities. The patch network is accurate when there are clear visual refer-ences like edges of buildings or other discontinuities. There are some outliers (white speckles), which are expected with-out applying a smoothness function. SGM looks to predict accurate disparities. The SGM disparity map (a) clearly is sparse, which is indicated by the light purple.

In Table3the average errors over all 5000 evaluation im-ages are shown. For comparison with the SGM method, the pixels are separated in the following categories; all pixels

(8)

Table 3. Average n-pixel errors and KITTI error of the evaluated disparities for the test set for different methods. The pixels are considered to be in the foreground (fg) if inside a building according to the mask, otherwise background (bg). For comparison with the benchmark (SGM) pixels are also separated to be in the benchmark or not.

> 1 px > 2 px > 3 px KITTI error

all fg bg all fg bg all fg bg all fg bg

SGM 0.096 0.100 0.095 0.068 0.075 0.068 0.061 0.067 0.060 0.091 0.083 0.092 Patch all 0.526 0.529 0.525 0.424 0.407 0.426 0.389 0.362 0.393 0.497 0.422 0.508 (5 epochs) in SGM 0.205 0.225 0.203 0.094 0.108 0.092 0.076 0.088 0.074 0.210 0.139 0.220 not in SGM 0.848 0.778 0.860 0.756 0.651 0.774 0.704 0.585 0.723 0.785 0.653 0.807 PSM all 0.570 0.637 0.560 0.400 0.444 0.394 0.328 0.342 0.326 0.517 0.465 0.525 (5 epochs) in SGM 0.357 0.441 0.346 0.151 0.207 0.144 0.097 0.121 0.093 0.343 0.257 0.355 not in SGM 0.792 0.804 0.790 0.660 0.645 0.662 0.570 0.530 0.576 0.699 0.642 0.708 Fine-tuned all 0.387 0.419 0.382 0.307 0.292 0.310 0.270 0.236 0.275 0.337 0.296 0.343 PSM in SGM 0.107 0.140 0.103 0.064 0.065 0.064 0.053 0.047 0.053 0.097 0.077 0.100 (5 epochs) not in SGM 0.678 0.657 0.681 0.561 0.486 0.573 0.496 0.397 0.513 0.587 0.483 0.604 Fine-tuned all 0.327 0.329 0.327 0.261 0.228 0.266 0.229 0.186 0.235 0.282 0.235 0.289 PSM in SGM 0.089 0.104 0.087 0.059 0.058 0.059 0.049 0.046 0.049 0.081 0.070 0.083 (80 epochs) not in SGM 0.576 0.521 0.585 0.472 0.374 0.488 0.416 0.306 0.435 0.491 0.376 0.510

(all); pixels for which SGM predicted disparities (in SGM); and pixels for which SGM did not predict disparities (not in SGM). Furthermore, the pixels are separated in foreground (fg) and background (bg) using the ground-truth building mask. Pixels inside a building are considered to be fore-ground, pixels outside a building are background.

The Table shows SGM has the lowest average 1-pixel errors. The fine-tuned PSM network has lowest average 2-and 3- pixel errors. The fine-tuned network has a noticeably higher error on the pixels in the background with respect to the foreground when evaluating with the KITTI error.

A salient detail is that all learning methods have lower average errors for pixels in the foreground with respect to pixels in the background, for the pixels which are not in SGM. Qualitatively assessed, SGM often does not predict disparities for vegetation. (See SGM disparity map (a) in Figure 6, below the bottom right house in the upper im-age or right of the houses in the lower imim-age.) Vegetation often consists of patterns, like leaves and branches, which makes matching ambiguous. SGM is tuned to not predict a disparity for ambiguous matches, therefore disparity is not predicted for these pixels. The learning methods pre-dict a disparity for all pixels. The ground-truth has a higher chance to have outliers on vegetation in comparison with human made objects, for example because leaves can cre-ate multiple scatters while human objects generally do not. For this reason there is a higher change to predict an “erro-neous” disparity, since ground-truth has a higher chance to be noisy.

Between the patch network and PSM network which are both trained for 5 epochs the errors are comparable. The patch network has lower errors for the pixels which are in

SGM, while the PSM network has lower errors for the pixels which are not in the SGM. When evaluating the errors with all pixels, the patch network has lower 1-pixel and KITTI errors and the PSM network lower 2- and 3-pixel errors.

The learning methods have lower average errors for pix-els which are in SGM, than for pixpix-els which are not. This in-dicates SGM predicts disparities for pixels which generally are less ambiguous to be matched. Of the learning meth-ods, the PSM network with fine-tuned weights has overall the lowest errors.

SGM has a coverage of 44.3% of all pixels, 43.9% for foreground pixels and 44.4% for background pixels.

Finally, the fine-tuned PSM network is trained for 75 more epochs (80 in total). As can be seen in Table3, the fine-tuned PSM network has lower errors than SGM in all cases except for the 1-pixel error evaluated on the pixels in the foreground.

5.4. Mask Module Performance

First the results of the mask prediction are interpreted, then the disparity results.

Mask prediction Examples of the masks generated after conducting experiments are shown in Figure10. The inter-mediate results (a) are generated after training for 5 epochs with the mask loss only. This part is the same for the experi-ment (II) where the context information of the mask module is not integrated back into the PSM network and for the ex-periment (III) where the context information is integrated.

Figure10 shows that the masks predicted by the mask module overestimate the building footprints. The predicted masks do not have the expected rectangular shapes and

(9)

the masks are noisy, especially the mask corresponding with the experiment (II) for which the context information of the mask module is not integrated back into the PSM network (b). The module misclassified built environment around buildings, like paved roads, as buildings. The mod-ule falsely predicts that water (bottom example) is a build-ing. The module is probably inaccurate here due too few training examples with water and the color of water is dark which is generally similar to buildings and in contrast with vegetation.

The masks of the experiment (II), where the context in-formation from the mask module is not integrated with the PSM network, visually look noisier than the mask gener-ated after conducting the other experiment (III). The masks of experiment (II) actually do not look like masks, but rather look like edges. The masks corresponding with experiment (III) still overestimate the buildings, though the masks seem to be slightly sharper than the masks of the intermediate re-sults, showing building edges more clearly. This suggests that with knowing the (approximate) depth of an object the model is able to more accurately predict the location of an object.

Noteworthy is that the ground-truth is not accurate. In the top example it is clear the building footprint does not correspond with the location of the roof and that a building can be covered (partially) by a tree. In the bottom example there are parts of the building missing.

The intermediate results has an IoU of 0.344, the exper-iment (II) where the context information is not integrated back into the PSM network has an IoU of 0.097 and the last experiment (III) has an IoU of 0.285.

Disparity prediction Examples of the disparity maps gen-erated by the three experiments are shown in Figure9, SGM is included for comparison. In both examples the SGM is noticeably sparse. Qualitatively the difference between the three experiments is subtle. The predicted disparities at the edges of buildings seem to be slightly more accurate for the experiment (III) where the context information of the mask module is integrated in the PSM network in compari-son with the other experiments.

In Table4the quantitative results are shown. The exper-iment where the context information of the mask module is integrated back into the PSM network has the lowest aver-age error in all cases. The averaver-age errors of this experiment computed for all pixels or the pixels for which SGM did not predict a disparity, are lower than the errors of the fine-tuned PSM network (Table3). The errors calculated for the pixels which are in SGM are comparable with the errors of the fine-tuned PSM network.

6. Conclusions

In this research a comparative study of three stereo matching techniques is conducted and an extension to the PSM network is proposed to simultaneously predict an ob-ject mask and predict disparity.

The fine-tuned Pyramid Stereo Matching (PSM) net-work, after training for 80 epochs, had lower errors than SGM, with the exception of the 1-pixel error evaluated on pixels in the foreground. The fine-tuned PSM network can be used for applications which require similar matching per-formance as SGM. SGM predicts disparities for about 44% of the pixels, while the learning methods predict disparities for all pixels (dense disparity maps).

The proposed mask module predicted inaccurate and noisy building masks. The results suggest that stereo match-ing and mask prediction become more accurate when com-bined.

For further research, instead of building footprints, edge maps of roof segments, used as ground-truth for training the mask module, might further increase the accuracy of the predicted disparity map. The mask module does not need to be adjusted in order to do this.

Acknowledgement This work is conducted under super-vision of T. Mensink (University of Amsterdam) and S.A. Briels (Readaar). J.M.M.P Michel (Readaar) was second corrector.

References

[1] M. Achtelik, A. Bachrach, R. He, S. Prentice, and R. Nicholas. Stereo vision and laser odometry for au-tonomous helicopters in gps-denied indoor environments. volume 7332, pages 7332 – 7332 – 10, 2009.

[2] P. M. Atkinson. Surface modelling: Whats the point? Trans-actions in GIS, 6(1):1–4.

[3] J.-R. Chang and Y.-S. Chen. Pyramid stereo matching net-work. arXiv preprint arXiv:1803.08669, 2018.

[4] Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang. A deep vi-sual correspondence embedding model for stereo matching costs. 2015 IEEE International Conference on Computer Vi-sion (ICCV), pages 972–980, 2015.

[5] O. Faugeras. Three-dimensional Computer Vision. The MIT Press, 1993.

[6] P. Fischer, A. Dosovitskiy, E. Ilg, P. H¨ausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. CoRR, abs/1504.06852, 2015.

[7] A. Fusiello, E. Trucco, and A. Verri. A compact algorithm for rectification of stereo pairs. Machine Vision and Applica-tions, 12(1):16–22, Jul 2000.

[8] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-tonomous driving? the kitti vision benchmark suite, 2012. [9] A. Geiger, M. Roser, and R. Urtasun. Efficient large-scale

(10)

editors, Computer Vision – ACCV 2010, pages 25–38, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.

[10] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015. [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning

for image recognition. CoRR, abs/1512.03385, 2015. [12] H. Hirschm¨uller. Accurate and efficient stereo processing by

semi-global matching and mutual information. In In Proc. CVRP, pages 807–814. IEEE Computer Society, 2005. [13] Y. Hong, H. A. Nix, M. F. Hutchinson, and T. H. Booth.

Spatial interpolation of monthly mean climate data for china. International Journal of Climatology, 25(10):1369–1379. [14] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry,

R. Kennedy, A. Bachrach, and A. Bry. End-to-end learn-ing of geometry and context for deep stereo regression. 03 2017.

[15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.

[16] K. Konolige. Small vision system: Hardware and imple-mentation. In Proc. of the Intl. Symp. of Robotics Research (ISRR), pages 111–116, 1997.

[17] T. Lin, P. Doll´ar, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detec-tion. CoRR, abs/1612.03144, 2016.

[18] W. Luo, A. Schwing, and R. Urtasun. Efficient deep learning for stereo matching. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, volume 2016-January, pages 5695–5703. IEEE Computer Society, 1 2016. [19] MATLAB and C. V. S. Toolbox. version 9.3.0 (R2017b) Up-date 7. The MathWorks Inc., Natick, Massachusetts, 2010. [20] N. Mayer, E. Ilg, P. H¨ausser, P. Fischer, D. Cremers,

A. Dosovitskiy, and T. Brox. A large dataset to train con-volutional networks for disparity, optical flow, and scene flow estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016. arXiv:1512.02134.

[21] M. Menze and A. Geiger. Object scene flow for autonomous vehicles, 2015.

[22] M. Nishio and M. Mori. Hydrologic analysis of a flood based on a new Digital Elevation Model. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spa-tial Information Sciences, pages 127–134, June 2015. [23] J. Pang, W. Sun, J. S. J. Ren, C. Yang, and Q. Yan. Cascade

residual learning: A two-stage convolutional neural network for stereo matching. CoRR, abs/1708.09204, 2017. [24] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z.

De-Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-matic differentiation in pytorch. 2017.

[25] P. Pilesj, A. Persson, and L. Harrie. Digital elevation data for estimation of potential wetness in ridged fieldscomparison of two different methods. Agricultural Water Management, 79(3):225 – 247, 2006.

[26] D. Scharstein, H. Hirschm¨uller, Y. Kitajima, G. Krathwohl, N. Nesic, X. Wang, and P. Westling. High-resolution stereo datasets with subpixel-accurate ground truth. In X. Jiang, J. Hornegger, and R. Koch, editors, GCPR, volume 8753 of Lecture Notes in Computer Science, pages 31–42. Springer, 2014.

[27] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Interna-tional Journal of Computer Vision, 47:7–42, 2002.

[28] A. Seki and M. Pollefeys. Sgm-nets: Semi-global matching with neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [29] X. Song, X. Zhao, H. Hu, and L. Fang. Edgestereo: A

con-text integrated residual pyramid network for stereo matching. arXiv preprint arXiv:1803.05196, 2018.

[30] T. Taniai, Y. Matsushita, and T. Naemura. Graph cut based continuous stereo matching using locally shared la-bels, 2014.

[31] D. Tarboton. Terrain analysis using digital elevation models in hydrology, 2003.

[32] N. van der Zon. Kwaliteitsdocument ahn2 (dutch). Actueel Hoogetbestand Nederland, 2013.

[33] M. Vermeer. Large-scale efficient extraction of 3d rof seg-ments from aerial stereo imagery, 2018.

[34] Q. Yang. Stereo matching using tree filtering. IEEE Transac-tions on Pattern Analysis Machine Intelligence, 37(4):834– 846, April 2015.

[35] J. Zbontar and Y. LeCun. Computing the stereo match-ing cost with a convolutional neural network. CoRR, abs/1409.4326, 2014.

[36] J. Zbontar and Y. LeCun. Stereo matching by training a con-volutional neural network to compare image patches. CoRR, abs/1510.05970, 2015.

A. Rectification of image pairs

An image pair is rectified if the image planes are aligned such that the epipolar lines are parallel, the epipoles are at infinity and the epipolar lines are parallel to one of the im-age axes (often the horizontal axes) [5].

One way to align both image planes is to project both images on the same image plane. Then, to ensure that the epipoles are at infinity the image plane must be parallel to the line passing through both optical centers (the baseline). Finally, to have the epipolar lines parallel to one of the im-age axes, the coordinate systems for the imim-ages must be chosen to be in the image plane.

There are an infinite number of planes parallel to the line passing through the epipoles of the images. Only the ori-entation of the plane has importance, since the distance be-tween the line and the image plane is a matter of scaling. The orientation can simply be imposed, but more preferable is the orientation which results in a minimum distortion.

In this research rectification is done with the algorithm from Fusiello [7].

B. Figures and Tables

(11)

(a) Non-rectified image pair (b) Rectified image pair

Figure 7. Examples of non-rectified and rectified images pairs extracted from the stereo10 aerial imagery dataset.

Table 4. N-pixel errors and KITTI error of the evaluated disparities for the three experiments (Section5.1). The pixels are considered to be in the foreground (fg) if inside a building according to the mask, otherwise background (bg). For comparison with the benchmark (SGM) pixels are also separated to be in the benchmark or not.

> 1 px > 2 px > 3 px KITTI error

all fg bg all fg bg all fg bg all fg bg

Exp. (i) all 0.390 0.425 0.385 0.305 0.288 0.307 0.265 0.227 0.271 0.340 0.297 0.346 in SGM 0.118 0.159 0.112 0.067 0.072 0.066 0.054 0.049 0.054 0.108 0.090 0.111 not in SGM 0.674 0.651 0.677 0.552 0.473 0.566 0.485 0.380 0.503 0.581 0.473 0.599 Exp. (ii) all 0.411 0.466 0.403 0.322 0.323 0.321 0.279 0.255 0.283 0.363 0.333 0.368 in SGM 0.133 0.193 0.125 0.074 0.086 0.072 0.058 0.056 0.058 0.125 0.109 0.127 not in SGM 0.701 0.700 0.701 0.580 0.525 0.589 0.510 0.424 0.524 0.611 0.525 0.625 Exp. (iii) all 0.379 0.411 0.374 0.297 0.278 0.300 0.259 0.220 0.265 0.328 0.283 0.335 in SGM 0.109 0.148 0.104 0.064 0.066 0.063 0.051 0.046 0.052 0.099 0.080 0.101 not in SGM 0.661 0.636 0.665 0.541 0.460 0.555 0.476 0.370 0.493 0.567 0.457 0.585

(12)

Figure 8. Examples of reference image (left) with sparse disparity map (middle) and building mask (right). The ground disparities are extracted from the AHN. In the disparity map, white indicates a high disparity, black a low disparity and light purple indicates there is no data. The building masks are extracted from the BAG. In the mask, white pixels are buildings, black pixels are not.

(13)

(a) SGM (b) Experiment (I) (c) Experiment (II) (d) Experiment (III)

Figure 9. Two examples of evaluated disparities for the three experiments (Section5.1), with SGM for comparison. Upper left image is the is the reference image. Upper row (excluding the reference image) contains the evaluated disparity maps. White indicates high disparity, black low disparity and light purple no predicted disparity. The lower row shows the difference between the evaluated disparities and the ground truth, where blue indicates that the evaluated disparity is greater than the ground truth, red vice versa.

(14)

figures/exp_mask_00026210_l.png

figures/exp_mask_00017368_l.png

(a) Intermidiate results (b) Experiment (ii) (c) Experiment (iii)

Figure 10. Two examples of predicted mask for the three experiments (Section5.1), with the intermediate experiment. Upper left image is the reference image. Upper row (excluding the reference image) contains the predicted masks. The lower row shows the difference between the predicted masks and the ground truth, where green indicates true positives, red false positives, blue false negatives and black true negatives. The intermediate results are generated after training for 5 epochs with the mask loss only.

(15)

Table 5. Layer settings of the PSM network architecture proposed by [3].

Name Layer setting Output dimension

Input H x W x 3 CNN conv0 1 3 x 3, 32 12H x12W x 32 conv0 2 3 x 3, 32 12H x12W x 32 conv0 3 3 x 3, 32 1 2H x12W x 32 conv1 x 3 x 3, 32 3 x 3, 32  x 3 1 2H x 1 2W x 32 conv2 x 3 x 3, 643 x 3, 64  x 3 1 4H x14W x 64 conv3 x 3 x 3, 128 3 x 3, 128  x 3, dila = 2 1 4H x 1 4W x 128 conv4 x 3 x 3, 128 3 x 3, 128  x 3, dila = 4 1 4H x 1 4W x 128 SPP module branch 1 64 x 64 avg. pool 3 x 3, 32 bilinear interpolation 1 4H x14W x 32 branch 2 32 x 32 avg. pool 3 x 3, 32 bilinear interpolation 1 4H x 1 4W x 32 branch 3 16 x 16 avg. pool 3 x 3, 32 bilinear interpolation 1 4H x14W x 32 branch 4 8 x 8 avg. pool 3 x 3, 32 bilinear interpolation 1 4H x 1 4W x 32

concat [ conv2 16, conv4 3, branch 1,

branch 2, branch 3, branch 4] 14H x14W x 320

fusion 3 x 3, 128 1 x 1, 32 1 4H x 1 4W x 32 Cost volume Concat left and shifted right 1

4D x 1 4H x 1 4W x 64 3D CNN (stacked hourglass) 3Dconv0 3 x 3 x 3, 32 3 x 3 x 3, 32  1 4D x 1 4H x 1 4W x 32 3Dconv1 3 x 3 x 3, 32 3 x 3 x 3, 32  1 4D x 1 4H x 1 4W x 32 3Dstack1 1 3 x 3 x 3, 64 3 x 3 x 3, 64  1 8D x18H x18W x 64 3Dstack1 2 3 x 3 x 3, 64 3 x 3 x 3, 64  1 16D x 1 16H x 1 16W x 64

3Dstack1 3 deconv 3 x 3 x 3, 64add 3Dstack1 1 18D x18H x18W x 64 3Dstack1 4 deconv 3 x 3 x 3, 32 add 3Dconv 1 4D x 1 4H x 1 4W x 32 3Dstack2 1 3 x 3 x 3, 64 3 x 3 x 3, 64 add 3Dstack1 3 1 8D x 1 8H x 1 8W x 64 3Dstack2 2 3 x 3 x 3, 64 3 x 3 x 3, 64 1 16D x 1 16H x 1 16W x 64

3Dstack2 3 deconv 3 x 3 x 3, 64add 3Dstack1 1 14D x14H x14W x 32 3Dstack2 4 deconv 3 x 3 x 3, 32 add 3dconv1 1 4D x 1 4H x 1 4W x 32 3Dstack3 1 3 x 3 x 3, 64 3 x 3 x 3, 64 add 3Dstack2 3 1 8D x 1 8H x 1 8W x 64 3Dstack3 2 3 x 3 x 3, 64 3 x 3 x 3, 64 1 16D x 1 16H x 1 16W x 64

3Dstack3 3 deconv 3 x 3 x 3, 64add 3Dstack1 1 14D x14H x14W x 32 3Dstack3 4 deconv 3 x 3 x 3, 32 add 3dconv1 1 4D x 1 4H x 1 4W x 32 output 1 3 x 3 x 3, 323 x 3 x 3, 1 14D x14H x14W x 1 output 2 3 x 3 x 3, 32 3 x 3 x 3, 1 add output 1 1 4D x 1 4H x 1 4W x 1 output 3 3 x 3 x 3, 32 3 x 3 x 3, 1 add output 2 1 4D x 1 4H x 1 4W x 1

3 ouput [ouput 1, output 2, output 3] upsampling Bilinear interpolation D x H x W

Referenties

GERELATEERDE DOCUMENTEN

Op  de  bodemkaart  wordt  het  noordelijke  deel  van  het  terrein  aangeduid  als  een  lSdm(g)  en 

Since the electric field has a constant value space charge effects are neglected the time axis can be transformed to a position-in-the-gap axis: the width of

It is the purpose of this paper to formulate a non-parallel support vector machine classifier for which we can directly apply the kernel trick and thus it enjoys the primal and

1) Motor imagery task: Fig. 4 compares the test accuracies reached by the regularized Gumbel-softmax and the MI algorithm. The Gumbel-softmax method generally.. performs better than

expensive, performing multiple iterations of ICA’s of growing size to perform the channel selection and then still needing the network to be trained on the selected channel

Omdat de steekproef (5 blikken) klein is ten opzichte van de totale populatie (20.000 blikken) veranderen de kansen niet zo heel

This paper is organised as follows: The model is introduced in Section 2. In Section 3 , we derive an exact expression for the mean end-to-end delay averaged over all sources,

[r]