Multi view stereo matching for aerial photogrammetry

(1)

True Multi-View Matching for Aerial

Imagery

Author: Kyllian BROERS Supervisor: Sven BRIELSMSc.

Co-Supervisor & Examiner:

Dr. Thomas MENSINK

A thesis submitted in fulfillment of the requirements for the degree of Master of Science

in the

Computer Vision Group Informatics Institute

(2)

(3)

Declaration of Authorship

I, Kyllian BROERS, declare that this thesis titled, “True Multi-View Matching for

Aerial Imagery” and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a research de-gree at this University.

• Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated.

• Where I have consulted the published work of others, this is always clearly attributed.

• Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed my-self.

Signed:

(4)

(5)

UNIVERSITY OF AMSTERDAM

Abstract

Faculty for Science Informatics Institute

Master of Science

True Multi-View Matching for Aerial Imagery by Kyllian BROERS

This research addresses the problem of digital surface model (DSM) prediction from aerial photos. DSM’s help to extract data like roofs, building volumes and other data regarding urban planning and development. The proposed multi-view DSM predicting Deep Height Matching Network (DHMNet) works by projecting images back to world coordinates. This creates rasters over evenly spaced depth planes. This method allows for aggregation over a varying number of image features in an earlier stage of a matching network. This enables to construct DSM maps us-ing multi-view matchus-ing without the use of a reference image and usus-ing a various amount of input images, making it end-to-end without any need of pre and post processing. Other methods solve this by combining multiple binocular depth maps or by creating voxel volumes to create a 3D model. Results show increasing accu-racy for depth predictions when using more images. When compared the recent binocular Pyramid Stereo Matching Network (PSMNet) error is around 50% higher. Yet efficiency increases fromO(n!/(n−2)!)time complexity to linearO(n)over the amount of input images. Also, DSM coverage increases to 100% by design of the network. More research in input size invariant aggregation methods could further improve the accuracy of the network.

(6)

(7)

completely unfamiliar at the beginning of this thesis. My UvA supervisor, Thomas Mensink, I would like to thank for always being encouraging and patient while guid-ing me through the problems and challenges I faced. Furthermore I would like to thank my thesis coaching group headed by Arnoud Visser.

(8)

(9)

List of Figures

1.1 True multi-view matching for aerial images . . . 1

1.2 Stereo matching . . . 3

2.1 Flight path and overlap . . . 6

2.2 Nadir and Oblique . . . 7

2.4 World to camera coordinates . . . 9

2.5 Camera to image coordinates . . . 9

2.6 Inverse projection . . . 10

2.7 Projection distance . . . 11

2.8 Height resolutions . . . 11

3.1 Cost volume based stereo matching . . . 14

3.2 Levels of feature encoding . . . 15

3.3 PSMNet: spatial pyramid pooling and stacked hourglass . . . 15

3.4 Inverse projecting to depth planes . . . 17

4.1 DHMNet . . . 22

4.2 Test set overview . . . 23

4.3 Principal point update . . . 23

5.1 Examples of tiles with missing objects and occlusion . . . 28

5.2 Box plot of DHMNetconcat . . . 31

5.3 Examples of DHMNetconcatwith 2 images . . . 32

5.4 Error per gradient for DHMNetconcat . . . 33

5.5 Error per height for DHMNetconcat . . . 33

5.6 Error per confidence for DHMNetconcat. . . 34

5.7 Box plot of DHMNetvar_mean . . . 35

5.8 Percentage of points outside error margin . . . 35

5.9 Examples of DHMNetconcatcompared to PSMNet . . . 37

(12)

(13)

List of Tables

2.1 Max amount of overlapping images . . . 6

4.1 DHMNet data dimensions . . . 24

5.1 Percentage of points outside error margin . . . 30

5.2 MAE and std over test set . . . 31

5.3 Average coverage per DSM . . . 34

(14)

(15)

List of Abbreviations

CNN Convolutional Neural Network DHMNet Deep Height Matching Network DSM Digital Surface Model

GSM Ground Sampling Distance LiDAR Light Detection And Ranging MVM Multi-View Matching

PSMNet Pyramid Stereo Matching Network RGB Red, Green and Blue

SHG Stacked Hour Glass SPP Spatial Pyramid Ppooling WTA Winner Takes All

(16)

(17)

(18)

(19)

1 Introduction

(A) Aerial images (B) Inverse projection (C) Depth reconstruction

FIGURE1.1: Using inverse projection and multi-view matching for depth reconstruction from aerial images

Aerial photography captures images of the world in a top down manner. Yet, when capturing a 3D scene in a 2D image a dimension is lost. This is the depth of the view, which in this case is the height of structures. This is problematic for spatial analysis applications, e.g. property management, infrastructure design or urban planning (Maune, 2007), where height is an essential factor. This research will focus on re-constructing the height of the captured world scenes as a digital surface model from aerial images.

Using images for 3D reconstruction for stereo matching has been widely researched. First using traditional algorithmic techniques from image processing (Scharstein and Szeliski,2002) and since 2015 enhanced with data-driven deep learning techniques (Zbontar and Lecun,2015). Height can be reconstructed if a 3d world points is found in two or more images. In that case triangulation can be used to compute the 3D coordinate from the 2D coordinates in the images if the camera parameters and their position relative to each other are known.

The challenge is in finding the same points from the images alone, known as the correspondence problem, as their placement depends on unknown 3D point relative to the camera. Image matching is not at all trivial because images can be large, making it hard to exhaustively search between all points. There can be a lot of similar points. Or there can be no matching points at all if an image point is not visible in another images, e.g. a shed visible in one image but blocked from view by a house in any other.

Furthermore, for homologous points to be accurately matched, lighting and reflect-ing conditions can play a significant role as this changes the captured color of points in a scene. This can be a problem for photographing real world surfaces in fluctuat-ing lightfluctuat-ing conditions. A stereo matchfluctuat-ing method would thus need to solve these challenges for accurate 3D reconstruction.

(20)

2 Chapter 1. Introduction

This research will use data driven methods to learn these correspondences. In this thesis it is argued that finding these matches is the hardest part of any depth esti-mation network. While this is possible using only two overlapping images, more images would make data driven quality stronger. Therefore a model is proposed for true multi-view matching (MVM) and experiments are conducted to show the benefit of this for accurate 3D reconstruction.

1.1 Motivation for true multi-view stereo matching

Height prediction using stereo matching can be done using two or more images. Commonly, the latter is termed multi-view. If multiple views of the same scene are available, this can improve 3D reconstruction. This by providing more image points to improve matching accuracy or by increasing the completeness for having more image points to match on as shown by Okutomi and Kanade (1993).

To distinguish multi-view methods, Collins (1996) states three conditions for true multi-view methods, which "make full and efficient use of the geometric relation-ships between multiple images and the scene". These conditions are:

1. "the method generalizes to any number of images greater than two" 2. "the algorithmic complexity is O(n) in the number of images, and"

3. "all images are treated equally (i.e. no image is given preferential treatment)" The first condition separates methods which only take a fixed number of input im-ages greater than two. These could also be called N-view and cannot process a vari-able mount of images. Sometimes this is deliberately fixed, such as for trinocular stereo matching (Zhou et al.,2015). But with aerial images, the amount of overlap-ping images can vary across the dataset. A model for aerial stereo matching should thus meet condition one to utilize all data for accurate and correct height reconstruc-tion.

The second conditions is stated to be for efficiency: any method which cannot scale linearly with the amount of input images will take substantially longer to process. An example this is the repeated application of N-view matching to incorporate all images, e.g. summing over the sum of squared-difference for image pairs(Okutomi and Kanade,1993). This would make it unusable for on-the-fly processing or become a costly factor when processing large areas such as entire municipalities.

The last condition prevents choosing a reference image. Such a method predicts depth for all points in the reference image (Kendall et al.,2017) and is thus termed image based depth. This will use rectified image so matching becomes a 1D search from a reference image to a target over epipolar lines, see figure 1.2. It will not reconstruct depth for points which are not in the reference image. This is called occlusion and could happen when objects in the scene block each other from view. To overcome occlusion, a method might repeatedly use each image as reference image to minimize the view blocking. This would however break condition two. The other option is to do height prediction from the reference frame of the scene, called object based. This aligns with the digital surface model as the height axis will be equal. These three conditions are therefore important for stereo matching with aerial im-agery and a true multi-view method could solve the difficulties in height prediction.

(21)

FIGURE1.2: Stereo matching image points over lines

1.2 Objectives & research Questions

This research will look at techniques to utilize the increasing availability of aerial images to reconstruct terrain and structure height. The main question is: does a true multi-view image matching method result in more efficient, accurate and complete digital surface model reconstruction. To answer this, a method complying to these conditions will be proposed and experiments are conducted to answer the following sub-questions:

• What is the effect of world coordinate based height prediction compared to image based?

• What is the effect of true multi-view matching compared to repeated stereo matching and will the proposed true multi-view method give comparable qual-ity height reconstructions?

1.3 Thesis outline

After this introduction the thesis will be structured in the following way:

Chapter 2 provides all necessary background on aerial photography and remote sensing techniques used for capturing and processing aerial photos.

Chapter 3 gives and overview of the related work on stereo and multi-view matching using data driven deep learning techniques.

Chapter 4 describes the proposed true multi-view matching model.

Chapter 5 states the implementation details used for experimentation, the results of the experiments and an analyses of the findings relating them to the research questions.

Finally chapter 6 concludes this thesis by answering the main research question based on the theory and experimental results. Furthermore, other findings in this research will be discussed and suggestions are made for future research.

(22)

(23)

2 Background on aerial

photography and height

reconstruction

Aerial images can be taken by any airborne vehicle such as planes, helicopters or Unmanned Aerial Vehicles. These fly a predetermined flight plans to cover a par-ticular area. This path usually consists of parallel flight lines going back and forth over the area to be captured. Here the photos are taken along the flight line at such a frequency that they overlap lengthwise and flight lines are flown close enough to adjoining flight lines to overlap crosswise, see figure2.1. Overlap of 80% along the lines and 60% over the sides are not uncommon making seven images overlap (table

2.1). Mostly, along the line is larger as overlap only depends on the frequency at which the camera can shoot and the speed of the vehicle, whereas overlap on the side requires more flight lines which is more costly. The overlap is used for match-ing to obtain height models. A larger overlap results in more images overlappmatch-ing the same part of a scene, which in turn can be used by stereo matching methods to reduce matching ambiguity and increase visibility of all the points in the scene. This will lead to better height reconstruction of the photographed scenes.

Height models have a multitude of applications. Depending on the scene they cap-ture, this application can range from nature oriented, e.g. geomorphology, to man-made oriented, e.g. solar power potential on rooftops (Vermeer,2018) or mixed, e.g. infrastructure planning. Furthermore, a height model can be used to geometrically correct perspective errors in aerial images to create an orthorectified image, see the orthographic view in figure2.3. Such an image displays object at their true location, which is required for spatial analyses.

This research will propose a method to use aerial imagery for height reconstruction. Height could be measured with height sensors such as LiDAR, RADAR or SONAR but the availability of aerial images, their applications beyond height reconstruc-tion and the lower priced and higher resolureconstruc-tion sensors compared to active sensors make this an attractive way to do height reconstruction. Height measurements from LiDAR are used only for accuracy assessment of DSMs. Therefore, this chapter will provide a background on the input aerial images and output digital surface model. First, the basics of projecting a 3D scene on a 2D image is explained. Next are two topics, a clarification on how aerial images are captured and an analyses of the re-sulting visual effects due to projection. This is followed by the theory on inverse projection to 3D. Next, height models are discussed, in particular the desired digital surface model. The chapter is concluded with a summary of other remote sensing techniques used for 3D reconstruction and height modelling.

(24)

6 Chapter 2. Background on aerial photography and height reconstruction

3 2

FIGURE2.1: Flight path and overlap of aerial images

side overlap overlap <=50% 60% 80%

<=50% 3 4 6

60% 4 5 7

80% 6 7 9

TABLE2.1: Max amount of overlapping images

2.1 Aerial image characteristics

2.1.1 Nadir & oblique

Aerial photography captures scenes from top down. Depending on the angle to-wards the scene, this can be divided in two types: nadir and oblique.

An aerial image is termed nadir if the optical axis is in theory orthogonal to the earth’s surface and parallel to the plumb line from the camera to the earth (straight down in figure2.2). However, as this is usually not practically feasible, any image with an optical axis approximating this within some degrees is also considered nadir. Yet it must be stated that the terms nadir line or plumb line and optical axis will not be used interchangeably. For urban areas this will result in images mostly covering tops of structures and only some sides. This comes from that planes taking these images fly at such an altitude that the angle to the structures is small.

When there is a larger angle α between the optical axis (P) and the plumb line (N), see figure2.2, the image is termed oblique. Oblique imagery with an angle between 30 and 60 degrees is most common as at this angle any structures in the scene will have both their their facades and roofs clearly visible.

It thus depends on the application which type to use. When building a 3D model of a house, facades are of importance for a realistic reconstructing. However, this research is on constructing height models of larger urban areas where the focus is on the highest points of structures. Therefore, only nadir imagery will be used and any reference to aerial imagery can be assumed to be nadir unless otherwise specified.

(25)

FIGURE2.2: Nadir and Oblique aerial imagery

Orthographic view Perspective view

Image plane

FIGURE2.3: Orthorectified image compared to perspective image

2.1.2 Radial displacement, occlusion & scale

When capturing aerial images, points at ground level farther away from the nadir line will have a larger angle to the lens. This angle also increases for any point above the ground point in a line parallel to plumb line. Due to this, the base of a structure will have a smaller angle to the nadir line compared to the top. The result is the top being projected farther away from the principal point on the image. This effect is termed radial displacement, as it occurs in circles diverging from the nadir ground point. Figure2.3shows in black the sides of the structure move away from the center to the blue top. When rectified, this results in the orthographic view of figure2.3. This effect can also cause objects to be displaced in such a manner, that they block other objects from view. These will therefore not be visible on the image, this is termed occlusion. Occlusion is problematic for stereo matching as it could lead to points not being visible in other images for matching.

A third aspect of aerial imagery is scale. While the earths curvature is negligible for the size of the images, difference in terrain and building height can cause differences

(26)

in scale as higher points will be closer to the camera and thus appear larger. This will influence the Ground Sample Distance (GSD). That is the area which is captured by one pixel in the image. This can influence matching as an object at a larger GSD is captured by less pixels than the same object at smaller GSD. Any stereo matching method dealing with significant height, and thus also scale, differences would need to address this to prevent not finding a match.

2.1.3 Static & dynamic objects

Aerial images contain a mix of nature, e.g. water, trees and vegetation, and man made structures, e.g. buildings, roads and bridges. These are static, in that they do not move by themselves. However also objects such as cars, humans and animals can be captured. As these object can move by themselves, they can appear in differ-ent places in multiple images. This will make matching these points difficult as this assumes points to be static in the world for correct triangulation. These will need to be removed from the images or dealt with accordingly.

2.1.4 Texture and shading

The final two characteristics to be discussed are texture and shading. Texture is the way an object reflects light over a larger surface. This can range from smooth (uniform, homogeneous) to rough (coarse, heterogeneous). However, this is subject to scale, as a surface might seems smooth at a smaller scale, when captured by more detail may seems rough. A smooth surface might be more difficult to match points over due to similarity. To avoid long shadows, aerial data is usually captured within two hours of solar noon. This because shadows cast by objects might obscure other objects, making it harder to match points on. On the contrary, some single image 3D reconstruction methods use these shadows as height cues.

All these characteristics need to be taken into account for accurate and complete stereo matching, for this research the proposed method will learn to deal with these. However, the radial displacement effect can be utilized for stereo matching by ex-ploiting that higher points get more displaced. The distance between matching points is essentially the displacement of the point relative to the camera axes com-bined with the different locations of the cameras.

2.2 Image projection

When capturing a photo, a 3D scene gets project on a 2D plane, this is called image projection. The mathematical principals behind this projection follow the pinhole camera model. This describes the geometric relation between 3D points in a scene and their projection a 2D image plane.

This relation consists of two parts: the relation between the reference frame of the scene and the reference frame of the camera, known as the extrinsic parameters, and the relation between the camera reference frame and the image plane, known as the intrinsic parameters.

(27)

FIGURE2.4: World to camera coordinates

FIGURE2.5: Camera to image coordinates

Starting from a point in the world Pw= (xw, yw, zw). The the projection in the camera

reference frame Pcam = (xcam, ycam, zcam)can be seen in figure2.4and is obtained by:

  xcam ycam zcam  = R|t     xw yw zw 1     (2.1)

where the camera extrinsics R and t are used on the homogeneous coordinates of Pw.

The camera point Pcamis projected to the camera plane by:

λ   u v 1  =   −f 0 v0 0 −f u0 0 0 1     xcam ycam zcam  =K   xcam ycam zcam   (2.2)

Where a points moves through the optical centre C to the plane, see figure2.5, at a distance of the focal length f with moved by the location of the principal point pp= (v0, u0)to obtain the image coordinate Pim = (u, v)up to a scaling factor λ.

Combining equation2.1and2.2gives the projection matrix PM:

λ   u v 1  =K R |t     xw yw zw 1     =PM     xw yw zw 1     (2.3)

2.2.1 Inverse projection to a reference plane

Inverse projection is the projection of an image point back to the object reference frame, thus being the inverse of forward projection described in section2.2.

(28)

FIGURE2.6: Inverse projection

As described, the intrinsic transformation projects a point trough the optical center C as can be seen in figure2.5. So back projection will project an image points back along that line as the exact height is not known. The next operation is of that line in the camera reference frame to the object reference frame, being the inverse of figure

2.4.

Using another line, inverse projected from a homologous point on a different image to the same object reference frame, height can be determined from the intersection of the lines, see figure2.6. This can be demonstrated by reverting the inverse pro-jections to forward projecting: a 3D point P in object space reflecting light c onto an image plane at point p with color value image(p) = c with projection matrix PM: PM×P = p. The same goes for projection to the other image: PM0×P = p0 and image0(p0) =c, provided both images have equal visibility and lighting conditions. Thus, homologous image points have the same value which stereo matching can use to find similar points for triangulating height.

2.2.2 Plane sweeping

Inverse projection can also be done for all the points in the images plane to a common reference plane at once. Such a plane can be placed at different heights along the z axis. For each height it can than be tested if the plane contains any intersecting inverse projected lines. This is technique is known as plane sweeping and with the availability of GPUs, this can be done efficient in the following manner.

First a coordinate grid is created. This is a discretized raster over the reference plane. Each raster cell contains the x, y, z coordinates of its center. This creates a 3D matrix of Height×Width×3, which a GPU can efficiently handle. Next, forward projection is used to compute the corresponding image points for all cells. This is done for each image. per cell, it can be determined if the image points match. If so, it is the intersection of their lines and the coordinates of the cell are coordinates of the 3D point.

(29)

FIGURE2.7: Projection distance to nadir line on reference planes

FIGURE 2.8: Camera ray through raster cells of inverse projected planes at different

height resolutions

Plane sweeping is especially useful on aerial images whose camera reference frame’s x, y plane is close to parallel with the world x, y plane, as the effect of the radial displacement will scale proportionate to the height, see2.8.

As the z axis of the camera moves away from the lens, it will be the negative of the z axis of the world, which points up. This way z axis of both the image and the world are close to parallel to the inverse projected planes along which heights are pre-dicted, this gives the best matches as shown by Gallup et al.,2007.These properties make that projection between world and image coordinates results less distortion and this similar shape and structure of objects. This is beneficial as pixels will be equally spread over object shapes.

2.2.3 Height resolution

As mentioned in the previous paragraph, reference planes have to be at the correct heights to project homologous image points to the correct 3D point in a plane. How-ever, since heights are continuous, a plane cannot be found for all heights. There-fore also the height needs to be discretized. Furthermore, the correct heights are not known, so a range of heights need to be tested. This is usually done at evenly spaced or at inverse intervals.

Figure 2.7 shows planes at various heights where at due to the angle α the line through the lens inversely projects at different distances d from the Nadir line. Only the second plane from the top shows the corner of the building at the correct distance and this is thus the correct height.

When choosing heights, the distance between them need to be chosen carefully. Fig-ure2.8shows four height planes each divided in evenly spaced cells. Going to small (1) might results in the a point being projected to the same coordinates at consecu-tive heights, making it harder for the model to find the correct one. Going higher and a point might miss the correct height (3).

(30)

2.2.4 Digital surface model

Height can be modelled for the earths surface in different ways. Most common are either a collection of known points called a point-cloud, as a mesh of triangles cre-ating a surface or as a raster of evenly spaced cells. The last is often used for stereo matching as aerial images are already in a raster format. When this raster follows the world x, y coordinates this is termed a Digital Surface Model. The value of each cell has denoted highest point on the surface including man made structures such as buildings (Hirt, 2014). As stated, the scope of this research is for urban areas. While there are many applications of DSM’s for non urban areas, these will not be considered here.

A DSM is considered 2.5D due to its 2D representation with one height values per raster cell. Yet sides of structures can be visible on aerial images due to radial distor-tion, again see the black parts of2.3. there are multiple points along the same height dimension visible. This can results in multiple height planes having a some images matching on a point. When creating a voxel grid, this would leave multiple match-ing voxels over the different planes. This can be found by selectmatch-ing a threshold for a match. However, when creating a DSM a model should select the highest matching point as per definition of a DSM which maps the heights along a surface.

This could ether be utilized to generate a 3D model also containing sides of objects, interpreting the planes and rasters as a voxel grid. Yet for the application of DSM creation this has to be dealt with. The height of interest would be the highest match-ing height. However, since in practise, the viability and lightmatch-ing conditions are not the same in each image and simple color values are not that descriptive as similar colors appear a lot times in the images, especially for areas of the same material.

2.2.5 Accuracy

The quality of a DSM is a measure of how accurate elevation is at each pixel, i.e abso-lute accuracy, and how accurately the morphology is presented, i.e relative accuracy. The first can be evaluated by computing the error between the reconstructed heights and a set of control measurements. The second is less straight forward and depends on a subjective evaluation.

2.3 Conclusion

Aerial images and derives product have many applications, but height is lost due to projection to 2D. To reconstruct this, inverse projection using plane sweeping over a range of heights can be used. This makes use of the radial displacement to do stereo matching. The desired output is to create a digital surface model. The following chapter will go into detail how this is done by related research.

(31)

3 Related work

3D reconstruction from images is mostly done using stereo matching, the technique of matching homologous images points over images. These matching points, com-bined with known camera parameters can be used to recompute depth. A much used traditional algorithm for this is Semi-Global Matching (Hirschmüller, 2005). This method performs pixel-wise matching over two images and approximates a 2D smoothness constraint by several 1D paths. Since 2015, data-driven methods for stereo matching have become popular due to both the availability of large stereo data-sets with ground truth, e.g. Mayer et al. (2016), Scharstein, Hirschmüller, et al. (2014), and Menze and Geiger (2015), needed for supervised data-driven meth-ods and due to improved results compared to traditional methmeth-ods. Data-driven methods produce dense depth reconstructions, compared to sparse methods such as Semi-Global Matching, and show to be more robust to varying input quality due to the high expressive power of deep features. However, the focus in research is mostly on depth prediction for single objects, indoor or small scale outdoor scenes (Furukawa and Hernández, 2015), e.g. single rooms or buildings and in the auto-motive industry, depth vision for (semi) self driving vehicles (Zbontar and Lecun,

2015).

A challenge for stereo matching for aerial imagery is the scarcity of such open source data-sets or benchmarks actively used to compare methods. While there are aerial image data-sets available, most do not have ground truth height data. Another op-tion is unsupervised learning, e.g. G. Yang et al.,2018; Batsos, Cai, and Mordohai,

2018and Khot et al., 2019 use the depth maps to project to the input images and train on photometric consistency. Yet the most successful models on standardised benchmarks are supervised end to end models. This section will give an overview of different models for stereo matching and their use for height prediction from aerial images.

3.1 Data driven stereo matching

The first end-to-end methods are based on intuitions from Convolutional Neural Networks (CNN)s for image encoding (Zbontar and Lecun,2015) to match image patches. To find the disparities, i.e distance between image points, between all points Dosovitskiy et al.,2015concatenate images and encode them. A more advanced ver-sion of this method uses first some encoding layers on the images separately with shared weights, followed by correlating instead. GC-Net (Kendall et al.,2017) then used this with the structure of traditional stereo matching methods. By rectifying images to have the same baseline, it shifts one image with respect to the base im-age over one dimension and correlating these. This creates a cost volume where the

(32)

14 Chapter 3. Related work

2D feature

extraction cost volume depth map

feature aggregation

cost volume

optimization soft argmax

FIGURE3.1: Pipeline for cost volume based stereo matching

added dimension contains the disparities. Following traditional methods, this vol-ume is spatially refined afterwards. Utilizing the traditional cost volvol-ume comes at the cost of more parameters and a longer run time.

Early stereo matching networks for depth estimation are thus build on one of the following principles:

• Features correlation

– Encodes similarity into features channel

– Faster run-time, but the real geometric context is lost • 3D cost volume

– Similarity costs as third dimension

– Slower run-time, but real geometric context is maintained

While fast and competitive, feature correlation methods such as (Mayer et al.,2016) are less accurate than hand designed pipelines following the traditional cost volume approach on popular benchmarks such as ETH3D (Schops et al.,2017), KITTI 2012 & 2015 (Geiger, Philip Lenz, and Urtasun,2012; Geiger, Lenz, et al.,2015), SUN3D (Xiao, Owens, and Torralba,2013) or Middlebury (Seitz et al.,2006). These pipelines use a similar structure generally consisting of four parts, see also figure3.1:

1. 2D features extraction

2. Feature aggregation into a cost volume 3. Cost volume optimization

4. Differentiable Winner-Takes-All (WTA): soft-argmax

As this has shown to be a good structure, the proposed model will also use a cost volume based approach.

3.2 Feature extraction

Stereo matching uses matching cost between image descriptors to compute a match-ing cost. In its most basic form such a descriptor is the RGB color value of a smatch-ingle pixel. However, such a descriptor is not very discriminative. To better compare and determine matching between image points, more discriminative features could be extracted. However, these need to be invariant or partially invariant to the follow-ing (Guo,2018):

1. translation 2. scale 3. rotation

4. 3D camera view point

(33)

FIGURE 3.2: Different levels of feature encoding (From Y. LeCun Deep learning slides)

(A) Spatial pyramid pooling (B) Stacked hourglass

FIGURE3.3: Feature encoding and cost volume regularization from

Chang and Y.-S. Chen (2018)

As these factors could vary over different images. A non-learned example of a fea-ture adhering to the first three is a SIFT (Lowe,1999) keypoint, constructed from a histogram of oriented gradients as a feature vector. However, dense learned features have been shown more useful for computer vision tasks as shown by comprehensive experiments of Bengio, Courville, and Vincent (2013) andPenatti, Nogueira, and Dos Santos (2015) in domains such as image recognition Sharif Razavian et al.,2014and image segmentation (Ronneberger, Fischer, and Brox, 2015). Using CNNs, an en-coder network learns kernels to recognize image features, figure3.2. These range from low-level features in the first layers of the network, e.g. to recognize edges, to high-level features for specific objects (Zeiler and Fergus,2014).

The first version of MC-CNN (Zbontar and Lecun,2015) uses only one convolutional layer to create features of 32 channels, the later MC-CNN-fst and MC-CNN-acrt (Zbontar and LeCun,2016) use multiple layers with non-linear activation functions. FlowNetCorr (Dosovitskiy et al.,2015) uses three convolutional layers each with a pooling layers followed by a normalization layer. GC-Net (Kendall et al.,2017) adds skip connections to add earlier layers to later ones. Other networks propose a mul-titude of combinations of all these layers and skip connections.

A feature encoder for scale invariant encoding is the Spatial Pyramid Pooling (SPP) network (He et al., 2015). This uses various pooling sizes to extract features at different scales, figure 3.3a. This has been used by Chang and Y.-S. Chen (2018) and Im et al. (2019) successfully for stereo matching and is also incorporated in the DeepLabv3+ (L. C. Chen et al.,2018).

While ablations studies do show the effectiveness of learned feature encoding, there are no comparative studies solely on feature encoding for stereo matching. As the SPP network has shown good results in multiple fields requiring image encoding, this research will adopt it as well.

(34)

3.3 (True) multi-view matching

As mentioned in the introduction, Collins (1996) states three conditions for true MVM: generalizing to any number of images, maintaining linear complexity to the number of input images and treating all images equally. While all multi-view net-works, by definition, adhere to the first condition, the other two conditions are usu-ally not met. This might be because they are either not deemed relevant to the prob-lem domain or not considered as a condition for the model to be termed multi-view. Some networks process multiple input views by repeatedly matching image pairs (Im et al.,2019; Choi, Kihong, and Sohn,2018; Huang et al.,2018; Khot et al., 2019; Hou, Solin, and Kannala, 2019; Yao, Z. Luo, Li, Fang, et al., 2018; Yao, Z. Luo, Li, Shen, et al.,2019). These methods choose a particular reference view to predict dis-parities for. Due to this common reference frame, intermediate results can be ag-gregated. However, this breaks the third condition as one image gets preferential treatment. This has the downside of limiting the prediction to the visibility in the reference image.

To counter this, multiple reference images can be chosen and combined in one 3D reconstruction. Various techniques exist to choose an optimal set of reference images (Mostegel, Fraundorfer, and Bischof,2018; Schönberger et al., 2016). Yet, this does not scale well. In the worst case all N images are chosen as reference, resulting in a complexity ofO(N!/(N−2)!)total image pairs to be evaluated.

To counter both issues, an algorithm would need to find a multi-linear relation be-tween any amount of images while still being linear in complexity. This gives as benefit that since no reference image is chosen, there is no bound reference frame. This makes it possible to choose the coordinate frame arbitrarily. It is then trivial to choose the object reference frame (Kar, Häne, and Malik,2017; Choy et al.,2016; B. Yang et al.,2019; Hartmann et al.,2017). These networks inverse project all images to parallel planes in this reference frame and can therefore match all images for each plane. Due to these properties and the benefits of object based depth as discussed in section 2.3, the proposed model will also use inverse projecting to the object space with differentiable plane sweeping.

3.4 Cost volume generation

A cost volume traditionally contains the matching cost for all points over a range of disparities or depth planes. Disparities are used when shifting two rectified images over epipolar lines (Dhond and Aggarwal,1990; Kendall et al.,2017), see1.2. Each shift aligns the image features for matching at one disparity. This results in a volume of the width and height of the reference image and as third dimension the range of disparities. When using plane sweeping, this third dimension contains all the depth planes the images are back projected to. This can be from an image reference frame (Figure3.4A) where the planes are parallel to the x, y dimension and range over the z dimension of the camera. Or the planes are from the object based reference frame (Figure3.4B), being parallel to the object x, y dimension and range over the z dimen-sion. Notice the different order of the depths z. The main objective of generating a

(35)

(A) Inverse camera based depth (B) Object based depth FIGURE3.4: Inverse projecting to depth planes

cost volume is to align features in a spatial representation so that the rest of the net-work can determine matches from these utilizing the geometrical context. As stated, the proposed model will use this as well.

3.5 (Multi) feature aggregation

With deep learning methods, matching does not necessarily results in a scalar cost, but could be a feature vector denoting the aggregation of features from multiple im-ages. As the features are aligned in the cost volume, they need to be aggregated. When the amount of input features is two, this can be easily done by operations such as the dot product or cross correlation (Zbontar and LeCun,2016). Paschalidou et al. (2018) uses the average inner product to compute the probability of a 3D raster cell being on the surface of the object. However these can only be used on two image features. Expending this to a fixed number of two or more can make use of feature concatenation (Kendall et al.,2017). Combining a variable number of images how-ever, requires a mapping which can handle this. (Eslami et al.,2018) uses summation to aggregate a variable number of images for 3D prediction and Huang et al. (2018) uses a max operator. Convolutional layers before and after a mean (Hartmann et al.,

2017; Paschalidou et al.,2018) or variation (Yao, Z. Luo, Li, Fang, et al.,2018) opera-tor are such mappings. As these are invariant to the input size and require no extra learned parameters the proposed mode will combine both the mean and variance. These are all hard attentive without learning to preserve the useful information.The Learned Stereo Machine (Kar, Häne, and Malik,2017) uses a 3DCNN GRU to han-dle the features as a sequence. 3D-R2N2 (Choy et al.,2016) also makes use of recur-rent feature aggregation, only it has an encoder-decoder structure more similar to

(36)

FlowNetSimple (Dosovitskiy et al.,2015). This does not explicitly exploit the cam-era pose for geometric reconstruction and its results don’t match up to the Learned Stereo Machine. Both report their best results using 3D-CNN-GRUs. However a sequence is affected by the input order and can become inefficient as the amount of input images increase because it cannot be parallel processed. Therefore this will not be used for the proposed model.

B. Yang et al. (2019) proposes a novel method AttSets to use soft attention to ag-gregate a variable number of features. This methods first uses a learned attention score per feature and uses these to do a weighted sum of the elements. This without the use of a gating mechanism as in earlier research on attention based learning by Ilse, Tomczak, and Welling,2018. Soft attention is sequence invariant, efficient and able to learn what is important. Yet as this method is very recent, it has not been incorporated into the model.

3.6 Cost volume regularization/refinement

After features are aggregated at each depth to a volume of F×D×W ×H, the aggregated features have to be reduced to a single value to construct a cost volume. A regularizer sub-network can do this for features using their geometrical context. There exist multiple different models to do this. Mostly either using 2d convolutions over W×H planes at each depth d ∈ D or using 3D convolutions to refine over D×W×H. This can smooth edges (Kendall et al., 2017) and planes of surfaces with hard to match features due to differences in albedo or non-Lambertian surfaces. Leading to improved results following multiple ablation studies (Chang and Y.-S. Chen,2018). To preserve small details, EdgeStereo predicts both a depth map and a edge map (Song et al.,2019) and train using a loss function which incorporates both. A downside of inverse projection is that refinements in the regularization step can-not use the input images directly as used in Yu et al.,2018; Liang et al.,2018which is a common technique called skip connections for deep networks to preserve gradi-ents and input information throughout the network.

Again, as for feature encoding, it has been shown that using cost volume refinement improves depth prediction. However, there are no comparative studies solely on depth refinement for stereo matching. Close comes the ablation study on PSMNet. As its Stacked Hour Glass (SHG) cost volume refinement has shown good results this research will adopt it.

3.7 Depth prediction and loss

To predict a depth map from a cost volume, along the depth dimension of each location xi ∈ H, yj ∈ W the lowest matching cost is chosen in a Winner Takes All

manner. This can be done using the arg-min function, but this is not differentiable and thus not suitable for training a neural network. Therefore GC-Net defines the soft argmin, equation3.1. The right of the multiplication does the soft-min operation over the cost cdto create a probability. This is multiplied with disparity or depth d

(37)

d=Dmin ∑_i=0e−ci

As this has been shown to be more robust than classification based methods (Zhang et al.,2019; W. Luo, Schwing, and Urtasun,2016) and can generate sub depth dis-cretization accuracy, it will also be used for the proposed model.

3.7.1 Voxel methods

The cost volume could also be interpreted as a voxel volume. The Learned Stereo Machine and 3D-R2N2 produce a voxel occupancy grid from the cost volume. This requires to asses each voxel as a Bernoulli distribution of being part of the object or not. Voxel methods Paschalidou et al. (2018), Ulusoy, Michael J. Black, and Geiger (2017), and Ulusoy, Geiger, and Michael J Black (2015) compute which image rays hit each voxel, thus only predicts for voxels on the surface. RayNet (Paschalidou et al., 2018) is a network which instead of interpolating depth planes, uses depth rays per pixel in each input image and aggregates these depth distributions into a 3D reconstruction. As every ray has to be computed, this scales very bad with bigger images.

3.8 Conclusion

An overview has been given of different ways to do data-driven stereo matching and different multi-view methods and their relation to true MVM have been discussed. From these and other research various modules for each part of the pipeline have been considered for the proposed model. The next chapter will go into depth about this model with the proposed modules.

(38)

(39)

4 Method

4.1 Deep height matching network: true multi-view

match-ing for aerial imagery

This research proposes DHMNet a model inspired by recent deep stereo matching methods compliant to the defined constraints for a true MVM for optimal DSM pre-diction from nadir aerial images. An overview of the proposed model is given in figure4.1. The core model uses inverse projection by plane sweeping to world co-ordinates, this makes is possible to do multi-view matching. A soft WTA strategy is used to find the best matching dept per point to create a DSM. The feature encoder, feature fusion and 3D regularizer are all modular deep networks of the model, any can be omitted or replaced by a network at least adhering to the correct input and output sizes.

4.2 Inverse projection

To be able to do multi-view matching and to align height prediction with the de-sired output frame, inverse projection using the plane sweep method is used. This requires to define the extend of the planes, as well as the heights to project to. As DSMs cover areas, e.g. figure4.2, larger than what could fit entirely in the mem-ory constraints of computers, areas are usually broken up in non overlapping rect-angular tiles (blue in figure4.2), each bound by their[x_min, xmax]and[ymin, ymax]. For

inverse projecting to such a tile, this also requires bounds on the height[zmin, zmax].

This 3D space is then discretized at resolutions: rx = 0.1, ry = 0.1, rz = 0.1. These

are chosen to be equal to create a regularly spaced, three-dimensional grid of voxels of size W×H×D, where:

1. x0 =xmin, xW =xmaxand W = xmax_r−_xxmin

2. y0 =ymin, yH = ymaxand H= ymax_r−_yymin

3. z0 =zmin, zD =zmaxand D= zmax_r−_zzmin

The center of each voxel is at coordinate xi, yj, zk, where i ∈ [0..W], j ∈ [0..H], k ∈ [0..D]. Using the projection matrix of each image, these coordinates are projected to image coordinates λuv 1 = PM_im_n x y zk 1

to get the image feature for the voxel. Since these are not integer pixel coordinates but real numbers, bilinear interpolation will be used to compute the feature at xi, yj, zk. Plane sweeping does this by projecting

all voxels at each height Z = z_k at once making use of the GPUs matrix processing capabilities.

(40)

22 Chapter 4. Method image 1 image 2 im N back projection digital surface model depth volume 1 depth volume 2 depth volume N cost volume multi-view cost WTA & image 1 image 2 im N image encoder feature map 1 feature map 2 feature map N back projection digital surface model feature volume 1 feature volume 2 feature volume N fused feature volume cost volume feature fusion 3D regularizer WTA & &

FIGURE4.1: Overview of the proposed model both without and with deep learning modules. Blue boxes denote operations. Green repre-sents a 2D spatial relation, yellow a 3D spatial relation. The purple

square is the 2.5D DSM

4.3 Aerial images to DSM

As the area to be predicted is tiled, the images covering each tile are selected to be used by back projection. The green lines in figure 4.2 show the footprints of images, each covering large areas containing multiple tiles and each tile is covered by multiple images to be used for matching. To minimize the memory footprint, each image is cut to the minimum umin, vminand maximum umax, vmaximage coordinates

need for inverse projection.

When cutting out part of an image, the intrinsic parameters change. That is, the principal point is at a different v, u location relative to the difference between the 0, 0 point of the cut and the full image, see figure4.3. This is simply recalculated by subtracting the 0, 0 coordinate of the cut image from the u, v coordinates of the principal point: ppv ppu cut = ppv ppu image − v0 u0 cut (4.1)

The set of cut images covering a tile can than be used for image matching. First by inverse projecting to a range of height planes to align the images. Next using a multi-view cost function to compute the matching cost of each voxel. Last a WTA over the height dimension to output a single height prediction for all x, y world locations in the grid resulting in a DSM.

(41)

FIGURE4.2: part of the test set with footprints of buildings (black), tiles (blue), images (green) and pointclouds (red)

im (0,0) pp cut (0,0) uim(pp) v im (pp) ucut(pp) v cut (pp)

FIGURE4.3: Relative location of the principal point

4.4 Network architecture

To improve stereo matching, deep learning modules can be added, see bottom image on figure4.1. The first is a feature encoder, to learn more descriptive features. Next is feature fusion, this to improve matching multiple features compared to a direct cost function. Last is 3D feature volume regularization to refine the fused features into a cost volume.

The general network dimensions are given in table4.1. The image height and width depend on the cut sizes per image. After inverse projection, the height and width are for the DSM grid from the area to be predicted. The scale is to account for the difference between the input image size and feature map size due to pooling layers, interpolation layer, convolutional strides or any other size changing operation in the feature encoder. The height planes depends on the range and resolution of Z. The regularizer needs to account for any scaling to produce a full sized DSM

(42)

24 Chapter 4. Method

Name Dimensions Remarks

Image 3× (Him) × (Wim) Him: Image Height

Wim: Image Width Feature map C×1 s(Him) ×1s(Wim) C: feature channels s: scale Feature volume C×1 sD×1sHw× 1sWw D: height planes Hw: DSM Height Ww: DSM Width

Fused feature volume F×1

sD× 1sHw×1sWw F: fused feature channels

Cost volume D×Hw×Ww

DSM Hw×Ww

TABLE4.1: Data dimensions in the network

4.5 Image features

The first module is the image feature encoder. This takes the cut aerial image for a tile as input and constructs a feature map of each. As stated, this is used because each feature pixel can contain more information with regard to the surrounding pixels and the image compared to simple RGB values.

For DHMNet, SPP (figure3.3a) is used, as it has shown to be robust for features con-structing of different sizes used by (He et al.,2015). This due to the use of multiple parallel branches each with average pooling of 8, 16 ,32 and 64 to encode features over multiple scales. SPP will also be used to compare this model to the dispar-ity predicting binocular stereo matching method PSMNet, proposed by (Chang and Y.-S. Chen, 2018), topping the KITTI benchmark (Geiger, Lenz, et al., 2015) leader board at their time of writing.

4.6 Feature fusion and 3D regularizer

The next module is the feature fusion. This uses the feature volumes from the pre-vious module and combines them to allow for true MVM adhering to the stated conditions. Since each volume is in the world coordinate system, their boundaries align, making it possible to reduce all volumes over the feature dimension to a fused feature of arbitrary size.

This is thus the core part of a true multi-view matching network as it allows to aggre-gate all images equally with linear complexity. The feature fusion is a mapping func-tion which should be able to take in N ≥ 2 inputs A = {x1, x2, ..., xN}, xn ∈ R1×C,

map these linearly while treating each input the same into a fixed output y∈ _R1×F_.

Yet being invariant to any permutation π:

f(x1, ..., xN) = f(xπ(1), ..., xπ(N)) (4.2)

Concatenation of two features will be used for comparison to PSMNet. This variant will be called DHMNetconcat. This use of concatenation breaks the first condition but

gives insight in the difference in using rectified image and predicting a DSM via dis-parities compared to using raw images and predicting DSMs using back projection.

(43)

Next, the model will be tested with a combination of the variation and mean of the feature dimension (DHMNetvar_mean) of the volumes as a combination of (Yao,

Z. Luo, Li, Fang, et al., 2018) and (Hartmann et al.,2017). These are permutation invariant as stated by equation4.2.

After image features are fused to a single new feature it is put through a regular-izer. As proposed, the SHG module (figure3.3b) will be used. Chang and Y.-S. Chen (2018) has shown that this encoder-decoder structure improves height prediction. This has two functions: going from anR1×Fsize fused featured to aR scalar value and regularizing the spatial context. This has the effect of smoothing irregularities as matching can never be perfect (Kendall et al.,2017). From a perspective of tra-ditional stereo matching the output of this can be viewed as the cost volume where the scalar values represent the matching cost. cost can be changed to score depend-ing on interpretation of the scalar bedepend-ing maximized or minimized to denote a better match.

4.6.1 Up-sampling

Since the dimensions are scaled by a factor s due to downsizing in the image fea-ture encoding, the cost volume needs to be up-sampled to the descaled sizes. Up-sampling uses a predefined interpolation method: bi-linear interpolation, bi-cubic interpolation or nearest neighbour interpolation. Another optio is transposed con-volution, which uses learned parameters to interpolate the features. While Kendall et al. (2017) use 3D deconvolution get the desired output size, Chang and Y.-S. Chen (2018) show that up-sampling does not perform worse combined with the SHG.

4.7 Depth reduction

With a cost volume, the loss of the model can be computed as a classification or regression problem. Classification computes the loss per depth array where each depth can be a match or not. However, this ignores the spatial relation between depths. Regression does take this into account and uses the soft argmin function (equation3.1) to predict a single depth based on the entire array for which the loss can be computed.

4.8 PSMNet

As mentioned, the proposed network DHMNet will be compared to PSMNet. These follow a similar pipeline using the SPP and SGH. However PSMNet uses rectified images and shift these over each other using concatenation to build the cost volume. This results in an image based depth map. Experiments will be conducted to show the effects of the differences.

(44)

(45)

5 Implementation & experiments

A methodical overview of the proposed DHMNet has been given and put in context of the aerial domain. This chapter will go into any implementation specifics for training the model and inference for experiments to empirically demonstrate the benefit of the discussed conditions of a true MVM and properties for DSM predicting using aerial imagery. This will be compared to the PSMNet as used by Zuurmond (2018) for aerial images.

5.1 Aerial imagery and ground truth height data

For the experiments, aerial imagery of the municipalities of Breda and Roosendaal in the Netherlands are used as training and testing set respectively. For these, 25000 tiles for Breda and 5000 for Roosendaal of 16 by 16 meter are defined, figure 4.2. All tiles are completely covered by at least four images. For the test set, 1929 tiles are also by six images. As the use case is DSM prediction for urban areas, only tiles overlapping building footprints are defined, these are the black polygons in figure4.2. Building footprints are collected from the "Basisregistraties Adressen en Gebouwen" dataset which is maintained by the Dutch governmental organization "het Kadaster".

Aerial images are shot with a calibrated camera. In- and extrinsics for these are known and adjusted for the cut out images following equation4.1. These images are shot at an approximate 10 cm GSD, covering approximately 10 cm2 _{when inverse}

projected at ground level.

For each tile ground truth data is extracted from the publicly availableActueel Hoogtebe-stand Nederland 3 (AHN3)(2018) LiDAR dataset of the Netherlands. The LiDAR data is converted to a digital surface map discretized at 10 cm2, the same as the GSD of the images.

The LiDAR is not as densely captured as the aerial images and has a reading with a radius of±20cm every±50cm at ground level. Each point is converted by rounding both the x and y coordinate to the upper and lower 10cm interval, giving four cells a ground truth value. Cells without a ground truth value in the DSM cannot be used for error assessment and will be masked out.

Due to height differences, points are not as evenly spread. See the second row of figure5.1, only the colored points are readings have a ground truth height value. Each tile has an average of 13179 ground truth readings over a raster of 160×160 cells, which is an average 51% of the tile.

The images and LiDAR data are chosen as they are collected in close succession in early 2017. Yet there are still some differences, e.g. due to non static objects, between the data collection moments, see the figure5.1column one, two and three. The first

(46)

28 Chapter 5. Implementation & experiments

FIGURE 5.1: Five examples of tiles with predictions from

(47)

with real aerial data. Both DHMNet and PSMNet will have to work with this noisy data.

5.2 Training

The two different versions, DHMNetconcat and DHMNetvar_mean, are both trained

with a learning rate of 1e−5to convergence on a NVIDIA RTX 2070 with 12 GB mem-ory. The Adam optimizer (Kingma and Ba,2014) is used with betas (β1 =0.9, β2 =

0.999), following other research (Im et al.,2019; Chang and Y.-S. Chen,2018).

DHMNetconcat uses two images per tile when training and DHMNet var_mean uses

four.

As both the SPP and SHG use batch normalization with mini-batches for better find-ing a minimum (Masters and Luschi, 2018) a trade-of between tile size and batch size had to be made for the used hardware. Therefore the tiles are set to cover 16x16 meter so that, at the resolution of 10cm2, four tiles of 160×160 can be processed by the model on the GPU.

The SHG regularizing module has three outputs, one after each hourglass, see fig-ure3.3b. During training the loss will be computed as the weighted sum: using weights 0.5, 0.7 and 1.0 for the first, second and last output respectively as proposed by Chang and Y.-S. Chen, 2018. During inference, the last output is used as the prediction.

Again following suggestions by the previously mentioned papers, a smooth L1 loss function is used for training, also called the Huber loss:

loss(x, y) = 1 n

∑

_i zi, zi = ( 0.5(xi−yi)2, if|xi−yi |< 1 |xi−yi | −0.5 otherwise (5.1) 5.2.1 Training PSMNet

PSMNet is trained to convergence on the training data with a max disparity of 192, using the same learning rate and optimizer hyper-parameters as DHMNet. These settings are similar to the PSMNet in Zuurmond,2018, which also predicts disparity maps using a test dataset covering Roosendaal. The difference is that Zuurmond,

2018measures the accuracy over the disparity instead of height. They report of 57%, 40% and 32% of the disparities being outside 1, 2 and 3-pixel error margins respec-tively. This is comparable to height margins 0.1, 0.2 and 0.3 meter. The percentages in table5.1state 52%, 33% and 25% of the heights outside the respective height mar-gins. The better results of this research could be because per DSM, all images are used as base image once and predictions are combined. This so that all images are utilized equally for comparison to DHMNet.

(48)

number of images

2 4 6

Method >0.1 >0.2 >0.3 >0.1 >0.2 >0.3 >0.1 >0.2 >0.3

PSMNet 52% 33% 25% 46% 32% 26% 41% 28% 23%

DHMNetconcat 75% 57% 45% N/A N/A

DHMNetvar & mean 67% 46% 35% 59% 41% 34% 58% 39% 32%

TABLE5.1: Percentage of points higher than error margin (m), lower

is better.

5.3 Testing

5.3.1 Depth planes

DHMNet will predict over 192 depth planes. With the SPP module downsizing four times, the feature volume will have a size D = 192/4 = 48 ranging over indices i = 0, . . . , 47. Thus, with a resolution of 0.1, the highest back projected height is 47×4×0.1=18.8m. Therefore height will be evaluated with the range[0, 18.8].

5.3.2 Disparity map to DSM

As the outputs of PSMNet are image based, post processing is needed to create a DSM. Each points in the disparity map is triangulated to the DSM raster. As this is not a direct mapping, some cells might not be filled and will be masked out in the test data. On the contrary, as multiple disparity maps are combined some points will be triangulated to the same DHM cell. For these the median is taken as this is less sensitive to outliers than the mean.

5.4 Experiments

Experiments will be conducted to test the benefits of true MVM for aerial stereo matching. This is broken up in three experiments for the proposed DHMNet com-pared to PSMNet. These are based on the research questions in the introduction:

1. What is the effect of world based DSM prediction with compared to image based for two images?

2. What is the effect on multi-view DSM prediction using size invariant variance and mean for DHMNet compared to repeated DSM prediction using concate-nation for PSMNet?

3. How does the quality of DHMNet compare to PSMNet?

5.4.1 Effect of world based DSM prediction

To analyse the accuracy between world based and image based, both DHMNetconcat

and PSMNet will be evaluated on the test set using two images. This gives an aver-age error over all ground truth points of 0.57m for PSMNet and 0.87m for DHMNet

(49)

concat

DHMNetvar & mean 0.86±1.78 0.75±1.61 0.67±1.50

TABLE 5.2: Mean Absolute error and standard deviation over all

points in the test set, lower is better.

FIGURE5.2: Box plot of the absolute error of points for PSMNet and

DHMNetconcatwith 2 images. Green is the mean, orange the median.

The box represents the first to third quartile. Whiskers denote the 5% and 95%

as can be seen in table5.2, thus PSMNet outperforms DHMNetconcat. To analyse this,

figure5.2shows a box plot of the distribution of the error.

First it can be seen that both PSMNet and DHMNetconcathave their mean above the

third quartile and the 95% whisker far higher. This means that more than 75% of the points have an error less than the mean, but also points with a lot higher error. Second, for DHMNet the third quartile is closer to the mean and the 95% mark is higher compared to PSMNet. Thus PSMNet does perform better overall.

From the examples in figure5.3 it can be seen from the left and right column that both networks can suffer from high errors due to trees. This because due to smooth-ing the depth maps produce a tree as a solid object with smooth height changes, but the ground truth gives high and low readings scattered between each other depend-ing in the laser hit the tree or passed trough.

This is confirmed by figure5.4bshowing that points with large height gradients have a larger error for both PSMNet and DHMNet, yet slightly more for DHMNet. This is also visible for the second and second to last column of figure5.3, showing high errors around the edges of the structures. It must be noted however that most of the points have a gradient lower than 0.5, figure5.4a. So most flat areas are predicted with a accuracy below the average.

For the fourth and sixth row of figure 5.3, negative error means the network pre-dicted lower, and this explains the high error for high points in5.5b. Positive errors

(50)

FIGURE 5.3: Examples ranging from where the MAE of

DHMNetconcat is better (left) to where the MAE of PSMNet is

(51)

(A) Number of points per gradient 0 1 2 3 4 5 6 7 Gradient (m) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Average Absolute Error (m)

(B) Average absolute error per gradient

FIGURE5.4: Error per gradient for DHMNetconcat

(A) Number of points per height (B) Average absolute error per height

FIGURE5.5: Error per height for DHMNetconcat

mean the network predicted higher than the actual height. This could explain the increase in average error in figure5.5bfor points under 1.5 meter.

As a third metric, the error per confidence for DHMNet is examined. This is mea-sured as the variance over the depth axis of the cost volume after the softmax op-eration relative to the maximum possible variance for the depth. As can be seen from figure5.6b, points with low confidence show a on average a higher error. This declines as the confidence increases. Due to the confidence distribution of points (figure5.6a), confidence above 0.9 has less than 100 points. This makes it more sen-sitive to outliers and and could explain the higher average errors above a confidence of 0.9.

In summation, DHMNet and PSMNet show the same patterns for the error com-pared to height and gradient but the error of DHMNet is always higher. As the cer-tainty about predictions decreases, the average error for both DHMNet and PSMNet increase. This shows that these are also harder to predict accurately for PSMNet. As final point, the completeness for DHMNet does outperform PSMNet and shows the usefulness of the design of the network.

Multi view stereo matching for aerial photogrammetry

True Multi-View Matching for Aerial

Imagery

Declaration of Authorship

Abstract

Contents

List of Figures

List of Tables

List of Abbreviations

1 Introduction

1.1

Motivation for true multi-view stereo matching

1.2

Objectives & research Questions

1.3

Thesis outline

2 Background on aerial

photography and height

reconstruction

2.1

Aerial image characteristics

2.2

Image projection

2.3

Conclusion

3 Related work

3.1

Data driven stereo matching

3.2

Feature extraction

3.3

(True) multi-view matching

3.4

Cost volume generation

3.5

(Multi) feature aggregation

3.6

Cost volume regularization/refinement

3.7

Depth prediction and loss

3.8

Conclusion

4 Method

4.1

Deep height matching network: true multi-view

match-ing for aerial imagery

4.2

Inverse projection

4.3

Aerial images to DSM

4.4

Network architecture

4.5

Image features

4.6

Feature fusion and 3D regularizer

4.7

Depth reduction

4.8

PSMNet

5 Implementation & experiments

5.1

Aerial imagery and ground truth height data

5.2

Training

∑

5.3

Testing

5.4

Experiments