Multi-Frame Super-Resolution Image Enhancement for Autonomous Landing of Unmanned Aerial Vehicles

(1)

Bachelor Informatica

Multi-Frame Super-Resolution

Image Enhancement for

Autonomous Landing of

Unmanned Aerial Vehicles

Herman Kelder

January 20, 2021

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

Recent years have shown great technological advances in the field of autonomous ve-hicles. Major tech companies are researching the possibility of using Unmanned Aerial Vehicles (UAVs) in delivery services, a so-called drone delivery service. Part of the benefits that UAVs can bring, are deliveries at hard to reach places, or fast deliveries. UAVs gener-ally use GPS as their main source of location data to navigate to their destination currently. However, GPS is a secondary source of information that can be unreliable, inaccurate and untrustworthy. Onboard cameras can leverage the power of computer vision to mitigate se-curity and integrity concerns for a UAV delivery service. Flying at high altitude is necessary to mitigate safety and privacy concerns.

The purpose of this research is to make use of Multi-Frame Super-Resolution (MFSR) to aid the autonomous landing system of the UAV independently of GPS. Multiple low-resolution images are used in the MFSR algorithm to create a single high-low-resolution image. MFSR will enable the UAV to identify the landing location indicated by the recipient whilst flying at a higher altitude. Implementation of the MFSR algorithm requires multiple stages, which includes RAW image data conversion, geometric registration, image reconstruction, image sharpening, and identification of the indicated landing location. Experiments show that the MFSR algorithm outperforms traditional interpolation methods and is able to reconstruct a high-resolution image with greater details. The MFSR algorithm significantly improves the identification rates of the landing location, when compared to the original low-resolution images. An average altitude range extension of almost a factor two is achieved by the MFSR algorithm.

This research shows that the MFSR algorithm is able to aid the autonomous landing system by increasing the altitude range at which the UAV is able to identify the landing location.

(4)

(5)

1 Introduction 7 1.1 Research Questions . . . 8 1.2 Research Outline . . . 8 2 Background 9 2.1 Navigation . . . 9 2.2 Super-Resolution . . . 9 2.2.1 Taxonomy . . . 9 2.2.2 Single-Frame . . . 10 2.2.3 Multi-Frame . . . 11 2.3 Related Work . . . 12 3 Design 13 3.1 UAV . . . 13 3.2 Camera . . . 14 3.3 Marker . . . 14 3.3.1 Pixels . . . 15 3.4 Computer Vision . . . 17 4 Implementation 19 4.1 Initialisation and Pre-Processing . . . 19

4.1.1 Demosaicing . . . 19 4.1.2 Snippet . . . 21 4.2 Motion Model . . . 22 4.2.1 Geometric Parameters . . . 22 4.3 Geometric Registration . . . 24 4.3.1 Direct Registration . . . 24

4.3.2 ECC Image Alignment . . . 24

4.3.3 Affine Estimation . . . 26

4.4 Image Reconstruction . . . 28

4.4.1 Methodology . . . 28

4.4.2 Inverse Distance Weighting . . . 29

4.4.3 Sharpening . . . 30

4.5 Marker Identification . . . 32

5 Experiments 35 5.1 Method . . . 35

5.1.1 Alignment Accuracy . . . 36

5.1.2 Image Reconstruction Accuracy . . . 36

(6)

6 Results 37 6.1 Geometric Registration . . . 37 6.2 Reconstruction Accuracy . . . 38 6.2.1 Synthetic . . . 38 6.2.2 Real Data . . . 39 6.3 Identification Rate . . . 42 7 Discussion 45 7.1 Ethical Aspects . . . 47 7.2 Conclusion . . . 47 7.3 Future Research . . . 47 A Parameters 51

(7)

CHAPTER 1

Introduction

Computer vision is everywhere. Computer vision has been involved in great technological devel-opments over the past decades. Areas such as unmanned vehicles and automation have benefit from these developments. Amongst others, self-driving cars and artificial reality are well known applications that effectively make use of computer vision.

Major tech companies such as Amazon and Tesla invest a lot of resources in research con-cerning autonomous unmanned vehicles. Part of the future technological advances in the area of autonomous unmanned vehicles is the utilisation of Unmanned Aerial Vehicles (UAVs), more commonly known as drones, in a variety of tasks. A recent trend is the research in drone postal services, where computer vision can be utilised to a great extent. Delivery at hard to reach places, or fast deliveries, are part of the benefits that UAVs can bring in delivery services. There aren’t any widespread commercial UAV delivery services available to the public currently.

In order to make commercial UAV delivery services possible, some problems need to be addressed. The areas of the UAV delivery service that require further research are automation, security, and integrity of the UAV delivery service. A UAV currently uses GPS as the main source of location data in order to navigate to a destination and orient themselves. However, GPS is a secondary information source and therefore not trustworthy. GPS spoofing has been shown effective and thus violates the security and integrity requirements of the UAV delivery service [15] [20] [24]. Computer vision can be used to mitigate this issue. Computer vision is independent of location services. Cameras carried by the UAV can leverage the power of computer vision to navigate to their destination.

Computer vision often benefits from more accurate or higher detail imagery. Navigating based on computer vision cannot be done by making UAVs fly at low altitude above city areas. UAVs flying at low altitude would form a safety hazard and bring privacy concerns. The UAV needs to fly at a higher altitude to mitigate safety and privacy concerns. However, when a UAV flies at a higher altitude the camera loses details due to its fixed resolution. Autonomous delivery systems require details in order to navigate to an exact landing location indicated by the recipient. Loss of detail can be mitigated with a method called Super-Resolution.

Super-Resolution is a method that tries to increase the resolution of an image. Creating a higher resolution image can be done by using a single image (Single-Frame Super-Resolution (SFSR)) or multiple images (Multi-Frame Super-Resolution (MFSR)). Over the last decade most research has been focused on SFSR, due to the progress in deep learning. However, these SFSR methods ”hallucinate” the higher resolution. This could make an SFSR method based on deep learning unsuitable for applications where unique identification is necessary. The integrity of a delivery system is at stake when the unique identification of a delivery location could be based on hallucinated information. MFSR increases the resolution by merging of actual data from multiple frames (images), and may therefore be preferred as solution for an autonomous delivery application.

(8)

1.1 Research Questions

Literature research into techniques used for autonomous landing systems for drones has not led to articles which indicate the use of MFSR techniques. MFSR techniques are either not applied yet or information about application of these techniques have not been published yet, to the best of my knowledge. A drone delivery application needs to be able to function in absence of reliable GPS information and requires identification of a delivery location from a high altitude. The use of MFSR techniques may significantly increase the altitude at which a drone can identify a delivery location.

This research project will look into how MFSR techniques can be utilised for the autonomous landing of a UAV on a marked (recognisable and identifiable) landing location. It is important for the delivery service application that the UAV can safely, accurately and securely deliver its package. The research thesis addresses the following main question:

How can Multi-Frame Super-Resolution Image Enhancement be used to aid an au-tonomous landing system for a UAV in unknown surroundings?

In order to address the main research question it will be divided into roughly four research questions:

• RQ1: How to implement the MFSR algorithm? The MFSR algorithm consists of different components, for each component multiple solutions are possible. As such, there are many different ways of implementing an MFSR algorithm. As guidance for selecting the right solution for each component of the MFSR algorithm the requirements of the application and its environment will be taken into account. This research question addresses the selection of a proper technique for each component based on literature research and the outcome of field tests. This research question is dealt with in chapter 4.

• RQ2: How much - if at all - does the MFSR algorithm increase the resolution? MFSR is used to increase the resolution of the acquired images, but if the performance increase is minimal, it may not be desirable to use it for this application. Researching the performance impact of the created MFSR system is quantified by experiments shown in chapter 5. • RQ3: How does the MFSR affect the performance for the detection and identification of the

marked landing location? The MFSR algorithm may improve the resolution, but that does not necessarily imply a positive improvement in the performance of the marker detection and identification. The performance improvement in marker detection and identification with images from the MFSR is shown in chapter 6.

• RQ4: How does the MFSR algorithm affect the usability of a UAV for delivery applica-tions? Performance of MFSR at different altitudes will determine the usability of future UAV delivery systems. Analysing the results of the applied MFSR algorithm algorithm to quantify the usability in a real life environment is done in chapter 7.

1.2 Research Outline

This report starts by presenting background information on UAV navigation and Super-Resolution in chapter 2. The design of the MFSR system is discussed in chapter 3. That chapter describes the design choices in the hardware and software of the application.

The implementation of the MFSR system is described in chapter 4. The structure of the implementation and the required pre-processing is presented first. The different stages of the algorithm and their design choices are then explained in detail. The chapter concludes with a description of the marker identification that will be used to analyse the performance and usability of the system.

Chapter 5 presents the experiments performed and their purposes. The results of the exper-iments are shown in chapter 6. A discussion on the results and findings follows in chapter 7 to conclude on the research performed and propose future research.

(9)

CHAPTER 2

Background

The autonomous landing system is an essential part of a UAV delivery service. This chapter starts by introducing the navigation currently used by UAVs and explains why a computer vision method is needed. This is followed by an introduction on Super-Resolution. An outline on the notion of Super-Resolution algorithms and the general inner-workings is discussed. Lastly, related work that explores the autonomous landing of UAVs using computer vision is discussed briefly.

2.1 Navigation

Navigating the UAV and identifying the landing location is essential for a fully autonomous UAV delivery service. Location data from the GPS is often used by UAVs to navigate to the delivery location. However, GPS can be unreliable, has limited accuracy and can be untrustworthy. GPS is a secondary source of information and offers malicious users the opportunity to spoof a signal [15]. Research has shown that the GPS signal received by a UAV can successfully be spoofed [20], making it possible to hijack the UAV and guide it somewhere else. GPS spoofing does not require great amounts of effort or money, with GPS spoofing devices available for a relatively inexpensive price [24].

Onboard cameras and computer vision can prevent the use of unreliable and untrustworthy GPS. In this research, computer vision is used to aid the autonomous landing system of a UAV by identifying the landing location indicated by the recipient. As said in the introduction, the UAV needs to fly at a higher altitude to take away safety and privacy concerns. Flying at a higher altitude reduces the details in the images due to a fixed resolution of the camera. Computer vision requires details to detect and identify the landing location indicated by the recipient. A high-resolution image will be created using Super-Resolution, which is a class of algorithms designed to construct a high-resolution image from one or more lower resolution images. Multi-Frame Super-Resolution (MFSR) will be researched to improve the resolution at higher altitudes using multiple low-resolution images.

2.2 Super-Resolution

In this research MFSR will be used to aid the detection and identification of the landing location at higher altitudes. Super-Resolution is a class of algorithms that create a high-resolution image from a single or multiple low-resolution images. This section will introduce the class of Super-Resolution and its different methodologies.

2.2.1 Taxonomy

The class of Super-Resolution algorithms can be divided into different categories based on the number of images used and the domain operated in. In order to visualise the structure of

(10)

the different Super-Resolution methods available, the taxonomy in figure 2.1 is created. This taxonomy is based on [26]. The yellow path indicates the methodology deployed in this research.

Super-Resolution

Frequency

Domain Spatial Domain

Fourier Wavelet Multi-Frame Iterative Back Projections Iterative Adaptive Filtering

Direct Projection Onto

Convex Sets Probabilistic

Shift and Add

Non-Parametric Maximum Likelihood Maximum A Posteriori Single-Frame Learning Reconstruction

Figure 2.1: Super-Resolution taxonomy based on [26]. The yellow blocks indicate the path to the strategy used in this research.

The Super-Resolution class is divided into two domains: the frequency domain and the spatial domain. Both refer to the domain in which the low-resolution images are processed and a high-resolution image is created. The spatial domain can be split into two categories based on the number of low-resolution images used to create a high-resolution image. When a single image is used, the algorithm is part of the Single-Frame category, and when multiple images are used it is part of the Multi-Frame category. The Super-Resolution algorithm that will be used in this research is part of the Multi-Frame category of the spatial domain. The following sections provide a brief overview of both the Single-Frame and Multi-Frame categories.

2.2.2 Single-Frame

Single-Frame Super-Resolution (SFSR) is part of the spatial domain and uses a single low-resolution image to create a high-low-resolution image. Most SFSR implementations make use of learning-based algorithms. The learning-based algorithms often make use of a training-step to learn the relation between the low-resolution images and their corresponding high-resolution image. In the learning-based strategy, classes are used to recover the high-resolution details. The class represents a certain type of content in the image, such as face images. A class consists of a priori known mappings from low-resolution images to high-resolution images. A relation between the low-resolution and high-resolution image can be based on mapping patches inside the image or mapping structures and features. Learning-based SFSR algorithms use the mapping of patches between low-resolution and high-resolution [26]. When features are used instead of patches, it is part of the reconstruction-based category inside the SFSR algorithms [26]. The SFSR algorithms are also known as hallucination algorithms. Details are hallucinated in the high-resolution image after training. This could make SFSR methods based on learning unsuitable for applications where unique identification is necessary.

(11)

2.2.3 Multi-Frame

Multi-Frame Super-Resolution (MFSR) is part of the spatial domain and uses multiple low-resolution images to create a high-low-resolution image. The taxonomy in figure 2.1 shows that the Multi-Frame category can be split in different types of algorithms. A direct MFSR algorithm will be implemented in this research. A general outline of the other approaches inside the Multi-Frame category can be found in [26].

The direct Multi-Frame approach is divided into two sub-categories. The shift and add methodology is the first sub-category. Shift and add is performed in two stages:

• Geometric registration • Image reconstruction

The second sub-category is very similar to shift and add, but combines the steps of shift and add [26]. In this research the shift and add methodology is chosen to investigate the individual parts.

Shift and Add

Geometric Registration Image Reconstruction Direct Registration Feature-based Registration

Figure 2.2: Multi-Frame - Direct - Shift and Add Methodology.

The shift and add approach is based on aligning and merging the low-resolution images. The goal is to create a high-resolution image that contains more details. This can only be done if the low-resolution images contain different information. The differences are often referred to as sub-pixel shifts when motion is present between low-resolution images. In the geometric registration stage a reference image is chosen from the low-resolution images. All other low-resolution images are registered with respect to the reference. Registration aligns the low-resolution images with the reference.

There are two classes of geometric registration, direct and feature-based registration [34]. The direct approach warps the images relative to the reference and quantifies the correspondence with pixel-to-pixel matching. Error metrics are used to quantify the difference between the images. The goal is to retrieve the transformation that minimises the error function. Feature-based approaches extract features, such as edges or corners, from the images. The features from both the image and the reference are matched. Based on the mapping between the feature points, a transformation is defined from one image to the other.

After geometric registration, the transformation from a low-resolution image to the reference image is acquired. All low-resolution images are processed by the geometric registration stage. The low-resolution images are then warped into the high-resolution frame by scaling and applying the corresponding transformation. The high-resolution frame represents the pixel grid of the high-resolution image. Image reconstruction is performed inside this high-resolution frame.

Image reconstruction merges the registered images into a high-resolution image. The value at the discrete pixel locations is calculated with the pixels of the registered images. Interpolation is a

(12)

common technique used to calculate the colour value at the pixel locations. Image reconstruction results in a high-resolution image. This high-resolution image will be used to detect and identify the landing location indicated by the recipient.

2.3 Related Work

The aforementioned Super-Resolution class is an area of ongoing research. Recent research has shown that Super-Resolution can be used effectively with cameras inside mobile phones [17] [35]. The tremor of the human hand is exploited to create motion between images. The difference between the low-resolution images makes it possible to perform MFSR. In [17] it is used to improve the dynamic range of the image. There is no readily available software that performs MFSR currently. Both [17] and [35] only describe the idea behind their algorithm, but have not made their implementation publicly available.

Literature research into the autonomous landing systems of UAVs has not led to articles that indicate the use of MFSR techniques. There are few research articles on the detection and identification of a landing location using some form of marker [1] [29]. All of these papers detect a relatively big marker up close. However, in the case of delivery to a recipient, the UAV needs to be able to detect and identify its landing location at a high altitude. Some papers use the well known heli-platform symbol to identify a landing location. In the case of a UAV delivery to consumers, the use of a heli-platform symbol is not practical (size) and does not contain any unique features for identification. The delivery requires a way of uniquely identifying the marked landing location. This has lead to the Research Questions as defined in section 1.1.

(13)

CHAPTER 3

Design

The implementation and test of the Multi-Frame Super-Resolution (MFSR) algorithm requires test hardware and software. This chapter provides an overview of the design choices to come to a suitable test system. Multiple components are required in the system to convert low-resolution images to a single high-resolution image:

• UAV - An essential part is the UAV itself. The UAV will hover above the marker. • Camera - The onboard camera of the UAV acquires the images and provides output in

both RAW and JPEG format. Multiple images of the same scene will be taken in a short time period.

• Marker - A marker is used to identify the desired landing location. Characteristics of the marker influence the detection and identification performance.

• Computer Vision - An implementation of the MFSR algorithm is made, making optimal use of existing libraries, to prevent development from scratch.

3.1 UAV

A UAV (Unmanned Aerial Vehicle) is an autonomous or remote controlled aircraft without a human pilot on board. The UAV used in this research is the consumer-grade DJI Mavic Air1_,

shown in figure 3.1a. A DJI Mavic Air has a battery flight time duration of 21 minutes. The flight time duration of 21 minutes is sufficient for acquiring the images needed in the experiments. The DJI Mavic Air provides autonomous features to protect the aircraft and create photo-graphic shots. Part of the autonomous system is the onboard sensing systems to avoid obstacles. The sensing system is able to detect obstacles in front of, from behind and underneath the drone. Whenever the user is not navigating the drone, it automatically tries to hover at the current po-sition, even in (heavy) wind conditions. Being able to steadily hover above the location where a potential marker might be, is essential for the camera and the images. Slight movements are permitted, but heavy motion causes motion blur and possibly other artefacts in the image.

Figure 3.1b shows the remote controller. A live feed from the drone shows the user what the camera on board the drone captures. The live feed to the controller will be important to acquire the images used to test the MFSR algorithm.

(14)

(a) The DJI Mavic Air drone (b) The remote controller of the DJI Mavic Air

Figure 3.1: DJI equipment used to capture the images.

3.2 Camera

As shown in section 3.1 the DJI Mavic Air is used. Figure 3.1a shows that the front of the DJI Mavic Air features an onboard camera. The camera of the DJI Mavic Air is mounted on a gimbal. A 3-axis gimbal allows the camera to remain stable under the motion of the drone. Stabilising the motion of the drone prevents artefacts in the images. The camera can be from straight down (−90◦) to slightly pointed upwards (+17◦). The ability to tilt the camera allows the drone to track the marker and hover above it. In this research the camera is tilted straight down (−90◦).

The camera on board the DJI Mavic Air is used to capture images. Although no video footage will be used in this research, the camera is able to record a resolution of 3840 by 2160 pixels at 30 frames per second. The camera contains a 12 MegaPixel sensor and has a Field of View (FoV) of 85◦_{. The field of view is measured as the diagonal angle, shown later on in figure 3.3. Two}

different aspect ratios can be used to capture images. The resolution of the images are 4056 by 3040 pixels for a 3:4 aspect ratio and 4056 by 2280 pixels for a 16:9 aspect ratio. Single images and burst shots (3/5/7 images) can be taken. The images taken with the camera can be saved as a JPEG and/or DNG (RAW format) file.

RAW formats originate directly from the image sensor and are unprocessed and therefore without any loss. The RAW file size is big, as normally these files are uncompressed. Another benefit of the RAW format is that the original data from the image sensor is not destroyed, only the metadata that controls the rendering is altered. RAW formats are mainly used by photographers to have more freedom in post-processing the image. JPEG is a common file format used to store images. When an image is stored as a JPEG it is compressed to save space. The compression can be altered to adjust the quality of the image. In comparison to the RAW format there is a drawback in the editing of the image afterwards. Editing JPEG images offers less flexibility and can degrade to a worse quality than the original when multiple edits are done [18]. In this research the RAW data is utilised instead of the JPEG image. Section 4.1 ”Initialisation and Pre-Processing” will elaborate on utilising RAW instead of JPEG formatted images for this research.

3.3 Marker

In general, UAVs use location data from the GPS to navigate to the destination. As described before, the goal is to autonomously navigate the UAV using its onboard camera(s) and make it independent from GPS. Computer vision is deployed to use the images from the camera and recognise the landing location. In this research the landing location is indicated with a marker. A marker should be unique, otherwise one landing location cannot be distinguished from another landing location nearby. When a drone delivery system is used for delivering packages to consumers, a marker shall also be small and easy to deploy by the consumer. A consumer should not be bothered with putting large heli-platform like constructions in their

(15)

back garden to be able to make use of a drone delivery service. It shall be quick and simple for a consumer to put a marker at a location where the goods they expect for delivery shall be dropped by the drone. To minimise the development time of the system it is desirable to use a proven marker for which software libraries are available. For ease of use, a consumer should be able to print the marker on a consumer printer and put it in a frame or holder provided by the delivery service (could also be standardised). A unique marker could be printed and put in the frame or holder for each delivery. The marker shall thus be not bigger than A4 size. The goal is that computer vision can use the images from the drone and is able to recognise a relatively small marker from a relatively high altitude.

An area in Computer vision that makes use of markers is the field of Augmented Reality (AR). Augmented Reality applications often use markers to orientate and show the Augmented Reality animation. There are a wide variety of markers that can be used, each having its own features. A commonly used and well supported marker is the ArUco marker [12]. Figure 3.2a shows an example of an ArUco marker. An ArUco marker is a black and white block pattern. The patterns can, for example, encode an id of a product. Based on the dimensions of the black and white block grid, multiple variants of the ArUco marker exist. A 7x7 ArUco grid with a two blocks wide border is displayed in figure 3.2a. The ArUco marker offers the ability to calculate the position of the camera relative to the marker, if the dimensions of the marker are known. In this research the ArUco marker is used to evaluate the proposed MFSR algorithm. It will not be investigated if the ArUco marker is the best possible code to be used for this application. It is a practical and available solution that will work for this test system and which complies with the requirements above.

Subsection 3.3.1 will estimate at what altitude an ArUco marker of size A4 could be recognised with the application of the MFSR algorithm to determine if this design choice is suitable. For this determination a calculation of the area covered by a pixel based on the properties of the camera and altitude of the drone is needed. The number of pixels that register the marker is smaller when the UAV flies at a higher altitude. A minimum number of pixels is needed to recognise an ArUco marker.

(a) An 7x7 ArUco marker (b) ArUco code setup used to acquire images

Figure 3.2: The ArUco marker used in this research.

3.3.1 Pixels

The size of the area captured by the onboard camera of the UAV can be estimated [32]. Height and width of the area captured by a UAV is estimated based on the altitude, Field of View (FoV) and the aspect ratio from the camera of the UAV. Figure 3.3 visualises the geometric calculations from the different perspectives. The following properties of the camera are used to estimate the size of the area captured:

y = distance from the ground θ = field of view

(16)

(a) Side-view of the UAV and its lines of sight

d a

b

(b) Area captured by the cam-era

y

(c) Triangle representing the side-view of the FoV

Figure 3.3: Visual representation of the geometry involved in the calculation of the area perceived by a camera pointed straight down.

The diagonal of the image is indicated by d in figure 3.3b. The side-view of the UAV shown in figure 3.3a can be split into the two right-angled triangles of figure 3.3c. Splitting the side-view into two right-angled triangles makes it possible to calculate the length of the diagonal:

d = 2y · tan(θ 2)

Aspect ratio r indicates the relative sizes of the sides a and b from the area captured:

r = a b b = ra The length of a and b can be defined as:

d2= a2+ b2 = a2+ (ra)2 = a2(1 + r2) Therefore, a2= d 2 1 + r2 It follows that: a =√ d 1 + r2 b = √ rd 1 + r2

a and b can be calculated as:

a = 2y · tan( θ 2) √ 1 + r2 (3.1) b = 2ry · tan( θ 2) √ 1 + r2 (3.2)

(17)

Equations 3.1 and 3.2 make it possible to calculate the sides of the area captured. Based on the altitude, aspect ratio and the FoV, the number of pixels covering an A4 paper can be estimated. An example will show the effects of the different parameters involved.

When a UAV flies at an altitude of 50 metres with an aspect ratio of 16 : 9 and an FoV of 85 degrees, a ≈ 45 metres and b ≈ 80 metres. The number of centimetres covered by each pixel is calculated by dividing the length of a side by the number of pixels covering that side. An image with aspect ratio 16 : 9 and 4056 by 2280 resolution gives approximately 2.0 cm in width and height for each pixel. Standard A4 paper would only be represented by roughly 10 pixels in width and 15 pixels in height. This would not be enough to accurately identify a marker based on a single image.

Figure 3.4 shows that an ArUco marker loses the necessary details at an altitude of 50 metres. The cropped ArUco marker in figure 3.4b shows that the rough estimate of 10 by 15 pixels for A4 paper at an altitude of 50 metres is a decent estimate.

More details, i.e. more pixels, are required for reading the marker. A higher resolution camera or zoom lenses could be applied. Please note that the resolution of the camera applied is already quite high. When the drone delivery systems will be used, cost of the equipment will play an important role for the delivery service. Higher resolution cameras and zoom lenses will make the drone more expensive and more complex to apply and to maintain. This research seeks to use consumer grade cameras to keep cost within limits and to demonstrate that acceptable solutions are available based on relatively simple hardware. This research needs to demonstrate that by applying MFSR, the resolution of the marker shall be increased such, that it will be possible to accurately and robustly detect, read and verify the marker at a relatively high altitude. Above calculation shows that improvement of the resolution is necessary for the drone to operate at a reasonable altitude.

(a) The ArUco marker as seen from an altitude of 50 metres (b) Cropped ArUco marker

Figure 3.4: An ArUco marker at an altitude of 50 metres. The left image shows the entire image captured by the drone and location of the marker indicated with a red circle. The right image shows the marker that was indicated by the red circle. Please note that the white shape on the right includes the mount of the marker.

3.4 Computer Vision

Before the MFSR algorithm can be implemented, design choices regarding the platform need to be made. Software Developer Kits (SDKs) are available for the drone. However, using this SDK would make the solution dedicated to the specific drone. Furthermore, learning the SDK and applying it, cannot be achieved in the time given for the research project. The decision is made to not tailor the application by the drone, but to develop the Super-Resolution algorithm in the basis of a standard available software environment to demonstrate that the MFSR algorithm can provide the necessary image resolution improvement. The choice was made to use the Python programming language. Python offers rapid prototyping and a large community. One of the great benefits offered by Python is the many libraries readily available.

(18)

related techniques. OpenCV2[3] is the most popular computer vision library available. Its implementation in C++ is highly optimised. A Python wrapper is available which allows the use of the high performance functionality in a fast prototyping environment. For this reason, OpenCV is chosen as the preferred library to perform the necessary functionality.

Other libraries besides OpenCV will also be used. Most notable are the Numerical Python3_[16]

(or NumPy) and RawPy4 _{libraries. NumPy is used for its mathematical and statistical}

func-tionalities. The RawPy library is a Python wrapper of the LibRaw library5_{. RawPy is used to}

convert RAW images to an RGB format.

2_{OpenCV organisation: https://opencv.org/} 3_{NumPy organisation: https://numpy.org/}

4_{RawPy documentation: https://letmaik.github.io/rawpy/api/} 5_{LibRaw organisation: https://www.libraw.org/}

(19)

CHAPTER 4

Implementation

This chapter provides an overview of the techniques used by the Multi-Frame Super-Resolution (MFSR) algorithm to create a high-resolution image. First, the initialisation and pre-processing stage of the implementation is discussed. In this stage the data is prepared for the main part of the MFSR algorithm. Then the MFSR algorithm is explored in more detail. A motion model is defined that is used in geometric registration, followed by the geometric registration of low-resolution images and the reconstruction of a high-low-resolution image. The chapter concludes with the marker identification algorithm that will be used to analyse the performance and usability of the MFSR algorithm. A flow chart displaying the overall structure of the system is shown in figure 4.1.

RAW image data

Convert RAW to RGB Extract sub-image Reference image Low-resolution images Initial estimation Geometric registration Transform into high-resolution frame Inverse Distance Weighting interpolation High-resolution reconstruction Sharpening Erosion ArUco identification Initialisation and Pre-Processing Geometric Registration Image Reconstruction Marker Identification

Figure 4.1: Outline of the proposed Multi-Frame Super-Resolution (MFSR) algorithm.

4.1 Initialisation and Pre-Processing

In the initialisation and pre-processing stage, all data is prepared for the main MFSR algorithm. Preparation of the low-resolution images starts with the conversion of RAW sensor data to an RGB format which is supported by the applied software libraries. Only a small part of the image containing the marker is used as input for the main MFSR algorithm. In this section, the steps required in the pre-processing stage are discussed.

4.1.1 Demosaicing

The UAV described in section 3.1 provides both a RAW image and a JPEG image. A JPEG image encodes the RGB values of each pixel. RAW images contain the image sensor data. Inside the image sensor are arrays of individual sensors, the pixels. Between the image sensor and the lens

(20)

is a Colour Filter Array (CFA). This CFA controls which colour is captured by the individual sensors. Multiple layouts of CFAs exist, each with different colours and patterns. Figure 4.2 illustrates the commonly used Bayer Colour Filter Array. Inside the Bayer filter pattern, half of the sensors capture green and only a quarter of the sensors capture red and blue. A higher sampling rate for the green colour is chosen on purpose. The human eye is more sensitive to the colour green. Brightness is also approximated well with the colour green [19].

(a) Bayer Colour Filter Array [4] (b) Sensor in Bayer Colour Filter Array [5]

Figure 4.2: The Bayer grid commonly used in Colour Filter Arrays (CFAs).

Most computer vision applications and libraries cannot process the Bayer filter pattern but require an RGB pattern. In this RGB pattern, each sample point has a Red, Green and Blue value. A RAW image does not have an RGB value for each sample point. Converting the colour channels in RAW images to an RGB pattern is called demosaicing. Various strategies for demosaicing can be deployed. Simple interpolation often causes colour artefacts at object borders. Adaptive Homogeneity-Directed (AHD) [19] has become a commonly used demosaicing algorithm in the industry. AHD will be used to convert the RAW image data to an RGB format that can be used in the MFSR algorithm.

In this implementation the RAW data is converted instead of using the JPEG provided by the drone. The main reason for converting RAW images is to prevent loss of detail. There are multiple reasons behind the loss of detail in the JPEG images provided by the drone. Firstly, JPEG is a lossy compression method to store images. Sharp transitions are often discarded. Losing sharp transitions could be detrimental for the reconstruction of an ArUco code. Using the RAW image makes it possible to utilise the details that are part of the scene. A second reason is the market orientation of the drone. The DJI Mavic Air is a consumer-oriented drone. As a result of this, the JPEG images are focused on making an image that appeals to the human eye. This means that the green colours are elevated and further assumptions about the scene are made to improve the visual quality. In the MFSR algorithm, visual appeal is not important. Reconstruction of the ArUco marker requires the details as they are measured by the image sensor (without adjustments). Figures 4.3a and 4.3b show the JPEG image generated by the drone.

Converting a RAW image to an RGB format can be done in many ways, even when the demosaicing strategy is chosen, as there are many characteristics that can be tuned. Demosaicing uses variables that can be tweaked to fit the environment in which the image is captured. Generic settings [14] as shown in figures 4.3c and 4.3d already show improvement when compared to the JPEG from the drone in figures 4.3a and 4.3b. For reading the marker it is important to have a high contrast between black and white. This can be achieved by optimising the gamma settings in the demosaicing algorithm. An important change is made to the generic settings concerning the gamma values. Gamma settings influence the luminance, more commonly known as brightness. More information about gamma correction can be found in [7] [25] [30]. The gamma settings are lowered to improve the contrast between white and black inside the marker. Black areas inside the marker are darker in tone when compared to the JPEG of the drone and conversion with generic settings. As a result of lower gamma settings, the image shown in figures 4.3e and 4.3f appears dark.

(21)

RAW image. It is immediately noticeable that the JPEG from the drone has warm colours and the green colours are elevated to appeal to the human eye. Details of the marker, such as sharp transitions, are lost. Figures 4.3c and 4.3d show that converting from a RAW file with generic settings can make white less prominent and contain less fading. Contrast and darker tones are more prominent when using lower gamma values as shown in figures 4.3e and 4.3f.

(a) JPEG image as provided by the drone (b) Marker as provided by the drone

(c) RGB image converted from a RAW file using generic settings

(d) Marker as converted from a RAW file using generic settings

(e) RGB image converted from a RAW file using lower gamma settings

(f) Marker as converted from a RAW file using lower gamma settings

Figure 4.3: A comparison of the JPEG image generated by the drone and the images converted from RAW. The conversions from RAW are performed using general settings and lower gamma settings. Each row represents a different setting of conversion parameters. The left image in each row shows the entire image and the location of the marker indicated with a red circle. The right image in each row shows the marker that was indicated by the red circle.

4.1.2 Snippet

For the MFSR algorithm a large number of low-resolution images of the same object are merged into a single high-resolution image. Conversion of the RAW image to an RGB format makes it possible to process the image. However, not the entire image is needed. Only a small part containing the marker is of interest. Processing the entire image would significantly increase the computational costs and time. A small image (snippet) containing the marker is used as input to the MFSR algorithm. An example of such a snippet is shown in figure 4.3f. It is assumed that the MFSR algorithm has a function that can extract small images containing the marker from

(22)

the low-resolution images. Automated marker snippet recognition has not been implemented in this research.

Another benefit of only using a small area inside the image is the prevention of artefacts. For merging the images a so-called motion model is used. A motion model combines images based on the type of motion expected between images. The next section describes a motion model that defines the type of motion that can occur between low-resolution images. Artefacts occur in the reconstruction when there is motion between the images that does not adhere to the motion model. For example, artefacts may appear when rotation occurs, while the motion model only accounts for translation. In this case, registration will fail, and artefacts appear in the reconstruction.

4.2 Motion Model

An MFSR algorithm combines multiple low-resolution images. The MFSR algorithm starts with geometric registration. Geometric registration finds the best fit between multiple images. A motion model is used in geometric registration to define the relation (motion) between the low-resolution images. Multiple motion parameters characterise the motion model. Geometric registration uses the motion parameters to transform all low-resolution images with respect to a common frame. The high-resolution image is reconstructed inside the common frame.

4.2.1 Geometric Parameters

A motion model describes the parameters that define the relation between images. One of the snippets is chosen as a reference and all other snippets are registered with respect to it. The low-resolution snippets are aligned relative to the reference with their respective motion parameters. After the alignment, all snippets are in a common frame where the high-resolution image will be formed. There are two categories of motion models that can be identified [36]: global and local. Global motion models define a set of parameters that represents the motion in the entire image. Local motion models consider the image as a set of sub-images. Each sub-image has a local set of parameters that map it to a corresponding sub-image of another image.

The motion model will define the relation between the snippets described in the previous section on pre-processing. The marker has a fixed location, is not moving and does not change shape. The low-resolution images are more or less taken from almost the same drone location, while the drone is only moving slightly, therefore it is assumed that no local motion is present between the low-resolution snippets. A global motion model is therefore more suitable. Motion models can represent different types of motion, from simple translation to a perspective transfor-mation [34]. Motion between the low-resolution snippets is modelled as an affine transfortransfor-mation in this research. The affine transformation describes the relation between the low-resolution snippets with translation, rotation, scaling and shearing. Lines and parallelism are preserved in this motion model. The slight motion of the drone from a high altitude capturing a fixed marker makes it possible to use the affine transformation instead of a more complex model. Figure 4.4 illustrates the different types of motion that are included in the affine transformation.

(23)

(a) Translation of an object (b) Rotation of an object

(c) Scaling of an object (d) Shearing of an object

Figure 4.4: The geometric transformations involved in the affine geometric registration model. The transformation applied on the blue square results in the red shape.

The geometric registration used in this research will apply the affine motion model to define the transformation between images. This transformation will warp the low-resolution snippets into a common frame where the high-resolution image will be formed. A mathematical repre-sentation of the affine motion model needs to be defined before the parameters of the motion between snippets can be calculated.

The mathematical representation that defines the affine motion model uses a coordinate system described by a x, y and z axis. The x and y axis span the ground plane. On this ground plane is the stationary marker indicating the landing location. The altitude of the UAV is represented by the z axis. Motion between snippets is caused by motion of the UAV and its camera. The motion between snippets can be represented by an affine transformation of the x and y plane inside the coordinate system. For example, movement of the UAV parallel to the ground plane (without rotation) is visible as a translation of the x and y axis.

The affine transformation is used as a global motion model to define the relation between two images. This relation maps a point (x, y) from one image to (x0, y0) in the other image. Affine transformations from (x, y) to (x0, y0) can be described in matrix form as [2]:

  x0 y0 1  =   a b c d e f 0 0 1     x y 1   (4.1)

In this matrix equation, (x, y) denotes the coordinate inside the input image and (x0_{, y}0_{) denotes}

the corresponding coordinate in another image. The equations for x0 _{and y}0 _{can be written as:}

x0= ax + by + c (4.2)

y0 = dx + ey + f (4.3)

The point correspondences between points (xi, yi) and (x0i, yi0) can be written in matrix form:

x0 i y_i0 =xi yi 1 0 0 0 0 0 0 xi yi 1         a b c d e f         (4.4) q = M p (4.5)

(24)

The affine transformation defines the mapping from a point (x, y) inside a snippet to (x0, y0) inside the reference. When all snippets are registered with respect to the reference snippet, they will be aligned inside a common frame where the high-resolution image is created. The parameters of the affine transformation are defined in p. This means that aligning snippets in a common frame is reduced to a parameter estimation problem. The next section on geometric registration will discuss how the parameters in p are approximated.

4.3 Geometric Registration

The first step of the MFSR algorithm is the geometric registration of low-resolution images. The motion model as described in section 4.2 will be used to align images in a common frame. A geometric transformation between an image and the reference image is estimated. This section provides a detailed overview of the geometric registration deployed.

4.3.1 Direct Registration

Geometric registration is used to estimate, with sub-pixel accuracy, the affine transformation between two images captured by the UAV. With the affine estimation and sub-pixel accuracy, the pixel colours of the high-resolution image can be estimated.

Section 2.2.3 on the background of the MFSR algorithm describes that there are two classes of geometric registration: direct and feature-based. Direct geometric registration can offer iterative convergence to an optimal estimation [34]. With the convergence to an optimal solution, direct geometric registration can achieve the sub-pixel accuracy that is needed to increase the resolution in the image reconstruction phase. Direct methods can have a higher computational complexity when a pixel-wise comparison is used, as described in literature [34] [36]. However, by using snippets instead of the entire image, as discussed in the pre-processing stage section 4.1.2, the increased computational complexity is reduced. The pixel-wise comparisons make use of an error metric that quantifies the performance of the registration. Feature-based approaches often rely on detecting many features inside the image to be accurate [36]. The feature-based approach is not suited for the small low-resolution snippets used in this MFSR algorithm, as these snippets are not feature-rich, a direct approach is chosen to make use of the convergence to an optimal estimation and sub-pixel accuracy.

A well-known direct geometric registration algorithm is the iterative Lucas-Kanade method by Bruce D. Lucas and Takeo Kanade [23]. Many variations have been made based on the original algorithm, improving the computational complexity and the alignment performance. The geometric registration methodology chosen in this research is of the same category as the Lucas-Kanade method [23]. It is called ”Parametric Image Alignment Using Enhanced Correlation Coefficient Maximization”, or in short ECC [11]. The choice for ECC is made based on its properties.

The ECC algorithm has fast convergence to an optimal solution. In [11] it is reported that the ECC algorithm converges faster to a slightly better estimation in comparison to iterative Lucas-Kanade and other methods. Faster convergence implies that the ECC algorithm requires fewer iterations to achieve an equal or better estimation when compared to the Lucas-Kanade method.

Another reason for choosing the ECC algorithm is the support offered by OpenCV. Computer vision library OpenCV provides the functionality to perform iterations of the ECC algorithm. Support by OpenCV facilitates the connection of the geometric registration with other compo-nents of the MFSR algorithm that primarily makes use of OpenCV. Using ECC an estimation is made of the affine transformation between an image and the reference image. This estimation is used in the image reconstruction of the MFSR algorithm.

4.3.2 ECC Image Alignment

This section provides a brief overview of the ECC algorithm from [11]. The ECC algorithm is a direct registration technique that uses a gradient descent to estimate an optimal solution. Low

(25)

computational costs make gradient descent approaches well fitted in Computer vision applications [11].

As explained in section 4.3.1 about direct geometric registration methods, error metrics are used to quantify the performance of the transformation with parameters p. The criterion in equation 4.6 is used by the ECC algorithm.

EECC(p) = ir kirk − iw(p) kiw(p)k 2 (4.6) The reference image is represented by ir. The registered image warped with parameters p

is denoted by iw(p). The zero-mean of the reference image and warped image are ir and iw(p)

respectively. Criterion EECC is invariant to photometric distortions in contrast and/or brightness

[11]. The optimal parameter values are computed by minimising the performance criterion EECC.

Minimising EECC is the same as maximising the enhanced correlation coefficient in equation 4.7

[11]. ρ(p) = i T riw(p) kirkkiw(p)k = ˆiTr iw(p) kiw(p)k (4.7) In equation 4.7 of the enhanced correlation coefficient, ˆir = _kiir

rk

denotes the normalised version of the zero-mean reference vector, which is constant. Gradient descent is used to maximise ρ(p). To iteratively improve the estimate of p, gradient descent uses a parameter update rule. The update rule is defined as p = ˜p + ∆p, where ˜p is a nominal parameter vector and ∆p is a vector of perturbations. The coordinates of an image registered to the reference under the nominal parameter vector are ˜y = φ(x; ˜p). With the perturbed parameter vector the warped coordinates are y = φ(x; p). A first order Taylor expansion with respect to the parameters is applied to the intensity of the warped image at coordinates y in equation 4.8 [11].

Iw(y) ≈ Iw(˜y) + [∇yIw(˜y)]T

δφ(x; ˜p)

δp ∆p (4.8)

Equation 4.8 can be applied for all coordinates, resulting in the image intensity vector warped with parameters p [11]:

iw(p) ≈ iw(˜p) + G(˜p)∆p (4.9)

In equation 4.9 the Jacobian matrix (first-order partial derivatives) of the warped intensity vector with respect to the parameters is written as G(˜p). With equation 4.9 an approximation of equation 4.7 for the enhanced correlation coefficient can be made with the nominal parameter vector and its perturbations:

ρ(p) ≈ ρ(∆p|˜p) = ˆiTr

iw(˜p) + G(˜p)∆p

kiw(˜p) + G(˜p)∆pk

(4.10) Both G(˜p) and ∆p can be computed as shown in [11].

The algorithm to iteratively estimate the parameter vector is called the Forward Additive ECC iterative algorithm, as shown in table 4.1. With an estimate pj−1it can compute iw(pj−1)

and G(p_j−1). An approximation of ρ(p) is made with ρ(∆p_j|p_j−1). The parameter update rule p_j = p_j−1+ ∆p_j follows. The gradient descent approach of ECC iteratively updates the parameter vector to minimise the error metric. By minimising the error metric, the parameters of the affine transformation between an image and the reference are estimated. This affine transformation will be used to align the snippets in the common frame where the high-resolution image is formed.

(26)

Initialisation

Use the reference image Ir to compute the zero-mean normalised vector ˆir.

Initialise p₀ and set j = 1.

Iteration Steps

S1: Using φ(x; pj−1) warp Iw and compute its zero-mean counterpart vector iw(pj−1).

S2: Using φ(x; pj−1) compute the Jacobian G(pj− 1).

S3: Compute perturbations ∆pj.

S4: Update pj= pj−1+ ∆pj. If k∆pjk ≥ T then, j + + and goto S1; else stop.

Table 4.1: Outline of the forward additive ECC algorithm [11]

Figure 4.5 shows a visualisation of the iterations performed by the ECC algorithm. The reference image is warped with a random small affine transformation in figure 4.5b. Iterations of the estimated affine transformation by the ECC algorithm are displayed. Figure 4.5 shows that the ECC algorithm converges to an estimation of the affine transformation to the reference image.

(a) Reference image (b) Affine transformation ap-plied to reference image

(c) 5 iterations of the ECC al-gorithm

(d) 10 iterations of the ECC al-gorithm

(e) 20 iterations of the ECC al-gorithm

(f) 30 iterations of the ECC al-gorithm

Figure 4.5: Visualisation of the iterations by the ECC algorithm. After 30 iterations, the esti-mation is a close approxiesti-mation of the reference image.

4.3.3 Affine Estimation

This section will briefly discuss the development of the affine estimation strategy deployed. ECC image alignment is used to estimate the affine transformation between a snippet and the reference snippet. With the estimated affine transformations all snippets are aligned in the common frame where the high-resolution image is created. While prototyping the geometric registration stage, multiple strategies were investigated. The first attempt was a straightforward approach where the ECC algorithm was used with the snippets and the reference. Occasionally it was able to estimate the affine transformation with a correlation coefficient of 0.95 and above. Registration

(27)

of multiple images was very inconsistent and mostly resulted in low correlation coefficients. This strategy is not satisfactory and a different strategy was required to accurately estimate the affine transformation with the desired sub-pixel accuracy.

Other strategies were explored to mitigate the inconsistency and low correlation coefficient. Increasing the number of iterations does not have the desired effect. A higher number of iterations significantly increases the computational cost without an equal relative increase in the correlation coefficient. Inconsistency also remained present when the number of iterations was increased. Some estimations required many more iterations to achieve a high correlation coefficient (above 0.95). At first, the conclusion would be that the number of iterations needs to be increased even further. However, the increase in computational complexity diminishes the slight increase in the correlation coefficient.

A different strategy has been deployed to address the inconsistency and unsatisfactory corre-lation coefficient. An initial estimation of the affine transformation can be given as input to the ECC algorithm. Multiple techniques can be utilised to achieve this initial estimation. Perform-ing a couple of iterations of the ECC algorithm does not suffice. That is the same as performPerform-ing more iterations without an initial estimation. Other geometric registration algorithms have suc-cessfully used a Gaussian pyramid to improve the convergence to a high correlation coefficient [34] [36]. A Gaussian pyramid, as explained below, contains multiple levels of blur. The top-level is a heavily blurred version of the image. Continuing down the pyramid decreases the amount of blur. Approaches that make use of a pyramid to improve the performance are often called coarse-to-fine approaches [36].

At first, a small Gaussian kernel (3x3) was used on the images. The estimation was made using Euclidean motion (translation and rotation) and the slightly blurred images. Experimenting with this approach showed that the Euclidean motion (translation and rotation) model was not well suited for an initial estimation. A slight shear or scale change between two images resulted in worse performances than the ECC algorithm without an initial estimation. The Affine motion model shall therefore also be used in the initial estimation. Experiments also showed that a bigger Gaussian kernel with the same amount of iterations increased the final correlation coefficient. The ECC algorithm converges faster to an initial estimation when there are fewer details and the general structure remains present.

(a) Greyscale image (b) Greyscale image with Gaussian kernel size 3

(c) Greyscale image with Gaussian kernel size 9

(d) Greyscale image with Gaussian kernel size 21

Figure 4.6: The Gaussian blur used in the initial estimation of the affine transformation estima-tion.

In the final implementation of geometric registration, the initial estimation is based on blurred versions of the images. The blurred images are made with a 21x21 Gaussian kernel to reduce the amount of details present in the original image. An initial estimation with blurred images allows the geometric registration to quickly estimate the affine transformation. After the initial estimation, more ECC iterations follow with the original images to further improve the affine estimation. Geometric registration uses the affine estimation to transform all low-resolution images into the common frame. Figure 4.7 shows that geometric registration of the snippets results in a non-uniform grid containing sub-pixel shifts. A non-uniform grid contains values at non-discrete locations. This non-uniform grid is used in the image reconstruction stage to create the high-resolution image.

(28)

(a) Pixels of the reference image in a part of the high-resolution grid

(b) Part of the high-resolution grid with regis-tered pixels

Figure 4.7: Registration of pixels from low-resolution images in a high-resolution grid. A resolu-tion increase of two is applied. Only a part of the high-resoluresolu-tion grid is displayed. The big blue dots represent the reference pixels and the smaller coloured dots represent the registered pixels. Pixel locations of the high-resolution image are indicated with the x and y axis.

4.4 Image Reconstruction

After geometric registration of the low-resolution images, image reconstruction is performed. Image reconstruction will use the information present in the snippets to create a high-resolution image. This section provides an overview of the techniques used in the image reconstruction stage.

4.4.1 Methodology

Image registration aligns the low-resolution images in a common frame. As shown in section 4.3, geometric registration results in a non-uniform grid. The registered pixels of the low-resolution snippets show the sub-pixel shifts between the snippets when aligned in the non-uniform grid. For this reason, image reconstruction makes use of non-uniform interpolation. The non-uniform interpolation is used to convert the non-uniform grid to a uniform grid representing the high-resolution image. Non-uniform interpolation can be performed in different ways and by making use of different properties of the image data. Non-uniform interpolation ranges from simple nearest neighbour interpolation to interpolation based on triangulation [8] [28].

Two strategies were considered, Delaunay triangulation [8] with barycentric coordinates [25] or Inverse Distance Weighting (IDW). An example of Delaunay triangulation performed on a non-uniform grid created by the registration of images is shown in figure 4.8. Colour values at the discrete pixel locations of the high-resolution grid are calculated with barycentric interpo-lation. Barycentric interpolation is performed with the triangle that encloses the discrete pixel location. The resulting pixel value is thus based on the three vertices of the enclosing triangle. An advantage of using Delaunay triangulation is that it is a relatively fast algorithm to convert the non-uniform grid to a uniform grid. Please note that according to the properties of the Delaunay triangulation the three vertices of the enclosing triangle are not necessarily the closest three registered pixels. This can be seen in figure 4.8b.

In the case of Inverse Distance Weighting (IDW), the closest registered pixels are used. A predefined radius defines the maximum distance from a discrete pixel location. Only the registered pixels with a distance equal to or lower than the radius contribute to the resulting colour value. IDW is a form of weighted averaging. In the IDW approach, the inverse of the distance to the discrete pixel is used as a weight. Registered pixels further away have a lower weight and contribute less to the final pixel value. An advantage of the IDW approach is that more than three pixels can be used and the pixels involved are always the closest.

(29)

The IDW approach is favoured in this research. This choice is made based on the properties of the ArUco marker. An ArUco marker has sharp edges inside its internal black and white grid. Consider a pixel on the side of a black cell near an edge with a neighbouring white cell. Both the Delaunay triangulation and IDW approach will result in averaging, but the Delaunay triangulation is more susceptible to both positional errors and pixel value errors. Delaunay triangulation only uses three pixels instead of a neighbourhood of pixels. Outliers also impact the Delaunay triangulation more than the IDW approach. Delaunay triangulation can result in an incorrect pixel colour when, for example, there is a single white pixel in a neighbourhood full of black pixels. With IDW the outliers hardly influence the resulting colour when there are enough correct pixels around. Delaunay triangulation also results in grainy images. Graining is caused by the transition between triangles and their barycentric interpolation, which is not smooth. By using IDW the resulting image will have more consistent black and white areas.

(a) Registered pixels in a part of the high-resolution grid

(b) Delaunay triangulation of the registered pixels in a part of the high-resolution grid

Figure 4.8: Delaunay triangulation of the registered pixels in a higher-resolution grid. A resolu-tion increase of two is applied. Only a part of the high-resoluresolu-tion grid is displayed. The big blue dots on the grid display the pixels of the reference low-resolution image. The smaller coloured dots represent the registered pixels of the remaining low-resolution images. Pixel locations of the high-resolution image are indicated with the x and y axis.

4.4.2 Inverse Distance Weighting

After geometric registration all pixels of the snippets are in a high-resolution non-uniform grid. Inverse Distance Weighting is performed in the high-resolution frame g to convert a non-uniform grid to a uniform grid. Transforming a non-uniform grid to a uniform grid assigns to all discrete pixel locations (i, j) in the regular grid a colour value. A radius r defines the maximum distance at which a registered pixel can influence the colour of a discrete pixel location. Another way of viewing this is by drawing a circle with radius r and its centre at a discrete pixel location. All registered pixels (x, y) for which the condition in equation 4.12 holds, contribute to the pixel value at position (i, j):

di,j(x, y) = (x − i)2+ (y − j)2 (4.11)

di,j(x, y) ≤ r (4.12)

Inverse Distance Weighting is a variation of the weighted averaging approach. The weights used in IDW are based on the distance di,j(x, y) from the discrete pixel to the registered pixel.

Weights control the contribution to the resulting pixel value. Weights wi,j(x, y) are assigned by

taking the inverse of the distance between the discrete pixel location (i, j) and the registered pixel (x, y):

wi,j(x, y) = (

1 d (x, y))

(30)

Registered pixels closer to the discrete pixel locations are assigned a higher weight. Decreasing the weights over distance can be done increasingly by taking the inverse to a power p > 1. A higher power benefits the scenario of the ArUco marker at the edges of the internal black and white grid. Outliers and registered pixels further away do not have a big influence on the final result.

The pixel value g(i, j) of the discrete pixel (i, j) is calculated as a weighted average. All registered pixels within the radius are multiplied by their corresponding weight and summed up. This sum is divided by the sum of weights:

g(i, j) = P kf (k) · wi,j(k) P kwi,j(k) (4.14) Please note that k represents the non-uniform locations (x, y) of the registered pixels andP

k

represents the sum over all registered pixels within the radius. Figure 4.9 shows examples of the high-resolution reconstruction from multiple low resolution images.

(a) Low-resolution at 5 meters (b) Reconstruction at 5 meters (c) Low-resolution at 15 meters (d) Reconstruction at 15 meters (e) Low-resolution at 25 meters (f) Reconstruction at 25 meters (g) Low-resolution at 35 meters (h) Reconstruction at 35 meters

Figure 4.9: Examples of high-resolution reconstructions at different altitudes.

4.4.3 Sharpening

Inverse Distance Weighting results in a discrete grid of interpolated values representing a high-resolution image. However, when looking at the image it appears blurred. The final step in creating a high-resolution image is sharpening the reconstructed image. Sharpening is used to improve the detection and identification of the ArUco marker, not the general appearance of the image.

The blurred appearance is indirectly caused by the image sensor inside the camera. An image is formed inside the image sensor. Capturing an image can be seen as sampling a continuous function. Pixels represent the samples and are the building blocks of an image. A common misconception is that a pixel is a little square, as seen when zooming in. A pixel is a point sample and an image consists of point samples [13]. Capturing images inside the image sensor is different from the theoretical single point sampling. In theory, a sample is taken from a single point with no area. In the image sensor, the individual sensors capture light from a small area. The resulting sample is an average from the light captured at the individual sensor [6] [22]. When the sensor captures light from both a black and white area, blurring occurs due to averaging.

The blurring shall be resolved to improve the sharpness and contrast between black and white inside the ArUco marker. Improving the sharpness and contrast will benefit the detection and

(31)

identification of the ArUco marker. Figure 4.10c shows the image used in the MFSR algorithm. Blurring caused by averaging is noticeable when compared with a synthetically downsampled image in figure 4.10a. The downsampled images are not blurred, because they are created with point samples from a high-resolution image without blur. Blurring caused by averaging in the image sensor does not occur in the downsampling. Figure 4.10d shows that the current state of the MFSR algorithm produces a blurry version of the marker. Comparing the same process with the downsampled images, a much sharper image is produced in figure 4.10b. The final step in the MFSR algorithm is sharpening the image.

(a) Downsampled from high-resolution image (b) Super-Resolution of downsampled images (c) Marker captured at 25 meters altitude (d) Super-Resolution of a marker captured at 25 meters altitude

Figure 4.10: Comparison between downsampled images and real images.

A sharpening filter shall remove the fading between black and white inside the image of the ArUco marker. Due to averaging inside the image sensor, the white areas appear thicker than reality. At first, the morphological erosion filter was tried for correcting this averaging effect. Erosion was not the solution, because the fade remained. However, a morphological filter (or a combination thereof) can be used to sharpen the image [21] [31].

In [21] and [31] it is shown that the morphological erosion and dilation filters can be combined into a sharpening filter. Morphological filters use a structuring element. Structuring elements define how the neighbourhood around a pixel is used in a filter. A commonly used structuring element is a square kernel where every pixel in the neighbourhood has the same weight. Both erosion and dilation will be applied to a greyscale image using a square structuring element. Erosion replaces a pixels grey value with the minimum grey value in its neighbourhood [31]. The morphological erosion operator is defined as:

F (x, ρ) = (f cρ)(x) (4.15) Dilation replaces a pixels grey value with the maximum grey value in its neighbourhood [31]. The morphological dilation operator is defined as:

F⊕(x, ρ) = (f ⊕ cρ)(x) (4.16) In both equations f (x) is the original function and x is a pixel location. The structuring element defining the neighbourhood is cρ(x), where ρ indicates the scale. If the structuring element is of scale 0 the following holds:

F (x, 0) = F (x, 0) = F⊕(x, 0) = f (x) (4.17) Following the definition of [21] and [31], an image sharpening filter ε can be constructed using dilation and erosion:

ε[f ](x, ρ) =      F (x, ρ), F⊕(x, ρ) − F (x, 0) > F (x, 0) − F (x, ρ), F⊕(x, ρ), F⊕(x, ρ) − F (x, 0) < F (x, 0) − F (x, ρ), F (x, 0), otherwise. (4.18)

Using the image sharpening filter, as described above, makes it possible to further enhance the quality of the reconstruction. The size of the structuring elements used in the experiments

Multi-Frame Super-Resolution Image Enhancement for Autonomous Landing of Unmanned Aerial Vehicles

Bachelor Informatica