Lightweight and Unsupervised Flow Estimation through Iterative Refinement

(1)

MSc Artificial Intelligence

Master Thesis

Lightweight and Unsupervised Flow

Estimation through Iterative Refinement

by

Shane Koppers

+31628653196

January 23, 2021

48 EC Nov 2019 - Jan 2021

Supervisor:

Dr S Karaoglu

Assessor:

Dr T Gevers

University of Amsterdam

3DUniversum

(2)

Abstract

In the domain of optical flow estimation, labeled data is scarce and the few datasets that exist contain data considerably different from video footage shot by consumers. To solve the problem of data scarcity and close the gap between training data and data used for inference, unsupervised learning is used. Current state-of-the-art unsupervised learning optical flow networks are of considerable size and not suitable for use on mobile devices where a small GPU memory poses a significant constraint. Recently, a new lightweight architecture, RAFT-S, has been published that has yet to be utilized for unsupervised training. This architecture uses a recurrent iterative refinement process to estimate optical flow. We successfully create UnRAFT by using unsupervised training to fine-tune RAFT-S on the two major optical flow benchmarks: Sintel and KITTI. In the process we introduce a new weighting scheme for applying loss to the intermediate output flow. We demonstrate the significance of backpropagation at coarse and intermediate flow output as well as introduce smoothness loss at all but the coarsest intermediate flow outputs. Lastly, we propose the use of an upsampling layer during training as it aids in the effectiveness of the photometric loss. We identify problems such as vanishing gradients in the recurrent unit and distortions introduced by the upsampling layer. UnRAFT is able to deliver promising results, with detailed motion boundaries and near-competitive scores on KITTI and Sintel.

(3)

2.4 Lightweight Models . . . 11 2.5 Training procedures . . . 11 3 Methods 12 3.1 Model Architecture . . . 12 3.1.1 Feature Extraction . . . 12 3.1.2 Correlation Block . . . 13 3.1.3 Update block . . . 14 3.1.4 Upsample layer . . . 14 3.2 Unsupervised Training . . . 14 3.2.1 Photometric Loss . . . 15 3.2.2 Smoothness Loss . . . 15 3.2.3 Self-Supervised Loss . . . 15 4 Experiments 16 4.1 Data . . . 16 4.1.1 Flying Chairs . . . 16 4.1.2 Flying Things . . . 16 4.1.3 Sintel . . . 16 4.1.4 KITTI . . . 17 4.2 General Setup . . . 17 4.3 Iterations . . . 17 4.4 Upsample Layer . . . 20 4.5 Qualitative analysis . . . 22

4.6 Comparison with State of the Art . . . 23

(4)

Chapter 1

Introduction

In recent years, deep learning has revolutionized many computer vision tasks, one of which being optical flow estimation. Optical flow estimation is the problem of solving the pixel-wise disparity between a pair of consecutive frames. Estimating for every pixel a 2D vector which points to the corresponding point in the next frame. An example can be seen in figure 1.1.

There are numerous applications that make great use of the produced flow field. Optical flow can be used for the task of video super-resolution [33], by exploiting the temporal information the flow vectors hold. It can be used for High Dynamic Range (HDR) imaging [32], where multiple Low Dynamic Range (LDR) images are captured, aligned and then stitched together to create one HDR image. Optical flow is used to warp all pixels of the multiple LDR frames in the right position to create a seamless stitch. Or optical flow could be used for frame interpolation [18], by generating intermediate frames by slowly moving the pixels along the generated flow vectors.

As mentioned above, there are many applications for optical flow and it can be used to improve the quality of video and image capture. This can be especially true for mobile devices where the size of the camera poses a significant constraint on the quality of the images it produces. It would be beneficial to apply these techniques on mobile devices, however, optical flow estimation can be a computationally expensive task. Many deep learning models are trained and optimized to be used on high end graphics cards. In the domain of mobile phones, the restrictions on the GPU, namely a small GPU memory and slow execution time, can play an important role on the viability of post-processing techniques. In order to be able to effectively use deep learning models, few parameters need to be used so that the model can fit on small phone GPUs. Current state-of-the-art models RAFT [31] and PWC-Net [29] come in at around 5M parameters, while in the past other models such as SPyNet [26] and LightFlowNetX [13] have been able to get adequate accuracy using only around 1M parameters. However, they require labeled optical flow data and many training iterations while still underperforming when compared to current state-of-the-art. Aside from RAFT, Teed and Deng also introduce a smaller model: RAFT-S, with only 1M parameters. Although this model shows promising results, the authors mostly focus on the full size model in their paper.

Classical methods such as EpicFlow [27] do not have the issue of a large memory footprint as they by definition do not have learned parameters. However, these methods are very computationally expensive and thus difficult to run within an acceptable time frame on such a mobile environment. Deep neural networks have shown promising results in the field of optical flow. Not only do they provide better results than the classical methods, more importantly, the computational cost of inference is also far lower. A downside is that these networks need proper training before they can reach maximum performance.

Ground truth optical flow is typically difficult to obtain. Manually annotating flow fields by hand is impossible to do accurately. However there are three methods that have been used by current datasets to programmatically obtain accurate ground truth flow fields. The first method as used by the Middleburry dataset [3] is filming a simple scene and reconstructing it by hand in a 3D modeling environment. From this reconstructed virtual environment, the ground truth optical flow can be obtained. The problem with this approach is that the scenes and movements have to be simple, as complex movements will be impossible to reconstruct accurately. The second method makes use of completely synthetic data, this data is able to display challenging scenes and movements, however since they do not use real-life footage, differences such as lighting, noise and other irregularities exist between training data and data used for inference. Datasets that use such synthetic data are Flying Chairs [8] and Flying Things [23], which feature randomly moving chairs or household objects in front of a plethora of backgrounds. Although the images provide a considerable challenge, as can be seen in figure

(5)

Figure 1.1: Example of optical flow estimation. For better visualization only a sparse flow field is shown, however optical flow vectors are estimated for every pixel.

1.2, there is a considerable gap between the data and footage one would find in real life. Another synthetic dataset and currently the most popular benchmark is Sintel [5]. Sintel is a CGI action/fantasy movie where the ground truth optical flow has been extracted. Although the scenes provide real movements, it is still easy to differentiate from real footage and thus not a perfect reflection of real-world scenarios. The third method as used by KITTI [9, 25] consists of using Lidar sensors to reconstruct points of the scene captured by camera. This method is able to use real-life data without restrictions on complex movements, however Lidar technology is not yet able to capture as many points as there are pixels when using HD video quality. Therefore, flow fields are sparse and although models generate per pixel flow fields, only some of the flow vectors will have a ground truth counterpart. KITTI uses this technology to create ground truth optical flow from video footage taken from a driving car. Although this video footage is real imagery, it is very different from what you would find in the average video taken by consumers on their mobile phone. In figure 1.2, some example images can be seen from each dataset.

There is a considerable gap between the available annotated optical flow data and real data. To solve this issue, some models have instead opted to train on unlabeled data. Not only does this solve the problem of data scarcity but also lessens the gap between learning data and data used in practice. There is an abundance of real-life footage online which can be easily used to train on, if unsupervised learning proves to be effective. Previous researchers have already made unlabeled datasets by gathering publicly available videos [35, 18].

Currently, all state-of-the-art unsupervised optical flow networks are built with the widely used PWC-Net model as their foundation [20, 19, 22, 21, 38, 16]. Although this network proves to be effective, it has a sizable 5M parameters. ARFlow [20] improves upon the classic PWC-Net by making it lightweight, reducing dense

(6)

1.1. PROBLEM STATEMENT

Figure 1.2: Example frames of four popular optical flow datasets. Top left: Flying chairs, top right: Flying Things, bottom left: Sintel, bottom right: KITTI.

connections and downsizing the model to 2.24M parameters. Although PWC-Net has been at forefront of most deep learning methods, recently, a completely new model has entered the scene: RAFT [31]. This model steps away from pyramidal structure used in PWC-Net and delivers a brand-new architecture, outperforming many of the PWC-Net based networks. More importantly to our case: together with RAFT, a small version of the model was published as well: RAFT-S. With a size of 0.99M parameters, it perfectly fits our use case. There is one problem to be solved, however: it has yet to be used with unsupervised learning.

1.1 Problem Statement

With parameter size in mind, we want to create an optical flow network that can be trained using unlabeled data. Although small models with parameter size around 1M exist, they are trained using labeled data. The recently published RAFT-S model seems to be a good fit based on performance and model size. However, since the architecture differs significantly from the PWC-Net model all state-of-the-art unsupervised networks are based on, there are challenges to overcome in order to use RAFT-S for end-to-end unsupervised training. The problem statement thus becomes as follows:

How can we effectively develop a lightweight model for unsupervised training of optical flow, while maintaining its small size?

In unsupervised learning of optical flow, different loss terms play an important role. However, the most sig-nificant term is photometric loss, which behaves vastly different than the supervised loss used by RAFT. To account for problems introduced by this change, we might need to make changes to the model and architecture, however, all decisions come at a price, and we especially need to be careful with the consequences of these decisions on the memory footprint.

We introduce a new model UnRAFT and a variation UnRAFT-U with slightly more parameters. Both models can fine-tune on unlabeled data and achieve close to state-of-the-art results.

The contributions of this thesis are as follows:

1. Introduction of unsupervised learning to the newly introduced RAFT architecture. 2. Use of an upsampling layer to aid unsupervised training of optical flow.

3. Smoothness loss at intermediate flow results to improve iterative refinement.

(7)

Chapter 2

Background & Related Work

Optical flow estimation can be formulated as follows: If we have two consecutive frames I1and I2, can we find

the dense flow field f12 where for every pixel coordinate p:

I2(p) = I1(p + f12(p)) (2.1)

In other words; we want to find for every pixel in frame I1 the flow vector that points to its corresponding

point in the next frame I2. This formulation makes the brightness constancy assumption which entails that the

brightness of a point will remain the same across frames. In reality, this is not the case because of occlusions and lighting changes such as moving shadows.

The problem of estimating optical flow has been around for a long time. One of the earliest research into this field was published in 1981, by Horn and Schunk [12]. They posed the determination of optical flow as an energy minimization problem using a brightness constancy and spatial smoothness assumption. This approach is used by other papers employing some form of coarse-to-fine scheme with decent results [27]. Although it has proven to deliver good results on small displacements, energy minimization is considerably more computationally expensive compared to machine learning implementations.

The recent developments in Convolutional Neural Networks (CNN) and deep neural networks have provided great advancements in the field of optical flow. Fischer et al. introduce the hallmark FlowNet [8] architecture providing two models: FlowNetS and FlowNetC. The former is a simple CNN only consisting of convolutional layers, the latter introduces a correlation layer where each image starts in its own CNN but are then merged together by the correlation layer. As a successor to Flownet, FlowNet2 [15] is published, obtaining then state-of-the-art results and thus competing with classical methods for the first time. FlowNet2 is made by combining multiple FlowNet models; however, a significant contribution to the model’s success is because of a sub-network specifically designed and trained for small movements. A major flaw of the network lies in the model size. FlowNetS and FlowNetC consist of around 39M parameters each. Since FlowNet2 is a composition of multiple FlowNetS and FlowNetC networks, the total model is considerably larger with 162M parameters.

The FlowNet models have become pioneers of solving optical flow using machine learning methods. Recently Deep Learning has proven to obtain great results within a wide range of fields in AI. After the success of the FlowNet models, deep learning has also been applied to the domain of optical flow. Ranjang and Black incorporate a spatial pyramid of images with deep learning and create SPyNet [26]. Published around the same time as FlowNet2, SPyNet reaches comparable or lower error as FlowNetC on standard benchmarks while being 97% smaller, a crucial improvement on model size. By using a coarse-to-fine spatial pyramid structure to learn the residual flow at each level of the pyramid, the model was able to learn both big and small movements alike. Flow is calculated for the top level of the pyramid, upsampled and then refined at the next level. For each level a downsampled ground truth is used to calculate the loss of the corresponding network of that level.

2.1 Key components of optical flow networks

To properly understand many of the crucial improvements, we must first explore the core concepts used by optical flow networks. Although every model is different, these are the concepts that are used by most if not all optical flow networks.

(8)

2.1. KEY COMPONENTS OF OPTICAL FLOW NETWORKS

2.1.1 Feature extraction

Instead of directly estimating flow on the image pairs as done by SPyNet, most models first use some kind of feature extraction. When comparing two images directly, corresponding points in the image can wildly differ in appearance because of noise, rotation and lighting differences. Features are extracted from the images so that learned features are compared rather than pixel values. This leads to better flow estimation in challenging imagery as these learned features should provide a more generalizable and stable representation of points in the images, which in turn leads to improved matching of corresponding points.

On top of learning features for each image, some models also use a context network [29, 31]. The context network usually is a feature encoder network similar or equal to the network used for regular feature encoding, however, only the first image is used to encode context features. These features are usually not processed in the same manner as the features representing the images, but are only used at the final stages of flow estimation. As the name suggests, these features should represent the context of the scene.

2.1.2 Cost volume

Sometimes called the correlation volume or layer, the cost volume is a matrix that holds for every pixel in frame I1the cost of matching it to a corresponding pixel in frame I2. In its purest form it is the volume that holds for

every pixel in frame I1, the matching cost for each pixel in frame I2. It is this volume from which ultimately

the optical flow is derived, as finding the minimum matching cost for each pixel in I1 will lead to finding

the corresponding point in I2. The matching cost would normally be some form of measurement of difference

between two pixels, however, when matching features, a matching cost is no longer necessary as features can directly be multiplied with each other to create a more correctly named correlation volume. Instead of finding the lowest cost, the correlation volume is instead processed by convolutions to derive the flow from it.

In optical flow, the subjects of the cost volume are two 2D images. If we would construct for every pixel in the first frame a cost for every pixel in the second frame we would get a 4D cost volume. Most models use some kind of lookup radius in which every feature constricts its matching area to their closest neighbors [29, 15, 36]. A lookup radius of 3 would result in a 7x7 lookup area in which a matching cost will be constructed between the pixel in frame I1in the center of the lookup area to all pixels in the lookup area in frame I2. If the dimensions

of I1and I2are given by H × W , the resulting cost volume would be 3D and of size 49 × H × W . However most

state-of-the-art deep networks use some kind of processing to reshape the cost volume to 2D with the lookup area dimension being put in the channels so it can be used with 2D spatial processing [29, 15].

2.1.3 Loss functions

In order to make the networks end-to-end trainable, a loss function needs to be constructed. There are 2 types of loss functions that will be of importance to us: supervised and unsupervised. The former are very straightforward: the output flow is a collection of 2D flow vectors f . If we have the ground truth flow fgt,

we can simply calculate the difference to get the error. A common supervised loss function will then be as follows:

L = |fgt− f | (2.2)

Unsupervised losses are far more complex. The basic principle revolves around warping the source image to the target image using the output flow and then comparing the warped image with the target image.

Photometric loss through warping

If we take I1 and I2 as the input images and the output flow to be f12, we can employ backward warping by

bilinearly sampling I2 into a new image ˜I21 using f12. The result is a warped image of I2 where the pixels

are now repositioned so that they align with their corresponding position in I1. Now we can use some kind

of pixel-wise comparison between the warped image ˜I21 and the target I1 to get the photometric loss. Why

backward warping is used over forward warping can be explained by looking at figure 2.1.

The concept itself is fairly simple, however, in practice there are a few pitfalls to consider. Firstly is the brightness constancy assumption. Even when we use the ground truth flow and warp a pixel to it’s corresponding point in the next frame, because of lighting changes and noise, the pixel in the next frame might have a different brightness. So even if the output flow is correct, the loss might indicate differently. A solution to this is to not directly compare pixel values, but for example use SSIM [34] which computes a pixel wise similarity index. Or instead first apply some kind of transform, like the census transform, introduced to unsupervised learning by

(9)

2.1. KEY COMPONENTS OF OPTICAL FLOW NETWORKS

Figure 2.1: Forward and backward image warping. The problem with forward warping is that some pixels in the warped image are not allocated a pixel. When backward warping, bilinear sampling can be used to sample from anywhere within the original image. (Source: [28])

Meister et al. [24]. The census transform transforms every pixel into a byte which indicates for every direct neighbor (the 8 surrounding pixels) whether its intensity is greater, or less than itself as described in equation 2.3. This operation makes it so that we no longer care for the absolute brightness of a pixel, but only about its brightness relative to its neighbors. Since the result of the census transform is for every pixel a byte, we can pixel-wise calculate the hamming distance between the warped and the target image to get the loss.

ζ(p, p0) = (

0, if p > p0

1, if p ≤ p‘ (2.3)

Figure 2.2: Example of census transform applied on a single pixel. (Source: [1])

Occlusion estimation

The second major difficulty with photometric loss is dealing with occlusion. If a pixel does not have a corre-sponding point on the next frame, no matter where we warp from, it will be erroneous. To deal with this issue, an occlusion mask is estimated. This mask is a binary matrix sharing the width and height dimensions of the images pairs I1and I2containing a 0 when the pixel is occluded and 1 when it is not. By element-wise

multiply-ing the photometric loss with this mask, we zero out the loss for the occluded pixels. The difficulty, however, is in obtaining the occlusion mask. There are 2 techniques that are often used: range-map and forward-backward consistency. The range-map technique uses the backwards flow f21 to count for every pixel in I1how many flow

vectors point to it, by assigning a score of 1 to the 4 surrounding pixels of every flow vector. Pixels that have no vector pointing to them are considered to be occluded. An example of this technique can be seen in figure 2.4. The second technique uses the forward and backwards flow to do a consistency check. By warping the backward flow using the forward flow and then adding the warped flow to the forward flow, we get the differ-ence for each flow vector. We can than say that if the differdiffer-ence is above a certain threshold, the corresponding pixel is considered to be occluded. The forward-backward consistency check works well, however, there is one caveat: since the estimated flow at the start of training is often quite random, this consistency check produces near to all-zero occlusion masks. A solution to this, is to only start using it after training a portion of the training schedule using either the range-map technique or no occlusion estimation at all during the beginning of training.

Smoothness loss

Apart from the photometric loss, most unsupervised models also use a smoothness loss as regulation [20, 17, 37]. When image patches lack structure or have repeating patterns, multiple flow fields can warp a pixel to a similar

(10)

2.2. RECENT DEVELOPMENTS

Figure 2.3: Left: source, middle: target, right: warped source. A simple example of why an occlusion mask is needed. When warping, we don’t know what is behind the red square, leading to occlusion.

Figure 2.4: An example of the range-map technique used to estimate occlusion. The gray pixels indicate that they are assigned a score of 1 and thus considered to be not occluded.

point in the target image. The smoothness loss is a penalty to account for this ambiguity. It constrains flow vectors to be similar to their neighbors when no significant gradient is present.

2.2 Recent developments

Sun et al. create PWC-Net [29], improving upon SpyNet by constructing a learnable feature pyramid instead of estimating flow from an image pyramid. Both images have a pyramid of which the layers consist of features at different resolutions. Starting with the image as input and first constructing the finest, largest feature layer of the pyramid. The other layers, each smaller than the one before, are then iteratively created by a series of convolution layers. The feature layers are then used in a coarse-to-fine manner. Starting with the smallest feature layer, flow is estimated, upsampled and then used to refine the flow at the next layer by using the upsampled output of the previous layer to warp the features of the current layer and estimating flow again. When properly trained PWC-Net obtains state-of-the-art results and many new models have been build with PWC-Net as its foundation [22, 36, 11, 17, 20].

Yang et al. create VCN [36] based on PWC-Net, proposing to use a 4D cost volume of dimensions H ×W ×U ×V , where U and V are the horizontal and vertical search range. They argue that when reshaping the cost volume into 2D, as done by PWC-Net and FlowNet, it requires the network to memorize particular displacements seen during training as the search area information is in the channel dimension. 4D cost volumes are able to generalize better to displacements that are not seen during training. However, the increase in dimensions comes with drawbacks such as computation time and memory requirement when processing the volume using 4D convolutions. To address these issues, VCN proposes to use separable 4D convolutions to lessen the amount of parameters needed to process the cost volume and improve computation time.

Teed and Deng [31] argue that the pyramidal structure used by PWC-Net and many of its offspring are con-strained by this design choice as the coarse-to-fine approach may miss small fast-moving objects and have difficulty recovering from early mistakes. They introduce RAFT which does not use a pyramidal structure and instead constructs per pixel features. This is done by a feature extractor network that consists of residual units. However, the real ingenuity of RAFT lies in its cost volume, constructing it by matching all pairs, i.e. a cost volume of dimensions H × W × H × W . Instead of using a search range on the construction of the cost volume, they instead create a full cost volume and then sample from it using a search range and the current flow estimate. The flow estimate is then iteratively refined using a recurrent network, each time the flow is updated, so will the sample from the cost volume.

(11)

2.3. IMPROVEMENTS UPON UNSUPERVISED LEARNING

Iterative refinement is something that was seen before in IRR [14]. Here a framework is provided that will iteratively refine the output flow of a given network by reusing a single network to refine the previous estimate. IRR uses either FlowNet2 or PWC-Net as base optical flow network, which makes IRR either limited by network size or the pyramidal structure, respectively. Although IRR makes use of supervised learning, they do explicitly reason about occlusion. A dedicated decoder is used for occlusion estimation. This occlusion is then used in combination with bidirectional flow to refine the flow and obtain improved results. Flow estimation is usually done with downsampled images and then upsampled as using full resolution only gives slight accuracy increase while having great computational costs. However, Hur and Roth find that upsampling occlusion does lead to significant accuracy loss. To counteract this, instead of simple bilinear upsampling, they use a simple CNN module combining multiple features to generate an improved upsampled occlusion map.

Although PWC-Net has been incredibly popular as a base architecture for optical flow since its release, RAFT has been able to improve results significantly by stepping away from the commonly used pyramid structure. Currently, in the domain of supervised optical flow, RAFT is the top performing model on Sintel.

2.3 Improvements upon unsupervised learning

Multiple approaches have been explored to improve unsupervised learning. One of such approaches is the use of other, similar learning tasks to aid in optical flow estimation. Yin and Shi create GeoNet [37] which learns depth, optical flow and camera pose. The three properties are coupled by the nature of 3D scene geometry. This coupling is then used to reason about static and dynamic objects which is also used to reason where occlusion might occur. Jiang et al. develop SENSE: a shared encoder network for scene-flow estimation [17]. The network shares a feature encoder for four highly correlated tasks: optical flow estimation, disparity estimation, occlusion estimation and semantic segmentation. Build with PWC-Net as a starting point, the encoder is replaced and optical flow, disparity and semantic segmentation each have a unique decoder. Occlusion estimation uses the output of the optical flow and disparity decoders.

Janai et al. argue that using multiple frames helps improve unsupervised learning of optical flow [16]. By using a minimum of three frames combined with explicit reasoning about occlusion they seem to decrease the photometric loss and increase performances. Similarly, SelFlow [22] uses a window of 3 images as it provides more information about occluded pixels. Additionally, their approach involves semi-supervised learning using a teacher-student model. They train two identical models; however, one model focuses on flow prediction of non-occluded pixels (NOC) while the second model learns to predict flow for all pixels (OCC). Before images are used for training in the OCC model, occlusion is artificially created by injecting patches of the image with noise using superpixels [2]. Since the patches are not occluded in the NOC model, its prediction can be used as annotation for guiding the OCC model. Once trained, only the OCC model is used on inference.

Jonschkowski et al. [19] perform an analysis on key components of unsupervised optical flow learning and introduce UFlow using the PWC-Net architecture. They introduce cost volume normalization which should solve the issue of very low values in the estimated cost volumes as a result of vanishing feature activations at higher pyramid levels. They confirm that for the pixel-wise comparison of the warped image with the target image, census loss performs better than SSIM loss and the generalized Charbonnier loss. A comparison is done on the two methods for occlusion estimation: range-map and forwards-backwards consistency. They conclude that range-map works best when stopping gradients at the occlusion mask, but that forwards-backwards consistency leads to better performance especially when only using it after 20% of the training steps. Lastly, they experiment with applying the smoothness loss at the level flow is estimated, which is at a quarter of the input resolution, rather than at the level of the upsampled output flow. They argue that because of the 4x upsampling, only every fourth pixel can possibly have a non-zero second order derivative, which might not be aligned with the corresponding image edge and thereby reduce the effectiveness of edge-aware smoothness. Their experiments show that applying smoothness loss at the level of flow estimation improves performance.

Liu et al. [20] create ARFlow, improving upon the student-model proposed by Selflow. While in Selflow, the teacher model is first trained and then only used for inference during training of student model, in ARFlow both models are trained in conjunction. However, to achieve stability, gradients are stopped between the student and teacher model. On top of this, ARFlow also adds other augmentations on top of the occlusion transformation to the student model, namely spatial and appearance transformations. For unsupervised learning, ARFlow is by far the best performing on Sintel, the next best performing would be Selflow [22] and UFlow [19], however the performance gap is considerable.

(12)

2.4. LIGHTWEIGHT MODELS

2.4 Lightweight Models

To accommodate for big movements, flow estimation networks like FlowNet2 have been very big in size. By using a spatial pyramid, SPyNet only requires to predict small movements at each pyramid level, decreasing the size of the total network. Although this approach greatly reduces the size of the network, the network was unable to deliver results on par with FlowNet2. To address the size of FlowNet2, Hui et al. [13] introduce LiteFlowNet with 5.37M parameters while increasing performance on KITTI and Sintel. As a variation they also introduce LiteFlowNetX with 0.90M parameters. Although the performance decreases significantly, it is able to outperform SPyNet while using even fewer parameters.

Previously mentioned IRR-PWC [14] reduces model parameters by using a iterative residual refinement scheme with shared weights. Combined with refinement with the use of bi-directional and occlusion estimation they increase the accuracy of PWC-Net by 17.7% while also decreasing the amount of parameters by 26.4%. ARFlow [20] modifies PWC-Net by reducing the dense connections and sharing the flow decoder for all of the levels across the pyramid. This change reduces the model parameters to 2.24M. Lastly, RAFT-S is introduced together with the standard RAFT architecture. While only using 0.99M parameters, it is able to outperform many other full-size models such as VCN on the training data of Sintel and KITTI. As RAFT-S was not the focus of the paper, the possibilities of this smaller version were not fully explored. No fine-tuning on Sintel or KITTI is done and the model was not submitted for evaluation on the test set of these two benchmarks.

2.5 Training procedures

Recent work showed that besides architecture, training procedures also play a big role into achieving high accuracy. [15] and [30] introduce the now standard training procedure of pre-training on FlyingChairs [8], finetuning on FlyingThings3D [15] followed by finetuning on either Sintel [5] or KITTI [9, 25], depending on which dataset the model will be evaluated on. [30] also shows that accuracy improvements can be obtained by simple procedures such as horizontal flips, not adding Gaussian noise and disrupting the learning rate.

As previously mentioned, ground truth optical flow is scarce. As a result a common technique used in training is to take random crops of the images in a dataset to (1) create more training data and (2) reduce the size of the images so that more can fit in a batch. Bar-Haim and Wolf show that using fix sized random crops cause a bias to using the center pixels more often [4]. These center pixels often have slower movement and less occlusion than pixels near the edge. A more dynamic cropping method that provides an equal representation of all pixels would yield data more representable of the dataset. Building on top of IRR-PWC [14], ScopeFlow is created by using both cropping and zooming strategies combined with relaxation of regularization and augmentation.

(13)

Chapter 3

Methods

In order to create a lightweight network that is able to estimate optical flow after training without labeled data, we combine the small model provided by RAFT with the training setup of ARFlow. Some slight changes are made to accommodate for the differences of the models and training setup each are made to work with.

3.1 Model Architecture

RAFT-S consists of 3 main components: a feature extraction network consisting of bottleneck residual layers [10], a correlation volume and lastly, a recurrent update operator using a ConvGRU, which is the convolution equiva-lent of the Gated Recurrent Unit (GRU) [7, 6]. Optical flow is estimated at 1/8th of the original image resolution. To counteract inaccuracies introduced by bilinear upsampling, an upsampling layer can be used as an extension to the network. A general overview can be seen in figure 3.2. Features are encoded by the feature network for the first and second frame. The first frame is also used by the context network to encode context features. Next the all-pairs correlation volume is constructed from the features of the image pair. By using average pooling, a correlation pyramid is made. As the all-pairs correlation volume is of considerable size, a lookup operation uses the current flow estimate to sample only a small part of it. The sampled correlation volume, the context features and the current flow estimate are then used in a recurrent unit to iteratively refine the flow estimate. We thus enter a loop of sampling from the correlation volume, refining the flow estimate and repeating until converged.

Figure 3.1: A bottle-neck residual unit

3.1.1 Feature Extraction

The feature extraction network is used twice in the model: once as a feature network to extract features of the input images, and once as a context network, the only difference being the dimensions. Both input images are used by the feature network to encode image features, however the first frame is also used by the context network to encode context features. The feature network consists of a regular convolution layer, followed by 3 pairs of bottleneck residual units, each pair having an increased hidden dimension. After passing through all pairs of residual units, a final convolution is applied.

A regular residual unit consists of 2 3x3 convolutions, each followed by normalization and an activation function. A bottleneck residual unit on the other hand uses a 1x1 convolution to reduce the dimensions, a 3x3 convolution, and lastly, a 1x1 convolution to increase the dimensions back to the input dimensions. The reducing and restoring of dimensions by the 1x1 convolutions is beneficial for the smaller model size as the 3x3 convolution now operates in a smaller dimension space. A figure of this can be seen in figure 3.1.

The residual units are used in combination with a regular convolution layer at the start and at the end to create the feature and context encoder as seen in figure 3.4. The only difference between the feature encoder and context encoder is the normalization. The feature network uses instance normalization while the context encoder uses no normalization.

(14)

3.1. MODEL ARCHITECTURE

Figure 3.2: An overview of the method used to compute optical flow.

3.1.2 Correlation Block

Once features are extracted from the image pair, they are passed to the correlation block which firstly constructs an all pairs correlation volume by taking the inner product of all pairs of feature vectors. If we take H as the height of the feature sets and W as the width, we create a correlation volume of H × W × H × W . Secondly, by average pooling the last 2 dimensions with a stride of 2 three times, saving all intermediate results, we construct a correlation pyramid of frame I2for every feature in frame I1. Although the first two steps are computationally

expensive, they need only be executed once. Lastly, we define a lookup operator. The current flow estimate is used to create a local grid around the corresponding point in frame I2 of each feature in I1, this grid is then

bilinearly sampled to create a smaller view of the total correlation volume. This is done for each pyramid level to return a set of correlation features within the lookup radius. A figure of this can be seen in 3.3. Notice how although the lookup radius is constant, the amount of pixel coordinates that are in the radius increases when the pyramid level increases. In our implementation 4 pyramid levels are constructed and a lookup radius of 3 is used. Since flow is computed as 1/8 resolution, even without using a correlation pyramid, we already have an effective lookup radius of 24 since we consider up to 3 features away and each feature represents an 8 × 8 neighborhood of pixels. By using the correlation pyramid, we double the effective lookup radius with each level, resulting in an effective lookup radius of 192.

Although the correlation pyramid is only computed once, the lookup operator is used every iteration as the updated flow provides a new view of the unchanged correlation pyramid.

Figure 3.3: The correlation pyramid of a single feature f in frame I1. The dots represent pixel coordinates,

while the red dot represents the corresponding point to f according to the current flow estimate. A colored square indicates a single pooled value. The dimensions of the pyramid level left-to-right are 16 × 16, 8 × 8 and 4 × 4. Using a lookup radius of 1, a local grid is created around the corresponding point, indicated by the red outline. This grid is then used to bilinearly sample from the pyramid level.

(15)

3.2. UNSUPERVISED TRAINING

Figure 3.4: The architecture of RAFT-S (Source: [31])

3.1.3 Update block

The recurrent update block estimates a sequence of flow estimates {f1, . . . , fN}, with f0 being initialized as a

zero matrix. Every iteration produces an update direction ∆f , which is added to the current flow estimate to create the new flow estimate: fk = fk−1+ ∆f . The current flow estimate, together with the sampled correlation

features and the context features are used in the update block, processed by convolutions and then used in a recurrent ConvGRU block before being processed by a flow encoder to create the update flow ∆ft.

3.1.4 Upsample layer

The optional upsample layer consist of two convolutions and uses as input the hidden state of the update block. If the original image input size is (H × W ), flow is estimated at (H/8 × W/8). Since we upsample by 8, every low resolution flow vector will be replaced by a field of 8 × 8 high resolution flow vectors. For every high resolution flow vector, we want it to be a convex combination of the 9 neighboring low resolution flow vectors. The upsampling layer outputs a mask of size (9 × 8 × 8 × H/8 × W/8), where the first dimension represents the 9 neighboring flow vectors and the second and third dimension represent the field of high resolution vectors. A softmax function is used on the first dimension to create, for each high resolution vector, a weight for the convex combination of low resolution neighboring vectors.

3.2 Unsupervised Training

In unsupervised training of optical flow, there are 2 important terms for the total loss: photometric loss and smoothness loss. To improve learning from complex and difficult situations without the use of labeling, ARFlow uses self-supervision by the way of augmentations to guide training. Photometric loss is calculated at every iteration, while smoothness loss and self-supervised loss are only calculated at the final iteration. However, experiments are done on calculating smoothness at multiple iterations. Each iteration and loss term are given a weight. The final loss function is thus as follows:

L =

N

X

i=1

wi(Lph+ ρiλsmLsm) + λaugLaug (3.1)

Where in the standard case of only applying smoothness loss at the final iteration:

ρi =

(

1, if i = N

(16)

3.2. UNSUPERVISED TRAINING

3.2.1 Photometric Loss

As explained in section 2.1.3, photometric loss consists of backward warping all pixels from frame I1 using the

optical flow f12 to construct a warped image ˜I12, which, if the optical flow is correct should closely resemble

frame I2. A pixel wise comparison is done between the warped image ˜I12 and the target frame I2, where the

loss is multiplied with the occlusion mask to zero out occluded pixels.

Our proposed model uses census loss and generates an occlusion map at the most refined iteration using the forward backwards consistency check. Since we need forward and backward flow for the consistency check, we calculated losses for both directions. We define the census transform of pixel p as CT(p) and the hamming distance between x1and x2as HAM(x1, x2). The formulation for one way photometric loss then becomes:

˜ I12(p) = I1(p + f12(p)) (3.3) Lph= X p HAMCT ˜I12(p), CT I2(p) · O12(p) (3.4)

3.2.2 Smoothness Loss

Either first or second order edge aware smoothness is used and calculated after upsampling the output flow. This depends on the dataset on which is being trained on. Although for most training data first order smoothness is used, for real footage such as KITTI, second order smoothness results in better performance [19, 20]. Smoothness loss can be formulated as follows:

Lsm=

X

p

|∇xf12(p)| · exp − |∇xI(p)| + |∇yf12(p)| · exp − |∇yI(p)|

(3.5)

3.2.3 Self-Supervised Loss

A teacher-student model is used as in ARFlow [20]. Optical flow is first estimated normally resulting in the teacher output flow ft. Photometric and smoothness loss is calculated on this output flow. Next augmentations

are applied and, if needed, as in the case of spatial transformations, the teacher output flow ft is augmented

to match. Flow is now estimated on the augmented sample, which should introduce new challenges, however we have the teacher output flow ft to help guide the model and learn from these difficult inputs. Two types

of augmentations are used: spatial transformation and appearance transformation. The spatial transformation consists of a random affine transformation where both the image samples as well as the teacher output flow ft

and occlusion mask Ot are transformed. The appearance transformation consists of random jitter, blur and

gamma. As the operations do not affect the flow, only the image samples are transformed.

Once the student output flow fsof the augmented samples is estimated, loss is calculated by using the teacher

flow as supervision, where ST() is the spatial transformation:

Laug= (||ST(ft) − fs|| + 0.01)0.4 ST(Ot) (3.6)

Gradients are stopped at the teacher output flow and the weight for the augmentation loss λaug is set to a low

(17)

Chapter 4

Experiments

In the pursuit of creating an unsupervised version of the RAFT model, we are essentially taking an effective unsupervised loss and training setup and combining it with an accurate lightweight model. To analyze the effectiveness of this combination we provide comparisons with each of the isolated components. However, simply combining the two does not guarantee to produce adequate results. Such is the case for unsupervised RAFT. To find the cause of the issues we are faced with, we must compare the differences of our situation with the situation where the unsupervised loss is working correctly and with the situation where RAFT is working correctly. This means that 2 comparisons are important. Firstly, the comparison between RAFT and the model unsupervised loss does seem to work well with, PWC-Net. Secondly, the comparison between the supervised loss successfully used by RAFT and the unsupervised loss we are trying to apply. By finding the important differences, we can determine possible solutions and confirm them through our experiments.

4.1 Data

As mentioned, obtaining dense flow fields is difficult for real life footage. Therefore, most datasets concerning optical flow consist of synthetic data. In recent literature 4 datasets have been proven to be the most useful and are therefore also used here, to provide a proper and fair comparison. The four datasets are as follows:

4.1.1 Flying Chairs

Dosovitskiy et al. created the Flying Chairs dataset as part of FlowNet [8]. It consists of 22872 image pairs and corresponding flow fields. The images feature 3D renders of different types of chairs, floating in front of an image background. Both the chairs and the background are moved in random directions. This dataset is very simple in nature and is most often used as the first step in training [29, 19, 31]. It is supposed to help learn the basics of flow estimation, while not providing too much of a challenge.

4.1.2 Flying Things

Similar to Flying Chairs, this dataset consists of around 25000 image pairs each containing multiple everyday items ranging from lamps to headphones to motorcycles. All objects move in a random 3d trajectory. The background scenery is also generated. It consists of a plane which is populated with 200 randomly chosen and textured shapes chosen from cuboids and deformed cylinders. The scenes are very busy and populated, with random movement all around. Although the scenery is quite abstract, this dataset provides a heft challenge. For this reason the Flying Thing is often used as the second step in the training regiment for supervised optical flow estimation [29, 31]. Because of the huge size of the dataset, which is close to 1 TB, we have chosen not to train on it.

4.1.3 Sintel

Created with Blender, an open-source 3D creation application, Sintel is an open source animated short film. It is created by the community, much like open-source software. Because of this, there is free access to all assets, which led to Butler et al. [5] to create an optical flow dataset using the scenes provided by the short film. Sintel features all kinds of movement in a natural setting, which makes it a good contender for comparisons to real life situations. The dataset is split into two categories: final and clean. The final and clean pass consist of the same scenes, however the clean pass has a lot of effects and particles remove as to make flow estimation easier.

(18)

4.2. GENERAL SETUP

Evaluation is done by measuring the average end-point-error between the output flow and the ground truth. Most of our evaluation will be done on this dataset.

4.1.4 KITTI

Using real-life camera footage from a driving car, KITTI is one of the few datasets which feature real footage. Although the footage has a lot in common with every day camera footage like distortions, noise and lighting difficulties, it is not a good representation of normal footage as most of the movement is horizontal and movement in general is very unlike a handheld camera would move. On top of this, the ground truth optical flow field is not dense. Evaluation is done by using multiple metrics, however the most important metric is Fl, the percentage of optical flow outliers.

4.2 General Setup

We first pre-train the model using supervision on the Flying Chairs dataset for 100k iterations following the procedures outlined in RAFT. This is to set a baseline as well as get the first stage of training out of the way so that we can start training using forward-backwards consistency check occlusion estimation. We apply normalization to the image features before they are used in the cost volume, per recommendation of [19], further analysis of this feature is done in experiment 2. Evaluation is done by fine-tuning on the Sintel dataset. The images are used at full resolution and normalized to have a mean of zero. We apply the following augmentations during training: random swapping of left and right images and random horizontal flip. We use a batch size of 8 unless stated otherwise, while training for 120k iterations. The learning rate is set to be 10−4 for the first 100k iterations, after which the learning rate exponentially decreased to 10−6 for the last 20k iterations. This learning rate schedule has been inspired by [19] and was chosen here as a fixed learning rate did not seem to converge properly and regular step decay proved to still lead to over-fitting.

As training for KITTI usually requires its own set of hyperparameters, evaluation for the experiments is done only on Sintel to limit the search space of this thesis. The evaluation metric is the average end-point-error (EPE) between the output flow and ground truth as given by equation 4.1. Final evaluation is done on both Sintel and KITTI, however for both datasets the test set ground truth is not available, so evaluation is done on the training set. For KITTI, we train on an unlabeled extension of the training data and thus have separation of training and evaluation data. For Sintel, however, we train and evaluate on the full training set. This is done as other literature does so as well and so doing the same makes for a fair comparison.

EPE(fgt(ugt, vgt), fo(uo, vo)) =

q

(uo− ugt)2+ (vo− vgt)2 (4.1)

4.3 Iterations

The iterative refinement process of RAFT posses problems when using it with unsupervised loss. This section aims to explore what exactly those problems are and what possible solutions may be. A notable property of the unsupervised loss is its complexity. Because of its multiple components it requires more computational power and memory during training. For this reason early experiments focused on the amount of iterations to use and found that training with 6 iterations in place of 12, not only helped decrease memory requirements but also increased performance. In experiment 1 more experimentation is done around iterations, however, all other experiments use 6 iterations, unless stated otherwise.

Experiment 1: Iteration Weighting

In this experiment, we will look at different methods of giving weights to the different iterations. There are three major differences to consider when comparing the unsupervised with supervised loss. The first notable difference is that the unsupervised loss is far more complex and thus requires more computational power and memory during training. This means that while calculating the loss for many iterations is easily scalable using supervised loss, it can pose memory problems when using unsupervised loss. The second and far more important difference is the precision of the loss functions. Supervised loss is extremely precise, but what’s important is that it is also precise in the scale of error. While unsupervised loss can correctly reflect that a flow vector is wrong, it cannot reflect how wrong as the photometric loss can only reflect how different the pixel is from a warped pixel. How far away that warped pixel is from the actual corresponding pixel is not and can not be reflected in the loss. If a flow vector points to a pixel 20 pixels off from the ground truth or a pixel of the same intensity 100 pixels off, supervised loss will show a greater penalty, while unsupervised loss cannot discern between the

(19)

4.3. ITERATIONS

two situations other than through smoothness loss. On top of that even ground truth flow will result in a non-zero loss because of brightness constancy assumption and occlusion. This all leads to the photometric loss being very noisy. Although performance on the evaluation set increases during training, the loss will stay very similar.

If we now first take a look at how supervised loss is used in RAFT. For every iteration there is an intermediate flow output to which the loss as described in equation 2.2 is applied. The loss is multiplied by the weight for each iteration. Gradients are stopped between iterations. In figure 4.1 an overview is given of the actual loss, the weighting used and the final loss per iteration. Even though the first few iterations are given a very small weight, because the flow is less refined and thus the loss significantly higher, these first few iterations still contribute for a big portion of the total loss. This is possible because as mentioned before, the supervised loss reflects the scale of error. It is precisely here where unsupervised loss shows its differences. The error is much bigger in the first few iterations, however, this is not reflected in the unsupervised loss as shown by figure 4.1. If we would naively use the same weighting scheme as used by RAFT, we would see that the first few iterations would contribute for only a small portion of the total loss, whereas in the supervised setting these iterations would account for a greater fraction of the total loss. This is problematic as that would mean that the update operator would mostly only learn to improve updates at the finer iterations and not improve as much on the earlier iterations.

Experimental Setup

To show this, we run several experiments with different weighting schemes and evaluate the performance on Sintel. The weighting schemes used are as follows:

1. Exponential: the weighting scheme used by RAFT 2. Fixed: all iterations use the same weighting of 1

3. Exponential Boost Coarse: Exponential weighting scheme, but with the first 2 iteration assigned a higher weight

4. Parabolic: weights follow a parabolic curve: 0.05(x − N/2)2_{+ 2, and are then normalized to have a sum}

of 4

The first thing we noticed is that while still using the exponential weighting scheme, using less iterations im-proved performance significantly. There are two hypotheses as to why that is so. Firstly, since the unsupervised loss is quite noisy, applying it on too many iterations might result in the gradients counteracting each other. As lowering the amount of iterations also solves the aforementioned problem of high memory and computa-tion cost of the unsupervised loss, all further experiments will use 6 iteracomputa-tions while training unless stated otherwise.

The second hypothesis is as follows: using the exponential weighting scheme wi= γN −1, when using a lower N

(the amount of iterations), iteration 1 will get a higher weight, providing more supervision to early iterations. Based of this hypothesis and the weighted loss as seen in supervised RAFT per figure 4.1, we introduce two new weighting schemes: Exponential Boost Coarse, which follows the RAFT weighting scheme but gives more weight to the first 2 iterations to more closely resemble the weighting distribution as seen in supervised learning. Secondly is parabolic weighting, which abandons exponential weighting completely to more aggressively shape the weighted loss into the distribution as seen in supervised learning. Parameters of the parabolic curve are thus chosen to serve this purpose. Regular exponential weighting serves as a baseline, while fixed weighting is considered as it the weighting used by ARFlow and UFlow on PWC-Nets pyramid levels.

Results

Weighting scheme Sintel (clean) Sintel (final) Exponentional (b) 3.81 4.88

Fixed 3.80 5.04 Exp. Boost Coarse 3.73 4.88 Parabolic 3.60 4.75

Table 4.1: Results of the iteration weighting experi-ment. Exponential (b) is used as a baseline.

The results can be seen in table 4.1. If we take ex-ponential weighting as our baseline, we can see that a fixed weighting scheme decreases performance signif-icantly. Although the earlier iterations are now bet-ter represented in the total loss, the final ibet-terations are under-represented, leading to learning focusing on im-provement of the coarser iterations. Exponential boost coarse seems to increase performance on the clean pass, which indicates that giving higher weights to the early iterations can help learning of refinement at the first few

(20)

4.3. ITERATIONS

Figure 4.1: Left: supervised loss, right: unsupervised loss. The difference in pure loss between iterations is greater when on the supervised loss. Using the same weighting scheme leads to bigger differences in the weighted unsupervised loss.

the coarser part of refinement, we might strike a good balance. This is exactly what parabolic weighting does and as confirmed by the results, we see an increase on both the clean and the final pass.

Experiment 2: Cost Volume Normalization

This experiment will look into a technique that might improve gradient flow: cost volume normalization. Jonschkowski et al. experiment with normalizing features before using them to construct the cost volume. They argue this helps address the problem of vanishing feature activations when only applying loss on the finest levels of flow. However, their situation is different than ours as they use PWC-Net and are limited by the small image size of the coarsest pyramid levels. This poses a problem for unsupervised learning as photometric consistency and other objectives work better at higher resolutions. RAFT does not suffer from this problem as all intermediate flow outputs are at the same resolution, however, the output flow at earlier levels is still very unrefined and it might be beneficial to only apply loss on the more refined levels. Because RAFT uses a recurrent update block, the more refined levels are also more likely to suffer from vanishing gradients. Even when still applying loss at the intermediate flow levels, cost volume normalization could help with gradient propagation from the refined levels.

Experimental Setup

To investigate whether cost volume normalization (CVN) improves gradient flow when not applying loss to every iteration, we use the following setups using cost volume normalization:

1. Train with 6 iterations, apply loss at all iterations 2. Train with 8 iterations, apply loss at finest 6. 3. Train with 10 iterations, apply loss at finest 8. 4. Train with 12 iterations, apply loss at finest 6.

5. Train with 12 iterations, apply loss at iteration {1, 2, 4, 6, 8, 10, 12}

Firstly, we train with the regular 6 iterations and apply loss at each iteration to show cost volume normalization does not decrease overall performance in the normal setting. Setup 2 to 4 are to investigate whether we can skip backpropagation at the coarser levels when using cost volume normalization. Setup 5 is meant to show insight into the effectiveness of sparse backpropagation. Here we only backpropagate on 7 iterations. If we assume the argument made in experiment 1 that 12 iterations could provide conflicting gradients, reducing the amount of backpropagation could solve this problem. This way we can get the refined flow from 12 iterations while still using few backpropagations.

Results

The results as shown in table 4.2 show that using cost volume normalization with 6 iterations and applying loss at each level increases performance. However when looking at setup 2-5, we see that performance decreases significantly when not applying backpropagation at all used iterations. CVN might help gradient flow through

(21)

4.4. UPSAMPLE LAYER

CVN Iterations Loss at levels Sintel (clean) Sintel (final)

off 6 All 3.74 4.82 on 6 All 3.68 4.80 on 8 {3, 4, 5, 6, 7, 8} 4.09 5.12 on 10 {3, 4, 5, 6, 7, 8, 9, 10} 4.34 5.46 on 12 {7, 8, 9, 10, 11, 12} 8.21 8.45 on 12 {1, 2, 4, 6, 8, 10, 12} 4.16 5.43

Table 4.2: Evaluation results of applying cost volume in different settings

Figure 4.2: An example of artifacts caused by bilinear upsampling to 8 times the original resolution.

the cost volume, nonetheless it does not seem to be enough to skip backpropagation on some of the iterations. Setup 2 to 4 show that gradients propagating from finer iterations still seem to be not enough to properly train the network on their own, backpropagation is needed at the coarser level where vanishing gradients are less of an issue. Setup 5 shows that even if we backpropagate intermediately, the finer iterations are still suffering from vanishing gradients and thus have less of an effect on training as earlier iterations would have. It can also be concluded that applying loss at all iterations is important for an iterative model that relies on input from previous iterations.

4.4 Upsample Layer

A notable difference between UnRAFT and other optical flow networks such as PWC-Net and FlowNet is that while they both estimate optical flow at 1/4th of the image resolution, UnRAFT does so at 1/8th of the image resolution. Bilinear upsampling is used to upsample the flow to the image resolution, however this does introduce artifacts and imperfections around object edges, which get more pronounced the more upsampling needs to be done. An example of these artifacts can be seen in figure 4.2. The unsupervised loss is most precise around object edges, so it makes sense to also want our output flow to give precise estimations around these edges. This way we can get the more precision out of the loss.

The upsampling layer as explained by section 3.1.4 needs to produce a mask that is quite big in dimension. Even 2 convolution layers can introduce a lot of new parameters. Because of this, in paper where RAFT is introduced the upsampling layer is not used for the small model as it adds 0.5M parameters in the normal model, which is far too much considering the rest of the small model is only 1M parameters. However, by reducing the hidden dimension from 256 to 192, we already reduce the size of the upsampling layer to 0.25M parameters.

Experiment 3: Upsample Layer

This experiment aims to explore how small we can make the upsampling layer, adding as few parameters as possible to the model. Although RAFT has shown that the upsampling layer can improve accuracy, especially around motion boundaries, we have another important concern: model size. In an effort to minimize the downside of extra parameters that the upsampling layer brings, we construct an experiment which seeks to find the smallest possible addition while still gaining as much accuracy as possible. Parameter size can be reduced by using a smaller hidden dimension for the upsampling layer, however we do expect to see a trade-off with accuracy.

(22)

4.4. UPSAMPLE LAYER

Figure 4.3: Output flow of the upsample layer model using 192 hidden dimensions. Notice the distortion in the upper left corner. Underneath each output flow is a magnified image of the 40 × 40 region in the upper left corner.

Upsample layer Hidden dimension Sintel (clean) Sintel (final)

off - 3.74 4.82

on 128 3.75 4.83

on 192 3.64 4.78

on 256 3.70 4.90

Table 4.3: Evaluation results of applying cost volume in different settings

Experimental setup

The setup is fairly straightforward: we train RAFT as per the general setup, however, we add an upsampling layer with differing hidden dimension. The hidden dimensions used will be: {128, 192, 256}.

Results

During training we noticed the model was slowly diverging after around 30-40k iterations. After closer inspection into the output flows, we noticed a strange distortion in the corner of each flow field, that seemed to grow as training continued. Examples of this distortion can be seen in figure 4.3. When examining flow outputs at different stages of training, we noticed this distortion to start occurring early as well, however, at first it is not present in all flow outputs and is spread more around the image edges. At the end of training the distortion is present in all output and always in a 5x5 region in the upper left corner of the unupsampled flow. The distortion is not present in the first iteration and develops more strongly as the iterations progress. This is an issue that obviously needs to be fixed, however, we experiment with replacing the affected region with flow from the first iteration where the distortion is not yet present and this shows that it only impacts the final EPE score on scale of 0.01 to 0.03. Nevertheless, The results can be seen in table 4.3. It seems that score-wise 192 hidden dimensions performs the best. Improvement is seen especially on the clean pass where object boundaries are more clearly defined as they are not obscured by overlaying effects.

Experiment 4: Smoothness Loss

The smoothness loss can be calculated at different resolution levels. Since the flow is estimated at 1/8th of the original resolution and then upsampled, we could apply the smoothness loss before or after upsampling. Jonschkowski et al. argue that the smoothness loss should be applied before using bilinear upsampling [19] as the bilinearly upsampled flow might have misaligned edges with the original image. However, in our case the upsampling layer should upsample it in such way that the edges of output flow and input image align. Thus, we apply smoothness loss after upsampling.

Jonschkowski et al. also argue that photometric consistency and other objectives work better at higher resolu-tions and thus only apply smoothness loss at the final pyramid level, which has the highest resolution output flow. Although their reasoning is not mentioned, ARFlow does the same. Since UnRAFT intermediate flow outputs are all the same size, we do not have the issue that we only our final iteration is at the highest resolu-tion, thus we can experiment with applying loss at all iterations as this can help the supervision at intermediate iterations leading to better refinement.

(23)

4.5. QUALITATIVE ANALYSIS

ws Only refined Sintel (clean) Sintel (final)

16 no 3.58 4.78

32 no 3.60 4.67

50 no 3.66 4.64

50 yes 3.58 4.63

Table 4.4: Evaluation results of the smoothness experiment. ws is the weight given to the smoothness loss.

Only refined indicates whether smoothness loss is only applied on iterations 3 and later.

Upsample layer Parameters Sintel (clean) Sintel (final)

On 1.27M 3.58 4.63

Off 0.99M 3.81 4.88 Training only 0.99M 3.72 4.74

Table 4.5: Evaluation results of the disable upsample layer experiment

Experimental Setup

We train using the general setup and apply smoothness loss at every iteration’s upsampled flow. The smoothness loss is multiplied by the same iteration weight that is used for the photometric loss. Because this introduces more smoothness loss, we need to also experiment with the smoothness weight, which previously was set to 50. Lastly, we also experiment with not using the upsampling layer and smoothness loss at the first 2 iterations as the output flow is so coarse, it might misguide the upsampling layer into optimizing the refinement of very unrefined flow.

Results

As seen in table 4.4, there is a trade-off present where increasing the loss weight will improve the score on the final pass, but decrease the score on the clean pass. However, when we do not use the upsampling layer as well as not apply smoothness loss on the first two iterations, we see that the performance on the clean pass improves as well as on the final pass.

Experiment 5: Removing the Upsample Layer

This experiment aims to seek out if we can only use the upsampling layer during training, while removing it for inference and still improve accuracy compared to not having used the upsampling layer at all. With the concern of model size, we can explore an even more extreme reduction in parameter size of the upsampling layer: completely removing it during inference. The theory behind this is, the upsampling layer can aid in training, especially since much of the photometric loss is coming from object edges and motion boundaries. Thus, using the upsampling layer to increase accuracy at these regions helps the photometric loss to be more accurate. Once training is complete, we should be able to remove the upsampling layer while still benefiting from the improved accuracy in loss we got during training.

Experimental Setup

We take the parameters from the best performing model from experiment 4, disable the upsampling layer and evaluate it on Sintel.

Results

The results can be seen in table 4.5. As expected keeping the upsampling layer on has the best performance, but more importantly we can see improved results even when it was only used during training. This means that if model size is of high importance the upsampling layer can be used in training without considering the extra parameters it brings during inference, while still benefiting from a performance increase.

4.5 Qualitative analysis

One of the primary reasons to use RAFT for unsupervised learning is that all current state-of-the-art methods use the pyramidal architecture, which, as argued by Teed and Deng [31] can miss small fast moving objects and has trouble correcting early mistakes. To show that our unsupervised version of RAFT is able to overcome the flaws of the pyramidal architecture, we perform a qualitative analysis where we compare the output of our model with that of the current best performing unsupervised model: ARFlow [20].

Lightweight and Unsupervised Flow Estimation through Iterative Refinement

MSc Artificial Intelligence

Master Thesis