Temporally Stable Automatic Video Colorization

(1)

Temporally Stable

Automatic Video

(2)

(3)

Temporally Stable

Automatic Video Colorization

Lysander G.B. de Jong

11788674

Bachelor thesis

Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam

Faculty of Science

Science Park 904

1098 XH Amsterdam

Supervisor

MSc. Jiaojiao Zhao

Informatics Institute

Faculty of Science

University of Amsterdam

Science Park 904

1098 XH Amsterdam

June 26th, 2020

(4)

Abstract

Independently colorizing each single frame of a gray-scale image sequence results in temporal flickering due to changes in coloring from frame to frame. Recent works have utilized optical flow, temporal convolutions, regularization, and attention to improve temporal consistency in other computer vision applications. This work attempts to combine both attention and regularization to improve the temporal consistency of color in video, by anchoring a reference

and diffusing its colors to a video. To color gray-scale video end-to-end, we propose a

system consisting of two submodels. The first model is a Reference Generating Network, which takes the first frame of the video and outputs a colorized version. This output is then used as the color reference for the second model: Video Color Propagation Network.

This colors one frame at a time based on the reference. The Video Color Propagation

Network is evaluated using a custom test set taken from the stock video website Videvo. We show that our video model outperforms contemporary video colorization models in both qualitative and quantitative tests. We also evaluate the Reference Generating Network and find it quantitatively underperforms compared to state-of-the-art image colorization models on ImageNet, COCO-stuff, and Paces365, whilst being visually similar.

(5)

1 Introduction

Colorizing legacy videos is very challenging, though recent deep learning based methods have shown promise on image colorization (R. Zhang et al., 2016; Iizuka et al., 2016; Deshpande et al., 2017; Zhao et al., 2019; Vitoria et al., 2020), by learning the relation between color and semantics of objects. Applying single image colorization techniques to video yields color flickering and temporal instability. Approaches taken to solve this utilize optical flow (Lei & Chen, 2019; B. Zhang et al., 2019) or temporal convolutions (Iizuka & Simo-Serra, 2019) to impose temporal constraints on colorization. The former is computationally expensive, and the latter is memory hungry, limiting both the speed and resolution at which colorization can be performed. This paper proposes a lightweight framework consisting of a single image colorization and a video color propagation model capable of efficient and accurate video colorization.

Traditionally, image colorization methods rely on user input to guide the colorization process, either in the form of scribbles (Levin et al., 2004; Huang et al., 2005; Yatziv & Sapiro, 2006) or references (Welsh et al., 2002; Ironi et al., 2005; Tai et al., 2005). With the advent of deep learning the approaches transformed into the use of convolutional neural networks and additionally the leveraging of images available in large-scale datasets in order to learn the relationship between objects and their color. This shift sparked a variety of new architectures. Some incorporate

(7)

semantic features in order to gain scene understanding (Iizuka et al., 2016). Others predict a per-pixel color distribution to encompass multi-modality (R. Zhang et al., 2016). These results proved fruitful and were expanded upon and incorporated into generative models (Deshpande et al., 2017; Guadarrama et al., 2017; Zhao et al., 2019; Vitoria et al., 2020). The main problem with applying single image colorization models to video is due to the multi-modality of the color while forcing a single prediction even if models are capable of diverse colorizations. Therefore, a slight change in the input image can lead to a large change in the predicted color. Thus when used on video this results in temporal incoherence, which is usually expressed in the form of flickering and inconsistent colorization. It becomes necessary to constrain the prediction of color in order to reduce these temporal artifacts.

Various authors have developed numerous techniques to transition single image colorization models to ones suitable for video (Lai et al., 2018; Eilertsen et al., 2019). Some created models with video colorization in mind from the ground up (B. Zhang et al., 2019; Iizuka & Simo-Serra, 2019; Lei & Chen, 2019). All models for video colorization require a mode to keep colors between frames consistent. For now, three different approaches have emerged; optical flow, temporal convolutions, and self-regularization. Optical flow maps motion between two frames but is computationally expensive and prone to errors if two frames show large disparities (Lai et al., 2018; B. Zhang et al., 2019). Temporal convolutions convolve over a series of images, but may still be prone to drifting of the color over time due to memory limitations as it constrains the number of frames that can be processed at once (Iizuka & Simo-Serra, 2019). Self-regularization expresses itself in schemes which forces a single image colorization model to color images depicting similar objects to color them the same, either by data augmentation (Eilertsen et al., 2019) or more complex loss functions (Lei & Chen, 2019).

Comparably, this paper proposes an end-to-end system that colorizes a gray-scale video au-tomatically. In order to obtain consistent colorization, a colored reference image is generated to guide the video colorization process. A Video Color Propagation Network establishes a pixel-wise correspondence between a supplied reference and the individual frames of a gray-scale image sequence. In this way, color is diffused from the reference to these singular frames. Since all frames take their color from the same reference, the fully colored video is temporally stable in its colorization. For generating a high-quality reference from the first frame of a given video, we design a Reference Generating Network. Some examples of the output of both networks can be seen in figures 1 and 2.

Both architectures will be evaluated and compared against state-of-the-art colorization mod-els. The Video Color Propagation Network is compared on a custom test set taken from the stock video website Videvo against other state-of-the-art video colorization models, while the Refer-ence Generating Network is evaluated on three public datasets, namely Imagenet (Russakovsky et al., 2015), COCO-stuff (Caesar et al., 2018), and Places365 (Zhou et al., 2017), against con-temporary image colorization models. Both are evaluated on the following quantitative metrics; PSNR, SSIM, and LPIPS (R. Zhang et al., 2018). Followed by a qualitative evaluation.

To summarize, this paper proposes an automatic video colorization system, and performs a comprehensive evaluation against the state-of-the-arts. In section 2, an overview of related works in colorization shall be provided. Thereafter, this paper discusses the architectures of the models and their respective training regimes in section 3. In section 4, this paper shall cover evaluations for both models. Followed by an integration of the two proposed models in section 5, a conclusion, and suggestions for future works are given.

(8)

2 Related Work

This section will briefly cover work by previous researchers relevant to the topic of single image and video colorization.

2.1 Single Image Colorization

2.1.1 User-Guided Colorization

User-guided colorization methods require, to various degrees, human intervention to operate optimally. Early attempts at single image colorization used scribbles or hints in the form of color points and strokes to govern the colorization process (Levin et al., 2004). Colors from the scribbles are diffused in a manner that assumes that neighboring pixels of similar intensity must have similar colors. Several improvements since then have been proposed to reduce the edge bleed. For example, utilizing edge information to stop propagation (Huang et al., 2005), using luminance-weighted chrominance blending to reduce the number of scribbles needed (Yatziv & Sapiro, 2006), or propagating color according to texture similarity in the gray-scale image. These techniques for colorization are still labor-intensive.

The second type of methods, which requires human intervention, are techniques which repli-cate the colors from a reference image and use them to colorize a gray-scale image. One of the first approaches involved matching intensity and texture information between a reference and a gray-scale image but lacked spatial cohesion (Welsh et al., 2002). This method was improved by first segmenting the reference image to generate scribbles and color based on those using the methods outlined by Levin et al. (2004) (Ironi et al., 2005). A different technique by Tai et al. (2005) generated segmentation in both the reference and the target and matched regions with similar statistics. Chia et al. (2011) segmented foreground objects and found suitable references by retrieving images from the internet. Another method of color matching was proposed by Gupta et al. (2012). They used corresponding superpixels in the reference and the input to inform their colorization. Lastly, Bugeau et al. (2013) proposed a variational model that chooses color from the reference based on previous selections.

More recently, with the advent of deep learning, R. Zhang et al. (2017) combined convo-lutional neural networks, which performed a base colorization, with user-provided color hints to interactively fine-tune the colorization of an image, reducing the manual labor required. A model that works with a reference had been created by M. He et al. (2018). They used Liao et al. (2017) to establish a correspondence between the reference and the target. Using this map they predicted the final colors with a U-Net (Ronneberger et al., 2015) convolutional neural network. The aforementioned techniques attempt to establish a mapping, utilizing various techniques, between a reference and a gray-scale image to transfer color from one to the other. But these methods, based on reference, suffer from one flaw, namely the quality of the colorization is strongly linked to selecting a suitable reference. A task which is still challenging (Chia et al., 2011). As for methods based on scribbles, they tend to be labor-intensive and require expert knowledge of colorization in order to avoid unrealistic results. Which is why this paper favors the use of learned prior from large scale database over user interventions for generating the reference for video colorization. Otherwise, it would defeat the purpose of automatic colorization.

2.1.2 Automatic Image Colorization

In recent years, learned-based methods have proliferated and have outdistanced classical ap-proaches. Due to increased computing capability and the availability of large-scale datasets,

(9)

such as ImageNet (Russakovsky et al., 2015), researchers have proposed a variety of models for automatic image colorization.

One of the earliest models utilizing deep learning to color images was proposed by Cheng et al. (2015). Their model consisted of low, mid, and high-level feature descriptions of an image, which were fed into a multilayer perceptron to predict the final chrominance of the image using least square minimization. In a different work, the color was predicted by a linear parametric model using quadratic regression on an image dataset (Deshpande et al., 2015). Improvements have been made to these approaches by involving convolutional neural networks and utilizing larger datasets. Larsson et al. (2016) fed a gray-scale image into a VGG network and created a hypercolumn out of the intermediate layers. From this hypercolumn two distributions, one for lightness and one for chroma were constructed. The final colors were then sampled from these distributions. Another example, showing the importance of semantic information, jointly predicted the color output and classification by fusing features at the high and mid-level together (Iizuka et al., 2016). A third example tackled the multi-modal nature of colorization by predicting per-pixel color distributions. This network has been trained with cross-entropy and rebalanced the color distributions allowing uncommon colors to emerge (R. Zhang et al., 2016). Mouzon et al. (2019) improved upon the model of R. Zhang et al. (2016) by changing how colors are selected for the final output. They used a variational approaching yielding greater quality of colorization. Another category of deep learning colorization models are generative models. Isola et al. (2017) showed some applications of image-to-image translation using a conditional GAN, among which was colorization. Nazeri et al. (2018) extended the previous work to improve speed, resolution, and stability. Cao et al. (2017) used a conditional GAN as well, but obtained multiple colorizations by resampling the input noise. Vitoria et al. (2020) took it one step further, by jointly predicting color and class using a GAN. It is similar to the method of Iizuka et al. (2016), where the L2 norm in the loss was replaced by a wGAN loss. Lastly, Antic (2018) proposed a new training method for GANs, resulting in more stable output suitable for both single image and video colorization.

The second type of generative models are variational auto-encoders (VAE) (Bugeau et al., 2013). Images are encoded into a regularized latent space from which the decoder samples and produces diverse colorized images (Deshpande et al., 2017). These generated images show spatial coherence, but can be blurry when compared to GANs.

The third type of generative models are autoregressive models. These models condition the generation of the color on already predicted pixel values. Building up the image one pixel at a time. A commonly used backbone model for this task is PixelCNN (Van den Oord et al., 2016; Salimans et al., 2017). It has seen some adoption for the task of image colorization (Royer et al., 2017; Guadarrama et al., 2017). These models solved the one-to-many problem by rendering diverse colorizations, but they do suffer from spatial inconsistencies as the distance between pixels grows. A type of artifact from which other types of generative models suffer less. Zhao et al. (2019) improved upon this by incorporating pixel-level semantics, giving more realistic and finer results.

In summary, the advent of deep learning and in particular convolutional neural networks has led to a variety of approaches. And while most of the previously discussed approaches already produce satisfactory results, single image colorization is still an open problem. This stems from a problem all papers on the subject acknowledge, there does not yet exist a metric which allows for easy comparisons between networks. For now, mostly PSNR and SSIM are used, but as has been pointed out several times, these do not take into account the multi-modality of color, thus favoring approaches that try to reconstruct the color of the original, as opposed to rendering diverse or pleasantly colored images. This paper opts to use the principles behind R. Zhang et al. (2016) while using a different network that is capable of superior semantics (Chen et al.,

(10)

2018).

2.2 Video Colorization

Video colorization extends the problem of colorization to the temporal dimension. As has already been touched upon, single image colorization models cannot simply be used to colorize a gray-scale video. This is due to the fact that small changes in the input can lead to large changes in the output, on top of that lies the multi-modality of color. Taken together these sources of interference result in inconsistent colors between frames. In recent years, several new approaches on video colorization have been proposed.

2.2.1 Optical Flow based Video Colorization

Optical flow, with respect to temporal consistency, is a mapping, which describes per-pixel motion vectors between two consecutive frames. Using these mappings, a frame can be warped to resemble its neighbors. Optical flow has been incorporated into style transfer networks, to enforce temporal consistency (Ruder et al., 2016).

Lai et al. (2018) proposed a post-process network, which took an already colorized sequence of images and the original sequence and outputs a temporally consistent sequence. The proposed networks was a blind one and could used for more application besides colorization. The network calculated perceptual and temporal loss separately. Perceptual loss took layers from a pre-trained VGG network and tries to minimize the distance between the ground truth and the processed image. Temporal loss was measured by warping the previously processed frame to the current one using optical flow and occlusion estimation. It then compared the current frame to the warped previous frame. This created short-term temporal consistency. For long-term temporal consistency, the processed frame was compared to the first frame.

A more direct approach was taken by B. Zhang et al. (2019). The proposed network consists of three parts; a correspondence network, a colorization network, and a discriminator. The correspondence network was responsible for creating a mapping between the reference features and the image features and a confidence map. The colorization network used these maps and the previously processed frames to guide the coloration of the image. The discriminator was then given the task of determining whether the new frame is consistent with the previous one or not. While optical flow can enforce temporal stability across multiple frames, calculating these maps can be resource-intensive and can introduce errors if the disparity between consecutive frames is large. This downside is the mean reason this technique was not utilized.

2.2.2 Temporal Convolutions based Video Colorization

It is not uncommon to see the suggestion to be put forth to utilize temporal convolutions for video colorization in future works. Temporal convolutions as proposed are convolution made over a series of images as opposed to one image at a time. This allows for the processing of multiple images at once, making consistent colorization more attainable.

Iizuka & Simo-Serra (2019) proposed a network not only capable of colorizing gray-scale video with references but also rectifying common artifacts found in vintage footage. The network took in a sequence of gray-scale images and used a 3D convolutional auto-encoder to clean the luminance channel. Then 3D convolutions are used again to form a feature representation. The same is done for a set of reference images. They are then combined with a source-reference attention layer and self-attention layers. Next the sequence was upsampled and combined with the cleaned luminance channel to generate the colored output sequence.

(11)

Temporal convolutions show a lot of promise, but still suffer some drawbacks. One of these drawbacks is the memory requirements. Processing one image is possible even at higher resolu-tions. But temporal convolutions require more memory, limiting the resolution at which video can be processed. Whilst this method could have been utilized, due to the memory requirements, this was not pursued as an option.

2.2.3 Regularization based Video Colorization

Regularization in deep learning are techniques employed to reduce the chance of overfitting on training data. The most common examples of which include data augmentation, where more examples are generated, and loss regularization, where models are punished if their weights stray too far from the zero-mean.

Other regularization techniques can be employed to stabilize the outcome of a convolutional neural network. Eilertsen et al. (2019) proposed the idea of warping operation, usually performed with optical flow, can be replaced by a simulacrum. A geometric transformation which shifts, shears, and scales the image as if to represent a prior or subsequent frame. Both are then colorized simultaneously. The geometric transformation can then be inverted. Thus when comparing the two images against one another, a stable CNN, should produce identical results.

Another regularization technique is diversity. Lei & Chen (2019) proposed a model that produced multiple different colorizations given an input. This model took in two consecutive frames and established a spatiotemporal link with k-nearest neighbors, which are then optimized using perceptual loss with diversity to obtain multiple consistent colorizations. In order to improve their results, they employed a second network in which they warp one frame to resemble the other using optical flow and used both frames to refine their final result.

Regularization can provide most of the benefits of more expensive methods such as optical flow and temporal convolutions. For this reason, this technique was chosen to be implemented as part of the method of temporal stability.

2.2.4 Attention based Video Colorization

Attention was first introduced in the seminal work by Vaswani et al. (2017), greatly improving on the natural language models of the time. Since then it has seen widespread use in Natural Language Processing. Outside of that use-case, it was found to be able to create correspondences between images.

Utilized by Yang et al. (2019) to improve the long-term temporal dependencies of object segmentation of video. As well as finding its use in Iizuka & Simo-Serra (2019) to selectively transfer the colors from references to a video output.

Taking after these implementations, this paper proposes a network that uses a colorized reference as an anchor from which the rest of the gray-scale video frames are colorized.

3 Method

This paper is interested in the task of automatic temporally stable video colorization. Therefore, the method proposed should be able to colorize a video without user-intervention and should not exhibit noticeable color flickering. To colorize the first frame for generating a reference, a Reference Generating Network is proposed. Then we design a Video Color Propagation Network, whose task it is to diffuse the color of the reference generated by the Reference Generating Network to a set of video frames. The reason for using a reference is to ensure color stability

(12)

over time as all frames draw their color from this same source. The colors from this reference are then propagated through the entire sequence, creating a temporally stable colorized video.

In order to reduce the computational burden, the CIELAB color space (McLaren, 1976) is utilized. This color space describes the color of a pixel using three components, similar to the RGB color space, but the components are different. The first component, L, describes the intensity or lightness of a pixel. Thus using only the lightness channel holds a gray-scale image. The second channel, a, represents the intensity of colors between green and red, while the third channel, b, expresses the intensity between blue and yellow. The color space was created to be a uniform approximation of the human vision system, in which the changes in values correspond with a similar perceived change in an image. Using this color space provides two distinct advantages. It reduces the need to predict three color channels to two. And the color space is perceptually uniform, meaning that distances is Lab color represent distances in the human vision system, a property useful for measuring errors in color prediction.

3.1 Reference Generating Network

Reference Generating Network functions as a single image colorization model, taking a gray-scale image and outputting a colorized version. This network is a fully connected convolutional neural network and therefore capable of coloring images of any size. Although in our approach it only serves to provide the Video Color Propagation Network with a reference. The approach taken is heavily inspired by R. Zhang et al. (2016). They quantize the Lab color space and predict the per-pixel color distributions with cross-entropy loss. From these distributions, the final color is then sampled and combined with the gray-scale input to form a reconstructed Lab image, which is then converted back to sRGB.

3.1.1 Encoding and Decoding Color

Given an image consisting of XLab∈ RH×W ×3, where H and W are the image dimensions, in Lab

color space, the luminance channel, XL∈ RH×W ×1, is used as the input of the Reference

Gener-ating Network. The network is tasked to estimate the remaining channels, (Ya ∈ RH×W ×1, Yb∈

RH×W ×1) to be compared to ground truth channels (Xa∈ RH×W ×1, Xb∈ RH×W ×1). Together

they form a complete colored image YLab = (XL, Ya, Yb), which can be transformed back into

sRGB color space to be viewable on modern displays.

Given that the Lab color space is perceptually uniform, it would be natural to use the Eu-clidean distance loss (L2) to measure the difference between the ground truth and predicted colors. However, this is unwise as the L2 loss is not capable of handling the multi-modality of color. Take for example the task of coloring a gray-scale picture of a flower, given the knowledge of the colors of every flower in existence, the L2 norm will predict the average of those colors, resulting in a white or grayish output. The multi-modal color of a flower works directly against the L2 loss if the model is to predict colorful results. Man-made artifacts, such a clothing com-pound this problem, by being both multi-modal in color and not commonly found in datasets for models to learn from.

Therefore a manner to obtain more vibrant results is to utilize classification loss. The classes that model is tasked with predicting are derived from quantizing the ab components of the Lab color space. In this paper, the color is quantized into 10 × 10 bins. After removing bins which are out of gamut of the sRGB color space, Q = 313 bins are left.

Now with classification loss, the model predicts a per-pixel distribution ˆZ ∈ [0, 1]H4× W

4×Q at reduced resolution, exploiting the fact that the human vision system is less sensitive to color then to luminance (Curcio et al., 1990). For every pixel, Q log probabilities are predicted

(13)

corresponding to the ab values associated with the respective bins. There are several ways in which these distributions can be converted back to Yab ∈ R

H

4×W4×2. One way is to take the annealed-mean (R. Zhang et al., 2016), another is to use a variational decoder (Mouzon et al., 2019). This paper uses the annealed-mean approach, which will be explored later.

First, a method is needed to convert the ground truth ab values of an image into a ground truth distribution Z. This is done using a soft encoding scheme proposed by R. Zhang et al. (2016). Where every pixel in the image is downsampled and encoded by based on its five nearest neighbors, weighted to the distance from the ground truth values using a Gaussian kernel with σ = 5.

Now with the ability to construct ground truth Z and prediction ˆZ, they can be compared using multi-nominal cross-entropy as defined in equation 1.

Lcross entropy(Z, ˆZ) = − X h,w v(Zh,w) X q Zh,w,qlog ( ˆZh,w,q) (1)

Here, v(Zh,w) is a term that reweights every pixel based on the rarity of the bins, as calculated

over ImageNet (Russakovsky et al., 2015). The calculated bin weights are mixed with a uniform distribution following this equation:

v(Zh,w) ∝ (1 − λ)˜p + λ Q −1 (2) Where ˜p is the smoothed distribution calculated from ImageNet, whilst λ ∈ [0, 1] determines the strength between the empirical distribution and the uniform distribution. This paper takes over the recommendation of λ = 0.5 by R. Zhang et al. (2016).

The rebalancing of these color classes is necessary in order to encourage the model to predict vivid, rare colors. This is because the distribution of color in natural images a biased toward low ab values, due to background objects such as clouds or blown-out highlights.

Lastly, a decision must be made on the manner in which ˆZ is converted to Yab. As alluded

to before, this paper has opted to take the annealed-mean of ˆZ following equation 3 below. H(Zh,w) = E[f (Zh,w)] fT(z) =

exp (log(z)/T ) P

qexp (log(zq)/T )

(3) The annealed-mean has a temperature parameter T ∈ [0, 1]. Setting T → 1 samples the colors closer to the mean, whilst T → 0 samples closer to the mode of ˆZ. Visually, this entails high values for T samples duller, but spatially more consistent colors, whilst low values for T sample more vivid color, but can show more artifacts in the final result. This paper fixes the value of T to 0.38 per the suggestion of R. Zhang et al. (2016).

3.1.2 Network Architecture

The architecture of the Reference Generating Network is a modified version of a DeepLabv3+ architecture (Chen et al., 2018) with an Xception net backbone (Chollet, 2017). The final output is replaced by a convolution having an output size H₄ × W

4 × Q, with Q = 313 as

described above. We initialize this model using weights pre-trained on ImageNet and COCO. This improves performance in two ways. Reducing the time needed to fine-tune the network for colorization and reusing the semantic information already present within the weights. Both contribute to improved colorization performance. An evaluation of the model can be found in section 4.1.

(14)

Figure 3: Visual overview of the Reference Generating Network. A gray-scale image, repeated three times, is downsampled by atrous convolutions, then convolutions at different levels are combined in a spatial pyramid pooling. Next, the results are upsampled and concatenated with a skip connection and produce a per-pixel distribution prediction. From this prediction, the annealed mean is taken to get the ab channels. Combined with the gray-scale input the colorized image is constructed. For more details on the network, we refer to Chen et al. (2018).

3.1.3 Training and Optimization

The Reference Generating Network is trained on 20% of ImageNet using the Adam optimizer with a learning rate set to 1e−5 and β1= 0.9 and β2= 0.999, without weight decay for 120.000

iterations with a batch-size of 12 using the loss from equation 1. Due to resource limitations, this paper employs gradient accumulation to simulate a larger batch-size of 60. The evolution of the training process can be found in figure 4.

In order to get the most out of the roughly 260.000 images, this paper employs common image augmentation strategies, such as resizing, flipping, translation, rotation, and scaling. A complete overview of the augmentation can be found in appendix A in table 4. The model was trained on an NVIDIA GTX 1080 GPU for one week.

(15)

0 20000 40000 60000 80000 100000 120000 Iterations 6600 6800 7000 7200 7400 7600 7800 Loss Training Regime Training Loss Validation Loss

Figure 4: Graph showing the training and validation loss over time. Note that the loss utilized is a summed loss as described in equation 1.

3.2 Video Color Propagation Network

The Video Color Propagation Network is responsible for stable video colorization. In order to facilitate this, we propose an architecture that takes as inputs a singular frame from a gray-scale video and a reference frame obtained from the Reference Generating Network as described in section 3.1. The video model establishes a correspondence map between the reference and the frame. This map informs the model at what regions the images are similar and which objects are to receive which color from the reference. We fix the reference of the course of en entire video clip. This causes the output to be consistent in color as all frames are guided by the same reference color, resulting in stable colorization over time.

The following sections will discuss the architecture of the model, how regularization is per-formed and how the model was trained and optimized.

3.2.1 Network Architecture

The architecture of the Video Color Propagation Network starts with a pre-trained ResNet (K. He et al., 2016) encoder without the classification head. From this output, a correspondence map is calculated between the reference and the input using an attention layer. This is followed by another attention layer performing self-attention. The output of the self-attention layer is then concatenated with a skip connection. These are decoded and upsampled to produce the remaining two color channels. A visual representation can be found in figure 5.

(16)

Figure 5: Visual overview of the Video Color Propagation Network. A reference an a gray-scale image is fed into the network. The embedded features are mapped based on similarity, followed by self-attention. This is then concatenated with a skip connection and decoded using bilinear upsampling and convolutions. The output is the predicted ab channels, combined with the gray-scale input a fully colored image is reconstructed.

3.2.2 Attention Mechanism

The attention mechanism creates a correspondence map. It takes in two inputs of any dimension. The mapping shows which regions are similar to the inputs. This allows the colors of the reference to find corresponding spatial positions in the input image. We use two of these mechanisms in our model. The first establishes a mapping between the reference features and the gray-scale features, assigning which objects in the input receive which color from the reference. The second mechanism serves to smooth out the color transitions by establishing a correspondence between similar parts of the same image, giving similar regions in the image a similar color.

Let the inputs of the layer be Xi∈ RCi×Hi×Wi, where Xiis the input features with Ci being

the channels, Hi being the height, and Wi being the width of the input and Rr ∈ RCr×Hr×Wr

as the reference features.

First, the size of the channel is reduced by a factor of 8 to reduce the computational strain. Then following equation 4, inspired by Iizuka & Simo-Serra (2019) and Yang et al. (2019), the attention is calculated. Attention(Xi, Rr) = Xi+ γ R 0 r sof tmax ₁ √ Ci ∗ R0T r X 0 i T! (4)

Here X_i0 ∈ RCi/8×HiWi _{is a reduced form of X}

i, same as R

0

r∈ RCi/8×HrWr for Rr. Ci scales

the dot product to keep all the gradients within the softmax region and γ is a learnt parameter, which scales the final attention with the input.

3.2.3 Regularization

Capitalizing on the ability of attention layers to diffuse the color, this paper uses strong image augmentation techniques to simulate the reference as a related frame to the input. The method used is inspired by Eilertsen et al. (2019). The input and reference are separately transformed according to table 1.

(17)

Parameter Target Probability Min Max Horizontal Flip I, R 50% - -Translate I, R 75% −0.0625 0.0625 Rotation I, R 50% −45◦ ₄₅◦ Scale I, R 50% 0.9x 1.1x Crop I, R 50% 160px 256px Resize I, R 100% 224px 224px RGB channel shift I 50% −20 20 Noise I 33% N (0, 0.01) N (0, 0.05) Brightness I 33% 0.8 1.2 Gamma I 33% 0.8 1.2 Saturation I 33% 0.6 1.1

Table 1: Augmentation parameters and their ranges for the input and the reference for the Video Color Propagation Network. Here, I is the input and R is the reference. Augmentations are performed separately on both images.

This manner of regularization forces the network to learn how two frames correspond to one another and create a consistent rendering of color across an entire video sequence.

3.2.4 Training and Optimization

The training of the method is done in two distinctive phases. During the first phase, the model is pre-trained on 20% of ImageNet, where the perturbed ground truth is given as a reference and its separately perturbed gray counterpart is given as the input. The model is then trained for approximately 40, 000 iterations using the Adam optimizer with a learning rate of 2e−4.

0 5000 10000 15000 20000 25000 30000 35000 40000 Iterations 4000 6000 8000 10000 12000 14000 16000 18000 Loss

First Phase: Pre-training

Training Loss Validation Loss 0 5000 10000 15000 20000 25000 Iterations 5000 10000 15000 20000 25000 30000 35000 Loss

Second Phase: Fine-Tuning

Training Loss Validation Loss

Figure 6: Plots showing the training and validation loss of the Video Color Propagation Network. The training is done in two phases. In the first phase, the model is pre-trained on 20% of ImageNet. During the second phase, the dataset is switched to DAVID. Both are trained with Huber loss.

Subsequently the training moves to the second phase. In this phase the model is fine-tuned, the training and test set is switched to DAVID (Perazzi et al., 2016), and the learning rate is reduced to 1e−7. Instead of given the perturbed ground truth as the reference, it is changed to a randomly selected frame from the video clip. This fine-tuning phase is run for 25, 000 iterations. Both phases use a batch-size of 56 and use Huber regression loss, which is less sensitive to outliers

(18)

and thus more conducive to colorful results. Combined training time was three days using an NVIDIA GTX 1080 GPU. Figure 6 shows how the loss evolves during both training phases.

4 Results

Both models are evaluated on the following metrics; PSNR, SSIM, and LPIPS (R. Zhang et al., 2018) with a VGG backbone. The reason for using these quality metrics is three-fold. Firstly, it is commonly used in related works keeping it roughly comparable against others. Secondly, there exists no consensus on which is the best method in which to evaluate colorization quality. And thirdly, a user preference study is unable to be performed due to resource limitations. Important to note is the manner of calculating the metrics. PSNR and SSIM are calculated by averaging over individual color channels, while LPIPS is calculated over all channels at once.

4.1 Reference Generating Network

The Reference Generating Network is evaluated on three different datasets. Consisting of a subset of ImageNet, COCO-stuff, and Places365. The subset of the ImageNet (Russakovsky et al., 2015) test split is provided by Larsson et al. (2016). ImageNet is a commonly used dataset, so it is important to use this in comparisons. Although the simplicity compared to natural images must be noted, as the content of ImageNet consists mostly of single mundane object or animal, which are grouped into 1000 classes taken from WordNet. The next benchmark dataset used is the validation split COCO-stuff (Caesar et al., 2018). This dataset consists of complex natural images of both indoor and outdoor scenes. Therefore, these scenes are more difficult to colorize. The last dataset used for evaluation is the validation split of the Places365 (Zhou et al., 2017). This dataset is made up of images depicting natural (landscapes) and man-made (buildings) scenes. Compared to COCO, these scenes should colorize more easily, as colors in landscapes such as the sky or ground are the same across multiple scenes.

For a fair comparison, all images are resized to 256 × 256 before metric calculation. Calcu-lations are performed in compliance with section 4. The proposed model follows the training strategy as described in section 3.1.3. Then, section 4.1.1 will discuss the colorization perfor-mance metrics as compared to the state-of-the-art, while section 4.1.2 assess the colorization performance on a selection of random samples.

4.1.1 Quantitative Evaluation

Table 2 shows the quantitative results. As can be observed from the table, the proposed model does not appear to hold up compared to state-of-the-art networks. Falling 1.0612 dB behind the baseline on PSNR and even further compared to the other models on ImageNet. However, on SSIM our model holds up competitively against ChromaGAN (Vitoria et al., 2020). Though it does not on LPIPS, trailing even the baseline. Moving to the COCO-stuff dataset, it can be observed that the performance of our model has improved overall relative to ImageNet. Still, other methods see a similar advancement. Again, the proposed network only scores alike in SSIM. Whereas it lags on LPIPS. Places365 appears quantitatively to be the easiest to colors as all models saw a slight uplift compared to the other two datasets. Even so, our model is still outdistanced on all metrics.

(19)

ImageNet ctest 10k COCO-stuff validation split Places365 validation split PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ Gray (baseline) 21.7961 0.9116 0.2238 22.0604 0.9103 0.2119 22.4149 0.9218 0.2031 R. Zhang et al. (2017) 24.1471 0.9176 0.1843 24.0949 0.9144 0.1783 24.2183 0.9014 0.2228 Vitoria et al. (2020) 22.7703 0.8739 0.2379 22.5928 0.8608 0.2439 23.6093 0.9104 0.2125 Antic (2018) 23.5912 0.9091 0.2200 23.3931 0.8688 0.2640 23.7737 0.8866 0.2508 Proposed Method 20.7349 0,8689 0.2869 21.1737 0.8720 0.2759 21.1734 0.8593 0.2807

Table 2: Comparison overview of various image colorization models. Models are tested on a 10k subset of the ImageNet test set, the COCO-stuf validation split and the Places365 validation split. The arrows next to the metric PSNR, SSIM, and LPIPS (R. Zhang et al., 2018) indicate whether a high or low score is desirable.

However, the metrics chosen serve merely as a proxy of colorization performance, as a suit-able metric is non-existent. Straying far from the ground truth is penalized by these metrics. Therefore more weight should be placed on the subjective evaluation in section 4.1.2.

A commonly taken approach is to measure naturalness by performing a user-study asking participants if they perceive an image to be natural. Although, due to resource constraints such a study has not been performed.

(20)

4.1.2 Qualitative Evaluation

Input (Gray) Ground Truth Proposed Method R. Zhang et al. (2017) Victoria et al. (2020) Antic (2018)

Figure 7: Comparison overview of a few sample images. From left to right: the input, the ground truth, the proposed method, Real-time user-guided image colorization with learned deep priors by R. Zhang et al. (2017), ChromaGan by Vitoria et al. (2020) and DeOldify ”artistic” by Antic (2018).

When turning toward the visual comparison in figure 7, it can be observed that the proposed model colorizes saturated imagery but are not always spatially consistent compared to the other

(21)

models. This can be seen in the second example of the scene of the waterfall and the lake. This is mostly due to predicting per-pixel color distributions, causing pixel color values to be assigned irrespective of their neighbors. Another image of interest is the scene of a slug on a beach. Here, the proposed solution is alone in the interpretation that the sand of the beach is actually grass, thus coloring it green. A similar, but more nuanced, misinterpretation can be seen in the fifth image. It this case the common adage that the sky is blue does not hold, as there is cloud cover misleading the rest of the models as well to various extends.

Otherwise, for scenes that depict images with landscapes or outdoor images, all models pro-duce plausible colorization. This much is clear from figure 7. When colorizing objects which do not have consistent colors in the real-world, the models show their bias. A model able to produce plausible colorization under these considerations can be understood as possessing higher colorization performance. Below in figure 8, shows some examples of images that are difficult for all models. Though for these images the proposed model show spatially inconsistent results, while other models are not.

Figure 8: Handpicked examples of difficult images to colorize.

Figure 8 also reveals how the models deal with artifacts and small objects. The proposed method biases towards brown and green tones, while the model of R. Zhang et al. (2017) de-saturates the picture with gray hues. ChromaGan (Vitoria et al., 2020) suffers less from bias but tints the entire image using a singular color, frequently the most predominant hue. Lastly, DeOldify (Antic, 2018) subtly biases to blue hues, as can be seen in the colors of the clothes in the first image. Finally, figure 9 shows some handpicked examples where our model misin-terprets the content of the image leading to unpleasing colorization, whereas the other models assign plausible colors.

(22)

Figure 9: Handpicked images in which the proposed model produces clearly unrealistic results. Overall the proposed model in both quantitative and qualitative evaluations does not perform on par with other solutions. It might, therefore, be more productive to integrate a third-party model into the complete automatic video colorization colorization system.

4.2 Video Color Propagation Network

A selection of video colorization models are evaluated on a custom dataset of video taken from the stock video website Videvo. The dataset was selected to provide multiple different types of clips to colorize. To this end, it contains a range of static to dynamic clips, usually longer in length than the clips provided by the DAVID dataset. This was also done to prevent the potential problem that one of the models has already seen these videos as part of training.

All images are resized to 854 × 480 before evaluation. Otherwise, results are calculated as previously described in section 4. First, in section 4.2.1 a quantitative comparison will be made against other state-of-the-art models. Followed by a visual comparison in section 4.2.2.

4.2.1 Quantitative Evaluation

Counter to the Reference Generating Network, the Video Color Propagation Network performs far better (4.105 dB higher than the next best, 0.0084 higher in SSIM then the second best and 0.061 better than the runner-up in LPIPS). As can be observed in table 3, the proposed technique outperforms comparative models on the evaluation dataset.

Videvo validation set

PSNR ↑ SSIM ↑ LPIPS ↓

Gray (baseline) 22.0894 0.9305 0.2429

Iizuka & Simo-Serra (2019) 23.9090 0.8986 0.1991

Lei & Chen (2019) 16.9073 0.6657 0.4314

Antic (2018) 22.7078 0.8649 0.3049

Proposed Method 28.0140 0.9389 0.1381

Table 3: Comparison overview of various video colorization models. Models are tested on 22 test videos taken from the stock video website Videvo. Models capable of accepting a reference are given the first ground truth frame of the video clips as reference. The arrows next to the metric PSNR, SSIM, and LPIPS (R. Zhang et al., 2018) indicate if a high or low score is desirable. An overview and sources of the video clips is provided in appendix B in table 5.

(23)

Figure 10, shows a breakdown of how well the different videos performed on the tested networks. As can be observed from the figure, the proposed method scores better on all three metrics even on more difficult image sequences.

bird

bricklaying

cafe

car_mountain

desert flags forrest fountain glacier

hell_mouth

jet

la_transport makoko running skateboard sunflowers

taxi venice_pier vr_at_work water_development woodpile work 15 20 25 30 35 Average PSNR PSNR per video bird bricklaying cafe car_mountain

hell_mouth

jet

taxi venice_pier vr_at_work water_development woodpile work 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Average SSIM

SSIM per video

bird

bricklaying

cafe

car_mountain

hell_mouth

jet

taxi venice_pier vr_at_work water_development woodpile work 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Average LPIPS

LPIPS per video

Gray (baseline) Proposed Method Iizuka & Simo-Serra (2019) Lei & Chen (2019) Antic (2018)

Figure 10: The three figures show how well the models perform with respect to the different videos. These videos vary in content and length. A detailed overview of the videos can be found in appendix B in table 5.

Diving deeper into these metrics, they can be reduced to show a time graph plotting how well metrics hold over time.

0 50 100 150 200 250 Frame number 20.0 22.5 25.0 27.5 30.0 32.5 35.0 37.5 PSNR

PSNR over time in cafe

0 50 100 150 200 250 Frame number 0.6 0.7 0.8 0.9 1.0 SSIM

SSIM over time in cafe

0 50 100 150 200 250 Frame number 0.05 0.10 0.15 0.20 0.25 0.30 0.35 LPIPS

LPIPS over time in cafe

Figure 11: A trio of graphs showing in what manner the metrics develop over time. The sources of the utilized videos can be found in appendix B in table 5. An overview graph for every video can also be found in appendix C.

As an example, a sliding shot taken in a cafe is decomposed in figure 11 to show per-frame metrics. Here, the proposed method can be seen worsening over time as the difficulty of trans-ferring colors increases, although it still starts and ends with higher scores. Whilst other models, follow the baseline as is to be expected. For both our model and DeepRemaster the usefulness of a singular reference quickly diminishes as the correspondence becomes harder to establish. Automatic models score more consistent but lower in this regard, as they cannot rely on hints to improve colorization performance.

4.2.2 Qualitative Evaluation

Figure 12 shows how well the proposed models compare visually against other models. The figure includes both video colorization and single image colorization networks for easy comparison.

(24)

Ground Truth

T = 0 T = 108 T = 217 T = 326

Proposed Method

Iizuka & Simo-Serra

(2019)

Antic (2018)

Lei & Chan (2019)

R.Zhang et al. (2019)

Figure 12: Comparison overview for the video colorization models. The frame number is shown along the top and the different models are shown along the side. From top to bottom: the grounth truth, the porposed method, DeepRemaster (Iizuka & Simo-Serra, 2019), DeOldify ”video” (An-tic, 2018), Fully Automatic Video Colorization (Lei & Chen, 2019), and Real-Time User-Guided Image Colorization (R. Zhang et al., 2017).

In this particular video clip of a parrot (figure 12), it is evident that visually the proposed method performs well. But most others, with the exception of B. Zhang et al. (2019), show spatial consistency within images. Strong temporal stability can be observed in the proposed model, while weak temporal stability can be observed in the rest of the models. Although it must be noted that models subtly shift their color over the course of the video.

The proposed method uses the ground truth reference to keep its colors consistent leading to a favorable outcome in quantitative evaluation. As colorization close to the ground truth is favored, and other diverse plausible colorizations are penalized by the metrics. Now, figure 12 also shows that learning a prior over a large dataset might not be enough to colorize all types of images. Especially if recovery of the ground truth is the objective. Even so, reference-based models also suffer from flaws, mainly that colors that are not in the reference are not inferred. Take the last column in figure 12, both reference-based models, ours and DeepRemaster (Iizuka

(25)

& Simo-Serra, 2019), both fail to predict the blue highlights in the feathers of the parrot. A problem all models suffer from is the difficulty of coloring small artifacts, as models are trained on low-resolution imagery, most commonly 256 × 256. Demonstrating that both image and video colorization have not been solved.

5 Integrating the models

The two models presented here can be coupled together in a sequential fashion to efficiently color a gray-scale image sequence. The Reference Generating Network takes the first frame of the sequence and colorizes it. This frame shall serve as the reference for the video color propagation network. The Video Color Propagation Network combines the reference and a frame from the gray-scale image sequence to output a fully colored image. Once the feed-forward pass has been performed, we can insert the second frame in the sequence and color it, using the same reference as before. This manner of colorization continues until all frames are colored. The correspondence between the reference and the gray frames can carry consistent color over a long image sequence, given that their are no scene cuts or transitions.

6 Conclusion

This paper proposes an automatic video colorization system to improve the temporal consistency of video colorization by reducing color flicker between frames. This system consists of two parts: a Reference Generating Network and a Video Color Propagation Network. The Reference Generating Network renders a reference frame based on the first frame of a gray video, which is used by the video model to propagate its colors to the rest of the gray-scale video. While the proposed Reference Generating Network results underperforms against the state-of-the-art, the Video Color Propagation Network manages to accurately diffuse colors over an entire video. To further improve the Reference Generating model will strengthen the final video colorization, as the performance of the video model is heavily tied to the quality of the reference.

7 Discussion

From the results in section 4.1 and 4.2, the observation can be made that the Reference Gen-erating Network underperforms quantitatively, whilst visually being competitive and the Video Color Propagation Network outperforms state-of-the-art both quantitatively and qualitatively. The entire system encompasses both models. Although due to time constraints these have not been trained in an end-to-end fashion, leaving this avenue open for exploration.

Our model for single image colorization underperformed compared to contemporaries. This can be partially attributed to a lack of training. As in figure 4, the model is still trending downwards, although this to had to be cut short do due time constraints. Compared to the video model (figure 6), which received ample time to plateau.

There cannot be a conclusive statement that our proposed model for color propagation in video, will generalize outside the constructed test set. The sources for which can be found in appendix B, table 5.

It is noted that the separate models of our system are opposed in their performance, where one scores relatively low, and the other relatively high. This indicates the framework used for the Reference Generating Network could use some improvements, while the video model shows a

(26)

computationally efficient way to diffuse color over a video, giving it the possibility of colorizing high-resolution videos. Both techniques can be improved.

Here we make a few suggestions as to what avenues of research can be pursued.

An observation made was the metric utilized to measure colorization performance is heavily biased towards recreating the ground truth, penalized models which stray too far. Resulting in most models defaulting to gray tones when colorizing objects they do not recognize. There thus exists an unmet need for a metric that can capture this multi-modality of color.

Similarly there exist a difficulty to show that color in video is consistent over time. As the only efficient way of showing improvements is to show the video. A similar metric for temporal consistency is needed.

Whilst deep learned priors help single image colorization perform better, most models are trained with relatively low-resolution imagery. It might, therefore, be useful to increase the size of the images while training, in order to capture small details. Another point is dealing with artifacts. Most models default to gray or brown when coloring them. Further research might look at how to capture the color of artifacts.

We hope that our work and these suggestions inspire further research into the field of col-orization, for as fine as the results are now, it is far from a solved problem.

(27)

References

Antic, J. (2018). Deoldify: A deep learning based project for colorizing and restoring old images (and video!). Retrieved 25-05-2020, from https://github.com/jantic/DeOldify

Bugeau, A., Ta, V.-T., & Papadakis, N. (2013). Variational exemplar-based image colorization. IEEE Transactions on Image Processing, 23 (1), 298–307.

Caesar, H., Uijlings, J., & Ferrari, V. (2018). Coco-stuff: Thing and stuff classes in context. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 1209–1218). Cao, Y., Zhou, Z., Zhang, W., & Yu, Y. (2017). Unsupervised diverse colorization via

gener-ative adversarial networks. In Joint european conference on machine learning and knowledge discovery in databases (pp. 151–166).

Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the european conference on computer vision (eccv) (pp. 801–818).

Cheng, Z., Yang, Q., & Sheng, B. (2015). Deep colorization. In Proceedings of the ieee interna-tional conference on computer vision (pp. 415–423).

Chia, A. Y.-S., Zhuo, S., Gupta, R. K., Tai, Y.-W., Cho, S.-Y., Tan, P., & Lin, S. (2011). Semantic colorization with internet images. ACM Transactions on Graphics (TOG), 30 (6), 1–8.

Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceed-ings of the ieee conference on computer vision and pattern recognition (pp. 1251–1258). Curcio, C. A., Sloan, K. R., Kalina, R. E., & Hendrickson, A. E. (1990). Human photoreceptor

topography. Journal of comparative neurology, 292 (4), 497–523.

Deshpande, A., Lu, J., Yeh, M.-C., Jin Chong, M., & Forsyth, D. (2017). Learning diverse image colorization. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 6837–6845).

Deshpande, A., Rock, J., & Forsyth, D. (2015). Learning large-scale automatic image coloriza-tion. In Proceedings of the ieee international conference on computer vision (pp. 567–575). Eilertsen, G., Mantiuk, R. K., & Unger, J. (2019). Single-frame regularization for temporally

stable cnns. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 11176–11185).

Guadarrama, S., Dahl, R., Bieber, D., Norouzi, M., Shlens, J., & Murphy, K. (2017). Pixcolor: Pixel recursive colorization. arXiv preprint arXiv:1705.07208 .

Gupta, R. K., Chia, A. Y.-S., Rajan, D., Ng, E. S., & Zhiyong, H. (2012). Image colorization using similar images. In Proceedings of the 20th acm international conference on multimedia (pp. 369–378).

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 770–778). He, M., Chen, D., Liao, J., Sander, P. V., & Yuan, L. (2018). Deep exemplar-based colorization.

(28)

Huang, Y.-C., Tung, Y.-S., Chen, J.-C., Wang, S.-W., & Wu, J.-L. (2005). An adaptive edge detection based colorization algorithm and its applications. In Proceedings of the 13th annual acm international conference on multimedia (pp. 351–354).

Iizuka, S., & Simo-Serra, E. (2019). Deepremaster: temporal source-reference attention networks for comprehensive video enhancement. ACM Transactions on Graphics (TOG), 38 (6), 1–13. Iizuka, S., Simo-Serra, E., & Ishikawa, H. (2016). Let there be color! joint end-to-end learning of

global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics (ToG), 35 (4), 1–11.

Ironi, R., Cohen-Or, D., & Lischinski, D. (2005). Colorization by example. In Rendering techniques (pp. 201–210).

Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 1125–1134).

Lai, W.-S., Huang, J.-B., Wang, O., Shechtman, E., Yumer, E., & Yang, M.-H. (2018). Learning blind video temporal consistency. In Proceedings of the european conference on computer vision (eccv) (pp. 170–185).

Larsson, G., Maire, M., & Shakhnarovich, G. (2016). Learning representations for automatic colorization. In European conference on computer vision (pp. 577–593).

Lei, C., & Chen, Q. (2019). Fully automatic video colorization with self-regularization and diversity. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 3753–3761).

Levin, A., Lischinski, D., & Weiss, Y. (2004). Colorization using optimization. In Acm siggraph 2004 papers (pp. 689–694).

Liao, J., Yao, Y., Yuan, L., Hua, G., & Kang, S. B. (2017). Visual attribute transfer through deep image analogy. arXiv preprint arXiv:1705.01088 .

McLaren, K. (1976). Xiii—the development of the cie 1976 (l* a* b*) uniform colour space and colour-difference formula. Journal of the Society of Dyers and Colourists, 92 (9), 338–341. Mouzon, T., Pierre, F., & Berger, M.-O. (2019). Joint cnn and variational model for

fully-automatic image colorization. In International conference on scale space and variational meth-ods in computer vision (pp. 535–546).

Nazeri, K., Ng, E., & Ebrahimi, M. (2018). Image colorization using generative adversarial networks. In International conference on articulated motion and deformable objects (pp. 85– 94).

Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., & Sorkine-Hornung, A. (2016). A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 724–732). Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical

image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 234–241).

(29)

Royer, A., Kolesnikov, A., & Lampert, C. H. (2017). Probabilistic image colorization. arXiv preprint arXiv:1705.04258 .

Ruder, M., Dosovitskiy, A., & Brox, T. (2016). Artistic style transfer for videos. In German conference on pattern recognition (pp. 26–36).

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., . . . others (2015). Imagenet large scale visual recognition challenge. International journal of computer vision, 115 (3), 211–252.

Salimans, T., Karpathy, A., Chen, X., & Kingma, D. P. (2017). Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517 .

Tai, Y.-W., Jia, J., & Tang, C.-K. (2005). Local color transfer via probabilistic segmentation by expectation-maximization. In 2005 ieee computer society conference on computer vision and pattern recognition (cvpr’05) (Vol. 1, pp. 747–754).

Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. (2016). Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems (pp. 4790–4798).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . . Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

Vitoria, P., Raad, L., & Ballester, C. (2020). Chromagan: Adversarial picture colorization with semantic class distribution. In The ieee winter conference on applications of computer vision (pp. 2445–2454).

Welsh, T., Ashikhmin, M., & Mueller, K. (2002). Transferring color to greyscale images. In Proceedings of the 29th annual conference on computer graphics and interactive techniques (pp. 277–280).

Yang, Z., Wang, Q., Bertinetto, L., Hu, W., Bai, S., & Torr, P. H. (2019). Anchor diffusion for unsupervised video object segmentation. In Proceedings of the ieee international conference on computer vision (pp. 931–940).

Yatziv, L., & Sapiro, G. (2006). Fast image and video colorization using chrominance blending. IEEE transactions on image processing, 15 (5), 1120–1129.

Zhang, B., He, M., Liao, J., Sander, P. V., Yuan, L., Bermak, A., & Chen, D. (2019). Deep exemplar-based video colorization. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 8052–8061).

Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In European conference on computer vision (pp. 649–666).

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Cvpr.

Zhang, R., Zhu, J.-Y., Isola, P., Geng, X., Lin, A. S., Yu, T., & Efros, A. A. (2017). Real-time user-guided image colorization with learned deep priors. arXiv preprint arXiv:1705.02999 .

(30)

Zhao, J., Han, J., Shao, L., & Snoek, C. G. (2019). Pixelated semantic colorization. International Journal of Computer Vision, 1–17.

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40 (6), 1452–1464.

(31)

Appendices

A

Augmentation table

Parameter Probability Min Max

Horizontal Flip 50% - -Translate 50% −0.0625 0.0625 Rotation 50% −45◦ ₄₅◦ Scale 50% 0.9x 1.1x Crop 50% 160px 256px Resize 100% 224px 224px

Table 4: Augmentation parameters and their ranges for the Reference Generating Network.

B

Custom video test set details

Video Name Subject Original Dims Resized Dims Frames URL

Tropical Bird Cleaning Itself 4K Parrot 4096x2160 910x480 435 www.videvo.net/video/tropical-bird-cleaning-itself-4k/ 456027/

Bricklayer Mid Shot Wall + Human 1920x1080 853x480 434 www.videvo.net/video/bricklayer-mid-shot/3678/ Cafe Table with Coffee and

Cakes Tables with objects 4096x2160 910x480 263

www.videvo.net/video/cafe-table-with-coffee-and-cakes/ 464213/

Car Driving Through Icelandic

Landscape Landscape with car 1920x1080 853x480 222

www.videvo.net/video/car-driving-through-icelandic -landscape/452786/

Silent Desert Landscape 1920x1080 853x480 288 www.videvo.net/video/silent-desert/4395/ Child Running Through

Na-tional Flags Flags 1920x1080 853x480 913

www.videvo.net/video/child-running-through-national -flags/455837/

Flying Through Forest 1 Forest landscape 1920x1080 853x480 239 www.videvo.net/video/flying-through-forest-1/4651/ Turia Fountain Neptune Statue Fountain + Pigeons 1920x1080 853x480 387 www.videvo.net/video/turia-fountain-neptune-statue/

458896/ Reveal Shot of Icelandic Glacier

Lake Glacier 1920x1080 853x480 561

www.videvo.net/video/reveal-shot-of-icelandic-glacier -lake/452834/

Door To Hell in Turkmenistan 01 Fire 3840x2160 853x480 387 www.videvo.net/video/door-to-hell-in-turkmenistan-01/ 457998/

Private Jet Taxiing Airplane 1920x1080 853x480 434 www.videvo.net/video/private-jet-taxiing/4130/ LA Transport 02 Bus 1280x720 853x480 754 www.videvo.net/video/la-transport-02/2777/ Women Serving Food 07 Human 1920x1080 853x480 514 www.videvo.net/video/women-serving-food-07/458531/ Marathon Runners Slow Motion Humans 1920x1080 853x480 346 www.videvo.net/video/marathon-runners-slow-motion/

4542/

Skateboarder Big Ollie Skateboard 1920x1080 853x480 657 www.videvo.net/video/skateboarder-big-ollie/3554/ Flying Over Sunflower Field 1 Plants 1920x1080 853x480 419 www.videvo.net/video/flying-over-sunflower-field-1/

5187/

Nosy Be City Centre Taxi Taxi 1920x1080 853x480 228 www.videvo.net/video/nosy-be-city-centre-taxi/4566/ Man Cycling With a Surfboard Cityscape 1920x1080 853x480 209 www.videvo.net/video/man-cycling-with-a-surfboard/

455997/ Tracking Shot Across Woman

Using VR Headset Human 2048x1080 910x480 536

www.videvo.net/video/tracking-shot-across-woman-using -vr-headset/7405/

Industrial Pipes and Valves Colored Pipes 1920x1080 853x480 153 www.videvo.net/video/industrial-pipes-and-valves/ 456387/

Walking Along Woodpile

Hand-held Shot Landscape 1920x1080 853x480 229

www.videvo.net/video/walking-along-woodpile-handheld -shot/4898/

Overhead Of Paperwork And Hands Of Business Colleagues Brainstorming Around Table

Landscape 3840x2160 853x480 215

www.videvo.net/video/overhead-of-paperwork-and-hands -of-business-colleagues-brainstorming-around-table/ 464906/

(32)

C

Individual metrics per video over time

0 50 100 150 200 250 300 350 400 Frame number 10 12 14 16 18 20 22 24 PSNR

PSNR over time in bird

0 50 100 150 200 250 300 350 400 Frame number 0.3 0.4 0.5 0.6 0.7 0.8 0.9 SSIM

SSIM over time in bird

0 50 100 150 200 250 300 350 400 Frame number 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 LPIPS

LPIPS over time in bird

Figure 13: Frame by frame overview on how the metrics evolve over time for bird. See appendix B, table 5 for details.

0 50 100 150 200 250 300 350 400 Frame number 20 25 30 35 PSNR

PSNR over time in bricklaying

0 50 100 150 200 250 300 350 400 Frame number 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 SSIM

SSIM over time in bricklaying

LPIPS over time in bricklaying

Figure 14: Frame by frame overview on how the metrics evolve over time for bricklayer. See appendix B, table 5 for details.

PSNR over time in cafe

SSIM over time in cafe

LPIPS over time in cafe

Figure 15: Frame by frame overview on how the metrics evolve over time for caf´e. See appendix B, table 5 for details.

(33)

0 25 50 75 100 125 150 175 200 Frame number 20.0 22.5 25.0 27.5 30.0 32.5 35.0 PSNR

PSNR over time in car_mountain

0 25 50 75 100 125 150 175 200 Frame number 0.4 0.5 0.6 0.7 0.8 0.9 1.0 SSIM

SSIM over time in car_mountain

LPIPS over time in car_mountain

Figure 16: Frame by frame overview on how the metrics evolve over time for car mountain. See appendix B, table 5 for details.

PSNR over time in desert

SSIM over time in desert

LPIPS over time in desert

Figure 17: Frame by frame overview on how the metrics evolve over time for desert. See appendix B, table 5 for details.

0 200 400 600 800 Frame number 12 14 16 18 20 22 24 PSNR

PSNR over time in flags

0 200 400 600 800 Frame number 0.4 0.5 0.6 0.7 0.8 0.9 SSIM

SSIM over time in flags

0 200 400 600 800 Frame number 0.1 0.2 0.3 0.4 0.5 0.6 LPIPS

LPIPS over time in flags

Figure 18: Frame by frame overview on how the metrics evolve over time for flags. See appendix B, table 5 for details.

Temporally Stable Automatic Video Colorization

Temporally Stable

Automatic Video

Temporally Stable

Automatic Video Colorization

Lysander G.B. de Jong

11788674

Bachelor thesis

Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam

Faculty of Science

Science Park 904

1098 XH Amsterdam

Supervisor

MSc. Jiaojiao Zhao

Informatics Institute

Faculty of Science

University of Amsterdam

Science Park 904

1098 XH Amsterdam

June 26th, 2020

Contents

1

Introduction

2

Related Work

2.1

Single Image Colorization

2.2

Video Colorization

3

Method

3.1

Reference Generating Network

3.2

Video Color Propagation Network

4

Results

4.1

Reference Generating Network

4.2

Video Color Propagation Network

5

Integrating the models

6

Conclusion

7

Discussion

References

Appendices

A

Augmentation table

B

Custom video test set details

C

Individual metrics per video over time