Game of Pirates: Watermark Removal using Deep Learning-Based Collusion Attacks

(1)

Learning-Based Collusion Attacks

Game of Pirates: Watermark Removal using Deep

Academic year 2019-2020

Master of Science in Computer Science Engineering

Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Ir. Hannes Mareen, Ir. Martijn Courteaux

Supervisors: Prof. dr. Peter Lambert, Prof. dr. ir. Glenn Van Wallendael

Student number: 01305554

Lukas Grisar

(2)

(3)

Preface

I hereby present you my master’s dissertation which has been written to finalise my Master studies in Computer Science Engineering at Ghent University.

Firstly I would like to thank my counsellor ir. Hannes Mareen for his guidance and patience. Every step of the way, he provided me with valuable feedback, and gave me detailed answers to all my questions. Next, I would like to pay my gratitude to Martijn Courteaux who filled in for Hannes when he was abroad. Additionally, I would also like to thank my supervisors prof. dr. Peter Lambert and prof. dr. ir. Glenn Van Wallendael, for their time and valuable input. Lastly, I would like to thank my parents and sister. My parents for their patience and support throughout my academic career and my sister for her useful feedback on my writing.

I would also like to mention that the computational resources (GPULab) used for this disserta-tion were kindly provided by IMEC and Ghent University.

(4)

(5)

3.3.1 Preprocessing stage . . . 20 3.3.2 Temporal Stage . . . 21 3.3.3 Aggregation Stage . . . 21 3.3.4 Merge Stage . . . 21 3.3.5 Results . . . 22 4 Proposed Scheme 23 4.1 Requirements . . . 23 4.1.1 Detection Failure . . . 23 4.1.2 Video Quality . . . 25 4.2 Attack Pipeline . . . 26 4.2.1 Data Pre-processing . . . 26 4.3 Neural Net . . . 26 4.3.1 Merge Stage . . . 27 5

(6)

6 CONTENTS 4.3.2 Combine Stage . . . 28 4.3.3 Recompression . . . 28 4.4 Network Training . . . 29 4.4.1 Training dataset . . . 29 4.4.2 Training Details . . . 29 5 Evaluation 31 5.1 Synchronisation Attacks . . . 31 5.1.1 Zoom Attack . . . 31 5.2 Removal Attacks . . . 33

5.3 Neural Network Attack . . . 34

5.3.1 Before Recompression . . . 35

5.3.2 Z-scores after Recompression . . . 40

5.3.3 Z-score for Different Input Qualities . . . 43

6 Discussion 56 6.1 Model structure . . . 56

6.2 Model training and tuning . . . 56

6.3 Attack performance . . . 56

7 Conclusion 58 Bibliography 60 Appendices 62 A Error in PSNR and SSIM Values 63 A.1 Origin of the error . . . 63

A.2 Consequences of the error . . . 64

(7)

List of Figures

1.1 Impression of the title screen of the leaked Game of Thrones episodes . . . 11

2.1 High-level diagram of Asikuzzaman’s watermark embedding algorithm. . . 13

2.2 High-level diagram showing the differences between an unwatermarked and wa-termarked video. . . 14

3.1 Framework overview of ARTN network . . . 20

3.2 Comparison of Inception network-in-network . . . 22

4.1 RMSE values for the first frame of the BQTerrace sequence. . . 24

4.2 Comparison of the two methods of z-score calculation. . . 25

4.3 Overview of the network structure. . . 27

4.4 Visualisation of how patches are merged. . . 29

5.1 Crop rectangle for a zoom factor of 1.05x. . . 32

5.2 Cumulative z-score for the BQTerrace sequence with a zoom factor of 1.02× . . . 32

5.3 Cumulative z-score for the BQTerrace sequence with a zoom factor of 1.005× . . 33

5.4 Comparison of PSNR and SSIM . . . 35

5.5 Quality comparison of frame 10 of the BQTerrace sequence . . . 36

5.6 Quality comparison of frame 50 of BQterrace . . . 38

5.7 Quality comparison of frame 50 of BQterrace, detail . . . 39

5.8 Quality comparison of frame 50 of Cactus after compression. . . 42

5.9 Visual quality comparison of frame 50 of the QBTerrace sequence with output Constant Rate Factor (CRF) of 22 . . . 52

5.10 Visual quality comparison of frame 50 of the QBTerrace sequence with output CRF of 27 . . . 53

(8)

List of Tables

5.1 Minimal z-score at frame 150 for sequence BQTerrace . . . 34

5.2 Minimal z-score at frame 50 for sequence BQTerrace with input CRF 22 . . . 36

5.3 Mean frame quality for sequence BQTerrace with input CRF 22 . . . 37

5.4 Mean frame quality for unattacked watermarked videos with CRF 22 . . . 37

5.5 Minimal z-score for 200 frames for sequence BQTerrace at different values of α, CRF and number of input videos. . . 45

5.5 Minimal z-score for 200 frames for sequence BQTerrace at different values of α, CRF and number of input videos, continued. . . 46

5.6 PSNR and SSIM scores after recompression . . . 47

5.6 PSNR and SSIM scores after recompression, continued. . . 48

5.7 Minimal z-score for 200 frames for sequence BQTerrace at different values of α, input and output CRFs, and number of input videos. . . 49

5.8 PSNR and SSIM scores after recompression for different parameters for different input CRFs . . . 50

5.9 Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM) for different CRF values for the watermarked videos of the BQTerrace sequence . . . 51

(9)

Acronyms

AR-CNN Artifacts Reduction Convolutional Neural Network. 21, 22 ARTN Artefact Reduction Temporal Network. 7, 20–22, 26, 27, 56, 58 AWGN Additive White Gaussian Noise. 17

CNN Convolutional Neural Network. 20, 31

CRF Constant Rate Factor. 7, 8, 13, 28–30, 36, 37, 40, 41, 43, 45, 46, 49–55, 57 DCN Deep Convolutional Network. 19, 58

DCT Discrete Cosine transform. 17

DTCWT dual-tree complex wavelet transform. 12, 13, 17, 34 HEVC High Efficiency Video Coding. 13, 19, 20, 22, 28, 29, 63 HVS Human Visual System. 12

ME Motion Estimation. 20, 21 MSE Mean Squared Error. 19, 30 NCC Normalised Cross Correlation. 13

PSNR Peak Signal to Noise Ratio. 7, 8, 25, 27, 28, 34, 35, 37, 40, 43, 47, 48, 50, 51, 63, 64 ReLu rectified linear unit. 21, 27

RMSE Root Mean Squared Error. 14, 15, 21, 23–25

SSIM Structural Similarity. 7, 8, 25, 27, 28, 34, 35, 37, 40, 43, 47, 48, 50, 51, 63, 64 STD standard deviation. 14

TSS Three Step Search. 20, 21

(10)

1

Introduction

Game of Thrones was one of the most popular TV-series of all time. The season 8 premiere had 17.4 million viewers in the first 24 hours after its release. In that same time span, it was downloaded 55 million times illegally [1]. People just did not want to wait to see the new episode. It may then come as no surprise that, when the first four episodes of season 5 (2015) leaked before the official release, they were downloaded more than a million times before the official broadcast [2].

The four episodes that leaked, were press screeners. These are episodes that are sent out to the press reviewers before the official release so they have time to write their opinions. One of the reviewers, however, was not as trustworthy as initially thought and put the first four instalments of the new season online. To deter people from doing exactly this, HBO marked each copy they sent out with a unique identifier which was visible on screen the whole time. A leaked episode could thus be easily tracked back to the journalist. As the ID-code did not move, the leaker just blurred his or her ID to remain anonymous [3]. Figure 1.1 shows an impression of how this would have looked. This is a (bad) example of forensic watermarking.

Thorwirth et al. from the Streaming Video Alliance define watermarking as the technique of embedding data into video that can be reliably extracted [4]. Even after the asset has been modified. The watermark is designed in such a way that it must be robust against changes to the asset, thus not changing itself.

Subsequently, Thorwirth et al. define forensic watermarking as a watermark that is intended to provide a means of identification of the source of leaked content. In the event of a leak, the watermark can be extracted and the culprit identified.

The watermark, the visible ID, used in the Game of Thrones screeners, did not satisfy the requirement of modification resistance. But even then, it might be possible to recover the blurred ID. Mareen et al. proposed a technique which uses a secondary watermark signal as a fallback when the original watermark is destroyed [5]. A detailed explanation of how this technique works, can be found in Chapter 2. In this masters dissertation, the robustness of this

(11)

11

(a) The original image with the reviewer ID (b) The attacked image with the identifier.

Figure 1.1: Impression of the title screen of the leaked Game of Thrones episodes technique will be examined in the case when multiple pirates collude. That is when multiple people combine there watermarked videos, to create one untraceable video.

The rest of this dissertation is organised as follows. Chapter 2 gives a detailed explanation of how the technique proposed by Mareen et al. works. The next chapter, Chapter 3, gives an overview of the state-of-the-art regarding existing watermark removal techniques, split up in individual and collusion attacks. Additionally, different compression artefact removal methods will be discussed.

Afterwards, Chapter 4 lists the requirements of a successful attack and proposes an attack scheme targeting Mareen’s secondary watermark signal. Next, the proposed scheme is thor-oughly evaluated in Chapter 5. Then, Chapter 6 will discuss whether the requirements are satisfied and potential improvements to the proposed attack. Finally, Chapter 7 summarises the conclusion and the most important future research opportunities.

(12)

2

Watermarking based on Compression Artefacts as

secondary Watermark Signal

This chapter gives a summary of how the secondary watermark signal is generated and how it can be extracted. This technique was proposed by Mareen et al. in a submitted but, at time of writing, unpublished paper [6]. The technique is an extension of the technique Mareen et al. proposed earlier [5].

Before the secondary watermark is explained, the primary watermark is described in section 2.1. Note that the technique proposed by Mareen et al. does not depend on any specific primary watermarking technique. This section explains the primary watermark used by Mareen. Next, the origin of the secondary watermark is clarified. Afterwards, the method for watermarking extraction is discussed. Lastly, some possible attack vectors are proposed in Section 2.4. These are exploited in the proposed method in chapter 4.

2.1 Primary Watermark

The original, primary watermark that is embedded in the currently unpublished paper by Mareen et. al was proposed by Asikuzzaman et al. It is a zero-bit watermark in the dual-tree complex wavelet transform (DTCWT) domain of the U-channel [7]. The Y and V-channel remain unchanged. This can be seen in the block diagram in Figure 2.1. The watermark signal is pseudorandomly generated with the watermark ID as a seed and is transformed into the DTCWT domain. In order to reduce the number of consecutive frames that contain the same watermark, the watermark is changed after a set number of frames.

The watermark is embedded in the low-frequency DTCWT coefficients of the channel. The U-channel is used, as the Human Visual System (HVS) is less sensitive for the chrominance U-channels (U and V) than the luminance channel (Y). The watermark will thus be less perceptible. Due to the lower perceptibly by the HVS, the chrominance channels are compressed more heavily. The low-frequency coefficients of the chrominance channels contain more perceptible information and

(13)

2.2. SECONDARY WATERMARK SIGNAL 13 YUV Y channel Watermark Embedding in DTCWT domain Watermark ID U channel V channel Watermarked YUV Watermarked U channel

Figure 2.1: High-level diagram of Asikuzzaman’s watermark embedding algorithm. are therefore compressed less. Embedding the watermark there makes it more robust, but also more perceptible. In order to decrease the perceptibly, a perceptual mask is used.

For watermark extraction, similar but inverse operations are applied. First, the watermark is estimated from the DTCWT coefficients of the U-channel. Next, the Normalised Cross Correl-ation (NCC) is calculated between the original watermark and the estimated watermark. This for every frame of the video. The NCC score is high when the watermark is present, and low if it is absent. Peak values can be compared to a threshold to decide if the watermark is present or not, this for a preferred False Probability rate.

As stated before, only the U-channel is modified. Converting the image to black and white, for example, which is the same as removing the U and V chroma (i.e. colour) channels, will thus also completely remove the watermark. How the watermark can be removed while keeping the U and V-channels, will not be discussed in this dissertation, but it is henceforth assumed to be1_. The secondary watermark method will still allow the recovery of the originally embedded wa-termark by exploiting small differences compression artefacts.

2.2 Secondary Watermark Signal

The secondary watermark originates when the watermarked videos are compressed. Figure 2.2 shows a high-level diagram and more details will be given below.

As stated in the previous section, the U-channels of the watermarked and unwatermarked videos are exactly the same, as can be seen in Figure 2.2b.

After a video has been watermarked with Asikuzzaman’s method, it is compressed. In the example in Figure 2.2 a CRF of 22 was used 2_{. Due to this compression, compression artefacts} are introduced in all video channels, which can be seen in Figures 2.2d and 2.2e. When these compressed videos are compared, many differences can be seen in the Y-channel (Figure 2.2c). This means, that even though the Y-channels are identical before compression, their compressed counterparts exhibit many differences in the Y-channel as a direct consequence of the embedded watermark.

The encoder used by Mareen et al. to compress the videos, is the libx265 encoder, an H.265/High Efficiency Video Coding (HEVC) encoder, using default parameters and a set CRF.

1_{In order to speed up the evaluation of the proposed technique, only the Y-channel is taken into account.}

Asikuzzaman’s watermark is thus also completely removed. Only the secondary watermark signal is present.

(14)

14 CHAPTER 2. SECONDARY WATERMARK SIGNAL 0 5 10 15 20+ Compression (a) Uncompressed unwatermarked YUV Primary U-channel Watermark Embedding Watermark ID Compression (d) Y-channel unwatermarked compression artifacts (c) Y-channel differences after compression = secondary watermark Legend + (b) Y-channel differences before compression +

(e) Y-channel watermarked compression artifacts + + -Y Y Y Y Y Y Y

Figure 2.2: High-level diagram showing the differences between an unwatermarked and water-marked video.

When videos are embedded with a different watermark-ID, their compressed Y-channels will display an analogous difference in the Y-channel. These differences are proposed by Mareen et al. as a secondary watermark signal.

2.3 Secondary Watermark Detection

As the secondary watermark does not contain any encoded information, it is a zero-bit water-mark. And, since no information is embedded, there is no watermark information to extract. It is, however, possible to detect the watermark’s presence. This is done by comparing the leaked video with all other compressed watermarked videos by calculating the Root Mean Squared Error (RMSE), which is defined in Equation 2.1. In the equation, o is the original video and w is the watermarked video. p is the number of pixels in the video.

RM SE(o, w) =pM SE(o, w)) = v u u t 1 p p X i=1 (oi− wi)2 (2.1)

When the RMSE is zero, two compared videos are exactly the same and the watermark should be detected. A non-zero RMSE does not mean that the watermark is not present, since the RMSE will increase when the video is attacked. A simple threshold that indicates detection, can thus not be used. But it is important to state that when the watermark is present, the RMSE will be significantly lower than when the watermark is not present.

To make detection more statistically substantiated, the z-score is calculated for each RMSE value. The z-score is a measure of how many standard deviations a value differs from the mean of the distribution. The mean of the distribution is calculated based on the RMSE values between videos that for certain do not contain the watermark. The equation for the z-score can be found in Equation 2.2 in which r is the RMSE value that is being converted. µa and σa are the mean and standard deviation (STD) of the RMSE values where the watermark is absent3_.

3_{Mareen et al. defined the absenteeism as all the RMSE values except the lowest one. When collusion attacks}

are used, multiple videos will have lower RMSE values. Consequently, the resulting z-scores will all be higher. A more detailed explanation and a fix are given in Section 4.1.1.

(15)

2.4. ROOM FOR ATTACKS 15

z(r, µa, σa) =

r − µa σa

(2.2) Since the z-score normalises the collection of RMSE values corresponding to absent watermarks, the resulting distribution of scores has a zero-mean and unit variance. In other words, all z-scores corresponding to absent watermarks will have a value close to zero, and the score of the present watermark should have a value that is significantly lower than zero.

As the z-score is normalised and independent of arbitrary, absolute differences, it can now be compared to a threshold to check if a watermark has been detected or not. This can be done for a certain False Positive (FP) probability Pf p which is obtained using the Gaussian method as stated in Equation 2.3. TF is the threshold and erfc is the complementary error function. For example, if one desires the FP probability to be Pf p= 10−6, then the corresponding threshold is TF ≈−4.8.

TF = − √

2 ∗erfc−1(2Pf p) (2.3)

2.4 Room for attacks

The secondary watermark signal proposed by Mareen et al. depends completely on the compres-sion artefacts introduced by the re-comprescompres-sion after the primary watermark was embedded. If one can succeed in reducing or distorting these artefacts enough, the detection might fail, or at least be less confident.

(16)

3

State-of-the-Art Analysis

This chapter starts with an overview of existing watermark removal techniques. A distinction will be made between attacks that use a single source attack and attacks that use multiple sources when multiple people collude.

The second part of this chapter summarises state-of-the-art compression artefact removal tech-niques, as these will be used in the proposed technique.

3.1 Watermark attacks

The purpose of digital watermarks is that the owner of the copyright can claim his ownership of some digital media. In this dissertation, the focus lies on video. The watermark is a signal that is embedded in the original data stream that can be later be extracted or at least detected after the media has been distributed. When some of the copyrighted material is misused, the original owner can claim his ownership by showing the existence of his watermark in the material. In the case of screener episodes that are released to journalists, each journalist gets his uniquely watermarked version of. In the event the episode is leaked, the copyright owner can trace who leaked it.

Watermark attacks aim to remove or at least distort the watermark enough so that it cannot be extracted or detected anymore. Song et al. classify watermark attacks into four distinct cat-egories, namely removal attacks, geometric attacks, cryptographic attacks and protocol attacks. These four categories are explained in more detail in the next sections.

3.1.1 Removal Attacks

Removal attacks, as the name implies, aim to remove all watermarking information completely. For this, no attempt is made to crack the security of the watermarking algorithm (see section 3.1.2). When the watermark is removed, no method can succeed in recovering the watermark, regardless of the complexity of the used method.

(17)

3.1. WATERMARK ATTACKS 17 Single Attacks

The example given in the introduction of Chapter 1 is an example of a removal attack. The watermark is completely covered up by blurring the ID number. Other methods of removal include sharpening the image, histogram equalisation and adding noise. Sharpening, histogram equalisation, and blurring aim to sharpen or blur the watermark enough so detection fails. By adding noise, an attacker hopes the noise masks the watermark signal.

In the case of Mareen’s secondary watermark signal, these attacks are hard to pull off. The wa-termark is embedded in all channels of the image. The attacks mentioned before, will not thwart the detection. Mareen specifically tested against Additive White Gaussian Noise (AWGN) at-tacks and found that they were futile.

Collusion Attacks

When multiple videos can be combined in a way to get a better estimate of the watermark, it becomes easier to distort the watermark enough to thwart detection. Watermarks that are embedded in the transform coefficients (e.g. DTCWT or Discrete Cosine transform (DCT) coefficients) could be removed by averaging these coefficients across the multiple copies. The average could also be considered as a ground truth. Subtracting a single copy from the average will resemble the original watermark, which can then in turn be subtracted from the single copy. In Section 5.2, multiple simple removal attacks are evaluated, but none came even close to avoiding detection when a reasonable number of copies were used1_.

3.1.2 Cryptographic Attacks

Cryptographic attacks attempt to crack the security methods of the watermarking schemes and in this manner, removing the embedded watermark information. One can also try to embed a new, deceitful watermark. One can try this by brute-forcing the key used for the watermark. Due to the computational complexity, the practical use of these attacks is limited.

Single Attacks

If one knows the exact watermarking technique that was used to embed the watermark, he could try to brute force the original key used for embedding. If the key is found, the watermark could be removed completely. If the keyspace is large, this would be computationally expensive. For the secondary watermarking signal, no key is added. It is therefore impossible to brute force the key, as it does not exist. These attacks could therefore not be used.

Having more watermarked copies has no additional benefit to an attacker. All the keys in the copies are different and need to be extracted separately and having more copies therefore not yield any speedup.

1_{When the number of input videos is large enough, these attacks will succeed, but in practice, the number of}

(18)

18 CHAPTER 3. STATE-OF-THE-ART ANALYSIS Again, the secondary watermark signal has no key, cryptographic collusion attacks cannot be used.

3.1.3 Synchronisation Attacks

Synchronisation attacks try to break the synchronisation between the watermark in the video and the watermark detector. Even though the watermark is still embedded in the video, the detector fails to extract it successfully as the watermark in a certain frame is not what the detector expects. The two main forms of synchronisation the detector relies on are spatial synchronisation and temporal synchronisation. Attacks on both kinds of synchronisation are discussed in the next subsections. Both kinds of attacks can be used at the same time.

Messing with desynchronisation also has an impact on the viewing experience. If the strength of the attacks needs to be high, it may result in a video that is awful to watch (e.g. losing a big part of the video by cropping, crooked angles by rotating, flickering induce by messing with the frame order etc.). This limits the effectiveness of these attacks.

Geometric Attacks

Geometric attacks aim to break the spatial synchronisation between the embedded watermark and the detector. By using one or more geometric transformations, the watermark can be distorted. Geometric attacks include scaling, rotating, sheering and cropping. Due to these transforms, certain pieces of watermarking information change place or are deleted (e.g. by cropping). Consequently, if the detector does not take these transformations into account, the watermark it extracts is different than what was originally embedded.

Temporal Attacks

Temporal attacks break the synchronisation among multiple frames. If the watermark changes over the duration of the video and this variation is important, breaking the order of the frames may result in detection failure. Desynchronisation can be achieved by speeding the video up or down, dropping or duplication frames, switching the position of frames and so forth.

If an attacker has multiple watermarked versions of a video, he can combine these videos by for example taking a frame from each video in turn, combining multiple versions of that frame into one, before adding even more distortion by the aforementioned techniques. This may improve the chances of desynchronisation while not needing to bump the strength of the transformations, resulting in a better viewing experience.

Synchronisation attacks are less useful in the context of Mareen’s secondary watermark use case. If a video has been leaked, it is of great importance to find the culprit(s). Therefore, the studios may direct a lot of resources to find the source of the leak. Estimating the applied transform-ations, may take a lot of computing power, but this might worth it in the end. Detection, therefore, happens in 2 stages. The first stage tries to estimate the transformations, either by brute force or using a more sophisticated technique like gradient descent, before the next stage does the actual watermark detection.

(19)

3.2. REMOVAL OF COMPRESSION SRTEFACTS 19

3.1.4 Protocol Attacks

Protocol attacks aim to break the entire concept of digital watermarking. Here, an attacker leaves the original watermark. He, however, tries to find a way to extract a different signal of which he can claim that it is his own watermark. Both the owner and the attacker are thus able to extract a watermark. This to create ambiguity of who truly created the original work. The Secondary Watermarking technique’s main goal is to identify who leaked some copyrighted material. The watermark is not embedded to claim ownership of it. Therefore, if an attacker would succeed to extract some other watermark from the video, the copyright owner would still know who leaked the material.

Protocol attacks do, in this case, not apply as a valid attack strategy.

3.2 Removal of Compression Srtefacts

Mareen’s secondary watermark is based on compression artefacts. If one can succeed to remove or at least reduce these artefacts, the secondary watermark detection may fail. The latest methods in the state-of-the-art often use Deep Convolutional Network (DCN)s for artefact reduction. Keeping the goal of reducing the artefacts, and thus defeating the secondary watermark signal, there exist 3 main flavours of usable networks.

Firstly, there are networks that remove JPEG compression, i.e. using single frames as input. Next, there are spatio-temporal networks that also use neighbouring frames as inputs. Lastly, networks specifically trained for removing HEVC artefacts. These three kinds are discussed in the next subsections.

3.2.1 CNN for JPEG Artefacts Removal

One of the first networks to use DCN for artefact removal, was the AR-CNN network proposed by Dong et al. in 2015 [8]. This network was inspired by DCN for super-resolution which used four convolutional layers and Mean Squared Error (MSE) as the loss function. Civagelli et al. proposed a much deeper, 12 layer convolutional network with a multi-scale loss function[9]. Kim et al. modified the inception module, proposed by Szegedy for better image classification, [10][11] and showed that it could also be used for artefact reduction.

3.2.2 Spatio-temporal Networks for Video

As a video is a sequence of multiple, often very similar, frames, more information could be extracted by a network if it takes these sequential frames into account. These networks exploit the temporal information as well as the spatial information. Therefore, they take multiple consecutive frames as input. This is often used for denoising, prediction and super-resolution[12, 13, 14].

Even though temporal information is not used in the attack method proposed in Chapter 4, it might result in a better attack strategy.

(20)

20 CHAPTER 3. STATE-OF-THE-ART ANALYSIS

Figure 3.1: Overall framework of the proposed scheme, where k, n, s above the convolution layers denote the kernel size, the number of output feature maps, and the convolution strides, respectively.

3.2.3 CNNs Specifically for HEVC

Some networks are trained specifically to reduce artefacts caused by HEVC compression. Some techniques change the encoder to gain better compression performance, e.g. Kim et al. [15]. Others add a Convolutional Neural Network (CNN) as a post-processing step after decoding without changing the encoder, e.g. Dai et al. [16].

Soh et al. proposed a deep temporal CNN which combines the techniques mentioned before[17]. It is a deep temporal neural network trained specifically for HEVC. The proposed attack in Chapter 4 uses an adapted version of this network. The proposed network will be referred to as Artefact Reduction Temporal Network (ARTN). This technique will be explored in more detail in the next section 3.3.

3.3 Artefact Reduction Temporal Network

By exploiting the temporal redundancy in consecutive frames, the network tries to recreate the original frame as it was before any compression. Or more formally: the network maps an input X to the output F (X, θ) where θ are the network parameters. The parameters are trained so that F (X, θ) is as close as possible to the original signal Y . In this case, the network uses data from the previous, current and next frame, to enhance the current frame.

The scheme proposed by Soh et al. consists of four stages: a prepossessing stage, the temporal stage, the aggregation stage and the merge stage.[17]. These stages will be discussed in the next sections. An overview of this scheme can be seen in Figure 3.1

3.3.1 Preprocessing stage

The preprocessing stage will split each frame of the video in different patches of size 64×64 pixels, which is the same size as the largest coding unit in the HEVC codec. Each patch overlaps 16 pixels (1

4 of the patch) with neighbouring frames, this is equal to a stride of 48 pixels.

For each of these patches, the best matching patch is sought for in both the previous and next frame. This can be done by full search Motion Estimation (ME), but this is very computationally expensive. Soh et al. found that using the Three Step Search (TSS) algorithm, which is far

(21)

3.3. ARTEFACT REDUCTION TEMPORAL NETWORK 21 cheaper and has little impact on overall performance. TSS is a simple iterative block matching algorithm. It samples 8 locations at a certain step size around the current block position, and the current position itself. Pick, among the 9 sampled locations, the one with the lowest cost function (e.g. RMSE). This location will be the start of the next iteration. For every next iteration, the step size is halved and the algorithm stops when the step size is 1. This algorithm reduces the computation by a factor 9.

To cope with abrupt scene changes and ME failures, the previous (or next) frame’s patches are discarded and replaced by the current frame’s patch when the mean absolute difference of the patch exceeds a threshold.

As the ME can introduce a possible mismatch, the patches of the previous and next frames are rescaled to 32×32 pixels. This will stress the importance of the current frame’s patch.

3.3.2 Temporal Stage

The temporal stage consists of 3 temporal branches, i.e. one current, one previous and one next patch. Each of these branches consist of 3 convolutional layers followed by a rectified linear unit (ReLu). This branch aims to extract features from the corresponding frame, therefore it consists of simple convolution layers.

As stated in the previous subsection, the branches for the previous and next frame have only half the number of feature maps to increase the importance of the current frame.

3.3.3 Aggregation Stage

Next, the aggregation stage merges the features extracted by the temporal stage and enhances them. For this, an Inception-based network, proposed by Kim et al. [10], for which it was shown to provide comparable denoising performance than other state-of-the-art, is used. It is a modification of the Inception module form GoogLeNet [18]. According to Szegedy et al., the network-in-network structure helps to extract rich features by using various kernel sizes while requiring fewer parameters[18]. The modification proposed by Kim et al. is to remove the Max Pooling layer and add a larger 7×7 kernel filter. Both architectures are shown in Figure 3.2. Using more Inception modules (or more convolutional layers in the temporal stage) might im-prove overall performance, but this number was chosen as to keep the overall number of para-meters similar to the Artifacts Reduction Convolutional Neural Network (AR-CNN) with which the ARTN is be compared.

3.3.4 Merge Stage

The final output frame is the weighted average of the patches outputted by the network. As stated in Subsection 3.3.1, a quarter of the patches overlap with its neighbours. This means that parts of the image are the overlap of two patches (horizontal or vertical) or 4 patches (horizontal, vertical and 2 diagonals). The value of a pixel in an overlapping region is the sum of the overlapping pixel values multiplied by the Gaussian weight from the centre of each patch. The resulting value is then divided by the total weights for normalisation.

(22)

22 CHAPTER 3. STATE-OF-THE-ART ANALYSIS

Figure 3.2: Comparison of network-in-network structure from the (a) original Inception and (b) the modification for artefacts removal.

The Gaussian weight is defined as:

W (i, j) = √1 2πσexp −d(i, j) 2 σ2 (3.1) where d(i, j) is the Euclidean distance from the centre of the patch as (0, 0) to the coordinate (i, j). By changing σ, one can define how quickly the weights decrease as they are further from the center.

3.3.5 Results

The experiments Soh et al. performed showed that their proposed network yields a higher gain than other more conventional networks, e.g. AR-CNN [8], of 0.23 dB for HEVC video. They also found that flickering artefacts, which are often found in compressed video, are reduced. Finally, the ARTN complexity was chosen to be comparable to the AR-CNN network. Soh et al. believe that more gains could be achieved if deeper temporal branches are used.

(23)

4

Proposed Scheme

This chapter describes the proposed attack. Firstly it will list all requirements a successful attack must meet. Next, the pipeline of the attack is discussed. Lastly, the structure of the neural net is explained in more detail.

4.1 Requirements

For an attack to succeed, some criteria must be met. First and foremost, the secondary water-mark detection must fail. Secondly, the quality of the resulting video should be acceptable. The details of how this is measured will be listed in the following sections.

4.1.1 Detection Failure

The most important criterion to be met is that watermark detection should fail. Detection of the watermark is analogous to the detection method proposed by Mareen et al. which is explained in Section 2.3. A False Positive probability Pf pof 10−6will be used. Converting this to a z-score with equation 2.3 results in a z-score threshold TF ≈ −4.8. This means that if the z-score is −4.8, there is an only a one in a million chance that a watermark is detected when it is not present. The certainty of detection increases with lower z-scores.

Z-score Calculation

The z-score calculation itself is modified to better detect collusion attacks. The main structure of the formula remains the same as in equation 2.2. r is the RMSE between the watermarked video and the original video (equation 2.1). µaand σaare also the mean and standard deviation of the distribution of the RMSE values corresponding to absent watermarks. But how this absence is calculated differs between Mareen’s method and the method proposed here.

Mareen et al. assume that the lowest RMSE score corresponds to the video where the watermark is present, and thus all other RMSE scores are of videos without the watermark. This is a good

(24)

24 CHAPTER 4. PROPOSED SCHEME

Figure 4.1: RMSE values for the first frame of the BQTerrace sequence.

estimation when only one watermark is present in the watermarked video. However, when multiple videos are combined in a collusion attack, all their watermarks could be present in the combined video, which means that more than one of the RMSE values will be lower.

Therefore, as this dissertation focuses on collusion attacks, a new method is needed. µa and σa should be calculated on videos of which one can be certain that the watermark is not present. This can be a list of watermarked videos that were never released1_{. The dataset is split into} videos that could possibly be attacked and baseline videos.

The equation for the z-score is thus adapted to use µb and σb which is the mean and standard deviation of the distribution of the baseline videos. This is shown in equation 4.1.

z(r, µb, σb) = r − µb

σb

(4.1) The necessity of this new z-score formula is explained by the example in Figure 4.1. This figure shows the RMSE values for 20 watermarked videos converted to grayscale2_{. It is clear that the} RMSE values or videos 1, 2 and 3 are lower than the rest. These three videos were used in a very rudimentary collusion attack where the three videos were simply averaged.

When the z-scores are calculated for all but the lowest RMSE values, µa and σa are 0.009180 and 0.00085 respectively while µb = 0.00947 and σb = 0.00002. The smaller mean and larger standard deviation for the all-but-one method, has a great impact on the resulting scores. The scores for both methods are displayed in Figure 4.2. The z-scores for the all-but-one method are ≈ −2.9 which is higher than the threshold of −4.8 defined in the previous section. This means that this method would not detect the watermark.

1_{If one cannot trust or ensure that none of the baseline videos were released, one can generate new watermarked}

videos from random, unused seeds.

2_{The primary watermark as described in Section 2.1 is only embedded in the colour channels U and V.}

Converting to grayscale removes these channels and thus only keeps the Y-channel. The primary watermark is herby removed. Only the secondary watermark as described in Section 2.2 remains.

(25)

4.1. REQUIREMENTS 25

(a) Z-scores with µa and σa. (b) Z-scores with µb and σb.

Figure 4.2: Comparison of the two methods of z-score calculation.

The method using the baseline videos has z-scores of ≈ −121 for video 1, 2 and 3 which is far below the threshold.

Cumulative Z-score vs. Single Z-score

The z-score can be calculated on a per-frame basis or on a per-video basis. To get the most accurate detection, the z-score must be calculated for the entire video at once. This means that the RMSE is calculated for the entire video. As these calculations are quite expensive, one can choose to only calculate the z-score of a single frame or for only a part of the video. Especially when no attacks of bad attackers are used, a single frame suffices to identify the culprit. When testing if an attack is viable, calculating the RMSE of only a part of the video, will give a quick estimate without the need of generating an attacked frame for each frame in the video.

The cumulative z-score charts, which will be used in Chapter 5 show the evolution of the z-score. The cumulative z-score at frame x is the z-score based on the RMSE up to and including frame x. After a couple of frames, it is possible to predict the further evolution of the z-score. If a clear downward trend is visible in the first 50 frames for example, one may assume that this trend will continue.

4.1.2 Video Quality

Avoiding detection is in itself not hard to achieve. When the video is replaced by a black screen, it will not be detected, but the original data is also lost. This is where this next criterion comes in. The attacked video must be of acceptable video quality.

It is hard to define ‘decent’ video quality in terms of objective measures. PSNR and SSIM will be looked at, but the main criterion will be, ‘is it subjectively enjoyable to watch’. A pirate cares less about true image quality but just wants to enjoy his content without waiting for an actual release or paying for it. The pirate also has no reference to how the video should look.

(26)

26 CHAPTER 4. PROPOSED SCHEME He only sees the video that he downloaded. If he can enjoy that video, it is a success in his eyes. Distracting artefacts as colour banding, blocking or flicker should, therefore, be avoided, but loss of high-frequency data is less important.

4.2 Attack Pipeline

The proposed attack is an adaptation of the ARTN proposed by Soh et al. [17] which is described in Section 3.1. Just as the ARTN, this attack only works for grayscale images. The reason for this is twofold. Firstly, as stated in Section 2.1, the primary watermark is only embedded in the U-channel of the video. The removal of the colour channels will also remove the primary watermark. Secondly, as only one of the three channels remain, the complexity of the attack is reduced significantly.

The attack pipeline is structured as follows, analogous to the ARTN. First, the data is pre-processed, that is, split into patches which will be fed to the network. Next, the network will take in these patches and combine them into one output patch. These output patches are then merged into frames. Of these merged frames, a weighted sum is taken with the average of the input frames. Lastly, all frames are combined into a single video in the recompression step. These stages are explained in detail in the next sections.

4.2.1 Data Pre-processing

Soh et al. extract a patch of 64×64 pixels of 3 consecutive frames as they only have one video and they try to exploit the temporal dependencies between consecutive frames. This collusion attack has multiple copies, albeit with a different watermark, of the same frame. There is thus no need to look for the best match in the previous and next frames, but the same block of pixels can be taken from each of the input videos. The size of the patches was chosen to be 100×100 pixels with a stride of 90 pixels, which differs from the patch size in Soh’s proposal. As will be stated in the next section, the Neural network is completely convolutional. As a consequence, the input size of the network can be any size.

The exact patch size has little impact on the effectiveness of the attack, and as long as there is some overlap between neighbouring patches. In theory, it is possible to use the entire frame as input but due to limited GPU memory, this was not achievable in practice. Different patch sizes and strides do have an impact on the runtime. When using a small stride, more patches are extracted from a frame, which all need to be computed by the network. Hence, using as little patches as possible will execute faster. Equally, using larger patches will reduce the total number of patches.

For the experiments conducted, a patch size of 100×100 pixels and stride of 90 pixels was used. This strikes a balance between execution time and memory usage.

4.3 Neural Net

The structure of the neural network is very similar to the network proposed by Soh et al. The network can be split in a ’Spatial’ branch, which replaces the temporal branch of Soh’s network, and an aggregation stage. A global overview is provided in Figure 4.3.

(27)

4.3. NEURAL NET 27

Figure 4.3: Overview of the network structure. Spatial Stage

The spatial branch is very similar to the temporal branch in the ARTN. As opposed to the tem-poral branch of the ARTN, which extracts and exploits the temtem-poral information of consecutive frames, the proposed network exploits information of the same patch in the different versions of a single video.

The stage consists of as many branches as there are input videos, i.e. the number of colluders. Each branch is made up of 3 convolutional layers followed by a ReLu. A difference between this network en Soh’s ARTN is that, as this attack uses multiple versions of the same frame, all branches are equally important. Therefore, all branches have the same size feature map. The exact kernel sizes and strides are shown in Figure 4.3.

Aggregation Stage

The aggregation stage merges the features extracted by the spatial branch and enhances them. The stage is the same as the aggregation stage of Soh et al. That is, two inception blocks followed by a final convolution layer. The structure of the Inception Block can be seen in Figure 3.2 (b).

4.3.1 Merge Stage

The output patches of the neural network are combined in this step. The patches the network generates may have some artefacting at the edges as the border pixels have less surrounding information. To prevent any possibility for these artefacts to distort the final image, the borders are dropped. Any artefacts that were encountered during testing, were limited to just a few pixels from the edge. The overlap between neighbouring patches is more than a few pixels, so it is easy to mask these possible problems.

The overlap between neighbouring patches is patchsize − stride pixels. Only half of this overlap of each patch will be used in the final image. In other words, patchsize−stride

2 pixels will be dropped from the edges of each patch. A visualisation is available in Figure 4.4.

It was a deliberate choice not to blend the overlap together. Experiments have shown that blending the pixels together, by averaging, for example, will increase the SSIM index and PSNR a bit, but will also worsen the z-score. This difference in quality is not noticeable with the

(28)

28 CHAPTER 4. PROPOSED SCHEME naked eye. Perfect picture quality is also not the goal of this attack. The main goal is avoiding detection.

For the edges of the frame, where there are no neighbouring, overlapping patches, all pixels of these patches are used. If some artefacting would occur at these edges, they would not be intrusive as they are at the very edge of the frame.

4.3.2 Combine Stage

Before the frames are recompressed into a video, the frames are summed with a weighted differ-ence between the merged frame and a naively attacked frame. The formula is shown in Equation 4.2 in which the merged frame, is the frame that is a result of the previous merge stage and α is a weighting factor. Note that, when α is 0, the resulting attacked frame will be the same as the merged frame, and when α is -1, it would be the same as only the naive frame.

attacked = merged + α ∗ (merged − naive) = (1 + α) ∗ merged − α ∗ naive = merged + α ∗ (merged − 1 n n X i=1 videoi) (4.2)

The naive frame is the mean for every pixel of all the input video frames. Which can be calculated by summing the videos pixelwise and dividing by the total number of videos (= number of colluders).

As was shown in Section 3.1.3, averaging the frames, is not a successful method of avoiding detection. Values of α close to −1 will result in a frame close to the naive frame of which the watermarks will be detected. When α is smaller than -1 or larger than 0, the quality of the resulting attacked frame will deteriorate as the merged and naive frame are combined in unequal parts, that is, the total sum of the weights will not equal one. Consequently, α values should be kept close to the range [−1, 1]. This was also shown experimentally: the larger α the worse the PSNR and SSIM became. But larger values will improve the z-score. A balance between video quality and z-score should be made. Different values of α will be evaluated in the next chapter, Chapter 5.

4.3.3 Recompression

The final step is to combine the merged frames into a single video. This is done by encoding the frames with the HEVC encoder. The default HEVC parameters were used with a CRF of 22, 27, 32 or 37. The goal of this recompression is not only to create a video of all the separate frames, but also to embed a new secondary watermark. By using more compression (using a higher CRF), more artefacts will be introduced which may hide the remnants of the original secondary watermark that still remain after the neural network.

(29)

4.4. NETWORK TRAINING 29

Figure 4.4: Visualisation of how patches are merged.

4.4 Network Training

4.4.1 Training dataset

The dataset consists of 5 video sequences of the JCT-VC test sequences: BQTerrace, Cactus, Kimono, Parkjoy and Parkscene. These are FullHD sequences with a resolution of 1920×1080 pixels. All sequences are 10 seconds long and contain 600, 500, 240, 500 and 240 frames re-spectively. For each sequence, a set of 20 watermarked videos was generated with four different CRF values: 22, 27, 32 and 37. All videos are encoded with the HEVC encoder and have a primary and secondary watermark embedded as described in chapter 2. As stated before, the network is only trained for the Y-channel. For the 5 sequences, the original, uncompressed and unwatermarked version is used as the network target i.e. the labels.

The watermarked input videos are split into patches of 64×64 pixels with a stride of 48. The label of each patch is the original, uncompressed, unwatermarked patch. This is done for every 50th _{frame of each training sequence. Soh et al. also found that using all patches as is, is} not good for training. Patches of flat regions have very little variance and little compression artefacts. Patches with high frequency and/or high variance will contain a lot of compression artefacts and will, therefore, be better for training. Hence, patches with a variance lower than a threshold will be dropped from the training set. A threshold of 0.002 (when pixel values are normalised between 0 and 1) was chosen. As the last step, the dataset is augmented by rotating and flipping each patch which increases the size of the dataset with a factor 8.

4.4.2 Training Details

For each CRF value, number of colluder videos, and for each sequence, a new network was trained. Soh et al. found that a network specialised for a certain CRF value, will perform better than a general-purpose one [17]. This same reasoning goes for training a new network for every number of inputs. Due to the limited number of test sequences, a separate network was trained to attack each sequence. Each network is trained on the four other sequences and validated on

(30)

30 CHAPTER 4. PROPOSED SCHEME itself.

The model was implemented in Keras and used the ADAM optimiser. The MSE was used for the loss function. The batch size was set to 64, as this was the largest size that could be used for a high number of input videos. The learning rate was kept constant at 0.001. Each network was trained for 10 epochs. Training for more than 10 epochs was deemed futile as training and validation loss stagnated.

A new network was trained for each of the 5 sequences, for 2 to 6 input videos, for a CRF of 22, which totals to 25 networks. For the BQTerace sequence, a network was trained for 2 to 6 input videos for each of the 4 CRF values: 22, 27, 32 and 37. This adds 15 more networks. This is a total of 25 + 15 = 40 networks that were trained. Better optimisation of the network meta parameters and deeper branches might improve the performance, but due to the large number of networks, training time was restricted. Training time for a single network takes more than 6 hours, this is more than 240 hours or 10 days for the 40 networks. As the meta parameters were based on the parameters found by Soh. et al., the hypothetical optimal parameters will not drastically improve overall performance. Even these sub-optimal networks should give enough insight if this method has any possibility of breaking the watermark.

(31)

5

Evaluation

This section evaluates the performance of the different watermark attacks that are discussed in Section 3.1. Before the CNN attack was coined, many simpler collusion attacks were attempted which can be categorised in synchronisation and removal attacks. Firstly, synchronisation at-tacks are discussed. Next, the efficacy of simple removal atat-tacks is evaluated. Lastly, the results of the proposed method in Chapter 4 are reported. As was stated in Section 3.1.2 and Section 3.1.4, cryptographic and protocol attacks are inapplicable to the secondary watermark signal and are therefore not evaluated.

Important notice

The PSNR and SSIM scores that are listed in this section contain an error. The scores listed here are lower than one would expect for the actual perceived video quality. While the scores are not comparable with external PSNR and SSIM values

(the scores shown here are far lower), they can be compared to each other. A detailed explanation for this is given in Appendix A.

5.1 Synchronisation Attacks

This section will evaluate some simple synchronisation attacks. Due to the inefficacy of these attacks, only one type, the zoom attack, is discussed as the other synchronisation attacks are analogous.

5.1.1 Zoom Attack

The zoom attack is a geometric synchronisation attack where the image is zoomed in or out by a certain factor. Zooming in can be seen as cropping the image and rescaling the result to its original dimensions. For a zoom factor of 1.02 and 1920×1080 resolution, this means that 11 and 19 pixels are dropped from the left and right, and top and bottom respectively. This is visualised in Figure 5.1. The pixels inside the red rectangle are rescaled to the original

(32)

32 CHAPTER 5. EVALUATION

Figure 5.1: Crop rectangle for a zoom factor of 1.05x.

Figure 5.2: Cumulative z-score for the BQTerrace sequence with a zoom factor of 1.02× dimensions of 1920×1080 pixels.

Even this slight zoom is enough to break the synchronisation as can be seen in Figure 5.2. In this figure, the red line denotes the cumulative z-score of single attacked, i.e. cropped, video. The green dots represent the cumulative z-score of the baseline videos with the green line being the mean value, which must always be 0. The blue line at the bottom is the detection threshold of -4.8. The z-score of the attacked video stays far above the threshold, even after 200 frames which means that the watermark was not detected. As the trend of the cumulative z-score is quite flat, 200 frames suffice to indicate detection failure.

While at first sight, this might indicate a successful attack, this is not necessarily the case. Image/video registration software can be used to detect and undo these geometric transforms.

(33)

5.2. REMOVAL ATTACKS 33

Figure 5.3: Cumulative z-score for the BQTerrace sequence with a zoom factor of 1.005× Imperfect Rsynchronisation

When one tries to resynchronise the attacked video, perfect synchronisation is not required. The Secondary Watermark tolerates slight desynchronisation. For example, if a zoom factor of 1.005× is used, the watermark is clearly detected as can be seen in Figure 5.3 where the red line, denoting the cumulative z-score, dips far below the threshold of -4.8. A zoom factor of 1.005× corresponds to a crop of 5 pixels from the left and right, and 3 pixels from the top and bottom.

5.2 Removal Attacks

This section will evaluate simple removal attacks. As was mentioned in Section 3.1.1, removal attacks try to remove the watermark. As the secondary watermark resides in the compression artefacts, removing the watermark is equal to removing these artefacts. Before the more com-plex proposed method is analysed in Section 5.3, this section will evaluate simple and naive approaches of removing the artefacts. One of those naive methods is averaging the multiple copies of the different colluders frame by frame. As each copy of the video has slight variations in its compression artefacts due to the primary watermark, averaging the frames will reduce these differences.

Even though the differences are reduced, it is not nearly enough to thwart the watermark detection. Table 5.1 contains the z-score at frame 150 of the BQTerrace sequence for 6 different removal attacks. When more videos are used, the z-score increases, which indicates that, when enough attackers collude, the attacks might succeed. However, gathering such a large group of nefarious reviewers is very unlikely in practice.

The attacks used in the table are performed as follows:

Mean frame: Calculate the mean value of every pixel across all watermarked copies. Median frame: Calculate the median value of every pixel across all watermarked copies.

(34)

34 CHAPTER 5. EVALUATION DTCWT average: Apply the DTCWT transformation to each of the copies. Take an average (median in this case) of the DTCWT coefficients. Lastly, calculate the inverse transform of the averaged coefficients.

Subtract watermark: Estimate a ground truth using the different copies. Here the ground truth is the mean of all copies. The watermark is estimated to be the difference between one of the copies and that ground truth. Subtract the watermark from the copy.

Subtract watermark exclusive: Analogous to Subtract watermark but the watermark is es-timated on all but one of the different copies. The eses-timated watermark is then subtracted from the excluded copy.

Attack name 2 videos 3 videos 4 videos 5 videos 6 videos Mean frame -842.16 -592.95 -536.07 -413.12 -377.03 Median frame -842.16 -526.72 -496.56 -379.75 -348.07 DTCWT Average -842.16 -534.76 -496.81 -373.90 -345.72 Subtract watermark -1647.24 -1856.74 -1964.00 -1968.49 -1991.82 Subtract watermark excl. -1375.54 -1649.71 -1851.59 -1887.93 -1936.69

Table 5.1: Minimal z-score at frame 150 for sequence BQTerrace

While the z-score is not reduced enough to thwart detection, the quality does increase for some of the attacks. Figure 5.4 shows the evolution of the PSNR and SSIM scores for the different attack where the blue lines indicate the scores for the first 50 frames of the attacked video. The orange line is the score for the original watermarked video. For the three attacks where a frame is averaged, both the PSNR and SSIM scores improve. Meanwhile, the scores for the two subtract watermark attacks decreases. These scores are affected by by the error explained in Appendix A. They are however affected equally and may be compared to one another.

These differences in quality are, however, not very noticeable in reality. Figure 5.5 shows a part of the tenth frame of the BQTerrace sequence. Subfigure 5.5a is a part of the original, unwatermarked frame. Subfigure 5.5b is a part of a compressed and watermarked frame without any attacks. Subfigures 5.5c and 5.5d are a part of the frame for the Subtract watermark attack and median frame attack respectively. Even zoomed in this close, only small differences can be seen. These differences are barely noticeable when looking at a complete frame.

This reinforces the statement made in Section 4.1.2 that the PSNR and SSIM scores are not sufficient to validate visual quality. A visual check is needed to gauge the actual perceived quality.

5.3 Neural Network Attack

This section will evaluate the attack using the attack pipeline as described in Section 4.2. Firstly, the z-scores before recompression are discussed as these scores hinted at the possibility of a successful attack. Afterwards, the scores after recompression are examined.

(35)

5.3. NEURAL NETWORK ATTACK 35

Figure 5.4: Comparison of PSNR and SSIM

5.3.1 Before Recompression

Z-score

Table 5.2 contains minimal z-score (the worst z-score of all input videos used for the attack) for the five different sequences. The scores are shown for 2 through 6 colluders (which equates to 2 through 6 input videos) and for α-values of 0, 0.5 and 1. The scores are calculated based on 50 frames.

Important to note is that frames with different α-values are based on the same result from the neural net. The α-value is thus the only parameter influencing the score for the same sequence and number of inputs. Knowing this, it is clear to see that a higher α-value greatly reduces the z-score. The α-value is the blend factor between the frame produced by the neural network and a naively attacked frame. This is explained in detail in Section 4.3.2.

Next, the effect of using more videos is also clear to see. When more input videos are used, the z-score improves.

In general, the scores are an order of magnitude better when using the proposed network, then when using the naive averaging attacks of Section 5.2. A successful attack has a z-score higher than -4.8 which equates to a False Positive Probability of 10−6_{. This hints at the possibility} that recompression might improve the z-score enough to thwart the watermark detection. That is, if the quality of the generated frames permits it.

(36)

(a) Original (b) No attack

(c) Subtract watermark (d) Median frame

Figure 5.5: Quality comparison of frame 10 of the BQTerrace sequence

Sequence α 2 videos 3 videos 4 videos 5 videos 6 videos BQTerrace 0.0 -79.84 -55.58 -49.41 -38.00 -30.67 0.5 -48.59 -34.59 -31.92 -24.36 -19.79 1.0 -33.33 -24.51 -23.45 -17.76 -14.42 Cactus 0.0 -93.87 -70.11 -49.79 -37.80 -43.93 0.5 -60.14 -46.81 -33.41 -25.82 -31.51 1.0 -41.91 -34.41 -24.79 -19.51 -24.86 Kimono1 0.0 -42.09 -30.39 -21.48 -17.48 -14.19 0.5 -26.15 -19.65 -13.73 -11.36 -9.31 1.0 -18.08 -14.20 -9.81 -8.28 -6.84 ParkJoy 0.0 -163.53 -107.49 -84.88 -77.66 -59.20 0.5 -103.22 -69.77 -56.19 -54.35 -40.22 1.0 -68.39 -47.85 -39.90 -41.17 -29.76 ParkScene 0.0 -135.46 -91.16 -65.09 -53.88 -48.13 0.5 -86.81 -60.03 -43.37 -36.34 -32.05 1.0 -61.99 -43.97 -32.09 -27.17 -23.83 Table 5.2: Minimal z-score at frame 50 for sequence BQTerrace with input CRF 22

(37)

5.3. NEURAL NETWORK ATTACK 37 PSNR (dB) SSIM Sequence α 2 3 4 5 6 2 3 4 5 6 BQTerrace 0.0 35.80 35.96 36.28 36.11 36.50 0.943 0.947 0.948 0.947 0.950 0.5 30.12 29.99 31.39 31.04 30.86 0.938 0.944 0.945 0.937 0.947 1.0 25.41 25.27 26.44 26.18 25.95 0.927 0.937 0.938 0.918 0.941 Cactus 0.0 35.03 36.04 34.95 35.04 36.23 0.916 0.922 0.921 0.923 0.923 0.5 30.15 31.10 29.94 29.92 31.42 0.909 0.916 0.915 0.915 0.919 1.0 26.14 26.88 25.95 25.91 27.17 0.895 0.903 0.902 0.902 0.908 Kimono1 0.0 41.16 41.29 41.39 41.74 41.80 0.957 0.959 0.959 0.960 0.961 0.5 33.44 34.12 33.96 33.54 33.24 0.949 0.953 0.952 0.954 0.953 1.0 27.92 28.41 28.29 27.95 27.73 0.932 0.937 0.936 0.937 0.935 ParkJoy 0.0 35.58 35.85 35.98 36.06 35.85 0.944 0.947 0.949 0.949 0.949 0.5 30.77 31.01 31.09 31.11 30.67 0.930 0.933 0.935 0.935 0.932 1.0 26.15 26.33 26.38 26.39 26.04 0.899 0.904 0.907 0.905 0.900 ParkScene 0.0 39.24 39.56 39.75 39.80 39.88 0.952 0.955 0.956 0.957 0.957 0.5 33.47 33.38 33.25 32.87 33.33 0.946 0.948 0.949 0.949 0.951 1.0 28.13 28.02 27.90 27.60 27.95 0.930 0.932 0.932 0.931 0.935 Table 5.3: Mean frame quality for sequence BQTerrace with input CRF 22. These scores cannot be compared to other z-scores in this dissertation as they are affected differently by the error. Quality

Next the quality is evaluated. The PSNR and SSIM scores are displayed in Table 5.3. In this table, one can see that for a certain α-value, the quality scores remain fairly similar when using more videos. The α-value itself has a great impact on the quality. These scores are affected differently by the error in Appendix A. These scores can only by compared to values in this table. Sequence PSNR (dB) SSIM BQTerrace 26.00 0.93 Cactus 28.29 0.89 Kimono1 27.69 0.91 ParkJoy 26.26 0.85 ParkScene 27.54 0.90

Table 5.4: Mean frame quality for unattacked watermarked videos with CRF 22

This is a consequence of the networks original purpose. It tries to remove all compression artefacts and generate a frame which is as close as possible to the original, uncompressed frame. The network is quite successful in doing so, according to these measures. When comparing the different versions visually (see Figure 5.6), these findings are confirmed. The original frame has a quite flat, low contrast look. The watermarked version1 _{seems to have more contrast (e.g.} darker shadow under the bridge, lighter building). This added contrast is due to the way the videos were read, see Appendix A.1 for the explanation, The network tones down this contrast

1_{The primary watermark is removed when the video was converted to black and white, see Chapter 2. The}

(38)

(a) Original (b) Watermarked (PSNR 26.00 dB)

(c) α-value 0 (PSNR 35.60 dB) (d) α-value 0.5 (PSNR 30.71 dB)

(e) α-value 1 (PSNR 25.97 dB)

Figure 5.6: Quality comparison of frame 50 of BQterrace

to match the original (α-value 0). When looking closer in Figure 5.7, the noise reduction of the network can be seen. Noise is reduced significantly without removing many other details. The visual effect of the α-value is mostly a reduction in dynamic range. The minimum and maximum pixel values are mapped closer to middle grey (pixel value of 127 in 8-bit colour). Since no other artefacts are introduced, even this version is still enjoyable to watch. The lower contrast is not bothersome. As a pirate has no reference as to how much the original should have, he would fail to notice it.

(39)

5.3. NEURAL NETWORK ATTACK 39

(a) Original (b) Watermarked

(c) α-value 0 (d) α-value 0.5

(e) α-value 1

(40)

5.3.2 Z-scores after Recompression

In the previous subsection, it was established that the neural network attack improves z-score a lot. Recompressing the separate frames into a single video introduces new compression artefacts. These artefacts might predominate the residual artefacts that trigger the watermark detection. This subsection evaluates the effect of recompressing the frames with different CRFs: 22, 27, 32 and 37 .

Z-score

Table 5.5 contains the minimal z-scores calculated on 200 frames for the five sequences for different CRF, and α-values. The minimal z-score is the lowest z-score of all videos used in the attack. None of the colluders may be detected, so if only one of the video’s z-score drops below the threshold, the attack failed.

Values below the detection threshold of -4.8 are coloured green. For a certain α-value, the z-score improves when the CRF and the number of input videos increase. Across α-values, the findings from the before-compression analysis (Section 5.3.1) remain: a higher α-value improves the z-score.

For all but one sequence, Parkjoy, at least one of the attacks was successful. The Parkjoy scores are far worse than the scores of the other sequences, which might indicate that the network training has gone wrong. For the other sequences, successful attacks exist, that is a z-score above -4.8. The next section will evaluate if these successful attacks also are of acceptable quality.

Quality

For multiple combinations of higher α-values and CRF’s, the detection fails. But are these successful attacks also of reasonalble quality? As was explored in Section 5.3.1, a higher α-value is acceptable. So, what is the effect of a higher CRF? A higher CRF increases the amount of compression the encoder uses. High compression will result in many compression artefacts such as blocking, ringing and so forth.

The PSNR and SSIM scores for the different parameters are listed in Table 5.6. Especially the SSIM score decreases a lot when a higher CRF is used, which might indicate lots of artefacts. This can be verified visually. A section of frame 50 of the Cactus sequence is shown in Figure 5.8 for the 4 different CRF values. Figure 5.8a and 5.8b look almost identical and are of great visual quality. At CRF 32, Figure 5.8c, the cacti and swinging tiger lose a lot of detail and a few ringing artefacts can be seen in the playing card. At CRF 37, the artefacts become even worse and make the video very unpleasant to watch. These findings are analogous for the other sequences.

The highest usable CRF is therefore 32, albeit with noticeable but still acceptable compres-sion. When applying this quality limit to the results, attacks were successful for the sequences BQTerrace, Cactus and Kimono when at least 3 attackers collude with an α-value of 1.

For the sequences Kimono and BQTerrace, attacks are successful for an even lower CRFs of 27. This lower compression requires at least 5 colluders and an α-value of 1 or 6 colluders for α-value 0.5

(41)

5.3. NEURAL NETWORK ATTACK 41 Attacks were not successful for ParkJoy and ParkScene for a CRF of 32 or higher. Their uncompressed z-scores were also higher than for the rest of the sequences.

(42)

(a) CRF 22 (b) CRF 27

(c) CRF 32 (d) CRF 37