Image Similarity for Email Rendering

(1)

IMAGE SIMILARITY FOR EMAIL RENDERING

submitted in partial fulfillment for the degree of master of science

Christian Diestlberger

11830832

master information studies

data science

faculty of science

university of amsterdam

2018-07-07

Internal Supervisor External Supervisor Name, Title Anna Sepliarskaia, MSc Aljar Meesters, PhD Affiliation UvA Copernica

(2)

ABSTRACT

Image similarity has been widely studied over the last few decades. Although much progress has been achieved, there are still specific use-cases where standard image similarity measures might not work. The aim of this thesis is to determine the acceptability of im-ages when there are only minor variations between a set of imim-ages and a corresponding reference image. By conducting a user study it is examined how human beings assess similarity in that specific use-case. Furthermore it is investigated if humans are identical in their similarity assessment. The results from the user study are then combined with similarity scores obtained from several image processing algorithms. The ultimate aim is to find a threshold value that reflects human perception on acceptability.

KEYWORDS

Image Similarity, Image Ranking, Image Processing, Image Accept-ability, Human Vision, Human Perception

1 INTRODUCTION

Copernica offers a mail marketing automation software, that en-ables its users to send mails to large audiences.The outgoing mails are opened on a lot of different platforms (e.g. GMail, Outlook, Thun-derbird) and all these platforms render the email slightly different. Currently, Copernica offers a test (Litmus check), that enables users to see how the email is rendered on these different platforms. Since the emails are opened on different operating systems, mail clients and devices, the mails are almost never identical to the original one. However, most of the time there are only minor differences and the emails are still acceptably similar. Take for example a slight offset that is caused by minor differences with regards to the size of certain elements.

As it is noted by Wang and Sheikh [1], the only valid method for quantifying if images are similar or not is through subjective evaluation. This is especially true for applications such as the email marketing automation tool from Copernica, where users themselves have to go through all the email previews in order to see if the email is acceptable on all existing platforms or not. Most of the time such a subjective evaluation is just too inconvenient and time-consuming for users. In a blog-post, Chad [2] highlights the complexity of email rendering. While website rendering is mostly affected by the used operating system, browser and screen size, for email rendering also other aspects such as the subscriber’s email client play a crucial role. Putting all these factors together leads to a vast number of potential renderings. In the case of Copernica, the Litmus check results in around 250 possible combinations.

The aim of this thesis is to enable an automation of this human check. Instead of using a costly and time-consuming subjective evaluation method, an appropriate objective evaluation method (i.e. assessment of similarity without human interfere) has to be found. Therefore, a strong metric has to be identified that correlates with the subjective assessment and the human visual system in the specific use-case of email rendering.

Despite the fact that a lot of research has already be done in the area of image similarity, this is still not a trivial task. First, no ’best’ similarity metric exists, but rather a set of metrics and the appropriate one always depends on the particular use-case (similar to a tool in a toolkit). None of the metrics performs the same job as the other, it always depends on the specific application. With regards to this thesis project, it differentiates from past research as it tries to determine if an image is still similar enough to a reference image (i.e. the focus is on the acceptance of the user). Since rating similarity is a greatly subjective task (especially in the case of often almost identical images), the existence of a threshold value that reflects human assessment of acceptability is investigated in more detail. There are several important aspects that help to answer this question:

• RQ1: Is there a common way for assessing similarity of al-most identical images?

• RQ2: What is the relationship between similarity and accept-ability for almost identical images?

• RQ3: How do human beings assess similarity for almost identical images in general?

• RQ4: How well do image processing algorithms reflect hu-man perception in this specific use-case?

By answering these research questions, the ultimate goal of this thesis is to find above-described threshold value.

2 RELATED WORK

Image similarity has been a popular and common research area in the last few decades. The major goal of image similarity assessment is to design algorithms for automatic and objective evaluation of similarity in a manner that is consistent with subjective human evaluation [3]. Advances in computer technology have lead to the creation of more and more digital images that are stored in large databases. These large amounts of images also give reason to automate certain tasks. In this context, research has mainly focused on identifying duplicates [4], image categorization [5] and image queries [6]. These areas are of particular interest because they facilitate the search of specific images. However, the development of such algorithms also requires underlying similarity metrics that are consistent with the human visual similarity perception [7].

In the beginning, approaches for defining image similarity met-rics have mainly used classical image processing techniques. First approaches were based on single discriminant (e.g.hash values or color-histograms). Later, approaches have used image compression techniques that were able to obtain features that correspond to multiple aspects. These metrics have proven to perform well on tasks with different objects. However, it is unclear how well they relate to the actual human perception [7]. More recently, models of the early human visual systems have been developed in order to design quality metrics for certain applications. In this context, the ’Structural Similarity Index (SSIM)’ proposed by Wang and Sheikh [1] is one of the most popular examples in this area. Finally, in the era of deep learning, also convolutional neural networks have

(3)

been successfully applied in order to replace the manual feature extraction [8].

Since the aim of this thesis is to compare images with minor differences to a base image, this thesis focuses on so-called full-reference image similarity. For doing this, one can either apply subjective or objective evaluation [1].

2.1 Subjective Evaluation

As noted by Al-Najjar and Soong [9], mean opinion score is a popular subjective method for assessing image similarity scores. Although this method reflects human perception, it is argued that it is too expensive and time consuming. Nevertheless, there has been quite a lof of research on the human assessment of image similarity [7, 10–12]. However, research in this area focused mainly on making automatic systems understand image similarity an did not provide extensive analyses of the underlying measurements.

As opposed to the above describe literature, this paper provides a study on the consistency of image similarity evaluated by humans on almost identical images. More specifically, it is investigated if there is a natural threshold value for the acceptability of images, in the context of email rendering.

2.2 Objective Evaluation

With regards to objective evaluation methods, Avcibas [13] catego-rizes these methods into six categories. For the purpose of this paper, assessing the similarity is done with image processing algorithms from the two following groups:

• Pixel difference-based measures: Mean Square Error (MSE), Normalized Mean Square Error (NRMSE) and Peak Signal-to-Noise Ratio (PSNR)

• Human visual system-based measures: Structural Similarity Index (SSIM)

With pixel difference-based measures, image similarity is deter-mined from a comparison of pixel intensities. Since such intensity-based methods are point operation, the similarity evaluation is independent from all other pixels in an image. However, often neighboring pixels are also highly correlated with each other. Fur-thermore, intensity-based methods do not account for properties of the human visual system [14]. Instead of focusing on single pixel values, ’SSIM’ [1] compares local patterns of pixel intensities and is therefore supposed to better reflect actual human perception.

3 METHODOLOGY

This section gives an overview about the methodology of the user study. After describing the experimental settings (task 1 & task 2), the used resources (i.e. the data) and the subjects are presented in more detail. Finally, the data gathered from the user study is inspected.

3.1 Experimental Settings

Having no labeled dataset available, the first step of this thesis project was to conduct a user study. Without the subjective feedback obtained from such a cognitive-psychological experiment, it would not be possible to make valid judgments on the acceptance level of

human beings in the specific use-case of image similarity for email rendering.

In order to get a number of respondents that is large enough for getting significant results, the user study was performed by using an online survey tool. The final version of the created user study assigned subjects with two main tasks: rating the similarity of email images in task 1 and choosing the most similar image in task 2. After getting an introduction about the research goals and the purpose of the online survey, a detailed description about task 1 was provided. In this task, the participants have been confronted with 20 image pairs. Each image pair consisted of the ’original’ email image (created with Copernica’s marketing automation tool) and an email-client-specific image. Since there are often only minor differences between those images, participants could zoom in to the images by clicking on the magnifier symbol in the top right corner. After inspecting the images, the subjects could rate the similarity by using a slide bar (ranging from ’completely different images’ to ’identical images’) that was displayed below the image pair. Figure 1 shows an image of how this task looked like.

Figure 1: User Study - Task 1

The order of the displayed image pair has been randomized. In some cases the ’original’ email image was displayed on the left side, while in other cases it appeared on the right side. Before rating the similarity, subjects first always had to mark the ’original’ email image by clicking on it. This attention check was used in order to see if participants actually pay attention to the task. After rating the similarity of one image pair, the survey subjects had to answer some follow-up questions. Besides using thumb symbols (i.e. thumbs-up for ’acceptable’ and thumbs-down for ’not acceptable’) for rating the acceptability of the email-client image, participants also had to choose the image feature that had the greatest influence on their similarity rating. In this context, subjects could choose between the following answer choices:

• Color features

• Text features (e.g. different font, etc.) • Position of elements

• Size of elements • Other (please specify)

(4)

In order to overcome the bias (e.g. magnetism of the middle) that can result from the order that items are presented in, randomization of the above presented answer choices has been applied. By doing this, answer choices have been displayed in a random order each time the question was displayed.

Once this task was completed for one pair of images, a new pair has been presented. After rating the similarity of all 20 image pairs, subjects have successfully completed task 1. Before starting with the next questions, participants could have a short break while reading the instructions for task 2. The aim of this task was to rank images according to their similarity to a reference image (i.e. to the ’original’ email image). For this purpose, subjects now have been presented with two pairs of images. Each pair consisted of the ’original’ email (on the left side) and a certain client version of the email (on the right side). Instead of rating the similarity (as it was the case in task 1), subjects now had to choose the image pair, where the client image on the right side was more similar to the ’original’ image on the left side. In order to see the full image, participants had to use the zoom function by clicking on the magnifier symbol in the top right corner of each image pair. Choosing the image pair where the right image was more similar to the left one could be done by clicking on it. The corresponding image pair was then assigned with a check mark in the top left corner. Figure 2 shows an image of the interface used for this task.

Figure 2: User Study - Task 2

After choosing the image pair that was more similar, subjects were also asked to justify their decision by choosing the most im-portant image feature. This question was based on the same answer choices that have already been used in task 1. Once the task was completed for a given set of images, a new set has been presented. In total, task 2 consisted of 12 image comparison questions. In or-der to also enable a comparison with the results from task 1, the same images have been used for both tasks. Doing this allows a comparison of the similarity scores from task 1 and the similarity choices from task 2. Significant differences in the scores from task 1 should also be reflected in the results from task 2. After completing the two experiments, subjects were asked to fill in a questionnaire about their demographic information (age, gender, education and employment).

3.2 Data Used For User Study

The images used for the experiments come from Copernica’s mar-keting automation tool. In more detail, Copernica’s ’Marmar-keting Suite’ provides clients with 11 email templates for various scenarios (e.g. newsletter, advertisements, etc.).The email previews of those templates (provided by Copernica’s software) are later needed in order to enable a comparison of this originally created email with corresponding email images from various applications and plat-forms. In this context, Copernica offers a so-called ’view document in various clients’ function. This function is based on an integra-tion with Litmus. Litmus offers previews of what emails will look like for more than 90 different applications and devices (e.g. GMail, GMX, AppleMail, etc.). For this purpose, a test email is sent to various physical machines that run different email clients. Litmus then instructs the email clients to open the test email and takes highly optimized screenshots. These screenshots will ultimately be displayed in Copernica’s marketing automation tool and help clients to see how their email will be rendered. Figure 3 shows an example result of this Litmus check.

Figure 3: Example Litmus Screenshot

After deciding on the best suitable templates, 5 images per tem-plate have been selected to be part of the user study. This was done by using two of the most popular image similarity measures: Mean Squared Error (MSE) and Structural Similarity Index (SSIM). Since the email previews capture the whole desktop, they also contain client-specific surroundings. Since these surroundings would have an effect on both, the calculated similarity scores and the similarity judgments of survey participants, all images have been cropped accordingly to only contain the actual email part. After applying the similarity measures, for each template all images have been ranked and were segmented into 5 categories. In this context, each category reflects a different degree of similarity. This approach assures a comparison of a wide spectrum of similarity scores. The full ranking for each image can be found starting from page 13 in the appendix. Table 12 shows the results for the education template. The rankings for other templates follow on the subsequent pages in the appendix.

3.3 Subjects

In total, 100 subjects have participated in the online survey. In order to recruit this number of subjects the crowd-sourcing Internet

(5)

marketplace ’Amazon Mechanical Turk’, also called MTurk, has been chosen. MTurk has become an increasingly popular tool for researchers to conduct online experiments since it is a great way to get fast responses and enables to screen for specific participants with certain qualifications [15].

For their participation, subjects were offered a 2.50 US-Dollars compensation. Since the average time for completing the survey was around 15 minutes, this compensation corresponds to an hourly rate of 10 US-Dollars, which is considered as an ethical payment for a task like this. All subjects were required to be native English speakers and to be at least 18 years old. The requirement to be native English speaker was achieved by setting the location of participants to the United States. By doing is, it was assured to avoid any language bias. However, this restriction could also have an effect on the cultural homogeneity of the sample [16]. Apart from that no other qualifications have been specified in order to reflect the general population as much as possible.

To make sure that the participants of the online survey are pay-ing attention and are no survey-takpay-ing bots, the user study was complemented by several attention-check items. In the event of more than 2 failures of these attention-checks, the associated par-ticipant has been disqualified. By applying this procedure, 23 out of 100 survey results had to be removed. Doing this assures that the final findings of the project are not negatively affected by biased data.

In the end, 77 subjects have successfully conducted the user study. With regards to the gender, there was almost an equal distribution (38 males, 39 females). The subject population also varied in other aspects such a age (from 18-24 to above 54), employment (majority employed full-time) and education (from no degree to Master’s degree).

3.4 Data Gathered In The Experiments

By conducting the in 3.1 described experiments, the following types of data have been collected:

• Similarity scores between original and client-specific email images obtained in task 1

• Image ranking based on similarity scores from task 1 • Acceptability votes for images in task 1

• Image votes obtained from the image comparison in task 2 • Image feature votes obtained from task 1 & 2

• Demographic information of subjects

With regards to the similarity scores obtained from task 1, the decision was made to offer participants a slide bar for rating the similarity. This question type was used because traditional Lik-ert scales are related to ordinal data (i.e. numbers that have an order but the differences between the numbers are not all equal). There have been many debates about what types of analyses are suitable for different kind of scales. One of the earliest and most influential papers in this area comes from Stevens [17] where he writes about the four levels of measurements. In his paper, Stevens argues that certain kind of calculations (e.g. mean calculations) are not permissible on anything less than interval data (i.e equal distances between ordinal numbers). However, this restriction is problematic for many academics and researchers because Likert scales (i.e. ordinal data) are a common rating format for all kind

of user studies. However, since a majority of statistical tests use means and standard deviations in their calculations, these methods cannot be applied on data from Likert scales.

More recently, researchers started to treat ordinal variables with five or more categories as continuous and showed that this approach did not have any harm on their analysis [18–21]. In cases like this, the corresponding variable is often denoted as an ’ordinal approximation of a continuous variable’. On the other hand, Elaine Allen and Christopher Seaman [22] support the traditional opinion that Likert scale data should not be treated as interval data, but also suggest that this ’ordinal-data problem’ can be solved by using slider bars to get responses on a continuous scale. In order to enable an appropriate data analysis, participants have therefore been able to indicate the similarity between images using such a slider between two extremes (least and highest similarity).

4 RESULTS

In the following section, the results obtained from conducting the user study are presented. After analyzing the results from task 1, task 2 is evaluated and compared with task 1. In the end the results from the user study are compared with a range of image processing algorithms.

4.1 Similarity Scores

By analyzing the similarity scores it is investigated if there is a common way for assessing the similarity of almost identical images (RQ1). Figure 4 presents the distribution of the similarity scores for all of the 20 tested images. For each of the 4 used email templates, the similarity of 5 images from different email clients had to be rated.

Figure 4: Distribution - Similarity Scores

The purpose of these boxplots is to give a first overview of the data. Since each template is inspected in a separate section in more detail, the labels on the x-axis have been removed from this plot.

As can be seen from Figure 4, it appears that 3 images in particu-lar have a significantly lower average simiparticu-larity score. Besides that, the boxplots also show that there is a high bandwidth of similarity scores for each image. For several images the boxplots also show observations that lie an ’abnormal’ distance (so-called outliers) from other values in the sample. However, this wide distribution of similarity scores can be argued by the existence of different rat-ing strategies as also noted by Tirilly et al. [16]. Since ratrat-ing the

(6)

similarity is a very subjective task, it is likely that participants of the survey also use different rating scales based on different mean values and different ranges. In many cases, the first image has a great influence on the rating of the subsequent images as it sets a reference score. This reference score can limit the scale that is avail-able for all subsequent ratings. The following histogram (Figure 5) shows that the average similarity rating among subjects is indeed different and acts as an evidence for the above stated assumption.

Figure 5: Average Similarity Rating

It is possible to counteract (or at least reduce) the above-described subjectivity of rating scales by applying an appropriate subject-dependent normalization [16]. The formula for this kind of normal-ization looks as follows:

NSsi=Ssi

−µ(S_s) σ(Ss)

(1)

The normalized scoreNSsiassigned to a specific image (i) by a

certain subject (s) is calculated by first subtracting the mean score µ(Ss) (assigned to all images by the corresponding subject) from the

initial scoreSs and then dividing by the standard deviationσ(Ss)

of the scores assigned by the subject.

Calculating normalized scores using this formula centers the distribution of scores assigned by a specific subject on the mean score. Furthermore, the distribution is also scaled according to the range of scores assigned by the subject. Doing this helps to reduce the effects of subjective rating scales, while the overall distribution of the scores assigned per subject does not get modified. After applying the subject-dependent normalization on the dataset, the distribution of the similarity scores for each email client looks as follows:

Figure 6: Distribution - Similarity Scores (normalized)

As can be seen from Figure 6, the range of the rating scale has been considerably reduced and the distribution of the normalized scores looks more like it follows a normal distribution. However, for the majority of images, the Shapiro-Wilk test shows a p-value significantly below 0.05. The null-hypothesis of this test is that the population is normally distributed. Based on the low p-values, this hypothesis can be rejected for most of the images. Figure 6 illustrates that by normalizing the data also the number of outliers has increased. The resulting skewed distribution is likely to be the reason for this non-normality. For this reason, extreme outliers have been removed and the Shapiro-Wilk test was applied again. This time, most of the images have a p-value considerably higher than 0.05, which means that the null-hypothesis (normal distribution) cannot be rejected. The corresponding p-values are plotted in the following scatterplot:

Figure 7: pValues before and after Normalization

As can be seen from Figure 7, for most of the images the null-hypothesis of the Shapiro-Wilk test cannot be rejected anymore after removing outliers. A table showing all p-values before and after removing extreme data points can be found in the appendix in Table 16.

In the literature, there is much debate on how to deal with out-liers in data in the best possible way. In most cases, research is arguing to either keep outliers, to remove them, or to change them to another variable [23, 24]. Many statistical tests and/or parametic statistics (e.g. means, standard deviations, etc.) are highly sensi-tive to outliers. However, it is not acceptable to drop observations

(7)

just because they are different from other ones. Often those ob-servations are legitimate and the most interesting ones. For this reason, extreme outliers have not been removed from the data, but are inspected in more detail on a per-template-level. If a template contains significant outliers, robust statistical methods are applied. 4.1.1 Template Education. In this section, the results of the first email template are analyzed in more detail. The analysis of the remaining 3 templates follows in the subsequent sections. Figure 8 shows the distribution of the normalized similarity scores for each email image of the corresponding template.

Figure 8: Distribution of Similarity Scores (normalized) As can be seen from these boxplots, one email image has received a significantly lower rating. When looking at the other images it seems that they have a similar rating. Only the score of ’apple-mail10’ appears to be not as positive. Taking a closer look at the actual images, particularly the freenet_explorer image has some clear deviations from the original email created with Copernica’s marketing tool. With regards to the image captured from ’apple-mail10’, there are some obvious differences in the displayed font. As already mentioned in section 4.1 of the user study results, for most images there are quite some observations that lie in an ’abnormal’ distance from the rest of the observed data points (i.e. outliers). The Shapiro-Wilk test (Table 16) showed that these outliers have a negative influence on the normality of the underlying distribution. However, after inspecting the corresponding outliers in more detail, removing them seems like a too naive approach. First, removing outliers would lead to the elimination of 17 participants. Real out-liers should be rare and do not occur that many times. Second and most importantly, rating the similarity of images is a subjective task and the judgment of subjects in this matter can be quite dif-ferent. For example, some participants of the user study might pay more attention to details like the font, while others focus more on the position of elements. Removing those data points would most likely alter the final findings of this user study and lead to wrong conclusions. In addition to that, attention-checks have been used to already identify participants that did not focus on the rating task before even analyzing the data. By doing this, misleading data points have already been removed in advance.

Based on the above given explanation, each further analysis (for each of the four email templates) is based on the results of all 77 subjects. Table 1 shows the average similarity score of each tested email client for the education template:

Mail Client Similarity Score aol_firefox 0.770514 gmx_firefox 0.681778 yahoo_firefox 0.586369 applemail10 0.260014 freenet_explorer -2.004741 Table 1: Average Similarity Scores

Looking at the similarity scores, it seems like there is a clear hierarchy between the images. While the difference between the first three images (aol_firefox, gmx_firefox, yahoo_firefox) is only around 0.1 respectively, the scores for ’applemail10’ and ’freenet_explorer’ are considerably lower.

In the next step, it is investigated if there is a ’real’ difference (in terms of similarity) between the images or if the difference appeared because of random choice in the sample selection. A paired t-test can determine whether the mean difference between paired observations is significant. In this context, observations can be paired because the same subjects had to rate all the different images. The paired t-test does also not assume that observations within a group are normally distributed, only the differences be-tween pairs should follow a normal distribution. However, after calculating the differences for each possible combination of image pairs, the Shapiro-Wilk test shows a p-value below 0.05 for most pairs. Therefore, the assumption of normality is not fulfilled and a paired t-test is not applicable. Instead, the Wilcoxon signed-rank test can be used as a non-parametric alternative to the paired t-test. However, conducting multiple comparison goes along with the so called multiplicity effect [25]. Not accounting for the combination of pairwise comparisons can lead to the loss of control over the so-called familiywise error rate. Therefore, it is common procedure to first perform tests adequate to multiple comparison followed by an appropriate posthoc analysis in order to compare the pairwise images. For this reason, first the Friedman test has to be performed. This test informs about the presence of differences among all groups, but gives no hint about the groups where the differences occur. If the overall Friedman test is significant, a series of Wilcoxon tests can be applied to identify where the specific differences lie. However, in order to control for inflation of type I error (i.e. rejection of a true null-hypothesis), the Bonferroni correction [26] is applied. This adjustment is done by taking the initial significance level and dividing it by the number of comparisons.

The Friedman test results in a p-value significantly lower than 0.05, which means the null-hypothesis (no differences between re-lated samples) can be rejected. In order to decide which images are significantly different from each other, in the next step (posthoc analysis) the Wilcoxon signed rank test is applied on all possi-ble combinations. The null-hypothesis for the Wilcoxon test is that the medians of the two groups are equal. The full results of the Wilcoxon signed-rank rank test can be found in Table 17 in the appendix. Looking at the p-values, one can see that the null-hypothesis has been rejected for all possible combinations with ’freenet_explorer’. Therefore there are ’real’ differences among

(8)

those image pairs. The same accounts for all possible combinations with ’applemail10’. However, the null-hypothesis has not been re-jected for comparisons among the first three ranked images.

4.1.2 Template Event. Figure 9 shows the distribution of the normalized similarity scores for the event template.

Figure 9: Distribution of Similarity Scores (normalized) Again, one image (gmail_explorer) has received significant lower ratings than the others. Furthermore, it seems that many people have dissenting opinions on the similarity of the image captured form mailru_firefox. Taking a closer look at the actual images, the low rating for the image from ’gmail_explorer’ is likely due to the different size of the used images in the email. With regards to ’mailru_firefox’, the text appears to be more bold than in the original email image. Looking at the boxplot, it seems that for some subjects this was more of a problem than for others. To enable a better comparison, Table 2 shows the average similarity score of each tested email client.

Mail Client Similarity Score thunderbird 0.44763 yahoo_firefox 0.37691 mailru_firefox 0.29058 comcast_chrome 0.18429 gmail_explorer -1.56385 Table 2: Average Similarity Scores

The screenshot from ’thunderbird’ could achieve the highest score, although the average similarity scores of the images are close together in general. Only for one image (’gmail_explorer) there is a considerable (negative) difference. In order to see if differ-ences are significant, first the Friedman test was conducted. Since this test results in a p-value below the significance level, the pair-wise Wilcoxon signed-rank test has been applied for the posthoc analysis. For all possible comparisons with ’gmail_explorer’, the null-hypothesis (i.e. medians of the two groups are equal) has been rejected. However, the differences between the higher ranked im-ages often seem not to be significant. Table 18 in the appendix contains the full results of this test.

4.1.3 Template News. Figure 10 shows the distribution of the normalized similarity scores for the news template.

Figure 10: Distribution of Similarity Scores (normalized) In contrast to the previous email templates, in this case there is no email image that has received considerably worse ratings than the other ones. However, when taking a closer look at the boxplots, two images (’yahoo_firefox and ’freenet_chrome’) seem to be more similar to the originally created email. When looking at the actual images, it seems that all the images only have minor deviations from the original email. However, as already experienced in previous examples, some images differ in terms of font (’outlook2016’) or size of elements (’gmail_explorer’). Table 3 shows the average similarity score of each tested email client:.

Mail Client Similarity Score yahoo_firefox 0.54763 freenet_chrome 0.33612 comcast_chrome 0.25375 gmail_explorer 0.14867 outlook2016 -0.01959 Table 3: Average Similarity Scores

As one could already expect from the boxplots, the screenshots of ’yahoo_firefox’ and ’freenet_chrome’ have achieved the highest average similarity, while outlook2016 is the only one with a score below zero. By applying the Friedman test, it was evaluated if there are ’real’ differences between the images in general. Since the null-hypothesis could be rejected, the Wilcoxon test was used to identify the images where those differences appear. This time, there are ’real’ differences for all combinations with the highest ranked image (’yahoo_firefox’). Furthermore, the null-hypothesis has also been rejected when comparing ’freenet_chrome’ and ’outlook2016’. No ’real’ differences have been determined among the other image pairs. All corresponding p-values can be found in Table 19 in the appendix.

(9)

4.1.4 Template Webshop. Figure 11 shows the distribution of the normalized similarity scores for the webshop template.

Figure 11: Distribution of Similarity Scores (normalized) The first thing to be noticed is the boxplot for ’aol_explorer’. As opposed to the other images of this template, the image from this email client clearly has received lower similarity scores. With regards to the remaining four images, it seems that the scores for the two outlook version (2013 and 2016) are quite similar. The same accounts for the screenshots taken from ’mailru_chrome’ and ’office365_chrome’. Taking a look at the images from the survey, it can be seen that the elements in the image of ’aol_explorer differ in size and position. Furthermore, both ’outlook’ versions show a different font. It is interesting to note that the image captured from ’outlook 2016’ has some more obvious deviations (different width of a bar at the bottom of the email). However, it seems that this did not have a great influence on the similarity scores and other factor played a more important role. However, an explanation also could be that subjects have not always used the zoom function in the online survey. Since the difference appears to be at the bottom of the image, the different width of the bar maybe was not visible to all participants. Table 4 shows the average similarity score of each tested email client.

Mail Client Similarity Score office365_chrome 0.50178

mailru_chrome 0.31111 outlook2016 0.01747 outlook2017 -0.13424 aol_explorer -1.99219 Table 4: Average Similarity Scores

The average similarity scores from Table 4 confirm the impres-sion gained from the boxplots. While ’aol_explorer’ has a signifi-cantly lower rating than all the other images, also the two outlook version have only achieved low similarity scores. With regards to the two remaining images, ’office365_chrome has achieved a higher average rating.

The Friedman test is used to see if the above-described differences among the images are significant or not. Resulting in a p-value considerably below the significance level of 0.05, the null hypothesis can be rejected. In order to find out where differences exactly appear, the Wilcoxon test is used. For the majority of image pairs, the null-hypothesis of equal medians can be rejected since the resulting p-values lie below the adjusted significance level of 0.005 (Bonferroni correction). However, for the following two image pairs the null-hypothesis could not be rejected:

• mailru_chrome and office365_chrome • outlook2013 and outlook2016

Having a p-value above the significance level, the null-hypothesis cannot be rejected between these two image pairs. This results also matches the initial analysis of the boxplots. The full results with all corresponding p-values can be found in Table 20 in the appendix.

4.2 Acceptance Ratings

In task 1, subjects also had to decide if the rated images from differ-ent email clidiffer-ents are acceptable for sending them out to customers or not. Analyzing this task helps to understand the relationship between similarity and acceptability for almost identical images (RQ2). For each image, the proportion of positive acceptance rat-ings has been calculated. Figure 12 shows the relationship between those acceptance ratings and the similarity scores.

Figure 12: Similarity Score and Acceptance Rating The scatterplot shows a positive relationship between the two variables. As the average similarity scores increase, the acceptance rate also tend to rise. The regression line indicates a high linear correlation between the two variables. In order to measure this linear correlation, Pearson correlation coefficient has been applied. This test gives the following result:

Correlation p-value 0.9887985123839993 2.5320490329338187e-16

Table 5: Pearson correlation coefficient

The Pearson correlation coefficient can take a range of values from +1 to -1. The achieved value of around 0.989 therefore in-dicates an extreme positive relationship. Furthermore, based on

(10)

the low p-value the null hypothesis (true correlation coefficient is equal to 0) can be rejected. This results is not surprising since a strong association between the two variables could be expected. More interesting is the fact, that opinions about the acceptability of images are largely consistent. Looking at Figure 12, the acceptabil-ity of images is either really high or low. There are no data points that show a significant difference of opinion. Figure 13 shows the average acceptance rating for each tested image and supports this finding.

Figure 13: Average Acceptance Rating)

4.3 Feature Ratings

For each similarity rating, participants of the user study also had to choose the image feature that had the greatest influence on their judgment. This section elaborates on the results of this question and helps to get a better understanding of how similarity is assessed in general (RQ3). The number of votes for each answer choice is displayed below in table 6. In order to also identify if there are any differences among feature votes between acceptable and non-acceptable images, the decision was made to also split images based on their acceptance rating. For doing this, an arbitrary threshold value of 0.9 has been chosen. Images that have received more than 90 percent positive acceptance votes belong to the ’+ images’ column, while images that have less acceptance votes can be found in the ’-column’.

feature all images + images - images Color features 129 81 48 Position of elements 409 179 230

Size of elements 308 130 178 Text features 690 456 234

Table 6: Feature votes

As can be seen from this table, color features have received the least votes. This accounts for both, acceptable and non-acceptable

images (9.57% and 6.15% respectively). With regards to acceptable images (+images), in more than half of the cases (around 53.9%) subjects have voted that text features such as a different font had the biggest influence on their similarity rating. It seems that minor differences of the font do not have an effect on the acceptability of the emails. However, text features also appear to be one of the main reasons for non-acceptable (- images) email images. Both, font features and the position of elements have received around 30% of the corresponding feature votes. Therefore, it appears that the acceptance ratings a more based on the degree of differences, than on certain specific features.

4.4 Image comparison

In task 2 of the user study, subjects had to rank images according to their similarity to the originally created email. Presented with two comparisons, survey participants had to choose the image pair with the higher similarity. As opposed to the other result section, this task does not focus on a new research question, but rather acts an an extension to the similarity ratings from task 1 (see section 4.1) as it enables a comparison of the results.

4.4.1 Template Education. Table 7 shows the number of votes for each of the three comparisons related to the education template.

applemail10 freenet_explorer 64 13 gmx_firefox applemail10 59 18 aol_firefox yahoo_firefox 23 54

Table 7: Similarity Votes

In the first two comparisons a majority of subjects has chosen the same image. This result also matches the average similarity scores in Table 1. Furthermore, the Wilcoxon test confirmed that there are ’real’ differences between those image pairs. The last comparison has yield a different result than the similarity scores would indicate. However, in this case the null-hypotheses of the Wilcoxon test (i.e. medians of two groups are equal) could not be rejected. As opposed to task 1, in this task subjects could compare different images from one template next to each other. As argued by Averbach and Coriell [27], the visual process of human beings involves a buffer storage that also includes an erasure mechanism. This mechanism tends to erase previous stored information when new information is put in. Therefore, this mechanism might also leads to different similarity decisions in task 1.

(11)

4.4.2 Template Event. Table 8 shows the comparison votes for the event template.

yahoo_firefox mailru_firefox 57 20 gmail_explorer thunderbird 6 71 thunderbird comcast_chrome 58 19

In the first comparison, ’yahoo_firefox’ has received more votes than ’mailru_firefox’. This results matches the similarity scores in table 2. While the Wilcoxon test in task 1 could not prove that differences are significant, in the end there might be some ’real’ dif-ferences between those two images. The remaining two comparison confirm the results from task 1.

4.4.3 Template News. Table 9 shows the results of the compari-son task for the news template.

freenet_chrome outlook2016 52 25 yahoo_firefox comcast_chrome 38 39 gmail_explorer outlook2016 41 25

The first comparison reflects the results from task 1 (higher score for freenet_chrome and significant differences). However, the second comparison yields a different result. While the similarity rating (see table 3) was higher for yahoo_firefox in task 1, in task 2 both almost received the same number of votes. Again, this might be because of the erasure mechanism of the human vision. The last comparison matches the similarity scores from task 1, although the Wilcoxon test did not prove any significant difference.

4.4.4 Template Webshop. The last comparison task was based on images from the webshop template. Table 10 shows the number of votes for each corresponding comparison.

outlook2013 outlook2016 55 22 aol_explorer outlook2016 16 61 office365_chrome mailru_chrome 56 21

Compared to to the similarity scores obtained in task 1, the first comparison now yields a different result. As already described in section 4.1.4, the result from task 1 was a bit surprising since the image related to ’outlook2016’ appears to have more clear deviations than the one related to ’outlook2013’. In task 2, subjects now had to use the zoom function and were more likely to see the different width of the bar at the bottom of the email image. The second and third comparisons reflect the similarity scores obtained from task 1 (see table 4). However, the Wilcoxon test has only proven ’real’ differences for the second comparison.

4.5 Image Similarity Metrics

In contrast to the previous sections, this part of the analysis is now focused on evaluating image similarity by comparing the subjective ratings from the user study with a range of objective image similar-ity algorithms (MSE, NRMSE, PSNR, SSIM). Doing this enables to exaine how well objective similarity methods reflect actual human perception of similarity (RQ4). Table 11 shows an overview of the achieved similarity scores.

mse nrmse psnr ssim mail client 79.7069 0.0497 29.1158 0.9583 event_yahoo_firefox 342.8915 0.0869 22.7792 0.94 education_yahoo_firefox 209.946 0.0641 24.9097 0.9096 news_yahoo_firefox 587.1901 0.1137 20.443 0.8504 education_gmx_firefox 230.7903 0.0683 24.4986 0.8182 webshop_office365_chrome 460.9372 0.1195 21.4944 0.7857 event_thunderbird 889.8694 0.14 18.6375 0.7805 education_aol_firefox 393.8347 0.0893 22.1777 0.7268 webshop_outlook2013 1221.6462 0.1547 17.2613 0.7067 news_freenet_chrome 1156.0544 0.1596 17.501 0.6666 education_applemail10 570.7548 0.1075 20.5663 0.6199 webshop_mailru_chrome 1342.8066 0.1622 16.8507 0.5899 news_comcast_chrome 1182.7716 0.1914 17.4018 0.5324 event_mailru_firefox 1204.6778 0.1561 17.3221 0.4409 webshop_outlook2016 1948.1473 0.1954 15.2346 0.4159 news_outlook2016_windows 2465.7896 0.2198 14.2112 0.3981 news_gmail_explorer 2231.2882 0.2125 14.6452 0.346 webshop_aol_explorer 1750.9812 0.2329 15.698 0.3421 event_comcast_chrome 3853.4483 0.2913 12.2723 0.2975 education_freenet_explorer 4502.0455 0.3735 11.5967 0.1649 event_gmail_explorer

Table 11: Results of Objective Methods

(12)

It can be seen from Table 11 that different image processing algorithms obtain different similarity scores. The table is sorted by the ’ssim’ column from the highest to the lowest similarity score. In general, the similarity scores of the applied image processing algorithms seem to deliver similar results. However, when looking closer at the single scores, one can see that the ranking of all images would differ for each algorithm. In order to find the algorithm that reflects the subjective results the most, the scores are now linked to the results from the user study. The comparison of these scores can be found in Table 21 in the appendix. This table is sorted by the acceptance rate in order to enable a better overview of the corresponding relationship between acceptance rate and the similarity scores.

For finding the find the strongest relationship between subjec-tive and objecsubjec-tive scores, Pearson correlation coefficient has been applied on both, subjective similarity scores and acceptance ratings. In both cases ’MSE’ could achieve a correlation coefficient higher than 0.81. While the Person correlation shows similar results for ’SSIM’ and ’NRMSE’ (between 0.70 and 0.74), ’PSNR’ could only achieve around 0.63 in both tests. Based on this relationship, in the next step it is investigated if a certain threshold value for accept-ability of emails can be determined. Comparing the scores in Table 21 one can see that images with a bad acceptance rating are rather easy for the algorithms to detect. Furthermore, it looks like there is a significant increase of similarity as soon as an acceptance rate of 0.95 has been achieved. This accounts for both, subjective and objective similarity scores. Therefore, for the further analysis, all images above an acceptance rate of 0.95 are classified as acceptable. In this context, a high rate makes sense as it helps reducing the risk of accepting images that are too different from the original one. In the case of uncertainty, it is better to investigate an image in more detail instead of automatically accepting it.

With regards to the Mean Squared Error (MSE), the best threshold value (based on the available data) would be around a score of 900. By setting the value at this level, everything below could be automatically determined as acceptable. Images that get a score above this threshold value should be further investigated. However, looking at the scores this would also cause the false classification of two images. The similarity scores obtained from ’SSIM’ seem more appropriate for defining a threshold value. By setting the value at 0.75, all images are classified correctly.

5 CONCLUSION

In this paper, several experiments have been performed. By con-ducting an online user study it has been investigated how humans assess the similarity of images that only have minor differences. In more detail, the aim of this task was to find out if there is a common way for humans to judge the similarity of email renderings. Since the images of emails from certain different clients are often almost identical and only have small variations, they have been suited well for performing this task. The ultimate goal of the user study was to determine if images are similar enough (i.e. acceptable) to a reference image.

First, it has to be noted that rating the similarity (especially in this use-case) is a highly subjective task. This subjectivity is also reflected in the results of the user study. Similarity ratings of images

often showed a high bandwidth of scores. However, by applying an appropriate subjective normalization the effects of these subjective rating scales could be significantly reduced. In the end, significant differences between similarity scores have been detected for about half of the images. Having many non-significant differences can be argued by the high similarity between the images. Furthermore, by letting survey participants vote on image features, it was also shown that subjects often pay attention to different aspects of an image.

Since the above described subjective evaluation is very time-consuming and expensive in practice, the ultimate goal of this thesis was to come up with a threshold value that reflects human perception of acceptability in terms of image similarity. Therefore, subjects in the user study also had to vote if an image is acceptable or non-acceptable. After finding a strong relationship between acceptance ratings and subjective similarity scores, different image processing algorithms have been applied. In this context, rather simple algorithms have been used. This was done because they are not only computationally cheap, but also deliver good results for the purpose of comparing almost identical images.

By comparing the scores obtained from the image processing al-gorithms with the scores from the subjective rating, it was then tried to find natural threshold values in the data that reflect the actual human perception of acceptability. Although it could be observed that ’Meas Squared Error’ (MSE) had the greatest relationship with the subjective similarity and acceptability scores, ’Structural Sim-ilarity Index’ (SSIM) appeared to be the best choice for this task. Defining acceptability with an acceptance rate of at least 0.95, all images could be classified correctly using ’SSIM’.

REFERENCES

[1] Zhou Wang and Hamid R Sheikh. Image Quality Assessment: From Error Visi-bility to Structural Similarity.IEEE TRANSACTIONS ON IMAGE PROCESSING, 13(4):14, 2004.

[2] Chad S. White. Why is email rendering so complex?, April 2017.

[3] Q. Liu, X. y Jing, R. m Hu, Y. f Yao, and J. y Yang. Similarity preserving analysis based on sparse representation for image feature extraction and classification. In2013 IEEE International Conference on Image Processing, pages 3013–3016, September 2013.

[4] Richard Connor, Stewart MacKenzie-Leigh, Franco Alberto Cardillo, and Robert Moss. Identification of MIR-Flickr Near-duplicate Images - A Benchmark Collec-tion for Near-duplicate DetecCollec-tion:. pages 565–571. SCITEPRESS - Science and and Technology Publications, 2015.

[5] A. Bamidele and F. W. M. Stentiford. An attention based similarity measure used to identify image clusters. InThe 2nd European Workshop on the Integration of Knowledge, Semantics and Digital Media Technology, 2005. EWIMT 2005. (Ref. No. 2005/11099), pages 67–71, November 2005.

[6] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: image segmen-tation using expecsegmen-tation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):1026–1038, August 2002.

[7] Bernice E. Rogowitz, Thrasyvoulos N. Pappas, and Jan P. Allebach. Building bridges between human vision and electronic imaging: a ten-year retrospective. InHuman Vision and Electronic Imaging III, volume 3299, pages 2–16. International Society for Optics and Photonics, July 1998.

[8] L. Hertel, E. Barth, T. Käster, and T. Martinetz. Deep convolutional neural networks as generic feature extractors. In2015 International Joint Conference on Neural Networks (IJCNN), pages 1–4, July 2015.

[9] Yusra A Y Al-Najjar and Dr Der Chen Soong. Comparison of Image Quality Assessment; PSNR, HVS, SSIM, UIQI. 3(8):5, 2012.

[10] I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, and P. N. Yianilos. The Bayesian image retrieval system, PicHunter: theory, implementation, and psy-chophysical experiments. IEEE Transactions on Image Processing, 9(1):20–37, January 2000.

[11] Thomas Frese, Charles A Bouman, and Jan P Allebach. A Methodology for Designing Image Similarity Metrics Based on Human Visual System Models .

(13)

page 13.

[12] Dirk Neumann and Karl R. Gegenfurtner. Image Retrieval and Perceptual Simi-larity.ACM Trans. Appl. Percept., 3(1):31–47, January 2006.

[13] Ismail Avcibas, İsmail Avcıbaş, Bulent Sankur, and Khalid Sayood. Statistical Evaluation of Image Quality Measures.Journal of Electronic Imaging, 11:206–223, 2002.

[14] M. P. Sampat, Z. Wang, S. Gupta, A. C. Bovik, and M. K. Markey. Complex Wavelet Structural Similarity: A New Image Similarity Index.IEEE Transactions on Image Processing, 18(11):2385–2401, November 2009.

[15] Michael Buhrmester, Tracy Kwang, and Samuel D. Gosling. Amazon’s Me-chanical Turk , Amazon’s MeMe-chanical Turk: A New Source of Inexpensive, Yet High-Quality, Data? , A New Source of Inexpensive, Yet High-Quality, Data? Perspectives on Psychological Science, 6(1):3–5, January 2011.

[16] Pierre Tirilly, Xiangming Mu, Chunsheng Huang, Iris Xie, Wooseob Jeong, and Jin Zhang. Image similarity as assessed by users: A quantitative study.Proceedings of the American Society for Information Science and Technology, 49(1):1–10. [17] S. S. Stevens. On the Theory of Scales of Measurement.Science, 103(2684):677–680,

June 1946.

[18] Gail M. Sullivan and Anthony R. Artino. Analyzing and Interpreting Data From Likert-Type Scales.Journal of Graduate Medical Education, 5(4):541–542, December 2013.

[19] Bruno D. Zumbo and Donald W. Zimmerman. Is the selection of statistical methods governed by level of measurement?Canadian Psychology/Psychologie canadienne, 34(4):390–400, 1993.

[20] Geoff Norman. Likert scales, levels of measurement and the “laws” of statistics. Advances in Health Sciences Education, 15(5):625–632, December 2010. [21] David R. Johnson and James C. Creech. Ordinal measures in multiple indicator

models: A simulation study of categorization error.American Sociological Review, 48(3):398–407, 1983.

[22] Elaine Allen and Christopher A. Seaman. Statistics Roundtable: Likert Scales and Data Analyses.

[23] R. (Richard) Lowry. Concepts and Applications of Inferential Statistics. http://vassarstats.net/textbook/, April 2014.

[24] Thomas V. Pollet and Leander van der Meij. To Remove or not to Remove: the Impact of Outlier Handling on Significance Testing in Testosterone Data. Adaptive Human Behavior and Physiology, 3(1):43–60, March 2017.

[25] Alexis Dinno. Nonparametric Pairwise Multiple Comparisons in Independent Groups Using Dunn’s Test. page 12.

[26] Robert J. Cabin and Randall J. Mitchell. To Bonferroni or Not to Bonferroni: When and How Are the Questions.Bulletin of the Ecological Society of America, 81(3):246–248, 2000.

[27] E. Averbach and A. S. Coriell. Short-Term Memory in Vision.Bell System Technical Journal, 40(1):309–328, July 2013.

(14)

6 APPENDIX

platform ssim mse

0 original 0.9471042512207699 166.66269401330376 1 yahoo_firefox 0.9400347030893642 342.89153732446414 2 freenet_firefox 0.9214845229979285 395.7124907612712 3 thunderbird 0.8733929472990574 414.5873059866962 4 webde_chrome 0.8553406495211996 516.0177383592018 5 gmx_firefox 0.8503702689096522 587.1901330376941 6 outlook2013 0.8418206106509567 506.5440687361419 7 aol_chrome 0.8414864942638071 580.7729120473023 8 gmail_firefox 0.8372902901725803 612.5088691796009 9 inbox_chrome 0.8348879981544175 621.5089615668884 10 gmail_chrome 0.8334529875107319 631.0805617147081 11 outlook2007 0.8252419293144274 578.5664264597192 12 outlook_chrome 0.8232142736858201 678.8641906873614 13 outlook2016 0.822431192837654 432.39255358462674 14 freenet_chrome 0.8162025949156598 590.9527900960828 15 yahoo_chrome 0.8160254389736399 589.3980968218773 16 windows10 0.8155813147465704 564.1567812269031 17 inbox_firefox 0.805229276722937 757.7940687361419 18 office365_chrome 0.8024374398269323 759.8060790835181 19 aol_firefox 0.7804549026737029 889.8693643754619 20 applemail11 0.6846372823900565 1014.7742978566149 21 outlook2016_mac 0.6817525615281876 993.1333148558758 22 applemail10 0.6665620935385234 1156.054416112343 23 mailru_chrome 0.491017901622626 3489.296101256467 24 mailru_firefox 0.4711397175971359 1716.503603104213 25 freenet_explorer 0.29752194950888605 3853.448263118995 26 webde_explorer 0.2786388518231271 3929.8816518847007 27 gmail_explorer 0.27476142670824283 4075.5358462675536 28 office365_explorer 0.27336233474507654 4084.1482815964523 29 gmx_explorer 0.27210967950386455 4120.240761271249 30 mailru_explorer 0.2654377617240395 4205.1622320768665 31 aol_explorer 0.22710544146123227 4312.099316334073

Table 12: Ranking - Template Education

(15)

platform ssim mse 0 original 0.9900364964547269 20.675487012987013 1 yahoo_firefox 0.9582520866187143 79.70689935064935 2 yahoo_chrome 0.9085873959671589 162.4198051948052 3 comcast_firefox 0.8991279851057647 233.23579545454547 4 freenet_chrome 0.8978612813845958 207.99488636363637 5 applemail10 0.8815918664541685 226.19472402597404 6 inbox_firefox 0.8794607591122111 219.1719155844156 7 gmail_firefox 0.8684509500006706 270.599512987013 8 webde_firefox 0.8642504050827776 283.24707792207795 9 gmx_firefox 0.8487810919129383 321.3478896103896 10 applemail11 0.8409429560809084 340.9543831168831 11 gmail_chrome 0.8203904931372092 375.77199675324675 12 windows10 0.8091495622032688 524.9688311688311 13 inbox_chrome 0.7882660091165525 399.268262987013 14 aol_firefox 0.7867178256538812 502.4702922077922 15 thunderbird 0.7856665931440103 460.9371753246753 16 aol_chrome 0.7523729849027323 479.87142857142857 17 freenet_firefox 0.7448327393973453 566.4889610389611 18 outlook_firefox 0.6326273767349134 984.242288961039 19 outlook2003 0.6253455148718949 1574.9040584415584 20 office365_chrome 0.6057565182159903 968.700974025974 21 tonline_firefox 0.5985550611482929 1286.2435064935064 22 outlook_chrome 0.588416964346615 1128.9969967532468 23 mailru_firefox 0.5324381289124175 1182.771590909091 24 mailru_chrome 0.5025093620584954 1438.2167207792209 25 outlook2007 0.39731319170465357 1660.300487012987 26 outlook2013 0.35518639219132897 1717.4338474025974 27 comcast_chrome 0.3420763029909909 1750.98125 28 outlook_explorer 0.24934199163206758 3356.645211038961 29 comcast_explorer 0.24172977846074709 3314.6810876623376 30 office365_explorer 0.20745416164312855 4311.742775974026 31 tonline_explorer 0.18682954573749078 4419.955925324675 32 freenet_explorer 0.18482988218116075 4463.436282467533 33 gmail_explorer 0.1649471379437009 4502.045454545455 34 gmx_explorer 0.1642839745296319 4535.590097402597 35 webde_explorer 0.16309879316897835 4523.943344155844 36 notes8 0.14835854261594275 4593.57573051948 37 aol_explorer 0.1368745840912433 4109.122727272727 38 mailru_explorer 0.13292390005939778 4159.222077922078

Table 13: Ranking - Template Event

(16)

platform ssim mse 0 original 0.9484112759036734 146.0685995989305 1 yahoo_firefox 0.9095853682639192 209.94602272727272 2 webde_firefox 0.8982972624569748 290.89221256684493 3 gmx_firefox 0.874023463444351 357.3348930481283 4 gmail_chrome 0.8538051859804087 332.63536096256684 5 gmail_firefox 0.8491790622476345 431.34767713903744 6 webde_chrome 0.8485741575285947 304.60753676470586 7 aol_firefox 0.8459521106327186 417.8742479946524 8 inbox_firefox 0.8121051514693998 681.3949699197861 9 freenet_firefox 0.7860910641157786 821.0615808823529 10 inbox_chrome 0.7660549336611164 577.0126169786096 11 aol_chrome 0.7637691272802162 762.1471423796792 12 applemail11 0.7199657915652603 949.8351437165776 13 freenet_chrome 0.7067402180389561 1221.646223262032 14 office365_firefox 0.7012481956858317 1204.0060160427809 15 outlook_firefox 0.6975922598809193 1209.0762867647059 16 office365_chrome 0.6970127729908085 1227.613135026738 17 outlook2003 0.6761771497851882 1032.313419117647 18 applemail10 0.6695969551224585 1339.1051136363637 19 yahoo_chrome 0.6401592717656808 1035.9203709893047 20 outlook_chrome 0.6062526955935473 1347.8785929144385 21 comcast_chrome 0.589860334626128 1342.806567513369 22 outlook_explorer 0.5802913450209027 2452.9127673796793 23 office365_explorer 0.5736344820097726 2449.9254679144383 24 comcast_firefox 0.5515073610773391 1306.6755514705883 25 tonline_firefox 0.5208684518085567 1578.972510026738 26 notes8 0.47632405354301016 1682.3666443850268 27 notes9 0.46700024231941956 1807.5090240641712 28 mailru_firefox 0.4508858021425889 1686.2739806149732 29 windows10 0.4498167886952694 1912.4842078877004 30 mailru_chrome 0.44228309707516617 1629.6244151069518 31 outlook2016_windows 0.4158626244368092 1948.1473094919786 32 webde_explorer 0.40051507102893574 2413.038937165775 33 outlook2013 0.39814291411203556 1772.3310494652405 34 gmail_explorer 0.39806158860593077 2465.7896056149734 35 mailru_explorer 0.3962399684572877 1921.4873830213903 36 outlook2007 0.3935876549827792 1730.282503342246 37 gmx_explorer 0.3830368516047901 2516.75618315508 38 comcast_explorer 0.3769432410406914 2638.182486631016 39 freenet_explorer 0.3636918694207178 2515.9781082887703

Table 14: Ranking - Template News

(17)

platform ssim mse 0 original 0.9664239525591546 52.70547829036635 1 office365_chrome 0.8182415507541014 230.7902815468114 2 outlook_chrome 0.8037911514148061 262.7203188602442 3 outlook2016_mac 0.761447380196628 320.952170963365 4 windows10 0.7565875735156009 306.15340909090907 5 outlook2017 0.7497024707550685 392.31801221166893 6 outlook2013 0.7267768808937947 393.8347184531886 7 freenet_firefox 0.6901127679462475 413.79418249660785 8 thunderbird 0.6827637136702369 419.4671811397558 9 yahoo_firefox 0.6360697697288784 663.5013568521031 10 mailru_chrome 0.6198796583546464 570.7548337856174 11 gmail_firefox 0.6061631515446629 675.4565807327001 12 aol_firefox 0.6017157751501788 703.5093283582089 13 mailru_firefox 0.5939750986038937 686.7773066485753 14 applemail10 0.5682921843515715 777.8305630936228 15 applemail11 0.5680744382812364 811.8445556309363 16 aol_chrome 0.565814807326778 733.0130597014926 17 inbox_chrome 0.5512932976133504 863.5335820895523 18 inbox_firefox 0.547598238814827 878.5160278154681 19 gmail_chrome 0.5333481577561356 770.5484226594301 20 outlook2016 0.4408897205331186 1204.6778324287652 21 aol_explorer 0.3460433903727048 2231.2882462686566 22 gmail_explorer 0.3458313731739038 2207.806648575305 23 gmx_explorer 0.33435375789095495 2124.9762550881956 24 webde_explorer 0.305004971189239 2237.8577849389417 25 office365_explorer 0.2924622005569409 2172.5012720488467 26 mailru_explorer 0.27837455362416985 2141.046302578019 27 freenet_explorer 0.2514542847791233 2291.606088873813

Table 15: Ranking - Template Webshop

(18)

pvalue pvalue (outliers removed) score_education_aol_firefox 5.392742605181411e-05 0.0393315926194191 score_education_applemail10 0.00030091055668890476 0.2867078483104706 score_education_gmx_firefox 0.012427028268575668 0.4826847314834595 score_education_yahoo_firefox 0.43885719776153564 0.012994596734642982 score_education_freenet_explorer 0.0005270194378681481 0.7542626857757568 score_event_mailru_firefox 0.17805732786655426 0.39410075545310974 score_event_yahoo_firefox 0.03658248484134674 0.03152892738580704 score_event_thunderbird 0.00013031213893555105 0.07974550127983093 score_event_gmail_explorer 0.045685358345508575 0.3027716279029846 score_event_comcast_chrome 0.0007337276474572718 0.05971873924136162 score_news_comcast_chrome 0.0308835506439209 0.2766557037830353 score_news_yahoo_firefox 0.0006704509723931551 0.17919261753559113 score_news_gmail_explorer 0.055938899517059326 0.15735235810279846 score_news_outlook2016 0.06060812622308731 0.43955573439598083 score_news_freenet_chrome 0.006583156995475292 0.13462293148040771 score_webshop_mailru_chrome 0.0630626380443573 0.4212372899055481 score_webshop_aol_explorer 0.6203590631484985 0.3091084063053131 score_webshop_outlook2013 0.23388633131980896 0.010210944339632988 score_webshop_office365_chrome 0.007313629612326622 0.03154360130429268 score_webshop_outlook2016 0.8465269207954407 0.7142322063446045

Table 16: pValues before and after Normalization

statistic pvalue "(’score_education_aol_firefox’ score_education_freenet_explorer)" 0.0 2.4628307660238145e-14 "(’score_education_gmx_firefox’ score_education_freenet_explorer)" 0.0 3.6057003700249217e-14 "(’score_education_yahoo_firefox’ score_education_freenet_explorer)" 0.0 7.731138702428815e-14 "(’score_education_applemail10’ score_education_freenet_explorer)" 6.0 9.88124761061331e-14 "(’score_education_aol_firefox’ score_education_applemail10)" 97.0 4.75368583740862e-07 "(’score_education_applemail10’ score_education_gmx_firefox)" 180.0 1.6910679632484672e-05 "(’score_education_applemail10’ score_education_yahoo_firefox)" 139.0 0.00026990691476958624 "(’score_education_aol_firefox’ score_education_yahoo_firefox)" 233.0 0.010489046953413074 "(’score_education_gmx_firefox’ score_education_yahoo_firefox)" 341.0 0.11095984107503655 "(’score_education_aol_firefox’ score_education_gmx_firefox)" 336.5 0.32317884653654205

Table 17: Wilcoxon Test - Education Template

statistic pvalue "(’score_event_gmail_explorer’ score_event_comcast_chrome)" 21.0 8.282331708245069e-14 "(’score_event_yahoo_firefox’ score_event_gmail_explorer)" 8.0 1.0720633280805895e-13 "(’score_event_mailru_firefox’ score_event_gmail_explorer)" 0.0 1.1325002193757658e-13 "(’score_event_thunderbird’ score_event_gmail_explorer)" 30.0 1.1785549808659964e-13 "(’score_event_thunderbird’ score_event_comcast_chrome)" 272.0 0.00200163003972894 "(’score_event_yahoo_firefox’ score_event_comcast_chrome)" 247.0 0.006353840797900855 "(’score_event_mailru_firefox’ score_event_thunderbird)" 606.0 0.16941625563807172 "(’score_event_mailru_firefox’ score_event_yahoo_firefox)" 405.0 0.20413822149203897 "(’score_event_yahoo_firefox’ score_event_thunderbird)" 410.0 0.4468231296804849 "(’score_event_mailru_firefox’ score_event_comcast_chrome)" 495.5 0.46852384482315323

Table 18: Wilcoxon Test - Event Template

(19)

statistic pvalue "(’score_news_yahoo_firefox’ score_news_outlook2016)" 124.0 4.3652445319513903e-07 "(’score_news_outlook2016’ score_news_freenet_chrome)" 77.0 4.632453082976357e-06 "(’score_news_yahoo_firefox’ score_news_gmail_explorer)" 213.5 0.00012248350032087503 "(’score_news_comcast_chrome’ score_news_yahoo_firefox)" 203.0 0.0006551378341908983 "(’score_news_yahoo_firefox’ score_news_freenet_chrome)" 236.0 0.004212561809436901 "(’score_news_comcast_chrome’ score_news_outlook2016)" 417.0 0.01324569428429417 "(’score_news_gmail_explorer’ score_news_freenet_chrome)" 287.5 0.01545208594249387 "(’score_news_gmail_explorer’ score_news_outlook2016)" 401.0 0.08454478516143069 "(’score_news_comcast_chrome’ score_news_gmail_explorer)" 570.0 0.2784851038228826 "(’score_news_comcast_chrome’ score_news_freenet_chrome)" 495.0 0.4652869190565806

Table 19: Wilcoxon Test - News Template

statistic pvalue "(’score_webshop_aol_explorer’ score_webshop_office365_chrome)" 0.0 3.6053479686626766e-14 "(’score_webshop_aol_explorer’ score_webshop_outlook2016)" 0.0 3.6053479686626766e-14 "(’score_webshop_mailru_chrome’ score_webshop_aol_explorer)" 0.0 3.6057003700249217e-14 "(’score_webshop_aol_explorer’ score_webshop_outlook2013)" 0.0 5.279842972608441e-14 "(’score_webshop_outlook2013’ score_webshop_office365_chrome)" 99.0 1.8875286862814657e-08 "(’score_webshop_mailru_chrome’ score_webshop_outlook2013)" 211.0 1.0066529653289448e-06 "(’score_webshop_office365_chrome’ score_webshop_outlook2016)" 242.0 2.034686722672316e-06 "(’score_webshop_mailru_chrome’ score_webshop_outlook2016)" 493.0 0.0030880119991527396 "(’score_webshop_mailru_chrome’ score_webshop_office365_chrome)" 398.5 0.03327516657635238 "(’score_webshop_outlook2013’ score_webshop_outlook2016)" 399.0 0.12211896679321398

Table 20: Wilcoxon Test - Webshop Template

platform acceptance rate subjective score mse nrmse psnr ssim education_aol_firefox 1.0 0.77 889.87 0.14 18.64 0.78 education_yahoo_firefox 0.99 0.59 342.89 0.09 22.78 0.94 education_gmx_firefox 0.97 0.68 587.19 0.11 20.44 0.85 news_yahoo_firefox 0.97 0.55 209.95 0.06 24.91 0.91 event_thunderbird 0.96 0.45 460.94 0.12 21.49 0.79 event_yahoo_firefox 0.96 0.38 79.71 0.05 29.12 0.96 webshop_office365_chrome 0.95 0.5 230.79 0.07 24.5 0.82 event_comcast_chrome 0.94 0.18 1750.98 0.23 15.7 0.34 event_mailru_firefox 0.94 0.29 1182.77 0.19 17.4 0.53 news_comcast_chrome 0.92 0.25 1342.81 0.16 16.85 0.59 news_freenet_chrome 0.92 0.34 1221.65 0.15 17.26 0.71 webshop_mailru_chrome 0.9 0.31 570.75 0.11 20.57 0.62 webshop_outlook2016 0.88 0.02 1204.68 0.16 17.32 0.44 education_applemail10 0.87 0.26 1156.05 0.16 17.5 0.67 news_gmail_explorer 0.86 0.15 2465.79 0.22 14.21 0.4 webshop_outlook2013 0.81 -0.13 393.83 0.09 22.18 0.73 event_gmail_explorer 0.17 -1.56 4502.05 0.37 11.6 0.16 webshop_aol_explorer 0.08 -1.99 2231.29 0.21 14.65 0.35 education_freenet_explorer 0.08 -2.0 3853.45 0.29 12.27 0.3

Table 21: Comparison of Subjective and Objective Methods