Real time linear broadcast viewer prediction and analysis

(1)

Master Thesis

Data Science

Real time linear broadcast viewer prediction and analysis

Author:

Yu Ri Tan

Supervisors:

Dr. Thomas Mensink

Dr. Rogier van der Geer

(2)

Real time linear broadcast viewer prediction and analysis

Yu Ri Tan

University of Amsterdam yuri_tan@outlook.com

ABSTRACT

NPO (the public Dutch television broadcaster) has access to a tech-nology called HbbTV, which allows them to gather viewing behav-ior of the three main channels in the Netherlands (NPO1, NPO2 and NPO3) from roughly 200.000 televisions in the Netherlands in real-time. This provides new possibilities in viewing behavior analysis, such as real-time insight in viewing rates. This could be very useful since the official viewing rates are only published the day after. This research tries to answer the following question: Is it possible to predict linear television viewing rates based on the data from HbbTV compatible televisions?

1 INTRODUCTION

Television ratings are extremely important for today’s broadcasters. All television shows are criticized by considering the reach and pop-ularity of television shows, but are also used to discover the reach of commercial and promotional videos. These ratings are calculated by an organization called Stichting Kijkonderzoek (SKO)1_{. Besides}

calculating viewing rates, the SKO measures and analyses viewing behavior as well. The data is automatically transported once a day from a selected group of households to a central computer located at GfK2_{in Hilversum, the Netherlands where it will be stored and}

analyzed [17].

In order to get the ratings of a specific show, you have to wait until the next day in order to receive the ratings from the SKO. With this research we use a data driven approach to improve the analysis of viewing behavior by predicting real-time linear viewing rates. Real-time insight in viewing rates opens up lots of opportunities for the broadcasters to anticipate on specific viewer behavior, but also for new business models or even new revenue streams.

The television signal can be received analog or digitally. Over 2016, 88% of the Dutch households received television digitally and 36,7% of the Dutch households has connected their (smart)TV di-rectly to the internet [16]. Combining the classic form of television with additional web services is spreading slowly across Europe, and since quite recently also in the Netherlands. HbbTV is a global initia-tive aimed at harmonizing the broadcast and broadband delivery of web entertainment services to consumers through connected TVs, set-top boxes and multi-screen devices3_{. Hybrid Broadcast}

Broad-band Television, is an association founded 2009. The Nederlandse Publieke Omroep (NPO), the Dutch public television broadcaster, is able to use the HbbTV technology since March 2017.

1_{https://kijkonderzoek.nl/} 2_{http://www.gfk.com/nl/} 3_{https://www.hbbtv.org/}

Thanks to the HbbTV technology the NPO is able to get real-time insights of roughly 200.000 (smart) TVs in the Netherlands. In this research, several models will be introduced to improve real-time viewing rate predictions, using new data gathered using the HbbTV technology.

When the term viewing rate is used in this research, it refers to the linear television viewing rate. Linear television is watching a TV program at a scheduled time and on the particular channel it’s offered. An important distinction to make is the difference between linear television and linear streaming. As the name suggest, the latter service is also linear but uses streaming technology instead to broadcast the content to a device, such as a tablet or computer. This research will focus on predicting linear television viewing rates.

The goal of this research is to find out how accurately the linear viewing rates can be predicted using data from the current sample of HbbTV compatible televisions. In order to do that, different models are used to solve this regression problem. Another element covered in this research are the methods used by the SKO to calculate the national viewing rates in the Netherlands. This will be covered in section 3.1. A more detailed view of the data used in this research is given in section 3.2. And in section 3.3 will be explained which regression models are used en how they are implemented. Also, the experimental setup is explained and described in section 3.4 and the experiments and results are covered in section 4. Finally, conclusions are drawn and some interesting points for possible improvement and for future research are discussed.

2 RELATED WORK

The prediction of certain values are typically done using regression. The prediction is based on several variables called independent variables, to predict the dependent variable (or in this case: the official viewing rate). A regression analysis is a process to estimate relationships between the independent and dependent variables. Linear models are often used in practice due to it’s simplicity. The most simple regression model is the so called (multiple) Linear regression model. This model assumes that their is a linear rela-tionship between one or more variable(s) and a dependent variable. Many linear regression variants were invented in the distant past such as the Moving Average (MA) and Auto-Regressive Moving Average (ARMA) for example, where the latter model is created in 1951 by Peter Whittle [21]. Many linear regression models are still used in time series forecasting, for example in the forecasting of energy consumption of a household [9]

To improve the prediction accuracy of regression models, often ensembling methods are used. These ensembling methods often use trees to split the data based on (different combinations of) variables and the number of observations in subset. Examples of ensembling

(3)

methods are boosting (invented by Freund and Schapire in 1995 [8]) and bagging (invented by Breiman in 1996 [3]). Both methods manipulate the training data, and create different training instances. After which, all instances are combined to predict in a voting format. Bagging creates sub-samples of the training set for each instance and all instances get an equal vote in the final prediction. Boosting on the other hand uses all data for every instance, but adjusts the weights of all instances to influence the importance of an instance’s vote. Popular and successful ensembling methods are the XGBOOST [5] (boosting) method and the Random forest [4] (bagging) method. Artificial neural networks (ANN) are another type of a machine learning model that have found acceptance in many disciplines for modeling complex real-world situations, such as regression tasks [1]. ANNs are adaptive, distributed computational models inspired by nature. These neural nets consists of neurons which are interconnected by links, with associated coefficients [10]. The ANN model gained populairty for regression tasks because of its ability to capture subtle usefull relationships in the data even though the underlying relationships are unknown or hard to find [20, 22].

There is often a temporal aspect in real-world data. Think about weather, language, or the stock market for example. Also in this research, time dependent or sequential data is used. This data is called time series data. For many years time series data has been a research topic [6]. Due to temporal nature or sequential nature of this data, recurrent neural nets (RNN) became popular. These types or neural nets are also effective for forecasting tasks [14]. An popular variant of the RNN is the Long Short Term Memory RNN or LSTM, invented by Hochreiter and Schmidhuber in 1997 [12]. The LSTM is widely used nowadays due to its superior performance in modeling short and long term dependencies[2].

3 METHODOLOGY

The methodology of this research is explained in several subsections. First, the methods used by the SKO to calculate the official viewing rates are discussed. Then the data involved in this research will be discussed in the next subsection (3.2). In subsection 3.3 will be explained which models are used to solve our regression problem, and why. The last subsection (3.4), explains how different models and features are compared.

3.1 Stichting Kijkonderzoek (SKO)

SKO calculates the viewing rate of a television show based on a sample of roughly 1250 households, including 2750 people in the Netherlands. This sample is composed by information originated from the Centraal Bureau voor de Statistiek (CBS, the Dutch bureau for statistics) together with the Media Standaard Servey (MSS). This MSS is a research which is performed on a yearly basis, using a country wide sample of approximately 6000 households and 5100 individuals of at least thirteen years old. The final sample of 2750 people are chosen based on the results of the MSS. When a viewing rate is mentioned, this number represents the amount of viewers who are at least six years old by default [17]. Also, the population size of television viewers in the Netherlands is estimated each

year and used as a constant. In 2017 the population size is set to 15.700.000[17].

3.2 Data set

Since March 25t h_{2017, NPO is able to continuously collect data}

though the HbbTV technology. This data consists of events contain-ing a unique ID of the TV, the channel it is on, and a time stamp. There are two types of events collected, being HbbHeartbeats and HbbViews. The latter is sent the first time a TV is switching to an NPO channel, whereas a HbbHeartbeat event is a periodic check if the TV is still on the same channel as registered by the Hbb-View event. This check is executed and registered every minute. Also, the sum of the HbbViews and Heartbeats is saved as the Hbb count to represent the total amount of viewers at a point in time. These events are stored on NPO’s HDFS4_{and are accessed using}

Spark5_{. The NPO is only able to see if a TV is on NPO1, NPO2}

or NPO3, which means that it is not possible to see the difference between switching channels to a non-NPO channel, or turning the TV off. The HbbTV data used for this research is collected from the 25t h_{of March until the 1}st _{of June and therefore consists out of 68}

days, where every row represents a minute in the given time period. For this research, the official viewing rates provided by the SKO are also collected for the same date range. To access the SKO data, you need to use specific software approved by the SKO. Since the raw data is not stored on HDFS, an export of the data had to be made manually in CSV file-format. The exported data consists of a time stamp, channel and the amount of people have been watching to that channel in the Netherlands. Both the HbbTV viewing rates as well as the SKO viewing rates are registered each minute of the day in the data set.

Besides the HbbTV and SKO data, NPO’s linear streaming data is used to help predicting the viewing rates. This data is also stored on HDFS. Even though these viewing rates are not derived from people who are watching television, it shows the same content at the same time and might therefore improve the accuracy of the prediction. To emphasize the temporal nature of the data, periodical (one-hot encoded) features are added, such as day of the week, and hour of the day.

In order to see what content is displayed on all three channels throughout the day, the so called Asrun data is collected. This XML formatted data contains meta data of every show, commercial or promo video which are broadcasted throughout the day. After pars-ing this XML data, the start time, duration and show name are stored in a dataframe with the same structure as the previsously mentioned data where every row represents a minute. The displayed content which confiscates the most seconds of a minute will be seen as the only content displayed that minute. Therefore it could be that a short clip with a duration shorter that 1 minute might be over-looked, or that a show starts a few seconds later (according to this dataframe) than it actually did. Since the HbbHeartbeats are sent every minute, it is not possible yet to improve the granularity of the

4_{https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html} 5_{https://spark.apache.org/}

(4)

content displayed. As a result of using this Asrun data, the content displayed on each channel for every minute in the data set is known. All data is loaded in dataframes, and merged based on the time stamp (rounded per minute). For each of the NPO channels this dataframe is generated separately due to the size of the dataframes. As a result, the dataframes consist out of the following columns: HbbHeartbeats, HbbViews, Hbb count, Linear streaming count, day of the week (one-hot encoded), hour of the day (one-hot encoded) and the dependent variable: official viewing rate (referred to as SKO count).

3.3 Regression models

Using this data, three regression models are implemented to predict the viewing rates. The first model which is implemented and used in this research is the Ordinary Least Squares (OLS) Linear Regres-sion model. This method is one of the most commonly used models in regression analysis, and is established for 2 centuries [7]. The OLS linear regression model builds a linear function with a minimal squared error (residuals) between the observations and the function. This model is chosen due to it’s simplicity both conceptually and implementation wise.

Also, a non-linear model used in this research being the random forest regressor. A benefit of using random forest regression mod-els is that they allow non-linearities and interactions to be learned from the data without to explicitly model them. Random forest models (unlike linear models) do not only focus on the mean and variance, but other elements of the data as well [11]. Also, using random forest regression models it is difficult to over-fit the data and is good with handling noise[4].

Due to the temporal nature of the data, and the increasing use of neural networks in machine learning, another model is imple-mented as well. This model is an Long short-term memory network (LSTM), which allows historic information to persist in the network. These recurrent neural nets can consider relevant information from the (recent) past, to perform a task in the present or future. This could be beneficial since the popularity of a television show for the past minutes, from last week, or even the weeks before can be very useful for the model to make a more accurate prediction.

3.4 Evaluation measures

One of the metrics used to compare different models and features is the Root Mean Squared Error (RMSE), which is defined in formula 1 where the ˆy represents the predicted value, and y the ground truth.

RMSE

=

v

t

1 n

n

Õ

t=1

(

ˆy

t

− y

t

)

2

(1)

When squaring the viewing rates, the numbers can become quite large. Therefore, the RMSE is used instead of the Mean Squared Error(MSE) to get a feeling of the amount of viewers the prediction is off. Since order of magnitude of the viewing rates differs much, the RMSE isn’t always sufficient. For example, when taking the

average RMSE of all predictions of a day, the metric is heavily af-fected by differences in peak hours, because the absolute differences between prediction and ground truth can be large (compared to off-peak hours), while the relative error doesn’t necessarily have to be large as well. Therefore a relative error metric is also taken into account, being the Symmetric Mean Absolute Percentage error (SMAPE). SMAPE is a variant of the Mean Absolute Percent Error (MAPE), but is more stable when there are zero values (or close to zero) in the ground truth. SMAPE uses the average value between the prediction and the ground truth to calculate the percentual error of the prediction instead of just the ground truth. This results in a reduced influence of these zero (or close to zero) points in the data and therefore decreases the skewness of the overall error rate. SMAPE is described in formula 2. By definition this metric is bounded between 0 and 2 (200%) and slighty benefits lower pre-dictions than the ground truth than prepre-dictions higher than the ground truth, which is acceptable for this research.

SMAPE

=

1 n

n

Õ

t=1

|

ˆy

t

− y

t

|

(

ˆy

_t

+ y

_t

)/

2 (2)

4 EXPERIMENTS

The HbbTV events used in this research are collected during 68 days. This data needs to be split into a train, validation and test set. In order to maximize performance by providing enough training data while leaving enough data for validating and testing, the data set needs to be split into multiple sets. Because of the arrival of the official viewing rates the day after, it is only necessary to predict one day in advance. To allow weekly shows to be analyzed properly, the size of the test set is set to three weeks of data, and the validation set size to 1 week of the data. Therefore the maximum training size is 40 days. To create a realistic experimental environment, the 28 days of validation and test data are trained on the 40 preceding days by moving the window of the train and test set by one day. This way 7 validation sets and 21 test sets are generated each containing 40 days to train and 1 day to test or validate. These fixed train, vali-dation and test set sizes are defined in order to treat each prediction equally and it allows to compare different predictions in a fair way. The validation set including 1 week of data in total, is used for testing different combinations of (hyper)parameters, models and features. Another important thing to mention is that a prediction is made for every minute given the received HbbTV events for that minute (among other features). Since the HbbHeartbeat and Hb-bView events are received and stored continuously (each minute) throughout the day, there is no need to predict ahead of the given HbbTV events.

Implementing the models

All models are implemented in Python, but use different libraries. For the implementation of the OLS linear regression model, Scikit-learn’s linear regression library6is used. The Random Forest regres-sor uses Scikit-learn’s Random Forest Regresregres-sor libraby7_{. And the}

6_{http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model} 7_{http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.}

RandomForestRegressor.html

(5)

LSTM recurrent neural net is implemented using the Keras library8

using a Tensorflow backend.

After implementing the models, the (hyper) parameters need to be tuned in order to perform optimally. The values of the optimal parameters differ in almost every situation, but there are some best practices on how to test for a good (hyper) parameter setup [18]. For some models research is done in parameter tuning. Ochiro et al., 2012 researched the optimal number of trees in random forest regression models by comparing accuracy of 29 large data sets and concluded that using more than 128 trees or estimators (256, 512, 1024, 2048 and 4096) had no significant improvement in accuracy [15]. This number is more an indication rather than a ground truth, since it might be different for the data set at hand.

To find the best performing combination, several parameters are tested to see which combination works best for this application. The performance of the parameter setup is tested using the vali-dation set, which contains one week of data. For the OLS linear regression model, the parameters fit_intercept and normalize are tested with values True and False. The best performing setup was fit_intercept=True and normalize=False, which also happens to be the default values. The regularization parameter alpha is tested for different values and performed best when set to 1e-5 (best of 10, 1, 1e-1, 1e-5, 1e-10).

The random forest regressor is tuned by testing different values of n_estimators (10,32, 64, 128 and 256). For this setup in general, the more estimators used the more accurate the predictions became. As claimed by Ochiro et al.,2012 the accuracy did not improve sig-nificantly when using more than 128 estimators. Therefore the n_estimators is set to 128 The max_depth is set to 8 (best of 5, 8, 16, 32 and None), the min_samples_split is set to 2 (best of 1, 2, 5, 10, 15), min_samples_leaf is set to 1 (best of 1, 2, 5, 10, 15) and finally the max_features is set to None (best of log2, sqrt, and None). The LSTM is set by experimenting with different number of nodes, layers, activation functions, dropouts, loss functions and optimizers. Also for neural nets, there is no optimal architecture or hyper parameter setup. Different combinations need to be tested in order to find the best architecture for the problem at hand. The best performing setup consists of an input, output layer and three hidden LSTM layers including 128 units each. The amount of in-put units is equal to the amount of features, and the outin-put layer contains one unit. Furthermore, the MSE metric is used as the loss function, the adam method is used as optimizer, and no dropout is used. The activation functions used in the hidden layers is the tanh activation function, whereas the output layer uses the linear acti-vation function. This LSTM trained during 10 epochs, after which the training loss function stabilizes.

Experiment 0: Observations in the data

The first observation is that the viewing rates provided by the SKO fluctuate much because of their relatively small sample size com-pared to the group of HbbTV compatible televisions (see figure 1).

8_{https://keras.io/}

Figure 1: TV ratings for all three NPO channels from 00:00 until 23:59, first of May 2017. At specific time intervals the SKO count, fluctuate much and show zero points due to the relatively small sample size.

Especially during the off-peak hours and during the night, many zero points are measured at NPO 2 and 3. This fluctuation is because NPO2 and 3 attract less viewers on average compared to NPO1. In the HbbTV data, there are no zero points measured in time and a much smoother fluctuation is shown.

The sample size is large enough to be statistically representative, as stated by the SKO. But using this sample to approximate the amount of viewers for the entire country comes with a statistical uncertainty. Calculating this uncertainty is difficult due to the type of data at hand. Since every measurement is a unique measurement because of the varying times, days or even content showed, but also due to fluctuating order of magnitude of the viewing rates, calculating a mean or variance is almost impossible. Calculation of the official viewing rates can therefore be approached as a counting experiment. SKO counts the amount of viewers in their sample, and applies that knowledge to approximate the viewing rate for the entire country. The probability distribution that this counting experiment shows the same behavior as the entire country is con-sidered to be a Poisson distribution. Since we have no insight in the absolute amount of viewers in the SKO sample (the counting experiment), an estimation has to be made on how many people are represented in the Netherlands by 1 person in the sample. This information can be found during the off peak hours where the amount of viewers vertically in or decreases as shown in figure 1. To make this a conservative estimation, the smallest vertical dif-ference is chosen in the viewing rate data. This smallest difdif-ference is equal to 1000. If everyone in the sample represented only 1000

(6)

LSTM RANDOM FOREST LINEAR REGRESSOR BASELINE FEATURES USED CHANNEL TIME INTERVAL RMSE SMAPE (%) RMSE SMAPE (%) RMSE SMAPE (%) RMSE SMAPE (%) Hbb count NPO1 00:00-23:59 238700 68,88 63313 19,98 60981 33,13 81673 21,69 NPO1 18:00-23:59 690676 59,87 145040 8,45 120504 6,90 246697 13,10 NPO2 00:00-23:59 75975 77,66 23617 35,60 23851 45,80 24499 35,95 NPO2 18:00-23:59 214357 86,59 50462 15,74 45659 15,06 53058 15,11 NPO3 00:00-23:59 27808 48,28 25438 43,26 24170 45,82 25580 44,43 NPO3 18:00-23:59 60333 37,31 52902 30,43 47482 28,64 52000 29,20 Hbb count, HbbHeartbeat count,

HbbView count NPO1 00:00 - 23:59 110636 38,65 56557 19,56 60392 32,22 81673 21,69 NPO1 18:00 - 23:59 282680 20,73 127090 7,37 122837 7,02 246697 13,10 NPO2 00:00 - 23:59 59614 67,73 22404 35,33 22733 44,66 24499 35,95 NPO2 18:00 - 23:59 158641 63,08 44648 14,37 42326 13,81 53058 15,11 NPO3 00:00 - 23:59 28815 51,33 23100 41,07 23853 45,78 25580 44,43 NPO3 18:00 - 23:59 61266 39,35 49178 28,98 45909 27,21 52000 29,20 Hbb count, HbbHeartbeat count,

HbbView count, day of the week, hour of the day

NPO1 00:00 - 23:59 55793 24,92 51806 18,10 60392 32,22 81673 21,69 NPO1 18:00 - 23:59 124230 7,00 117596 6,89 122837 7,02 246697 13,10 NPO2 00:00 - 23:59 21571 33,52 20699 33,91 22726 44,60 24499 35,95 NPO2 18:00 - 23:59 45167 13,95 41323 13,33 42333 13,80 53058 15,11 NPO3 00:00 - 23:59 23756 41,11 23088 41,04 23855 45,80 25580 44,43 NPO3 18:00 - 23:59 48789 26,60 46640 27,61 45926 27,21 52000 29,20 Hbb count, HbbHeartbeat count,

HbbView count, day of the week, hour of the day, Stream count

NPO1 00:00 - 23:59 60235 26,62 51575 17,30 57583 29,79 81673 21,69 NPO1 18:00 - 23:59 137992 7,14 123928 6,95 119463 6,83 246697 13,10 NPO2 00:00 - 23:59 21681 33,51 20341 34,10 21785 43,19 24499 35,95 NPO2 18:00 - 23:59 45901 14,05 44840 14,97 43333 14,06 53058 15,11 NPO3 00:00 - 23:59 24052 44,41 22215 40,18 23044 44,26 25580 44,43 NPO3 18:00 - 23:59 45954 25,38 49803 28,87 46822 27,78 52000 29,20

Table 1: Average error per minute of a day and during peak hours, reported on average for the validation data for both eval-uation metrics, using different models and features. The baseline is set to the Hbb count multiplied by the average fraction between the SKO count and Hbb count

FEATURES USED CHANNEL TIME INTERVAL RMSE SMAPE (%) Weighted Hbb count NPO1 00:00 - 23:59 42858 16,21 NPO1 18:00 - 23:59 97947 5,27 NPO2 00:00 - 23:59 19929 32,68 NPO2 18:00 - 23:59 40175 12,62 NPO3 00:00 - 23:59 21627 37,8 NPO3 18:00 - 23:59 45552 25,82

Table 2: Average error per minute of a day and during peak hours, reported on average for the validation data for both evalu-ation metrics, using different models and all features, including the weighted Hbb count. The baseline is set to the weighted Hbb count, which is calculated using the TV weighting model.

people in the Netherlands, viewing rates can only be estimated below 2.750.000 which in reality is far to low. This value is still used in the uncertainty calculation to show the minimal uncertainty of the official viewing rates, which can only be higher in reality.

The Poisson distribution is a discrete probability distribution for the counts of events that occur randomly in a given interval of time [13]. A property of this Poisson distribution is that the estimated mean using the observations of an interval can be seen as the actual mean, which in this case is also equal to the variance (µ = σ2) [19]. The number of observations in the sample are estimated (conservatively) by dividing the SKO count at time t by 1000. With this information an uncertainty can be defined by_√1

n, where n

represents the estimated number of observations in the sample. This means that when there are 1 million people watching at a certain point in time, the true value is likely to be within the range of 3.2% above and below the SKO count (1 million). But when there are only 50.000 people are watching, the true value is likely to be within the range of 14,1% above and below the 50.000. When the actual amount of observations are known of the sample, this

uncertainty is likely to increase. Knowing this, when a prediction falls within that range but is not exactly equal to the viewing rate, keep in mind that it can be a correct estimation as well. Since there is no actual ground truth available, SKO’s viewing rates are used as ground truth by broadcasters. Further experiments therefore still try to predict the SKO count.

4.1 Experiment 1: Comparing models and

features

To see which models perform the best, four different combinations of features are used.

(1) Hbb count

(2) Hbb count, HbbHeartbeat count, HbbView count

(3) Hbb count, HbbHeartbeat count, HbbView count, day of the week, hour of the day

(4) Hbb count, HbbHeartbeat count, HbbView count, day of the week, hour of the day, Stream count

First there is the Hbb count, which is the sum of the HbbHeart-beats and HbbViews. This number represents the number of TVs watching at a point in time regardless of that TV just turned on, or

(7)

Figure 2: Histogram of the weights of all televisions per NPO channel on average for all 21 test sets. Although the highest weight is also the most assigned weight for all three channels, the distribution of the weights for NPO1 are more like a normal distribution, NPO2 is already less normally distributed and NPO3 not normally distributed at all. Also, the less normally distributed the weights are, the more the model depends on the highest weighted televisions

is still watching. The second combination is using the Hbb count, HbbHeartbeat count and the HbbView count. This setup is thought of to see whether the model performs better if a distinction is made between the stable watchers and the new watchers. The third combi-nation emphasizes the temporal aspect of the prediction by adding the day of the week and hour of the day features. These categorical features need to be one-hot encoded in order to be valuable for the model. In this combination, the Hbb count, HbbHeartbeat count and HbbView count are taken into account as well. The last combination uses all available features, which is the same as the third combi-nation, but adds the stream count. This is the amount of viewers watching linear content of the corresponding NPO channel though streaming using a tablet or computer. Although this information is not obtained from television viewer behavior, it could help the model in predicting the correct viewing rate. Furthermore, a base-line is added which is calculated by taking the average fraction of the SKO viewing rates over the HbbTV viewing rates, and multiply every Hbb count with that average fraction. This baseline is chosen since this is the most basic way to translate HbbTV viewing rates into SKO viewing rates.

The validation set is used to test different combinations of fea-tures and models. The aggregated results are shown in table 1 reported on both RMSE and SMAPE. Per channel and time interval the best score is highlighted, and the best performing combina-tion of model and features for a specific time interval and channel are highlighted accross the row. When comparing the different combination of features, you can see that the optimal combination of features and model differs per channel and per time interval. As explained in the beginning of section 4, it is only necessary to predict one day in advance. Ideally, all 24 hours can be predicted accurately. But, the most important part of the day is during the peak hours. These peak interval starts at 18:00 and ends at midnight. All combinations are reported for both intervals.

For NPO1, the random forest regressor performs best to predict all day using all features. Another interesting observation is that

this random forest regressor is the only model that is performing better than the baseline method for all day predictions. When pre-dicting during the peak hours, all three models perform better than the baseline in their optimal feature setup, but the linear regression model performs best of all models using all features, even though it is a close call. This indicates that there is a linear relationship between the SKO count and the Hbb count NPO1 during peak hours. Another indication is that even for NPO1 the viewing rates dur-ing the off peak hours contain noise, since the baseline (the most basic form of prediction without taking any of the extra features into account) performs better than the LSTM and linear regression model. Because of the architecture of the random forest regressor, it is good at dealing with noise in the data. This property is noticable when predicting viewing rates for NPO1. The just mentioned noise can caused by either the HbbTV data, the official viewing rates or both. Since the sample size of the HbbTV compatible televisions is much larger than the sample used by SKO, it could be that the noise is coming from the SKO viewing rates. On the other hand, since the sample of HbbTV compatible televisions is not selected based on several demographics, it is also likely that the viewing behavior of the HbbTV compatible televisions is skewed compared to the general Dutch viewing behavior. This last mentioned case, will be discussed in section 4.2.

For NPO2, the best performing combination is different. The LSTM RNN performs poorly in general when using relatively few features, but performs much better when adding more features. The LSTM (using all features) performs even best when predicting all day. For the peak hour predictions the random forest regressor performs best with a SMAPE of 13,33 and a RMSE of 41323,23 using all features except the linear streaming count. For NPO3 the best performing models are the opposite of those for NPO2. The best performing model during peak hours is the LSTM model using all features. The best performing model predicting all day, is the random forest regressor using all features as well. The fact that the linear regression model performs worse at predicting viewing rates for NPO2 and NPO3 compared to NPO1, indicates that the

(8)

relationship between HbbTV data and the SKO count for these channels is not linear.

Official viewing rates are generally reported per TV show by calculating the average viewing rate per minute for the minutes that a show was on. The RMSE and SMAPE scores can improve by doing the same for the predictions. This will be done in subsection 4.3. But first, a new model is introduced in the following subsection.

4.2 Experiment 2: Weighted HbbTV viewing

rates

To improve the accuracy of the predictions so far, a new model is implemented. As mentioned in section 3.2 the fact that the sample of HbbTV compatible televisions might not be a good sample of the general Dutch viewing behavior, is not taken into account. There-fore a method is created to emphasize or ignore a certain televisions in the sample based on their viewing behavior.

Since every HbbTV compatible TV has a unique ID, it is possible to track viewing behavior per television. By adding a weight to every unique ID, the HbbTV viewing rates can be adjusted based on the viewing behavior of the televisions. The weight can therefore filter out viewing behavior of certain TV’s by lowering the weight of those televisions, or make the viewing behavior more important by increasing the weight. When for example relatively less viewers are watching in a specific interval than calculated by the SKO, the TVs which were active during that interval are underrepresented compared to the general Dutch viewing behavior and should there-fore be weighted more heavily (and vice versa). The weights are optimized using a linear regression approach.

To mathematically define the described weighting method, sev-eral formulas are drafted. Assume that we have a vector a per television i, where T equals the total amount of time steps. ait = 1

if TV i is active at time t and is zero otherwise. The weight of each television is represented by vector w. Note, the weights are assigned per television, which makes the weights (w) not time dependent. The learning objective is to minimize the error between the SKO count, compared to the weighted Hbb count. The error function is defined by formula 3, which calculates the relative error between the SKO viewing rates and the weighted Hbb count per minute. A relative error of 1 means that the viewing rate is equal to the weighted Hbb count. This method differs from a regular gradient descent approach since this method uses the relative error to adjust the weights, which increases the weights when the weighted view-ing rates are lower than the SKO viewview-ing rates and vice versa. The weight is derived from the average fraction from all minutes that a television was active. A single fraction at a time t, and iteration i is calculated by deviding the SKO count by the weights times the ac-tive televisions (equals weighted Hbb count at time t). This fraction is the perfect translation from weighted Hbb count to SKO count for time t. If all televisions were active during 1 minute, a weights vector w with a learning objective of zero can be found. The diffi-cult part is dat every television is active for a different (arbitrary) amount of minutes and that the weights must be averaged over the minutes it was active. Still, this approach tries to find optimal

weights with as litte steps as possible. The number of iterations is important due to the amount of calculations what needs to be done for 40 train days which consist of 57600 (60*24*40) time steps for over 200.000 televisions. To get as close as possible to a relative error of 1, the objective function is defined as described in formula 4

w

=

©

«

w

1

w

₂

...

w

N

ª

®

¬

a

i

=

©

«

a

i1

a

_i2

...

a

iT

ª

®

¬

f (t )

=

a

sko t

Í

i

a

it

w

n_i

(3)

L

= min

w

1 T

Õ

t

| f (t ) −

1|

(4)

As a starting point, all (initial) weights are set to the average fraction of all time steps of the SKO and Hbb count in the training set. This method is described in formula 5 where n represents the number of iterations. the weight of TV i at iteration n is defined as wn_i. After the initial weights are set, the optimal weights need to be calculated by minimizing the relative error by editing the weight of each television. The update of the weight depends on the average relative error using the current weight over all minutes that TV i was active. This fit step is defined as described in formula 6:

w

_i0

=

1 NT

Õ

t

f (t )

(5)

w

_in+1

= w

n_i

Í

t

f (t )a

it

Í

t

a

it

(6)

The weights are calculated over multiple iterations for each of the 21 train and test sets. After implementing this model, the weights became very large for a small subset of the TVs. Therefore, the maximum weight is limited at 2 times the initial weight to prevent this model from depending on a relatively small subset. This limit in weight resulted in a larger sample with a relatively higher weight and makes the predictions therefore more stable without dropping too much in accuracy. The fitting of the weights is done over 10 iterations for this experiment, after which the learning objective converges for all three channels. The error curve per iteration for the validation set is displayed in figure 4 in Appendix A.1.

The distribution of the weights after 10 iterations is shown in figure 2. What is interesting to see is that the the highest weight is the most assigned weight. Another thing is that the weight distri-bution of NPO1 looks a lot like a normal distridistri-bution. The weight distribution of NPO2 is already less normally distributed and the weight distribution of NPO3 the least.

Using the same experimental setup as used in section 4.1, the weighted HbbTV viewing rate model outperforms almost every model tested so far on both SMAPE and RMSE. The results are shown in table 2. Only during peak hours at NPO3, the the SMAPE

(9)

LSTM RANDOM FOREST LINEAR REGRESSOR BASELINE FEATURES USED CHANNEL TIME INTERVAL RMSE SMAPE (%) RMSE SMAPE (%) RMSE SMAPE (%) RMSE SMAPE (%) Hbb count, HbbHeartbeat count,

HbbView count, day of the week, hour of the day, Stream countm Weighted Hbb count NPO1 00:00 - 23:59 54455 26,02 40011 16,26 43955 18,61 42858 16,21 NPO1 18:00 - 23:59 118651 6,33 86603 4,79 94450 5,22 97947 5,27 NPO2 00:00 - 23:59 21957 37,43 19530 32,81 20219 36,47 19929 32,68 NPO2 18:00 - 23:59 45767 14,54 38336 12,26 37717 12,32 40175 12,62 NPO3 00:00 - 23:59 22890 40,21 20127 37,09 20299 38,29 21627 37,8 NPO3 18:00 - 23:59 44437 25,48 39976 22,75 39754 23,73 45552 25,82

Table 3: Average error per minute of a day and during peak hours, reported on average for the validation data for both evalu-ation metrics, using different models and all features, including the weighted Hbb count. The baseline is set to the weighted Hbb count, which is calculated using the TV weighting model.

LSTM RANDOM FOREST LINEAR REGRESSOR BASELINE FEATURES USED CHANNEL TIME INTERVAL RMSE SMAPE (%) RMSE SMAPE (%) RMSE SMAPE (%) RMSE SMAPE (%) Hbb count, HbbHeartbeat count,

HbbView count, day of the week, hour of the day, Stream countm Weighted Hbb count NPO1 00:00 - 23:59 41963 12,80 31332 7,94 37846 9,64 32772 8.38 NPO1 18:00 - 23:59 76891 5,35 58919 4,30 67640 5,08 67047 4,98 NPO2 00:00 - 23:59 15332 27,37 13688 23,83 15398 29,91 14653 26,03 NPO2 18:00 - 23:59 25474 11,31 22606 9,41 23418 9,73 24341 10,10 NPO3 00:00 - 23:59 17284 40,33 14897 37,93 15269 39,72 16367 38,91 NPO3 18:00 - 23:59 34604 20,22 31273 17,48 32012 18,62 39369 21,82

Table 4: Average error of TV show per day and during peak hours, reported on average for the validation data for both evalu-ation metrics, using different models and all features, including the weighted Hbb count. The baseline is set to the weighted Hbb count, which is calculated using the TV weighting model.

error was lower when using a random forest regression model in combination with all features mentioned in the previous exper-iment. This indicates that a part of the noise mentioned in the introduction of section 4.1 can be caused by the skewed sample of Hbb viewing behavior compared to the general Dutch viewing behavior.

By weighting the HbbTV televisions we do not only create a new model to predict the SKO count, but as a result a new input feature is created as well. Since this weighted viewing rate is corrected for the skewness of the HbbTV viewing behavior, it could be a very usefull feature to improve the accuracy of the models combined in section 4.1. The results of this experiment can be found in table 3. For this experiment the baseline is set to the weighted Hbb count since using this weighted viewing rate as a feature instead should at least perform better than just taking the weighted Hbb count.

Using the weighted Hbb count as an extra input feature for the previously investigated models, the accuracy of the predictions improved for all models, but not all models performed better than the baseline. The random forest regressor performed best in general. The viewing rates predicted by the random forest model were the most accurate for NPO1 during peak hours and for NPO3 when predicting all day. For NPO1 and NPO2 when predicting all day, the differences in SMAPE and RMSE score are small, just like predicting during peak hours for NPO2 and NPO3. In the next experiment, the minutely predictions are used to calculate an average viewing rate for a television show. Since viewing rates are generally reported per TV show, the average score per TV show will be calculated first, before choosing the best model per channel for each of the two intervals.

4.3 Experiment 3: Reporting per TV show

Viewing rates are generally reported per TV show instead of per minute. Thanks to the Asrun data, it is possible to calculate the viewing rate of a television program instead of each minute of the day. After predicting per minute, an average prediction of a TV show can therefore be made as well. The viewing rate of a TV show is calculated by taking the average of all minutes that a show was being broadcasted. This method is also used by the SKO. Besides the way of reporting, there is another benefit of using the average prediction over several minutes. When the model predicts higher than the SKO count for some minutes, while predicting lower that the SKO count for other minutes, taking the average would result in a more correct viewing rate. To calculate this, the predictions per minute are grouped per TV show and the predictions are averaged. The predictions used for this experiment are the predictions made by the best performing model and feature combination found in the previous experiment. Then the average error is reported using both the SMAPE and RMSE metrics for the average show per day per channel. Like the previous experiment, the distinction is made between all day predictions and predictions during the peak hours of a day. The results are shown in table 5. In almost all situations, reporting on the average per TV show resulted in quite a large improvement in terms of accuracy reported on both RMSE and SMAPE, except for the predictions for NPO3 for the entire day, and for NPO1 during peak hours.

Now we have reported the predictions per TV show we can compare different shows from the same season. For this plot, only shows that occur at least once a week (three times in the entire test set) after 18:00 hours are taken into account. Box plots of all television shows per day that occur at least once a week are shown in Appendix A.1. For this experiment, a slight adjustment is made to the SMAPE metric as formulated in equation 2. In order to deter-mine if a specific show is always predicted higher or lower than the SKO count, the nominator is changed into ˆyt− ytto allow negative

(10)

CHANNEL TIME INTERVAL BEST MODEL RMSE SMAPE (%) NPO1 00:00-23:59 All features (incl. weighted Hbb count), random forest 47488 12.79 NPO1 18:00-23:59 All features (incl. weighted Hbb count), random forest 65429 6.11 NPO2 00:00-23:59 All features (incl. weighted Hbb count), random forest 15875 28.68 NPO2 18:00-23:59 All features (incl. weighted Hbb count), random forest 22361 11.04 NPO3 00:00-23:59 All features (incl. weighted Hbb count), random forest 15049 45.16 NPO3 18:00-23:59 All features (incl. weighted Hbb count), random forest 27030 20.27

Table 5: Average error of TV show per day and during peak hours, reported on average for the test set for both evaluation metrics, using the best performing model per channel per time interval

Figure 3: Box plot showing percentual error per TV show broadcasted in the test set, where the show was broadcasted at least 3 times.

percentual errors as well.

In figure 3, the difference in stability of the predictions per chan-nel is clearly visible. For NPO1 the medians are all relatively close to 0, with the upper and lower quartile often above and below zero respectively. Also, the upper and lower whiskers are a lot closer to each other compared to the other channels. There are some shows however that are systematically predicted lower or higher than the actual official viewing rate. NPO2 already shows more noise in the predictions, since the boxes are larger and the medians are less close to zero compared to NPO1. Like NPO1, NPO2 shows some systematical under or over prediction for some TV shows. NPO3 shows the most noise, which was expected due to the SMAPE and RMSE scores from the previous experiments. What is interesting to see is that the medians are not further away from zero than NPO2, but the distance of the upper and lower whiskers to the boxes are much larger. This indicates that there is more variation since the outer 50% of the predictions differ more than the central 50% of the predictions. The varying predictions might partially be caused by the fact that NPO2 and NPO3 broadcasts more tv shows a day (also

during peak hours), for a relatively smaller audience. This will also cause some more unstable viewing behavior compared to NPO1.

For all three channels TV shows can be found where the predic-tions are systematically lower or higher than the official viewing rates. Even though these predictions are partially based on the weighted Hbb count, a systematically lower prediction for a show means that those viewers are probably underrepresented in the HbbTV sample, and vice versa. Therefore it could be an improve-ment to apply a certain correction per television show afterwards to compensate for the bias in the prediction for a specific show.

5 CONCLUSION AND DISCUSSION

A data-driven approach is used to answer the question if it is possi-ble to predict official viewing rates provided by SKO, using HbbTV data as input. Using this HbbTV data, among other (categorial) features, it is possible to predict viewing rate. In this research, four different models were implemented and trained to predict viewing rates for every minute per NPO channel. In general, more people tend to watch NPO1 compared to NPO2 and NPO3. This resulted

(11)

in a more stable viewing behavior for NPO1, wich often lead to a more accurate prediction. Also, viewing rate predictions are more accurate during peak hours (18:00-23:59) for all three channels for the same reason.

The first three models that were used are a linear regression model, random forest regressor model and a LSTM RNN. Since sam-ple of HbbTV compatible televisions might not be a good samsam-ple of society, a fourth model is created to improve the viewing rates by giving weights to every TV in the HbbTV sample. By comparing the viewing behavior of televisions in that sample with the official view-ing rates, certain television viewview-ing behavior more ephasized or ignored. The weighted Hbb count by itself was more accurate than almost all of the predictions of the regression models implemented thus far. A good explaination for this is that the TV weighting ap-proach reduces the skewness of the viewing behavior of the HbbTV sample compared to the general Duth viewing bahavior. The TV weighting method also resulted in a new feature which is used to improve the initial (three) regression models. By doing this, the most accurate combination is found to predict viewing rates for the entire day, for all three channels. This model is the random forest regressor using the following features: Hbb count, HbbHeartbeat count, HbbView count, day of the week, hour of the day and the the weighted Hbb count. When reporting the predictions of this model per TV show during peak hours, an average error of 5.52% for NPO1, 9.28% for NPO2 and 18.22% for NPO3 per television show is reached. The HbbTV data is collected since March 2017. Therefore, the sample of HbbTV compatible televisions including 200.000+ televi-sions and the data collected will increase. This will likely result in more stable and accurate models for two reasons. The first reason is the growing amount of training data over time, which will allow the models to train on more data to find more accurate relations between variables. The second reason is that the sample of HbbTV compatible televisions will likely increase. This will result in a more stable viewing behaviour used as input data. Also, in this research the decision is made to train regression models to predict 1 day in advance, but it could be beneficial to predict on specific time intervals of a day. These improvements, however, will be limited by the official viewing rates. As long as the SKO uses a relatively small sample size, the off peak hours will stay unstable and show fluctuating viewing rates which are very hard to predict.

REFERENCES

[1] IA Basheer and M Hajmeer. 2000. Artificial neural networks: fundamentals, computing, design, and application. Journal of microbiological methods 43, 1 (2000), 3–31.

[2] Filippo Maria Bianchi, Enrico Maiorino, Michael C Kampffmeyer, Antonello Rizzi, and Robert Jenssen. 2017. An overview and comparative analysis of Recurrent Neural Networks for Short Term Load Forecasting. arXiv preprint arXiv:1705.04378 (2017).

[3] Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123–140. [4] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32. [5] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system.

In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785–794.

[6] Thomas G. Dietterich. 2002. Machine Learning for Sequential Data: A Review. Springer Berlin Heidelberg, Berlin, Heidelberg, 15–30. https://doi.org/10.1007/ 3-540-70659-3_2

[7] Clara Dismuke and Richard Lindrooth. 2006. Ordinary least squares. Methods and Designs for Outcomes Research 93 (2006), 93–104.

[8] Yoav Freund and Robert E Schapire. 1995. A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory. Springer, 23–37.

[9] Nelson Fumo and MA Rafe Biswas. 2015. Regression analysis for prediction of residential energy consumption. Renewable and Sustainable Energy Reviews 47 (2015), 332–343.

[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.

[11] Ulrike Grömping. 2009. Variable importance assessment in regression: linear regression versus random forest. The American Statistician 63, 4 (2009), 308–319. [12] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural

computation 9, 8 (1997), 1735–1780.

[13] Jonathan Marchini. 2008. The Poisson Distribution. In Lecture 5 : The Poisson Distribution. Oxford department of statistics.

[14] LR Medsker and LC Jain. 2001. Recurrent neural networks. Design and Applica-tions 5 (2001).

[15] Thais Oshiro, Pedro Perez, and José Baranauskas. 2012. How many trees in a random forest? Machine learning and data mining in pattern recognition (2012), 154–168.

[16] SKO. 2016. MSS: Mediagebruik 2016. SKO, in collaboration with NLO, NOM en VINEX.

[17] Stichting Kijkonderzoek (SKO). 2017. Methodologische documenten. [18] Abhishek Thakur. 2016. Approaching almost any machine learning problem.

Kaggle (2016).

[19] Sylvain Veilleux. 2006. The Poisson Equation. In Some Statistical Basics. University of Maryland, department of astronomy.

[20] Alfredo Vellido, Paulo JG Lisboa, and J Vaughan. 1999. Neural networks in business: a survey of applications (1992–1998). Expert Systems with applications 17, 1 (1999), 51–70.

[21] Peter Whitle. 1951. Hypothesis testing in time series analysis. Vol. 4. Almqvist & Wiksells.

[22] Guoqiang Zhang, B Eddy Patuwo, and Michael Y Hu. 1998. Forecasting with artificial neural networks:: The state of the art. International journal of forecasting 14, 1 (1998), 35–62.

(12)

APPENDIX

A TABLES AND FIGURES

A.1 Extra figures

Figure 4: Percentual error per training iteration per channel for 05-05-2017 using the television weighting model

(13)

Figure 5: Box plot showing percentual error per TV show broadcasted in the entire test set (21 days), where the show was broadcasted at least three times.

(14)

(15)

Real time linear broadcast viewer prediction and analysis

Master Thesis

Data Science