Forecasting TV-ratings using Web Data

(1)

Forecasting TV-ratings using Web Data

Wyko Rijnsburger (10207120)

Information Studies

Faculty of Science

University of Amsterdam

Supervised by:

dr. T. Coenen

University of Amsterdam

June 2014

Abstract

The television industry relies on advertising to to generate revenue. Forecasting the audience is key in selling TV-advertisements. This thesis describes the creation of an application capable of automatically retrieving and storing internet data related to TV-shows airing in the United States and using that data to forecast Nielsen ratings. A dataset of 109 episodes from 48 different TV-shows was collected and used to create a forecasting model in the form of a neural network. Techniques such as Bootstrap Aggregation and the implementation of a greedy algorithm were used to improve the results. This model is able to forecast the demography Nielsen rating of a TV-show with an average forecast error 0.65 Nielsen rating points using the number of watchers for a TV-show on Trakt.TV as the data-input.

(2)

1 Introduction

Despite the advent of online and subscription based television alternatives, tra-ditional network television remains the biggest platform for both viewership and advertising. Global television advertising revenues are valued at US$162.1 billion in 2012 and are forecast to pass US$200 billion in the next five years (PWC, 2014). Advertising time is purchased in advance and the price is based on the number of viewers that are forecasted to watch the channel at that time.

The most common measurement of the size of TV-audiences is the Nielsen rating. Nielsen ratings are collected by the Nielsen Company from families who allow their viewing habits to be tracked and recorded in one of two ways:

1. Using viewer “diaries”, where the tracked family writes which TV-shows they have watched and which members of the family were present during the viewing.

2. Using devices called Set Meters that are connected directly to the television. These devices automatically track and record the viewing-habits.

The Nielsen ratings are used by the TV-networks to price their advertisement time, which in turn weighs into the decision if TV-shows should be renewed for additional seasons and episodes. While there are alternative measurements offered by companies such as TiVo or comScore, Nielsen is the dominant player. NBC President of Research and Media Development Alan Wurtzel stated that Nielsen is a monopoly and that their ratings are the only currency within the TV-business (Herrman, 2011). A single Nielsen rating point represents one percent of all estimated TV-watching households in the United States, of which there are an estimate of 115.6 million in 2014 (Wallenstein, 2013). Gensch and Shaman (1980) conclude through a time series analysis that Nielsen Ratings provide accurate forecasts of the total number of viewers.

Meanwhile, people are increasingly logging their viewing behavior on the internet. Sites like Trakt.TV allow users to record their television and movie viewing habits. Each month, over one million episodes are marked as “watched” on the site (Trakt.TV, 2014). Facebook users are liking the fanpage of their favorite shows. The most popular TV-show page on Facebook, The Simpsons, has been liked more than 70 million times (Facebook, 2014). Social media sites like Twitter and Tumblr have become instrumental in the way viewers experience and discuss television.

This thesis describes the creation of an application capable of automatically retrieving and storing internet data related to TV-shows and using that data to create a model capable of forecasting Nielsen ratings. Data is only collected for American TV-shows, because these shows have the largest audience and thus the largest amount of data. No forecast is made for repeat airings of episodes. The specific implementation of the model is done in the form of an artificial neural network. Through regression analysis, I will attempt to forecast Nielsen ratings by creating a neural network with web-scraped data inputs. This forecast can

(4)

be used by TV-networks to sell their advertisement time in the most profitable way.

1.1 Related Work

There have been previous attempts to forecast television ratings. Danaher and Dagger (2012) proposed a nested logit model to forecast ratings. Their dataset consisted of 6000 programs and more than 70.000 episodes that aired over a four and a half year period, which is much larger than the dataset used in this thesis. Two different types of input were used: time-based variables and program-based variables. The time-based variables were data such as the day-of-the-week and the time-period in which the program aired. A program’s genre and duration were used as program-based variables. They concluded that their model was useful for both the modeling and forecasting of TV viewing behavior and that it achieved an average forecast error of 1.08 rating points.

Patelis et al. (2003) used a Bayesian approach. They only used time-based variables such as the season of the year in which a program is airing and the historical market share of each TV channel for the time-period in which a program is airing. The study did not provide an average forecast error. Neither study used data from the internet.

1.2 Structure

This thesis opens with a section describing all technology used in the application. The next section discusses and shows off the design of the application, including the way in which the collected data is presented and visualized. Then, the process of creating the forecasting model is described. Results are presented for each step and approach. The thesis concludes with a discussion of the results, an analysis of the used approach and directions for future work.

(5)

2 Application Design

The application is hosted on the web at www.wykorijnsburger.nl and is pass-word protected to prevent unauthorised access (username: scriptie, passpass-word: scriptie). It allows the user to explore a large database of TV-shows and various data that is collected for those TV-shows. Data is periodically collected for all TV-shows that are added to its local database. New TV-shows can be added to the database manually or automatically. You can search an external database for a TV-show and add it to the local database, but the application also keeps itself up to date with the most popular TV-shows of the moment. The collected data is presented in tables and graphs directly within the application. Every 24-hours, the data is used to forecast ratings for TV-shows that have yet to air. This section starts out by describing the technology used to create the ap-plication. Then the data that is used in the model will be discussed. Finally, the external presentation of the application and the visualization of the data is presented.

2.1 Technology Stack

A VPS (Virtual Private Server), rented at TransIP 1 _{hosts the application.}

Hosting on a pre-configured environment such as Heroku2 _{or Amazon EC2} 3

was not possible, because the server needed to support periodical data-collection through the use of cron jobs. Cron is a time-based job scheduler that lets you set a schedule of commands that need to run periodically. In this case it was used to run Rails rake tasks containing web-scraping methods.

The VPS is configured in the following way: I chose the Linux-based operating system Ubuntu Server to control the server. It is the most used operating system on Amazons Hosting service EC2 (Market, 2014), which means there is a lot of support for it. Another advantage is that it is free and that it has good and up-to-date documentation. The application uses Nginx4_{, configured using the}

Passenger module, as the web server5 _{. I used Git to push and pull the project}

to a repository on Github and to deploy new versions of the application on the VPS6_.

The application needs to be able to run periodical data collection jobs and present the collected data and associated forecasts in a user-friendly manner. To achieve this task, I choose the Ruby on Rails framework. Ruby on Rails was released in 2005 and is an open source web application framework that is built in the Ruby language. It is a full-stack framework in which both the front-end and the back-end can be programmed in the same language. This application uses Ruby version 2.0.0p451. It features a lot of improvements compared to the older

1_{http://transip.nl} 2_{http://heroku.com} 3_{http://aws.amazon.com/ec2/} 4_{http://nginx.org/} 5_{https://www.phusionpassenger.com/index2} 6_{https://github.com/utwyko/TVRatings}

(6)

1.9.3 version and was deemed stable enough for this project. The application uses Rails version 4.0.4. Rails 4.1 did not feature any interesting new features that would make it worth using versus the stability of Rails 4.0.x.

Rails can be extended by add-ons called gems. This application uses a wide variety of gems. The most notable ones are Nokogiri 7for HTML parsing, when-ever 8_{for the creation of cron jobs, HTTParty} 9_{for web scraping and ruby-fann} 10 _{for creating the neural network.}

2.2 Database Design

When a show is added to the database, either manually or automatically, infor-mation is retrieved and stored on three different levels: TV-show, season and episode. The database schema shown in figure 1 visualizes this. A TV-show is created and information such as the genre, the network on which it is airing and the country in which it is airing is added. Seasons are created for each season that the TV-show has aired. Seasons do not contain additional information but consist of one or more episodes. Episodes contain information such as the first air date, the title and a description. Collected data related to the specific TV-show are stored in a showstats entry and data relevant to a specific episode is stored in an epstats entry. The exact contents of these entries will be discussed in the next subsection.

It is possible to add a show manually by searching for it by title and selecting the “Add TV-show to database” button. But the application also needs to be able to keep its database up to date with the newest shows automatically. To achieve this, Trakt.TV is consulted. Every 24 hours, the top 50 shows with the highest number of current viewers on Trakt.TV are added to the database if they have not been previously added.

7_{http://rubygems.org/gems/nokogiri} 8_{http://rubygems.org/gems/whenever} 9_{http://rubygems.org/gems/httparty} 10_{https://github.com/tangledpath/ruby-fann}

(7)

Figure 1: Complete Database Model as generated by the gem Railroady.

3 Data Sources

The Online TV Database is used to populate the application with TV-shows11. This site features an open database of a wide collection of TV-shows, editable by anyone. While the TVDB has its own API, the same data is also accessible

(8)

by using the Trakt.TV API 12. Because it is also used to collect user data, I chose to access the data through the Trakt.TV API.

Information that is added to the database does not always remain up to date. It might be that an episode is created for a show that will not air for another year. This means that information such as the exact air date and the episode description are not yet available. To keep the information up to date, each TV-show is updated every 24 hours. This consists of adding/changing information for episodes or adding new episodes/seasons for TV-shows.

Every three hours, data is scraped for all episodes that have aired in the last 4 weeks. Information is added in three categories. An “epstats” entry is added for the episode. These episode stats consists of the following elements:

• The number of seeders on the torrent site KickAssTorrents13_.

• The number of leechers on the torrent site KickAssTorrents.

A “leecher” is a user that is currently downloading a file, in this case an episode of television, but who hasn’t completed the download yet. A user that has completed the download is called a “seeder”.

KickAssTorrents is the second largest torrent site (Ernesto, 2014). The largest torrent site, ThePirateBay, has a complicated HTML structure, which makes it hard to scrape data from.

There has been no scientific investigation into the effect of Torrent Down-loads on TV-ratings. Adalian (2013) suggests that the large increase in ratings through the years for the TV-show Breaking Bad might be related to the high popularity of the show on torrent sites. Breaking Bad was the 2nd most downloaded TV-show of 2013 (Ernesto, 2013).

• The number of viewers that have marked the episode as “Watched” on Trakt.TV.

Trakt.TV is a site on which users can keep track of their watching habits. Episodes of TV-shows can be marked as watched. This can be done either manually of automatically. Plugins can be downloaded for media centers such as XBMC 14 and Plex 15 that automatically mark an episode as watched when a certain percentage of the duration of that episode has elapsed during playback. By default this percentage is set to 80%. Trakt.TV offers two statistics: the unique number of viewers or the to-tal number of times the episode has been marked as watched. The second statistic includes users who have watched an episode multiple times. Watch-ing an episode multiple times does not increase its Nielsen ratWatch-ing, so the first statistic is used.

12_{http://trakt.tv/api} 13_{http://kickass.to/} 14_{http://xbmc.org} 15_{https://plex.tv/}

(9)

• The user score for the episode on Trakt.TV.

Trakt.TV allows users to rate the episodes they have watched. Trakt offers two rating systems: advanced or simple. Users can choose which system they use. The advanced system is selected by default.

The advanced system lets users choose a number between 1-10. The simple system offers two options: like or dislike. Both systems are combined to a single user score that can range from 0 to 100. A like counts as 100 and a dislike as 0. The advanced ratings between 1-10 are multiplied by 10. Every three hours, the Nielsen-ratings for each episode are retrieved and a “ratings” entry is added for that episode. Nielsen only publishes the top 10 shows for each day publicly. Fortunately, there are several unofficial sites that report the Nielsen-ratings. This application uses the SpoilerTVPlus+ Ratings Charts Playground16_{as a source. This site has up to date Nielsen-ratings for a large}

number of American shows. The following ratings are retrieved: • The total rating in millions of people.

• The “demo” rating in millions of people.

Nielsen ratings consist of two numbers. The total rating is the total number of people that watch a certain show. The “demo” rating is the number of people within the 18 to 49 years old demographic. This is the most coveted demographic because viewers in this age group are the most lucrative target for advertisers. Often, this “demo” rating is more important than the total number of viewers (Nagourney, 2013). The forecasting algorithm currently only forecasts the “demo” rating. According to Jones and Fox (2009), over half of the adult internet population is between 18 and 44 years old. Because the application collects web data for television shows, that data is mostly generated by people in the “demo” rating age group. The “showstats entry is added for each TV-show. These statistics are unlikely to fluctuate much in a small period of time and thus are only retrieved once per day.

• The number of likes the page for the show has received on Facebook. Every popular TV-show (thus every TV-show included in the data for this project) has an official page on Facebook. Facebook users can like this page. This is done for two reasons: users who like a page see status updates for that TV-show in their personal feed. Secondly, users like pages to showcase their personal interests to their Facebook friends (Bachrach et al., 2012). A study by Ungapen (2013) indicates that persons who have liked a Facebook page for a show or followed the Twitter account of a show are 75% more likely to actually watch that show.

(10)

The number of likes is retrieved through the Facebook Graph API (Weaver and Tarjan, 2013). Through the API, a search is made for a page with the title of the TV-show. This search returns a big JSON response containing multiple pages, from which the official show page has to be selected. To do this, the JSON is first filtered to only include pages that belong to the category “Tv show”. Then, the page is selected that matches the exact title of the TV-show. If there is no exact match, the first result is selected. The number of likes for this page is retrieved and stored.

The only way to evaluate the accuracy of this method was to check all pages manually. Three out of 48 shows had an incorrect number of likes. This was caused by the naming of the official pages. The official page for the show Salem is called “Salem — WGN America”, while the page that was retrieved was an unofficial page called “Salem”. In the normal Facebook interface, official pages have a verified checkmark on their page, to indicate to users that this page is the official fanpage. However, this information is not yet available in through the Graph API. Until this is added, the risk of an incorrect number of likes being stored. A temporary solution to this problem could be to manually edit the number of likes for the affected pages, but this would contradict with the autonomous functionality of the application. Another solution is to exclude these shows completely from the dataset. This would not be a permanent solution, because new shows are being added automatically to the application that might also return an incorrect amount of likes. Therefore the choice was made to continue using this approach while keeping in mind that a small percentage of the data was incorrect.

• The user score for the show on Trakt.TV.

Trakt.TV allows users to rate TV-shows in the same way that it is possible to rate episodes. TV-shows have a higher number of votes than individual episodes.

• The number of watchers of a show on Trakt.TV

This is the number of unique users that have marked a single episode of a certain show as watched on Trakt.TV. Thus, users who have only marked the first episode of a show as watched is considered a watcher just like someone who has marked all episodes as watched.

3.1 Data Presentation

The front-end of the application was created using the Twitter Bootstrap frame-work17. Bootstrap offers many pre-styled components that do not require any additional CSS. Another advantage of Bootstrap is that it can be used to create a responsive design. Figure 2 and 3 demonstrate the different design when visiting the application on a mobile or a desktop browser. Every page of the application

(11)

is mobile-friendly except for the Charts pages. Because Bootstrap is so widely used, I did no want to use its default theme. The theme Flatly, which is freely available from the site Bootswatch18 was applied to give the application a less standard look.

The application opens with the homepage, shown in figure 2. The homepage lists all shows that are currently in the database. A search field allows the user to search a TV-show.

Figure 2: The homepage as viewed on a mobile device.

Each TV-show has its own page. This page, shown in figure 3 shows in-formation about the TV-show such as the network, the runtime and the latest web-scraped data for the show. The bottom of the page lists all episodes, grouped by season.

Episodes also have their own pages. The episode page, shown in figure 4 shows information about the episode such as the description, the first aired time and web-scraped episode stats. Stats are only collected for episodes that have already aired. Unaired episodes show a message that there are no stats available yet.

(12)

Figure 3: Show page on a desktop.

Figure 4: Episode page on a desktop.

To analyze the collected data, several scatter plots are created on the charts page, shown in figure 5. These charts are automatically updated every 24 hours and visualize the relation between the data (e.g. the torrent downloads for an episode) and the demography or total rating. A periodical task collects all info that needs to be visualized and saves it in a Tab Separated Values (TSV) file. The values from this file are then loaded into JavaScript and visualized using the D3 JavaScript library (Bostock et al., 2011) in combination with the Dimple D3 API19_{. D3 can be used to create tables or visualizations in JavaScript. Creating}

(13)

a simple scatter plot is relatively complicated using the standard D3 library, so Dimple was used to simplify this process. Dimple can create simple visualizations from a TSV file without much code.

Figure 5: Chart page showing a visualization plotting the Trakt Views for the previous episode against the demography rating.

(14)

4 Forecasting Model

The final dataset that was used to create the forecasting model consisted of 109 episodes from 48 different TV-shows. I chose to use an artificial neural network as the forecasting model and to train it through supervised learning. An artificial neural networks (ANN) is a computational model inspired by biological neural networks (Yegnanarayana, 2009) and was first developed by Rosenblatt in 1958.

An artifical neural network consists of nodes and layers. The first layer is called the input layer and represents your input data. The last layer is called the output layer and represent your output data. The number of nodes in these layers depend on the amount of input or output variables you supply to the network. All nodes are linked to each other and each link is given a weight. When an input node is given a value, it transmits that information to all other nodes it’s connected to, taking into account the weight of the connection between the nodes to determine the exact value that is transmitted. Most neural networks do not consists of just input and output layers. New layers of nodes, called hidden layers, are put between them. These layers and nodes do not produce an actual output, but are used to help calculate the output. Figure 6 shows a simple neural network consisting of three input nodes, one hidden layer of four nodes and two output nodes. Neural networks can be used for both classification and in this case, regression analysis (Specht, 1991).

Figure 6: Example neural network20_.

Supervised learning is an approach to machine learning where all training data has a known output (Rasmussen, 2006). In this case, the known output is

(15)

the Nielsen rating for this episode. A trained forecasting model can forecast a value if it is provided a set of input variables.

The neural network implemented in this application is a multilayer feedfor-ward network, which is the most common neural network according to Svozil et al. (1997). In a feedforward neural network, all information moves in one direction

towards the output layer. This means that the network does not contain loops or cycles like other types of neural networks, such as recurrent neural networks, but that all connection lines point into one direction (Hornik et al., 1989). The information from one layer is transmitted to the nodes in the next layer and so forth.

The specific implementation of the network is done by using the Fast Artificial Neural Network library (Nissen, 2003). This is a library written in C that allows fast and easy implementation of multilayer feedforward neural networks. This library has been ported to many languages, including Ruby, which is the one I used21_{. The ruby-fann gem allows you to set up a neural network by specifying}

the number of input values, output values, the number of hidden layers and the number of nodes these hidden layers should contain. This network can then be trained by supplying input variables along with their output variables. After the network has been trained, you can supply input data without output data and run the network. The output value is your result.

Heaton (2008) writes that for most practical problems, there is no reason to use any more than one hidden layer. He also indicates that the number of hidden neurons should be between the size of the input layer and the size of the output layer. Using those rules-of-thumb, I decided on using a single hidden layer consisting of three nodes in my model.

4.1 Initial Network

The first version of my forecasting model consisted of a single neural network with a predefined set of input variables. Based on the scatter plots created of the collected data, I manually selected four input variables: Torrent Downloads for the previous Episode, Show User Score on Trakt.TV, Viewers for the previous episode on Trakt.TV and Number of Facebook Likes.

The database is queried to retrieve episode and their associated values they were 24 hours before the original air date of the show. For example, when retrieving the data for the Game of Thrones episode “Oathkeeper” which aired at 2014-04-28, 01:00 CEST, the epstats, ratings and showstats entries that were created at that time are retrieved. Forecasts are only made for episodes that have all those entries available, so the dataset only includes episodes were the previous episode aired after the application was up and running. If the to be forecast episode is the first episode of a season, the stats for the last episode of the previous season is retrieved. The retrieved episodes are stored along with all associated variables in a TSV file, which is simultaneously used as the data source for the charts described in the data presentation section and as the dataset

(16)

for the model.

All episodes in the dataset are shuffled and divided into a training set and a test set. I chose to use an 80/20 percent training/test split. The training set is used to train the neural network and the test set is used to analyze its efficiency. This is done by providing the network with the input data and the output data for each episode one by one. The FANN documentation recommends normalizing all input and output data to values between 0 and 1, so this is done before the data is fed to the network. With every episode, the network adjusts itself based on the supplied data. After the network is trained, the input data for the test set are send to the network, which then provides a forecast.

The forecast results of the test set saved in a separate TSV file. As an indicator of the accuracy of the forecasts, the mean squared error for the test set is calculated by comparing the forecasts to the actual rating. The results are presented in within the application by showing the mean squared error and a scatter plot with the forecast rating on the Y-axis and the actual demo rating on the X-axis.

It quickly became apparent that this approach led to a lot of variance in the results, which is shown in table 1. The mean squared error varied widely on different runs. While this approach might have worked when using a larger dataset, in this case a technique had to be used to reduce the variance in the results.

Run Mean Squared Error 1 0.5830 2 1.3267 3 0.5231 4 0.3815 5 0.4717 6 0.5955 7 0.2362 8 0.4341 9 0.6033 10 0.2913

Table 1: Results of the initial network with predefined input variables.

4.2 Reducing Variance with Bootstrap Aggregation

To reduce the variance between different runs and to compensate for the small data set, bootstrap aggregation (bagging) was applied. The bagging technique consists of generating N training sets from your existing training set and generat-ing N versions of the model (in this case the neural network) usgenerat-ing those traingenerat-ing sets. The individual models combine to form an aggregated model (Breiman, 1996). Bauer and Kohavi (1999) determined that bagging reduces variance of unstable methods.

(17)

The N training sets are created using the bootstrapping technique. From the existing training set, items are sampled with replacement and added to a new training set, which thus may contain duplicate items. Table 2 shows an example of the bootstrapping technique in practice. These N new training sets are used to train N new neural networks. The entire test set, which still remains the same, is run through all these networks. The individual networks produce a forecast for each episode in the test set and these forecasts combined and an average is calculated. Finally, the mean squared error is calculated for these averages. In this application, I chose to create 1000 individual training sets and thus 1000 individual networks.

Original training set 1 2 3 New training set 1 1 2 2 New training set 2 3 1 1 New training set 3 2 1 3 Table 2: Bootstrapping example.

Table 3 shows that the bagging approach did not significantly reduce the variance compared to the non-bagged results. The mean squared error per run still varied wildly from 0.1888 to 0.8205. Still, figure 7 indicates that the bagged approach does reduce variance for a single train set. This histogram shows the distribution of the mean squared errors of the individual networks, which varies from 0.02 to 2.15 per network. By taking the mean of this value, as the bagging approach does, we essentially eliminate this variance. Thus we can conclude that the remainder of the variance stems from the division between test set and training set. Depending on which episodes are in the training set and which are in the test set, the performance of the network varies. This needs to be addressed, but first, we need to find a better approach to decide which input variables we can use to get the best forecasts.

Run Mean Squared Error 1 0.1888 2 0.6170 3 0.5169 4 0.7871 5 0.2440 6 0.7281 7 0.5062 8 0.2396 9 0.8205 10 0.3048

(18)

Figure 7: Histogram showing the distribution of mean squared errors for indi-vidual networks.

4.3 Adding a Greedy Algorithm for Input Selection

The input variables were manually selected based on intuition and the generated scatter plots to create the initial neural network. This is not the optimal approach, as the network should use the variables that are able to create the best forecast. To achieve this I implemented a greedy algorithm. A greedy algorithm is an algorithm that solves a problem by making the optimal choice after each stage (Black, 2005).

The algorithm starts out by selecting the best individual input variable. This is done by training and evaluating the network, including the bagging method, for each individual input variable. Networks are created and the average mean squared error for those networks is calculated. Using this method, the application can determine the accuracy of a single variable or a combination of variables in the form of the mean squared error. The input variable with the lowest mean squared error is selected.

Next the network is trained and evaluated with two variables. The first variable is the best individual variable and each remaining variable is tested in combination with it. If any of the networks with two input variables result in a lower mean squared error, that combination of input variables is now considered the best and a third variable will be added. This process is repeated until adding another variable does not improve the mean squared error or if the max number of variables is reached, which in this case is seven. An example of this approach is shown in table 4.3.

(19)

Input variables 1, 2, 3

Single variables Mean Squared Error

1 0.6

2 0.5

3 0.4

Two variables Mean Squared Error

3, 1 0.33

3, 2 0.4

Three variables Mean Squared Error

3, 2, 1 0.45

Best input variables = 3, 1

Table 4: Greedy algorithm example.

After the best input variables have been selected, the network is trained again using those variables. The chosen input variables are shown along with a scatter plot and the mean squared error in the forecast page, as seen in figure 8.

Figure 8: The forecast page within the application that shows the forecasted rating plotted with the actual rating.

This approach lowered the mean squared error substantially. The results of running the complete forecasting algorithm up to this point ten times is shown

(20)

in Table 5. However, the best variables are different for each run, which indicates that there is overfitting taking place. A learning algorithm is overfitted when the algorithm is performs well on the test data but does not perform well on new, unknown data. In this case, certain input variables might be able to forecast the ratings for the episodes in the test set well, but when the test set is reshuffled, other variables perform better.

Run Mean Squared Error Best Variables

1 0.2202 fblikes, episodeuserscore, showuserscore 2 0.1979 fblikes, showviews, torrentdls

3 0.2298 showviews, showuserscore, episodeuserscore 4 0.2929 showviews, episodeviews, episodeuserscore 5 0.2113 fblikes, torrentdls, showviews, showuserscore 6 0.2551 showviews, showuserscore

7 0.8041 fblikes, episodeuserscore 8 0.3416 showviews, episodeuserscore

9 0.3585 showviews, episodeviews, torrentdls, episodeuserscore, showuserscore 10 0.4089 episodeviews, fblikes, episodeuserscore

Table 5: Forecasts using the greedy algorithm.

To verify this, I needed to restructure the data into a training set, a test set and a validation set. The model is trained and tested using the training and test sets and then the performance on unknown values is tested using the validation set. I used a 50/25/25 training/test/validation split. Table 6 shows that the mean squared error on the validation set is higher in nine out of ten runs. This meant that there was definitely overfitting taking place and that another approach had to be taken to retrieve the best input variables.

Run Test set MSE Validation set MSE 1 0.3476 0.6265 2 0.2007 0.6925 3 0.4741 0.6004 4 0.0999 0.7304 5 0.2443 0.4423 6 0.2086 0.4336 7 0.1989 0.6146 8 0.2812 0.5806 9 0.7834 0.5595 10 0.2719 0.7060

Table 6: Forecasts using a test set and a validation set.

This approach consisted of running the complete algorithm, including bagging and the greedy algorithm, 1000 times. Using this method, the best variables for 1000 different splits between training set and test set were calculated. The top

(21)

ten of most frequent combinations of best variables can be shown in table 7. Place Variable combination Frequency

1. showviews 289

2. fblikes, showviews 288

3. fblikes 193

4. showviews, fblikes, torrentdls 73 5. showviews, torrentdls 65 6. fblikes, torrentdls 44

7. torrentdl 30

8. episodescore, fblikes, showviews 1 9. episodescore, episodeviews, fblikes 1 10. episodeviews, episodescore 1

Table 7: Top ten of most frequent combinations of best variables. The number of unique watchers on Trakt.TV is the best input variable from the dataset for forecasting the demo Nielsen rating, which was surprising, as I did not include this variable in the initial selection of inputs. 995 out of 1000 combinations consisted solely of one or more of the following variables: showviews, fblikes or torrentdls.

4.4 Evaluating the Model

Having found the best input variable, I could finally test the performance of the forecasting model. This was done by executing the following procedure 1000 times:

1. Shuffle the dataset.

2. Split it into a training and test set with an 80/20 split.

3. Generate 100 new training sets out of the training set using bagging. 4. Create 100 neural networks for each training set.

5. Let the 100 neural networks each forecast the rating for each episode in the test set.

6. Calculate the average of the forecasts for each episode. 7. Calculate the Mean Squared Error for all forecasts.

This resulted in 1000 Mean Squared Errors for each run, varying from 0.29 to 1,74. The distribution of the values is plotted in figure 9. To conclude the whole process, the mean for all the Mean Squared Errors is calculated: 0.429. This value is squared however and thus does not represent actual rating points. To solve this, we calculate the root of the mean squared error.

p

(22)

Figure 9: Histogram showing the distribution of mean squared errors for each execution of the described procedure.

5 Discussion

The small dataset indicates that the application is able to forecast demo Nielsen ratings through the use of a neural network with an average forecast error of 0.655 Nielsen “demo” rating point. The number of users on Trakt.TV that have marked one or more episodes of a certain TV-show as watched is the best indicator to forecast the Nielsen “demo” rating. Other good indicators are the number of torrent downloads for the previous episode and the number of Facebook likes, but these were not used in the final calculation. The application will continue to collect data and calculate the best combination of input variables, which might change in the future.

The previous study by Danaher and Dagger (2012) forecasted the previously discussed total Nielsen rating instead of the demo Nielsen rating. The total rating for all episodes in the dataset is on average 3.331 times higher than the demo rating. Higher values lead to a higher average forecast error. To compare the two results, we need to multiply the forecast error for the demo rating by 3.331. This results in an average forecast error of 2.181, which is higher than the forecast error of 1.08 that was the result of the model by Danaher and Dagger (2012). We do need to keep in mind that that study used a larger dataset and techniques to calculate the total amount of viewers, which this application does not.

The analyzed TV-shows only air once per week and most seasons consist of 13 or 24 episodes. This means that the collected data only covers at most half

(23)

a season. To get improved forecasts, the application will have to run for a full year. In this year, most TV-shows will have aired at least a complete season. Fortunately, the application is designed to keep running without extra user input. It will automatically add and keep track of new TV-shows, seasons and episodes. This makes it easy to evaluate if the forecasts are improving and if the best input variables change. When the dataset gets closer in size to the one used by Danaher and Dagger (2012), it will be interesting to compare both models again.

It is also interesting to start collecting data related to the total number of viewers at a certain time, like in the Danaher and Dagger (2012) model. Currently, the application forecasts a number of viewers based on historical data. But there are more factors that have to be considered. For example, an episode might air on a national holiday, such as Memorial Day in the U.S. This could mean that less people will watch TV that night and thus will lead to lower ratings. This will currently not be reflected in the forecast, because the application looks solely at input data from the previous episode and data like facebook likes that will not change for a holiday. Other data that might be valuable to collect are weather forecasts for the airtime of the episode.

There is also data that is already collected being collected but which is not actually used in the forecasting model. The time of day when a program is airing influences the ratings (Danaher et al., 2011). Shows airing at 21:00 will attract more viewers on average than a show airing at 04:00 in the morning. The total number of people watching television also differs per day of the week. On average, people watch more television on Sunday then on Tuesday. Thus, the forecasted rating for a show airing on Sunday should be higher than a show airing on Tuesday.

The algorithm could also be improved by taking into account which other programs are airing at the same time. For example, if Mad Men and Game of Thrones both air at Sunday 21:00, a viewer that might normally watch both shows can only view one. If a Game of Thrones episode would air at the same time as a rerun of a show with low ratings, the ratings for the Game of Thrones episode might be higher. Program ratings are also involved by so-called ins. A lead-in is the program which airs before a certalead-in program. A certalead-in percentage of the audience keep watching the same channel, and thus transferring part of the audience from one program to another. In the model, a show that is airing after a show that is ed to have high ratings should have its forecasted audience increased. Implementing these improvements to the algorithm, adding new data sources and combining these with the data that is already being collected could greatly improve the forecasts.

But even in it’s current incarnation, the application is useful to television networks and its advertisers. It is able to give a forecast for the most important audience-measurement, the demography rating, with an error of 0.65 rating point solely based on data that is freely available on the internet. The application is self-sustainable and will automatically add and monitor new TV-shows. Based on the new data, it will adjust the composition of the input variables and improve the forecasting model. This way, the application remains a future-proof tool for audience-measurement, right now and in the future.

(24)

References

Adalian, J. (2013). What networks can learn from breaking bad’s ratings explosion. http://www.vulture.com/2013/08/ lessons-from-breaking-bads-ratings-explosion.html. Accessed: 2014-05-27.

Bachrach, Y., Kosinski, M., Graepel, T., Kohli, P., and Stillwell, D. (2012). Per-sonality and patterns of facebook usage. In Proceedings of the 3rd Annual ACM Web Science Conference, pages 24–32. ACM.

Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning, 36(1-2):105– 139.

Black, P. E. (2005). Greedy algorithm. http://www.nist.gov/dads/HTML/ greedyalgo.html. Accessed: 2014-06-22.

Bostock, M., Ogievetsky, V., and Heer, J. (2011). D3 _{data-driven documents.}

Visualization and Computer Graphics, IEEE Transactions on, 17(12):2301– 2309.

Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. Danaher, P. and Dagger, T. (2012). Using a nested logit model to forecast

television ratings. International Journal of Forecasting, 28(3):607–622. Danaher, P. J., Dagger, T. S., and Smith, M. S. (2011). Forecasting television

ratings. International Journal of Forecasting, 27(4):1215–1240.

Ernesto (2013). Game of thrones most pirated tv-show of 2013. http://torrentfreak.com/ game-of-thrones-most-pirated-tv-show-of-2013-131225/. Accessed: 2014-05-27.

Ernesto (2014). Top 10 most popular torrent sites of 2014. http:// torrentfreak.com/top-10-popular-torrent-sites-2014-140104/. Ac-cessed: 2014-05-26.

Facebook (2014). The simpsons. https://www.facebook.com/TheSimpsons. Accessed: 2014-05-17.

Gensch, D. and Shaman, P. (1980). Models of competitive television ratings. Journal of Marketing Research (JMR), 17(3).

Heaton, J. (2008). Introduction to neural networks with Java. Heaton Research, Inc.

Herrman, J. (2011). Why nielsen ratings are inaccurate, and why they’ll stay that way. http://splitsider.com/2011/01/ why-nielsen-ratings-are-inaccurate-and-why-theyll-stay-that-way/. Accessed: 2014-07-02.

(25)

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366. Jones, S. and Fox, S. (2009). Generations online in 2009. Pew Internet &

American Life Project Washington, DC.

Market, T. C. (2014). The cloud market: Ec2 statistics. http:// thecloudmarket.com/stats. Accessed: 2014-05-24.

Nagourney, E. (2013). Why don’t advertisers care about me anymore? Accessed: 2014-06-08.

Nissen, S. (2003). Implementation of a fast artificial neural network library (fann). Report, Department of Computer Science University of Copenhagen (DIKU), 31.

Patelis, A., Metaxiotis, K., Nikolopoulos, K., and Assimakopoulos, V. (2003). Fortv: decision support system for forecasting television viewership. Journal of Computer Information Systems, 43(4):100–107.

PWC (2014). Global entertainment and media outlook: 2013-2017. http://www. pwc.com/gx/en/global-entertainment-media-outlook/index.jhtml. Accessed: 2014-05-17.

Rasmussen, C. E. (2006). Gaussian processes for machine learning.

Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386. Specht, D. F. (1991). A general regression neural network. Neural Networks,

IEEE Transactions on, 2(6):568–576.

Svozil, D., Kvasnicka, V., and Pospichal, J. (1997). Introduction to multi-layer feed-forward neural networks. Chemometrics and intelligent laboratory sys-tems, 39(1):43–62.

Trakt.TV (2014). Statistics. http://trakt.tv/statistics. Accessed: 2014-05-17.

Ungapen, V. (2013). When networks network - tv gets so-cial. http://vimninsights.viacom.com/post/61773538381/ when-networks-network-tv-gets-social-in-our. Accessed: 2014-06-02. Wallenstein, A. (2013). Nielsen reverses decline in

u.s. tv homes. http://variety.com/2013/tv/news/ nielsen-us-household-number-estimate-increase-1200471668/. Ac-cessed: 2014-07-02.

Weaver, J. and Tarjan, P. (2013). Facebook linked data via the graph api. Semantic Web, 4(3):245–250.

Forecasting TV-ratings using Web Data