Now-casting influenza-like illness cases based on online user behaviour and webpage content

(1)

Now-casting influenza-like illness cases based on online user behaviour and webpage content

submitted in partial fulfillment for the degree of master of science David C.M. Langerveld

10685456

master information studies data science

faculty of science university of amsterdam

2017-09-06

Internal Supervisor External Supervisor 3rdsupervisor

Title, Name Dr Ilya Markov Aram Zegerius Dr Thomas Mensink

Affiliation UvA, FNWI, IvI solvo UvA, FNWI, IvI

(2)

behaviour and webpage content

D.C.M. Langerveld

ABSTRACT

Influenza like infections (ILI) can infect 3 million people each year in the Netherlands alone, causing further health problems and even death for thousands of people. In order to combat the spread of ILI through the country, accurate forecasting models can help health care providers to deploy vaccinations and other (preventive) mea-sures faster. By applying the collected data on user behaviour on the solvo online health platform as input, and using the historic records of the number of ILI-cases in the Netherlands provided by Nivel, a system is proposed which extracts further features from the analytics data based on keywords provided by domain experts. The proposed model is a more accurate predictor of ILI-cases than auto-regressive methods, and hints at the potential for further im-provement towards even better and constant ILI monitoring.

ACM Reference format:

D.C.M. Langerveld. 2017. Now-casting influenza-like illness cases based on online user behaviour and webpage content. In Proceedings of ACM Conference, Washington, DC, USA, July 2017 (Conference’17),10 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION

It is estimated that 1.5 to 3 million people get infected with influenza-like infections (ILI) each year in the Netherlands alone [18], (in)directly leading to an average of two thousand deaths per year [2]. These numbers together cause ILI to be ranked as one of the top five afflictions in terms of incidents. This is widely acknowledged by governments, and since the pandemic of 2009 public health agencies have increased their surveillance efforts of the disease [2], leading to two monitoring systems in the Netherlands: the Great Influenza Survey [9], and a system based on sentinel general practitioners who monitor their ILI consultations [19].

By monitoring the spread of ILI-cases throughout the country, governments aim to respond to diseases more timely, and prevent additional infections. In order to reduce some of the disadvantages of each monitoring system [6], this research aims to develop a model than can now-cast the number of ILI-cases in the Nether-lands based on the online behaviour of consumers on health-related websites. To do so, five regression models are selected and trained to predict the number of ILI-cases in the Netherlands. By studying their behaviour and performance an attempt is made to discover

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

how accurately each of these models can predict the number of ILI-cases based solely on web-analytics data. Contrary to most other studies, this data is provided by a health-related content provider (solvo, see: https://solvo.nl), rather than working with query logs from a search engine, so that knowledge regarding the content of webpages is added to the acquired data. Based on this content, ad-ditional features for the regression models will be calculated. These features will then be used as input for five different regression models: AdaBoost regression, RandomForest regression, Elastic-Net regression, GradientBoosting regression and Ridge regression. These models were selected because of their reported effectiveness in similar (now-casting) problems, as will be further described in the Method section of this paper.

Additionally, by adding information on historic ILI spread the added value of regression models derived from web-analytics data as opposed to regression based on historic data can be determined. In doing so, an argument can be made towards applying web-analytics data in the now-casting of ILI-cases in the Netherlands, as it provides for more accurate models. In order to effectively investi-gate the feasibility of now-casting the number of ILI-cases in the Netherlands based on web-analytics data, the following research questions are offered:

RQ1 Which of the five selected models is the best predictor of

ILI-cases in the Netherlands based on R2_{, MAE, and RMSE}

scores?

RQ2 Which input features are most informative in now-casting

ILI-cases in the Netherlands?

RQ3 Can content and web-analytics based now-casting models

be a suitable alternative to auto-regressive models?

RQ4 How accurate can the number of ILI-cases in the Netherlands

be predicted based on web-analytics data collected from solvo’s online health-platform?

2 RELATED WORK

In 2009 Ginsberg et al. described how they could monitor the num-ber of ILI-cases in different regions in the United States using the relative frequency of certain search queries on the Google search engine, and reported a strong correlation between their predicted values and actual numbers of ILI-cases as reported by the US Cen-ters for Disease Control and Prevention (CDC), around r = 0.95 [8], using a simple linear regression model. Similar research was per-formed in Sweden, by Hulth, Rydevik and Linde [10], who used a

partial least squares regressionmodel to predict ILI-cases in

Swe-den during the 2005/2006 and 2006/2007 influenza seasons. They conclude that web-analytics data can be used to estimate the devel-opment of influenza activity in a country, although their predictive errors appear too large to accurately predict or now-cast the number of cases.

(3)

Conference’17, July 2017, Washington, DC, USA D.C.M. Langerveld Eysenbach (2006) showed that the number of clicks on an

ad-vertisement campaign he launched during the 2004/2005 season showed strong correlation with the number of ILI reports in the Canadian sentinel program, and that this correlation was stronger than the correlation between reported ILI cases in previous weeks and currents weeks [4]. While the work by Eysenbach (2006) , Gins-berg et al. (2009) and Hulth et al. (2009) suggest that web-analytics data for specific search engine queries are strong predictors, it must be mentioned that the datasets applied in these studies were relatively small, compassing only one or two influenza seasons. Lampos et al. [12] use a larger dataset in their research describing their work on improving Google Flu Trends. In this paper they report state of the art results using autoregressive models com-bined with query-based systems such as applied by Ginsberg, and Hulth [8] [10].

Other authors describe using similar methodologies to detect cases of other diseases as well. By investigating search query con-tents, Paparrizos, White, and Horvitz (2016, 2017) have shown that web searches have the potential to be used for screening for various kinds of carcinoma [16] [21]. Similarly, White et al. [22], in 2016 worked on the identification of adverse drug reactions using search log data, and show how search logs can be applied to detect new adverse drug reactions. The successes of papers directed towards the identification and detection of cases of diseases using search logs, as described above, is evidence of quantitative relations be-tween online behaviour and larger trends in the number of cases for such diseases.

However, with the exception of Eysenbach [4], this research is focused on applying logs obtained from search engines such as Google, or Bing, and correlations between trends in the data and the number of (ILI-)cases obtained from an outside source. In this paper however, data is not acquired from a search engine: instead, the logs from a single Dutch website have been used. This provides additional information regarding the contents of visited webpages, as well as the behaviour of users on these pages. Another addition is the consideration of the entire year, rather than only the influenza season as is done by most of the cited research, in order to work with more additional data, and test the possibilities for year-round surveillance.

3 DATA

This section is split into two parts: the first describes the data collected as input for the model, while the second discusses the ground-truths used to validate and test the models.

3.1 Input

Now-casting the number of ILI-cases in the Netherlands will be performed using data collection from pages on the solvo platform (https://solvo.nl). Solvo is the owner of three different health related websites (gezondheidsplein, ziekenhuis.nl and dokterdokter.nl) and is the largest online medical platform in the Netherlands. In this research, pages from dokterdokter.nl were ignored due to the differ-ent structure of the website, which did not allow for monitoring in the same way gezondheidsplein.nl and zieknhuis.nl provide. By us-ing the analytics data from gezondheidsplein.nl and ziekenhuis.nl

a large portion of the total Dutch health information market is covered, similarly to applying query logs from a large search en-gine, while much more information on the visited webpages can be collected.

Pages were selected by keywords. Solvo tags each of its web-pages using keywords, and by selecting all keywords relevant to ILI, a set of relevant webpages can be formed. The selection of key-words was performed by domain expert employed by solvo, who have knowledge of both the medical field (and thus are aware of relevant symptoms, medicines, and other treatments), as well as the website itself (and thus know the pages they have and how users are directed through them) in order to maximise the relevancy of selected keywords and webpages. This resulted in a subset of 162 webpages originated from ziekenhuis.nl and dokterdokter.nl.

In Table 1 the collected metrics relevant to this research are recorded and briefly explained. Each of these features were recorded per day, and per individual webpage. In order to be able to use these metrics as input for our model, each metric (pageviews, sessions, users, and averagetime) had to be aggregated per week and over all pages, as this is the frequency in which ILI-cases are recorded. Data was collected for the period starting in week 36, 2014, and ending in week 20, 2017, covering a total of 142 weeks. Since the websites were redesigned and these new version were launched before this period, making the 2014/2015 influenza season the first season which was fully covered by these new versions, applying data which was recorded before this period could have unwanted effects on the now-casting performance of the selected models, and was thus ignored.

metric definition

date date of record

item unique identifier of webpage in the logs

pageviews number of raw pageviews to page on date

sessions number of session on that page on date

users number of different users who viewed page on date

averagetime average duration of session on page on date (seconds) Table 1: Gathered page-metrics

In order to provide a better indication of the shape of the recorded data, Table 2 shows the average, minimum, and maximum values for each of the metrics after aggregating to weekly time-periods, as this is the form in which the data is fed to the models. In this table, all values are rounded to the nearest integer, whereas in the real data averagetime is not necessarily an integer, and can have decimals.

metric average min max

pageviews 10570 3174 21480

sessions 5526 1173 13024

users 6807 2328 13905

averagetime 41 35 47

Table 2: Shape of recorded data: metrics are aggregated for all ILI-related pages per week

(4)

3.2 Recorded ILI-cases

Ground-truths on the number of ILI-cases in the Netherlands were provided by Nivel [15]. Nivel monitors the number of ILI-cases in the Netherlands using the sentinel general practitioners pro-gram [19] for the RIVM, the Ministry of Health, Welfare and Sport. As ground-truth, the number of ILI-cases per one hundred thou-sand Dutch citizens, as recorded by Nivel, was used. While data is also reported for smaller regions, only nation-wide numbers were considered for this research, in order to maximise the amount of available data. All considered data ran from week 36 in 2014 to week 20 in 2017, again for a total of 142 weeks. To give an idea of the shape of the data, Table 3 is provided. The number of cases per one hundred thousand citizens is again rounded to the nearest integer.

metric average min max

ili_inc (per 100,000) 50 6 150

Table 3: Shape of ground-truths: number of weekly ILI-cases

4 METHOD

Now that the data has been described in more detail the features which will be fed to the models later can be discussed in more detail. In section 4.2 the regression models which were selected for further experiments are more accurately described and reasoning is provided for the selection of these specific models.

4.1 Features

The main advantage of the data gathered from solvo’s domain is the added information on webpage content. Therefore, next to the four previously defined features (pageviews, sessions, users and averagetime), another set of features is computed. For each webpage relating to a keyword in the set marked as relevant by domain experts, a set of all keywords relating to these webpages was formed. Out of this set of keywords, the keywords which were used to label a minimum of 10 webpages within the set were kept. For each of these keywords a calculation was performed based on the tf-idf as often used in information retrieval:

t f − id f_k= (1 + loд(t fk)) ∗loд(1 +

N nk

)

where k is a keyword within the selected set, t fkthe total number

of pageviews of webpages relating to keyword k, N is the total

number of webpages in the subset, and nkthe number of webpages

in the set which relate to keyword k.

This altered tf-idf feature allows for the comparison of the rel-evancy of a keyword in the activity on the webpages, reflecting whether webpages relating to a keyword were visited more fre-quently or less frefre-quently in one week compared to another. These features also allow for comparison between the different keywords, to give insight in the relative importance of these keywords in the composition of the content users have looked at during a given week.

4.2 Regression Models

Based on relevant literature a selection of five different regres-sion models was formed: ElasticNet, Adaboost regresregres-sion, Random Forest regression, Gradient Boosting, and finally Kernel Ridge re-gression. This subsection will further argue for the selected of these five methods

The primary reason to select ElasticNet regression is its appli-cation by Lampos et al. [12], who describe it as “a more robust generalization of Lasso” (p.2). ElasticNet regularisation applies both

λ1and λ2regularisation. This makes Lasso regularisation [20], as

used by Lampos and Cristianini in 2012 [13] in order to now-cast events using Twitter, an extreme case of ElasticNet regularisation

in which λ2= 0. Similarly, the often used Ridge regression is the

opposite extreme, in which λ1 = 0. As ElasticNet was described

as an improvement over Lasso regularisation, the latter was ex-cluded from this research. However, no mention was made of Ridge Regression, which is why it was included for further analysis.

Mathematically, the objective function for ElasticNet regression can be defined by:

arдminw, β XT t=1 (wT_x t+ β − yt)2+ λ1 F X j=1 |wj|+ λ2 F X j=1 w2_j

Where wT _{holds the weights for all features x, β is the intercept,}

and y the target variable, or ILI-rates, and F is the set of existing features.

From this, similar objective functions for Lasso and Ridge

regres-sion can be found by setting either the λ2or λ1to 0.

In a paper from 2001, Breiman described a set of models called Random Forests [1]. In this paper he argues that these models do not overfit, and concludes that they operate as accurate classifiers and predictors. In 2016, in an attempt to improve an anomaly de-tection system for energy usage in large scale buildings, Random Forest regression was used and found to perform better than al-ternative models [14] [11]. While the dataset used in that paper had significantly more datapoints, similar features were used for a similar problem as in this paper, suggesting that Random Forest regression can accurately now-cast ILI-cases as well.

The AdaBoost.R2 algorithm which was selected for use in fur-ther experiments has been developed and explained by Drucker in 1997 [3]. In the aforementioned paper an algorithm is described which fits a regression model and offers a predicted target value,

hx, for each instance in the training data. The error between hx

and the actual target value y is then calculated, and used to create a new subset of the data which contains relatively more of these high-error instances. On this subset a new version of the model is trained in order to get a better performance on previously tough in-stances. This process is iterated until a satisfying result is obtained, or the process is terminated by reaching the maximum number of iterations.

By combining a number of models fitted on different subsets of the data, it is presumed that accurate predictions can be made for each subset, whereas a single model for the entire set would have failed to do so, simply by retaining subtleties that would have been lost when only considering the entire dataset. Further, by weighing

(5)

Conference’17, July 2017, Washington, DC, USA D.C.M. Langerveld the predicted values of each model based on the confidence in their

predictions, a total weighed average can be formulated as a final ensemble prediction for any instance in the data. For a more in depth description of the algorithm it is recommended to read the 1997 paper by Drucker [3].

An interesting capability of the AdaBoost.R2 algorithm is that

the individual regression models used to formulate hxcan be any

regression model. Therefore, a selection of models was used for AdaBoost, three of which were decision tree based regressors (Deci-sion Tree regres(Deci-sion, Random Forest regres(Deci-sion, as explained above, and Extra Tree regression), and two of which were not (Linear regression, and ElasticNet, see above). Each of these models is avail-able in the scikit-learn package [17], and default hyper-parameters were used. By adding two models which were already selected in this particular paper, the effect of boosting on their effectiveness in predicting ILI-cases can be detected.

Another boosting method is gradient boosting, and is defined, among others, by Friedman in 2001 [5]. Its essential working is sim-ilar to that of AdaBoost: by combining the results of weak learners it aims to develop build a strong ensemble predictor. However, the approach varies. Whereas AdaBoost aims to combine the predicted values posted by each individual model into a final ensemble predic-tion, gradient boosting aims at obtaining a single model. By training a series of weak learners, usually tree-based, each expanding the initial weak model. In general, a model is then described by:

Fm(x) = Fm−1(x) + βmh(x; am)

In which Fm is the ensemble model for parameter set, Fm −1 is

the model at the previous step, βm is the assigned step size, and

h(x; am)is a (tree-based) weak learner intended to reduce the error

of the previous model. It is easy to see how each consecutive model is an extension of its predecessor for m = { 1, 2, ...M }. In the specific algorithm applied here in order to predict ILI-cases, the weak learners are all tree-based, and the chosen criterion for the loss function is the mean absolute error.

Each of the aforementioned selected models is available in the scikit-learn Python library [17], and these versions of each respec-tive algorithm were used in further experimentation.

5 EXPERIMENTAL SETUP

In order to evaluate the performance of the previously described models on the now-casting of ILI-rates in the Netherlands, a set of three experiments has been set up. The first experiment will fit each model by tuning the hyper parameters and try to answer RQ1. In experiments two and three the robustness of models and relative importance of each feature to each model is tested, in order to respond to RQ2. In order to respond to RQ3 the results of experiments two and three will again be used to further detail the differences in performance between auto-regressive models and content based models. These experiments are not only intended to evaluate the different models compared to each other, but also to evaluate the feasibility of now-casting the number of weekly ILI-cases in general.

Before running the experiments, the entire dataset was split into separate train and test sets. Out of the initial set of 142 weeks 75% was used for training, and the remaining 25% was used for testing.

5.1 Experiment 1

The first experiment compares the five selected models on their performance on all features, meaning the four directly logged by solvo, as well as the “keyword tf-idf” as described in section 4.1. For each regression model, an attempt is made to optimise its hyper parameters by applying an extensive grid search. For each param-eter in a set of hyper paramparam-eters applied by every algorithm, a range of values is selected to be used for testing. An example is the number of estimators used by AdaBoost.R2, which were set to the range of n_estimators = { 10, 20, 30, 50, 75, 100, 150 }. The grid search algorithm then tries every combination of values provided for the hyper parameters for a given model by fitting and testing the regression model using a K-Fold cross-validation (k = 3) on the training set. The optimal combination of hyper parameters is then returned, which is used to fit a model on the entire training set. This model is then used to predict the number of ILI-cases in week q in the test set given the provided feature values for week q, which are compared to the actual value as recorded by Nivel. Model

performance is recorded in R2_{, Mean Absolute Error (MAE) and}

Rooted Mean Squared Error (RMSE) scores, which together allow for the comparison between regression models, as well as provide an idea as to how close predictions get to real values.

5.2 Experiment 2

In the second experiment a small step is made to test the robustness of the models, as well as the relative importance of each feature, by performing a leave-one-out experiment.

Each model will be trained on the full training set with the exception of a single feature. The fitted model will then again predict

the number of ILI-cases for all weeks in the test set, providing R2_,

MAE and RMSE scores. This will be repeated for every feature, so that each model will be fitted a total of 46 times. The applied hyper parameters will be the set of hyper parameters which were selected as the best performing in Experiment 1.

The results of this experiment can help identify the information an individual feature contains about the number of ILI-cases which is not reflected in other features: where a feature with a lot of infor-mation is removed, the reported scores would decrease compared to those in Experiment 1. However, if the information in this feature is

repeated in other features, the reported R2, MAE, and RMSE scores

will remain similar. Furthermore, in the unlikely event that a single feature would be a relatively good feature to predict the training set, but not the test set, thereby causing overfitting, the results of this experiment can shed a light on this.

5.3 Experiment 3

The last experiment is an inverse of Experiment 2, in which only one feature is taken in consideration each time. Each model is fitted on that single feature’s instances in the training set, and scores on its performance based on predictions on that single feature in the test set are reported. The selected hyper parameters will again be the results from Experiment 1. In this experiment, two features will

(6)

be added: t_1 and t_2, the previously recorded ILI-cases, shifted by 1 and 2 weeks respectively. By adding simple regressions on the historic data, the results of a simple (auto-) regression are uncovered, which can later be used to compare other models against. Since Nivel reports with a delay of 2 weeks, now-casting by third parties could, in a real-life scenario, only be performed on 2 weeks old data, such as is contained in t_2. However, t_1 is also included for good measure.

The reported scores will show the information regarding the number of ILI-cases in the Netherlands contained in each respective feature, regardless of whether it is repeated in other features as well, in order to identify which features are relatively important. In future experiments, this information can be used to prune the number of features, which could help in improving the now-casting model.

6 RESULTS

6.1 Experiment 1

The results of Experiment 1 can be found in Table 4, and show a large distinction between RidgeRegression and the other

algo-rithms. While the R2scores for each of the first four algorithms falls

between 0.81 and 0.85, RidgeRegression performs slightly below 0, suggesting its predictions are always around the mean of the target value, the number of ILI-cases in the Netherlands. Since this phe-nomenon repeated itself during the other experiments, in which all

scores, R2_{, MAE, and RMSE remained identical regardless of which}

features were used, it can be discarded from further investigation. It should be noted that in Experiment 3 not all scores were exactly identical. However, as scores never improved by one ILI-case, the

best scores being 0.050, 31.0457, 38.551 for R2, MAE, and RMSE

re-spectively, these small improvements do not make further analysis worthwhile in the context of this paper.

Out of the other four algorithms, Table 4 indicates AdaBoost to be the best performing algorithm. The simple regression model used for its weak learners in the set of optimal hyper parameters is the ExtraTreeRegressor provided in scikit-learn, a tree-based regression model intended solely for use in ensemble methods, and which relies on strongly randomised tree development [7].

6.2 Experiment 2

By omitting each feature once, experiment 2 was intended to find the relative information loss to the model caused by each feature. Features which contain unique information relevant to the model will cause a larger drop in performance than other features. Part of the result of the experiment can be found in 5. The displayed fea-tures are the feafea-tures that impacted the MAE score of AdaBoost.R2 most. The focus on this model is based on the results in Experiment 1, in which it performed best.

The first three selected features contained the least least unique information, as removing them as input impacted the results least. The remaining three features had most negative impact on the per-formance, and thus contained most unique information, as opposed to the other features.

Interestingly, the MAE scores in Table 5 are not always higher than they are in Table 4, which contains the results of Experiment

1. This could be caused by a form of overfitting caused by some of the features, which would be avoided by omitting such features. For example, the first three features in Table 5 are not just the features which cause the least negative impact on performance for the AdaBoost.R2 algorithm; removing them actually improves the performance of the model.

However, the table also shows that removing a specific feature does not have similar effects on each algorithm. The feature “bi-jholteonsteking” increases the MAE for AdaBoost and RandomFor-est compared to the results in Experiment 1, as is expected, yet decreases the MAE for ElasticNet and GradientBoosting, which is the exact opposite effect. This behaviour suggests each algorithm relies on different features for its models. What causes these differ-ences is unclear, but it could potentially be caused by the relative similarity between features, making them reasonably interchange-able between models.

6.3 Experiment 3

Table 6 shows the MAE scores of each model when trained on a single keyword. The selection of keywords represents the three features which obtained the lowest MAE score for the AdaBoost model, combined with three features that resulted in the highest MAE scores for that same model. As t_1 and t_2 were included as features in this experiment as well, the resulting table shows that t_2 was not one of the three best single-input features for the AdaBoost model. However, it is worthwhile noting that, for the other three models, this was the case.

The results represented in Table 6 show that the AdaBoost algo-rithm never benefits from applying only a single feature as input. For each of those features the MAE is higher than it was in Exper-iment 1, the result of which is seen in the first column of Table 6. While this generally holds true for the other algorithms, their reference score in the all features column showing a lower MAE than the other columns, there is an exception for the t_1 feature. As it directly represents the number of ILI-cases of the previous week it is expected to be a strong feature, and it can be seen to provide better RandomForest and GradientBoosting models than the refer-ence score provided, which included all features extracted from the solvo data. Although these MAE scores are still lower than the those obtained by the AdaBoost model, they do approach them. The last three features (averagtime, “hoofdpijn” and “zesde ziekte”) show that the difference between certain features as single-input is rather large. The worst performing feature, “zesde ziekte” has a MAE score which is 4.9 times as high as the reference AdaBoost performance. While there are rather large differences between the performance of models on the features “hoofdpijn” and “zesde ziekte” non can report MAE scores approaching the reference scores.

7 DISCUSSION

In this section the findings reported previously will be applied to answer the posed Research Questions, in order to place it back into the larger context of related literature.

(7)

Conference’17, July 2017, Washington, DC, USA D.C.M. Langerveld

Model name R2 MAE RMSE

AdaBoost.R2 0.847 8.410 11.963

RandomForest 0.815 11.142 16.017

ElasticNet 0.837 10.267 13.859

GradientBoosting 0.834 11.191 15.662

RidgeRegression -0.001 32.426 39.907

Table 4: Results of Experiment 1: optimised model performance using all input features

Model name all features bacteriele en virusinfecties spierpijn griep mexicaanse griep koorts bijholteontsteking

AdaBoost.R2 8.410 7.814 7.926 7.942 8.834 8.999 9.129

RandomForest 11.142 11.931 9.268 10.635 11.343 10.865 11.985

ElasticNet 10.267 9.898 9.959 10.008 9.810 10.357 10.028

GradientBoosting 11.191 11.000 10.594 11.120 11.081 11.776 10.393

Table 5: Adjusted performance (MAE) after removing single feature from input: 3 features with least negative impact, three with most negative impact

Model name all features “keelontsteking” t_1 “spierpijn” averagetime “hoofdpijn” “zesde ziekte”

AdaBoost.R2 8.410 12.161 12.275 12.373 37.090 38.558 41.037

RandomForest 11.142 13.394 9.311 13.643 30.572 31.832 36.729

ElasticNet 10.267 16.590 10.415 18.503 32.666 32.587 29.382

GradientBoosting 11.191 17.448 8.825 13.695 34.185 28.753 31.410

Table 6: Model performance on single-feature input: 3 lowest and highest MAE scores

7.1 RQ1

The first Research Question regards the five chosen regression models, and which of these is the best predictor of the number of ILI-cases in the Netherlands. Based on the results of Experiment 1, where the hyper parameters for each model were tuned on all available features, AdaBoost.R2 appears to be the best performing

regression model, with the highest R2_{(0.847) score, as well as the}

lowest MAE (8.410) and RMSE (11.963) scores. Furthermore, in Experiment 2 the lowest reported MAE (7.814) score, which is the lowest reported in all experiments combined, was similarly obtained by the AdaBoost model.

While other models outperformed AdaBoost in certain settings, especially in Experiment 3, where other models reported lower MAE scores in single-feature models than AdaBoost. However, the single-feature regression problems the models were applied to in Experiment 3 are not exemplary of the real-life problem that this paper aspires to help solve; outside of academic pursuit it would make sense to use more than a single feature. A curiosity in these findings would be the fact that AdaBoost does not apply one of the complex models already used as individual regressors in these experiments, but instead selects a tree-based ensemble method. Considering the literature, a model such as ElasticNet would appear to have been a better suited learner.

7.2 RQ2

In order to identify the most informative input features in now-casting ILI-cases in the Netherlands, Experiments 2 and 3 have tested the decrease in performance when omitting a feature, and the performance when using a single feature as input. The results

of these experiments are not yet conclusive on the best features. Experiment 3 clearly shows that t_1 is a very strong predictor for the number of ILI-cases, reporting similar or better performance as the reference scores in three out of four models. However, there is no such conclusive evidence for the other features. While removing “spierpijn” as a feature in Experiment 2 improved the performance of all models over their reference scores in Experiment 1, it is also seen as one of the strongest single-features in Experiment 3. This might indicate that the feature contains relevant information for the prediction of the number of ILI-cases, yet it is not unique in doing so and (a combination of) other features make it obsolete and supersede it when applied.

The results of Experiments 2 and 3 do show that the metrics directly provided by solvo, pageviews, averagetime, users, and ses-sions, are not particularly strong features. They are never reported as valuable features in either experiments, while averagetime is documented as close to being the worst single-feature input. This suggests that the addition of tf-idf based features are better suited to this type of problem than the more general metrics.

That being said, a lot more work can be done to improve the feature engineering for these kinds of problems. While the results show the importance of content based features, a smaller subset of these features would likely improve model performance. Similarly, the addition of features that combine content with other informa-tion than pageviews (such as the number of users, sessions and the time spend on a page) could likely improve models as well.

(8)

7.3 RQ3

In Experiment 3, the performance of single-feature based models was tested, and two of those features were auto-regressive, con-taining the number of ILI-cases in the Netherlands with a one or two week delay. To fully answer this research question, a table containing the MAE of each model using only these features, as well as their reference MAE obtained in Experiment 1 was added (Table 7. From this table it is evident that while t_1 allows for im-proved performance in three of the chosen models, this does not hold for the t_2 feature, in which the number of ILI-cases is reported with a two week delay. Models based on the t_2 feature are always outperformed by their respective reference scores.

To further identify the value of content and web-analytics based models over auto-regressive models, Figures 1 and 2 have been added. These represent a scatter plot of the predicted and recorded ILI-cases for the test set, as well as an x = y horizontal line. In Figure 1 the model with the lowest reported MAE scores out of all three experiments has been used, which was the AdaBoost.R2 model with the “bacteriele en virusinfecties” feature removed. By examining the cases in which the predicted and recorded values deviate from the x = y line the most, an indication towards the flaws of the model can be given. In this plot, most predictions, especially in the lower regions, are relatively close to the line. However, the errors are somewhat larger as the number of ILI-cases increase. Although it makes intuitive sense, considering the data reporting a high number of ILI-cases is relatively rare, it does show that this model is likely to be lead astray as in influenza season reaches its peak.

In Figure 2 the best performing auto-regressive model is shown similarly as before. This is the GradienBoosting model based solely on the t_1 feature, which, as can be seen in Table 7, had the lowest score out of all the auto-regressive models. In this figure, the spread of the MAE appears more generally divided over the entire range: Most points in the lower ranges appear to deviate further from the mean as they do in Figure 1, and there is a large deviation for a single point which is heavily over estimated by the model. Considering that the model here is reliant solely upon a single-feature, a generally higher deviation is to be expected, as the model cannot use other variables to differentiate between certain instances in the data, whereas AdaBoost.R2 plotted in Figure 1 had more than 40 features at its disposal.

Considering Table 7 together with Figure 1 and 2, the results of Experiments 2 and 3 indicate that content and web-analytics based models are can perform similar models based on auto-regressive fea-tures. That being said, this is not yet conclusive evidence of the su-periority of such systems. For both types of models, additional data could heavily improve performance, as well as additional scrutiny in selecting hyper parameters. Since solvo, as a young company, has been through a few design overhauls only a limited amount of data was directly relevant to the current platform. In order to make the content based models as accurate as possible the decision was made to ignore all data collected before September 2014. Using more historic data could improve the models, and especially be useful for the single-feature models using t_1 and t_2. To make the comparison of models relevant, these were now only trained on the same amount of data as all other models.

Figure 1: Content based model predictions vs. recorded ILI-cases

7.4 RQ4

To answer the final research question, the best performing (content based) model has to be further analysed. This is the specific appli-cation of AdaBoost.R2 in Experiment 2, in which the “bacteriele en virusinfectie” feature was removed, with a mean absolute error of 7.814.

Since the mean absolute error is expressed in the same unit of measurement as the target variable, this indicates that, on average, a prediction is off by 7.814 ILI-cases per one hundred thousand individuals. On average throughout the dataset, the number of ILI-cases is 49.654. Considering these statistics, predicting ILI-ILI-cases is not yet accurate enough to be reliable, although Figure 1 does indicate most of these errors originate in weeks with a high number of ILI-cases during peaks in the influenza season. Perhaps future research could further investigate the cause of these errors, and perhaps adjust for them. As this research has shown the value of using content and web-analytics based features to predict the number of ILI-cases in the Netherlands, further research should investigate the importance of features, and continue the application of these features to improve now-casting of the number of ILI-cases.

8 CONCLUSION

As a first step towards accurate now-casting of the number of ILI-cases in the Netherlands based on content information and online user behaviour data, this paper described the computation of a set of features based on the tf-idf models that are popular in Information Retrieval systems, and applied these features towards the regression of ILI-cases. Five models were selected based on literature and previous work, in order to empirically try varying approaches and comparing these methods. By performing a few small experiments based on excluding varying sets of features the resilience of each model as an accurate predictor was tested, as well as the merits of the developed set of features compared to simple auto-regression. This has shown that content and user behaviour based systems can predict the number of ILI-cases more accurate than auto-regressive models, with a decrease in MAE of 11.5%. The

(9)

Conference’17, July 2017, Washington, DC, USA D.C.M. Langerveld

Model name all features t_1 t_2

AdaBoost.R2 8.410 12.275 15.01

RandomForest 11.142 9.311 13.394

ElasticNet 10.267 10.415 14.109

GradientBoosting 11.191 8.825 13.354

Table 7: Mean Absolute Error of chosen models using single auto-regressive features

Figure 2: Auto-regressive based predictions vs. recorded ILI-cases

AdaBoost.R2 algorithm, while two decades old, has been shown to be the best performing method, especially in combination with the ExtraTree regressor method.

In the future, the effects of pruning the number of features fed to the models should be further studied, as it has been shown in Experiment 2 that removing features can have positive effects on the performance of a model. Secondly, similar tf-idf based features, which were now calculated using only a single collected metric (pageviews), could be computed using each of the four metrics (pageviews, users, sessions, averagetime) in order to increase the number of potentially interesting features; this would make the need for feature selection only more relevant.

When considering the work of Lampos et al. in both 2012 [13] and in 2015 [12], the effects of improvements to regression models is quite clear, and combined with the adjustments to features there is significant potential for improvement, indicating the potential for now-casting ILI-cases in the Netherlands far more accurately than has been shown so far. In doing so, the spread of influenza strains can be monitored better, allowing for better response to, and management of, this disease in our society.

9 ACKNOWLEDGEMENTS

I thank solvo, and especially Aram Zegerius, for offering me an internship and access to their data. Additionally, Ilya Markov and his never-ending questioning of everything I reported must be acknowledged, for without them my theis would have never come to fruition.

Approval of usageThis study has been approved according to the governance code of NIVEL Primary Care Database, under number NZR-00317.014.

Dutch law allows the use of electronic health records for research purposes under certain conditions. According to this legislation, neither obtaining informed consent from patients nor approval by a medical ethics committee is obligatory for this type of observational studies containing no directly identifiable data (Dutch Civil Law, Article 7:458).

REFERENCES

[1] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[2] C. C. Van den Wijngaard, L. van Asten, M. P. G. Koopmans, W. van Pelt, N. J. D. Nagelkerke, C. C. H. Wielders, ..., and M. Kretzschmar. Comparing pandemic to seasonal influenza mortality: Moderate impact overall but high mortality in young children. PLoS ONE, 7(2), e31197., 2012.

[3] Harris Drucker. Improving regressors using boosting techniques. In ICML, vol-ume 97, pages 107–115, 1997.

[4] G. Eysenbach. Infodemiology: tracking flu-related searches on the web for syn-dromic surveillance. AMIA Annual Symposium Proceedings, 2006(244), 2006. [5] Jerome H Friedman. Greedy function approximation: a gradient boosting machine.

Annals of statistics, pages 1189–1232, 2001.

[6] I.H.M. Friesema, C.E. Koppeschaar, G.A. Donker, F. Dijkstra, S.P. van Noort, R. Smallenburg, W. van der Hoek, and M.A.B. van der Sande. Internet-based monitoring of influenza-like illness in the general population: Experience of five influenza seasons in the netherlands. Vaccine, 27(45):6353 – 6357, 2009. ESWI -Third European Influenza Conference.

[7] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning, 63(1):3–42, 2006.

[8] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Bril-liant. Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012–1014, 2009.

[9] griepmeting.nl. Grotegriepmeting.

[10] A. Hulth, G. Rydevik, and A. Linde. Web queries as a source for syndromic surveillance. PloS one, 4(2)(e4378), 2009.

[11] Michael J Kane, Natalie Price, Matthew Scotch, and Peter Rabinowitz. Comparison of arima and random forest time series models for prediction of avian influenza h5n1 outbreaks. BMC bioinformatics, 15(1):276, 2014.

[12] V. Lampos, A. Miller, S. Crossan, and C. Stefansen. Advances in nowcasting influenza-like illness rates using search query logs. Scientific Reports, 2015. [13] Vasileios Lampos and Nello Cristianini. Nowcasting events from the social web

with statistical learning. ACM Trans. Intell. Syst. Technol., 3(4):72:1–72:22, Septem-ber 2012.

[14] D. Langerveld. Reducing energy waste: Advancing a cost-efficient anomaly detec-tion model to reduce energy consumpdetec-tion. Master’s thesis, Amsterdam University College, 2016.

[15] Nivel. Nivel zorgregistraties. NZR-00317.014, 2017.

[16] John Paparrizos, Ryen W White, and Eric Horvitz. Screening for pancreatic adenocarcinoma using signals from web search logs: Feasibility study and results. Journal of Oncology Practice, 12(8):737–744, 2016.

[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[18] RIVM. Influenza: Cijfers & context.

[19] C. Schweikhardt, R.A. Verheij, G.A. Donker, and Y. Coppieters. The historical development of the dutch sentinel general practice network from a paper based into a digital primary care monitoring system. Journal of Public Health, 2016. [20] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of

the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. [21] Ryen W White and Eric Horvitz. Evaluation of the feasibility of screening patients

for early signs of lung carcinoma in web search logs. JAMA oncology, 3(3):398–401, 2017.

[22] Ryen W White, Sheng Wang, Apurv Pant, Rave Harpaz, Pushpraj Shukla, Walter Sun, William DuMouchel, and Eric Horvitz. Early identification of adverse drug reactions from search log data. Journal of biomedical informatics, 59:42–48, 2016.

My email mailto:david.langerveld@student.auc.nl My supervisors email mailto:i.markov@uva.nl

(10)