Faculty of Economics and Business
Amsterdam School of Economics
Requirements thesis MSc in Econometrics.
1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided
up into a number of sections and contains references. An outline can be something like (this
is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper
from the literature):
(a) Front page (requirements see below)
(b) Statement of originality (compulsary, separate page)
(c) Introduction
(d) Theoretical background
(e) Model
(f) Data
(g) Empirical Analysis
(h) Conclusions
(i) References (compulsary)
If preferred you can change the number and order of the sections (but the order you
use should be logical) and the heading of the sections. You have a free choice how to
list your references but be consistent. References in the text should contain the names
of the authors and the year of publication. E.g. Heckman and McFadden (2013). In
the case of three or more authors: list all names and year of publication in case of the
rst reference and use the rst name and et al and year of publication for the other
references. Provide page numbers.
2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that
actually matters is that your supervisor agrees with your thesis.
3. The front page should contain:
(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty
as in the heading of this document. This combination is provided on Blackboard (in
MSc Econometrics Theses & Presentations).
(b) The title of the thesis
(c) Your name and student number
(d) Date of submission nal version
(e) MSc in Econometrics
(f) Your track of the MSc in Econometrics
Shilling attacks on recommender systems
with implicit feedback
Radim Ka²párek
11669799
MSc in Econometrics
Track: Big Data Business Analytics Date of nal version: 14th July 2018 Supervisor: prof. dr. Marcel Worring Second reader: dr. Noud van Giersbergen
Statement of Originality
This document is written by Radim Ka²párek 11669799 who declares to take full re-sponsibility for the contents of this document.
I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it.
The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.
Abstract
Shilling attacks are eorts to impair a recommender or sway its recommendations in a certain way benecial for the attacker. Whereas shilling attacks against recommender systems with explicit feedback have been extensively studied, the literature on shilling attacks against recommenders with implicit feedback is missing. In this thesis, various shilling attack scenarios were simulated and their eect on the implicit-feedback based recommender's performance assessed. The dataset used was obtained from a booking portal, where customers browse and book hotels. The only feedback available is whether the customer clicked on the detail of a hotel and whether they booked it. We show that even very small attacks severely impair the recommender's performance. However, it is very dicult to sway the recommeder in a way, that it recommends a specic item more or less than before the attack.
Acknowledgment
I would like to express gratitude to my parentsfor their endless support during my studies, and to my friends Samuel Kozuch, Jan Hynek, and Stepan Svoboda, who helped me tremendously during the whole year. Lastly, I would like to thank to my supervisor prof. dr. Marcel Worring for his supervision.
Table of contents
Contents
1 Introduction 8 2 Literature review 10 2.1 Research question . . . 14 3 The model 16 4 Dataset description 20 4.1 Columns . . . 204.1.1 Search Criteria variables . . . 21
4.1.2 Static hotel characteristics variables . . . 22
4.1.3 Dynamic hotel characteristics variables . . . 23
4.1.4 Visitor information variables . . . 24
4.1.5 Competitive and other information variables . . . 24
4.2 Missing values . . . 25
4.3 Position Bias . . . 26
5 Simulations 29
5.1 Metrics . . . 30
5.2 Shilling attacks using clicks . . . 31
5.3 Shilling attacks using clicks and bookings . . . 33
6 Results 35 6.1 Eect on the overall performance of the model . . . 36
6.1.1 NDCG of the model under random attack . . . 37
6.1.2 NDCG of the model under average attack . . . 38
6.1.3 NDCG of the model under push attack . . . 39
6.2 Eect on the recommendation score of the target item . . . 40
6.2.1 Recommendation score of the target item under average attack . 41 6.2.2 Recommendation score of the target item under push attack . . . 42
6.2.3 Recommendation score of the target item under push-similar attack 43 6.2.4 Recommendation score of the target item under nuke attack . . . 44
7 Discussion 45 7.1 Discussion of the results . . . 45
7.1.1 Eect on the overall performance of the model . . . 45
7.2 Limitations . . . 48
7.3 Further research . . . 48
8 Conclusion 49
List of gures and tables
List of Figures
1 The correlation matrix for Search Criteria . . . 222 The correlation matrix for Static hotel characteristics . . . 23
3 The correlation matrix for Visitor info . . . 25
4 Number of missing values in each column . . . 26
5 Position bias in non-random sort . . . 27
6 Position bias in random sort . . . 27
7 Relative change in the nDCG of the model, random attack . . . 37
8 Relative change in the nDCG of the model, average attack . . . 38
9 Relative change in the nDCG of the model, push attack . . . 39
10 Relative change in the RS of the target item, average attack . . . 41
11 Relative change in the RS of the target item, push attack . . . 42
13 Relative change in the RS of the target item, nuke attack . . . 44
14 Random attack, click-only scenario . . . 52
15 Average attack, click-only scenario . . . 53
16 Push attack, click-only scenario . . . 54
17 Push-similar attack, click-only scenario . . . 55
18 Nuke attack, click-only scenario . . . 56
19 Random attack, click-book scenario . . . 58
20 Average attack, click-book scenario . . . 59
21 Push attack, click-book scenario . . . 60
22 Push-similar attack, click-book scenario . . . 61
23 Nuke attack, click-book scenario . . . 62
List of Tables
1 Columns and column types . . . 501 Introduction
The rise of internet allowed for moving sales into the virtual environment. This envi-ronment enabled the emergence of massive e-shops, where thousands and thousands of items were available. This vast volume of goods in shops caused that it became infea-sible for users to go through all the items, so ltering was implemented. However, as the number of items in each category grew, simple ltering was not sucient. There-fore, the retailers implemented other tools, namely recommender systems, to simplify navigation for customers.
The recommender systems quickly proved their value. It was shown that the eects of recommenders extend far beyond the direct extra revenue of items purchased (Dias et al., 2008). The authors demonstrated that a substantial additional revenue is gener-ated by introducing shoppers to new categories, where they then continue to purchase. Therefore the recommender systems were widely adapted and became essential parts of e-shops.
Having a high-quality recommender model became an important advantage over compe-tition and, on the other hand, having a model, which performed poorly, very eectively limited the ability to compete. Thus companies focus on improving their models exces-sively (Gomez-Uribe and Hunt, 2016). However, this fact also presents a clear incentive for impairing the quality of the competitor's recommender - so called `shilling attacks' The shilling attacks against systems that use explicit feedback, where user ascribes an item a score directly, have been extensively studied (Gunes et al., 2014). It was shown that these systems are vulnerable to such attacks and they also demonstrated how eective attacks can be constructed. The knowledge about which attacks are eective and which are not can be a very useful guideline for systems administrators, who are trying to uncover such attacks and set up some automated procedures for disclosing such eorts.
Shilling attacks against systems that use implicit feedback, where the sentiment of a user towards an item is deduced solely based on their actions (click, buy, put into basket
etc.), have not been studied yet. Even though administrators of such systems denitely use some automated checks of data quality, the knowledge about which attacks are potentially dangerous for a system like theirs is missing.
In this thesis shilling attacks against implicit feedback based systems were simulated and their eciency assessed. The question of interest is how should such attacks be constructed and how extensive these have to be to impair the recommender. The ndings of this thesis could be used for setting up security measures against shilling attacks.
2 Literature review
With the ever-increasing number of items in e-shops, the customer's navigation through the shop is more and more complicated. Therefore online businesses often provide customers with recommendations, which reduce the time needed to nd desired items and increase the conversion ratio. These recommended items are chosen by so-called recommender systems, which utilize the user's past behavior in combination with the general popularity of items. While it has been comprehensively described how to best design these systems to provide the most tting recommendations for customers and to perform decently even with sparse data, the sensitivity of such systems to intentional attacks has not been rigorously explored. Given that the recommendation systems are a key part of the overall competitiveness of an e-shop, it can be expected, that such attacks are mounted and thus how the systems react to such attacks should be thoroughly explored.
Recommender systems have become a very important business tool in e-commerce, as most of online businesses implemented this feature to their e-shops. These systems were originally designed to overcome the eort required to browse the ever-increasing number of items in online shops and to allow the user to nd the item they are looking for faster, and with less diculty. These recommender systems were shown to have positive impact on the sales (de Nooij, 2008), which created a strategic advantage for the early-adopters. Due to the early success of recommender systems, they were soon widely implemented by other e-commerce businesses. While for a certain group of customers with very specic tastes and a clear vision of the desired product the recommenders can return inacurrate and irrelevant recommendations, for the majority of users the recommenders save a signicant amount of time and eort. The recommenders arguably contribute to positive user experience, which manifest itself in a growing base of loyal customers (Konstan and Riedl, 2012).
The rst predecessor of modern recommender systems was introduced in 1992 (Gold-berg et al., 1992). The authors noticed that the time needed to go through one's emails is sharply increasing as the accessibility of the internet was growing and as a
conse-quence, the number of irrelevant emails rose as well. To combat this, they created a `collaborative ltering' scheme, where the so-called `eager readers', i.e. people who en-joyed going through emails and reading them, annotated the emails and the rest of the users could lter their incoming messages based on these annotations. This approach was later built upon and current recommender systems are signicantly more sophis-ticated. Nevertheless, the general idea of utilizing other users' behavior to construct recommendations remains consistent as a methodology.
Current recommender systems can be divided into three groups based on the mechanism they employ. The rst system is called Content-based ltering and uses user-item interactions such as ratings or buying behavior to construct recommendations. Based on these interactions, the system captures which items a user liked in the past and their characteristics. Next, according to a prespecied similarity measure, the similarity be-tween candidate items and previously rated items is calculated. The similarity measures could be based on a classication of an item, i.e. categories to which the item belongs as well as on the similarity of the textual description of the item, e.g. TF-IDF matrix. Subsequently, for each candidate item the user rating is predicted (commonly used basic methods for this prediction are Content-based Nearest Neighbourhood, Relevance Feed-back or Naive Bayes) and the items with highest predicted ratings are recommended to the user. This method is very intuitive and the data needed are not so dicult to obtain (i.e. the item characteristics and the users' behaviour), however, it does not perform well for new users - it suers from the so-called 'cold start' problem (Jannach et al., 2010), (Ricci et al., 2011).
The second class of methods is called the Collaborative ltering scheme and it utilizes the similarity between users. This system employs a similarity measure to identify customers that have similar preferences on items as the user in consideration and then, based on the ranking of these similar users of the item, calculates a predicted rating of that item for the user. The basic similarity measure employed to identify the similar users is Pearson correlation coecient. However, it is necessary to account for the fact that some users give steadily lower rankings than others, therefore some kind of normalization has to be applied. Also, certain items might be generally liked and the similarity in taste on these items does not carry much information. Therefore methods
called Variance weighting are commonly used to determine the amount of information given by a similar rating of an item. Moreover, some users might rank a very dierent sets of items with only a small overlap. Then, even though the rating for these few items can be virtually the same, it should not be concluded that the users share a preference prole on items. Therefore, the number of items a similarity between two specic users was calculated from should be incorporated in the formula. The Collaborative Filtering scheme suers from cold start in a similar way as the above mentioned Content-based ltering. Moreover, this method also has low performance when there are many items and few users in the dataset as the users' rating sets overlap only in a small number of items. Such datasets are commonly called `low density datasets'. Lastly, with the increasing number of users, the computational power needed to nd similar users increases sharply. The main advantage of this approach is that the item characteristics do not need to be known, which could be very useful in some cases (e.g. when the recommended content is created by a broad group of contributors) (Jannach et al., 2010), (Ricci et al., 2011).
There is also a third group of recommenders, which seek to get the best of both worlds and combine the above mentioned approaches, referred to as Hybrid recommenders. These systems leverage item-item as well as user-user similarity for constructing recom-mendations. The models used for this range from straightforward ensembles of Col-laborative ltering scheme and Content-based ltering to complicated machine learning algorithms such as the on used in (Wang et al., 2015). The authors noticed that Collaborative ltering schemes performances drop signicantly when the rat-ings are sparse. Therefore, they proposed a hierarchical Bayesian model called Collab-orative Deep Learning (CDL). This model jointly performs collabCollab-orative ltering for the user feedback and deep representation learning for the content information. They demonstrated that this new approach was state-of-the-art at the moment on three real world datasets (citeulike-a, citeulike-t, and the dataset from Netix challenge). They concluded that the performance could be further improved by incorporating the convo-lutional neural network (CNN) model which can explicitly take the context and order of words into account. This paper was the rst bridge between the state-of-the-art deep learning models and recommender systems (Wang et al., 2015) and opened a new direction in research of the recommender systems.
Furthermore, recommender systems can be also divided into two groups according to the data they use. The rst class uses explicit feedback - users in this paradigm rate items on an ordinal scale (e.g. IMDB or Rotten Tomatoes lm databases). The second uses only implicit feedback. That means that the user's sentiment towards an item is deduced only from their actions such are the number of views of an item, time spent on the detail of an item, save or buy action, and other. The latter class of recommender systems is more complicated to construct and evaluate as the interpretation of the user's actions can be ambiguous. Nevertheless, since providing explicit feedback presents additional time and cognitive load on users, the implicit feedback-based systems have signicantly more applied cases.
The recommended items in online stores are more likely to be sold simply because they are viewed by more people (Lin, 2014). Therefore, it is in the producer's interest to have their items recommended by e-commerce portals or booking portals. This fact presents an incentive to try to inuence the recommender system in such a way, that it will rank specic products higher. Additionally, it could be benecial to the producer's sales if the greatest competitors products are not recommended at all. For such eorts to sway the recommendations in a specic way the name shilling attacks was adapted (Lam and Riedl, 2004).
Shilling attacks can have two objectives - the rst is to increase the popularity of targeted items. In existing literature, these are frequently referred to as push attacks (Gunes et al., 2014). The second objective is to decrease the popularity of dierent targeted items and these are referred to as nuke attacks (Gunes et al., 2014).
Some literature (Mobasher et al., 2007); (Gunes et al., 2014) describe several types of shilling attacks. The basic types are:
• Random Attacks - this attack randomly chooses items to rate and their ratings, except for the target item. This attack is easy to implement, however has only limited eectiveness.
of the system's users by drawing its ratings from the rating distribution associated with each item. These attacks are usually very eective, but require detailed knowledge of the dataset.
• Bandwagon Attacks - in this scenario, the targeted item is associated with a handful of well-known popular items. This attack is also fairly eective, but does not have such high requirements on the knowledge about the underlying dataset. • Segment Attacks - during these the attackers aim to increase the
recommenda-tions of the target item to some specic group of users.
The existing literature describes well shilling attacks on explicit feedback-based systems, however, the literature on implicit feedback-based recommender systems is very rare. Zhang investigated the Segment Attacks on binary feedback based recommender systems (Zhang, 2016). His results demonstrate the segment attack model is very eective against both Collaborative ltering-based algorithm and Content-based algorithms. To measure the eects of such attacks, Zhang introduces a Hit Ratio measure - it denotes the probability that the system will recommend the pushed item. It is dened as the sum of the position number of all target items in the recommendation list across all test users divided by the number of pushed items and the number of test users. Generally, the two algorithms have bigger hit ratios when the attack size increases.
2.1 Research question
As the robustness of implicit feedback-based hybrid recommender systems has not yet been extensively studied, various shilling attack scenarios will be tested in this paper and the robustness of the baseline implicit feedback-based Gradient Boosting Classier recommender assessed. This will be conducted in order to answer the large research question of the present study, which is to test how many fake searches must be included to sway the recommendations of this particular model. A second follow up question will observe how such searches should be constructed so that the eciency of such attack would be the highest possible.
The answer to this question could be the rst step in developing detection algorithms of the shilling attacks against such systems and could improve the robustness of recom-menders used in the e-commerce, as these could possibly be targets of various shilling attacks schemes.
To be able to identify such attacks is of utter importance for online retailers and booking sites, as the attacks decrease the quality of recommendations and subsequently the prot of the business as well as customer satisfaction.
3 The model
The model used as a base method in this study was used in (Liu et al., 2013). It was originally designed as an entry to the Kaggle competition Personalize Expedia Hotel Searches - ICDM 2013 (https://www.kaggle.com/c/expedia-personalized-sort), where it scored fth on the private leaderboard. This approach was used because the ait was the highest ranked method for which the authors made their code available (https://github.com/shawnhero/ICDM2013-Expedia-Recommendation-System). This model is a Gradient Boosting Classier based on Classication and Regression Trees (CARTs) and it is further described below. Due to computational limitations, only one tenth of the dataset was used in our analysis, however, it has been proven that results obtained using a downsampled dataset do not dier greatly from the results obtained using the whole dataset (Liu et al., 2013).
The main idea behind gradient boosting of regression trees is well described in (Fried-man, 2001). The gradient boosting model is an ensemble of weak predictors, in most cases these weak predictors are CARTs. Whereas in bagging many independent learners are built and combined using some model averaging techniques (weighted average, ma-jority vote etc.), in boosting the predictors are built sequentially (Christopher, 2016). After each iteration, i.e. learning of one predictor, misclassied data increases its weights to emphasise the most dicult cases. In this way, subsequent learners will focus on these cases during their training. The contribution of subsequent learners is reduced through the learning rate thus it is not benecial to increase the number of underlying learners past a certain unknown tipping point. Current research in the area is focusing both on new applications of gradient boosting as well as on fast and ecient implementa-tions. As of now, the most advanced implementation is the Xgboost package for R programming language (Chen and Guestrin, 2016).
(Freund and Schapire, 1997) rst implemented the idea of boosting by proposing a new algorithm called Hedge (B), in which they combined a set of weak PAC learning algorithms. The pseudo code for the algorithm is as follows:
• β ∈ [0, 1]
• initial weight vector w1 ∈ [0, 1]N with PN
i=1w1i = 1
• Do for t = 1, 2, . . . , T , where T is the number of trials
Steps:
1. Choose allocation pt= wt
PN i=1wit
2. Receive loss vector `t∈ [0, 1]N from the model
3. Suer loss pt`t
4. Set the new weights vector to be wt+1
i = witβ`
t i
This algorithm was shown to have its loss bounded from the upper side and that this boundary is reasonably low even for a large number of underlying weak PAC learning algorithms as it depends logarithmically on this number. It's greatest contribution at the moment was in fact that it did not require any prior knowledge about the perfor-mance of the weak learning algorithm, whereas the preceding algorithms had required substantial prior knowledge about the problem at hand (Freund et al., 1993), (Freund, 1995), (Schapire, 1990). This paper was very inuential and outlined a whole new direction of research (as of 7th of May 2018, it has been cited 15367 times).
The base method model used in this analysis was implemented as the function
sklearn.ensemble.GradientBoostingClassier from sklearn package for the Python 2.7 programming language. The parameters were set as follows:
• loss='deviance' - this setting sets loss function to be 'deviance', this corresponds to logistic regression. The loss function is specied as follows:
L(p) = y log(p) + (1 − y) log(1 − p) , (1) where y is the target variable and p is the predicted probability of y = 1
• learning_rate=0.1 - the learning rate strongly aects the performance of the algorithm. It denotes the shrinkage of the contribution of each tree. It has been proven that an adaptive learning rate can improve the performance of a boosting algorithm (Culp et al., 2011).
• n_estimators=100 - this parameter denotes the number of boosting stages to perform. This parameter has to be set with the setting of the learning rate in mind, as for high learning rate, the boosting stages have low to no benet for the performance of the algorithm after exceeding some unknown number of these stages. These stages not only have low to no benet for the learning, but unnecessarily increase the computational demands and the time needed to train the model.
• subsample=1.0 - this parameter indicates the fraction of a dataset used for tting every individual base learner. Setting subsample smaller than one leads to a reduction of variance and increased bias.
• min_samples_split=2 - the minimum number of observations that has to be in a node to be considered for splitting. When set to higher values, it controls for over-tting as the relevance of relations that are highly specic for a sample is somewhat diminished. Too high values can lead to under-tting.
• min_samples_leaf=1 - the minimum number of observations in a terminal node. Similarly as the previous parameter could be used to control for over-tting. This parameter is set to 1 as this classication task has highly imbalanced classes and the regions where the minority class is can potentially be very small. • max_depth=3 - maximal depth of a regression tree. This parameter controls
for over-tting as `shallow' trees do not learn very specic relations that could be specic for the training dataset.
• random_state=None - denotes the random seed that should be used. Gener-ally it's a good practice to x a random seed when tuning the parameters as one can then easily compare the performance of a model with diering parameters. • max_features=None - the number of randomly chosen features considered
The extensive documentation on this implementation of Gradient Boosting can be found here1.
4 Dataset description
The dataset was provided for the Kaggle competition Personalize Expedia Hotel Searches - ICDM 2013 (the competition can be accessed via this link). The introduction text provided by Kaggle for the competition says:
Expedia is the world's largest online travel agency (OTA) and powers search results for millions of travel shoppers every day. In this competitive market matching users to hotel inventory is very important since users easily jump from website to website. As such, having the best ranking of hotels (sort) for specic users with the best integration of price competitiveness gives an OTA the best chance of winning the sale.
For this contest, Expedia has provided a dataset that includes shopping and purchase data as well as information on price competitiveness. The data are organized around a set of search result impressions, or the ordered list of hotels that the user sees after they search for a hotel on the Expedia website. In addition to impressions from the existing algorithm, the data contain impressions where the hotels were randomly sorted, to avoid the position bias of the existing algorithm. The user response is provided as a click on a hotel or/and a purchase of a hotel room.
4.1 Columns
The data were generated via the search page on the portal Expedia.com and contain 54 columns. The columns and their types can be found in the Appendix A. It is possible to divide the columns into 6 categories:
• Search Criteria
• Static hotel characteristics • Dynamic hotel characteristics • Visitor information
• Competitive information • Other information
These columns are further described in following sections, as the nature and the struc-ture of the dataset in consideration are essential for this thesis.
4.1.1 Search Criteria variables
• date_time - Date and time of the search. The span is from 2012 January 1st to 2013 June 30th, i.e. 18 months.
• srch_destination_id - Search Destination ID. The dataset contains 12 189 unique locations that had been searched for. The mapping from IDs to locations was not provided.
• srch_length_of_stay - Length of stay in days. The lenght of stay span from 1 to 32, 7 searches were for stays of length 57 days.
• srch_booking_window - Length in days between the search day and the start of the stay. The minimal value is 0, the maximal is 488.
• srch_adults_count - Number of adult individuals specied in the search. The maximum value is 9.
• srch_children_count - Number of children specied in the search. The maxi-mum value is 9.
• srch_room_count - Number of rooms specied in the search. The maximal number in the dataset is 8.
• srch_saturday_night_bool - Boolean that indicates whether the stay includes Saturday night, starts from Thursday and is shorter than 5 days.
The correlation matrix (see Figure 1) shows some obvious relations (such as the one between srch_adults_count and srch_room_count) as well as some non-trivial corre-lations. These support the use of srch_saturday_night_bool as a proxy for holiday
(when +1, the stay is classied as a vacation, when 0 the stay is identied as a busi-ness trip Netessine and Shumsky (2002)), as there is a slight positive correlation with srch_adults_count (usually people go on vacation in a bigger group than on a business trip).
Figure 1: The correlation matrix for Search Criteria
4.1.2 Static hotel characteristics variables
• prop_id - Property ID. There are 114 833 unique properties in the dataset. • prop_country_id - Country the property is in. The properties are located in
159 countries.
• prop_starrating - Star rating of the hotel. Ranges from 0 to 5, with 0 indicating either unknown star rating or zero stars.
• prop_review_score - The mean of customer review scores, ranges from 0 to 5, rounded to 0.5 points. 0 indicates there have been no reviews, NA indicates that the information is unavailable.
• prop_brand_bool - Boolean that denotes the aliation to major hotel chain. 63% of the hotels in the dataset are part of a major hotel chain.
• prop_location_score1 - The rst score denoting the desirability of the prop-erty's location. The higher the value, the more desirable location.
• prop_location_score2 - The second score denoting the desirability of the prop-erty's location.
• prop_log_historical_price - the logarithm of the mean price of the property in the last trading period. 0 denotes that the property was not booked in that period.
Of note is the negative correlation (see Figure 2) between prop_brand_bool and prop_location_score1 - it seems that in less desirable locations, there are more hotels belonging to some major
hotel chain. Apart from that, the correlation matrix does not show any unsurprising relations as all the correlations have the sign as expected.
Figure 2: The correlation matrix for Static hotel characteristics
4.1.3 Dynamic hotel characteristics variables
• position - Position on the results page. The maximum number of returned results for one search is 40.
• price_usd - Displayed price. However, dierent countries can have dierent conventions (taxes included/excluded, one night/whole stay)
• promotion_ag - Boolean denoting whether the hotel has a sale price promo-tion.
• gross_bookings_usd - Total value of the transaction.
• booking_bool - Boolean denoting whether the hotel was booked.
• srch_query_anity_score - The logarithm of the probability a hotel will be clicked on in Internet searches. If zero, the hotel did not appear in any searches. The correlation matrix is not provided for this set of columns as the informative value is rather low here. However, it is worth mentioning that there is rather strong negative correlation between position and click_bool/booking_bool.
4.1.4 Visitor information variables
• visitor_location_country_id - The ID of the country from which the cus-tomer accesses the webpage.
• visitor_hist_starrating - The mean of star rating of hotels the customer pre-viously booked. NA indicates that the customer booking history is not available. • visitor_hist_adr_usd - The mean price per night of hotels the customer pre-viously booked. NA indicates that the customer booking history is not available. • orig_destination_distance - The physical distance between the customer and
the hotel. NA indicates that the distance could not have been calculated.
The correlation matrix does not uncover any surprising facts, the only fact worth mentioning is that with increasing distance between customer and the hotel the vis-itor_hist_starrating increases. This seems reasonable as people who are willing to pay for a longer trip to their destination are probably willing to pay more for a night and therefore book hotels with higher star rating.
4.1.5 Competitive and other information variables
• comp1_rate - This variable describes whether Expedia has higher, lower or same price as competitor 1. NA indicates there is no data.
Figure 3: The correlation matrix for Visitor info
• comp1_inv - 1 if competitor 1 does not oer rooms in the hotel, 0 if both Expedia and competitor 1 have availability in the hotel. NA indicates there is no data.
• comp1_rate_percent_di - The absolute percentage dierence in prices if there is one, otherwise NA.
• srch_id - The id of a search.
• random_bool - Boolean indicating whether the displayed sort was random or normal. To circa 30 % of customers random order was presented to allow for learning without position bias.
• site_id - This variable denotes what was the point of sale (Expedia.com, Expe-dia.co.uk, Expedia.co.jp etc)
The dataset contains variables for data on up to 8 competitors, however, it is very rare to have info about more than two competitors.
4.2 Missing values
Unfortunately, the dataset contains a lot of missing values. As the goal of this thesis is not to predict any variable, these will not be interpolated and will be treated as NAs
as in the base method model from (Liu et al., 2013). For proportion of NAs in each column, see Figure 4.
Figure 4: Number of missing values in each column
4.3 Position Bias
The dataset shows strong position bias - the hotels that are listed higher in the search result page have signicantly more clicks and bookings. This makes sense for the non-random sort as Expedia had implemented a recommender system already before the competition took place, however, even for random sort there is a clear preference of customers for hotels on the front positions. Another insight gained is that the at-the-time existing recommender systems had a signicant positive inuence on the conversion ratio (#bookings/#clicks).
Figure 5: Position bias in non-random sort
Figure 6: Position bias in random sort
4.4 Downsampling
The original dataset contains over 665 000 searches, which corresponds to approximately 16.5 millions data points. Unfortunately, it is computationally unfeasible to work with a dataset of this size on a common laptop, therefore a downsampling had to be performed.
Downsampling does not result in signicant decrease in performance as was tested by (Liu et al., 2013). The training set was downsampled to approximately 2M data points (this corresponds to 79869 unique searches). As the test set, approximately 1M data points were used. This allowed the analysis to be conducted on a common laptop with 8GB RAM.
5 Simulations
`A shilling attack against a recommender system consists of a set of proles added to the system by the attacker' - p. 1247 (Long and Hu, 2010). In this specic case, where the underlying system is a booking page and we only obtain unary feedback, the shilling attacks can be described as `Conducting a number of specic searches with specic strategies for clicking hotels with the aim to inuence the sorting algorithm'. In the rst scenario, only fake clicks are considered, as booking hotels and the consequent cancellation requires additional eort of the attacker. In the second scenario, more elaborated attacks are considered and the attack simulations contain fake bookings as well.
It is important to note that the attacks were not tested on the actual search page, therefore only existing searches were used in simulations. Specic search queries cannot be used as the underlying sorting mechanism is not known. Using the actual search page would enable the use of specic searches adapted for the purposes of an attack and thus for attaining higher eciency of the attack. As this is not possible, only existing searches were replicated during the simulations and the `click' and `book' actions were assigned dierently. The number of replications of each search was controlled for.
Based on the existing literature, ve dierent shilling attacks will be tested:
• Random attack • Average attack • Direct push attack
• Direct push-similar attack • Direct nuke attack
As the shilling attacks against unary feedback-based recommender systems were not yet simulated, the methodology is based on shilling attacks against explicit-feedback based recommenders.
5.1 Metrics
For the simulation assessment, two metrics will be used. The rst is the Normalized Discounted Cumulative Gain (NDCG). The NDCG measures how successful the rec-ommendation system is at constructing relevant recrec-ommendations and it is the ratio of the IDCG and DCG. IDCG stands for `Ideal Discounted Cumulative Gain' and it is the score achieved by the optimal sort - in this case it would be hotels that had been booked on the rst positions, then hotels that had been clicked on and following that the rest. DCG stands for `Discounted Cumulative Gain' and it is the score truly achieved by the sorting algorithm. The NDCG and DCG are specied as follows:
N DCGp= DCGp IDCGp DCGp= p X i=1 2reli − 1 log2(i + 1) ,
where reli is equal to 5 for booked hotels, 1 for clicked hotels, and 0 for the rest (this
setting is arbitrary; 5, 1 and 0 was used in the original competition); the p is the number of search results returned by the system. The NDCG is calculated for each search and these are later averaged to get NDCG representatives for the whole dataset.
The second is the recommendation score of an item. This captures how much the system favors an item, in this case, how lucrative positions on the search results page does it ascribe to an item.
A common method for scoring positions is the Dowdall-system, it is a modication of the well-known Borda count (Fraenkel and Grofman, 2014). This system awards the rst position with one point, the second with one half, the third with one third etc. Formally, the amount of points awarded to a position p in search s is specied as follows:
P osition_Scores,p=
1 p
Dowdall-system accounts for the fact that the dierence in benet between two conse-quent positions is decreasing in position number. The overall recommendation score for an item i for the whole dataset is obtained as follows:
RSi=
X
s∈S
rsp,s ,
where S is the set of all searches in the test set and p is the position of item i in the search s.
5.2 Shilling attacks using clicks
These attacks consist of conducting a number of fake searches and clicking on chosen hotels, while not clicking on other sets of chosen hotels. Illustration gures can be found in Appendix B. These attacks are not dicult to mount.
• Random attack - in this scenario, random searches were replicated and m `click' actions were assigned randomly. The question of interest was how many of these fake searches have to be added to severely decrease the performance of the algo-rithm. The proportions of fake searches tested were {0.01, 0.03, 0.05, 0.1, 0.2}, for m the relative size {0.05, 0.1, 0.2, 0.3, 0.5} was used.
• Average attack - at the beginning, one target item was chosen. Then n searches were randomly chosen with replacement from the set of searches, where the target item appeared. Then m randomly chosen hotels from the search results were clicked on, and the target item was clicked on as well. The question was how big nhas to be to signicantly increase the recommendation score of the target item, how the choice of m aects this number and how the overall performance of the recommender system (NDCG measure) is aected. The relative size of n tested were {0.01, 0.03, 0.05, 0.1, 0.2}, for m the relative size {0.05, 0.1, 0.2, 0.3, 0.5} was used.
searches were randomly chosen with replacement from the set of searches where the target item appeared. Then m hotels from the very end of the search results list as well as the target item were clicked on. The question was how big n had to be to signicantly increase the recommendation score of the target item, how the choice of m would aect this number, how the overall performance of the system would be aected and how this setting would dier from the previous scenario. The relative size of n tested were {0.01, 0.03, 0.05, 0.1, 0.2}, for m the relative sizes were {0.05, 0.1, 0.2, 0.3, 0.5}.
• Direct push-similar attack - at the beginning, one target item was chosen. Then n searches were randomly chosen with replacement from the set of searches where the target item appeared. Then m
2 hotels listed directly above and m
2 hotels
listed directly below the target item were clicked on. This scenario is based on the idea that hotels that are listed directly above and directly below are in some way similar to the target item and therefore clicking on those should enhance the overall preference of the model for such hotels. The question was how big n had to be to signicantly increase the recommendation score of the target item, how the choice of m would aect this number, how the overall performance of the system would be aected and how this setting would dier from the previous scenario. The relative size of n tested were {0.01, 0.03, 0.05, 0.1, 0.2}, for m the relative sizes were {0.05, 0.1, 0.2, 0.3, 0.5}.
• Direct nuke attack - at the beginning, one target item was chosen. Then n searches were randomly chosen with replacement from the set of searches, where the target item appeared. Then m randomly chosen hotels from the search results apart from the target item were clicked on, while the target item was not clicked on. The question was how big n would have to be to signicantly decrease the recommendation score of the target item, how the choice of m would aect this number, and how the overall performance of the system would be aected. The relative size of n tested were {0.01, 0.03, 0.05, 0.1, 0.2}, for m the relative size {0.05, 0.1, 0.2, 0.3, 0.5}was used.
5.3 Shilling attacks using clicks and bookings
These attacks consists of clicks and bookings. As the system allows for cancellation of bookings, fake booking actions can very well be used for aecting the recommender. However, the cancellation requires additional eort, therefore the attacker has to be strongly motivated to mount such an attack. The scenarios used were very similar to the scenarios for the attacks that use only clicks. Illustration gures can be found in Appendix C.
• Random attack - in this scenario n random searches were replicated and the `click' and `book' actions were assigned randomly. For every search, one `book' and m `click' actions were assigned. To allow for comparison, the proportions of fake searches tested were the same as for `click only' attack, i.e. {0.01, 0.03, 0.05, 0.1, 0.2} for n and {0.05, 0.1, 0.2, 0.3, 0.5} for m.
• Average attack - before the simulation, one target item was chosen. Then n random searches from the set of searches returning the target item were replicated and m `click' actions were randomly assigned. Consequently, `click' and `book' actions were ascribed to the target item. The setting was again the same as in the simulation of the `click only' attack to allow for comparison.
• Direct Push attack - this simulation aims to increase the recommendation score of the target item. Therefore n searches containing the target item were repli-cated and m hotels from the very end of the search results list were clicked on, the target item was clicked on and booked. The question is how large has n to be to signicantly increase the recommendation score and whether the addi-tional booking aects the success of the attack. The relative size of n tested were {0.01, 0.03, 0.05, 0.1, 0.2}, for m the relative size {0.05, 0.1, 0.2, 0.3, 0.5} was used. • Direct push-similar attack - this simulation aims to increase the recommen-dation score of the target item. Therefore n searches containing the target item were replicated and m
2 hotels listed directly above and m
2 hotels listed directly
below the target item were clicked on, the target item was clicked on and booked. The question is how large has the n be to signicantly increase the
recommenda-tion score and whether the addirecommenda-tional booking aects the success of the attack. This scenario is based on the idea that hotels that are listed directly above and directly below are in some way similar to the target item and therefore click-ing on those should enhance the overall preference of the model for such hotels. The relative size of n tested were {0.01, 0.03, 0.05, 0.1, 0.2}, for m the relative size {0.05, 0.1, 0.2, 0.3, 0.5}was used.
• Direct Nuke attack - this simulation's goal is to decrease the recommendation score of the target item. n searches returning the target item were replicated and the rst result (second if the rst item was the target item) was clicked and booked. Consequently, m following searches were clicked (again, if the target item was in these results, it was skipped). The relative size of n tested were {0.01, 0.03, 0.05, 0.1, 0.2}, for m the relative sizes of {0.05, 0.1, 0.2, 0.3, 0.5} were chosen.
6 Results
For every simulation the nDCG and the Recommendation score for the target items (except for the random scenario) were recorded. As the target items, three hotels were chosen randomly. The relative dierence from the baseline nDCG (or baseline Recommendation score) was calculated for every item and these were later averaged for each scenario.
In the rst part the eect on the overall performance of the model is reported. The performance change was assessed for random, average and push attacks. In the second part, the eect of the simulations on the Recommendation score of the target item is inspected. The eect on the Recommendation score was inspected for push, push-sim, average and nuke attacks.
6.1 Eect on the overall performance of the model
The titles of the plots in the following section denote the scenario and the proportion of fake searches added to the original dataset, the x axis shows the proportion of fake clicks and bookings in each fake search, and the two lines show how was the nDCG inuenced by corresponding click-only (blue line, `click') and click-book attacks (red line, `click_book'). This holds for all gures in this section. The relative dierence from the baseline nDCG was calculated as:
1 3 3 X i=1 N DCGi,attacked− N DCGi,original N DCGi,original
6.1.1 NDCG of the model under random attack
From Figure 7 it is apparent that the nDCG of the recommender is signicantly nega-tively inuenced even for very small proportions of fake searches. There is no decrease in performance with increasing n or m. The relative dierence of the nDCG of the attacked model (≈ −0.25) and the original nDCG is very close to the dierence (≈ −0.282) of the nDCG attained by completely random sort and the original model. Therefore it can be concluded that the attacked recommender is not much better than choosing the recommendations randomly. 0.05 0.1 0.2 0.3 0.5 m 0.255 0.250 0.245 0.240 0.235 n DCG relative
ndcg random n=0.01
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.260 0.255 0.250 0.245 0.240 0.235 n DCG relativendcg random n=0.05
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.26 0.25 0.24 0.23 0.22 n DCG relativendcg random n=0.1
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.260 0.255 0.250 0.245 0.240 0.235 0.230 n DCG relativendcg random n=0.2
click click_book6.1.2 NDCG of the model under average attack
The average attack signicantly decreases the overall performance of the model as well (see Figure 8).The performance seems to be slightly worse for n = 0.1 and n = 0.2, therefore it can be tentatively concluded that increase in the proportion of fake searches leads to a decrease in performance.
0.05 0.1 0.2 0.3 0.5 m 0.260 0.255 0.250 0.245 0.240 n DCG relative
ndcg average n=0.01
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.270 0.265 0.260 0.255 0.250 n DCG relativendcg average n=0.05
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.270 0.265 0.260 0.255 0.250 n DCG relativendcg average n=0.1
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.275 0.270 0.265 0.260 0.255 0.250 n DCG relativendcg average n=0.2
click click_book6.1.3 NDCG of the model under push attack
The push attack makes the recommender near-to worthless as well as can be seen in Figure 9. There is no markable dierence between the eects of click-only and click-book attacks. 0.05 0.1 0.2 0.3 0.5 m 0.270 0.265 0.260 0.255 0.250 0.245 n DCG relative
ndcg push n=0.01
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.29 0.28 0.27 0.26 0.25 n DCG relativendcg push n=0.05
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.260 0.255 0.250 0.245 n DCG relativendcg push n=0.1
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.265 0.260 0.255 0.250 0.245 0.240 0.235 n DCG relativendcg push n=0.2
click click_book6.2 Eect on the recommendation score of the target item
For the measurement of the change in the recommendation score three items were randomly chosen. The recommendation score for each was recorded for every scenario. The results are reported as the average relative dierence between the recommendation score of the item of the attacked model and the non-attacked model:
1 3 3 X i=1 RSi,attacked− RSi,original RSi,original
The titles of the plots denote the scenario and the proportion of fake searches added to the original dataset, the x axis shows the proportion of fake clicks and bookings in each fake search, and the two lines show how the recommendation score changed relative to the one of the non-attacked model for click-only (blue line, `click') and click-book attacks (red line, `click_book'). This holds for all gures in this section.
6.2.1 Recommendation score of the target item under average attack Figure 10 shows that the average attack does not signicantly increase the recommen-dation score except for three scenarios - {both click-book and click-only, n = 0.01, m = 0.2} and {click-book, n = 0.1, m =0.2.} However, it cannot be conrmed that this setting reliably leads to an increase in the recommendation score as the number of simulations was too low to test the result statistically.
0.05 0.1 0.2 0.3 0.5 m 0 1 2 3 4 5 rel_rec_score
rs average n=0.01
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.2 0.0 0.2 0.4 0.6 0.8 1.0 rel_rec_scorers average n=0.05
click click_book 0.05 0.1 0.2 0.3 0.5 m 0 1 2 3 4 rel_rec_scorers average n=0.1
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.20 0.15 0.10 0.05 0.00 0.05 rel_rec_scorers average n=0.2
click click_book6.2.2 Recommendation score of the target item under push attack
The push attack was designed to increase the recommendation score of the target item. Figure 11 reveals the simulations work best for a smaller proportion of fake searches (0.01, 0.05). Also, the book scenario seems to have better results than the click-only. 0.05 0.1 0.2 0.3 0.5 m 0 1 2 3 4 rel_rec_score
rs push n=0.01
click click_book 0.05 0.1 0.2 0.3 0.5 m 0 1 2 3 4 rel_rec_scorers push n=0.05
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.2 0.0 0.2 0.4 0.6 rel_rec_scorers push n=0.1
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.2 0.1 0.0 0.1 rel_rec_scorers push n=0.2
click click_book6.2.3 Recommendation score of the target item under push-similar attack This attack successfully increased the recommendation score of the target item, the eect was highest for n equal to 0.05, click-book attack and m = 0.1 (see Figure 12). For other congurations the eect was somewhat lower, but in most cases the recom-mendation score was increased nevertheless.
0.05 0.1 0.2 0.3 0.5 m 0.4 0.2 0.0 0.2 0.4 0.6 rel_rec_score
rs push-sim n=0.01
click click_book 0.05 0.1 0.2 0.3 0.5 m 0 1 2 3 4 rel_rec_scorers push-sim n=0.05
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.2 0.1 0.0 0.1 0.2 rel_rec_scorers push-sim n=0.1
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.0 0.2 0.4 0.6 rel_rec_scorers push-sim n=0.2
click click_book6.2.4 Recommendation score of the target item under nuke attack
The goal in this scenario was to lower the recommendation score of the target item. Clearly, this goal was not attained (see Figure 13).
0.05 0.1 0.2 0.3 0.5 m 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 rel_rec_score
rs nuke n=0.01
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.5 0.0 0.5 1.0 1.5 2.0 rel_rec_scorers nuke n=0.05
click click_book 0.05 0.1 0.2 0.3 0.5 m 0 1 2 3 4 rel_rec_scorers nuke n=0.1
click click_book 0.05 0.1 0.2 0.3 0.5 m 0.0 0.5 1.0 1.5 2.0 2.5 rel_rec_scorers nuke n=0.2
click click_book7 Discussion
7.1 Discussion of the results
7.1.1 Eect on the overall performance of the model
The rst research question was whether the overall recommendations quality could be aected by a shilling attack. For this purpose, several shilling attack scenarios were simulated and the nDCG scores for the resulting predictions were recorded. The results for three scenarios were reported in the previous section, namely for the random, average and push attack.
The random attack caused the nDCG of the system to drop to ≈ 74% of the original value. This holds for every combination of n and m. The increase in n does not further decrease the nDCG. This does not show that the model is somehow robust against such attacks, it rather shows that the nDCG of a model is bounded from below by the nDCG yielded by a random sort. The nDCG yielded by a random sort is ≈ 71% of the original value. Therefore it can be concluded that adding even a very small number of random searches (≈ 0.01) is enough to severely impair the model performance and render it more or less useless.
The average attack caused the nDCG to drop to the lower bound as well. This holds again for all combinations of n and m. There is not a dierence between using the click-only and click-book scenarios, therefore there is no need to mount click-book attacks, as these require additional eort for either canceling the bookings or paying for the booked hotels.
The push attack decreased the nDCG of the model to ≈ 0.37 (the original value was ≈ 0.48) as well. Therefore it can be concluded that even attacks aimed to rather than harming the model to promoting a specic property impair the recommendations' quality severely.
As it was shown that it does not require much eort to signicantly lower the recom-mendations quality in this case, companies should test their recommenders and take appropriate steps.
7.1.2 Eect on the recommendation score of the target item
The second research question was how the recommendation score of an item can be aected by a targeted shilling attack. This attack could be mounted by a property owner, who wants to increase bookings for their property or to harm their competition. The results for four scenarios were reported in the previous section, namely for the average, push, push-similar and nuke attack.
The average attack does not seem to have a reliable impact on the recommendation score. The notable peaks for n = 0.01 and m = 0.2 could denote very ecient congu-ration, but this needs to be tested statistically. The conclusion is that this scenario can be potentially used for mounting very ecient attacks.
Subsequently, the impact of push attack on the recommendation score of the target item was tested. There are again some notable peaks, but the statement from previous paragraph holds - these need to be tested statistically to be able to determine, whether the congurations are highly ecient. For n = 0.01 and n = 0.05, the recommendation score is increased almost for all m. Nevertheless, the increase is mostly not of a large magnitude, therefore the possible attacker should carefully assess whether the costs associated with mounting such attack does not exceed the additional prot. For these n, the click-book scenario is more ecient than the click-only.
The push-similar attack was based on the idea that pushing items with similar char-acteristics should increase the recommendation score of the target item as well. This attack was most ecient for n = 0.05 and m = 0.1. Other congurations were not par-ticularly ecient and some were even counterproductive - the recommendation score was lowered in these.
The nuke attack failed to attain the pre-specied goal, i.e. to decrease the recommen-dation score of the target item. Even though for some combinations of the parameters the resulting recommendation score was lower than the original, it is very close to the original value and therefore this attack can not be used.
As the results are -at best- non-intuitive, it was decided to take a closer look at one specic attack - namely click-book, push, n = 0.01, m = 0.2. However, it is beyond the scope of this paper to dive into the single CAR trees, so the question of interest was how the positions in the search results were aected. The target item of the attack was returned in 6 searches. The positions ascribed to this item in these 6 searches by the non-attacked model were {23, 8, 18, 9, 22, 26}. After the attack, the model ascribed positions {12, 12, 18, 11, 27, 2} in these searches (12th in the rst one, 12th in the second one, etc.). The Recommendation score for this item was increased from ≈ 0.42 to ≈ 0.85. It can be concluded this conguration worked well for this item and successfully pushed it. As the nDCG in this scenario dropped signicantly, the positions of other items were inspected as well. Nevertheless, no pattern was found in the dierences between the attacked and the non-attacked model.
Overall, the eorts to sway the recommendation score in some direction were partly successful, but the results were not tested statistically. Specically, it is possible to increase the recommendation score of an item, but it is not easy to decrease the rec-ommendation score of an item. Moreover, it is not possible to do so without severely impairing the overall quality of the model, which eectively prevents these attacks to go without notice for a longer period of time.
Thus there is evidence that this model is not robust against such attacks. The question is whether the costs of waging such attacks would not exceed the additional prot. Nevertheless, there is still a potential incentive for waging such attacks, so companies should denitely assess the robustness of their recommender and take appropriate steps for securing their systems.
7.2 Limitations
This research was mainly limited by two facts - the rst was the inability to use the actual search page for constructing attacks and the second was limited computational power.
The ability to use the actual search page could be of interest for creating better targeted attacks. Targeted searches could create a better environment for pushing certain items and, on the other hand, nuking competition by deliberate non-clicking on their hotels. This limitation would be best overcome by collaboration with the company owning the recommender, in this case Expedia, Inc.
The limited computational power limited signicantly the number of simulations ran. Massive computational power would allow for gathering a large number of results and consequent usage of statistical tests such as the t-test for testing the signicance of an attack's eect on the model.
In this thesis a framework for testing shilling attacks was introduced. This framework allows for companies to test their recommender systems in terms of robustness against shilling attacks. This information can be further used for improving the recommender systems as well as for setting up the automatic recognition of malicious searches.
7.3 Further research
For further research it is recommended to gather sucient computational power, run large number of simulations and test the results with proper statistical methods. An-other direction would be to focus on tailor-made searches via the actual search page and assess how these can be used for decreasing the recommendation's overall quality and pushing/nuking a pre-specied item of interest. Based on the results of this further work, procedures for uncovering malicious searches could be developed.
8 Conclusion
The goal of this thesis was to test the robustness of the chosen recommender against shilling attacks and nd out how the shilling attacks should be constructed as well as how extensive they have to be to severely impair the recommender. The robustness was assessed by the relative change in the nDCG of the model and the relative change in the Recommendation score (see Section 5) of a previously chosen target item.
The recommender used as the base model in this thesis was Gradient Boosting Classier presented as a solution to a Kaggle competition (Liu et al., 2013). It was shown that this recommender's performance is decreased drastically even for very small shilling attacks (in terms of the number of fake rows added to the training dataset) and that all attacks cause the resulting recommendations to be only very slightly better than randomly generated recommendations.
There were some minor successes for attacks aimed to increase the Recommendation score, however as the number of simulations run was not high enough, these were not statistically tested. Nevertheless, these results are denitely promising and these attacks should be further inspected. The eorts to decrease the Recommendation score were not successful and the results indicates that the model is robust against such attacks. Simulations presented in this thesis can be used for testing various recommender sys-tems with implicit feedback and assessing their robustness. Based on the ndings the administrators can construct eective security measures and automated procedures for disclosing such attacks.
Appendix A
Table 1: Columns and column types
column column type column column type
srch_id Integer random_bool Boolean
date_time Date/time comp1_rate Integer
site_id Integer comp1_inv Integer
visitor_location_country_id Integer comp1_rate_percent_di Float visitor_hist_starrating Float . . . . visitor_hist_adr_usd Float comp8_rate Integer prop_country_id Integer comp8_inv Integer prop_id Integer comp8_rate_percent_di Float prop_starrating Integer click_bool Boolean prop_review_score Float booking_bool Boolean
prop_brand_bool Integer position Integer
prop_location_score1 Float gross_bookings_usd Float prop_location_score2 Float prop_log_historical_price Float position Integer orig_destination_distance Float price_usd Float srch_query_anity_score Float promotion_ag Integer srch_saturday_night_bool Boolean gross_booking_usd Float srch_room_count Integer srch_destination_id Integer srch_children_count Integer srch_length_of_stay Integer srch_adults_count Integer
Appendix B
These gures illustrate how the click-only attacks were simulated. One ordered set of hotels that the user sees after they search for a hotel on the Expedia website is presented (the data consists of such ordered sets). The columns in the middle show the original state, the columns on the right, between the green lines, show how were the clicks and bookings reassigned. Blue denotes the target item, yellow marks the actions ascribed to the target item, red marks the actions ascribed to other items.
Appendix C
These gures illustrate how the click-book attacks were simulated. One ordered set of hotels that the user sees after they search for a hotel on the Expedia website is presented (the data consists of such ordered sets). The columns in the middle show the original state, the columns on the right, between the green lines, show how were the clicks and bookings reassigned. Blue denotes the target item, yellow marks the actions ascribed to the target item, red marks the actions ascribed to other items.
References
Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785794. ACM.
Christopher, M. B. (2016). PATTERN RECOGNITION AND MACHINE LEARNING. Springer-Verlag New York.
Culp, M., Michailidis, G., and Johnson, K. (2011). On adaptive regularization methods in boosting. Journal of Computational and Graphical Statistics, 20(4):937955. de Nooij, G. J. (2008). Recommender systems: an overview. Vrije Univ. Amsterdam. Dias, M. B., Locher, D., Li, M., El-Deredy, W., and Lisboa, P. J. (2008). The value of
personalised recommender systems to e-business: a case study. In Proceedings of the 2008 ACM conference on Recommender systems, pages 291294. ACM.
Fraenkel, J. and Grofman, B. (2014). The borda count and its real-world alternatives: Comparing scoring rules in nauru and slovenia. Australian Journal of Political Sci-ence, 49(2):186205.
Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and computation, 121(2):256285.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119139.
Freund, Y., Warmuth, M. K., Haussler, D., and Helmbold, D. P. (1993). Data ltering and distribution modeling algorithms for machine learning.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 11891232.
Goldberg, D., Nichols, D., Oki, B. M., and Terry, D. (1992). Using collaborative ltering to weave an information tapestry. Communications of the ACM, 35(12):6170.
Gomez-Uribe, C. A. and Hunt, N. (2016). The netix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Sys-tems (TMIS), 6(4):13.
Gunes, I., Kaleli, C., Bilge, A., and Polat, H. (2014). Shilling attacks against recom-mender systems: a comprehensive survey. Articial Intelligence Review, 42(4):767 799.
Jannach, D., Zanker, M., Felfernig, A., and Friedrich, G. (2010). Recommender systems: an introduction. Cambridge University Press.
Konstan, J. A. and Riedl, J. (2012). Recommender systems: from algorithms to user experience. User modeling and user-adapted interaction, 22(1-2):101123.
Lam, S. K. and Riedl, J. (2004). Shilling recommender systems for fun and prot. In Proceedings of the 13th international conference on World Wide Web, pages 393402. ACM.
Lin, Z. (2014). An empirical investigation of user and system recommendations in e-commerce. Decision Support Systems, 68:111124.
Liu, X., Xu, B., Zhang, Y., Yan, Q., Pang, L., Li, Q., Sun, H., and Wang, B. (2013). Combination of diverse ranking models for personalized expedia hotel searches. arXiv preprint arXiv:1311.7679.
Long, Q. and Hu, Q. (2010). Robust evaluation of binary collaborative recommendation under prole injection attack. In Progress in Informatics and Computing (PIC), 2010 IEEE International Conference on, volume 2, pages 12461250. IEEE.
Mobasher, B., Burke, R., Bhaumik, R., and Sandvig, J. J. (2007). Attacks and remedies in collaborative recommendation. IEEE Intelligent Systems, 22(3).
Netessine, S. and Shumsky, R. (2002). Introduction to the theory and practice of yield management. INFORMS Transactions on Education, 3(1):3444.
Ricci, F., Rokach, L., and Shapira, B. (2011). Introduction to recommender systems handbook. In Recommender systems handbook, pages 135. Springer.
Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2):197 227.
Wang, H., Wang, N., and Yeung, D.-Y. (2015). Collaborative deep learning for recom-mender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 12351244. ACM.
Zhang, F. (2016). Segment-focused shilling attacks against recommendation algorithms in binary ratings-based recommender systems. International Journal of Hybrid In-formation Technology, 9(2):381388.