The eﬀect of personalized product suggestions on customer purchases

(1)

The effect of

personalized product suggestions

on customer purchases

MSc Thesis Econometrics

Fleur Wijtman

s2736403

Econometrics, Operations Research & Actuarial

Studies

Faculty of Business and Economics

University of Groningen

(2)

(3)

The effect of personalized product suggestions

on customer purchases

Author: Fleur Wijtman

Abstract

(4)

1 Introduction 5 1.1 Personalization . . . 5 1.2 Research goal . . . 5 1.2.1 Research questions . . . 6 2 Literature 7 2.1 Recommendation methods . . . 7 2.1.1 Method comparison . . . 8 2.2 Customer loyalty . . . 10 2.2.1 Purchase behaviour . . . 10 2.3 Model evaluation . . . 11 2.3.1 Predictive accuracy . . . 11 2.3.2 Practical performance . . . 12 3 Research Design 13 3.1 Market . . . 13 3.1.1 Implicit feedback . . . 14

3.1.2 Current shopping behaviour . . . 14

(5)

Bibliography 42

(6)

1 Introduction

1.1 Personalization

A personalized shopping experience could be used to attract and retain customers amidst the ongoing war for their attention. Personalization implies knowing and activating customers: what do they want, when do they want it and how do they like to be informed? By anticipating to those customers’ needs, it could also en-large customer loyalty. The degree of loyalty, for example, shows up as the purchase amount, but has to do with customer goodwill too (Dick and Basu, 1994). Ideally, an increase in this purchase amount is not random but actually due to a customer who is more satisfied with the brand or firm.

A way to promote customer purchase behavior is by making use of a loyalty card program (Kang, Alejandro, and Groza, 2015). Although effectiveness seems to differ per market segment (Bijmolt, Dorotic, and Verhoef, 2010), many firms have adopted such a loyalty program to invest in a long-term relationship with their customers. With every purchase you make and/or euro you spend, you earn a (personalized) discount or some kind of loyalty points. Loyalty members are even willing to share some of their personal data in return. This data, combined with purchase history, could be used to gain more insights in customer behaviour. Hence, personalization could be optimized even further, which is beneficial for the customer.

Firms could highly benefit from this personalization if such customers indeed in-crease their spendings due to the program, and hence revenue inin-creases. To encour-age members to purchase more, they should be triggered to buy additional or more expensive products. Using the correlations in the existing customer data, it might be possible to detect some latent demand and to generate corresponding product suggestions. Personalized product suggestions for new-to-customer products then aim at increasing the number of transactions and the basket value of loyalty mem-bers.

Many major companies already applied a recommendation system in some form. For example, Youtube shows a list of video recommendations on the homepage which account for 60% of video clicks on the homepage, and for 30% of overall views (Zhou, Khemmarat, and Gao, 2010; Davidson, Liebald, et al., 2010). For online businesses, success seems to heavily depend on the power of recommendations, as Netflix offered$1.000.000 to the one who could increase performance of their current system with only 10% (Bell, Koren, and Volinsky, 2010).

1.2 Research goal

(7)

data from the Dutch drugstore Etos is used. Shopping frequency for drugstores is way lower than for supermarkets, while supermarkets do also offer a range of main health- and beauty care products. So, it is of great importance to invest in customer loyalty and thus attract customer to either one of their hundreds of physical stores or the corresponding webshop.

The main challenge when creating a product recommender is choosing the per-sonalization method which suits the firm best. This choice will be be based on a trade-off between accuracy, scalability and efficiency. Although the possibilities re-garding recommendation methods are endless, algorithms are often based on either collaborative filtering or the content-based approach (Jacobs, Donkers, and Fok, 2016), which hence will serve as focus point. The latter method has a rich history in marketing and is suited to include detailed characteristics. Collaborative filtering searches for relationships using explicit customer ratings or for implicit data by, for example, relying on the number of co-purchased items in history (Liu, Lai, and Lee, 2009).

1.2.1 Research questions

The main research question is:

What is the effect of personalized product suggestions on customer purchases? This question is trying to be answered using two sub-questions. The first one is: Which model makes the best recommendations in terms of customer responses? As there exist several methods to create personalized recommendations, several algo-rithms are proposed and their real-life performance is compared using A/B testing. Among them, there will be a random recommender, the current method and two newly proposed methods. To determine online customer responses, the click and buying rates are used as a measure. Although these measures might not completely coincide, the most advanced method (collaborative filtering) is expected to perform best. The other point of interest lies in analysing subsequent consumer behaviour to answer the other sub-question:

Could these product recommendations be used to increase customer spendings? The effects of those recommendations on customer spendings are then investigated using regression analysis, so that more factors influencing customer behaviour could be taken into account.

(8)

2 Literature

2.1 Recommendation methods

To enhance customer loyalty and satisfaction, it is essential to match consumers with the most appropriate products (Koren, Bell, and Volinsky, 2009). As firms providing a loyalty card program are able to match all transaction data with per-sonal characteristics of loyalty members, their databases are huge. Because of this, Jacobs, Donkers, and Fok (2016) state that scalability should be of the same impor-tance as predictive performance of an algorithm.

A system recommending one of the most sold products probably already performs better than a randomly selected item. However, this procedure is not very precise and does not take grey-sheep nor the popularity bias into account (Abdollahpouri et al., 2020). Grey-sheep refers to the group of customers not consistently (dis)agreeing with any other people. In case of a larger popularity bias, so the more popular items are suggested by default, the less recommendations will be appropriate. In order to make more personalized decisions, the system must be able to compare users and rank products by predicting the utility. Moreover, in the case of purchase history, no explicit ratings or preferences exist. In particular, it is hard to say if a customer dislikes a product when he or she did not buy it yet. Nevertheless, there exist several algorithms to calculate the utility of a product, which might also depend on other variables (Ricci, Rokach, and Shapira, 2011).

The most commonly used recommendation types are based on one of the follow-ing three techniques (Ricci, Rokach, and Shapira, 2011):

1. Content-based filtering (CB): items with the same characteristics as items which the user liked before are recommended, based on a kind of item or user profile.

2. Collaborative filtering (CF): this technique could be based on either items or users.

• User-based: takes users’ preferences and characteristics to search for sim-ilar users, thereafter recommending items those users liked.

• Item-based: uses the users’ feedback on items to find similar items which are still unrated. So, this method uses item similarity in terms of rating patterns instead of item characteristics in case of CB techniques.

(9)

Figure 2.1: CF vs CB

3. Hybrid : an extension of CF and CB methods into a hybrid model could help to improve the recommendations. Such hybrid models are trying to fix some of the shortcomings of one method with the advantages of another (Thorat, Goudar, and Barve, 2015). For instance, the model could account for seasonal trends (context-awareness model) or possibly offensive recommendations using age and gender characteristics (demographic-based model). Another option is to use a neural network, which might be able to model non-linear relationships in the data.

2.1.1 Method comparison

Performance of above methods has been studied in different situations, but when comparing them, no recommendation approach is found to be most promising. It turns out that study results are hard to compare due to different evaluation strategies used, implying that the best choice is dependent on the research objectives (Beel, Langer, et al., 2013; Said and Bellogın, 2014; Beel, Gipp, et al., 2016). The relevant methods are now discussed in more detail, taking into account both their advantages as well the disadvantages.

Content-based filtering

(10)

information about their rated items. Next, this past behaviour of users is used to recommend products similar to past interactions. A drawback of this method is that the feedback of the other users is not taken into account (Thorat, Goudar, and Barve, 2015), introducing the cold-start problem. This implies that new customers or items cannot be allowed to the system, as there is no historical data available yet. (Pazzani, 1999).

Collaborative filtering

With collaborative filtering, based on an users’ feedback and an item-to-item similar-ity matrix the missing user-item ratings are calculated and stored in a corresponding user-item matrix (Hu, Koren, and Volinsky, 2008). There are two primary areas of CF to predict new ratings (Koren, Bell, and Volinsky, 2009; Thorat, Goudar, and Barve, 2015).

1. Neighbourhood methods (NBM): measure similarity between users and/or items (neighbours) trying to predict the utility for a specific user/item. This method is able to make diverse recommendations, but is not very efficient in terms of computation time. A solution might be to make these calculations in advance instead of real-time.

2. Latent factor models (LFM): try to describe users and/or products by a num-ber of factors derived from the rating patterns. LFM then predicts for all customers the utility of their unrated items, using the so-called user-item ma-trix. An efficient method to handle this matrix factorization is by Alternating Least Squares (ALS), which is able to deal with implicit feedback (Hu, Koren, and Volinsky, 2008; Tak´acs and Tikk, 2012).

Latent factor models make use of matrix factorization for prediction of the missing ratings. In practice the number of missing ratings is way larger than the observed ones, resulting in a highly sparse matrix. Luckily, LFM is able to handle sparse ma-trices. But, for efficiency, the missing ratings are often all weighted the same. This simplifies the algorithm, which improves the speed but worsens the predictions. To improve predictions, some methods are proposed using alternative weighting schemes instead of standard matrix factorization methods like (weighted) ALS. (Devooght, Kourtellis, and Mantrach, 2015; He, Zhang, et al., 2016; He, Tang, et al., 2019) Neural networks

(11)

(McCulloch and Pitts, 1943). Neural networks are nowadays part of machine learn-ing techniques and can be used in supervised learnlearn-ing problems, where observed values can be compared to predicted ones. The chance of overfitting the training data is however very likely, and needs to be taken into account (Hastie, Tibshirani, and Friedman, 2009).

In the case of a multi-layer approach, the network consists of an input and an output layer with hidden layers in between. Those hidden layers try to model the non-linear relationships in the data, but also create a so-called black box. Due to the hidden layers no insights can be gained regarding the data structure. Further-more, reviews of state-of-the-art methods showed that the use of neural networks rarely outperformed non-neural ranking methods (Dacrema, Cremonesi, and Jan-nach, 2019; Ludewig et al., 2019). However, those comparisons were solely based on baselines using nearest-neighbour methods instead of matrix factorization tech-niques. Besides, Ludewig et al. (2019) considered only session-based recommenders, which do not use purchase history but only makes use of some interactions within an online shopping session. The results of the neural CF algorithm introduced by He, Liao, et al. (2017) were also contradicting but performing well in the case of an implicit feedback dataset. This algorithm was in addition revisited by Rendle et al. (2020), who argue that the used multilayer perceptron to replace the dot product is no better default choice.

2.2 Customer loyalty

Loyalty can be measured bi-dimensional, namely as behavioral and affective loyalty (Gomez, Arranz, and Cill´an, 2006). This implies, respectively, actual transactions versus the reasons why those transactions are made. A widely used measure of behavioural loyalty is the share-of-wallet, indicating the share of money spend at a specific store compared with the total spendings in the market (M¨agi, 2003; Fox and Thomas, 2006). The share-of-wallet might give insights in a consumer being a primary or secundary customer, but also requires market data. A higher share-of-wallet also results in longer lifetime duration (Meyer-Waarden, 2007). However, hese customers are also more likely to join the loyalty program, which might introduce a self-selection problem which need to be accounted for in analysis (Leenheer et al., 2007; Meyer-Waarden and Benavent, 2009).

2.2.1 Purchase behaviour

(12)

Besides, the increase in sales seems to be larger for less popular products, prob-ably due to lower search costs in case of well-known items (Chen, Wu, and Yoon, 2004). The effect on sales is however larger for more recent recommendations, which furthermore have a positive effect on cross-sell of the firm as well on price-level per-ception. Recommendations moreover have a larger effect on sales than the customer reviews or ratings provided in a webshop (Pathak et al., 2010).

2.3 Model evaluation

Some potentially relevant properties of the recommendations are utility, novelty, di-versity, serendipity, coverage and unexpectedness (Silveira et al., 2019). The main need of customers hereby is utility, since this reflects to what extend customers like the recommendations. Utility defines the relevance of received recommenda-tions as well as the order of consumption preference (Ricci, Rokach, and Shapira, 2011). Utility could then be measured by either implicit or explicit customer rat-ings. Specifically, in case of purchase history the rating is visible as the number of times a product is bought.

In order to evaluate the associated utility of the models of interest, research objec-tives must be clear. At first, a suitable model needs to be found. Model parameters can be adapted to fit the data so that the predictive performance of the algorithm can be optimized. Another point of interest lies however in practical performance, like actual customer reactions. These two kinds of measures might yield contradict-ing results though (Garcin et al., 2014). Moreover, Beel, Genzmehr, et al. (2013) argue that focusing solely on predictive performance could ignore human factors.

2.3.1 Predictive accuracy

To determine model parameters, historical data is splitted into two parts: a train and test set. These are used to compare the performance of an algorithm in- and out-of-sample, respectively. Commonly, the focus is on predictive accuracy, so the difference between the predicted rating ˆyi and observed rating yi. The most used

error measures are listed below.

• Mean Abosulte Error (MAE): measures the mean over the absolute differences. M AE = n X i=1 |yi− ˆyi| n . (2.1)

• Root Mean Squared Error (RMSE): a quadratic way to determine the magni-tude of the error, to penalize large errors more than small errors. It takes the square root of the mean of squared differences.

RM SE = v u u t n X i=1 (yi− ˆyi)2 n . (2.2)

(13)

in accurate rankings (Balakrishnan and Chopra, 2012; Zhang et al., 2016). There-fore, some proposed measures are focusing on the precision of the recommendation list.

• Top K accuracy (precision@K): the percentage of true items which are in the top K of the predicted/recommended list:

precision@K = # bought items in top K

# total items in top K . (2.3) • Rank Order Error Metric (ROEM): a recall-oriented measure taking the actual

place in the ranking list into account by calculating the percentile-rank. ROEM =

P

u,ir t

u,iranku,i

P

u,ir t u,i

, (2.4)

where t indicates that the rating comes from the test-set. Furthermore, ranku,i

represents the percentile ranking of item i on the predicted (ordered) list of all items for user u. Hence, ranku,i = 0% indicates that the item is on top of

the preference list, while ranku,i = 100% tells the item is predicted to be least

preferred (Hu, Koren, and Volinsky, 2008).

2.3.2 Practical performance

Error metrics can thus be used to predict and optimize the ranking of items, but it does not tell anything about user acceptance. Running several algorithms on different subgroups, for instance via A/B testing, gives rise to compare customer reactions (Young, 2014; Gomez-Uribe and Hunt, 2015; Gilotte et al., 2018). By checking which customers actually interacted with the recommended product, the following measures can be calculated.

• Click-through-rate (CTR): represents the fraction of customers who opened an email that clicked through.

CT R = # clicking customers

# total customers . (2.5)

• Buying-rate (BR): represents the fraction of customers buying an offer out of the total number of customers who received the offer.

BR = # buying customers

# total customers . (2.6)

(14)

3 Research Design

3.1 Market

In this research, the Dutch drug chain Etos is considered. They serve over a million of customers in more than 500 stores and their webshop. Since a few years the loyalty card program was introduced where you earn loyalty points with every euro you spend. These points can be exchanged, when reaching a specific amount, for free products or discounts. Currently, some segmented email marketing is taking place, but the goal is to make this communication more personal. A pilot has been done with the use of a basic recommendation system, which was already performing better than the manually selected recommendations only differing for men, women and mothers. However, this recommendation method is not put into use as perfor-mance should be investigated in more detail.

The goal of this personalization is to increase spendings of loyalty members by recommending products which are not recently bought but probably relevant. In short, transaction data of all customers on the past three years is available. How-ever, only for the loyalty members behavior can be tracked over time. Personal recommendations are hence created for these loyalty members, which receive the recommendations via weekly (opt-in) emails. The acceptance of a recommendation should be visible as a product which is clicked-through or even bought.

Recommendation systems were primarily introduced in the setting of entertainment products, such as Netflix’ movies and series (Hallinan and Striphas, 2016). The use of these hedonic products is not strictly necessary, so the context of this research is somewhat different. Actually, a drugstore offers a wide variety of both luxury and pharmaceutical products. More recent studies also investigated the performance of recommender systems for (online) grocery shopping (Yuan et al., 2016; Macken-zie, 2018). Although supermarkets and drugstores partly offer the same range of products, supermarket studies often excluded health- and beauty care products. Moreover, the shopping frequency and discount awareness differs between super-markets and drugstores in the Netherlands (Van Lin and Gijsbrechts, 2016).

(15)

3.1.1 Implicit feedback

Transaction data can be used to determine how much a customer likes a product since the number of interactions with each product is known. Relying on the num-ber of times a product has been purchased is called implicit data because it is not guaranteed to reflect customer satisfaction. In addition, the lack of a transaction does not mean that the customer dislikes the product, only that the customer has not bought it (in your store) yet. Moreover, the popularity of a product differs a lot. Thereby, the average quantity needed differs per product, as the usage frequency dif-fers. Hence, these product differences need to be taken into account by normalizing the rating variable.

3.1.2 Current shopping behaviour

The introduction of COVID-19 during the spring of 2020 might have changed the circumstances, since the virus caused several unusual effects on customer’ shopping behaviour. Due to this virus, people are advised to stay at home as much as pos-sible, resulting in fewer shop visits and larger average spendings per transaction. Some products like desinfectants were sold in enormous amounts while they were not popular before March. Besides, this probably negatively affected the sales of products from the overlapping assortment. This idea is supported by the fact that Dutch supermarkets benefit from consumers’ hoarding behavior and generated more sales than ever. While this effect was the largest in the first weeks of March, their sales kept increasing in the later months of 2020 (CBS, 2020). The drugstore more-over lost many brick-and-mortar customers, mainly for shops located in city centres. The virus is still influencing shopping behaviour which in turn affects the recom-mendations. But, next to this unusual time effect, weekly marketing campaigns might also give rise to momentary changes in customer behaviour. This explains the urgency to test all versions of the recommendation systems in the same week(s) by making use of A/B tests and representative subsets.

3.2 Data

The data used for this research consists of several datasets, of which the combination will be used to get a complete overview of the loyalty members and (recommended) products. The data will now be described in more detail using some summary statistics, whereafter some constraints will be explained. These constraints will moreover determine the number of unique members and products to be used for this research.

3.2.1 Data description

The following datasets are processed and stored using Microsoft Azure.

(16)

2. Product data: extensive information on every single product which is or was sold in (some of) the stores.

3. Customer data: containing some demographic information on loyalty card members, but also some summary statistics about past purchase behavior. All these different data types are updated on a daily basis but are stored in different tables. Hence, the transactional, product and customer data needs to be joined using unique identifiers for analysis. Examples are shown in the next sections. Transaction lines

Transactional data of every purchase made in either the physical store or the web-shop is available. Basically all information which is visible on a receipt like product id, transaction amount, possible discount, timing and location of the purchase is available. As can be seen in Table 3.1 and 3.2, this info is however stored in sep-arate tables. The first one consists of all basic information, while the second table shows more details about the transaction, respectively.

Table 3.1: Transaction totals

TransactionId Date Online StoreNbr MemberId Quantity SalesEUR 000000001 01-01-2020 false 632 null 3 18.39 000000002 01-01-2020 true 783 12345678 5 33.50 000000003 01-01-2020 false 278 null 2 3.98

Table 3.2: Transaction lines

TransactionId Date MemberId ItemId Quantity PriceEUR SalesEUR 000000001 01-01-2020 null 12345 1 7.99 7.99 000000001 01-01-2020 null 67890 1 6.49 5.00 000000001 01-01-2020 null L2468 1 5.40 5.40

If a transaction is thereby made by a loyalty member, the transaction is registered by adding their unique member id. In this way, all transactions of a member can be analysed over time. On average, loyalty members make up between 20% and 30% of the total sales and the total number of transactions. This implies that the rest of the transactions are not linked to a specific customer and consequently cannot be used for analysis.

Product information

(17)

Table 3.3: Product data

ItemId Brand Online Department Class Subclass SalesEUR Privacy 12345 Nivea true 1 6 2 7.99 false 67890 Nivea true 3 28 1 6.49 false L2468 null false 2 7 4 5.40 false

Customer characteristics

To register as a loyalty member, a full name and email adress is required. Besides, one can choose to fill in information about date of birth, home address and newslet-ter innewslet-terests. This info is of course private and not easy to access. In the customer database, only MemberId is visible and additional loyalty information is stored like duration of membership, first transaction, being a business to business (B2B) cus-tomer, the number of visits and total spendings. An example can be found in Table 3.4.

Table 3.4: Customer data

MemberId B2B FirstVisit LastVisit NrVisits Quantity SalesEUR Active 12345678 false 05-06-2019 01-01-2020 11 15 100.00 true 23456789 false 12-12-2019 01-01-2020 2 6 41.14 true 34567890 false 01-01-2020 01-01-2020 1 1 9.99 true

Besides some transaction information is calculated for all those loyalty members. This includes for instance the average spendings per transaction, total number of transactions, number of items bought and their preference store. Some relevant variables are shown in Table 3.5. Note that these statistics are calculated over the whole membership length, and hence row 1 and 2 are indicating the total number of transactions and products bought. In row 3 of Table 3.5 the average sales per transaction during the membership are calculated, where negative spendings indicate a transaction in which products are returned for money. Furthermore, row 4 shows the number of stores visited. Lastly, row 5 and 6 shows how many days ago the membership started and the number of days since the last transaction.

Table 3.5: Summary statistics over membership duration

Variable Mean St.Dev Min 25% 50% 75% Max TransactionCnt 18.82 31.05 0 0 6 25 3853 QuantityCE 123.37 194.24 1 18 61 161 53539 AverageSpendEUR 17.97 12.32 -31.51 11.22 15.51 21.52 2645.02 NrOfStores 2.24 2.57 0 0 6 25 78 DaysLoyaltyMember 1159.08 524.60 0 742 1288 1665 1962 DaysSinceLastTransaction 219.15 286.46 1 22 77 332 1242

(18)

3.2.2 Constraints

Some transformations of the data are needed to ensure that the results are rep-resentative for the entire customer base and hence suited to generate appropriate recommendations. As could be seen in previous section, there are several situations or outliers which need to be accounted for.

Looking at Table 3.5, the maximum value regarding average spending is huge. Cus-tomers making these huge purchases, are possibly a business to business customer, which do not reflect the needs of an average customer. Hence, customers with average spendings above e100 per basket are excluded from the analysis. Further-more, customers who opt out for the weekly emails are not able to receive their personalized recommendations. However, the algorithm will still generate product recommendations for all loyalty customers. Including all customers will improve the performance of the algorithm, as it is more likely to find similar customers. Besides, this will make the algorithm future-proof in case the customer changes their mind. To ensure that the recommendations are relevant, they must be up-to-date. Firstly, historical recommendations are stored in a separate table. This guarantees diver-sified recommendations as items recommended before can in this way be filtered out from the recommendation list. In addition, including all transactions available might not represent today’s preferences nor the current family situation. Someone could have been single three years ago but now having a husband and a baby. Or, the newborn who was wasting diapers is a toilet trained toddler by now. Because of this, only last years’ transactions (365 days) are included when comparing cus-tomers’ buying behaviour. As mentioned before, an additional indicator for bought diapers is used before recommending baby products. This is important since baby products make up a substantial amount of total sales, but could lead to offensive situations if not recommended properly.

(19)

Figure 3.1: Number of products reflecting total sales

This results in a total of 5874 products which could potentially be used as recom-mendations. However, online availability differs on a daily basis. And since rec-ommendations are going to be included in the weekly emails with a corresponding link to the article in the webshop, it is important the product is indeed sold online. Hence, the set of final products is not constant over time and moreover differs per person, as personal history is taken into account.

3.3 Research methods

As the data is very sparse and there are way more customers than products, the item approach is preferred over the user approach. This yields for the algorithms based on both CB and CF. In practice, consumers might recognize their CB recommendations as similar products to what they bought before. For CF recommendations, the recommendations might be less familiar, as these are based on purchases of unknown but similar customers. Furthermore, these methods are both able to deal with large datasets, so that there should be no scalability and efficiency problems. Performance of the suitable recommender systems are tested by making use of A/B tests, together with some other versions which will be explained in more detail in the next sections.

3.3.1 Content-Based

(20)

words describing item i are proportional to the number of times that word occurs in all item descriptions. However, as stop and/or linking words are so common, the weight of these words need to be reduced. A popular content-based technique which takes both these factors into account is called Term Frequency-Inverse Document Frequency (TF-IDF) (Beel, Gipp, et al., 2016).

TF-IDF

Mathematically, this method can be described as follows (Havrlant and Kreinovich, 2017). Let the term frequency for word t = 1, .., M in description d = 1, .., N be defined as tf (t, d). Correspondingly, the total of all descriptions d is reflected by D. Then, the number of descriptions which contains term t is referred to as document frequency df (t). Based on the total number of documents N , the inverse document frequency is calculated:

idf (t) := ln N

df (t). (3.1)

Keywords for a particular description d are then words t with the highest value of the following product:

tf -idf (t, d, D) := tf (t, d) · idf (t, D). (3.2) A high value of tf -idf is consequently obtained when the term frequency is high for term t in the given description d, but low for all other descriptions. This processing method can moreover be told to normalize characters and uppercase letters before tokenizing, as well skip words which are occurring in too less or many documents. Given the weights (3.2) on words t for descriptions d, similarity scores between two descriptions, and hence products, can be calculated. Each description is defined as an M-dimensional vector, where each dimension contains the weight of a partic-ular word (which is zero if word t is not in description d). The cosine similarity is a robust measure (Tata and Patel, 2007; Alodadi and Janeja, 2015) to reflect this degree of similarity between vectors, say, x and y, defined as:

cos(x, y) = x · y T ||x|| · ||y|| = PM t=1xt· ytT q PM t=1(xt)2 q PM t=1(yt)2 . (3.3)

Calculation of (3.3) results in a matrix of shape D × D, which yields a cosine similarity score of each description d with every other description d (including itself, implying a similarity score of 1).

Item descriptions

(21)

Figure 3.2: Vocabulary example

Next, the corresponding similarity matrix for all items is defined, with shape 5874 × 5874. Recommendations are then obtained by looking at the column with similarity scores for one specific item, sorting them in descending order and hence extracting the most similar items. An example of such output can be found in Figure 3.3, where first the item is displayed (similarity score 1.0) and thereafter the 10 most similar products. Note that the first column is the item description, but item id and similarity scores can be found between brackets.

Figure 3.3: Similarity scores based on item description

Item characteristics

(22)

Figure 3.4: Similarity scores based on item characteristics

Optimal values

As can be seen in Figure 3.4 (and Figure 3.3), the similarity scores are calculated for the same product. However, the output differs a lot. The recommended products have higher scores but are not as similar as in previous case. Some products are not suited for men at all, while the latter six products all yield the same similarity score. Investigating more products results in the same pattern. Moreover, results did not improve by changing some of the model options or (combinations of) characteristics. Hence, the first method based on item descriptions seems more suitable and will be tested in practice.

3.3.2 Collaborative Filtering

As found in the literature section, it is important to allow for the implicit feedback and sparsity in the data. A suitable approach is Alternating Least Squares (ALS) which uses a Singular Value Decomposition (SVD) for the item-user matrix in the LFM.

Alternating Least Squares

First, some parameters are defined, following the approach of Hu, Koren, and Volin-sky (2008). Let u, v indicate a user, i, j an item and consequently ruiis the observed

rating from user u for item i. pui is then used as a binary indicator to distinguish

between a known or unknown preference. So, pui= 1 if the implicit rating is known

(rui > 0), whereas pui = 0 if not (rui = 0). However, zero interaction does not

necessarily mean the customer does not like the product in the case of implicit feed-back. At the same time, a consumer who bought a product just once (and was disappointed by then), shows way less preference than a consumer buying the item weekly. Hence, the degree of confidence differs between those ratings, but in gen-eral a higher value for rui indicates a higher confidence level in observing pui. This

relationship can, for example, be caught linearly in this confidence variable

cui= 1 + αrui, (3.4)

where α controls the confidence rate. This transformation from ruiinto two variables

pui and cui should imply a better representation of the implicit data, improving the

(23)

Let f be the number of latent factors. Then, the goal is to find vectors xu ∈ Rf

∀u with user-factors and yi ∈ Rf ∀i with item-factors. For direct comparison, these

vectors xu and yi will be used to map all users and items into a latent factor space.

The rating estimation can be determined using the inner product of the vectors xu

and yi as xTuyi. Instead of measuring the ratings matrix R directly, the preference

matrix P is calculated, consisting of the preferences pui. Now, a matrix factorization

technique is used to predict the preferences for every user u = 1, .., m and every item i = 1, .., n, while accounting for the varying confidence of observed ratings. This amounts to minimizing the following objective function:

minX

u,i

cui(pui− xTuyi)2+ λ(||xu||2+ ||yi||2). (3.5)

The second part of the sum is added to prevent the model from overfitting the training data (Zhou, Wilkinson, et al., 2008). This method is called Alternating Least Squares with Weighted-λ-Regularization (ALS-WR), where the value of λ is dependent on the dataset. Note that the objective function (3.5) contains m × n terms, which can be an enormous job in case of both a large amount of customers and products. Hence, the minimization process has to be handled efficiently. ALS fixes either the user or the item factor, which leads to a quadratic objective function of which the global optimum can be reached way faster.

The optimization algorithm will use a stopping criterion based on the value of the RMSE, defined as in (2.2). In this case, the user-ratings are fixed and the distance between the predicted and the actual value of an item is calculated. For item-ratings the same formula (2.2) applies, so that yi can be replaced by xu and n by m. The

ALS algorithm works iteratively. During each iteration, either factor matrix X or Y is held constant, and in the mean time the other is determined using least squares. When this matrix is solved, it is held constant while optimizing the other matrix. Next to hyperparameters α and λ as used respectively in (3.4) and (3.5), the ALS model makes use of several other input variables. The value of these parameters do have a default level, but can be changed in order to optimize the prediction accuracy. • numBlocks: the number of user or product blocks to parallelize the

computa-tion. By default these are both set to 10.

• rank: the number of latent factors f, defaults to 10.

• maxIter: the maximum number of iterations, default is 10. • regParam: the regularization parameter λ, defaults to 1.0.

• implicitPrefs: true in case of implicit feedback, default is set to false refer-ring to explicit feedback.

• alpha: parameter reflecting the baseline confidence in implicit preferences, by default set to 1.0.

(24)

• coldStartStrategy: for users or items not present in the training and/or test set, the model generates NaN results by default. This will mess up the evaluation metric and hence model comparison. When this parameter is set to drop the missing values are not taken into account.

Rating normalization

The rating variable is the last input needed for the algorithm. The number of times a product is bought can serve as the rating, but it is better to account for average usage/purchase frequency of the item. This is mainly important when investigating whether buying a product once indicates a preference or not. However, if the number of times a product is bought by a customer is for example divided by the total number of times the product is bought, the value of the rating is very small. When dividing the number of times a product is bought by customer i by the average number of times the product is bought, the result is also more intuitive. For example, if someone bought the article once last year, but on average people buy it twice (a year), the corresponding rating equals 0.5 instead of 1. Then, if this rating is above 1, the customer buys the product more than average. Thus, these ratings are used. Optimal values

All hyperparameters are optimized for this research setting. Obviously, implicitPrefs applies, besides nonnegative can be set to true as all ratings will be nonnegative. For model comparison, the coldStartStrategy needs to drop the missing values. In this model comparison, 80% will be used as training data and 20% as test data. This input dataset consists of an user and item id with corresponding rating variable. If the item is not bought yet, the model expects a rating of zero. Hence, these zero values for the ratings are added manually to the dataset, so that each item is rated by every customer.

To determine the optimal values of the other parameters, the model prediction needs to be compared. However, as mentioned before, implicit feedback makes it harder to evaluate and predict preferences. Especially in case of dislikes, the predic-tions do not exactly match true ratings as in the explicit case. However, by making use of a test set, it can be checked if the model gave higher predictions to products which are indeed bought more frequently. Hence, in the case of implicit ratings it is more convenient to use ROEM as defined in (2.4). Then, smaller values of ROEM indicate better predictions, as they are closer to the top of the list. Note that in the case of random predictions one expects ROEM= 50% (Hu, Koren, and Volinsky, 2008).

(25)

Table 3.6: Values hyperparameters ALS Inputs Optimal numBlocks 10 10 rank [10, 20, 30] 20 maxIter [10, 20] 20 regParam [.01, .05, .1, .15] 0.05 alpha [10, 20, 50, 100] 100

implicitPrefs true true

nonnegative true true

% (test, train) (.8, .2) (.8, .2)

3.3.3 Neural network

An alternative to ALS collaborative filtering is neural collaborative filtering. Matrix factorizaton can actually be seen as a generalization of such a neural method. A graphical representation of the neural approach can be found in Figure 3.5 (He, Liao, et al., 2017). As can be seen, the network consists of an input and output layer, with some additional layers in between.

Figure 3.5: Neural CF approach

In short, the inputs consist of (sparse) user and item identities. After this input layer, the embedding layer is reached. This layer can be compared to the latent vec-tors, where the embedding size can be seen as the number of latent factors. These item- and user embeddings are then pushed through multiple neural CF layers, in which they are mapped to actual predictions. These neural layers could be used to gain insights in user-item interactions. Then, the output layer is the final prediction ˆ

yui, trained with target value yui.

(26)

however often results in an overfitted model, so therefore some regularization might be needed. This can be done by penalizing the loss function or indirectly by adding an early stopping criterion to the loop. Besides, the neural network has a lot of unknown parameters or weights, of which optimal values need to be found.

The optimization is done by using Stochastic Gradient Descent (SGD) via updates over the complete dataset, like batch learning (Hastie, Tibshirani, and Friedman, 2009). The model works through a number of training subsamples, called batches, before optimizing the parameter values. The batch size is however also adaptable, as it affects the accuracy of the algorithm. If the complete training data is analysed in these batches, one epoch is finished. The number of epochs in total is also vari-able and often quite large to reduce the loss as far as possible. The learning rate might even be the most important hyperparameter. It determines the change in the weights per update, often being a small value between 0 and 1. A large value might result a suboptimal solution, while a small value worsens the speed of the algorithm. The speed is besides highly dependent on the batch size (larger is faster) and on the number of epochs (less is faster).

An appropriate metric to evaluate performance is for example top K ranking, as (2.3). But, model fit heavily depends on the choice of the above values, and hence the optimal configuration is not always easily found. Bergstra and Bengio (2012) even show that random searching the optimal values could outperform grid or man-ual searches. The importance of the hyperparameters namely seems to vary between research settings, which makes it difficult to determine suitable starting values. Other possible issues include overparametrization, nonconvexity and unstability of the model. Solutions might be to scale the inputs, regularization of the weights or changing the number of hidden layers.

Optimal values

All hyperparameters should be optimized for the research. Various methods have been proposed to suit this setting, of which results can be found in the Appendix. In general, it turns out the networks are overfitting the data and hence cannot be used for prediction. Further optimization needs more research, but is outside the scope of this research. Therefore, the version of the neural network is no longer taken into account.

3.4 Research setup

(27)

Figure 3.6: email recommendation block

3.4.1 Test versions

The recommendation versions that will be tested online and at the same time, are: 1. Random recommender

2. Current recommender

3. Content-Based recommender

4. Collaborative Filtering recommender

The first version will allocate random ratings to products and hence not any prefer-ence is taken into account. The proposed Content-Based and Collaborative Filtering recommendation versions are considered and besides that the current recommender will be included to check if the new versions are an improvement. But, to guarantee that an effect is actually due to the experiment and not random, it is essential to make use of a control group. To analyse the effect of the general weekly emails, there already exists a random group of members who do not receive any emails for a period of four weeks. Hence, this group will serve as a control group when looking at the effects on total customer purchases.

3.4.2 Stratified samples

(28)

3.4.3 Final recommendations

The output of the algorithms consists of the top 200 recommendations per per-son. These products do not contain any privacy or unpopular products, but do still contain some items which need to be filtered out. To increase the prediction perfor-mance, the training of the algorithm made use of all available items, however these are not all suited to recommend. As introduced before, some filters and constraints are therefore applied after the training of the algorithm. For instance, the recom-mendations are required to be sold in the webshop and the baby-products need to be used only in case a customer already bought some baby articles. Moreover, products which are already recommended or sold in the past three months will not be showed. The recommendations are now ready to put into use, but just taking the top four products which have highest rating does not suffice. The goal of the recommenda-tions is to encourage customer to buy new products. If someone only buys tooth-paste, it is likely that the top recommendations contains toothpaste too, even after filtering out most recent purchases. As there exist a lot of variations in brand and type, it could be the case that toothpaste will even show up more than once. How-ever, it is more favorable to show diversified recommendations. Because of this, a new and final ranking is proposed which takes the highest ranking per product category. In this way, the four recommendations showed will contain four different kinds of products for sure. Table 3.7 shows the diversification per recommendation version in more detail. Columns 3-6 show the number of different products per ranking position. The second column furthermore indicates the number of unique members which got the email with that specific version. Note that only a subset of those customers will be used for analysis, as only a fraction of these totals will open the email and hence be exposed to the recommendations.

Table 3.7: Number of unique items per ranking position per version Version # sends Rank 1 Rank 2 Rank 3 Rank 4

Random 121875 3253 3256 3258 3253 Current 115706 44 62 70 91 CB 136062 2555 2790 2900 2965 CF 122793 481 577 621 651

3.5 Analysis

3.5.1 Customer responses

(29)

Two-proportions Z-test

To compare the click-through-rates and buying rate per version, a statistical test is used. A z-test can be used to determine whether the difference between two proportions is significant. Hence, the rates of interest are tested in all possible pairs using Pearson’s chi-squared without Yates’ continuity correction (Howell, 2011). The corresponding test-statistic is defined as follows (called prop.test in R).

z = pA− pB

pp(1 − p) (1/nA+ 1/nB)

(3.6) where

pA = proportion in version A with sample size nA,

pB = proportion in version B with sample size nB,

p = overall proportion.

The z-test determines if the observed percentages in group A and B are equal or not. This test can either be one- or two-tailed, where the null hypothesis is rejected at a 5% confidence level if p < 0.05. The following hypotheses are proposed:

Two-tailed H0,2 : pA= pB with HA,2 : pA6= pB, (3.7)

One-tailed H0,1 : pA≥ pB with HA,1 : pA< pB. (3.8)

Logistic regression

A logit model is used to determine which factors influence the decision whether or not to click on or buy the recommender. This is a suitable method because the observed decision is either yes or no (binary) and it takes into account the latent factors causing the decision (Fok, 2017). The latent model can be described as

y∗_i = x0_iβ + ui, (3.9)

where β is a vector containing the coefficients, x0_i is the vector containing the individ-ual values for person i and ui is the error term, with E(ui) = 0. A logit link function

is then used to interpret this generalized linear model. The latent observation is connected to the actual observation in the following way:

yi =

(

1 if y_i∗ > 0, 0 if y_i∗ ≤ 0.

This model for yi can be used for both observed clicks and observed buys. Some

relevant but uncorrelated characteristics are included, so that the final latent model can be described by

y_i∗ = α + β1Current + β2CF + β3CB + β4DaysLoyaltyM ember

+ β5OnlineCustomer + β6B2B + β7DiapersBuyer + β8QuantityCE

+ β9N rOf Stores + β10AverageSpendEU R + β11DaysSinceLastT ransaction

(30)

3.5.2 Effects on purchases

To analyse the effects on customer purchases in general, the total spendings per member or the number of bought categories could be used. These kinds of data both need another regression model as those variables are continuous and count data, respectively. In both cases there are a lot of zeros, since some customers decide to do not purchase at all in the given period. Hence, the data is censored from below at zero. Essentially, the models need to account for both the binary decision of buying or not, and the purchase quantity itself.

Tobit-2

For yi being the dependent variable reflecting the observed sales amount per

cus-tomer i, tobit can be seen as a mix between a probit model for yi = 0 against yi > 0

and a regression for yi > 0. In this setting, it is not reasonable to assume that those

decisions are influenced by exactly the same variables. But, the two parts of the decision are naturally related to each other. Let the relationship between the latent variable y_i∗ and the observed variable yi be as captured in a selection and outcome

equation, respectively.

Selection (part 1): y_i∗ = x0_iα + u1,i, with u1,i∼ N (0, 1).

Outcome (part 2): yi =

(

0 if y_i∗ ≤ 0,

x0_iβ + u2,i if yi∗ > 0,

with u2,i ∼ N (0, σ2).

Where both error terms u1,i and u2,i might be correlated, and the set of explanatory

variables x differs in at least one variable. Furthermore, the inverse Mills ratio is calculated with the fitted values of the selection equation and then added as an inde-pendent variable to the outcome part. A significant value of this ratio implies that both unobserved parts are indeed related and hence Tobit-2 is suitable (Davidson, MacKinnon, et al., 2004).

Two-part count model

Poisson is the basic model to deal with count data, like in this case the number of categories bought. The main assumption of this model is however very strong: the mean and the variance need to be equal to eachother. This does not hold in this setting, as µ = 1.40 and σ2 _{= 7.74, implying overdispersion. To account for}

this unequality, negative binomial regression could be used. There, an additional parameter makes the model more flexible and is assumed to correctly specify the variance (Wooldridge, 2002). Often, no purchase is made, and consequently a zero is observed. To tackle this, a two-part model is proposed. The options when it comes to two-part models:

1. Zero-inflated model : uses a negative binomial to estimate the counts including the zero’s and a logit model to estimate the probability that a zero is observed for the dependent variable.

2. Zero-altered model : uses a negative binomial model to estimate the counts excluding the zero’s and a logit model to estimate the probability that a zero is observed for the dependent variable.

(31)

4 Results

In this chapter the performance of the recommender versions is analysed. The online performance is indicated as the click and buying rate, but to analyse the effects on purchases a suitable regression method is required.

4.1 Customer responses

4.1.1 Click rate

In Figure 4.1, the CTR are displayed as blue percentages. The block shows the format of Figure 3.6, with two click-through possibilities per product, namely the image itself and the black “show product” button. These percentages are unique clicks per link, but do not differ between the test versions.

Figure 4.1: CTR personalized recommendations

(32)

Figure 4.2: CTR non-personalized recommendations

In Table 4.1 the CTR per recommendation version is displayed. Here, the test groups are taken into account so that the rates can be compared between recommendation versions. The overall click rate (column 2) indicates a click anywhere on the page (including on the recommendations). In case of the recommendations (column 3), a click is counted when at least one out of four recommended products is clicked through. The last column shows the proportion of clicks on recommendations out of the total clicks.

Table 4.1: Click-trough-rates (CTR)

Version Overall CTR Recs CTR % Recs/Overall

Random 14.03% 0.75% 5.32%

Current 14.17% 0.91% 6.45%

CB 15.61% 2.95% 18.89%

CF 14.81% 1.32% 8.98%

Z-test

(33)

Table 4.2: Two-proportions z-test of Recs CTR Version A Version B H0,2 p-value H0,1 p-value

Random Current 0.0142 0.0071 Random CB < 0.001 < 0.001 Random CF < 0.001 < 0.001 Current CB < 0.001 < 0.001 Current CF < 0.001 < 0.001 CF CB < 0.001 < 0.001 Logit

Results of the estimated logit model as in Equation 3.10 are displayed in column (3) of Table 4.3. Including gender and age however results in a major drawback of los-ing many observations. Addlos-ing your gender and birthday to your profile is namely not required and hence not available for every observation. Therefore, column (2) of Table 4.3 shows estimates when ignoring these two variables. Again, some ob-servations are lost (as compared to column (1), without including any explanatory variables). This is because no transactional information is available, implying these customers did not make a transaction yet. However, this loss is way less than in case of column (3). Hence, model (2) is used for further analysis.

Table 4.3: Logit estimates of clicking

Dependent variable: Click (1) (2) (3) Current 0.204∗∗ (0.083) 0.154∗ (0.092) 0.270∗∗ (0.136) CF 0.583∗∗∗ (0.076) 0.599∗∗∗(0.083) 0.703∗∗∗ (0.124) CB 1.397∗∗∗ (0.068) 1.398∗∗∗(0.073) 1.501∗∗∗ (0.111) DaysLoyaltyMember 0.00004 (0.0001) −0.0001∗ _(0.0001) OnlineCustomer 0.133∗∗ (0.059) 0.242∗∗∗ (0.083) B2B −5.822 (84.291) −6.505 (196.968) DiapersBuyer −0.009 (0.066) 0.212∗∗ (0.096) QuantityCE 0.0003∗∗ (0.0001) 0.0003∗ (0.0002) NrOfStores −0.002 (0.009) 0.029∗∗ (0.012) AverageSpendEUR −0.015∗∗∗ _(0.003) _−0.017∗∗∗ _(0.005) DaysSinceLastTransaction −0.004∗∗∗ _(0.0004) _−0.004∗∗∗ _(0.001) Age 0.030∗∗∗ (0.003) Female −0.018 (0.170) Constant −4.890∗∗∗_(0.061) _−4.528∗∗∗ _(0.109) _−6.086∗∗∗ _(0.294) Observations 148,926 131,148 54,956 Log Likelihood −11,559.970 −9,721.673 −4,364.026 AIC 23,128 19,467 8,756 BIC 23,168 19,585 8,881 Note: ∗p<0.1;∗∗p<0.05;∗∗∗p<0.01

(34)

coef-ficients on the recommendation versions are all positive and significant, where the random recommender serves as the reference case. Hence, the presence of one of the non-random recommendations implies an increase in the probability of observing a click, ceteris paribus. This positive effect is least significant for the current recom-mendation version (as compared to the random version), but highly significant for both CF and CB relative to the random recommender. Being an online customer also increases the probability of a click, even as an increase in the total quantity bought. However, for higher average spendings, the probability of observing a click decreases. This probability also decreases in case of more days since last transaction. For all other variables, no significant effect was found.

To interpret the size of the effects, the odds ratio could be used. Corresponding values can be found in Table 4.4, where the same significance levels as in column (2) of Table 4.3 apply. Starting with the recommendation versions, all in reference to the random version, the current version implies the likelihood of clicking is 1.17 times larger than not clicking. This effect is somewhat larger in case of CF, where this likelihood is 1.82 times larger. In case of CB, the likelihood of clicking is even 4.05 times larger than the likelihood of not clicking. For the other variables, small yet significant odd ratios are found. Being an online customer increases the likelihood of clicking with 1.14. Higher overall quantity is also significant, but the estimate of 1.000 does not suggest a strong relationship. Having higher average spendings and more days since the last transaction both imply a negative relationship.

Table 4.4: Odds of clicking Odds Current 1.166∗ CF 1.820∗∗∗ CB 4.046∗∗∗ DaysLoyaltyMember 1.000 OnlineCustomer 1.142∗∗ B2B 0.0029 DiapersBuyer 0.991 QuantityCE 1.000∗∗∗ NrOfStores 0.998 AverageSpendEUR 0.985∗∗∗ DaysSinceLastTransaction 0.996∗∗∗ Note:∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

4.1.2 Buying rate

(35)

Table 4.5: Buying rates (BR)

Version Overall BR Recs BR % Recs/Overall

Random 33.07% 0.12% 0.35%

Current 33.91% 0.80% 2.37%

CB 33.26% 1.18% 3.56%

CF 33.34% 0.66% 1.99%

Z-test

To compare the buying rate of the recommended products, again a z-test (see 3.6) is performed. The same hypotheses are used, namely (3.7) and (3.8). Results of these tests can be found in Table 4.6. As all p-values are smaller than 0.05, the null hypotheses can be rejected at the 5% confidence level. This implies that the buying rate differs significantly between the test versions and yield higher proportions in case of version B. It is notable however that regarding this customer buying rate, the current version performs better than CF. All other performances are however in line with those of the CTR.

Table 4.6: Two-proportions z-test

Version A Version B H0,2 p-value H0,1 p-value

Random Current < 0.001 < 0.001 Random CB < 0.001 < 0.001 Random CF < 0.001 < 0.001 Current CB < 0.001 < 0.001 CF Current 0.0370 0.0185 CF CB < 0.001 < 0.001 Logit

To analyse which factors are actually influencing a buy, the logit model (3.10) is estimated. Results of this estimation can be found in Table 4.7. As can be seen in column (3), including all variables generates a huge loss in observations. Hence, column (2) is used for the interpretation. The coefficients in column (2) of Table 4.7, show that approximately the same variables as in case of the clicks appear to be significant, having the same signs. Two differences are striking, namely that buying diapers in this case increases the probability of buying, whereas average spendings does not affect this probability anymore.

(36)

1.21 times. The total quantity bought shows no relationship, and having higher days since the last transaction decreases the likelihood of buying with 5%.

Table 4.7: Logit estimates buying

Dependent variable: Bought (1) (2) (3) Current 1.937∗∗∗ (0.166) 1.998∗∗∗ (0.184) 1.804∗∗∗ (0.264) CF 1.745∗∗∗ (0.167) 1.800∗∗∗ (0.186) 1.783∗∗∗ (0.263) CB 2.329∗∗∗ _(0.161) _2.430∗∗∗ _(0.178) _2.405∗∗∗ _(0.253) DaysLoyaltyMember 0.0001 (0.0001) 0.0001 (0.0001) OnlineCustomer −0.032 (0.086) 0.052 (0.120) B2B −6.217 (222.935) −5.927 (324.744) DiapersBuyer 0.193∗∗ _(0.085) _{0.091 (0.130)} QuantityCE 0.001∗∗∗ (0.0001) 0.0005∗∗∗ (0.0002) NrOfStores 0.002 (0.011) −0.001 (0.016) AverageSpendEUR 0.002 (0.005) −0.0002 (0.008) DaysSinceLastTransaction −0.051∗∗∗ _(0.003) _−0.049∗∗∗ _(0.004) Age 0.008∗∗ (0.004) Female 0.118 (0.298) Constant −6.754∗∗∗ _(0.154) _−5.892∗∗∗ _(0.219) _−6.257∗∗∗ _(0.497) Observations 148,924 131,147 54,956 Log Likelihood −6,095.824 −4,669.868 −2,145.769 AIC 12,120 9,364 4,320 BIC 12,239 9,481 4,444 Note: ∗p<0.1; ∗∗p<0.05;∗∗∗p<0.01

(37)

4.2 Effects on purchases

In contrast to the above customer responses analysis, the control group is now added to the analysis. This control group did not receive any e-mails at all, so this group will now serve as the benchmark case (instead of the random recommender for the logit estimations before). To estimate the changes in customer purchases, two effects are investigated. First, the effects on total sales are considered. Thereafter, the number of product categories is used to determine the effects on the variation of purchases.

4.2.1 Sales amount

Tobit-2 models could be used to model the effects on total sales, where the sample of buying customers was determined by self-selection. As maximum-likelihood (ML) estimation can result in numerical issues, Heckman’s two-step approach (Heckit) is used to gain first insights in the nature of the sample selectivity (Davidson, MacK-innon, et al., 2004). The estimation of this Heckit 2-part model can be found in Table 4.9, where the significant Inverse Mills Ratio confirms that this model fits the data.

The first column (“Bought”) refers to the probit selection equation. The estimates can be interpreted as follows. The positive and significant coefficients on the recom-mendation versions imply a higher probability of buying in case of a recomrecom-mendation block in the mailings, ceteris paribus. Note that this effect is relative to the control group, so the customers who did not receive any emails at all. Hence, this effect might also be due to the presence of the mail itself instead of the presence of rec-ommendations in those emails. Moreover, a longer loyalty membership increases the probability of buying. As this duration is measured in days, the effect is small. Being an online customer moreover increases the probability of buying (compared to customers who did not shop online yet). B2B customers are also more likely to buy (but only significant at the 10% level), and so are women relative to men. Being older seems to increase the buying chance as well, although this effect is least signif-icant and nearly zero, possibly because age is measured in years. The only negative relationship is found for the average spendings. Having higher average spendings seems to decrease the probability of buying. This might, for example, be because these customers use less visits to buy the same amount of products.

(38)

increases spendings with e43. The total quantity bought only has a small effect of spending e0.03 more per additional historic product. The more days since the last transaction, the less sales are done. This effect is nearly e0.43 per additional day, and might be explained by having lost this customer as a loyal shopper. Then, being one year older increases spendings with e0.15. Being a women compared to men, increases spendings with e19. All effects here yield under the ceteris paribus assumption.

Table 4.9: Heckit 2-step estimates

Dependent variable:

Bought Total sales

Random 0.3083∗∗∗ (0.012) 9,737.703∗∗∗ (882.377) Current 0.3296∗∗∗ (0.012) 10,320.420∗∗∗ (931.833) CF 0.3150∗∗∗ (0.012) 9,876.648∗∗∗ (896.373) CB 0.3170∗∗∗ (0.012) 9,961.061∗∗∗ (898.631) DaysLoyaltyMember 0.0002∗∗∗ (0.000) 6.018∗∗∗ (0.582) OnlineCustomer 0.0784∗∗∗ (0.008) 2,279.707∗∗∗ (281.521) B2B 1.309∗ (0.732) 187,678.400∗∗∗ (21,314.920)) DiapersBuyer 0.1567 (0.004) 4,372.767∗∗∗ (464.142) QuantityCE 2.901∗∗∗ (0.111) NrOfStores 7.376 (7.170) AverageSpendEUR −0.0044∗∗∗ _(1.343) DaysSinceLastTransaction −43.241∗∗∗ _(1.709) Age 0.0005∗ (0.000) 15.033∗∗∗ (3.472) Female 0.0636∗∗∗ (0.008) 1,915.122∗∗∗ (302.225) Constant -0.9604∗∗∗ (0.012) −63,427.090∗∗∗ _(5,552.716) Observations 170,567 R2 _0.282 Adjusted R2 0.282 ρ 1.174

Inverse Mills Ratio 42,158.950∗∗∗ (3,475.382)

Note: ∗p<0.1;∗∗p<0.05;∗∗∗p<0.01

(39)

Regarding the other estimates of column 1 (“Bought”), only B2B no longer has a significant effect on the probability of buying in reference to previous model in Table 4.9. Besides, the difference of the other coefficients is not noteworthy. The second column (“Total sales”) also shows quite similar results, as the signs and sig-nificance levels did not change. Most of the variables somewhat decreased, except for the coefficient for Female and the days since the last transaction, which slightly increased.

Table 4.10: Heckit 2-step estimates without control group

Dependent variable:

Bought Total sales

Current 0.0213∗ (0.009) 580.153∗ (305.511) CF 0.0068 (0.009) 134.628 (296.579) CB 0.0086 (0.009) 213.326 (286.444) DaysLoyaltyMember 0.0002∗∗∗ (0.000) 6.230∗∗∗ (0.635) OnlineCustomer 0.068∗∗∗ (0.008) 1,955.418∗∗∗ (283.040) B2B 0.0748 (0.732) 92,916.880∗∗∗ (26,760.840) DiapersBuyer 0.1541∗∗∗ (0.009) 4,256.448∗∗∗ (487.083) QuantityCE 2.724∗∗∗ (0.117) NrOfStores 11.926 (7.322) AverageSpendEUR −0.0046∗∗∗ _(0.000) DaysSinceLastTransaction −44.402∗∗∗ _(1.755) Age 0.0004∗ (0.000) 10.829∗∗∗ (3.480) Female 0.0646∗∗∗ (0.008) 1,972.175∗∗∗ (322.600) Constant -0.7146∗∗∗ (0.014) −53,834.170∗∗∗ _(5,080.306) Observations 148,972 R2 0.219 Adjusted R2 _0.219 ρ 1.174

Inverse Mills Ratio 42,262.540∗∗∗ (3,722.179)

Note: ∗p<0.1;∗∗p<0.05;∗∗∗p<0.01

4.2.2 Number of categories

The optimal count model turns out to be the zero-altered negative binomial regres-sion. Regression results can be found in Table 4.11, where column “:1” shows the logit estimates, whereas column “:2” reflects the negative binomial. These both yield a different interpretation. The third value for the intercept is called the dispersion parameter.

(40)

on the probability of buying. This result is however inconsistent with the effects found in previous section, where receiving emails did seem to influence the binary decision. The other estimates show small but significant effects on the probability of buying one or more categories. The number of days being a loyalty member, buying diapers and being a female shows a negative relationship, whereas being an online customer and more days since last transaction show a positive relationship.

The second column shows the estimates of the negative binomial regression, without taking the zeros into account. The only recommendation version showing a signifi-cant coefficient relative to the random recommender is the one for CB. This implies that customers with CB recommendations, changes the number of different cate-gories bought relative to the random version with 100 · (e(−0.0274)_{− 1) = −2.70%,}

ceteris paribus. Being an online customer, diapers buyer or female increases the number of different categories bought with respectively 9.75%, 8.94% and 8.46%. The continous variables furthermore yield small but significant results. For every additional day of being a loyalty member, the different number of categories bought increases with 0.01%, ceteris paribus. For each additional day since the last transac-tion, the different number of categories bought decreases with 0.13%, and for every one-year increase in age the different number of categories bought decreases with 0.1%, ceteris paribus.

Table 4.11: Zero-altered negative binomial regression

(41)

5 Conclusion

A personalized shopping experience can be used to attract and retain customers, which could be promoted by making use of a loyalty card program. Introducing personalized product suggestions for new-to-customer products aim at increasing the basket value of loyalty members. Therefore, in this research, a recommendation system is proposed for the Dutch drugstore Etos. The corresponding research pur-pose is to investigate the effects of personalized product suggestions on customer purchases. At first, the challenge is to find a recommendation method which suits the firm best. Second, the effect of recommendations on customer purchases is re-searched.

In order to compare the performance of recommendation algorithms, two suitable new recommenders are proposed and, together with a random and the current ver-sion, tested in practice. Via A/B testing these four versions are simultaneously exposed to stratified customer samples. Among the new algorithms, there exist a collaborative filtering (CF) method based on a matrix factorization technique called Alternating Least Squares. This method searches for similar customers using pur-chase history, trying to predict which item one might like. The other algorithm is content-based (CB), using Term Frequency-Inverse Document Frequency to com-pare item descriptions to find similar items as one bought before. The current algorithm, as well as the random algorithm serve as a benchmark case to check if the new methods are actually an improvement. Furthermore, to check if the use of recommendations actually changes customer purchase behaviour, an additional control group is taken into account. These members did not receive any emails and hence no recommendations.

The eﬀect of personalized product suggestions on customer purchases

The effect of

personalized product suggestions

on customer purchases

MSc Thesis Econometrics

Fleur Wijtman

s2736403

Econometrics, Operations Research & Actuarial

Studies

Faculty of Business and Economics

University of Groningen

The effect of personalized product suggestions

on customer purchases

Author: Fleur Wijtman

Abstract

Contents

1

Introduction

1.1

Personalization

1.2

Research goal

1.2.1

Research questions

2

Literature

2.1

Recommendation methods

2.1.1

Method comparison

2.2

Customer loyalty

2.2.1

Purchase behaviour

2.3

Model evaluation

2.3.1

Predictive accuracy

2.3.2

Practical performance

3

Research Design

3.1

Market

3.1.1

Implicit feedback

3.1.2

Current shopping behaviour

3.2

Data

3.2.1

Data description

3.2.2

Constraints

3.3

Research methods

3.3.1

Content-Based

3.3.2

Collaborative Filtering

3.3.3

Neural network

3.4

Research setup

3.4.1

Test versions

3.4.2

Stratified samples

3.4.3

Final recommendations

3.5

Analysis

3.5.1

Customer responses

3.5.2

Effects on purchases

4

Results

4.1

Customer responses