Next Purchase Prediction: Improving Customer Interaction By Using Transactional Data

(1)

Next Purchase Prediction: Improving Customer Interaction By Using Transactional Data

submitted in partial fulfillment for the degree of master of science Bas de Jong

11121610

master information studies data science

faculty of science university of amsterdam

2019-07-10

Internal Supervisor External Supervisor Title, Name Rolf Jagerman Bas Karsemeijer Affiliation UvA, FNWI, IvI HEMA

(2)

Next Purchase Prediction: Improving Customer Interaction By

Using Transactional Data

Bas de Jong

bas.dejong1@student.uva.nl Universiteit van Amsterdam

ABSTRACT

Predicting customer behavior is an important task for many busi-nesses. While much research has been performed around recom-mending new products to a customer, there are few studies on predicting the next purchase of a consumer. In order to find the effect of showing customers their potential next purchase, trans-actional data was fed to multiple machine learning algorithms in order to estimate what he next purchase of any customer will be. Afterwards, this prediction was shown to a selection of customers in an e-mail. This research presents the difference in performance when a personalized banner is added to an email for the Dutch retail company HEMA. It was found that a personalized banner increased the click-through rate by a significant amount. While this did not significantly lead to people actually buying the proposed suggestions, it did lead to an overall increase in average order value when compared to the control group. In conclusion, predicting the next purchase of a customer shows promising insights, since an increase in click-through and revenue was found.

KEYWORDS

Recommendation Engine, Recurrent Neural Network, LSTM Net-work, Purchase Prediction

1 INTRODUCTION

Recommendation systems are used by many companies to provide their customers with a personalized selection of products which they might be interested in based on their purchase behaviour. A notorious case of recommendation systems in practice is the when the American retail chain Target allegedly figured out a girl was pregnant before her father knew due to her purchase behaviour.1 This thesis will apply multiple recommendation algorithms on a subset of the transaction data of customers of HEMA. HEMA is a retail company based in the Netherlands with their own loyalty program called Meer HEMA, which they partially use to person-alize the advertisements to the customer. Currently this is done through a tool which requires manual input for the most part and thus allows for a possible improvement through machine learning. Since it has been shown that email can be a very effective way for businesses to increase their customer interaction if used correctly [5], this thesis will apply this and improve the suggestions given to the customers via email in order to maximize the customer in-teraction. This research will try to achieve this by combining the purchase data and the aggregated customer data to predict the next purchase a customer will make. Cho [6] has shown that customers

1

https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/

with a higher product involvement are more likely to click on a ban-ner, and this research will try to achieve the same results through personalization methods.

Most studies in the field of recommendation engines have fo-cused on predicting new products and have not taken into account the possibility of showing a possible next purchase of a customer to that customer. To clarify, where this research differs from others is in the fact that it tries to predict the next purchase of a customer (as opposed to showing new products they have not yet bought) and the effect this might have on the click-through rate in email market-ing and the purchases of the customers. The obtained results from this research will give new insights in the way recommendation engines are built.

The main research question is stated as follows:

How can next-purchase prediction help in maximizing customer interaction?

This will be answered through several sub research questions: • RQ1: How do different methods compare?

• RQ2: How does the model perform for different user segments? • RQ3: Along what dimensions does the model perform well? The goal of the sub questions is to find circumstances for the model to perform well and possibly improve upon. Different algo-rithms are evaluated because more complex models do not always guarantee better results than simple algorithms.

2 RELATED WORK

Recommendation systems have been widely studied in both aca-demic settings and in industry settings. Ricci and Shapira [19] have provided an overview of recommendation systems which can be used as a guideline for practicioners of such systems. Several frameworks have also already been suggested to implement such recommendation systems. Singhal et al. [22] proposed an online rec-ommendation engine which utilized deep learning to make timely personalized suggestions. Some notable examples of recommenda-tion systems used by big companies are Amazon [23] and Netflix [10].

However, the goal of this project is not necessarily to recommend new products, but to predict what the next purchase will be and show this to a customer. The most straightforward way to do this is though neural networks [11]. A more advanced neural network in the form of deep learning has also been studied in the field of recommendation [9, 24]. Collaborative filtering [16, 21] is a well known recommendation algorithm, which has also been extended to include deep learning in the form of deep collaborative filtering [25].

Showing a customer their next purchase might not be the only situation that causes a customer to click through in an email. This has been evaluated by Bollen et al. [1] where the amount of choices

(3)

given to a customer was analyzed. Customer behaviour might also be an interesting metric to take into account, as portrayed by Boyer and Hult [3]. A somewhat more in depth approach was performed by Buder [4], who combined the aspects of a personalized recom-mendation system with a psychological approach. If possible, this shall also be applied somewhere during the project. An approach that might prove useful with regards to experimenting is counter-factual learning and countercounter-factual evaluation. This approach has been explored by Bottou et al. [2] to show how to use the data from an old system to figure out the what the new policy should be with regards to recommendation. If the current infrastructure is structured correctly, this could allow for an offline evaluation based on email logs to measure the effect on e.g. the click-through rate.

3 PROBLEM DEFINITION

Here we will describe the task that will be solved during this project, along with how the data was retrieved and processed. Afterwards some evaluation metrics will be explained together with the setup of the A/B email test.

3.1 Research Task

The task of next-purchase prediction can be formulated as follows: Given a user u and a sequence of t purchases {p1, p2, p3, . . . , pt},

we wish to predict the category y of the next purchase of that user. There are four main approaches that will be attempt this, and the one that performs best offline will be used for an online A/B test phase. The four approaches that will be evaluated are a random forest multiregressor, a deep neural network, a deep recurrent neural network, and an experimental deep recurrent neural network. To ensure reproducibility, for all models a specific random seed was set. The reason simpler models are also evaluated is that it may very well be possible that the next purchase of a customer may be easy to predict, and simpler models are generally less computationally expensive. Since scalability is an important factor for a big company as HEMA, runtime will be taken into account in combination with the performance. If retrieving, loading, and processing the data takes over a day, new purchases are already made which would make the available data incomplete. This would mean that promoting the next purchase of a customer to them would be too late and they might have already bought it. Sending an email based on outdated data is not very likely to give accurate predictions.

Another interesting notion to take into account is that this prob-lem can both be seen as a regression probprob-lem and a multiclassifica-tion problem depending on the scope. One can choose to predict the total amount of products bought for every category, or predict a single category where a customer buys the most products from in their next purchase.

3.2 Data and Preprocessing

The task at hand is to predict the next purchase of a customer in the loyalty program given the data of all their previous purchases. This data consists of two main parts, namely the purchase data and the aggregated customer data. The purchase data contains the data of all purchases made by customers in the loyalty program since 2017. This contains metrics including but not limited to the customer

id (if available), the id of bought products, the number of bought products, the date, and the store.

The aggregated customer data contains a detailed summary per customer which includes but is not limited to metrics as the per-centile of customers that make the most transactions and the sub-groups in which they can be segmented. Combining these metrics might allow for some generalization for the customers with a lot amount of purchases as well as a high level of personalization for the loyal customers. Combining these datasets together allowed for downscaling the problem, which will be explained further later on. The first step taken was preprocessing the available data. This research could be classified as a big data problem depending on the size of the scope, but this will be explored further later on.

However, to still retrieve enough relevant results, running every-thing locally would be a difficult task to undertake due to the sheer size of the data. To use as much data as possible locally, the raw dataset was preprocessed in chunks to avoid memory bottlenecks. The result of the preprocessing can also be fed to a neural network in batches for the same reason.

There were some strange outliers found in the data, for example people buying over 300 of the same product at once. However, after asking around at the company it is very likely that this data is actually correctly entered into the system and transactions like this sometimes occur. To add on this, in most cases where people buy a lot of products at once, this is from the Photos category. This is very plausible since people can purchase many photos with the end goal of putting these in a photo album. In the cases that the category is not Photos, it is highly likely that a franchiser or other company is making these purchases, and most of the times not an error in the data. Since the outliers occur very infrequently and most of the ones that do have a reasonable explanation, transactions like this were just treated as normal data.

One of the largest difficulties is that product IDs cannot be quan-tified through a single integer output. This would not be possible to predict as one single output, so the products can only be predicted through one-hot encoding. The same holds true for the inputs; giv-ing the product ID as an integer value has no real meangiv-ing to it. All the possible products become a new column in the datasets, which results in the number of learnable weights growing exponentially per additional column. Given that HEMA has over 50.000 product IDs (consisting of 30.000 products in different shapes and colors) even a fully connected neural network with one hidden layer the size of the output layer would have 2.5 billion learnable parameters. The size of the dataset would also grow a lot, since every product would need to have a one-hot encoding to allow for the model to process different products. There are many products, therefore a one-hot encoding would have a too high dimensionality to be computationally feasible. This is why the focus lies on categories and not on individual products.

There are a total of 229 categories, which is already significantly less than the 30.000 possible products mentioned earlier and should thus be more computationally viable to calculate. To add on this, there are some categories which are not even available online so these cannot be added to the emails during the testing phase, which lowers the number of possible outputs even more. Another upside to this approach, is that this directly deals with the cold start prob-lem [15] new products and seasonal products suffer from. These

(4)

products are still placed in categories based on where they fit best, therefore using this approach might prove to be better in actually predicting this as well as opposed to trying to predict on a product level.

While experimenting with the network, if a user u has made a sequence of t purchases of this user {p1, p2, p3, . . . , pt}, the network

receives {p1, p2, p3, . . . , pt −1} as input and has to predict {pt}. For

the final training stage (after the hyperparameters have been set) {p1, p2, p3, . . . , pt} was given to the network, and the network had

to predict unknown future purchase {p_t+1}.

3.3 Evaluation Metrics

Several baselines were put in place during the hyperparameter tuning of the model to evaluate the difference in performance of the models with regards to predicting what the next purchase will be. The goal is to perform above the following simple baselines:

• Random guessing: randomly selecting a single output class. • Majority class: always selecting the class that occurs the

most in the dataset.

• Input equals output: selecting the previous purchase to also be the next purchase.

Since there can be a total of 229 output classes where most are zeroes for any case, accuracy is a biased metric in the case of mul-ticlassification and regression. To illustrate, in the event that a customer on average buys products from three different categories per purchase, having the model claim everything is zero would achieve an accuracy of 98.7%. Initially this seems high, but in re-ality the model is not learning anything valuable. To counteract this issue, the main metric that will be evaluated during the exper-imental process is precision, as this should give a more accurate representation of how good the model is with regards to making useful predictions for this specific use case. Since precision only looks at the correct positive predictions, predicting only zeroes does not yield a high percentage anymore. A prediction was seen as a positive if it occurs in the purchase, since there can be multiple positives in a single transaction.

3.4 Email Setup

The model that performs best in the experimental setup will be used for the online A/B test. The reason A/B testing needs to be used is because using the already available data is biased to an extent since the choices made by the customers may rely on the suggestions given to users by the current model or other external influences. Also, since the end goal is to maximize the customer interaction, new predictions have to be given and tested in order to see whether it actually works since there is no other way to know whether the predictions actually influence the choice of a customer to click on a given suggestion.

To properly evaluate the effect of a personalized email on the click-through rate and purchase behaviour, the customers were randomly put into one of two groups. One group was the test group and the other group was the control group. The control group got the same email as normal, while the test group got a modified email which included a personalized banner below the main advertisement. This banner was designed after the category that the model predicted for a customer. To prevent making the

email feel intrusive to customers and to minimize the risk for the company, nothing in the email actively made it clear the banner was personalized. The evaluation was performed a week later, since most people open their mails in the first few days they receive it.

However, after implementing the heuristic the model still chose the majority class often. A large part in this is due to the imple-mented heuristic to deal with categories that cannot be displayed or were too uncertain. Removing these left very little to display aside from the two classes that occur the most often within this customer segment. While there is technically still some degree of personalization, choosing three of the most occurring categories is not a very high degree of personalization. Around 80 percent of the predictions came from the heuristic because the model simply is not able to make predictions with a specific certainty.

4 METHODS

This section describes the applied algorithms during the research.

4.1 Multiregressor

A regression algorithm can be applied to estimate the number of products a customer will buy. The regression algorithm used is a random decision forest [12]. Since there is not one single value but multiple ones that need to be predicted, multiple random decision forests were combined to create a random forest multiregressor which makes a prediction for every category. The reason a simple regression algorithm was used as opposed to a classification de-cision forest was to evaluate the the possibility of predicting the number of products bought per category.

This approach however suffers from the fact that not all trans-actional data can be taken into account when learning. It is also possible to feed several transactions to the network with padding, but this would take up too much memory to feasibly compute. Therefore this technique could not be implemented properly while not getting a memory error.

4.2 Simple Neural Network

The first version of the simple neural network used only the last previous purchase to predict the next purchase. It is very likely that it is impossible to have a perfect prediction rate with a simple neural network due to only the previous purchase being taken into account. From the subset of people who have made their last purchase from only one category, there can be many different outcomes for the next purchase. Intuitively this is not always a select single category, so this might be difficult for the network. In order to find more complex relations between variables, several Rectified Linear Units (ReLU) [17] have been implemented in the network.

Some similar problems arise as with the multiregressor. If a customer buys one product the output will always yield the same result, while it is pretty intuitive that this is definitely not always the case. Also, this approach too suffers from the fact that not all transactional data can be taken into account when learning and that (similar to the multiregressor) the aforementioned padding approach would not be feasible.

(5)

4.3 Recurrent Neural Network (LSTM)

To fulfill the goal of allowing all historical data of a customer to be used as input for the model as opposed to only 1 transaction, a Recurrent Neural Network (RNN) was employed [20]. RNNs are usually better than traditional neural networks at the task of mod-eling time series. An RNN allows for a hidden state that learns which data to store from the previous inputs. However, it has been shown that RNNs can sometimes suffer from a vanishing or explod-ing gradient in trainexplod-ing, which disables the network from learnexplod-ing properly [18]. There are multiple ways to deal with this, one of which is implementing a specific nonlinearity in the network in the form of a TANH layer as activation. Another slightly more complex way of dealing with this problem is implementing Long Short Term Memory (LSTM) layers [13]. There are additional advantages of using LSTM over a traditional RNN. The most notable of these can be found in the name; storing both long term memory and short term memory. This allows for long term dependencies, something that often is not possible in a normal RNN. It is also possible to stack LSTM layers, essentially increasing the level of depth in a deep neural network. This specific neural network looked exactly the same as the deep neural network but with the addition of two LSTM layers at the start. Because of the way the problem is defined, the decision was made for the output of the model to be a soft-max activation layer [8]. This gives a probability for every possible output class, where the largest output is taken as prediction.

To summarize, to prevent vanishing and exploding gradients that RNNs typically suffer from, multiple LSTM layers were used in the model. Gradient clipping was also investigated, but it seemed to negatively impact the results if the loss function had a clipper attached to it. Therefore, the clipper was removed for the final model since the LSTM model took care of the exploding and vanishing gradients.

4.4 LSTM with Uncertainty Cutoff (LUC)

The LSTM with Uncertainty Cutoff (LUC) is an extension on the normal LSTM model. As mentioned before, the last layer in the LSTM model is a softmax activation [8]. This causes the output to be a probability distribution for the class prediction, where the sum of all the outputs equals to one. As the neural network provides us with probabilities we can approximately measure how certain the neural network is about its predictions. Sorting these values and plotting this against the precision at the corresponding index showed that the higher the prediction value was, the more likely the model is to be correct. This can be seen in Figure 2. It can be seen that in the cases that the model is most certain the precision is quite high. However, this eventually drops to around 20 percent. While this by itself is pretty self-explanatory, this characteristic can be exploited by only taking the predictions the model is most certain about. This should theoretically give better results since the model should be correct more often.

To exploit this characteristic, a cutoff value was found based on aforementioned validation set. Exclusively the predictions above this cutoff value were taken into account for the test. For the other predictions that the model is not very certain about, a simple heuris-tic is employed which suggests one of the two biggest classes within this category and predicts these based on how much of it has been

Figure 1: Sorted average precision on 10k validation samples

bought previously. By doing this we make sure that the neural network is only making predictions when it is more certain, and is allows the method to fall back to a simple default heuristic for situations where the model is uncertain.

5 RESULTS

In this section all the research questions will be answered one by one based on the experiments that were performed.

5.1 RQ1: How do different methods compare?

During training, some results could quickly be retrieved with re-gards to performance. As mentioned before, accuracy is a biased metric so precision was taken as evaluation metric.

The multiregressor suffered from a large problem; namely the fact that training even a single tree took relatively long. It was decided that in order for the model to generalize well, the amount of data to be loaded in was prioritized over a large number of trees. To allow for the most amount of data to be entered into the model in a reasonable timespan, it was decided to limit the number of trees to 25. Since the Mean Squared Error was higher than predicting every transaction to contain 0 items in any category, the maximum number was taken and seen as the prediction.

For the first neural network, the results are somewhat better than the multiregressor when given the same amount of training time. This might be because this approach allows for some nonlinearity, but the results of this network were not good enough yet. The LSTM network (without the cutoff method) outperformed the neu-ral network by a small margin, which can largely be attributed to implementation of the LSTM layers. However, the LUC model beat all other models with regards to effectiveness when given the same amount of time for training and eventually reached a precision of 28% on a sample of the entire customer dataset.

After trying several different setups for the aforementioned ap-proaches, they all seemed to suffer from the same problem; resort-ing to the majority class disproportionately often, most likely due to relations that could not be found. Nevertheless, this LUC was used for further experimentation and the eventual A/B test. Several deeper and wider models have also been explored but the eventual architecture of the model that performed the best in the testing environment can be seen in Figure 1.

While it may be seen as bad practice to manually adjust the outcome of a model to suit specific needs, whilst experimenting

(6)

Figure 2: LUC architecture

with the LUC the manual cutting showed to get better results than any of the other methods. The most confident predictions are still very accurate, which means there is a pattern that can be found for at least some customers, and the heuristic appears to be a decent backup option. This technique can actually be seen as a manual way of ensembling [26] to deal with unpredictable customers in a very computing-efficient manner. Ensembling is a popular method that combines multiple machine learning algorithms to perform better than a single one could, but for this use case a deep learning algorithm is combined with a simple heuristic.

5.2 RQ2: How does the model perform for

different user segments?

Unfortunately, the retrieved predictions from the LUC model were often not of a lot of use for the company for any of the applied methods. This is due to the fact that most returned categories con-tained only (freshly made) food and drinks, which generally can not be bought online. To deal with this and get the most meaningful re-sults for this project given the time and computing constraints, the decision was made to scope down to only customers who have Kids Apparel as their favorite world (which is defined as a large selection of products) and are in the top 20% of most valuable customers with regards to turnover in the last 12 months. The motivation for scop-ing down in this way was largely inspired by the market research already performed by the company. This specific group has one of the highest percentage of omnichannel shoppers (customers who purchase products both online and offline), and still contains a large enough percentage of buyers from the total number of customers to draw meaningful conclusions from. Another reason this customer segment was chosen, was that the market research showed that these customers like to purchase discounted products. This is very easy to show in the emails that get sent out later on if the category can be predicted with high enough certainty certainty.

By selecting from a predetermined group of people, it is more likely that these people will have a more similar purchase pattern than two random customers from the entire customer base. This is due to customers buying primarily from the same world, implying they have similar needs. This in turn should result in a higher precision since patterns can be found more easily with a simpler model. Especially for the experimental setup, this is very useful for quickly trying out new models.

One interesting aspect to note when talking about result met-rics is that the offline precision does not have to equate to what the actual precision will be. Since the goal is to find out whether showing customers a personalized email will change their behavior. Showing people a specific product might prompt them to buy it or see what the product is, even if they originally did not intend on buying it. Similarly, the opposite can happen as well. This might cause the model to perform very differently than what the offline precision portrays.

5.3 RQ3: Along what dimensions does the

model perform well?

The main results from the email test can be seen in Table 1. Cam-paign is the name of the camCam-paign, where Test consists of Boys Clothing, Girls Clothing, and Party. These three groups were the only ones the model was able to make a prediction on. Accepted denotes the number of sent mails that were actually received by cus-tomers. Opens implies the number of unique emails opened. Clicks denotes the number of unique emails that were clicked on after opening. CTO (Click To Open Ratio) is the percentage of customers who clicked from the customers who opened the email. CTR (Click-Through Ratio) denotes the percentage of people who clicked from the customers who received the mail. Optout is the percentage of customers that, after seeing this mail, decided to unsubscribe from the email list. Sess. denotes the number of sessions. Tr. denotes the number of transactions made from the sessions. Rev. shows the

(7)

Table 1: Email Test Results

Campaign Accepted Opens Open Rate Clicks CTO CTR Optout Sess. Tr. Rev. AOV Conv. RPM Control 22.326 5.853 26,22% 494 8,44% 2,21% 0,05% 680 1 1 1 1 1 Test 22.316 5.820 26,08% 561 9,64% 2,51% 0,2% 772 1,077 1,198 1,112 0,948 1,199 Boys Clothing 12.608 3.195 25,34% 311 9,73% 2,47% 0,25% 439 0,615 0,720 1,169 0,953 1,274 Girls Clothing 9.571 2.590 27,06% 247 9,54% 2,58% 0,15% 330 0,462 0,478 1,037 0,953 1,116

Party 137 35 25,55% 3 8,57% 2,19% 0,00% 3 0 0 0 0,00 0

direct revenue as a result of this email. AOV (Average Order Value) is the average income. Conv. is the conversion rate, which shows the percentage of sessions that succesfully generated a transaction. RPM (Revenue Per Mille) is the average revenue per 1000 emails sent. Transactions, Revenue, AOV, Conversion rate, and RPM have been turned into a relative increase/decrease.

Since the entire selected customer segment was sent an email, it is possible to draw meaningful conclusions from these results. In order to answer the main research question, the CTO and CTR from Table 1 were evaluated. To determine whether there is a difference in performance between a personalized email and a normal email, a two-sided statistical analysis was carried out with the intent of finding a possible significant difference and determining a 95% confidence interval. For both problems, α is set at 0,05. The null hypothesis (H0) and the alternative hypothesis (H1) to answer this

question have been defined below.

Hypothesis H0: A personalized email does not yield a different

percentage in CTO/CTR than a normal email.

Hypothesis H1: A personalized email yields a different

percent-age in CTO/CTR than a normal email.

Below equations have been applied to calculate statistic measures. The goal of this is to find an answer to the main research question and find other meaningful metrics.

H0: ˆpcont= ˆpt est (1)

H1: ˆpcont, ˆpt est (2)

ˆ

ppool ed :positivest est+ positivescont nt est+ ncont

(3)

Z − Score : ( ˆpt est− ˆpcont)

r ˆ ppool ed nt est + ˆ ppool ed ncont (4)

Point Estimate (PE) : ˆpt est− ˆpcont (5)

Conf idence Level (CL) : Zα 2

(6)

Marдin of Error (MoE) : s ˆ pcont(1 − ˆpcont) ncont + ˆ pt est(1 − ˆpt est) nt est (7)

Conf idence Interval (CI) : PE ± CL · MoE (8) The most important results are a 14,2% increase in CTO and a 19,9% increase in RPM. To find whether there is a significant

difference in CTO, statistical analysis was performed. Applying formulas 1 to 4 to the available data with respect to CTO gives the following measures. The Z-Score is -2,2608. The corresponding value of p is 0,02382. The result is significant because p < α/2 (=0,025). H0is rejected; it can be said with 95% confidence that there is a

difference in proportions of CTO between ˆptestand ˆp_control. Using formulas 5 to 8 gives a confidence interval of 95% that the true difference in proportions of CTO between the test group and the control group lies between 0,0016 and 0,0224. This further confirms the previous findings that the email the test group was sent performs better than email the control group was sent.

Another metric from Table 1 that stands out from the rest is the Optout percentage. As mentioned before, this is the percentage of customers that, after seeing this mail, decided to unsubscribe from the email list. Applying formulas 3 and 4 from above gives the following metrics. The Z-Score is -2,2953. The corresponding value of p is 0,02144. The result is significant because p < α/2 (=0,025). From this alone it is possible to claim that the personalized emails lead to a higher Optout ratio, were it not for the fact that for the previous mails for the same customer segment, the Optout ratio was around 0,21% for the previous few campaigns. The 0,05% Optout ratio from the control group can be seen as an outlier as this is exactly what a normal non-personalized email would look like and it is expected that the Optout ratio would be similar to the one of a regular email.

A seemingly contradictory metric is the conversion rate. People seem to buy more which can imply that a specific need has been filled, yet there is a lower conversion rate. Applying formulas 1 to 4 gives the following statistics. The Z-Score is 0,1975. The value of p is 0,84148. The result is not significant because p < α/2 (=0,025). The difference is likely caused by the small number of people who made an online transaction through the email. A larger sample size could possibly give more meaningful results.

It can also be meaningful to look at how effective the model was with regards to correctly predicting whether a specific category occurs in the next transaction or not. Table 2 shows the metrics for the next purchases of customers in the target group. As mentioned before, the offline precision does not have to equate to what the actual precision will be. Showing people a product might prompt them to buy it or see what the product is, even if the prediction was wrong and the customer originally did not intend on buying it. This might cause the model to perform differently in a real world setting.

Table 2 uses some of the the same column names as Table 1 with three new ones. Prec. is the precision of a specific segment, Items is the total number of items bought, and Avg. is the average number

(8)

Table 2: Purchase Behavior Week After Email Test Prec. Tr. Rev. AOV Items Avg.

Control 12,70% 1 1 1 1 1 Test 13,19% 0,984 0,987 1,003 0,973 0,991 Control (Opened) 15,41% 0,233 0,225 0,965 0,214 0,920 Control (Unopened) 11,48% 0,767 0,775 1,010 0,784 1,025 Test (Opened) 12,44% 0,208 0,217 1,042 0,196 0,948 Test (Unopened) 12,62 0,777 0,771 0,993 0,777 1,002

of items bought per transaction. All metrics except precision have been turned into relative increase/decrease. While it is expected that both these groups should act the same since they both did not look at any suggestion with regards to what to buy, there still appears quite a large difference. This is a possible indicator of a high variance. Nevertheless, it appears that the email had a positive effect on the offline purchases of customers who viewed the email as well.

The difference in behavior per day was also evaluated, since people might not be likely to remember a banner from a week earlier, but this shows similar behavior as in Table 2. The different metrics per day were plotted and can be found in Appendix A. Another metric that immediately stands out it the relatively low precision. This is much lower than what was achieved in the offline training.

The percentage of people who actually clicked on the person-alized banner for the Boys Clothing, Girls Clothing, and Party categories is 14,03%, 26,54%, and 33% respectively. This appears to be inversely correlated with the AOV and the RPM.

Generally speaking, it appears that showing people a personal-ized banner with products they might buy according to the model does not necessarily cause them buy these specific products, but it does increase seem to increase their average order value.

6 DISCUSSION

This section will discuss the results and explain important findings. Thereafter the limitations of this project and future work will be outlined.

Before discussing the results, it is important to reiterate that some purchases are impossible to predict correctly. To clarify, it is possible that a specific customer only buys products that are only available offline (e.g. meals from the restaurant). While the model is be able to predict this, there would currently be very limited use in showing the customer products that are only available offline since there is nothing for the email to redirect to. If there is nothing to click on, there would be nothing to measure the click-through ratio with. Another example of a transaction that is (near) impossible to predict is a customer buying a gift for someone else as their only item. This transaction is based on the needs of a person which are impossible to know about with the currently available data. To add on this, the current results found are only for a specific customer segment, and customers with different interests might react differently or might require a different approach to achieve similar results.

All methods seem to suffer from the same issue; not being able to generalize very well. This can possibly be counteracted by turning it

into a smaller multiclassification problem. If only a single category would need to be evaluated, there is a possibility that a binary classification problem would be simpler to solve and could yield better results.

A possible area of improvement might be the data fed to the model. Currently only the categories of the previous purchase, the difference in dates, the number of items bought, and whether the purchase was either online or offline. A data-related possible reason that the actual precision is quite lower than what the offline model predicted is, might be the partially faulty dates. This is sadly not possible to be countered in any way with the current way data is loaded into the system. The data can possibly be extended through also including the advertisements a customer was sent and their activity on the website. For this project however, this was infeasible to do within the time frame.

While customer interaction and customer behaviour can be in-fluenced by email to a certain extent, there are more ways than the one evaluated in this research to influence this. Notable examples of this are tv commercials, radio commercials, flyers, online advertise-ments, and the HEMA app. By applying the same personalization to these channels as well, the observed effect could possibly increase even more.

The precision achieved in the test is much lower than what was achieved in the offline setting. The reason for this can not be stated with full confidence, but it is likely that this decrease is due to the fact that only a select few of the categories could actually be implemented.

6.1 Limitations

There are a few limitations to the approaches taken in this project. This project is by definition quite unique, as it can be seen as a multilabel multiclassification problem with an input of variable length. However, for this project the definition was changed to simply a multiclass classification problem. This choice was made because implementing the multilabel aspect was deemed not very usable for the specific use case, but it would require a large amount of effort to implement. The reason it was deemed infeasible is because new banners and email layouts would have to be created from scratch, and the current program that handles sending out emails requires a manual creation for all possible combinations of categories. This would be way too much effort to implement manually in its entirety, since having only categories from the current 20 already gives 380 possible combinations which need to be created manually. If there would be an automatic way to do this, it might be useful to explore for further studies if the goal is to predict the complete next prediction as opposed to which class has the highest probability of occurring.

The effect of the personalized advertisement may be barely no-ticeable due to the placement of the extra category. Since it is below a different larger banner and has many similarities to the rest of the content of the email it might blend in without the customer even actively noticing it.

The promo campaign that the recommendation was attached to has been going on for some weeks already. As a result of this, all the observed metrics are highly likely to be somewhat lower than would be the case for a normal email. There is no way to counteract

(9)

this, and this does not give complete insights as to how a normal mail would perform.

Whilst experimenting with algorithms, the assumption was made that the more precise a network is with regards to predicting the next purchase, the more likely the customer is to act on this. While the proposed algorithm directly caused an increase in revenue in reality, this does not have to be the optimal solution. As found in the results, some data seems to be contradictory (e.g. the lower conversion rate in the test group). While this may not be significant, it still raises the question of whether showing the customer their next purchase entirely causes them to buy the most products. Since this use case has never been performed for this company, further research is required to find out what works best with regards to personalization. Perhaps this can be done through counterfactual evaluation so that an online test is not needed and multiple variants of the model can be run simultaneously.

However, due to the way email logs are currently stored within the company, counterfactual evaluation is not feasible yet. Process-ing the email logs would take a large amount of time which for this project was not achievable. This eliminates the possibility of achieving relevant results without having to implement a full on-line A/B test. In the future however, this could maybe be applied to allow for a simple way of testing without the risk of harming the customer base by presenting strange advertisements. Also, if it’s very certain a customer won’t click, the decision to send them a specific advertisement could be revised. If this data were to be available for use, it might have been possible to iterate over more unique algorithms.

6.2 Future Work

Since this problem can be tackled in a large number of different ways, the results of this experiment only give rise to even more questions. To what extent would new data increase the results? Could a deeper or wider network perform better than the possibili-ties explored? Is it possible to create a prediction of the entire next purchase of a customer?

Due to time constraints, it was sadly not possible to evaluate a possibly hierarchical loss function where if a product was ranked wrongly, but it was in the right world, the gradient update would be applied less severely because the model was still partially right. This might prove to be useful for future research in this area. Another limitation due to time was that it unfortunately was not possible to delve deeper into the softmax activation cutoff method. The results showed to have even more potential, which might be solved by a more complex neural network that takes the output of the current LSTM model and determines what to do with uncertain predictions based on several different factors.

Ensembling is not implemented to the extent that it could be, if given more time. Since most successful neural networks apply some form of bagging or boosting [7], applying the proper ensembling method is very likely to improve the behavior of the model.

Joachims et al. [14] have shown that click behaviour can be dif-ficult to interpret as implicit feedback, but that using this as an absolute measure can incorporate some bias. They claimed that whether people click or not is influenced by the trust people have in

the search function. However, for the company this cannot be mea-sured since the personalized banner is not really a search function and the email tool does not provide the possibility of seeing where someone clicked. Nevertheless, it might prove useful to conduct a more in depth analysis of whether a banner is relevant or not directly influences where or whether someone decides to click.

If the proposed solution can be improved to accurately predict the correct category with a reasonably large precision, it may be possible to create a multitude of subnetworks which can predict a product inside the given category. While this might currently be relatively infeasible, it might become more feasible if the world is also taken into account at the start. This would drastically limit what actually needs to be calculated when making a prediction. There would then exist three separate neural networks, where the first predicts the world, the second predicts the category in the world, and the third predicts the actual product in the category.

7 CONCLUSION

This research undertook the task of predicting the next purchase of a customer using four different algorithms. The result outlines the many difficulties when it comes to predicting the next purchase of a customer, especially for this use case. However, it appears that even implementing a simple heuristic already increases the revenue and clicks within the email by a relatively large amount. If some form of offline evaluation can be implemented, it might be possible to create many simple heuristics that together perform about as good at a neural net would, but with less required processing power.

To answer to the main research question; How can next-purchase prediction help in maximizing customer interaction? Next purchase prediction increases customer interaction for customers who have Kids Apparel as favorite world through adding a personalized ban-ner in an email for a specific category. It appears that showing people a personalized banner with products they might buy does not directly influence the customer to buy these specific products, but it still increases the click-to-open ratio and the overall rev-enue. For this test increases of 14,2% and 19,9% were measured respectively.

This research however does not claim that this is the absolute best possible way to perform personalization, but it is a step in the right direction. The main limitations for this specific project were related to the email test, since this was not automated for this use case, and therefore only a single test could be run within the given time frame. Due to the time contstraints, it was not possible to fully delve into more specific methods.

Future work is needed in order to come closer to determining what the best way possible is to perform personalization for the company, especially with regards to other customer segments since these might react differently to a similar approach. The main way of improving the network within the same use case is highly likely to be either through ensembling or a mixture of experts model. Nevertheless, this experiment shines a light on the possibilities of implementing a prediction of the next purchase of a customer in the form of a personalized banner in email advertising.

REFERENCES

[1] Bollen, D., Knijnenburg, B. P., Willemsen, M. C., and Graus, M. Understanding choice overload in recommender systems. In Proceedings of the fourth ACM

(10)

conference on Recommender systems (2010), ACM, pp. 63–70.

[2] Bottou, L., Peters, J., Quiñonero Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research 14, 1 (2013), 3207–3260.

[3] Boyer, K. K., and Hult, G. T. M. Customer behavior in an online ordering application: A decision scoring model. Decision Sciences 36, 4 (2005), 569–598. [4] Buder, J., a. C. Learning with personalized recommender systems: A

psycholog-ical view. Computers in Human Behavior 28, 1 (2012), 207–216.

[5] Chittenden, L., and Rettie, R. An evaluation of e-mail marketing and factors affecting response. Journal of Targeting, Measurement and Analysis for Marketing 11, 3 (2003), 203–217.

[6] Cho, C.-H. The effectiveness of banner advertisements: Involvement and click-through. Journalism & Mass Communication Quarterly 80, 3 (2003), 623–645. [7] Dietterich, T. G. An experimental comparison of three methods for constructing

ensembles of decision trees: Bagging, boosting, and randomization. Machine learning 40, 2 (2000), 139–157.

[8] Dunne, R. A., and Campbell, N. A. On the pairing of the softmax activation and cross-entropy penalty functions and the derivation of the softmax activation function. In Proc. 8th Aust. Conf. on the Neural Networks, Melbourne (1997), vol. 181, Citeseer, p. 185.

[9] Elkahky, A. M., Song, Y., and He, X. A multi-view deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th International Conference on World Wide Web (2015), International World Wide Web Conferences Steering Committee, pp. 278–288.

[10] Gomez-Uribe, C. A., and Hunt, N. The netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS) 6, 4 (2016), 13.

[11] Haykin, S. Neural networks (vol.2). New York: Prentice hall (1994).

[12] Ho, T. K. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (1995), vol. 1, IEEE, pp. 278–282. [13] Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural

compu-tation 9, 8 (1997), 1735–1780.

[14] Joachims, T., Granka, L. A., Pan, B., Hembrooke, H., and Gay, G. Accurately interpreting clickthrough data as implicit feedback. Sigir 5 (2005), 154–161. [15] Lam, X. N., Vu, T., Le, T. D., and Duong, A. D. Addressing cold-start problem in

recommendation systems. In Proceedings of the 2nd international conference on Ubiquitous information management and communication (2008), ACM, pp. 208– 211.

[16] Linden, G., Smith, B., and York, J. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing, 1 (2003), 76–80.

[17] Nair, V., and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (2010), pp. 807–814.

[18] Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In International conference on machine learning (2013), pp. 1310– 1318.

[19] Ricci, F., R. L., and Shapira, B. Introduction to recommender systems handbook. Recommender systems handbook (2011), 1–35.

[20] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988).

[21] Sarwar, B. M., Karypis, G., Konstan, J. A., Riedl, J., et al. Item-based collabo-rative filtering recommendation algorithms. Www 1 (2001), 285–295. [22] Singhal, R., Shroff, G., Kumar, M., Choudhury, S. R., Kadarkar, S., Virk, R.,

Verma, S., and Tewari, V. Fast online ’next best offers’ using deep learning. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data (2019), ACM, pp. 217–223.

[23] Smith, B., and Linden, G. Two decades of recommender systems at amazon. com. Ieee internet computing 21, 3 (2017), 12–18.

[24] Van den Oord, A., Dieleman, S., and Schrauwen, B. Deep content-based music recommendation. In Advances in neural information processing systems (2013), pp. 2643–2651.

[25] Wang, H., Wang, N., and Yeung, D.-Y. Collaborative deep learning for recom-mender systems. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining (2015), ACM, pp. 1235–1244. [26] Zhou, Z. H., Wu, J., and Tang, W. Ensembling neural networks: many could be

better than all. Artificial Intelligence 137, 1-2 (2002), 239–263.

A

FIGURES

For extra context, day 0 in the figures in this section is on a Thurs-day and Figure 4 to 8 have been adjusted to show a relative in-crease/decrease.

Figure 3: Precision days after email

Figure 4: Transactions days after email

Figure 5: Total revenue days after email

(11)

Figure 6: Average revenue days after email

Figure 7: Total items purchased days after email

Figure 8: Average items purchased days after email