Temporal Collaborative Filtering: Reordering Recommendations to Add Temporal Specificity

(1)

Temporal Collaborative Filtering: Reordering

Recommendations to Add Temporal Specificity

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

D

IEDERIK

B

EKER

10190848

M

ASTER

I

NFORMATION

S

TUDIES

H

UMAN

C

ENTERED

M

ULTIMEDIA

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

June 29, 2017

1st_Supervisor ₂nd_Supervisor

Dr. T.E.J. Mensink Dr. F.M. Nack

(2)

Temporal Collaborative Filtering: Reordering

Recommendations to Add Temporal Specificity

Diederik Beker

University of Amsterdam

Amsterdam, The Netherlands

diederik.beker@live.nl

ABSTRACT

Recommender systems are a good way to help users find content or items that might be interesting for them. Differ-ent approaches exist for generating recommendations which output content or items with a relevance score. For some appli-cations, however, it would be better to order recommendations based on a temporal aspect specific to the user’s needs. Gen-erating recommendations based on a temporal aspect is, as far as we know, a new approach in the recommender systems field. This paper describes a Temporal Collaborative Filtering method to determine a reordering of recommendations, gen-erated with a Collaborative Filtering approach, based on the temporal needs of a user on an online marketplace for study materials. The approach makes use of a directed graph to find the shortest distance between a user’s last download and his/her recommendations, reordering the recommendations to show the user’s needs based on temporal relevance. The results show that adding temporal specificity to recommenda-tions is possible and might have the possibility to improve the recommendations.

Author Keywords

Collaborative Filtering, Graph, Recommender System, Temporal Recommendations, Temporal Collaborative Filtering

INTRODUCTION

Digital archives tend to grow very rapidly, making it impossi-ble to browse through the content of some online marketplaces. Spotify currently holds over 30 million songs1_{, while}

Ama-zon’s estimated number of items is over 1 billion2_{. Their}

massive size makes that users generally have two options to find what they need, either they search for what they want or they get recommended items they might want. The disad-vantage of search is that users need to know the appropriate keywords to input into the search bar and the results are gen-erally not evaluated for relevance. Recommendations, on the other hand, take into account previous actions of users and can determine the relevance of items to users and do not require the user to input specific keywords. Recommending items to users has a great value for some companies, Netflix even hosted a competition with a prize of one million dollars to the

1_{https://press.spotify.com/us/about/} 2_{https://export-x.com/2015/12/11/}

how-many-products-does-amazon-sell-2015/

team that could improve their recommendations by a mere ten percent3_[3].

Many different types of recommender systems exist with dif-ferent goals and use cases. A recommender system might, for example, use previous user events to determine the best recommendations for a specific user. A suggested movie could be generated from the user’s previously watched movies and his/her rating for specific movies. If the user mostly likes action-filled war movies, the recommender system can rec-ommend similar movies. This example can be considered a simple recommender system, though many approaches to gen-erating recommendations exist. The approach used depends largely on the data the system has available and the output that the system should generate.

Recommendations are generated based on features that do not include temporal data, in other words, they are not ordered by time, they are ordered by importance or relevance. This is a good approach in most cases where temporal specificity is not of importance to the items that are recommended, for example with Netflix, where it does not matter if you get recommended a movie that was created before a movie you have already watched. This temporal specificity can, however, be important for some applications.

Stuvia4_{is an online marketplace for study notes and}

sum-maries where students can go to obtain information they might need for studying. They currently hold over 300,000 items, including summaries, study notes, study questions, etc. Their member base consists of almost half a million members and is growing exponentially, receiving over 50,000 visits monthly and experiencing days with more than 100,000 page views. With this much content a user can have a difficult time finding the right summaries or study notes. A search for “marketing” currently yields 12,415 results, making it hard for a user to select the summary they need. Much data exists about the user as well as about the content, which can be used to extend a helping hand to the user in finding the content they need. On such a platform, temporal specificity is important as a user needs content in a particular order. To understand this, it is imperative to understand a user that makes use of such a platform: students. Students follow study programs which follow a particular order in courses, the content that is needed by students can thus be assumed to follow a specific order as

3_{http://www.netflixprize.com/} 4_{www.stuvia.com}

(3)

well. Even though this order occasionally changes and might be different across universities, these patterns can be studied to determine if an order exists between content on an online marketplace for study materials. Since the content on the plat-form is not specific for any university, school, or course, it can be used across institutions and courses, and thus the order of needing content can not be determined by simply following the course’s logical order, but needs to be determined by study-ing patterns of user’s transactions. An assumption for this research then states that if users have previously downloaded content in a particular order, other users will be interested in documents in the same order. This so called ‘order’ can also be called ‘temporal specificity’. This temporal specificity can thus be of importance to a recommender system on such a platform for online study materials.

To the best of our knowledge temporal specificity within rec-ommendations, as proposed here, has never been described. This paper will look into the aspect of temporal specificity for recommendations. To structure this research it will be guided by the following research question: “How can a rec-ommender system be created to generate temporal specific recommendations for an online marketplace for study mate-rials?”. Sub-questions accompanying this research question are:

• “How can temporal specificity in transactions be repre-sented?”

• “How can item order for a list of content be determined based on a temporal aspect?”

• “How can an existing recommender technique be adapted to include temporal specificity?”

RELATED WORK

Recommender systems is a popular field of study, within the following section we highlight the most important related works for the understanding of the choices made in this paper, as well as papers proposing relevant approaches.

Recommendation Techniques

Different approaches exist in generating recommendations, some distinguish as much as six different recommendation approaches [16, 4]. The most important difference within the types of recommender systems is the way recommendations are generated and for which purpose they are used. Whichever way the recommendations are generated, the core functionality of such a system is to predict a user’s needs. For recommenda-tions to be generated, similarities between people or items are determined. These similarities can then be used to determine which new items can be of use to a user.

An example of an item similarity based approach can be found in a content-based filtering approach [16, 4, 12, 15]. This technique is based upon item feature comparison, where items are recommended to user’s that have preferences for those features. The user preferences can be extracted from previ-ously used content with certain features. A big advantage of such a system is that it can easily deal with new items without needing user interactions with the item. A disadvantage of

this approach is that sparsity in metadata will cause the system to perform poorly [9]. For this particular marketplace for on-line study materials, this type of recommender system is less useful. Although the items are described with features, these differ depending on the type of item and are often incomplete. Different methods also exist for comparing users to one an-other. An example of this could be a demographic recom-mender, which recommends items based on the demographics of a user [8, 1]. Another example is the Collaborative Filtering (CF) technique, which uses user similarities to predict a user’s needs [18, 16, 7, 17]. The basic assumption for this technique is that a user that likes the same items as another user, is more likely to be interested in other items this user has chosen as well. CF uses a user-item matrix of preferences which can be used to calculate the similarities between users [9]. Two different approaches exist for CF, memory-based and model-based. The memory-based approach computes the similarity between users or items, after which a retrieval method can be used to obtain recommendations. Many user similarity func-tions exist, including the Euclidean distance, Log Likelihood similarity, Pearson Correlation similarity among others [14]. A much used retrieval method is a k-nearest neighbour method, retrieving the top-k items [11]. The CF approach is mostly used with ratings for certain items, tough it can also be used in a boolean matter, for example, downloads of certain items, where a download is equal to one and a non-download is equal to zero.

The mentioned techniques in this section give a rough overview of the possible approaches to recommender systems. For the sake of this research, the above overview is sufficient in understanding the choices made in this paper.

Temporal Aspects of Recommender Systems

Some research has been done towards temporal specificity in recommender systems, but the research does not provide a solution to the problem stated in this paper. Two similar solu-tions are proposed by Xiong et al. and Wang et al., who both consider a temporal aspect within recommendations. Xiong et al. propose a new method for taking into account the shifting interests of users over time, using a CF approach [23]. Wang et al. describe a method of capturing a user’s change of interest over time, inspired by the ant colony behaviour [21]. Even though these both take into account temporal specificity, it is a vastly different problem than stated in this paper.

Xiong et al. state a problem persisting a temporal issue on the global scale, scoping all data. The problem in this paper, however, is specifically tuned to adhere to a user’s personal temporal change. The change does not happen on a global scale, it happens on a user specific scale, thus making it an entirely different problem. Wang et al. do describe a method on a user scale, and even though this research is more closely related than that of Xiong et al., there is still a big difference in the problem stated there compared to the problem in this paper. The difference with the problem at hand is that this change in user interest, as described by Wang et al., happens gradually over time, while a change in a student’s course is instant. A user who has finished a course continues on to the next course. They describe the transaction between users and

(4)

items as pheromones; a transaction with the characteristic of fading over time.

By pheromone transmission between users and items and evaporation of those pheromones on both users and items over time - [21, p. 16]

The fading of these transactions is not appropriate for the problem stated in this paper, as past user activities are used to learn for future users, as well as for prediction of the user’s own interests. Wang et al. do touch upon the notion of using graphs with a CF technique, though the way in which the graph is used is different, as they cast the CF problem as a bipartite graph mining problem.

METHODS

The chosen method for a temporal recommender system de-pends largely on the type of data that is available. The platform on which this is tested holds transactional data for each user. A transaction consists of one or multiple documents, flash-cards, or bundles, and each transaction has a timestamp. The method described below can be used for any application that has similar transactional data as mentioned above, making this a general approach to adding temporal specificity to recom-mendations.

Recommendation Generation

As starting point for the proposed temporal recommender sys-tem, where we reorder recommendations based on past user activities, we use CF, since it also uses past user activities. With CF, these past user activities are used to determine sim-ilarities between users, determining the most similar users based on item overlap and recommending items that a user has not yet obtained. Figure 1 illustrates an example of CF. The recommendations are generated using a memory-based CF approach. This approach requires the creation of a user-item matrix B, usually filled with preference data. As the data available within the platform is transactional data, this does not include direct preferences. Here we assume that a download corresponds to a user having a preference for a certain item. This results in creating a boolean user-item matrix, where an interaction with an item refers to a download of that item, such that Bi, j=1 constitutes that user i downloaded item j. A small

example of a user-item matrix can be seen in figure 1. To generate recommendations, the CF system is given a user q ∈ U, where U contains all users on the platform. The simi-larity of q to p ∈ U is calculated using a simisimi-larity measure. The retrieval method selects the top-k similar users, creating subset T where T ⊆ U, aggregating the user-item matrices of all users in T .

To compute the similarity between users, a similarity measure has to be chosen. Choosing the correct similarity measure is an important part of the approach and is detailed in the experiment section of this paper. Experimentally we compare the City Block distance (1) [20], the Euclidean distance (2) [5, 2], the Log likelihood similarity (3) [6], and the Uncentered Cosine similarity (4) [22], which can be calculated as stated below, where each item n is seen as a dimension and the distance is computed between two usersp and q.

SCB_{(q, p) =} n

∑

i=1 |qi−pi| (1) SE_{(q, p) =} s _n

∑

i=1 (qi−pi)2 (2) SLL_{(q, p) =} n

∑

i=1 qi log pi (3) SUC(q, p) = ∑ n i=1qi pi q ∑n_i=1q2_i∑n_i=1p2_i (4) Another important step in generating recommendations using CF is the retrieval method, and as such is also detailed in the experiment section of this paper. The resulting best retrieval method is the k-nearest neighbour method, where k=20, in other words, the top 20 similar users are selected to generate recommendations from, resulting in the set T .

T can now be used to generate predictions on item i for a user q by computing the weighted average for item i from all users in T , where the weight is defined by the similarity measure. Using user-item matrix B, similarity measure SE_{, and p ∈ T}

being a similar user to q, we can denote prediction P as: P(q,i) =∑p∈TSE(q, p) bp,i

∑p∈T|SE(q, p)| (5)

The resulting subset O ⊆ I, where I contains all documents, flashcards, and bundles present on the platform, comprises the predictions for i ∈ I for user q where P(q,i) > 0. Requesting k recommendations will return the top-k items in O.

Temporal Specificity

In this section we discuss our approach to adding temporal specificity to the recommendations in set O. We call this new method Temporal Collaborative Filtering (TCF), since it is an extension of CF, adding temporal specificity by reordering a subset of the recommendations in O.

Representation

To be able to determine the reordering of the recommenda-tions, it is imperative to represent the transactional data while retaining its temporal component, as well as have a method of determining the distance between items. In order to do this we define a temporal graph as being a directed unweighted graph G = (V,E) consisting of vertices V = {v1, ...,vn}and directed

edges E ⊆ V x V . Each vertex v ∈ V represents a unique document, flashcard, or bundle, where each edge (vi,vj) ∈E represents a user’s subsequent download. This means that the order in which content was downloaded is maintained within the structure of the graph and the distance between two ver-tices, i.e. documents, flashcards, or bundles relates to their temporal dependency.

(5)

User Doc1 Doc2 Doc3

User1 1 1 1

User2 ? 1 ?

User3 1 ? ?

User4 1 ? 1

Figure 1. A Collaborative Filtering approach compares users based on their activity and recommends items that do not overlap. In this case, User4 is most similar to User1, a Collaborative Filtering approach would recommend User4 to download Doc2.

User Trans1 Trans2 Trans3 Trans4 User1 doc1 doc2,doc3 doc4 doc5 User2 doc1 doc4 doc6 doc7,doc8 User3 doc2 doc5,doc9 doc7 doc10

Figure 2. Transactional data for three users used to create a directed unweighted graph maintaining the order between transactions.

Figure 3. The resulting directed unweighted graph from the transac-tional data shown in figure 2.

To create the temporal graph, the transactional data is ordered according to the timestamp and iterated per user. Any doc-ument, flashcard, or bundle is represented as a vertex in the graph. As the vertices represent unique documents, flashcards, or bundles, any vertex can exist only once. Edges are added be-tween any item in a transaction preceding another transaction, e.g. if a user has downloaded document 3 in one transaction and subsequently document 5 in another, both document 3 and 5 would be vertices in the graph and an edge would be created from the vertex representing document 3 to the vertex representing document 5. This will result in a graph represent-ing user transactions, while maintainrepresent-ing the order in which they took place. An example of a small part of such a graph is illustrated in figure 3.

Determining Temporal Order

The temporal graph now includes any transactional data needed to obtain temporal specificity within recommenda-tions. To add the temporal specificity to the recommendations R ⊆ O, where R includes the top-k items from O, the temporal graph can be used to find the distances between a user q’s last download qlatestand each recommendation {r1, ...,rn} ∈R. As

an item is specified by a vertex, finding the distance between qlatestand riis the same as finding the distance between vertex

v1and v2, representing qlatestand rirespectively. For each item

ri, we define the following distance:

t(q,ri) =dpl(qlatest,ri) (6)

For temporal specific recommendations, we sort the items according to path length t(q,ri).

Two different distances between vertices can be used: the shortest path length and the average path length, as the main focus is to find the distance difference between the different recommendations in R provided the user’s last download y. Although it might seem intuitive to take into account multiple of the user’s last downloads to determine the reordering, these are already taken into account by the CF to construct the set R and these will not alter the shortest path between y and ri. As multiple paths might exist between two vertices, an

assumption could be that the average path length between two vertices would give the most accurate result, since this would take into account multiple user transactions and distances between vertices. A problem that could arise with this, is the fact that finding the average path length in a large graph can take a considerable amount of time, rendering it useless in a production environment. One way to overcome this issue is to set a cutoff length, providing the algorithm with a max path length to consider.

Using the shortest path length has the advantage of being much faster and could very well be a good distance measure to use in reordering the recommendations, as it provides accurate infor-mation about the distance from a user’s last download to the recommendations. Even though they are different measures, they are closely related. The shortest path length is obviously taken into consideration while calculating the average path length, and as such, a smaller shortest path length will most likely be reflected in a smaller average path length. Another advantage in using the shortest path length is the lack of cutoff necessity, since finding the shortest path length is not as costly as finding the average path length. This can provide reordered recommendations even if these recommendations are far away from the user’s latest download. The experiments section in this paper provides the results regarding the testing of both dis-tance measures, showing that the shortest path length provides more accurate results.

Reordering

Using the above distance measure, each distance can be found between a user’s last download qlatest and the

recommenda-tions {r1, ...,rn} ∈R. Suppose we have a temporal graph as

shown in figure 3, a user with a last downloaded document with id 4, and recommendations from the CF including docu-ment 10, 9, 7, and 5, in that order. Using the Temporal Collab-orative Filtering method with qlatest=4 and R = {10,9,7,5}, each distance is found between qlatestand ri∈R, resulting in

(6)

t(q,r1) =3, t(q,r2) =in f inite, t(q,r3) =2, t(q,r4) =1. As

document 9 can never be reached in this directed graph, the value given to the distance is infinite. The recommendation set R would be reordered according to the found distances creating T R = {5,7,10,9}.

Updating

For an online marketplace for study materials it is important to be able to update the TCF system regularly, as many trans-actions take place each day. The building of a temporal graph from scratch, however, can take a long time. The data set used in the experiment section of this paper, for example, takes over thirty-three hours to construct the temporal graph5_{. This is}

one more reason to use a graph as described above. By repre-senting the transactional data as such a temporal graph, any new transactions can easily be added to an existing temporal graph by creating new vertices for new documents, flashcards, or bundles, and new edges between user transactions.

EXPERIMENTS

This section will explain the experiments that were conducted in order to test the described system. This includes offline and online experiments, both having different goals as will be explained in their individual section. To test whether the men-tioned approach improves the recommendations, a quantitative research has been conducted.

Offline experiments

The offline experiments are conducted for a variety of reasons. The first and foremost of those reasons is optimisation and parameter selection for both the CF and TCF. The second reason is to gain preliminary insights into the effectiveness of the proposed approach.

Data Set

To conduct the offline experiments a cross-validation approach is used. A data set is chosen from the available data and split into a training and validation set according to a 80%/20% split respectively. A thing to consider with this data set is that the data has to stay ordered over time, as we are looking into temporal specifics. For this reason, we can not create a random split of the data set, as the data can not be differently ordered. If a random split was used, this would also take into account future users and future content when generating recommendations, which would generate recommendations that might not have been available to a user at that time. This does pose the problem that only one data set is available for the offline experiments and can thus only provide preliminary insights into the effectiveness of the system. The resulting data set includes 205,170 users and 621,819 entries in the train set and 80,075 users and 155,454 entries in the validation set.

Generating Ground Truth

To be able to validate the system, the generated recommenda-tions need to be compared to later user downloads. As users on the platform need specific content, not specific documents,

5_{From 777,273 rows of transactional data, using an AMD A10-5800k} APU with Radeon HD Graphics x 4, Gallium 0.4 on AMD ARUBA (DRM 2.46.0 / 4.8.0-56-generic, LLVM 3.8.0) GPU, on 64-bit Ubuntu 16.04 LTS.

the content of the recommendation has to be compared to the content of the user’s downloads. A possible way to do this is by using a method like term frequency-inverse document frequency (tf-idf) to determine the importance of words in a document, which can be used to determine the semantic similarity of two documents [13].

As the platform makes use of elasticsearch6_{, which is loaded}

with all the content available on the platform, the ‘more-likethis’7_{function available in elasticsearch provides the basis}

for creating such a comparison. The ’morelikethis’ function uses tf-idf and semantic comparison to retrieve similar doc-uments. By giving this function the recommendations and aggregating this with a function that filters out all documents except the downloads belonging to a user and present in the validation set, this can verify whether the downloaded docu-ments are similar to the recommendations based on the content of the documents. If a recommendation has a hit with one of the user’s downloads, this recommendation is considered ac-curate.

Evaluation Measure

The correctness of the TCF does not lie only in whether or not a user has downloaded certain content. It is also about which content was downloaded first, or in which order the content was downloaded, as the TCF should be able to generate recommendations in the order a user would need them. For this reason, the correctness is calculated using average precision. Average precision is used to measure the fraction of retrieved documents relevant to the query according to the rank on which the document is placed [24]. The average precision for user q can be calculated using:

APq= n

∑

i=1

p(q,i)∆r(q,i) (7) where i is the rank of the document, n is the amount of retrieved documents, p(q,i) is the precision at rank i for user q, and ∆r(q,i) is the change in the recall from i − 1 to i for user q. The resulting percentage will show the accuracy of the rec-ommendations for a specific user, it is important to note that this does not reflect a real life scenario. A user could decided he/she needs a summary for a certain course, though not for another course, while similar users did need a summary for that course. As the TCF predicts which document you need at this particular moment, and as the system does not know you have ‘skipped’ a course, this could result in a recommen-dation for a document which is not present in the valirecommen-dation set. This results in misses which might not actually be misses. In a real life test, the user might see the recommendation and decide they do need the document now that they know it exists. Although the offline tests can give an indication of the perfor-mance of the system, it is not an appropriate user test. For this reason, the online experiment results are very important in determining the actual performance of the system.

6_{https://www.elastic.co/}

7_{https://www.elastic.co/guide/en/elasticsearch/reference/}

(7)

Similarity measure Neighbourhood distance CF

CityBlock 200 33.6

EuclideanDistance 200 47.8

Loglikelihood 200 47.2

UncenteredCosineSimilarity 200 47.8

Figure 4. Test on 80,075 users to see which similarity measures perform best on the full data set. The two best similarity measures: Euclidean distance and Uncentered Cosine similarity, are chosen to be used in a comparison between CF and TCF.

Similarity measure Neighbourhood distance CF TCF average path TCF shortest path

EuclideanDistance 200 47.8 47.0 47.1

UncenteredCosineSimilarity 200 47.8 47.0 47.1

Figure 5. Test on 80,075 users to compare CF, TCF with shortest path length, and TCF with average path length, using the two similarity measures chosen previously and two different neighbourhood distances. The results show the best performance by the Euclidean distance with neighbourhood distance n=20.

Optimisation and Parameter Selection

One of the goals of the offline tests is to optimise the systems as well as select the appropriate parameters. The results of these tests can be seen in the appendix of this paper, this section will only give the relevant findings and explain the approach.

The used implementation for the CF system allows for a variety of user similarities8_{as well as an option to set the}

neighbour-hood distance or threshold9_{. To find an appropriate}

combina-tion of parameters, the first experiment was conducted on a small data set (1000 users), testing many different combina-tions of parameters. Combinacombina-tions scoring above fifty percent accuracy were used in an experiment on a the full data set. This included four similarity measures, as can be seen in table 4. The two best scoring similarity measures are combined with two neighbourhood distances and used to compare the CF to the TCF.

The TCF was also subject to some parameter selections, as mentioned in the method section of this paper. Two distance measures were tested: average path length and shortest path length. For the average path length, a cutoff has to be deter-mined to ensure the system is fast enough to be used in a pro-duction environment. Different cutoff lengths were tested to see which one would perform the best. One more experiment was conducted to test whether generating more recommen-dations would yield a higher accuracy when reordering the recommendations using the TCF, however, this was not the case. When generating more than twenty recommendations, the accuracy of both the CF and the TCF decreased.

The combination of the above mentioned parameters created a final experiment including two similarity measures, two neigh-bourhood distances, the Collaborative Filtering, the Temporal Collaborative Filtering using the average path length, and

8_{https://archive.cloudera.com/cdh4/cdh/4/mahout-0.7-cdh4.} 5.0/mahout-core/org/apache/mahout/cf/taste/similarity/ UserSimilarity.html 9_{http://archive-primary.cloudera.com/cdh4/cdh/4/mahout-0.} 7-cdh4.3.2/mahout-core/org/apache/mahout/cf/taste/ neighborhood/UserNeighborhood.html

Figure 6. A user’s dashboard showing the generated recommendations on the left side of the screen, highlighted in orange.

the Temporal Collaborative Filtering using the shortest path length, the results of which can be seen in table 5. The results show a greater accuracy when using the shortest path length as opposed to the average path length between two vertices. The best result was found using the Euclidean distance mea-sure (2), a neighbourhood distance of k = 20, and the CF. The TCF using the shortest path length yielded an accuracy closely related to the CF. For this reason, the parameters mentioned above were selected to be used in the online test, comparing the CF to the TCF using the shortest path length.

Online Experiments

The online experiments are conducted to gather actual user data. As mentioned, a few reasons create the need for an online experiment with actual users, the most important being the fact that users might not know content exists and thus not download it, as they can not find it or don’t search for it. This might cause any offline experiments to generate false accuracies. This is enhanced by the fact that only one data set can be used in the offline experiments, as the temporal structure can not be lost. The online experiment’s goal is to determine whether or not the TCF improves the order of the

(8)

Figure 7. Click distribution for CF and TCF. Showing a big drop after the first recommendation and a small uptick near the end for CF. TCF shows a more gradual drop in clicks, with a small variation around rank four.

Figure 8. Cumulative click distribution for CF and TCF, showing more clicks in the top segment for TCF compared to CF.

recommendations compared to the CF, using the parameters mentioned in the previous section. To do this, an A/B test is conducted, the baseline (A) being the CF and the experiment (B) being the TCF [10].

Experimental Setup

The A/B test is set up using the system as described in this paper. On the platform side, the existing method is replaced by making an API call to the CF or TCF system, depending on the test version appointed to a user. The returned recommen-dations are parsed by the platform and shown on the user’s dashboard, as can be seen in figure 6. The recommendations are also checked for status and language, such that an inactive document, or a document that is not in the language of the user or in English, is removed from the recommendations. Some fail safes are built in; if the recommender fails to provide recommendations, or, if after filtering the inactive or wrong language documents, no documents remain, the platform will fall back to its old system, making sure users are always pre-sented with content.

The two main aspects that are tested during this experiment are the conversion rates as well as the distribution of clicks on recommended items. The conversion rates are an obvious indication of accuracy for the generated recommendations, nonetheless, for this research the conversion is less relevant.

The conversion rates between the two tested systems should be nearly similar, as the recommendations shown are the same, the only difference is the order in which they are shown. A far more important aspect to test is the distribution of clicks on the recommended items. The expectation is that the CF will show a less distributed, more random, click distribution, while the TCF shows a more logical click distribution starting at the first recommendation and gradually decreasing towards the last recommendation. Another expectation is that the TCF will have more clicks in the top segment of the recommendations compared to the CF.

Results

The experiment was run for thirteen days, the daily views and clicks can be found in the appendix of this paper. Only users that were logged in had the possibility to see the recommenda-tions, resulting in a total of 22,569 views of the page on which the recommendations were shown. 11,317 of these views were version A (CF), while 11,252 of these views were version B (TCF).

In total 1,538 clicks on recommended items were registered of which 770 were clicks on recommendations generated by CF and 768 clicks were recorded on recommendations generated by TCF. This results in a conversion rate of 6.8 percent for CF versus 6.83 percent for TCF. As mentioned, the expectation was that these conversion rates would be very similar, as the results show, this expectation was met.

Using the platform’s old system, users proceeded to buy a recommendation after a click on that recommendation 35.85 percent of the time. Using the CF users proceeded to check-out 40.61 percent of the time, while with TCF 39.03 percent bought the recommendation. When looking at the last week of data, it is surprising to see that the users proceeding to buy a recommendation using CF dropped to 26.69 percent, while the TCF rose to 41.52 percent. A reason for this could be the increase in users on the platform, especially the increase of return users on the platform. As exam week at some Dutch universities commenced, double the views were recorded from the week before. This could indicate that TCF provides better results for return users, returning in an exam week. To confirm this, however, the same test has to be run when another test week commences.

Next we compare the distribution of clicks based on the rank on which a recommendation was placed, the results of which can be seen in figure 9. Figure 7 shows the distribution of clicks for users presented with the CF or TCF recommenda-tions. It can be seen that the click distribution belonging to the CF has an initial steeper descent compared to that of TCF. This shows that users found more use in the top documents when presented with recommendations generated by TCF than those generated by CF. Another noticeable aspect is the seemingly more random distribution towards the end of the CF click dis-tribution, while the TCF click distribution shows a more clean gradual descent. Figure 8 also shows that the TCF has more clicks in the top segment of the recommendations compared to the CF, which could indicate the appropriate reordering of recommendations. The results show that it might be possible to improve recommendations by adding temporal specificity.

(9)

1 2 3 4 5 6 7 8 9 10 CF 40.6% 17.9% 10.0% 8.4% 6.4% 4.5% 4.9% 3.1% 1.7% 2.3% TCF 39.6% 20.3% 11.8% 9.2% 4.0% 3.5% 3.6% 3.1% 2.9% 1.8%

Figure 9. Click distribution per rank for recommendations generated by CF and TCF during the online A/B test.

CONCLUSION AND DISCUSSION

In this paper we have introduced the notion of temporal speci-ficity within recommendations. We have shown how Collab-orative Filtering can be extended to Temporal CollabCollab-orative Filtering with the use of a directed graph representing transac-tional data and by determining the distance between a user’s last download and his/her recommendations. We have evalu-ated a first method of TCF and the results show a better click distribution and more clicks in the top segment of the recom-mendations compared to CF. This shows that adding temporal specificity to recommendations is possible and might have the possibility to improve the recommendations shown to a user. There are some points of concern which could be taken into consideration when using the proposed method, which will be discussed below.

Both the CF and the TCF system, as used in the experiments described in this paper, were not optimised on other aspects than mentioned. This means that no attention was paid to known disadvantages of Collaborative Filtering systems, such as the cold start problem, content-awareness, lack of diversity, etc. [19]. The obtained results from the experiments con-ducted, could therefor be considered to have a lower accuracy than is possible when considering the shortcomings of CF systems. This is especially true for the conversion and the amount of bought recommendations after a recommendation click. What’s more is that aspects like content-awareness and lack of diversity make for an interesting reordering of rec-ommendations based on a temporal aspect, as more diversity between recommendations exists.

As with CF, TCF has limitations that are to be considered. To be able to use the temporal graph used in TCF, it is imperative to have a large amount of transactional data available. If transactional data is sparse, the graph would be sparse as well, which could pose a problem when finding distances between recommendations. Consider a document which has been downloaded by a few users, while each of these users has only had one transaction within the platform. The CF could recommend this document, since the transactions of those users could include other documents as well. Despite this, the temporal graph would not be able to find any distances to this document, as the users who downloaded the document have only had one transaction, thus no edges were created between the document and other documents.

The location on the platform where the recommendations were shown might not be ideally suited for that purpose. Users who come to the platform generally already know what they are looking for. Even though the TCF should propose items

that they are currently looking for, a user might overlook the recommendations and go straight to the search area of the platform. The recommendations generated by the TCF might gain more conversion for the platform when used in an email to users, providing them with recommendations when they are not already looking for specific content. This could be especially useful for the retention of users.

Many interesting aspects to the stated problem still exist and many other approaches are interesting to consider. Future re-search could look into a different recommendation generation approach to be used before reordering the recommendations based on the temporal graph described in this paper. As the shortcomings of CF systems were not considered during this research, an interesting path for future research is the com-bination of other recommendation approaches. It would be very interesting to find out if using a hybrid recommender system (i.e. a combination of CF and Content-based Filtering) would increase the accuracy when reordering recommenda-tions based on a temporal aspect. Another interesting addition to the TCF system could be a determination of the user’s last activity, which might be able to determine which content the user has ‘skipped’, providing a user with more accurate recommendations.

An addition to the TCF could be the use of a weighted graph instead of an unweighted graph. Both of these have advantages and disadvantages. An unweighted graph is easier to use and update in such a system than a weighted graph would be. Besides that, using a weighted graph requires a more difficult approach to finding distances between recommendations, as the weights have to be considered as well as the path length between recommendations. Future research can determine if using a weighted graph can improve the accuracy of the TCF approach.

It might be possible to use the temporal graph as a stand alone recommender system, which includes temporal specificity. A user’s last download could be used to find the closest k items to that download, using only the temporal graph. This could generate recommendations that are closest to the user’s last download, however, this only takes into consideration one download. Multiple downloads can also be considered when trying this method, which might provide more accurate results, though this does pose a more difficult distance calculation.

ACKNOWLEDGEMENTS

The authors would like to thank Thomas Mensink and Hugo Kuijzer for their invaluable guidance, feedback and supervi-sion to this study, as well as Frank Nack for his supervisupervi-sion.

(10)

REFERENCES

1. Gediminas Adomavicius and Alexander Tuzhilin. 2005. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE transactions on knowledge and data engineering 17, 6 (2005), 734–749.

2. Saikat Bagchi. 2015. Performance and quality assessment of similarity measures in collaborative filtering using mahout. Procedia Computer Science 50 (2015), 229–234. 3. Robert M Bell, Yehuda Koren, and Chris Volinsky. 2008.

The bellkor 2008 solution to the netflix prize. Statistics Research Department at AT&T Research (2008). 4. Robin Burke. 2007. Hybrid web recommender systems.

In The adaptive web. Springer, 377–408.

5. Per-Erik Danielsson. 1980. Euclidean distance mapping. Computer Graphics and image processing 14, 3 (1980), 227–248.

6. Ted Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. Computational linguistics 19, 1 (1993), 61–74.

7. Michael D Ekstrand, John T Riedl, Joseph A Konstan, and others. 2011. Collaborative filtering recommender systems. Foundations and Trends® in Human–Computer Interaction 4, 2 (2011), 81–173.

8. Mustansar Ali Ghazanfar and Adam Prugel-Bennett. 2010. A scalable, accurate hybrid recommender system. In Knowledge Discovery and Data Mining, 2010. WKDD’10. Third International Conference on. IEEE, 94–98.

9. FO Isinkaye, YO Folajimi, and BA Ojokoh. 2015. Recommendation systems: Principles, methods and evaluation. Egyptian Informatics Journal 16, 3 (2015), 261–273.

10. Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009. Controlled experiments on the web: survey and practical guide. Data mining and knowledge discovery 18, 1 (2009), 140–181.

11. Daniel T Larose. 2005. K-nearest neighbor algorithm. Discovering Knowledge in Data: An Introduction to Data Mining (2005), 90–106.

12. Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. 2011. Content-based recommender systems: State of the art and trends. In Recommender systems handbook. Springer, 73–105.

13. Rada Mihalcea, Courtney Corley, Carlo Strapparava, and others. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In AAAI, Vol. 6. 775–780.

14. Sean Owen and Sean Owen. 2012. Mahout in action. (2012).

15. Michael Pazzani and Daniel Billsus. 2007. Content-based recommendation systems. The adaptive web (2007), 325–341.

16. Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Recommender systems handbook. Springer.

17. Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. ACM, 285–295.

18. JHJB Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen. 2007. Collaborative filtering recommender systems. The adaptive web (2007), 291–324.

19. Xiaoyuan Su and Taghi M Khoshgoftaar. 2009. A survey of collaborative filtering techniques. Advances in artificial intelligence 2009 (2009), 4.

20. Jigang Wang, Predrag Neskovic, and Leon N Cooper. 2007. Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recognition Letters 28, 2 (2007), 207–213.

21. Yongji Wang, Xiaofeng Liao, Hu Wu, and Jingzheng Wu. 2012. Incremental collaborative filtering considering temporal effects. arXiv preprint arXiv:1203.5415 (2012). 22. Hui Xiong, Gaurav Pandey, Michael Steinbach, and

Vipin Kumar. 2006. Enhancing data analysis with noise removal. IEEE Transactions on Knowledge and Data Engineering 18, 3 (2006), 304–319.

23. Liang Xiong, Xi Chen, Tzu-Kuo Huang, Jeff Schneider, and Jaime G Carbonell. 2010. Temporal collaborative filtering with bayesian probabilistic tensor factorization. In Proceedings of the 2010 SIAM International

Conference on Data Mining. SIAM, 211–222.

24. Mu Zhu. 2004. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo 2 (2004), 30.

(11)

APPENDIX

Offline Experiment Results

This sections shows the results from the offline experiments as described in this paper. The results are shown in the order the experiments were conducted in. First an often used similarity measure is taken to determine which neighbourhood distance is appropriate to test the different similarity measures with. Once this is determined, all available similarity measures are tested to see which has the highest accuracy, any similarity measure scoring above fifty percent is used in a concurrent test. The highest scoring similarity measure is once again used in a test to see which neighbourhood distance is best suited for this similarity. The found neighbourhood distance is used to compare the four similarity measures remaining, finding two that have a similar accuracy. The decrease in accuracy is due to the fact that the full test set is used. The two with similar accuracy are used to test for a lower neighbourhood distance, as well as different amount of recommendations. This results in the two neighbourhood distances being closely related, as well as two amounts of recommendations being close together in accuracy. The two similarity measures and two neighbourhood distances are used to compare CF with TCF, generating ten recommendations, determining which similarity measure and neighbourhood distance are best suited for the online experiment.

Loglikelihood 5 41.3 Loglikelihood 30 53.0 Loglikelihood 40 54.6 Loglikelihood 50 54.0 Loglikelihood 100 54.0 Loglikelihood 200 54.0

Table 1. Test on 1000 users to see which neighbourhood distance is the, most likely, best to test different similarity measures with. Best result is found in n = 40.

Loglikelihood 40 54.6 CityBlock 40 61.6 EuclideanDistance 40 52.0 PearsonCorrelation 40 0.0 SpearmanCorrelation 40 49.1 TanimotoCoefficient 40 44.8 UncenteredCosineSimilarity 40 54.1

Table 2. Test on 1000 users to see which similarity measures perform best on the data set. Any measure scoring above fifty percent is used in an experiment on the full data set.

(12)

Similarity measure Neighbourhood distance CF CityBlock 5 10.4 CityBlock 30 37.5 CityBlock 100 60.9 CityBlock 150 61.9 CityBlock 200 63.0 CityBlock 250 41.0 CityBlock 300 40.8 CityBlock 500 40.5 CityBlock 1000 32.6 CityBlock threshold = 2.0 0.0 CityBlock threshold = 1.0 0.0 CityBlock threshold = 0.8 0.0 CityBlock threshold = 0.2 0.0 CityBlock threshold = 0.09 20.7 CityBlock threshold = 0.07 35.7 CityBlock threshold = 0.06 39.7

Table 3. Test on 1000 users to see if a neighbourhood threshold performs better than a neighbourhood and which neighbourhood performs best for this similarity measure. Resulting in a distance of n = 200 as the best result, showing that a threshold is not the best method. In the output of the experiment it could be seen that the threshold worked well for some users, but worked very poorly for other users, as they did not meet the threshold. This resulted in users not getting any recommendations, because of this, the use of a threshold is discarded.

CityBlock 200 33.6

EuclideanDistance 200 47.8

Loglikelihood 200 47.2

UncenteredCosineSimilarity 200 47.8

Table 4. Test on 80,075 users to see which similarity measures perform best on the full data set. The two best similarity measures: Euclidean distance and Uncentered Cosine similarity, are chosen to be used in a comparison between CF and TCF.

Similarity measure Recommendations Neighbourhood distance CF TCF average path TCF shortest path

EuclideanDistance 5 200 59.5 59.5 59.6 EuclideanDistance 5 20 59.6 59.6 59.8 UncenteredCosineSimilarity 5 200 59.7 59.8 59.8 UncenteredCosineSimilarity 5 20 58.9 59.1 59.2 EuclideanDistance 10 200 55.2 54.8 54.8 EuclideanDistance 10 20 59.6 59.6 59.8 UncenteredCosineSimilarity 10 200 55.2 55.0 54.9 UncenteredCosineSimilarity 10 20 55.6 56.0 56.1 EuclideanDistance 20 200 52.0 51.3 51.6 EuclideanDistance 20 20 52.8 52.4 52.8 UncenteredCosineSimilarity 20 200 52.1 51.5 51.8 UncenteredCosineSimilarity 20 20 52.5 52.2 52.7

Table 5. Test on 1000 users to see if the amount of recommendations generated matters for accuracy and test the difference between two neighbourhood distances. The results show, when generating 5 or 10 recommendations, the same accuracy is obtained, however, when generating 20 recommendations accuracy is lost. It can also be seen that the neighbourhood distance of n = 200 and n = 20 result in closely related accuracies, thus they are both tested on the full data set.

Similarity measure Neighbourhood distance CF TCF average path TCF shortest path

Table 6. Test on 80,075 users to compare CF, TCF with shortest path length, and TCF with average path length, using the two similarity measures chosen previously, two different neighbourhood distances, while generating ten recommendations. The results show the best performance by the Euclidean distance with neighbourhood distance n=20.

(13)

Online Experiment Results

This sections shows the results from the online experiments as described in this paper. The clicks and views are both displayed according to the days on which they were recorded.

1 2 3 4 5 6 7 8 9 10 Total 6/9/2017 24 15 8 6 4 3 4 2 1 1 55 6/10/2017 21 6 9 3 3 2 5 1 1 0 68 6/11/2017 34 15 7 6 5 4 2 2 1 1 51 6/12/2017 30 8 11 4 2 3 2 1 4 5 77 6/13/2017 13 7 4 4 3 2 0 1 1 1 70 6/14/2017 19 9 7 8 2 1 6 1 0 2 36 6/15/2017 12 6 2 2 3 2 0 0 0 2 55 6/16/2017 20 11 1 3 3 3 2 1 0 0 29 6/17/2017 11 5 4 1 0 0 0 2 0 0 44 6/18/2017 28 8 6 1 5 1 4 1 1 0 23 6/19/2017 31 16 7 5 10 7 6 4 3 1 55 6/20/2017 36 21 5 15 6 5 5 3 1 3 90 6/21/2017 34 11 6 7 3 2 2 5 0 2 100 Total 313 138 77 65 49 35 38 24 13 18 770 Percentual 40.6% 17.9% 10.0% 8.4% 6.4% 4.5% 4.9% 3.1% 1.7% 2.3%

Table 7. Measured clicks on recommendations generated by CF, including the click percentage per rank.

1 2 3 4 5 6 7 8 9 10 Total 6/9/2017 25 17 8 2 4 1 2 5 1 0 65 6/10/2017 24 11 9 6 6 3 4 2 2 3 70 6/11/2017 27 11 12 11 2 7 3 2 0 0 75 6/12/2017 31 22 13 7 3 3 2 4 1 1 87 6/13/2017 19 8 5 1 1 2 2 0 0 0 38 6/14/2017 14 5 5 7 2 1 0 0 1 1 36 6/15/2017 15 5 4 2 1 3 1 1 2 0 34 6/16/2017 17 7 1 5 1 0 0 0 1 0 32 6/17/2017 19 7 1 5 2 0 4 2 1 1 42 6/18/2017 12 4 4 2 0 0 0 4 1 0 27 6/19/2017 24 23 14 9 1 2 1 0 3 1 78 6/20/2017 45 20 8 9 7 5 7 1 3 1 106 6/21/2017 32 16 7 5 1 0 2 3 6 6 78 Total 304 156 91 71 31 27 28 24 22 14 768 Percentual 39.6% 20.3% 11.8% 9.2% 4.0% 3.5% 3.6% 3.1% 2.9% 1.8%

(14)

Views Conversion 6/9/2017 1031 5.3% 6/10/2017 809 8.4% 6/11/2017 1128 4.5% 6/12/2017 1168 6.6% 6/13/2017 554 12.6% 6/14/2017 681 5.3% 6/15/2017 619 8.9% 6/16/2017 513 5.7% 6/17/2017 487 9.0% 6/18/2017 580 4.0% 6/19/2017 1205 4.6% 6/20/2017 1346 6.7% 6/21/2017 1196 8.4% Total/average 11317 6.9%

Table 9. Views of the page on which recommendations generated by CF were shown, including the conversion rate per day, total views, and average conversion rate. Views Conversion 6/9/2017 984 6.6% 6/10/2017 811 8.6% 6/11/2017 1048 7.2% 6/12/2017 1230 7.1% 6/13/2017 553 6.9% 6/14/2017 671 5.4% 6/15/2017 628 5.4% 6/16/2017 505 6.3% 6/17/2017 499 8.4% 6/18/2017 497 5.4% 6/19/2017 1249 6.2% 6/20/2017 1361 7.8% 6/21/2017 1216 6.4% Total/average 11252 6.8%

Table 10. Views of the page on which recommendations generated by TCF were shown, including the conversion rate per day, total views, and average conversion rate.