Pairwise and sequence-based skip predictions on a music video platform

(1)

submitted in partial fulfillment for the degree of

master of science

Priyanka Nanayakkara

12151173

master information studies

data science

faculty of science

university of amsterdam

June 28, 2019

Academic Supervisor First Industry Supervisor Second Industry Supervisor Name Dr. John Ashley Burgoyne Dr. Alessandro Pagliero Dr. Bouke Huurnink

Affiliation UvA XITE XITE

(2)

Pairwise and sequence-based skip predictions on a

music video platform

Priyanka Nanayakkara

University of Amsterdam Amsterdam, The Netherlands

Abstract

Music streaming platforms oftentimes allow users to skip tracks to personalize their music-listening experiences. Skip-ping behavior can indicate a user’s listening preferences, and predicting such behavior presents an opportunity to improve the user’s experience (for instance, by recommending a new playlist at an appropriate time). We model whether the cur-rent video will be skipped on a music video platform using pairwise modeling approaches; in other words, we consider features of only the previous video and current video for the prediction task. We find that whether the previous video was skipped is the strongest predictor of whether the current video will be skipped. Next, we attempt a sequential model-ing approach by includmodel-ing features of all videos in the session prior to the current video and find that such sequential infor-mation improves prediction, but involves significantly more computational expense than only including features of the immediately preceding video in the model. Based on these findings, we recommend further focus on pairwise modeling approaches to improve skip predictions in the music video domain.

Keywords skip prediction, machine learning, music videos, music streaming

1 Introduction

Feedback from users on music streaming platforms is invalu-able to improving user experience. Track skips, and skipping behavior in general, can provide insights into how a user in-teracts with a service or segment of a service (e.g. a playlist). However, while “like” and “favorite” features serve as ex-plicitly positive signals about a given track, skips are more ambiguous; a user may skip a track because she does not like the track, or perhaps because the track is one of her all-time favorites, but she is looking for new music during a particular listening session. In the latter case, never present-ing the skipped track to the user could actually diminish her user experience in the long run, when she switches out of an exploratory mode of consumption. Thus, skipping behavior is complex, and requires further investigation and attention. We can begin this investigation by predicting when a user will skip a video. By predicting skip behavior, we can also investigate factors that correlate with users skipping videos (for instance, does a drastic change in genre or tempo from one video to another correspond to a higher likelihood of

a user skipping the current video?) to better understand what motivates a skip. Ultimately, skip prediction will help improve recommendations to the user—for example, if we know that a user is likely to skip a video, we can suggest a more appropriate playlist or re-rank recommended videos taking into account predicted skipping behavior.

For this research project, we predict skips on an interac-tive television channel byXITE. The service allows users to watch and explore music videos on demand (see Figure

1). Specifically, we focus on the Dutch version of the ser-vice, which does not impose a skipping limit. In addition to viewing curated playlists (e.g. Pride Hits, Trending, Hits Now, Indie Essentials, etc.), users can search and favorite videos, and create personalized “mixer” channels (the user interacts with the service using a remote control). To create a personalized mixer channel, a user makes selections from four categories (genre, era, style, and mood), as shown in Figure2.

Figure 1. XITE’s service allows users to watch on-demand music videos on their televisions. This example screen shows how users can watch various music video “channels” which are hand-curated compilations of music videos in various categories.

This research is particularly relevant considering the re-cent Spotify Sequential Skip Prediction Challenge [21], held in early 2019 byCrowdAI,Spotify, andWSDM. The goal was to predict skipping behavior for entire second halves of Spotify user listening sessions, using information from session logs and acoustic features. We mimic the setup of this challenge, and make appropriate modifications to suit XITE’s data and needs.

In order to understand skipping behavior, we explore the following research questions:

(3)

Figure 2. XITE’s Mixer feature allows users to create chan-nels based on genre, era, style, and mood selections.

RQ1 How well can we predict skipping behavior of the current video using information from the current video and previous video?

RQ2 Which features are important in predicting skip be-havior?

RQ3 Does including sequential information in a skip model improve performance?

The paper is structured as follows. We present the related work which our research relies on. We then describe our data preparation and exploratory data analysis processes. Next, we describe two pairwise skip prediction models and a third model framework intended to test the importance of se-quential information in predicting skips. Finally, we present our results and end with a discussion on major takeaways, including which features are important in predicting skips and ideas for further model improvements.

2 Related Work

This project relies on existing machine learning methods as well as their recent applications in music-streaming contexts. In addition, we borrow ideas from the channel zapping do-main, a traditional television-viewing paradigm parallel to our music video skipping paradigm.

2.1 Modeling

Logistic Regression. Logistic regression is a method of model-ing a boolean variable in the followmodel-ing way:

π (x)= e

β0+β1x

1+ eβ0+β1x (1)

where π (x) is the conditional probability that the outcome is 1 given a certain x value (the data) [12]. Beta values are obtained through maximum likelihood estimation, a process whereby parameter values that maximize the probability of observed data are found [12]. The method can be extended to include more than one covariate.

Gradient Boosting Trees. Gradient boosting tree models can be used both in regression and classification tasks. They are built iteratively, by continually adding “weak” learners based

on the negative gradient of a differentiable loss function [8]. (A weak learner is one “whose accuracy is slightly better than random guessing” [13].) In recent years, gradient tree boosting methods have been applied in various domains, from ranking problems to ad click through rate prediction [6].

Recurrent Neural Networks. A recurrent neural network (RNN) is a type of artificial neural network which passes information from one state to the next, enabling a model to take into account earlier computations [9]. Long Short Term Memory (LSTM) networks [11] are a type of RNN used for sequence-based prediction and recommendation tasks; they are especially successful at these tasks because “[r]emembering information for long periods of time is prac-tically their default behavior” [17]. LSTMs are able to do this by updating information in the cell state through the use of various gates, which determine how much previous information to add or remove to the cell state [17]. In partic-ular LSTMs have been applied to tasks such as time series prediction (predicting stock returns [5], extreme event fore-casting [15]) and language modeling [23].

2.2 Findings from Spotify

Spotify Sequential Skip Prediction Challenge. As previously mentioned, the goal of the Spotify challenge was to predict skipping behavior for entire second halves of Spotify user ses-sions. Participants were provided with The Music Streaming Sessions Dataset [2]. The dataset included track information from session logs (such as the position in the session) as well as acoustic features (such as energy level) [22]. For the purposes of the challenge, the public was provided with a training dataset with full sessions, as well as a test dataset with full information for first halves of sessions and partial information for second halves of sessions [4]. Additionally, only sessions between 10 and 20 tracks were included in the challenge’s dataset [25].

The first-place team employed RNNs, and predicted not only skipping behavior, but also additional features about each track (for instance, whether there was a short pause be-fore a play) to add additional information to the model [27]. LSTMs were also popular among high-ranking teams; for in-stance, the second-place team [10] and ninth-place team [25] utilized LSTMs. In addition, teams utilized gradient boost-ing trees (e.g., the ninth-place team and fourteenth-place team [7]). Because the goal of this challenge was to predict skipping behavior for multiple songs in a row, mean average accuracy was used as a measure of model performance. For the top five teams, these values ranged from 61.3% to 65.1% [21]. However, prediction accuracy of the first track in the second half of sessions is more relevant to the goals of this research project; first prediction accuracies for the top five teams ranged from 79.4% to 81.2%. [21].

Spotify User Behavior. Additional research into general Spotify user behavior has found that users exhibit various 2

(4)

on a music video platform MSc Thesis, Information Studies, 2019, University of Amsterdam listening patterns. For instance, users tend to have specific

times of day at which they use the service [26]. Additionally, “users have a strong ‘inertia’ to continue successive sessions on both the most frequently used desktop and mobile devices” [26]. These findings provide insights to factors which may predict whether a user will skip a track. For example, if users interact with Spotify differently according to the hour of day, including hour of the track’s play in a model may help predict skipping behavior.

2.3 Channel Zapping

In a television context, channel zapping occurs when viewers change channels. A body of research explores this phenome-non, and zapping of ads. Siddarth and Chattopadhyay [20] modeled whether an ad is “zapped” using a binary logit model, and found that zapping was less likely when an ad contained a “brand differentiating message” (though they also note that the effect is small, and applies only to the non-zapping prone segment of households), indicating that con-tent of an ad is related to zapping behavior. An eye-tracking study of ad zapping behavior by Texeira et al. [24] found that multiple occurrences of a brand in its ad correspond to decreases in zapping, indicating that visual elements of an ad also play a role in zapping behavior.

3 Methods

We begin by creating a dataset for the skip prediction task by using XITE’s session logs and music video metadata. We briefly explore features of the data and perform feature engi-neering. We then build two pairwise models (gradient boost-ing trees and logistic regression) for predictboost-ing whether the current video will be skipped based on information from both the current video and previous video. We use these models to determine which features have skip predictive power. Next, we test whether utilizing sequential data (in-formation from not only the video prior to the current video, but also from videos from the start of the session up until and including the video immediately prior to the current video) is helpful in improving predictive power of a skip model. We do so by creating a “partial” LSTM model including features of the immediately preceding video and a “full” LSTM model including features of all videos preceding the current video in the session, and comparing the two.

3.1 Data Preparation

Querying. Our data is from XITE’s service in the Nether-lands. Specifically, we query session logs from a relatively recent two-month period (October-November 2018) through Apache Cassandra.

Cleaning. Session logs contain records of every interaction with the service, even interactions performed at the company for testing purposes. We remove these records because we are interested in predicting actual user behavior. We also

remove what we can assume to be system glitches: tracks that have zero duration and interactions marked as skips or plays that occurred within the first two seconds of the video. Regarding the latter case, it is unlikely that a user can reasonably make a decision about a music video to skip it within the first two seconds, and it does not make sense for a full play to have a duration of two seconds or less. Thus, we consider these records to be system errors. These records make up 0.11% of the queried data, and removing them will not significantly alter results.

Subsetting. We limit the dataset to only include sessions between 10 and 20 videos, in keeping with the Spotify chal-lenge [25]. Before limiting the dataset, the median session length is 15 videos, as noted by the black line in Figure3. Thus, it makes further sense to limit sessions between 10 and 20 videos to study user behavior in typical sessions. Instead of removing sessions with lengths greater than 20 videos entirely, we instead truncate sessions after 20 videos (essen-tially removing videos with session positions greater than 20) to include more data and to not introduce a bias into the dataset towards shorter sessions.

Merging Session Logs with Music Video Metadata. We merge session logs with music video metadata, thus adding features of videos such as first release year, genre, and energy level to the dataset.

Figure 3. The histogram shows session lengths of XITE user sessions in the Netherlands between October and November 2018. Median session length is 15 videos, shown by the black line.

Defining a Target Variable. We are interested in predicting skipping behavior, and consider a skip to have occurred any time a music video is not played completely. This includes instances where a user leaves the music video for a different channel or for the search feature, for example. Thus, we define the target variable “is_skip” as a boolean variable indicating whether the music video is not played until the end.

(5)

Interestingly, the Spotify challenge designates whether “the track was only played briefly” [21] as the target variable; in other words, if a track was skipped late in the song, it would not be considered a skip. Because we are not privy to Spotify’s reasons for using a more limited definition of a skip, and because it is useful for XITE to predict any skipping behavior in general, we adopt a broader skip definition.

Furthermore, we aim to predict skipping behavior of a video somewhere in the middle of a session, to simulate a real-world prediction task. Thus, we only aim to predict the first video in the second half of a session. As seen in Figure4, in a ten video session, videos one through five comprise the first half of the session while videos six through ten comprise the second half. We are thus interested in predicting whether the sixth video was skipped. Depending on the model, we will adopt different formulations of how much information we have about each half of the session.

3.2 Exploratory Data Analysis

In order to better understand the dataset, and develop a sense of how potential features will influence models, we conduct a brief exploratory analysis. In particular, we are interested in studying when users skip, how they skip, and how ads are related to skips.

When are users skipping? Users skipped 37.2% of videos, and 35.2% of first videos in the second halves of sessions. Proportions of skips vary by hour; users skipped the highest proportion of videos at hour 5.001(53.0% of videos skipped) and the lowest percentage of videos at hour 2.00 (22.4% of videos skipped). Skip percentages dip between 5.00 and 13.00, as seen in Figure5. Skip percentages rise again around hour 18.00. Additionally, skip percentages across day of the week are fairly consistent, ranging from 35.9% (Sundays) to 38.4% (Wednesdays).

How are users skipping? In Section3.1we defined a skip as whenever a music video is not completely played. Because this includes instances of a user leaving a channel mid-way through a video, for example, not all skips are the same. The vast majority of skips—93.1%—are made by the user clicking the “skip” button to move on to the next video. The second most prevalent method of skipping (comprising 5.60% of skips) is “leave4channel,” meaning the user left the video for another channel. Skipping by means of leaving the video for the search feature (‘leave4search’), the filtering feature (‘leave4filters’), and favorites (‘leave4favorites’), along with liking the video and using the skip button in conjunction (’likeskip’) make up less than 2% of skips.

How are ads related to skips? Ads represent an interesting type of video, in part because users are not allowed to use the skip button on them, but can leave a video through other methods (for instance, by changing the channel). Thus, it

1_{The Netherlands follows CET and CEST time zones. For 2018, the country}

switched from CEST (GMT+2) to CET (GMT+1) on October 28th.

makes sense that only 6.86% of ads are skipped according to our broader definition of a skip (see Section3.1). This is in comparison to the overall skipping rate of 37.3%. Videos that come directly after ads are skipped 19.7% of the time. How is genre related to skips? The dataset contains 23 genres. The most represented genres are Pop, Electronic/Dance, and Regional Popular, as seen in Figure6.

Skip percentages (in other words, the percentage of oc-currences where videos in a certain genre were skipped) by genre (not including genres with less than a thousand occurrences in the dataset) range from ∼35% to ∼50%. 3.3 Selecting and Engineering Features

We select and/or create the following features to use in our models.

3.3.1 is_mixer_channel

is_mixer_channel is a boolean variable indicating whether the music video was played in a mixer channel. We might hypothesize that if users are consuming videos in a mixer channel personalized to their interests, they might be less likely to skip videos. Conversely, we may expect users to skip videos more frequently while using mixer channels to quickly discover and “like” videos they enjoy.

3.3.2 is_default_channel

is_default_channel is a boolean variable indicating whether the video was played in the default channel, Hits Now. Nearly 79% of videos in the dataset were played in the default chan-nel. We can hypothesize that if the user stays in the default channel, this is an indicator that she is less inclined to inter-act with the service, including skipping videos.

3.3.3 is_ad

is_ad is a boolean variable indicating whether the video is an ad. Because users are not permitted to skip ads, we see that users skip ads at a rate much lower than the overall skipping rate. Users are still allowed to leave the channel while viewing an ad, which we would consider to be a skip (because the ad would not be played until the end). 3.3.4 context_switch

context_switch is a boolean variable indicating whether the user left the video by changing the channel, entering the filters feature, the search feature, or her favorites. Changing contexts might indicate that a user intends to refine the type of music videos she is looking for (for instance, by changing the channel to one that more appropriately fits her tastes). 3.3.5 hour

hour is a numerical variable ranging from 0 to 23 which in-dicates the hour of the day the user played the music video. Hour is recorded in GMT. We create this variable by extract-ing hour from timestamps recorded in user logs usextract-ing the 4

(6)

on a music video platform MSc Thesis, Information Studies, 2019, University of Amsterdam

Figure 4. We are interested in predicting skipping behavior of the first video in the second half of a session. In a session comprised of ten videos, as shown above, videos one through five are part of the first half of the session and videos six through ten are part of the second half. Thus, we are interested in predicting skipping behavior for video six, boxed in red.

Figure 5. Videos played at different hours of the day (re-ported in GMT) have different overall skip percentages. In particular, hour 2.00 has the lowest skip percentage (22.4%), while hour 5.00 has the highest skip percentage (53.0%). Python librarydatetime. Zhang et al. [26] found that Spo-tify users favor certain times of the day to use the streaming service. Thus, we might expect time of day to be related to skipping behavior on XITE as well. Additionally, we found that users skip at different rates according to hour of the day in our exploratory data analysis (see Figure5).

3.3.6 weekend

weekend is a boolean variable indicating whether the video was played during the weekend (Saturday, Sunday). Again, we create this variable using the Python library datetime and timestamps recorded in session logs. Because we did not find differences between skipping percentages for individual days of the week, we instead group Monday-Friday together

Figure 6. The barplot shows percentages of the top eight genres in the dataset. Over a quarter of the music videos played or skipped in the dataset are pop videos.

and Saturday-Sunday together, creating a weekend identifier. We might expect users to skip more on weekends, when they are at home and can interact more actively with the service. Or, we may expect users to skip less on weekends due to XITE running in the background at parties and other social gatherings.

3.3.7 first_release_date

first_release_date is a numerical variable indicating the year of the music video’s first release. We replace values above 2018 with missing values, and consider these instances as mistakes in the music video metadata.

(7)

3.3.8 isfavorites

isfavorites is a boolean variable indicating whether the music video is in the user’s favorites. Naturally, we expect users to have a preference towards videos in their favorites, and thus we might expect users to skip these videos less frequently.

3.3.9 issearch

issearch is a boolean variable indicating whether the music video was found through the search feature, as opposed to a curated channel, for example.

3.3.10 like

like is a boolean variable indicating whether the user used the “like” button while viewing the current music video. We might expect that users skip videos they “like” less of-ten. However, company experts note that sometimes users quickly skip through videos when in an exploratory mood. An example could be a user skipping through videos while pressing the “like” button for several videos. In this case, a “like” would be associated with a skip, counter to general

intuition.

3.3.11 unlike

unlike is a boolean variable similar to like. It indicates whether a user used the “unlike” button while watching a video, un-doing a previous “like” for the video.

3.3.12 genre

genre is a character variable denoting the genre of the mu-sic video. Examples include Electronic/Dance, Latin, Pop, Rap/Hip-Hop, and R&B/Soul.

3.4 Modeling

Throughout all models, we predict the skip behavior of the first video in the second half of a session, as illustrated in Figure4. For each session, we refer to the video we are interested in predicting as the current video.

We split the dataset into train and test sets. The train set includes observations with session start times before November 15, 2018, while the test set includes those with session start times including and after November 15, 2018. We split observations according to session start times as opposed to individual video timestamps in order to keep entire sessions in either the train or test set. Otherwise, we could end up with a portion of a session in the train set and another portion in the test set if the session occurred late at night into the early morning hours of the next day.

We begin by modeling skip behavior in a pairwise manner by using information solely from the current video and video immediately preceding the current video. We create a gra-dient boosting trees model and a logistic regression model. Next, we test whether a sequence-based approach—using information from all videos preceding the current video in

the session—adds information which improves prediction. We do so by creating an LSTM model with the video preced-ing the current video and an LSTM model with all videos in the session preceding the current video, and comparing the performance of both.

We train all models on a machine with four CPUs and 50GB RAM.

3.4.1 Pairwise Models Gradient Boosting Trees Model

Features. We use all features2described in Section3.3, except for context_switch. The reason for this exclusion is that the variable essentially codes when a certain type of skip occurs— when a user leaves a video for another system feature—and we would not have this information in a real-world context for the current video. We do not impute missing values for first_release_date, as we will do in future models. Addition-ally, we include features described in Table1. These features capture information linking the previous video to the current video. For example, the variable new_genre notes whether the genre changed from the previous to the current video. In this way, we are able to effectively include the previous video in the model. The general idea is that by representing the change between videos, we can predict whether the user will skip the current video. For instance, if the user did not skip the previous video, and the current video is of the same genre, mood level, and energy level, the user may be less likely to skip the current video. By providing these bits of information to the model, we hope it will learn these types of relationships to improve prediction.

Model. We implement the gradient boosting trees model using the using the Python libraryxgboost[6]. We set the following parameters:

• objective= ‘binary : loдistic′

• learninд_rate= .001 • scale_pos_weiдht = 1.74

Due to the imbalanced nature of skips and non-skips in the dataset, we change the default scale_pos_weiдht to the ratio of non-skips to skips to prevent the learner from predicting all non-skips.

Additionally, we perform a stratified 10-fold cross valida-tion grid search for the following parameter values:

• max_depth : [2, 4, 6, 8, 10] • n_estimators : [5, 50, 100, 150] Logistic Regression Model

Features.We include the same set of features in the logistic re-gression model as we do in the gradient boosting trees model. In addition, we impute missing first_release_date values with the mean of non-missing values. We do this separately for the train and test sets so as to limit information from the train set in the test set.

2_{We one-hot encode genre.}

(8)

on a music video platform MSc Thesis, Information Studies, 2019, University of Amsterdam Feature Description new_clip_urgency, new_clip_mood, new_clip_speed, new_clip_energy

Boolean variables indicating whether the urgency/mood/speed/energy of the previous video is different from that of the current video. Example ur-gency values are “Classic Hit”, “Hit”, “Non-relevant”, “Recognizable”, and “Recurrent Hit”; example mood val-ues are “Happy” and “Melancholic”; Example speed values are “Relaxed,” “Normal,” and “Uptempo”; example energy values are “Chill,” “Easy,” and “Powerful.”

new_context new_context is a boolean variable in-dicating whether the current video is the result of a context switch in the immediately preceding video. If the user left the previous video’s context, we can deduce that the current video is a new context. The Music Stream-ing Sessions Dataset includes a simi-lar variable [2].

new_genre new_genre is a boolean variable in-dicating whether the genre from the previous video to the current video changed. For example, new_genre would be set to 1 if the previous track was “Pop” and the current track is “Electronic/Dance.”

prev_is_ad prev_is_ad is a boolean variable in-dicating whether the previous video was an ad.

prev_is_skip prev_is_skip is a boolean variable in-dicating whether the previous video was skipped. In the Spotify challenge, this feature was found to be one of the most important in predicting skips [25].

Table 1. Additional Features in Pairwise Models. Features described above contain information about the previous track or relationship between the previous video and current video. These features are used in conjunction with features from Section3.3in both the gradient boosting trees model and logistic regression model.

Model. We use the Python librarystatsmodels[19] to implement the logistic regression model. We first implement the model with all features described above, then create a final model including only statistically significant (p < .01) features in the model.

3.4.2 Sequence-Based Models

We create both a “full” LSTM model containing all videos preceding the current video in the session, and a “partial” model containing only the immediately preceding video to the current video.

Features. For both the full model and partial model, we include all features3described in Section3.3for each video. Additionally, we replace missing first_release_date values in the same way as in the logistic regression model (see Section

3.4.1). We use the function preprocessing.scale from the Python librarysklearn[18] on both continuous variables, first_release_date and hour.

In the full model, sequence lengths differ due to varying session lengths. Because we initially cut off sessions longer than 20 videos, the maximum sequence length is 10 (in this case, we predict the 11th video in the session). Thus, we pre-pad shorter sequences with zeroes.

Model. We use the same model architecture for both the full model and partial model. We create sequential models each with two LSTM layers. Each layer contains 60 neurons, and after each layer, we perform a 20% dropout and batch normalization. We use the rectified linear unit (ReLU) acti-vation function at each layer, a learning rate of .001, and the Adam algorithm [14] for optimization. We use a batch size of 1000 and train each model for 9 epochs. Prior to choosing this architecture, we experimented with various values for batch size, epochs, and number of layers on an ad hoc basis.

3.5 Evaluation

We evaluate all models against two baseline approaches, both of which were used in the Spotify challenge [10]:

• Baseline 1: Use the previous video’s skip outcome as the current video’s predicted skip outcome. For in-stance, if the user skipped the previous video, we pre-dict that the user will also skip the current video. • Baseline 2: Predict no skips.

In addition, we perform 10-fold cross validation to evaluate each model (including baseline approaches), and report mean accuracy and a 95% confidence interval for accuracy (using the t-distribution and 9 degrees of freedom).

We also report mean recall and precision. We use standard definitions of recall and precision:

recall= true positives

true positives+ false negatives (2) precision= true positives

true positives+ false positives (3) Finally, we report accuracy, recall, and precision for each model applied to the holdout test set. Together, these metrics give us a holistic view of model performance.

3_{Again, we one-hot encode genre.}

(9)

4 Results

Evaluation metrics for all models can be found in Table2. We find that both the gradient boosting trees model and logistic regression model outperform Baseline 1 in terms of accuracy on the holdout test set. We also find that adding sequential information to the partial LSTM model improves accuracy on the test holdout set. We begin by presenting results for both pairwise models, then results from both sequential models.

4.1 Pairwise Models

Gradient Boosting Trees Model

The cross validated grid search resulted in the following optimal parameters:

• max_depth : 4 • n_estimators : 100

On the test set, the model achieved the best accuracy (85.6%) and precision (82.9%) out of all models created. The accuracy is 3.70 percentage points higher than that of Base-line 1 and 17.2 percentage points higher than BaseBase-line 2.

Feature gains are found in Table3. The feature prev_is_skip— whether the previous video was skipped—returned signifi-cantly more gains than any other feature in the model. Ad-related variables is_ad and prev_is_ad yielded the second and third most gains, respectively. The only genre to add gains to the model was Rap/Hip-Hop.

Logistic Regression Model

The logistic regression model performed comparably to the XGBoost model, with an accuracy of 85.5% on the test set. The model outdoes Baseline 1 in terms of accuracy and precision, but not recall.

The features that remained in the final model (after in-significant variables (p ≥ .01) were removed from the initial model) are found in Table4. The feature prev_is_ad has the largest coefficient based on absolute value, followed by is_ad and unlike. All features in the final model were significant at the .01 level except for is_default_channel.

Odds of a video being skipped across significant predictors in the final model are found in Figure7. Points above x = 1 (red line) indicate that turning a given feature “on” and all other features “off” corresponds to an odds of the video being skipped above one. For example, if the previous video was skipped, the odds of the current video being skipped is 22.9 to 1. In terms of probability, given that the previous video was skipped, there is a 95.8% probability of the current video being skipped. From Figure7, we see that presence of most features do not correspond to high odds of a skip. In fact, the only feature other than prev_is_skip with odds clearly above one is unlike, but its confidence interval indicates that we are not confident that unlike= 1 corresponds to odds of a skip greater than one.

Figure 7. The plot shows skip odds, with 95% confidence intervals, across significant predictors from the final logistic regression model. The red line indicates odds equal to one. Note that prev_is_skip is associated with the highest odds of a skip.

4.2 Sequence-Based Models

Partial LSTM Model The partial LSTM model yielded an accuracy of 81.9% on the test set, the same as Baseline 1. The model performs on par with Baseline 1 in terms of both precision and recall as well. Not including data preparation time, the model took 3 minutes 56 seconds to train.

Full LSTM Model The full LSTM model performs better overall compared to the partial LSTM model based on ac-curacy, precision, and recall on the test set. The full model yields an accuracy of 85.0%, which is 3.10 percentage points higher than the partial model’s accuracy. The model took significantly longer to train compared to the partial model at 22 minutes and 54 seconds.

5 Limitations

While these results represent a first step in understanding music video skip behavior, we note their limitations. The data used represents user actions in or near the Netherlands, and thus our results may not be generalizable to popula-tions in other locapopula-tions, due to factors such as varying music tastes and social norms. Additionally, the data specifically represents how users interact with XITE’s service, and may not apply to other music video platforms which may, for example, have different modes of allowing users to interact with videos.

Additionally, limitations exist in the modeling portion of the project. While creating the logistic regression model, we did not take into consideration interaction terms between features, which may yield further insights to when and why users skip videos. The LSTM models both do not take into 8

(10)

Cross Validation Test Holdout

Accuracy 95% C.I. Precision Recall Accuracy Precision Recall

Logistic Reg. 0.843 (0.840,0.844) 0.801 0.757 0.855 0.828 0.683 XGBoost 0.843 (0.842,0.844) 0.802 0.756 0.856 0.829 0.682 LSTM (full) 0.851 (0.850,0.852) 0.802 0.785 0.850 0.752 0.784 LSTM (partial) 0.829 (0.828,0.831) 0.767 0.765 0.819 0.712 0.715 Baseline 1 0.829 (0.827,0.832) 0.764 0.770 0.819 0.711 0.717 Baseline 2 0.635 (0.635,0.635) – 0.000 0.684 – 0.000

Table 2. Results from Models. The table shows various metrics for pairwise models, sequence-based models, and baseline approaches. For accuracy, precision, and recall cross validation results, mean values are reported. Additionally, confidence intervals are for mean accuracy. We find that the XGBoost model performs best in terms of accuracy and precision on the holdout test set, while the full LSTM model achieves the best recall on the holdout test set.

Feature Gain prev_is_skip 55 356.6 is_ad 2297.9 prev_is_ad 312.5 Rap/Hip-Hop 96.7 new_context 96.2 issearch 94.2 like 88.3 hour 12.2 new_clip_speed 3.7 weekend 1.3 is_default_channel 0.5 new_clip_energy 0.5

Table 3. Feature Gains in Gradient Boosting Trees Model. The table shows, in descending order, gains resulting from features in the model. Features used in the model which do not appear in the table resulted in no gains. We see that prev_is_skip yielded the most gains.

account features of the current video, which is useful in pre-dicting skip behavior, as found in both the gradient boosting trees model and logistic regression model. Future iterations of this work can include these features to see if they improve performance.

6 Discussion

We discuss performance of models in this research project, as well as what these models show us about contexts of skips. Finally, we describe methods for improving our skip models. 6.1 Performance

In terms of performance measured by accuracy, XGBoost and logistic regression outdo both baseline accuracies. These models perform about three percentage points above Base-line 1’s accuracy. Thus, the models would seem to present viable options for improving skip prediction at XITE. How-ever, Baseline 1 performs better in recall than both XGBoost

Feature Coefficient S.E.

intercept −1.80 * 0.02 prev_is_skip 3.13* 0.01 is_ad −2.22 * 0.03 unlike 1.48* 0.43 like −0.47 * 0.05 new_context −0.42 * 0.03 is_mixer_channel 0.31* 0.05 isfavorites −0.26 * 0.05 prev_is_ad 0.11* 0.02 weekend −0.09 * 0.01 is_default_channel 0.02 0.01 hour 0.01* <.01

Table 4. Logistic Regression Coefficients. *p < .01

The table shows regression coefficients for features in the fi-nal logistic regression model. All coefficients are statistically significant except is_default_channel.

and logistic regression, meaning that Baseline 1 identifies a higher proportion of true skips present. In the context of music video streaming, recall is an especially important metric to consider because we are particularly interested in identifying skips. Knowledge that a user will skip a video can be considered to be more powerful information than a play because the user has to make a decision to skip the video, sig-nalling an active choice. Thus, in general, a skip holds more user-specific data than a play. This being said, XGBoost and logistic regression score higher in precision than Baseline 1; in other words, when the former two models predict a skip, it is more likely to be a true skip. Thus, if XITE aims to recommend new music to a user before a predicted skip, they would want to be especially certain that the user will indeed skip the video. Otherwise, recommending a change in playlist, for instance, could unnecessarily interrupt the user’s session and be perceived as a nuisance. The trade-off between precision and recall [3] is one that XITE should 9

(11)

carefully consider before choosing a model for production. This decision will depend heavily on the intended use of skip predictions, which may call specifically for either higher precision or recall.

Naturally, we can compare results from this research to re-sults from the Spotify challenge. The equivalent of Baseline 1’s accuracy in the Spotify challenge was 74.2% [1]. The top team achieved an accuracy of 81.2% the holdout test, improv-ing upon the previously mentioned baseline score by seven percentage points (odds ratio of baseline score to test score is 0.67) [21]. In this research, the XGBoost model improves upon Baseline 1’s score by nearly four percentage points (odds ratio of baseline score to test score is 0.76). The discrep-ancy between the top team’s improvement from the Spotify challenge and that of this research can be in part attributed to the fact that our baseline accuracy is much higher; in this project, Baseline 1 achieves an accuracy of 81.9%, which is already over seven percentage points higher than the base-line performance in the Spotify challenge. Thus, it is more difficult to make substantial improvements to accuracy with our dataset.

We find that adding sequential information to a model improves accuracy by roughly three percentage points (odds ratio of partial sequential model to full sequential model is 0.80). This gain in accuracy pales in comparison to the extra runtime necessary—nearly six times as much—to include all prior videos. Furthermore, we only considered sequences with a maximum of ten videos prior to the current video, however if used in production, the model may need to be trained on longer sequences, which would increase compu-tational expense. Thus, we recommend that XITE focus on improving pairwise models which focus on the relationship between the video immediately preceding the current video and the current video.

6.2 Skip Conditions

Findings related to features influential in both pairwise mod-els point to how XITE can further understand user behavior. In both the XGBoost model and logistic regression model, whether the user skipped the previous video was the most important feature in predicting whether the current video would be skipped. This finding suggests that users are in-clined to continue their current behavior. XITE can further explore this behavioral pattern to understand exactly when users will change behavior, and what triggers an initial change from plays to skips (discussed in more detail later).

Interestingly, whether the user changed contexts from one video to the next was also important in predicting skips in both pairwise models. From the logistic regression model, we know that if a user changes context from the previous to the current video, she is less likely to skip the current video. Thus, we can interpret context changes as the user’s way of refining her viewing experiencing and narrowing in on a set of videos more appropriate to her taste. Interpreting a

context change this way, XITE can implement a feature to recommend the user a new set of videos when she changes contexts.

It is interesting to note the relationship between ads and skipping behavior. If a video is an ad, the user is less likely to skip the video. This makes sense considering users are not permitted to use the skip button on ads. However, we might expect users to use another method of skipping the ad (by leaving the channel for search, for example). Overall, our results indicate that users do not often take this route. However, instances where users do skip ads could be an inter-esting area of investigation; these skips require more effort, or creativity, than traditional skips, and might be utilized in cases where the user is already unsatisfied with the sequence of videos and the ad nudges her to find a better-suited set of videos.

We might also expect a genre change (new_дenre= 1) to help predict skips, however the feature does not add gains to the XGBoost model and is insignificant in the initial logis-tic regression model. This could be because a genre change helps predict skips when combined with information about the previous video’s skipping behavior—for example, if the previous video and current video are of the same genre, and the previous video was not skipped, we might expect the current video to also not be skipped. Another explanation could be the fact that much of XITE’s traffic occurs in hand-curated channels, meaning that videos viewed in succession have commonalities beyond genre. Thus, even if a change in genre occurs, other features about the set of videos re-mains constant, lessening the impact of a genre change to the overall consistency of the set of videos. It would be in-teresting to explore the interaction between genre changes and other features, and further explore features other than genre which unite videos in a channel (perhaps by discussing with channel curation experts what goes into the process of selecting a channel’s videos). Identifying these other fea-tures which distinguish videos from one another in curated lists could yield further insights to predictors of skipping behavior specifically in XITE’s service.

Also related to genre, Rap/Hip-Hop is the only one-hot en-coded genre to add gains (albeit a very small amount) to the gradient boosting trees model, and no genre category ends up in the final logistic regression model. We hypothesize that differences between the set of gain-adding features in the gradient boosting trees model and the set of significant fea-tures in the logistic regression model are due to differences in how each model determines decision boundaries. Therefore, the set of features helpful in each modeling context differs.

Behavior of is_mixer_channel in the logistic regression model points to a potential instance of positive skips. If a video is played in a mixer channel, it is more likely that the user will skip the video. Given that mixer channels are personalized based on user-selected preferences, we expect users to like videos in these channels. Thus, skips in these 10

(12)

channels may be positive skips, indicating that the user is using the channel as a way of quickly finding videos of interest. On the other hand, users encounter a larger selection of videos in the mixer channel compared to hand-curated channels, leading to a potentially higher variance in types of videos. Thus, higher skipping rates in mixer channels could be a result of the larger range of videos in mixer channel. Further investigation is required to substantiate these ideas.

6.3 Future Work and Improvements

We know that the skipping behavior of the previous video is highly predictive of the current video’s skipping behavior. To further improve skipping models, it would be extremely useful to identify when the first skip in the session will occur. Understanding the context of this first skip will yield insights into potential factors that initially prompt skips. Skips that occur after the initial skip could be due to the fact that the user is already in a skipping mode, skipping due to an inertia to continue the same behavior rather than video-specific features. Thus, focusing on the first skip in a session would be a concentrated way of examining factors that are associated with skips. An example line of research would be to explore whether after the first ad in the session, users are more likely to make the first skip. XITE can then explore whether this behavior is consistent across sessions with different ad placements. According to these findings, the company can strategically place ads.

Once we are able to predict the initial skip in a session, it would then be interesting to study ensuing skip behavior. While we expect future skips after the initial skip, it would be useful to understand how long the same skipping behavior continues for, and factors that correspond to a stop in skips. Then, using information about the initial and final skip, we can attempt to determine whether the user was, in fact, in a skipping, exploratory mode, meaning her skips may not be negative signals.

Additionally, XITE can direct their efforts towards captur-ing nuances of music videos which can be added to models. Most research in skip prediction has focused on song skips, which is similar but different from music video skips. The main difference is the visual element that music videos intro-duce. As previously mentioned in Section2, visual elements play a role in ad zapping. A similar phenomenon may exist with music video skips; for instance, if a video features the song’s singer early in the video, users may be less likely to skip compared to videos that show the singer later in the video or not at all. Research has shown that celebrity en-dorsements play a role in consumer purchasing behavior [16]. We can extend this idea to music videos: videos with the singer present is a form of endorsement of the video, which may lead to higher viewer attention.

7 Conclusion

We find that the relationship between the previous video and current video is extremely important in predicting whether the current video is skipped. In particular, we find that the previous video’s skip outcome is the strongest predictor of the current video’s skip outcome. In addition, we find that adding information about videos beyond the immedi-ately preceding video does not greatly increase model per-formance taking into consideration the added computational expense. Thus, we recommend that XITE focus on further capturing features of the previous and current video, and the changes that occur from one video to the next to improve skip prediction models.

Acknowledgments

I would like to thank my supervisors at XITE, Dr. Alessandro Pagliero and Dr. Bouke Huurnink, and my supervisor at the UvA, Dr. John Ashley Burgoyne, for allowing me the freedom to pursue my own ideas while providing invaluable guidance throughout the project. I would also like to thank my friends in Amsterdam, for making this city feel like home, and my parents, for supporting me always.

References

[1] Sainath Adapa. 2019. Sequential modeling of Sessions using Recurrent Neural Networks for Skip Prediction. arXiv preprint arXiv:1904.10273 (2019).

[2] Brian Brost, Rishabh Mehrotra, and Tristan Jehan. 2019. The Music Streaming Sessions Dataset. In Proceedings of the 2019 Web Conference. ACM.

[3] Michael Buckland and Fredric Gey. 1994. The relationship between recall and precision. Journal of the American society for information science 45, 1 (1994), 12–19.

[4] Sungkyun Chang, Seungjin Lee, and Kyogu Lee. 2019. Sequential Skip Prediction with Few-shot in Streamed Music Contents. In WSDM Cup 2019 Workshop. WSDM.

[5] Kai Chen, Yi Zhou, and Fangyan Dai. 2015. A LSTM-based method for stock returns prediction: A case study of China stock market. In 2015 IEEE International Conference on Big Data (Big Data). IEEE, 2823–2824. [6] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785–794. [7] Andrés Ferraro, Dmitry Bogdanov, and Xavier Serra. 2019. Skip

pre-diction using boosting trees based on acoustic features of tracks in sessions. arXiv preprint arXiv:1903.11833 (2019).

[8] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.

[9] C Lee Giles, Gary M Kuhn, and Ronald J Williams. 1994. Dynamic recurrent neural networks: Theory and applications. IEEE Transactions on Neural Networks 5, 2 (1994), 153–156.

[10] Christian Hansen, Casper Hansen, Stephen Alstrup, Jakob Grue Si-monsen, and Christina Lioma. 2019. Modelling Sequential Music Track Skips using a Multi-RNN Approach. arXiv preprint arXiv:1903.08408 (2019).

[11] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

[12] DW Hosmer and S Lemeshow. 2000. Wiley series in probability and statistics. Applied logistic regression. pp. pages cm (2000).

(13)

[13] Michael Kearns. 1988. Thoughts on hypothesis boosting. Unpublished manuscript 45 (1988), 105.

[14] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[15] Nikolay Laptev, Jason Yosinski, Li Erran Li, and Slawek Smyl. 2017. Time-series extreme event forecasting with neural networks at uber. In International Conference on Machine Learning. 1–5.

[16] Garima Malik and Abhinav Guptha. 2014. Impact of celebrity endorse-ments and brand mascots on consumer buying behavior. Journal of Global Marketing 27, 2 (2014), 128–143.

[17] Christopher Olah. 2015. Understanding LSTM Networks. http: //colah.github.io/posts/2015-08-Understanding-LSTMs/

[18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.

[19] Skipper Seabold and Josef Perktold. 2010. Statsmodels: Econometric and statistical modeling with python. In 9th Python in Science Confer-ence.

[20] Sivaramakrishnan Siddarth and Amitava Chattopadhyay. 1998. To zap or not to zap: A study of the determinants of channel switching during commercials. Marketing Science 17, 2 (1998), 124–138.

[21] Spotify. 2018. Spotify Sequential Skip Prediction Challenge: Predict if users will skip or listen to the music they’re streamed. Re-trieved April 30, 2019 from https://www.crowdai.org/challenges/ spotify-sequential-skip-prediction-challenge

[22] Spotify. 2018. Spotify Sequential Skip Prediction Challenge: Predict if users will skip or listen to the music they’re streamed. Re-trieved April 30, 2019 from https://www.crowdai.org/challenges/ spotify-sequential-skip-prediction-challenge/dataset_files

[23] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Thirteenth annual confer-ence of the international speech communication association.

[24] Thales Teixeira, Michel Wedel, and Rik Pieters. 2012. To zap or not to zap: How to insert the brand in TV commercials to minimize avoidance. GfK Marketing Intelligence Review 4, 1 (2012), 14–23.

[25] Charles Tremlett. 2019. Preliminary Investigation of Spotify Sequential Skip Prediction Challenge. WSDM (2019).

[26] Boxun Zhang, Gunnar Kreitz, Marcus Isaksson, Javier Ubillos, Guido Urdaneta, Johan A Pouwelse, and Dick Epema. 2013. Understanding user behavior in spotify. In 2013 Proceedings IEEE INFOCOM. IEEE, 220–224.

[27] Lin Zhu and Yihong Chen. 2019. Session-based Sequential Skip Predic-tion via Recurrent Neural Networks. arXiv preprint arXiv:1902.04743 (2019).