User type classification with platform interaction features

(1)

submitted in partial fulfillment for the degree of master of science

Jesse Haenen

10670742

master information studies

data science

faculty of science

University of Amsterdam

2019-06-27

Internal Supervisor External Supervisor Title, Name Dr Maarten Marx Sven van der Burg Affiliation UvA, FNWI, IvI Company.Info

(2)

User type classification with platform interaction features

Jesse Haenen

University of Amsterdam jessehaenen@gmail.com

ABSTRACT

Data Mining with user data is an increasingly lucrative operation for online businesses. Automating the process of dividing users into segments can lead to more efficient sales and marketing efforts and personalization of the website. Using a dataset of 8822 users, we use Decision Tree based models (Random Forest, Gradient Boosting and AdaBoost) to automate classification of users into two types. These types were manually labelled by domain experts. As input features, platform interaction and page visit data from Google Analytics are used. A baseline level of performance was established using a one-feature rule where users are classified by setting a threshold level for user type assignment. Classifiers were optimized using cross-validation and evaluated on a stratified test set. To make recommendations for website personalization, feature importance is studied using both linear SVM feature weights and a performance-based approach. The performance of the baseline classification technique was improved with a Random Forest model. The results of the best model show that it is difficult to accurately classify users from both classes. Feature importance gives insightful options for website personalization but need more robust validation methods.

1 INTRODUCTION

Dividing groups of customers into distinct groups is a strategy that is often employed by business managers to make more effective marketing decisions. In this situation, the goal is often to find groups that are as similar as possible internally, while as different as possible compared to other groups. Especially for online plat-forms, where users are consuming and interacting with products in a web-based environment, the potential for targeted marketing efforts and a customized experience is both profitable and feasible from a technical standpoint [11]. Large volumes of user-level data are available to automate the process of creating groups of users, and has been subject of research in recent years, especially in the domain of online retail [7]. Applying Data Mining techniques to user data is an active field of study, especially in the areas of rec-ommender systems. These problems, however, deal with labeled data that is different in nature from the problem of classifying user types. Instead of predicting user actions, this research focuses on comparing a human-centered baseline of user type classification with automated, supervised learning methods.

At Company.info, a large data platform for business information, user segmentation is done by hand-labeling clients based on domain expert knowledge. These ’ground-truth’ labels allow for automated segmentation with classification algorithms. By domain experts of Company.info, two user types have been defined, making this a binary classification problem. Feature weights of the classification models can provide insight in the sense that they can validate the presuppositions of the company regarding the behavior of their

client types. Results of an accurate classification model and iden-tifying strong features allows targeted sales and advertisement to different user types and type-specific personalization of services provided on the Company.info website1.

Many studies on user segmentation focus on unsupervised meth-ods to discover clusters of similar users. Clustering algorithms are most often used, as most datasets used deal with unlabeled observa-tions [12, 24]. This leaves an important shortcoming of clustering algorithms understudied, as without ground-truth labels, the result of a clustering model has no a priori meaning and are thus difficult to interpret by the observer [21]. The approach in this study aims to improve upon a baseline customer segmentation with classification algorithms and provide Company.info with interpretable results with regards to the most important characteristics for the two user types. This approach can improve the day-to-day workflow of users that use online platforms intensively.

The dataset used for this purpose contains 8822 individual users with platform interaction features that are aggregated from both internal databases and Google Analytics, and are assumed to reflect the usage patterns of different user types. The user types as defined by domain experts are divided into two groups; Commercial users and Risk users. These user types are defined based on the nature of their purpose of using the website.

Research questions.The goal of the research is twofold. Firstly, we determine to what extent machine learning models can automate the process of online user segmentation based on platform inter-action features. The purpose is user profiling and personalization of the Company.info website. Secondly, using feature importance measures, we attempt to validate the predefined segments of users and their defining web usage characteristics. In doing so, we test the assumptions of domain experts of their assessment of what defines a user type. The research is split into four sub-questions.

RQ1 What baseline performance level can be achieved for the classification of users with pre-defined user types, using simple rule-based segmentation?

RQ2 How much can the baseline performance be improved by using a well-performing (according to literature) classifica-tion algorithm?

RQ3 Can feature selection methods be used to reliably identify strong features in classes for the purpose of user profiling and recommendations for platform personalization? RQ4 To what extent can we validate the assumptions of domain

expert with feature rankings?

This work is divided into the following sections. Section 2 intro-duces related work that is relevant to the subject, while Section 3 provides background studies done relating to the chosen method-ology. Section 4 discusses steps taken to create the Company.info user dataset used, as well as the methods that were implemented

1_{https://companyinfo.nl/}

(3)

to answer the research questions. Further, Section 5 discusses the evaluation of the results of the methods and the answers to the research questions. Section 6 draws conclusions from the results, as well as a discussion and recommendations for future work.

2 RELATED WORK

This section describes previous work done in Data Mining in the context of user data and online platforms. These studies are closely related to aim of web personalization as outline in Section 1.

Using Data Mining techniques to divide users of online retailers or other online platforms into segments is an increasingly popu-lar method to gain customer-centric business intelligence. Early research on the topic use novel methods for user segmentation for purposes such as direct marketing, sales communication or the personalization of website layouts [8, 24]. In its essence, the segmentation of users of an online product or platform is used to identify segments of users that are internally similar, while different from users in other segments.

In the context of sales, retail and other online platforms, customer or user segmentation can lead to improved sales efforts. [2] describe costly marketing campaigns that can be improved by predicting success or failure of direct telemarketing calls. For online retail-ers, it is important that users of their platform find the products they are looking for as fast as possible. Going beyond exploratory analysis of segments of users, [22] were first to propose using user log files for the purpose of making websites more adaptive to user behavior [13]. In a case study, they present a method that semi-automatically proposes changes to the structure of the website to let users find more relevant results. In recent work within the field of Recommender Systems, the aim is more direct, as the goal is more turnover-oriented. [14] implement personalized recommendations for online retailer Zalando, and leverage item-based recommenda-tions with personalized recommendarecommenda-tions. Using A/B tests, they find that the group confronted with a personalized strategy consis-tently outperforms unpersonalized recommendations in terms of Click Through Rate and Conversion.

In a study done by [11], a K-Means approach is applied in an unsupervised cluster analysis. They manage to identify five clusters and with descriptive statistics assign defining characteristics. A refined segmentation is made in their largest cluster with a Decision Tree but is not tested for statistical significance with regards to the differences within segments. This approach is highly interpretable, as Decision Trees benefit from insightful methods of visualization. User segmentation in this context is most often approached with unsupervised experimental setup, but therefore lack external vali-dation. Evaluating if the resulting clusters from the algorithm make sense to human observers is a challenge within existing research. [18] propose a semi-supervised clustering technique in which do-main experts specify additional pair-wise constraints. Incorporating additional rules of which data points must cluster together achieves better clustering results as the number of constraints increases.

Using supervised methods, [1] use navigation and click stream data in the context of making recommendations to individual users. Their results show that a K-NN based classifier was able to improve recommendations and helps “making support to the browsing pro-cess more genuine, rather than a simple reminder of what the user

was interested in on his previous visit to the site as seen in path analysis technique.” These finding are relevant within the aims of this study, as these recommendations can be used to personalize the website based on behavioral patterns. This approach stands out from existing research as segments are based on ground-truth labels. Within the scope this study, we use predictor variables that are connected to Company.info ground truth labels, translating a supervised experimental setup to using strictly user-level data.

3 BACKGROUND

This section covers studies that deal with related methodology and are used as a motivation for the chosen methods used to answer the research questions. Section 3.1 and introduces tree-based classifica-tion methods and ensemble learning. Secclassifica-tion 3.2 covers techniques used to measure feature importance.

3.1 Decision Trees for classification

The presence of ground truth labels for user types means we can apply supervised learning methods for the automation of user seg-mentation. In this section, two particular areas in the scope of classification methods are covered; tree-based algorithms and en-sembles of trees.

3.1.1 Tree-based algorithms. [6] first proposed tree-based al-gorithms that are better known as Classification And Regression Trees (CART). Decision Trees were introduced in the context of automated decision-making, allowing target prediction based on both continuous and categorical features [5]. Starting with a single split - the root node - and by making multiple subsequent binary decisions in ’children’ of the root node, an ideal split is achieved at each node level, segmenting data points into target categories as good as possible [23]. These models are relatively easy to interpret and can achieve state of the art results while being robust to feature distributions and outliers.

3.1.2 Ensemble learning. In more recent years, shortcomings of Decision Trees were addressed by using multiple trees and aggre-gating their results. Two approaches can be taken in the process of combining the predictive power of multiple Decision Trees: bag-ging methods combine multiple trees that select only a random subset of the training data, while still being independent of other trees. Boosting methods iteratively weigh misclassified points with a single tree. Its distinction from bagging lies in the prediction phase; in bagging, independent and individual trees take a simple majority vote while in boosting subsequent trees take a weighted vote [20]. One of the foundational authors of the Decision Tree algorithm, Leo Breiman, presents a combination of tree predictors that achieves a significant improvement in prediction accuracy [4]. In the Random Forest method, classification is done by individual learners (trees) voting, hence falling into the bagging category. Ad-ditionally, an element of randomness is introduced by selecting a subset of features at each node.

3.2 Feature importance methods

In a study done by [9], feature importance rankings are found by using a linear SVM model. For linear models, each individual feature is tied to a weight in the decision function. The larger

(4)

the weight that is tied to a feature, the more important it is in the decision function. In their study, the results of the feature ranking using linear SVM models are similar to rankings obtained using methods that are independent of the model used, such as a correlation and performance-based methods (F1 score and AUC score, respectively.) [19] optimize candidate predictor rankings in Random Forest models using a performance-based method. Random Forest models employ built-in feature importance measures such as the decrease in Gini impurity whenever that variable is used for creating a split in the trees. They argue that a AUC-based2 feature importance measure is more suitable for datasets that have imbalanced classes, as AUC scoring puts the same weight on both classes regardless of the relative size of each class. Their proposed measure outperforms other feature importance measures based on simulated features, differentiating strong features and noise (features without any association to the target variable) with various class imbalance ratios.

4 METHODOLOGY

The classification of users of an online platform starts from the assumption that users have fundamentally different characteristics that are worth exploiting. This section describes details of the user type characteristics from the perspective of domain experts, the features in the dataset collected from Company.info user website interaction (Section 4.1), as well as the models used to answer the research questions (Sections 4.2 and 4.2).

4.1 Description of the data

The data used for answering the research questions has been pro-vided by Company.info and is stored in a database at both client and user level. The core business of Company.info is providing enriched and organized data to clients (licensing access to the platform), which in turn have one or more users on the same license. On the user-level, data is collected to gain insight in how they interact with website functionalities. The underlying assumption in automating user classification is that user types interact differently with the functionalities on the website. The dataset contains users with pre-dictor variables that reflect usage patterns, and the target variable, user type, which is the focus of automating user segmentation.

Mapping usage patterns to features starts from client data that is manually labeled by domain experts who are Company.info em-ployees in marketing and sales and have expert knowledge about the specific needs per client. From a dataset of 2229 clients, 197 were manually labelled, indicating the user type as the target vari-able, with the assumption that the users belonging to this client have similar ways of using the functionalities on the Company.info website. Following the client labels, the target variable is mapped to individual users belonging to that client. The result is a labeled set of 8822 users for which platform interaction features are engineered. The target variable indicating user type has two labels: ’Risk’ users that use the website for compliance work and background checks, and ’Commercial’ users that use data for doing market research and other commercial purposes. This justifies the client-to-user mapping, as clients are labelled by their core business activity.

2_{Also known as ROC-AUC or AUROC scores using Receiver Operating}

Characteristics-curves.

Table 1: Class distributions on both client and user level.

Clients Users

Amount % of Total Amount % of Total

Commercial 79 40 2321 26

Risk 118 60 6501 74

Total 197 8822

In Table 1, an overview of class balance is shown for both the manually labeled data of clients and the dataset of users, which will be the basis for training the models. The manually labeled data has a slightly unequal distribution of the two types: user type Risk is the majority class compared to minority class Commercial with a 0.67 ratio of minority to majority. When the client-level labels are mapped to the level of the user, this unequal distribution becomes more prominent, and the ratio of minority to majority class members decreases to 0.36.

The predictor variables were chosen with the idea in mind that the insight from the model feature importance experiment (RQ3 and RQ4) are used for website personalization, such as functionality highlighting or increasing accessibility and ease of use. The website interaction features used to predict the class for each individual user are categorized into three distinct areas:

(1) Page visit statistics per website functionality. (2) User watchlist statistics.

(3) User watchlist content.

4.1.1 Page visit statistics. The page visit statistics are aggregated from Google Analytics, which tracks navigational patterns on Com-pany.info domains per user3. The features used in this category are the total number of hits (Hits) and the average daily time spent (Avgt) on pages of the Company.info website. In Table 2, we show which functionalities are assumed by domain experts to be dis-tinctive for that user type, and to which of the features these are linked. The features collected from Google Analytics are divided into 2 sets, and the associated functionalities are related to user type professional usage of the product4.

Commercial users manage clients, do market research or search organisations to approach for commercial purposes (Key function-ality 1-4). Search functionfunction-ality is key to creating a list of potential clients, as well as having an overview of customers (CRM). In doing market research, management overviews with shareholders, direc-tors and boards can be viewed. In addition, the aforementioned Loketcan be used to get annual (financial) reports of companies or sectors. Central to the work of Risk users is creating a risk assess-ment for doing business with certain persons or organisations (5-8). These users are assumed to use the page Loket, from which they can buy overviews of information used for these assessments. They also check for news coverage about potential clients and perform (regulatory) compliance checks.

3_{https://marketingplatform.google.com/intl/en uk/about/analytics/}

4_{In further discussion of the page visit statistics, the *-symbols preceding the feature}

names in Table 2 are replaced by Avgt and Hits. For example, for *OrgSearch, both HitsOrgSearchand AvgtOrgSearch are used as features. The complete list of features used can be found in Appendix A.

(5)

Table 2: Assumed functionalities per user type by domain experts and their representation in page visit features.

Key functionality Feature name Type

1. Lead lists *OrgSearch Comm.

2. Customer Relations *ConcernRelations Comm.

Management (CRM)

3. Management overviews *OrgProfiles Comm.

4. Business characteristics *LoketReports Comm.

5. Company extracts *LoketExtracts Risk

*LoketHistory

6. Company news *NewsSearch Risk

7. Compliance checks *ComplianceOrganisations Risk *CompliancePersons

8. Person profiles *PersonProfiles Risk

For every user, Hits and Avgt are collected and aggregated for two months. A limiting factor in collecting data from Google Analytics is pricing, as every query to the data is billed depending on the size of the query. As a result of this limited time frame, many users have missing values. This indicates that these users are temporarily or permanently inactive during that time period. In both Hits and the Avgt, missing values are mapped to zero, indicating that there is no activity logged. Figure 1, shows the distribution of the feature AvgtLoketExtracts. We expect this value to be higher for the class Risk, which the wider spread for this type seems to confirm. Most users have zero-values for this feature which is a trend that is consistent among these features. This is likely consequence of using a limited time frame for which data is queried, as not all users in the dataset are matched with statistics from Google Analytics.

4.1.2 User watchlist statistics. A secondary functionality is the watchlist, in which users can follow companies and receive updates about them per email, such as quarterly results, publishings from the Dutch Chamber of Commerce or relevant news articles. As opposed to page visit statistics data, these features have no assumed usage pattern belonging to either of the two user types. Per user, four features are extracted from the contents of a watchlist:

• Number of companies followed by that user

• Average size of watchlist companies (ordinal scale: 1-10) • Company sector homogeneity

• Company legal form (type) homogeneity

In Figure 2, a feature distribution is shown for the number of companies followed by the user. Here it becomes apparent that most users (roughly 75%) do not follow any companies, skewing the distribution. We observe a wider spread in Risk users which indicates Risk users generally follow more companies. This effect is minor, as mean differences between Commercial (6.2) and Risk (7.8) are relatively small. However, both classes have outliers which skews these findings. Company sector and legal form homogeneity are computed by dividing the number of unique sector (SBI codes5) and legal form codes in a watchlist by the total size of the watchlist.

5_{Dutch Standaard Bedrijfsindeling / Standard Industrial Classifications version 2018:}

https://bit.ly/2X4nSEf

Figure 1: Distribution of the average time spent on Loket Extracts (AvgtLoketExtracts) per user type. Relative frequen-cies within classes are shown.

Figure 2: Distribution of number of companies followed by the user (WatchlistLength) per user type. Relative frequen-cies within classes are shown.

4.1.3 User watchlist content. Beside statistics, other methods are employed to capture the content of a user watchlist. Each company in a watchlist has an SBI code indicating the sector or activity that a company is active in. To create word embeddings of SBI code descriptions, text is converted to Word2Vec-vector representations with length 3006. The purpose of using vector representations is to test the assertion that the user types follow companies in different sectors. The median values across company SBI vector representations are used to capture the type of companies a user is interested in following. To limit its size, Principal Component Analysis (PCA) is used to extract the 10 most important components of the averaged feature vector. The 10 PCA components are then used as input features to the classification models.

6_{The Word2Vec model was trained on a corpus of English Wikipedia pages. SBI}

descriptions are also translated to English. 4

(6)

4.2 Methods

This section covers methods to automate user classification. The methods are described in order of the research questions. Firstly, the method used to establish a baseline level of performance is described. Secondly, to compare to the baseline performance, the automation approach is presented. This part covers the descrip-tion of the models used for the classificadescrip-tion task, methods used to improve model performance, as well as metrics used for evalu-ation. Finally, two methods are presented for feature importance. Finally, the feature importance results are used to compare to the assessment of the domain experts of Company.info.

4.2.1 Establishing baseline human performance. To assess to what extent the process of user segmentation can be automated, the results need to be compared to a baseline level of performance. This is done by creating a one-feature based classification. The baseline is aimed to closely approach the process that a human would take in labelling each user case-by-case. This process consists of manually inspecting user data that is both easily accessible and interpretable7. Limiting our scope to one feature, classifying a user is done by determining which side of the threshold is associated with each user type.

From the features described in Section 4.1, 1 feature is chosen from the page visit statistics. These features are easily accessible using a web interface and are available as-is by running simple queries. As to not choose a feature from this subset at random, a Pearsons-R correlation test is performed, where all features in the subset are ranked based on correlation with the target variable. The feature with the highest correlation is chosen with the assumption that this feature can best separate the two classes within the space of users. In one-feature based classification, we can assign labels to users based on a threshold value of the chosen feature. The intuition that underpins this method is that the higher/lower the value, the higher the likelihood that the user will belong to a certain user type. The threshold value is chosen by using different values and evaluating the results by using the F1 score. Performance indicators are further described in Section 4.3. As opposed to conventional train-test splits when training machine learning models, all data is used for establishing baseline performance. The results are based on a single rule and are always the same for a threshold value.

4.2.2 Classification models. For the classification task, tree-based models are used. These models are suited to the distributions found in the predictor variables. They are trained to make recom-mendations for the personalization of the Company.info website. In the context of this study, we can define this part as a binary classification problem. The models used for answering the research questions are implemented with the Python package scikit-learn8 unless stated otherwise.

Decision Trees.Starting at the ’root’, binary Decision Trees work by creating subsequent splits of the data that best separates the classes into two ’nodes’. A Decision Tree classifier can use different methods of selecting the best split and will continue until a stopping criterion is reached. A node with no subsequent splits is called a leaf

7_{Easily accessible is defined here as not requiring preprocessing steps or other}

quanti-tative techniques.

8_{https://scikit-learn.org/stable/}

node. The splitting criterion used in this research is Gini impurity. Used by the CART algorithm as described in [23], the Gini criterion is defined as an attempt to maximize the homogeneity of classes in each node. The CART algorithm is chosen as a starting point in this research as they are relatively insensitive to class imbalance. This is a built-in characteristic of CART as class frequencies in nodes are calculated relative to the class frequencies in the root. The first experiment that is done using a Decision Tree is running a single tree with the same feature as was used in the baseline to validate a ’human’ approach to user segmentation. We compare the performance differences, using the best split value at the root node with the chosen threshold value. Subsequently, a Decision Tree model using all features is used to assess if the model is improved by including more features.

Ensemble and boosting methods.Ensemble methods are chosen in the context of this research for their resilience to overfitting and better performance compared to single tree models. Ensemble methods allow the combination of the predictions by multiple, individual trees (learners). Key differences between the models are the methods employed in the training phase of the model. Three models that use ensembles of trees are chosen: Random Forests, Gradient Boosting and AdaBoost.

Compared to a single Decision Tree, Random Forests introduce randomness in two ways. Firstly, a random subset of the data points is chosen as training data, as well as selecting only a random subset of features to use [4]. The predictions of each tree are combined by means of soft voting, in which the probabilities of the predicted class are averaged and the highest is chosen. Hyperparameters tuned for the Random Forest model are the maximum depth of each tree and the maximum number of features to consider when looking for the best split.

The Gradient Boosting models are different from Random Forests in that instead of using multiple full Decision Trees and combining the results, it combines multiple ’weak’ learners. The strategy of Gradient Boosting lends its name from the additive process of training weak learners on the errors (also called pseudo-residuals) of an initial base learner [17]. For each iteration, the contribution of a weak learner is determined by a gradient descent optimization process, which tries to minimize the errors of the base learner. AdaBoost is another variant of an ensemble method based on a boosting process. It is similar to Gradient Boosting but varies how weak learners are trained on the residuals. AdaBoost focuses on difficult cases not by training only on the errors of the base learner, but rather changing the weight of each observation by increasing it for wrong predictions and decreasing it for correct predictions [15, 16]. Another difference with Gradient Boosting is the way weak learners contribute to the outcome of the base learner. As opposed to Gradient Boosting, each weak learner contributes according to its own performance instead of the added performance to the base learner. The base learner in this case is a single Decision Tree that is tuned with the same hyperparameters as in Random Forest.

Improving models: sampling.Class imbalance occurs when the classification categories are not equally represented. As discussed in Section 4.1, the class imbalance in the target variable can pose problems to training an effective classification model. To increase class balance of user types, oversampling is used to attempt to

(7)

improve model results. The goal of oversampling is to create more instances of the minority class, such that the models are better able to differentiate between the two classes.

To achieve a more equal representation of the two user types, we use the Synthetic Minority Over-sampling Technique (SMOTE) as proposed by [10] to create new instances of the minority class Risk. SMOTE works by selecting points from the minority class and creating a new synthetic example of that instance. Selecting k neighbors, depending on the amount of oversampling that is applied, members are chosen to create a new synthetic minority class instance. This is done by multiplying the difference between the feature vector of the point under consideration and one of its neighbors, multiplying this difference with a random number between 0 and 1 and adding this to the feature vector of the point under consideration.

This process broadens the decision region for the minority class, leading to better results and generalization. The number of neigh-bors to consider when creating a new synthetic example is k= 5, the default option as implemented in the SMOTE implementation in the Python package imbalanced-learn9. To choose the right level of oversampling, multiple sampling ratios between 0.4 and 1 are chosen, where 1 means an equal representation of instances from both user types. The sampling steps are taken after creating a train-test split, and SMOTE is only applied to the training set. Using the model with the best results for the original class distributions, multiple sampling strategies are evaluated, and an optimal level of oversampling is chosen.

4.3 Model performance metrics

The performance of each classification model is evaluated across all test set samples, as well as for each class separately. As performance metrics, precision, recall and F1 scores are used. The F1 macro-averaged is used as a performance metric for choosing a best model, as both classes are equally important in the implementation of the model for Company.info. This is seen as the most reliable performance metric, as in the stratified test set class representation is the same as in the training set. For each model, a grid search is performed to find optimal combinations of hyperparameters yielding the highest F1 score. Specific values used in the grid search are found in Appendix B. Using stratified cross-validation with 10 folds, the model is trained using 70% of the data. The best trained model is then evaluated on the test data for a final comparison of performance.

4.4 Feature importance

To assess which website functionalities are used by each user type, a feature importance approach is taken. To answer RQ3 and RQ4, we measure feature importance in two ways: using linear SVM feature weights and a performance-based feature ranking. The requirement of reliability is important as different classification algorithms may yield varying results. In this scenario, the importance of a feature may be dependent on model specifics. We choose two methods to test for strong features to achieve more certainty in the recom-mendations to Company.info. The strongest features that are found

9_{https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.}

over sampling.SMOTE.html

in both rankings are used for insight in the Company.info usage patterns.

For the sake of interprability, only a subset of features is used. Features that do not directly represent a website functionality, such as the watchlist content features described in Section 4.1 are left out. To also address the issue of co-variance, only the Hits on sub-directories are taken into account, and the average time spent (Avgt) features are left out.

4.4.1 Linear SVM feature ranking. Support Vector Machines attempt to maximize the margin between the training examples and the decision boundary. Maximizing the boundary between two classes makes this a binary classifier by nature. It does so by assigning weights to points to minimize the loss function [3]. The weights of the SVM model are used for feature ranking, where the larger weight |wj|, the more important the jth feature. In the

context of this study, we want to know the feature rankings per class, so wjis interpreted as is, instead of taking the absolute value

of the weights. Negative weights are then associated with the negative class, and positive weights with the positive class. The target variable user type is labeled to 0 indicating Commercial users, and 1 indicating Risk users.

4.4.2 Performance-based feature importance. We also determine feature importance using a performance-based approach using the best performing model. This is done by using the eli5 Permutation-Importance package10. This works similar to the approach taken by [9], but instead of using the ROC-AUC score, the F1 score is used for both classes separately. This method iteratively replaced one of the features with noise, which is done by drawing values from the same distribution as the original feature values by means of shuffling. To increase the stability of the results, the number of shuffling iterations is set to 20. Using the best trained model, the features with the highest scores indicate the highest mean decrease in F1 score when that feature is replaced with noise (permutation).

5 EVALUATION

In this section, we present the experimental results to answer the research questions as defined in Section 1. The methods described in Section 4 will be implemented. The training and test data across models will be the same for the experiments relating to RQ1 and RQ2. RQ3 and RQ4 will use a different subset of features to improve interpretability.

5.1 Baseline performance

The performance for different threshold values in this feature are displayed in Figure 3. In Table 3, the results of the baseline are shown for both classes, as well as the F1 macro-averaged score for both.

The data feature with the highest absolute correlation is 0.13 (Pearson’s R), which is by no means a strong correlation with the target variable outside the context of establishing a baseline perfor-mance level. The feature AvgtLoketExtracts is chosen as a feature to establish the rule-based baseline. Looking at the numbers of the feature correlations with the target variable, we can observe that the more time spent on this page is correlated with the positive

10_{https://eli5.readthedocs.io/en/latest/blackbox/permutation importance.html}

(8)

Figure 3: Performance levels of single-rule based segmenta-tion on the complete dataset with various thresholds.

class (Risk). This is in line with what is expected by the domain experts in terms of functionalities of that user type: Risk users are generally expected to have higher values in AvgtLoketExtracts. We hence ’predict’ all users that fall above the threshold in the chosen feature are Risk users. All others are labeled as Commercial users. Using the feature AvgtLoketExtracts, we can label users based on a certain threshold. In Figure 3, various threshold levels are shown with their the F1 (macro averaged) performance to get an even indication of the performance of the baseline. A threshold of 10 is chosen, as using this value as a threshold for segmenting users achieves the highest score: 0.45. Performance is given for different threshold levels between 0 and 500 with intervals of 10. The F1 macro-average performance peaks at 10, after which the score continues to decrease as the class Risk is over-classified.

Upon further inspection of the baseline results, Table 3 shows that the baseline segmentation achieves a low precision of 0.31 on the minority class Commercial, but a high recall of 0.91. For the majority class risk, this pattern is reversed; it achieves a precision of 0.90 and a recall of 0.29. This can hint to an over-classification of the minority class Commercial, as opposed to an under-classification of the majority class. Further inspecting the confusion matrix as shown in Figure 4, we can see that it is indeed the case that more than half of the total dataset are Risk users falsely classified as Commercial. This baseline hence achieves poor results as the majority of users are falsely classified.

5.2 Model performance

In an effort to improve upon the baseline, several models are em-ployed. The performance results of the threshold level found with a Decision Tree (using the same one feature) are shown in Table 3. A full overview of model results including model results with SMOTE oversampling is shown in Table 5.

Using a decision tree with the same feature used in the single-rule base classification does not achieve an improvement in per-formance. However, a higher threshold of 125.93 is chosen for an ideal split. The tree follows a logic consistent with our expectations

Table 3: Precision, recall and F1 performance per class in the baseline and the Decision Tree using one feature.

User type Commercial Risk One feature-based baseline

Precision 0.31 0.90

Recall 0.91 0.29

F1 0.47 0.44

One feature Decision Tree

Precision 0.31 0.90

Recall 0.91 0.29

F1 0.47 0.44

Table 4: Confusion matrix of actual and predicted labels for the complete dataset using a single-rule based classification of users.

Predicted

Commercial Risk Total

Actual Commercial 2108 213 2321

Risk 4622 1879 6501

Total 6730 2092 8822

in the sense that higher values in a baseline feature would iden-tify a Risk-user. Conversely, low values in the predictor variable would more likely identify Commercial users. Even though the split value chosen by the Decision Tree is higher, the difference in performance is negligible. This subtle difference may have occurred to differences in the dataset the model was trained on. As expected, the performance of the single-rule baseline is almost identical to the baseline performance. Precision for the Commercial class is 0.31, while recall is 0.88. For the majority class Risk, the precision is 0.88 while recall is 0.32. The macro-averaged F1 score is almost the same; 0.46.

To further improve these results, all other features are used in the Decision Tree as well as three other models based on ensembles of trees. In Table 5, the model results of a Decision Tree using all features, Random Forest, Gradient Boosting (GBDT) and AdaBoost are shown. When all features are used as input for the Decision Tree, a somewhat reversed pattern is observed. Here, the minority class (Commercial users) are predicted with 0.51 precision, but a much lower recall: 0.22. The precision of Risk users is 0.77, which is lower than the baseline and one-feature tree results. The recall now is much higher for Risk users, which increases from 0.29 in baseline recall to 0.92. Compared to the baseline F1 scores in Risk users, there is an increase from 0.44 to 0.84 if all features are used, while F1 scores for Commercial users decrease from 0.47 to 0.30. Overall, there is an improvement in macro-average F1 score from 0.45 to 0.57. This reversed pattern indicates a shift from the Commercial users to Risk users being over-classified, judging by the high recall-low precision results of the latter.

Random Forests are expected to achieve better results than a single tree, mainly due to less overfitting. The observed shift in classification results persists, and even becomes more prominent. At no cost to other precision and recall for the Risk class, it achieves an improved precision for the Commercial class of 0.77 and a recall

(9)

of 0.22. There is little change in the predictions of Risk users. The precision is 0.78 while Recall is slightly improved from 0.92 to 0.98 compared to the Decision Tree model. For both classes F1 scores improved and overall the macro F1 increases with 0.05 to 0.62. Indicated by very high recall in the majority class Risk and low recall in Commercial, we argue that the classifier has trouble with identifying points as Commercial users.

In the GBDT model, the results are roughly the same as in the Random Forest model. No notable changes are observed beside a decrease in F1 score from 0.35 to 0.30 for the Commercial class. It achieved the same results for the Risk class. Ultimately it performs worse than the Random Forest model with a macro-average F1 of 0.60. AdaBoost achieves very similar results compared to GBDT but has slightly lower precision for Commercial users: 0.73. The F1 macro-average is also 0.60. Interestingly, using boosting methods did not lead to an improvement compared to Random Forests.

5.2.1 SMOTE oversampling results. To address the observed is-sue of the over-classification of the majority class, oversampling methods are used to create a more balanced class distribution. Us-ing SMOTE, more models are evaluated with oversampled trainUs-ing data. To test the ideal degree of oversampling of the minority class, several ratio values are tested for the highest performance. The reference sample ratio in the training set for the best model without oversampling (Random Forest) is 1625/4450=0.37. Figure 4 shows F1 macro-averaged results using the Random Forest model with varying SMOTE sampling ratios. A ratio of 0.70 (minority class / majority class) achieves the best results. Using this ratio of over-sampling using SMOTE, we can again run the same models on the oversampled training data.

SMOTE did not lead to improvements in performance for either classes. Across the board, both Decision Tree, Random Forest and boosting methods achieve 0.60 - 0.61 macro-average F1 scores. Con-trary to what is expected when oversampling, the widening the decision boundary separating user types in the n-features dimen-sional space did not work for the models to classify more points from the minority class correctly. This may be due to the character-istics of Decision Tree models or patterns in the feature space of the minority class. Seemingly boosting methods are both unable to improve the base learner with subsequent weak learners. Also training these models with additional, synthetic data points does not lead to a significant improvement in minority class recall.

Concluding, the best model to improve upon baseline perfor-mance is a Random Forest model without oversampling. This model increases the macro F1 score from 0.45 to 0.62. Due to the strong bias towards the minority class Commercial in the baseline, this achieves the highest recall, as well as the highest precision in the majority class Risk. However, F1 macro scores are quite low: 0.45. Overall, the best model to improve upon the baseline performance is a Random Forest model without oversampling, achieving the highest precision in the minority class, as well as recall in the ma-jority class. It achieves the highest F1 macro score of 0.62, which is an improvement of 0.17 over the baseline.

5.3 Feature importance

For the feature importance results, the results of the linear SVM model are shown in Figure 5. The performance-based approach is

Figure 4: Model performance using varying sampling ratios.

split into results for Commercial and Risk users shown in Table 6 and 7, respectively.

The weights in a linear SVM model are used to interpret the feature importance for both classes. As users with class Commercial are mapped to 0, and Risk users to 1, negative weights are given to features that are strong for Commercial users and positive weights to Risk users. In Figure 5, the results of the feature importance ranking are shown. The feature weights show that of the 17 chosen features to include in the model 7 are negative, and 10 are positive. Not all features may be helpful in training a classification model for user types. Using too many features may lead to a model that is too complex (overfitting the training data) and may not generalize well to the total population of users of Company.info. Table 5.2 shows the results of the models with feature selection by linear SVM applied. The results show that in none of the models feature selection by linear SVM increases model performance.

Based on the feature weights of the linear SVM model key fea-tures can be identified for both classes. For the class Commercial, two are clearly stronger than the others. They are based on; the average size of companies followed in a watchlist, and the total hits on profile pages of companies. For user type Risk, three key features can be identified, which reflect usage patterns related to the type of companies in a watchlist; homogeneity of company type, the number of hits on the homepage and the number of hits on the checkout history for company extracts. Concluding, we argue that for users that are classified Commercial, follow recommen-dations for large companies are helpful, and that the class labels can be used as input to a recommender system for watchlists. This is confirmed by the feature importance of the performance-based approach, where this feature contributes most strongly to the score by a substantial margin. For the class Risk, an overlap of strong features is less clear. Its largest weight in the linear SVM model indicates that more homogeneous watchlists in terms of its type indicates a Risk user, but this finding is not supported as strongly by the performance-based method.

(10)

Table 5: Model performance in comparison to existing rule-based baseline with 10-fold cross-validation.

Commercial Risk All

Model Precision Recall F1 Precision Recall F1 F1 (macro)

Baseline (one-rule) 0.31 0.91 0.47 0.90 0.29 0.44 0.45

Decision Tree (one feature) 0.31 0.88 0.46 0.88 0.32 0.47 0.46

Decision Tree (all features) 0.51 0.22 0.30 0.77 0.92 0.84 0.57

Random Forest 0.77 0.22 0.35 0.78 0.98 0.87 0.62

GBDT 0.77 0.18 0.30 0.77 0.98 0.86 0.60

AdaBoost 0.73 0.22 0.33 0.78 0.97 0.86 0.60

Decision Tree (all features) + SMOTE 0.56 0.26 0.35 0.78 0.93 0.85 0.60

Random Forest + SMOTE 0.74 0.24 0.36 0.78 0.97 0.87 0.61

GBDT + SMOTE 0.74 0.23 0.35 0.78 0.97 0.87 0.61

AdaBoost + SMOTE 0.65 0.24 0.35 0.78 0.95 0.86 0.60

Decision Tree (all features) + SVM-FS 0.55 0.20 0.29 0.77 0.94 0.85 0.57

Random Forest + SVM-FS 0.64 0.20 0.30 0.77 0.96 0.85 0.58

GBDT + SVM-FS 0.65 0.20 0.30 0.77 0.96 0.86 0.58

AdaBoost + SVM-FS 0.63 0.19 0.29 0.77 0.96 0.85 0.57

Figure 5: Feature weights in the linear SVM classifier. Negative values indicate weighting towards the class Commercial, positive to the class Risk.

5.4 Validating domain expert assumptions

In linking Key functionalities to user types, we test if the assump-tions of domain experts are correct. Using the linear SVM model, we observe that for the Commercial user type three out of four Key functionalities (Feature names 1-3) as listed in Table 2 match with the negative weights from the linear SVM model. The assumptions for Commercial users are correct in the sense that they indeed ap-pear to use search functionality to look up organizations and their managerial structure. Key functionality 4 is found to be (weakly)

associated with Risk users while it is assumed to be linked to Com-mercial users. For Risk users too, three out of four Key functionality assumptions (Feature names 5-7) match with positive weights. Key functionality 8 is a feature that is expected to be strong for the Commercial users but is found to have a positive weight (a feature associated to the class Risk). Confirming these findings with the performance-based approach is problematic: while Key functional-ity 5 is confirmed to be associated with Risk users, it is also found strong for the Commercial class. Further, Key functionality 1 is

(11)

Table 6: Feature importance based on average F1 score per-formance decrease by feature permutation in Commercial users.

Feature name Average F1 decrease Std. Dev

AvgWatchlistSize 0.1397 0.0234

WatchlistSectorHomogeneity 0.0538 0.0105

HitsLoketExtracts 0.0529 0.0182

Table 7: Feature importance based on average F1 score per-formance decrease by feature permutation in Risk users.

Feature name Average F1 decrease Std. Dev

HitsOrgSearch 0.0597 0.0073

WatchlistTypeHomogeneity 0.0439 0.0075

AvgWatchlistSize 0.0358 0.0054

WatchlistSectorHomogeneity 0.0243 0.0022

HitsLoketExtracts 0.0237 0.0044

expected to be associated with Commercial users, while it is found to give the highest performance permutation for Risk users when replacing the feature with noise. Concluding, linear SVM weights largely confirm the expected usage patterns, while the feature rank-ings with a performance-based approach seem to be unrelated to the assumptions of domain experts.

6 CONCLUSION & DISCUSSION

6.1 Conclusion

In classifying user types using platform interaction features, using a rule-based method with a single feature was used to establish a baseline performance level. It showed that using the baseline method achieved a poor 0.31 precision on the minority class, while recall was 0.91. The majority class shows a reversed pattern with a precision of 0.90 and a recall of 0.29. As both classes are equally important in scoring the predictions, the F1 score is used as an over-all scoring metric. Here, the baseline achieves a macro-averaged score of 0.45. This process was automated with a Random Forest model. It achieves an F1 score of 0.62, an improvement over the baseline. These results show that - contrary to the baseline perfor-mance - the models have a strong bias towards the majority class and is under-performing for the minority class in spite of model and oversampling efforts to improve the results.

Using the weights from a trained linear SVM model provide interpretable results for strong features in both classes. For the minority class two strong features are identified: the size of com-panies followed by the user and page clicks on company address information. These features fall outside the expected Key func-tionalities and can be used for website personalization. For the majority class three large weights stand out: the homogeneity in company types followed and the total hits on the home page and the history of bought documents via the Company.info platform. The results of the feature permutation method are not as mutually exclusive and hence not as interpretable. It does confirm, however, that for the Commercial users the company size feature contributes

to most to the performance of the model. Overall, the assumptions of the marketing and sales agents are mostly confirmed by the linear SVM weights. Several other features are found to be impor-tant for both classes that can provide recommendations for website personalization.

6.2 Discussion

With this study, an important step is taken in automating user clas-sification for Company.info. A key focus must lie in improving the performance for the minority class of Commercial users. Steps to improve data quality include using the full scope of user navigation history (as opposed to only using two months) to get more users in the dataset. In combination with simply labeling more clients, get-ting a larger training set would also allow more freedom in filtering inactive users that have little data or other outliers. Investigating the errors of the misclassified points to assess the feature space of these users can provide direction in the process of applying these filters.

One practical use of the of this research to Company.info would the predictions for users that are likely to be correct. The user type labels can then be used for direct sales and personalized marketing efforts. Relating to the topic of personalization, certain functionali-ties of the website that are also found to be strong features can be highlighted to improve the workflow of Company.info users. As a further area of investigation, clients that are assumed to be one user type but are shown by the model to be the other type can be valuable. Creating more modular services (where clients pay per functionality instead of the whole product) allows directed offers that are valuable for both client (in terms of reduced costs) and Company.info.

ACKNOWLEDGEMENTS

I would firstly like to thank Company.info for the opportunity to work on this interesting project. A special thanks to Sven van der Burg and the data science team at Company.info for technical know-how and valuable input. I could not have wished for a more welcoming environment to complete this internship. Finally a special thanks to my supervisor Maarten Marx for the necessary guidance and feedback in the process of writing my thesis, as well as Chen Yifan for being the second reader.

REFERENCES

[1] Adeniyi, D., Wei, Z., and Yongqan, Y. Automated web usage data mining and recommendation system using k-nearest neighbor (knn) classification method. Applied Computing and Informatics 12, 1 (2016), 90–108.

[2] Barraza, N. R., Moro, S., Ferreyra, M., and de la Pe ˜na, A. Information theory based feature selection for customer classification. In Simposio Argentino de Inteligencia Artificial (ASAI 2016).(2016).

[3] Boser, B. E., Guyon, I. M., and Vapnik, V. N. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory(1992), ACM, pp. 144–152.

[4] Breiman, L. Random forests. Machine learning 45, 1 (2001), 5–32. [5] Breiman, L. Classification and regression trees. Routledge, 2017.

[6] Breiman, L., Friedman, J., Olshen, R., and Stone, C. Classification and regres-sion trees. wadsworth int. Group 37, 15 (1984), 237–251.

[7] Brito, P. Q., Soares, C., Almeida, S., Monte, A., and Byvoet, M. Customer segmentation in a large database of an online customized fashion business. Robotics and Computer-Integrated Manufacturing 36(2015), 93–100.

[8] Chan, C.-C. H. Online auction customer segmentation using a neural network model. International Journal of Applied Science and Engineering 3, 2 (2005), 101–109.

(12)

[9] Chang, Y.-W., and Lin, C.-J. Feature ranking using linear svm. In Causation and Prediction Challenge(2008), pp. 53–64.

[10] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research 16(2002), 321–357.

[11] Chen, D., Sain, S. L., and Guo, K. Data mining for the online retail industry: A case study of rfm model-based customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management 19, 3 (2012), 197–208.

[12] Chiu, C.-Y., Chen, Y.-F., Kuo, I.-T., and Ku, H. C. An intelligent market segmen-tation system using k-means and particle swarm optimization. Expert Systems with Applications 36, 3 (2009), 4558–4565.

[13] Eirinaki, M., and Vazirgiannis, M. Web mining for web personalization. ACM Transactions on Internet Technology (TOIT) 3, 1 (2003), 1–27.

[14] Freno, A. Practical lessons from developing a large-scale recommender system at zalando. In Proceedings of the Eleventh ACM Conference on Recommender Systems(2017), ACM, pp. 251–259.

[15] Freund, Y., and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119–139.

[16] Friedman, J., Hastie, T., Tibshirani, R., et al. Additive logistic regression: a statistical view of boosting. The annals of statistics 28, 2 (2000), 337–407. [17] Friedman, J. H. Greedy function approximation: a gradient boosting machine.

Annals of statistics(2001), 1189–1232.

[18] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern recognition letters 31, 8 (2010), 651–666.

[19] Janitza, S., Strobl, C., and Boulesteix, A.-L. An auc-based permutation variable importance measure for random forests. BMC bioinformatics 14, 1 (2013), 119.

[20] Liaw, A., Wiener, M., et al. Classification and regression by randomforest. R news 2, 3 (2002), 18–22.

[21] Mueller, A. C., Guido, S., et al. Introduction to machine learning with Python: a guide for data scientists. O’Reilly Media, 2016.

[22] Perkowitz, M., and Etzioni, O. Towards adaptive web sites: Conceptual framework and case study. Artificial intelligence 118, 1-2 (2000), 245–275. [23] Steinberg, D., and Colla, P. Cart: classification and regression trees. The top

ten algorithms in data mining 9(2009), 179.

[24] Wu, R.-S., and Chou, P.-H. Customer segmentation of multiple category data in e-commerce using a soft-clustering approach. Electronic Commerce Research and Applications 10, 3 (2011), 331–341.

(13)

Appendices

A

FEATURE LIST

A total of 36 features are used to predict user type.

A.1 Page visit statistics

For page visit statistics (Google Analytics), 11 distinct parts of the website track the average time spent (Avgt) and the total hits (Hits), resulting in 22 features.

*BagProfiles Used to find address information, including the size of the building.

*ComplianceOrganisations Used to do compliance checks for organisations.

*CompliancePersons Used to do compliance checks for per-sons.

*HomePage Main page from which users can navigate. *ConcernRelations Used to integrate clients and potential

leads data from Company.info to other CRM software. *LoketExtracts Used to buy extracts summarizing company

function, structure and key decision makers.

*LoketHistory Used to view a history of bought documents and extracts.

*LoketReports Used to buy company annual reports. *NewsSearch Used to find news about companies or sectors. *OrgProfiles Used to view general company information

such as sector, address and organisational structure. *OrgSearch Used to search and filter organisations.

A.2 User watchlist statistics

For watchlist statistics, 4 features are made.

WatchlistLength Number of companies in a watchlist fol-lowed by the user.

AvgWatchlistSize Average size of watchlist companies. Size is measured on a 1-10 ordinal scale which maps to the number of employees.

WatchlistSectorHomogeneity Homogeneity in watchlist companies’ sectors.

WatchlistTypeHomogeneity Homogeneity in watchlist com-panies’ legal forms.

A.3 User watchlist content

For watchlist content, we try to represent the textual descriptions of followed companies in a vector. We use 10 Principal Components from Word2Vec word embeddings. 10 features are used as predictor variables.

MedianSBIPCA (1-10) .

B

MODEL PARAMETERS

For classification models, the following model parameters are used in grid search to optimize performance.

B.1 Random Forest

n estimators [200]. max depth [3, 5, 10, None]. max features [3, 5, None].

B.2 GBDT

n estimators [200]. max depth [3, 5, 10, None]. max features [3, 5, None].

B.3 AdaBoost

n estimators [200].

max depth (base estimator) [3, 5, 10, None]. max features (base estimator) [3, 5, None].