Machine Learning for Project Selection in a Professional Service Firm

(1)

Machine Learning for Project Selection in a Professional Service Firm

submitted in partial fulfillment for the degree of master of science Quinten Oostdam

10787712

master information studies data science

faculty of science university of amsterdam

2018-07-05

Internal Supervisor Second reader External Supervisor Title, Name Dr. Maarten Marx Loek Stolwijk Bart Vergeer

Affiliation UvA, FNWI, IvI UvA, FNWI KPMG

(2)

Machine Learning for Project Selection in a Professional Service

Firm

Quinten Oostdam

University of Amsterdam quinten.oostdam@gmail.com

ABSTRACT

In this study, a Machine Learning based approach is proposed for the task of selecting projects in multi-project contexts. Several ensemble as well as non-ensemble algorithms were deployed in order to compose a binary classifier for distinguishing profitable projects from unprofitable projects. These algorithms were trained and evaluated using a dataset from a Professional Service Firm containing 10.349 projects.

The results indicated that the ensemble methods yielded the best results with Bootstrap Aggregating having the highest F1score of

0.51. Due to an imbalanced dataset, oversampling methods were tested. This did not result in better performance.

The work introduces a new perspective on the task of project selection and shows that Machine Learning tools can be helpful for Project Managers, since the classifier is able to distinguish profitable projects from unprofitable projects moderately well using solely ERP data.

KEYWORDS

Project Management, Machine Learning, Ensemble Learning, Clas-sification, Professional Service Firm, Project Profitability

1 INTRODUCTION

Projects have become a standard model for structuring and deliv-ering work in many organizations [4] and are a key way to create value and benefits [23]. Project-based firms are found in many areas, such as consulting and professional services, complex products and systems and cultural and sports industries [43].

Selecting the right projects is crucial for organizations with a project-based business model [37]. Usually, the number of available projects exceeds the number of projects that can be executed in a firm because of limited resources [3, 16]. This makes choosing the right projects a difficult task in multi-project contexts.

Project Portfolio Management is a discipline which is concerned with selecting which projects go into the pipeline and which do not [29]. It is essentially an evolved form of project management. There are hundreds of methods for managing project portfolios, such as expert judgment and parametric tools. These methods have limita-tions: expert judgment is susceptible to bias [40] and parametric tools are often not widely used because they are difficult to use or understand, require too much input data or are too complex [19].

In this study, a new method for project selection is proposed which is based on Machine Learning (ML). The goal of this method is to overcome the limitations of traditional methods. The input data that is used is Enterprise Resource Planning (ERP) data, which is captured by ERP systems and thus easy to access [35]. Further-more, complexity is minimal since the algorithms are only applied to distinguish profitable projects from unprofitable projects in the

selection stage. This could help project managers in avoiding un-profitable projects, which is necessary since these projects drain scarce and valuable resources from the more profitable projects [30].

A major issue with research on the subject is that real-life project databases are often inadequate for obtaining significant results because size and diversity are insufficient [6] . Furthermore, most project databases are not publicly available.

This study attempts to fill this gap by using a real-life project database from a professional service firm and deploying several Machine Learning algorithms on this dataset for distinguishing profitable projects from unprofitable projects. Different algorithms are evaluated using various metrics and their performance is com-pared.

This leads to the following main research question:

To what extent can Machine Learning algorithms be deployed to distinguish profitable projects from unprofitable projects at the start of their life cycle in a Professional Service Firm?

In order to answer this question, the following subquestions are addressed:

(1) What are the existing techniques used to predict project outcomes?

(2) How can Machine Learning algorithms be deployed on com-pany data in order to distinguish profitable from unprofitable projects?

(3) How well do the Machine Learning algorithms distinguish profitable from unprofitable projects?

2 BACKGROUND AND RELATED

LITERATURE

In this section, the existing scientific work related to the research questions is discussed.

2.1 Techniques for project selection

Existing literature was reviewed to find out how the task of selecting projects is currently performed. Subsequently, it can be established what the limitations are and how Machine Learning can overcome these limits. Three techniques are discussed: subjective techniques because literature suggests that managers often use opinion-based decision making, Earned Value Management because this is one of the most used techniques for predicting project duration and cost, and Decision Support Systems because a literature research revealed that many scholars propose such systems for the task of project selection.

2.1.1 Subjective Techniques

Literature on project management revealed that many processes rely on subjective techniques [23], this involves the use of human

(3)

judgment. As a result, many organizations rely on the competencies of their employees for project selection [42].

Project Selection involves making decisions such as go, hold or cancel[37]. Decision making is a cognitive process which can be the result of expertise or intuition. Expert judgment can be provided by a group or an individual. This expertise can originate from experience, education, knowledge or training. Individuals may also use their intuition or "gut feeling" for making decisions.

The advantage of subjective techniques is that these methods can also be deployed when limited data is available. Disadvantages are that human opinions can be biased (overoptimistic) or inaccurate and the procedure may cost valuable time [36, 40].

2.1.2 Earned Value Management

Earned Value Management (EVM) is a widely used technique used for predicting project outcomes which is based on three key parameters:

•Budget at Completion (BAC) is the total budget for all activi-ties

•Planned Value (PV) is the budget baseline at any point in time derived from the original project planning schedule. •Actual Cost (AC) is the actual cost of the work that has been

performed.

•Earned Value (EV) is the original budget of the work that has been performed.

These parameters can be used to monitor project performance as well as forecast final project cost and duration. By comparing the EV to the AC, cost variance can be determined. For schedule variance, EV and PV can be compared. This allows for monitoring the project progress in terms of duration and cost. See figure 1 for a graph showing EVM curves.

Figure 1: EVM curves

While EVM is based on simple arithmetic operations, it has proven to be effective on many types of projects [17]. Its usefulness

is illustrated by many studies [2, 14, 23, 32] and EVM has become an important component of successful project management [33].

Despite its popularity, EVM comes with some disadvantages and limitations. It requires defining the complete work scope of a project which is difficult, especially for IT projects [17]. Agile project management has gained popularity which is focused on changing projects over time [27]. This makes defining a baseline obsolete. Furthermore, some scholars found that EVM is hard to use, unnecessary and costly [14, 45].

2.1.3 Decision Support Systems

Decision Support Systems (DSS) are used by managers to access current and historical operational data. This can provide them with insights which can be used to support decision making.

Power [38] defines Decision Support Systems as follows: "An interactive computer-based system or sub-system intended to help decision makers use communications technologies, data, documents, knowledge, and/or models to identify and solve problems, complete de-cision process tasks, and make dede-cisions. Dede-cision Support System is a general term for any computer application that enhances a person or group’s ability to make decisions."

Data that can be accessed using these systems can include per-formance metrics, customer activities or organization processes. Many scholars have proposed Decision Support Systems for the task of project selection, e.g. [19, 21, 31, 42].

Results suggested that such tools can be useful in real-life situ-ations, especially when limited human expertise is available [19]. Furthermore, research has shown that the use of DSS systems is correlated with higher profitability and productivity in firms [10]. For these reasons, Decision Support Systems can be considered an important technique for project selection.

2.2 Machine Learning algorithms applied to

project predictions

In this section, an overview is given of the Machine Learning tech-niques that have been applied to project predictions as was found in the literature. To start, a definition of Machine Learning is given. Robert [39] defined Machine Learning as follows:

"A set of methods that can automatically detect pat-terns in data, and then use the uncovered patpat-terns to predict future data, or to perform other kinds of decision making under uncertainty."

The field of Machine Learning is emerging and has been applied to many areas, including Project Management [34].

Several Machine Learning algorithms were found that were ap-plied in literature. The techniques can be separated into two cat-egories: traditional and ensemble. The traditional techniques are well-established algorithms which use a single learning method, while ensemble techniques use combinations of learning methods [28]. Ensemble learning methods are the current state-of-the-art Machine Learning methods and often perform better on complex, high-dimensional or imbalanced datasets [26, 50].

(4)

2.2.1 Traditional methods

The following three traditional (non-ensemble) methods are discussed: Nearest Neighbors, Decision Trees and Support Vector Machines.

Nearest Neighbors.Several papers that are concerned with esti-mating project success make use of the Nearest Neighbor algorithm [13, 34, 48]. A nearest neighbor algorithm calculates the distance between a given data point and a number (k) instances in the train-ing set. This is usually accomplished ustrain-ing the Euclidean Distance metric. The number of nearest neighbors to be used can be defined a priori. The idea behind this approach is that the algorithm can find the top-n most similar neighbors of a project. Performance of neighbor projects is expected to be similar, so the outcomes of these projects can be used as a predictor. This is comparable to a real-life situation where a project manager uses his past experience to predict project outcomes.

Decision Trees.Decision trees are a supervised learning method. They cover both classification and regression and can also be vi-sually represented [9]. Each node in the tree represents an input feature and the arcs coming from the nodes are the possible values. The leafs of the tree are the outcome classes (classification tree) or values (regression tree). Decision tree models are relatively simple and easy to understand. They are able to mirror human decision making more accurate than other methods [24], which could be a benefit since the goal is to mirror the decisions of a project manager. Support Vector Machines.A Support Vector Machine (SVM) is an algorithm that maps the training instances in a high-dimensional space such that the instances from the different classes are as di-vided as possible. This makes the SVM a binary linear classifier. Test samples are mapped in the trained space and then assigned to a class. If the relation between the predictors and the outcomes is assumed to be non-linear, a kernel function can be employed. The reader is referred to [41] for an extensive explanation of SVM and kernel usage.

2.2.2 Ensemble Methods

In this section, several ensemble learning methods are discussed. Bootstrap Aggregating involves training multiple learners in a par-allel manner and combining them while Boosting is a technique that trains learners sequentially, focusing more on harder instances at each step. Three boosting algorithms are discussed: Gradient Boosting, Adaptive Boosting and XGBoost.

Bootstrap Aggregating.Bootstrap Aggregating - often referred to as "bagging" was introduced by Breiman [8] and was designed to improve the accuracy of Machine Learning algorithms such as decision trees. Decision trees are susceptible to high variance, because a minor change in the training data could change the structure of the entire tree. Bagging solves this problem by training a number of trees on bootstrapped samples of the training data. The final prediction is the result of the average predictions of the trees. See figure 2 for a schematic overview of the algorithm.

Gradient Tree Boosting.Decision trees can also be used in combi-nation with a Boosting function. Gradient Boosting is an iterative process, which starts with a weak model and strengthens is at every

Figure 2: Schematic view of Bootstrap Aggregating

iteration. The loss function is minimized, with the goal to reduce test loss to a minimum. Boosting focuses on the residuals, at every step a new model is constructed which is better fitted to the residu-als of the previous step. This approach focuses on instances which are hard to predict [18].

Adaptive Boosting.Adaptive boosting is another boosting tech-nique that starts with a weak Decision Tree and tries to improve it at every iteration. The difference with Gradient Boosting is that Adaptive Boosting focuses on the weights of the instances. Wrongly classified instances gain a higher weight and correctly classified in-stances gain a lower weight. As a result, the learner is more focused on difficult instances.

XGBoost.Extreme Gradient Boosting (XGBoost) is an optimized implementation of Tree Boosting. This algorithm is focused on max-imal computational speed and performance. Features include CPU parallelization, cache optimization and data compression. XGBoost is much more computationally efficient which results in lower exe-cution time and results are often more accurate than other boosting algorithms. For more information, the reader is referred to [12]. Because of the apparent advantages compared to other boosting implementations, XGBoost is also utilized in this study.

2.3 Machine Learning applied to Project

Management

In the previous section, the Machine Learning algorithms found in the literature have been described. In this section, literature that applied these algorithms for Project Management is reviewed to provide a background for this study.

Abe et al. [1] used a Naive Bayes classifier to estimate project success of IT projects. Data of 28 projects was used and contained 29 metrics ranging from the skill and experience of the developers to the amount of documentation. Project success was seen from three viewpoints: Quality, Cost and Duration. The classifier achieved higher accuracies than experts in predicting the success of projects in terms of these variables.

Costantino [15] proposed a system for predicting project success in the selection phase. An Artificial Neural Network was deployed to 3

(5)

classify the degree of success of projects. The dataset contained 150 samples. The model used assessments of completed projects as input. These assessments were based on questionnaires of Critical Success Factors. The goal of the systems was to predict project success using a given set of CSF’s. This output was compared to a degree of success given by experts to compare performance. Performance was satisfying as the accuracy of the model was around 90%.

Cheng et al. [13] propose a system that estimates project suc-cess in the form of a value between 0 and 1. For this research, assessments of projects were used as input. The data contained 46 samples. The researchers suggest a hybrid model in which Support Vector Machines are used in combination with Fast Messy Genetic Algorithm (fmGA). In addition, K-means clustering was applied to select projects that have similar characteristics which reduced the RMSE significantly. Results suggested that the proposed system can predict project success with a significant level of accuracy.

Wauters and Vanhoucke [46] employed a Support Vector Ma-chine (SVM) for estimating project estimated total cost and duration. The input for the SVM was periodic EVM data that was generated using Monte Carlo simulations. The attributes consisted of a number of EVM metrics. The performance was better than the traditional EVM methods, the average improvement was 1,42%. A limitation of this research is that the methods have not been evaluated using real-life project data. Instead, simulated data was used. The results showed that the methods produced satisfying results when the test set was similar to the training set, but performance was rather bad when there was a difference between the test and training set.

Wauters and Vanhoucke [47] compared the performance of sev-eral Machine Learning methods for forecasting project duration. Decision trees, Bagging, Random Forests, Boosting and Support Vector Machines. All these methods predict the final duration of a project with a higher accuracy and a lower standard deviation than the available EVM methods. SVM and Boosting produces the best results. It should be noted that this research was also carried out using simulated project data. This research was extended with a k-nearest neighbors algorithm to select more similar projects for the training set [48]. Using this technique improved the results.

Berlin et al. [7] compared several methods for estimating cost and duration for IT projects. This is one of the few studies that used a large database of real-life projects (> 500) for this task. A traditional regression was compared to a neural network. They used Size, Productivity, Complexity, Duration and Effort as input variables. The results contradict Wauters and Vanhoucke [46, 47], since regression performed better than Neural Networks. However, a simple neural network was used. The previously described stud-ies were hybrid approaches with more advanced neural networks, which could explain the difference.

Wen et al. [49] reviewed over 80 studies which applied Machine Learning to predict the effort for software development projects. Their conclusion was that Machine Learning models provide ac-ceptable results and overall perform better than non-ML models. Case-based reasoning, artificial neural networks and decision trees are used the most. Neural networks have the lowest mean error. The distribution of the studies and their used methods is displayed in figure 3.

Figure 3: Machine Learning techniques that were used in previous studies for predicting project outcomes.1

Concluding from the reviewed papers, Machine Learning ap-proaches seem to be promising in the field of making project pre-dictions. Results indicate that Machine Learning is capable of mak-ing accurate predictions, however the conclusions should not be generalized since the datasets are by no means sizable, diverse or representative. Research is still in early stages and while the pro-posed systems can help managers in making decisions, they should not replace traditional methods and human expertise but instead they should be used as complementary tools.

3 METHODOLOGY

In this section, the methods used for the research are discussed. First, the used data and the necessary preprocessing are discussed and explained. Thereafter, several Machine Learning techniques are tuned and implemented so that their performance can be compared. The algorithms are written in Python code, which is included in Appendix F.

3.1 Description of the data

For this study, project data from a large professional service firm was collected. This data is extracted from the Enterprise Resource Planning systems that are used to capture operational data in the firm.

The attributes in the raw data and the meaning of them are out-lined in Appendix A.1. See table 1 for the statistics of the numerical variables. Exploratory plots can be viewed in Appendix B.

The data ranges from January 2013 to March 2018. For this thesis, only projects that were closed are used since the outcomes of ongo-ing projects are not known. Accordongo-ing to the data, 10.349 projects were finished in this period.

A distinction is made between projects with a positive net rev-enue and projects with a negative revrev-enue. The Total Net Revrev-enue column is converted to binary values, with a 0 for every project with a positive revenue (>0) and a 1 for projects with a negative or zero revenue (<=0). By doing this, the projects are divided into two classes: profitable and unprofitable.

(6)

In total, 794 projects a revenue of zero or less and are thus labeled as "Not profitable" (value of 1). The remaining 9555 projects have a positive net revenue and are thus labeled with a 0.

Considering the data to be used as input, a selection is made of the attributes, see the underlined attributes in Appendix table A.1. Engagement ID, Name and dates are not used, since these values are not expected to have any predictive value. Line of Business is skipped because this is an aggregated version of the Industry Sector field.

The Client and Partner columns have many unique values which are counted to increase efficiency. This was achieved by looping over all projects in the data and counting the number of occurrences of the client ID/partner ID in previous engagements (based on the closing dates). The Total hours column is not used, because the model should be applicable to projects in the selection phase which do not have any hours.

Table 1: Project Data attribute statistics

Variable Statistics Number of employees Mean 6.76

Median 5.0 Std. dev 6.05 Client previous engagements Mean 7.10

Median 1 Std. dev 14.07 Partner previous engagements Mean 61.06 Median 38.0 Std. dev 67.13

3.2 Machine Learning Task

In order to deploy Machine Learning techniques on the dataset, a clear and suitable task must be defined. The goal of Machine Learning is to learn a function that best maps between the output variable (y) and input variables (X).

In this case, the goal is to correctly predict unprofitable projects because this provides useful information for the firm for project consideration. As described before, these projects are labeled with a 1 in the "Total Net Profit" column. The profitable projects are labeled with a 0. The values of this column are the y variable of the model. Since these variables are binary, the function of the model will be binary classification. The input (X) variables are the underlined columns in table A.2.

3.3 Preprocessing

The raw data was preprocessed in order to drop redundant, unnec-essary and missing values and to store it in a format that can be used as input for Machine Learning tools. See Appendix F.1 for the Python code used to achieve this.

The data was delivered as a spreadsheet containing a row for every employee-project link. The columns of this spreadsheet are outlined in Appendix table A.1. These columns are constant for every row in the data. The bold attributes are the same for a project, thus repeated for every employee that worked on a project. The other attributes are linked to an employee.

This data was aggregated to project-level rows. A set of all the engagements in the Engagement column was made. For every en-gagement in the set, the (aggregated) values were obtained using the code in Appendix F.1. Clients were delivered as a separate file, and needed to be joined.

Previous engagements of partners and clients are calculated. Partner previous engagements is the total number of engagements that the partner has done before an engagement. Client previous engagements is the number of engagements that the same client has had before an engagement. These values are calculated by looping the projects again, and counting the number of occurrences of the Client or Partner ID in projects with an earlier close date.

This preprocessing resulted in the aggregated project data having the columns as described in table A.2. The data contains categorical variables, which most Machine Learning algorithms cannot handle. Therefore, these values must be converted into numerical values. This is achieved by using dummy variables, which involves adding a new variable for every category and labeling it with 0 or 1 to indicate it’s absence or presence. The result was a DataFrame with 734 columns, this was used as input.

For the implementation of the Machine Learning algorithms, the data is split into 75% training data and 25% test data. Before splitting, the data is shuffled so that there is no order in the data. This is applied to reduce the risk of overfitting. Furthermore the split is stratified so that the proportions of the two classes in the train and test data are kept equal. The data contains binary columns as well as columns with continuous values. To scale all the columns to an equal norm, the data was normalized to numeric values between 0 and 1.

According to the data, unprofitable projects occur 7,7% of the time, so the data is quite imbalanced. Machine Learning algorithms often perform worse on imbalanced data [5]. Oversampling is a technique that can be used to increase the number of instances in minority classes, thus simulating a more balanced dataset. Over-sampling training data can help to improve performance of the models. Therefore, the resampling methods SMOTE, ADASYN, and Random Oversampling are applied and compared. Random oversampling is the simplest oversampling method which involves randomly replicating instances of the minority class. Synthetic Minority Over-sampling Technique (SMOTE) is an oversampling technique that does not replicate instances, but instead generates new synthetic instances in order to reduce overfitting [11]. Adaptive Synthetic Sampling (ADASYN) is another oversampling technique which focuses on replicating harder instances. A k-nearest neighbor algorithm is deployed and the ADASYN oversampler focuses on generating samples next to the ones that were wrongly classified [20].

The oversampling methods were applied using the Python pack-age imbalanced-learn. The final result consisted of four train datasets, of which three were oversampled.

3.4 Machine Learning Algorithms

As described before, the task that needs to be performed is a binary classification. Thus, the Machine Learning techniques that will be 5

(7)

applied need to be suitable for this task. In chapter 2, several en-semble en non-enen-semble techniques were described. In this section, the process of applying these techniques will be outlined.

In total, 7 algorithms are applied, of which 4 ensemble techniques and 3 non-ensemble:

Non-ensemble techniques •Decision Trees •K-Nearest Neighbors •Support Vector Machine Ensemble techniques

•Bootstrap Aggregating •Gradient Tree Boosting •Adaptive Boosting •Extreme Gradient Boosting

Using Python in combination with the pandas, sklearn, xgboost, imbalanced-learn and numpy modules, the algorithms are em-ployed. See Appendix F.4 for the code.

3.4.1 Hyperparameter Tuning

Most algorithms have parameters, which can be set before the training process begins. These parameters vary per algorithm, e.g. the depth of a Decision Tree or the number of neighbors to use in a K-NN classifier. The optimal parameters are not known a priori thus need to be found using the train data.

A method for finding the optimal parameters is to perform a Grid Search. This involves defining a range of possible values for the parameters. The Grid Search runs an exhaustive search with all possible combinations of the parameters and calculates a score for every run which can be a metric of choice. For this Grid Search, the average AUC value of a 5 fold Cross-Validation was chosen as a metric. Thus, the Grid Search looks for the settings that result in the highest possible mean AUC value of the 5 folds. This metric is further explained in the results section. Cross-Validation involves splitting the train data into subsets and using one of these subsets for validation. By taking the average of these validations, the variance is reduced. By using stratified folds, the proportions of the classes are kept equal for every fold.

Using this approach, a separate validation set is not needed so train data loss is minimal. The best parameters settings that were found, are summarized in Appendix E. An outline of the whole Machine Learning pipeline is displayed in figure 4.

4 RESULTS

In this section, the results of the applied methods are reported. The models were run for the normal and the oversampled train datasets, resulting in four models to be evaluated. Several evaluation metrics are used such as the Receiver Operating Characteristic (4.1) and Precision/Recall (4.2) along with the corresponding curves.

4.1 Receiver Operating Characteristic

The first metric that was chosen to evaluate the performance of the classifiers, is the Receiver Operating Characteristic (ROC). An ROC curve is a plot which visualizes the performance of a binary classifier.

The applied classifiers are effectively logistic regression models, meaning that they do not output the predicted class as a binary

Figure 4: Overview of the Machine Learning Pipeline

value but instead they output the probability that an instance be-longs to the "positive" class (in this case unprofitable projects). A threshold setting specifies at what probability an instance should be assigned to the positive class. By default, most classifiers use a threshold of .5, but other thresholds can be chosen as well. Lower thresholds mean that more samples are classified as "positive" while a higher threshold means that less samples are classified as "posi-tive". Hence, the threshold setting has an impact on the outcomes of the classifications.

There are four possible outcomes of classifications in this context: • True Positive (TP): Correct classification of an unprofitable

project

• False Positive (FP): A profitable project that was incorrectly classified as being unprofitable

• True Negative (TN): Correct classification of a profitable project

• False Negative (FN): An unprofitable project that was incor-rectly classified as being profitable

Based on these outcomes, a True Positive Rate (TPR) and a False Positive Rate (FPR) can be calculated.

• True Positive Rate (TPR): Proportion of unprofitable projects (label 1) that have been correctly identified as such ⋄ T PR= TP/(TP + FN )

(8)

•False Positive Rate (FPR): Proportion of profitable projects (label 0) that have been incorrectly identified as being un-profitable

⋄ F PR= FN /(TP + FN )

An ROC curve allows for visualizing the performance of classi-fiers for different thresholds. It is a figure in which the True Positive Rate is plotted against the False Positive Rate for various thresholds. Thus, an ROC curve essentially measures the ability of the classifier to distinguish between the two classes.

The ROC curves of the different algorithms are displayed in fig-ure 5. The models in this figfig-ure were trained without oversampling techniques. The ROC curves of the models that were trained using oversampling techniques can be found in Appendix C.

The diagonal line in this figure is equal to the performance of a random classifier. Points above this dotted line mean that the results are better than random, while points below the line mean that the results are worse than random. A perfect classifier would have a point at the upper left corner, or coordinate (0,1).

Figure 5: ROC curves for the implemented algorithms

(not using resampled data) 4.1.1 Area Under the Curve

While ROC curves give a visual indication of the performance of the classifier, their corresponding Area Under the Curve (AUC) value can also be used as a numeric metric. An AUC value of 0.5 means that there is no accuracy at all (random guess) and a value of 1.0 denotes a perfect classifier. The actual AUC values of the classifiers in combination with the various oversampling techniques are outlined in table 2.

The traditional academic point system provides a good guideline for determining the accuracy corresponding to an AUC value [44].

•.90-1 = excellent (A) •.80-.90 = good (B) •.70-.80 = fair (C) •.60-.70 = poor (D) •.50-.60 = fail (F)

Table 2: Area Under Curve values for the implemented algo-rithms

Oversampling method None Syn. Adapt. Random Non-Ensemble

Decision Tree 0.76 0.80 0.77 0.82 Nearest neighbors 0.72 0.73 0.67 0.72 Support Vector Machine 0.74 0.74 0.67 0.73 Ensemble

Bagging 0.86 0.86 0.84 0.87 Gradient Boosting 0.83 0.82 0.82 0.82 Adaptive Boosting 0.86 0.85 0.84 0.86 XGBoost 0.87 0.85 0.84 0.86

Bold= best result

As can be seen in figure 5 and table 2, the non-ensemble methods have fair performance and the ensemble methods perform good. Differences between the different algorithms are only marginal and could be the result of variance, since the training process of the models are subject to a degree of randomness.

With regard to the oversampling techniques, it can be seen that the ROC curves and AUC values of the oversampled mod-els barely differ from the modmod-els that were not oversampled. The non-ensemble methods seem to benefit slightly from oversampling, in contrast to the ensemble methods. Again, the differences are only marginal so firm conclusions cannot be drawn.

Table 3: Average Precision of the applied algorithms

Oversampling method None Syn. Adapt. Random Non-Ensemble

Decision Tree 0.33 0.27 0.21 0.34 Nearest neighbors 0.24 0.25 0.20 0.25 Support Vector Machine 0.22 0.24 0.16 0.22 Ensemble

Bagging 0.46 0.45 0.44 0.44 Gradient Boosting 0.38 0.40 0.36 0.39 Adaptive Boosting 0.36 0.36 0.36 0.37 XGBoost 0.44 0.44 0.42 0.41 Bold= best result

4.2 Precision and Recall

To get the full picture of the performance of the different classifiers, Precision and Recall are also included in the evaluation. These metrics can be calculated as follows:

precision= TP/(FP + TP) recall = TP/(TP + FN )

Recall is the same as the True Positive Rate as was also included in the ROC. It provides an answer to the question: What proportion of unprofitable projects was identified correctly?

Precision is the proportion of positive classifications that were actually correct, thus it provides an answer to the question: Of the 7

(9)

projects that were classified as unprofitable, how many are actually unprofitable?

These metrics can be plotted in a Precision-Recall curve. These curves visualize the precision and recall for various threshold set-tings of the classifier. The Precision-Recall curves of the applied al-gorithms (without oversampling) can be found in figure 6. Precision-Recall curves for the oversampled models are included in Appendix D. As can be seen, these curves are wide-ranging. As higher pre-cision/recall values are better, a perfect classifier would have a point at the top right (1,1) coordinate. Again, it can be seen that the ensemble methods outperform the non-ensemble methods. Figure 6: Precision-Recall curves for the implemented algo-rithms

(not using resampled data)

The Average Precision (AP) value is a metric which can be used to summarize this curve, it is the area under the precision-recall curve. As can be seen in table 3, the bagging classifier has the highest AP in all cases.

Any point in the Precision-Recall curve can be selected for de-termining the desired precision and recall of the classifier. This means that a trade-off must be made between precision and recall.

Higher precision means lower recall, and vice versa. The F1score

is a metric that can be used to select a balanced threshold. F1is

the harmonic mean of precision and recall and can be calculated as follows:

F1= 2 ·_{precision + recall}precision · recall

The threshold with the highest F1 score is thus the optimal

balance between precision and recall. See also Appendix F.5 for the calculation. Using this metric, thresholds are selected for the different classifiers and results are outlined in table 4. In this table, it can be seen that Bootstrap Aggregating has the best overall performance with no oversampling.

From the results it became clear that the ensemble methods are most suited for this task, with Bootstrap Aggregating yielding the best results. Oversampling does not seem to be beneficial.

5 EVALUATION

5.1 RQ1

What are the existing techniques used to predict project outcomes? The current techniques that are used for predicting project prof-itability were uncovered by going though background literature. It was found that human expertise is still a widely used method for predicting project outcomes [49]. This is understandable, since human forecasting is also possible when little data is available and it takes into account the wealth of human experience [36]. However, humans tend to be biased and overoptimistic [36, 40].

A well-established technique for predicting project duration and cost is Earned Value Management. This is the most used method for project performance measurement [22]. It can be used in any type of project, and is empirically proven.

Data-driven forecasting is another technique which does not have the disadvantages of subjective forecasting, but does require historical data. Such data is often available when companies use ERP systems, since these systems capture vast amounts of operational data. Firms have started using Data-Driven Support Systems to access operational data. Research has shown that the use of these systems is correlated with higher profitability and productivity in firms [10].

It is beyond the scope of this study to look into all of the many available tools for predicting project outcomes, but the literature Table 4: Classifier scores

Oversampling technique None SMOTE ADASYN Random

Metric P R F1 P R F1 P R F1 P R F1

Non-Ensemble

Decision Tree 0.53 0.37 0.44 0.36 0.47 0.40 0.21 0.24 0.24 0.38 0.53 0.45 Nearest neighbors 0.24 0.40 0.30 0.23 0.49 0.32 0.20 0.32 0.25 0.30 0.31 0.30 Support Vector Machine 0.29 0.40 0.34 0.29 0.41 0.34 0.16 0.47 0.24 0.26 0.42 0.32 Ensemble

Bagging 0.50 0.51 0.51 0.46 0.52 0.49 0.47 0.51 0.49 0.41 0.62 0.49 Gradient Boosting 0.37 0.60 0.46 0.35 0.60 0.44 0.34 0.55 0.42 0.35 0.59 0.44 Adaptive Boosting 0.39 0.52 0.44 0.40 0.45 0.42 0.43 0.41 0.42 0.46 0.38 0.42 XGBoost 0.46 0.52 0.49 0.40 0.56 0.47 0.42 0.60 0.49 0.40 0.56 0.47

Bold= best result 8

(10)

review uncovered three important techniques namely subjective techniques and parametric tools that also come with several limita-tions.

5.2 RQ2

How can Machine Learning algorithms be deployed on company data in order to distinguish profitable from unprofitable projects?

For this study, a pipeline was built for processing company (ERP) data and applying Machine Learning algorithms.

Preprocessing the data was necessary, since the data needed to be aggregated to project-level data. Categorical variables were encoded to numerical variables and missing values were removed. 75% of the data was selected as train data. Several implemen-tations of Ensemble and Non-Ensemble learning methods were imported, using Python. Using the train data and a Grid Search with a 5-fold stratified cross-validation, these algorithms were tuned which was necessary in order to obtain optimal results.

In order to distinguish the profitable projects from the unprof-itable projects, the data was labeled with a binary value, 0 for profitable projects and 1 for unprofitable projects. The algorithms produce a probability that the instance belongs to the positive (un-profitable class). Thus, for actual classification, setting a threshold for discriminating classes is necessary. For this study, a balanced threshold using the F1score was chosen.

While a balanced threshold can be appropriate, other thresholds can be chosen as well. It is dependent on what the project manager is interested in. A lower threshold results in classifying more projects as unprofitable, hence a higher recall. Higher thresholds result in classifying more projects as profitable thus a lower recall. A well-adjusted threshold must be set and should be determined by consulting project managers.

As a result, the answer to this research question is that a well-constructed pipeline is necessary for the deployment of Machine Learning algorithms on company data. Such data is often quite complex, so preprocessing is crucial. Furthermore, good tuning and threshold selection is important. The proposed approach in this study can be used for distinguishing profitable projects from unprofitable projects using company data as input. More work is necessary to improve this approach and turn in into a system that can be used by project managers.

5.3 RQ3

How well do the Machine Learning algorithms distinguish profitable from unprofitable projects?

The applied Machine Learning algorithms were evaluated us-ing various metrics. First, the Receiver Operatus-ing Characteristic curves and the corresponding Area under the Curve values were examined. The ROC curves showed quite satisfying results with all classifiers performing much better than a random classifier. The ensemble methods outperformed the non-ensemble methods with Bootstrap Aggregating and Boosting showing AUC values up to 0.87 which imply good performance. The cause of this could lie in the characteristics of the data: it is very high-dimensional and imbalanced. Traditional methods try to reduce the error as much as possible but do not take these factors into account. Ensemble learning combines multiple models such that the variance of the

final model is decreased and it is more suited for the data, hence better performance.

In addition the the ROC, precision and recall were measured. The precision-recall curves turned out to be less satisfying than the ROC curves. This could be due to the fact that ROC is known to be fairly unaffected by imbalanced data [25]. Three oversampling methods were tested in order to determine if applying these techniques can overcome the impairments of an imbalanced dataset. Applying these techniques did not result in better performance.

The Bootstrap Aggregating classifier showed the best precision-recall curve with an average precision of 0.51. To measure actual precision and recall, choosing a threshold was necessary. For this research, a balanced threshold was chosen based on the F1metric.

Since the F1score is the harmonic mean of precision and recall, the

highest possible F1score would form a balanced classifier. Bootstrap

Aggregating yielded the highest F1score of 0.51 with a precision

of 0.51 and a recall of 0.50. This means that the algorithm can find approximately half of all the unprofitable projects and of all projects that were classified as being unprofitable, approximately half are correctly identified as such. These results are by no means optimal, but do provide proof that Machine Learning can at least modestly predict profitability of projects.

As such, the answer to this research question is that the best ML models have a moderate ability of predicting project profitability.

6 CONCLUSION AND DISCUSSION

After providing answers to the subquestions, the main research question can be addressed: To what extent can Machine Learning algorithms be deployed to distinguish profitable projects from unprof-itable projects at the start of their life cycle in a Professional Service Firm?

Going through the existing literature revealed that many tech-niques exist for the task of project prediction. Three techtech-niques were reviewed: Subjective techniques, Earned Value Management and Data-Driven Support Systems. While these three techniques are well-grounded in the literature, no qualitative research was conducted. Interviews with project managers could help to get a better picture of the methods that are used in practice for the task of project selection.

The traditional methods come with considerable limitations, and research has proposed Machine Learning tools in attempt to overcome these limitations. Most scholars report promising results with the proposed Machine Learning tools.

Based on the background literature, a new approach is proposed in this thesis which is based on ERP data. Such data is easy to gather since it is recorded in ERP systems and can be exported.

ERP data from a large Professional Service Firm was gathered. With more than 10.000 projects, the data is sizable. The number of features was very limited with only 10 features of which 6 were categorical.

The results of the evaluation are twofold, Receiver Operating Characteristic showed satisfying curves and AUC values. Precision and Recall curves were moderate, with a highest F1value of 0.51

for the Bootstrap Aggregating algorithm.

When the algorithms are deployed to separate projects into profit or loss classes, a separation threshold needs to be chosen. For 9

(11)

this research, a balanced threshold was suggested based on the F1

score. However, threshold selection is subject to preference. Project managers should determine what the algorithm should focus on, reducing false negatives or reducing false positives. E.g., how much money is missed when a profitable project is not done because of a false negative versus the cost of doing an unprofitable project because of a false positive. By determining these costs, a threshold can be chosen that is more adjusted to a specific use-case.

The work shows that it is to some extent possible to distinguish profitable projects from unprofitable projects using ERP data from a professional service firm. Even with very limited data, moderate precision and recall were obtained. This indicates that Machine Learning has good potential and could possibly be improved with more input data. Furthermore, projects generally are complex and involve a degree of uncertainty. External factors can not always be predicted, so perfect scores would be unreachable. The firm in which the approach was tested, might already have good risk management and their project selection could be very efficient already. The approach should be tested in other firms as well to determine its usefulness. Since many firms use similar ERP systems, it should be possible to gather comparable data and replicate the research.

With this in mind, the predictions could have some value for the firm in question but since accuracy cannot be guaranteed, the methods should not replace traditional tools. As a complementary tool however, the Machine Learning approach could be beneficial. This is also confirmed by other studies [34].

Looking at the training process of the proposed methods, some remarks must be placed. As the models were trained on historical ERP data from 5 past years, it should not be presumed that all data has equal predictive value. Projects take place in an environment that is subject to change, internally as well as externally. The com-pany itself changes over time, departments are restructured and policies change. External factors such as the economy, politics and law can also change over time and all these changes can have in-fluence on project outcomes. As a consequence, a project from 5 years ago might not be representative anymore for current project outcomes. This was not taken into consideration in the develop-ment of this approach. Future work could improve the methods by giving older projects lower weights in the model or including external factors.

For this study, cost was used as a selection criteria for projects. While this makes sense because the firm wants to make as much profit as profitable, other criteria should also be considered such as resource limitations, project interdependencies, strategy, risk and effort. Sometimes, it is known that a project is unprofitable, but it is still performed due to strategy reasons. Successfully completing a project may lead to more business which means more profit in the future at the cost of a small loss.

To conclude, the proposed methods introduce a new perspective towards the application of Machine Learning on predicting project outcomes. Results are moderate, but by no means disappointing since the models are based on limited data. Further research is necessary to overcome the limitations that were encountered.

6.1 Acknowledgements

I would like to thank the people of KPMG for offering me an intern-ship to write this thesis and providing a pleasant work environment. Special thanks to Bart Vergeer for helping me in the process of writ-ing this thesis, your ideas and feedback were incredibly helpful. Also thanks to dr. Maarten Marx for his guidance out of the uni-versity, your expertise on the subject has been very valuable for realizing this thesis.

REFERENCES

[1] Seiya Abe, Osamu Mizuno, Tohru Kikuno, Nahomi Kikuchi, and Masayuki Hi-rayama. 2006. Estimation of project success using bayesian classifier. In Proceed-ings of the 28th international conference on Software engineering. ACM, 600–603. [2] Frank T Anbari. 2003. Earned value project management method and extensions.

Project management journal34, 4 (2003), 12–23.

[3] Norm P Archer and Fereidoun Ghasemzadeh. 1999. An integrated framework for project portfolio selection. International Journal of Project Management 17, 4 (1999), 207–216.

[4] René M Bakker. 2010. Taking stock of temporary organizational forms: A system-atic review and research agenda. International Journal of Management Reviews 12, 4 (2010), 466–486.

[5] Gustavo EAPA Batista, Ronaldo C Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter 6, 1 (2004), 20–29.

[6] Jordy Batselier and Mario Vanhoucke. 2015. Construction and evaluation frame-work for a real-life project database. International Journal of Project Management 33, 3 (2015), 697–710.

[7] Stanislav Berlin, Tzvi Raz, Chanan Glezer, and Moshe Zviran. 2009. Comparison of estimation methods of cost and duration in IT projects. Information and software technology51, 4 (2009), 738–748.

[8] Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123–140. [9] Leo Breiman. 2017. Classification and regression trees. Routledge.

[10] Erik Brynjolfsson, Lorin M Hitt, and Heekyung Hellen Kim. 2011. Strength in numbers: How does data-driven decisionmaking affect firm performance? (2011). [11] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer.

2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research16 (2002), 321–357.

[12] Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785–794.

[13] Min-Yuan Cheng, Yu-Wei Wu, and Ching-Fang Wu. 2010. Project success predic-tion using an evolupredic-tionary support vector machine inference model. Automapredic-tion in Construction19, 3 (2010), 302–307.

[14] David S Christensen. 1998. The costs and benefits of the earned value manage-ment process. Journal of Parametrics 18, 2 (1998), 1–16.

[15] Francesco Costantino, Giulio Di Gravio, and Fabio Nonino. 2015. Project selection in project portfolio management: An artificial neural network model based on critical success factors. International Journal of Project Management 33, 8 (2015), 1744–1754.

[16] Mats Engwall and Anna Jerbrant. 2003. The resource allocation syndrome: the prime challenge of multi-project management? International journal of project management21, 6 (2003), 403–409.

[17] Quentin W Fleming and Joel M Koppelman. 2010. Earned value management. Project Management Institute (PMI), Pennsylvania, USA(2010).

[18] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232.

[19] Fereidoun Ghasemzadeh and Norman P Archer. 2000. Project portfolio selection through decision support. Decision support systems 29, 1 (2000), 73–88. [20] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive

synthetic sampling approach for imbalanced learning. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE Interna-tional Joint Conference on. IEEE, 1322–1328.

[21] G Hu, L Wang, S Fetch, and B Bidanda. 2008. A multi-objective model for project portfolio selection to implement lean and Six Sigma concepts. International journal of production research46, 23 (2008), 6611–6625.

[22] Project Management Institute. 2016. The High cost of Low Performance | Pulse of the Profession 2016. (2016).

[23] Project Management Institute. 2017. A Guide to the Project Management Body of Knowledge. Project Management Institute.

[24] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An introduction to statistical learning. Vol. 112. Springer.

[25] László A Jeni, Jeffrey F Cohn, and Fernando De La Torre. 2013. Facing imbal-anced data–recommendations for the use of performance metrics. In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference

(12)

on. IEEE, 245–251.

[26] Mahesh Vijaykumar Joshi. 2002. Learning classifier models for predicting rare phenomena. University of Minnesota.

[27] Michael Karlesky and Mark Vander Voord. 2008. Agile project management. ESC 247, 267 (2008), 4.

[28] Sotiris B Kotsiantis, I Zaharakis, and P Pintelas. 2007. Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering160 (2007), 3–24.

[29] Harvey A Levine. 2005. Project portfolio management. San Francisco (2005). [30] Harvey A Levine. 2007. Project portfolio management: a practical guide to selecting

projects, managing portfolios, and maximizing benefits. John Wiley & Sons. [31] Chinho Lin and Ping-Jung Hsieh. 2004. A fuzzy decision support system for

strategic portfolio management. Decision support systems 38, 3 (2004), 383–398. [32] Walt Lipke, Ofer Zwikael, Kym Henderson, and Frank Anbari. 2009. Prediction of project outcome: The application of statistical methods to earned value manage-ment and earned schedule performance indexes. International journal of project management27, 4 (2009), 400–407.

[33] Robert A Marshall, Philippe Ruiz, and Christophe N Bredillet. 2008. Earned value management insights using inferential statistics. International Journal of Managing Projects in Business1, 2 (2008), 288–294.

[34] D Magaña Martínez and Juan Carlos Fernández-Rodríguez. 2015. Artificial Intelligence applied to project success: a literature review. IJIMAI 3, 5 (2015), 77–84.

[35] Andrew McAfee. 2002. The impact of enterprise information technology adop-tion on operaadop-tional performance: An empirical investigaadop-tion. Producadop-tion and operations management11, 1 (2002), 33–53.

[36] John T Mentzer and Roger Gomes. 1989. Evaluating a decision support forecasting system. Industrial Marketing Management 18, 4 (1989), 313–323.

[37] Ralf Müller, Miia Martinsuo, and Tomas Blomquist. 2008. Project portfolio control and portfolio management performance in different contexts. Project management journal39, 3 (2008), 28–42.

[38] Daniel J Power. 2008. Understanding data-driven decision support systems. Information Systems Management25, 2 (2008), 149–154.

[39] Christian Robert. 2014. Machine learning, a probabilistic perspective. (2014). [40] Barry Shore. 2008. Systematic biases and culture in project failures. Project

Management Journal39, 4 (2008), 5–16.

[41] Alex J Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regres-sion. Statistics and computing 14, 3 (2004), 199–222.

[42] Christian Stummer, Elmar Kiesling, and Walter J Gutjahr. 2009. A multicrite-ria decision support system for competence-driven project portfolio selection. International Journal of Information Technology & Decision Making8, 02 (2009), 379–401.

[43] Jörg Sydow, Lars Lindkvist, and Robert DeFillippi. 2004. Project-based organiza-tions, embeddedness and repositories of knowledge. (2004).

[44] Thomas G Tape. 2006. Interpreting diagnostic tests. University of Nebraska Medical Center, http://gim. unmc. edu/dxtests(2006).

[45] Ricardo Viana Vargas. 2003. Earned value analysis in the control of projects: Success or failure? AACE International Transactions (2003), CS211.

[46] Mathieu Wauters and Mario Vanhoucke. 2014. Support vector machine regression for project control forecasting. Automation in Construction 47 (2014), 92–106. [47] Mathieu Wauters and Mario Vanhoucke. 2016. A comparative study of

Artifi-cial Intelligence methods for project duration forecasting. Expert Systems with Applications46 (2016), 249–261.

[48] Mathieu Wauters and Mario Vanhoucke. 2017. A Nearest Neighbour extension to project duration forecasting with Artificial Intelligence. European Journal of Operational Research259, 3 (2017), 1097–1111.

[49] Jianfeng Wen, Shixian Li, Zhiyong Lin, Yong Hu, and Changqin Huang. 2012. Systematic literature review of machine learning based software development effort estimation models. Information and Software Technology 54, 1 (2012), 41–59. [50] Cha Zhang and Yunqian Ma. 2012. Ensemble machine learning: methods and

applications. Springer.

(13)

A

DATA ATTRIBUTES

Table A.1: Overview of the attributes in the raw data

Attribute Type Description Engagement String Internal project code

Engagement Suite String Business suite of the engagement

Estimated Fee Integer Estimation of the fee that is billed to the client Creation Date Datetime Date on which the engagement has been created Fiscal Year Datetime Fiscal year in which the project is undertaken Entity Group - LOB String Product or Service that the client offers Industry Sector String The industry sector of the client Engagement Service Type String Type of work of the engagement Engagement Service Type Group String Service type aggregated in groups Engagement Partner String Employee ID of the engagement partner Engagement Partner Suite String Suite of the partner

Engagement Profit Center ID String Name of the internal department where the project is carried out Engagement Profit Center Name String Name of the internal department where the project is carried out Employee String ID of the employee

Employee Profit Center String ID of the Profit Center where the employee is employed Employee Profit Center Name String Name of the Profit Center where the employee is employed Staff level String Job level of the employee

Hours Float Number of hours that the employee has registered on the engagement Revenue @ Standard Float Commercial tariff, thus what is intended to be billed to the customer Net Engagement Revenue Float Revenue after cost reduction

Employee Suite String Business suite of the employee Bold= Same for entire project

Table A.2: Overview of the attributes in the preprocessed data

Attribute Type Source Description Comments

Engagement String Raw data Internal project code (unique) Project name String Raw data Name of the project

Practice String Raw data Business practice of the project See figure B.2 Creation date Datetime Raw data Date on which the engagement has been created

Engagement close date Datetime Raw data Date on which the engagement has been closed Client ID String Raw data ID of the client company

Entity Line Of Business String Raw data Product or Service that the client offers See figure B.3 Industry Sector String Raw data The industry sector of the client See figure B.1 Profit Center String Raw data Internal department where the project is carried out

Engagement Partner String Raw data Employee ID of the engagement partner Engagement Type String Raw data Type of work of the engagement

Estimated fee Integer Raw data Estimation of the fee that is billed to the client

Number of employees Integer Derived Total number of employees with registered hours on the project See figure B.5 Total hours Float Derived Total number of hours registered on the project

Partner Previous Engagements Int Derived Total number of previous engagements that have the same partner Client Previous Engagements Int Derived Total number of previous engagements that have the same client Total Net Revenue Float Raw data Net revenue in euros of the project

Underlined = Used as input for ML models

(14)

B

EXPLORATORY PLOTS

Figure B.1: Industry Sector Distribution

Figure B.2: Practice Distribution

Figure B.3: Line Of Business Distribution

(15)

Figure B.4: Estimated fee distribution (Outliers removed & Z-normalized)

Figure B.5: Number of employees distribution (Outliers removed)

(16)

C

ROC CURVES

Figure C.6: No oversampling

Figure C.7: SMOTE

Figure C.8: ADASYN

Figure C.9: Random oversampling

(17)

D

PRECISION-RECALL CURVES

Figure D.10: No oversampling

Figure D.11: SMOTE

Figure D.12: ADASYN

Figure D.13: Random oversampling

(18)

E

PARAMETERS OF THE ML ALGORITHMS

Table E.3: Best settings found using the hyperparameter tuning

Algorithm Parameters Non-Ensemble

Nearest neighbors n_neighbors: 70, leaf_size: 1, p: 1, weights: distance, algorithm: ball_tree Decision Tree max_depth: 4, criterion: ’friedman_mse’

Support Vector Machine kernel: ’linear’ Ensemble

Gradient Boosting alpha: 0.95, n_estimators: 100, max_features: 0.8, max_depth: 20, loss: ls Adaptive Boosting loss: exponential, n_estimators: 300

Bagging n_estimators: 500, max_features: 0.8, max_samples: 0.8 XGBoost max_depth: 10, n_estimators: 100

F

PYTHON CODE

F.1 Preprocessing

# I m p o r t s import pandas as pd import seaborn as sns import numpy as np import random import m a t p l o t l i b . p y p l o t as p l t import re /% m a t p l o t l i b i n l i n e

raw_df = pd . r e a d _ e x c e l ( " . . / . . / Data / Revenue ␣ C o n t r i b u t i o n ␣ d a t a . x l s x " ) raw_df [ " Engagement " ] = raw_df [ " Engagement " ] . a s t y p e ( s t r )

c l i e n t _ d f = pd . r e a d _ e x c e l ( " . . / . . / Data ␣ s c r i p t i e / C l i e n t s . x l s x " )

f i n i s h e d _ e n g a g e m e n t s = s e t ( raw_df [ raw_df [ " Eng ␣ l a s t ␣ c l o s e ␣ d a t e " ] != " # " ] . Engagement ) d a t a _ l i s t = [ ]

f o r engagement in f i n i s h e d _ e n g a g e m e n t s :

engagement_df = raw_df [ raw_df [ " Engagement " ] == engagement ]

t o t a l _ r e v e n u e _ s t a n d a r d = sum( engagement_df [ " Revenue ␣@␣ Standard " ] ) t o t a l _ n e t _ r e v e n u e = sum( engagement_df [ " Net ␣ Engagement \ nRevenue " ] ) c r e a t i o n _ d a t e = engagement_df [ " C r e a t i o n ␣ Date " ] . v a l u e s [ 0 ]

t o t a l _ h o u r s = sum( engagement_df [ " Hours " ] )

i n d u s t r y _ s e c t o r = engagement_df [ " I n d u s t r y ␣ S e c t o r " ] . v a l u e s [ 0 ]

e n t i t y _ l i n e _ o f _ b u s i n e s s = engagement_df [ " E n t i t y ␣ Group ␣ − ␣ LOB" ] . v a l u e s [ 0 ] e s t i m a t e d _ f e e = engagement_df [ " E s t i m a t e d ␣ Fee " ] . v a l u e s [ 0 ]

p r o f i t _ c e n t e r = engagement_df [ " Engagement ␣ P r o f i t ␣ Center ␣ Name " ] . v a l u e s [ 0 ] i n d u s t r y _ s e c t o r = engagement_df [ " I n d u s t r y ␣ S e c t o r " ] . v a l u e s [ 0 ]

engagement_type = engagement_df [ " Engagement ␣ S e r v i c e ␣ Type " ] . v a l u e s [ 0 ] engagement_partner = engagement_df [ " Engagement ␣ P a r t n e r " ] . v a l u e s [ 0 ] c l o s e _ d a t e = engagement_df [ " Eng ␣ l a s t ␣ c l o s e ␣ d a t e " ] . v a l u e s [ 0 ]

number_of_employees = len ( engagement_df ) d a t a _ l i s t . append ( { " Engagement " : engagement ,

" T o t a l ␣ Standard ␣ Revenue " : t o t a l _ r e v e n u e _ s t a n d a r d , " T o t a l ␣ Net ␣ Revenue " : t o t a l _ n e t _ r e v e n u e ,

" T o t a l ␣ Hours " : t o t a l _ h o u r s ,

" I n d u s t r y ␣ S e c t o r " : i n d u s t r y _ s e c t o r , 17

(19)

" E n t i t y ␣ Line ␣ of ␣ B u s i n e s s " : e n t i t y _ l i n e _ o f _ b u s i n e s s , " E s t i m a t e d ␣ f e e " : e s t i m a t e d _ f e e ,

" P r o f i t ␣ Center " : p r o f i t _ c e n t e r , " I n d u s t r y ␣ S e c t o r " : i n d u s t r y _ s e c t o r , " Engagement ␣ Type " : engagement_type ,

" Number ␣ of ␣ employees " : number_of_employees , " Engagement ␣ P a r t n e r " : engagement_partner , " C r e a t i o n ␣ Date " : c r e a t i o n _ d a t e , ' Close ␣ Date ' : c l o s e _ d a t e } ) p r o j e c t _ d f = pd . DataFrame ( d a t a _ l i s t ) p r i n t( p r o j e c t _ d f . shape ) p r o j e c t _ d f . head ( ) # C l e a n i n g " E n g a g e m e n t Typ e " c l e a n _ d f = p r o j e c t _ d f . copy ( )

p r i n t( len ( s e t ( c l e a n _ d f [ " Engagement ␣ Type " ] ) ) )

c l e a n _ d f [ " Engagement ␣ Type " ] = c l e a n _ d f [ " Engagement ␣ Type " ] . apply ( lambda x : x . r e p l a c e ( "AND" , " " ) . r e p l a c e ( "&" , " " ) . r e p l a c e ( " , " , " " ) ) c l e a n _ d f [ " Engagement ␣ Type " ] = c l e a n _ d f [ " Engagement ␣ Type " ] . apply ( lambda x : re . f i n d a l l ( ' ( ? < = − ) . ∗ ' , x ) [ 0 ] )

# M i s s i n g v a l u e s

c l e a n _ d f = c l e a n _ d f [ c l e a n _ d f [ " E s t i m a t e d ␣ f e e " ] != " Not ␣ a s s i g n e d " ] c l e a n _ d f [ " E s t i m a t e d ␣ f e e " ] = c l e a n _ d f [ " E s t i m a t e d ␣ f e e " ] . a s t y p e ( int )

c l e a n _ d f [ " Engagement ␣ P a r t n e r " ] = c l e a n _ d f [ " Engagement ␣ P a r t n e r " ] . a s t y p e ( s t r )

c l e a n _ d f = c l e a n _ d f . merge ( c l i e n t _ d f , how= " l e f t " , l e f t _ o n = " Engagement " , r i g h t _ o n = " ∗ P r o j e c t ␣ ID " ) . drop ( [ " Import ␣ Date " , " ∗ C o n f i d e n t i a l " , " ∗ P r o j e c t ␣ ID " , " F u n c t i o n " ] , a x i s =1)

# F u n c t i o n s f o r d e r i v i n g f e a t u r e s

def partner_engagements ( engagement_id , p a r t n e r _ i d ) :

# R e t u r n s t h e number o f p r e v i o u s e n g a g e m e n t s o f t h e p a r t n e r

engagement_date = c l e a n _ d f [ c l e a n _ d f [ " Engagement " ] == engagement_id ] [ " Close ␣ Date " ] . v a l u e s [ 0 ] previous_engagements = len ( c l e a n _ d f [ ( c l e a n _ d f [ " Close ␣ Date " ] < engagement_date ) &

( c l e a n _ d f [ " Engagement ␣ P a r t n e r " ] == p a r t n e r _ i d ) ] ) r e t u r n previous_engagements

i n s e r t = [ ]

f o r row in c l e a n _ d f . i t e r r o w s ( ) :

engagement_id = row [ 1 ] [ " Engagement " ] p a r t n e r _ i d = row [ 1 ] [ " Engagement ␣ P a r t n e r " ] n = partner_engagements ( engagement_id , p a r t n e r _ i d ) i n s e r t . append ( n ) c l e a n _ d f [ " P a r t n e r ␣ P r e v i o u s ␣ Engagements " ] = i n s e r t # F u n c t i o n s f o r c a l c u l a t i n g p r e v i o u s e n g a g e m e n t s def c l i e n t _ e n g a g e m e n t s ( engagement_id , c l i e n t _ i d ) : # R e t u r n s t h e number o f p r e v i o u s e n g a g e m e n t s o f t h e p a r t n e r

engagement_date = c l e a n _ d f [ c l e a n _ d f [ " Engagement " ] == engagement_id ] [ " Close ␣ Date " ] . v a l u e s [ 0 ] previous_engagements = len ( c l e a n _ d f [ ( c l e a n _ d f [ " Close ␣ Date " ] < engagement_date ) &

( c l e a n _ d f [ " C l i e n t . ∗ C l i e n t ␣ ID " ] == c l i e n t _ i d ) ] ) r e t u r n previous_engagements

i n s e r t = [ ]

f o r row in c l e a n _ d f . i t e r r o w s ( ) :

(20)

engagement_id = row [ 1 ] [ " Engagement " ] c l i e n t _ i d = row [ 1 ] [ " C l i e n t . ∗ C l i e n t ␣ ID " ] n = c l i e n t _ e n g a g e m e n t s ( engagement_id , c l i e n t _ i d ) i n s e r t . append ( n ) c l e a n _ d f [ " C l i e n t ␣ P r e v i o u s ␣ Engagements " ] = i n s e r t # C o n v e r t t o i n p u t f o r m o d e l ( one − h o t ) i n p u t _ d f = c l e a n _ d f [ [ ' Engagement ␣ P a r t n e r ' , ' Engagement ␣ Type ' , ' E s t i m a t e d ␣ f e e ' , ' I n d u s t r y ␣ S e c t o r ' , ' Number ␣ of ␣ employees ' , ' P r o f i t ␣ Center ' , ' P r a c t i c e ' , ' P a r t n e r ␣ P r e v i o u s ␣ Engagements ' , ' C l i e n t ␣ P r e v i o u s ␣ Engagements ' ] ] o u t p u t _ d f = pd . DataFrame ( )

o u t p u t _ d f [ " Nega tive ␣ Revenue " ] = p r o j e c t _ d f [ " T o t a l ␣ Net ␣ Revenue " ] <= 0 o u t p u t _ d f [ " Nega tive ␣ Revenue " ] = o u t p u t _ d f [ " Nega ti ve ␣ Revenue " ] . a s t y p e ( int ) i n p u t _ d f = pd . get_dummies ( i n p u t _ d f , columns =[ ' Engagement ␣ P a r t n e r ' ,

' Engagement ␣ Type ' , ' I n d u s t r y ␣ S e c t o r ' , ' P r o f i t ␣ Center ' , ' P r a c t i c e ' ] )

F.2 Train/Test Split & Oversampling

from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t from s k l e a r n import p r e p r o c e s s i n g

from imblearn . over_sampling import SMOTE , ADASYN, RandomOverSampler import s k l e a r n . m e t r i c s as m e t r i c s

X = i n p u t _ d f

y = o u t p u t _ d f [ " Negative ␣ Revenue " ] . v a l u e s

X_normalized = p r e p r o c e s s i n g . n o r m a l i z e ( X , a x i s =0)

X_train , X_test , y _ t r a i n , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X_normalized , y , t e s t _ s i z e = 0 . 2 5 , random_state =0 , s h u f f l e =True , s t r a t i f y =y )

X_resampled_SMOTE , y_resampled_SMOTE = SMOTE ( ) . f i t _ s a m p l e ( X_train , y _ t r a i n ) X_resampled_ADASYN , y_resampled_ADASYN = ADASYN ( ) . f i t _ s a m p l e ( X_train , y _ t r a i n )

X_resampled_random , y_resampled_random = RandomOverSampler ( ) . f i t _ s a m p l e ( X_train , y _ t r a i n )

F.3 Hyperparameter Tuning

from s k l e a r n . m o d e l _ s e l e c t i o n import GridSearchCV from s k l e a r n . m e t r i c s import make_scorer

from s k l e a r n . m e t r i c s import auc

(21)

from s k l e a r n . m o d e l _ s e l e c t i o n import c r o s s _ v a l _ s c o r e from s k l e a r n . m o d e l _ s e l e c t i o n import S t r a t i f i e d K F o l d from s k l e a r n . m e t r i c s import r o c _ a u c _ s c o r e s c o r e r = make_scorer ( r o c _ a u c _ s c o r e ) def f i n d _ b e s t _ p a r a m s ( tuned_parameters , c l f ) : g r i d _ s e a r c h _ c l f = GridSearchCV ( c l f , tuned_parameters , cv= S t r a t i f i e d K F o l d ( n _ s p l i t s =5 , s h u f f l e =True , random_state = 1 ) , s c o r i n g = s c o r e r , n_ jo bs = −1) g r i d _ s e a r c h _ c l f . f i t ( X_train , y _ t r a i n ) r e t u r n g r i d _ s e a r c h _ c l f . best_params_

F.4 Model implementations

from s k l e a r n . ensemble import B a g g i n g R e g r e s s o r from s k l e a r n . ensemble import AdaBoostRegressor import xgboost as xgb

from s k l e a r n . t r e e import D e c i s i o n T r e e R e g r e s s o r from s k l e a r n . n e i g h b o r s import KNeighborsRegressor from s k l e a r n . svm import SVR

from s k l e a r n . ensemble import G r a d i e n t B o o s t i n g R e g r e s s o r b a g g i n g _ c l f = B a g g i n g R e g r e s s o r ( n_j ob s = −1 , n _ e s t i m a t o r s =500 , m a x _ f e a t u r e s = 0 . 8 , max_samples = 0 . 8 ) a d a b o o s t _ c l f = AdaBoostRegressor ( l o s s = ' e x p o n e n t i a l ' , n _ e s t i m a t o r s =300) x g b o o s t _ c l f = xgb . XGBRegressor ( max_depth =10 , n _ e s t i m a t o r s =100) d t r e e _ c l f = D e c i s i o n T r e e R e g r e s s o r ( max_depth =4 , c r i t e r i o n = ' friedman_mse ' ) k n n _ c l f = KNeighborsRegressor ( n_neighbors =70 , l e a f _ s i z e =1 , n_ jo bs = −1 , p = 1 , weights = ' d i s t a n c e ' , a l g o r i t h m = " b a l l _ t r e e " ) sv m_ clf = SVR ( k e r n e l = " l i n e a r " ) g r b _ c l f = G r a d i e n t B o o s t i n g R e g r e s s o r ( n _ e s t i m a t o r s =100 , m a x _ f e a t u r e s = 0 . 8 , alpha = 0 . 9 5 , max_depth =20 , random_state =0 , l o s s = ' l s ' )

F.5 Finding optimal cutoff

from s k l e a r n . m e t r i c s import p r e c i s i o n _ s c o r e from s k l e a r n . m e t r i c s import r e c a l l _ s c o r e from s k l e a r n . m e t r i c s import a c c u r a c y _ s c o r e from s k l e a r n . m e t r i c s import f 1 _ s c o r e from s k l e a r n . m e t r i c s import a v e r a g e _ p r e c i s i o n _ s c o r e 20