Matching of employees to in-house jobs at Sogeti NL using employee descriptions and hard skills

(1)

Matching of employees to in-house jobs at Sogeti NL using

employee descriptions and hard skills

Brechje Boeklagen, 11021462

First examiner (UvA): Maarten Marx

UvA supervisor and second examiner: Ana Lucic

Sogeti supervisor: Sanne Bouwman

June 26, 2019

ABSTRACT

This research is a case study for Sogeti NL in the field of human resources. Sogeti spends a lot of time and resources on procedures matching employees to in-house vacancies based on their resumes, interviews, and applications. To make this a smoother process, this research looks into the way resumes of employees can be matched automatically. Current solutions use different models to predict per-sonality traits, fitness to a job based on probability, and applicant matching to academic jobs, but miss the incorporation of job type and personality trait clustering based on job and applicant descrip-tions that this research focuses on. This research will take different approaches into account: multi-class classification for matching employees to the best-suited job type, and binary classification to determine the probability of being hired based on that job type. The matches will depend on hard and soft skills that are extracted using Guided Latent Dirichlet Allocation (LDA) on descriptions from both the job description as well as the resumes, and keywords. Balanced and unbalanced training sets were used to train the differ-ent models and performed better than the used logistic regression baseline model. Overall, the best models were Neural Networks (NN), K-Nearest Neighbors (KNN), and Decision Tree Classifiers (DS). The most predictive features of the baseline model for every job type did not include soft skills, stating that for the job types indicated by Sogeti hard skills are most important when hiring an employee.

1 INTRODUCTION

Machine learning has increasingly found its way to everyday life. More and more decisions are supported or made entirely by ma-chine learning algorithms. One of these decisions is the hiring of employees by companies. These developments can have positive and negative effects on these companies, as well as on the employees or society. An example is the effort of Amazon to automate a part of its hiring process (Dastin, 2018). This algorithm, however, suffered from a bias favoring men over women when selecting a candidate for a vacancy. Extracting personality traits from text is also done by Cambridge Analytica, and may have played a role in influencing the outcome of the 2016 United States election and the Brexit refer-endum (Rosenberg et al., 2018). Another approach is implemented by Seedlink in Amsterdam, the Netherlands. This company has made an algorithm that poses open-ended questions to an appli-cant through an application and labels the input of the appliappli-cant as fit or unfit. The employer creates a profile and in combination with the labeled user input, the system provides a recommendation

(Seedlink, 2017). Textkernel, also from the Netherlands, is focused on collecting job vacancy information for matching and analytical purposes. Textkernel has created ’Jobfeed’. This is a source of job market data, allowing users to analyze the demand side of the la-bor market and is unique in its domain (Textkernel, 2019). Clearly, algorithms can have an effect on companies, employees, and society. This research is set up as a case study for Sogeti. Sogeti is a con-sultancy company that creates IT solutions and provides a wide range of IT-services for their clients. It has a pilot case component, to see if a job matching algorithm can be applicable to the area of human resource (HR)/planning within the company. As the select-ing, matching and hiring process is time-consumselect-ing, a matching algorithm is needed to quicken and smoothen this process. Sogeti would like to know whom to select for a particular job, based on soft skills extracted from the resume descriptions and keywords with a level of ability (hard skills). This case study is focused on the context of finding suitable employees for a job type within the com-pany, without having to spend a lot of time following procedures of matching people. Another area of research is a comparative study. The comparative study is a necessity because a well-performing model needs to be found to optimally solve the pilot case. Sogeti would like to know what job type fits an employee, and given that job type, if the employee is likely to get hired.

When Sogeti gets a particular request from a client, this is directed at the sales department. Sales then communicates this offer to the planning department. Planning on its turn approaches employees and vice versa to make a selection of employees to report back to sales. Sales then picks the best candidates to recommend to the client. This is highly time-consuming and could be automated using an algorithm that makes a pre-selection of employees automatically, based on soft and hard skills of Sogeti employees. This lead to the following research question:

how can we optimize the matching of hard and soft skills of Sogeti employees to in-house vacancies?

The sub-questions are:

• RQ1: Given a description of an applicant, what job type are they best suited for? What kind of model performs best for this multi-class purpose?

• RQ2: Given a description of an applicant and the job type they are best suited for, how likely are they to be hired for this particular job type? What kind of model performs best for this binary classification purpose?

(2)

• RQ3: What skills of an applicant are most predictive for being hired for a job type?

The results of this research will show a best performing model for the first two research questions, together with the metrics and pa-rameters. To select the best performing model, multiple models are tested using a Grid Search cross-validation, and Neural Networks (NN) are used as well in an attempt to answer the research questions. This report is structured as follows: in Section 2 we discuss other work related to this research. In Section 3 the methodology is dis-cussed, and in Section 4 we elaborate on the final models and their performance results. Section 5 criticizes the results, followed by conclusions about the research and steps for further research.

2 RELATED WORK

An approach of Bakar and Ting (2011) is to match soft skills to jobs through a Bayesian network. Their motivation is that it is necessary to identify suitable soft skills to perform a job effectively. Employers are simply looking at a combination of soft and hard skills (Bakar and Ting, 2011). The inputs given to the network are a job title, qualification of the employee, and working experience (Bakar and Ting, 2011). The system should be able to propose a list of job titles, soft skills, and its relevance for the employer to select those employees. Compared to the research of Bakar and Ting (2011), the input of this research for Sogeti will be larger, there will be more features to incorporate into a model (see Section 3.1). Secondly, this research will match employees to job types, but it will be based on more than the job title: the description of the job will be used to indicate the job type, as job titles are not available. Another approach is focused on the selection process of different types of jobs (De Wolf and Van der Velden, 2001). This research is mainly focused on the behavior of employers selecting hypo-thetical applicants. The paper is focused on the transition from university to the labor market. The probability of being selected was conducted by logistic regression analysis on the hypothetical applicants that were selected or not. After the selection, employers ranked the six best applicants based on how well the applicant fitted the job. Soft skill characteristics that have been used in this study were flexibility, communication skills, outward appearance, and personality type. These characteristics are the most decisive for the selection procedure. Having a matching personality is less important for the ranking of scientific jobs than for generic and sector-specific jobs (De Wolf and Van der Velden, 2001). In this case study, there is no manual selection of applicants for a job and is not focused on the context of Sogeti. Looking at De Wolf and Van der Velden (2001), this research will be more generalized. There are more than three categories within Sogetis in-house vacancies instead of the three academic job types of De Wolf and Van der Velden (2001), namely: analysts, advisers, managers, creators, and developers. This research does include a ranking of the employ-ees based on the fit to a vacancy, but this will only be a pre-selection. Periatt et al. (2007) incorporate personality testing in its selection process of employees. The five-factor model is used: neuroticism, extraversion, openness, agreeableness, and conscientiousness, also

called the Big Five. Periatt et al. (2007) try to prove that personal-ity testing should be used to complement the traditional methods of interviewing and applications. They established a relationship between personality and work-related behavior of logistics em-ployees. Another method that incorporated the five-factor model is that of Li et al. (2008). The goal was to improve the quality of selecting a candidate for a position by using the five-factor model and support vector machines (SVM). The SVM is used to predict the fitness of the candidate for placement. Information about the knowledge, skills, and abilities of the applicants have been used as a major hiring criterion for the past decades (Li et al., 2008). The contribution of the research is to make an expert system to facilitate HR development: the model predicts the position suitable for work-ers based on probability. SVMs are used as well to predict employee turnover based on job performance, where SVM models are used to compare the prediction of performance (Hong et al., 2005). Li et al. (2008) used a questionnaire to extract personality traits, where this research will use other means to extract personality from the descriptions found in Sogeti resumes, namely: clustering methods using Latent Dirichlet Allocation (LDA).

Liu et al. (2016) researched a probabilistic topic model (PT-LDA) to predict personality traits within the five-factor model. The model incorporates n-gram word features into latent topics about per-sonality traits. The research shows that the PT-LDA approach can be used to extract topics that are associated with each personality trait, and therefore is a new way to analyze user behavior in online social networks. In this research, a PT-LDA model is not applicable, because the personality does not have to be predicted, only the best job type for an applicant and the probability that the applicant gets hired for that particular job type. Guided LDA is a suited approach because descriptions of jobs and employees are going to be fitted in a specifically indicated context. The approach uses seed words to indicate topics that are of specific interest to the researcher (Ja-garlamudi et al., 2012).

This research will yield the most relevance in ICT HR fields of research because job matching is now becoming more automated and is, therefore, reforming the current state of affairs. Insights for Sogeti could be the way of quantifying personality without any personality test data through the use of guided LDA and the way different models are built to optimize the matching process of employees. This will result in spending less time looking for people that are fit for a job and more time hiring employees for vacancies which could streamline the planning and hiring process.

3 DATASET

3 .1

Resources

The datasets that are used originate from a Sogeti. They provided datasets about Matched Vacancies of 2017, 2018, 2019, and a HR man-agement file with Resumes. The Matched Vacancy dataset contains data about • Year of request; • Request ID; • Region; • Sogeti department; 2

(3)

• Difficulty level; • Note;

• Employee ID; • Job description; • Client name;

• Dates of status changes; • Hired/not-hired label.

In the Resume dataset, mostly descriptions are found: • Employee ID;

• General description; • Knowledge description; • Education description;

• Past working experiences description (short and long with a beginning and end date);

• Keywords with level of ability.

There are approximately 2068 employees in the Resume dataset after the merging of all tabs in the file containing the different features. For the Matched Vacancies, 6960 matches in 2019, 17983 matches in 2018, and 17456 matches in 2019 are merged with the resume data per year, to be added to a big dataframe afterward. Matches in this context imply employees that have applied for a job offer and are hired or rejected. This resulted in 11566 rows with matched people and no duplicates.

3 .2

Preprocessing

3 .2.1 Text cleaning.

As this research focuses on the matching of employees to jobs based on their personality description, first to assess is cleaning of texts needed. The texts are descriptions of the employee regarding a gen-eral overview, knowledge, education, and past working experiences. These texts were merged into a concatenated string. Firstly, the text is lowercased. Secondly, punctuation is removed using the ’punkt’ library of NLTK. Thirdly, all text is tokenized, meaning that the concatenated strings are split, making a list of separate single-word strings. Fourthly, the tokenized lists were normalized because there exist many variations of words carrying the same meaning. This is called stemming. Lastly, stop words were removed using NLTK corpus stop word remover. In some cases, the name of the employee is mentioned in the description. These had to be removed using a list of first names and replacing the names with an empty string.

3 .2.2 Select important words.

To only select important words in the cleaned text, TF-IDF scores were used. TF-IDF stands for term frequency-inverse document frequency. It is a means to reflect how important a word is to a doc-ument in a collection (Rajaraman and Ullman, 2011). Furthermore, all words that occurred less than 10 times in the whole unique de-scription set, were removed, as well as words with a length smaller than two characters.

3 .3

Job type and personality trait feature

engineering

3 .3.1 Word2Vec and subjectivity.

Now that the descriptions are cleaned, they can be used to estab-lish labels and features. To use the words for clustering, vectors are needed to give them a point in space. We use a pre-trained

Word2Vec gensim model based on the Dutch wiki-320 corpus1. The Dutch embeddings have been researched by Tulkens et al. (2016). The Dutch embedding made by Tulkens et al. (2016) performs better in an unsupervised way than a hand-crafted dictionary. To come to an optimal model where clusters can be indicated, parameters for this case study have been altered. There are 5 topics, 300 iter-ations, a random state of zero and a refresh of thirty for the fit of the Word2Vec model.

These words are filtered again, based on a threshold of sentiment. The idea behind this filtering is only keeping the words for the employee personality-trait clustering that holds some kind of senti-ment. This decision is based on a study by Higgins and Judge (2004). They researched fit and hiring recommendations and explain that subjective evaluations of the fit of a person to an organization or environment (P-O) are important factors for recruiters. The P-O fit is mostly about the compatibility based on the values and personal-ity traits of the applicant. For a person to job (P-J) fit, the central requirements for the job are important. However, subjective evalua-tions made by recruiters tend to have stronger effects than objective ones (Higgins and Judge, 2004). Furthermore, the hiring process is a process of cultural matching. Employers seek applicants who have the skills but are also culturally similar to themselves, and often lean toward an applicant with whom they feel a connection (Rivera, 2012). To be as close as what happens in real-life situations, both personality trait and job type clustering are based on subjective words.

The algorithm used for the subjectivity selection is the sentiment from pattern.nl (CLiPS, 2011). This sentiment functionality returns a tuple with a polarity and subjectivity score based on adjectives. At a threshold below a score of 0.1 words are not used to construct the personality-trait clusters, as 0.0 is neutral and the goal was to only use subjective words. The threshold of 0.1 was set to prevent a large reduction of words that can be used. The same is done for the vacancy (job type) descriptions.

3 .3.2 Guided LDA.

Guided Latent Dirichlet Allocation (LDA) is used to perform a word clustering based on seed words (Jagarlamudi et al., 2012). As the data was not labeled for a job type or personality traits, these features had to be engineered. The only data available to base these features on were job descriptions in the Matched Vacancy dataset and em-ployee descriptions in the Resume dataset. Guided LDA is a topic modeling clustering method that takes some direction, indicated by seeds the user thinks of being representative of the underly-ing topics (freeCodeCamp.org, 2017; Jagarlamudi et al., 2012). The words indicating specific job types or personality traits were given to the guided LDA models to take into account during clustering. Seed words are generated using unique words in the descriptions about the employees and their cosine similarity to the personality traits or job types using the wiki-320 Word2Vec model. After the clustering, labels were given to the documents indicating the most salient cluster.

1_{https://github.com/clips/dutchembeddings} 3

(4)

3 .3.3 Job type labeling.

Since the Matched Vacancy dataset does not include job types, the labels need to be generated using guided LDA. In this case, the seed topics were provided by Sogeti and are as follows:

(1) Analyst; (2) Adviser; (3) Manager; (4) Creator; (5) Organizer.

The cleaned job descriptions are run through the guided LDA model to assign each job description to a job type (one of the five seed topics). These job types are used as labels in the multi-class classifi-cation task mentioned as RQ1 in Section 1 . The labels are used to identify the populations for binary classification tasks which are split up by job type to answer RQ2.

3 .3.4 Personality trait features.

As was emphasised by Periatt et al. (2007), Li et al. (2008), and Bakar and Ting (2011), soft skills are important to take into consideration when hiring employees for a job. To incorporate this into the case study for Sogeti, the description of employees in the Resume dataset is used to establish personality traits. Employees’ personality can be linked to different personality traits: extraversion, emotional stabil-ity, agreeableness, conscientiousness, and openness to experience (Barrick and Mount, 1991; Periatt et al., 2007). These personality traits are tested and quantified in personality tests by Sogeti, but this research has no access to that data. The words used to indicate the seed words were:

• Extravert; • Customer-friendly; • Careful; • Stable; • Openness. 3 .3.5 Evaluation of clusters.

The clusters are evaluated using the Euclidean distance between different cluster centroids and the distance between the centroid and the maximum distance between the centroid and a point within the cluster. However, since the vector space consists of many di-mensions, it is not clear to say whether the clusters overlap or not, so this is not a good measure. To determine overlap, the silhouette coefficient is calculated using sklearn with a cosine metric, as this is the metric most used for textual data (scikit learn.org, 2019). The coefficient uses the mean intra-cluster distance (a) and the mean nearest-cluster distance (b), called tightness and separation (see Formula 1) (Rousseeuw, 1987; scikit learn.org, 2019).

(b − a)

max(a,b) (1)

The high dimensional word vectors and their labels were run through the function and outputted a value close to zero for both the job type clusters and the personality clusters. A value close to zero indicates overlapping clusters, as a value close to 1 indicates an appropriate clustering, and -1 inappropriate clustering (Rousseeuw, 1987). The silhouette score for the job types was 0.02 and for the personality traits 0.05. The job types were assigned to a cluster per document. Based on this evaluation, this is also done for personality traits. The initial idea was to count the words belonging to a cluster

to state a more nuanced view over employees and their personality traits, but given the overlapping clusters, the frequencies per cluster do not make sense as the words are all so similar. This was even the case when overlapping words indicating the clusters (output guided LDA) were removed. To base the personality on more than just the frequencies, a doc topic is assigned for training. Such a doc topic is the most salient topic measured in the whole document, and the personality traits are one-hot encoded for training purposes.

3 .4

Features

For the features, the one-hot encoded personality traits are used, as well as keywords (hard skills), together with their ability level (score from 1 until 5). For example, employee A has a high ability level (5) for the following keywords: MS office 2007, CSS, Angular, Hadoop, Matplotlib, and Azure AS. Lastly, the overall number of unique jobs that a person has ever had is added as a feature, as well as the overall sum of the duration of every job in years. These features all have a specific job type and a hired or not hired label. All the features resulted in an 1845 column dataset with 11566 rows for every match. This includes the job-specific features mentioned in Section 3 .1. However, Sogeti would like to know beforehand who is suited for a job type, not after a match. This means that the job-specific features are deleted in the training and test split.

4 METHOD

4 .1

Classification

As RQ1 states: given a description of an applicant, what job type are they best suited for, and what kind of model performs best for this multi-class purpose? Two approaches are pursued: a multi-class classification with all rows in the dataframe, and the highest ratio multi-class classification. This last approach is introduced because of duplicate rows with a different label. This kind of duplicate data is possible because the same employee could be hired for multiple jobs, resulting in the same rows (personality and keywords) with a different label per job type. To solve this, the label that occurs most frequently per applicant is selected. To train these models, only the positive (hired) outcomes were used, to determine a successful match probability between the employee and job type. However, this leaves the model with 814 rows to train on, rather than 1607 as with the other approach. These models are based on only the employee characteristics. Specific match characteristics are not in-cluded in this process, as this contains information about the match, and the purpose of the model is to determine probabilities of the best fit for a person based on his/her characteristics to help HR find the best employees to match to a job type.

To determine if someone is likely to get hired for a job within a job type segment (RQ2), a binary classification model is built based on a balanced dataset of hired and not hired cases for every job type. This binary classification required the balancing of the dataset due to the skewness of the data: around 10% to 17% of the people that have applied for a job in a cluster are hired, the rest is rejected.

4 .1.1 Balancing using SMOTE.

Balancing is performed for the binary classification, as well as for

(5)

the two multi-class classification problems, to see whether there is a difference between the performance of the balanced or un-balanced training-set. For the ratio multi-class classification, this means that rows are deleted to remove duplicates and new rows are generated to balance the data. In the normal approach of the multi-class classification model, no duplicates are removed, but still, data is generated using Synthetic Minority Oversampling Tech-nique (SMOTE). SMOTE is a balancing algorithm that generates data from the cases available and fills up the data of the other labels to the maximum number of rows a specific class has, instead of replacements (Lemaitre et al., 2017; Chawla et al., 2002). The data points are sampled randomly. SMOTE adapts the data automatically to the majority class of the overpowering label, generating data for the minority classes. The skewness of the data is visible in Figure 1 and 2, and the skewness of the hired and not-hired (binary) cases for every job type lies around 90% for non-hired cases. Balancing is only performed during the train/test-split and on the training data, making sure only training data is generated and the test data contains only real matches. Again, job characteristics are not in-cluded, as Sogeti would like to know whether a person is suited for a job before he/she is actually matched, but the binary tasks are split into job types, meaning that there will be one model per job type to predict hired and not-hired cases. This is done because the patterns of the features within job type segments can differ from others, making it easier to train models if they are separate. However, there is less data to train on. This is attempted to solve with the SMOTE approach for each of the the training-sets. The division of the unbalanced data is visible in Figure 1. One can see that the accuracy of the unbalanced training-set with an unbalanced test-set should exceed 40%. The division of the ’best suitable’ job types based on the ratio is depicted in Figure 2 and the unbalanced training and testing procedure should result in an accuracy of at least 53%. For the balanced sets, the accuracy should perform better than random 20%. The binary models use a balanced training-set and then test on unbalanced test-sets. Because this data is so skewed for every job type, no unbalanced training-set is created. For the binary models, the f1 needs to be as high as possible and the accuracy preferably higher than 53% (random).

Figure 1: Division of job type labels over the dataset of all positive cases

4 .1.2 Models.

For every multi-classification task, a baseline model is run on the balanced and unbalanced training-sets and tested on the unbal-anced test-sets. No validation-set is used because of the size of the

Figure 2: Division of most suitable job type labels over the dataset of non duplicate positive cases

datasets. The baseline model that is used is logistic regression from the sklearn library. A couple of other models were used as well in the same way as the logistic regression model but were slightly optimized using Grid Search with a 10 fold cross-validation. The Grid Search is based on the fit with the balanced data and validated on balanced data as well. A train/validation-split was made to per-form the cross-validation Grid Search to make sure the model is optimized before it has encountered the test data.

• BernoulliNB: alpha [0, 0.1, 0.3, 0.4, 0.5, 1, 1.5]

• Support Vector Classifier (SVC): C = [0.1, 1, 10, 20, 30, 40, 100]

• DecisionTreeClassifier (DS): min_samples_split = [0.1, 0.5, 0.9], min_samples_leaf = [1, 2, 5, 10]

• GradientBoostingClassifier (GBC): learning_rate = [0.1, 0.5, 1.0], max_features = [0.1, 0.5, 1.0], warm_start = True • Linear SVC: penalty = [l1, l2], loss = squared_hinge, C =

[0.1, 0.5, 1.0]

• KNN: n_neighbors = [1, 2, 3, 4, 5, 10], leaf_size = [1, 10, 20, 50, 100]

The binary models for every job type used a default logistic regres-sion model as a baseline model, as well as other models from the sklearn library that are optimized in the same way as the multi-class classification models. All models for the three tasks were based on previous research as mentioned in Section 2 and by Müller et al. (2016).

Apart from the sklearn models, a Neural Network (NN) from keras is introduced for both tasks to optimize the predictions because some models did not perform well in intermediate measures. The multi-class classification tasks worked with one-hot encoded la-bels with a softmax. For the normal multi-class classification, the NN was compiled using a categorical cross-entropy loss and op-timized with rmsprop. The activator for the layers was relu, and softmax in the output layer. Furthermore, the fit of the balanced and unbalanced data is done with 400 epochs and a mini-batch size of 512. For the ratio NN, the same loss, activation functions, and optimizer are used. Both the unbalanced and balanced training-sets used 300 epochs and a mini-batch size of 512 and 128 respectively. The binary NN models worked with the same loss and optimizer as the multi-class classification models, and a softmax activation was used with one-hot encoded labels. 300 epochs were run on a mini-batch size of 128. The NN and their parameters can be found

(6)

in the Appendix, Figure 3 until 7.

The best model is used to firstly make a prediction of the job type an employee is most suited for (multi-class) and afterward, the model indicating the job type is used to predict whether the employee will be hired for a job within that job segment. A ranking per binary classification is made using the prediction probability belonging to an employee number for Sogeti to use. To determine which features are most predictive of the task, the baseline model is analyzed on the weights and coefficients of the features. The coefficients are calculated and ranked for each label, so that a top three (multi-class) or top five (binary) could be extracted. The outcome of this analysis will be in the next Section.

5 RESULTS

5 .1

RQ1: multi-class classification

With the multi-class classification approach, the first research ques-tion will be answered. Given a descripques-tion of an applicant, what job type are they best suited for? What kind of model performs best for this multi-class purpose? To answer these questions, two approaches were introduced, the normal multi-class, and the ratio class classification. When looking at the results for the multi-class multi-classification in Table 1, one can see that the best model for the balanced training-set is the DS (accuracy = 32%). The unbalanced training-set has the best accuracy (37%) when an SVC is used. For the ratio multi-class classification (Table 2) the best models for the balanced and unbalanced dataset are the NN (40%) and SVC (58%) respectively. The optimized parameters can be found in the column ’parameters’ in the tables or in the NN Figures corresponding with the model (see Appendix).

5 .2

RQ2: binary classification

Given a description of an applicant and the job type they are best suited for, how likely are they to be hired for this particular job type? What kind of model performs best for this binary classification purpose? The best models for the binary classification can be found in Table 3 and 4. The best model depends heavily on the f1 score, as this determines the overall performance of the algorithm. For the analyst task, the NN model performed best with an f1 of .53. The model can be found in Appendix Figure 3. The adviser task resulted in an f1 of .19 versus a .12 f1 score of the LR baseline. The best model for the manager task was NN as well, with an f1 of .33. An f1 score of .23 was found for the DS for the creator task and an f1 of .35 for the organizer task with the use of a KNN model. All parameters can be found in Table 3 and 4.

5 .3

RQ3: feature importance

What skills of an applicant are most predictive for being hired for a job type? To answer this question, the skills (features) used have to be analyzed on their weights/coefficients. For the normal multi-class classification, the coefficients of the logistic regression baseline were calculated for the balanced and unbalanced dataset respectively. Each class has a different top three of most predictive features (see Table 5). This classification task had 202 features that were not used in the model, they got a coefficient of zero. The multi-class multi-classification based on the ratio for best job type resulted in

the feature importance Table 6. For this logistic regression baseline model, 259 coefficients were equal to zero and do not contribute to the prediction.

The most predictive features for the binary classification baseline model are documented in Table 7. In the organizer class, 342 features had a zero weight and in the other classes, this was the case for 221 (analyst), 402 (adviser), 429 (manager), and 501 (creator) cases. None of the zero-weight features were engineered by this study.

6 DISCUSSION

One of the limitations of this research was the size of the datasets. As the dataset of all matches was 11556 rows long, only 1607 rows hired cases and 814 rows when duplicates are removed using the ra-tio dataset. This is a limitara-tion of the real-life problem, not everyone gets hired over a span of three years. Furthermore, in some job type segments, there are more job offers and more people get hired than in others. To solve these problems for training, the datasets were balanced over job types and hired/not-hired labels for multi-class classification and binary classification respectively. However, in the end, still more data was needed, but a small number of employees hired is a reality of this case study.

This research is scalable to all sizes of the dataset. However, the algorithms will become slower in terms of execution time. Outside of Sogeti, the algorithms may not be the best ones to solve the matching problem and the data format will form a preprocessing issue. In conclusion, this research is scalable within terms of data format and context. This research can be reproduced by others, but the dataset is only available to employees. Apart from the accessi-bility of the data, the data will be updated as more people leave or join the company and more employees are matched to a job and are hired or rejected. When the size of the dataset has increased or decreased, the number of words within the descriptions of jobs and employees will have as well. This means that the guided LDA will perform differently. The terms to collect the seed words can be the same (e.g. manager, extravert), but the seeds could differ from the ones in this research. Lastly, the outcome of the guided LDA specifying the jobs that belong to a specific job type or words that belong to a personality trait will probably differ as well.

As for the overlapping clusters, this could be due to the top ten extraction of important words using TF-IDF scores and extract-ing everythextract-ing but neutral words to construct the seed words for the guided LDA model and its output. The seeds have overlap-ping words because the same word occurs in the cosine similarity list for both job type and personality trait indicators. When over-lapping words are removed during the selection for seed words, overlap perhaps could be avoided, but several indicative words would be deleted. As a solution, the output of the guided LDA was de-duplicated, but the silhouette score overlap remained close to zero. To remove the overlap, more attention needs to be directed at the guided LDA input and model. Another issue could be that the seed words were not applicable to the dataset. The seeds are a means to steer the topic clustering towards an underlying topic the researcher deems representative (Jagarlamudi et al., 2012). However,

(7)

Table 1: Models with accuracy for (b) balanced and (unb) unbalanced dataset in the multi-class classification task with all rows used

Model parameters train acc b acc b train acc unb acc unb

Dataset - 20% - 40% LR (baseline) default 21% 22% 67% 27% BernoulliNB alpha = 0 22% 20% 54% 25% SVC C = 0.1 21% 20% 41% 37% DS min_leafs = 10, min_split = 0.9 36% 32% 43% 36% GBC learning_rate = 0.5, max_features = 0.5 22% 21% 68% 30% KNN default 23% 22% 52% 30%

Keras NN Appendix Figure 4, 5 22% 17% 59% 30%

Table 2: Models with accuracy for (b) balanced and (unb) unbalanced dataset in the multi-class classification task ratio

Model parameters train acc b acc b train acc unb acc unb

Dataset - 20% - 53% LR (baseline) default 21% 25% 91% 28% KNN default 26% 25% 55% 43% BernoulliNB alpha = 0.1 20% 19% 70% 29% SVC C = 1 18% 10% 67% 58% DS min_leafs = 5, min_split = 0.5 16% 15% 55% 55% GBC learning_rate = 1, max_features = 0.1 21% 17% 99% 40% Keras NN Appendix Figure 6, 7 38% 40% 60% 50%

Table 3: Performance of analyst, adviser, and manager models for the binary classification task

Analyst Adviser Manager

Model parameters acc f1 parameters acc f1 parameters acc f1

LR (baseline) default 36% .23 default 42% .12 default 42% .24

SVC C = 1 37% .23 C = 100 46% .13 C = 30 45% .22 DS min_samples_leaf = 1 min_samples_split = 0.9 42% .31 min_samples_leaf = 1 min_samples_split = 0.5 62% .16 min_samples_leaf = 1 min_samples_split = 0.5 58% .24 Linear SVC C = 1 max_iter = 1000 penalty = L1 36% .25 C = 0.1 max_iter = 500 penalty = L1 42% .13 C = 0.1 max_iter = 500 penalty = L1 39% .23 GBC learning_rate = 0.1 max_features = 0.5 35% .10 learning_rate = 0.5 max_features = 0.1 55% .16 learning_rate = 0.1 max_features = 1 46% .19 KNN n_leaf = 1 n_neighbors = 5 33% .29 n_leaf = 100 n_neighbors = 3 32% .16 n_leaf = 20 n_neighbors = 3 30% .29

Keras NN Appendix Figure 3 46% .48 Appendix Figure 3 48% .21 Appendix Figure 3 45% .30

it could have been the case that the single words in the filtered and cleaned descriptions did not lend themselves for this kind of topic clustering. This could be continued in further research.

The models did not perform as well as was aspired to. This could be due to the skewness of the original data, possible missing fea-tures about personal information, or engineered feafea-tures. Perhaps the decisions of the clients were based on something that was not included in the dataset or could not be extracted, as is discussed above. However, most models performed better than the baseline, but for the Neural Network models that performed exceptionally well compared to the other models, a note has to be made. These pre-dictions came close to the majority of the labels in the unbalanced test-set. This means that sometimes only one class was predicted with some minor fluctuations. Furthermore, the same NN model

was used for all binary classification tasks of job types. On the one hand, the classification tasks were the same, indicating the same labels and the same kind of input. However, the model could have performed better if optimized for each job type separately. Due to time limitations, this is an issue for further research.

A comparison needs to be made between the ratio and normal multi-class classification tasks. For both multi-class tasks, differ-ent models were optimized to research the difference in training a model if there are duplicate rows with different labels and when a ’best’ ratio is selected, removing the duplicate rows. Both sets were balanced, but the ratio dataset was left with fewer rows to train on. As can be learned from Table 1 and 2, for the balanced sets, the ratio dataset performed better overall with higher accuracy scores than the normal multi-class with the duplicates. For the unbalanced

(8)

Table 4: Performance of creator and organizer models for the binary classification task

Creator Organizer

Model parameters acc f1 parameters acc f1

LR (baseline) default 45% .18 default 48% .25

SVC C = 40 50% .18 C = 10 48% .25 DS min_samples_leaf = 1 min_samples_split = 0.9 42% .31 min_samples_leaf = 1 min_samples_split = 0.5 62% .16 Linear SVC C = 0.5 max_iter = 500 penalty = L1 45% .18 C = 0.1 max_iter = 1000 penalty = L1 49% .28 GBC learning_rate = 0.5 max_features = 0.1 51% .16 learning_rate = 0.5 max_features = 0.1 48% .15 KNN n_leaf = 1 n_neighbors = 3 25% .20 n_leaf = 20 n_neighbors = 5 41% .35

Keras NN Appendix Figure 3 46% .30 Appendix Figure 3 60% .17

Table 5: Most predictive features for balanced (b) and unbalanced (unb) Logistic Regression multi-class classification baseline model.

Analyst adviser Manager Creator Organizer

b unb b unb b unb b unb b unb

verzeker IA IA IA sharepoint consultancy IA extraversie lean yellow lean yellow

joomla J meter verzeker inkoop lean yellow verzeker SQL 2016 opleiding UNIX resource

onderwijs GIS obs techn voorraad framework medish graf. industrie CCNA kantoor orgkun

Table 6: Most predictive features for balanced (b) and unbalanced (unb) Logistic Regression multi-class ratio classification baseline model.

Analyst adviser Manager Creator Organizer

b unb b unb b unb b unb b unb

verzeker J meter lean yellow years of experience resource onderwijs onderwijs extraversie UNIX resource

graf. industrie graf. industrie Jquery inkoop software - house arbo Jquery years of experience inkoop gestrest

kwaliteit IA onderwijs voorraad fortran orgkun gestrest MIS OO vwo

Table 7: Predictive features for binary Logistic Regression (baseline) models per job type

Organizer Analyst adviser Manager Creator

UNIX relatie mgt years work experience years work experience marketing

netwerk IA inkoop vwo EVT

kwaliteit LAN pers. mgt GUI lean yellow

inkoop indorg lean yellow water years work experience

GIS OO OO lean yellow novell

training-sets, they performed both around 40% and 53% (normal and ratio respectively) what means that for the unbalanced sets there is not so much of a difference in performance. However, there were more cases of overfitting the data on the training-set for the ratio unbalanced data than for the normal multi-class models. Results will only be applicable to hiring and/or job type within the context of Sogeti vacancies and resumes, and no assumptions can be made outside this context. However, some details about this context are missing. Since bias-sensitive features like gender, age or religion were not included in the dataset, it is unclear if the algorithms are biased. It could have had an effect on the labeling of the data (hired

or rejected) by clients of Sogeti or on the performance of the dataset since these personal features could be important for the algorithm predicting whom clients would hire. Perhaps women are less often hired than men. Although this cannot be evaluated with the current dataset, it does not mean that the bias does not exist.

6 .1

For further research

As is mentioned in Section 6 , the algorithm cannot spot any biases regarding gender, religion, and age, as this data is not available, but the bias may still be there. The trained models might be learn-ing from a distorted picture since the hirlearn-ing process can depend on more than hard and soft skills. In further research, when this data is available, more attention could be given to the fairness of the binary classification and the recommendation of employees. Perhaps clients look more at these features than soft or hard skills. This is for further research to investigate. Furthermore, I would like to suggest that Sogeti keeps track of these features as well, as this could be of importance to the company and the diversity of its employees.

(9)

This research used guided LDA to establish personality traits based on the descriptions of the employees, but there are more reliable personality metrics available within Sogeti. They have not been used in this research since the personality-test data was too sen-sitive. For further research, I would suggest validating the guided LDA with the Sogeti personality test outcomes and check what contributes to a better model and matching algorithm. Lastly, the scores could be used in the models to check in what sense they are predictive of matching an employee to a job type.

Only single words were used out of the top 10 most important words according to TF-IDF scores that had also a score above 0.1 for subjectivity and polarity to build the guided LDA model and the clusters. In further research, one ought to investigate the impact of n-grams on the clustering, as n-grams, in general, provide more context. Furthermore, more perspectives should be gathered in the field of subjectivity and polarity in the description of people indi-cating their personality. The context can also be more specified by a similarity score between the vacancy and the employee description. Since this was not in the scope of this research, it would be an addition to the matching process in future research.

Lastly, this research provides a ranking of people that are most likely to be hired for a job type, not recommending employees for a particular vacancy. Future research ought to look into the recommendation of employees to specific jobs based on soft and hard skills so that matching employees to jobs can be even more specific.

7 CONCLUSION

We aim to answer how we can optimize the matching of hard and soft skills of Sogeti employees to in-house vacancies.

In conclusion, the designed models for both tasks (multi-class and binary classification) ought to perform better for Sogeti to make outstanding predictions on to answer the sub-questions:

• RQ1: Given a description of an applicant, what job type are they best suited for? What kind of model performs best for this multi-class purpose?

• RQ2: Given a description of an applicant and the job type they are best suited for, how likely are they to be hired for this particular job type? What kind of model performs best for this binary classification purpose?

Nevertheless, the found models for both tasks did perform better than the logistic regression baseline, with some minor exceptions. With a 32% accuracy for the balanced data and a 37% accuracy for the unbalanced data, a job type can be matched to an applicant based on only their hard and soft skills based on all the rows in the hired dataset, which answers RQ1 for the normal multi-class classification task. The models that were used were a DS and SVC respectively. For the ratio of best job types for a person in the hired dataset, an accuracy is reached of 40% and 58% for the balanced and unbalanced NN and SVC respectively. This answers RQ1 for the ratio multi-class classification task. Based on these predictions, the task was split into five parts to answer RQ2. Would an applicant get

hired for a particular job type based on their hard and soft skills? The models that perform best are:

• Analyst: NN (.48 f1) • Adviser: NN (.21 f1) • Manager: NN (.30 f1) • Creator: NN (.30 f1) • Organizer: KNN (.35 f1)

The skills most predictive for being hired for a job type are presented in Table 7. No personality trait is in the top five of the best predictive features, so based on this outcome a conclusion can be that hard skills are most predictive of being hired for a job type specified by and within Sogeti. However, this is not completely the case for the matching of people to a job type, where ’extravert’ is mentioned as one of the most predictive features in the category ’creator’. This answers RQ3:

• RQ3: What skills of an applicant are most predictive for being hired for a job type?

This research could be a start for Sogeti to optimize their matching procedures based on hard and soft skills within their company in such a way that it will reduce time and costs. However, the matching and optimizing process depend on much more than just the models and algorithms: the fitness, the amount, and diversity of the data in real-life situations is of crucial importance as well.

8 ACKNOWLEDGEMENTS

I want to thank Ana Lucic and Sanne Bouwman for their ongoing support and help during my thesis. Furthermore, I want to thank Paul Verhaar for sharing his practical and hands-on experience with me when I needed it, and Sogeti Netherlands for the learning experience and opportunity to perform a case study in the field.

REFERENCES

Bakar, A. A. and Ting, C.-Y. (2011). Soft skills recommendation systems for it jobs: A bayesian network approach.

Barrick, M. R. and Mount, M. K. (1991). The big five personality dimensions and job performance: a meta-analysis.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Jour-nal of artificial intelligence research, 16:321–357.

CLiPS (2011). pattern.nl. https://www.clips.uantwerpen.be/pages/pattern-nl.

Dastin, J. (2018). Amazon scraps secret ai recruiting tool that showed bias against women. Retrieved from: https://www.reuters.com/article/us-amazon-com-jobs- automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G. De Wolf, I. and Van der Velden, R. (2001). Selection processes

for three types of academic jobs. an experiment among dutch employers of social sciences graduates. European Sociological Review, 17(3):317–330.

freeCodeCamp.org (2017). How we changed

un-supervised lda to semi-supervised guidedlda. https://www.freecodecamp.org/news/how-we-changed-unsupervised-lda-to-semi-supervised-guidedlda-e36a95f3a164/.

(10)

Higgins, C. A. and Judge, T. A. (2004). The effect of applicant influ-ence tactics on recruiter perceptions of fit and hiring recommen-dations: a field study. Journal of Applied Psychology, 89(4):622. Hong, W.-C., Pai, P.-F., Huang, Y.-Y., and Yang, S.-L. (2005).

Applica-tion of support vector machines in predicting employee turnover based on job performance. pages 668–674.

Jagarlamudi, J., Daumé III, H., and Udupa, R. (2012). Incorporating lexical priors into topic models. In Proceedings of the 13th Confer-ence of the European Chapter of the Association for Computational Linguistics, pages 204–213. Association for Computational Lin-guistics.

Lemaitre, G., Nogueira, F., Aridas, C., and Oliveira, D. (2017). 2. over-sampling. https://imbalanced-learn.readthedocs.io/en/stable/oversamplinд.html.

Li, Y.-M., Lai, C.-Y., and Kao, C.-P. (2008). Incorporate personality trait with support vector machine to acquire quality matching of personnel recruitment. pages 1–11.

Liu, Y., Wang, J., and Jiang, Y. (2016). Pt-lda: A latent variable model to predict personality traits of social network users. Neurocomputing, 210:155–163.

Müller, A. C., Guido, S., et al. (2016). Introduction to machine learning with Python: a guide for data scientists. " O’Reilly Media, Inc.". Periatt, J. A., Chakrabarty, S., and Lemay, S. A. (2007). Using

person-ality traits to select customer-oriented logistics personnel. Trans-portation Journal, pages 22–37.

Rajaraman, A. and Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press.

Rivera, L. A. (2012). Hiring as cultural matching: The case of elite professional service firms. American sociological review, 77(6):999– 1022.

Rosenberg, M., Confessore, N., and Cadwalladr, C. (2018). How trump consultants exploited the facebook data of millions. Retrieved from: https://www.nytimes.com/2018/03/17/us/politics/cambridge-analytica-trump-campaign.html.

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpre-tation and validation of cluster analysis. Journal of compuinterpre-tational and applied mathematics, 20:53–65.

scikit learn.org (2019). sklearn.metrics.silhouettescore. https :

//scikit−learn.orд/stable/modules/дenerated/sklearn.metrics.silhouettescore.html.

Seedlink (2017). Using seedlink ai to measure candi-datesâĂŹ cultural fit to the company. The Importance of Culture Fit and How AI Can Help. Retrieved from: https://www.seedlinktech.com/en/articles/using-seedlink-ai-to-measure-candidates-cultural-fit-to-the-company/.

Textkernel (2019). This is how jobfeed by textkernel works. Textk-ernel BV. https://www.textkTextk-ernel.com/how-jobfeed-by-textkTextk-ernel- https://www.textkernel.com/how-jobfeed-by-textkernel-works/.

Tulkens, S., Emmery, C., and Daelemans, W. (2016). Evaluating un-supervised dutch word embeddings as a linguistic resource. arXiv preprint arXiv:1607.00225.

Appendices

A

KERAS NEURAL NETWORK PARAMETERS

Figure 3: Keras Neural Net model for binary classification

Figure 4: Keras Neural Net model for multi-class classifica-tion with balanced dataset

Figure 5: Keras Neural Net model for multi-class classifica-tion with unbalanced dataset

Figure 6: Keras Neural Net model for multi-class ratio clas-sification with balanced dataset

(11)

Figure 7: Keras Neural Net model for multi-class ratio clas-sification with balanced dataset