Predicting salary of job adverts

(1)

Predicting salary of Job advert’s

submitted in partial fulfillment for the degree of master of science Roma Bakhyshov

12124931

master information studies data science

faculty of science university of amsterdam

2019-06-25

Internal Supervisor 2nd Internal Supervisor External Supervisor Title, Name Chen, Yifan Ziming Li Pim den Hertog Affiliation UvA, FNWI UvA, FNWI Dialogic

(2)

ABSTRACT

The main goal of the paper is to assess whether the stated salary can be sufficiently predicted using multiple features found in the job advert placed on a job vacancy website. The classification model is constructed using Natural language processing (NLP) approach which combines various features as well as the job description for this task. The results found that salary can indeed be sufficiently predicted by the job advert features. Moreover we find that Job description often incorporated information found in other features within the text, making job description sufficient for effective salary prediction by itself. The best accuracy is achieved using a combi-nation of one hot encoded features and job description using a Doc2Vec representation, achieving 0.84 precision, 0.84 recall, 0.84 F1 and 0.85 Cohen Quadratic Kappa.

1 INTRODUCTION

Accurate recruitment of employees is a key element for any busi-ness due to its impact on productivity and competitivebusi-ness[7][18]. Wrong hire can cost a company up to a double of a yearly salary of an employee[12]. Moreover, rapid technological change and accom-panying changes in required skills is said to be partially responsible for the rise in job changes the average worker will do in his life-time[21].

Job board platforms facilitate multiple parts of the recruitment process, from candidate ranking, resume summarisation and feed-back to market analysis[20]. One of the possible features a platform may want to implement is salary suggestion mechanism. The pro-posed salary suggestion mechanism will try to use other features of the job advert to give a salary range based on the other job adverts on the website. Traditionally, filling in salary expectations was done by the HR department when posting their job offer on the website. Given the sensitive nature of salary information, sometimes the field is left blank when the job is posted on the website[15]. Com-pensation is found to be the most important factor for job seekers when looking for new opportunities[6][14].

Improving the rate of salary indication on the platform may improve job seekers experience on the platform, reduce asymmetry of compensation between companies, prevent overpaying as well as employee attrition caused by below market rate salary[15]. This system may also allow to create new insights into job market and features which are most predictive of a certain salary bracket[15]. The reduced costs of computing power combined with recent im-provements of machine learning and natural language processing may provide an automated method for the large scale data classifi-cation [19]. With big data algorithms already providing automat-ing document annotation, summarisation and machine translation, this paper will try to use modern machine learning methods to assess whether job advert data can be used for automatic salary range classification. This will be done using the following research question: How accurately can job advert features predict the stated salary amount?. To be able to answer the research question we will further use the following sub-questions.

• What multi class classification performance can be achieved using job advert features excluding job description?

• How to extract valuable information from the job description? • How much can performance be improved by adding job

de-scription feature to the model?

This paper is structured into several parts. After the introduc-tion, the related work relevant to the research questions will be introduced. Using the related work the hypotheses will be formed. Following the related work section, the methods used to for fea-ture extraction as well as classification approaches are presented. Next, the data-set is described as well as the cleaning process used in our approach. Afterwards the software, specific packages and evaluation metrics to measure the performance of the classifiers are specified. Finally the results will be presented, which will pro-vide answers to the hypotheses and the research questions, with discussion and conclusion summarising the findings and discussing possible future research.

2 RELATED WORK

2.1 Factors influencing salary

The literature focusing on determinants of salary has found multiple societal and macroeconomic factors. This paper will only focus on a few significant determinants that are available to us, as many factors are employee related and we need to select relevant features available from the data-set to be used in the predictive model.

One of the main factors determining the salary is the geographi-cal location of the company[10].Companies tend to locate in loca-tion where talent is located, which is often found in geographically constrained places. This shortage of space increases the costs of living, requiring companies to offer higher wages to attract new talent to those locations and offset the rising costs of living.[9].

Education and job training tend to positively influence individu-als productivity and with it the average salary of the individual[4]. Educated workers can perform a more complex set of duties that have a higher added value to the company, resulting in a higher wage. The employees experience is found to have a similar posi-tive effect on salary. Experienced workers are found to be more productive when compared to less experienced workers, as well as possessing valuable tacit knowledge[23].

As job title is a short form of description relating to the job, domain and level of responsibility it was found to be helpfull in salary prediction tasks, and was added to the model as well[8].

Finally, literature shows that there are significant differences in salary between specific skills of workers[22]. This is mainly dependent on the possible added value of the skill and availability of skilled workers on the market.

The geographical location, education, experience and job title features could be found directly within the data-set. The job descrip-tion was used as a proxy for skills, as we assume that required skills for a specific job are explicitly stated within the job description.

(3)

2.2 Multi class classification with employment

data

Automatic classification of textual documents has played an impor-tant role in numerous industrial and academic application within the data mining literature. Currently, the most popular classifica-tion approaches focus on using Support vector machine (SVM) and neural network (NN) approaches[17]. Linkedin effectively used an SVM approach to tackle a job title classification problem using the short description text[2]. Yun Zhu[24] work on linkedin data showed that a cascade system that includes job descriptions as well as job titles improves automatic job segment classification accuracy. Roberto Boselli[5] successfully used SVM’s in his WoLMIS market intelligence system to build a job advert classifier that uses ISCO code’s, one of the standardized job titles classifications.

2.3 Hypotheses in salary prediction

As found above, multiple features have shown to have predictive power in determining the salary amount. Geographical location, level of experience, education and job title all have shown to be partially influential on the salary. Deriving from the previous liter-ature, we theorise that using only these features we can achieve a significant performance level in the proposed multi class classifier. Specific skills are also found to be predictive in determining salary amount. Some skills are very difficult to learn, creating a shortage on the job market, while others are required within each or only specifically low paying jobs. This can be a knowledge of a particular programming language, framework or experience with a specific machinery. Also, a specific skills may be in such high demand that it can overshadow other features. As job description often state the skills required by the applicant, we theorise that not only will the job description text have predictive power when added to the classifier, but also improve the classifiers performance when combined with the other features. The features used in this research and their expected effects on the salary can be found below in figure 1.

Figure 1: Effects of job advert features on salary

3 METHODOLOGY

3.1 Feature extraction from control features:

One Hot encoding

The geographical location, experience, education and job title which are used as control features were encoded using the one hot encod-ing (OHE) approach. As all of said features are categorical in nature,

this was needed to be able to correctly represent the data for the methods proposed in this paper. Using this approach the categorical feature is re-coded into multiple columns. Each column represents one of the classes of said categorical feature, with labels 0 and 1. Labelling a feature class 1 entails a specific job advert belonging to a specific feature class, with 0 representing the opposite.

3.2 Feature extraction from job description:

BOW and D2V

To be able to use the job description as a feature within the classifier, we needed a method to represent said data. The first often used approach is the Bag of Words (BOW) representation[11]. Here, the word corpus of all documents is represented by a matrix format MxN where M is the specific document and N is a specific word

found in the corpus. Weighting of words within the BOW matrix was done using TF-IDF. TF stands for term frequency within the document, while IDF stands for the inverse document frequency component and reflects the term importance within the collection of all the documents. TF-IDF allows for less frequently used words to get a higher weighting, while more often occurring words get a lower weighting. As our hypothesis states that sparsely occurring skill terms occur mainly in a specific salary bracket, TF-IDF is one of the most suitable approaches for representing the importance of words for this task. We used log-TF, to dampen the effect of higher term counts within the document, as well as dampen differences in document length effect on the score. IDF was not normalized, as skills are defined by original terms and are hypothesised to carry high predictive capacity. The final score per word was calculated by multiplying each words log-TF and IDF values.

T FIDF

i,d

= logT F

i,d

· IDF

i (1)

The other often suggested method in the literature is the Document-to-vector (D2V) approach for text representation[11]. D2V is an unsupervised algorithm that generates a dense numerical vector representation from text data. Very similar to word2vec, D2V com-putes a specified N feature vector for each specific document, using a shallow Neural network. D2V can be used in 2 forms, Distributed bag of words (DBOW) and Distributed Memory (DM). DBOW works in a similar way as skip-gram, only replacing the input by a special token representing the document. DM on the other hand tries to predict the specific word based on the context, the words around the specific word. The main advantage of this approach compared to the TF-IDF is that while D2V representation is un-interpretable for humans, it gives a much denser representation than TF-IDF, which often produces a very sparce matrix. And while you can use dimensionality reduction techniques to decrease the sparsity of TF-IDF, information will still be lost. For this specific task we chose to use D2V DM approach, because it allows semantic similarities to be captured. When using TF-IDF or DBOW, words "house" and "home" are interpreted as different words, because the placement of the word is not retained. DM approach retains the placement in-formation of the word within the text, giving the ability to account for semantic differences between texts, improving accuracy of the classifier.

(4)

3.3 Salary classification: SVM and MLP

The first classifier that is considered is the Support Vector Ma-chines (SVM). SVM is chosen because of its good performance on the classification tasks found in related work chapter[17]. SVM’s were originally designed for binary classification tasks. Here the classifier tries to separate the data-points between 2 classes of the training data-set with the support vector using a loss func-tion[13]. Hyperparameters can be used to change the amount of data-points the classifier uses in their decision boundary and how much wrongly classified data-point contribute to the penalization of the loss function. Using SVM for multi class classification can be done via different methods, with most popular being one against all and one against one. The oldest method is one against all, where q SVM models are constructed and q is the number of classes. For each class a classifier is trained where one class is up tox salary and another is all the other classes. The one vs one method trains a separate classifier for each of the separate classes against another one. It is often found to be slower but more precise than one vs all. One vs one is often more useful for smaller data-sets with highly imbalanced classes. In this paper the one vs all approach was used as the data-set is significant and classes relatively well balanced.

Another popular classification approach that was used is the Multi layer perceptron (MLP)[17]. This is a class of feed-forward Neural network, where input layer serves as the data input layer, the hidden layers are used to learn the feature importance and output layer is used as class labels. Each layer consists of nodes (neurons) and their activation functions. The learning is done by training the network on labeled data and by using back-propagation automatically learning the importance of features within the data. MLP’s are found to be very effective with large amounts of data with non-linear relationships and is thus a viable method for salary classification.

4 EXPERIMENTAL SETUP

4.1 Data

The data-set used for this task are all the ICT related job postings on the Dutch job-boards from 01-01-2014 to 31-12-2017. The data-set contains 288.856 entries from 29.514 websites, but only 31.846 entries from 3241 websites contain a salary indication. As the salary indication is going to be used as the datas label, we only used these entries as the data-set. The features selected for the model are based on the related work section and only the data-set features that are found to influence the salary amount are included in the model. The features which were used as features for the training of the classifier are listed below in table 1.

4.1.1 Data cleaning. The salary indication is sometimes given as a value and sometimes as a range. To standardize the data, the values which are given as a range were summed up and a mean of both values was used as the value for the classifier. Around 200 adverts were found to have a salary value of 1 million+. These were considered outliers and removed from the data-set, as we assume that individuals earning such a salary find work via other channels and the salary amount of said job adverts is incorrectly filled in. Education feature has only HBO, MBO and WO classes for 85% of the values. MBO stands for the lowest type of higher education, with

main focus on physical labour job types. HBO stands for applied sciences, with focus on service related jobs, while WO class defining the highest levels of education with job requiring University level academic education. These were also used as the classes for the final classifier. 5% are combinations of former and were classified into the lowest stated class within the combination. Another 1% are dutch highschool diplomas HAVO, VWO and LWO. Any highschool education was classified as MBO, being the lowest class. The rest of the job adverts did not contain any education requirements. The work experience feature contained either junior, medior or senior values and had 14.7% of its entries missing.

To impute the missing values still present in the education and experience columns we used the mean imputation technique. Here, the mean of each column is imputed for each advert. This is done because missings were random, meaning that there was no specific sub-group with their education and experience features missing, the % values missing for each feature is relatively small and deletion would have decreased our data-set by almost 18%. The education feature missing were imputed with the mean class value, in this case HBO. Experience feature was imputed with class value medior, which is the mean class for this specific feature.

The job description required cleaning as well. First the HTML was removed from the job description text, as we are only interested in the job description text. Next, the words were tokenized, where the job description string was split in the blank spaces, giving us a list of strings, each representing a word. Next, numbers and punctuation were removed as well as they have no predictive value for our task. Stop words were removed as well, as these do not carry any meaningful predictive power and add to the dimensionality of the final word representation. This was done using a pre-built dutch stop word dictionary. The remaining text was normalized, removing capitalization of words. As there were a few Job adverts in languages other than dutch and the classifier cannot perform well with text’s from different languages, we used a python package to label job description text’s as dutch or other and remove all adverts with non dutch job descriptions. This removed 406 job descriptions, giving us 31.325 job adverts as the final number of job advisement’s within the clean data-set. The flowchart describing the cleaning steps from the raw data to the final input data-frame can be found in figure 2.

Figure 2: Data cleaning flowchart 3

(5)

Feature Type Example Cleaned example Missings

Company location Nominal Noord-Holland 1 0%

Job Title Nominal Systeemanalist 1 0%

Education requirement Ordinal WO 3 9%

Work experience indicator Ordinal Junior 1 14.7%

Salary indication Count 25.500 25-35 0%

Job description Text <li><a><h3>Ziggo moet nieuw contract afsluiten [ziggo, moet, nieuw, contract, afsluiten] 0%

Table 1: data-set features

4.1.2 Descriptive statistics. The work experience feature con-tains 20004 medior, 8118 junior and 3203 senior values. The distribu-tion of the educadistribu-tion is significantly different from the populadistribu-tion distribution and is highly skewed toward HBO, with MBO being the smallest class. This can be possibly caused by the job posting website catering only to ICT adverts, which require a higher ed-ucation than the populations average. The distributions of work experience and education can be seen below in figure 3.

Figure 3: Education and experience distribution

The salary feature values are normally distributed around 45.000, with a slight skew to the right. The distribution of the salary feature can be found below in figure 4.

Figure 4: Salary distribution

After cleaning the job descriptions tokens were counted. The structure of each job description text varies greatly, with the distri-bution of N words per job description normally distributed around 300 words, with a slight skew to the right. The distribution of the word-count per job description is visualized in figure 5.

Figure 5: Job descriptions word count distribution

The final data-set was randomly split into 70/20/10 for the train-ing, test and validation data-sets. The salary was split into classes 0-25.000, 25.001-35.000, 35.001-45.000, 45.001-60.000 and 60.001+. This split is based on the salary brackets found on other websites which are wide enough to give leeway in per person salary negoti-ations, while still improving the search mechanism of the website. The 70/20/10 split is done to save as much data as possible for train-ing by keeptrain-ing the data-set as large as possible, while still havtrain-ing enough data to preform validation and testing. This ensures that the maximum performance of the classifier is achieved, avoiding over-fitting and under-fitting. The class features were OHE, which is required for their use as input for the classification algorithms.

4.2 Training details

The training and testing was preformed on the corporate server with a 1080TI, using Python 3.6 with the Jupyter notebooks. Specif-ically, sklearn1was used for SVM method and to create the TF-IDF text representation, Keras2for the MLP and Gensim3package for

1_{https://scikit-learn.org/} 2_{https://keras.io/}

3_{https://radimrehurek.com/gensim/} 4

(6)

creating the D2V representation of the job description. Data clean-ing and formattclean-ing was done usclean-ing pandas4, numpy5, NLTK6and beautifulsoup47packages. Filtering out job descriptions with other than dutch language was done using langdetect8package. Training the MLP classifiers was done using the Adam optimizer9. To im-prove the accuracy of the classifier the following techniques were used. While our data-set is not highly unbalanced, the gaussian distribution still causes some imbalance between the salary classes. To combat this, we tried to use under-sampling and oversampling techniques. Sampling is one of the most effective techniques to combat unbalanced data-sets. In oversampling the smaller classes are sampled relatively more often than the bigger classes, by using a smaller class data multiple times relative to the bigger class data for the training of the classifier. This is useful for smaller data-sets, but can lead to overfitting on the smaller classes. Alternatively, in under-sampling the bigger classes are not sampled fully compared to the smaller classes. This is preferable to oversampling but can lead to poor accuracy results with smaller data-sets as not all the data is used to train the classifier.

Another technique that was used is extra pre-processing of the text. Different suffixes and forms of one word are seen as different words by the algorithm and increase the dimensionality of the data-set. To combat this, stemming was used on the cleaned job description text. This technique removes word endings from the root of the word and simplifies words like eating and eaten to eat, normalizing any differences in the tenses or the usage of the words. Stemming was done using an algorithmic stemmer specifically developed for the Dutch language, called the Kraaij-Pohlmann stemmer.

Finally, to see whether high sparsity of the TF-IDF matrix has an effect on the performance of the classifier we used PCA[1] and LSA[16] dimensionality reduction techniques. Highy dimensional sparse matrices not only have shown to be computationally in-tensive, but were also found to influence the effectiveness of the convergence, negatively impacting the final accuracy of the classi-fier. Both use Singular value decomposition to reduce the size of the data-set and increase the density of the matrix which is used as input of the classifier. While PCA aims for the best affine linear sub-space on a term by term matrix, LSA seeks the best linear subsub-space in frobenius norm within document term matrix. For performance improvement, both will be used to see whether the dimentionality reduction positively influences the performance metrics.

4.3 Evaluation

As discussed in the relevant work section, the methods stated above were evaluated using the standard metrics: precision, recall and F1 score. Precision stands for then correctly classified positive examples forx class True positive’s (T P) divided by the n examples labeled by the system as positive True positive’s + False positive’s (T P + FP). In our case precision represents the ratio of correctly 4_{https://pandas.pydata.org/} 5_{https://www.numpy.org/} 6_{https://www.nltk.org/} 7_{https://pypi.org/project/beautifulsoup4/} 8_{https://pypi.org/project/langdetect/} 9_{https://arxiv.org/abs/1412.6980}

classified job adverts over all the job adverts classified for that salary class.

Precision =

TP

TP + FP (2)

Recall stands for then correctly classified positive examples for x class (T P) divided by the n of all the adverts belonging to that class in the total data-set True positive’s + False Negative’s (T P + FN ). Recall shows how well the classifier is able to label the results of the whole data-set.

Recall =

TP

TP + FN (3)

F1 score stands for the overall accuracy of the classifier. It con-siders both the performance on precision as well as recall of the classifier. This is done by returning the harmonic average of both precision and recall metric values.

F 1 =

2·pr ecision ·r ecal l

pr ecision+recall (4)

To account for the hierarchical structure of the salary, we used Cohen’s Quadratic kappa (Coh.QK) metric[3]. Coh.QK is defined in the formula below, wherePo(w )is the total weighted agreement

probability, or the accuracy andPe (w )is the agreement probability

possible due to chance alone. By taking the weighted version of the accuracy we penalize large mistakes more than the smaller one’s, as classifying a 0-25.000 job advert into 25.001-35.000 salary class is a much smaller mistake than classifying it into a 60.001+ salary class.

κ

_w

₌

Po(w )−Pe (w )

1−Pe (w ) (5) Precision, recall and F1 metrics were calculated for each class, after which an macro-average was taken for the performance of the whole method. Macro-average is chosen because the performance of all classes is equally important in this case, as lower frequency salary brackets should not out-preform the higher frequency salary brackets. Coh.QK was calculated for all classes at the same time. All metrics were calculated for both the SVM and D2V algorithms. For each approach and its metrics only the best preforming hyper-parameter setup and performance adjustment were reported in the paper, as these numbers are the most relevant for the research questions.

5 RESULTS AND ANALYSIS

5.1 Baseline Results

The results of the SVM and MLP classifiers, as well as various text representations and performance improvement techniques used can be found in table 2 below. Overall performance can be considered very satisfactory and we can confidently say that job advert features can be effectively used for salary prediction.

First sub-question: What multi class classification performance can be achieved using job advert features excluding job description? The best performance using only the control variables was achieved using MLP approach, achieving 0.43 Precision, 0.32 Recall, 0.32 F1-score and 0.40 Coh.QK. As hypothesised the control variables were able to significantly predict the salary classes on their own, however the performance is not far above random (score of 0.20 for 5 classes), 5

(7)

Approach Precision Recall F1-Score Coh.QK SVM OHE 0.43 0.29 0.27 0.37 OHE+TFIDF 0.69 0.65 0.67 0.72 TFIDF Only 0.69 0.65 0.67 0.72 OHE+TFIDF(STEM) 0.68 0.64 0.66 0.71 OHE+TFIDF(PCA) 0.65 0.61 0.63 0.69 OHE+TFIDF(LSA) 0.67 0.63 0.65 0.71 OHE+TFIDF(PCA+ STEMMED) 0.64 0.60 0.62 0.69 OHE+DOC2VEC(DM) 0.81 0.70 0.71 0.71 OHE+DOC2VEC(STEM) 0.80 0.80 0.80 0.81 OHE+DOC2VEC(DM+DBOW) 0.79 0.76 0.75 0.72 DOC2VEC(DM) Only 0.80 0.78 0.78 0.80 MLP OHE 0.43 0.32 0.32 0.40 OHE+TFIDF 0.67 0.67 0.67 0.75 TFIDF Only 0.69 0.66 0.67 0.73 OHE+TFIDF(STEM) 0.65 0.67 0.66 0.74 OHE+TFIDF(PCA) 0.66 0.63 0.63 0.70 OHE+TFIDF(LSA) 0.68 0.65 0.66 0.73 OHE+TFIDF(PCA+ STEMMED) 0.65 0.62 0.63 0.69 OHE+DOC2VEC(DM) 0.84 0.84 0.84 0.85 OHE+DOC2VEC(STEM) 0.82 0.81 0.81 0.83 OHE+DOC2VEC(DM+DBOW) 0.81 0.78 0.79 0.80 DOC2VEC(DM) Only 0.80 0.82 0.81 0.83

Table 2: Performance metrics

and control variables are not considered enough for an effective classifier.

Second sub-question: How to extract valuable information from the job description? We found that using a D2V+DM approach as the textual representation worked best. The text was represented using 500 vectors for each document, as using more vectors did not improve the performance of the classifier. As expected, using the D2V text representation performed much better than the TF-IDF representation. The extreme sparsity of the TF-IDF job description representation may have possibly been of influence. The other possibility are the semantics within the text. The DM approach of D2V interprets the semantics within the text inside the vector representation of each document, while TF-IDF and DBOW do not. The differences in performance suggest that this may be the main factor. Overall we see that MLP approach performs better than the SVM. This is in line with the findings in scientific literature, as MLP’s can capture a lot of non linear relations-ships between features, but are more computationally expensive.

Third sub-question: How much can performance be improved by adding job description feature to the model? Compared to only using the control features, adding the job description showed a significant increase of the evaluation metrics.This is in line with the stated hypothesis, where we stated that additional information about skills would have a significant improvement in predictive power of the classifier. Overall, we saw an average increase of 85% for each of the evaluation metrics. The best performance of the classifier was achieved using the combination of control features with OHE and job description D2V+DM representations. Using this we achieved 0.84 Precision, 0.84 Recall, 0.84 F1-score and 0.85 Coh.QK. Finally, even with imbalanced classes, there is no sign of over-fitting on one specific salary class. Per class evaluation metrics of the best performing approach can be found in the table 3.

Precision Recall F1-score Salary class N Adverts data-set 0.85 0.87 0.86 -25 1495

0.84 0.84 0.84 25-35 5681 0.82 0.81 0.81 35-45 8451 0.83 0.85 0.84 45-60 11639 0.84 0.81 0.83 60+ 4059

Table 3: Best approach performance per class

5.2 Performance improvements

The proposed improvements did not increase the performance of the classifier. Using oversampling and under-sampling to balance the salary classes resulted in a similar performance. As TF-IDF is a very sparse matrix, PCA and LSA dimentionality reduction techniques were used to create a job description text vector rep-resentation with 500, 1000, 2000 and 5000 features. With 91% of variance explained 5000 features performed best, while still not outperforming the original TF-IDF text representation. Still, we have to note that while the performance did not improve, the loss of performance can be considered negligible while significantly decreasing the dataset’s size and computational requirements. To account for different suffixes, we used a Dutch stemmer, decreasing the TF-IDF dimensionality from 14,000 to 11,000. Similar to the previous proposed improvements, the results were underwhelming, providing similar or worse results.

6 DISCUSSION

The results stated above show that salary can be reliably predicted by the job advert features. Nontheless, its important to remember that the data-set used contains only ICT related job adverts and only from Holland and thus may not be representative of all the job advert boards. Future research may want to include data from other industries within the EU advert boards to see whether the salary can still be effectively predicted using the job advert features. Moreover, the data-set only contains job adverts from 2014 to 2017. New jobs, skills and terms may influence the performance of the classifier in the future. To preserve the high performance of the classifier new advert would need to be added and the model retrained to account for creation of new skills and job types.

Its important to note that while the data-set only contains adverts with the salary label. Its possible that attributes of job adverts without the salary differ so significantly from the adverts with the salary mentioned that the model cannot be extrapolated to those adverts. This cannot be tested as we do not have the labels for those adverts, and labelling them with the current model would be incorrect. However, the fact that the data-set spans from 2014 to 2017, encompasses all of Netherlands and is gathered from more than 3000 different job boards would minimize this problem. A feedback form for the HR personnel on the performance of the salary suggestion feature would empirically assess whether this is the case.

The stemming done in this paper used a Dutch variant of the Porter stemmer, named the Kraaij-Pohlmann stemmer. While Kraai-Pohlmann is considered one of the best stemmers for Dutch lan-guage, the specificity of the text and wide usage of very specific skill terms make this stemming approach less robust in our case. 6

(8)

Future research may want to try a hybrid approach, where a specific dictionary of terms is created that will not be stemmed, allowing the user to mitigate this problem and possibly increasing the accuracy even further.

While the performance of the classifier significantly increased compared to only using the control features, the performance did not decrease significantly when using only the job description. This can be caused by job description also containing words synony-mous of, or exactly referring to education, geographical location, experience and the job title. Said words are possibly automatically implemented within the textual representation and thus mitigate the need of the control features. Researchers in the future may find it interesting to filter out the words referring to the control features used in this paper and see whether this decreases the accuracy of the classifier when using only the job description, but performs similarly when using both control features and job description.

7 CONCLUSION

This paper tried to to analyse the effectiveness of several techniques for salary prediction. This was done with the following Research question:

How accurately can job advert features predict the stated salary amount?.

The findings above suggest that a salary suggestion mechanism can be successfully performed using job advert features. Perfor-mance of the approaches is quite high and well balanced between classes. While not perfect, we also found that job description signif-icantly increased the performance of the classifier and has shown to be the feature with the highest predictive value. The results also suggest that the job description often contains other features which were used as control features in this thesis. This means that the job description should be considered the most important feature when constructing a salary suggestion feature and can possibly be used by itself for salary prediction.

The findings of this paper help the field of machine learning and job market analysis by determining usable features for salary suggestion, as well as assessing the viability of a salary suggestion feature in production environment of a job-board website.

REFERENCES

[1] Hervé Abdi and Lynne J Williams. “Principal component analysis”. In: Wiley interdisciplinary reviews: computational statistics 2.4 (2010), pp. 433–459. [2] Ron Bekkerman and Matan Gavish. “High-precision phrase-based document

classification on a modern scale”. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2011, pp. 231–239.

[3] Arie Ben-David. “About the relationship between ROC curves and Cohen’s kappa”. In: Engineering Applications of Artificial Intelligence 21.6 (2008), pp. 874– 882.

[4] Emily Blanchard and Gerald Willmann. “Trade, education, and the shrinking middle class”. In: Journal of International Economics 99 (2016), pp. 263–278. [5] Roberto Boselli et al. “WoLMIS: a labor market intelligence system for

clas-sifying web job vacancies”. In: Journal of Intelligent Information Systems 51.3 (2018), pp. 477–502.

[6] CareerBuilder. “How to rethink the candidate experience and make better hires.” In: CareerBuilder (2016).

[7] Richard Desjardins and Kjell Rubenson. “An analysis of skill mismatch using direct measures of skills”. In: (2011).

[8] Sananda Dutta, Airiddha Halder, and Kousik Dasgupta. “Design of a novel Prediction Engine for predicting suitable salary for a job”. In: 2018 Fourth International Conference on Research in Computational Intelligence and Commu-nication Networks (ICRCICN). IEEE. 2018, pp. 275–279.

[9] Richard Florida. Cities and the creative class. Routledge, 2005.

[10] Richard Florida and Charlotta Mellander. “The geography of inequality: Differ-ence and determinants of wage and income inequality across US metros”. In: Regional Studies 50.1 (2016), pp. 79–92.

[11] Ian Goodfellow et al. Deep learning, vol. 1. 2016.

[12] James Houran. “New HR study: Candid recruitment experiences with LinkedIn”. In: Retrieved April 20 (2017), p. 2017.

[13] Chih-Wei Hsu and Chih-Jen Lin. “A comparison of methods for multiclass support vector machines”. In: IEEE transactions on Neural Networks 13.2 (2002), pp. 415–425.

[14] Jobvite. “Job seeker nation study.” In: (2016).

[15] Krishnaram Kenthapadi et al. “Bringing salary transparency to the world: Computing robust compensation insights via LinkedIn Salary”. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM. 2017, pp. 447–455.

[16] Thomas K Landauer, Peter W Foltz, and Darrell Laham. “An introduction to latent semantic analysis”. In: Discourse processes 25.2-3 (1998), pp. 259–284. [17] Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze.

“Introduc-tion to informa“Introduc-tion retrieval”. In: Natural Language Engineering 16.1 (2010), pp. 100–103.

[18] Ignacio Martın et al. “Salary Prediction in the IT Job Market with Few High-Dimensional Samples: A Spanish Case Study”. In: International Journal of Computational Intelligence Systems 11.1 (2018), pp. 1192–1209.

[19] Marcin Michał Mirończuk and Jarosław Protasiewicz. “A recent overview of the state-of-the-art elements of text classification”. In: Expert Systems with Applications 106 (2018), pp. 36–54.

[20] Ioannis Nikolaou. “Social networking web sites in job search and employee recruitment”. In: International Journal of Selection and Assessment 22.2 (2014), pp. 179–189.

[21] Miikka Rokkanen and Roope Uusitalo. “Changes in job stability: Evidence from lifetime job histories”. In: (2010).

[22] Regina Pefanis Schlee and Gary L Karns. “Job requirements for marketing graduates: Are there differences in the knowledge, skills, and personal attributes needed for different salary levels?” In: Journal of Marketing Education 39.2 (2017), pp. 69–81.

[23] Catherine Zhang et al. “Factors associated with increased academic productivity among US academic radiation oncology faculty”. In: Practical radiation oncology 7.1 (2017), e59–e64.

[24] Yun Zhu, Faizan Javed, and Ozgur Ozturk. “Semantic Similarity Strategies for Job Title Classification”. In: arXiv preprint arXiv:1609.06268 (2016).