Investigation of the indicative features of job advertisement interest by matching job and job seeker

(1)

Investigation of the indicative features of job advertisement interest by matching job and job seeker

1

submitted in partial fulfillment for the degree of master of science

2

Rianne Klaver

3

12395927

4

master information studies

5 data science 6 faculty of science 7 university of amsterdam 8 2019-06-26 9

First Supervisor Second Supervisor Title, Name Prof. Dr. Evangelos Kanoulas Dr. Yuval Engel

Affiliation Informatics Institute Amsterdam Business School

(Entrepreneurship & Innovation Section) Email e.kanoulas@uva.nl y.engel@uva.nl

10

(2)

Investigation of the indicative features of job advertisement

interest by matching job and job seeker

Rianne Klaver

University of Amsterdam

ABSTRACT

Research into job recommendations and job attractiveness has

al-12

ways been of interest for companies. Choosing the right applicant

13

can be of great financial importance, since the job needs to be

14

performed adequately. If a company knows what aspects draw in

15

applicants, it could alter its advertisement to be more attractive for

16

the desired applicants. This research investigates which features

17

contribute most to a user expressing interest in an advertisement. It

18

will make a distinction between the indicative features: in general,

19

gender specific, and user specific. Furthermore, several methods

20

are employed to determine whether a user is actually capable of

21

performing the job, i.e. is a match.

22

KEYWORDS

Job recommendation Job interest prediction Text matching

23

Feature importance

24

1 INTRODUCTION

Finding the right candidate to fill job vacancy is crucial for

compa-25

nies. Since a company invests in their new employee, they could

26

lose a significant amount of money if the person’s performance is

27

lacking or if they resign [21]. Researchers have focused on how

28

to select the right candidate from the pool of candidates [16], and

29

how the selection process actually works in practice [10]. However,

30

what if the right person was not even in that pool to begin with?

31

An application was built to match companies and job seekers.

32

Job seekers could see job advertisements, including information

33

about the company, and ’like’ or ’dislike’ the advertisement.

Like-34

wise, a company could like or dislike a job seeker’s profile. A match

35

occurred when both parties liked each other; then the application

36

would provide communication opportunities between them to

fur-37

ther discuss the job advertisement.

38

The data collected from this application can provide great

in-39

sights into what drives a job seeker and company, looking to fulfill

40

a vacancy, to be interested in one another. This research will focus

41

on the interest of the job seeker, and will investigate whether it can

42

be predicted if a certain user will like a certain advertisement using

43

the data of the application and which features drive that decision.

44

This information could be useful for a company seeking the perfect

45

candidate for their vacancy. They would know what information

46

they provide is most important to the job seekers. The company can

47

then focus on altering that information in the job advertisement to

48

attract the right candidates and thus improve their candidate pool.

49

This will increase their chance of finding the right employee.

50

This research aims to answer the following question:

51

• What company, job advertisement, and job seeker features

52

are most indicative of whether a job seeker will like the

53

advertisement or not?

54

The sub questions that will be answered in this research are:

55

• What is the best method to determine if a user matches with

56

a job advertisement?

57

• Do the indicative features differ per user?

58

• Is there a difference between indicative features for each

59

gender?

60

The research questions will be answered by firstly creating a

61

random forest classifier, predicting whether a user will like the

62

advertisement. The feature importance of each feature is then

mea-63

sured to see which features attribute most to a prediction. The

64

feature contributions to a single prediction of various user rating

65

are calculated to see whether those features differ per user

pre-66

diction. Furthermore, this research explores methods to match the

67

skills a user has to the skills a job requires, and to match the job

68

title and the user’s previous work experience. This will be done to

69

see if a user profile actually matches with the job advertisement,

70

i.e. the user can actually exercise the profession. Lastly separate

71

models are trained on a dataset containing either only male or only

72

female users, to see whether the indicative features, measured by

73

the feature importance, differ per gender.

74

2 RELATED LITERATURE

2.1 Applicant’s attraction to a job

75

Various studies have researched what features of a job

advertise-76

ment and the company behind it can indicate whether an applicant

77

will be interested in pursuing the job.

78

Chapman et al. studied the attraction of an applicant to a certain

79

job [12]. They combined 71 studies to find that characteristics of the

80

company and the job itself are important features for this attraction.

81

Moreover, the researchers found that an applicant’s perception of

82

fit in the company, a subjective feature, is an important predictor

83

as well. This perception of fit can be between the applicant and

84

the organisation, or between the applicant and the job. Another

85

interesting finding is that the available data about job characteristics

86

is used more by women than by men when assessing a job.

87

Furthermore, research has been done on employee preferences

88

per gender when selecting a job. Flory et al. found that showing a

89

job advertisement with male connotations would attract different

90

responses from women and men than the same advertisement with

91

these connotations removed [17].

92

Moreover, Cheryan et al. found that the physical environment

93

has influence on whether people want to participate in a certain

94

group, computer science in this research [15]. For example, if women

95

feel like they do not fit in, because of certain objects representing

96

certain male stereotypes, they have less interest in participating.

97

Gender stereotyping for certain jobs starts early in our lives. Miller

98

et al. showed that children from the age of 8 already find some jobs

99

should be performed by men or by women [27].

100 1

(3)

Lastly, research has been conducted on which features women

101

or men find important when choosing a job. Centers et al. found

102

that women find ‘good coworkers’ more important than men when

103

choosing a job [11].

104

2.2 Job recommending

105

Section 2.1 demonstrates that different applicants can respond

dif-106

ferently to the same job advertisement dependant on several factors.

107

Thus, this information could be used to provide the applicants with

108

the right advertisements, i.e. give the right job recommendations.

109

This would possibly make them more likely to like the

advertise-110

ment in the application. Research has been done on how to

recom-111

mend the right job, i.e. predicting whether the user will like the

112

advertisement. Lacic et al. showed that a user’s ‘frequency and

re-113

cency of interactions with a job posting’ can be used to recommend

114

jobs [22]. Kille et al. investigated whether clicking on an

adver-115

tisement, bookmarking it, or replying can indicate whether a user

116

finds the advertisement interesting [20]. Their research concluded

117

that replying is the most accurate feature for providing relevant

118

recommendations. The 2016 and 2017 ACM Recommender Systems

119

Challenge involved job recommendations, and the 2017 contest

120

specifically focused on cold-start recommending [8, 24]. The team

121

of Lian et al. used a single boosting tree as their predictive model,

122

and investigated whether basic profile features could be matched to

123

the advertisement features to improve the recommendations [24].

124

They matched the country, industry, and career level, and reported

125

that the country match was in their top ten of most important

126

features. Bianchi et al. participated in the contest as well, and also

127

investigated user-advertisement profile matching [9]. Their method

128

involved calculating the cosine coefficient of the common features

129

and computing TF-IDF to investigate feature relevance.

130

3 METHODOLOGY

3.1 Data Description

131

The dataset consists of ratings between 2015 and 2017 of which

132

the 23 features can be divided into three categories: data about the

133

company, about the job advertisement, and about the user who is

134

rating the job. The dataset is labelled; the labels state whether the

135

user has liked or disliked the advertisement. The dataset contains

136

the following unstructured features: job required skills, job tasks,

137

user work experience, and user skills. All other considered features

138

are structured. Table 1 shows general statistics of the dataset. A

139

distinction is made between unique and total male and female users.

140

A user was able to rate multiple advertisements, the total number

141

is the sum of all ratings made by either men or women.

142

Table 1 reveals a heavily imbalanced dataset regarding the like

143

and dislike class. Figure 1 shows the contribution per gender to this

144

imbalance. It appears that women have been more active on the

145

application, even though there were less unique female users than

146

male users.

147

3.1.1 Company data. The user would see certain information

148

about the company behind the job advertisement when rating. This

149

information concerned the company itself, such as the company

150

size, foundation year, average employee age, and the percentage of

151

female employees. The industry of the company is also recorded,

152 Ratings 675787 Liked 114080 Disliked 561707 Job advertisements 2588 Users 8920

Male 4967 unique, 380090 total Female 3910 unique, 293746 total

Table 1: Statistics of the dataset

Figure 1: The job ratings per gender

and follows the Global Industry Classification Standard (GICS).

153

GICS is a global standard for classifying companies into sectors

154

[35]. The percentage of women in the industry is recorded as well.

155

3.1.2 Job data. Besides information about the company, the

156

user would also see information about the job itself. This includes

157

the tasks of the job, the skills the job applicant is required to have,

158

the location of the job, and the language of the advertisement.

159

The user would also see if the job is full-time, part-time, or an

160

internship. Moreover, the job advertisement indicated the level of

161

the job, ranging from entry-level to executive. The dataset also

162

contains a classification of the job into six different occupational

163

groups, e.g. ’management occupations’.

164

3.1.3 User data. Users had to disclose personal data when

sub-165

scribing to the application. The company would see this data when

166

rating users. The users provided their gender, location, languages,

167

education, graduation year, and years of experience. Figure 2 shows

168

a histogram of the graduation decade of users to provide an idea

169

of the user distribution. It shows the distribution of the job ratings

170

per graduation decade in total percentage and absolute values per

171

decade. It can be seen that most users of the application are recent

172

graduates or still studying. The education information was stored

173

unstructured, and has been modified into a structured feature,

in-174

dicating whether the education was Information Technology (IT)

175

related or not.

176

3.2 Classifier

177

A random forest classifier is used to classify whether a user will like

178

an advertisement. Liaw et al. explains the algorithm [25]: a random

179

forest consists ofn decision trees, which are built on n bootstrap

180

samples. To decide on the best split at every node only a subset

181

of features is used instead of all. All decision trees are combined

182

to provide one classification: each predicts the same sample and

183

the majority vote for a certain class is the result. The model will be

184

evaluated on the accuracy and F-score [6]. Accuracy alone would

185 2

(4)

Figure 2: The job ratings per user graduation decade

be insufficient since the dataset is imbalanced, i.e. a larger number

186

of users has disliked advertisements [19].

187

3.3 Dataset pre-processing

188

3.3.1 Filtering. The dataset has been filtered on the following

189

conditions:

190

• Each job advertisement must concern a full-time or part-time

191

job.

192

• Each job advertisement must have over 85 user ratings.

193

• Each job advertisement must be written in English.

194

• Each user must be able to speak English.

195

• Each company behind an advertisement must have less than

196

500 employees.

197

Advertisements with less than 85 ratings, internships, and large

198

companies are filtered out to remove outliers which could

nega-199

tively impact the model. Internships make up only a very small

200

part of the dataset (4%), which is the case for large companies as

201

well, and advertisement with less than 85 ratings make up only 5%

202

of the dataset. The English language requirements are set to ensure

203

that text matching on certain features can be performed.

Further-204

more, any entry containing missing values has been dropped as

205

the Random Forest Classifier implemented by scikit-learn does not

206

support handling missing values [36]. The dataset has been split

207

into a train (70%) and test (30%) set.

208

3.3.2 Basic features. Several basic features have been created

209

based on existing features in the dataset:

210

• Whether a user can speak Dutch or not.

211

• How many skills a user claims to have.

212

• The count of skills required in the job advertisement.

213

3.3.3 Text features. The features user skills, job required skills,

214

job title, and user work experience all consist of text. These features

215

have been sorted alphabetically per either skill or job title per

216

rating. Furthermore, the most frequent occurring Dutch words have

217

been translated. The user work experience feature contained Dutch

218

words even though the dataset has been filtered on English job

219

advertisements and English speaking users. This is most likely due

220

to the ability to import previous work experience from LinkedIn,

221

which can be in Dutch. For example, ’communicatie’ has been

222

translated to ’communication’. Frequently occurring abbreviations

223

have also been replaced by the full word, e.g. ’jr.’ by ’junior’. These

224

alterations are of importance for the stemming and to calculate

225

word similarity, as explained in the following sections.

226

3.3.4 Stemming. Each word in the text features mentioned in

227

section 3.3.3 has been stemmed using the Lancester stemmer. This

228

stemmer consists of 115 rules that state when the ending of a word

229

has to be removed or replaced [33]. Multiple rules can be applied to

230

one word. Moral et al. states that the Lancester stemmer is stronger

231

than the Porter, Lovins, Dawson, and Krovetz stemmers [30]. A

232

stronger stemmer could be beneficial since text matching will be

233

performed on these features , as it is expected it will stem the most

234

words with the same meaning to the same stem.

235

3.3.5 Robust Scaling. All features are scaled using the Robust

236

scaler, which is a scaler robust to outliers [3]. The features are scaled

237

separate of each other by first discarding the median [2]. Then, the

238

data is scaled between the first quartile and third quartile.

239

3.4 SMOTE

240

The Synthetic Minority Over-sampling Technique (SMOTE) has

241

been used to resolve the imbalance of the dataset. SMOTE

over-242

samples the class ’like’, which has significantly less entries than

243

the class ’dislike’ [13]. Each entry of the class ’like’ is seen as a

244

vector, and its nearest neighbour is found. The neighbour vector

245

is deducted from the original entry and the result is multiplied by

246

value randomly chosen in the range [0,1] and added to the original

247

entry. This results in a new entry. This is done for every entry,

248

and can be repeated multiple times to achieve the same amount of

249

entries as the ’dislike’ class.

250

3.5 Job tasks

251

A company had to provide job tasks for every job advertisement,

252

indicating what the employee would be working on and how much

253

time they would be spending performing the task. The time

indica-254

tion is on a daily scale, with a maximum of five days to divide over

255

the tasks. A company could add up to three tasks to an

advertise-256

ment. The following four features have been created based on this

257

information:

258

• Task total count: the total amount of days of one week spent

259

working on the tasks

260

• Task different count: the amount of different tasks that have

261

to be performed on the job

262

• Task maximum: the longest period of time spent on one task

263

in one week

264

• Task minimum: the shortest period of time spent on one task

265

in one week

266

3.6 Rarity of skills

267

3.6.1 User skills. A user provided their skills as part of their

268

profile. To investigate whether their skills are an indication if they

269

will like a certain job advertisement or not, the rarity of their skill

270

set is determined. This is done in two ways: based on all the user

271

skill sets and based on the skill sets of the users who have rated

272

that certain advertisement.

273

(5)

3.6.1.1 Based on all the users’ skill sets - skill-rarity1. The rarity

274

of the skill set is measured by taking the following steps. First

275

new dataset is created containing all unique user profiles. Then,

276

an Inverse Document-Frequency (IDF) dictionary based on all the

277

skills of the unique user is calculated. This dictionary contains the

278

IDF score for every word appearing in any skill set. The IDF score

279

is calculated as follows:

280

IDFi= loд_{d f}N

i + 1 (1)

whereIDFiis the IDF score for wordi, N equals the total amount

281

of documents, i.e. skill sets, andd fiequals the number of documents

282

that contain wordi, i.e. the document frequency [7].

283

The following equation is used to calculate the total IDF of a

284

user’s skill setj:

285

IDFj = Ínj

i=1IDFi, j

nj (2)

whereIDFi, jis the IDF score for wordi appearing in document

286

j, njis the total number of words in documentj.

287

3.6.1.2 Based on the skill sets of the users’ who have rated that

288

certain advertisement - skill-rarity2. This calculation largely follows

289

the steps taken in section 3.6.1.1, however, multiple new datasets

290

are created, one for every job advertisement. Each dataset contains

291

all user profiles who have rated that certain advertisement. A new

292

IDF dictionary is created for each dataset, containing the IDF score

293

for every word appearing in only the skill sets of the users who

294

have rated that advertisement. Then, the total IDF of a user’s skill

295

set in that particular dataset is calculated based on equation 2.

296

Thus, it is likely that users who have rated multiple advertisements

297

are not assigned the same IDF score for their skill set for every

298

advertisement.

299

3.6.2 Job required skills - skill-rarity3. A job advertisement

con-300

tains information on the skills the applicant should possess,

pro-301

vided by the company. If a company asks for a very specific,

uncom-302

mon skill users might be more inclined to dislike the advertisement,

303

as there is a higher chance they do not have that skill. Therefore,

304

the rarity of the job required skills might be indicative of whether

305

a user will like the advertisement or not. The rarity of these skills

306

is determined in almost the same manner as presented in section

307

3.6.1.1. However, new datasets of all unique job advertisements,

308

instead of all unique users, are created.

309

3.7 Matching user and job required skills

310

Both the user and the company provided skills the applicant has or

311

should have. This could be used to determine whether the user is

312

a match with the advertisement, i.e. whether the user is actually

313

capable of performing the job. Several methods have been tested

314

to determine these matches based on the skill sets and will be

315

discussed in the following sections.

316

3.7.1 Levenshtein Distance. The Levenshtein distance is a metric

317

that calculates the edit distance between sentence x and y. The

318

distance is defined as the sum of character insertions, deletions, and

319

replacements performed on sentence x to transform it into sentence

320

y [23]. Three different methods are implemented to calculate the

321

Levenshtein distance between the user skill set and the job required

322

skill set and will be explained in the following three sections.

323

3.7.1.1 Complete string - Levenshtein1. All the user skills are

324

combined into one stringu, and the same for all the job required

325

skillsj. The Levenshtein distance, L(u, j) is then calculated between

326

these two strings. This distance is normalized by dividing it by

327

the maximum character length of stringu and j, resulting in the

328

normalized Levenshtein distance over the complete skills sets.

329

3.7.1.2 Per skill - Levenshtein2. This method focuses on

individ-330

ual skill matches between the two skill sets, and will output the

331

amount of matched skills. A matched skill is defined as a skill in

332

the user skill setui that is similar to a skill in the job required skill

333

setui, To determine the amount of matched skills, all skill pairs

334

between the user and job required skill set are created, (ui, jk) . For

335

every skill pair, the normalised Levenshtein distance as introduced

336

in section 3.7.1.1 is calculated, however, the complete strings are

337

now represented by one skill. Next, the best matchingjkis

deter-338

mined for everyui, i.e. the (ui, jk) pair with the lowest Levenshtein

339

distance. Eachui can only match with onejk. A thresholdt is

340

needed to determine how many (ui, jk) actually match, i.e. a match

341

occurs whenL(ui, jk)< t. t is found by empirically trying different

342

thresholds andt=0.5 is chosen as it results in the highest model

343

accuracy.

344

3.7.1.3 Per skill – Levenshtein3. This method mostly follows the

345

method explained in section 3.7.1.2. However, besidesuionly being

346

allowed to match once,jk is only allowed to match once as well.

347

The matching pairs (ui, jk) are selected in the following manner:

348

Find the lowestL(ui, jk), register the pair as a potential match, and

349

remove all other pairs containing eitherui orjk from the list of

350

all skill pairs. Again, the lowestL(ul, jm) is found, and registered

351

as a potential match. This pair can thus never contain eitherui or

352

j_k. Once allui orjkare matched,x potential matches are found,

353

wherex = min(len(ui), len(j_k)). These Levenshtein distances of

354

these potential matches are then compared with certain thresholds

355

t to arrive at the final number of actual matches. t=0.2 is chosen as

356

it results in the highest model accuracy.

357

3.7.2 ROUGE score. Recall-Oriented Understudy for Gisting

358

Evaluation (ROUGE) is an algorithm to determine the similarity

359

between two text fragments and is based on overlapping words

360

[26]. Several ROUGE scores have been used to calculate the distance

361

between the user and job required skills: ROUGE-1 and ROUGE-2

362

(ROUGE-N scores), ROUGE-L, and 1 and

ROUGE-WE-363

2 (ROUGE-WE scores).

364

3.7.2.1 ROUGE-N. ROUGE-N measures the overlap of either

un-365

igrams (ROUGE-1) or bigrams (ROUGE-2) between the hypothesis

366

sentence and the reference sentence. A unigram is a single word,

367

a bigram is a sequence of two words. The hypothesis sentence in

368

this case is the job required skill set and the reference sentence is

369

the skill set of the user. The ROUGE-N F-score is used to match

370

the similarity between the two sentences and is calculated by the

371

following equations [26]:

372

ROUGE-NRecall = Í

дr amn∈RCountmatch(дramn)

Í

дr amn∈RCount(дramn)

(3)

(6)

whereдramnis a n-gram of length n,R is the reference sentence,

373

Countmatch(дram_n) are overlapping n-grams between the

refer-374

ence and hypothesis sentence, andCount(дramn) is the count of a

375

n-gram.

376

ROUGE-NPr ecision= Í

Í

дr amn∈HCount(дramn)

(4)

whereH is the hypothesis sentence.

377

TheROUGE-NF −scor eis then calculated using the general for-mula for the F-score of all ROUGE variants:

ROUGEF −scor e=

2 ∗ROUGERecall∗ROUGEPr ecision ROUGERecall+ ROUGEPr ecision (5) 3.7.2.2 ROUGE-L. ROUGE-L is based on the longest common

378

subsequence (LCS) between the hypothesis and reference sentence.

379

The LCS is the longest sequence of words occurring in both of these

380

sentences [39]. The sequence does not necessarily have to exist

381

of consecutive words, however, the word order cannot be altered.

382

For instance, say hypothesis = ‘java programming, C#, SQL’ and

383

reference = ‘C# programming, SQL’ . The LCS would be either ‘C#,

384

SQL’ or ‘programming, SQL’, and thus be of length two. ROUGE-L

385

is then calculated as follows [26]:

386

ROUGE-L_Recall =Í LCS(H, R) дr am1∈RCount(дram1)

(6)

whereLCS(H, R) is the LCS of the hypothesis and reference

sen-387 tence. 388 ROUGE-LPr ecision=Í LCS(H, R) дr am1∈HCount(дram1) (7)

ROUGE-LF −scor eis calculated using formula 5.

389

3.7.3 ROUGE-WE score. The ROUGE-WE score was introduced

390

by Ng et al. and is based on calculating ROUGE over word

em-391

beddings [31]. Word embeddings aim to translate a word into a

392

vector containing numbers, ’such that their respective projections

393

are closer to each other if the words are semantically similar, and

394

further apart if they are not’ [31]. The researchers use word2vec to

395

calculate the word embeddings.

396

3.7.3.1 ROUGE-WE-1. The ROUGE-WE-1 scores are calculated

397

in the following manner [32]: The first step is to calculate the

398

word2vec vectors,wi, for every wordi. A pre-trained model is

399

used for this, trained on dataset containing Google News articles

400

[1]. Then, all pairs of words, (Hi, Rj), between the hypothesis and

401

reference sentence are determined, and their word2vec vectors are

402

multiplied. If a word does not appear in the model (OOV), i.e. has

403

no word2vec vector,wx = 0. All resulting vectors are summed and

404

then divided by either the total word count of the reference sentence

405

for the recall score (V EC-ROUGE-W E-1_Recall), or of the

hypothe-406

sis sentence for the precision score (V EC-ROUGE-W E-1Pr ecision).

407

The entries of the final vector,V EC-ROUGE-W E-1Recall or

408

V EC-ROUGE-W E-1Pr ecision, are averaged to arrive at

409

ROUGE-W E-1_Recall orROUGE-W E-1Pr ecision, i.e.:

410

V EC-ROUGE-W E-1Recall = Í i ∈HÍj ∈RwHi∗wRj Í дr am1∈RCount(дram1) (8) wherewx = 0 if wxis OOV. 411

V EC-ROUGE-W E-1Pr ecision= Í i ∈HÍj ∈RwHi∗wRj Í дr am1∈HCount(дram1) (9) wherewx = 0 if wxis OOV. 412

ROUGE-W E-1_Recall = Ín

i=1V EC-ROUGE-W E-1Recall i

n (10)

where n equals the number of entries of the vector

413

V EC-ROUGE-W E-1Recall.

414

ROUGE-W E-1Pr ecision= Ím

i=1V EC-ROUGE-W E-1Pr ecisioni m

(11) where m equals the number of entries of vector

415

V EC-ROUGE-W E-1Pr ecision.

416

ThenROUGE-W E-1F -scor eis calculated using formula 5.

417

3.7.3.2 ROUGE-WE-2. The calculation of ROUGE-WE-2 scores

418

broadly follow the steps presented in section 3.7.3.1. The word2vec

419

of a bigramwi, j is calculated by multiplying the vectors of the

420

individual words constituting the bigram, i.e.wi, j = wi ∗w_j. The

421

equations for calculatingV EC-ROUGE-W E-2Recalland

422

V EC-ROUGE-W E-2Pr ecisionthus differ slightly from equation 8

423

and 9 for unigrams:

424

V EC-ROUGE-W E-1_Recall = Í

i ∈HÍj ∈RwHi ∗wRj

Í

дr am2∈RCount(дram2)

(12)

wherewx = 0 if wxis OOV,i = дram2∈H, and j = дram2∈R.

425

V EC-ROUGE-W E-1Pr ecision= Í

i ∈HÍj ∈RwHi∗wRj

Í

дr am2∈HCount(дram2)

(13)

wherewx = 0 if wxis OOV,i = дram2∈H, j = дram2∈R.

426

Equations 10, 11, and 5 are then used to calculate

427

ROUGE-W E-2F -scor e.

428

3.7.4 BLEU score. The bilingual evaluation understudy (BLEU)

429

score is based on the modified precision score for n-grams [34]:

430

pn= Í

Í

дr amn∈HCount(дramn)

(14)

The modified n-gram precision scores (of n=1,...) are combined by

431

taking their geometric mean. Papineni et al. stated BLEU based on

432

a maximum of 4-grams provided the best results [34]. Therefore,

433

this research will also focus on this maximum. Papineni et al. also

434

introduce a brevity penaltyBP to penalize hypothesis sentences

435

that are shorter than the reference sentence:

436

BP =

1 i f h > r

e(1−r /h)) _{i f h ≤ r} (15) whereh is the length of the hypothesis sentence, and r the length

437

of the reference sentence.

438 5

(7)

BLEU is equal toBP multiplied by the geometric mean of the

439

pn. BLEU could be equal to zero when e.g. no 3-grams or 4-grams

440

are found, since the geometric mean ofpnwill then result in zero.

441

Smoothing function 1, as presented by Chen et al., is used avoid

442

this issue [14]. This function replacespn=0 withϵ = 0.1.

443

3.8 Matching job title and user work

444

experience

445

Each job advertisement contained the actual job title, and each user

446

had to provide the titles of jobs they have performed in the past.

447

Matching these job titles could be indicative of whether the user is a

448

good candidate for the job. If a user has held the same job title as the

449

advertisement, they might be a suitable match for the job, and more

450

inclined to like it. The following methods described in section 3.7 are

451

used to determine if there could be a match: ROUGE-1, ROUGE-2,

452

and ROUGE-L, ROUGE-WE-1 AND ROUGE-WE-2, and BLEU. The

453

method using Levenshtein distance has been altered slightly. First,

454

the Levenshtein distance between the advertisement job title and

455

each user job title is calculated. Then the pair with the minimum of

456

all distances is chosen as the best match. This Levenshtein distance

457

method thus outputs one distance measurement.

458

3.9 Feature importance

459

Several methods have been implemented to investigate which

fea-460

tures are most indicative of whether a job seeker will like the job

461

advertisement. The feature importance of all features of the random

462

forest classifier model has been calculated, and partial dependence

463

plots (PDP) are created. Furthermore, distinctive models are created

464

for all male and female users to see whether the feature importance

465

differs per gender. Lastly, the contribution per feature to a decision

466

classification of a single user is investigated. The following sections

467

will elaborate on the techniques used.

468

3.9.1 Of the model - Feature importance. The feature importance

469

is calculated by the Gini importance. The Gini importance uses the

470

Gini impurity to calculate the impurity of a nodeni[40]:

471 G = C Õ i=1 p(i) ∗ (1 − p(i)) (16) The Gini importance of one node is then calculated by [5, 37]: first

472

multiplying the node impurity with the number of training samples

473

reaching the node divided by the total number of samples, i.e. the

474

weighted node samples. Then subtracting the node impurities of the

475

left and right child node multiplied by the weighted node samples.

476

To obtain the feature importance of featuref of the tree, the Gini

477

importance of all nodes splitting onf is summed and then divided

478

by the total amount of nodes in the tree. This is then normalized

479

and all averaged over all trees to obtain the feature importance per

480

feature of the random forest.

481

3.9.2 Of the model - Partial Dependence Plots (PDP). PDPs are

482

constructed per feature and aim to explain the relationship between

483

the job rating and the features of the model. If the Gini importance

484

for a certain feature in the model is high, a PDP provides additional

485

insight. It plots the likelihood of predicting like against the possible

486

values of the feature. Thus, it shows what influence a certain value

487

of that feature has on the prediction, whereas the Gini importance

488

only states to what extent a feature is of importance to the model.

489

A PDP for featuref is created in the following way [18]: first the

490

training dataset is changed slightly, only the values of featuref

491

are altered. This value is changedi times to all possible values fi

492

between the [5,95] percentile range that are present in the dataset.

493

The other values are disregarded to diminish the effect of outliers.

494

Then for each of these altered versions of one row, a prediction is

495

made by the model. This is thus donei times for every row. Finally,

496

the average prediction per possible value off is calculated, which

497

is plotted resulting in the PDP forf .

498

3.9.3 Model per gender. The dataset has been split in d1

con-499

taining only male users andd2 containing only female users.

Ran-500

dom forest models have been trained and tested on these datasets

501

separately. The feature importances are calculated using the Gini

502

importance as explained in section 3.9.1.

503

3.9.4 Per user. The TreeInterpreter algorithm is used to

investi-504

gate the importance of features per user prediction. This algorithm

505

measures importance as contribution of a feature to the final

pre-506

diction. When looking at a single tree, the root node already has

507

a prediction value. This value then changes based on the splitting

508

criterion, resulting in different prediction values for the child nodes

509

[4, 38]. The contribution of the feature at the root node is either

510

the difference in value compared to the left child or the difference

511

compared to the right child, depending on the actual decision path.

512

This process repeats until the leaf nodes are reached. When

pre-513

dicting a user’s liking a certain path is followed in each tree of the

514

random forest. For each tree the contributions per feature present

515

in the tree path are registered, together with the final prediction.

516

All contributions per feature are averaged to calculate the

contribu-517

tions for the entire random forest. The final probabilities of a user

518

belonging to a class, i.e. like/dislike, is then calculated by summing

519

all feature contributions for that class, and adding it to the forest

520

bias, i.e. the average value of all root nodes. The algorithm can be

521 written as [4]: 522 Pred(u) = 1 T T Õ t =1 root_{val t}+ F Õ f =1 (1 T T Õ t =1 contt(u, f )) (17) wherePred(u) is the prediction of user u, T is the total number of

523

trees,rootval is the value at the root node,F is the total number of

524

features andcont(u, f ) is the contribution of f to the prediction of

525

u.

526

4 RESULTS AND ANALYSIS

4.1 On the original dataset

527

A random forest classifier model has been trained on only the

528

existing, unaltered, features in the dataset. The results are presented

529

in table 2. This means the baseline for the accuracy is 81.57%. The

530

F-score is fairly low, mainly caused by the very low recall score.

531

Accuracy 0.8157 F-score 0.2618 Precision 0.6705 Recall 0.1627

Table 2: Results of testing on the original model

(8)

4.2 Choosing best method to match user and

532

job required skills

533

Table 4 shows the accuracy, F-score, precision and recall of the test

534

set of several random forest classifiers. Each model is trained on one

535

of the skill matching features plus all other features except the ones

536

concerning the job title and user experience matches. All results lie

537

together quite closely. ROUGE-L obtains the highest accuracy and

538

precision, and BLEU the highest F-score and recall. The ROUGE-L

539

score has been chosen to add to the feature selection of the final

540

model, since it has the highest accuracy.

541

4.3 Choosing best method to match job title

542

and user experience

543

Several models are created and tested to obtain the best method to

544

match the job title and user experience. One of the methods plus

545

all other features except the ones concerning skill matching have

546

been used to train each model. The results are shown in table 5.

547

Again, the results of each feature are reasonably alike, however, the

548

maximum of each performance score is higher than the maximum of

549

each when trained on the skill matching. Matching on job title and

550

user work experience thus seems to lead to a slightly better model

551

than when matching on skill sets. Levenshtein distance results in

552

the highest accuracy and precision, and ROUGE-L in the highest

553

F-score and recall. Since the Levenshtein distance has the highest

554

accuracy it has been chosen to be added to the final model.

555

4.4 Model results

556

Adding the ROUGE-L feature on skill matching and the Levenshtein

557

distance feature on job title and work experience matching to the

558

random forest classifier model yields the results presented in table

559

3. The final model has obtained the highest accuracy compared to

560

the results in table 4 and 5, although only slightly. The F-score and

561

precision have not improved compared to the results of ROUGE-L

562

on skills and Levenshtein on job title matching. The recall score

563

has improved slightly compared to the results of these two models.

564

Compared to the results of the original model on unaltered data

565

(see section 4.1) the accuracy has improved slightly, and the F-score

566

has improved significantly. This is due to an increased recall score,

567

meaning more users who have liked an advertisement are correctly

568

predicted by the model. Thus the final model performs significantly

569

better than the original model on unaltered data. However, the

570

model’s low recall score indicates that out of all users who have

571

liked an advertisement, the model only predicts 36.78% correctly.

572

Accuracy 0.8292 F-score 0.4613 Precision 0.6184 Recall 0.3678 Table 3: Results of final model

4.5 Feature importance

573

4.5.1 Gini importance. The Gini importance was calculated for

574

each feature of the final model. The top five most important

fea-575

tures are shown in figure 3 (see appendix A for the complete list).

576

Graduation year is the most important feature. The added features

577

concerning users’ skills (ROUGE-L skill matching, skill rarity1, and

578

the count of user skills) are all in the top five as well. This indicates

579

that the user skills do posses important information for whether a

580

user will like the advertisement or not.

581

The importances can be categorized since the Gini importances

582

of all features combined add up to 1. Table 6 shows the summed

583

importances per feature category. Features concerning the user

in-584

formation are most important, followed by the category containing

585

the matched skill and job title features. The company features are

586

least indicative of whether a user will like the advertisement. This

587

shows that while a user’s own circumstances are leading when

588

rating job advertisements, a company should pay attention to the

589

details of the advertisement to attract the right applicants. A

com-590

pany should focus on establishing the right required skill set, since

591

this has an importance of almost 9% when matching it with the

592

user skill set. The job required skill set is also of importance to the

593

model when calculating the rarity.

594

Levenshtein BLEU ROUGE Rouge-WE

1 2 3 1 2 L 1 2

Accuracy 0.8224 0.8192 0.8224 0.8204 0.8236 0.8183 0.8250 0.8217 0.8236 F-score 0.4418 0.4464 0.4594 0.4619 0.4613 0.4479 0.4472 0.4557 0.4564 Precision 0.5940 0.5864 0.5991 0.5721 0.5895 0.5848 0.6120 0.5906 0.5981 Recall 0.3517 0.3603 0.3725 0.3873 0.3789 0.3629 0.3523 0.3709 0.3690

Table 4: Results of user skill and job required skill set matching

Levenshtein BLEU ROUGE Rouge-WE

1 2 L 1 2

Accuracy 0.8275 0.8271 0.8229 0.8214 0.8226 0.8247 0.8223 F-score 0.4615 0.4645 0.4600 0.4584 0.4683 0.4655 0.4526 Precision 0.6205 0.6026 0.5961 0.5962 0.5866 0.6015 0.5921 Recall 0.3673 0.3779 0.3745 0.3724 0.3896 0.3796 0.3663

Table 5: Results of job title and user work experience matching

(9)

Figure 3: The Gini importance of the top 5 most important features

Features GINI importance User 0.7203

Matching text 0.1342 Job 0.0906 Company 0.0549

Table 6: Total GINI importance per feature category

4.5.2 Partial dependence plots. The partial dependence plots are

595

created for the top two most important features found by the Gini

596

importance. Each plot contains a red dotted line indicating the 50%

597

prediction, the higher the number the more chance of predicting

598

’like’. Figure 4 shows that if a user has a graduation year of up until

599

2011, they are more likely to dislike the advertisement. Between

600

2011 and 2017 (the year the data was acquired) there is more chance

601

of liking the advertisement. In 2017 there is equal chance of liking or

602

disliking the advertisement. The chance of liking decreases over 10%

603

between 2016 and 2017. This could be due to the user still studying

604

and using the application informatively rather than actively looking

605

for a job, since most of the dataset was obtained during 2016 and

606

2017.

607

Figure 5 shows that generally the less years of experience a user

608

has, the more likely they are to like the job advertisement. A user

609

is more likely to dislike an advertisement if they have over 10 years

610

of experience. This gives the impression that either advertisements

611

are mainly focused on graduate or junior positions or users with

612

less work experience are more open to different jobs that might

613

not completely match with their profile. Interestingly, users with

614

almost no experience tend to dislike an advertisement more as

615

well. However, this can be linked to the results of the PDP of user

616

graduation year. No or almost no experience might indicate that

617

the user is still studying, and might therefore not be as interested

618

in finding a job just yet.

619

Furthermore, the partial dependencies of the user gender are

620

calculated to investigate the relation between gender and the

clas-621

sification. Male users seem to have more chance of liking an

adver-622

tisement, about 50.6%, whereas female users have more chance of

623

disliking an advertisement, about 54.0%. Thus, women are about

624

4.5% less likely to like an advertisement than men. This could mean

625

that the advertisements were more focused on men, or that men are

626

more likely to like an advertisement even though they might not

627

match the profile. Research has indeed shown that job requirements

628

are seen more as set rules by women than men [28].

629

Figure 4: Partial Dependency Plot of Graduation year

Figure 5: Partial Dependency Plot of Years of experience

4.5.3 Model per gender. Two models have been trained, on only

630

male or only female users, to inspect the difference between men

631

and women when rating an advertisement. Table 7 shows the results

632

of testing the models on separate test sets. The model trained on the

633

female users can predict with higher accuracy, whereas the model

634

trained on male users has a higher F-score.

635

Table 8 shows the top five most important feature as calculated

636

by the Gini importance (see appendix A for the complete list). These

637

features are different for each gender. For example, the most

impor-638

tant feature ’Years of experience’ for men has an importance of 0.16,

639

as opposed to a lower importance of 0.10 for women. Graduation

640

year, the most important feature for women with a score of 0.15, is

641

only the fourth most important feature for predicting the rating of

642

men, with a score of 0.09. However, these features are correlated

643

since a user usually starts work after graduating, and thus contain

644

potentially overlapping information.

645

4.5.4 Feature importance - per applicant. Two users have been

646

randomly chosen to show the possible results when investigating

647

what features contributed most to the prediction of the model. One

648

user liked and one user disliked an advertisement, and for both the

649

model has predicted this correctly. Table 9 shows some information

650

about these users.

651 8

(10)

Male Female Accuracy 0.8207 0.8322 F-score 0.5061 0.4274 Precision 0.6269 0.5716 Recall 0.4243 0.3413

Table 7: Model results on the test dataset per gender

Male Female

Years of experience 0.1613 Graduation year 0.1508 User country 0.1205 Count user skills 0.1369 Count user skills 0.1011 ROUGE-L skills 0.1026 Graduation year 0.0893 Years of experience 0.1007 Skill-rarity1 0.0861 Dutch 0.0889 Table 8: Top five most important features per gender model

User 1 2

Liked advertisement No Yes ROUGE-L skills 0 0.1004 Gender Male Female User education field IT IT

Dutch Yes No

Graduation year 2011 2016 skill-rarity-1 3.1083 2.9536 User country NL Other Language count 2 2 skill-rarity-2 2.8802 3.2263 Table 9: Information about user 1 and 2

The top five sets of features that contributed most to user 1’s

652

prediction are shown in table 10. ROUGE-L skills has the highest

653

contribution. All other contribution sets are made up by multiple

654

features. The set of user 1’s gender (male), education field (IT), and

655

Dutch (yes), combined with one or more features, shows to be a big

656

contributor to the prediction.

657

Table 11 shows the top five of contributions sets for the prediction

658

of user 2. The top contributor set has a very high contribution of

659

0.5 to the prediction, which is in the range [0,1]. All top five sets

660

contain the user’s country (not the Netherlands), and ROUGE-L

661

skills, some plus some other features.

662

5 CONCLUSION

This research aimed to find the most indicative features of whether a

663

user will like a job advertisement. It investigated how to determine

664

whether a user and an advertisement match, i.e. the user could

665

fulfill the position. The differences between men and women, and

666

different users have been explored as well.

667

Several methods have been applied to determine if a user matches

668

with a job advertisement. The ROUGE-L measure proved best for

669

matching the user skill set and the job required skills, although only

670

by a small margin. To match the job title and user work experience

671

the Levenshtein distance demonstrated the best results. Adding

672

these predictive features improved the original model significantly,

673

however, the F-score of the final model is still fairly low, which is

674

mainly caused by a low recall score. This means that the conclusions

675

drawn from this research should only be taken tentatively.

676

This research has shown there is a difference between indicative

677

features for each gender. The results show that a large difference

678

can be found in the matching of skill sets. Women seem to weigh in

679

more whether their skill set actually matches the job required skills

680

than men. If a company would like to attract more female applicants,

681

they could decide to rephrase the required skills, for example by

682

adding that possessing all skills is not a hard requirement.

683

The most indicative features are proven to be user related,

how-684

ever, matching a user and a job advertisement based on skill set

685

is also of great importance. Although the indicative features to

686

differ from user to user, matching based on the skill does contribute

687

for every case. A company should therefore put a strong focus on

688

writing the required skill set as precisely as possible to drive their

689

ideal candidate to like the advertisement and thus in their candidate

690

pool.

691

This research has used partial dependence plots to investigate the

692

influence of a single feature. However, these plots are less reliable

693

when features are correlated [29]. Future research could therefore

694

focus on constructing for example accumulated local effects plots.

695

These plots are robust to correlated features, and will thus describe

696

the individual influence of a feature in the model more accurately.

697

Contribution Feature set 0.2496 ROUGE-L skills

0.0888 Gender, User education field, Dutch, ROUGE-L skills

0.0832 Gender, User education field, Dutch, Graduation year, skill-rarity1 0.0272 Gender, User education field, Dutch, Graduation year, ROUGE-L skills

0.0236 Gender, User education field, Dutch, Graduation year, ROUGE-L skills, skill-rarity1 Table 10: Top 5 feature contribution sets of user 1

Contribution Feature set

0.5003 User country, ROUGE-L skills, User education field, Language count, skill-rarity-1 0.0535 User country, ROUGE-L skills, User education field

0.0209 User country, ROUGE-L skills, User education field, Language count, skill-rarity-1, Graduation year 0.0174 User country, ROUGE-L skills

0.0166 User country, ROUGE-L skills, User education field, Language count, skill-rarity1, Graduation year, skill-rarity2 Table 11: Top 5 feature contribution sets of user 2

(11)

Further, it would be interesting to involve a user’s past activity on

698

the application in the model. This can be used to determine whether

699

a user is interested in the job advertisement, as discussed in section

700

2.1 and thus could improve the model. The original dataset contains

701

the exact date and time of a user’s rating, from which the user’s

702

usage frequency and interactions with other job advertisements

703

can be derived.

704

6 ACKNOWLEDGMENTS

I would like to thank and express my great gratitude towards Prof.

705

Evangelos Kanoulas and Dr. Yuval Engel for providing me with

706

valuable support, constructive advice, and the opportunity to

per-707

form research on this intriguing problem. I also wish to thank Sam

708

Waterson and Priyanka Nanayakkara for all our great discussions

709

enhancing my research.

710

REFERENCES

[1] [n. d.]. GoogleNews-vectors-negative300.bin.gz. ([n. d.]). Retrieved June 11, 2019

711

from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/

712

edit?usp=sharing

713

[2] [n. d.]. scikit-learn/sklearn/preprocessing/data.py. ([n. d.]). Retrieved June 11,

714 2019 from https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/ 715 preprocessing/data.py#L1047 716 [3] [n. d.]. sklearn.preprocessing.RobustScaler. ([n. d.]). Retrieved 717

June 11, 2019 from https://scikit-learn.org/stable/modules/generated/

718

sklearn.preprocessing.RobustScaler.html

719

[4] 2014. Interpreting random forests. (2014). Retrieved June 13, 2019 from http:

720

//blog.datadive.net/interpreting-random-forests/

721

[5] 2017. scikit-learn/sklearn/tree/_tree.pyx. (2017). Retrieved June

722

13, 2019 from https://github.com/scikit-learn/scikit-learn/blob/

723

18cdaa69c14a5c84ab03fce4fb5dc6cd77619e35/sklearn/tree/tree.pyx#L1056 724

[6] 2019. Evaluation of binary classifiers. (2019). Retrieved June 19, 2019 from

725

https://en.wikipedia.org/wiki/Evaluationofbinaryclassifiers 726

[7] 2019. scikit-learn/sklearn/feature_extraction/text.py. (2019). Retrieved June

727

18, 2019 from https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/

728

featureextraction/text.py 729

[8] Fabian Abel, András Benczúr, Daniel Kohlsdorf, Martha Larson, and Róbert

730

Pálovics. 2016. RecSys Challenge 2016: Job Recommendations. In Proceedings of

731

the 10th ACM Conference on Recommender Systems (RecSys ’16). ACM, New York,

732

NY, USA, 425–426. https://doi.org/10.1145/2959100.2959207

733

[9] Mattia Bianchi, Federico Cesaro, Filippo Ciceri, Mattia Dagrada, Alberto Gasparin,

734

Daniele Grattarola, Ilyas Inajjar, Alberto Maria Metelli, and Leonardo Cella.

735

2017. Content-Based approaches for Cold-Start Job Recommendations. 1–5.

736

https://doi.org/10.1145/3124791.3124793

737

[10] Pernilla Bolander and Jörgen Sandberg. 2013. How Employee Selection Decisions

738

Are Made in Practice. Organization Studies 34, 3 (March 2013), 285–311. https:

739

//doi.org/10.1177/0170840612464757

740

[11] Richard Centers and Daphne E. Bugental. 1966. Intrinsic and Extrinsic Job

741

Motivation Among Different Segments of the Working Population. The Journal

742

of applied psychology 50, 3 (June 1966), 193–197. https://doi.org/10.1037/h0023420

743

[12] Derek S. Chapman, Krista L. Uggerslev, Sarah A. Carroll, Kelly A. Piasentin, and

744

David A. Jones. 2005. Applicant Attraction to Organizations and Job Choice:

745

A Meta-Analytic Review of the Correlates of Recruiting Outcomes. Journal of

746

Applied Psychology 90, 5 (Sept. 2005), 928–944.

https://doi.org/10.1037/0021-747

9010.90.5.928

748

[13] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer.

749

2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial

750

intelligence research 16 (2002), 321–357.

751

[14] Boxing Chen and Colin Cherry. 2014. A Systematic Comparison of

Smooth-752

ing Techniques for Sentence-Level BLEU. In Proceedings of the Ninth Workshop

753

on Statistical Machine Translation. Association for Computational Linguistics,

754

Baltimore, Maryland, USA, 362–367. https://doi.org/10.3115/v1/W14-3346

755

[15] Sapna Cheryan, Victoria C. Plaut, Paul G. Davies, and Claude Steele. 2009.

Ambi-756

ent Belonging: How Stereotypical Cues Impact Gender Participation in Computer

757

Science. Journal of personality and social psychology 97, 6 (Dec. 2009), 1045–1060.

758

https://doi.org/10.1037/a0016239

759

[16] Jane Fish. 1992. How to choose the right applicant. British Journal of Nursing 1,

760

7 (July 1992), 352–355. https://doi.org/10.12968/bjon.1992.1.7.352

761

[17] Jeffrey A. Flory, Andreas Leibbrandt, and John A. List. 2014. Do competitive

762

workplaces deter female workers? A large-scale natural field experiment on job

763

entry decisions. Review of Economic Studies 82, 1 (Jan. 2014), 122–155. https:

764

//doi.org/10.1093/restud/rdu030

765

[18] Brandon M. Greenwell. 2017. pdp: An R Package for Constructing Partial

De-766

pendence Plots. The R Journal 9, 1 (2017), 421–436.

https://doi.org/10.32614/RJ-767

2017-016

768

[19] Laszlo A Jeni, Jeffrey F Cohn, and Fernando De La Torre. 2013. Facing Imbalanced

769

Data–Recommendations for the Use of Performance Metrics. In 2013 Humaine

770

Association Conference on Affective Computing and Intelligent Interaction. IEEE,

771

245–251.

772

[20] Benjamin Kille, Fabian Abel, Balázs Hidasi, and Sahin Albayrak. 2015. Using

773

Interaction Signals for Job Recommendations, Vol. 162. 301–308. https://doi.org/

774

10.1007/978-3-319-29003-417 775

[21] Ute-Christine Klehe. 2004. Choosing How to Choose: Institutional Pressures

776

Affecting the Adoption of Personnel Selection Procedures. International Journal

777

of Selection and Assessment 12, 4 (Dec. 2004), 327–342. https://doi.org/10.1111/

778

j.0965-075X.2004.00288.x

779

[22] Emanuel Lacic, Dominik Kowald, Markus Reiter-Haas, Valentin Slawicek, and

780

Elisabeth Lex. 2017. Beyond Accuracy Optimization: On the Value of Item

781

Embeddings for Student Job Recommendations. CoRR abs/1711.07762 (2017).

782

arXiv:1711.07762 http://arxiv.org/abs/1711.07762

783

[23] Vladimir Iosifovich Levenshtein. 1966. Binary codes capable of correcting

dele-784

tions, insertions and reversals. Soviet Physics Doklady 10, 8 (Feb. 1966), 707–710.

785

Doklady Akademii Nauk SSSR, V163 No4 845-848 1965.

786

[24] Jianxun Lian, Fuzheng Zhang, Min Hou, Hongwei Wang, Xing Xie, and

787

Guangzhong Sun. 2017. Practical Lessons for Job Recommendations in the

788

Cold-Start Scenario. In Proceedings of the Recommender Systems Challenge 2017

789

(RecSys Challenge ’17). ACM, New York, NY, USA, Article 4, 6 pages. https:

790

//doi.org/10.1145/3124791.3124794

791

[25] Andy Liaw and Matthew Wiener. 2002. Classification and Regression by

random-792

Forest. R News 2, 3 (Dec. 2002), 18–22. http://CRAN.R-project.org/doc/Rnews/

793

[26] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of

Sum-794

maries. In Text Summarization Branches Out: Proceedings of the ACL-04

Work-795

shop. Association for Computational Linguistics, Barcelona, Spain, 74–81. https:

796

//www.aclweb.org/anthology/W04-1013

797

[27] Linda Miller and Jacqueline Budd. 1999. The Development of Occupational

Sex-798

role Stereotypes, Occupational Preferences and Academic Subject Preferences

799

in Children at Ages 8, 12 and 16. Educational Psychology 19, 1 (1999), 17–35.

800

https://doi.org/10.1080/0144341990190102

801

[28] Tara Sophia Mohr. 2014. Why Women Don’t Apply for Jobs Unless They’re 100%

802

Qualified. (2014). Retrieved June 18, 2019 from

https://hbr.org/2014/08/why-803

women-dont-apply-for-jobs-unless-theyre-100-qualified

804

[29] Christoph Molnar. 2019. Accumulated Local Effects (ALE) Plot. Christoph Molnar,

805

Chapter 5.3. https://christophm.github.io/interpretable-ml-book/

806

[30] Cristian Moral, Angélica de Antonio, Ricardo Imbert, and Jaime Ramírez. 2014-03.

807

A Survey of Stemming Algorithms in Information Retrieval. Information Research:

808

An International Electronic Journal 19, 1 (2014-03), 22.

809

[31] Jun-Ping Ng and Viktoria Abrecht. 2015. Better Summarization Evaluation with

810

Word Embeddings for ROUGE. (Aug. 2015).

https://doi.org/10.18653/v1/D15-811

1222

812

[32] ng-j p. 2015. ROUGE summarization evaluation metric, enhanced with use of

813

Word Embeddings. (2015). Retrieved June 11, 2019 from

https://github.com/ng-814

j-p/rouge-we

815

[33] Chris D. Paice. 1990. Another Stemmer. SIGIR Forum 24, 3 (Nov. 1990), 56–61.

816

http://doi.acm.org.proxy.uba.uva.nl:2048/10.1145/101306.101310

817

[34] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU:

818

A Method for Automatic Evaluation of Machine Translation. In Proceedings

819

of the 40th Annual Meeting on Association for Computational Linguistics (ACL

820

’02). Association for Computational Linguistics, Stroudsburg, PA, USA, 311–318.

821

https://doi.org/10.3115/1073083.1073135

822

[35] Ryan L. Philips and Rita Ormsby. 2016. Industry classification schemes: An

823

analysis and review. Journal of Business Finance Librarianship 21, 1 (Jan. 2016),

824

1–25. https://doi.org/10.1080/08963568.2015.1110229

825

[36] raghavrv. 2015. [RFC] Missing values in RandomForest 5870. (2015). Retrieved

826

June 11, 2019 from https://github.com/scikit-learn/scikit-learn/issues/5870

827

[37] Stacey Ronaghan. 2018. The Mathematics of Decision Trees, Random Forest and

828

Feature Importance in Scikit-learn and Spark. (2018). Retrieved June 13, 2019 from

829

https://medium.com/@srnghn/the-mathematics-of-decision-trees-random-830

forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3

831

[38] Ando Saabas. 2019. andosa/treeinterpreter. (2019). Retrieved June 13, 2019

832

from https://github.com/andosa/treeinterpreter/blob/master/treeinterpreter/

833

treeinterpreter.py

834

[39] Kuo-Tsung Tseng, De-Sheng Chan, Chang-Biau Yang, and Shou-Fu Lo. 2018.

835

Efficient merged longest common subsequence algorithms for similar sequences.

836

Theoretical Computer Science 708 (Jan. 2018), 75–90. https://doi.org/10.1016/

837

j.tcs.2017.10.027

838

[40] Victor Zhou. 2019. A Simple Explanation of Gini Impurity. (2019). Retrieved

839

June 13, 2019 from https://victorzhou.com/blog/gini-impurity/

840

(12)

A

FEATURE IMPORTANCE

Full model Male model Female model

Graduation year 0.1342 Years of experience 0.1613 Graduation year 0.1508 Years of experience 0.1214 User country 0.1205 Count user skills 0.1369 ROUGE-L skills 0.0918 Count user skills 0.1011 ROUGE-L skills 0.1026 skill-rarity1 0.0776 Graduation year 0.0893 Years of experience 0.1007

Count user skills 0.0754 skill-rarity1 0.0861 Dutch 0.0889

User country 0.0743 ROUGE-L skills 0.0611 skill-rarity1 0.0636 Dutch 0.0710 Job rating year 0.0497 Levenshtein job title 0.0472 Levenshtein job title 0.0424 Count user languages 0.0467 skill-rarity2 0.0373 User gender 0.0413 Code user education field 0.0442 Job title code 0.0361 skill-rarity2 0.0379 Levenshtein job title 0.0420 User country 0.0349 Count user languages 0.0315 skill-rarity2 0.0376 Count user languages 0.0327 Code user education field 0.0305 skill-rarity3 0.0278 skill-rarity3 0.0251 Job rating year 0.0252 Dutch 0.0178 Job classification code 0.0197 skill-rarity3 0.0250 Job title code 0.0161 Job rating year 0.0194 Job title code 0.0231 Job classification code 0.0122 Code user education field 0.0140 Job classification code 0.0159 Company average employee age 0.0115 Company average employee age 0.0130 Count required skills 0.0118 Count required skills 0.0111 Count required skills 0.0124 Company average employee age 0.0110 Industry percent female 0.0102 Industry percent female 0.0104 Industry percent female 0.0096 Company percent females 0.0097 Company age 0.0097 Company percent females 0.0087 Company age 0.0090 Company percent females 0.0089 Company age 0.0083 Job level code 0.0078 Company foundation year 0.0080 Company foundation year 0.0069 Industry GICS class 0.0075 Job level code 0.0066 Job level code 0.0065 Company foundation year 0.0072 Industry GICS class 0.0063 Industry GICS class 0.0063 Tasks maximum 0.0049 Tasks maximum 0.0057 Tasks maximum 0.0051 Company size 0.0044 Company size 0.0047 Company size 0.0041 Tasks different count 0.0017 Tasks different count 0.0015 Tasks different count 0.0014 Tasks minimum 0.0008 Tasks total count 0.0010 Tasks minimum 0.0008 Tasks total count 0.0005 Tasks minimum 0.0007 Tasks total count 0.0006 Job location 0.0002 Job commitment 0.0007 Job commitment 0.0004 Job commitment 0.0002 Job location 0.0004 Job location 0.0002