Under The Influence : Big Data and its influence on recruitment

(1)

!

Bachelor Thesis Information Sciences:

Under The Influence - Big Data and its influence on recruitment

Tim Lorent (10452141)

Bachelor Information Sciences

FNWI (Faculteit der Natuurwetenschappen, Wiskunde en Informatica)

Under supervision of L. Stolwijk

Universiteit van Amsterdam

26 August 2015 

(2)

Abstract

Today we generate increasingly large amounts of data with every interaction with our mobile phones, our social media

activities, and business transactions. In the year 2012 alone, around 2.5 billion gigabytes of data were produced. This

phenomenon, of huge quantities of data that require complex and smart algorithms to process and analyse them, is

called ‘big data’. With the advent of big data the amount of variables that can be used for profiling, or people

analytics, has increased. As such, businesses are eager to apply this to processes such as recruitment or hiring. By

conducting a qualitative literature study, this thesis researches if businesses are increasingly relying on external sources

of knowledge and how this influences recruitment. Consequently, it seeks to find out what can be lost and gained and

what recommendations can be made. It first provides a theoretical background on big data and its applications, privacy

issues, and criticism. Then, it will outline how external knowledge impacts business processes, including the relevant

advantages and disadvantages. Although the presented disadvantages to using big data in recruitment can be seen as

valid, the exciting new possibilities cannot be ignored. The thesis concludes with a theory and conceptual framework

that provides a guideline for businesses to successfully and safely implement big data in the recruitment process.

(3)

1 Introduction

4

1.1 Background 4 1.2 Problem definition 5 1.3 Research questions 5

2 Research methodology

6

2.1 Research approach and process 6

2.2 Literature research 6

2.3 Selection of data 8

2.4 Methodology evaluation 8

3 Frame of Reference

9

3.1 What is Big Data? 9

3.2 Big Data Collection 10

3.3 Big Data Opportunities 11

3.4 Privacy Issues 13

3.5 Criticism & Obstacles 15

4 Theoretical study

17

4.1 Business Processes 17

4.2 External Knowledge 18

4.3 The Hiring Process 18

4.4 Data Profiling in Recruitment 19

4.5 Advantages 20

4.6 Disadvantages 21

5 Analysis & Results

22

5.1 Frameworks and Models 22

5.2 Statistical Errors 26

5.3 Final Theory 28

6 Conclusion

31

6.1 Recruitment & Big Data? 31

6.2 Discussion 31

6.3 Future Research 32

(4)

1 Introduction

The introduction will provide a short overview of the concept of big data. Although it will be

extensively discussed in the theoretical framework, some knowledge is required to discuss the

problems and purpose of this thesis.

1.1 Background

Today, we generate increasingly large amounts of data with every interaction with our mobile

phones, our social media activities, and business transactions (George et al., 2014). In the year 2012

alone, around 2.5 billion gigabytes of data was produced (Wall, 2014). This is only expected to

significantly increase every year, as more and more smartphone users are identified (Wamba et al.,

2015). The phenomenon of huge quantities of data that require complex and smart algorithms to

process and analyse them is called ‘big data’. These data are commonly recognised as big when they

meet the following requirements: volume, velocity, and variety (McAfee & Brynjolfsson, 2012).

Volume because the amount is growing at an exceptional pace, velocity because new technologies

have made it possible to increase the speed at which these data are created (Welling, 2014), and

variety because these data are created in many shapes and come from various sources (Hoy, 2014).

Big data has inherent privacy issues. For example, highly sensitive information could leak

(Williamson, 2015). Data triangulation, combining data from multiple sources, allows anyone to

make inferences about people (George et al., 2014). Consequently, some advocate for agreements,

laws, legislations and policies to be put in place to safeguard our data. These include mechanisms for

data protection (George et al., 2014), consent forms (Boyd & Crawford, 2012), and educating the

public to make them ‘data savvy’ (Schadt, 2012).

These data are either consciously generated, which is called “self-quantification data”, or

unconsciously, which is called “data exhaust” (George et al., 2014). Afterwards, big data are being

applied to improve processes in many fields. Examples include improving airline ETA’s, predicting

crimes, natural disasters and customer behaviour, improving recommendation systems on online

services such as Netflix, and changing recruitment and hiring practices. For example, several data

mining companies are using big data to find factors that determine job retention, testing skills

related to job performance, and searching for qualified programmers on the internet.

With the advent of big data the amount of variables that can be used for profiling, or people

analytics, has increased. As such, companies are eager to apply predictive analytics not only to their

business processes, but also their hiring practices (Peck, 2013). This noticeable increase in applying

data to people analytics (Bersin, 2015) has transformed how people are hired, fired and promoted

(Ibarra, 2013). This thesis further researches the possibilities of this new field.

(5)

1.2 Problem definition

Besides the internal knowledge that businesses have access to, the potential of the world of big data

has opened up new possibilities. As mentioned, big data are being implemented by businesses in

their recruitment and hiring practices. The essence of it is that people are profiled on the basis of

big data. These data are acquired via online personality tests, video games, or algorithms that detect

talented employees (Huhman, 2014; Economist, 2013; Peck, 2013). However, this noticeable

increase in applying big data to people analytics (Bersin, 2015) might lead to candidates’

applications being solely based on results generated by algorithms that have analysed big data.

Research has already shown that this leads to discrimination (Gangadharan, 2014; Barocas & Selbst,

2014), but it could lead to other problems as well. The field of Human Resource Management and

recruitment, as the name already suggests, needs a human element. As will be discussed in this

thesis, machines do not yet have the capacity to think like humans do and although they bring new a

new dynamic to the hiring table, to what extent can they be applied? In other words, might big data

be replacing well-known paradigms and knowledge based on experience, i.e. “tacit knowledge”, too

soon?

1.3 Research questions

As big data can create new possibilities in business processes, this thesis will explore if businesses are

increasingly relying on external sources of knowledge, big data, and if so how does this influence

business processes such as recruitment or hiring? Furthermore, what can be gained and lost by

applying big data to recruitment or hiring? What recommendations can be made to ensure the

successful and safe implementation of big data? The primary focus is on the influence that big data

can have on business processes, specifically hiring or recruitment, and whether or not these processes

become more effective when big data is applied. This research can help companies determine the

degree of external data that is used and help them find a suitable balance between internal and

external sources of knowledge to improve their processes. 

(6)

2 Research methodology

This section outlines the research methodology. It describes the research approach and process, what

methods were used to search for literature and what the selection process was for including the

literature in the thesis. It concludes with an evaluation of the used methodology.

2.1 Research approach and process

For this thesis, a qualitative research approach was used. The focus is on answering ‘why’ and ‘how’

questions, specifically how the business process of hiring or recruitment is influenced by big data.

This should result in a theory based on existing and new models. It is an explanatory study, finding

an explanation for descriptive information (Gray, 2014). Qualitative research contributes to theory

building, as new theoretical insights are gained by relying on multiple perspectives of various

resources of a phenomena and comparing these (Doz, 2011; Gray, 2014). It is important to take into

consideration the multidimensionality of available resources and provide contradicting views.

Furthermore, the approach was inductive, conceptualising theoretical perspectives after literature

research (Gray, 2014). This approach was best suited for this thesis, as it constructs a generalised,

relational view of a phenomena. As will be further explained in this section, inductive reasoning

begins with planning data collection and then analyses this data to detect patterns and relationships

between theories (Gray, 2014). Finding an answer to the question of how big data influences

recruitment, without pre-existing theories or hypotheses, means that a diverse amount of theoretical

or literary research must be conducted, after which the findings from this research can be combined

and compared to gain new insights. These insights also come from existing frameworks and models,

as well as new models that are the result of a combination of theories in the literature and existing

models. The models, frameworks and theories found and generated in this thesis have eventually led

to a theory on the implementation of big data in recruitment, answering the subquestions.

2.2 Literature research

The thesis is divided into several sections or themes, all with the overarching concept of big data.

Each section explains different parts associated with big data, that are relevant for this thesis. This

division aided the literature research, as it provided a guidance for answering the related research

questions. The literature that was searched for was categorised accordingly. The first theme is big

data, with the sub-themes of collection, application, privacy issues and criticism. The second theme

is business processes, which is further divided into types of processes and reliance on external

information. A variation in business processes is needed to illustrate where big data can have an

influence. The third theme, used to go into further detail on the influence of big data, is the main

(7)

frame of reference. This is recruitment and data profiling, divided into sub-themes of advantages,

disadvantages and possible situations that can emerge once big data is applied.

Out of the literature research came an extensive body of work on big data itself. The

increased interest and wider implementation of it in businesses and organisations provided a large

amount of available research and literature. Consequently, the concept of big data could be

explained in great detail, as well as illustrating it by using the myriad of applications available in

various fields such as healthcare and crime. However, the literature does show that there is no

consensus regarding the meaning of big data. Combining all the different views did result in a

rationale relevant for the focus of this thesis. Although big data can help accelerate processes in

various fields, a remarkable amount of literature was found that focused on the negative side-effects

and inherent issues. As this thesis set out to answer what is lost and gained by using big data, this

consequently provided more for the question of what can be lost more than what can be gained.

A variety of academic as well as magazine articles was available for the second theme,

business processes. As the process of recruitment or hiring is the main frame of reference, this

section also provides a short overview of the history of it and the inherent issues. Although this

theme is partly used to diversify the processes that businesses have in order to illustrate the impact

big data can have, the main focus is the reliance on external knowledge. The literature shows that

even before big data, businesses relied on other sources of knowledge besides the internal knowledge

they have of their processes, services, products and clients. Almost all literature shows the

importance and relevancy of external knowledge for business innovation and competitiveness.

However, several authors do emphasise that this should not over-shadow internal knowledge.

The third and final theme provided more articles on the web than in academic journals.

Perhaps this is due to the fact that the implementation of big data in recruitment is not yet widely

used and still relatively new. Again, more disadvantages than advantages to using big data in hiring

practices were found in the literature. There is a general tendency to believe that businesses do not

yet possess the necessary capabilities required for competent data analysis. As such, these limitations

and disadvantages can be used by businesses as guidelines to learn from their mistakes.

Although the main focus was on the impact that big data can have on business processes, a

comprehensive and detailed explanation of the concept itself provided the necessary insight into the

capabilities of big data. How big data are collected and used, as well as the privacy issues and

criticisms, are all linked to its influence on business processes and especially recruitment. This

provides the necessary background for the researcher and helped to comprehend the ‘bigger

picture’. The remainder of the literature, concentrating on external knowledge and data profiling

have significantly helped answer the research questions. Furthermore, the body of work found on

people analytics and data profiling has shaped the sub-questions, reflecting on the influence on

hiring processes and what is consequently gained or lost.

(8)

2.3 Selection of data

After initial discovery, a part of the literature was discarded for further use in the thesis as they

merely provided more of the same answers or confirmation of theories. The initial qualifications or

requirements for the inclusion of literature were that they had to contribute to answering the

research questions, provide new insights, and if possible provide insights that oppose the principal

views. Thus, the aim of the literature was not to provide a singular vision of every theme, instead to

provide a comprehensive and multi-faceted description that was sometimes contradictory. The

literature that was needed for each theme was guided by the following questions:

• Big Data: How does the text communicate a theory? How does it view the concept of big data?

How does it explain the process of collection? How does it illustrate the use of big data? What

type of privacy issues and critical points does it communicate?

• Business processes: How does the text communicate a theory? Which business process does

it present in combination with big data? How does it view the use of external knowledge in

business processes? How does it explain hiring? How does it view hiring problems?

• Recruitment: How does the text communicate a theory? How does it explain recruitment and

data profiling? How does it provide examples of big data in recruitment? How does it view

recruitment augmented by big data?

The answers to these questions where searched for in the literature and highlighted. Each theme has

its own literature and after reading each article or book it was summarised on the basis of the

answers to the questions. Then, these summaries were used to provide the theory behind each

theme and sub-theme.

2.4 Methodology evaluation

The used methodology proved to be a fit for this thesis. The qualitative, inductive approach was

beneficial for theory building and for researching literature to gain multiple perspectives of the

phenomena of big data in recruitment. There was no theory or hypothesis on which this research

was based, so it was necessary to first collect data and then find patterns and explanations. Dividing

the research into themes and sub-themes proved helpful for guiding the research process, as well as

using critical questions to evaluate the literature. 

(9)

3 Frame of Reference

This section of the thesis discusses the extensive literature found on the subject of big data. It

provides a theoretical background of big data by explaining the concept and who gathers and

generates it. Furthermore, it will detail the opportunities of big data in several fields, as well as

discussing the debates on related privacy issues and finally, criticism of its relevancy and use.

3.1 What is Big Data?

Imagine walking around in the city when you get a phone call from a friend. You have a short

conversation, after which you agree to meet him at a restaurant nearby. But first you need to go to

an ATM for cash. You complete the transaction, retrieve your money and proceed to walk to your

destination. On the way over to the restaurant, you see a moment that has to be captured by your

camera. After taking a photo and posting it on Instagram, you send it to your friends via WhatsApp.

Eventually you arrive at the restaurant and you check-in on Facebook, notifying your friends of your

whereabouts. You may not realise this, but in this short amount of time you have generated a wealth

of data - ‘big data’. These data were generated from multiple sources: your mobile transactions and

activities, your user-generated content on platforms such as social media, and the content you

generated through business transactions (George et al., 2014). You are not the only actor in this

process though. The world is awash with more information than ever before and the scale of our

databases continues to grow in a rapid pace (Mayer-Schönberger & Cukier, 2013): approximately

2.5 billion gigabytes of data were generated in 2012 (Wall, 2014). Facebook must have a

considerable share in this, for it has 1.10 billion active users (Fairfield & Shtein, 2014) and processes

2.5 billion pieces of content a day (Kitchin, 2014). And around 2010-2011 about 4 billion

mobile-phone users were identified of which 12% used smartmobile-phones (Wamba et al., 2015). Furthermore,

the global digital information is expected to grow 45% every year to approximately 8 trillion

gigabytes in 2015 (Fanning & Grant, 2014).

So what exactly are these big data that everyone is generating and talking about? The

literature shows that there is not one clear, generally accepted definition of the concept. Although

most, initially, define it as large or ‘massive’ datasets (Hoy, 2014; Lewis et al., 2013; Boyd &

Crawford, 2012; Chen et al., 2012; Crumbly & Church, 2013; Mahrt & Scharkow, 2013; Manyika

et al., 2011) there is more at play than just the sheer quantity or size of it. Williamson (2015) and

Kitchin (2014) argue it is about the interconnectedness between once disparate datasets, as well as

Boyd & Crawford (2012) who believe the value lies not in the quantity of information but in the

capacity to cross-reference the datasets. What other characteristics make big data ‘big’? The most

common characteristics found in the literature are the three V’s (Hoy, 2014; Kitchin, 2014; McAfee

& Brynjolfsson, 2012): volume, velocity, and variety. Volume, because of the amount of created

(10)

data. For example, Walmart, the grocery behemoth, generated 2.5 petabytes of data relating to

1

more than 1 million customer transactions every hour in 2012 (Kitchin, 2014). Velocity refers to the

speed of creation, as big data shows similarities to Moore’s law: its capacity grows twice its own size

every two years (Welling, 2014). Variety because the data comes in many shapes, such as text

messages, photo’s, GPS-signals, and is created on multiple platforms and from various sources (Hoy,

2014). Other researchers and authors have added their own characteristics as well. Williamson

(2015), for example, argues that big data should be valid, veracious or authentic, valuable and

visible. Kitchin (2014) states that it should also be exhaustive in scope, flexible and scaleable, as well

as relational. The scope should strive to capture entire populations or systems, it should be flexible

enough to be easily extended and scaled, and finally the nature of it should be relational by

containing common fields that enable various data sets to be connected. This is ultimately what big

data is all about: combining or inter-connecting large, once disparate datasets to gain new insights.

3.2 Big Data Collection

Mankind was not always capable of collecting huge amounts of data. The invention of modern

writing made it possible to record data for future reference, albeit slowly and arduous (Hoy, 2014).

Computers made the process easier and businesses started storing data in relational databases

(Fanning & Grant, 2014) and governments captured statistics of populations to generate national

censuses (Kitchin, 2014). In the early 00’s, the internet offered unique data collection, analytical

research and development opportunities for various sectors (Chen et al., 2012). Gone were the days

of large-scale mailing of catalogs or using phone directors to target customers (Fanning & Grant,

2014). Instead, retailers and other industries could interact with their customers directly through

IP-specific user search, using interaction and server logs and cookies (Chen et al., 2012). Customers

could basically be tracked by looking at what they buy, what they look at, how they navigate through

the website, and if they are influenced by promotions and reviews (McAfee & Brynjolfsson, 2012).

This is exemplified by the Chinese online shopping service Alibaba, tracking and collecting the huge

amounts of consumer data that is produced on their website every day (Palmer, 2015). Or Target,

an American store that collects data on each customer by linking their daily transactions and

communications with the company to a guest ID which includes demographic information (Duhigg,

2012). Lastly, big data is also collected by companies trying to improve their hiring practices and

decisions. They either search these data themselves by scanning online behaviour or buying them

from data mining companies selling publicly available data (Barocas & Selbst, 2014; Preston, 2015).

Today, the omnipresence of computing and mobile devices has created a new data era.

Several technological advances have helped speed up the process of generating the billions of

A petabyte is 1 billion gigabytes.

(11)

gigabytes of data. Examples are cheaper storage and improved computer processing power

(Williamson, 2015) that have made computational techniques for large-scale data analysis possible

(Lewis et al., 2013), as well as widespread internet (Kitchin, 2014), and the global expansion of

mobile phones that has made it possible for us to generate data on-the-spot.

Big data are generated from a variety of sources: location information through digital

CCTV, retail purchases, electronic communications such as social media postings, ‘clickstream’ data

that track an individual’s navigation through a website or app, measurements from sensors such as

RFID (Crumbly & Church, 2013; Kitchin, 2014). According to George et al. (2014), these sources

of high volume data can be sub-divided into two categories: “data exhaust” and “self-quantification

data”. The former refers to ambient data, such as mobile phone activity, that are passively collected

and only become useful once combined with other sources. This data exhaust contains information

that is normally not visible and only revealed when an individual interacts with information

technologies. Gaining access to this information is referred to as “reality mining”: processing large

quantities of data from mobile devices to predict human behaviour (Mayer-Schönberger & Cukier,

2013). The latter refers to data that are actively revealed by an individual through the use of sensors

that monitor, for example, exercise or movement. Kitchin’s (2013) categories - directed, automated,

and volunteered - bear a resemblance to the categories of George et al. (2014). Directed data are

generated by surveillance, such as CCTV, and automated data are generated as an inherent,

automatic function of a device, such as a mobile phone that records its user’s activity. Similar to

self-quantification data, volunteered data are generated or ‘gifted’ by the user themselves.

3.3 Big Data Opportunities

Many actors, consumers and businesses alike, have been producing large datasets for a long time

(Kitchin, 2014). However, they are now able to more effectively capture and analyse these large

datasets than before (Hoy, 2014). They are finding ways to implement these datasets to create value

for individuals, their business, communities and governments (George et al., 2014). For example, big

data can play a significant economic role to the benefit of national economies and their citizens.

The US could cut its healthcare expenditure by 8% and European government administrations

could save approximately €100 million in operational efficiency improvements (Manyika et al.,

2011). Another example is improving airline ETA’s (McAfee & Brynjolfsson, 2012), which shows

how a combination of internal and external information is used to improve operational efficiency. A

major American airline used existing ETA’s and hired a data mining company to collect publicly

available data on factors that influence arrival times, such as weather. Then they combined this

external data with proprietary data of the company, that is stored. This resulted in multidimensional

information, allowing for sophisticated analyses and pattern matching, that effectively reduced the

gaps between estimated and actual arrival times. Other examples include analysing workplaces for

(12)

behaviour and efficiency (George et al., 2014), predicting crimes or pharmacy stock (Williamson,

2015), stopping terrorism or flu’s (Harris, 2014), and predicting natural disasters such as hurricanes

(Fanning & Grant, 2014).

Figure 1: Benefits of successful big data implementations (Fanning et al., 2013)

Specific industries have benefited from and are widely implementing big data. Bloomberg

logs employee activities such as keystrokes and Harrah’s, a Las Vegas casino, tracks the smiles of its

employees, because its analytics team has quantified the impact that a smile can have on customer

satisfaction rates (Peck, 2013). Sears, a $36 billion retailer with more than 230,000 employees hires

approximately 150,000 sales representatives a year and bases the hiring decisions on online

simulation tests and existing information on the competencies of current employees (Rifkin, 2014).

Target, as previously mentioned, gathers vast amounts of customer data. This data is then used by a

predictive analytics team to understand customer’s habits in order to more effectively market them

(Duhigg, 2012). The products that are bought are analysed, for example when are items bought in

combination with each other, and then the predictability of buying this product is based on a score.

The moment an individual’s or family’s shopping habits alter or become more flexible, and of

course more vulnerable to marketing intervention, Target starts a marketing scheme to nudge their

customers into new spending habits. Other examples include Google, that significantly updated it

web search algorithm to allow for more targeted advertising, and Netflix, that has revolutionised the

streaming industry by altering the consumption patterns of movies and television series of people

(Hoy, 2014). Or the city of Los Angeles that uses big data in predictive policing, resulting in a 26%

decrease of burglaries (Welling, 2014). Further implementations are in the fields of biology,

healthcare, and people and business analytics (Chen et al., 2012; Savage, 2014; Bersin, 2015;

Davenport, 2013). For example, cancer research has benefited from big data as new discoveries are

(13)

made concerning the growth of cancer cells. Researchers have also developed a learning health

system, CancerLinQ , where patients can receive diagnoses based on findings from big data. These

developments by Google, Netflix and CancerLinQ , are described by Hoy (2014) as “disruptive

breakthroughs”, because they could make other systems obsolete in the future.

Two other cases deserve more detailed descriptions: the US presidential elections of 2012

(Fanning & Grant, 2014; Kitchin, 2013) and the case of Paris Brown (Crumbly & Church, 2013). In

2012, the administration of President Obama used big data for its campaign. They collected this

data from multiple sources - website cookies, data assembled from registration, government and

census datasets as well as social media sites - resulting in a massive, interrelated database about every

voter in the country. It consisted of around 80 variables, relating to the voter’s demographics, social

and economic history, as well as patterns of behaviour and consumption. This large-scale, complex

and detailed data gathering provided Obama’s team with an exhaustive insight into US society. Paris

Brown was appointed Youth police and Crime Commissioner for Kent, but was soon relieved of her

position after it emerged that she made ‘offensive’ and ‘inappropriate’ tweets that did not live up to

the standards of her office. However, these tweets were made between the ages of 14 and 16, prior

to her job appointment. Her digital footprint left an indelible data trail that led to her prompt

resignation.

3.4 Privacy Issues

Big data can bring a lot of profit for companies, while at the same time improving their operational

efficiency. Target, Sears, Google, Netflix and Facebook are all good examples. However, it also

brings to mind the issue of privacy. As Williamson (2015) argues, big data can be used to provide

powerful new ways to see and interpret the world, but this comes with some inherent ownership,

misuse and privacy issues. Of course there is the risk of inappropriate surveillance and that personal

data can leak (Williamson, 2015), which nowadays are generated in large amounts and increase any

chances of privacy infringement. This amount also impacts IT-security, as companies and IT

departments have to struggle with an overload of data resulting from diverse data sources, formats

and volumes (Chen et al., 2012). Furthermore, privacy risks are multiplied when data are combined

(Crumbly & Church, 2013). This can lead to imputed identity by inferring an individual’s identity

through data triangulation from multiple sources (George et al., 2014).

As was established, big data are generated from a variety of sources (Crumbly & Church,

2013; Kitchin, 2014). These sources are all capable of collecting highly confidential information,

such as your location, personal beliefs, purchase history and even your phone records. This is

exemplified in the case of the PRISM program, that focused public attention on the nature of mass

market consumer data mining (Fairfield & Shtein, 2014). This program involves the collection and

analysation of foreign and domestic communications from a range of sources, including companies

(14)

such as Microsoft, Google and Facebook (Crumbly & Church, 2013). Target and Walmart, with

their vast collection of demographic information, and the Obama administration, with its

large-scale database on voter’s social and economic history and patterns of behaviour and consumption,

all play a part in the increasing distrust of the use of big data. Furthermore, a Harvard research

group conducted a Facebook study with the intention of establishing how people’s interests and

friendships changed over time (Fairfield & Shtein, 2014). They gathered data from 1,000 students

but unfortunately this was done without their consent.

Although some examples above show that government agencies, social media websites and

stores gather our information, we also have a share in the aggregation of all these big data. This

raises questions about epistemology, about how knowledge is constituted (Lewis al., 2013). As

defined by George et al. (2014) and Kitchin (2013), self-quantification and volunteered data are data

generated by users themselves. Individuals are constantly leaving digital traces (Boyd & Crawford,

2012) and choose to disclose personal information on the internet and in social networks such as

Facebook (Schadt, 2012). This easy sharing of sensitive information loosens our expectations of

what information is actually private or not (Schadt, 2012). According to Boyd & Crawford (2012),

accountability of big data is a multi-directional relationship. As companies are held accountable for

our information, we are accountable for our own information just as well.

Big data not only raises questions of epistemology, but also of ethics (Lewis et al., 2013): how

will the user privacy be protected? Williamson (2015) argues that legislation should keep up with our

increasing use of big data, to prevent mistakes and misuse from happening. The inherent risks of big

data means that strong governance is required. Boyd & Crawford (2012) suggest that consent from

the public should be obtained and that researchers or companies should not simply find the use of

big data ethical because they are publicly available. Thus, mechanisms for data protection and

privacy should be set up, such as anonymous open data, access control, rights management, and

usage control (George et al, 2014). Although there already exist Service Level Agreements, or SLA’s,

where contractors can specify which requirements need to be met, these are not yet sufficiently

applied to big data. However, not all agree with the above statements. Schadt (2012) believes

education is the key, aiming to prevent discriminational use of big data by educating the people.

Crumbly & Church (2013) also advocate less regulation as each step of the big data lifecycle -

collection, combination, analysis, and use - is already regulated by frameworks that balance the costs

and benefits of big data. These have been proven effective in cases of Google and Facebook

(Crumbly & Church, 2013). The former wanted to secure and combine multidimensional user data

from its many services, while the latter mislead its users concerning the publication of private

information.

Boyd & Crawford (2012) provide an appropriate insight into the paradoxical nature of big

data: it triggers both utopian and dystopian rhetoric. Big data is seen as a powerful tool that can

help alleviate societal problems and gain new insights into diverse areas. However, it might also be

(15)

viewed as a threat to our privacy and civil freedoms. The Chinese government is a prime example,

restricting its citizens in their internet use while at the same time collecting their data to create a

modern form of authoritarianism (Palmer, 2015). In some cases, such as with automated data, we

might not even be aware of what data is actually collected. Thus, as Williamson (2015) argues, there

is a strong need to educate people to become ‘data savvy’. After all, computers and machines are not

yet capable of deciding what is sensitive information (Hoy, 2014) - we are.

3.5 Criticism & Obstacles

Several authors in the literature have voiced their critical opinion of big data. According to Kitchin

(2014), the increasing amount of data that are generated on such a large scale has caused a shift in

data analysis. Before, data were scarce, static and clean and were generated and analysed with

specific questions or hypotheses in mind. Now the data resources are abundant, exhaustive, varied

and dynamic, but also messy, uncertain and are analysed without specific issues or questions in

mind. This has brought, according to some (Kitchin, 2014; Mahrt & Scharkow, 2013; Steadman,

2013; Mayer-Schönberger & Cukier, 2013), death to theory and the scientific method. Hypotheses

are no longer generated as databases are mined using ‘snowball’ techniques and conclusions are

based purely on analysis. Rather than defining theories and hypotheses, data are used to justify

conclusion after research.

Figure 2: Big data problems and potential solutions (Fanning et al., 2013)

(16)

Furthermore, big data can be deceptive and misleading: working with big data remains

biased and subjective (Boyd & Crawford, 2012), as the data are always examined through a

particular lens that influences interpretation (Kitchin, 2014). However, human interpretation can

prevent the acceptance of meaningless correlations, or ‘apophenia’, seeing non-existent patterns

(Hoy, 2014). Patterns found within a dataset are not necessarily meaningful as correlations can be

completely random. This was exemplified when researchers found a correlation between changes in

the S&P 500 stock index and butter production in Bangladesh (Boyd & Crawford, 2012).

Mayer-Schönberger & Cukier (2013) emphasise that there is a growing tendency for correlations rather

than causality. Consequently, meaningful interpretations by proficient analysts are well needed as big

data continues to increase in size and impact. As Brooks (2013) mentions, big data has a weak spot

in the social aspect of analysis: humans are good at assigning value to objects by using their

emotions, but computers excel at measuring quantity and have difficulties with explaining context in

social situations. However, by 2018 the US alone will face a shortage of 140,000 to 190,000 people

with deep analytical skills. This is reflected in research focused on challenges related to getting

business value from big data, as Figure 3 shows.

A final point of critique is the access to and size of big data. Although there is sometimes

limited access to big data (Boyd & Crawford, 2012; Kitchin, 2013; Welling, 2014), bigger or more

data are not necessarily better data. The sample of the dataset must be considered (Boyd &

Crawford, 2012) and more data does not automatically mean statistically significant correlations

(Brooks, 2013), as overfitting might cause less effective predictions and conclusions.

(17)

4 Theoretical study

This section discusses the theoretical study on the influence of big data on business processes,

specifically recruitment. It provides a background on business processes and external knowledge,

before diving into the hiring process. Advantages and disadvantages will be presented, as well as

possible statistical errors that can emerge when big data is applied to recruitment.

4.1 Business Processes

Big data has become a vital input for businesses as it can create new forms of economic value

(Mayer-Schönberger & Cukier, 2013). Big data is most commonly applied in decision making,

human resources and enhancing competitive advantage (PWC, 2014; Marr, 2015; Davenport et al.,

2012). The majority of firms in a research by PWC (2014) indicate they use data and advanced

analytics to optimise their range of variables. Financial firms such as Western Union use large

datasets to decide on the optimal price range that can generate the most customer satisfaction and

shareholder value. Large internet companies such as Google use big data and data analytics for

employee engagement (Davenport et al., 2012). Ensuring highest employee productivity,

engagement and retention are among their top priorities. Barton & Court (2012) emphasise the

implementation of big data for competitive differentiation. Staying competitive means using the

increasing amount of data that are available on customer satisfaction, product innovation and

services. Being able to fully exploit these data requires three mutually supportive capabilities,

resulting in the model shown in Figure 3. First, the capability to identify, combine, and manage a

multitude of sources. Second, the capability to build advanced models for the prediction and

optimisation of outcomes. Third, the managerial capability to transform the organisation in support

of the data and models that can yield better decisions. Clear strategies for the use of and competing

with data are essential, as well as deploying the appropriate technologies. Staying competitive as

more companies learn the core skills of using big data means building the aforementioned

capabilities, as they become a decisive competitive asset (Barton & Court, 2012).

(18)

4.2 External Knowledge

As stated in the problem definition, the increasing access to a wide range of data from external

sources has opened up new possibilities for businesses. The various examples in section 3.1.2

illustrates that businesses, organisations and researchers are eager to implement these big data.

However, in the early 00’s there was still a bias towards internal information (Ojala, 2002). There

was no acknowledgement of the value of external data and all relevant information resided

internally. Gallego (2013) and Larrañeta (2012) emphasise the importance of external knowledge for

cooperation, innovation and competitiveness. External knowledge is increasingly important for the

innovation process, as well as the ability to remain competitive by building novelty into products and

operations. It can be expressed in either diversity or novelty (Larrañeta, 2012). Diversity exposes

managers to new perspectives on how to compete as well as promote strategic variety, while novelty

refers to the extent of control that businesses have of that knowledge. It can further augment

internal knowledge and lead to the development of multiple competitive approaches and strategic

variety (Larrañeta, 2012). Internal knowledge still remains of value (Ojala, 2002), as it holds a

corporate memory or a sort of repository of successful and failed processes and employees.

Therefore, it is advisable that companies try to balance the development of internal knowledge with

the implementation of and search for external knowledge (Gallego, 2013).

4.3 The Hiring Process

To research the influence that big data can have on recruitment, it is necessary to have some

knowledge of the hiring process. In the late 80’s there was a strong reliance on using interviews as a

selection device (Raza & Carpenter, 1987) and this is still common practice (Rutledge et al., 2008),

even though they contain little reliability and validity, as well as being subject to biases and

subjective judgments. These biases can include an applicant’s likability, professional and personal

characteristics that match current employees, and demographic background. As for the hiring

process itself, it is comprised of four steps: recruitment, screening, selection, job offer (Rutledge et

al., 2008). Screening refers to the filtering of the initial applicant pool by eliminating any individual

who does not meet the requirements. Selection adds data to narrow down the search: objective

production data, personnel data, judgmental data such as recommendations, and job or work

sample data. Other factors that influence recruitment, which recur in the literature, are P-O and P-J

fit (Kwok et al., 2011). The former refers to person-organisation fit, or the compatibility between

people and an organisation, the latter to person-job fit or the relationship between an individual’s

characteristics and those of the offered job. Achieving high levels of PO-fit through hiring is

essential for a workforce to remain flexible in and committed to the organisation (Kristof, 1996). The

model in Figure 5 (Kristoff, 1996) illustrates the characteristics and process of P-O fit.

(19)

Figure 5: Conceptualisation of P-O fit (Kristoff, 1996)

4.4 Data Profiling in Recruitment

Profiling people, based on a certain set of characteristics in order to classify and predict a certain

type of behaviour, has been a common practice for decades. With the advent of big data the

amount of variables that can be used for profiling, or people analytics, has increased. As such,

companies are eager to apply predictive analytics not only to their business processes, but also their

hiring practices (Peck, 2013). This noticeable increase in applying data to people analytics (Bersin,

2015) has transformed how people are hired, fired and promoted (Ibarra, 2013). Before, recruiters

were stuck in their own paradigms and models (Byrne, 2014) and hiring was based on common

tests, such as IQ tests, skills aptitude tests and physical exams (Huhman, 2014). Now, it is based on

algorithms, that are trained by using large amounts of data and are able to assess the potential of

individuals (Walker, 2012). As mentioned, big data are generated from a multitude of sources such

as social media postings and other electronic communications. These data are not only collected by

companies, but also given by ourselves (Schadt, 2012). We dispense all sorts of personal information

and by doing this leave a digital imprint (Preston, 2015) with our ‘likes’ and ‘dislikes’, political and

(20)

religious views, personal photographs and videos. Consequently, this can be used by smart

algorithms that evaluate and measure this information to understand not only behavioural patterns,

but also habits, age, type of friends you have, or if you are a suitable employee for a company.

Big data are now used by various data mining companies to evaluate a candidate, based on a

set of variables, before an interview. Companies like Xerox use big data to measure job retention,

finding correlations between employee engagement and home to work distance (Huhman, 2014). Or

Evolv that found people who have ‘job hopped’ in the past do not necessarily quit earlier than those

who have not (Economist, 2013). They have generated such an expansive dataset by using various

data-mining techniques that they are able to to say with precision which attributes matter more to

the success of retail-sales workers or customer-service personnel at call centers (Peck, 2013). Knack,

a company that develops app-based video games, tests skills related to job performance by collecting

every step the user makes and by analysing these decisions can infer whether or not someone is

creative, persistent and socially intelligent (Peck, 2013). Tech recruiters such as Gild have developed

algorithms to use big data to mine open-source code found on the internet to identify good software

engineers - hire based on merit by tapping from the online talent pool (Peck, 2013).

As Davenport (2013) suggests, by implementing these data and smarter algorithms we have

now reached an era of analytics 3.0. The era of business intelligence constitutes analytics 1.0, where

businesses could only use data sparingly and analysis was time-consuming. The era of big data

constitutes analysis 2.0, when data was being collected from external sources and the era of

data-enriched offerings constitutes analysis 3.0, where analytics are used to support customer and

business services or business features and improve products. However, research on nearly 500

organisations has found that only 4% could perform predictive analytics about their workforce and

that only 14% has done any significant statistical analysis of employee data (Bersin, 2013).

4.5 Advantages

Recruitment augmented by big data presents some advantages. For example, humans bring an

inherent bias to selection processes (Huhman, 2014) and big data and computers can remove this.

Furthermore, recruits have been stuck in their own paradigms (Byrne, 2014): hiring systems are still

based on out-dated practices such as studies of human behaviour and military techniques. Big data

can help reveal other aspects and reveal patterns that were initially not searched for (Dyche, 2012).

For example, data has shown that for certain jobs, such as customer-support, there is no correlation

between people with criminal records and work performance (Economist, 2013). Or that call-center

employees should not be judged on their experience but on their personality or creativity, as this is a

better indication of job retention (Walker, 2012). Finally, the selection step in the hiring process can

be augmented by big data by searching for and including more relevant data, as well as providing

the necessary algorithms for processing these data.

(21)

4.6 Disadvantages

Despite the above mentioned advantages, the literature identifies disadvantages as well: fallacies of

the machine, generalisability, excessive data, and statistical errors. Although machines remove bias

from selection processes, they can get it wrong (Peck, 2013). One company rejected applicants

because they did not possess a job title which only exists at their company (Economist, 2013). Other

examples are relying on past prejudices (Barocas & Selbst, 2014) and not understanding the context

of a specific situation (Byrne, 2014). Furthermore, they can potentially allow discrimination to

happen (Gangadharan, 2014; Barocas & Selbst, 2014). Mortgages are denied on the basis of race,

creditworthiness is based on social media posts and applicants are discriminated against for not

possessing the required skills that past, successful employees have.

There are also issues of generalisability. Data collected from social media might be skewed as

individuals purposefully shape a certain identity (Mahrt & Scharkow, 2013). Relying on one specific

system or site is also a common practice (Hargittai, 2015), leading to unsuccessful evaluations as the

depth and breadth of the data is limited (Rifkin, 2014). For example, LinkedIn and Twitter data are

broad but not deep - they provide us with a large amount of data but not everything is of substance.

Besides machine and generalisability fallacies, there is also a risk of relying on an excessive

amount of data. Is bigger better? Not necessarily. As more and more data is thrown into personality

assessments, extra variables are added that are unrelated to the job (Corcodilos, 2014). This

questions the relevancy of the specific variables that are being screened (Walker, 2012). The

example of Leprino Foods in Denver, Colorado, illustrates this. Employees and job applicants of this

company received a settlement of $550,00 US dollars after the government found out that the tests

they used included variables that were completely unrelated to the job.

Do these disadvantages outweigh the aforementioned advantages? Machine fallacies can

actually be reduced by including human input and the issue of generalisability can be avoided by

simply extending the amount of sources. Too much data is avoidable as well, as limits can be set to

the amount of data used. Trying to move away from known paradigms and tests, as well as finding

unknown patterns, outweigh the disadvantages as job related characteristics are found to be

redundant or irrelevant, leading to the hiring of people initially thought unqualified. More relevant

data is a paradox though, as more data can become a big disadvantage as well.

The most relevant and impactful disadvantage involved with using big data in recruitment,

for this thesis, is the risk of statistical errors and how they can increase compared to a hiring process

that does not implement big data. As the amount of big data increases, extra but irrelevant variables

are added that increase the risk of “false discoveries”, or type I errors (Shah et al., 2015). Examples

of statistical errors are false negatives, when potentially good employees are rejected, and false

positives, when candidates are wrongfully hired. These potential errors will be further discussed in

the next section.

(22)

5 Analysis & Results

This section discusses the analysis of the literature and the results. Existing models, as well as new

frameworks and models will be presented to gain insight into the implementation of big data in

recruitment. As such, it will seek to answer if big data have a positive or negative influence on

business processes, especially recruitment. It will conclude with a final theory and model, with

recommendations for successfully implementing big data in the hiring process.

5.1 Frameworks and Models

Several conceptual frameworks have been derived out of the literature study. These frameworks

already existed or were created on the basis of the theories in the literature. Once combined, these

models lead to a guide on how to successfully implement big data in recruitment.

5.1.1 Big Data Model

The first model, based on the findings in the literature, outlines the big data process. In the first

stage, there are several raw, unconnected datasets. These datasets contain, for example, social media

posts, business transactions or mobile phone activities. As the literature shows, the essence of big

data is that eventually these datasets are cross-referenced (Williamson, 2015; Kitchin, 2014; Boyd &

Crawford, 2012). Then, in the analysis stage, algorithms are applied to search for patterns and

correlations. Finally, the results lead to new insights. However, these results are not always equally

valuable as correlations do not imply causality. This is further explained in section 5.2.

(23)

5.1.2 Privacy Model

In the literature there is a lot of concern for privacy issues. Thus, several suggestions for the

protection of data and correct analytics have been presented. These are now conceptualised into a

framework of privacy, as shown in Figure 7.

Figure 7: Conceptualisation of privacy

The requirements for big data privacy are the following: data & it-security, accountability, and

legislation. As Williamson (2015) points out, inappropriate surveillance and data leaks can occur and

this calls for security. Boyd & Crawford (2012) find accountability to be a multi-directional

relationship between businesses and users: we need to have realistic expectations of our privacy

(Schadt, 2012) and businesses need to get consent from the users to use their data. Finally, legislation

needs to be put in place so that all actors involved are conforming to standards and general rules.

Examples are mechanisms for data protection, access control and SLA’s (George et al., 2014).

5.1.3 Big Data & Privacy Model

The above models for big data and privacy can also be combined to form a new conceptual

framework. This is shown in Figure 8. First, data & it-security must be put in place so that the raw,

unconnected datasets are protected. When these datasets are cross-referenced, and become big data,

businesses need to have the consent of the owner of the data and users must have an educated

perspective on what their data is being used for. Then, with the analysis step, mechanisms and SLA’s

need to exist so that the analysis conforms to the rules and no datasets are being used beyond their

intended purpose.

(24)

Figure 8: Conceptualisation of big data & privacy

5.1.4 Recruitment

The model in Figure 9 (Kwok et al., 2011) is used as basis for the recruitment process and then

combined with big data to form a new framework, in section 5.1.5. A hiring or recruitment model

by Kwok et al. (2011) for students was used as a reference. As Rutledge et al. (2008) explain, hiring

consists of screening and selection. In the model these are recruiters’ expectations, combined with

student characteristics, and recruiting-selection process, respectively.

(25)

5.1.5 Recruitment & Big Data

The central theme of this thesis is big data in combination with recruitment, thus a framework is

needed to reflect the compatibility of both. This is shown in Figure 10.

Figure 10: Conceptualisation of big data recruitment

Big data can add value to the screening and selection steps of recruitment. The recruiter’s

expectations make up the screening process, where the initial applicant pool is filtered. An

applicant’s characteristics and their relevant datasets, which contain social media posts and other

activities, determine the initial requirements that recruiters look for. Of course these datasets are

protected by data & it-security, conform the privacy model. During the selection process more,

actual big data can be added to narrow down the search. These data are protected by accountability.

The selection process then turns into analysis, protected by mechanisms and SLA’s. The results of

the analysis leads to new insights and these insights are used to determine the hiring decisions.

5.1.6 SMART

Parts of the theories and explanations found in the literature show similarities with the SMART

model of Marr (2015), shown in Figure 11. His model starts with a good data strategy, with the goal

of deciding the business’ strategic objectives and related business questions to achieve these

objectives. This helps decide what data should be collected and effectively search for these data.

Then, it is important to measure metrics and collect the data required to answer the questions. Data

comes in different shapes: structured, unstructured (big data), internal, and external. His

(26)

recommendation is to first investigate what is available internally, then search for external

information. The next steps are applying analytics, to turn data into relevant insights, and to report

the results to decision-makers so they can be understood. Finally, this will lead to a transformation

of your business and various ways to improve processes, services, and products. This final step is

where the value of big data is revealed.

Figure 11: SMART model (Marr, 2015)

5.2 Statistical Errors

As mentioned in the previous section, one of the disadvantages that big data brings to recruitment is

the risk of statistical errors. The main issue is the growing tendency to base conclusions on

correlations rather than causality (Mayer-Schönberger & Cukier, 2013; Corcodilos, 2014).

Correlation quantifies a statistical relationship between variables, where a strong correlation

indicates the likelihood of variable A changing when variable B does, or vice versa. Even though

correlations are useful for predictions, they are not inherently meaningful (Kitchin, 2014) or

necessarily indicate a causal relationship. Increasing the amount of data does not help either, as this

does not mean more statistically significant correlations with a large enough dataset correlations can

be found among almost all the variables (Brooks, 2013; PWC, 2014). This ‘correlation thinking’ is

reflected in the growing amount of research that is merely interested in finding patterns without a

clear definition and direction of research (PWC, 2014). Correlations might provide an indication of

a relationship between variables, but they do not answer the question of ‘why’: why are two events

related (Barnes, 2013)? This requires a causal explanatory framework that acknowledges the

inherent powers and flaws of a phenomena and the underlying causal mechanisms (Barnes, 2013).

Consequently, the focus on correlation instead of causality influences profiling and

(27)

• An individual is suited for a job, but profiling confirms otherwise. In other words, someone is

wrongfully rejected. This is defined as a type I error.

• An individual is not suited for a job, but profiling confirms otherwise. In other words, someone is

wrongfully hired. This is defined as a type II error.

The type I and type II errors are intertwined: if you want to decrease the chances of wrongly

rejecting potentially successful employees then you have to make the profiling easier. This, however,

increases the chances of wrongly hiring someone who is not suited. Setting the bar higher for

profiling means you decrease these chances, but at the same time you increase the chances of

wrongfully rejecting potentially successful employees. The degree of difficulty of profiling depends

on what type of error you prefer to minimise or even avoid.

Besides obvious impairments, both types also have improvements. With a type I error, when

someone is wrongfully rejected, improvements can also be measured if a company finds out

afterwards that the applicant was in fact not suited. Impairments of a type I error are when an

individual is rejected for a loan based on his or her race even though he or she has the sufficient

requirements for a loan, or when a potentially talented employee exploits his skills at another

company. With a type II error, when someone is wrongfully hired, improvements can be noted when

the decision to hire someone is based on data that are not representative of someone’s actual

capacities and this turns out positively. On the other hand, impairments occur when a mentally

unstable pilot is hired. This is illustrated in the case of the Germanwings crash (Hepher, 2015).

The influence that big data can have on profiling is shown in the following model:

(28)

5.3 Final Theory

The purpose of this thesis is to find out if business are increasingly relying on external knowledge,

such as big data, and how this affects business processes such as recruitment. The knowledge gained

in the frame of reference and theoretical study, as well as the theoretical frameworks will now be

used to answer this question. Once combined they lead to new insights. In section 2.3, several

questions were identified for guiding the process of researching the literature. For final analysis, not

all questions are equally relevant. Thus, the answers to the most relevant questions are now

summarised in the following tables, sorted by theme, with the corresponding author. To conclude, a

final theory complete with a new conceptual framework will provide the requirements for successful

implementation of big data in recruitment.

5.3.1 Big Data

• How does the text communicate a theory?

• How does it explain the process of collection?

• What type of privacy issues and critical points does it communicate?

Author(s) Theory Collection process Privacy issues Criticism

Williamson, 2015; Kitchin, 2014; Boyd & Crawford, 2012 Interconnecting once disparate datasets, cross-reference. X X X Hoy, 2014; Kitchin, 2014; McAfee & Brynjolfsson, 2012 Volume, velocity, variety X X X

George et al., 2014 X Data exhaust,

self-quantification data X X

Williamson, 2015 X X Inappropriate

surveillance, data leaks

X

Schadt, 2012 X X Privacy perceptions X

Boyd & Crawford,

2012 X X Accountability, multi-directional X

George et al, 2014 X X Mechanisms for data

protection, access X

Brooks, 2013 X X X Excessive amount of

data

Mayer-Schönberger & Cukier, 2013

Under The Influence : Big Data and its influence on recruitment

!

Bachelor Thesis Information Sciences: