!
Bachelor Thesis Information Sciences:
Under The Influence - Big Data and its influence on recruitment
Tim Lorent (10452141)
Bachelor Information Sciences
FNWI (Faculteit der Natuurwetenschappen, Wiskunde en Informatica)
Under supervision of L. Stolwijk
Universiteit van Amsterdam
26 August 2015
Abstract
Today we generate increasingly large amounts of data with every interaction with our mobile phones, our social media
activities, and business transactions. In the year 2012 alone, around 2.5 billion gigabytes of data were produced. This
phenomenon, of huge quantities of data that require complex and smart algorithms to process and analyse them, is
called ‘big data’. With the advent of big data the amount of variables that can be used for profiling, or people
analytics, has increased. As such, businesses are eager to apply this to processes such as recruitment or hiring. By
conducting a qualitative literature study, this thesis researches if businesses are increasingly relying on external sources
of knowledge and how this influences recruitment. Consequently, it seeks to find out what can be lost and gained and
what recommendations can be made. It first provides a theoretical background on big data and its applications, privacy
issues, and criticism. Then, it will outline how external knowledge impacts business processes, including the relevant
advantages and disadvantages. Although the presented disadvantages to using big data in recruitment can be seen as
valid, the exciting new possibilities cannot be ignored. The thesis concludes with a theory and conceptual framework
that provides a guideline for businesses to successfully and safely implement big data in the recruitment process.
Contents
1 Introduction
4
1.1 Background 4 1.2 Problem definition 5 1.3 Research questions 52 Research methodology
6
2.1 Research approach and process 6
2.2 Literature research 6
2.3 Selection of data 8
2.4 Methodology evaluation 8
3 Frame of Reference
9
3.1 What is Big Data? 9
3.2 Big Data Collection 10
3.3 Big Data Opportunities 11
3.4 Privacy Issues 13
3.5 Criticism & Obstacles 15
4 Theoretical study
17
4.1 Business Processes 17
4.2 External Knowledge 18
4.3 The Hiring Process 18
4.4 Data Profiling in Recruitment 19
4.5 Advantages 20
4.6 Disadvantages 21
5 Analysis & Results
22
5.1 Frameworks and Models 22
5.2 Statistical Errors 26
5.3 Final Theory 28
6 Conclusion
31
6.1 Recruitment & Big Data? 31
6.2 Discussion 31
6.3 Future Research 32
1 Introduction
The introduction will provide a short overview of the concept of big data. Although it will be
extensively discussed in the theoretical framework, some knowledge is required to discuss the
problems and purpose of this thesis.
1.1 Background
Today, we generate increasingly large amounts of data with every interaction with our mobile
phones, our social media activities, and business transactions (George et al., 2014). In the year 2012
alone, around 2.5 billion gigabytes of data was produced (Wall, 2014). This is only expected to
significantly increase every year, as more and more smartphone users are identified (Wamba et al.,
2015). The phenomenon of huge quantities of data that require complex and smart algorithms to
process and analyse them is called ‘big data’. These data are commonly recognised as big when they
meet the following requirements: volume, velocity, and variety (McAfee & Brynjolfsson, 2012).
Volume because the amount is growing at an exceptional pace, velocity because new technologies
have made it possible to increase the speed at which these data are created (Welling, 2014), and
variety because these data are created in many shapes and come from various sources (Hoy, 2014).
Big data has inherent privacy issues. For example, highly sensitive information could leak
(Williamson, 2015). Data triangulation, combining data from multiple sources, allows anyone to
make inferences about people (George et al., 2014). Consequently, some advocate for agreements,
laws, legislations and policies to be put in place to safeguard our data. These include mechanisms for
data protection (George et al., 2014), consent forms (Boyd & Crawford, 2012), and educating the
public to make them ‘data savvy’ (Schadt, 2012).
These data are either consciously generated, which is called “self-quantification data”, or
unconsciously, which is called “data exhaust” (George et al., 2014). Afterwards, big data are being
applied to improve processes in many fields. Examples include improving airline ETA’s, predicting
crimes, natural disasters and customer behaviour, improving recommendation systems on online
services such as Netflix, and changing recruitment and hiring practices. For example, several data
mining companies are using big data to find factors that determine job retention, testing skills
related to job performance, and searching for qualified programmers on the internet.
With the advent of big data the amount of variables that can be used for profiling, or people
analytics, has increased. As such, companies are eager to apply predictive analytics not only to their
business processes, but also their hiring practices (Peck, 2013). This noticeable increase in applying
data to people analytics (Bersin, 2015) has transformed how people are hired, fired and promoted
(Ibarra, 2013). This thesis further researches the possibilities of this new field.
1.2 Problem definition
Besides the internal knowledge that businesses have access to, the potential of the world of big data
has opened up new possibilities. As mentioned, big data are being implemented by businesses in
their recruitment and hiring practices. The essence of it is that people are profiled on the basis of
big data. These data are acquired via online personality tests, video games, or algorithms that detect
talented employees (Huhman, 2014; Economist, 2013; Peck, 2013). However, this noticeable
increase in applying big data to people analytics (Bersin, 2015) might lead to candidates’
applications being solely based on results generated by algorithms that have analysed big data.
Research has already shown that this leads to discrimination (Gangadharan, 2014; Barocas & Selbst,
2014), but it could lead to other problems as well. The field of Human Resource Management and
recruitment, as the name already suggests, needs a human element. As will be discussed in this
thesis, machines do not yet have the capacity to think like humans do and although they bring new a
new dynamic to the hiring table, to what extent can they be applied? In other words, might big data
be replacing well-known paradigms and knowledge based on experience, i.e. “tacit knowledge”, too
soon?
1.3 Research questions
As big data can create new possibilities in business processes, this thesis will explore if businesses are
increasingly relying on external sources of knowledge, big data, and if so how does this influence
business processes such as recruitment or hiring? Furthermore, what can be gained and lost by
applying big data to recruitment or hiring? What recommendations can be made to ensure the
successful and safe implementation of big data? The primary focus is on the influence that big data
can have on business processes, specifically hiring or recruitment, and whether or not these processes
become more effective when big data is applied. This research can help companies determine the
degree of external data that is used and help them find a suitable balance between internal and
external sources of knowledge to improve their processes.
2 Research methodology
This section outlines the research methodology. It describes the research approach and process, what
methods were used to search for literature and what the selection process was for including the
literature in the thesis. It concludes with an evaluation of the used methodology.
2.1 Research approach and process
For this thesis, a qualitative research approach was used. The focus is on answering ‘why’ and ‘how’
questions, specifically how the business process of hiring or recruitment is influenced by big data.
This should result in a theory based on existing and new models. It is an explanatory study, finding
an explanation for descriptive information (Gray, 2014). Qualitative research contributes to theory
building, as new theoretical insights are gained by relying on multiple perspectives of various
resources of a phenomena and comparing these (Doz, 2011; Gray, 2014). It is important to take into
consideration the multidimensionality of available resources and provide contradicting views.
Furthermore, the approach was inductive, conceptualising theoretical perspectives after literature
research (Gray, 2014). This approach was best suited for this thesis, as it constructs a generalised,
relational view of a phenomena. As will be further explained in this section, inductive reasoning
begins with planning data collection and then analyses this data to detect patterns and relationships
between theories (Gray, 2014). Finding an answer to the question of how big data influences
recruitment, without pre-existing theories or hypotheses, means that a diverse amount of theoretical
or literary research must be conducted, after which the findings from this research can be combined
and compared to gain new insights. These insights also come from existing frameworks and models,
as well as new models that are the result of a combination of theories in the literature and existing
models. The models, frameworks and theories found and generated in this thesis have eventually led
to a theory on the implementation of big data in recruitment, answering the subquestions.
2.2 Literature research
The thesis is divided into several sections or themes, all with the overarching concept of big data.
Each section explains different parts associated with big data, that are relevant for this thesis. This
division aided the literature research, as it provided a guidance for answering the related research
questions. The literature that was searched for was categorised accordingly. The first theme is big
data, with the sub-themes of collection, application, privacy issues and criticism. The second theme
is business processes, which is further divided into types of processes and reliance on external
information. A variation in business processes is needed to illustrate where big data can have an
influence. The third theme, used to go into further detail on the influence of big data, is the main
frame of reference. This is recruitment and data profiling, divided into sub-themes of advantages,
disadvantages and possible situations that can emerge once big data is applied.
Out of the literature research came an extensive body of work on big data itself. The
increased interest and wider implementation of it in businesses and organisations provided a large
amount of available research and literature. Consequently, the concept of big data could be
explained in great detail, as well as illustrating it by using the myriad of applications available in
various fields such as healthcare and crime. However, the literature does show that there is no
consensus regarding the meaning of big data. Combining all the different views did result in a
rationale relevant for the focus of this thesis. Although big data can help accelerate processes in
various fields, a remarkable amount of literature was found that focused on the negative side-effects
and inherent issues. As this thesis set out to answer what is lost and gained by using big data, this
consequently provided more for the question of what can be lost more than what can be gained.
A variety of academic as well as magazine articles was available for the second theme,
business processes. As the process of recruitment or hiring is the main frame of reference, this
section also provides a short overview of the history of it and the inherent issues. Although this
theme is partly used to diversify the processes that businesses have in order to illustrate the impact
big data can have, the main focus is the reliance on external knowledge. The literature shows that
even before big data, businesses relied on other sources of knowledge besides the internal knowledge
they have of their processes, services, products and clients. Almost all literature shows the
importance and relevancy of external knowledge for business innovation and competitiveness.
However, several authors do emphasise that this should not over-shadow internal knowledge.
The third and final theme provided more articles on the web than in academic journals.
Perhaps this is due to the fact that the implementation of big data in recruitment is not yet widely
used and still relatively new. Again, more disadvantages than advantages to using big data in hiring
practices were found in the literature. There is a general tendency to believe that businesses do not
yet possess the necessary capabilities required for competent data analysis. As such, these limitations
and disadvantages can be used by businesses as guidelines to learn from their mistakes.
Although the main focus was on the impact that big data can have on business processes, a
comprehensive and detailed explanation of the concept itself provided the necessary insight into the
capabilities of big data. How big data are collected and used, as well as the privacy issues and
criticisms, are all linked to its influence on business processes and especially recruitment. This
provides the necessary background for the researcher and helped to comprehend the ‘bigger
picture’. The remainder of the literature, concentrating on external knowledge and data profiling
have significantly helped answer the research questions. Furthermore, the body of work found on
people analytics and data profiling has shaped the sub-questions, reflecting on the influence on
hiring processes and what is consequently gained or lost.
2.3 Selection of data
After initial discovery, a part of the literature was discarded for further use in the thesis as they
merely provided more of the same answers or confirmation of theories. The initial qualifications or
requirements for the inclusion of literature were that they had to contribute to answering the
research questions, provide new insights, and if possible provide insights that oppose the principal
views. Thus, the aim of the literature was not to provide a singular vision of every theme, instead to
provide a comprehensive and multi-faceted description that was sometimes contradictory. The
literature that was needed for each theme was guided by the following questions:
• Big Data: How does the text communicate a theory? How does it view the concept of big data?
How does it explain the process of collection? How does it illustrate the use of big data? What
type of privacy issues and critical points does it communicate?
• Business processes: How does the text communicate a theory? Which business process does
it present in combination with big data? How does it view the use of external knowledge in
business processes? How does it explain hiring? How does it view hiring problems?
• Recruitment: How does the text communicate a theory? How does it explain recruitment and
data profiling? How does it provide examples of big data in recruitment? How does it view
recruitment augmented by big data?
The answers to these questions where searched for in the literature and highlighted. Each theme has
its own literature and after reading each article or book it was summarised on the basis of the
answers to the questions. Then, these summaries were used to provide the theory behind each
theme and sub-theme.
2.4 Methodology evaluation
The used methodology proved to be a fit for this thesis. The qualitative, inductive approach was
beneficial for theory building and for researching literature to gain multiple perspectives of the
phenomena of big data in recruitment. There was no theory or hypothesis on which this research
was based, so it was necessary to first collect data and then find patterns and explanations. Dividing
the research into themes and sub-themes proved helpful for guiding the research process, as well as
using critical questions to evaluate the literature.
3 Frame of Reference
This section of the thesis discusses the extensive literature found on the subject of big data. It
provides a theoretical background of big data by explaining the concept and who gathers and
generates it. Furthermore, it will detail the opportunities of big data in several fields, as well as
discussing the debates on related privacy issues and finally, criticism of its relevancy and use.
3.1 What is Big Data?
Imagine walking around in the city when you get a phone call from a friend. You have a short
conversation, after which you agree to meet him at a restaurant nearby. But first you need to go to
an ATM for cash. You complete the transaction, retrieve your money and proceed to walk to your
destination. On the way over to the restaurant, you see a moment that has to be captured by your
camera. After taking a photo and posting it on Instagram, you send it to your friends via WhatsApp.
Eventually you arrive at the restaurant and you check-in on Facebook, notifying your friends of your
whereabouts. You may not realise this, but in this short amount of time you have generated a wealth
of data - ‘big data’. These data were generated from multiple sources: your mobile transactions and
activities, your user-generated content on platforms such as social media, and the content you
generated through business transactions (George et al., 2014). You are not the only actor in this
process though. The world is awash with more information than ever before and the scale of our
databases continues to grow in a rapid pace (Mayer-Schönberger & Cukier, 2013): approximately
2.5 billion gigabytes of data were generated in 2012 (Wall, 2014). Facebook must have a
considerable share in this, for it has 1.10 billion active users (Fairfield & Shtein, 2014) and processes
2.5 billion pieces of content a day (Kitchin, 2014). And around 2010-2011 about 4 billion
mobile-phone users were identified of which 12% used smartmobile-phones (Wamba et al., 2015). Furthermore,
the global digital information is expected to grow 45% every year to approximately 8 trillion
gigabytes in 2015 (Fanning & Grant, 2014).
So what exactly are these big data that everyone is generating and talking about? The
literature shows that there is not one clear, generally accepted definition of the concept. Although
most, initially, define it as large or ‘massive’ datasets (Hoy, 2014; Lewis et al., 2013; Boyd &
Crawford, 2012; Chen et al., 2012; Crumbly & Church, 2013; Mahrt & Scharkow, 2013; Manyika
et al., 2011) there is more at play than just the sheer quantity or size of it. Williamson (2015) and
Kitchin (2014) argue it is about the interconnectedness between once disparate datasets, as well as
Boyd & Crawford (2012) who believe the value lies not in the quantity of information but in the
capacity to cross-reference the datasets. What other characteristics make big data ‘big’? The most
common characteristics found in the literature are the three V’s (Hoy, 2014; Kitchin, 2014; McAfee
& Brynjolfsson, 2012): volume, velocity, and variety. Volume, because of the amount of created
data. For example, Walmart, the grocery behemoth, generated 2.5 petabytes of data relating to
1more than 1 million customer transactions every hour in 2012 (Kitchin, 2014). Velocity refers to the
speed of creation, as big data shows similarities to Moore’s law: its capacity grows twice its own size
every two years (Welling, 2014). Variety because the data comes in many shapes, such as text
messages, photo’s, GPS-signals, and is created on multiple platforms and from various sources (Hoy,
2014). Other researchers and authors have added their own characteristics as well. Williamson
(2015), for example, argues that big data should be valid, veracious or authentic, valuable and
visible. Kitchin (2014) states that it should also be exhaustive in scope, flexible and scaleable, as well
as relational. The scope should strive to capture entire populations or systems, it should be flexible
enough to be easily extended and scaled, and finally the nature of it should be relational by
containing common fields that enable various data sets to be connected. This is ultimately what big
data is all about: combining or inter-connecting large, once disparate datasets to gain new insights.
3.2 Big Data Collection
Mankind was not always capable of collecting huge amounts of data. The invention of modern
writing made it possible to record data for future reference, albeit slowly and arduous (Hoy, 2014).
Computers made the process easier and businesses started storing data in relational databases
(Fanning & Grant, 2014) and governments captured statistics of populations to generate national
censuses (Kitchin, 2014). In the early 00’s, the internet offered unique data collection, analytical
research and development opportunities for various sectors (Chen et al., 2012). Gone were the days
of large-scale mailing of catalogs or using phone directors to target customers (Fanning & Grant,
2014). Instead, retailers and other industries could interact with their customers directly through
IP-specific user search, using interaction and server logs and cookies (Chen et al., 2012). Customers
could basically be tracked by looking at what they buy, what they look at, how they navigate through
the website, and if they are influenced by promotions and reviews (McAfee & Brynjolfsson, 2012).
This is exemplified by the Chinese online shopping service Alibaba, tracking and collecting the huge
amounts of consumer data that is produced on their website every day (Palmer, 2015). Or Target,
an American store that collects data on each customer by linking their daily transactions and
communications with the company to a guest ID which includes demographic information (Duhigg,
2012). Lastly, big data is also collected by companies trying to improve their hiring practices and
decisions. They either search these data themselves by scanning online behaviour or buying them
from data mining companies selling publicly available data (Barocas & Selbst, 2014; Preston, 2015).
Today, the omnipresence of computing and mobile devices has created a new data era.
Several technological advances have helped speed up the process of generating the billions of
A petabyte is 1 billion gigabytes.
gigabytes of data. Examples are cheaper storage and improved computer processing power
(Williamson, 2015) that have made computational techniques for large-scale data analysis possible
(Lewis et al., 2013), as well as widespread internet (Kitchin, 2014), and the global expansion of
mobile phones that has made it possible for us to generate data on-the-spot.
Big data are generated from a variety of sources: location information through digital
CCTV, retail purchases, electronic communications such as social media postings, ‘clickstream’ data
that track an individual’s navigation through a website or app, measurements from sensors such as
RFID (Crumbly & Church, 2013; Kitchin, 2014). According to George et al. (2014), these sources
of high volume data can be sub-divided into two categories: “data exhaust” and “self-quantification
data”. The former refers to ambient data, such as mobile phone activity, that are passively collected
and only become useful once combined with other sources. This data exhaust contains information
that is normally not visible and only revealed when an individual interacts with information
technologies. Gaining access to this information is referred to as “reality mining”: processing large
quantities of data from mobile devices to predict human behaviour (Mayer-Schönberger & Cukier,
2013). The latter refers to data that are actively revealed by an individual through the use of sensors
that monitor, for example, exercise or movement. Kitchin’s (2013) categories - directed, automated,
and volunteered - bear a resemblance to the categories of George et al. (2014). Directed data are
generated by surveillance, such as CCTV, and automated data are generated as an inherent,
automatic function of a device, such as a mobile phone that records its user’s activity. Similar to
self-quantification data, volunteered data are generated or ‘gifted’ by the user themselves.
3.3 Big Data Opportunities
Many actors, consumers and businesses alike, have been producing large datasets for a long time
(Kitchin, 2014). However, they are now able to more effectively capture and analyse these large
datasets than before (Hoy, 2014). They are finding ways to implement these datasets to create value
for individuals, their business, communities and governments (George et al., 2014). For example, big
data can play a significant economic role to the benefit of national economies and their citizens.
The US could cut its healthcare expenditure by 8% and European government administrations
could save approximately €100 million in operational efficiency improvements (Manyika et al.,
2011). Another example is improving airline ETA’s (McAfee & Brynjolfsson, 2012), which shows
how a combination of internal and external information is used to improve operational efficiency. A
major American airline used existing ETA’s and hired a data mining company to collect publicly
available data on factors that influence arrival times, such as weather. Then they combined this
external data with proprietary data of the company, that is stored. This resulted in multidimensional
information, allowing for sophisticated analyses and pattern matching, that effectively reduced the
gaps between estimated and actual arrival times. Other examples include analysing workplaces for
behaviour and efficiency (George et al., 2014), predicting crimes or pharmacy stock (Williamson,
2015), stopping terrorism or flu’s (Harris, 2014), and predicting natural disasters such as hurricanes
(Fanning & Grant, 2014).
Figure 1: Benefits of successful big data implementations (Fanning et al., 2013)
Specific industries have benefited from and are widely implementing big data. Bloomberg
logs employee activities such as keystrokes and Harrah’s, a Las Vegas casino, tracks the smiles of its
employees, because its analytics team has quantified the impact that a smile can have on customer
satisfaction rates (Peck, 2013). Sears, a $36 billion retailer with more than 230,000 employees hires
approximately 150,000 sales representatives a year and bases the hiring decisions on online
simulation tests and existing information on the competencies of current employees (Rifkin, 2014).
Target, as previously mentioned, gathers vast amounts of customer data. This data is then used by a
predictive analytics team to understand customer’s habits in order to more effectively market them
(Duhigg, 2012). The products that are bought are analysed, for example when are items bought in
combination with each other, and then the predictability of buying this product is based on a score.
The moment an individual’s or family’s shopping habits alter or become more flexible, and of
course more vulnerable to marketing intervention, Target starts a marketing scheme to nudge their
customers into new spending habits. Other examples include Google, that significantly updated it
web search algorithm to allow for more targeted advertising, and Netflix, that has revolutionised the
streaming industry by altering the consumption patterns of movies and television series of people
(Hoy, 2014). Or the city of Los Angeles that uses big data in predictive policing, resulting in a 26%
decrease of burglaries (Welling, 2014). Further implementations are in the fields of biology,
healthcare, and people and business analytics (Chen et al., 2012; Savage, 2014; Bersin, 2015;
Davenport, 2013). For example, cancer research has benefited from big data as new discoveries are
made concerning the growth of cancer cells. Researchers have also developed a learning health
system, CancerLinQ , where patients can receive diagnoses based on findings from big data. These
developments by Google, Netflix and CancerLinQ , are described by Hoy (2014) as “disruptive
breakthroughs”, because they could make other systems obsolete in the future.
Two other cases deserve more detailed descriptions: the US presidential elections of 2012
(Fanning & Grant, 2014; Kitchin, 2013) and the case of Paris Brown (Crumbly & Church, 2013). In
2012, the administration of President Obama used big data for its campaign. They collected this
data from multiple sources - website cookies, data assembled from registration, government and
census datasets as well as social media sites - resulting in a massive, interrelated database about every
voter in the country. It consisted of around 80 variables, relating to the voter’s demographics, social
and economic history, as well as patterns of behaviour and consumption. This large-scale, complex
and detailed data gathering provided Obama’s team with an exhaustive insight into US society. Paris
Brown was appointed Youth police and Crime Commissioner for Kent, but was soon relieved of her
position after it emerged that she made ‘offensive’ and ‘inappropriate’ tweets that did not live up to
the standards of her office. However, these tweets were made between the ages of 14 and 16, prior
to her job appointment. Her digital footprint left an indelible data trail that led to her prompt
resignation.
3.4 Privacy Issues
Big data can bring a lot of profit for companies, while at the same time improving their operational
efficiency. Target, Sears, Google, Netflix and Facebook are all good examples. However, it also
brings to mind the issue of privacy. As Williamson (2015) argues, big data can be used to provide
powerful new ways to see and interpret the world, but this comes with some inherent ownership,
misuse and privacy issues. Of course there is the risk of inappropriate surveillance and that personal
data can leak (Williamson, 2015), which nowadays are generated in large amounts and increase any
chances of privacy infringement. This amount also impacts IT-security, as companies and IT
departments have to struggle with an overload of data resulting from diverse data sources, formats
and volumes (Chen et al., 2012). Furthermore, privacy risks are multiplied when data are combined
(Crumbly & Church, 2013). This can lead to imputed identity by inferring an individual’s identity
through data triangulation from multiple sources (George et al., 2014).
As was established, big data are generated from a variety of sources (Crumbly & Church,
2013; Kitchin, 2014). These sources are all capable of collecting highly confidential information,
such as your location, personal beliefs, purchase history and even your phone records. This is
exemplified in the case of the PRISM program, that focused public attention on the nature of mass
market consumer data mining (Fairfield & Shtein, 2014). This program involves the collection and
analysation of foreign and domestic communications from a range of sources, including companies
such as Microsoft, Google and Facebook (Crumbly & Church, 2013). Target and Walmart, with
their vast collection of demographic information, and the Obama administration, with its
large-scale database on voter’s social and economic history and patterns of behaviour and consumption,
all play a part in the increasing distrust of the use of big data. Furthermore, a Harvard research
group conducted a Facebook study with the intention of establishing how people’s interests and
friendships changed over time (Fairfield & Shtein, 2014). They gathered data from 1,000 students
but unfortunately this was done without their consent.
Although some examples above show that government agencies, social media websites and
stores gather our information, we also have a share in the aggregation of all these big data. This
raises questions about epistemology, about how knowledge is constituted (Lewis al., 2013). As
defined by George et al. (2014) and Kitchin (2013), self-quantification and volunteered data are data
generated by users themselves. Individuals are constantly leaving digital traces (Boyd & Crawford,
2012) and choose to disclose personal information on the internet and in social networks such as
Facebook (Schadt, 2012). This easy sharing of sensitive information loosens our expectations of
what information is actually private or not (Schadt, 2012). According to Boyd & Crawford (2012),
accountability of big data is a multi-directional relationship. As companies are held accountable for
our information, we are accountable for our own information just as well.
Big data not only raises questions of epistemology, but also of ethics (Lewis et al., 2013): how
will the user privacy be protected? Williamson (2015) argues that legislation should keep up with our
increasing use of big data, to prevent mistakes and misuse from happening. The inherent risks of big
data means that strong governance is required. Boyd & Crawford (2012) suggest that consent from
the public should be obtained and that researchers or companies should not simply find the use of
big data ethical because they are publicly available. Thus, mechanisms for data protection and
privacy should be set up, such as anonymous open data, access control, rights management, and
usage control (George et al, 2014). Although there already exist Service Level Agreements, or SLA’s,
where contractors can specify which requirements need to be met, these are not yet sufficiently
applied to big data. However, not all agree with the above statements. Schadt (2012) believes
education is the key, aiming to prevent discriminational use of big data by educating the people.
Crumbly & Church (2013) also advocate less regulation as each step of the big data lifecycle -
collection, combination, analysis, and use - is already regulated by frameworks that balance the costs
and benefits of big data. These have been proven effective in cases of Google and Facebook
(Crumbly & Church, 2013). The former wanted to secure and combine multidimensional user data
from its many services, while the latter mislead its users concerning the publication of private
information.
Boyd & Crawford (2012) provide an appropriate insight into the paradoxical nature of big
data: it triggers both utopian and dystopian rhetoric. Big data is seen as a powerful tool that can
help alleviate societal problems and gain new insights into diverse areas. However, it might also be
viewed as a threat to our privacy and civil freedoms. The Chinese government is a prime example,
restricting its citizens in their internet use while at the same time collecting their data to create a
modern form of authoritarianism (Palmer, 2015). In some cases, such as with automated data, we
might not even be aware of what data is actually collected. Thus, as Williamson (2015) argues, there
is a strong need to educate people to become ‘data savvy’. After all, computers and machines are not
yet capable of deciding what is sensitive information (Hoy, 2014) - we are.
3.5 Criticism & Obstacles
Several authors in the literature have voiced their critical opinion of big data. According to Kitchin
(2014), the increasing amount of data that are generated on such a large scale has caused a shift in
data analysis. Before, data were scarce, static and clean and were generated and analysed with
specific questions or hypotheses in mind. Now the data resources are abundant, exhaustive, varied
and dynamic, but also messy, uncertain and are analysed without specific issues or questions in
mind. This has brought, according to some (Kitchin, 2014; Mahrt & Scharkow, 2013; Steadman,
2013; Mayer-Schönberger & Cukier, 2013), death to theory and the scientific method. Hypotheses
are no longer generated as databases are mined using ‘snowball’ techniques and conclusions are
based purely on analysis. Rather than defining theories and hypotheses, data are used to justify
conclusion after research.
Figure 2: Big data problems and potential solutions (Fanning et al., 2013)
Furthermore, big data can be deceptive and misleading: working with big data remains
biased and subjective (Boyd & Crawford, 2012), as the data are always examined through a
particular lens that influences interpretation (Kitchin, 2014). However, human interpretation can
prevent the acceptance of meaningless correlations, or ‘apophenia’, seeing non-existent patterns
(Hoy, 2014). Patterns found within a dataset are not necessarily meaningful as correlations can be
completely random. This was exemplified when researchers found a correlation between changes in
the S&P 500 stock index and butter production in Bangladesh (Boyd & Crawford, 2012).
Mayer-Schönberger & Cukier (2013) emphasise that there is a growing tendency for correlations rather
than causality. Consequently, meaningful interpretations by proficient analysts are well needed as big
data continues to increase in size and impact. As Brooks (2013) mentions, big data has a weak spot
in the social aspect of analysis: humans are good at assigning value to objects by using their
emotions, but computers excel at measuring quantity and have difficulties with explaining context in
social situations. However, by 2018 the US alone will face a shortage of 140,000 to 190,000 people
with deep analytical skills. This is reflected in research focused on challenges related to getting
business value from big data, as Figure 3 shows.
A final point of critique is the access to and size of big data. Although there is sometimes
limited access to big data (Boyd & Crawford, 2012; Kitchin, 2013; Welling, 2014), bigger or more
data are not necessarily better data. The sample of the dataset must be considered (Boyd &
Crawford, 2012) and more data does not automatically mean statistically significant correlations
(Brooks, 2013), as overfitting might cause less effective predictions and conclusions.
4 Theoretical study
This section discusses the theoretical study on the influence of big data on business processes,
specifically recruitment. It provides a background on business processes and external knowledge,
before diving into the hiring process. Advantages and disadvantages will be presented, as well as
possible statistical errors that can emerge when big data is applied to recruitment.
4.1 Business Processes
Big data has become a vital input for businesses as it can create new forms of economic value
(Mayer-Schönberger & Cukier, 2013). Big data is most commonly applied in decision making,
human resources and enhancing competitive advantage (PWC, 2014; Marr, 2015; Davenport et al.,
2012). The majority of firms in a research by PWC (2014) indicate they use data and advanced
analytics to optimise their range of variables. Financial firms such as Western Union use large
datasets to decide on the optimal price range that can generate the most customer satisfaction and
shareholder value. Large internet companies such as Google use big data and data analytics for
employee engagement (Davenport et al., 2012). Ensuring highest employee productivity,
engagement and retention are among their top priorities. Barton & Court (2012) emphasise the
implementation of big data for competitive differentiation. Staying competitive means using the
increasing amount of data that are available on customer satisfaction, product innovation and
services. Being able to fully exploit these data requires three mutually supportive capabilities,
resulting in the model shown in Figure 3. First, the capability to identify, combine, and manage a
multitude of sources. Second, the capability to build advanced models for the prediction and
optimisation of outcomes. Third, the managerial capability to transform the organisation in support
of the data and models that can yield better decisions. Clear strategies for the use of and competing
with data are essential, as well as deploying the appropriate technologies. Staying competitive as
more companies learn the core skills of using big data means building the aforementioned
capabilities, as they become a decisive competitive asset (Barton & Court, 2012).
4.2 External Knowledge
As stated in the problem definition, the increasing access to a wide range of data from external
sources has opened up new possibilities for businesses. The various examples in section 3.1.2
illustrates that businesses, organisations and researchers are eager to implement these big data.
However, in the early 00’s there was still a bias towards internal information (Ojala, 2002). There
was no acknowledgement of the value of external data and all relevant information resided
internally. Gallego (2013) and Larrañeta (2012) emphasise the importance of external knowledge for
cooperation, innovation and competitiveness. External knowledge is increasingly important for the
innovation process, as well as the ability to remain competitive by building novelty into products and
operations. It can be expressed in either diversity or novelty (Larrañeta, 2012). Diversity exposes
managers to new perspectives on how to compete as well as promote strategic variety, while novelty
refers to the extent of control that businesses have of that knowledge. It can further augment
internal knowledge and lead to the development of multiple competitive approaches and strategic
variety (Larrañeta, 2012). Internal knowledge still remains of value (Ojala, 2002), as it holds a
corporate memory or a sort of repository of successful and failed processes and employees.
Therefore, it is advisable that companies try to balance the development of internal knowledge with
the implementation of and search for external knowledge (Gallego, 2013).
4.3 The Hiring Process
To research the influence that big data can have on recruitment, it is necessary to have some
knowledge of the hiring process. In the late 80’s there was a strong reliance on using interviews as a
selection device (Raza & Carpenter, 1987) and this is still common practice (Rutledge et al., 2008),
even though they contain little reliability and validity, as well as being subject to biases and
subjective judgments. These biases can include an applicant’s likability, professional and personal
characteristics that match current employees, and demographic background. As for the hiring
process itself, it is comprised of four steps: recruitment, screening, selection, job offer (Rutledge et
al., 2008). Screening refers to the filtering of the initial applicant pool by eliminating any individual
who does not meet the requirements. Selection adds data to narrow down the search: objective
production data, personnel data, judgmental data such as recommendations, and job or work
sample data. Other factors that influence recruitment, which recur in the literature, are P-O and P-J
fit (Kwok et al., 2011). The former refers to person-organisation fit, or the compatibility between
people and an organisation, the latter to person-job fit or the relationship between an individual’s
characteristics and those of the offered job. Achieving high levels of PO-fit through hiring is
essential for a workforce to remain flexible in and committed to the organisation (Kristof, 1996). The
model in Figure 5 (Kristoff, 1996) illustrates the characteristics and process of P-O fit.
Figure 5: Conceptualisation of P-O fit (Kristoff, 1996)
4.4 Data Profiling in Recruitment
Profiling people, based on a certain set of characteristics in order to classify and predict a certain
type of behaviour, has been a common practice for decades. With the advent of big data the
amount of variables that can be used for profiling, or people analytics, has increased. As such,
companies are eager to apply predictive analytics not only to their business processes, but also their
hiring practices (Peck, 2013). This noticeable increase in applying data to people analytics (Bersin,
2015) has transformed how people are hired, fired and promoted (Ibarra, 2013). Before, recruiters
were stuck in their own paradigms and models (Byrne, 2014) and hiring was based on common
tests, such as IQ tests, skills aptitude tests and physical exams (Huhman, 2014). Now, it is based on
algorithms, that are trained by using large amounts of data and are able to assess the potential of
individuals (Walker, 2012). As mentioned, big data are generated from a multitude of sources such
as social media postings and other electronic communications. These data are not only collected by
companies, but also given by ourselves (Schadt, 2012). We dispense all sorts of personal information
and by doing this leave a digital imprint (Preston, 2015) with our ‘likes’ and ‘dislikes’, political and
religious views, personal photographs and videos. Consequently, this can be used by smart
algorithms that evaluate and measure this information to understand not only behavioural patterns,
but also habits, age, type of friends you have, or if you are a suitable employee for a company.
Big data are now used by various data mining companies to evaluate a candidate, based on a
set of variables, before an interview. Companies like Xerox use big data to measure job retention,
finding correlations between employee engagement and home to work distance (Huhman, 2014). Or
Evolv that found people who have ‘job hopped’ in the past do not necessarily quit earlier than those
who have not (Economist, 2013). They have generated such an expansive dataset by using various
data-mining techniques that they are able to to say with precision which attributes matter more to
the success of retail-sales workers or customer-service personnel at call centers (Peck, 2013). Knack,
a company that develops app-based video games, tests skills related to job performance by collecting
every step the user makes and by analysing these decisions can infer whether or not someone is
creative, persistent and socially intelligent (Peck, 2013). Tech recruiters such as Gild have developed
algorithms to use big data to mine open-source code found on the internet to identify good software
engineers - hire based on merit by tapping from the online talent pool (Peck, 2013).
As Davenport (2013) suggests, by implementing these data and smarter algorithms we have
now reached an era of analytics 3.0. The era of business intelligence constitutes analytics 1.0, where
businesses could only use data sparingly and analysis was time-consuming. The era of big data
constitutes analysis 2.0, when data was being collected from external sources and the era of
data-enriched offerings constitutes analysis 3.0, where analytics are used to support customer and
business services or business features and improve products. However, research on nearly 500
organisations has found that only 4% could perform predictive analytics about their workforce and
that only 14% has done any significant statistical analysis of employee data (Bersin, 2013).
4.5 Advantages
Recruitment augmented by big data presents some advantages. For example, humans bring an
inherent bias to selection processes (Huhman, 2014) and big data and computers can remove this.
Furthermore, recruits have been stuck in their own paradigms (Byrne, 2014): hiring systems are still
based on out-dated practices such as studies of human behaviour and military techniques. Big data
can help reveal other aspects and reveal patterns that were initially not searched for (Dyche, 2012).
For example, data has shown that for certain jobs, such as customer-support, there is no correlation
between people with criminal records and work performance (Economist, 2013). Or that call-center
employees should not be judged on their experience but on their personality or creativity, as this is a
better indication of job retention (Walker, 2012). Finally, the selection step in the hiring process can
be augmented by big data by searching for and including more relevant data, as well as providing
the necessary algorithms for processing these data.
4.6 Disadvantages
Despite the above mentioned advantages, the literature identifies disadvantages as well: fallacies of
the machine, generalisability, excessive data, and statistical errors. Although machines remove bias
from selection processes, they can get it wrong (Peck, 2013). One company rejected applicants
because they did not possess a job title which only exists at their company (Economist, 2013). Other
examples are relying on past prejudices (Barocas & Selbst, 2014) and not understanding the context
of a specific situation (Byrne, 2014). Furthermore, they can potentially allow discrimination to
happen (Gangadharan, 2014; Barocas & Selbst, 2014). Mortgages are denied on the basis of race,
creditworthiness is based on social media posts and applicants are discriminated against for not
possessing the required skills that past, successful employees have.
There are also issues of generalisability. Data collected from social media might be skewed as
individuals purposefully shape a certain identity (Mahrt & Scharkow, 2013). Relying on one specific
system or site is also a common practice (Hargittai, 2015), leading to unsuccessful evaluations as the
depth and breadth of the data is limited (Rifkin, 2014). For example, LinkedIn and Twitter data are
broad but not deep - they provide us with a large amount of data but not everything is of substance.
Besides machine and generalisability fallacies, there is also a risk of relying on an excessive
amount of data. Is bigger better? Not necessarily. As more and more data is thrown into personality
assessments, extra variables are added that are unrelated to the job (Corcodilos, 2014). This
questions the relevancy of the specific variables that are being screened (Walker, 2012). The
example of Leprino Foods in Denver, Colorado, illustrates this. Employees and job applicants of this
company received a settlement of $550,00 US dollars after the government found out that the tests
they used included variables that were completely unrelated to the job.
Do these disadvantages outweigh the aforementioned advantages? Machine fallacies can
actually be reduced by including human input and the issue of generalisability can be avoided by
simply extending the amount of sources. Too much data is avoidable as well, as limits can be set to
the amount of data used. Trying to move away from known paradigms and tests, as well as finding
unknown patterns, outweigh the disadvantages as job related characteristics are found to be
redundant or irrelevant, leading to the hiring of people initially thought unqualified. More relevant
data is a paradox though, as more data can become a big disadvantage as well.
The most relevant and impactful disadvantage involved with using big data in recruitment,
for this thesis, is the risk of statistical errors and how they can increase compared to a hiring process
that does not implement big data. As the amount of big data increases, extra but irrelevant variables
are added that increase the risk of “false discoveries”, or type I errors (Shah et al., 2015). Examples
of statistical errors are false negatives, when potentially good employees are rejected, and false
positives, when candidates are wrongfully hired. These potential errors will be further discussed in
the next section.
5 Analysis & Results
This section discusses the analysis of the literature and the results. Existing models, as well as new
frameworks and models will be presented to gain insight into the implementation of big data in
recruitment. As such, it will seek to answer if big data have a positive or negative influence on
business processes, especially recruitment. It will conclude with a final theory and model, with
recommendations for successfully implementing big data in the hiring process.
5.1 Frameworks and Models
Several conceptual frameworks have been derived out of the literature study. These frameworks
already existed or were created on the basis of the theories in the literature. Once combined, these
models lead to a guide on how to successfully implement big data in recruitment.
5.1.1 Big Data Model
The first model, based on the findings in the literature, outlines the big data process. In the first
stage, there are several raw, unconnected datasets. These datasets contain, for example, social media
posts, business transactions or mobile phone activities. As the literature shows, the essence of big
data is that eventually these datasets are cross-referenced (Williamson, 2015; Kitchin, 2014; Boyd &
Crawford, 2012). Then, in the analysis stage, algorithms are applied to search for patterns and
correlations. Finally, the results lead to new insights. However, these results are not always equally
valuable as correlations do not imply causality. This is further explained in section 5.2.
5.1.2 Privacy Model
In the literature there is a lot of concern for privacy issues. Thus, several suggestions for the
protection of data and correct analytics have been presented. These are now conceptualised into a
framework of privacy, as shown in Figure 7.
Figure 7: Conceptualisation of privacy
The requirements for big data privacy are the following: data & it-security, accountability, and
legislation. As Williamson (2015) points out, inappropriate surveillance and data leaks can occur and
this calls for security. Boyd & Crawford (2012) find accountability to be a multi-directional
relationship between businesses and users: we need to have realistic expectations of our privacy
(Schadt, 2012) and businesses need to get consent from the users to use their data. Finally, legislation
needs to be put in place so that all actors involved are conforming to standards and general rules.
Examples are mechanisms for data protection, access control and SLA’s (George et al., 2014).
5.1.3 Big Data & Privacy Model
The above models for big data and privacy can also be combined to form a new conceptual
framework. This is shown in Figure 8. First, data & it-security must be put in place so that the raw,
unconnected datasets are protected. When these datasets are cross-referenced, and become big data,
businesses need to have the consent of the owner of the data and users must have an educated
perspective on what their data is being used for. Then, with the analysis step, mechanisms and SLA’s
need to exist so that the analysis conforms to the rules and no datasets are being used beyond their
intended purpose.
Figure 8: Conceptualisation of big data & privacy
5.1.4 Recruitment
The model in Figure 9 (Kwok et al., 2011) is used as basis for the recruitment process and then
combined with big data to form a new framework, in section 5.1.5. A hiring or recruitment model
by Kwok et al. (2011) for students was used as a reference. As Rutledge et al. (2008) explain, hiring
consists of screening and selection. In the model these are recruiters’ expectations, combined with
student characteristics, and recruiting-selection process, respectively.
5.1.5 Recruitment & Big Data
The central theme of this thesis is big data in combination with recruitment, thus a framework is
needed to reflect the compatibility of both. This is shown in Figure 10.
Figure 10: Conceptualisation of big data recruitment
Big data can add value to the screening and selection steps of recruitment. The recruiter’s
expectations make up the screening process, where the initial applicant pool is filtered. An
applicant’s characteristics and their relevant datasets, which contain social media posts and other
activities, determine the initial requirements that recruiters look for. Of course these datasets are
protected by data & it-security, conform the privacy model. During the selection process more,
actual big data can be added to narrow down the search. These data are protected by accountability.
The selection process then turns into analysis, protected by mechanisms and SLA’s. The results of
the analysis leads to new insights and these insights are used to determine the hiring decisions.
5.1.6 SMART
Parts of the theories and explanations found in the literature show similarities with the SMART
model of Marr (2015), shown in Figure 11. His model starts with a good data strategy, with the goal
of deciding the business’ strategic objectives and related business questions to achieve these
objectives. This helps decide what data should be collected and effectively search for these data.
Then, it is important to measure metrics and collect the data required to answer the questions. Data
comes in different shapes: structured, unstructured (big data), internal, and external. His
recommendation is to first investigate what is available internally, then search for external
information. The next steps are applying analytics, to turn data into relevant insights, and to report
the results to decision-makers so they can be understood. Finally, this will lead to a transformation
of your business and various ways to improve processes, services, and products. This final step is
where the value of big data is revealed.
Figure 11: SMART model (Marr, 2015)
5.2 Statistical Errors
As mentioned in the previous section, one of the disadvantages that big data brings to recruitment is
the risk of statistical errors. The main issue is the growing tendency to base conclusions on
correlations rather than causality (Mayer-Schönberger & Cukier, 2013; Corcodilos, 2014).
Correlation quantifies a statistical relationship between variables, where a strong correlation
indicates the likelihood of variable A changing when variable B does, or vice versa. Even though
correlations are useful for predictions, they are not inherently meaningful (Kitchin, 2014) or
necessarily indicate a causal relationship. Increasing the amount of data does not help either, as this
does not mean more statistically significant correlations with a large enough dataset correlations can
be found among almost all the variables (Brooks, 2013; PWC, 2014). This ‘correlation thinking’ is
reflected in the growing amount of research that is merely interested in finding patterns without a
clear definition and direction of research (PWC, 2014). Correlations might provide an indication of
a relationship between variables, but they do not answer the question of ‘why’: why are two events
related (Barnes, 2013)? This requires a causal explanatory framework that acknowledges the
inherent powers and flaws of a phenomena and the underlying causal mechanisms (Barnes, 2013).
Consequently, the focus on correlation instead of causality influences profiling and
• An individual is suited for a job, but profiling confirms otherwise. In other words, someone is
wrongfully rejected. This is defined as a type I error.
• An individual is not suited for a job, but profiling confirms otherwise. In other words, someone is
wrongfully hired. This is defined as a type II error.
The type I and type II errors are intertwined: if you want to decrease the chances of wrongly
rejecting potentially successful employees then you have to make the profiling easier. This, however,
increases the chances of wrongly hiring someone who is not suited. Setting the bar higher for
profiling means you decrease these chances, but at the same time you increase the chances of
wrongfully rejecting potentially successful employees. The degree of difficulty of profiling depends
on what type of error you prefer to minimise or even avoid.
Besides obvious impairments, both types also have improvements. With a type I error, when
someone is wrongfully rejected, improvements can also be measured if a company finds out
afterwards that the applicant was in fact not suited. Impairments of a type I error are when an
individual is rejected for a loan based on his or her race even though he or she has the sufficient
requirements for a loan, or when a potentially talented employee exploits his skills at another
company. With a type II error, when someone is wrongfully hired, improvements can be noted when
the decision to hire someone is based on data that are not representative of someone’s actual
capacities and this turns out positively. On the other hand, impairments occur when a mentally
unstable pilot is hired. This is illustrated in the case of the Germanwings crash (Hepher, 2015).
The influence that big data can have on profiling is shown in the following model:
5.3 Final Theory
The purpose of this thesis is to find out if business are increasingly relying on external knowledge,
such as big data, and how this affects business processes such as recruitment. The knowledge gained
in the frame of reference and theoretical study, as well as the theoretical frameworks will now be
used to answer this question. Once combined they lead to new insights. In section 2.3, several
questions were identified for guiding the process of researching the literature. For final analysis, not
all questions are equally relevant. Thus, the answers to the most relevant questions are now
summarised in the following tables, sorted by theme, with the corresponding author. To conclude, a
final theory complete with a new conceptual framework will provide the requirements for successful
implementation of big data in recruitment.
5.3.1 Big Data
• How does the text communicate a theory?
• How does it explain the process of collection?
• What type of privacy issues and critical points does it communicate?
Author(s) Theory Collection process Privacy issues Criticism
Williamson, 2015; Kitchin, 2014; Boyd & Crawford, 2012 Interconnecting once disparate datasets, cross-reference. X X X Hoy, 2014; Kitchin, 2014; McAfee & Brynjolfsson, 2012 Volume, velocity, variety X X X
George et al., 2014 X Data exhaust,
self-quantification data X X
Williamson, 2015 X X Inappropriate
surveillance, data leaks
X
Schadt, 2012 X X Privacy perceptions X
Boyd & Crawford,
2012 X X Accountability, multi-directional X
George et al, 2014 X X Mechanisms for data
protection, access X
Brooks, 2013 X X X Excessive amount of
data
Mayer-Schönberger & Cukier, 2013