• No results found

Big data skills in online vacancies : exploring the need for talent and skills in the age of data

N/A
N/A
Protected

Academic year: 2021

Share "Big data skills in online vacancies : exploring the need for talent and skills in the age of data"

Copied!
47
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

1 | Page

Big data skills in online vacancies

Exploring the need for talent and skills in the age

of data

Student: Pim Boon

Student nummr: 10998349 Data of submission: 26-01-2018

Business Administration (MSc): Digital Business (track) University of Amsterdam (UvA)

(2)

2 | Page

Statement of originality

This document is written by Student Pim Boon who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

3 | Page Abstract

In the age where information is power, Big Data is the game to play. There is an explosive growth of information available for organizations. This information is coming from many different digital sources such as data derived from online customer behavior and social media. Organizations try to capture as much information as possible and try to derive value from this information. The amount of information gathered makes analysis difficult, people with skills required to make these analyses are hard to find. Therefore, the importance lies with the people who can make this information understandable for the masses, from Big Data to understandable valuable data. Previous research has been done regarding the most

common hard skills in the field of Big Data, research has been done on to the effect that different soft skills within Big Data related job titles. However, there is no research into the combination of soft and hard skills in the field of Big Data, this is where this thesis will fill the gap. We will do this by answering the questions how well vacancies fit within the

clusters?. These skills will be administered by hand from 200 Big Data related job vacancies by the 20 first companies registered in the Fortune 500 of 2016. This research suggests that hard- and soft skills are as important in the segmentation process besides others namely; the years of experience required for the job and years of education. In this thesis, we use a cluster analysis to map where different skills belong. Hence, predictive value lies in the possibility of managing these segments simultaneously in one job application, for instance a set hard 2 hard skills and 2 soft skills is more common for a “data analyst” in comparison to a “data

scientist” which had a set of 4 hard skills. Using a Scree plot to generate the best number of clusters a K-Means cluster analysis was performed with 4 clusters. First of all this research suggests that no matter what the field, Healthcare automotive or financial if you need a “Data Scientist” across different industries the skillset remains the same. Furthermore, hard skills and years of education form clear clusters were as soft skills are important in all vacancies. Atlast the importance of data governance is expressed.

(4)

4 | Page

Content

1. Introduction ... 5

1.1 Problem statement ... 6

1.2 Research goal and research question ... 8

2. Theoretical framework ... 9

2.2.1 Big Data volume ... 11

2.2.2 Big Data variety ... 11

2.2.3 Big Data velocity ... 12

2.2.4 Big Data veracity ... 12

2.3 Impact on workforce ... 13

2.4 Big Data use by organizations ... 15

2.5 Main challenges of Big Data ... 16

2.5.1 Talent ... 16

2.5.2 Data challenges with the four V’s ... 17

2.5.2.1 Volume ... 17

2.5.2.2 Variety ... 18

2.5.2.3 Velocity ... 18

2.5.2.4 Veracity ... 19

2.5.2.5 Big Data security ... 19

2.6 Data Scientist ... 20

3. Method ... 23

3.1 Data collection and sample size ... 23

3.3 Results ... 25

3.3.2 Linear regression with ANOVA ... 31

3.4 Cluster description ... 32

4. Discussion, Conclusion and Further research ... 34

4.1 Discussion and Conclusion ... 34

4.2 Further research ... 36

(5)

5 | Page

1. Introduction

The digital age is producing vast amounts of Big Data, this trend is emerging and digital data is being gathered in large quantities, little attention is give to the skills needed for this vast amount of data: “US-based Acxiom offers clients, from banks to auto companies, profiles of 500 million customers—each profile enriched by more than 1,500 data points gleaned from the analysis of up to 50 trillion transactions” (Bughin, Chui, & Manyika, 2013). This thesis gives a description of Big Data, exploring definitions and raising the question of what is needed in terms of skills and how this is currently reflected in online vacancies. First, through a literature review, a solid understanding of Big Data and corresponding terminology is presented. In doing so, the most relevant issues and opportunities of Big Data for

organizations are assessed. The use of Big Data offers enormous potential for organizations, which is currently not being utilized in all industries.

What is “Big Data”? Jacobs (2009) gives a meta-definition of it as “data whose size forces us to look beyond the tried-and true methods that are prevalent at that time.” He describes that the definition of Big Data has changed over time, and that it should not be static but should change as data changes. One thing that will not change is the need for people who understand the algorithms and hardware needed to make the calculations required for Big Data solutions. The definition of Big Data that is used in this thesis is the one from Zikopoulos and Eaton (2011): “In short, the term Big Data applies to information that can’t be processed or analyzed using traditional processes or tools.” This deffenition was chosen because it considers tools and skills needed for Big Data where most definitions only focus on the size of data.

Suthaharan (2014) adds “Big Data is currently defined using three Data characteristics: volume, variety and velocity.” These three aspects are used a lot in research on Big Data. Singh & Singla (2015) present three Vs in their article, later a fourth V was introduced. These four V’s consist of; first volume: the basic assumption is that the sheer size of Big Data is an important attribute. Currently, the growth of Big Data is enormous. Data is everywhere and gathered everywhere; a mundane train has hundreds of sensors collecting data. These sensors keep track of conditions such as the state of individual parts, location (with GPS), shipment tracking and distance (Zikopoulos & Eaton, 2011). The second basic property of Big Data is

(6)

6 | Page velocity, which stands for the speed at which data is generated. The third parameter of Big Data is variety, which is the way of structuring the data. The fourth v which was introduced later is veracity, or the accuracy of the data. That is, when Big Data analytics and the outcomes of these analysis do not have any flaws and is reliable. There are three main

varieties: structured, unstructured and semi structured data (Raghupathi, & Raghupathi, 2014) later more about this.

This study investigates the much-needed extra diversion, by which we mean the skillset needed to understand and handle Big Data. Is it all as relevant as we think and is everyone getting the right information from the correct data as collected by the organization? In this thesis, we determine what the skill-set is needed in the Big Data solutions for large

organizations, when looking at vacancy’s in the top ten organizations of de Fortune 500. In building Big Data solutions, that are necessary to solve problems that companies in the field face, employees have to be continuously learning and implementing new ways to handle and build Big Data, as the set of data keeps changing and technology improves. A few of the biggest problems of Big Data are the high infrastructure costs, such as system

communication. Consider equipment, such as hardware in terms of computer processors is very expensive, even if cloud solutions are in place (Tole, 2013). However, the large costs of having the needed processing power, infrastructure, maintenance and other technology related products are becoming cheaper as these technologies like progressing power and Big Data technologies for analyzing large data sets become important to a large variety of industries.

The difficulty that has not been researched in depth is the current different skill sets of Big Data applications. Business intelligence and analytics education should cover the critical aspects, such as analytical skills, IT skills, knowledge about the business and its

surroundings, and the communication skills to make Big Data solutions understandable for the required user (Chen et al., 2012).

1.1 Problem statement

Most decisions are based on data, and more companies are collecting data in various forms and methods (Labrinidis & Jagadish 2012). These data driven decisions become more and more important as the amount of data around the world grows (Pospiech & Felden 2012). But

(7)

7 | Page what skills do we need to for these decisions, motivated by the lack of information about the skill set of a data analyst has to have we have begun this research. The goal of this master thesis is to determine what specific skillset is required for each new data-related job

description, by clustering skills and job titles. This research is the start of measuring whether organizations are looking for the same skillset related to Big Data analytics and the

acquisition of talent. This means that when a company is looking for a smaller skillset in the market, the organization could be less mature in the Big Data field than for example an organization that requests a larger skillset. Furthermore, results of this research can indicate whether a company is changing based on their outstanding vacancies in terms of Big Data maturity. The demand for Big Data skills may be relevant to gaining conclusions about the market. By analyzing the current vacancy offering, you can see if companies make certain adoptions of Big Data technology. Organizations outstanding vacancies can also be a

measurement method for overall market developments in order to see what skills are needed most and where.

Most research on Big Data tries to show how certain Big Data skills can improve companies, however these studies do not look at the entire skillset the employ can offer the “employee package”. In this thesis we take a step back to show how vacancies actually map the relevance of certain skills without applying it in practice by looking at the frequency in vacancies. “There are many related areas for future research related to the diffusion of big data technologies. A question of policy interest, as the use of big data technologies becomes widespread, is the extent to which large-scale data driven decision-making will complement.” Tambe, P. (2014) Furthermore, acquiring complementary Big Data skills is not the only obstacle to successful big data use. The effective use of big data will require changes in management beliefs, and data governance. Previous studies provide insight into how

management believes can get superior performance by allowing organizations to get a much faster insight into customer behavior, competitors, and suppliers (Mendelson 2000); the use of big data technologies can raise the returns to these practices by improving the depth of insight that firms derive from these interactions as well as the speed at which they respond. Installing these capabilities often requires organization-wide changes to complement data-driven practices.

This study sets out to find a group where a skillset will form a vacancies specifications and roles required by firms to fill these vacancies and add value to their Big Data analysis. Our

(8)

8 | Page findings corroborate the ideas of Miller (2014) who suggested that Data Scientists and their profound expertise on analytical methods are far from being sufficient in granting companies a real competitive advantage. Companies are looking for skilled employees that can manage and analyze their data sets to make more grounded decisions in order to maintain a

competitive advantage. Among these companies are Amazon, Google, eBay, Netflix, and Wal-Mart. These companies analyses large data sets to gain new customer insights (Dunaway & McCarthy, 2015).

1.2 Research goal and research question

The main research questions are is:

1. To what extent is Big Data represented in the recruitment needs of companies? 2. What segmentation in Big Data related job vacancies can be made using hard skills,

soft skills, experience and years of education?

3. Is the perceived importance of hard skills effect by soft skills, years of education and experience?

(9)

9 | Page

2. Theoretical framework

2.1 Definitions of Big Data

There is no consensus on a scientific definition of Big Data and a lot of research is still being conducted on this topic. This makes it necessary to discuss the definition of Big Data and how the term is used in this thesis.

Big Data is, as the term suggests, a large amount of data, irrespective the origin, whether this data is gathered via social media or with sensors containerships measuring the height of the waves. There are many definitions as to what Big Data is and should be. Nasser and Tariq (2015) define Big data as;

“Simply Big Data refers to the data and information which can’t be handled or processed through the current traditional software systems. Big Data is large sets of structured and unstructured data which needs be processed by advanced analytics and visualization

techniques to uncover hidden patterns and find unknown correlations to improve the decision making process”

Lohr (2012) argues that Big Data might just be something thought up by marketing. This is supported by Big Data being seen as a popular media term used by businesses in the industry (Manovich, 2011). Nevertheless, Big Data is an advancing trend in technology that could open new doors for organizations to get a better understanding of the world and make decisions based upon new type of information.

Big Data consists of a few components structured, semi-structured and unstructured data. Structured data is relatively simple and is easily handled, for example in a database. In this example think about an Excel file with sales of books but a file so big that Excel is not able to handle to information being given. It refers to data where the information is structured. This means that it can easily be found with search engine algorithms or similar systems.

Unstructured data is much more difficult to handle. This data consists of information that does not have a pre-defined data model or is not structured in any form that is easily handled with a straightforward data model. When looking at the example of books these could be the reviews being given online, these reviews consists of text and do not have a pre-determent

(10)

10 | Page order. Unstructured information typically consists of text-heavy data. Another example of this is recorded conversation where information is hidden within a story. Unstructured data may also contain data such as dates, numbers, and facts that are similar to structured data. The mixed nature of this data set makes it an ambiguous and difficult task to analyze it with traditional data models and algorithms compared to a data set from a traditional database with structured data. The difficulty of Big Data solutions is further explained by Hashem (2015):

“Big Data are characterized by three aspects: (a) data are numerous, (b) data cannot be categorized into regular relational databases, and (c) data are generated, captured, and processed rapidly. Moreover, big data is transforming healthcare, science,

engineering, finance, business, and eventually, the society” (Hashem et al., 2015).

Other researchers suggest that there are several kinds of Big Data. Big Data is not a stand-alone term that refers to all information systems that have the key aspects of Big Data like high volume, variety and velocity. They see two main aspects of Big Data: ‘Big Data technology’ and ‘Big Data analytical methods’ which is explained by Sahu & Shandilya, (2010): “Big Data analytics is the process where large amount of data are collected, examend and organized to look for hidden patterns in the data. When discussing Big Data Analytics technology they talk about the tools data scientists use to analyze large data sets.

Big Data is not only a lot of data, it affects the quantification in terms of records, tables, files or transactions from example a credit card company (Russom, 2011). For many

organizations, the data they collect is very different when looking at different markets or even competitors. Where one organization is looking at the online chat about a product, the other if looking for defaults in a production line. Both examples are Big Data related but have very different information sources. Data is collected, but not always in the most efficient way for analysts. Different forms of analyses require different data sets, this way some analyses lead to ad hoc analytical analyses and some solutions require more long-term solutions. The fact that data keeps growing, to understand this better we will explain it by using the four V’s (volume, velocity, variety, and veracity) to characterize different dimensions and challenges of Big Data (Hitzler & Janowicz, 2013)

(11)

11 | Page

2.2.1 Big Data volume

As discussed in the previous chapter, volume is an important aspect, if not the most important aspect, of Big Data. But at what point can we call Big Data ‘big’? Big Data does not only consist of the data we already possess, but also the data gathered right now. Facebook

generates about 10 Terabytes (TB) every day, Twitter generates about 7 TB. Some companies produce 1 TB each hour. Not only social media gathers large data sets, but retailers record all customer transactions made both on and offline and gather large datasets (Manyika et al., 2011). McKinsey Global Institute (2011) describes the large volume of data collected: “Their [The expanding digital universe, March 2007] analysis estimated that the total amount of data created and replicated in 2009 was 800 Exabyte’s—enough to fill a stack of DVDs reaching to the moon and back. They projected that this volume would grow by 44 times to 2020, an implied annual growth rate of 40 percent”.

Therefore, volume refers to the bulk of data that is being gathered every year, month, hour or even second. Accordingly, volume provides challenges, because technologies are currently build for analyzing smaller data sets and these do not scale up (Sahu & Shandilya, 2010). This is further supported by McAfee et al. (2012), who states that 2.5 Exabyte’s of data is created every day and this number doubles every 40 months.

2.2.2 Big Data variety

The number of smart devices that surrounds us is growing, as is the number of functionalities these devices have. Thus, the variety of data types and the formats data is supplied in is growing rapidly. These ‘smart’ devices, new technologies and the social networks where we get our data from have generated data sets complex to deal with. Tt now includes raw, semi-structured and unsemi-structured data collected from various sources, like web pages, emails, social media forums, search indexes, audio and video.

These days most organizations focus on structured data, while 80% of data is unstructured or semi-structured (Ronik, J. 2014). This if further supported by, McAfee et al. (2012) state that “Many of the most important sources of big data are relatively new”. This refers to data gathered from messages via (social) media, measurements from GPS signals from cell phone applications, and more which is seen as unstructured data. Variety of data refers to every

(12)

12 | Page point where data is or could be gathered. From a payment made in a store to the search for a good restaurant in Amsterdam, the variety of data is enormous.

2.2.3 Big Data velocity

Velocity refers to the speed at which data is generated, the speed at which the data should be analyzed and the speed with which people should act on the new knowledge. The speed of data is further increased because it is “coming from any kind of device or sensor, say robotic manufacturing machines, thermometers sensing temperature, microphones listening for movement in a secure area, or video cameras scanning for a specific face in a crowd”

Russom, (2011). New technologies, like smartphones, sensors and other smart devices, have caused an unprecedented growth in data. This growth created a need for real time analytics (Gandomi & Haider, 2015). The speed of data generation has become more valuable for many applications than the volume it is presented in. For example, Wall-Mart processes more than a million transactions every hour (Cukier, 2010) giving them immediate insight into what is going on in the stores. Real time decision making enables a company to be much more agile than its competitors. Quick insights present a clear advantage for many groups, for example, Wall Street analysts and Main Street managers (McAfee & Brynjolfsson, 2012)

2.2.4 Big Data veracity

Veracity refers to the incompleteness or folds in the data set. This feature measures the possibility to conduct valuable analyses and gives a way to analyze the data set. The accuracy of the data is very important, because of the high potential impact of decisions made based on incomplete data (Sahu & Shandilya, 2010). How complete or correct the data set is, is very important for further analyses. When we talk about the completeness and/or correctness of the data we measure this with the quality of the data set. The quality is measured by several parameters: first complete, all the needed information is there for example all addresses of clients who do online shopping. Second accurate, the data set does not have any typing errors example the addresses should all be spelled correctly. Third it should be available, when needed the data is easy to find this means all addresses should be easy to find and fourth timely the data is ready when needed for decision making, this means that an address change should be done on time (Nasser & Tariq, 2015).

(13)

13 | Page

2.3 Impact on workforce

Why is it important that we look at the impact of Big Data? Wamba et al. (2015)

state that “The simple answer to this critical question is because big data has the potential to transform the entire business process”. The impact of Big Data is getting larger, not only in obvious fields, like one-line sales, but in other fields as well: “In astronomy, the Sloan Digital Sky Survey (SDSS) shows how computational methods and big data can support and

facilitate sense making and decision making at both the macroscopic and the microscopic level in a rapidly growing and globalized research field” (Chen, Chiang & Storey, 2012).

In the area of fuel consumption, the gathered information about traffic generates added value that is quantifiable to the consumers in terms of time saved by not waiting in a traffic jam, as Google does when using Google Maps as your navigation system. The same goes for the fuel you save. Both the fuel and time savings are enabled by an application that tracks your coordinates. Taylor-Sakyi (2012) argues that Big Data applications allow organizations to analyze customer behavior in a much shorter time span than before, when it could have taken days or even weeks to generate this information. In addition to active academic research on data analytics, industry research anddevelopment has also generated much excitement, especially with respect to Big Data analytics for semi-structured content (Chen, Chiang, & Storey, 2012).

Since the 1950s, artificial intelligence has been around and has been used by researchers in relation with artificial intelligence. In the 1990s, business intelligence gained popularity and in the 2000s, businessanalytics was introduced to the world (Davenport, 2006). Based on a survey among 4.000 participants, they found that business analytics is one of the four most important technology trends in the 2010s. Furthermore, they found that companies with revenues over $100 million were using some sort of analytical tools. The phenomenon of Big Data is not new for those in the commercial sector, who have been collecting and combining different data sets they have collected to make better customer segmentations and understand the market better (Manyika et al., 2011). McAfee et al. (2012) found that, the more

companies act on data-driven decisions, the higher they scored at objective measures with regards to a data driven process and the better they performed on operational and financial results.

(14)

14 | Page At the World Economic Forum, Big Data was an important topic. Data was announced as a new class of economic asset, a commodity like oil and gold (Lohr, 2012). This indicates the importance of Big Data and the impact it will have on the way we live. These days, Big Data is generating attention worldwide. A Google search on Big Data in 2011 had about 252.000 hits; in 2012, it had almost 1.39 billion hits (Flory, 2012). In 2017, there are 4.6 million hits on Google Scholar alone.

Making decisions on the basis of data, which became popular in the 1980s and 1990s and is becoming more sophisticated these days and therefore more popular. Analytical algorithms being used for data analysis are important factors in this popularity. However, the analytical tools we have are still in their infancy and it will take some time for them to mature

(Picciano, 2012). Additionally, Picciano (2012) states that “While big data and analytics are not panaceas for addressing all of the issues and decisions faced by higher education

administrators, they can become part of the solutions integrated into administrative and instructional functions”. Big Data analytics refers to business intelligence and analytics that find their origin in data mining and statistical analytics (Chen, Chiang, & Storey, 2012). The importance of Big Data and analytics is further supported by Chen, Chiang and Storey (2012).

Ammu and Irfanuddin (2013) state that we are in the “today, we are afloat in Data Ocean” this indicates that there is a lot of information being gathered, and all around us, but is it all of equal value? According to Wamba et al. (2015), “despite the excitement and recent interest in ‘big data’, little is known about what encompasses the concept. Indeed, potential adopters of ‘big data’ are struggling to better understand the concept and therefore capture the business value from ‘big data’.” This indicates that the real impact of Big Data is not clear for every organization as they still struggling with the information Davenport, & Patil (2012). Many organizations want to implement and use the full potential of Big Data but do not quite know how to.

Chen et al., (2012) state that the “opportunities with the abovementioned emerging and high-impact applications have generated a great deal of excitement within both the BI&A

[Business Intelligence & Analytics] industry and the research community.” Besides this, they state that the industry focused on scale and implementations of applications, but more

(15)

15 | Page

2.4 Big Data use by organizations

Most organizations are dependent on Big Data and the information process that collects and handles this data (Demirkan & Delen, 2013). In this paragraph we explain how organizations use their Big Data and what value they get from using it are explained. The list of companies or industries that use Big Data is endless Sagiroglu & Sinanc, (2013) state:

“astronomy, atmospheric science, genomics, biogeochemical, biological science and research, life sciences, medical records, scientific research, government, natural disaster and resource management, private sector, military surveillance, private sector, financial services, retail, social networks, web logs, text, document, photography, audio, video, click streams, search indexing, call detail records, POS information, RFID, mobile phones, sensor networks and telecommunications”.

The potential value that Big Data will have is enormous. According to (Chen & Zhang, 2014), Big Data use will potentially add $300 billion of value to US healthcare and €250 billion for European public administration. Using worldwide location data will potentially generate $600 billion annually for consumers. This shows the importance for Big Data use in organizations for current times and the future to come.

Getting value from the insights of data has been around for years. Management supported by data tends to make more sustained decisions. But according to Wamba et al. (2015), “Big data has the potential to revolutionize the art of management”. The differences with using Big Data in the decision making process is not just supplying timely and high quality data, but it is the “continuous autonomous decision-making via the use of automation” (Yulinsky, 2012). One of many examples is the automation of health care decisions: the access to remote monitors for health conditions like heart disease or diabetes allows doctors to automate decisions (Manyika et al., 2011).

(16)

16 | Page

2.5 Main challenges of Big Data

In the following paragraphs we will look at the different challenges faces when investigating and working with Big Data. With the fast growth of Big Data, new challenges arise in several fields of knowledge and these challenges or gaps in knowledge have to be filled. When looking at process challenges, analysts and managers should both take into consideration difficulties encountered when processing data; the hardest part is showing the output in such a way that the target group understands it and has a bird’s eye view of what is going on. Further aspects that should be considered when looking at the data are data procurement, information extraction and cleaning, data integration and aggregation, query processing, data modeling, and analysis, interpretation, management challenges, privacy, security,

governance, and flexibility (Nasser & Tariq, 2015). This is further supported by the research of Giri et al. (2014). According to Kaisler et al. (2013), “We suggest there are three

fundamental issue areas that need to be addressed in dealing withBig Data: storage issues, management issues, and processing issues. Each of these represents a large set of technical research problems in its own right”.

2.5.1 Talent

A little history about where the current ‘talent’ is formed is in order. It started in the 1980s and 1990s, where the Wall Street Data ‘quants’ were the current data scientists. In those days, people with a background in physics or math went to investment banks and hedge funds to think of entirely new algorithms. Then universities made programs like financial engineering, which gave the world the second generation of data analysts and which was accessible for the mainstream (Davenport & Patil, 2012). As Big Data is experiencing rapid growth, the people that are able to understand and get valuable information from big data should grow with the same speeds as Big Data technologies. A shortage in Big Data specialist is the results of this need which is further addressed in the article of Chen & Zhang (2014) they say that “The shortage of talent will be a significant constraint to capture values from Big Data.”

The constraint is that of experts, who can implement Big Data solutions is not large enough to meet the demand, as the technology is still emerging and a large group of organizations wants to use Big Data. Data analysts not only need the technical skills to maintain algorithms, they also need to read, write and understand the algorithms. These days, organizations are hiring Big Data consultants to train their employees (Sahu & Shandilya, 2010). Additionally,

(17)

17 | Page vendors of Big Data solutions are working on their software to make it easier to work with and more manageable (Davenport & Patil, 2012).

Giri et al. (2014) state that talent is the most prominent challenge for Big Data. The United States alone have 140,000 to 190,000 jobs for people with deep analytical talent and 1.5 million positions for manager who have some data skills (Manyika et al., 2011). The lack of skilled workers is a serious challenge. When a technology is growing, talent is an important factor, however, the talent available for Big Data innovations is insufficient. Professionals with the necessary wide range of skills are rare, which explains why data scientists are

currently in short supply (Giri et al., 2014). Data scientists are not the only profession in short supply, the need is much wider. Big Data has become an important decision driver in

organizations and every employee must have a data mindset (Miller, 2014). There is a short supply of talented people working with Big Data analytics. “The demand from companies has been phenomenal. They just can’t get this kind of high-quality talent” (Davenport & Patil, 2012). As Big Data is growing, it becomes cheaper and more organizations can use it

(McAfee et al., 2012). Recruiting ‘data junkies’ could be tricky, especially when the needed skill sets are are not addressed in academic programmes. Big Data analyses are of a statistical nature, but many of the techniques currently used for Big Data analyses are not taught in the traditional statistics courses (McAfee et al., 2012). Companies that do not recruit early in the early stages of Big Data analytics have the risk falling behind (Davenport & Patil, 2012).

2.5.2 Data challenges with the four V’s

Several challenges exchange the effect of the talent gap, the challenges that emerge from the amount of data that increases every day. This increase has its challenges and these challenges are discussed by the four-v’s volume, variety, variety and velocity.

2.5.2.1 Volume

As Big Data grows, so is the amount of data (or volume) we receive. The process of managing this volume is becoming more complex as the amount of data is growing with record speeds (Giri, 2014). The immense growth in data is mainly due to machine-generated data. In 2000, around 0.8 zettabytes of data were stored worldwide. By 2020 this is expected to reach 35 zettabytes (Zicari, 2014). Additionally, this data is used to be presented in a structured form, because this was the only way organizations could handle this kind of data. Current systems are mostly built to handle structured data, but now semi-structured and

(18)

18 | Page unstructured data is coming from all sorts of sources that are hard to present in rows

(interview with David Gorbet on July 2, 2012). The new data flow brings challenges on the fields of data acquisition, analysis, storage and management. The current data management cannot handle these challenges—the volume of data is too large, the data moves too fast, or the data does not fit the structures of existing database architectures (Giri et al., 2014). The problem is that data is rising in volume and the processing power is lagging to analyze this.

2.5.2.2 Variety

The data that organizations retrieve is not in one format. It is not numerical like organizations used to retrieve, it has become more complex as data is gathered from text, images, videos and non-numerical data sets. The challenges are how to retrieve, analyze and store this data and how to make this data readable (Tole, A. A. 2013). For Big Data analyses, it is often unclear how we can interpret the collected data and the most complex part is how to analyze the data. The algorithms that deal with the variety in data must support current and future needs, that is, they need to cope with increasingly expanding and complex data (Giri et al., 2014).

The complexity of the variety of data is further enhanced when people converse, they do this for the most part by speaking. The nuance and of natural language that has depth for example an angry voice, yet machines do not have the algorithms to analyses this data. At this point in time algorithms can’t understand these nuances, data becomes homogeneous because of this lack to make nuances. To overcome the problems associated with homogeneity, the data have to be carefully restructured by hand before the analyses can begin (Jagadish et al., 2014). We do not yet have the ability to recognize sarcasm with machines in the voices of human beings.

2.5.2.3 Velocity

Data management as it exists today uses a traditional data management style. It can handle structured data, but not semi-structured or unstructured. On Facebook alone, over three billion pieces of content are created every day (Chen & Zhang, 2014). The emerging data brings huge challenges on data acquisition, analysis, storage and management (Giri et al., 2014). One of the challenges is how to keep updating the volume to ensure that the accuracy of the data stays intact (Kaisler et al., 2013).

(19)

19 | Page Timeliness is important when looking at velocity. Real-time data is important and techniques to summarize and filter what is to be stored are needed, since in many instances it is not economically viable to store the raw data (Jagadish et al., 2014). Big Data moves to fast to be handled by the current processing capacity of traditional databases. Another way must be found to process this data (Giri et al., 2014). For example, FICO’s falcon credit card fraud detection system administers 2.1 billion accounts all over the world (Chen & Zhang, 2014). When a fraudulent credit card transaction is made, ideally the transaction should be blocked. Currently, this is not possible because a full analysis of the consumer’s purchasing history would take too long (Jagadish et al., 2014).

2.5.2.4 Veracity

Noise is a challenge to the veracity of data. Big Data ordinarily consists of various types of measurements and measurements errors, missing values and outliers (Fan, Han, & Liu, 2014). The information supplied comes from increasingly diverse sources. Because of this, it is uncertain whether the gathered data is truly what it seems to be and how this uncertainty must be managed (Jagadish et al., 2014).

2.5.2.5 Big Data security

The last set of challenges has to do with ethics. When analysing and collecting Big Data, a number of ethical challenges rise: privacy, informed consent and protection of harm. But the question of security goes deeper, to what kind of data should be collected and combined (Eynon, 2013). Another major concern is the privacy of the data and this concern continues to grow with the growth of Big Data. For instance, electronic health records have strict laws governing what data can be revealed, but this is not the case for all records used in Big Data. There is a public fear of inappropriate use of personal data, especially when it is combined with other data sources (Giri et al., 2014). Managing the privacy of the ‘consumer’ is both an ethical and a technical problem.

After the issues of data collection have been tackled, the next challenge is how to protect this data. Because the content of the data in might contain sensitive information, like financial records, this data must be distributed carefully without violating people’s privacy (Giri et al., 2014).

(20)

20 | Page

2.6 Data Scientist

Data scientists solve business problems with advanced analytical approaches using analytical tools to find patterns in data. For the most part, data scientists work with logistic regression, clustering methods, and classification methods to get information from this data (Mohanty et al., 2013). As data becomes cheaper, so does the availability of data, and the information gained from the data analysis becomes more valuable. One of the most critical factors that comes with Big Data are the data scientists, as they work with a vast amount of information in a rapidly changing environment and where quick decision making is key (McAfee & Brynjolfsson, 2012).

Davenport and Patil (2012) state that a data scientist is: “a high-ranking professional with the training and curiosity to make discoveries in the world of big data”. Furthermore, a data scientist should know how to communicate with the entire organization and help managers’ decision making process using the inputs from Big Data (McAfee & Brynjolfsson, 2012). The relevance across different organizations and companies becomes clear: organizations see the need to use this information for their competitive advantage. To do this, however, they need to hire employees with the right knowledge or equip their current employees with the right training and tools to perform these complex analyses (Ismail & Abidin, 2009).

Figure 1: All the skills a data scientist should have, showing the complexity of the skillset of a data analyst (Van der Aalst, 2016).

(21)

21 | Page The Data Scientist uses different tools/technologies to analyze data sets. Some technologies that make Big Data possible are Hadoop (this is a system that processes files and is one of the most widely used) and other such tools that relate to Hadoop, cloud computing, and data visualization (Davenport & Patil, 2012). A data scientist should have skills that relate to data mining, data modelling, data visualization, and machine learning (Shum et al., 2013). These skills should be related to the above mentioned analytical tools. Likewise, Ayankoya et al. (2014) state that a data scientist uses advanced analytical tools, like data modelling, data visualization, and predictive analytics to see trends, anticipate what will happen in the future, and advise organizations on the basis of this newly found knowledge. Ismail & Abidin (2009) state that the two most relevant skills for an organization are programming and performing statistical analyses. Again, these skills are conducted with tools equal to/or Hadoop.

Mohanty et al. (2013) state that: “Data scientists combine advanced statistical and mathematical knowledge along with business knowledge to contextualize big data. They work closely with business managers, process owners, and IT departments to derive insights that lead to more strategic decisions.” Furthermore, data scientists not only need to know how to program but should know multiple languages, such as Python, R, Java, Ruby, Clojure, Matlab, Pig or SQL. Besides programming skills, they need to understand Big Data tools, like Hadoop, Hive and Map-Reduce. Mohanty et al. (2013) go even further and state that data scientists also have to know about natural language processing, machine learning, conceptual modeling, statistical analytics and hypothesis testing.

The need for data scientist is clear, as stated in Wixom et al. (2014): “We need more data scientists with skills in statistics, modeling, that understand structured and unstructured data.” Data scientists need to have integrated skillsets ranging from mathematics, machine learning, artificial intelligence, statistics, databases, and optimization, all the way to a deep

understanding of the problems with Big Data and how to build effective solutions for these problems (Dhar, 2013).

The use of Big Data in organizations strongly depends on hiring data scientists. The

challenges for hiring managers are to determine which skills have the best fit and to make the vacancy attractive to outstanding data scientists. Because of the different skill sets required, hiring data scientists is not straight forward (Davenport & Patil, 2012). These skills are for the most part known, but are not yet taught at university. Universities are focusing on getting

(22)

22 | Page programs ready address the gap and provide students with a more focused set of core skills, like computer science, statistics, causal modeling, problem isomorphs, formulation, and computational thinking (Dhar, 2013).

(23)

23 | Page

3. Method

The goal of this master thesis is to get a clear understanding of the relationship between job titles and skills required by organizations. In the literature review, Big Data is described to get a better understanding of the current playing field. Consequently, online data that represents the market developments and human resources is collected and analyzed. In the next paragraphs an explanation is given of this process.

3.1 Data collection and sample size

The collected data for this thesis is based on the content of online job vacancies. Data collection was done by hand, rather than using algorithms to find string text as this is more effective in smaller or moderate text strings in comparison to larger texts. The study of Fletcher, & Kasturi, R. (1988) shows that: “an optimal size of sequence length can be found in a region that is not very large.” In order to compromise for this error skills were selected by hand. In the appendix you will find two examples of vacancies used for this analysis, furthermore in these examples the hard skills, soft skills, years of experience and the job title are highlighted.

The Fortune 500 is a frequently used source by business scholars (Zhu, 2000). Therefore selected job vacancies came from the first 40 companies of the Fortune 500, 2017, in order to get a good representation of the workforce from relevant industries such as health care, technology, automotive and more. From each of these 40 companies the first five vacancies that showed were selected, as these vacancies showed the closest relation to the search. All the vacancies that were gathered from the company careers website and no third-party websites were up)sed to gather information. This to make sure the information given in the vacancy was not polluted by a third-party. When looking for jobs that had a relation to big data different text strings were used to find vacancies involving Big Data. The research of Russom (2011) gives us a clear understanding of jobs within the Big Data field, the following job titles were mentioned in his article: Business analyst, Director or manager of analytics, Data architect, Engineer, Data analyst and Big Data Analyst. In total 200 vacancies were required in total 179 vacancies were gathered to be analyzed, the number 200 was not met because some companies did not have 5 matching vacancies with any of the above mentioned search words.

(24)

24 | Page This data was administered in Excel where 4 main topics were collected: Job Title is the position the vacancy is offered for these could be vacancies for Data Scientist as long the are in the top 5 results of our search, Years of experience (the number of years that were required to get the job). Hard Skills (skills that are teachable abilities, in the case of this thesis:

Hadoop, Python, R and more). Soft Skills (skills that are not directly measurable like interpersonal people skills, social skills and communication skills) were also added in this research. This is done in line with the arguments of Ahmed et al. (2012) who suggest to add soft skills to the education program for software development: “The production of any piece of software involves a human element, requiring activities such as Communication skills, Interpersonal skills, and Team player”. The softs skills are of importance in the future success in the working world. It is important to teach these skills as they are valued the same as technical skills (Carter, 2011).

The text analysis for soft and hard skills had the following procedure:

- An understanding of the requirements for hard skills was created using the article of (Shirani, 2016). Together with the article of Fox, et al. (2015) a list of hard skills is made to search for in the vacancies. These hard skills are but not limited to: Hadoop, SQL, NoSQL, Apache, Spark, Phython, R, Shark and Hive. The top 10 skills

mentioned most in this research are:

NR. Hard Skill NR. Hard Skill

1 Python 6 C.org

2 Spark 7 Hadoop

3 Hive 8 Tableau

4 R.org 9 SAS

5 Java 10 SQL

- The same process had been done for the Soft skills. A list of soft skills was created using the articles of Zaharim, et al. (2012) and Carter, (2011) with soft skills to be considered when analyzing the vacancies, some of the softs skills in these articles are: leadership, work ethic, team player and communication skills. The top 10 softskills mentioned in the interviews are mentioned below:

(25)

25 | Page

NR. Soft Skill NR. Soft Skill

1 Agile 6 Managing Skills

2 Written skills 7 Team Building

3 Communication skills 8 Verbal skills 4 Collaboration skills 9 Reporting

5 Interpersonal skills 10 Time Management

These skills that were administered had to be quantified to do the analysis. For education this was done by the amount of years it took to reach a bachelors license (4 years), masters (5 years) or PhD (9 years) degree. This was a more difficult process for the soft and hard skills , in this case we decided to value them by the times they have been mentioned in different vacancies.

If R-studio was mentioned 10 times and Python 15 the overall score for a vacancy with Python would be higher than R-studio. In attachment one you can find the top 10 rankings for both soft and hard skills. The skill was then replaced with score for that skill using a macro, see appendix one for the macro used to do this.

3.3 Results

The world of Big Data is a complex and growing world, where the interpretation of big data is diverse concerning the needs for the organization in terms of required skillsets. The difficulty within the dataset lies within the different needs of data driven jobs this causes problems with heterogeneity. Accordingly, the collected data was divided into subgroups to create homogeneous groups and in order to answer the research question focusing on specific subgroups.

First, a job title can mean many different things, resulting in different required capacities within similar job descriptions. Furthermore, criteria such as Computer Science, Analytics, Data Management, Art Design and Entrepreneurship have an influence on the skillset for example a data scientist. The difference in this structure causes the difference in skillset needed and where a future employee should look for.

(26)

26 | Page

3.3.1 K-means cluster analysis

For this thesis cluster analysis is the most effective way to find homogenous groups of objects, ideally a natural composition of elements (Sekaran, et al. (2016). The K-means cluster analysis is a popular tool and used by several researchers (Littmann, et al. 2000; Okazaki, 2006) to identify coherent groups. Wagstaff, et al. (2001) state that “K-means is another popular clustering algorithm that has been used in a variety of domains.” Moldovan, & Mutu (2015) used this method to see if there is a relationship between corporate

governance and the performance of an organization and data mining methods use the K-means clustering analysis to. The K-K-means cluster analysis is also a popular tool to identify how different skills may overlap and to assess the implications for the job skills within the specific function. Cluster analysis will answer the main question as well as the sub-questions. In order to assess the distinctive patterns in the skills needed for different job titles a cluster analysis was carried out. This would give the organization an impression to what extent different skills are incorporated in the corporate vacancies and if this can be aligned to the Big Data efforts communicated by the organization. In theory, a cluster analysis would show the result of a cluster containing different skills of vacancies that belong to a single group of vacancies showing that each job title has its specific set of skills relating to the job. After doing a cluster analysis the vacancies that are related to a cluster are homogenous with vacancies in the same cluster and different from vacancies in other clusters.

In this thesis the K-means cluster analysis was chosen. The reason that a hierarchical cluster analysis was not chosen in for this thesis was because we are not interested in the dispersion of the group. The focus is rather on what is happening within the group from each individual, combining/clustering them until there is one cluster left. When using the K-means clustering analysis it identifies the best fit within a given number of clusters. The K-means cluster analysis goes through the data set selecting the values that have the best matrix containing numbers and variables for each unit.

When using a k-means it requires a specific number of clusters, one method to do this is by trial and error and see which result gives the best clustering. For this thesis, we have used a scree plot that helps with the selection of the correct numbers of clusters. The scree plot is a popular based on the Scree test (Cattell, 1966). The scree plot is used to determine how many clusters should we used for further analysis. “The scree is the break between the steep slope and a leveling off indicates the number of meaningful factors, different from random error.”

(27)

27 | Page D'agostinoet al. (2005). The calculation and plot are shown in figure 3.1. When the total sum of squares is small, this means that the objects within the cluster are close together. This is a good thing, because this means that the individual plots in the cluster are close together. However, it can always be made smaller by taking a larger number of clusters/vacancies.

If the total within-cluster sum of squares is small, that means that the objects within each cluster are similar to each other (good), but you can always make it smaller by taking a large number of clusters (bad). To select the correct number of cluster you are looking for an elbow in the scree plot. Below there are two ways of the code used in R-studio to select the correct number of clusters.

3.1 Building a WSS Plot with 20 clusters

Figure 3.1

3.2 Building a cluster in detail 10 clusters

(28)

28 | Page As show in figure 3.1 and 3.2 the elbow is after 4 clusters. After selecting the four clusters from the Scree plot a K-means analysis can be made, below the initial results of the K-means analysis.

The cluster indication in the appendix allows you to allocate each registered vacancy to a cluster. Furthermore, in the appendix is the K-means Withness that shows the within-cluster sum of squares for every cluster. In total there are 159 values, this is the same as shown in the Scree plot. When a cluster has a small within-cluster sum of squares this means that all objects in a cluster are close together, when the cluster has a large within-cluster sum of squares this means the cluster is dispersed. The cluster Withness from the different vacancies in the cluster from close together to more spread out are as following [4] 28,88994

[3]33,25931 [2]47,85336 [1] 48,92235. The next important aspect of the K-means the size of each cluster, which from small to large is [1]13 [4]40 [3]62 [2]63. Size is the total number of vacancies in the cluster. Cluster 4 has the largest size this also correlates with the larger within-cluster sum of squares.

Table 3.1 Data Average Score of clusters, and total of clusters

years of study Exp Softskill Hard skill

1 4,8 6,9 29,0 47,6

2 4,6 5,7 10,4 8,3

3 4,8 5,3 4,7 41,2

4 5,5 5,2 6,1 82,7

Total 4,9 5,5 8,8 39,3

Above in table 3.1 the averages of years of study, experience, soft skills and hard skill are shown. The numbers give an overview of the vacancies in the clusters, this way there is a quick overview of aspects each cluster had on average. From analyzing the number above, we can quickly see that cluster 4 has the highest numbers of years studied (5,5) and requires the largest set of hard skills (82,7) for the job.

(29)

29 | Page

3.6 Sum of within clusters

1 2 3 4 Big Data 1 1 5 3 Data Analyst 1 32 4 4 Data Architect 0 1 7 6 Data Scientist 2 3 5 16 Developer 0 0 2 1 Director 0 6 2 0 Engineer 5 9 24 5 senior 4 11 13 5

In table 3.6 an overview is shown of the different number of vacancies in each job title. This means that the 1 vacancy with “Data Analyst” is in cluster [1], 32 in the second cluster [2] and in both the third [4] and fourth cluster. This overview gives us a clear understanding of what companies need in terms of skillset, for example the out of 200 administered vacancies 41 are looking for a “Data Analyst” and 31 one of these vacancies belong in the same cluster telling us that independent of the organization or industry the organizations are looking for the same thing.

3.7 Cluster Hardskills and Softskills

Figure 3.3

This is a very clear cluster, there is still overlap within the cluster which means this is a good result. The number of hard skills in the vacancy is highest for the blue cluster than the green and black cluster and the lowest level of hard skills is needed in the red cluster. Furthermore,

(30)

30 | Page the soft skills in this cluster analysis are dispersed the same, only the black cluster has more requirements for soft skills in the vacancy than the other cluster.

3.8 Cluster Hardskills and Experience

Figure 3.4

This cluster shows how the years of experience combined with the hard skills, again there is clear clustering for the red, green and blue cluster however the black cluster is dispersed both over the green and a little bit over the blue cluster. The reason for this is that that these represent the senior or managers which background and skills are more diverse.

3.9 Cluster Hardskills and Years of Education

(31)

31 | Page The last visualization of the cluster shows the years of experience and the hard skills. First we would like to explain the gap between 5 and 9 years. The reason for this is that when

administering the data we stated that we have three different levels of education; first

bachelor which takes 4 years, master which takes 5 years and last PHD which takes 9 years to get your diploma. Again very clear clustering which indicates the selected number of cluster with the Scree plot is correct.

3.3.2 Linear regression with ANOVA

The One-way Analysis of Variance (ANOVA) has been chosen to test the difference between ¨years of experience¨, ¨soft skills¨ and ¨education¨ have on hard skills is significant. the ANOVA was the best fitting analysis for this thesis because the data consists of multiple independent categorical variables. Linear regression with ANOVA is used to predict a response that “years of experience”, “Soft Skills” and “Hard Skills” have on the job title. In this thesis we have chosen for a single regression model where we looked at the impact of one of the predictor variables and its effect on the job title. The ANOVA was done in SPSS a commonly used statistical tool.

The dataset gathered as shown above consists of a data frame with 179 rows and 4 columns. Each is an observation and in this case are job titles in vacancies. The columns are the predictors such as “experience”, “Soft Skills” and “Hard Skills”, this means that the three predictor variables are responsible for the outcome of the job title. In order to generate a One-way ANOVA analysis 3 hypothesis were created:

H1 When more ¨experience¨ are required the the required ¨hard skills¨ will appear higher. H2 ¨Soft skills¨ are higher when more ¨hard skills¨ is required.

(32)

32 | Page

H1 predicts that when the required ¨Experience¨ are high that this is also required for the hard skills. As shown above, the results of the one-way ANOVA are F = 1.233; p = 0.163 which means that p < 0,01 therefore H1 is not significant so ¨years of study¨ has no effect on ¨hard skills¨. The H2 one-way ANOVA for the effect of ¨soft skills¨ on ¨hard skills¨ shows F = 1.273; p = 0.129, p < 0,01 therefore H2 is not significant. The one-way ANOVA for H3 results F = 1.763; p = 0.004 where p< 0.01 and therefore is significant. This means that there is a effect of ¨years of study¨ on ¨hard skills¨.

3.4 Cluster description

As presented in the output in table Input 3.2 it is clear that there are four clusters which the K-means should be calculated on. The results of the K-means cluster analysis are shown in Input 3.3 where each number resembles a line in the data set and the number in the table the cluster this vacancy is in. The overview of Input 3.6 shows the job at hand in the vacancies and the sum of the vacancies that are in total related to the job.

Cluster 1; a manager, represented by human computing, as reference group these are the consultants has only 13 observations with less than 10% of the total vacancies registered, 5 engineers and 4 senior’s make the largest part of the group. Cluster one has the highest dispersion from the four clusters seen in input 3.4 which means that the data points from these clusters are farthest apart and are less correlated to each other. This groups requires and average amount of education in comparison to the total sample (4.9 cluster one and 4.9 the total sample), a little more experience (6.9 versus 5.5 for the total sample), they do have the largest set of set of soft skills (29 versus 8.8 for the total sample) and a little above average

(33)

33 | Page hard skills (47.6 versus 39.3 for the total sample). In the visualization of the cluster in Figure 3.3 cluster one is black, this is a cluster with a relatively high score for soft skills and an average score for hard skills.

Cluster 2; the data analyst. In the second cluster we find the hard core data scientist, this one might be considered to be nerdy. This cluster consists out of 63 vacancies as shown in Input 3.6 and for the largest part they are Data Analyst (32). But also senior’s (11) and Engineers (9) together represent 52 of the vacancies with a K-means Withness of 47,85336 it is still a large dispersion, the second largest of the group and indicates the clustering of this group is still quite far apart. The second cluster will be further referred to as the analyst. The second cluster does not differ a lot from the total sample when look at years of study, experience and soft skills however the hard skills for the second cluster does differ a lot (8.3 versus 39.3 for the total sample). In the visualization plots of the clusters this cluster is red as shown in Figure 3.3. For the most part this cluster is in the lower ranges of hard skills and soft skills with some outliers with a higher set if soft skills. Furthermore Figure 3.4 shows the

experience and the hard skills with no clear description, the cluster clearly happens on the basis of hard skills.

Cluster 3; the engineer, this cluster is the second largest of the group with 62 vacancies in the cluster and for the largest part represented by engineers. Also this cluster has the second to smallest K-means Withness of 33,25931 and shows a tighter group than the other two clusters. However the third cluster is the most average among them with, years of study, experience and hard skills just around average the soft skills are below average. The three biggest groups who represent the third cluster are first, Engineer’s (24), second seniors (13) and third Data Scientists. This is the green cluster in the visualization with an average set of hard skills and the lowest set of soft skills. In Figure 3.4 it becomes clearer that this is the lowest on hard skills.

Cluster 4; on average a data scientist. The fourth cluster has a total of 40 vacancies with Data Scientist (16), Data Architect (7) and Engineer (5). The highest education (5.5 versus 4.9 for the total sample), by far the highest set of hard skills. Then the K-means Withness is

28.88994 which indicates that the data points in this group are closest together. In Figure 3.3 this is the blue cluster which clearly scores highest on the hard skills, again when looking at the education form Figure 3.4 there is not much to say. However when looking at the Years

(34)

34 | Page of education and their level of hard skills there is a trend, as the vacancy’s with the need for hard skills get larger so does the request for applicants to possess a PhD.

4. Discussion and Conclusion, Further research

4.1 Discussion and Conclusion

This research gives us an important insight into the field of “Big Data” and needed skills. Previous research only focused on hard skills or soft skills and did not focus on the combination of both in the market. In this thesis, soft skills, hard skills, the education and years of experience have been used to evaluate the recruitment needs of companies, this to group similar vacancies together to further analyze the requirements. Therefore, we have answered the following research question: To what extent is Big Data represented in the recruitment needs of companies? It has shown that Big Data skills are required in both soft skills and that these are as important as hard skills, filling the gap in research currently focused on the hard skills. Moreover, it showed us that a “Data Analyst” in terms of skills set is not at all comparable to a “Data Scientist”. In an attempt to answer the question we have looked at four variables. The first was years of education, how did the years of education impact the job title. The average vacancy was looking for a Data Scientist with the highest educated people. The second variable was years of experience, overall this was quite similar and different than expected, managers and seniors did not differ much in terms of work experience. The third variable was soft skills. Also this one did not have a very large impact on the groups overall soft skills that are needed at every job. Fourth variable was hard skills, the vacancies that scored highest on this were again the Data Scientists.

The second question we wanted to answer was “What segmentation in Big Data related job vacancies can be made using hard skills, soft skills, experience and years of education?” A scree plot was made to determine the number of clusters, four segments were selected. Successive a K-means hierarchical cluster analysis was performed. Cluster 1, “the manager” had a high level of soft skills and the most experience required indication a more managerial vacancy. Cluster 2, “the data analyst” has the lowest hard skills and second highest soft skills, indicating that communicating the message is more important that understanding the data. Cluster 3, “the engineer” which has the lowest set of soft skills, a master’s degree and more than average hard skills. This indicates that building the system is more important than

(35)

35 | Page explaining it. Cluster 4, “the data scientist” with the highest average education and highest level of hard skills required in the vacancies these are the one who do the “hard-core” programming. The results shown there are three clear clusters that could be identified using information from vacancies. With this information a more wide approach in selection criteria could be established. For example: four remotely the same vacancies are available online but with different search words. The reach within the prospect applicants is small, when using the same vacancy title, it will allow for more people to find it and reply. The first cluster “the manager” has a large dispersion meaning these vacancies are far apart from each other in the cluster. Meaning that when these vacancies are being constructed great detail should be paid to which skills a company is requesting in order not to have the wrong candidates for the job.

Based on this analysis we can conclude that the requirements in a vacancy are related to a specific vacancy like a Data Analyst, Data Scientist or Engineer. The skill set needed for employees is clear and the same skill set is needed within different organizations. However this does not work for all vacancies, cluster one is the smallest it also has not got a clear job title. But they are scattered around the three clusters, telling us that these are the managers from different clusters with each a unique skill set within the cluster. Furthermore, we can conclude that the because the Withinss is quite high for all the clusters the recruitment needs might differ towards the future. This means that the clusters could change over time, for example knew skills like understanding of the law could be added.

Furthermore we are able to conclude that although companies might differ in their industry, this data set contains industries differing from computing, automotive all the way to

healthcare. The fact that these industries differ in many ways, it does not influence the need for a Data Analyst, Data Scientist or Engineer and the same skill set with these different vacancies. Concluding that the companies might differ but the methods used to analyze the data does not. Moreover the results of R-studio show that although cluster are strongly clustered there are some vacancies like “Data Scientist” in the cluster of “Data Engineer” suggesting that a company has the wrong title on its vacancy and therefore not finding the correct fit for the vacancy.

The relevance of these four variables and the impact has been tested via a one way ANOVA in SPSS. These clusters formed the basis for the description of education, employee talent and skills. These three features helped us answer what combination of skills was required per

(36)

36 | Page job. In total 179 vacancies were put into four clusters. To answer the question “Is the

perceived importance of hard skills effect by soft skills, years of education and experience?” For example, do the soft skills mentioned in a vacancy affect the amount of hard skills mentioned in the vacancy? An ANOVA was performed and the means were compared. With the three segments analyzed only with years of education the hypothesis was supported meaning that there is a 95% change that when the level of education is higher in a vacancy so will be the mentioned hard skills. This we cannot conclude for soft skills and years of

experience, meaning that more soft skills or experience is required this has no significant effect on the mentioned hard skills.

4.2 Further research

There is no current research in how the market is adapting to different skills sets needed, assed by vacancies. Since it is the first time this research has been done there are several methodological limitations. First, the data gathered. Even though the data set is large enough it has all been collected by hand by one person this is prone to have a bias. In future research it would be wise to have a larger data set gathered by more than one person, or even use data scraping methods to gather the needed data so that there are no human interactions or bias in the selection process of the data. However this may also be an advantage, if more than one person is collecting information there is the chance of getting more irrelevant information. With one person collecting the information it is more probable the process remains uniform and thus more valid.

Another limitation in this study is that, although we did collect samples from the top 40 Fortune 500 companies, we could not check the weight of skills for the company. By this we mean: a vacancy has several soft skills like communicative, coaching, verbal skills and team player. Each of these might be valued as more or less important for each organization and position. The same holds for the hard skills, where R-studio, Python and Power BI may be mentioned in the vacancy, Python might be seen as the most valuable. Furthermore the added value of combining soft and hard skills has not been taken into consideration when evaluating the skills. For example you might want someone that has excellent communication and Python skills, but does not need to be a coach. These separate skill sets had not been given additional value in this thesis, due to the amount of time it would take to contact all the companies and ask them to evaluate the most important aspects of their survey.

(37)

37 | Page For further research we would suggest not only looking at the skills currently required but combining this with future regulation. As data regulation becomes increasingly important to organisations, authority and individuals.

Referenties

GERELATEERDE DOCUMENTEN

Table 1 ). Outcome favorability had two levels: favorable outcome and unfavorable outcome. Data sharing had three levels: no sharing with a third party and data sharing with

Volgens de vermelding in een akte uit 1304, waarbij hertog Jan 11, hertog van Brabant, zijn huis afstaat aan de kluizenaar Johannes de Busco, neemt op dat ogenblik de

The comparison with the Gaia DR1 G-band photometry shows the tremendous value of this all-sky, stable photometric catalogue for the validation, and possibly calibration, of

By analyzing the negotiating history, Russia’s WTO Accession commitments, domestic regulations, the compatibility with subsidies and anti-dumping rules, and the recent case

The cycle is completed by using the sulphur dioxide dissolved in concentrated sulphuric acid (50 to 70 wt %), to depolarise the anode of the electrolyser cell. Sulphuric acid,

The ratio between pull-off strength (preload 10 N) and peel strength (peel angle 30 1) for all kinds of specimen and conventional wound dressings. a) A schematic of elastic

Als wordt gekeken naar absolute concentraties in plaats van naar temporele patronen dan blijkt dat alleen het fase 2 model in staat is om anorganisch stikstof te reproduceren en

Also, future research can investigate high contact proximity group in more detail since the research context used in this study demonstrates more results for the low-contact