Team performance classification based on team members'personal traits

(1)

Team performance classification based on team

members’ personal traits

C.A. Miulescu, MSc

11434805

SUPERVISED BY

dr. S. Rudinac

Dissertation submitted in partial fulfilment for the degree of

Master of Business Administration in Big Data & Business Analytics

(2)

Abstract

Nowadays, companies across major industry segments are working in collaborative environments, dealing with teams to achieve their targets. It was proved, since the 1920s when the teamwork concept was introduced in organizational settings, that groups of experts are better at handling complex tasks when they work together. As teams are indispensable, it is fair to say that cross-functional teams are the foundation of today’s professional environment. But a few questions are still unanswered: “What is the optimal team composition that leads to high performance?” or “Can project managers predict the performance of their teams using data analytics?”

Research is scattered when trying to explain the factors that influence team performance. From team dynamics to individual performance and supportive environment, many factors contribute to team effectiveness. Would companies benefit from gaining knowledge on how to design high performance teams using research and Artificial Intelligence technology?

In practice, we see the process of designing teams based on employees’ role and expertise, with nearly no consideration for complementary personal traits. As a matter of fact, team dynamics is usually ignored when creating new teams, adjusting team structures or hiring candidates into existing teams. Yet the team dynamics components - communication, coordination, cooperation and interdependence, are definitory for a team to deliver a qualitative product or service in a timely and efficient matter. The present thesis assesses the feasibility of implementing a machine learning model that predicts team performance by analysing its’ team members’ personal traits. To limit cross-domain biases we focus this investigation on project teams of a specific software company active in the Technology field. The employees’ personal traits are defined by personality factors and behavioral traits, retrieved through the company’s self-development tool, Clifton StrengthsFinder. Thus, the machine learning model is trained and evaluated on the sample data collected from this company. In spite of the limited dataset, the outcome of the investigation shows promise for using machine learning to establish the relationship between personal traits and team performance. This result translates into a wake-up call for companies to support and facilitate extended applied research in order to develop a machine learning tool capable of designing high-performance teams. In similar settings, we analysed a complementary perspective to evaluate the relationship between individual performance and team performance. The experiment uses a public academical dataset from Computer Science domain. With this dataset, we aim to assess how our machine learning model performs when predicting the citation level of scientific articles (team performance) based on researchers’ bibliometrics (individual performance).

(3)

1 Introduction

A balanced team composition has an important role in today’s professional environment with benefits for both companies and employees. Research studies identified team composition as a key factor influencing team performance. By definition, team composition encapsulates all team members attributes: hard skills (technical skills) such as knowledge, skills, and abilities (KSAs) and soft skills (psychological skills) like personality, strengths and values. At individual level, technical skills are easier to measure and often domain specific, while psychological skills are intangible and more difficult to quantify. We often see that companies are evaluating employees’ technical skills, when forming teams or hiring and pay less attention to the technical-psychological skills balance within a group. Organisations use team-building practices to improve team composition, under the correct assumption that productive outcome occurs when a team is mature, in later stages of its lifetime. However research found non-significant effects of such activities for improving team performance [1]. Despite the fact that teams with technical and psychological skills balance generate a positive economic impact, organisations often oversee the importance of team composition and run into significantly increased cost-per-delivery and lack of efficiency.

“The team is now the norm at work”, says Arnold Bakker, professor of Work and Organizational Psychology, Rotterdam [1]. Teams are defined as interdependent groups, where individuals share responsibilities and work towards a common goal. They have autonomy in decision-making and rely on communication, coordination and cooperation to achieve their tasks. In practice, many companies apply a pseudo teamwork approach, where team leaders have full control over the team processes. With this approach they lose many of the teamwork benefits like employees’ sense of ownership, commitment or learning. In professional environments the teamwork administration falls under operations and project management with support from HR. Their main challenge is making use of existing information to define team structures and team processes. Often they face problems like decentralised or incomplete data, difficulty to measure team and individual performance and no records of employees’ psychological skills. Although many large organisations have access to internal statistical data to increase their collaborative workspace efficiency, few are preoccupied to optimize the team composition based on technical and psychological skills, and therefore to improve their output quality.

In software companies, we see that changes in team structure have a direct effect on the team’s productivity. This pattern is noticed in many other industry segments, where employees work interdependent to deliver an end-product. According to Bruce Tuckman, the main reasons for this trend are the team development stages, as explained in his research from 1965. Teams tend to underperform until team members become familiar with each other (Forming), establish responsibilities (Storming), solve internal conflicts and create specific group norms (Norming). In line with Tuckman’s theory, teams become fully productive when open communication patterns are established and they reach a strong team structure. Thus, a machine learning application designed to create congruent teams that reach productivity stages faster could help project managers and HR take informed decisions when making internal team adjustments or hire external candidates, without breaking the team’s composition balance.

(5)

Despite intensive research, finding team datasets with a reasonable amount of records for machine learning algorithms, proved to be a burdensome task. Research groups face similar issues, when investigating team composition in company settings, instead of laboratory setup. Surprisingly, little is known in actual work environment about team composition with impact on team performance. Companies seem to oversee this opportunity to optimise their organisational effectiveness. This thesis addresses the potential of developing such a predictive algorithm and provides insights into the implementation process. Prediction of team performance based on personal traits could be accelerated with the companies’ support, having the potential to disrupt the existing HR and recruiting practices. Consolidating the above mentioned industry’s challenges, the main research question arises: Can a machine learning algorithm learn how to predict team performance, based on members’ personal traits? This thesis tries to capture the knowledge about how employees can efficiently be assigned to teams, considering the way they naturally think, feel and act. The hypothesis is that personal traits influence team performance, in contrast with teams created based on team members’ availability. According to the research in the field, there are many enabling conditions for a team to perform well, such as a compelling direction, supportive context, strong team structure and shared mindset. We investigate the team structure and shared mindset, in terms of complementary (psychological) skills, diversity in perspectives, responsibility and team interoperability. Thus, the following sub-questions help answering the main research question:

- Which psychological factors have high impact on team performance and are worthy to be considered in this study?

- How to calculate team metrics using scores measured at individual level? - What data is available at company level to help assess team composition?

- What evaluation techniques are appropriate to measure the output of this machine learning model?

It is important to state that the thesis analyses software teams, mitigating the influence of roles and tasks’ nature in our study. Also, under the assumption that a balanced team composition leads to a higher team performance, the experiment focuses on team dynamics, rather than individual performance.

(6)

2 Literature Review

This section introduces the team performance concept and describes the key factors (psychological and technical skills) that influence team performance. Also, it covers the major group and environmental factors which can affect individual and team measurements. Finally, we highlight the most common methods to calculate team-level metrics out of individual-level measurements used in team applications.

2.1 Team Performance

At its most basic level, team performance is the extent to which a team achieves its tasks. Recent studies, (K. P. Palacios, 2015), show that a team can be modeled as a simple input-process-output (IPO) model. In this context, the team composition is the input, which processes tasks and produces measurable results as output. But the measure of team output alone does not reveal the teamwork efficiency. For example, a software team could deliver a quality software product covering the scope and time frame agreed on, but this output does not tell us anything about the communication within the team, collaboration between team members or if the environment encourages personal development and creativity. Thus, when evaluating team performance is important to clarify whether the measurement is meant to describe within-the-team interactions or team output measures.

Performance is often evaluated by team leaders overseeing teams and/or being responsible for them. This approach covers two important dimensions: team output and team efficiency. However, because leaders are not directly involved in every team process, they cannot capture collective metrics like team satisfaction or members’ attitude towards their work environment. A study conducted by (K. P. Palacios, 2015) proposes a more complete framework that leaders and managers should use when assessing their teams’ performance:

- Customer satisfaction - Service / product quality

- Successfully covered the target scope

- Successfully delivered within time and budget

- Team’s external recognition by managers / other teams - Perceived team’s responsiveness to new requests - Perceived team’s productivity

This model was demonstrated to be effective and inline with the psychometric theory, yet it has limitations such as cultural differences, specific group characteristics and industry particularities [12]. In general, the measurement of team performance is bound to domain-specific frameworks. Also, the measurement scale is dependent on the evaluation purpose such as team feedback, individual training or budgeting evaluations. In order to accurately create or compare teams, organisations need objective, standardized metrics, that can be used to assess team performance beyond the industry’s boundaries.

(7)

2.2 Psychological Factors

Since half a century ago, team research tried to determine the factors that influence team performance and the circumstances under which they occur. Below we summarise the key factors discovered in meta-analysis studies [1] [6], with different degrees of influence on team performance.

Development Stages

According to (Tuckman B., 1965), teams need time to reach their full potential. In his analysis, Tuckman defines five development stages, namely Forming, Storming, Norming, Performing and Disbanding. The development stages are not always sequential and not necessarily unidirectional, as they can overlap or slip back to previous stages. A team composed of members with complementary skills could reduce or even skip the first three stages, reducing the timeframe for the team to become productive. Traditionally, this factor is controlled through performance monitoring and periodical team reviews.

Team Climate

Psychological safety is triggered by a healthy team climate, where members are encouraged to express their perspective, argue in a constructive way and admit mistakes. This teamwork aspect boosts creativity and innovation within the group, and often leads to better solutions and group decisions. To monitor the team climate, surveys of “team climate inventory” were developed which measure the levels of group innovation.

Group Diversity

Group diversity influences both team processes and team output, due to members wider perspective and prior experiences. Although teams composed of diverse ethnicity, age or personality have the potential to be more successful due to varied perspectives they add to the group, they often fail to achieve this potential. (Kandola, 1995) explains that teams need integration, which is more difficult to achieve as interpersonal differences increase. Individual research studies have shown strong effects of the diversity on team performance, however meta-analysis indicated insignificant effects. This is a cause of inconsistent definitions of group diversity and experimental results, where strong effects of some studies are offset by weak effects of others (Bowers et al, 2003; Fay et al.,2006) [3][6]. Due to inconclusive research, we decided to treat personality traits separately, as a diversity aspect in teamwork environment.

Collective Intelligence

A factor emerging from group collaboration is the collective intelligence, where a study conducted by Woolley A. has shown that collective intelligence explains large variance of team performance [7]. In contrast to the general intelligence factor (IQ), collective intelligence is strongly correlated with the average social sensitivity, distribution of time-spent-per-member in conversations and gender diversity [8]. Studies in psychology found that women have higher social sensitivity than men, therefore teams that include women have a higher collective intelligence, hence a higher team performance. Certainly it does not mean that women-only teams perform better, but social sensitivity and gender diversity make a difference on team performance.

(8)

Cognitive Ability

Meta-analysis have shown that team members’ cognitive ability (IQ) is a weak predictor of team performance [4]. This comes as a surprise, knowing that intelligence predicts individual performance. A research by (Devine and Phillips, 2001) found that stronger relationship between intelligence and team performance is present at the beginning of a team cycle, decreasing over time. But the vast majority of research papers revealed that smarter teams do not necessary perform better than the others [6], therefore we should not consider cognitive ability as a potential predictor for team performance.

Personality Factors

A more extensive research was conducted on personality factors, where meta-analysis [1][3][5][6] have shown that personality traits and teamwork values are important factors influencing team performance. The Big Five personality framework (or variations of it) is generally used to show the impact of these factors on team performance. Fig 2.1 depicts the magnitudes of the most representative composition variables within a team.

Fig 2.1 Personal trait magnitudes in a team profile

The Big Five personality factors are measures of openness to experience, conscientiousness, extraversion, agreeableness and neuroticism. Research recommends to consider all traits simultaneously when testing the effects on performance, yet few are truly relevant for a successful team. In the evaluated studies, the personality factors at team-level are calculated as mean scores across team members, with the exception of ‘agreeableness’ calculated as team members’ minimum score. Recent studies show a general lack of industry-team samples - the majority of the experiments are conducted in laboratory setting, meaning that the results may not fully reflect the reality in industry settings.

(9)

To understand the particularities of each trait and their influence on team performance, we cover them separately:

- Openness to experience - defines a creative, open-minded individual, seeking of new experiences and challenges. In teams, these individuals are likely adaptable and can easily lead the change required in dynamic environments. Through a meta-analysis, (Peeters M. et al., 2006) found insignificant effects on team performance [3]. However, (Bell S., 2007) have shown that small to medium effects were observed for professional teams [5].

- Conscientiousness - characterises honest and hardworking individuals. High levels of conscientiousness in teams create a supportive workspace, where individuals help each other and contribute more to the overall team outcome. All conducted studies have shown a strong relationship between conscientiousness and team performance [3][5][9].

- Extraversion - represents sociable, enthusiastic, and optimistic individuals. They are always willing to work with others, yet from too many extroverts can emerge too many leaders. The analysis of this particular trait has shown scattered results, because of the different methods (mean, min, variance) used to calculate the measure of extraversion at team-level. Reviewing team meta-analysis, we can conclude that extraversion and team performance have a positive relationship, in line with the expectation that extroverts are able to increase interpersonal interaction within a team [5].

- Agreeableness - characterises helpful, friendly and tolerant individuals. In teams these individuals incline to be very cooperative. Except that, one disagreeable individual can alter the team’s harmony and indirectly affect the team performance. Research has shown strong positive effects of agreeableness on team performance.

- Neuroticism - is the personality trait defined by emotional instability, inability to manage stress, and negativity. At team level, high neuroticism could harm the team cohesion. However, research found a negligible relationship between emotional stability and team performance [1].

In addition to the above personality factors, personal values of teamwork were also found to have a strong effect on team performance. In unanimity, personal values of teamwork are defined as collectivism and preference for teamwork. These values shape how individuals think, feel and act in a team context [6]. In contrast to personality traits that are relatively stable or can gradually improve over time, values of teamwork can significantly change over an individual’s life span [10].

2.3 Technical Skills

Top performers are a great value for any organisation. They are highly skilled individuals, respected as problem-solvers and they always stand out. But better individuals do not necessarily form better teams [18]. Too often, managers believe that top performers increase team performance and rely on this belief in their team selection. (Ronay R. 2014) has shown that for interdependent teams, top talent improves team performance up to a point, after which the advantages of having skilled team members diminish and even decrease performance. This effect is explained through hampered intra-team coordination, where top performers often want to be the best in their team [19]. The technical skilled members are vital, but technical skills alone don’t turn individual performance in team performance.

(10)

2.4 Group and Environmental Factors

Team performance measurements can be biased if a variety of group and environment factors are ignored. These factors require special treatment, such as adding them as team moderators or clearly delimiting the research scope. Literature advises to examine the following set of factors [9]:

Group-level moderators

- Team size - the team dynamic works different for teams of 2 people, compared to 50 people. Thus, factors that influence team performance might diminish their effect for larger groups. - Team cohesiveness - this moderator captures intra-team communication, conflicts’ resolution

and workload distribution. Studies have shown that social cohesion is a strong predictor of team performance [9].

- Operationalization method - the method used to calculate team-level metrics could mask different aspects of team composition. Common methods are treated later in this chapter. Environment-level moderators

- Task types - the tasks’ nature performed within a team can trigger coordination problems. This lack of coordination affects team performance. Research studies are using Steiner’s Task Taxonomy framework, which divides tasks into 9 types: divisible, unitary, maximizing, optimizing, additive, compensatory, conjunctive, disjunctive, and discretionary tasks.

- Study settings - research has shown that social processes differ in laboratory conditions vs. field conditions.

2.5 Aggregation Method

Operationalization is the process of calculating variables that are not directly measured, but are observed through other measurements. For teams, individual traits are aggregated to reflect the variables at team level. The operationalization method is likely to influence the significance of these variables on team performance. Murray R. summarises in his study [9] the most common methods used in research: mean, variance, minimum and maximum of individual metrics. Each operationalization provides unique information, therefore an optimal solution is to investigate the impact of every operationalization method. In practice however, researchers adopt only one method, mean or minimum, which were proved to be the most important predictors of team performance for the considered team variables.

- The mean of individual measurements - assumes that the amount of a certain trait is evenly distributed among team members. This approach works best when individual traits can be additively combined to reflect the team-level amount of a trait, e.g. consciousness.

- The variance of individual measurements - captures variations in team composition, which usually are not captured by the mean, e.g. extraversion.

- The minimum or maximum of individual measurements - assumes that for certain traits a single individual can significantly affect the team success, e.g. agreeableness.

(11)

2.6 Discussions

In terms of personality traits, literature recommends to create an optimal team consisting of conscious and open to experience individuals. Attention should be drawn to low agreeable people that have a negative impact on team performance. Also, teamwork values like collectivism and preference for teamwork proved to have a high effect on team dynamics and, in consequence, on team performance. Finally, the importance of technical skills is indisputable, however team performance depends on team members’ contribution on all different skills they possess. Putting together an all-stars team will not transform it into a highly performing team, but the right mix of psychological skills lead to increased team success.

(12)

3 Data Extraction

This section describes the methodology we used to develop the machine learning application and explains the business scope of our research. The data sources for team compositions were provided by the target company. We extracted the individual evaluation metrics of team members from the company’s assessment framework, CliftonStrengths. The team structures were retrieved from the project schedule tool maintained at company level. And the team performance evaluations were gathered through semi-structured interviews conducted with every project’s leader.

3.1 Methodology

Many data science practitioners apply different methodologies during their project lifecycle, in order to grasp a good understanding of their needs: business requirements, available data, modeling techniques and final deployment. For the present data project we used the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology. The CRISP-DM model is extensively used in data mining projects, where its six phases interact with each other in an iterative process as described in Fig 3.1. Generally, steps like data understanding, data preparation and modeling are revisited multiple time.

Fig 3.1. CRISP-DM Model for Data Mining

3.2 Business Understanding

As we could see in the Literature Review chapter, organisations are struggling to create teams based on intuitive rationale. Their business objective is to put together team members that complement each other’s skills and together are able to deliver a quality product, within the time and budget. In case the team fails or proves to be inefficient, the company suffers costs of delays, bad reputation and most probably loss of contracts. Therefore, our goal is to facilitate the team’s creation based on advanced computational algorithms. We aim to implement a predictive analytics application using machine learning, to predict team performance based on team composition. Specifically, we are trying to predict whether a team performs successful or challenging, analysing the personal traits of its team members.

(13)

When analysing current and historical data on team composition within a company, we could make predictions about the performance of future teams. With this business goal defined, the task can be translated into a classification problem. Thus, solving the problem consists in finding the model that fits best the team composition - feature set X, which falls in one of the two categories, high-performing or low-high-performing teams - class label y.

Fig 3.2 Classification of input X into its class label y

Fig 3.2 explains the data mining technique used to classify the team performance, y. To reduce the effect of control variables related to group and environment particularities, we decided to focus on software teams, with high levels of interdependability and supervised by a project manager. The targeted company is a software agency, providing their clients development teams existing out of engineers, application designers and business analysts. Key responsibilities rest with the project manager, to evaluate the team progress together with the client and the team members. The project manager also oversees the quality of the software product and teamwork processes. The team has the freedom to propose and adopt practices in order to improve the teamwork.

3.3 Data Collection

The initial data was collected from a sample of 240 employees and 45 projects. However, after being subject to the following restrictions, the final sample contains 121 employees working on 27 projects.

- Software products delivered by the targeted company - Software teams with 2 - 9 team members

- Projects with a stable team structure for a duration of min 2 weeks and max 2 months - Projects running in the time interval, June 2017 - May 2018

- Team progress overseen by a project manager aware of the daily team processes

- Limited to teams for which the core team members are part of the targeted company, e.g developers, testers, designers, technical lead

3.4 Data Description

The plan is to collect personal traits of all employees, identify team structures and retrieve performance evaluations for each team. The experiment objective is to implement a classification model trained on the dataset in question and evaluate if the model can accurately predict the way teams perform. The following subsections will explain these steps in details.

(14)

Step 1. Collect personal traits from CliftonStrengths framework

Our target company uses CliftonStrengths assessment tool to identify employees’ talent potential. Most common personality assessment tools in the market are Myers Briggs (MBTI), Caliper Profile, CliftonStrengths (CSF) and Sixteen Personality Factor Questionnaire (16PF). All of them aim to determine employees’ strengths, weaknesses and behavioural tendencies, e.g. execution, influence, dominance and support.

CliftonStrengths, previously called Clifton StrengthsFinder, is an assessment-based development tool introduced by Don Clifton in 2001, after more than 25 years of research. CliftonStrengths analyzes employees’ skills, knowledge, abilities, attitudes and personality traits and categorizes them into 34 distinct themes. A theme represents a sum of talents, which gives teams and individuals an indication of their natural patterns of thinking, feeling and behaving. In 2007, an updated version of the assessment tool was introduced. The tool is designed to identify the employee’s strengths and help to achieve success through personal development and growth.

When looking at individual development, the 34 themes precisely capture one’s psychological traits. But when deciding how teams could be created, a more general framework is considered: themes are clustered in 4 theme groups, namely Executing, Influencing, Relationship Building and Strategic Thinking depicted in Fig 3.3. Each of these theme groups is conceptually correlated with the BigFive five personality traits: Executing group is highly correlated with conscientiousness; Influencing with extraversion; Relationship Building with agreeableness; and Strategic Thinking with openness to experience. There is no strong correlation of any theme group with neuroticism [11].

Fig 3.3 CliftonStrengths Team Members Assessment Report

We consider this psychometric to reflect team members’ personal traits, based on which we compute the team composition variables - our feature set X, with impact on team performance. Due to privacy regulations, the psychological data gathered from employees has to comply with company’s internal HR policies. Thus, employees gave explicit consent to share their CliftonStrengths scores by participating to this study. Personal data and project information were anonymised before they were used in the classification model and individual data cannot be disclosed or used for other purposes than this experimental research.

(15)

Step 2. Extract present and historical team structures

Next to personal traits, we need the company’s project schedule to extract all the teams and employees assigned to these teams. The required information was provided by the TFS responsible (Talent Fulfillment Specialist) within the targeted company, from June 2017 until May 2018. The lack of centralised data prior to this date forced us to focus on present and historical data for the length of one year only. The project schedule was used to determine the teams structure, duration of stable team configuration and other project characteristics used to filter the relevant teams. The challenges encountered with data integrity and data consistency were settled in close collaboration with the TFS responsible, who is liable to maintain the project scheduling and other relevant initiatives to improve this process. The confidential data extracted from both data sources were merged into a preliminary dataset.

Step 3. Conduct team performance evaluation interviews

To add the team metrics to the final dataset we followed a two-steps approach. First, teams and projects were selected to comply with the criteria stated in the Data Collection chapter. Second, the performance evaluations of the teams were obtained through semi-structured interviews with the project managers of the qualified projects. In this second phase, the biggest challenge was to find a domain agnostic framework that measures team performance as a combo of objective and subjective metrics and captures all aspects of team output and intra-team dynamic.

The specialised literature and project management articles promote a set of evaluation methods based on the Six Factors framework. Here a project success is measured in terms of meeting the time constraints, agreed scope, budget, team satisfaction, client satisfaction and product quality [12]. This technique measures the team output at the end of the project cycle, but it fails to capture the team interaction metrics. For this reason, we adapted this evaluation method to serve our purpose and focus on team dynamics. In this case, the team performance is defined on three dimensions, evaluating the team output quality, the team process and members' experience, as described in Fig 3.4.

- Project Delivery - is an objective evaluation of the team output, measuring if the project was delivered on time, on budget and on scope.

- Team Productivity - evaluates if the client is satisfied with the team output and team process. We assume a project is failed, if the team output met the client expectations but the team process was hindered.

- Team Interaction - is a more subjective metric, where the project manager evaluates the intra-group processes: communication patterns, conflict levels, team cohesion, growth and personal well-being of the team members.

(16)

Fig 3.4. Dimensions of Team Performance Evaluation

Each dimension is evaluated using a RAG scale, which is a standard project manager method to indicate project statuses. The RAG system is based on colors used in the traffic light system: Red (high risk), Amber (medium risk) and Green (low risk). The choice for this scale was highly influenced by the popularity of it within project management. Lastly, a ‘majority rule’ voting mechanism over the values of each dimension was used to determine the final team performance.

3.5 Data Quality

Data integrity was verified in multiple stages during the data collection process. At first the company’s project schedule was re-evaluated by the TFS where any misunderstandings were directly solved. Secondly the team structures were cross-checked with project managers during the performance evaluation interviews. The CliftonStrengths traits were collected based on individual invite and all data was centralised in a safe environment with limited access.

To eliminate biases in team performance evaluation, we established a standard format, followed during the evaluation interviews:

1. Establish a common understanding on the scope of the experiment

2. Detailed explanation of the performance dimensions: Project Delivery, Team Productivity, Team Interaction

3. Introduction to the RAG measurement scale

4. Define together with the project manager the point in time during the team’s lifetime, when the evaluation should relate to.

Possible control variables, that were not accounted for in the performance evaluation and may influence the final results, are:

- Virtual teams, where technology may influence the communication patterns

- Additional team members with secondary roles (not core team members), that are not part of the target company and may present a different culture, teamwork values, goals.

Although the dataset is relatively small, the features proportion is leveled: Executing 26%, Influencing 13%, Relationship Builder 26% and Strategic Thinker 35%.

(17)

4 Data Preparation

This chapter describes the approach we followed during our investigation and the techniques we used to prepare the data for machine learning modeling. It also describes the first insights observed by analysing the importance of personal traits in software teams.

4.1 Experimental setup

Although this study focuses on the effects of psychological skills on team performance, we cannot ignore the importance of technical skills in a team composition. Our investigation consisted of two parts: the prediction of team performance based on psychological skills and second, based on technical skills. First and the most important part focuses on psychological skills and evaluates the effects of personal traits by running experiments on a private software team dataset. The second part, considered a complementary experiment, focuses on technical skills and evaluates the effects of individual performance on team performance. Ideally, both analysis would be performed on the same dataset. Because this was not feasible, we quantified the impact of technical skills by using a public scientific publication dataset.

For both datasets we used similar techniques for feature engineering, feature importance and evaluation of the models output. In our experiments we trained three machine learning models and determined the extent to which these models are able to classify team performance based on individual variables. Finally, we discussed the implications of the selected machine learning models, particularities of each dataset and insights gained from this study.

4.2 Operationalization Method

The mean method is the most used operationalization technique to calculate team variables. As CliftonStrengths’ traits are descriptive, we can additively combine them to reflect the team-level variables. Therefore, for every feature (personal trait) we calculate the mean score of all team members which gives us the team score:

𝑆𝑐𝑜𝑟𝑒(𝑓₍, 𝑡₊) = 1 / ∑ 𝑆𝑐𝑜𝑟𝑒(𝑓(, 𝑚2) / 234 , where 𝑓₍ is feature i, 𝑡₊ is team k,

n number of team members in team k, 𝑚₂ the team member j of team k.

(18)

Fig 4.1 Application of Mean Operationalization at Team-Level

Fig 4.1 illustrates the first five teams in our dataset, for which we calculated the team variables by using the mean operationalization method.

4.3 Feature Engineering

Input and Output Variables

The independent and dependent variables, X and y, are defined as numpy arrays data structures in Python. The dependent variable, y, also called target variable, is the performance level of a team. This variable represents the output that we predict in supervised learning. The independent variables, X, also called predictor variables, are the input data mapped to the target variable. In our situation, the predictor variables are the 34 themes from achiever to strategic.

Scaling

Many classification algorithms are sensitive to the data scaling because often they use Eucledian distance to compute the distance between two data points. This distance is heavily affected by the features magnitudes. To reduce this effect, scaling techniques are used so that each feature contributes proportionately to the distance. Common types of data scaling are Min-Max Scaling, Standardization, Mean Normalization and scaling to Unit Vector. We decided to use Standardization scaling, that centers and scales features independently, with mean μ = 0 and standard deviation σ =1. The StandardScaler from sklearn.preprocessing.scale implements the standardisation data scaling in Python.

Datasets Split

We use a dataset with 75%-25% splitting ratio, for training and testing our machine learning models. The training dataset (75%) is used for model fitting and hyperparameters tuning, while the testing dataset (25%) is used to obtain an unbiased evaluation of the final model. It is important to note that the testing set is hold back from training the model.

Data Balancing

With a strong imbalanced dataset, 6 observations belong to the low-performing class and 21 observations to the high-performing class, we risk to develop inaccurate predictive models. The reason is that many machine learning models don’t consider the class distribution or proportion of classes during their training.

For this purpose we use SMOTE (Synthetic Minority Over-sampling Technique) sampling technique, that generates new synthetic data points based on a subset of the minority class. This synthetic data is then added to the original training dataset. Conveniently, SMOTE technique does not involve

(19)

under-sampling of the majority class, therefore there is no loss of information. A disadvantage of SMOTE is its inefficiency with high dimensional data. Fig 4.3 exemplifies the data sampling of our minority class.

Fig 4.3 Synthetic data generation with SMOTE for low-performing teams

4.4 Feature Importance

Feature selection is used to reduce the number of features which are fed to the model. This technique aims to simplify the machine learning model, which further will provide a better generalisation. For this purpose, we use a meta-transformer, called SelectFromModel in Python. In our case, we select all features with the importance weight larger than a given threshold, 0.05. A random forest classifier with 250 trees is used to compute the feature importance weights.

We can observe that for mean operationalization, top 4 features are explaining large parts of the team performance variation. The most important features found by the tree-based classifier, are:

- Arranger - people characterised by order and flexibility with a special talent in setting everything together for maximum productivity,

- Adaptability - people characterised by the term ‘now’ that prefer to take things as they come and easily adaptable in dynamic situations,

- Includer - people characterised by accepting others and making efforts to include the left out individuals,

- Intellection - people characterised by consciousness, introspection and intellectual activity. As we can see, these skills are closely related to collectivism and preference for teamwork, prerequisite values for an efficient team collaboration. Considering the size of our sample, a larger dataset may provide a different feature ranking.

(20)

Fig 4.4. Feature Importance for Software Teams Dataset

4.5 Technology Stack

The technology stack consists of scikit-learn lightweight machine learning toolkit, together with complementary Python packages. Jupyter Notebook environment was used to implement the application in Python 3.6 programming language and Tableau Software was used for data governance and EDA (Exploratory Data Analysis).

Scikit-learn is a free machine learning library for Python, that provides a wide range of classification, regression and clustering algorithms and interoperability with Python numerical and scientific libraries. Jupyter Notebook is an open-source IDE based on web technology, that facilitates code development for both small and large-scale applications. Its collaborative work environment and integration with big data tools makes it widely used in data science projects. We decided to go for Python 3.6, installed through Anaconda distribution, because of its huge popularity throughout the industry, flexibility and special libraries for machine learning and data analysis, namely scipy, numpy and matplotlib.

(21)

5 Software Teams Experiment

5.1 Training Technique

As in any data science application we cannot follow a recipe that tells us which algorithms to use in order to solve a business problem. The algorithm selection process is highly dependent on the nature of the dataset. To solve a classification problem, scikit-learn provides a set of algorithms:

- Logistic Regression - a statistical method that estimates the model’s coefficients using the maximum likelihood and provides expected probabilities of the predicted class;

- Naive Bayes Classifier - applies a technique based on Bayes’ Theorem, assuming independence between each two features;

- Support Vector Machine - for which the decision boundary is constructed so that it maximizes the margin between the decision hyperplane and the training data;

- Nearest Neighbor - which finds the predicted class using a voting mechanism to determine the mode class of the closest neighbours;

- Decision Tree - algorithm that builds a tree structure with decision nodes and leaf nodes, which represent the class label;

- Random Forest - is an ensemble of decision trees that calculates the final class label as the mode of the classes predicted by each tree.

- Neural Network - is a network of neurons, arranged in layers that convert an input vector into an output.

Our goal is to select three models that provide probability interpretation of the relationship between the input and the output variables, easy to train and suitable for relatively high dimensions compared to the number of records. Because small datasets require models with low complexity to avoid overfitting, we decided to use Logistic Regression (LR), Support Vector Machines (SVM) and Random Forest (RF) models.

To evaluate our models performance against a baseline, we use a Dummy Classifier that makes predictions respecting the class distribution of the training dataset ('stratified'). The model’s accuracy scores on the train and test sets are 0.50 and 0.54 respectively. This classifier is only used for baseline comparison purpose.

The Grid Search construct (GridSearchCV in sklearn Python) is used for all models to determine the optimal hyperparameters. The grid search method, as defined in data science, is an exhaustive search through all possible combinations of the given parameters for a machine learning model. The algorithm finds the best parameters by analysing the performance of each parameter combination through cross-validation.

(22)

5.2 Evaluation Technique

Evaluation and cross validation are standard methods to measure the performance of a machine learning model. Both techniques generate evaluation scores that can be used to inspect or compare them with those of other models. We compare each model against the baseline scores of the Dummy Classifier and later in the chapter we summarise the evaluation scores of all tested models.

Therefore, we define three evaluation methodologies to inspect a model:

- Obtain the model accuracy through cross-validation, which repeatedly trains, scores and evaluates subsets of the full data automatically;

- Obtain the accuracy of the model trained and scored on the training set, and compare it to the model accuracy evaluated on the testing set;

- Analyse the precision, recall and F1-scores.

First, we use cross-validation which performs three train-score-evaluate operations (3 folds) on different subsets of the data. The dataset is divided into three parts, where the model is trained on two parts (folds) and tested on one. This operation is repeated three times and the evaluation accuracies are averaged. This average accuracy indicates how well the model would perform on unseen data. We use the learning curve to visualize the accuracy scores when evaluating a model through cross-validation.

Second, we evaluate the accuracy of the model fitted on the training set and tested against the testing set. We use this evaluation to consolidate the cross-validation outcome and understand if the model is at risk of overfitting or underfitting the dataset. A large difference between the training and test accuracies indicates a high risk of overfitting. The reason is that the model cannot generalize well enough, as it is tailored to the training dataset. We can detect this effect by analysing the visual evaluation of the learning curve.

In classification problems, the accuracy is the proportion of the correctly classified data points. However, when the dataset is unbalanced, as in our case, accuracy masks many model’s faults. Alternatively we look at the confusion matrix, to determine the ratio of correctly predicted data points, and the classification report to evaluate the precision, recall and F1-score.

5.3 Analysis and Modeling

5.3.1

Logistic Regression Model

The model’s efficiency depends heavily on the model’s parameters, therefore we let Grid Search construct to choose the right hyperparameter, C, for our Logistic Regression model. Grid Search browses through a given parameter space and trains, evaluates and compares each unique combination of these parameter values. In our case, grid search algorithm fitted 30 models and performed a cross-validation with 3 folds for each of the 10 parameter values. The best-performing Logistic Regression model was found for C=1, with an accuracy score of 0.906 on the training set. In Fig 5.1 we can observe the accuracy scores throughout the given parameter space.

(23)

Fig 5.1 Grid Search Model Selection for Logistic Regression

To evaluate the performance of the Logistic Regression model, we first look at the accuracy obtained through cross-validation. The average accuracy is 0.815, much higher than the baseline score, 0.66. This relatively high accuracy indicates that our model would perform well predicting future data. But if we analyse the learning curve in Fig 5.2, we observe a high risk of overfitting. In the right side of the graph, the training score is much higher than the test score. A reason for this could be the limited size of our dataset - 27 data points.

The same effect can be noticed when comparing the accuracies retrieved from the model trained and scored on the training set, and evaluated on the testing set. The Logistic Regression model scores 0.906 on the training set and 0.714 on the test set. Nevertheless, these scores are much higher compared to the baseline accuracies. Table 5.1 summarizes the evaluation scores retrieved for Logistic Regression, against the baseline Dummy Classifier.

(24)

For binary classification models the predicted value can take only two values, high or low-performing teams, which are usually referred to as positive or negative values. The confusion matrix (Appendix 1) is a matrix that shows the number of true positives, true negatives, false positives and false negatives data points. The positive and negative data points that a model predicts correctly are called true positives and true negatives. The incorrectly predicted data points are called false positives and false negatives.

Therefore, the Logistic Regression model correctly predicts 3 out of 5 high-performing teams and 2 out of 2 low-performing teams. Fig 1.1 (Appendix 1) illustrates the proportion of true positives and true negatives. In the normalized matrix, we observe that 60% of the positive data points were correctly predicted, whereas 100% of the negative data points were predicted correctly. Our Logistic Regression model performs very well in relation to these metrics.

To better understand the performance of a classifier, we can ask the following questions: "Out of the predicted high-performing teams, how many were classified correctly?” The answer for this questions is the precision of the model, which represents the proportion of positive data points that are correctly classified: 0.86. A second question is “Out of all the high-performing teams, how many were correctly classified by the model?”. This is the recall of the model, which represents the true positive rate: 0.71. The classification report in Table 1.1 (Appendix 1) contains these average values of precision and recall.

Every model has a trade-off between precision and recall, which might have important business implications. In our case the model predicts mostly low-performing teams, thus it has a high precision, but a low recall. This means that many of the high-performing teams would be misclassified.

Another metric used in model evaluation is the F1-score, which is calculated as the harmonic mean of precision and recall. The advantage of F1-score is that it evaluates the model as a single metric, but best practices recommend to also analyse precision and recall to get a better understanding on how the model behaves. For our classification model the F1 score is 0.73, relatively high compared to the baseline score, 0.32.

5.3.2

Support Vector Machine Model

Second, we considered a Support Vector Machine classifier, implemented in scikit-learn as Support Vector Classification (SVC), under the assumption that SVC is effective with datasets for which the number of features is greater than the number of samples - we have 36 features and only 27 samples. Another reason for which we chose SVC, are the kernel functions that can be specified for the model’s decision function. A disadvantage of SVC is the lack of direct probability estimates, but we can indirectly extract the prediction probabilities through the predict_proba function.

As a preliminary step, we analyse the model’s decision boundary for the four kernels: linear, rbf, polynomial and sigmoid. In our analysis we considered two most important features, ‘Intellection’ and ‘Arranger’. Fig 5.3 shows the decision boundaries for the models tuned with the regularisation parameters: C=0.1 and gamma=0.7 for SVC-rbf; degree=3 for SVC-polynomial; and gamma=2 for

(25)

SVC-sigmoid. We can notice that the most suitable models for our dataset are the SVC with linear and rbf kernels. Due to high model complexity of SVC-rbf vs. our relatively small dataset, we decided to use SVC with a linear kernel in our experiment (SVC-linear).

Fig 5.3 Support Vector Machine Kernels for Two Most Important Features

For the SVC model with linear kernel the Grid Search construct retrieves the right hyperparameter, C, from the same parameter space as used for Logistic Regression. In this case, the best-performing SVC model was found for C = 0.01, with an accuracy score of 0.969 on the training set. In Fig 5.4 we can observe the accuracy scores throughout the given parameter space.

Fig 5.4 Grid Search Model Selection for Support Vector Classifier

To evaluate the performance of the SVC model, we look at the accuracy retrieved through cross-validation. The average accuracy is 0.778, which is higher than the baseline score, 0.66. However, it is lower than the Logistic Regression cross-validation accuracy, 0.815. At first sight, this relatively low accuracy indicates that our model would not perform very well on unseen data. If we analyse the learning curve in Fig 5.5, we observe a much lower risk of overfitting. The training and testing scores,

(26)

The same effect can be noticed when comparing the accuracies retrieved from the model trained and scored on the training set, and evaluated on the testing set. The SVC model scores 0.969 on the training set and 0.857 on the test set. Nevertheless, these scores are much higher compared to the baseline and Logistic Regression accuracies. Table 5.2 summarizes the evaluation scores retrieved for SVC against the baseline Dummy Classifier.

Fig 5.5 Cross-Validation Learning Curve for Support Vector Classifier

Table 5.2 Evaluation Scores for Support Vector Classifier

Inspecting the confusion matrix (Appendix 2) we observe that the SVC model correctly predicts 5 out of 5 high-performing teams and 1 out of 2 low-performing teams. Fig 2.1 (Appendix 2) illustrates the proportion of true positives and true negatives. From the normalized matrix we notice that 100% of the positive data points were correctly predicted, whereas only 50% of the negative data points were predicted correctly. Our SVC model performs relatively low in relation to these metrics, With a support class of 2 negative and 5 positive samples, the result may not be the same for a larger dataset. The classification report in Table 2.1 (Appendix 2) shows a precision of 0.88, which represents the proportion of positive data points correctly classified by the model. The recall of the model is 0.86, which represents the true positive rate. Therefore, SVC shows a better trade-off between precision and recall, which results in fewer high-performing teams misclassified by this model. Also, the F1-score is 0.84 which is better positioned compared to the baseline and Logistic Regression scores, 0.32 and 0.73 respectively.

(27)

5.3.3

Random Forest Model

Lastly, we considered in our experiment a Random Forest classifier, because of its ease of tuning, high effectiveness in various applications and insensibility to the noise in data. In case of Random Forest model, Grid Search looks for the optimal numbers of tree estimators and the maximum depth until each tree is expanded. The grid search algorithm evaluates 36 (3x12) Random Forest classifiers built with an ensemble of [100, 150, 200] decision trees, each with [4, 5, 6, 7] levels of max depths developed per tree to prevent overfitting. The search algorithm identified n_estimators = 200 with max_depth = 4 for an accuracy of 1.0 on the training dataset.

To evaluate the performance of the Random Forest model, we look at the accuracy retrieved through cross-validation. The average accuracy is 0.815, higher than the baseline and SVC’s accuracy scores, 0.66 and 0.778 respectively. This relatively high accuracy indicates that the model would perform well. When we analyse the learning curve in Fig 5.6, we observe a very high risk of overfitting. The training and testing scores, in the right side of the graph, are far apart from each other. This second check shows that the Random Forest model might not be suitable for our dataset, compared to Logistic Regression or SVC.

The same conclusion can be drawn when comparing the accuracies retrieved from the model trained and scored on the training set, and evaluated on the testing set. The Random Forest model scores 1 on the training set and 0.714 on the test set. The evaluation scores show a strong overfitting effect. Table 5.3 summarizes the evaluation scores retrieved for Random Forest, against the baseline Dummy Classifier.

(28)

Analysing the confusion matrix (Appendix 3) we observe that Random Forest correctly predicts 5 out of 5 high-performing teams, but is not able to classify correctly any low-performing team. Fig 3.1 (Appendix 3) shows the proportion of true positives and true negatives. In the normalized matrix, we see 100% of the positive data points were correctly predicted and 0% of the negative data points were predicted correctly. Our Random Forest model performs poorly in relation to these metrics, However, due to small-sized support classes (of 7 samples) the results are not concludent.

The classification report in Table 3.1 (Appendix 3) shows a precision of 0.51, which represents the proportion of high-performing teams correctly classified by the model. The recall of the model is 0.71, which represents the true positive rate. In this type of application, a high precision is more important, compared to recall, because misclassified low-performing teams may have a heavy impact on business. The F1-score is the lowest compared to our previous trained models, 0.60. All these evaluations show that Random Forest classifier is not a good fit for our dataset.

5.4 Results and Implications

The summary Table 5.4, contains the evaluation scores discussed individually in the previous sections. All three analysed models, Logistic Regression, Support Vector Classifier and Random Forest are compared to each other and against the baseline Dummy Classifier.

The three dimensional evaluation technique that we defined at the beginning of this chapter, allows us to compare multiple models, in order to find the best candidate for our dataset. First, we compare the cross-validation accuracies among all 3 models, where Logistic Regression and Random Forest perform best. Because accuracy alone is not a reliable indicator when evaluating classifiers, we look at the differences between the accuracy scores on training and testing datasets. This evaluation helps us eliminate the models that present a high risk of overfitting, like Random Forest in this case. Lastly, the SVC model has the highest precision, recall and F1-score among the compared models. Therefore, SVC with linear kernel model seems to be the best classifier for our dataset.

(29)

5.5 Synthetic Data Generator

Due to companies’ strict regulations on access to employees’ sensitive data, collecting large samples from industry is both challenging and time consuming. As an attempt to test our hypothesis, according to which machine learning can predict team performance based on members’ personal traits, we have to artificially extend our dataset.

We consider that a reliable dataset extension conserves the particularities of the original dataset, e.g. similar team sizes, and maintains the same mean and standard deviation per feature (personal trait) for each support class. Thus, we implemented a synthetic data generator that generates 2000 additional teams, 1000 labeled as high-performing teams, “1”, and 1000 labeled as low-performing teams, “0”. In this process, we extracted all high-performing teams from the original dataset and calculated their means and standard deviations at feature level. Then, we generated random teams with features following a normal distribution, with means and standard deviations as calculated above. This operation was repeated for low-performing teams.

Finally, the two randomly generated sets of teams were merged and added to the original dataset. Fig 5.7 displays the 2027-samples dataset in two clearly defined clusters, one for high-performing teams (light blue) and one for low-performing teams (dark blue).

Fig 5.7 Synthetic Data Generator for Software Teams with Mean Operationalization

As expected, high similarities can be observed when analysing the feature importance. A proportion of 70% of top 10 most important features from the synthetic dataset are similar to top 10 features in the original dataset. Running a similar experiment for our Logistic Regression, SVM and Random Forest models, we notice a large improvement in performance. Table 5.6 summarizes the evaluation scores for all three models. Once again, the SVM model is slightly better with a cross-validation accuracy of 0.97. Although the original analysis was made on a small dataset (27 samples), the analysis conducted on a synthetically generated dataset (2027 samples) supports our initial findings. Therefore, this validation enables us to conclude that SVM (SVC with linear kernel) is the most

(30)

Table 5.5 Model Comparison for Software Team Synthetic Dataset

The answers for questions like “How accurate is the machine learning model trained on a synthetic dataset in real-world scenarios?” or “To what extent the original data reflects the reality in industry?” could be found by comparing our findings with a larger dataset collected from industry. This thesis could only prove the potential of implementing a classification machine learning model, that accurately predicts team performance based on team members’ personal traits.

5.6 Discussions

As a particularity of the Software Teams’ dataset, we have seen that traits like arranger, adaptable, includer and intellection explain most of the team performance variation. In other words, these traits are able to predict top-performer teams. Analysing these traits’ definitions given by CliftonStrengths, we notice that all are closely related to personality factors and values of collectivism and preference for teamwork, described in Chapter 2, as factors with high impact on team performance.

A brief description of these traits, as defined by CliftonStrengths [20], is presented below:

- Arrangers are able to bring together seemingly disconnected elements and make informed decisions;

- Adaptable people have a strength to help others stay calm and relaxed in uncertain situations; - Includers enable others to join their group and make them feel comfortable;

- Intellections like thinking to a challenge or idea and provide others with deep learnings and deep understanding of it.

Moreover, looking at the correlations between personal traits and team performance we notice that specific traits have a stronger positive or negative correlation than others. This means that teams composed of individuals high in positive-correlated traits could decisively influence the team to perform better. On the other hand, individuals high in negative-correlated traits could negatively influence how the team performs. Fig 5.8 shows positive-correlated traits (left) and negative-correlated traits (right), for which the absolute correlation coefficient is larger than 0.2. These traits are the most influential factors regarding the performance polarity of a team.

(31)

Fig 5.8 Positive-correlated Traits (left) and Negative-correlated Traits (right) with Team Performance From the CliftonStrengths’ description of traits [20], we can logically deduct that positive-correlated traits are closely related to teamwork and collectivism, whereas the negative-correlated traits are individualistic and encouraged self-efficacy. A summary of the correlations between all personal traits that were analysed in relation to software teams is illustrated in the correlation matrix in Fig 7.1 (Appendix 7).

Team performance classification based on team members'personal traits