Understanding Organizational Culture from a Complex System Perspective

(1)

University of Amsterdam

Masters Thesis

Understanding Organizational Culture

from a Complex System Perspective.

Proposal of an Automated Method to Measure Soft

Controls.

Author:

Suzan Q. Blommestijn

Supervisors: Jori van Schijndel Marcel Boersma

Examiner: Drona Kandhai Second Reader: Sumit Sourabh

A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science in Computational Science

in the

Section Computational Science Informatics Institute

(2)

Declaration of Authorship

I, Suzan Quirine Blommestijn, declare that this thesis, entitled ’Understanding Orga-nizational Culture from a Complex System Perspective. Proposal of an Automated Method to Measure Soft Controls.’ and the work presented in it are my own. I confirm that:

This work was done wholly or mainly while in candidature for a research degree

at the University of Amsterdam.

Where any part of this thesis has previously been submitted for a degree or any

other qualification at this University or any other institution, this has been clearly stated.

Where I have consulted the published work of others, this is always clearly

at-tributed.

Where I have quoted from the work of others, the source is always given. With

the exception of such quotations, this thesis is entirely my own work.

I have acknowledged all main sources of help.

Where the thesis is based on work done by myself jointly with others, I have made

clear exactly what was done by others and what I have contributed myself.

Signed:

(3)

Abstract

To steer company performance in the right direction a healthy culture is very important. Pinpointing where exactly the problem in an unhealthy company lies can be a difficult, expensive and time consuming job. In the current study organizational culture is seen as an emergent property arising from in-teraction between employees. To understand the emergence of organizational culture from a complex system perspective this thesis proposes an automated method to measure soft controls. This method consists of three steps: first email data is labeled as containing information about soft controls. Next, from these emails the polarity is extracted with sentiment analysis. Finally, networks are built based on the extracted data. The results show that the method is able to extract information about part of the soft controls. In ad-dition centrality and clustering methods can be used to detect employees or regions within a company where the organizational culture is unhealthy. The absence of soft control related information in combination with the presence of overall communication appears to be an indication intervention is neces-sary. In addition employees with similar feelings about the organizational culture appear to cluster together. The study was of an exploratory na-ture and therefore the conclusions cannot be generalized to other companies. Future research should confirm the findings made in this study.

(4)

Acknowledgments

I would like to thank my supervisors Jori van Schijndel and Marcel Boersma for their help, feedback and enthusiasm during the process of writing this thesis. Special thanks to Drona Kandhai for taking the time to review and judge my thesis. In addition, I would like to thank Paul Hulshof for his expert opinion and feedback. Also, I hereby express my gratitude to the entire KPMG FTech group for welcoming me in their team.

Finally, I would like to let my friends and family (most of whom will probably not read or understand half of this thesis) know that I appreciated the very welcome distractions during this period. I do think I am one of the few people who actually enjoyed (for most part) writing my thesis. Special thanks to my mom, who kept telling me she did not understand most of what I did, but agreed with everything I wrote anyway and helped me correct the d-t mistakes (who knew this was a multi-lingual trait of mine).

(5)

3.3.6 Conclusion . . . 65 3.4 Summary . . . 66 4 Discussion 68 4.1 Conclusion . . . 68 4.1.1 Implications . . . 70 4.2 Limitations . . . 71 4.3 Future Research . . . 73 4.4 Final Remarks . . . 75 A Packages 89 B Data Structure Details 90 C Dictionaries 92 D Code: most important functions 95 D.1 Data Cleaning . . . 95 D.2 Labeling . . . 96 D.3 Sentiment Analysis . . . 96 D.4 Network Analysis . . . 96 E Additional Plots 98 F Statistical Tests 101 F.1 Spearman Correlation . . . 101

(7)

Chapter 1 Introduction

In this thesis an automated method to measure the effect of the control of the culture and behavior of employees in a company is proposed. Control here is defined as the act or power of exercising direction over something or someone[1]. Control is an essential part of our society, and of large com-panies in particular. A term used to describe the culture and behavior of employees is organizational culture[2]. Kaptein (2012) explains that fencing in organizations with structures, procedures and systems is no guarantee that employees will do the right thing[3]. The important question here is how to explain the behavior of people in organizations. The unethical behavior of a company is described as being the result of both the ethics program and the ethical culture[4]. Some of the earlier literature on how to have a successful career puts its focus on the employee being an active agent that molds his own environment and future[5]. It is important to put yourself in charge and take control of your actions. As a passive agent you let your environment take control over you. Unfortunately over the years organizations might have lost focus of individuals and put all emphasis on the performance of the or-ganization alone, where the employees are sculpted to perform in accordance with the goals of the company. These goals are translated into logical rules that employees need to follow, the importance of human related aspects is thereby overlooked[6]. Luckily organizations are becoming more aware of the importance of a balance between focusing on the performance of a company as a whole and looking at the goals and desires of the individuals in that company. Streatfield (2001) describes two views on organizations[7]. The first is the presence of a dominant voice, where the manager has control over the company from outside. This view has its focus on intention, regularity

(8)

and control. The second view is that humans themselves are members of the complex networks that they form. This has a more participative perspective. Ladner (2009) argues that patterns of interaction are a key component of organizational culture, and important for a company to survive[8]. Organiza-tional culture is the product of behavior patterns and unstated assumptions in a company[9]. It is an emergent phenomenon arising from interactions in a company that cannot be traced back to a single individual or cause[10][9]. The organization is a complex adaptive system where the individual behav-ior of employees is at the local (microscopic) level, and the organizational culture at the global (macroscopic) level[11].

1.1 Goal

The goal of this thesis is to understand organizational culture from a complex system perspective. To this purpose an automated method to measure soft controls is proposed. Soft controls is a term used to describe the control of the behavior and culture of a company[12]. This subject will be explained in depth in section 2.1.

The focus in research on soft controls is mainly theoretical, not many articles look at soft controls in practice[13]. The research that does look at the practical implications mostly make use of questionnaires[14]. As far as the author knows, no automated and unobtrusive method to measure soft controls exists. The current study will use email data as the source for in-formation about soft controls in a company. As supported by Guimera et al. (2003) the use of emails as a source of information provide an inexpensive but powerful alternative to the approach of conducting questionnaires, which is expensive in terms of time and cost[15][16]. The current study proposes an automated method to measure the eight soft controls as described by Kaptein[17] from email data to gain understanding about organizational cul-ture from a complex (social) system perspective. Sentiment analysis provides a method to investigate the polarity of the measured information about soft controls. More about sentiment analysis and how it can be of use to study organizational culture is discussed in section 2.3. Network analysis can give more insight in how organizational culture develops and the role communi-cation structure plays in the well-being of a company.

A sub-goal of this study is to improve on the limitations of current mea-surement methods as will be discussed in section 2.1.2 for the proposed

(9)

method. The method will be automated, and therefore efficient, not ex-pensive and not as susceptible to subjectivity as other methods. After the method is constructed and validated, it does not need the input or meddling of people. In addition, because the proposed method makes use of email data, the employees will not be as aware of this measure as they would be with the existing methods. Therefore the social desirability bias[18] will be less of an issue. Finally, social interactions between the employees can be investigated with network analysis. This could measure an important part of the effect of soft controls in an organization that other methods do not take into account.

1.2 Organization of Thesis

This thesis has the following structure. Chapter 2 provides an overview of the literature study. First the subject of soft controls and the limitations with the current measurement methods will be discussed. A short introduction to complex systems will be given. Available sentiment analysis methods will be explained. An introduction to network analysis will be provided. Chapter 2 will finish with a discussion of relevant previous research, and how the current study adds to this.

The methodological structure of this study consisted of three steps: (1) Data labeling, (2) sentiment analysis, and (3) network analysis. Chapter 3 describes the methodological part of this study. First an overview of the data and the pre-processing steps will be given. Next the measuring of organiza-tional culture by extracting information and sentiment about soft controls will be provided. With the obtained results networks will be built. Section 3.3 will discuss the network analysis steps and results. Two key properties of networks will be discussed in detail: centrality and clustering.

An overall conclusion will be given in chapter 4. The limitations of the current methods will be discussed. The chapter will end with recommenda-tions for future research.

(10)

Chapter 2 Literature Study

This chapter will provide in depth information about the subjects necessary to understand the methodology of the current study. First the definition of soft controls is discussed. Current methods to measure them and their limi-tations will be provided. Next a short introduction to complex systems will be given, complex social systems in particular. Because sentiment analysis will be used to decide the polarity of organizational culture related informa-tion, the next section will explain more about sentiment analysis methods. To study social interaction this thesis will use network analysis. Therefore more about the general terminology used in network analysis and some of the properties that are most important for this study will be explained. Finally relevant research examples will be discussed. This chapter will end with a concise overview of all that was discussed.

2.1 Soft Controls

Influencing the behavior of employees in a company towards achieving the organizational goals of that company is an important aspect of controls[19]. These controls are both about avoiding undesirable behavior as they are about promoting desirable behavior. Hard controls lead to directly visible changed behavior, actions or skills[12][20]. Hard controls can be clearly ob-served and controlled. Examples of formal hard controls are fixed procedures, registration systems, written approvals, having to wear a badge and the lock on doors and lockers.

(11)

three aspects of controls[19]. These aspects are information, motivation and equipment. The three described effects are positive, aversive and passive. The positive effects from the three aspects of hard controls are that the employee feels informed, motivated and equipped. The aversive effect leads to mistrust in the controls. The information can make the employee feel pedantic. The motivation could result in the employee feeling offended or insulted, and the employee could feel obstructed by the equipment, like they are taken by hand. The result is a lack of engagement in desirable behavior, or even showing resistance and behaving opposite of what is desirable. The passive effect happens when the employee trust the controls too much. The information can make the employee assume everything is taken care of. The employees can feel less responsible for their own behavior by the motivation. The equipment can result in the employees being careless and less alert to risk because they trust the systems and procedures too much. This results either in passive behavior, or in the employee going overboard with trying to show desirable behavior and losing sense of proportion (asking permission for everything, sending too much information, etc.). Kaptein and Vink describe that the effect of hard controls are mostly positive in the beginning, but while the amount of hard controls increase, the aversive and passive effects do as well. Therefore it is suggested to not only look at the hard controls of a company, but also to focus on what is happening beneath the surface: the culture of the company[17].

The organizational culture of a company can be described as organiza-tional values communicated through norms and artifacts1 and can be ob-served in behavioral patterns[2]. Schein’s model of organizational culture de-scribes different layers; values are at the bottom, next are the norms, followed by the artifacts. The norms and artifacts influence behavior. Literature does not always agree on the interpretation of organizational culture[8][2]. This could be explained by Schein’s model, and researchers discussing the organi-zational culture on a different layer.

The culture of a company can be a challenging area to measure because it is not something that can be easily captured[21]. A term that describes the control of the behavior and culture of a company is soft controls. Soft controls can be defined as ”a measure that intervenes in or appeals to employ-1_{Artifacts in this model are described as the elements in a culture that are visible and}

can be recognized by people outside of that culture[2]. An example is the dress code in a company.

(12)

ees’ individual performance”[20]. Soft controls are described as the control of an employees’ convictions, values and personality[12]. These attributes are not observable and influence the behavior that is observable. They are what controls the artifacts, norms and values described in Schein’s model[2]. Kaptein[22] describes eight soft controls, or cultural drivers, that can be di-vided into three subcategories: prevention, detection and response[23]. Pre-vention controls the intention of employees and consists of the following four soft controls: Clarity, role-modeling, commitment and enabling environment. Detection controls the actions and behavior of employees and consists of the following two soft controls: Transparency and openness. Response controls the effects and results of the behavior of employees. Response consists of the following two soft controls: Accountability and enforcement[4].

These eight soft controls, preceded by their higher level categories, are described as follows[24][4]:

Prevention

1. Clarity It needs to be clear to the board, management and employees what behavior is desirable. The clearer the expectations, the better the employees know what to do and the larger the chance they actually do it.

2. Role-modeling The board, higher management and executives need to set an example. The better the example, the better employees be-have, and vice versa.

3. Commitment The board, management and employees need to deal with each other with respect and keep each other involved. The more respect and feeling of involvement, the more people are committed to do what the organization needs from them.

4. Enabling Environment The goals, responsibilities and tasks need to be practicable. The more people can possess the right knowledge and skills, the better they can do what is expected. Goals that are one sided or that are set too high increase the chance of undesirable behavior.

(13)

Detection

5. Transparency The better people can observe the behavior of them-selves and others, the better they can adjust their behavior to the expectations of others. Transparency increases people’s awareness and sense of responsibility.

6. Openness The more people are given the opportunity to talk about their point of view, emotions, dilemmas and violations, the more they will actually do this and the more they will learn from this. An open and protective environment where employees can discuss their feelings, dilemmas and criticism is important.

Response

7. Accountability The more comfortable or safe people feel with report-ing mistakes, incidents and calamities, the more they will actually do this and the more they will learn from these situations. It is important to create an environment where people are held accountable for their actions.

8. Enforcement Rewarding desirable and punishing undesirable behav-ior, without taking away an employee’s intrinsic motivation. The bet-ter the enforcement the more people will be inclined to do what is rewarded, avoid what will be punished and the more they will learn from their mistakes.

Soft controls were already present in companies, even before the intro-duction of hard controls. An example would be the ’walking around’ policy of a manager in a supermarket, where his mere presence on the work-floor is a motivator for employees to show desirable behavior. With companies growing larger the focus was set on implementing more and more hard con-trols. Based on research from Katz-Navon, Naveh and Stern (2005) it can be concluded that there is no linear relation between the amount of hard con-trols on the number of incidents in a company[25]. They, however, found it possible to find an optimum number of hard controls that leads to the lowest chance of incidents[22]. This again implies that organizations should not put their focus solely on hard controls, there is only so much hard controls can do.

(14)

A profound reason for more research to focus on implementing soft con-trols is the amount of scandals where standard performance tests did not show the true state of a company. Examples of major failures in companies where the focus was entirely on hard controls, and not at all on soft con-trols, include the Enron case (2001), the Ahold case (2003) and Parmalat (2003)[26][27]. These cases are all examples of companies where performance was valued over the ethical behavior within a company, resulting in an envi-ronment where fraudulent behavior was possible.

2.1.1 Existing Measurement Methods

Soft controls are not easily visible or measurable and it has therefore been proven difficult to capture them[28][29]. However, since they are a key as-pect of the actual well-being of a company, measurement methods have been developed and are put into use. These methods consist largely of question-naires2 and observation[17][6]. Kaptein (2010), for example, explains that to measure the ethical culture in his experiment, a five point Likert type scale ranging from ”1 = strongly disagree”, to ”5 = strongly agree”, was used[4].

The Institute of International Auditors (IIA) the Netherlands (2015) de-scribes six techniques that can be used to measure soft controls[20]:

1. Interview Aimed to capture subconscious behavior, dilemma’s and decisions.

2. Facilitated workshop Aimed to analyze the group process.

3. Questionnaire An often used method aimed to measure people’s opin-ion towards the implementatopin-ion of the soft controls. Results can be used for statistical testing to see if relations and differences exist be-tween groups.

4. Behavioral observation Aim is to observe behavioral aspects at the workplace.

5. Consult existing sources Use existing documents like satisfaction questionnaires and incident logs to identify behavioral aspects.

2_{Literature about soft controls uses both the term questionnaire and survey, sometimes}

interchangeably, other times as separate measurement methods. This thesis will use the term questionnaire to represent both surveys and questionnaires.

(15)

6. Game Aimed to observe behavior and skills in a simulated environ-ment.

2.1.2 Limitations Existing Methods

Although the methods described above are well established, they come with disadvantages. The limitations for the existing methods can be divided into five categories: time, cost, subjectivity, desirability bias and lack of interac-tion measures. These limitainterac-tions will be discussed below.

1. Time Most of the existing methods take up a lot of time. Interviews, for example, have to be scheduled, performed, and evaluated. This is a limitation because it prevents the methods from being used on a regular basis and it is not possible to make an immediate assessment of a companies well-being. All existing methods deal with the limitation of time.

2. Cost It costs money to hire an expert. The more time an expert has to spend on the job, the more money this costs. In addition, with questionnaires for example, every employee has to fill in the question-naire and a company has to be hired to analyze the results and write a report. All the discussed methods deal with the limitation of cost.

3. Subjectivity A rule auditors must live by is that all subjective audit statements must be clearly supported by objective evidence[30]. The problem with soft controls is that finding objective evidence can be a challenge. Although the interviews, for example, are done by experts, they still can only give their point of view and that of the person being interviewed.

4. Desirability Bias With the existing methods, employees are aware that they are being observed or tested. This can invoke people to give socially desirable answers and show behavior consistent with what they think is expected of them, also known as the social desirability bias[18]. This can pose an issue because a key aspect of soft controls is that what really goes on inside an employee’s mind. If it were possible to measure the real feelings or thoughts of employees, specific interventions could be made to let them feel appreciated again. This would result in a more healthy company.

(16)

5. No Interaction Measures The culture of a company is built from interactions between individuals. Most existing methods evaluate each individual separately, and do not look at the interactions between these individuals. Important information might be overlooked because of this. Although methods like questionnaires do look into the relation between people, from their point of view, the actual interaction is not measured.

The limitations discussed in this section do not mean to imply that the methods currently in place are not valid. The automated method the current study proposes could eventually be used as an addition to existing methods. Solving part of the discussed limitations is a sub-goal of this study. Since the goal of this thesis is to understand organizational culture as a complex social system, the next section will explain more about complex systems.

2.2 Complex Systems

The world around us is filled with complex systems. Examples can be found in biology, physics, finance and in social contexts[31]. A well known exam-ple is the brain and the emergence of consciousness[32]. Comexam-plex systems are described as networks made up of components that interact with each other. These networks arise and evolve through self-organization so that they are neither completely random, nor completely regular. Hereby permit-ting the development of emergent behavior at macroscopic scales[33]. Self organization is the dynamical process by which a system spontaneously forms macroscopic structures or behaviors over time. Emergence can be described as the relationship between the system’s properties at different scales. A macroscopic property is called emergent when it cannot simply be explained by the microscopic properties. It can be defined as ”the process of coming into existence”[34].

2.2.1 Complex Social Systems

Complex social systems consist of behavior that is primarily the result of interacting social agents[35]. Examples are ant colonies, people’s traveling patterns and political movements[36]. The workplace has also been stated to be a complex (social) system[37][16]. Employees interact with each other in

(17)

a dynamic environment, having to deal with goals, tasks and requests while receiving feedback and criticism that can all affect motivation and inner drive. Organizational culture is described as a phenomenon emerging from the interaction between employees[38]. To understand organizational culture it is therefore important to investigate this interaction. Interaction within a company occurs in multiple ways. The most apparent being via telephone, face-to-face and e-mail.

The current study makes use of email data as a source of information. Emails provide a large amount of textual information that shows the relation between senders and receivers. Of the interaction methods within a company, email is the most accessible for research. The goal is to extract information about soft controls from this data. To get a clear image of the well-being of the culture of a company, it is also important to capture how employees feel about these soft controls. Employees might, for example, discuss how upset they are with their manager. If only the frequency of information about soft controls is extracted this will result in a good evaluation for role modeling, while it should result in a low score. Frequency alone can not provide a measure for whether the soft controls are in place or not. To extract feelings or opinions sentiment analysis will be used. The next section will give a short introduction into the different types of sentiment analysis methods.

2.3 Sentiment Analysis

More and more textual information is stored on the internet and this data is relatively easy to access for research, therefore the use of text and sentiment analysis is becoming more popular. Sentiment analysis is the computational treatment of opinion, sentiment and subjectivity in text[39]. The goal of sen-timent analysis, is to analyze the opinions, attitudes, emotions, sensen-timents and evaluations of people towards an entity[40]. This entity can, for exam-ple, be events, products, specific topics, other individuals, organizations or services. In addition to looking at the sentiment towards an entity, sentiment analysis can also be done on sentence or document level[41].

Sentiment analysis is used on different types of text content. Examples are messages on social media (facebook posts, tweets, etc.)[42], emails[43], conversations in online chatboxes[44] and opinions on topic specific webpages (IMDB, etc.)[45]. The difficulty of sentiment analysis is that different types of textual information need different methods to extract the sentiment from[46].

(18)

Facebook posts and tweets, for example, usually contain little contextual information, which can make it difficult to extract the true meaning of the text[47].

Sentiment analysis can be done in different ways. Figure 2.1 gives an overview of the different sentiment analysis techniques[41]. Two main meth-ods are lexicon based and (machine) learning based[48]. Lexicon based methods use dictionaries with, for example, positive and negative words. Based on these dictionaries, the orientation of the text is calculated. The sentences are, for example, rated with scores on a negative, neutral and pos-itive scale[40].

Lexicon based methods can again be divided in dictionary based and cor-pus based[41]. Dictionary based techniques depend on finding opinion words and searching the dictionary of their synonyms. Corpus based techniques begin with an initial list of opinion words. Next, other opinion words in a large corpus are found to help in finding opinion words with context specific orientations.

The advantage of lexicon based methods is that they produce high pre-cision [49]. This means less false positives. A limitation of lexicon based methods is that they generally have low recall. Low recall means a higher number of false negatives. It depends on the goal of the study whether this will cause real problems. For example, if a study aims to classify text mes-sages as containing information about a terrorist attack, it can be imagined that more value is given to high recall than to high precision. If, however, the aim of a study is to make statements about whether people liked a certain movie, it is probably more important that every piece of information gath-ered is actually about the movie. So in this case high precision is preferred over high recall. Another limitation is that the same word can have a differ-ent meaning depending on context. The word ’unpredictable’, for example, can be negative when a sentence is about using a gadget and positive when talking about the plot of a book[50].

Learning based methods involve the classification of text [40][41]. Learn-ing based methods can again be divided in supervised and unsupervised learning[50]. Supervised learning methods involve a test and train set to dis-tinguish between different text sentiments based on labels[51]. Examples of supervised learning are Bayesian classification[52] and Support Vector Ma-chines (SVM)[53]. An example for unsupervised learning would be an algo-rithm that extracts all combinations of adverbs and adjectives if they follow each other in a sentence. Limitations of the learning based method are that

(19)

a large enough training set is needed, which is not always possible, and that it can be inefficient and biased[54].

The sentiment analysis methods discussed above are not mutually exclu-sive, and combining different methods can produce better results. Zhang et al. (2015), for example, used a combination of lexicon based and learning based methods to analyze twitter data[55]. First, entity-level sentiment anal-ysis was used to get high precision. Second, additional tweets were gathered in an automated way to improve recall. A classifier was trained to assign polarities to the entities in the new tweets. The lexicon based method pro-duced the training data for the learning based method. Their method works well on twitter data because the text in tweets contains a lot of made up or slang words, which would not have been picked up by solely using a lexicon based method.

More complex methods like deep learning are also being used for senti-ment analysis[56]. Besides analyzing the actual words in a sentence, emoti-cons have also been the subject of sentiment analysis to extract meaning from messages[57]. Most sentiment analysis programs extract whether a piece of text has a positive or negative tone. The Natural Language Toolkit (NLTK), for example, provides interfaces for python to analyze text with[58]. The toolkit consists of a suite of open source program modules, tutorials and prob-lem sets[59]. Another sentiment analysis program is the IBM Tone Analyser. It can be used to extract more than just negative and positive emotions from text[60]. The Tone Analyser can be used to detect three types of tones: social tendencies, emotions and writing style. The decision for which method to use depends on the type and amount of data available and what information should be extracted.

The current study proposes an automated method to measure soft con-trols. The proposed method aims to capture meaning about the culture of a company from text. Sentiment analysis is an appropriate choice because, in addition to it providing methods to extract information about soft controls, it can also give the polarity of that information. Not only will the proposed method give insights in whether certain soft controls are discussed in emails, it will also capture the attitude towards those soft controls. It is a method that, as far as the author knows, has not been used for the measurement of soft controls before. As described above, soft controls control an employees’ convictions, values and personality[12]. The use of sentiment analysis on email data is less obtrusive than existing methods to measure soft controls. The analysis and evaluation can all happen without being directly visible to

(20)

Figure 2.1: Overview Sentiment Analysis Techniques (based partly on [41], p.1095).

(21)

the employees. If this automated and less obtrusive method works, it will be more time and cost efficient than existing methods.

2.4 Network Analysis

The social interaction between employees plays an intrinsic part in the or-ganizational culture of a company[10]. To get a better understanding of organizational culture as an emergent property from the interaction between employees the current study explores this interaction from a network per-spective. Network analysis is a well established method to study social interaction[61][62]. Box 2.1 provides the general terminology used in net-work analysis. These terms will be used throughout this thesis. Section 2.4.1 will give an introduction to clustering algorithms. Determining the existence of clusters and whether they relate to organizational culture will be part of the network analysis phase of this study.

2.4.1 Clustering

To get insight in how a network is structured and whether communities exist clustering algorithms can be used[63]. Barabasi explains that a community is a group of nodes that have a higher likelihood of connecting to each other than to nodes from other communities[64]. To detect communities different clustering methods are available. With graph partitioning, for example, a network is cut into groups that are least connected. A predefined number and size of communities is used. Most community detection methods, however, do not state the number or size a priori, an arbitrary number of groups can be found.

Two hypotheses in community detection are that (1) each community corresponds to a connected sub-graph (connectedness), and (2) nodes in a community are more likely to connect to other members of the same com-munity than to nodes in other communities (density).

There is a large amount of algorithms available for clustering or commu-nity detection, and it can be a difficult job to choose the right one[65][66][67]. One example of a division in community detection algorithms is disjoint ver-sus overlapping[68]. Disjoint algorithms, as a rule, produce separate clusters that do not have any overlap. Overlapping algorithms do allow for overlap between clusters.

(22)

Figure 2.2: MCL example (divisive clustering algorithm)

Another division is agglomerative versus divisive. Agglomerative clus-tering algorithms implement a bottom-up principle where each node is con-sidered to be its own cluster and these clusters are merged together over mul-tiple steps until some threshold or condition is met[69]. Divisive clustering uses a top-down principle where the entire networks starts as one cluster and with each step connections are removed so separate clusters appear. Figure 2.2 shows an example of the divisive clustering algorithm Markov Clustering (MCL)[70]. The original network is shown on the left, and with multiple iterations the weakest connections are removed according to a set of rules3_.

The network on the right is the end product of multiple separate clusters. Another division between algorithms is those designed to find communi-ties in static networks versus dynamic networks[71]. The advantage of using dynamic over static network analysis is that it represents real world behavior more intuitively, and time is taken into account. With static methods only a difference between networks can be obtained, with dynamic networks a pattern over time can be recognized[72].

The goal of this study is to understand the emergence of organizational culture from a complex system perspective. To understand a phenomenon it is important to know what properties it is made of. Network analysis can be used to study interaction, and to study what properties a network is made of. The network will be built from information about the users and their interaction via email. How this is done will be explained in section 3.3.

(23)

Box 2.1: General Terminology Network Analysis

The most important terms and their notation used in network analysis and network modeling from graph theory are the following[61][73][74][64]:

Graph The representation of a network consisting of nodes and edges. A graph is usually denoted as G.

Nodes Represent the people or other entities in the network. N(G) is the set of all the nodes in graph G. ni= the i-th node.

Edges Represent the connections between the nodes. E(G) is the set of all the edges in graph G.

Directed Graph The edges between nodes in a directed graph can by asymmet-rical. This means that an edge eij is not necessarily eji.

Undirected Graph The edges between nodes in an undirected graph are sym-metrical. eij= eji.

Weighted Graph As opposed to a graph that is not weighted, where the presence of an edge is given a value ’1’ and the absence of an edge a value ’0’, in a weighted graph each edge has a weight. This weight ωij represents the strength of the edge

between node i and node j.

Communities Groups of nodes that have a higher likelihood of connecting to each other than to nodes from other communities.

Density An indicator for the general level of connectedness of a graph. The den-sity is given by the number of edges in a graph divided by the number of possible edges. For an undirected graph the density is given by: D = 2∗(E(G))_{N (N −1)}

Shortest Path The path from node i to node j that has either the least amount of edges or the least amount of combined weight of the edges.

Average Path Length The average number of edges along the shortest paths for every combination of nodes.

Centrality Properties on the node level relating to the structural importance of a node in the network. There are three main types of centrality:

1. Degree The number of connections a node has. d(i) =P

jωij, where ωij is

the strength of the connection from node i to node j. The in-degree is given by the number (or weight) of edges from all other nodes to this node, the out-degree by the number (or weight) of edges from this node to all other nodes.

2. Closeness The total distance of a node to all other nodes. c(i) =P

idij,

where dij is the number of edges in a shortest path from node i to node j.

3. Betweenness The number of shortest paths that pass through a node. b(i) =P

j,k gjik

(24)

2.5 Relevant Research Examples

This study aims to add valuable information to the existing research on organizational culture and the use of an automated method and network analysis to measure organizational culture. In doing so it is important to understand the research that has already been done in these areas. This section therefore will provide some examples of relevant research in the area of sentiment and network analysis, some of which also use email data to measure organizational culture. In addition it will be explained what the value of the current study is in addition to this research. Most research discussed below makes use of the Enron Corpus, as does the current study. This large email dataset is freely available to everyone[75].

2.5.1 Text and Sentiment Analysis

The Enron corpus has primarily been used for natural language processing (NLP) research[76]. Mohammad (2012), for example, used the Enron Corpus to show that women tend to communicate more on a joy-sadness axis, and men more on a trust-fear axis[77]. Another implementation of text analysis on email data is categorizing emails as being spam or not[78].

Jemmet and Rotundo (2016) used the CoreNLP sentiment analysis method to extract the overall sentiment of sentences in the Enron email corpus[79]. One of their goals was to compare the change in sentiment to the change in stock price over time. To get the sentiment score per time period they averaged this score. The authors conclude that there is a weak correlation between the sentiment and stock prices over time.

According to Gilbert (2012) ”Email is the performance of power and hierarchy captured in text”[80]. In his study Gilbert used the Enron email corpus to extract phrases that signal company hierarchy. Gilbert concludes that the combined use of logistic regression and Support Vector Machine (SVM) are successful in extracting phrases that signal hierarchy. The author states that a limitation of this study is that the results cannot be generalized to other companies since the organizational culture at Enron was a specific (and unhealthy) one.

Moniz and De Jong (2014) state that in addition to the opinion of the media and consumers, employee satisfaction about companies should be an-alyzed to rate company performance[81]. For their study they used em-ployee reviews from multiple companies on the following subjects: culture,

(25)

work/life balance, senior management, benefits and career opportunities. A lexicon based sentiment analysis method called General Inquirer was used to calculate the polarity of text. This was done by combining all reviews per company and transforming the count of all positive and negative words into a compound score per company. They conclude that automated sentiment analysis based on employee reviews can provide novel insights into company culture.

In a conference paper, Adruce et. al. (2016), propose an opinion analysis method[82]. The authors state that existing sentiment analysis methods are not effective enough in extracting the opinion of employees on organizational culture. For their study they used a set of a hundred opinions on company culture from different sources (universities, Telecom companies, etc.). Their method was a hybrid of rule-based and lexicon based sentiment analysis using a customized corpus. The authors conclude that their method performed well and can be used for electronic opinion analysis of organizational culture data. The difference between previous research and the current study lies mainly in the way relevant information is extracted. The studies described above that also made use of email data used the entire dataset. The current study first extracts relevant information with regards to the organizational culture. This information is split into nine categories to be able to give even more precise advice about how and where to intervene. Next, sentiment analysis is used to determine the polarity of this information. In stead of solely using this sentiment value, a score is calculated that represents a combination of the polarity and the frequency of relevant information4. By using this com-bination of steps it is more clear where the problem lies, and therefore it is easier to find a solution. Other studies made use of culture specific data, from online questionnaires and reviews. In those cases the users are aware their contributions are being used to determine the health of a company. The current study assumes the use of email will reveal more than a subjective personal opinion that a person is willing to share.

The current study makes use of dictionary based information extraction, both for the soft control labeling as for the sentiment analysis. The reason for using dictionary and rule based methods is that they are transparent. The problem with learning based methods, called the black box problem, is that although the produced results might be desirable, it is not clear how and why 4_{The details of these methods and the calculation of this score are described in chapter}

(26)

it works[83]. As the goal of the current study is to gain understanding about organizational culture it is important to know exactly what is happening beneath the surface. In addition, the previous research discussed above that made use of lexicon based methods produced desirable results, and learning based methods did not.

2.5.2 Network Analysis

Sparrowe et al. (2001) investigated the link between network centrality and job performance [84]. One of the goals of their research was to see whether group performance is a function of the structure of informal rela-tionships within them, and whether this can hinder individual and group performance. They made a division between a work-related and an infor-mal network. Where the focus in the inforinfor-mal network was on hindrance, like rejections, sabotage and annoyance. The job performance of individu-als was positively related to the centrality in the work-related network and negatively related to centrality in the hindrance network. The density of the hindrance network was negatively related to group performance. This study shows that interactions on the work floor influence job performance. It could be inferred that the measurement of the type and polarity of interactions between employees can be used as an indicator of how well a company is performing.

Diesner, Frantz and Carley (2006) used the Enron email dataset to study real world organizational crises[76]. They used a social network analytic per-spective to extract patterns in the data. Their study concludes that both the volume of emails and communication pattern changes during organizational crisis. Their research also states that there is a need for a better automated method to extract relevant information about email data that can be used for network analysis. In addition they suggest that future research should take information extracted from the text of the emails into account.

Pathak et. al. (2006) proposed a method for analyzing knowledge per-ception in a social network[85]. The authors used part of the Enron email corpus to examine 118 emails manually and evaluating those emails for sen-timent to company image. They state that in the future, this approach could be used to monitor employees’ sentiment regarding sensitive topics like com-pany image. They state it is a better alternative to existing methods like intra-company surveys.

(27)

of employees from the Enron Corpus[86]. A combination of the number of emails, response time, a clustering coefficient, the degree and a centrality measure were used to rank each employee. Comparing the results with the known real job descriptions of the employees, the researchers conclude that their method can extract who the most important people in an organization are.

Collingsworth, Menezes and Martins (2009) explain it is widely known that email systems form a social network[87]. Properties observed from social network analysis are the result of patterns of social behavior. Observed social behavior may be attributed to organizational stability. Organizational health, therefore, may be detected by examining social network properties. Email activity reflects the change in organizational mood. The challenge is to extract this information. The authors use the Enron email corpus to demonstrate that problems in the organization were apparent as an emergent property of the social network.

Uddin et. al (2011) compared a static to a dynamic network, again mak-ing use of the Enron Corpus[88]. For their static network they aggregated all emails sent between July 2001 and December 2001, the period where the organizational crisis emerged. For the dynamic network they used the emails of the same period, but for each day separately. The authors conclude that for both the static and the dynamic networks, as the organization goes into crisis, a few actors become prominent and central in the email communication networks. In addition they explain that dynamic social network analysis still needs a lot more research, currently only the metrics of the network structure can be discussed, not the consequences of network structure for function.

The main difference between existing research and the current study is the information used to build the networks. Previous research, as mentioned before, made use of the entire (Enron email) dataset, where the current study builds a network containing only relevant information about each soft control category. In addition, most studies used network analysis to extract hierarchy, the current study aims to locate both influential groups of people and influential individuals. In addition the current study will examine the network properties of the networks to find out more about the emergence of organizational culture. Finally, previous research focused on static models where all emails are aggregated over time. The current study will too, but in addition a temporal attribute will be used to detect patterns over time.

(28)

2.5.3 Combined Sentiment and Network Analysis on

Organizational Culture

A study published in a conference proceeding by Barahona and Sun (2017) analyzed properties of the social network based on sentiment analysis on the Enron Corpus[89]. In their study Barahona and Sun labeled the content of a sample of a thousand pre-processed emails as being negative, neutral or positive with regard to the organizational culture. Positive emails would, for example, have content about socialization opportunities. Negative emails, for example, would include content about delays in processes and financial audit concerns. With the machine learning method Support Vector Machine (SVM) they trained their data based on these labels. This resulted in a large amount of emails labeled as neutral, where the precision and recall were both good. The negative and positive categories, however, did not produce the same results. The authors built a social network representing the employees as nodes and the emails as edges. They built three separate networks; one per polarity category (negative, neutral and positive). Barahona and Sun used centrality measures to get information about the most important people in each network. The authors conclude that the visualization of the spread of sentiment through a social network can provide meaningful information for organizational purposes.

The study by Barahona and Sun has many similarities with the current study. Both make use of the Enron email corpus, sentiment analysis and social network analysis to study organizational culture. The methods and definitions used are, however, very different. It is important to state these differences to understand what the current study adds to existing research. The differences are the following: (1) The operationalization of organiza-tional culture is different. Barahona and Sun use the entire dataset as part of company culture, where emails can be negative, neutral or positive with regard to the organizational culture. The current study first extracts emails containing information about organizational culture. Each email can contain information about one or more of the eight soft controls, or about organi-zational culture in general. (2) Barahona and Sun use the entire body of the emails as input for their sentiment analysis. The current study looks at the information about organizational culture on sentence level. (3) Bara-hona and Sun use a supervised learning method (SVM) on the entire dataset. The current study made use of a lexicon and rule-based sentiment analysis method only on the extracted sentences. (4) The result of the sentiment

(29)

analysis by Barahona and Sun is a categorization in one of the three polarity categories. The current study makes use of a soft control score calculation based on the frequency and sentiment of organizational culture related sen-tences5_{. This will result in each email having a soft control score per soft}

control category that it contains information about. (5) Because of these differences, the network representations are also different. Although both studies represent the employees as nodes and the emails as edges, the values these nodes and edges are given and the information included in the network is different. (6) Finally, Barahona and Sun represent the interactions as a static network where all emails are aggregated over time. The current study will make use of both static and dynamic network analyses to also capture the change in organizational culture over time.

2.6 Summary

The goal of this study is to gain understanding about organizational culture from a complex system perspective. In doing so an automated method to measure soft control is proposed. This method makes use of lexicon based labeling and lexicon and rule-based sentiment analysis. In addition, network analysis is used to evaluate the interactions within the company.

The eight soft controls that the proposed method aims to extract from the Enron email corpus are the following: clarity, role-modeling, commitment, enabling environment, transparency, openness, accountability and enforce-ment. Current methods to measure these soft controls deal with limitations including time, cost, subjectivity, the desirability bias and a lack of inter-action measures. By proposing an automated and less obtrusive method including the extraction of relevant information, sentiment analysis and net-work analysis, these limitations will (for most part) be dealt with.

This chapter ended with a description of relevant research in the area of organizational culture, sentiment and network analysis. What becomes apparent from the research discussed above is the need for an automated method to extract information about organizational culture. In addition, it has been shown that combining sentiment analysis and social network analy-sis can give insight about organizations. Most of the discussed studies made use of the entire (cleaned) Enron corpus. The current study will add to exist-ing experiments by buildexist-ing an automated method to extract only relevant 5_{Section 3.2.3 will give a more detailed description on this soft control score calculation.}

(30)

information about organizational culture. That information will be used to build multiple networks. With the use of centrality and clustering methods this method could eventually help detect single or groups of employees being key players in creating a negative organizational culture. Detecting these people or communities in time can lead to intervention and therefore pre-vention from incidents. In addition, the current study not only looks at the static networks where all information is aggregated, but also at the dynamic change over time. As far as the researcher knows, no other studies have used the combination of methods as applied in the current study.

(31)

Chapter 3 Case Study

This chapter will explain the analyses done for this study. First the data, the Enron email corpus, will be described. Next, more information about the data cleaning phase will be given. A description of the method used to extract the relevant information about soft controls will be provided next. Including the sentiment analysis method. After this the results will be given. It is important to note that multiple (similar) methods were used and eval-uated, after which adjustments were made which were again evaluated. The method described below is the end result of this process. This chapter will end with the network analysis, where an explanation of the construction of the networks will first be given. Descriptives of the network properties will be provided, different measures will be used to determine the most central em-ployees and clustering algorithms will be used to detect communities within the networks. Also, a closer look at the key players in the Enron scandal will be taken. Finally the results will be discussed.

As described by Sayama (2015), to understand, study and/or model a complex system the following steps are important[33]:

1. State the goal or questions that need to be answered.

2. Describe the microscopic scale of the system.

3. Describe the structure of the system.

4. Describe the possible states of the system.

(32)

For the current study the goal is to understand organizational culture from a complex system perspective and to propose a method that measures soft controls. The microscopic scale of the system translates to the em-ployees, emails and the information those emails entail. The macroscopic phenomenon described as organizational culture emerges from the interac-tion between employees via email. The set of possible states of the system is given by the possible states of well-being of the organizational culture in a company. In the current study this is operationalized as the soft control score, which will be explained in section 3.2.3. Finally, the dynamical rules by which the system changes over time translates to the organizational cul-ture changing over time measured by change in the frequency and sentiment of relevant emails. In addition, network structure and properties can cohere with the change of the cultural states over time. The following sections will explain more about the operationalization of these steps, and how they can help gain understanding about organizational culture as an emergent phe-nomenon. This section will start with a description about the data and the initial cleaning of this data.

3.1 Data Description and Data Cleaning

The data used in this study is the email dataset from Enron[90]. A descrip-tion of this dataset is given below. All analyses will be done using python (for a list of the used packages see Appendix A). To provide an absolute value for the status of a company’s organizational culture, it is useful to compare different datasets. Unfortunately freely down-loadable email data is scarce, and the Enron dataset was the only available dataset. However, comparing the culture between companies is not the only valid method for determina-tion of the organizadetermina-tional culture. It might even be more interesting for a company to know how their organizational culture changes over time. Us-ing a change over time as an alert somethUs-ing in the organizational culture is changing could help to make interventions in time. In addition, the definition of organizational culture can differ over companies. Detecting the change in the soft controls over time is therefore a useful method to capture the change in the well-being of a company.

(33)

3.1.1 Enron Dataset

A famous case study of fraud is the Enron Corporation case [91]. The en-ergy trader went from being said to be an innovative company to a company known for corruption and mismanagement[92]. In November 2001 Enron went bankrupt[26]. To maintain credit ranking at investment rate the com-pany made special arrangements and partnerships to keep the debts away from the balance sheets. Enron’s auditor, Arthur Andersen & Company, at the time in the top five of leading accounting firms, should have been a line of defense against deception or fraud. According to Hosseini and Mahesh (2016) it is still not certain whether they failed to notice or were part of the bad behavior[26]. The audit firm went out of business after the incident.

Enron failed due to unethical practices[26]. The company is said to have had enough hard controls in place, but lacked the soft controls. If an internal auditor had evaluated these soft controls they might have been identified in time[93].

The email data for this case study has been made available to analyze for the public, for example as part of a Kaggle machine learning challenge[90]. The dataset consists of approximately 500.000 emails from the in-boxes of 148 employees (mostly senior management) of the Enron Corporation, obtained by the Federal Energy Regulatory Commission[94]. An example of an email is given in figure 3.1.

One of the advantages of making use of the Enron dataset is that it is known, to some extent, what people were responsible for the fraudulent behavior. During the network analysis part of this study this information can be used to see if the current method can identify these people. If this is indeed possible, the method can have a huge value in showing where a company should intervene to get the health of the organizational culture up again.

The people held most responsible for the downfall are Kenneth Lay, for-mer chairman and CEO, and Jeffrey Skilling, forfor-mer CEO and COO[96]. Other key players are Jeffrey mcMahon and Ben Glisan, former Enron trea-surers executives, Andres Fastow, chief financial officer, and Ken Rice, former Enron executive and President of Enron’s broadband service. Two other peo-ple who might be interesting to investigate during the network analysis stage are Sherron Watkins and Vince Kaminski. They are said to have raised concerns about the bad practices going on in Enron[97].

(34)

Figure 3.1: Example of email from the Enron dataset[95]

3.1.2 Data Cleaning

Data is not always provided in a clean and clear fashion. It usually contains noise and information that is irrelevant. The first step in analyzing the data is to remove this noise and build a data-frame that is easy to work with. This is an important step for sentiment and text analysis[98]. The textual data used for sentiment analysis is usually unstructured. To extract the desired features or information the data needs to be presented in a way that makes it easier to analyze. What the pre-processing entails depends on the goal of the research and the data itself[98].

During the data cleaning phase the information necessary for further anal-ysis was extracted. The cleaned data contained all the information about the emails present in the mailboxes of the 148 users, including the date and time, name and email address of the sender and receiver(s) and the text of each email and each sentence separately.

(35)

Time Window

The text of the emails was processed so only the original message or original response was used as the text, and not all the text of the previous sent emails. This was done because not eliminating this text would result in duplicate sentences in the next steps. All text was made lower case so no labeling mistakes would be made in a later stage1_{. The dates of the emails}

were checked for outliers and odd values. Emails that were sent after 2002, the year data collection took place, were removed. In addition emails with dates that did not make sense (year 0002 for example), were excluded from the dataset. The total amount of removed emails counted up to 352 of 517401, which is a percentage of 0.068. This will not likely influence the results in a material way. To take the change over time in consideration the resulting dataset was sorted into 49 time steps based on all present year-month combinations2_{. Due to the number of emails present per time period}

it was decided to only look at the three and a half year range from January 1999 to July 2002, resulting in 43 time steps.

Sentence Reduction

The following methods were used to reduce each sentence to containing only the important information. All punctuation apart from the brackets, colons, question marks and exclamation marks was removed. This makes the use of other methods more straightforward at a later stage. The brackets, colons, question marks and exclamation marks were kept because they might provide information about the sentiment of the sentence. In addition, stop-words, like ’to’, ’this’ and ’is’, were removed. Finally, the words in the updated dictionaries were reduced to their lemmas. Lemmatization is a method to reduce a word to a common base form[69]. An example of the lemma of ’learning’ is ’learn’.

Enron Only

To make sure only information from Enron employees is used to label the data, all emails only sent or received by non-Enron employees were removed. The resulting set contained only the emails that were both sent and received

1_{For more information about the data structure: see appendix B} 2_{”2000, May” and ”1999, June” for example.}

(36)

by an Enron email address. With this addition emails containing spam or irrelevant information about other companies are filtered out. It was de-cided to only look at the communication within Enron because the focus of this study is on the organizational culture within a company, not between companies. The result is a dataset of size 352585, only containing the com-munication between Enron employees. This dataset will form the base for the following steps.

Relevant Word Extraction

To enlarge the nine dictionaries (discussed in section 3.2.1) a labeling method to extract soft control-specific words was implemented. The Enron dataset does not have any labels concerning soft controls or the culture of the com-pany. Therefore the labels had to be given manually. To accomplish this, a set of 187 sentences was drawn randomly from the Enron-only dataset3_{. To}

label the data, first each sentence was labeled either as ’belonging to a soft control’ or ’not belonging to a soft control’. Next, each sentence belonging to a soft control was given a label that said to which of the soft controls it belonged, or whether it belonged to more general information about the culture of the company. It was decided to let each sentence only belong to one soft control category.

The labeled data was used to extract words that are common for each of the nine soft control categories. The words that are more frequent in each of the soft control categories than in the no-soft control category were extracted. All words with a frequency-relation above 5:1 and that appeared relevant were added to the corresponding dictionaries4_{. This was done for}

every soft control category. A selection of the extracted words per soft control is shown in table 3.15.

Unique Emails Only

Because the Enron dataset consists of all the emails from the in-boxes of 148 employees, there are duplicate emails. The total number of Enron-only emails 3_{The sample size of 187 was based on a combination of heuristics[99], time it takes to}

label the data and removal of 3 duplicate sentences.

4_{A frequency relation of 5:1 means that the word appeared five times in the soft control}

category, versus 1 time in the no-soft control category.

(37)

Soft Control SC-specific word

Clarity plan

Role-Modeling

-Commitment innovation

Enabling Environment help

Transparency inquiry

Openness quiet

Accountability rule

Enforcement recognize

Table 3.1: Example of one extracted characteristic word per Soft Control

including duplicates was 352585. To use the results from the current analysis for network modeling it is important reduce the data to unique emails only. If the duplicate emails are kept this will influence their weight. Removing the duplicates resulted in a set of 160874 emails. This set contained all the sent and received emails from the 148 users, and in addition a subset of the sent and/or received emails from other Enron employees6.

3.2 Soft Control Measure

3.2.1 Automated Soft Control Labeling

In order to state what emails contain information about the soft controls nine dictionaries were constructed. The dictionaries are used to label the emails as containing information about soft controls. The nine dictionaries corre-spond to the eight soft controls and a more general dictionary (the dictionary names are given in table 3.2). This general dictionary contains everything that has to do with soft controls or ethical culture, that is not part of any of the eight soft control dictionaries. This general category was built to not exclude important information about the culture of a company. A first draft of the dictionaries was made based on literature and questionnaires on soft controls. These dictionaries contained words related to the categories7_{. An}

6_{This subset contained the emails either sent or received by one of the 148 users. It is}

a subset because not all of the sent and received emails are included, only if it was part of the in-box of one of the 148 users.

(38)

example of a word in the general dictionary is: ”culture”.

(1) General (2) Clarity (3) Role Modeling

(4) Commitment (5) Enabling Environment (6) Transparency (7) Openness (8) Accountability (9) Enforcement

Table 3.2: The dictionaries.

These dictionaries were supplemented with the corresponding synonyms and antonyms8. The result was examined, and irrelevant and duplicate words were removed. The proposed dictionaries were confirmed by an expert9 and modified according to the expert’s opinion. For the purpose of being able to differentiate between the soft controls it is important that the constructed dictionaries are mutually exclusive, or at most contain words belonging to two dictionaries. There turned out to be no overlap.

The resulting dictionaries were used to label the data as belonging to one of the nine categories. Every email and every sentence in that email was labeled based on whether it contained words from the dictionaries. The labeling was done as follows. All words were given the same weight of 1.0. A score per dictionary was calculated using a summation of the amount of words, per dictionary, that were present in the text.

In the example box below a dictionary includes the words ’feedback’ and ’accountable’. This sentence would get a score of 3.0 because both words have a weight of 1.0 and the word ’feedback’ is present twice.

Example: soft control word labeling

”You have to work on the feedback we gave you, you will be held accountable if you do not do something with this feedback.”

For each sentence nine scores were calculated based on the prevalence of words from the nine dictionaries. If the threshold of 2.0 was reached the sentences were given the label of belonging to that soft control category10_.

8_{This was done using the python package PyDictionary.}

9_{Paul Hulshof, Senior Manager at KPMG, is the expert referred to in this thesis.} 10_{The reason for this threshold is that after checking a random sample this threshold}

(39)

Only emails containing at least one sentence with a soft control label were taken to the next analysis step.

Descriptives

The total number of soft control related emails (according to the labels as dis-cussed above) is 53103. This set consisted of 15275 unique email-addresses. Table 3.3 shows the number of emails, the percentage of emails from the en-tire set of unique emails, and the percentage of emails from the soft control only set, per soft control category. It shows that most of the extracted emails are labeled as containing information about clarity and openness. According to the expert most of these results are as expected. Accountability and en-forcement, however, were expected to have a higher prevalence because they are communicative soft controls. This means that they can be expressed on paper or in words. Role modeling, on the other hand, is a soft control that, for a large part, is non-verbal. A possible explanation for the low prevalence of accountability and enforcement related sentences is that the proposed method does not detect all relevant information. Another possible explanation is that this soft control is really not prevalent in this dataset, and the lack of information is a sign of an unhealthy organizational culture (with regard to accountability and enforcement).

Soft Control Emails Percentage

of all emails Percentage of SC-only emails General 19092 11.9% 36% Clarity 23833 14.8% 44.9% Role Modeling 15330 9.5% 28.9% Commitment 17120 10.6% 32.2% Enabling Environment 18419 11.4% 34.7% Transparency 16868 10.5% 31.8% Openness 25191 15.7% 47.4% Accountability 3627 2.3% 6.9% Enforcement 12589 7.8% 23.7%

Understanding Organizational Culture from a Complex System Perspective

University of Amsterdam

Masters Thesis