The dark side of information visualization

(1)

1

The dark side of information visualization

(2)

2

The dark side of information visualization

An explorative study of the realm of social media networks, visually represented using node-link diagrams

Master thesis Computer-mediated Communication University of Groningen 25-12-2017 Supervisor Dr. L.M. Bosveld-de Smet Author Richard Hoving s2801094 Croeselaan 131b 3521BL Utrecht r.hoving.4@student.rug.nl

(3)

3

Preface

A dark side as in an area still left unexplored like in open world videogames and a dark side as in the result of malpractice like in Star Wars. The following exploration of the realm of visually representing social media networks in the form of node-link diagrams proved to be more complex and challenging than initially anticipated. While I accepted this challenge with great pleasure, completing it would not have been possible without Dr. L.M. Bosveld-de Smet, which leads to the principal reason for the existence of this preface: Thank you, Dr. Bosveld-de Smet, I thoroughly enjoyed our meetings and I am grateful for all of the effort that you put into guiding me and helping me complete this thesis; it is much appreciated.

(4)

4

Abstract

Visualizing information in a meaningful fashion is difficult. Visualizing networks of social media applications is no exception. Visualizing network data of such applications is particularly difficult due to the size and complexity of the data involved. Visualization tools that require no advanced technological knowledge or skills, such as NodeXL and Gephi, allow any individual who is interested in social networks to create visualizations thereof. Network structured data can be visualized in different ways; however, simple application of a visualization tool when attempting to visually represent social media network data can easily lead to poor images, i.e. visual representations that are not meaningful. Visual communication consists of an application domain, a graphical domain, and a link between the application domain and graphical domain. Information visualization describes a process where some application domain, the data, is transformed into some graphical domain, the picture, in order to amplify cognition. The transformation process represents the link between the application domain and graphical domain. This thesis explores how the visualization process is commonly applied to the network structured data of social media applications visually represented by node-link diagrams. The exploration considers the following:

(1) The inventorization of retrievable social media network data

(2) The characterization of node-link diagrams representing social media networks

(3) An examination of how social media network data are visually represented in node-link diagrams (4) An evaluation of the quality of node-link diagrams representing social media networks

The overall conclusion is that there is a need for context and that the link between the application domain and graphical domain needs to be made explicit. If meaningful node-link diagrams depicting social media networks are to be created, it is imperative that people consider and record the intention behind and the context for a visual representation and respect the general rules of information visualization.

(5)

5

Content

1. Introduction 7

2. Theoretical background 10

2.1. Visual communication 11

2.1.1. Three aspects of visual communication 11

2.1.2. Good and bad visual communication 11

2.1.3. Views on which aspects are important in good visual communication 13

2.2. Information visualization in general 16

2.2.1. Building blocks of the process of transforming an application domain into a graphical domain 18

2.3. Information visualization of network structured data 20

2.3.1. Application domain: components of network structured data 21 2.3.2. Graphical domain: visual representations of network structured data 22

2.3.3. Challenges of a node-link diagram 24

2.4. Visualizing social networks from social media 25

2.4.1. Application domain: the pre-computer era 25

2.4.2. Graphical domain: the pre-computer era 25

2.4.3. Application domain: the post-computer era 26

2.4.4. Graphical domain: the post-computer era 27

2.5. Summary 28

3. Method 30

3.1. Data collection 32

3.1.1. Selection of social media applications 32

3.1.2. Retrieval of social media network application domain entities from JSON data 32 3.1.3. Retrieval of existing node-link diagrams visually representing social media networks 35

3.2. Data analyses 36

3.2.1. Application domain 36

3.2.2. Graphical domain 37

3.2.3. Link between application domain and graphical domain 39

(6)

6

4. Results 49

4.1. Application domain 50

4.1.1. Application domain entities collected by descriptions of JSON data 50

4.1.2. Classification of application domain entities 51

4.1.3. Distribution of application domain entities 57

4.1.4. Summary 59

4.2. Graphical domain 60

4.2.1. Marks in existing node-link diagrams that visually represent social media networks 60 4.2.2. Graphical properties in existing node-link diagrams that visually represent social media networks 61

4.2.3. Summary 62

4.3. Link between application domain and graphical domain 63

4.3.1. Graphical domain entities that visually represent a social media user 63 4.3.2. Graphical domain entities visually representing relationships between users 64 4.3.3. Graphical domain entities visually representing user attributes 65

4.3.4. Summary 66

4.4. Evaluation of the link between application domain and graphical domain 67 4.4.1. Expressivity of existing node-link diagrams visually representing social media networks 67 4.4.2. Effectiveness of existing node-link diagrams that visually represent social media networks 69

4.4.3. Summary 74

5. Discussion 75

5.1. Main insights 76

5.1.1. Facebook’s exceptional position: focus on users, less on content 76

5.1.2. Limited variation in graphical domain entities 76

5.1.3. Little information being visually represented in node-link diagrams 77

5.1.4. Limited expressivity of node-link diagrams 77

5.1.5. Limited effectiveness of node-link diagrams 78

5.2. Future research 80

5.2.1. Subjects based on main insights 80

5.2.2. Subjects based on the limitations of this thesis 81

5.3. Recommendations 82

References 83

Appendix I - Retrieved application domain entities 85

Appendix II - Classified application domain entities 86

(7)

7

1. Introduction

The power of the unaided mind is highly overrated; it is things that make people smart, external aids (Norman, 1993: 43). One of these aids is the picture. An image can serve two purposes (Card, MacKinlay & Shneiderman, 1999: 1): The first is to communicate an idea in a manner that is both insightful and concise. This first purpose is related to the well-known statement that a „picture is worth ten thousand words‟ (Larkin & Simon, 1987). An idea is required to communicate. The second purpose is using images to discover or create an idea, also referred to as „using vision to think‟. The growing availability of computing power has made it possible to create new pictures, new methods to amplify cognition through the use of pictures. The pictures that were created after the evolution of computers were first applied to science; nowadays, however, this method is widely used. This broader use is referred to as „information visualization‟ (Card et al., 1999: 1).

The purpose of information visualization is to transform some data into visual form in order to amplify cognition. This is best explained using the example shown in Figure 1.1; this visual representation depicts a map of London‟s Soho district on which deaths that occurred as a result of cholera are represented in the form of points and the locations of water pumps are marked with crosses (Spence, 2007: 3).

Figure 1.1: Visual representation of a cholera outbreak in London (Spence, 2007: 3).

In 1845, London dealt with a cholera outbreak. The medical officer for London at the time, Dr. Snow, had the task of controlling this outbreak. By examining the map of the Soho district shown in Figure

(8)

8 1.1., he reached the conclusion that the majority of deaths were concentrated around the Broad Street water pump. He ordered that the pump be shut down, which resulted in a decrease in the number of deaths caused by cholera. The visual representation of the location of cholera deaths and water pumps allowed Dr. Snow to draw a life-saving conclusion, which he may not have been able to reach solely by examining the underlying textual data. This visual representation amplified his cognition.

Information visualization is applied in many domains; one particular domain is social network analysis, which is a field that is concerned with the structure and characteristics of a social network. An example of a visual representation of a social network is the node-link diagram in Figure 1.2 (Spence, 2007: 78), which depicts the social network of department store employees.

Figure 1.2: The social choices among department store employees (Spence, 2007: 78).

The points represent department store employees and the lines represent their social choices. (“Social choices” here refers to recreational partners.) The researcher who created this visual representation, Marbella Canales, intended to explain these social choices. Canales discovered that social choices were based on similarities in age, not on other personal traits (e.g. gender). In Figure 1.2, blue points represents employees younger than 30, yellow those between 30 and 40, and red those older than 40.

The exponential growth of social media has made an abundance of complex social network data concerning various objects, various types of relationships, and information about these objects and relationships available. The practice of visually representing the networks of social media applications such as Facebook and Twitter is growing in popularity. Social media networks can easily be visualized using existing tools such as NodeXL and Gephi. Ease of access to social media network data and visualization software tools has stimulated interest in the creation of visual representations of the data provided by such platforms. Figure 1.3 depicts such a visual representation.

(9)

9 Figure 1.3: Abstract depiction of a Facebook network (Huang, 2014).

Although the above visual representation of Facebook network data is aesthetically pleasing, it does not convey meaning in a manner similar to that in which the visual representation of the 1845 cholera outbreak does. Does it amplify our cognition? Which idea is communicated? Creating effective visualizations is not easy, even if the visualizations themselves are aesthetically pleasing (Card et al., 1999: 15).

Leverage works both ways: While information visualization can make people „smart‟, it can also make them „stupid‟ as a result of ill-advised visualizations and „chart junk‟ graphics that make information harder to comprehend (Card et al., 1999: 34). This leads to the problem statement of this thesis.

Visualizing information in a meaningful way is difficult. Visualizing networks of social media applications is no exception. Visualizing network data of social media applications is particularly difficult due to the size and complexity of the data involved. Visualization tools that do not require advanced technological knowledge or skills, such as NodeXL and Gephi, allow any individual who is interested in social networks to create visualizations. Network structured data can be visualized in different ways. Simple application of a visualization tool when attempting to visually represent social media network data can easily produce poor images, i.e. visual representations that are not meaningful.

This thesis explores how the visualization process is commonly applied to the network structured data of social media applications visually represented by node-link diagrams. To do so, this work investigates the following research questions:

(1) What network data are made available by social media applications?

(2) What do the node-link diagrams that visually represent social media networks look like?

(3) How are the network data provided by social media applications visually represented in the form of node-link diagrams?

(4) How meaningful are node-link diagrams visually representing social media networks?

Chapter 2 discusses the key theoretical concepts that are required when conducting such an exploration. In order to outline the „blueprint‟ of this thesis, the research method employed is described in Chapter 3. Chapter 4 elaborates on the results of this exploration. Finally, the insights that can be identified from the results, proposals for future research, and recommendations are addressed in Chapter 5.

(10)

10

2. Theoretical background

To facilitate the creation of superior visual representations, the general intention of this thesis is to consider the networks of social media applications both in terms of how they are visually represented and how they could/should be.

The scope of this thesis encompasses the network structured data of social media applications, which are visually represented in the form of node-link diagrams. This thesis operates within the realm of visual communication, consisting of some application domain, some graphical domain and a link between the application and graphical domain. The aspects of visual communication in general and perspectives on which components are important in terms of ensuring good visual communication are discussed in Section 2.1.

Within the field of visual communication, there are multiple areas of study; however, the focus of this thesis is on information visualization. Information visualization refers to the process of transforming some application domain (i.e. data) into some graphical domain (i.e. a visual representation) and the building blocks that are required for such a transformation (Section 2.2.). When discussing information visualization in general, it is imperative to consider how data should be transformed into the form of a visual representation.

Depending on what it is intended to describe, an application domain will include different types of data. In this thesis, network structured data represent the application domain. Network structured data can be transformed into a graphical domain in multiple ways. Before focusing on the visual representation of social media networks by means of node-link diagrams, the components of network structured data and the possibilities in terms of visually representing them are described in Section 2.3.

The practice of representing network structured data in visual form through the use of node-link diagrams has been, and remains, of value for many academic fields. It is useful to discuss the components of social network data and how social networks have been visually represented using node-link diagrams in both the pre- and post-computer eras (Section 2.4.).

(11)

11

2.1. Visual communication

Visual communication is communication that involves the use of images (Wang, 1995: 1). As stated by Wang (1995: 1), pictures provide a language that is capable of distinctly conveying spatial concepts such as geometrical shapes, size, position, and spatial relations. She argues that people naturally draw pictures and ascribe meaning to the graphical objects and the spatial relations between them within such pictures when considering a problem or attempting to explain something within a non-graphical domain. Through the process of visualization, the nature of a problem can become more obvious than when it is described textually. Without the graphical domain, tackling problems could take a considerably greater amount of work and, even then, certain features may never be discovered; this is what makes a picture worth ten thousand words (Larkin & Simon, 1987).

2.1.1. Three aspects of visual communication

Visual communication consists of three aspects: a graphical domain (pictures), an application domain (the problem[s]), and a link that associates the graphical domain with the application domain (the semantics of a picture) (Wang, 1995: 3).

Figure 2.1: The three aspects of visual communication (Wang, 1995).

The link is important in visual communication as, without it, a picture will only contain spatial information. The link transforms spatial information into information about the application domain, thus making pictures helpful in communication. The link can be viewed from two perspectives: First, it can be viewed as originating from the graphical domain towards the application domain. The first perspective is what Wang (1995: 4) refers to as „interpretations‟. The link specifies how entities within the graphical domain should be related to objects and their relationships with the application domain. From the second perspective, the link can be considered as originating from the application domain towards the graphical domain. This second perspective is referred to as „picture specifications‟. The scope of this thesis only encompasses the „interpretations‟ perspective; the second perspective is not addressed and is therefore not discussed further.

2.1.2. Good and bad visual communication

There are good and bad pictures. Whether a picture is good depends on the task, the situation, and the user it is intended for (Spence, 2007). As described by Spence (2007: 32), an example of a bad picture is the representation of an old aircraft altimeter shown in Figure 2.2.

(12)

12 Figure 2.2: Abstract representation of an old aircraft altimeter.

The smallest of the three indicators represents the number of tens of thousands of feet, the second largest indicator identifies the number of thousands of feet, and the largest indicator shows hundreds of feet. This altimeter could be interpreted quite rapidly by a familiar pilot as indicating an altitude of 13,460 feet. The major problem, however, would manifest when the gaze of a pilot shifts away from the altimeter, perhaps to view another indicator, to look out of a window, or to converse with the co-pilot. When the pilot‟s gaze returns to the altimeter, it may be possible that a change in height would not be noticed, as illustrated in Figure 2.3.

Figure 2.3: Representations of an old altitude meter.

At first sight, the representations of the altimeters in Figure 2.3 appear identical. However, there is a difference of 10,000 feet between the heights that the altimeters report because the two smaller indicators switched places. The smallest indicator in image A indicates a height of approximately 30,000 feet, while the same indicator in image B indicates a height of approximately 20,000 feet. If the intention of a representation is to make a change noticeable, it is essential that such a change should always be noticed. The faulty representations of such older altimeters have been responsible for many accidents due to sudden changes in altitude that are not noticed because their representations of different altitudes appear to be the same but are not. Modern aircraft altimeters have solved this problem by acknowledging the use to which altimeters are put and the characteristics of their uses; a modern altimeter is depicted in Figure 2.4.

(13)

13 Figure 2.4: Abstract representation of a modern aircraft altimeter.

The modern aircraft altimeter reflects the need for rapid evaluation of potential danger by separating the scale into two sections, green and purple, and indicating the current altitude of an aircraft using a black box. A brief glance at a modern day altimeter will confirm whether or not an aircraft is flying at a safe altitude.

2.1.3. Views on which aspects are important in good visual communication

There are different views on what makes visual communication good or bad: Wang (1995: 4), for example, argues that the subject matter within the application domain should be represented in a manner that people view as natural and that is not misleading. There are researchers who argue that there should be agreement between a representation and what it represents (Card et al., 1999: 23, Mackinlay, 1986) and researchers who argue for a match between a representation and the task for which it is intended (Spence, 2007: 53, Bertin, 1983, Card et al., 1999: 23). There are also researchers who feel that there should be a balance between the efficiency of a representation and it being aesthetically pleasing (Hansen, Shneiderman & Smith, 2011: 47). Although these views differ from each other, they all are related to human perception and cognition in their own ways. Taking into account all of these recommendations made could, however, lead to either stronger or weaker visual representations. Consider, for example, the subjective matter of the aesthetics of a visual representation: Due to its subjective nature, aesthetics can be defined in multiple ways, e.g. with reference to colour, shape, or the positioning of the aspects of a visual representation. Figure 2.5 depicts four different visual representations of a modern day altimeter in order to illustrate both good and bad uses of aesthetics.

(14)

14 Figure 2.5: Four versions of an abstract representation of a modern aircraft altimeter.

A researcher focusing solely on the task of a visual representation might create altimeter A because it can be used to correctly read the altitude of an aircraft. That same researcher might consider adding colour to create a superior visual representation and to make it more aesthetically pleasing, resulting in altimeter B. Green is generally accepted as representing positivity and red as negativity; therefore, altimeter B is superior to altimeter A because the colour in its visual representation enhances the user‟s ability to identify positive and negative altitudes. The researcher may wish to improve the altimeter further by positioning the black measuring box slightly outside of the green box and adding a triangle pointed towards the corresponding height to guide the eye of the pilot better, resulting in altimeter C. Altimeters B and C illustrate that the appropriate use of aesthetics can lead to better visual representations. An ill use of aesthetics, however, can also lead to poor visual representations. The same researcher who created altimeter A might have an illogical preference for the colour red, with Altimeter D being the result. Altimeter D is considered to be a poor visual representation for the same reason that Altimeter B and C are considered good, namely the colour scheme. The colour red is generally accepted as representing something negative; therefore, the altitudes positioned in the red box of altimeter D could be perceived as being negative.

Expressivity

There are right ways and wrong ways to show data (Tufte, 1997: 45). This thesis relies on the view proposed by Card et al. (1999: 23) concerning what makes a visual representation good. These authors state that there are multiple ways in which a picture can be created. If all of and only the underlying data are represented, a picture is expressive. (The underlying data is another way of describing the content of an application domain.) For instance, if an application domain consists of data concerning the sales of different car brands in Holland, Germany, and Belgium, the visual representation should only depict the differences in sales between car models and countries, as illustrated in Figure 2.6.

(15)

15 Figure 2.6: An example of an expressive visual representation.

The bar chart shows the number of sales for each country per brand. This visual representation allows for easy interpretation of the data: For example, the most Audis are sold in Germany, the most BMWs are sold in Holland, and, in general, more Audis are sold than BMWs. The visual representation in Figure 2.6 is considered to be expressive because no unwanted data are present; all of the data that are visually represented are present in the data table in Figure 2.6.

Creating good visual representations is challenging, as it is easy for unwanted data to appear in a visual representation. A visual representation is considered inexpressive if it exhibits unwanted data; an example thereof is provided in Figure 2.7.

Figure 2.7: An example of an inexpressive visual representation.

The red bars in the visual representation in Figure 2.7 represent the combined total sales for Germany, Holland, and Belgium: nine for Audi and six for BMW. Germany is placed on the y-axis to illustrate the difference in car sales between Germany, Holland and Belgium. The bar chart does portray that more Audis are sold than BMWs, corresponding to the data in the data table; however, by placing the countries on the y-axis, the representation implies that both Audis and BMWs sell more in Germany than in Holland and Belgium. While Audis sell more in Germany, BMWs do not. Therefore, data that are not present in the data table, i.e. unwanted data, are visually represented. Thus, this visual representation is considered to be inexpressive.

(16)

16 Effectiveness

Humans must be able to perceive a picture well. Card et al. (1999: 23) state that a picture (p1) will be more effective than another picture (p2) if (p1) can be interpreted more rapidly, can convey more distinctions, or leads to fewer errors when compared to (p2). Some visual representations are simply more effective than others; an example thereof is given in Figure 2.8.

Figure 2.8: An example of a difference in effectiveness (Card et al., 1999).

In Figure 2.8, both (a) and (b) are visual representations of a continuous wave, a sine wave. Visual representation (b) is superior to visual representation (a) because representing a continuous wave in the form of positions on a two-dimensional plane is more effective than representing it in the form of colours in one dimension (Card et al., 1999: 23).

Individual differences between users also dictate the effectiveness of a visual representation: Some users are simply more successful when it comes to interpreting certain visual representations than others. For example, a social scientist will be able to interpret a visual representation of the social network of a community more rapidly than a geologist because of their differences in terms of educational backgrounds. The geologist, however, will be able to interpret a visual representation of ground composition more quickly than the social scientist for the same reason.

2.2. Information visualization in general

As a term, information visualization can be approached in different ways, namely as a discipline, an object of interest in research, or as an object of research in itself. The object of research in itself can be defined in multiple ways. This thesis relies on the definition of information visualization offered by Card et al. (1999: 6):

“The use of computer-supported, interactive, visual representations of data to amplify cognition” (Card, MacKinlay & Shneiderman, 1999: 6).

Card et al. (1999: 17) state that information visualization can be described as the mapping of data to visual form that supports human interaction. The mapping of data to visual form discussed by Card et al. (1999: 17) translates to Wang‟s (1995: 4) transformation of the content of some application domain (data) into some graphical domain (a visual representation). Figure 2.9 shows the reference model of Card et al. (1999: 17), which illustrates the process of mapping data to visual form.

(17)

17 Figure 2.9: Reference model (Card et al., 1999: 17).

This reference model starts with raw data, meaning data in some original form. Raw data are transformed into data tables, which contain relational descriptions of the data, expanded to include metadata. Data tables are transformed into visual structures by means of visual mappings. A visual structure is a visual representation, while visual mappings are the spatial substrates, marks, and graphical properties that are chosen to create a visual representation. Visual structures are transformed into views by means of view transformations; view transformations specify graphical parameters such as position, scaling, and clipping. A user can interact with graphical parameters to create different views of a visual structure. For example, a view can be restricted to showing only a certain range of all of the data that are available in the data table. Mapping a data table into a visual structure by means of visual mappings is at the core of the reference model, and this model is relied upon in this thesis. Compared to Wang, the application domain is some data table, or some part of a data table, and the graphical domain is a possible view. An application domain is mapped into a graphical domain by means of visual mappings. Figure 2.10 depicts the concept of mapping data to visual form, illustrated using an example provided by Wang (1995: 5).

Figure 2.10: Transforming an application domain into a graphical domain (Wang, 1995).

In the example provided by Wang (1995: 5), the database on the left side represents the application domain, while the visual representation on the right side represents the graphical domain. The application domain consists of a data table that describes a network of cities and roads between them. The graphical domain is a node-link diagram that consists of nodes that are connected with links. The arrow in Figure 2.10 represents the mapping process which determines the spatial substrates, marks, and graphical properties that will be used to visually represent the network of cities and roads. The circles may be interpreted as representing cities, and the roads between cities are mapped as lines. The letters identify which city a circle represents, and the numbers do the same for the roads between

(18)

18 the cities. The lengths of the lines may imply that certain cities are connected by longer roads than others. Positioning circles closer to each other may imply that certain cities are closer to each other when compared to cities that are further from each other.

2.2.1. Building blocks of the process of transforming an application domain into a graphical domain

A graphical domain is comprised of a spatial substrate, marks, and graphical properties that encode the data from the application domain. The use of space is the most fundamental aspect of a graphical domain because space is perceptually dominant (MacEachren, 1995). Due to this dominance, the first step in designing a graphical domain is to decide which data from the application domain will be encoded spatially. Card et al. (1999: 26) state that, like other visual features, spatial position can be used to encode data from the application domain. However, because of the dominance of spatial position, it is treated separately from other visual features; this is referred to as a spatial substrate. The spatial substrate can be viewed as an empty space that serves as a container for graphical properties; this container can be constructed using axes. There are four fundamental axes:

(1) Unstructured axis: no axis

(2) Nominal axis: a region is divided into sub-regions

(3) Ordinal axis: the ordering of the sub-regions is meaningful (4) Quantitative axis: a region has a metric

Marks are added to the spatial substrate to encode certain data from the application domain. Marks are the visible elements that occur in space. The four fundamental types of marks are shown in Figure 2.11 (Card et al., 1999: 28).

Figure 2.11: Four fundamental marks (Card et. al, 1999: 28).

Marks possess several graphical properties, which can be used to encode additional information concerning the data represented by the marks (Card et al., 1999: 30). Bertin (1983) identified six graphical properties, as shown in Figure 2.12.

(19)

19 Figure 2.12: Graphical properties (Card et al., 1999: 30).

The six graphical properties identified by Bertin are size, value, texture, colour, orientation, and shape. Additional graphical properties have been proposed: Among others, MacEachren (1995) proposed dividing colour into hue and saturation (similar to “value” in Figure 2.12.), as shown in Figure 2.13.

Figure 2.13: Additional proposed graphical properties (Card et al., 1999: 30).

Some graphical properties are better suited to encoding certain data than others (Card et al., 1999: 30). The suitability of a graphical property in terms of encoding certain data depends on the task that a graphical property is intended to accomplish. Bertin (1983) identified four tasks common to information visualization:

(1) Association: the marks can be perceived as being similar (2) Selction: the marks are perceived as being different (3) Order: the marks are perceived as being ordered

(4) Quantity: the marks are perceived as being proportional to each other

Multiple guidance are available concerning the suitability of graphical properties for encoding certain data to accomplish a task (Spence, 2007: 52). Figure 2.14 depicts an interpretation of Bertin‟s guidance regarding the suitability of graphical properties to support the four common tasks.

(20)

20 Figure 2.14: Bertin‟s guidance regarding encoding methods (Spence, 2007: 52).

Bertin‟s guidance demonstrates that not all graphical properties are suitable for all four tasks. Spence (2007: 52) states that Bertin‟s guidance is intuitively reasonable. However, the most appropriate method for encoding data from the application domain is very dependent on context, taking into account the purpose of a visual representation, user expertise in terms of the application domain and/or graphical domain, and the perspective adopted on the criteria used to define good pictures.

2.3. Information visualization of network structured data

The application domain investigated in this thesis is comprised of network structured data. Network structured data can be transformed into different visual forms: linear, tabular, or network (Kerren, Purchase & Ward, 2014: 1). The graphical domain examined in this thesis is a network visual form. An example of network structured data being transformed into a network visual form is illustrated by the roads and cities example provided in Figure 2.10.

(21)

21 2.3.1. Application domain: components of network structured data

Network structured data are commonly referred to as a network. A network consists of objects, relationships, and attributes (Kerren et al., 2014: 1). Hansen et al. (2011: 31) describe objects as the „things‟ in a network, relationships as any form of connection between two objects, and attributes as the descriptive information that can be associated with objects or relationships. Figure 2.15 illustrates the components of a network using Wang‟s (1995: 5) roads and cities example.

Figure 2.15: Components of network structured data.

In Wang‟s (1995: 5) cities and roads example, there are two types of objects: cities and roads. „Connect‟ is the relationship that is present, representing a road connection between cities: e.g. city A is connected to city D by road 2. The unique names of cities (e.g. a or b) and roads (e.g. 1 or 2) are attributes. The database used in Wang‟s example is transformed into a network visual form wherein cities (circles) are connected with roads (lines), both of which are identified by unique names (letters and numbers).

There are different types of networks. Contandriopoulos, Larouche, Breton, and Brousselle (2017: 4) describe two types of networks, one-mode and two-mode networks. In a one-mode network, only objects of the same type are connected. Figure 2.15 depicts a one-mode network in which only cities are involved as objects. Figure 2.16 illustrates the difference between a one-mode network and a two-mode network: In the one-two-mode network, individuals are connected to each other, as they are objects of the same type. In the two-mode network, individuals are connected to leisure activities, which are objects of a different type.

(22)

22 Figure 2.16: A one-mode network and a two-mode network (Contandriopoulos et al., 2017: 5).

There are multiple properties that can be used to characterize a network. Kerren et al. (2014: 2) discuss the most important ones; however, they do so in a confusing manner, as they use terms that should not be used to describe the application domain, such as nodes and links (these terms belong instead to the graphical domain; see 2.3.2.).

The most relevant property for this thesis is whether a network is directed or undirected. The types of relationships that are present in a network dictate whether it is directed or undirected (Kerren et al., 2014: 2). Directed and undirected relationships are the two major types of connections (Hansen et al., 2011: 34). Hansen et al. (2011: 34) state that a relationship is considered to be directed if a distinct origin and destination are available, e.g. Andrew writes Adam a letter. A directed relationship can be followed by another directed relationship, e.g. Adam sends a letter to Andrew as a response. A relationship is considered to be undirected if the origin and destination are unknown, e.g. marriage. If John is married to Mary, Mary is automatically married to John; the relationship simply exists.

2.3.2. Graphical domain: visual representations of network structured data

Networks can be transformed into different visual representations. Figure 2.17 shows an example of two different visual representations of the same network: The visual representation on the left side is a node-link diagram (an example of a network visual form), whereas the visual representation on the right side takes the form of a matrix display (an example of a tabular visual form).

(23)

23 Figure 2.17: Examples of a node-link diagram and matrix display of the same network (Ghoniem, Fekete and Castagliola, 2005: 17).

Node-link diagrams consist of nodes and links, also known as vertices and edges. Figure 2.18 illustrates the components of a node-link diagram. In node-link diagrams, the objects in a network are often referred to as nodes or vertices and are frequently represented by points. The relationships that exist between objects are often referred to as links or edges and are frequently represented by lines. Attributes are also referred to as features and are frequently represented by the graphical properties of the marks that are used to encode the objects and relationships between objects.

(24)

24 A matrix displays consist of rows and columns; Figure 2.19 illustrates the components of a matrix display. In matrix displays, the objects are represented twice, once by the row of a matrix and once by a column. The relationships that exist between objects are represented by the cell, which will be highlighted in some fashion, at the intersection of a row and a column. Attributes are represented by the graphical properties of the marks that are used to encode the objects and relationships between them.

Figure 2.19: Components of a matrix display.

Some representations are better suited for a certain task than others. Ghoniem et al. (2005: 23) argue that matrix displays are more suitable for large and dense networks due to the fact that a matrix display is orderable and immune to visual overlaps. However, matrix displays are more difficult to apprehend because objects are shown twice; this forces the eyes to move back and forth from the row representing objects and the column representing objects. Node-link diagrams are more suitable for small-sized network representations. This insight was also provided by Shneiderman and Aris (2006: 733), who stated that the most successful node-link representations encode small-sized networks with between 10 to 50 nodes and 20 to 100 links. Shneiderman and Aris (2006: 733) operationalized success as the ability to effectively count the number of nodes and links and follow each link from source to destination in a diagram.

2.3.3. Challenges of a node-link diagram

Positioning nodes has long been a challenge when developing node-link representations. There are multiple approaches to positioning nodes in dimensional space, and the position of a node in two-dimensional space can be used to encode meaning. For instance, in Wang‟s (1995: 5) cities and roads example, the positions of nodes are based on the actual distances between the cities that they represent. The positioning of nodes in two-dimensional space can also be done randomly or can be controlled by other principles (such as the desire to avoid clutter).

The suitability of a positioning method is often dictated by the size of the network that is to be represented using a node-link diagram: The larger a network, the greater the likelihood of clutter as a result of node occlusions and link crossings (Hansen et al., 2011: 47). To be able to represent larger

(25)

25 networks, many positioning methods rely on aesthetic rules intended to minimize clutter. One set of rules is the „Netviz nirvana‟ of Schneiderman (Hansen et al., 2011: 47):

(1) Every vertex is visible

(2) Every link a node is connected to can be counted (3) Every link can be followed from origin to destination (4) Clusters and outliers are identifiable

A focus on aesthetics could result in the loss of information that could otherwise be encoded (e.g. the distances between cities). Moreover, it could produce a misleading representation, e.g. implying that some cities are closer to each other because the nodes that represent them are positioned closer to each other when the positioning is, in fact, partially based on node visibility. However, focusing on aesthetics could allow for node-link diagrams that represent larger networks with minimal clutter.

Once again, the most appropriate method for encoding data from the application domain is very dependent on context, including the task of a visual representation, user expertise with regard to the application domain and/or graphical domain, and the perspective on the criteria used to define good pictures.

2.4. Visualizing social networks from social media

The practice of representing networks in visual form using node-link diagrams can be applied to many fields of study, such as software engineering, biology, and sociology (Kerren et al., 2014: 2). Due to wide range of fields of study, there are many different types of networks available to be represented visually. This thesis focuses on social networks. Social networks have been represented using node-link diagrams since the early 1930s; however, the growth in and increasing availability of computing power has made collecting network structured data increasingly practical. The increased practicality of data collection has changed the nature of social networks and the ways in which social networks are encoded into visual form (Freeman, 2000).

2.4.1. Application domain: the pre-computer era

Social networks are networks comprised of the social interactions and exchanges that occur between people. Such social interactions and exchanges occur when people interact with others (Hansen et al., 2011: 31). In the pre-computer era, researchers collected social network structured data manually. Data collection would typically include observing or surveying a population, such as asking each member of a given population to list whom they knew, how, and how regularly they would meet; this represented a costly and time-consuming method of data collection (Hansen et al., 2011: 39).

2.4.2. Graphical domain: the pre-computer era

Freeman (2000: 3) states that one of the first visual representations of social networks were the ones of Moreno (1934: 213). In the pre-computer era, node-link diagrams that represented social networks were drawn by hand. Figure 2.20 depicts a hand-drawn node-link diagram representing who liked whom and who disliked whom in a football team. The legend for this diagram was not included in the original image, as it was created by the author of this thesis.

(26)

26 Figure 2.20: Who likes whom and who dislikes whom among the members of a football team (Moreno, 1934: 213).

As illustrated by Figure 2.20., Moreno introduced considerations that are important when constructing node-link diagrams intended to visually represent social networks (Freeman, 2000:3):

(1) The graphical properties of marks can be used to encode differences (2) The positions of nodes can be used to encode meaning

The use of the graphical properties of marks to encode differences between objects and relationships between objects has been widely adopted by the visualization community. Innovations such as new shapes followed rapidly. However, a greater challenge was, and remains, improving on the method used to position nodes.

In the early days, collecting and visually representing social network data was a time-consuming and costly undertaking; thus, it was not practiced by many. This changed when computers became widely available.

2.4.3. Application domain: the post-computer era

With the computer came the rise of social media. Hansen et al. (2011: 12) define social media as a “set of online tools that supports social interaction between users.” The focus of these authors is on the technological aspects of such media platforms. There are many different approaches to defining social media: For example, another is the social perspective of Kaplan and Haenlein (2010: 62).

(27)

27 These authors classify social media based on factors such as self-presentation, self-disclosure, social presence, and the richness of a given medium. Hansen et al. (2011: 4) state that people use social media to bring their families and friends closer together, to reach out to colleagues and neighbours, and to stimulate markets for products and services. Through the use of social media, connections that span continents and bind regions are created.

The explosion in the number of emerging social media applications has resulted in the vast availability of structured social network data (Hansen et al., 2011: 46). The components of network structured data from social media can be extremely varied, as objects can include the individual users of an application, teams, web pages, and videos. Relationships can take the form of friendships, transactions, or shared attributes. Attributes may describe the demographic characteristics of an individual user (e.g. age or gender), information that describes a person‟s use of a system (e.g. number of messages sent or edits made), or other characteristics (e.g. income or location) (Hansen et al., 2011: 34).

Hansen et al. (2011: 16) describe three types of relationships that are distinct to the networks of social media applications: explicit, implicit, and subtle implicit. The three types of relationships described by these authors should not be mistaken for the properties used by Kerren et al. (2014: 2) to describe networks in a more general sense (e.g. directed/undirected relationships). Explicit connections are consciously created: For instance, on Facebook, a user will send another user a request to become friends, both users will need to accept the request for the friendship relationship to become reality, and the users will then both acknowledge the creation of a friendship relationship. Implicit connections are inferred from the behaviours of users on a social media application: For instance, on Twitter, users can like the posts of other users. A like may simply be used to show appreciation for a post made by another user, without the intention of creating a relationship. Hansen et al. (2011: 16) state that, although the intention to create a relationship is absent, a connection is still created between the two users in question through the like. Subtle implicit relationships are inferred from a shared attribute: For instance, on Facebook, users can become members of certain groups. If two users are members of the same group, they share that attribute. A relationship between those users will exist based on that shared attribute. For subtle-implicit relationships, attributes are transformed into relational data. Implicit and subtle-implicit relationships are created in ways that are quite different; therefore, in the opinion of the author of this thesis, subtle-implicit relationships should not be viewed as being similar to „regular‟ implicit relationships and thus should be named differently.

2.4.4. Graphical domain: the post-computer era

Node-link visualizations intended to represent the network structured data of social media applications are largely drawn digitally using a software tool. A variety of tools for visually representing network structured data are available.

Among these is NodeXL, a software tool used to visually represent network structured data in the form of node-link diagrams. Developed by Hansen et al. (2011), NodeXL has a built-in data collector that allows for the easy collection of data from social media applications. A range of marks and graphical properties are available for the encoding of objects, relationships, and attributes. The method used to position nodes involves multiple layout and clustering algorithms, each of which is based on its own set of rules. Tools such as NodeXL allow for the creation of insightful node-link diagrams that represent the networks of social media applications.

To use tools such as NodeXL, no programming skills or extensive knowledge concerning information visualization are needed. Furthermore, the process of collecting data is no longer time-consuming and costly. An individual can simply download some tool, import data from some social media application, and start creating visualizations. While easy-to-use tools allow for the creation of insightful visualizations, they can also allow for the creation of bad visualizations as a result of a user‟s lack of

(28)

28 knowledge concerning information visualization. Bad visualizations contaminate the existing pool of node-link diagrams that represent the networks of social media applications and should be avoided.

2.5. Summary

Visual communication consists of an application domain, a graphical domain, and a link between the two domains. The application domain is the data, the graphical domain the visual representation, and the link describes how graphical domain entities should be interpreted and how entities from the application domain can be visually represented. There are different views regarding the process of evaluating the quality of a picture. This thesis follows Card et al. (1999: 23), who distinguish between expressivity and effectiveness. If all and only the underlying data are represented, a picture is expressive. A picture (p1) is more effective than some other picture (p2) if (p1) can be interpreted more rapidly, can convey a greater number of distinctions, or leads to fewer errors than (p2).

Some application domain can be transformed into some graphical domain using a spatial substrate, marks, and graphical properties. There are different types of application domains. This thesis focuses on network structured data. Network structured data consists of objects, relationships, and attributes and can be visually represented in multiple ways. Node-link diagrams and matrix displays are examples of visual representations of network structured data; this thesis, however, focuses on node-link diagrams. A node-node-link diagram consists of nodes that represent objects and node-links that represent relationships.

There are many fields of study that include visual representations of network structured data using node-link diagrams. This thesis focuses on structured social network data, specifically the networks of social media applications. Within the networks of social media applications, objects can be the individual users of an application, teams, web pages, and videos. Relationships can be friendships, transactions, and shared attributes. Attributes may describe the demographic characteristics of an individual user (e.g. age, gender), information that describes an individual‟s use of a system (e.g. number of messages sent or edits made) or other characteristics (e.g. income or location). Structured social network data have been visually represented using hand-drawn node-link diagrams since the 1930s. Nowadays, however, a variety of tools that allow for the easy collection of network structured data and the creation of digital node-link diagrams that represent that data, without the need for any extensive knowledge of information visualization or skill on a user‟s part, are available. These easy-to-use tools can be easy-to-used to produce insightful node-link visualizations of the network structured data of social media applications. The opposite is also possible, as they can also create bad visualizations that contaminate the existing pool of node-link diagrams that represent the network structured data of social media applications.

(29)

29 Figure 2.21: Structure of the goals of this thesis.

In order to explore the world of representing the structured data of social media applications visually by means of node-link diagrams, this thesis intends to

(1) Describe and categorize the network data of social media applications that are available (red) (2) Examine the marks and graphical properties used in the existing visualizations of social

media network data generated by easy-to-use tools (blue)

(3) Explore the ways in which social media network data are commonly encoded in existing visualizations (yellow)

(30)

30

3. Method

The method used to obtain insights into the visual communication of social network data consists of collecting, coding and counting. The method employed in this study can be explained with reference to Figure 3.1. The different aspects of the research methodology are discussed in greater detail in their respective sections.

Figure 3.1: Structure of the research methodology employed.

Visual communication consists of some application domain (red), some graphical domain (blue), and a link between the application domain and graphical domain (yellow) (Section 2.1.1.). As illustrated by the reference model of information visualization developed by Card et al. (1999), data is mapped into a visual form. The content of an application domain is transformed into a graphical domain using an encoding method that maps data from the application domain to a spatial substrate, marks, and graphical properties (the link between the application domain and the graphical domain) (Section 2.2.). The quality of an encoding method can vary in terms of its expressivity and effectiveness (Section 2.1.3.) (green).

The network structured data considered in this thesis originate from social media applications. There are many such applications available; this thesis addresses Facebook, Twitter, Instagram, and Reddit. The selection process used to decide upon these social media platforms is described in Section 3.1.1.

The exploration of the content of the application domain (application domain entities), the network structured data obtained from the social media applications (red), is based on a so-called JSON study (pink).

Two hundred and fifty-one application domain entities were collected via the descriptions of JSON data on developers websites made available by Facebook, Twitter, Instagram, and Reddit (Section 3.1.2.).

The application domain (red) is analysed by coding the 251 retrieved application domain entities using newly created categories. The number of entities per category is counted (Section 3.2.1.).

The explorations of the graphical domain (blue), the link between application and graphical domain (yellow), and the evaluation of this link (green) are based on a so-called Visualizations study (purple). The images and documentation for the 76 node-link diagrams that visually represent social media networks were collected using the NodeXL Pinterest page (Section 3.1.3.).

(31)

31 The graphical domain (blue) is analysed by coding and counting the marks and graphical properties (graphical domain entities) that are present in the actual images of the retrieved node-link diagrams (Section 3.2.2.).

The analysis of the link between the application domain and application domain (yellow) is based on the documentation of the retrieved node-link diagrams. The manner in which application domain entities are visually represented in the node-link diagrams, i.e. by what graphical domain entities, are coded and counted (Section 3.2.3.).

The quality of the link between the application domain and graphical domain (green) is analysed by evaluating the expressivity and effectiveness of the retrieved node-link diagrams (green). The evaluation process involves examining both the images and the documentation of the retrieved node-link diagrams (Section 3.2.4.).

(32)

32

3.1. Data collection

The data collection consists of (1) the JSON study and (2) the visualizations study, as shown in Figure 3.1. Both studies are grounded in a selection of social media applications, described in Section 3.1.1.

In a perfect world, all of the data that social media applications possess would be available and visible to everyone, thus providing a complete view of the application domain. However, the companies behind the various social media applications have decided neither to reveal all of their data nor to make them all retrievable. A user interface provides only a glimpse of the underlying data. JSON, short for JavaScript Object Notation, is a method of storing information in a readable form. JSON data describe the underlying data of a social media application. Application domain entities are retrieved from the description of JSON data provided on the developer websites of Facebook, Twitter, Instagram, and Twitter. JSON data in itself is not network structured data as described in Section 3.2.1, but network structured data can be inferred from JSON data. This inference of network structured can be viewed as a data transformation (Section 2.2.). Only entities that describe the relationships between social media users or attribute information concerning such users are collected. The process by which application domain entities were collected is discussed in greater detail in Section 3.1.2.

Many social media network visualizations are available from various sources, but only a limited number include documentation. The Pinterest page for NodeXL (https://www.pinterest.com/nodexl/) served as the source of the visualizations discussed in this thesis because every visualization on this page is accompanied by an article, thesis, or another form of scientific documentation. There are different approaches to visualizing a social media network. For this thesis, only node-link visualizations of networks on Facebook, Twitter, Instagram, and Reddit were collected. Visualizations were only collected if users were represented in the form of nodes and the relationships between users by links. The process by which visualizations were collected is described in greater detail in Section 3.1.3.

3.1.1. Selection of social media applications

There are many social media applications available, and there are different approaches to classifying such applications. In order to limit the scope of analysis, only four social media applications are included. The process of selecting social media applications took into account the approach to the classification of social media developed by Hansen et al. (2011: 18). The authors define 11 different types of social media applications. The criteria of this thesis required the selection of four different types of social media applications; the selected social media applications are Facebook (social networking service - social and dating), Twitter (blogs and podcasts - microblogs and activity streams), Reddit (social sharing - bookmarks, news and books), and Instagram (blogs and podcasts - multimedia blogs and podcasts).

3.1.2. Retrieval of social media network application domain entities from JSON data

A network consists of objects (1), relationships between those objects (2), and information about the objects (3) and relationships (4) (Hansen et al., 2011: 31). Objects are the „things‟ in a network (Hansen et al., 2011: 31). Also called agents or items, objects can include users, teams, web pages, and videos. A relationship is what connects objects in a network. As stated by Hansen et al. (2011: 31), a relationship exists if it has some official status, is recognized by a participant, or is observed as an exchange or interaction between two objects. A relationship is any form of connection between two entities. Relationships can be many things, e.g. friendship, transaction, and shared attributes. Attribute data are descriptive information that can be associated with objects or the relationships that exist between objects. Attribute data may describe the demographic characteristics of a person (e.g. age and gender), data that describe an individual‟s use of a system (e.g. number of messages sent or edits made), or other characteristics (e.g. income or location) (Hansen et al., 2011: 31). For this thesis, the aspects of a network are defined as follows:

(33)

33 (1) Objects: the users of social media applications

(2) Relationships: the connections between social media users (3) Object attributes: attribute information about social media users

(4) Relationship attributes: attribute information about the connections between social media users

Describing and categorizing social media network data forms one part of this thesis. In order to ensure that the research could be completed within the set timeframe, this thesis only focuses on the

following factors:

(1) Social media users

(2) The connections between social media users (3) Attribute information concerning social media users

The JSON data regarding the selected social media applications represents the basis of the process involved in retrieving social media application domain entities. An example of Twitter JSON data is shown in Figure 3.2, wherein application domain entities such as the username, profile picture, and biography of a user are present.

Figure 3.2: An example of JSON data.

There are different approaches to collecting JSON data. For this thesis, the websites of each social media application intended to assist developers are used because they provide descriptions for all of the data provided. The companies behind these social media applications own these websites or refer to them. Therefore, collecting application domain entities from these websites represents an ethical method of collecting data. The websites can be considered as dictionaries that can be used to decipher the meaning of JSON data. Social media application domain entities are not collected from actual JSON data, but instead from the descriptions of retrievable JSON data. The websites are as follows:

(1) Facebook: https://developers.facebook.com/docs/graph-api/reference/user (2) Twitter: https://dev.twitter.com/overview/api/users

(3) Instagram: https://www.instagram.com/developer/endpoints/users/#get_users (4) Reddit: https://github.com/reddit/reddit/wiki/JSON

Facebook, Twitter, and Instagram own the developer websites. Reddit refers to Github, a platform for developers, to decipher JSON data.

(34)

34 Individual data units are referred to as „application domain entities‟: For instance, the username, profile_picture, and biography shown in Figure 3.2 are all considered to be unique application domain entities.

An application domain entity is collected if the description of that entity describes it as either a relationship between users or attribute information concerning a user. The description itself is also collected in order to provide clarity in the case of possibly ambiguous application domain entities and to provide insights into the differences and similarities that may exist between entities. If a description is absent, the decision to collect an application domain entity is based on the researcher‟s expertise. Figure 3.3 depicts an example of the process of retrieving application domain entities from Twitter. All available application domain entities are collected from the descriptions of the JSON data.

Figure 3.3: An example of the process of retrieving application domain entities from a JSON data description. All retrievable data application domain entities, alongside their corresponding descriptions, are collected in the form of a spreadsheet. If the description of an application domain entity is ambiguous or absent, the helpdesks of the social media applications are consulted to refine or create a description. The helpdesks are as follows:

(1) Facebook: https://www.facebook.com/help/ (2) Twitter: https://support.twitter.com/ (3) Instagram: https://help.instagram.com/ (4) Reddit: https://www.reddithelp.com/

(35)

35 The JSON data concerning the selected social media applications do not explicitly mention the relationships that exist between users, only the application domain entities that are based on a relationship. For instance, in the JSON data of Twitter, the application domain entity „Followers_count‟ can be found, which is described as the number of followers that a user has. The „Followers_count‟ implies a „Followers relationship‟, meaning that a user is followed by other users. While the count is based on this relationship, the followers relationship itself does not exist as an individual application domain entity in the JSON data. Therefore, relationships are derived from the description of existing JSON application domain entities in order to gain insight into the relationships that are available in social media network data. A description of each relationship was created by the researcher, using the helpdesks mentioned previously. All application domain entities are manually collected in the form of a spreadsheet.

3.1.3. Retrieval of existing node-link diagrams visually representing social media networks The visualizations found on the Pinterest page for NodeXL are divided into four categories: „General‟, „Facebook‟, „Twitter‟, and „Other‟. All categories are manually examined for node-link visualizations that represent the networks that consist of users and the relationships between them of Facebook, Twitter, Instagram, and Reddit.

For each visualization, the image, its name, its year of creation, and the hyperlink to its documentation are manually collected and added to a spreadsheet. Figure 3.4 provides an example of how a visualization is displayed on the NodeXL pinterest page and how it is added to the spreadsheet.