Automatic data visualization

(1)

Bachelor Informatica

Automatic data visualization

Ben Witzen

June 20, 2014

Supervisor(s): Dr. R.G. Belleman (UvA)

Inf

orma

tica

—

University

of

Amsterd

am

(2)

(3)

Abstract

The rapid development of computer hardware and software has enabled us to accumulate and store data at an increasing pace. However, the usefulness of data is limited by our ability to interpret and comprehend that data. To this end, data visualization can prove useful, but the vast amount of visualization techniques developed so far is overwhelming. Each technique has its own advantages and disadvantages and choosing one over the other can have consequences on how the data is interpreted. The user must make an important decision in selecting the correct visualization technique. Dragon is a new tool that serves as a framework for easy integration of many visualization techniques. Furthermore, the tool is capable of analyzing the characteristics of the input data to automatically provide a visualization suggestion. The combination of both features results in a tool with a simple to use interface that can immediately turn raw data into a fitting visualization that reveals the (underlying) characteristics of that data, without requiring extensive visualization knowledge from the user. Our research has resulted in a tool that works

(4)

(5)

Introduction

1.1 Data visualization

The past few years have been marked by the rapid development of computer hardware and software components. This growth in computing power has enabled us to collect and store data at higher rates and scales than ever before. We are often required to make decisions based on the collected data or draw conclusions from it. To help us do so, the fast accumulation of data has required us to develop strategies to display high volumes of data in a comprehensible way. To this end, many techniques were developed, collectively known as data visualization. The visualization does not have to include all of the data, but it must be based on the data or the characteristics thereof. What motivates data visualization is that it allows us to display a lot of data in a compact, yet comprehensible manner, and as such it is a useful way to quickly analyze large volumes of data.

Data visualization as a practice can probably be traced back to ancient times, when people first learned how to draw. The drawings in the cave system of Lascaux are estimated to be 17,300 years old and some of them are believed to depict maps of stars and constellations [7]. Throughout the course of history, data visualization has not been a very popular topic and the developments made in this field were scarce. An important development was the invention of the Cartesian coordinate system by Ren´e Descartes in the seventeenth century, which allowed data to be visualized using two dimensions (and more if points are given different sizes, colors, symbols, et cetera), and went on to revolutionize mathematics at its time. At the end of the eighteenth century, we see modern signs of data visualization as people start using the Cartesian coordinate system to display line graphs [8].

Over the last five decades, data visualization as a study has seen a rapid growth in interest and progress with the advent of the computer and the development of better hardware and software able to render complex imagery. Today, we consider data visualization to be a branch of modern descriptive statistics, and it involves the development and study of techniques that convert raw data into imagery. These images are no longer limited to flat figures on Cartesian grids, but may also feature three- or high-dimensional elements. Most data visualization images found in the media or on the Internet are the result of developments made in the past 50 years [8]. Besides the imagery itself, being able to interact with the visualization, thereby providing the end-user with the means to dynamically adjust the visualization, has also grown in appreciation.

1.2 Research and motivation

Recent developments have caused a growth in both the size and complexity of datasets and the visualization strategies available to visualize them. Due to the high heterogeneity of the datasets at hand, we require several strategies to optimally highlight the features of each dataset. Although the theory has kept up with the requirements, we have now reached a point where managing these different strategies and selecting the correct visualization for our data has become

(8)

challenging. To make optimal use of the data we collected and the data visualization theory we develop, we now need a tool that allows us to easily and automatically apply the theory on the data.

This need has opened up room for the development of a tool that is capable of aiding the end-user in automatically matching the best visualization method with the data. The research described in this study focuses on the creation of such a tool. The question this thesis aims to answer is whether or not this tool is indeed capable of making such decisions to a satisfactory degree.

1.3 Related work

Even though data visualization is a relatively young field of research, it has gained significant interest over the past few decades. The earliest progress in this area has been performed by Bertin in 1973, where he discussed the various parameters of visualizations and their impact on the visualization [2]. Cleveland elaborated on this idea in 1984 [6], followed by Carpendale in 2003 [5]. These findings have helped us gain a better understanding on the fundamental properties of visualizations.

A wide range of software solutions are available today. Some of these environments offer guided user interfaces to turn data into a visualization. The spreadsheet components of various office suites, such as Microsoft Office1 and LibreOffice2, fall into this category, but are limited in the amount of visualizations they offer. Recently, the new tools Tableau3 _{and Spotfire}4 _have

become available. They allow for significantly more options while retaining a user friendly inter-face. For even more flexibility, we need to turn to Application Programming Interfaces (API’s). These are generally not suitable for the average user or researcher, who, in general, lack the experience and proficiency of creating an application around these API’s. The Visualization Toolkit, OpenGL, and D3.js are some of the many visualization API’s available today, although their initial usability for data visualization purposes differs. The present research focuses on im-plementation using D3, which is described in detail by its authors in the paper D3: Data-Driven Documents [4].

We are interested in a framework that combines the ease of guided user interfaces with the possibilities of API’s. The tool presented in this paper attempts to achieve this by implement-ing automatic visualization suggestion on top of a framework that allows for easy integration of new visualization techniques. In 1986, Mackinlay discussed both of these topics in his paper Automating the design of graphical presentations of relational information, which described the attributes of a tool that is able to automatically design effective graphical presentations [10]. Research performed in 2005 at the University of Massachusetts produced a Universal Visual-ization Platform that allows for incorporation of any number of visualVisual-izations within the same framework [9]. Recent research performed at the University of Amsterdam in 2013 resulted in a tool that provides automatically generated visualization suggestions using k -means clustering techniques combined with decision tree traversal [3].

1.4 Outline

This research presents a new tool, building on the knowledge gained by previous research, with an additional goal of reducing the complexity of data visualization even further. As such, the research focuses primarily on the following two points:

• The tool should be able to offer visualization suggestions depending on the data entered and the visualizations available. By taking this choice for the user, we reduce the risk of mistakes and further reduce the complexity of turning data into a visualization. We discuss this topic in chapter two.

1_{office.microsoft.com} 2_{www.libreoffice.org} 3_{www.tableausoftware.com} 4_{spotfire.tibco.com}

(9)

• The tool should provide a framework that allows the inclusion of any number of file formats and any number of visualizations. Having these technologies available within the same framework greatly simplifies the process of turning data into a visualization. The tool architecture, including this point, is discussed in chapter three.

(10)

(11)

CHAPTER 2

Visualization selection algorithm

The Visualization Selection Algorithm (VSA) is an important part of the tool and it automates the selection of a visualization befitting the data that was provided. Given a dataset as input, the VSA will return an ordered list of visualization selections based on the characteristics of that dataset. Although the idea for this module builds on previous work done at the University of Amsterdam by Blom [3], we go further by simplifying the process. The VSA must be compatible with a wide and changing range of visualizations which can be added and removed from the tool easily. Besides making a selection between these visualizations, the tool also incorporates the findings of Bertin’s Semi´ologie Graphique by optimizing how the chosen visualization is eventually used [2].

Before we discuss the two challenges, we first discuss the merits of implementing either a learning or a static system. We then look at two proposals for visualization selection algorithms. The Visualization Alphabet proposal is implemented in Dragon.

Although Dragon’s exact implementation is discussed in the next chapter, it is useful to note here that all data that is processed by the tool and by the VSA is numeric and placed in a matrix. Dragon ensures that this is the case before passing the data to the VSA. When we talk about a dataset, we are talking about a numeric matrix.

2.1 Learning

A learning system is capable of registering how users interact with it and adapt itself to that feedback. This is not in any way related to the tool’s flexibility to incorporate new visualiza-tions and other extensions. It allows the tool to improve its performance regardless of these, by incorporating user feedback into the VSA. Because the tool gives suggestions to the user, it could be valuable to register how the user responds to these suggestions and use that feedback to improve future suggestions. The implementation of such a system has its merits, the biggest one being that it allows the initial algorithm to be imperfect. Especially when new visualizations are introduced to the system, this versatility can prove useful. However, the feasibility of incor-porating it with the VSA differs for the algorithm chosen. The idea of incorincor-porating a learning component into Dragon was considered from the start, and was an important factor for some of the decisions with regards to the first VSA proposal. The alternative to a learning system is a static system, which does not respond to user feedback.

2.2 First proposal: Data Feature Vectorization

Data Feature Vectorization finds a suitable visualization by looking at the individual columns of a data matrix, generating a large amount of summary statistics for each column, and then selecting a visualization based on the result. Examples of such statistics include the number of elements, average, variance, correlation, maximum, minimum, and the linear regression. The premise is that, provided we generate enough of these statistics, we can create a discriminative

(12)

feature vector that categorizes the data and allows us to provide a visualization suggestion based on that categorization.

The idea to use feature vectors as a means to implement a VSA has been previously explored by Blom [3], who used them in combination with a decision tree. For each column in the dataset, a feature vector is built using a large amount of summary statistics over the contents of that column. Having generated these vectors, we need to determine how the VSA could use these vectors to generate visualization suggestions. Vectors are easily compared to other vectors by measuring the angle between them, so ideally we find a way to create vectors that characterize the parameters of visualizations. For each visualization, we know of at least a few datasets that map well to that visualizations. We could use the feature vectors of these datasets as initial characterizing vectors for the visualization.

Thus far we have a collection of visualizations, each with their own set of characterizing vectors. The previously discussed learning component can be implemented by making this set of characterizing vectors mutable, with changes to these vectors being driven by user feedback on the visualization suggestions the tool provides.

Doubts arise as to whether the vectors generated from the data can be distinctive enough for this method to work satisfactory. The problem is that there is only a limited amount of summary statistics we could extract from data columns, which limits the dimension of the vectors we generate. However, we require high dimension vectors to prevent problems as described by Anscombe’s quartet [1], where different datasets end up receiving the same feature vectors, and thus the same visualization suggestion. This hints at a potential problem with this approach, and has prompted this research to look for ways to increase the dimension of these vectors, and eventually for an entirely different solution altogether.

2.3 Second proposal: Visualization Alphabet

Where the previous proposal attempted to reach a visualization by extracting knowledge from the data, this proposal attempts to reach the data by considering what we know about the visualizations. We approach this by defining a collection of characteristics that can be assigned to visualizations and see if the data fits within these characteristics. One advantage of this approach is that while it is impossible to account for all the different characteristics of all datasets that the VSA could possibly encounter, it is possible for creators of visualization techniques to expose what kind of characteristics their visualization expects. This proposal makes use of that knowledge and incorporates it directly into the tool.

2.3.1 Alphabet matching

The visualization alphabet is a language consisting of characters that allows visualizations to expose their characteristics to the VSA, which can investigate a dataset to see if these charac-teristics sufficiently match with those of the visualization. Whereas the previous proposal had visualizations expose feature vectors, the alphabet VSA has visualizations expose alphabet char-acters. The programmers of visualizations attach alphabet characters to their visualizations, and the tool will automatically attach alphabet characters to each dataset provided. Visualizations and datasets that have a equal or similar sequence of characters are a match. The remainder of this chapter talks about assigning the alphabet to visualizations, but the reader should keep in mind that these characters are just as applicable to datasets, and the tool in fact does this to match the two together.

2.3.2 Alphabet definition

Visualizations can be complex. While the goal of the VSA is to keep the definition of the alphabet as simple as possible, it must also provide an alphabet that is expressive enough for a wide range of different visualizations to be able to sufficiently describe those visualizations. To this end, rather than assign a single array of characters to each visualization, a variable amount of arrays are assigned to a visualization instead.

(13)

We can characterize visualizations on different levels. To accommodate this, the alphabet that is implemented in Dragon’s VSA consists of three separate sets of characters, each applicable to a different level. In total, there are three levels. Following is the definition of the alphabet, on a per-level basis.

• Layer 1 serves to expose dimensionality, the amount of parameters the visualization sup-ports. For each visualization, this is an array containing any amount of nonzero values in N+. A visualization that exposes the numbers 3 and 5 in this layer thus supports a dataset with exactly three or five columns, but no other amounts. This example also makes it clear that a visualization must contain at least one value above zero in this layer, or the tool will never suggest that visualization for any dataset. Furthermore, there are visual-izations that support any amount of data columns, and the alphabet supports this as well by implementing the character ’or more’. A visualization that exposes the value 3 and the character ’or more’ thus supports datasets that contain at least three columns.

• Layer 2 serves to expose information about the parameters of a visualization. Each param-eter of a visualization tends to be mapped to a different aspect of that visualization, such as position, color, or shape. As such, each parameter receives its own array of characters to describe it. This layer directly tells something about the values that are allowed to be in a column of a dataset. If an illegal value is present in that dataset, the visualization will not be a viable suggestion. By default, if no characters are assigned to a parameter’s array, it accepts numeric values in N+_{. Characters can be added to broaden or narrow the range of}

allowed values. For example, the ’-’ character states that the parameter is compatible with negative values and the ’latitude’ character restricts the values to be floating point values in between -90.0 and 90.0. This means that the alphabet is able to recognize geographical data.

Each layer 1 numbered parameter can receive its own array of characters, but parameters that fall under the layer 1 ’or more’ character must all have the same array of characters. • Layer 3 serves to expose additional restrictions of the visualization to the VSA. This is required because some visualizations impose very strict requirements on the dataset that are inconvenient for most other visualizations. This layer can be used to force the matrix to be square, which is useful for visualizations of bidirectional relationships. In addition, this layer can be used to state a maximum amount of data that the visualization can handle without becoming messy.

2.3.3 Examples

Table 2.1 contains the alphabet definition for three visualizations.

The first visualization is a pie chart, a one-dimensional visualization. Layer 1 exposes the dimensionality of this visualization; it accepts exactly one parameter. Layer 2 defines which values the parameter accepts. Recall that the default is that all values in N+ _{are accepted}

automatically and require no definition. By adding the ’float’ character, we allow floating point values in the column. Layer 3 imposes an additional restriction; the pie chart may only contain up to twenty elements.

The second visualization is a chord diagram, which is used to display bidirectional relation-ships. See figure 5.8 for a visual example of a chord diagram. Layer 1 exposes the dimensionality; a chord diagram needs at least two parameters, but has no upper-bound. Layer 2 defines which values each parameter accepts; all parameters contained in the ’or more’ must be assigned the same character set. Layer 3 imposes an additional restriction; because the chord diagram displays bidirectional relationships, the matrix must be square.

The third visualization is a world map. This visualization accepts exactly three or four parameters. Each of the parameters has their character set defined in layer 2. This visualization requires no special restrictions, so layer 3 is empty.

(14)

visualization layer layer characters

pie chart 1 1

2 parameter 1: float

3 20

chord diagram 1 2, or more

2 parameter 1: float

parameter 2: float parameter 3 and onwards: float

3 square-matrix map 1 3, 4 2 parameter 1: latitude parameter 2: longitude parameter 3: float, parameter 4: float, -3 none

Table 2.1: Alphabet definition for three different visualizations.

It should be noted that, although we speak about an ’alphabet’, we can not make words or sentences of these characters. The collection of characters assigned to a certain visualization parameter or data column should be seen as a set, rather than as a word or sentence.

2.3.4 Conclusion and refining

The VSA we have discussed so far is capable of analyzing a dataset and attaching the discussed characters to that dataset. Using the known alphabet definitions of the visualizations, the tool can determine if the dataset could be matched to a visualization. The alphabet is quite discriminative and allows for countless combinations of characters and consequently for that many visualizations. However, the tool does not know which parameters map to which visual property of the visualization. This is not optimal, because it was noted by Bertin that not each visualization parameter is of equal expressibility [2] [5], which means that the values of some visualization properties are more easily distinguished than others. Table 2.2 shows the ranking Bertin found in regards to the ’visibility’ of parameters, with a difference in position being most distinguishable and a difference in texture being least distinguishable. This observation allows for a potential addition to the VSA, where it will not just select a visualization but also optimize its use by mapping the most important data to the most expressive parameter.

visualization variable position size shape value color orientation texture

Table 2.2: Visualization parameters for quantitative expressiveness, ranked from most to least distinguishable by Bertin [2].

In order to let the VSA know how it should perform the optimization, we need to add char-acters to the alphabet that allows visualizations to mark which parameter of that visualization maps to which visualization variable. Layer 2 of the alphabet describes individual parameters and as such is the perfect place to add this extension. We expand this layer by adding the visualization variables as found in table 2.2 as characters, providing the VSA with the required

(15)

knowledge to perform this optimization. For example, the character ’position’ could be added to a layer 2 parameter to indicate to the VSA that the visualization variable for this parameter is position.

We also need a way to mark which data is more important than other data. Like the rest of the VSA, we approach this by using the columns of the dataset. The user interface allows users to rank the columns of the dataset, allowing them to signal the tool which columns are of most importance. Although this does complicate the usage of the tool slightly, it is a safe assumption that the user has an idea of which data is most important, and it allows the user to highlight multiple aspects of the data by running it into the tool multiple times with a different ordering of the columns.

It should be noted that this VSA, unlike the previous proposal, does not include a learning component. This means that the answers of the VSA will always be the same unless new visualizations are added. Although the feature vectors could be adjusted slightly in response to user feedback, the alphabet characters assigned to the parameters cannot be adjusted in the same way. Removing or changing a character assigned to a visualization significantly changes the behavior of the VSA in regards to that visualization.

The combination of the alphabet VSA and the implementation of optimization using Bertin’s visualization variables should result in a powerful algorithm that allows for easy integration of more visualizations while still retaining the ability to make useful visualization suggestions.

(16)

(17)

CHAPTER 3

Implementation of Dragon

This chapter details the implementation of a tool proposal called Dragon that implements the previously mentioned alphabet visualization selection algorithm in a fully functional framework. The tool also serves as an example on how to set up a framework that is compatible with a wide range of file formats and data visualization techniques. A demonstration of the tool is available at www.ragey.net/dragon1_{, which is fully functional and includes six visualizations and CSV}

(comma-separated values) file support. Some datasets that can be used with the tool can be downloaded from www.ragey.net/dragon/data.

Recall that the goal for this framework is to simplify the process of data visualization not only by choosing a visualization for the user, but also by providing a single framework in which all those visualizations can be implemented. This chapter highlights primarily how this second point is achieved. Inspiration for Dragon is drawn from the Universal Visualization Platform by Gee, who also provided a framework that introduced compatibility with a multitude of user-specified visualizations [9]. The combination of a visualization selection algorithm and the ability for the user to specify importance between the columns are features that distinguish Dragon from previous work. Their inclusion required the tool to be written from the ground up.

Dragon provides a guided user interface (GUI) which is written in HTML and CSS. The engine of the tool, including all visualizations and other extensions, is written in JavaScript. Because of this, the tool requires a browser to run in, but it does not require an active Internet connection to function.

3.1 Specifications

The most important goal for the tool is to significantly simplify the task of turning raw data into a visualization, and it is this goal that forms the base for each of the following specifications. Although some of the aspects of this list have already been discussed, it is useful to have them listed together to get a picture of what the tool will eventually do for the user.

• The tool must be able to present data visualizations to the user.

• The tool helps the user by selecting the visualization for that user. The theory and ideas behind this algorithm have been discussed in the previous chapter.

• The tool takes into consideration the wide range of devices and input methods that are available today. Ideally, the tool targets a platform that is compatible with many operating systems, input devices, et cetera. This directly affects the usability of the tool, and was achieved by implementing the tool in web languages.

• The tool offers a framework that performs every step of the process itself without the need of additional software. Having the entire process take place in a single tool significantly improves the user experience.

(18)

• The tool can be expanded upon. It serves as a framework that can incorporate additional visualization and data sources with ease. The idea of expandability is vital to this tool, because it targets an emerging field of development. Expandability allows the tool to remain useful even when new technologies and data become available.

3.2 Tool hierarchy

The overall software design for Dragon is layered and can be seen in figure 3.1. Dragon can be divided into four modules. One of these modules is the previously discussed visualization selection algorithm, in vsa.js. Another is dragon.js, which serves as the core of the tool, which ties all other modules together. It also communicates with the front end. The last two modules consist of a collection of user extensions, one collection for file handlers and one collection for visualizations.

Figure 3.1: Overview of the software architecture of Dragon. The core of the tool is located on the second and third layers, counting from the top, and the user extensions are on the bottom layer.

3.3 Execution flow

Figure 3.2 shows an execution sequence for the tool. Most of the process takes place behind the scenes without the user noticing. First, the user enters a dataset into the tool, which prompts dragon.js to call the right file handler to extract and convert the data into a format that the rest of the tool and extensions can understand. It returns to the user and requests of that user which data columns should be used. This step allows the user to discard portions of the data and also to rank the chosen columns in importance. The remaining columns and their ranking are sent to the visualization selection algorithm, which issues a ranking for the visualizations. The highest ranking visualization is called and the result is displayed on the screen. At this point, the tool is done, but it does allow the user to select a viable different visualization other than the one recommended, if available. Because the data was already processed, selecting a different visualization at this point only requires Dragon to call one other visualization extension without having to walk through the entire process again.

(19)

Figure 3.2: The execution flow of the tool when it is presented with the file ’stocks.csv’. The file handler in this case is ’csv.js’ and the visualization selected by ’vsa.js’ is ’barchart.js’.

3.4 File format handling

Data can be stored in many different ways, and there are literally hundreds of different file formats available today, such as CSV (Comma-Separated Values), JSON (JavaScript Object Notation) and XLSX (Microsoft Excel). They make data visualization difficult because they each store their data in significantly different ways. Most software solutions only support a few of the file formats available and unfortunately offer only a few visualizations. We lose possibilities when a software solution does not support the combination of file type support and visualizations the user needs. Dragon attempts to solve this problem by providing a framework that can conceptually handle any file format.

Dragon implements its own file format, and offers an interface that allows for converting of any file format into Dragon’s. Programmers can write extensions that convert a file format to Dragon’s file format. As soon as an extension is written for a file format, Dragon supports that file type.

The data format that is implemented in Dragon supports four elements, of which one is required and three are optional. The required element is a matrix that contains the data that is read from the file. This matrix may only contain numeric elements. It may be of any shape as long as each row has the same amount of values. Two optional elements are arrays of text labels; one array for row labels and one array for column labels. The third optional element is a single text entry, specifying the name of the dataset. All file formats must be mapped onto these properties. If the file contains additional information that cannot be mapped onto these four elements, that information must be discarded.

Without going into programming details, adding a file handler to Dragon requires the pro-grammer to write a JavaScript file, add it to a folder within the tool’s environment, and add the new file handler to a list file. Although writing the code might take some time, integrating the

(20)

code into Dragon takes little time and effort.

3.5 Visualizations

Visualizations can be added in a way similar to the way file handlers are added. Visualizations are written in JavaScript and must expose two functions to Dragon; one ’meta’ function to expose the alphabet definition of the visualization, and the function that draws the visualization to the screen. Dragon provides a convention to which programmers must comply, such as the actual location of the visualization with respect to the screen edges. Dragon also provides a few utility functions, such as a legend creator, which allows for a wide range of visualizations by different programmers to still feel coherent. In addition, programmers can use any API targeting JavaScript to aid them in creating their visualization.

3.6 Conclusion

The architecture that was used to implement the components of Dragon ensures that the tool is easily expanded in multiple ways. Support for multiple file formats is made possible by allowing programmers to add extensions that convert the contents of these files to the format internally used by Dragon. The pool of visualizations can easily be expanded using a similar extension interface. Furthermore, each visualization only needs to support Dragon’s formatting, and due to the way the tool is set up, such visualizations are automatically compatible with each file format implemented. The cross-compatibility between different file formats and visualizations within a single framework benefits the user experience.

(21)

CHAPTER 4

Web visualization techniques

This chapter discusses the technology used by Dragon to draw visualizations to the screen. As Dragon targets web browsers, it is written in HTML, CSS, and JavaScript. To ensure the tool is easy to install and requires no online connectivity, only client-side supported technology was used. The combination of these three languages and some powerful API’s allow the tool to draw complex visualizations with relatively little code, which in turn eases the process of adding additional visualizations to the tool.

4.1 Introduction to web technology

This section provides a introduction to the web technology used in Dragon, which is required for the remainder of this chapter. An extensive tutorial is beyond the scope of this paper and can be found online1_{. Dragon uses the three languages HTML, CSS, and JavaScript, with HTML}

and CSS providing a visual front end and JavaScript an invisible back end.

HTML is a markup language that defines the elements within the browser window, such as a string of text or an image. CSS defines how these elements are presented. For example, CSS can make the string of text red and enlarge the image by 140%. The combination of these two languages provides us with all the tools needed to draw anything to the browser window, such as visualizations, but also user interaction elements such as menus.

JavaScript is a scripting language. JavaScript is able to monitor HTML elements, which allows JavaScript code to respond whenever the monitored element is interacted with. Furthermore, JavaScript is able to add, remove, or modify HTML elements and the CSS that affects those elements. The effect this has is of course visible to the user. This allows Dragon to use JavaScript to generate (interactive) menus and visualizations on the fly, depending on user interaction and the dataset entered. Almost all HTML and CSS elements within Dragon are generated by JavaScript, with static HTML and CSS making up less than 5% of the tool2_.

In December 2012, the standards for HTML have been standardized into HTML5. The updated standard includes many new features, some of which are interesting for data visualiza-tion. The most important one is the definition of the SVG (Scalable Vector Graphics) element, which allows using a combination of HTML and CSS to define an image. Most of the default visualizations added to Dragon use JavaScript to generate SVG elements.

4.2 D3

Web technologies have proven to be very popular and have resulted in many visualization API’s targeting JavaScript, one of them being D3.js3, which stands for Data-Driven Documents. D3 is

1_{www.w3schools.com} 2_{According to git} 3_www.d3js.org

(22)

a library written in JavaScript and should not be seen as visualization framework [4], but rather as a toolbox offering a wide range of utility functions.

It should be noted that the following text talks about JavaScript. If we talk about a HTML element, we are actually talking about the variable in JavaScript that references that HTML element.

The first step to take is to introduce a new SVG element which will contain the visualization. Dragon can then introduce any number of HTML elements inside this SVG element that will make up the visualization. D3 allows us to bind data to a SVG element, which makes it available to all elements within the SVG environment. One must realize that the contents of data must be remapped before a computer can visualize it. If the original data contains distances of kilometers, these distances must be converted to pixels before they can be visualized. Using data binding, functions can be defined that convert the raw data into visualization-compatible data while retaining the original aspects of the data. This step effectively also allows us to set the size of the visualization; Dragon will typically create visualizations roughly the size of the browser window.

1 // define rescaled data so that it fits in the SVG 2 var xscale = d3.scale.linear()

3 .domain([d3.min(data), d3.max(data)])

4 .range([0, Dragon.winWidth]);

5

6 // create a SVG element within the browser window 7 var svg = d3.select(’#visuals’).append(’svg’)

8 .attr(’width’, Dragon.winWidth)

9 .attr(’height’, Dragon.winHeight)

10 .attr(’class’, ’scatter_chart’);

Figure 4.1: Simplified JavaScript code that creates a new SVG element that fits in the browser window and defines a rescaling function for data of a column so that it fits in the SVG.

By selecting the visualization’s SVG, we can then append visualization elements to the SVG using the rescaled data values, as demonstrated in figure 4.2. In addition, D3 provides utility functions that can create axes based on the scaled data, allowing easy generation of an axis that exactly fits all data. Note that not all visualizations require axes.

1 svg.selectAll(’scatter-dots’) 2 .data(data) 3 .enter().append(’svg:circle’) 4 .attr(’cx’, function (d, i) { 5 return xscale(d[0]); 6 });

Figure 4.2: Simplified JavaScript code that expands on the code in figure 4.1. Note how the data is bound on line 2, which enables us to use the functional operator on line 4. The code adds the data elements of a column to the X axis of the visualization in the form of circles. The values of xscale are used rather than the values of the actual data.

This concludes all that needs to be done to create a visualization, although some surrounding boilerplate code is needed to tie it all together. Furthermore, certain visualizations require a legend to explain what is on the screen. D3 does not provide a convenient way to attach a legend to the visualizations and requires all labels to be attached to axes or visualization elements. This is inconvenient for some visualizations. To address this, the Dragon API exposes functionality to easily add a legend to the visualization. The legend can be dragged around to not obscure potentially important data.

(23)

D3 is a library that allows programmers to easily create complex visualizations and add them to Dragon. However, it cannot be used to implement all visualizations. Most notable, the library is limited in its three- or high-dimensional visualizations. Although the library can be used to create a world map, this is far from optimal, and Dragon uses the Leaflet 4 _{library for this}

purpose instead. As was previously mentioned, Dragon does not rely on a single visualization library, so future visualization extensions could be made using any other library, or using only plain JavaScript.

(24)

(25)

CHAPTER 5

Experiments

Dragon is capable of reading data and analyzing it. It suggests a visualization technique based on the contents of that data. The tool can be used on any device that supports a web browser that implements a modern JavaScript standard. Computers, (smart)phones, and tablets usually fall in this category. We are now interested in answering the original research question: Is Dragon able to produce visualization suggestions to a satisfactory degree? It is difficult to quantify what a satisfactory degree exactly is in the context of what Dragon produces as output. In the following experiments, we consider the output satisfactory if the visualization is legible and reveals some of the characteristics of the data.

5.1 Method

We will provide Dragon with ten significantly different datasets and see what Dragon produces. The first thing we want to ensure is whether or not all datasets actually produce a legible visualization. This confirms whether or not Dragon correctly discards visualizations that are syntactically incompatible with the dataset entered. We would also like to measure the quality of the visualization suggested. We can do this by checking whether or not we can see some characteristics of the data. For each dataset, we know some characteristics of that data. We look at the visualization to see whether or not we can recognize these characteristics. Characteristics can include shape, outliers, and patterns. The possible visible characteristics differ for each of the visualization techniques. Depending on the choice Dragon makes, we look for possible char-acteristics in the visualization it produces. If a suggestion results in a legible visualization that also reveals characteristics of the dataset, we consider the suggestion good. If a visualization sug-gestion is legible, but reveals no characteristics, we consider the sugsug-gestion valid. If a sugsug-gestions results in a visualization that is useless or illegible, we consider it invalid.

It should be noted that these tests serve only to test the visualization selection process, not the optimization process implemented by using Bertin’s visualization variables.

5.2 Results

The results of the experiment can be seen in table 5.1. Following the result table is a brief description of each of the datasets, describing their dimensionality and contents. Dragon was unable to open some datasets due to formatting errors within those datasets. These were omitted from the tests and descriptions that follow. Visualizations by Dragon can be seen in figures 5.1 through 5.10.

(26)

dataset visualization chosen result

1 primes bar chart good

2 age groups pie chart good

3 tornado (X, Y, VY) scatter plot good

4 tornado (X, Y, Z, VY) scatter plot valid

5 airline traffic scatter plot good

6 letter frequencies bar chart good

7 baseball managers (G, W, L, Rank) scatter plot good

8 hair color attraction chord diagram good

9 sf neighborhood world map good

10 stocks scatter plot invalid

Table 5.1: Results of the experiments performed on ten heterogeneous datasets.

• primes is a list of the first 50 prime numbers.

• age groups is a small one-dimensional dataset that divides approximately 40 million people in six age groups. It is unknown where this data originated from.

• tornado is a six-dimensional dataset. It contains a three-dimensional figure of a tornado. It contains the X-, Y-, and Z-coordinates of vectors within this tornado, and also stores the X-, Y-, and Z-velocity of those vectors. Because the highest dimensionality between the visualizations included in the prototype is four, only four of the columns could be tested at once.

• airline traffic is a two-dimensional dataset that contains a count of people who traveled by plane between the years 1949 and 1960.

• letter frequencies is a one-dimensional dataset which contains the occurrence percentage of each letter in the alphabet. It is unknown on which language the data is based. • baseball managers is a five-dimensional dataset which contains the performance of teams

led by certain baseball managers.

• hair color attraction is a four-dimensional bidirectional dataset. It contains the amount of occurrences of a hair color amongst a group of people, along with a count of their physical attraction towards a certain hair color.

• sf neighborhood is a three-dimensional dataset with the population density per neigh-borhood in San Francisco. The dataset includes latitude and longitude values.

• stocks contains the stock progression of three imaginary companies over time. (No real data in this field was available in a suitable format.)

We can draw some conclusions from the test results. First, we note that the scatter plot visu-alization is selected more than any other visuvisu-alization. Second, we see that a sixth visuvisu-alization, the line graph, was never suggested, even though we passed datasets (4 and 9) for which that visualization would have been better.

The first conclusion is grounded on the fact that the scatter plot visualization has a very versatile alphabet definition; it supports four dimensionalities and a wide range of possible values for each dimension. Furthermore, it imposes no restrictions on layer 3. This versatility causes the tool to select it more often than the other, more restricted, visualizations.

The second conclusion is closely related to the first. The scatter plot and the line graph visualizations have an identical alphabet definition on layer 2, which means Dragon cannot differentiate between the two. This causes problems in the form of a suboptimal choice (for dataset 4) or even an outright wrong choice (for dataset 9).

Overall, Dragon performed well. Most datasets produced a clearly visible visualization that effectively displayed characteristics of the dataset.

(27)

Figure 5.1: The prime numbers dataset.

Figure 5.2: The age groups dataset. We can see some characteristics of this dataset, such as the proportional size of each age group compared to that of the others.

(28)

Figure 5.3: The tornado dataset, using the columns X, Y, and VY. We can clearly see the shape of the tornado in the visualization. It should be noted that achieving this requires a combination of effort from both the user and from Dragon, as the visibility of the tornado also depends on which columns of the dataset are chosen.

Figure 5.4: Another visualization of the tornado dataset, this time using the columns X, Y, Z, and VY. The result is not as good as in figure 5.3. It is hard to recognize any characteristics here.

(29)

Figure 5.5: Two visualizations of the airline traffic dataset. The top figure is produced by Dragon’s VSA and although the scatter plot also reveals the characteristics of the dataset, the bottom representation (ranked second by Dragon) is probably better suited for this dataset.

(30)

Figure 5.7: The baseball managers dataset visualized by Dragon. The number of played games is displayed over the X axis, the amount of won games is displayed over the Y axis, the size of the circles denotes lost games, and the color displays rank from best (yellow) to worst (red). We can clearly see many characteristics, such as the fact that teams that play more games generally also win more games, and the direct relationship between losing games and having a low ranking.

Figure 5.8: The hair color dataset, which displays the hair color of people against the hair color of the people they feel physically attracted to.

(31)

Figure 5.9: The San Fransisco neighborhood dataset. Larger circles represent a higher population density.

Figure 5.10: Two visualizations of the stocks dataset. The top figure is produced by Dragon’s VSA. Location describes the progression of one stock, size of another, and color of the third stock. Clearly this is not a useful visualization of stock progression. The bottom visualization is what we are looking for.

(32)

(33)

CHAPTER 6

Conclusion

The previous research of Blom with regards to automatic visualization suggestions [3] and the research of Gee detailing how to build a generic visualization framework [9] has proven vital for many ideas in the present research, which has produced a tool that is capable of providing automatic visualization suggestions within a framework that supports the integration of any number of visualization techniques. Combining these two characteristics of the tool was made possible thanks to the alphabet visualization selection algorithm, which provides programmers of visualizations with a convenient way to categorize their visualizations, which in turn enables the tool to give visualization suggestions from a growing pool of visualization techniques.

The separation of file handlers from the remainder of the tool and the definition of a single file format to be supported by the remainder of the tool requires all file formats to be converted to this new file format. However, it simplifies the process of data visualization significantly, because it allows for the integration of many file formats into a single framework. More importantly, it allows all visualizations to target only one single data format, which makes the integration of a large amount of visualizations less tedious.

Because Dragon was programmed in HTML, CSS, and JavaScript, it can run on any device than can run a modern web browser. This includes computers, (smart)phones, tablets, game consoles, and many other devices. If the browser running Dragon supports alternative input methods, such as touch displays, Dragon will support these as well. Furthermore, the develop-ment of many visualization API’s targeting these languages, such as D3, Leaflet, and WebGL, has made them a popular and viable platform for data visualization purposes.

However, the visualization alphabet as it is currently defined does not appear to perform optimal. Even between just six implemented visualizations, conflicts arise that cause the sug-gestions to be less than optimal. It is likely that this problem will become more evident when a significant amount of additional visualizations is added, as that increases the chance of further suggestions. A potential solution for this problem would be to expand the alphabet with addi-tional characters on the column-describing layer. The current alphabet for this layer only limits the possible values that can be placed in a column, but additional characters could be defined that tell more about the less obvious properties of the column, such as summary statistics or the mathematical relationship between the values in a column.

The research question was whether or not Dragon, being a framework that allows for the integration of an arbitrary amount of visualizations and support for multiple file types, would be able to provide useful visualization suggestions automatically, using only the information stored in the data itself. In most cases, Dragon provided useful suggestions.

To summarize, Dragon is a user friendly tool that allows for quick visualization of data without requiring the user to only have knowledge about the data, and not about visualization theory. It immediately gives the user a suggestion and draws a visualization, but also allows the user to quickly see alternative compatible visualizations. For programmers and more advanced users, the tool provides an easy way to include additional visualizations and file handlers.

(34)

(35)

Bibliography

[1] Francis John Anscombe. Graphs in statistical analysis. American Statistician, 27:17–21, 1973.

[2] Jacques Bertin. S´emiologie graphique, 1973.

[3] C´edric Blom. Intelligent visualisatieframework. Bachelor’s thesis, University of Amsterdam, June 2013.

[4] Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. D3: Data-driven documents, 2011. [5] M. S. T. Carpendale. Considering visual variables as a basis for information visualisation,

2003.

[6] William S. Cleveland and Robert McGill. Graphical perception: Theory, experimentation, and application to the development of graphical methods. Journal of the American Statistical Association, 79:531–554, September 1984.

[7] Derek Cunningham. Analysis of the geometrical patterns found in the lascaux cave system. Midnight Science, 2014.

[8] M. Friendly. Handbook of data visualization, chapter A brief history of data visualization. Springer Berlin Heidelberg, 2008.

[9] Alexander G. Gee. A Universal Visualization Platform. PhD thesis, University of Mas-sachusetts Lowell, 2005.

[10] Jock Mackinlay. Automating the design of graphical presentations of relational information. ACM Trans. Graph., 5(2):110–141, April 1986.

(36)

(37)

APPENDIX A

Working with Dragon

A prototype of the tool is available at www.ragey.net/dragon. The tool was tested for Firefox and Chrome, and works well in those browsers. The tool is not compatible with Internet Explorer. The prototype accepts CSV files and can create six different visualizations. If the CSV file is formatted incorrectly or contains data Dragon can not work with, a warning is thrown. It is recommended to use the direct upload option rather than the URL search box. Some datasets that are certain to be accepted by Dragon can be found at www.ragey.net/dragon/data.

Upon entering a correct CSV file, the tool allows the user to select which columns of the CSV file to visualize. Columns that contain non-numeric data outside of the labeling will not be displayed here, because Dragon can not visualize them. Select any amount of columns, the order in which the columns are selected affects how the eventual visualization looks, and can be seen on the right. After clicking on visualize, Dragon will propose a visualization and display it. If this visualization is unsatisfactory, the user can choose another visualization on the bottom of the suggested visualization.

A.1 Adding or removing extensions

A.1.1 File handlers

File handlers should have the function prototype DRAGONfilextension(data), for example DRAGONcsv(data). The data element is a stream of tokens, directly as it is read from the file. The file extension of a file determines which file handler Dragon will call. File handlers must return a single object with the elements matrix, xLabels, yLabels, and name. The labels and the name may be empty (null). The matrix must contain the data contained within the data character stream. If the file for some reason cannot be read because the stream provided by data is invalid, the function should return null.

The created file handler should be added to the folder filehandlers/ and should be registered by adding it to the file extension array in filelist.js. See filehandlers/csv.js for an example.

A.1.2 Visualizations

A visualization defines two functions; DRAGONname(data) and DRAGONnameMeta(). The first is the implementation of the visualization. Any element that is created here must be appended to the visuals HTML element. The data element here is the same object that is returned by a file handler. This means that the data itself can be obtained with data.matrix. To create the actual visualization, any JavaScript library can be used. See chapter 4 for more details. If the visualization needs a legend, a draggable legend element can be created using Dragon’s API. See figure A.1 for an example.

(38)

1 addLegend([’Hello’, ’World!’]);

Figure A.1: Creates a legend with ’Hello’ on the first line and ’World!’ on the second. The addLegend function also accepts style elements, allowing legend elements to be assigned colors and other visual markers.

The meta function serves to expose the alphabet definition of the visualization to Dragon. See figure A.2 for an example of a meta() function.

1 function DRAGONscatterMeta() {

2 return {

3 name: ’Scatter plot’,

4 dims: [1, 2, 3, 4],

5 dim: [[’position’, ’-’, ’float’],

6 [’position’, ’-’, ’float’], 7 [’size’, ’-’, ’float’], 8 [’color’, ’-’, ’float’]], 9 req: [] 10 }; 11 }

Figure A.2: Alphabet exposure function of the scatter plot visualization. The visualization sup-ports datasets with one, two, three, or four columns. Each column supsup-ports any value. The scatter plot has no additional requirements. Note that the ’dims’ value is the layer 1 we discussed in 2.3.2, ’dim’ is layer 2, and ’req’ is layer 3.

Created visualizations should be added to the visuals/ folder and should be registered by adding them to the array in visuals.js.

Automatic data visualization

Bachelor Informatica