### Eindhoven University of Technology

### MASTER

### NICUdash

### a visual analytics tool for exploration of Neonatal Intensive Care Unit data

### Tran, Ky-Anh

Award date:

2022

Link to publication

Disclaimer

This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

### NICUdash: a visual analytics tool for

### exploration of Neonatal Intensive Care Unit data

### Master Thesis

### Ky-Anh Tran

### Department of Mathematics and Computer Science Visualization Research Group

### Supervisors:

### prof.dr. Anna Vilanova dr.ir. Carola van Pul

### Final Version

### Eindhoven, January 2022

## Abstract

Neonatal intensive care patients are closely monitored for their vital signs as life-threatening complications can occur. The monitoring data is stored and is defined as multivariate time series.

Currently, the hospital visualizes curves of patients separately, however there is a need to find look at the population as a whole, as researchers and neonatal experts want to explore and learn about patterns in this data in order to make hypotheses for further research. Research in data visualization is ongoing using dimensionality reduction for interactive visual purposes to display multivariate time series data (MTS). We propose a visual analytics tool for exploration of the neonatal intensive care monitoring data and created a strategy to calculate a distance matrix for our dimensionality reduction output. Users are able to navigate an overview of the whole population, selecting and examining groups of patients from this and reviewing specific patients.

## Preface

I would like to thank Anna Vilanova for the feedback and (technical) guidance during my thesis.

Although this year was with ups and downs, due to her, I learned a lot of things about my work ethic, and how I could improve it to work independently on a project. I also want to thank Carola van Pul for giving helpful feedback wherever she could and being patient with me. Next, I want to thank Natalia Sidorova, for feedback and I would like to thank Jack van Wijk for being part of the assessment committee. Finally, I want thank the people close to me: my family, friends for supporting me during this project.

## Contents

Contents iv

List of Figures vi

List of Tables ix

Listings x

1 Introduction 1

2 Project analysis 3

2.1 Domain Problem Analysis . . . 3

2.2 Data type & Task abstraction analysis . . . 4

2.2.1 Data Abstraction . . . 5

2.2.2 Task Abstraction . . . 5

2.3 User input requirements . . . 6

3 Related Work 8 3.1 Patient monitoring data visualization. . . 8

3.2 Visualization time-varying data . . . 9

3.2.1 Univariate Time Series visualization . . . 10

3.2.2 Multivariate Time Series visualization . . . 12

3.2.3 Visual Analytics . . . 14

4 Data set pre-processing 19 4.1 Data averaging . . . 19

4.2 Standardization. . . 19

4.3 Alignment . . . 19

5 NICUdash design 21 6 Dimensionality Reduction in NICUdash 25 6.1 Context for Dimensionality Reduction . . . 25

6.2 Overall strategy. . . 26

6.3 Missing data time series . . . 27

6.4 Distance measures choices . . . 28

6.5 Distance matrix calculation . . . 29

6.5.1 Speeding Up Average Distance matrix calculation . . . 31

6.6 Dimensional Reduction choices . . . 31

7 Results and Evaluation 32 7.1 Results. . . 32

7.1.1 NICUdash Overview . . . 32

CONTENTS

7.1.2 Input Interface Component . . . 33

7.1.3 Embedding component. . . 33

7.1.4 Cluster summary component . . . 36

7.1.5 Patient detail component . . . 37

7.1.6 Interactions . . . 37

7.1.7 Use case analysis . . . 39

7.2 User Evaluation. . . 43

7.3 Method . . . 43

7.3.1 Questionnaire & feedback . . . 43

7.4 Discussion . . . 44

8 Conclusion and Future Work 45 8.1 Conclusion . . . 45

8.2 Future Work . . . 46

Bibliography 47

## List of Figures

2.1 The nested model for visualization design consisting of four layers. The output from the level above is an input for the level below. (Munzner [12]) . . . 3 3.1 Examples of two interfaces often seen in intensive care units. With (A) showing

different line plots of the vital signs, and (B) showing an electronic record system (Faiola et al. [9]).. . . 9 3.2 MIVA 2.0 a dashboard solution for analysis and interpretation of real-time trends

and communicating clinical work and information. 3.1 [9]. . . 9 3.3 Visualization showing clusters of days based on day patterns of employees being

present at a company. Clusters are indicated by a color and are visualized on a calendar view, furthermore the line plot shows the average day pattern of each cluster (van Wijk et al. [22]). . . 10 3.4 Visualization of lock-step alignment of time series, on the left are displayed two

time series red and blue, the vertical lines indicate the alignment between the curve points. Although the two time series are quite similar, they are not aligned correctly;

some maxima are aligned to the minima between the two time series. On the right, the same time series are displayed, however two windows for each time series indicate how the time series can split up. The time series can be smoothed by for example using the average for each window. [3]. . . 12 3.5 Visualization of an example of elastic alignment of two time series. Each data point

in a time series is aligned a data point in the other time series, even though the two time series are not of the same length. Furthermore, the image shows that similar local maxima and local minima are aligned accurately [3]. . . 12 3.6 Example of two multivariate time series (MTS) items for two babies. The visualiza-

tion shows that baby 1 has three variables measured while baby 2 has two variables measured over time. . . 13 3.7 Example visualizations of multivariate time-varying data. The charts display for

five variables the values from monitoring data of one subject. (Nguyen et al. [13]) 14 3.8 m-TSNE projection for activity levels of cancer patients, with each scatter point as

one day. (Nguyen et al. [13]) . . . 15 3.9 Abstract overview of visual analytics showing the relation between visualization,

data analysis and the human. (Cui [8]). . . 16 3.10 Overview of MulTiDR, where users can explore similar instances and investigate

what features and time points are relevant for the selected clusters. (Fujiwara et al. [10]) . . . 17 3.11 two-step dimensionality reduction process visualized. This example first performs

dimensionality reduction along the variables mode, then dimensionality reduction along the time points. The notations are: T, N, D length of modes of time points, instances and variables respectively. Furthermore, X , X, y, Y , Z are a third-order tensor, matrix, vector, matrix and matrix respectively. (Fujiwara et al. [10]). . . . 18

LIST OF FIGURES

5.1 Sketch 1 of main view of NICUdash with (1) the input interface, (2) the embedding component, (3) the cluster summary component Cluster Aggregated View and (4) the patient details component . . . 21 5.2 Sketch 2 of main view of NICUdash with (1) the input interface, (2) the embedding

component, (3) the cluster summary component Individual Lines View and (4) the patient details component . . . 22 5.3 Sketch of measurement occurrence view of NICUdash with (5) Multiple patients

occurrence count component, (6) Individual patient occurrence calendar component and (7) Multiple patients calendar component. . . 22 6.1 Multivariate time series of two patients plotted. In this example, the multivariate

time series data has three variables. For variable 1 the time series have some overlapping in time frame, for variable 2 this overlapping is much smaller and for variable 3 patient 2 has no measurements. . . 26 6.2 Overall strategy from the Selected User Input to the resulting Lower Level embed-

ding. The orange noted entities are objects and the red noted entities are actions.. 27 6.3 Multivariate time series of two patients plotted. In this example, the multivariate

time series data has three variables. The grey area indicates the Time series range selected by the user. The orange vertical lines with the black arrows, indicate the time point areas where the two time series both have data. The intersection of the time points in the grey area and of the time points between the orange lines gives the final set of time points. For variable 1 the time series have some overlapping in time, for variable 2 this overlapping is much smaller and for variable 3 there are two ranges where there is overlapping. . . 28 6.4 Euclidean distance matrix example [1] . . . 29 6.5 Example of averaging two distance matrices for four patients p1, p2, p3 and p4.

The red colored indexes in the matrix indicate the patient pairs to be removed, while the orange colored indices will be replaced by the maximum value ∗ c, with c = 2 in this case. (step 4) . . . 31 7.1 Overview of NICUdash. With (1) the input interface component; (2) the Embedding

component; (3) the clusters overview; (4) the patient detail overview.. . . 32 7.2 On the left time series range selector by day aligned by first day of measurement

and on the right time series range selector by hour, aligned by crash moment. . . . 33 7.3 Example output of the dimensionality reduction, the output is colored by the labels

of the patients. With 1 the control group (pink colored) and 2 the sepsis group (yellow colored). . . 34 7.4 Dimensionality reduction output in a scatter plot where the patients are colored by

their gestational age (GA). With the red hues early pre-term patients and with the blue hues later born pre-term patients. . . 34 7.5 Color scheme for defining the color of a cluster. . . 35 7.6 Dimensionality reduction output with three manual selected clusters using the lasso

tool. . . 35 7.7 Options window with algorithm selection, distance measure selection and hyper-

parameter settings . . . 35 7.8 Cluster summary component showing the time series of three selected clusters. The

mean values and standard deviations are plotted per cluster. . . 36 7.9 Graph in the individual line mode, the time series of each patient from three selected

clusters are plotted. . . 37 7.10 Patient details component showing three line graphs over days for patient 2 of group

1. . . 37 7.11 Example of the user brushing in the top graph, the highlighted patient is colored

in bright green for all the individual line graphs. . . 38

LIST OF FIGURES

7.12 Maturation analysis plot on the left colored by gestational age and on the right colored by group with a distance measure of Dynamic Time Warping. . . 39 7.13 Maturation analysis plot on the left colored by gestational age and on the right

colored by group with distance measure of RMSE. . . 40 7.14 Selected group of patients from figure 7.12 and 7.13 . . . 40 7.15 Dimensionality result of parameter HF and RF, GA range of [24, 32] and time

series range in hours [-24, 24] . . . 41 7.16 Dimensionality result of parameter HF and RF, GA range of [28, 32] and time

series range in hours [-24, 24], on the top right appears a separated group of control patients, while on the bottom center another separated group of controls is located 41 7.17 Dimensionality result of parameter HF, SpO2 and RF, GA range of [28, 32] and

time series range in hours [-24, 24] there appear no distinguishable groups of similar patients . . . 42 7.18 View of NICUdash with settings that one neonatal expert used, selected are two

groups of control patients. Using UMAP with Max-distance . . . 42 7.19 View of NICUdash with settings that one neonatal expert used, selected are two

groups of control patients. Using UMAP with RMSE. . . 43

## List of Tables

2.1 Data Abstraction of the monitoring data of a patient . . . 5 2.2 Data Abstraction of the non-time varying data of a patient . . . 6 4.1 Example of a data file of one parameter from one patient. The values are not real. 20 4.2 Table 4.1 after including alignment with a crash moment of 19-01-2020 20:00 . . . 20 7.1 Questionnaire questions and its results based on evaluation of two neonatal experts.

With a score of 1: strongly disagree and 5: strongly agree . . . 44

## Listings

### Chapter 1

## Introduction

Around 15 million babies are born preterm worldwide each year, which is more than 1 out of 10 babies. From this, 1.1 million die from consequences of preterm birth [14]. A baby is preterm when the birth is before 37 weeks of pregnancy, preterm babies and especially very preterm babies may suffer from underdevelopment in their organ systems, which can hinder their physiological respons.

Under these conditions, preterms require medical care such as constant nursing or respiratory support, which can be provided by a neonatal intensive care unit (NICU). This is an area in a hospital with advanced technology and specialized staff to take care of newborn babies, with most of these babies being preterm [2].

In the NICU, vital signs like heart rate, arterial blood pressure and oxygen saturation levels are continuously monitored monitoring devices, these vital signs have to be in specific ranges, otherwise alarms will go off to notify nurses for possibly interventions [23]. These measurements are time- varying, as a consequence the data stored is of large size and complex to analyze. Researchers and neonatal experts want to analyze and understand the data to improve care for the patients, this requires to process the data such that possible diseases or events can be predicted. Machine learning algorithms can learn from data and learn to automatically detect patterns. Studies on NICU data have been performed to predict diseases such as sepsis, a blood infection, for which research shows that a drop in for instance the heart rate variability is one of the characteristics for the disease. The methods used in these studies use supervised machine learning algorithms [7][11].

These algorithms require data with labels, and based on this knowledge the algorithm learns to classify unlabeled data points, thus detected patterns are matched with labels of similar patterns.

In the context of the NICU there are specific labels, such as a certain disease.

However, there are cases where supervised machine learning is not the sole straightforward solution. Besides, machine learning solutions are a black-box solution and do not explain or give understanding. When researchers or neonatal experts do not have a specific label or disease to predict, these methods are not viable to learn about the data. Instead, other techniques that do not require labeled data need to be used be used to find patterns, solely based of the measurement data. Finally, if patterns are found they can be used to create hypothesis for further research.

Visual analytics is an interactive approach which visually assists the users in exploring big data so they can understand the data, while it minimizes assumptions made and provides for unknown potential insights. Users can find interesting patterns in groups, which can for example improve the labeling task.

Visual analytics combines automated analysis techniques with interactive visualization for the user, such that the user is in control of the direction of the analysis. Because the user is brought into the loop, user domain knowledge is used and results of the automated analysis can be examined and verified visually. In the case of the NICU data one use case could be finding and examining patterns in the maturation of patients during their stay in the hospital, while a different use case is finding new patterns leading to for example sepsis.

The choices of visualization are important in order to find structure in the data, such as clusters.

Clusters may be expected for groups such as patients most likely getting a certain disease, while

CHAPTER 1. INTRODUCTION

there are also groups of patients that are likely to have no complications based on their time- varying measurements, however these are assumptions and it is not certain whether they will be fulfilled. Many factors make automatically determining clusters harder, as it is not always the case that all vital signs are measured in all patients; different sets of vital sign measurement could complicate the use of machine learning techniques. Instead, users could be provided with tools to manually annotate the data, after examination.

We propose visual analytics such that experts can explore the complex temporal data and formulate hypothesis which can be used to improve the understanding of the data and poten- tially modelling. For this we present NICUdash, an interactive dashboard. In our visual analytics solution, dimensionality reduction is the core component, since it allows for displaying high dimen- sional data in a two dimensional view, however, our data is high dimensional and time varying, making it non trivial to use dimensionality reduction. Therefore we propose a strategy to create the distance matrix, and we put an emphasize on dealing with missing data without introducing artificial data. The distance matrix is an input for a dimensionality reduction algorithm. Defining distance metrics to calculate the distance matrix is a challenge due to the multivariate time vary- ing nature of the data. Next we conducted analysis on the output of the dimensionality reduction algorithm in different hyperparameter settings, and finally we performed a user evaluation on the dashboard with clinical users.

The rest of the report is organized as follow. Chapter 2 gives a domain, data and task analysis.

Chapter 3 covers the related work on distance measures, visual analytics for time-varying data.

Chapter 4 shows the pre-processing of the data. Chapter 5 shows the initial design of NICUdash, which is derived from the abstracted tasks and data. Chapter 6 describes our dimensionality reduction strategy. Then, in chapter 7 we present the result, the evaluation and we have a small discussion. Finally in chapter 8 we end with the conclusion and future work.

### Chapter 2

## Project analysis

This chapter covers the clinical challenges and design criteria of NICUdash. The process is in- spired by the framework of Munzner [12], which introduces a nested four-layer visualization design process, illustrated by figure2.1, from which we will use the first two layers in this chapter. First, the domain problem is analyzed covering domain specific questions and goals of the users. Next, the data and task abstraction section translates the domain problems and data to an abstracted and generic description.

Figure 2.1: The nested model for visualization design consisting of four layers. The output from the level above is an input for the level below. (Munzner [12])

### 2.1 Domain Problem Analysis

NICUdash will be used by neonatology experts and the researchers of a study with interest in the physiological changes of preterm babies on their maturation or illness. Data is collected through the clinical patient monitoring system, which is installed in the NICU of the MMC. The monitoring data is time-varying, and consists of information about measured vital signs, and has a timestamp.

Simultaneously, data such as the gestational age and birth weight are manually recorded.

We will be using a sepsis data set from the M´axima Medisch Centrum to test NICUdash. This data set is studied thoroughly by the users, is also used in the study of Cabrera-Quiros et al. [7]

and contains two labels, disease and control group. The subjects in this data set can have up to 45 different vital sign data, this large amount of different vital sign data is due to seemingly similar vital signs being measured from different locations of the body or derived from different signals. An example is the oxygen saturation, which can be measured from the ankle, and from the pre-ductus area. The monitoring data has a frequency of 1 Hz, such that the time resolution is in seconds. Therefore, due to computational overhead in pre-processing, we have chosen to just use five vital signs which are parameters that are expected to contain most of the patterns to determine sepsis. These parameters are oxygen saturation (SpO2), heart frequency (HF), resonance frequency breathing (RF), temperature (Temp) and mean airway pressure (MnAwP). From the previous work of Cabrera-Quiros et al. [7], sepsis is also reflected in the heart rate variability parameter (HRV).

However, processing a full electrocardiogram (ECG) was not feasible in our project. Although we are using this particular data set in this work, NICUdash will be developed to operate on similar data sets with multiple or an unknown number of groups of subjects.

CHAPTER 2. PROJECT ANALYSIS

The aim of the users is to explore the data and formulate hypotheses that can be used to improve the understanding of the data and model the prediction of diseases, and for the user is interested in finding cohorts and analyzing these further. Currently, it takes the users a lot of time understanding the data in order to define a study. Multiple factors contribute to the difficulty of analyzing and understanding the data.

Vital sign monitoring contains a multitude of these factors. For example, healthy ranges of vital signs are not straightforward; these ranges can depend on the age of a patient. Patients with a lower age having a higher heart frequency than patients with higher age. Furthermore, the experts mention that a low oxygen saturation level in certain weeks could be worrisome, while a high oxygen saturation level in the weeks afterwards could also be alarming. Additionally, whether vital signs are being measured indicate for the type of baby. Thus babies at the NICU do not necessarily have the same sets of vital signs being monitored. Related to this are the wide variety of diseases that occur at the NICU, such as apnea or late-onset sepsis. Each of these diseases having characteristic developments which can be related to the vital signs measurement values.

For instance, patients can deteriorate in a small time frame in the disease process of sepsis, while in other disease processes, the deterioration of the patients is a slow process and can result in more chronic results such as chronic lung disease [4].

Other factors that the neonatal experts mention are that discrete values such as how long parents were with a patient could contribute to a patient’s health. And finally, the neonatal experts have no tools to interactively explore the structure of the data and thus studying patients in cohorts.

By interviewing the neonatal experts, we derived the following questions that they would like to answer from the questions above:

1. Can we find groups of patients based on the monitoring data?

2. What patterns can be found in a selected amount of hours or days leading up to an event?

3. How do multiple patient groups which are derived from the monitoring data compare to each other?

4. What parameters can function as early warning signals for a disease or event?

5. What is the development of a vital sign in the NICU stay in the first selected days or weeks for one or multiple patients?

6. How do patients stabilize at the NICU, for example in the first 24 hours?

7. What parameters are measured for a patient or groups of patients and when during the stay at the NICU?

General Use Cases

Two general use cases can be distinguished by the neonatal experts:

• Maturation analysis

In this type of analysis the focus is on the patterns of progress after birth of admission to the NICU. This is related to questions 5 and 6.

• Anchor point analysis

In this type of analysis the focus is on discovering patterns around an event such as a disease happening. Such an event has a time point, which can serve as an anchor point to align the time series of patients. This is related to questions 2 and 4

### 2.2 Data type & Task abstraction analysis

In this section we will abstract the data and the questions of the users stated in the domain problem analysis. In doing so, we can relate the data and tasks to work (in other domains) that

CHAPTER 2. PROJECT ANALYSIS

Variable Time

Varying

Type Description

Patient Number No Ordinal Unique number in group

Group No Categorical Unique group

Heart frequency (HF) Yes Continuous Value of HF at a timestamp Oxygen saturation

(SpO2)

Yes Continuous Value of SpO2 at a timestamp Resonance frequency

breathing (RF)

Yes Continuous Value of RF at a timestamp Temperature Yes Continuous Value of Temp at a timestamp Mean airway pressure

(MnAwP)

Yes Continuous Value of MnAwP at a timestamp Table 2.1: Data Abstraction of the monitoring data of a patient

dealt with similar data, tasks or both. As a result, the output of this section can be used as input for the visual encodings and interactions.

### 2.2.1 Data Abstraction

The data consists of monitoring data and non-time varying data. For the monitoring data, we will use a subset from the vital signs as described in the domain problem analysis above. The non-time varying data is descriptive data of a patient, such as the birth weight.

Table 2.1 depicts the data abstraction of the monitoring data for a patient, who is uniquely identified by combining a patient number and the group they are related to. The group relates to the label of the patient. However, when data is unlabeled, these two variables will not exist and a unique ID is sufficient to identify a patient. Furthermore, each variable that is indicated as time varying can be represented as a variable that can possibly, but not necessarily have different values on each timestamp. From the data abstraction, we can derive that the monitoring data is of the type multivariate time-varying time series data. Although in this thesis we use only five variables that are time-varying, the data abstraction of a subject can have any amount of time-varying variables.

Table 2.2 shows the data abstraction of the non-time varying data of a patient and contains descriptive information of a patient. The gestational age and birth weight are variables that describe the subject when it was born, while the crash moment expressed in postnatal days, crash date and crash time are variables describing when a certain event happened for the subject.

So, subject multivariate time-varying data can be compared by aligning on their respective first timestamp or by aligning on a timestamp of a certain event, this relates to maturation analysis and anchor point analysis respectively.

### 2.2.2 Task Abstraction

The task abstraction is derived from the questions from the neonatal experts in section 2.1, and creating the task abstraction is crucial to have a generic description of the operations and is im- portant to make visual encoding decision as described by Munzner [12].

Task 1 - Discover groups of subjects from multivariate time-varying data

The focus of this task is to show similarities between subjects. The neonatal experts are inter- ested in, given that each subject has multivariate time-varying data, discovering groups of subjects.

Their similarity is based on similarity of their multivariate time-varying data. This task answers question 1.

To facilitate the questions 2, 3, 4, 5 and 6, the similarity calculations can be done with respect to two types of alignments as described in the data abstraction, section2.2.1.

CHAPTER 2. PROJECT ANALYSIS

Variable Type Description

Patient Number Ordinal Unique number in group

Group Categorical Unique group

Crash moment ex- pressed in postnatal days

Continuous Amount of days of the crash mo- ment after birth

Gestational age Ordinal The amount of weeks from the first day of the last menstrual period un- til birth

Birth weight Ordinal Value indicates the birth weight class

Crash date Date The date of getting sepsis

Crash time Timestamp The date and time of getting sepsis Table 2.2: Data Abstraction of the non-time varying data of a patient

Task 2 - Explore temporal patterns of discovered groups which are generated from multivariate time-varying data

The neonatal experts want to explore patterns of the discovered groups that are described in Task 1. These patterns relate to the time-varying pattern of the variables for each group, found patterns can be related to existing domain knowledge of time-varying patterns of neonates or serve as new hypothesis of unknown patterns to research. To facilitate exploration and comparison of groups, the focus of this task is to allow to view the time-varying data of variables of groups of subjects or singular subjects, to reason about relations between subjects, such as belonging to a group or not belonging to a group. This task relates to question 2, 3, 4, 5 and 6.

Task 3 - Identify which variables are measured and when for a single subject or in a comparison between subjects.

The focus of this task is to find when and what variables are measured for subjects which can give understanding of the type of the selected subjects. This task relates to question 7.

### 2.3 User input requirements

In this section we define the user input options needed for NICUdash to allows the neonatal experts to define their studies. These input options are based on further analysis of the questions of the neonatal experts:

• Define a time series range

In NICUdash we will focus on analysing instances in groups, thus at one point, the time series between patients have to be compared. So, the users should have the ability to select a specific time series range to perform the analysis. As similarities over time between patients are expected to differ between an interval of [1, 7] or [1, 14] in day units for all patients, it could be that patients that are similar in the interval [1, 7] could be not similar in the interval [1, 14] as different development after the 7th day could happen. Additionally, users can decide to investigate patterns before tan event such as a crash moment. As an example a time intervals of interest for neonatal experts is often 6 hours before crash for sepsis, this can be indicated as the range [-6, 0] in hours.

• Select one or multiple variables

Different variables can have correlation or add noise related to for example a disease. Users can verify their domain knowledge or learn from unseen results about the relation between variables, by being able to select any combination of variables for each analysis.

• Define a gestational age range

CHAPTER 2. PROJECT ANALYSIS

Different ranges of gestational range, allow that users can filter on specific patients based on their gestational age. For example selecting a range of [24, 27] leaves out patients that are relatively born later. This can be a valuable input option as a neonatal expert could choose to only analyse early born babies.

### Chapter 3

## Related Work

In the previous chapter we described the domain data and challenges. From that we derived a data and task abstraction, which we will use in this chapter to relate to similar work in literature is not necessarily from the same domain. In this chapter we first look into the current state of patient monitoring visualization for neonatal experts. Then we look into ways to visualize univariate and multivariate time-varying data. From there we describe what visual analytics is and why we choose to use it. Finally, we show one example of a visual analytics tool for multivariate time-varying data.

### 3.1 Patient monitoring data visualization

Currently, there exist visualization tools that are typically used in practice at the intensive care unit by neonatal experts, examples of these are shown in figure3.1. On the left, in figure3.1.A, is shown the patient vital sign display, which shows real time vital sign measurements in line graphs;

on the right, in figure 3.1.B, is shown an interface that supports making clinical decisions, not necessarily in real-time. Both these interfaces can complement each other, however in practice using these methods could give cognitive overload and strain which impacts the quality of care and safety of patients. Therefore Faiola et al. [9] proposed MIVA 2.0, which supports clinician cognitive load and reduces the decision-making errors, see figure3.2. The tool is a combination of the systems mentioned before and combines showing vital signs measurement in a time resolution of choice with electronic medical records as support information. Users can study each patient’s vital signs and compare to data points in the past to recognize anomaly patterns, and can in a collaborative manner contribute to the diagnosis to the patient by placing clinical notes. However, this solution does not support analysis of patients in cohorts and explore their patterns, so patients have to be analyzed one by one.

CHAPTER 3. RELATED WORK

Figure 3.1: Examples of two interfaces often seen in intensive care units. With (A) showing different line plots of the vital signs, and (B) showing an electronic record system (Faiola et al.

[9]).

Figure 3.2: MIVA 2.0 a dashboard solution for analysis and interpretation of real-time trends and communicating clinical work and information. 3.1[9].

### 3.2 Visualization time-varying data

In this section we describe visualization on univariate time series, and how similarity is computed.

Then we introduce visualization on multivariate time series and follow this up with an introduction to visual analytics and a example solution.

From the previous chapter2, we derived three abstract tasks:

• Task 1 Discover groups of subjects from multivariate time-varying data

• Task 2 Explore temporal patterns of discovered groups which are generated from multivariate time-varying data and task

• Task 3 Identify which variables are measured and when for a single subject or in a comparison between subjects.

Task 3 is the most trivial task as it merely focuses on showing whether things are measured and does not require complicated processing of the data. On the other hand, tasks 1 and 2 are

CHAPTER 3. RELATED WORK

complicated to enable as they focus on discovering cohorts and further analysing and exploring of these cohorts; therefore, in this section we will focus on relating these tasks to other work in literature. As the tasks and data are abstracted we are not bound to a domain when looking into other work.

### 3.2.1 Univariate Time Series visualization

Univariate time series data in a simple form are described as a sequence of N observations on
a variable. We then define a data set with univariate time series for each instance as Y =
{Y1, Y_{2}, ..., Y_{Q}}, with Q the number of instances. For 1 ≤ i ≤ Q, the series Yi consists of N
observations and we define it by Y_{i}= {Y_{it}} for t = 1, ..., N .

van Wijk et al. proposed a method to cluster univariate time series data by daily data patterns and show the average patterns of each cluster, and a calendar view [22]. Their aim is to create clusters with similar day patterns, such that day patterns within a cluster are more similar than to day patterns in the other clusters. Figure3.3shows clusters of days during the year that are based on the number of employees that are present at a company in a day. The calendar view shows that weekend days and holidays such as Christmas belong to a cluster, this could be common knowledge as these are days off. However, the intention is to drop assumptions and let the analysis tool show what is in the data. The calendar view is supported by a line plot with one line per cluster, where the time-varying values of the days within the clusters are averaged. This supporting view can be used for interpretation of these clusters. Thus, using these two views, we can recognize that Fridays and summer days belong to the same cluster (cluster 722), and during these days fewer employees are present.

Figure 3.3: Visualization showing clusters of days based on day patterns of employees being present at a company. Clusters are indicated by a color and are visualized on a calendar view, furthermore the line plot shows the average day pattern of each cluster (van Wijk et al. [22]).

Creating these clusters is an iterative process. Firstly, each day pattern is in its own cluster.

Then clusters with similar day patterns are merged into clusters, this is repeated until everything is merged into one cluster. This clustering method is a bottom-up hierarchical clustering algorithm.

To create an more interpret-able visualization such as in figure 3.3, a selection of clusters has to be made. In this example van Wijk et al. choose the seven most significant clusters from the hierarchical clustering. Multiple distance measures exists to calculate the similarity between two

CHAPTER 3. RELATED WORK

series, and the choice of the distance measure is crucial to the outcome of the clusters. van Wijk et al. propose several distance measures that can be used, one of them is the root-mean-square distance. Suppose Y is collection of univariate time series, and i 6= j then the distance between the univariate time series of instances Yi and Yj is:

drms=

qX

(Yi− Yj)^{2}/N .

The researchers found that this measure is robust for their data and goals, they also experi- mented with normalized values to focus on similar shapes of patterns, the normalized root-mean- square distance is:

dnorm−rms=

qX

(Yi/Y_{i(max)}− Yj/Y_{j(max)})^{2}/N .

With Yi(max) and Yj(max) the max-value for the respective univariate time series. Another distance measure that they experimented with also uses the max-value but more directly:

d_{max}= |Y_{i(max)}− Yj(max)|.

This distance measure can be used to create clusters based on similar peak values.

We show this work by van Wijk et al. as it is a well-known example to visualize time-varying data and allow interaction for the user to explore and interpret the data. Furthermore, some aspects in this work are interesting for our work. First is how trends in each cluster are shown by an average curve. A similar visualization could be useful to show the temporal patterns of dis- covered groups (Task 2 ). Next, with this example we have introduced that the choice of distance measure is crucial to determine similarity between instances, and has an effect on groups that can be discovered (Task 1 ). Especially, in solutions where output is directly based on the structure of the data such as clustering.

Univariate time series distance measures in literature

Serr`a et al. distinguish multiple types of distance measures for time series, we will focus on two of such types: lock-step measures (e.g. Euclidean) and elastic measures (e.g. Dynamic Time Warping). For univariate time series one of the distance measures that is considered superior is Dynamic Time Warping. Although, Euclidean distance measure is also considered an accurate, robust, simple, and efficient way to measure the similarities between time series [16]. The root- mean-square-distance is a Euclidean distance based measure. In a Euclidean distance measure the time point Yitof time series Yiis compared to the time point Yjtof another Yj time series for each t.

Figure 3.4 shows how a lock-step measure aligns two time series. in the example two time series curves are similar in shape, however they are not in the same phase. Some of the alignments are a large mismatch, for example matching a local maxima to a local minima between the two time series. Furthermore, the time series differ in length, thus not all data points could be aligned.

A solution could be to average the values over each time window, a suggestion of windows is indicated on the right of figure 3.4 by oranges boxes. Depending on the application and the data, this smoothing of time series is a viable option. For example if for each hour the same patterns occur but not at the same time, then averaging the data per hour is a good solution, so the effectiveness of this smoothing also depend on the time resolution of smoothing. Finding a balance between smoothing the time series with certain time resolution window sizes without losing information and trying to take care of the out of phase shapes between time series is a challenge this way.

The aim of the elastic measures is to take care of misalignment between for example two curves and can be applied to time series. As mentioned earlier, Dynamic Time Warping is such a distance measure. Figure3.5shows how an elastic measure could align two time series that are out of phase and are not of the same length. A non-linear mapping between time points is created, which allows for one-to-many mapping between data points of two time series. The downside of the popular elastic measure Dynamic Time Warping is that it is significant more expensive to compute than

CHAPTER 3. RELATED WORK

Figure 3.4: Visualization of lock-step alignment of time series, on the left are displayed two time series red and blue, the vertical lines indicate the alignment between the curve points. Although the two time series are quite similar, they are not aligned correctly; some maxima are aligned to the minima between the two time series. On the right, the same time series are displayed, however two windows for each time series indicate how the time series can split up. The time series can be smoothed by for example using the average for each window. [3].

the simpler Euclidean distance measures. A combination between smoothing with a certain time resolutions and using an elastic measure is an option of the many variants to speed up the com- putation, an example is the FastDTW distance measure, which is an approximate Dynamic Time Warping algorithm [15].

Figure 3.5: Visualization of an example of elastic alignment of two time series. Each data point in a time series is aligned a data point in the other time series, even though the two time series are not of the same length. Furthermore, the image shows that similar local maxima and local minima are aligned accurately [3].

In our work we deal with multivariate time series instead of univariate time series, this intro- duces more complexity to identify the similarity between instances and to eventually visualize the structure of the data correctly. In section 3.2.2we will elaborate further on visualization of multivariate time series data related to our work.

### 3.2.2 Multivariate Time Series visualization

The difference between univariate time series and multivariate time series is that a univariate time series has only one time-varying variable, while a multivariate time series has multiple time-varying variables for each instance. As there are multiple variables, this means that variables can correlate with each other; how and if we have to reason about this is something to take into account.

We then define a data set with multivariate time series for each instance as Y = {Y_{1}, Y_{2}, ..., Y_{Q}},
with Q the number of instances and d the number of variables. For 1 ≤ i ≤ Q, the series Y_{i}consists
of N time observations and we define it by Y_{i} = {Y_{ijt}} for t = 1, ..., N ; j = 1, ..., d; i = 1, ..., Q.

CHAPTER 3. RELATED WORK

However, in our case, although the variables are of the type continuous data, the patients do not always have measurements on all vital signs at all the N observations. In practice, vital signs are only measured if needed. Which also gives information on the type of baby, depending on what is measured and when. Thus, a multivariate time series can contain null values at any time observation moment for any variable. Figure3.6shows how multivariate time series items can be shown in a table format.

Figure 3.6: Example of two multivariate time series (MTS) items for two babies. The visualization shows that baby 1 has three variables measured while baby 2 has two variables measured over time.

Before we dive into the visualization of the structure of multivariate time series data, it is important to be able to define similarities between multivariate time series instances. To recall, similarity between two instances determines how similar two instances are and thus how they can be visualized together.

Examples of algorithms that are often used to transform complex data for visualization of hidden structure in the data are clustering or dimensionality reduction algorithms [24]. We put an emphasize on defining a similarity measure, as it gives more control for determining similarity.

Thus, the algorithms that we will consider have to allow for choosing the distance measure, either by selecting a distance measure supported by the implementation of the algorithm or by providing a similarity matrix as a precomputed matrix of all pair-wise similarities for all instances. In the former the input is a matrix in shape of for example (instances × variables) or (instances × time), while in the latter the shape is (instances × instances). In both cases the shape is of a 2D array, on the contrary, a multivariate time series data set can be defined as a 3D array with the shape (instances × variables × time). Therefore, to use these clustering or dimensionality reduction algorithm the 3D array has to be transformed to a 2D array. First we want to show a couple popular charts to display multivariate time series data. Figure 3.7 shows two of these graphs, both displaying the same multivariate time-varying data of one instance, in this case a patient.

Figure3.7(a)shows the star chart, in this graph each axis represents a variable, and the axes are positioned around a circle. Figure3.7(b)shows a parallel coordinate plot with a parallel axis for each variable. Each blue polygon and each blue line represent a multivariate data point at a time point for the star chart and parallel coordinate plot respectively. The downsides of these graphs is that only data of one subject is shown. Furthermore, if the number of time points or variables increase, these type of graphs become difficult to interpret.

Nguyen et al. propose the m-TSNE technique which can deal with multiple instances of multivariate time series data, they use dimensionality reduction to project the data to a low dimension. This data can then be visualized in a 2D or 3D scatter plot. For this, they create a similarity matrix in the shape of (instances × instances), where the pair-wise similarity between pairs of instances are computed with the EROS distance. The EROS distance is based on principle

CHAPTER 3. RELATED WORK

((a)) Star chart of multivariate time series. Each blue polygon is a multivariate data point of the subject at a time point

((b)) A parallel coordinate plot of one subject, where each blue line represents the multivariate data at a time point.

Figure 3.7: Example visualizations of multivariate time-varying data. The charts display for five variables the values from monitoring data of one subject. (Nguyen et al. [13])

component analysis (PCA) which in itself a dimensionality reduction technique [28]. Eventhough both compute eigenvectors and eigenvalues, the application is different. In PCA the eigenvectors and eigenvalues of the whole data set are computed, in order to compute the principle components [27]. In case of m-TSNE, the eigenvectors and eigenvalues for each multivariate time series item of an instance is computed. Then, the similarity between two instances is computed with a weighted sum of cosine similarities of the eigenvectors. The argument of using the EROS similarity is that for each instance the multivariate time series are summarized into eigenvectors and eigenvalues.

The researchers imply that the correlation between variables is kept using this technique. The dimensionality reduction technique used is a gradient descent method that is based on t-SNE, a popular dimensionality reduction introduced by van der Maaten [20].

Figure 3.8 shows the result of m-TSNE on a multivariate data set that has measurements of activity levels of patients that are in chemotherapy. In this case, an instance is a day of a patient such that the multivariate time series items for the instance is in the shape of (variables × time).

For each instance the data is collected for each hour for all five variables. The results of m-TSNE look promising, as the evaluation the professionals could identify three groups in the scatter plot which are high performance active days, low performance inactive days and noisy sensor data.

The scatter points are annotated in the format [date number[ [step per day], and the plot shows that most high performance active days are after the second chemo date, and are days where the patient walked a lot.

m-TSNE could also be applied on our data, where in our data the instances are the babies.

However, in our data we have missing data, such that not all time series are of equal length and multivariate time series items do not always have the same variables. The EROS similarity calculation can deal with different time series lengths, however, it cannot deal with calculating similarity between instances that have different sets of variables. Additionally, using m-TSNE to project to a low dimensional space alone would not be sufficient to interactively interpret the results. Since the neonatal experts want to interactively explore the structure in the data, we will look further into tools that can be used to explore multivariate time series data by the user.

### 3.2.3 Visual Analytics

From task 1 and 2 from section 2.2.2 we obtain that the users want to interactively analyse the complex data, however their expertise is not in data analysis and information visualization.

Visual analytics is a suitable solution to accommodate for these needs, as in visual analytics the user is brought into the loop. The visual analytics process is inspired by Shneiderman’s mantra

CHAPTER 3. RELATED WORK

Figure 3.8: m-TSNE projection for activity levels of cancer patients, with each scatter point as one day. (Nguyen et al. [13])

”Overview first, Filter and zoom, Details on demand”, which is well known in the scientific and information visualization field [19]. However, in visual analytics, the difference is that users can also interactively use data analytics algorithms, and get insights via interactive generated visualizations. This can be a looping process as shown in figure 3.9. Users can visually detect explainable patterns using their expertise, while also discovering the unexpected [8]. On the other hand, when the users would use machine learning solutions, they would get less understanding as a machine learning solutions operates like a black-box for the user, and replicates from what the model has learned.

If we relate the tasks in section2.2.2to the mantra ”Overview first, Filter and zoom, Details on demand” we find that task 1 can refer to Overview first and Filter and Zoom, while task 2 can refer to Filter and Zoom and Details on Demand.

Cui mentions that most visual analytics tools show the data in 2D such as a 2D scatter plot, however not all data is already in a 2D format [8]. In exploring a high-dimensional data set, starting with Overview first, a 2D scatter plot is a popular visualization in visual analytics. It can show relations between instances or groups of instances, and discovering these relations is depending on where each instance is plotted. For this, clustering or dimensionality reduction algorithms are typically a solution to process the data and plot the output in a scatter plot [26].

Clustering and dimensionality reduction both serve a different purpose. In clustering the em- phasize is on the relationship within and between clusters, whereas in dimensionality reduction the emphasize is on the relationship on an instance-to-instance level.

Clustering

Clustering requires to make assumptions, for example deciding on the number of clusters, or se- lecting clusters after hierarchical clustering. Some clustering algorithms require to specify the

CHAPTER 3. RELATED WORK

Figure 3.9: Abstract overview of visual analytics showing the relation between visualization, data analysis and the human. (Cui [8])

minimum size of a neighborhood. In the output, this implies that each instance is assigned to a specific cluster or in case of fuzzy clustering can belong to multiple clusters [26][6].

Dimensionality Reduction

In dimensionality reduction, instances are not assigned to a cluster. The algorithm its goal is to make sure that the relation between pairs of instances in the high-dimension is reflected in the low level projection. Thus plotted in a scatter plot, the position between instances reflects their relationship.

To choose the appropriate technique, we relate to the tasks and data abstraction from section 2.2. Based on this we decide to use dimensionality reduction. The reason is that the data sets in the NICU are complex and a lot is unknown, and clustering does assign instances to clusters.

Even more so, the neonatal experts want to find out if there is structure in the data and if so what structure without making assumptions. Therefore, using clustering would not fit as it already gives a hard suggestion of the groups in the data. Dimensionality reduction on the other hand is more relaxed, and users can still manually detect groups based on positions between instances and explore these groups with other supporting visualization graphs.

Next we will highlight some interesting visual analytics solutions for multivariate time series data that use dimensionality reduction. We will compare and reflect the solutions to our needs.

Example Visual Analytics solution for Multivariate Time Series data

Fujiwara et al. [10] presented MulTiDR, a visual analytics framework to process multivariate time series items as a whole, and define the multivariate time series as a 3D array of (instances × variables × time). They show an example to process this 3D array via a two-step dimensionality reduction to a 2D array with the shape (instances × time), then this 2D array is used as an input in a dimensionality reduction algorithm to project the data a low dimension of 2D. This process is illustrated by figure 3.11. With this method, similarly to the EROS distance calculation, the multivariate time series items are taken as a whole, thus the variables their correlation are not lost. However, when compressing all variables to 1 dimension there is a risk to lose a large amount of information.

This visual analytics framework consists at its core of the two-step dimensionality reduction

CHAPTER 3. RELATED WORK

process and its projection plotted to a scatter plot, and feature contribution visualizations. As users explore the scatter plot they can manual select groups of instances. Then, feature contribu- tions of these selected groups are shown which show what the characteristics of each group are.

As an example, the feature contribution visualizations in figure3.10.c show that the blue group of instances are characterized by the time period between November and December and that the variables NO2 and PM2.5 positively contribute to the characteristics of the group. Additionally, depending on the data and the needs of users, MulTiDR can show supporting visualizations, such as a map to indicate the location of the selected instance. This gives more domain context to the user.

Related to our work, the downside of the two-step dimensionality reduction is that missing data in the data set has to be handled before any analysis is run, as MulTiDR cannot handle missing values. In our case, we want to create a tool that can deal with missing data during run time, as the neonatal experts indicated that missing data is still useful information. For example whether a certain variable was measured or not gives information of the type of patient, and if there is no measurement, null values can appear.

Figure 3.10: Overview of MulTiDR, where users can explore similar instances and investigate what features and time points are relevant for the selected clusters. (Fujiwara et al. [10])

CHAPTER 3. RELATED WORK

Figure 3.11: two-step dimensionality reduction process visualized. This example first performs dimensionality reduction along the variables mode, then dimensionality reduction along the time points. The notations are: T, N, D length of modes of time points, instances and variables respect- ively. Furthermore, X , X, y, Y , Z are a third-order tensor, matrix, vector, matrix and matrix respectively. (Fujiwara et al. [10]).

### Chapter 4

## Data set pre-processing

This chapter elaborates on the pre-processing of the data, which is needed to run the dimension- ality reduction and the auxiliary graphs. The data pre-processing process is done once. Due to time constraints, as pre-processing the data is time intensive, we decided to only pre-process a subset of the parameters, for which is known that they are the most relevant in studying sepsis.

These parameters are oxygen saturation, heart frequency, mean airway pressure, temperature and respiration frequency.

### 4.1 Data averaging

The data is averaged to a resolution of hour and day, to decrease the amount of noise in the data, since the resolution of the raw time series is in seconds. Noise is not uncommon and can come from devices used on the patients or from interventions by nurses. In addition, per day there can be up to 60 ∗ 60 ∗ 24 = 86.400 measurements, per parameter per patient. Operations on such time series is computational expensive. Besides the average, the medians in the same resolutions are also calculated, which adds more flexibility for the user to explore the data.

### 4.2 Standardization

In component 1 of NICUdash in figure 7.1, users can select one or more parameters. Values of each parameter are in different magnitudes. For example heart frequency (HF) can be in the ranges of 140 while the oxygen saturation (SpO2) is around a value of 90, which is makes the data not consistent for comparison in the context of similarity or distance. Therefore all values for each parameter are standardized using z-score normalization. The z-score calculated for each parameter a is:

z = x − ¯xa

sda

with ¯xa the mean value and sda the standard deviation of all measurements of parameter a.

### 4.3 Alignment

In chapter3we talked about the alignment of the data. For comparison of between time series is dependent on the alignment between time series, thus correct alignment is crucial. We distinguish between two types of alignment, from data abstraction in chapter2.2.1these are alignment on the crash moment and alignment on the first timestamp of any measurement for a patient.

Since the sepsis data set is labelled, we distinguish between two groups of patients: group 1 (control group) are the patients that did not get sepsis, group 2 (disease group) are the patients that did get sepsis. Alignment of these time series can be important, as in the hours before disease

CHAPTER 4. DATA SET PRE-PROCESSING

detection it is possible to see pattern changes. For both groups of patients the crash moment is determined. For group 2 an anchor point is based on the moment that a blood culture is taken and antibiotics started. For patients in group 1 an equivalent crash moment is found, based on a matching gestational age [7].

After the averaging to hour and day resolution and standardizing the data we have two types of data sets: one for values per hour and one for values per day. Table4.1shows an example of a data file after averaging and standardization for one patient on one parameter on hours.

Timestamp Hour Mean Value Normalized value

19-01-2020 19:00 120 0.56

19-01-2020 20:00 125 0.58

20-01-2020 01:00 122 0.57

20-01-2020 03:00 121 0.56

Table 4.1: Example of a data file of one parameter from one patient. The values are not real.

The timestamp hours are related to a patient their stay in the hospital, however this format does not allow for comparison between patients. Thus, timestamps are converted to hours, relative to the first hour of measurement of a patient. Next, timestamps are also converted to hours relative to the crash moment. Table4.2shows the result of the alignment of hours. A similar pre-processing computation was run for the day time resolution.

With the alignment columns added the user can perform analysis from two perspectives: mat- uration of the patient at the hospital and patterns before the crash moment. To relate to the two general use cases described in chapter2.1for maturation analysis the time series are based of the First hour aligned column, while for anchor point analysis the time series are based of the column Crash aligned.

Timestamp Hour Mean Value Normalized

value

First hour aligned

Crash aligned

19-01-2020 19:00 120 0.56 2 -1

19-01-2020 20:00 125 0.58 3 0

20-01-2020 01:00 122 0.57 8 5

20-01-2020 03:00 121 0.56 10 7

Table 4.2: Table4.1after including alignment with a crash moment of 19-01-2020 20:00

### Chapter 5

## NICUdash design

This chapter is about translation of the task and data abstraction, from chapter2, to a first sketch.

The visual encoding and interaction are considered together as both are fundamental elements of a visual analytics system. We will describe each component of the dashboard and define the priorities for implementation. We also elaborate on the design process.

Figure 5.1: Sketch 1 of main view of NICUdash with (1) the input interface, (2) the embedding component, (3) the cluster summary component Cluster Aggregated View and (4) the patient details component

The choice of the design of the main view is inspired by Schneiderman’s Mantra of Overview first, zoom and filter, then details on demand [18], see figure5.3. With component 2 showing the overview of the data set, component 3 the filtered and zoomed part of the overview, however here we show a different representation of the selected data such that patterns can be explored. Finally, component 4 is the details on demand overview.

There are 7 components in NICUdash, where the tool consists of 2 pages: one main page and a measurement occurrence view. Switching between the two pages is possible with the switch operator on top of component 1. We will relate to the task abstractions from section2.2 to the components of figures5.1,5.2and5.3:

CHAPTER 5. NICUDASH DESIGN

Figure 5.2: Sketch 2 of main view of NICUdash with (1) the input interface, (2) the embedding component, (3) the cluster summary component Individual Lines View and (4) the patient details component

Figure 5.3: Sketch of measurement occurrence view of NICUdash with (5) Multiple patients occur- rence count component, (6) Individual patient occurrence calendar component and (7) Multiple patients calendar component

Before we elaborate on the components, we first recall the tasks from chapter2.2:

• Task 1 Discover groups of subjects from multivariate time-varying data

• Task 2 Explore temporal patterns of discovered groups which are generated from multivariate time-varying data and task

CHAPTER 5. NICUDASH DESIGN

• Task 3 Identify which variables are measured and when for a single subject or in a comparison between subjects.

Now we continue with explaining the components of the design.

• (1) Input interface component

This component does not address a task but is an supporting component for task 1. This components allows to make a selection of the patients with the gestational age (GA) slider, selecting any combination of vital sign variables of interest and defining the time range of the time series. Users can select for computation and displaying the data in days or hours using the mode selector. On the bottom of the component, we show the option for exclude which are expressions defined by the user to exclude data.

• (2) Embedding component

This component addresses task 1 and is meant to show patients that are similar to each other in close proximity on the scatter plot, based on the settings of component 1. Users can identify groups based on position in the plot and manually select groups. This visualization is inspired by Fujiware et al and Nguyen, where they also plot the low dimensional embedding in a scatter plot [10][13].

The dimensionality reduction method will project the multivariate time series data to a lower level embedding of two variables. Thus, the structure in the high dimension is reflected in the low dimension, and a 2D scatter plot can visually show the distribution of all instances based on these two variables.

• (3) Cluster summary component

This component addresses task 2 such that users are able to select groups of patients of interest from component 2, and then this component give meaning to the selected groups. As the data is time-varying, we chose to use the line plot to the measurements over time. This component can show either lines for each instance or shows the summarized pattern of a selected group. A summarized pattern consists of the mean value and the standard deviation plotted on each time point, this spread is visualized by the grey color. This visualization is inspired by van Wijk’s work where they also show an average curve for each clusters [22].

We decided to add the spread in groups over time to give more detail of the time-varying similarities within groups. Furthermore, a selected group can have very similar time series for one variable, but less similar time series in another variable. With the spread, this relationship can be identified.

• (4) Patient details component

This component addresses task 2 and builds on top of component 2 and component 3, allowing users to view detailed information of a patient, by interaction such as hovering with component 2 or component 3, when component 3 is in the view type Individual lines. This allows users to understand the parameters of individual patients.

• (5) Multiple patients occurrence count component

This component shows the amount of days that a parameter was being measured, based on the settings defined in component 1. It is a supporting view, as neonatal experts indicated that the amount of days a patient was measured for a certain variable also give information about the type of patient.

• (6) Individual patient occurrence calendar component

This components shows a calendar view for one patient that displays which parameters are measured on a each day.

• (7) Multiple patients calendar component

This component shows a calendar view for a selected parameter for all patients, displaying on which days each patient had a parameter measured.

CHAPTER 5. NICUDASH DESIGN

The development of the design of NICUdash was conducted in an iterative manner, by first creating a sketch based on the project description, which we then presented this to the neonatal experts in an online meeting. From this session we received positive feedback and the neonatal experts elaborated on what they like and how they would use the tool:

The neonatal experts indicated that they like that a range of gestational age can be selected.

With this they can for example focus the analysis on low gestational age patients. They also approved of the option to define a range in the time series, as they would use it to for example analyse how patients mature in the first 24 hours after birth. Then they can also switch to an analysis on the maturation after seven days, for example. The flexibility to perform a wide range of types of analysis was something they liked to see, as well that the tool allows them to study groups of patients, instead of only one. Finally, they mention that the option of selecting a wide variety of combinations of variables (indicated as Parameters in the design) is interesting in ex- ploring the influence of vital signs.

One thing that we forgot to incorporate the ability to choose the alignment of the time series with each other in the sketch. This relates to the maturation analysis and the anchor point analysis. We will implement this as a switch in our final implementation.

During the presentation we focused less on the component 5, 6, and 7, and the experts seem to be more enthusiastic about the components 1-4. Therefore, also due to time constraints, we will not be implementing components 5, 6, and 7. This implies that we will implement Task 1 and Task 2. However we will not implement Task 3.

In the next development phase the functionality of the tool was implemented and the final implementation was evaluated in a user study consisting of two neonatal experts. The result of this is presented in chapter7.