Interactive time series weather data visualization on Jupyter

(1)

Bachelor Informatica

Interactive time series weather

data visualization on Jupyter

Felix Atsma

January 25, 2021

Supervisor(s): Dr. Ji Qi

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

Current software for visualizing meteorological data is often not ideal for the workflow of meteorologists, which includes analysis using code alongside visualization. The Jupyter platform using Python can provide this capability, but there is a lack of libraries with the capabilities of visualizing time series weather data and at the same time provide interactivity with the data. This thesis seeks to fill that gap by creating a package in Python with those capabilities for the Jupyter platform. To create such a package, several choices are made, such as the interactive map the data is displayed on, and which visualization methods are used. The package created is supposed to be a start of a larger project and is meant to be expanded with functionality in the future.

(4)

(5)

3.5 Interactivity . . . 13 3.5.1 Animation control . . . 13 3.5.2 Scientific functions . . . 13 3.5.3 Miscellaneous controls . . . 14 4 Implementation 15 4.1 Data . . . 15 4.1.1 Format . . . 15 4.1.2 Data reading . . . 15 4.2 Interactive map . . . 15 4.2.1 GeoViews / Bokeh . . . 16 4.2.2 Ipyleaflet . . . 16 4.2.3 Folium / Mapbox GL . . . 16 4.3 Visualization . . . 16 4.3.1 Temperature . . . 16 4.3.2 Wind speed . . . 16 4.4 Interactivity . . . 17 4.4.1 Animation control . . . 17 4.4.2 Selections . . . 18 4.4.3 Miscellaneous controls . . . 18

(6)

5 Experiments 21 5.1 Interactive map . . . 21 5.1.1 GeoViews . . . 21 5.2 Visualization . . . 21 5.2.1 Performance . . . 21 5.2.2 Correctness . . . 23 5.3 Buffering . . . 24 6 Discussion 27 6.1 Contributions . . . 27 6.2 Limitations . . . 27 6.2.1 Correctness Verification . . . 27 6.2.2 Performance . . . 27 6.2.3 Additional Experiments . . . 27 6.3 Ethical Aspects . . . 27 7 Conclusions 29

(7)

CHAPTER 1

Introduction

Visualization is an important aspect of any field, whether scientific or commercial, that deals with large amounts of data. In the field of meteorology visualizations are especially crucial, because the data is spatial and relates to the real world. Many people see meteorological visualizations on a regular basis, like when checking the radar if it’s going to rain the next few hours or watching the forecast in the news. Huge amount of weather data are collected daily by many weather stations and research institutes and there are many ways to process and visualise this data depending on what you want to analyze or display. Meteorologists and other scientists interested in weather often require the latest data, and need to be able to analyse it quickly and easily.

There’s a range of software available for visualizing weather data: from command-line-operated solutions like GrADS1or GEMPAK2, to full graphical software packages like Panoply3. Other options are web services like windy.com. These traditional approaches can provide effec-tive visualizations but are often unsuitable for the workflow of meteorologists, which typically includes analysis using code combined with visualization.

The Jupyter platform is popular with meteorologists and data scientists [1]. It supports many programming languages, such as R, Julia, Ruby, and the most commonly used language: Python. It allows displaying any visualization you can create with Python next to your code. But besides that, because the environment is in a web browser it can use JavaScript to add dynamic updates and interactivity to what you’re displaying. Jupyter has a set of official and commonity developed widgets for this purpose, and it’s possible to create your own widgets.

The Python programming language is used by many scientists for data analysis and visual-ization due to its ease of use and flexibility [2, 3]. But with Python, visualising meteorological data is usually done by creating static images or animations using a visualization library like matplotlib[4] aided by packages such as cartopy to provide maps and projections. Or with more specific toolkits like Py-ART for working with radar data [5].

Jupyter allows for interactivity with widgets, and packages like ipyleaflet allow for interactive maps. Interactivity with the visualization can aid meteorological analysis by enabling the user to gather information from the visualization without needing to write code or manipulate the raw data. This can save time and effort, especially if the user does not much experience with Python or programming in general.

At the moment, no packages that provide an interactive map also allow easy display of time-series data and/or interaction with the data itself. This thesis will be attempting to answer the question of what the best way is of creating a Python package for the Jupyter platform that allows for visualization of time-series wind speed data and temperature data. The produced charts will be interactive and a user will be able to select data by selecting a region on a map. The package developed in this thesis is meant to be a part of a larger project which will be expanded in the future with support for more data sources and scientific functions.

1_{http://cola.gmu.edu/grads/}

2_{https://www.unidata.ucar.edu/software/gempak/} 3_{https://www.giss.nasa.gov/tools/panoply/}

(8)

(9)

CHAPTER 2

Theoretical Background

2.1 Geographical Information Systems

Geographic Information Systems (GIS) are systems for the capture, analysis, and display of geospatial data. Geospatial data describe both the location and the attributes of spatial features [6]. GIS can be used to manage or gain new insights from data, examples of applications are monitoring forest fires, analyzing historical climate change, visualizing crime distribution, and forecasting the weather.

2.1.1 GIS Data

GIS data is generally one of two models: raster data or vector data. A raster data model represents the world as a set of cells aranged in a regular grid. Each cell, or pixel, in the grid is of equal size and represents a physical area, typically square. Raster data is often the standard way of representing continuous phenomena or features, such as temperature, air pressure, elevation, or precipitation. This data is usually shaped in 2 to 4 dimensions. The data is taken from a certain area, which can be 2d, or 3d if multiple elevations are viewed at the same time. Beside the spatial dimensions, data can be taken from multiple points in time, which adds another dimension [7]. Vector data is data represented by lines, points or polygons. Vector data is useful for representing physical and geographical features, like borders and roads [8]. The work in this project focuses on visualizing raster data.

2.2 Wind Speed and Temperature Data

In this thesis, wind speed and temperature data are used as data types. These are both examples of raster data. They both have a latitude, longitude, and time dimension. The difference between the two is that temperature has one data variable, just the temperature value in kelvin, while wind speed data consists of two variables, the eastward component and northward component, often denoted as ’u’ and ’v’, respectively. Both wind speed components are measured in m/s.

2.3 ERA5 Dataset

The dataset used in this thesis comes from the ERA5 dataset. The ERA5 dataset is provided by the European Centre for Medium-Range Weather Forecasts (ECMWF) under the Copernicus programme. ERA5 is a reanalysis project, which means it combines historical measurements and observeations to provide a consistent view of atmospheric data and climate change.

This dataset is eventually meant to include global atmospheric data from January 1950, for now it’s from 1979 onward. Data variables include temperature, wind speed, humidity, cloud cover and more. These variables are available at 137 pressure levels. The spatial resolution is

(10)

31 km, so the ground distance between data points is 31 km north-south, or east-west. The temporal resolution is 1 hour, so data is available from each hour from January 1979 onwards [9].

2.4 Meteorological Visualization

Visualization is an important tool of meteorologists. It is used to analyze data, make decisions and communicate forecasts and research results [10]. Meteorological visualization has a long history, e.g. Galton [11]. Visualization has become more complicated in recent times because of an increase in the amount and diversity of available data. Computing power has also increased, allowing for more complex simulations and models, also increasing the difficulty of analysis. Mon-monier [12] describes a history of weather mapping. Rautenhaus et al. [10] give a comprehensive overview of current visualization techniques used in meteorology.

(11)

CHAPTER 3

Design

3.1 Package structure

To keep track of potentially multiple datasets and their attributes along with the map itself, an object based approach will be used for this package. The map object created by the interactive map library will be contained within a custom map object. This map object will also contain the layers the user adds. These layers are also custom objects. These will contain a dataset and keep track of which frame is being shown. The map object has functions to add a layer with either regular raster data (temperature data) or wind data. The map will create a layer object and store it within a list. Another module of the package will contain the functions needed to process the data for visualization.

When a user wants to visualize a dataset, they first create a map opject. The map object has methods to create and display a visualization layer. The user passes the dataset to this method, along with optional variables to alter the visualization. When this method is called, a layer object is created which reads the data and prepares it for visualization. This layer is added to the basemap, along with several control widgets.

(a) Map object structure (b) Layer object structure

Figure 3.1: Chart displaying the general structure of the map and layer objects which together form the package developed in this thesis.

3.2 Data preparation

3.2.1 Dataset input

Before you can visualize any data, you first need to access it. This can be done in two different ways. The first is to provide a filename and extract the data automatically. While this is convenient for the user, it would require the application to tailor to a specific file type (which

(12)

would be netCDF in this case). The other option is letting the user open the file and extracting the data themselves and giving the raw data as input.

Bacause meteorologists do not necessarily have much experience with programming, the aim is to streamline the process of creating their visualizations. For this reason the user will only need to provide the file name and names of the variables within the file. For now, the only supported file type will be netCDF, but this can be expanded in the future.

3.2.2 Reprojection and processing

Because the earth is a sphere, and spheres can’t be flattened without distortion or discontinu-ations in the map, every map represents the earth according to a certain projection. The most common projection for interactive maps is the Mercator projection. The Mercator projection preserves latitudinal and longitudinal lines, but inflates latitudinal size increasingly the farther away from the equator [13]. Because raster data is spaced out on a grid with cells of an equal (square) size in terms of latitude, the increase in latitude needs to be corrected for. This can be done by reprojecting the data.

The temperature data can be represented as an image directly after reprojection, but the wind speed data cannot be visualized the same way. Creating the wind visualization will take additional time. This difference will be expanded upon in a later chapter.

3.2.3 Buffering

Reprojecting and processing a frame of data takes time. The longer it takes to create a frame, the slower the animation will have to be, which is not desirable. If the time it takes to process an image is too long to create a smooth animation, it may be necessary to use buffering to be able to animate the data anyway. This means that frames will be processed ahead of time and stored so the only latency remaining is the time it takes to display the image onto the map.

3.3 Interactive map

The base requirements for an interactive map are to be able to pan around the map, zoom in and out, and have a detailed basemap at every zoom level. If the basemap were a static image you would either get a blurry image when zoomed in, or details such as names would be too small when zoomed out. To solve this, tile layer maps are used. These provide different images at different zoom levels. The map should support changing this tile basemap, because different basemaps can be used for different purposes. Also needed is to be able to display a raster data on top of the map. Handling the visualizations should be done with layers to allow for multiple visualizations which can be controlled seperately.

3.4 Data visualization

3.4.1 Temperature data

Temperature data is a typical example of raster data, so the way of visualizing the temperature data will be the standard way of handling raster data. Because raster data are values of a single attribute on a regular grid, the standard way of displaying it is as an image, with the values determining the color of the pixels according to a given colormap. The colormap can be specified by the user. An example of temperature visualization is in Figure 3.2.

3.4.2 Wind speed data

Wind data is a special case of raster data, because it requires at least two variables to properly describe, which makes it hard to display the same way as regular raster data. There are several ways to display wind. Common ways of visualizing wind is either by representing the data points as vectors or tracing the path of wind flow [14]. Examples can be seen in Figure 3.3.

(13)

Figure 3.2: Example of a visualization of temperature over Europe and Africa using a linear scale colormap. The scale is in degrees Celcius. Land outlines are displayed in black.

Flow visualization

Wind is the flow of air, so it can be visualized using flow visualization techniques. The most common form of flow visualization for wind is using streamlines. A streamline is a curve which follows the direction of the local velocity at every point along its path. It shows the path a theoretical massless particle would take in the flow. The lines drawn can be colored to show the magnitude of the local velocity. Traditionally streamlines were drawn as static lines (Figure 3.3a), but more recently, especially on the web, a popular way to visualize these is by displaying simulating particles following the streamlines (Figure 3.3b).

Vector field

A vector field is a way of describing a flow by representing the direction and, optionally magni-tude, of the flow at a certain point with a vector. These are normally represented with arrows in a grid pattern. The direction of the arrow would represent the direction of the wind, and size of the arrows the wind strength (Figure 3.3c).

3.5 Interactivity

3.5.1 Animation control

A common way of displaying time-series data is by animating the data over time. This makes it easier to observe the system dynamics over time compared to another common method, which is simply showing a sequence of frames. That method is more useful for comparing data between time frames. In meteorology, usually the information of interest is how a system changes. A basic form of interaction is providing controls for the animation. Included should be controls to play and pause the animation, a way to control the speed of the animation, and a timeline with which the user can select a frame manually.

3.5.2 Scientific functions

A user could interact with the data using various functions to aid analysis. Examples are being able to filter certain values or value ranges, or masking an area on the map. Functions like these can eventually be added to the program, but for this project the function that will be implemented

(14)

(a) Streamline visualization of wind. Gener-ated using matplotlib’s streamplot function.

(b) Screenshot of animated streamline visualization from windfinder.com. The white particles flow with the direction of the wind.

(c) Vector field visualization of wind. Gener-ated using matplotlib’s quiver function.

Figure 3.3: Three types of wind visualizations.

is being able to select a region on the map and extract any data within that region. The selection tool should allow drawing rectangles, circles, and arbitrary polygons. The coordinates forming the drawn regions should be able to be retrieved. Those can be used to retrieve data from the dataset. The user gets this selected data as a array which they can use to analyze further.

3.5.3 Miscellaneous controls

Some more controls can be added to improve the usability of the tool. Because the data visual-izations are displayed on top of the basemap, it could obscure the map if the visualization is not transparent. The ideal opacity value can differ with different basemaps, multiple visualizations or just user preference. Therefore a method to control this opacity dynamically would be useful. If a user is visualizing multiple datasets at the same time, these layers can overlap. Being able to control which layer(s) is/are visible can help if one layer is obscuring a part of another layer you’re interested in.

(15)

CHAPTER 4

Implementation

4.1 Data

4.1.1 Format

The data used in this project is in the netCDF format. This format is used by the ERA5 dataset. NetCDF, or Net Common Data Form, is a data format that is made for storing array based multidimensional data[15]. NetCDF data is self-describing. This means that a netCDF data set includes information defining the data it contains. Because of this, it is not dependent on any specific application. A netCDF dataset contains dimensions, variables and their attributes. For example, the temperature data used in this project has as dimensions latitude, longitude, and time. The variable is the temperature value, and the attributes are units such as Kelvin for temperature, degrees for the longitude and latitude, and hours from a specific time for the time dimension.

4.1.2 Data reading

There are a few libraries with the capability to read netCDF files. The most prominent are netcdf41_{, and GDAL}2_{. There are other libraries like xarray}3 _{or rasterio}4 _{which also have the}

capabilities of opening netCDF files, but actually use packages like netcdf4 or gdal as backend for opening these files. GDAL is a powerful library for working with geographic data. When reading the data, it reads the entire file into RAM, which when using large datasets can be many gigabytes. Netcdf4 is a straightforward package for reading and writing netCDF files. When using the dataset, it only reads the data into RAM when you’re actually using it, this sacrifices a bit of performance for the ability to load large datasets without filling the RAM. Because of the capability to use large datasets without being restricted by RAM usage, netcdf4 will be used.

4.2 Interactive map

There are several libraries which are able to display an interactive map according to the require-ments set earlier. The choice of map is important because the rest of the package will be built around this map and thus influence all other design decisions. Important factors in deciding which library to use are performance: the map should be able to be created quickly and display data without large delays. It should also be compatible with the Jupyter platform. In this section, several libraries will be compared to determine which will be used.

1_{https://unidata.github.io/netcdf4-python/netCDF4/index.html} 2_{https://unidata.github.io/netcdf4-python/netCDF4/index.html} 3_{http://xarray.pydata.org/en/stable/}

(16)

4.2.1 GeoViews / Bokeh

GeoViews is a package extending HoloViews with geographical features. HoloViews does not have its own plotting capability but uses a backend like matplotlib or bokeh. HoloViews tries to make it easy to display labeled datasets. Bokeh has broader interactive capabilities than matplotlib, so that will be used as backend for this section. Using GeoViews it’s really easy to create an interactive plot showing a map with the temperature data shown on top. In just 4 lines of code (without imports) you can create the aforementioned plot. The data is projected automatically and a slider is given to show the time frame. Creating the map seems to take longer the more time frames you have. During testing it became clear the time it would take to create a map using a full dataset would be too long to be practical, as can be seen in 5.1, so GeoViews will not be used for this project.

4.2.2 Ipyleaflet

Ipyleaflet provides Python bindings to Leaflet, an interactive JavaScript mapping library, and is specifically made for the Jupyter platform. Besides the map it can display several Control elements, which are various elements like a drawing tool, or a box which allows the user to hide certain layers. An important Control is the WidgetControl. This allows any Jupyter widget to be placed over the map. And these widgets can interact with other Controls and map layers. Ipyleaflet is already used for a European project JEODPP where it is used to display their data on an interactive platform [16]. Due to the support for Jupyter widgets and no immediate performance problems as GeoViews suffers from, ipyleaflet will be used as interactive map in this project.

4.2.3 Folium / Mapbox GL

Folium uses Leaflet as backend, like ipyleaflet, but is not specifically made for Jupyter. It implements a few more Leaflet features and extensions, however, none of those are as useful as the native Jupyter widget support provided by ipyleaflet.

Mapbox GL is another library based on Leaflet, but it uses WebGL for rendering, which they claim improves performance. However, like Folium, it does not have native support for Jupyter widgets. Besides that, it requires an access token for which you need to register on their website.

4.3 Visualization

4.3.1 Temperature

The temperature visualization is relatively straightforward. Reprojecting the data results in a 2d array, but ipyleaflet does not support rendering raster data directly from an array. Displaying an image is done using an ImageOverlay layer, which requires a URL as data source. A URL usually refers to an object on the internet, but it can also be a local file. It can also be a ’data URL’. Inside a data URL a file can be encoded using base64 to represent binary data [17]. So in order to prevent having to save the reprojected data as a file on disk, it is converted to an image and encoded in base64 in order to be able to create a ImageOverlay layer which can be displayed on the map.

4.3.2 Wind speed

Flow visualization

A common type of flow visualization are streamlines. There are two ways of using streamlines. The traditional method is drawing lines along the direction of velocities the flow has at every point in the field. This produces a static image filled with lines. In Python this can be created using matplotlib’s streamplot function. A newer way is simulating particles flowing along these streamlines. Ipyleaflet has a layer type called Velocity, which allows such a visualization. An image of this can be seen at Figure 4.1a. Although it looks nice and can provide an intuitive

(17)

view of wind, it is meant to display the flow for one measurement. When updating the wind data, it stops and starts the simulation. That makes it difficult to see changes over time, which is an important part of visualizing time-series data. Creating a static streamplot also has issues when animating multiple frames. Small changes in the data can cause the produced streamlines to appear to move erratically and making it harder to see changes over time as well.

Vector field

Another way of representing wind is using vector fields. These are visualized displaying arrows at various points in the field according to the local velocity and direction. These arrows are arranged in a regular grid. Normally this is created in Python by another matplotlib function: quiver (Figure 4.1b). Quiver creates a plot with such a vector field visualization. Normally plots created by matplotlib include axes and a border around them, but these are not desired because the arrows need to be as close to their geographical location as possible which would be hard to accomplish with additional space around the plot. Once the plot is just the arrows it can be saved as an image and displayed in the same way as the temperature data. Displaying the arrows as an image results in the arrows becoming pixelated when zoomed in enough, see Figure 4.2a.

Another way of creating a vector field visualization is using vector data (Figure 4.1c). Vector data (not to be confused with vector fields) consists of points, lines or polygons. Using lines, it’s possible use vector data to display arrows instead of describing a geographical feature. Because vector data is described using coordinates, this removes the need to create an image and display that. A simple way of creating vector data is with GeoJSON. This is a simple format based on JSON but supports the mentioned vector features and collections of multiple features. Ipyleaflet has a layer for displaying GeoJSON, which makes it easy to display once the features are calcu-lated. Because ipyleaflet draws these arrows itself instead of an image of arrows, they will not be pixelated no matter how zoomed in the map is, see Figure 4.2b. These lines to draw these vector arrows need to be calculated manually. The regular way of calculating a frame would be looping through all data cells and calculating the arrows, however, to increase performance multiprocessing can be used. Every arrow would be calculated in a separate process.

In Chapter 5 visualization validation and performance testing is performed. Because dis-playing the quiver plot is significantly inaccurate in terms of position, and because it is not significantly faster than calculating arrows using vector data, the default method of visualizing wind speed data will be using vector data to display vector field arrows. The quiver plot visu-alization can still be used if the user does not require as much accuracy but wants to display a visualization with a high arrow density, as the vector data plot is much slower using lower strides.

Caching / Buffering

To prevent the data needing to be reprojected or arrows drawn every time a new frame is shown, these calculated images can be cached. Removing the time to calculate the reprojection, the only latency remaining would be the time it takes to render it on the map.

A basic implementation is caching frames once they’ve been visited and calculated. This would improve performance if the user would want to view the same frames multiple times, but otherwise hardly make any difference. For the user to notice any improvement when trying to watch, for example, an animation of their data, these frames would need to be buffered ahead of time.

4.4 Interactivity

4.4.1 Animation control

Jupyter maintains a collection of widgets in the package ipywidgets. Among these is a Play widget. This widget contains a play, pause, stop/reset and a loop button. It has an integer as value and when the user presses the play button, it increases its value with a speed determined

(18)

(a) Ipyleaflet’s Velocity (b) Matplotlib’s quiver

(c) Calculated GeoJSON arrows

Figure 4.1: Three options for visualizing wind data.

by an interval value. Another widget is an IntSlider. This slider is used to control the current visible frame of the data. The value of the Play widget is linked to the value of the IntSlider so pressing play will move the IntSlider and change the visible frame. To control the animation speed, a textbox which accepts integers is placed next to the Play widget. Every time the user changes the animation speed (in frames per second), the interval value of the Play widget is updated accordingly.

4.4.2 Selections

Ipyleaflet has a Control type for drawing figures called DrawControl. DrawControl allows draw-ing lines, rectangles, circles, and arbitrary polygons. Once a shape is drawn, a callback function is executed with information about the shape given as argument, including the coordinates of the vertices of a polygon or rectangle, or, in the case of a circle, the center coordinate and its radius. These coordinates can be used to extract data, as every data point has a corresponding coordinate.

4.4.3 Miscellaneous controls

Ipywidget has several useful widgets like an FloatSlider for controlling opacity. Ipyleaflet has a Control type LayersControl that allows the user to select and deselect layers to toggle their visibility.

(19)

(a) Matplotlib’s quiver (b) Calculated GeoJSON arrows

Figure 4.2: Detail vector field wind visualization methods. Pixelation is visible when using the quiver method, if zoomed in enough. This is because it is converted to an image before displayed. Using vector data with GeoJSON does not suffer from this issue because the arrows are drawn by the map itself.

(20)

(21)

CHAPTER 5

Experiments

Several experiments are performed. The performance of the temperature visualization, as well as the two wind speed visualization methods is measured to determine if they are fast enough to create a smooth animation. The accuracy of all visualization methods is determined. The performance of buffering frames calculating serially or using multiprocessing is compared.

All tests were performed on a 2015 Macbook Pro with a 2 core Intel i5-5257U processor with hyper-threading and 16 GB RAM. The dataset used for these experiments comes out of the ERA5 dataset. The variables used are the u- and v-components of the wind together, and temperature. Both at a pressure level of 925 hPa. The timeframe is from 01-01-2017 until 31-12-2019 using every hour of every day. The geographical area is bounded by latitude from 0 to 62, and longitude from -17 to 45. The data is in the netCDF format.

5.1 Interactive map

5.1.1 GeoViews

Creating a map with multiple time frames of data seems to take longer to finish the more time frames are given. If this time keeps increasing when using more frames, it could cause the application to take a long time to start with a large dataset. We want to know how long GeoViews takes to create a map. To test this, a map will be created using an increasing amount of time frames and the time until the map is ready is measured. The results are shown in Figure 5.1. While creating the map, it seems it pre-renders the frames, which causes it to slow down significantly when increasing the the amount of frames you want to show. At 28 and 29 frames of data it takes over 70 seconds to finish and is likely to increase more with more frames. A common dataset used by meteorologists could have thousands of time frames, and if the trend in increase of time continues, that would take an unreasonable amount of time for practical use. Because ipyleaflet does not suffer from this problem, and it’s support for Jupyter widgets, ipyleaflet is the interactive map that will be used to build this package.

5.2 Visualization

5.2.1 Performance

In order to be able to view a smooth animation, attention needs to be paid to the performance of the visualization. The faster a frame can be displayed on the map, the faster the animation can be played. The time it takes to display a frame consists of two parts. The first is how fast data can be prepared for displaying. The second is how fast the map can display the data as a layer.

(22)

Figure 5.1: Time taken to display a map with a certain amount of time frames available of a dataset using GeoViews.

Temperature

Generating a frame of temperature data is relatively simple, only consisting of reprojection and saving the reprojection as an image. To determine the calculation time, the average time of 1000 calculations is measured. This resulted in a calculation time of 35 ms per frame on average. This would allow for an animation running at over 20 frames per second, which is relatively smooth.

Wind speed

Generating a frame of wind data is more complicated than temperature data, and depends on the visualization method. Using matplotlib’s quiver requires reprojecting data, creating the plot itself, and creating an image and encoding it in base64. Using vector data requires calculating ar-rows for the given data points. Because the vector data is described using coordinates, the arar-rows will be in the correct position and do not require reprojection. When generating a frame using either method, an arrow is created for every data point. If the data points are close together, it can look cluttered. To remedy this the data arrays can be read with a certain stride. Using a stride of n means reading every nth value from the array. The time to generate a frame is also smaller using a higher stride. Multiprocessing can be used for calculating the arrows using the vector data method. The Python multiprocessing package has a Pool class which can distribute tasks among multiple processes automatically. The ideal amount of processes used is equal to the amount of CPU cores (or double that if the CPU supports simultaneous multithreading, which most modern processors do) [18].

Processing time of the wind speed visualizations is measured for both methods, using a range of stride values. Vector data processing time is tested using multiprocessing and regular serial processing. Even using a stride of 10, the time to calculate a frame using vector data takes around 0.18 seconds on average. Which would allow for an animation with a maximum of 5.5 frames per second, resulting in not such a smooth animation. Calculating a frame using quiver suffers the same problem. With a stride of 10, calculating a frame takes 0.35 seconds on average. That allows for only 2.8 frames per seconds of animation. Both methods suffer from relatively slow calculation time, which leads to a slow animation speed.

(23)

Figure 5.2: Average time (100 calculations) to calculate a frame of wind speed data using vector data arrow and matplotlib’s quiver visualization with different data strides. Seperate lines for calculating the arrows serially, or using multiprocessing. 4 processes used for multiprocessing.

Results

Several performance measurements were made for the different visualization methods. For the temperature visualization, only one method is used so no comparison has to be made, but the average calculation time of a frame is fast enough for a reasonably smooth animation. The two wind speed visualization methods were compared. When using a low stride, giving a dense field of arrows, the quiver plot method is much faster, but from a stride of 5 on, the calculation time is approximately the same. The vector plot method was also tested using a multiprocessing method as well as regular serial calculation, but no significant difference was measured.

5.2.2 Correctness

It is important the visualizations created are accurate for meteorological analysis. For this reason they have to be validated. Because the temperature and wind speed data are visualized with different techniques, these require require different verification methods. These verifications are performed in this section.

Temperature

The correctness of the visualization is difficult to determine precisely, however, with a visual inspection a certain level of accuracy can be determined. Especially in warmer months, there is a clear difference in temperature between land and sea. This can be seen and used to determine some country borders. If these temperature differences are seen at sea borders, the reprojection can be determined to be relatively accurate, see Figure 5.3.

Wind speed

Because the arrows created by either using matplotlib’s quiver or using vector data are supposed to be at specific coordinates, it is relatively easy to determine if they are accurate. Ipyleaflet supports adding markers at a specific coordinate. Using the same latitudes and longitudes as

(24)

(a) Reprojected image

(b) Same data without reprojection. The arrow indicates the Gibraltar strait in the data.

Figure 5.3: Comparison between reprojected data, and data without reprojection. Note how the warmer temperatures on land surrounding the Gibraltar strait aligns neatly with the country borders in the image with reprojection, while in the other image the same data is positioned more north, indicated with the arrow.

the arrows in the visualization are supposed to be located at, a group of markers can be added to the map and compared to the arrows.

The results are shown in Figure 5.4. It is clearly shown the arrows drawn using vector data are located correctly, while the arrows drawn with quiver diverge significantly from their correct location. Because of this, and neither method being significantly faster than the other, the vector data method is preferred over the quiver plot method.

Results

A basic form of verification of the visualization was performed. The temperature visualization was only visually inspected. The wind speed visualizations were able to be verified using markers to point out the correct locations of the arrows. The quiver method showed quite a divergence from the correct locations, this is most likely due to it being generated from a matplotlib plot, which normally includes axes and does not have the data points at the edges of the plot, so as not to hide parts of the arrows.

5.3 Buffering

Because generating every frame only when it is the currently selected frame can be slow, prevent-ing a smooth animation, frames are buffered by calculatprevent-ing them ahead of time and storprevent-ing the resulting image or vector data. The traditional way of buffering a series of frames is calculating them one after the other (serially), but multiprocessing can be used to calculate multiple frames at the same time, which reduce the buffering time. To determine which method is fastest, they are both used to calculate an increasing number of frames, and the time it takes to complete is measured.

As shown in 5.5, using multiprocessing significantly reduces time taken to calculate a series of frames, with over 10 seconds reduced when calculating 70 frames. Because of this multiprocessing will be used for buffering.

(25)

(a) Vector arrows. Patch west of Ireland. (b) Vector arrows. Patch near the coast of Liberia.

(c) matplotlib quiver. Patch west of Ireland.

(d) matplotlib quiver. Patch near the coast of Liberia.

Figure 5.4: Markers placed over arrows. The markers are placed at the center of the arrows, indicating the correct location for the arrows should be located.

Figure 5.5: Shown is the difference between utilizing multiprocessing and calculating each frame after the other (in series). The time shown is the time to calculate an increasing amount of frames. The function used to calculate a frame was the quiver plot method.

(26)

(27)

CHAPTER 6

Discussion

6.1 Contributions

This goal of this project was to create a tool with which meteorologists and other scientists using meteorological data can more easily analyse the data they are interested in. The package built can be expanded to create a more general library for visualizing multiple data types and include more visualization methods.

6.2 Limitations

6.2.1 Correctness Verification

The verifications of the visualization correctness were limited. The temperature visualizations were only checked visually. A possible, more objective method is using a reference dataset where specific coordinates are highlighted, which then can be referenced agains markers like used for the wind speed verification. Besides geographic accuracy, the accuracy of the values was not determined.

6.2.2 Performance

All experiments were performed on a laptop with a relatively low performance CPU. The calcula-tion times could be much faster if a user were to use a system with a faster CPU. Multiprocessing might also have more of an impact on performance if a CPU is used with more processing cores.

6.2.3 Additional Experiments

Additional experiments would have been useful, such as perception studies of the visual results and user feedback of the ui, interaction and workflow of the tool, but due do the limited time of this project, these were not able to be performed.

6.3 Ethical Aspects

Open source software is an important part of ethical software development. The package devel-oped in this thesis is mainly aimed at use by meteorologists in the scientific community. Making a piece of software open source allows users to add or change functionality and use it as they please. This aligns perfectly with this project, as this package is meant to be expanded in the future. Because of this, the source code will be released under an open source licence.

(28)

(29)

CHAPTER 7

Conclusions

There is a range of software available to visualize meteorological data. Some can display time-series data, some can visualize data and allow interactivity, but not many can provide both, and no packages to do this are available on the Jupyter platform using Python, which is a popular platform for scientists to analyze meteorological data on. So the goal of this project was to fill this gap and find out how to create a Python package for the Jupyter platform that allows a user to visualize time-series wind speed data and temperature data. To answer this, various components of such a package were chosen and tested.

The first choice was the interactive map the rest of the package would be built around. This became ipyleaflet, because of its integration with Jupyter widgets, and not needing to load in an entire dataset as GeoViews does. Following that, the way data will be visualized had to be determined. For temperature data and other simple raster data, there was only one commonly used method, which is to display it as an image using the data values as pixel colors. The visualization of wind speed data was possible using two methods, visualizing the flow using streamlines, or showing the vector field as arrows. Generating streamlines turned out not to be ideal for animation, as very different lines could be generated for successive frames, making it hard to see small changes in the wind data. Because of this a vector field plot using arrows was chosen as the method of wind speed visualization. The visualization speed was tested and it was clear that generating a frame for the wind speed data takes too long to animate smoothly. To work around this, frames are buffered ahead of time, so when animating the data to be displayed can be retrieved instantly.

Effort was put in to develop the best package possible, but there were some limitations. The verification of the visualizations was limited, and study to get user feedback on UI, and workflow would have been greatly helpful but were not possible due to a lack of time available. The package developed is meant to be a start of a larger project and is supposed to be expanded in the future with more functionality, such as supporting more data sources and scientific functions.

(30)

(31)

Bibliography

[1] Jeffrey M Perkel. Why jupyter is data scientists’ computational notebook of choice. Nature, 563(7732):145–147, 2018.

[2] Sheri Mickelson. Python programming and visualization for scientists. Bulletin of the American Meteorological Society, 97(12):2396–2398, 2016.

[3] Johnny Wei-Bing Lin. Why python is the next wave in earth sciences computing. Bulletin of the American Meteorological Society, 93(12):1823–1824, 2012.

[4] John D Hunter. Matplotlib: A 2d graphics environment. Computing in science & engineer-ing, 9(3):90–95, 2007.

[5] Jonathan J Helmus and Scott M Collis. The python arm radar toolkit (py-art), a library for working with weather radar data in the python programming language. Journal of Open Research Software, 4, 2016.

[6] Kang-Tsung Chang. Geographic information system. International Encyclopedia of Geog-raphy: People, the Earth, Environment and Technology, pages 1–10, 2016.

[7] Paul Bolstad. GIS fundamentals: A first text on geographic information systems. Eider (PressMinnesota), 2016.

[8] Peter A Burrough, Rachael McDonnell, Rachael A McDonnell, and Christopher D Lloyd. Principles of geographical information systems. Oxford university press, 2015.

[9] Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, Andr´as Hor´anyi, Joaqu´ın Mu˜ noz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand Schepers, et al. The era5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730):1999– 2049, 2020.

[10] Marc Rautenhaus, Michael Böttinger, Stephan Siemen, Robert Hoffman, Robert M Kirby, Mahsa Mirzargar, Niklas Röber, and Rüdiger Westermann. Visualization in meteorology - a survey of techniques and tools for data analysis tasks. IEEE Transactions on Visualization and Computer Graphics, 24(12):3268–3296, 2017.

[11] Sir Francis Galton. Meteorographica: Or, Methods of Mapping the Weather... Referring to the Weather of a Large Part of Europe During the Month of December 1861. Macmillan & Company, 1863.

[12] Mark Monmonier. Air apparent: How meteorologists learned to map, predict, and dramatize weather. University of Chicago Press, 2000.

[13] Mark Monmonier. Rhumb lines and map wars: A social history of the Mercator projection. University of Chicago Press, 2010.

[14] Man Wang, Jun Tao, Chaoli Wang, Ching-Kuang Shene, and Seung Hyun Kim. Flowvisual: Design and evaluation of a visualization tool for teaching 2d flow field concepts. In 2013 ASEE Annual Conference & Exposition, pages 23–609, 2013.

(32)

[15] Russ Rew and Glenn Davis. Netcdf: an interface for scientific data access. IEEE computer graphics and applications, 10(4):76–82, 1990.

[16] Davide De Marchi, Armin Burger, Pieter Kempeneers, and Pierre Soille. Interactive visu-alisation and analysis of geospatial data with jupyter. Proceedings of the BiDS, 17:71–74, 2017.

[17] Larry Masinter. The ”data” url scheme, 1998.

[18] Navtej Singh, Lisa-Marie Browne, and Ray Butler. Parallel astronomical data processing with python: Recipes for multicore machines. Astronomy and Computing, 2:1–10, 2013.

Interactive time series weather data visualization on Jupyter

Bachelor Informatica