DisJotter: an interactive containerization tool for enabling FAIRness in scientific code

(1)

Bachelor Informatica

DisJotter : an interactive

con-tainerization tool for enabling

FAIRness in scientific code

Wilco Kruijer

June 15, 2020

Supervisor(s): Spiros Koulouzis & Zhiming Zhao

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

Researchers nowadays often rapidly prototype algorithms and workflows of their exper-iments using notebook environments, such as Jupyter. After experimenting locally, cloud infrastructure is commonly used to scale experiments to larger data sets. We identify a gap in combining these workflows and address them by relating the existing problems to the FAIR principles. We propose and develop DisJotter, a tool that can be integrated into the development life-cycle of scientific applications and help scientists to improve the FAIR-ness of their code. The tool has been demonstrated in the Jupyter Notebook environment. By using DisJotter, a scientist can interactively create a containerized service from their notebook. In this way, the container can be scaled out to larger workflows across bigger infrastructures.

(4)

(5)

Introduction

Activities in the life-cycle of scientific experiments are often modelled as a sequence of dependent steps. A hypothesis is formulated from observations, which is later tested and evaluated using experiments. Computational tasks and steps in a scientific experiment are commonly realized by scientific software or services, which can be automated by workflow management systems. An effective workflow programming and automation environment, in combination with reusable workflow components, enable quick experimental iterations and thus improve efficiency in scien-tific research.

In recent years a literate approach to programming has gained popularity in the form of note-books. Literate programming is a paradigm in which natural language is combined with computer code and results to represent a program that is easy to understand for both the computer and humans. Advocates of the literate programming approach argue that this style of programming produces higher-quality computer programs since it enables the programmer to be explicit about the decisions they make during the creation of software [5]. Literate programming has found its use in scientific research following computation becoming important in the scientific field. As computation in the scientific process became more commonplace it naturally became increasingly important to document the computational steps taken. Notebook environments can be used in most steps of the process; data collection, processing, analyses, and in creating visualizations. They allow for rapid iteration and testing across all steps.

Reproduction of experimental results is important for multiple reasons. It allows the scientist to validate their experiments, but it also enables other scientists to reproduce the experiment for further studies. Nowadays, scientific computation is often done in distributed cloud envi-ronments. These environments have storage and computation capacities that far exceed the capacity of single computers. Using this type of infrastructure research can be done on a much larger scale. The possibility of scaling experiments makes reproduction of experiments especially important.

(8)

1.1 Research question

An important aspect of notebooks is its highly interactive interface, via which users can rapidly develop algorithms, and visualize the results. In most cases notebooks do not make use of local modules in published notebooks [7], this means that notebooks are made available as a whole. This lowers the findability and accessibility of individual components within a notebook, e.g. a specific algorithm. This makes it hard to execute components of a notebook as part of a workflow in environments such as on cloud platforms, for instance when scaling to large data sets or parallelizing the execution.

In this thesis, we are motivated to answer the question: “how to reuse and share components in scientific code via notebook environments? ”

To answer this question we will first answer the sub-question: “how to encapsulate a component of a notebook? ”, then we will look at how to make encapsulated components findable, accessible and interoperable.

1.2 Outline

In the next chapter we will first explore the background information related to our research. Here we look into the principles we want our product to adhere to, next we discuss software that is used by researchers and which solutions there are to software encapsulation. Then, we identify the gaps that exist within these concepts. Based on these gaps we analyze requirements for a tool in the first part of the third chapter. We propose an architecture based on the requirements and discuss the technical considerations. In chapter four we discuss the implementation details and demonstrate the system functionality. We discuss the limitations and considerations of our software in the second-to-last chapter. Finally, we conclude this thesis.

(9)

CHAPTER 2

Background

2.1 FAIR principles

While the reliance on computation in the scientific process grew, the dependency on data grew with it. To improve “scientific data management and stewardship” [9] the data science community formulated a set of principles. The four principles are findability, accessibility, interoperability, and reusability. In this context, they refer to data specifically, but the same principles can be applied to implementations or code within computer science research [3].

Jim´enez et al. describe four recommended best-practices to keep in mind when working on re-search software. It argues that although not all of the FAIR principles directly apply to other digital objects besides data, most of the principles can be applied to software. All recommenda-tions given in this work draw parallels to the FAIR principles [3].

1. Findable: Researchers are encouraged to make software easy to discover in various man-ners.

2. Accessible: It pleads to make source code (publicly) available from day one.

3. Interoperable: Although the article does not give a recommendation about the interop-erability of software it does recommend software to be published on community registries, this indirectly makes the software more integratable in different workflows.

4. Reusable: Jim´enez et al. defend the fact that all best practices in software improve reusability.

Reusing and reproducing are important aspects of the scientific method. In scientific studies reproducibility can give guarantees about the research that has been done, it also makes a study more transparent. Without reproducibility, science would not be able to progress in any meaningful way. To accumulate knowledge, research must be built upon older research and data. Findability is similarly important in the context of research software. Software packages are often made discoverable by being disposed in repositories along with metadata. Metadata usually includes version numbers, licenses, and contact details of the author. A summarized explanation of the capabilities of software is generally also available with the package. This enables third parties to search for the software they desire. An important tool in findability are persistent identifiers. These identifiers can be used by software consumers as a way to uniquely refer back

(10)

to the used software. In many registries this is simply a URL to a web page. Another example is a namespace in Java.

FAIRness is considered important when applied to data sets. Typically, the principles are not used directly in research software.

2.2 Environments and tools for scientific research

2.2.1 Jupyter computational notebooks

Jupyter Notebook is a computational environment that is most widely used in the literate pro-gramming approach [6]. The Jupyter Notebook application can be used to create and share documents that help the user in data analysis, experimenting, and representing results. It sup-ports many different programming languages but it is most commonly used in combination with the Python and R programming languages. It allows the user to create and display plots inline. Documents contain cells, which can be of a variety of types. Among these types are code, text, and images. It is possible to formulate mathematical equations in text cells using LaTeX syn-tax. Notebooks are meant to improve the shareability and reproducibility of the code that they contain [4]. Saved Jupyter documents contain static output from when the code was executed last. This allows third parties to read the output of a notebook without executing it.

Figure 2.1: A typical iterative workflow in a literate programming notebook environment [8]. The Jupyter system contains a set of components that ultimately deliver the notebook user experience. All projects in the Jupyter ecosystem are built on top of the Jupyter Client, this is a Python library that implements the Jupyter protocol. This protocol is a specification of messages used by so-called kernels. In the Jupyter context, a kernel is a program that runs and examines the user’s code. Jupyter notebooks are most often used in conjunction with the Python kernel, but many other languages are supported such as the R programming language. The end-user can manually install kernels of their choice to extend the programming languages available in the notebook environment. The Jupyter Client library can be used in a language-agnostic manner, this means any program built on top of it does not have to be specifically implemented for a single language.

Jupyter Notebooks provide advantages in the fact that code, description and results are close together which help provide a (scientific) story to the reader. Giving working examples next to text allow readers to more easily understand the material and grasp the explained concepts. A programmer can use notebooks to rapidly prototype and immediately see results. However,

(11)

notebooks also come with drawbacks. Code in notebooks is often tightly coupled. Because of the cell structure in the notebook users underutilize traditional separation of concerns methods that exist in programming languages such as classes and functions [7]. Pimentel et al. have also shown that the reproducibility rate of these notebooks is low. The number one reason given for notebooks not working is missing dependencies. Jupyter notebooks do not have a standardized way for users to declare their code’s dependencies, instead notebooks use libraries globally available on the user’s system and expect the user to use a programming language and package manager specific way of declaring and installing external libraries. They conclude that only a small percentage of public notebooks are able to produce the same results that they contain. This hinders the advancement of science which researchers seek.

2.2.2 JupyterHub

JupyterHub is a software platform built around the Jupyter Notebook workflow. This platform takes the notebook environment to the cloud and allows it to be used by multiple users from a centralized place. Using the hub allows users to share their notebooks by simply sharing an URL. The second user can then open the URL to see and modify the original notebook in exactly the same environment as the original creator. This solves the reproducibility issues. However, this environment is managed by a system administrator. Every researcher has to request access to the platform. This is therefore not a viable solution to sharing Jupyter Notebooks with third parties.

2.2.3 Scientific workflow management

Scientific workflow management systems are designed to orchestrate and manage scientific pipelines. Pipelines combine a series of steps in which data flows between elements of the pipeline. In these pipelines, it is crucial that many different applications and technologies can be combined to create an experimental setup. Pipelines that have been created using management systems are then executed in cloud environments.

Workflow management systems are not inherently compatible with all types of software. Soft-ware has to be made interoperable in some way to be able to be implemented in a pipeline. Interoperable software is often referred to as “service”. This is a very broad term that refers to a software component with a specific purpose. It generally operates as a black box that takes some input and produces an output. Services can be used by a consumer for any purpose. Services are relevant to workflow management systems since they integrate well into pipelines. Any soft-ware can be developed to be a service, which can then be deployed and monitored using these management solutions.

Communication with services is often facilitated by HTTP. These services are referred to as web services. This is a common solution to making services interoperable. Workflow management systems and other service composition toolkits allow the user to invoke web services as part of the workflow.

2.3 Software encapsulation

In the service context, we refer to the process of creating a service out of an arbitrary piece of software as encapsulation. During this process, it is important to make sure the software is able to run in different environments. Defining the inputs and outputs of the software is also an important aspect of the encapsulation process.

(12)

Software that adheres to the FAIR principles lends itself well to service-oriented architecture. The interoperability and reusability aspects are especially relevant to software written for service-oriented architecture since the services are expected to be agnostic to the workflow they are being used in.

To ensure a software service is reusable in any environment virtualization is generally used. More specifically, OS-level virtualization is used. In OS virtualization multiple separated environments can be run on top of the host OS. These environments are often called containers. Software running inside of these containers does not have to be aware of the fact that it is running inside of a container. The containerized software only has access to hardware and other resources specifically assigned to that container. All software above the kernel level can be picked on a per-container basis. This includes the operating system and all libraries.

2.3.1 Docker

There are many OS-level virtualization solutions. One such solution is Docker. The Docker platform provides the tools for working with its containers. Docker containers are created from templates called images, these images include an operating system at its core, with other soft-ware, data, and configuration on top. Images can quickly grow in file size, which makes them inconvenient to distribute. Therefore Docker allows images to be created dynamically, via a command-line interface or via special files called Dockerfiles. These Dockerfiles contain a set of instructions that Docker uses to create and run images.

Docker registries are a component of the Docker ecosystem. Docker clients are able to push and pull images to and from a registry. This allows the user to easily share their Docker images in a reproducible way and allow other users to quickly download and run images as well. The default registry used by the Docker client is Docker Hub.

2.3.2 Repo2docker

In the paper “Reproducible Research Environments with repo2docker” Forde et al. propose a tool that fetches a repository, inspects it, and builds a containerized image based on the files in the repository [2]. This command-line tool uses standard file formats used by certain programming languages and its package managers to encapsulate an arbitrary software repository and create a Docker image out of it. It works best in conjunction with the programming languages used most often in the Jupyter toolchain, namely, Python, Julia, and R. It is however also capable of containerizing software written in other languages. The image it produces is made interactively available through a Jupyter Notebook environment.

In repo2docker the created service is a user-facing application, that can be used to interactively work with the research that has been done. This does improve the portability of the user’s code but it does not allow specific components to be extracted. It will not enable the encapsulated software to integrate into service-oriented workflows since it requires constant user interaction.

2.4 Gap analyses

We have seen multiple tools that are used in scientific research. First, we saw Jupyter Notebook, a tool which is used for quick iterative development of scientific software. Then we talked about scientific workflow management. These are tools that help researchers in scaling their software in cloud environments to enable experiments to run with more computational power and larger

(13)

data sets. Notebooks have their own set of problems with regards to reusing them. Among these reasons are the unavailability of dependencies and the absence of separation of concerns mechanisms. We can now also identify a clear gap between iterating on scientific software locally and getting this software running on remote systems.

Interoperability is a requirement for software to be integrated into scientific pipelines. To achieve this software can be encapsulated as a service, or going one step further, as a containerized service. The previous sections show software that helps in the process of encapsulation. Docker is a general solution that is used to improve the reusability of software. We also saw a tool that is a specialization within this space. Repo2docker enables the user to create a Docker image from a repository that includes arbitrary (scientific) software. However, this tool does not enable a repository to be used in scientific management systems.

(14)

(15)

CHAPTER 3

DisJotter

In this chapter we present the requirements and architecture along with the implementation details of our tool named DisJotter.

Definition 1 Jotter [noun]: “a small pad or notebook used for notes or jottings.”

The etymology of the name DisJotter is explained as the combination of the word Jotter and the negative prefix “dis-”. The symbolic meaning of the word then is “to rip a page out of a notebook”.

3.1 Requirements

Based on the technical gaps specified in the last chapter, we now list and describe a number of requirements for DisJotter.

1. Flexible, the tool must be usable alongside existing environments for scientific research. During both the iterative experimental process and the scaled process in scientific pipelines the user should not have to make a special effort to use our tool. Researchers should not be restricted in their design choices for their experiments as a consequence of using DisJotter. 2. Encapsulating, when the tool is used it should create a self-contained software service out of the specified component of the researcher’s original work. The generated service should be interoperable, as to be useful in different workflows. To improve the service’s reusability, there should also be a level of customizability that allows for extending the original work by modifying its state before execution of a component. A standardized API should be provided to allow for this to be implemented for any programming language. The interoperability of the generated service has to be provided by industry standards. 3. Interactive, the user-interface of the tool should provide options to adjust the eventual

content of the generated service. This interface should first help the user in selecting a component they want to be encapsulated. It should then provide a method to select software dependencies, this will mitigate the problems that arise from dependencies being unavailable.

(16)

3.2 Architecture

To meet the requirements listed above we propose an architecture that allows researchers to interactively create a containerized service of a certain component of their work. Figure 3.1 shows a schematic overview of our architecture. The architecture consists of five separate components, split over two distinct environments. The first environment is depicted on the left. It shows three components running directly on the user’s hardware. The right side shows the resulting container image which is generated by the left side. This is the eventual output service that will be generated by our tool. This image includes two more components that belong to our architecture. Each component has a specific role in our architecture.

Back-end Interface Introspection -inspector User User's machine Service helper Introspection -runner Created image User's code & data User's code & data Conﬁg Builds Containerization platform

Figure 3.1: Schematic overview of the system architecture.

• Interface: this component is shown to the user to allow them to interactively configure the service that will be generated. It will enable the user to select the part of their program that will be used to generate output for the service. The user will also utilize this interface to adjust the software dependencies and libraries that will be available inside of the generated image.

• Back-end: the core of our architecture. The back-end will create a configuration file based on the user input in the interface. This component will then collect the user’s code and data together with the configuration file and instruct the containerization platform to create an image that includes these files. It is also responsible for packaging the service helper and runner components inside the image.

• Introspection - inspector: the first of two introspection components. This part of our architecture analyses the user’s code to figure out what variables are in use in the specified part of the user’s program. This data will be written to the configuration file and can later be used by the second introspection component.

• Service helper: this is the core component within the created container image. It wraps the user’s code and makes it available as a service. It uses a configuration file to determine exactly what part of the user’s code and data should be runnable.

• Introspection - runner: the final introspection component. This uses the data generated by the inspector component together with inputs given with the invocation of the service to modify the behaviour of the user’s code.

(17)

3.3 Implementation

We prototyped DisJotter based on the described architecture on top of the Jupyter Notebook architecture. Portions of the Jupyter Notebook software stack were designed to be extended to provide additional functionality to the end-user. This extensibility together with the fact that the Jupyter software is widely used by data scientists [6] makes it an obvious choice for implementing our architecture on top of. Figure 3.2 shows a schematic overview of our implementation and how it relates to the Jupyter architecture.

The software that the notebook environment is built on top of also provide advantages. The client and kernel architecture are already modular, we can leverage this in the flexibility of our system. Incorporating our tool into the Jupyter Notebook workflow has limited our technical design choices for our implementation. The Jupyter Notebook server is a Python web service that runs on top of the Tornado asynchronous networking library.

The Docker ecosystem is used as our containerization service of choice. It is an industry-standard which includes tools that can be used for the creation, composition and deployment of containers. This ecosystem also has one big advantage, it helps with the findability principle. Docker has an integrated solution that allows the user to upload a full container image to a public registry called Docker Hub. This enables other users to quickly download and run the created image, but another important aspect of this is the fact that many workflow management systems have support for downloading existing images from Docker registries.

Jupyter server Server extension_(Back-end) Notebook interface Front-end extension_(Interface)

Jupyter client R kernel Python kernel User's machine Service helper Jupyter client R kernel Python kernel Created image Builds Introspection -inspector User's code & data User Introspection -runner User's code & data Conﬁg Docker daemon (Containerization platform)

Figure 3.2: Schematic overview of our implementation and its relation to the Jupyter architecture. Semi-transparent rectangles represent part of the Jupyter architecture. Fully

opaque rectangles show our components.

3.3.1 Front-end and server extensions

The interface of our application is available from within the Jupyter Notebook interface. Since the Jupyter front-end is browser-based, browser technologies such as HTML, CSS and Javascript are used by DisJotter to display the user-interface.

Our back-end is implemented as an extension for the Jupyter server. This server is a set of RESTful APIs that the front-end communicates with to deliver the notebook user experience. We can extend this server by adding custom API routes. The notebook server is also responsible

(18)

for loading the front-end code. Another import aspect of the back-end is the generation of the configuration file which will later be used by the service helper.

An important aspect of getting the user’s code to be executable inside of the container environ-ment is the availability of dependencies and libraries that the user imports. This is why, as was mentioned before, we allow the user to select software dependencies in the interface. To facili-tate choosing and installing these packages we use Conda, a package management solution with support for all languages that are commonly used in combination with Jupyter notebooks. The user-interface allows the user to modify an environment.yml file which Conda uses to install packages. To aid the user in selecting these libraries we use Pigar [1] to automatically detect the packages that are used by the notebook. We present the results in the user interface and allow the user to manually add more packages.

Docker provides a software development kit for Python which accommodates communication with the Docker daemon. The back-end creates a set of instructions which the daemon will use to build the image. The following instructions are executed using a Dockerfile.

• Use the selected base image as a foundation for the created image. • Copy the service helper and introspection components to the image.

• Install software dependencies of these components such as the Jupyter Client. • Install the software dependencies the user’s code requires using Conda. • Copy the user’s notebook, data and the generated configuration file.

• Lastly, Instruct the image to start the service helper Python module when the image is run as a container.

3.3.2 Service helper

To make the user’s code available as a service from inside a container we run a separate Python module that runs a web server. This module is the container’s entry point and will automatically run when the container is started. This helper program will read a configuration file created by the back-end component. It indicates which notebook and what cell should be run by default. This service uses the Jupyter Client to execute the user’s code cells in the relevant kernels. The output of the last cell will always be returned. Several output types are supported such as plain text, HTML, markdown, JSON and images. The web server supports the following two routes:

• GET /runs a component of the notebook indicated by the configuration file.

• GET /notebook/<file>.ipynb?cell=<index> runs an arbitrary cell of a notebook

avail-able in the container.

3.3.3 Introspection

We use introspection as a method to extend the reusability of notebooks. Combined with the availability of a cell as a web route this makes for a powerful tool that helps the user in scaling and extending their notebook. The Jupyter kernel protocol provides no introspection tools, this means introspection needs to be implemented for each kernel manually. To facilitate this we created a simple pluggable framework. Our introspection framework consists of two components. One component is used by the back-end while the other component is used by the service helper.

(19)

These are the inspector and runner components, respectively. Both the back-end and the service helper components of our architecture load the code inspector based on the notebook’s kernel. The inspector is invoked by the back-end to analyse the currently selected part of the user’s code. The analyses will return a list of variables that are in use by the specific cell of the notebook. The variables are shown to the user in the interface. Here, the user can select which variables should be available to be configurable at service run-time. This selection is written to the configuration file so it can later be used by the runner.

The runner component ensures the selected variables are set to a specified value when the helper service is invoked. The runner, together with the service helper, allow the variables to be set as either a query string argument or as post body data.

Our reference implementation integrates with notebooks using the Python kernel. We use the Python built-in abstract syntax tree module to find variables that are in use by the selected cell.

(20)

(21)

CHAPTER 4

Results

4.1 Software prototype: current status & installation

The prototype we discussed in the previous chapter is available on the Python package index, https://pypi.org/project/DisJotter. The source code is available on Github,

https://github.com/WilcoKruijer/DisJotter. This code is licensed under the Apache Li-cense. The source code includes a docker-compose file which allows for development in a reproducible environment.

To maximize the portability of our tool we made sure the installation process is very straight-forward. Anyone familiar with the Python ecosystem should be able to easily install our tool. Listing 4.1 shows the steps to install our tool. After executing these steps and restarting the Jupyter Notebook application an icon will appear in the toolbar of the notebook interface that will open the DisJotter interface. The Docker engine is required to be installed on the user’s machine. Without it our tool will not be able to build and run Docker images.

# Downloads our t o o l from t h e Python p a c k a g e i n d e x

p i p i n s t a l l d i s j o t t e r

# F i r s t e n a b l e t h e s e r v e r e x t e n s i o n

j u p y t e r s e r v e r e x t e n s i o n e n a b l e −−py d i s j o t t e r

# Then i n s t a l l and e n a b l e t h e f r o n t −end e x t e n s i o n

j u p y t e r n b e x t e n s i o n i n s t a l l −−py d i s j o t t e r j u p y t e r n b e x t e n s i o n e n a b l e −−py d i s j o t t e r

Listing 4.1: DisJotter installation instructions.

4.2 Demonstration

In this section, we will demonstrate creating a service from a handcrafted simplified version of a typical research notebook using our tool. Later on in the demonstration we will show a number of ways of using this service in practice. Figure 4.1 shows the overview of the notebook we use in this demonstration. It contains three cells that include code, one of which outputs a graph as an image. The first code cell imports the second most imported module in Python notebooks

(22)

available on Github [7]. This module, Pyplot, is then used to create the plot in the last cell. The plotted graph is based on a variable defined in the second code cell. In a typical iterative workflow, the value of the line variable is experimented with until an interesting output is achieved.

Figure 4.1: Overview of a simplified research notebook.

4.2.1 Containerizing code fragments in a notebook environment

In this part of the demonstration we will use the interface to configure the Docker image that will be generated by the DisJotter back-end. Before using our tool the user has to make sure the notebook is properly tested and debugged. Any errors in the user’s code will result in the created Docker image failing to produce any proper results.

The user opens the DisJotter dialog by pressing a button in the Jupyter Notebook interface. The dialog presents the user with a range of options to customize the build process of their container. We will enumerate and explain each of the available options seen in figure 4.2.

1. Image name: this is the name the Docker daemon will assign to the image. It can later be used to execute operations on the image such as running it as a container or uploading it to a Docker registry. When uploading to a Docker registry this name is also used as a persistent identifier. Names are usually of the form author/image-name:semantic-version. 2. Base image: we allow the user to select the base image of their image. The data science

notebook image is selected by default. In this demonstration we have selected a more minimalist base image. This will decrease the size of the resulting image.

3. Cell: this option specifies the cell which will be outputted by the created service. 4. Preview: the preview of the chosen cell, as it is available inside of the notebook.

(23)

Figure 4.2: The DisJotter interface.

5. Conda environment: this text field allows the user to pass any settings for use with the Conda package manager. In this example DisJotter has automatically detected the use of matplotib and added is as a pip dependency.

6. Variables: this is the code introspection feature, this feature is only visible to the user because the notebook is using a Python kernel. It allows the user to designate any variable that is loaded in the specified cell to be changed right before the cell is executed. In this example the user has chosen the line variable to be accepted as a query parameter in the resulting web service.

After configuring the appropriate options the user presses the build button to instruct DisJotter to build their Docker image. Depending on the local availability of the base image and the build cache the build process can take seconds to minutes. The user now has the image available on their system. Next, we will go over some uses of this image.

4.2.2 Using the generated Docker image

The image that has been created by DisJotter can be be used in a variety of ways. The most simplistic way of using the image is running it using the DisJotter interface. The second way of running the image is using the Docker command-line interface directly. Figure 4.3 shows the command used to start the image as Docker container and the output provided by the helper service.

(24)

Figure 4.3: Running the generated image, exposing the container on port 10000.

When the container launches the service helper component starts a web service. This web service is accessible on the host machine in multiple ways, e.g. the curl tool, or by simply visiting the service in the browser. DisJotter as well as the command-line Docker interface allow the user to specify which port the service should run on. In this case we use port 10000. Using the image created in the previous section we can generate plots with various outputs. As an example we set different values for the line variable. We generate two graphs using different methods. The outputs are shown in figure 4.4.

• We generate the left graph by visiting the following URL in the browser.

h t t p : / / l o c a l h o s t : 1 0 0 0 0 ? l i n e = [ 0 , 2 ]

• The right example was generated by executing the following command in the Bash shell.

c u r l −G −−data−u r l e n c o d e l i n e = [ − 1 0 0 0 0 , 1 0 0 0 2 ] h t t p : / / l o c a l h o s t : 1 0 0 0 0

Figure 4.4: Graphs generated by the containerized service which was created from the sample notebook using DisJotter. Both graphs are different from the figure that was originally

generated in the Jupyter Notebook environment.

There are other, more advanced, uses for the generated image. Containerization benefits the user with all advantages of the Docker ecosystem. For example, it is trivial to run multiple containers of the same image that distribute workload among them. Multiple containers are able to be spawned from a single image that all execute a, potentially computationally expensive, opera-tion with different parameters which normally would have run sequentially within the original notebook environment.

(25)

Figure 4.5: A bar chart showing the running time of generating eight plots.

We measured the running time of the graph-generating service we created above. In the tests we generate 8 plots using 1, 2 or 4 containers running the image we generated above. We request a graph to be generated using a round-robin approach in conjunction with the curl tool. Figure 4.5 shows the performance increase gained from scaling our notebook from 1 to 4 services. In all measurements the containerized services are already running before starting the timer.

Figure 4.6: Example pipeline implemented in Bash.

Another function of the image we will demonstrate is to integrate the service in a workflow pipeline. The service is in principle able to be integrated into any pipelines that support stan-dardised web service interfaces. To demonstrate this we have set up a simple pipeline that uses the generated service. Figure 4.6 shows the Bash script that represents a four-stage pipeline. This pipeline fetches a random number generated by a remote service, then uses our local service to generate a graph based on this number. Next, the colours in the generated image are negated using the convert utility. The last step in the pipeline shows the modified image to the user.

4.2.3 Publishing the generated Docker image

Another advantage of the Docker ecosystem is the shareability of Docker images. The created Docker image can be pushed to the Docker Hub repository, which allows third parties to find and use the image. The image generated in this demonstration is available from Docker Hub. It can be run by executing the Docker command shown in listing 4.2 (try it!). The only requirement is having Docker installed on the system. The image will be downloaded from the repository and executed. The graph will then be generated by navigating to http://localhost:10000 in a browser.

(26)

d o c k e r run −−rm −p 1 0 0 0 0 : 8 8 8 8 w i l c o k r u i j e r / t h e s i s −sample : 1 . 0 . 1 Listing 4.2: Command to download and run the Docker image created from the sample

notebook.

4.3 Process analyses

In the demonstration above we have shown that our tool provides a way to automatically create a Docker container from a Jupyter Notebook. Without our tool this task would have to be performed manually. This is a convoluted process in which the user would have to execute a number of steps.

1. Extract code from notebook cells.

2. Wrap this code using separation of concern mechanisms. 3. Extract data and any other files loaded from disk. 4. Analyze and create a list of module dependencies.

5. Create a web server and implement routes that execute the extracted code.

6. Write a Dockerfile that can be used to create an image that contains all code, data, and dependencies mentioned above.

DisJotter abstracts away these steps and integrates into the existing Jupyter Notebook envi-ronment. Using our tool the user can save a significant amount of time while bridging the gap between multiple elements in the experimental lifecycle. The pipeline example shows how an extracted component of a notebook can be used in a workflow that would otherwise not sup-port running notebooks. Both demonstrations use the code introspection feature to extend the reusability of the service. Without this feature the pipeline would produce the same result on every run. This would make the pipeline effectively useless. We have also noted how the Docker ecosystem can be leveraged to enable the user to redistribute and scale components of their notebook.

4.4 Evaluation of requirements

We formalized a set of requirements for a tool to assist us in answering the research question. Walking through the requirements will show how they are satisfied.

We have demonstrated the flexibility requirement being met by showing how our tool integrates into the Jupyter Notebook environment as well as into arbitrary pipelines. Installing our tool is also shown to be trivial, its technical depth is similar to installing the Jupyter Notebook environment itself. Technical choices for experiments are not dictated by DisJotter. Our tool’s flexibility is based on the versatility of the Jupyter Client. We leverage this technology to support a wide range of programming languages.

Furthermore, we have seen that our tool encapsulates components using industry standard technologies. The web service that is produced by our tools runs inside of a Docker container. These containers are spawned from fully reproducible images. The image retains and uses all data from the original notebook, this ensures the output of the service is equal to the original notebook cell’s output. We also demonstrated the extensible introspection interface. As a proof

(27)

of concept code introspection for the Python language was implemented. In future work the code introspection feature can be made compatible with other popular languages in the Jupyter ecosystem.

Lastly, the interface provides the user with an interactive way of configuring their encapsulated service. DisJotter displays a dynamic output of the notebook cell that is selected. Besides giving the user the ability to specify software packages to install, it also lets the user choose a base image to base the container on. This gives even more granular control over the output. The introspection component also displays options in the user-interface.

(28)

(29)

CHAPTER 5

Discussion

In this chapter we reflect on the previous chapters and discuss the limitations and considerations of our tool.

5.1 Alternative implementations

The implementation of our architecture was built on top of the Jupyter platform. This allowed us to leverage both their intuitive user-interface and their Jupyter Client and kernel architecture. However, our architecture is set-up in an abstract way which would allow it to be implemented as a stand-alone application or on top of some other software stack. An alternative approach could be to implement our tool as a command-line interface. The Jupyter Client can be replaced by any REPL (read-eval-print loop), each cell would then be a command for the REPL. The components in our implementation would all need (slight) changes to make them compatible with other environments, as they are currently not fully stand-alone.

5.2 Installation of dependencies within the generated image

This work addresses the importance of installing the proper dependencies before running the user’s code. Our tool automatically detects packages that are in use in the notebook. However, this is only supported when the notebook uses Python. Even in Python the method of detecting packages is not foolproof. We use Pigar for detecting packages, which has several open issues on its Github page [1]. Because of this the user still needs to evaluate the packages which are in use by their work.

Our selection of base Docker image alleviates this problem partly. The base image that is selected by default in DisJotter contains runtimes for the three most popular programming language in data science, as well as the most commonly used external libraries for each language. A trade-off for using this base image is the disk size of the generated image. The Conda package manager is used to download package within the encapsulated service. DisJotter’s user-interface enables the user to specify packages to be downloaded using this manager. This gives fine-grained control over the final image.

(30)

our tool manually. Because of the inheritance structure of Docker images the user can write their own Dockerfile based on an image generated by our service. In this workflow the user is still able to use our tool for an initial version of their Docker image. Later, they will have to use the command-line to enhance their image.

5.3 Missing metadata

The findability principle in FAIR describes that rich metadata should be provided with data, or in our case, software. DisJotter does not generate any metadata related to the service it generates at all. When the user shares the containerized service they are responsible for writing a description of the software. This is a deficiency in our tool. The Jupyter Notebook file format does include metadata such as the used kernel. In future work this could be written to the created image along with data about the input variables of the service.

5.4 Ethical considerations

The goal of this work is to make components of notebooks more reusable. Reusable research has obvious positive implications. Reusable research has the advantage of being easily usable in different workflows and therefore enabling derivative research to prosper. Research can also be verified quickly by independent sources. Of course, these are not direct implications of our work. People would have to use our software to improve these concerns. For that reason, we released our implementation as free and open-source software. It is available for download on GitHub and is licensed under a very permissive Apache license.

5.5 Alternative containerization platform

In this project we have chosen Docker as the virtualization platform. Docker is one of many soft-ware programs that provide virtualization. The ecosystem around Docker provides us with more advantages than just virtualization such as the Docker Hub platform. Singularity is a different containerization toolkit that is gaining popularity in the high-performance computing space. A large advantage of Singularity is the fact that it can limit users in their access and capabilities. Users that do not have administrator privileges on a machine can be safely given access to the Singularity toolkit. The same is not true for Docker. This thesis does not touch on actually deploying Docker containers to remote servers or cloud computing environments. Therefore this large advantage of Singularity has not affected our choice of virtualization software. Singularity includes a compatibility layer that allows Docker images to be run on the platform. This means the images created using our tool can also be used in a Singularity enabled environment. The back-end component of our implementation specifically uses the Docker SDK and Dockerfile instructions. To use a different containerization platform these parts would have to be replaced. Other components in our implementation are agnostic to the containerization platform.

5.6 Implementing code introspection for other programming languages

Code introspection is currently only available for the Python kernel. It is however not difficult to implement this for other programming languages. Two parts need to be implemented to

(31)

allow for introspection. In our implementation both aspects are implemented as a single Python class. The first part is the analyser, which will inspect source code in the language that will be implemented. The second component must emit source code in the implementation language that will set the specified variables to the specified values.

(32)

(33)

CHAPTER 6

Conclusion

In this thesis we looked at improving the reusability of components in scientific notebooks. To accomplish this we first focused on how to encapsulate a component of a notebook. This encapsu-lation is facilitated by DisJotter, a Jupyter Notebook extension. To create this tool we analyzed a number of requirements so we could later measure the performance of our tool. We proposed an architecture for this tool based on the requirements, and implemented the architecture on top of Jupyter environment. This tool allows the user to select and define components in a notebook and encapsulate them as a standalone web service. The encapsulation process uses the Docker containerization platform to build an image capable of running a service in a self-contained en-vironment. To evaluate the implementation, we demonstrated the containerization of a sample notebook. We then used the generated image in a variety of ways, such as integrating the image into a pipeline and using the image to scale the execution of the original notebook.

The images our tool generates are findable. By creating a Docker image from a notebook the user is able to publish their work easily. Registries also enable the uploader to describe their image, which third parties can use to search for images. Our tool does not directly provide any of these capabilities, it is, however, a consequence of bridging the gap between Jupyter Notebooks and the Docker ecosystem.

Secondly, they are accessible. All images generated by DisJotter are used in the same way. Once the user runs the container the predefined cell becomes available as a web route in the local environment. This means the execution of notebook components is standardized between different images created with our tool. The manner of executing the cell also uses a standard, namely HTTP. A missing feature in our tool that could help further improve accessibility would be a method of describing the variables that are available for introspection. Using the HTTP standard together with the fact that the generated service is fully self-contained by including all required dependencies means that the service is highly interoperable as well. Web services are usable in almost any research environment.

The effects of the three preceding principles all improve the reusability of notebooks, this is what we aimed to accomplish in this work. In applying these principles to notebooks we have bridged the gap between two phases in computational research. Namely, the iterative experimental workflow within literate scientific notebooks, and the scaled cloud-based research approach.

(34)

6.1 Future work

We have discussed a number of shortcomings of DisJotter that could be improved in future work. A useful addition to our tool would be code introspection features for other popular languages in data science such as R and Julia. Furthermore, there are a number of open issues on DisJotter’s Github page that could be solved.

A larger addition to DisJotter would be a more generalized approach to creating a container out of the user’s code. Currently, we use very broad base Docker images to build services. Repo2docker is able to produce containers that only include dependencies that are actually in use by the contained code. This saves disk space and build time. Integrating Repo2docker into DisJotter could be worthwhile.

More metadata to describe the created image would be desirable as well. The input and outputs of the generated services could be described in a format such as the common workflow language to more painlessly integrate into scientific workflows.

(35)

Bibliography

[1] Xiaochao Dong. Pigar - a fantastic tool to generate requirements.txt for your python project, and more than that. https://github.com/damnever/pigar, 2015.

[2] Jessica Forde, Tim Head, Chris Holdgraf, Yuvi Panda, Gladys Nalvarete, Benjamin Ragan-Kelley, and Erik Sundell. Reproducible research environments with repo2docker. 2018. [3] Rafael C Jiménez, Mateusz Kuzak, Monther Alhamdoosh, Michelle Barker, Bérénice Batut,

Mikael Borg, Salvador Capella-Gutierrez, Neil Chue Hong, Martin Cook, Manuel Corpas, et al. Four simple recommendations to encourage best practices in research software. F1000Research, 6, 2017.

[4] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando P´erez, Brian E Granger, Matthias Bus-sonnier, Jonathan Frederic, Kyle Kelley, Jessica B Hamrick, Jason Grout, Sylvain Corlay, et al. Jupyter notebooks-a publishing format for reproducible computational workflows. In ELPUB, pages 87–90, 2016.

[5] Donald Ervin Knuth. Literate programming. The Computer Journal, 27(2):97–111, 1984. [6] Jeffrey M Perkel. Why jupyter is data scientists’ computational notebook of choice. Nature,

563(7732):145–147, 2018.

[7] Joao Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. A large-scale study about quality and reproducibility of jupyter notebooks. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 507–517. IEEE, 2019.

[8] Adam Rule, Amanda Birmingham, Cristal Zuniga, Ilkay Altintas, Shih-Cheng Huang, Rob Knight, Niema Moshiri, Mai H Nguyen, Sara Brin Rosenthal, Fernando P´erez, et al. Ten simple rules for writing and sharing computational analyses in jupyter notebooks. PLoS computational biology, 15(7), 2019.

[9] Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al. The fair guiding principles for scientific data management and stew-ardship. Scientific data, 3, 2016.

DisJotter: an interactive containerization tool for enabling FAIRness in scientific code

Bachelor Informatica