• No results found

University of Groningen Integration techniques for modern bioinformatics workflows Kanterakis, Alexandros

N/A
N/A
Protected

Academic year: 2021

Share "University of Groningen Integration techniques for modern bioinformatics workflows Kanterakis, Alexandros"

Copied!
246
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

University of Groningen

Integration techniques for modern bioinformatics workflows

Kanterakis, Alexandros

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2018

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Kanterakis, A. (2018). Integration techniques for modern bioinformatics workflows. University of Groningen.

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

Integration techniques for modern

bioinformatics workflows

(3)

Alexandros Kanterakis. Integration techniques for modern bioinformatics

work-flows. Thesis, University of Groningen, with summary in English, Greek and Dutch.

The work in this thesis was financially supported by the University of Groningen Ubbo Emmius Fund, BBMRI-NL, a research infrastructure financed by the Netherlands Organization for Scientific Research (NWO project 184.021.007), and European Union Seventh Framework Programme (FP7/2007-2013) research projects BioSHaRE-EU (261433).

Printing of this thesis was financially supported by Rijksuniversiteit Groningen, Univer-sity Medical Center Groningen, Groningen UniverUniver-sity Institute for Drug Exploration (GUIDE) and NWO VIDI grant number 917.164.455.

The front cover features artwork from artist Theo van Doesburg (30 August 1883 – 7 March 1931) titled “Tekening”. The layout is based on the “classicthesis” template, Copyright ©2012 André Miede http://www.miede.de.

Printed by Netzodruk Groningen.

©2018 Alexandros Kanterakis. All rights reserved. No part of this book may be re-produced or transmitted in any form or by any means without permission of the author.

(4)

Integration techniques for modern

bioinformatics workflows

PhD thesis

to obtain the degree of PhD at the University of Groningen

on the authority of the Rector Magnificus Prof. E. Sterken

and in accordance with

the decision by the College of Deans. This thesis will be defended in public on

Wednesday 11 July 2018 at 9:00 hours

by

Alexandros Kanterakis

born on 30 July 1978 in Moschato, Greece

(5)

Supervisors: Prof. M.A. Swertz Prof. T.N. Wijmenga

Assessment committee: Prof. E.A. Valentijn Prof. J. Heringa Prof. H. Snieder

(6)

Paranymphs: Freerk van Dijk Dimitris Gakis

(7)
(8)

Contents

1 Introduction and Outline 9

1.1 Background . . . 9

1.2 Bioinformatics done right . . . 10

1.3 Integration as a service in genetics research . . . 16

1.4 Outline . . . 19

2 Creating transparent and reproducible pipelines: Best practices for tools, data and workflow management systems 25 2.1 Introduction . . . 26

2.2 Existing workflow environment . . . 28

2.3 What software should be part of a scientific workflow? . . . 29

2.4 Preparing data for automatic workflow analysis . . . 32

2.5 Quality criteria for modern workflow environments . . . 34

2.6 Benefits from integrated workflow analysis in Bioinformatics . . . 42

2.7 Discussion . . . 45

3 Population-specific genotype imputations using minimac or IMPUTE2 69 3.1 Introduction . . . 70

3.2 Methods . . . 71

3.3 Materials . . . 81

3.4 Procedure . . . 82

4 Molgenis-impute: imputation pipeline in a box 99 4.1 Background . . . 100

4.2 Methods . . . 102

4.3 Implementation . . . 104

4.4 Results and Discussion . . . 107

4.5 Supplementary information . . . 112

5 PyPedia: using the wiki paradigm as crowd sourcing environment for bioin-formatics protocols 121 5.1 Introduction . . . 122

(9)

5.2 Implementation . . . 125

5.3 Results . . . 134

5.4 Discussion . . . 138

5.5 Conclusions . . . 142

5.6 Supplementary Information . . . 143

6 MutationInfo: a tool to automatically infer chromosomal positions from dbSNP and HGVS genetic variants 157 6.1 Introduction . . . 158

6.2 Existing tools for resolving HGVS position . . . 162

6.3 Methods . . . 169

6.4 PharmGKB as a testing platform . . . 174

6.5 Analysis of HGVS variants with gene names . . . 183

6.6 Conclusions . . . 188

7 Discussion 197 7.1 Final notes in Imputation . . . 197

7.2 Integration as a vehicle towards clinical genetics . . . 206

Summary 225 Samenvatting 229 Περίληψη 233 Acknowledgments 237 List of publications 241 Curriculum vitae 245

(10)

1 Introduction and Outline

1.1 Background

Bioinformatics, the blend of computer science and biology, has played a dominant role in almost all major discoveries in the fields of genetics and subsequently in life sciences over the last decade. The main reason for this lies in the very nature of modern genetics research. The latest technological advancements have allowed the massive profiling of genetic data. These data include gene expression, genotypes, DNA-sequencing, and RNA-sequencing to name just a few. The quantity, complexity and heterogeneous nature of these data are such that handling, storage and, most importantly, their processing require advanced computational techniques. The need to develop these techniques is the reason why bioinformatics have become a vital part of modern genetic research.

Bioinformatics began as an experimental field that provided auxiliary help to ge-neticists, mainly for tackling algorithmic and organizational research tasks [22]. As genetics has become a more data-driven science, its reliance on bioinformatics has steadily increased. The level of this dependence is such that today we are experiencing two unprecedented events. The first is that although bioinformatics and genetics are tightly intertwined, with advances in one giving rise to progress in the other, today it is the bioinformatics that is guiding the progress in genetics [15]. Of course, the opposite also happens but on a smaller scale. The reason for this is that the computational challenges of genetic data are on the leading edge of computer science and technology in general [32]. It is not an exaggeration to claim that modern computer infrastructure is just not powerful enough to qualitatively process the amount of genetic data being generated on a daily basis. This is mainly because the pace of progress in sequencing technology is far higher than progress in computer technology [33], [24]. Moreover, there are also considerations such as noise, quality assurance and data provenance.

The realization of these shortcomings has led to the second unprecedented event. Because of the extreme computational demand of many genetic experiments, there are bioinformatics considerations in the formulation of genetic research hypotheses [15]. Thus, the complete intellectual process of shaping new ideas that might shed new light on the understanding of genetic regulation and disease mechanisms is now governed,

(11)

and sometimes hindered, by bioinformatics limitations. For example, the files that contain the complete nucleotide sequence of the genome (Whole Genome Sequencing) of a single sample at an adequate quality level (usually at 30X coverage which is the average number that each nucleotide has been sequenced) are approximately 350GB [21]. These files are enough to completely fill the hard drive of a commodity computer. Consequently, regardless of the low sequencing cost, a research team has to employ advanced data management techniques in order to test hypotheses that require the sequencing of even a modest number of samples [3]. These techniques may be more expensive than the sequencing. Another example can be seen in hypotheses that require either the integration of multiomics datasets or to locate multiple genetic factors that simultaneously play a role in complex diseases. In the first case by adding an extra layer of data that corresponds to a more complete picture of the disease, the dimensionality of the problem increases [6]. In the second, the search space of possible combinations of multiple genetic factors is so big that even modern techniques require extremely long computational time [19] (assuming of course that there are adequate numbers of samples). In both cases, the existing statistical analysis methods lie at the leading edge of modern bioinformatics research and the specifics of current implementations largely determine the formulation of currently testable hypotheses.

These are some of the main points that underline the significance of bioinformatics in genetics research and in healthcare in general. It also makes obvious that even small enhancements in the design of a bioinformatics pipeline can have enormous impact on the progress of a genetic experiment [5]. So far, bioinformatics have managed to engineer solutions for extremely challenging problems at such level that we can confidently state that the genomic revolution now lies in the past and that we are entering the post-genomic era. So despite this seemingly dark present picture, we can envision that bioinformatics will continue to be at the forefront of scientific discovery in the life sciences. But in order for this to happen we first need to analyze the inner characteristics and the philosophy behind the development of current bioinformatics pipelines, locate points for improvement, and attempt to implement them.

1.2 Bioinformatics done right

1.2.1 Maturity model

If we attempt to divide the history of bioinformatics so far, we can see three distinct generations. These generations, briefly described in Table 1.1, were formed mainly by the computational needs of the research society through time. The first is the ‘algorithmic’ generation, where special algorithms needed to be crafted for computation

(12)

problems inherent in genetics. For example, one of the most demanding computa-tional needs at that time was how to efficiently compare a genomic sequence with a database of available sequences. This gave rise to the Needleman–Wunsch algorithm (1970) that performed exact sequence alignment (or else global alignment). This was followed with the introduction of the Smith–Waterman algorithm (1981), which is a variation of Needleman-Wunsch that allowed sequence mismatches (local alignment). The Smith–Waterman algorithm also allowed for the computational detection of se-quence variants (mainly insertions and deletions). It is interesting that the demanding time requirements of these algorithms gave rise to the prominent BLAST and FASTA algorithms (1985) that are approximately 50 times faster (although they produce suboptimal results). Today these algorithms are available either as online web services or programs implemented in low-level programming languages that make optimal uti-lization of the underlying computer architecture. These algorithms and their variations are the foundational components of modern Next Generation Sequencing techniques. The second generation appeared as a need not so much to implement new algorithms but rather to manage large amounts of heterogeneous data [2]. The starting point of this generation was the Human Genome Project (HGP) [23] that lasted from 1990 until 2003 and delivered the first almost complete (92%) sequence of human DNA. At the cost of approximately 3 billion dollars, it remains until today the largest scientific collaboration in biology. HGP managed to bring together 20 sequencing research facilities from 6 countries, although more than 200 labs from 18 different countries participated in the analysis. The successful outcome of HGP was also based on generating and establishing common analysis protocols and data formats. Additionally, HGP pioneered the establishment of initial public data release policies and brought forward ethical, legal and social issues of genetics. This was also the era that large datasets started becoming public, mainly from large organizations (i.e. EMBL at Europe and NCBI in the United States). New notions appeared in bioinformatics such as security, data sharing, data modeling, distributed computing and web services. HGP succeeded in not only filling one of the most eminent gaps in biology, that of the sequence of the human genome, but also in paving the way for other projects to continue its work. Right after the end of HGP, the HapMap project [11] was started (in October 2002) with the goal of inferring the haplotype structure of the human genome. The haplotype structure contains information regarding the regions of the genome that tend to be co-inherited and recombination rarely happens. With this information any genetic variant that is known (or suspected) to be involved in a disease, can be associated with a complete region, thus offering many ways of investigating additional variants or even mechanisms of action. To accomplish this, the HapMap project genotyped 1184 individuals belonging to 11 different populations, investigating

(13)

more than 3.1 million SNPs (Single Nucleotide Polymorphisms) [17]. The project released data in three phases (2005, 2007 and 2009) and since the cost of sequencing technology (at that time) was prohibitive for population studies of this size, it mainly used genotyping techniques. The main difference between genotyping and sequencing is that genotyping is a technique to assess the combination of alleles that an individual has at a specific site, whereas sequencing reveals the complete nucleotide sequence of the individual’s DNA. Despite being more affordable, the main drawback of genotyping is that it requires a priori the knowledge of the genotyped location. Therefore, as a technology, it has a limited value in investigating novel variants. Fortunately, by 2008, sequencing had advanced well enough to be used in a large population study. The 1000 Genomes Project [10] (1KGP) was launched at that time to investigate human genetic variation by employing mainly sequencing techniques. 1KGP was revolutionary in many aspects. It sequenced 1092 individuals from 26 different population regions, covering a significant part of the world. It practically tripled the number of known variants of the human genome (from 3.1 million to 10 million), increased the analysis of the haplotype structure, and provided insights regarding the evolutionary processes that guided the distribution of these variants. Additionally, it used many sequencing technologies (whole-genome, exon-targeted) on varying platforms and also included trios (mother-father-child). At the peak of 1KGP, it was sequencing 10 billion bases per day, which is more than 2 whole human genomes. After the completion of 1KGP (in 2012), biologists at last had a comprehensive catalogue of genetic variants for various populations and also a tested set of software, pipelines, benchmarks and good practices for conducting large-scale sequencing studies. In parallel, the completion of HGP sparked many investigations regarding the functional effect of these variants. Perhaps the most notable effort towards this is the ENCODE project [9], which is an ongoing international study into the functional genomic elements of DNA. This project is building a comprehensive functional map of human DNA. For example, it annotates regions that are transcribed in RNA, regions that are mostly transcribed in particular cell types, and also regions that are translated into certain types of proteins.

Apart from being milestone contributions to genetics, these projects have also pioneered the open philosophy behind data publication and the willingness to receive feedback from the research community by publishing data and methods even at an early stage of the experiments. Nevertheless, the greatest contribution of these projects may be the huge momentum that they have created: many national organizations have designed experiments to investigate the genetic identity and diversity of their populations. Today (2017), there are more than 20 countries with national sequencing projects in various stages of completeness [16,13]. These countries cover 29% of the world population (18% is from China) and as this percentage grows we can expect the

(14)

genetic structure of nearly the whole human race to become known in the next few decades.

From these various projects, the Genome of the Netherlands [31] (GoNL) stands out. Although GoNL was a Dutch project that started in July 2009, only 1.5 years after the start of 1KGP, it represented a significant advance in population sequencing projects. Despite having approximately the same number of samples, it used better sequencing quality (∼12× coverage) than 1KGP (∼3× coverage). Additionally, GoNL’s samples were composed of only related individuals: specifically it contained 231 trios (mother, father, child), 8 quartos (mother, father, 2 siblings) with dizygotic twins, and 11 quartos with monozygotic twins. The relatedness information (i.e kinship) is valuable resource that helps to validate the sequencing information. For example, a nucleotide, say ’C’, at a specific location in the child’s genome should also be evident in the same location of at least one of the parents’ genomes (except for the highly unlikely event of a de novo mutation). But most importantly, by exploiting the relatedness information we can deduce the haplotype structure of the population in superb quality. Finally, all the samples from four different Dutch regional biobanks were pooled. These biobanks contained extensive phenotype information and additional biological samples for all the participating individuals. We can confidently state that if 1KGP demonstrated how to perform genetic studies on international populations, then GoNL demonstrated how to study the genetic structure on a national scale by applying high quality and cost-effective methods.

It is interesting to take a look at the progression of the order of magnitude of included samples by looking at some high-profile national projects. In March 2010 the UK10K project [18] was set up to sequence 10.000 individuals of UK descent whereas in June 2017 the Precision Medicine Initiative project [1] in the United States started enrolling the first of the 1.000.000 million planned individuals.

The main ‘product’ of this second generation phase is the wealth of knowledge that these projects have created. From a bioinformatics point of view, this generation also produced an extremely rich set of open-access data, tools, pipelines and web services. Today, it is difficult to list even the main products of this generation since they range from files with identified genetic variants (e.g. VCF files from GoNL), simple portals (e.g ArrayExpress, PubMed, dbSNP), genomic browsers (e.g. UCSC, ENSEMBL) and generic web platforms (e.g. MOLGENIS) to name just a few. A thorough discussion of the quantity and quality of existing data, tools and pipelines is given in Chapter 2.

We are currently experiencing the transition from the second generation to a new third generation. As with the first two generations, there are new computational needs that cannot be fulfilled by the existing ‘schools of thought’. The new necessities are open and reproducible protocols, methods and data. We also require more social and

(15)

more ‘extrovert’ behavior from researchers at all stages of development (maybe even beforehand, from the fund-raising stage through crowd-funding [8]). Although these re-quirements sound more behavioral, they bear particular computational demands. These are: the adoption of more readable programming languages, a minimum dependence on proprietary software in favor of open source implementations, good data stewardship, and a smooth transition towards nationwide or international computer infrastructures, such as the Grid and the cloud.

Generations Characteristics

Algorithmic

Build essential algorithms

Introduce data mining, AI, machine learning, visualization Biostatistics

Introduce terminology/nomenclatures Moderate computational needs Languages: C, C++, Fortran, Pascal Data Management Web services Large/open datasets Build formats/models/ontologies Data exchange

Introduce databases,software engineering to genetics Compute on local clusters

Languages: Java, C#, Perl, R, DSL

Social

Data provenance Open code/data/articles Reproducibility/Readability Social coding/Crowdsourcing Introduce Big Data to genetics

Compute on Grid/Cloud Languages: Python, Ruby, GO, Scala

(16)

1.2.2 Moving forward

Although bioinformatics have matured and many lessons have been learned during these generations [14], there are still many points to improve. We discuss four point below.

As a first consideration, today we have repositories that host or describe thousands of tools and databases in the bioinformatics domain. Additionally, there are more than 100 meta-tools to ease the integration and inter-connection of these tools in complex analysis pipelines. Yet, in most cases, this bridging requires extensive technical knowledge that can be done only by skilled IT (Information Technology) personnel [34]. Unless this bridging becomes easy enough for a geneticist or biologist with basic IT knowledge to implement it, this vibrant landscape will become fragmented and hostile towards innovative and ground-breaking ideas. There is therefore a prominent need for technologies that can bring the very people who generate the research hypotheses into the foreground, rather than have them depend on an unfamiliar technical domain.

Second, today we can be assured that open source is finally the norm for code distri-bution. Nevertheless, open source does not automatically make the research community favor a certain tool. Apart from easy access, an open source tool is not guaranteed to have readable source code, tests, documentation, and a team or community able to provide support [36]. Moreover, the development plans and roadmap of the software are not always known, leading to qualitative implementations that are abandoned and forgotten, for example, or to good ideas with great potential that the community hesitates to support for fear of future project discontinuation.

Third, the main criterion for career advancement in fields like bioinformatics right now is ultimately, the number of papers published and of their citations. The existing scientific publishing status quo has a limited interest in the open practices of the published solutions. Besides the obvious consequences [35], without this incentive, researchers perceive the adoption of open policies as an extra burden, even though this adoption can be highly beneficial either for the source code [4], the papers written [26] or the research data [30], [27].

A final fourth point has to do with the infrastructure. Developers often think locally when they implement their methods, tending to build solutions for the architecture of their laboratory or for the version of the problem that their fellow geneticists have formed. Adhering to open practices also includes developing generic solutions in common environments. This is one of the reasons why developing for the cloud or for a nationwide grid infrastructure should be a priority for modern bioinformaticians.

By overcoming these shortcomings, bioinformatics can fully enter into the emergent third generation. Moreover, these open challenges should be further polished, crystalized

(17)

and objectified into a set of principles that, at least for this thesis, can be called ‘bioinformatics done right’. These are the main principles that guide the creation of the

methods presented in this thesis.

1.3 Integration as a service in genetics research

Integration is the key to overcoming the presented shortcomings and to reap all the potential benefits from the emerging third generation of bioinformatics. In this context, integration can be defined as the process of efficiently bridging together tools, data, research communities, and different computation resources for the purpose of conducting innovative science. The European Bioinformatics Institute (EBI) set two major goals for 2016; the first was data growth and the second was integration [12]. Data growth is a sine qua non objective for any major bioinformatics institute and relies mostly on the adaptation of modern sequencing (and other mass profiling) technologies. In contrast, integration entails some additional and intricate challenges.

In general, the major purpose of integration is to provide solutions that offer high performance, efficiency and usability as a result of a sophisticated combination of existing components. Alternatively, the purpose is to prove that “the whole is greater than the sum of its parts”. This realization has also been pinpointed as the antic-ipated philosophy that should govern most future studies [7], [37]. Apart from the issues presented in section 1.2.2, the proper integration of existing methods can help tremendously in overcoming problems like limited tool discoverability, merging of heterogeneous data and battling the replication crisis. Therefore, one crucial question in this regard is how can: (1) modern tool developers, (2) data curators, and (3) Workflow Management System engineers alter their practices in order to augment the connectivity and share-ability of their resources. To achieve this, the bioinformatics community should agree on adhering to some basic and easily applicable development guidelines and good practices for each one of these groups. In order to do this, we first need to simply observe and note some of the common pitfalls and omissions that impede the process of integration. Subsequently, we need to distill these observations into clear and easy-to-follow guidelines and we should also underline how and to what extent the bioinformatics community can benefit by adopting them. This is perhaps the most important challenge - it basically lays the groundwork on which subsequent challenges have to be addressed.

One of these challenges is to properly and sufficiently communicate all the minor and major information that is essential for the seamless connection of different tools and data that comprise a bioinformatics pipeline. This is the Documentation challenge. This

(18)

effort becomes more challenging if the pipeline contains complicated, error-prone or very resourceful components. This description has to cover all the technical instructions, including troubleshooting directions, system requirements and methods to measure the adequacy of the input and the quality of the results. Moreover, the instructions should be clear enough to cover a target audience with a wide range of IT and genetics literacy. An example of a pipeline of this kind is genotype imputation [29], which is the process of stochastically enriching the set of markers of a genotype experiment with additional markers from a more densely sequenced or genotyped population panel.

A subsequent challenge is Wrapping, or how to offer ‘one-stop’ solutions to bioinfor-matics pipelines that have been sufficiently described in prior studies. The objective in this case is to provide a single, open-source tool that wraps all the software, installation scripts, and system configuration commands. This tool should undertake the tasks of setting up an environment, installation of software, applying quality control on input data, optimally splitting the processing tasks into executable jobs, submission of jobs on a wide selection of high-performance computing (HPCs), and finally assessing the quality of results. The design philosophy of this tool should be user friendly, highly customizable, documented and easily embed-able in an external pipeline for upstream and downstream analysis.

An additional consideration is Collaboration, which covers user involvement and engagement. The challenge in this case is to offer an online environment that promotes user collaboration and provides incentives to a diverse community of bioinformaticians to participate in creating, editing, documenting and testing bioinformatics pipelines. A pioneering method to achieve this objective is through ‘crowdsourcing’ [20, 25]. Crowdsourcing is the process where an online, loosely-coupled community collaborates in order to achieve a cumulative result, with absent or little central moderation. User participation takes the form of ‘edits’. Each edit can be anything from a minor typographic correction, a simple reference addition or a major contribution that immensely improves the result’s quality. Perhaps the most successful example of the crowdsourcing concept is Wikipedia, which contains, on average, high quality encyclopedic articles for almost all areas of interest. Therefore, an open question is if crowdsourcing can also be used for the purpose of creating high quality bioinformatics computational solutions. A crowdsourced bioinformatics resource environment could serve concurrently the purposes of a repository, a content management system, a workflow management system, an execution environment, and also a social experiment to test how the bioinformatics community can collaborate under the ‘wiki’ philosophy [28].

A common issue that usually rises during the process of building a bioinformatics pipeline is when there is an abundance of available tools for a given task, all produce

(19)

good results for some part of the input data but none is able to process all the possible variations of the input. This problem is usually coupled with the facts that the input is not well defined and that the community rarely uses formal directions or even good practices when generating these data. I call this the Composition challenge. A pipeline that attempts to tackle a problem of this category has to apply strict quality checks but also has to be permissive in erroneous input. Additionally, it should treat the component tools as an inventory and use the correct combination of tools for different kinds of input. Therefore, one challenge is to overcome the existing perceptions of pipelines as simple and rigid input/output connections between different tools. On the contrary pipelines should contain inherent logic (even sometimes fuzzy logic) that performs the optimal tool combination and alleviates possible incomplete tool implementation or even erroneous input data.

In Figure 1.1, I present a visualization of these challenges as pieces of a puzzle. By assembling this puzzle we can transform a simple analysis workflow into an active component that is fully immersed in the upcoming third generation of bioinformatics. In this Ph.D. thesis, I present some well-grounded approaches to these challenges that are being used to tackle various problems in genetics. Of course, since this generation is currently under formation, there may be additional considerations I am unaware of, I therefore expect future studies will complement my work by identifying and describing them.

(20)

Figure 1.1: The main Integration techniques presented in this Ph.D. thesis. On top there is a typical view of an analysis workflow (Data → Tool → Results). Making this workflow as integrate-able as possible for the wider bioinformatics community entails four basic challenges. These are the availability of proper Documentation, the inclusion of methods that Wrap the workflow in an embeddable one-step solution, the smart

Composition of available tools that handle incomplete implementations or erroneous

data, and the adoption of receptive policies that enhance user Collaboration.

1.4 Outline

Below I outline the chapters in this thesis, the particular research questions addressed, and how the chapters are related.

In Chapter 2, I present a set of practical and easy-to-follow best practices and guidelines for tool developers, data curators and Workflow Management System engi-neers. These guidelines can help these groups to significantly augment the re-usability,

(21)

connectivity and user friendliness of their Research Objects. This chapter also includes a discussion of the expected benefits that the wider biomedical community can gain by following these guidelines.

In Chapter 3, I propose a modern bioinformatics protocol for the imputation of genetic data. I discuss all the practical considerations and present good practices that should be applied when performing this task.

In Chapter 4, I present a bioinformatics implementation of the good practices of chapter 3 in a new imputation pipeline. The pipeline has been designed so that it requires minimum effort from a user to install and submit it in a High Performance Computation (HPC) environment. It is also open and easily adapted to external workflows and HPCs. The solution uses Molgenis-compute as a pipeline management system.

In Chapter 5, I discuss a perspective on the characteristics and the problems of modern bioinformatics pipelines. Specifically, I concentrate on the reproducibility and openness of published methods and I suggest that crowdsourcing can improve existing solutions. I propose adopting PyPedia, which is a wiki platform for the development of bioinformatics pipelines with the python language. To demonstrate its use, I have implemented parts of the imputation pipeline using PyPedia.

In Chapter 6, I address a problem that is more often seen in the area of clinical genetics. The majority of newly discovered mutations are presented in the scientific literature in one of the oldest and most widely used nomenclatures in genetics, called HGVS(Human Genome Variation Society). Although the main purpose of HGVS is the unambiguous and concise reporting of mutations, many authors disregard the official reporting guidelines. This renders some of the reported mutations ambiguous and hinders their location and validation in existing NGS studies. To remedy this problem I implemented MutationInfo, which combines 11 different tools in an exhaustive pipeline in order to locate the chromosomal position of an HGVS mutation with the highest possible confidence.

(22)

Bibliography

[1] Euan A Ashley. The precision medicine initiative: a new national effort. Jama, 313(21):2119–2120, 2015.

[2] TK Attwood, A Gisel, Nils-Einar Eriksson, and Erik Bongcam-Rudloff. Concepts, historical milestones and the central place of bioinformatics in modern biology: a european perspective. In Bioinformatics-trends and methodologies. InTech, 2011. [3] Riyue Bao, Lei Huang, Jorge Andrade, Wei Tan, Warren A Kibbe, Hongmei Jiang, and Gang Feng. Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer informatics, pages 67–83, 2014.

[4] Nick Barnes. Publish your computer code: it is good enough. Nature, 467(7317): 753–753, 2010.

[5] Bonnie Berger, Jian Peng, and Mona Singh. Computational solutions for omics data. Nature Reviews Genetics, 14(5):333–346, 2013.

[6] Matteo Bersanelli, Ettore Mosca, Daniel Remondini, Enrico Giampieri, Claudia Sala, Gastone Castellani, and Luciano Milanesi. Methods for the integration of multi-omics data: mathematical aspects. BMC bioinformatics, 17(Suppl 2):15, 2016.

[7] Joerg Martin Buescher and Edward M. Driggers. Integration of omics: more than the sum of its parts. Cancer & Metabolism, 4(1):4, 2016. ISSN 2049-3002. doi: 10.1186/s40170-016-0143-y. URL http://dx.doi.org/10.1186/ s40170-016-0143-y.

[8] Pamela Cameron, David W Corne, Christopher E Mason, and Jeffrey Rosenfeld. Crowdfunding genomics and bioinformatics. Genome Biology, 14(9):134, sep 2013. ISSN 1465-6906. doi: 10.1186/gb-2013-14-9-134. URL http://genomebiology. biomedcentral.com/articles/10.1186/gb-2013-14-9-134.

(23)

[9] ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447(7146):799–816, jun 2007. ISSN 1476-4687. doi: 10.1038/ nature05874. URL http://www.ncbi.nlm.nih.gov/pubmed/17571346http:// www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2212820. [10] The 1000 Genomes Project Consortium. A global reference for human genetic

variation. Nature, 526(7571):68–74, sep 2015. ISSN 0028-0836. doi: 10.1038/ nature15393. URL http://www.ncbi.nlm.nih.gov/pubmed/26432245.

[11] The International HapMap Consortium. The International HapMap Project. Nature, 426(6968):789–796, dec 2003. ISSN 0028-0836. doi: 10.1038/nature02168. URL http://www.nature.com/articles/nature02168.

[12] Charles E. Cook, Mary Todd Bergman, Robert D. Finn, Guy Cochrane, Ewan Birney, and Rolf Apweiler. The european bioinformatics institute in 2016: Data growth and integration. Nucleic Acids Research, 44(D1):D20, 2016. doi: 10.1093/ nar/gkv1352. URL +http://dx.doi.org/10.1093/nar/gkv1352.

[13] Manuel Corpas. The national genome project race, May 2017. URL https:// personalgenomics.zone/2017/05/23/the-national-genome-project-race/. [14] Manuel Corpas, Segun Fatumo, and Reinhard Schneider. How not to be a

bioin-formatician. Source code for biology and medicine, 7(1):1, 2012.

[15] Scudellari M. Data Deluge. The scientist:Large-scale data collection and analysis have fundamentally altered the process and mind-set of biological research. http://www.thescientist.com/?articles.view/articleNo/31212/ title/Data-Deluge/, 2011.

[16] Talitha Dubow and Sonja Marjanovic. Population-scale sequencing and the future of genomic medicine. RAND Corporation, 2016. URL http://www.rand.org/ pubs/research_reports/RR1520.html.

[17] International HapMap 3 Consortium et al. Integrating common and rare genetic variation in diverse human populations. Nature, 467(7311):52–58, 2010.

[18] UK10K Consortium et al. The uk10k project identifies rare variants in health and disease. Nature, 526(7571):82–90, 2015.

(24)

[19] Damian Gola, Jestinah M Mahachie John, Kristel Van Steen, and Inke R König. A roadmap to multifactor dimensionality reduction methods. Briefings in bioin-formatics, 17(2):293–308, 2016.

[20] Benjamin M. Good and Andrew I. Su. Crowdsourcing for bioinformatics. Bioinformatics, 29(16):1925, 2013. doi: 10.1093/bioinformatics/btt333. URL +http://dx.doi.org/10.1093/bioinformatics/btt333.

[21] Karen Y He, Dongliang Ge, and Max M He. Big data analytics for genomic medicine. International Journal of Molecular Sciences, 18(2):412, 2017.

[22] Paulien Hogeweg. The roots of bioinformatics in theoretical biology. PLoS Comput Biol, 7(3):e1002021, 2011.

[23] International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931–945, oct 2004. ISSN 0028-0836. doi: 10.1038/nature03001. URL http://www.nature.com/doifinder/10. 1038/nature03001.

[24] Wetterstrand K. The Cost of Sequencing a Human Genome. http://www. thescientist.com/?articles.view/articleNo/31212/title/Data-Deluge/, 2016.

[25] Ritu Khare, Benjamin M. Good, Robert Leaman, Andrew I. Su, and Zhiyong Lu. Crowdsourcing in biomedicine: challenges and opportunities. Briefings in Bioinformatics, 17(1):23, 2016. doi: 10.1093/bib/bbv021. URL +http://dx.doi. org/10.1093/bib/bbv021.

[26] Mikael Laakso and Bo-Christer Björk. Anatomy of open access publishing: a study of longitudinal development and internal structure. BMC Medicine, 10 (1):124, dec 2012. ISSN 1741-7015. doi: 10.1186/1741-7015-10-124. URL http: //bmcmedicine.biomedcentral.com/articles/10.1186/1741-7015-10-124. [27] Morgan GI Langille and Jonathan A Eisen. Biotorrents: a file sharing service for

scientific data. PLoS One, 5(4):e10071, 2010.

[28] Panagiotis Louridas. Using wikis in software development. IEEE software, 23(2): 88–91, 2006.

[29] Jonathan Marchini, Bryan Howie, Simon Myers, Gil McVean, and Peter Donnelly. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature genetics, 39(7):906–913, 2007.

(25)

[30] Jennifer C. Molloy. The Open Knowledge Foundation: Open Data Means Better Science. PLoS Biology, 9(12):e1001195, dec 2011. ISSN 1545-7885. doi: 10. 1371/journal.pbio.1001195. URL http://dx.plos.org/10.1371/journal.pbio. 1001195.

[31] Genome of the Netherlands Consortium. Whole-genome sequence variation, pop-ulation structure and demographic history of the Dutch poppop-ulation. Nature Genetics, 46(8):818–825, aug 2014. ISSN 1061-4036. doi: 10.1038/ng.3021. URL

http://www.nature.com/articles/ng.3021.

[32] Mike Orcutt. Bases to Bytes. Cheap sequencing technology is flooding the world with genomic data. Can we handle the deluge? http://www.technologyreview. com/graphiti/427720/bases-to-bytes/, 2012.

[33] Rob Carlson. Planning for Toy Story and Synthetic Biology: It’s All About Competition (Updated) — synthesis, 2018. URL http://bit.ly/2BhJ8PG. [34] Allegra Via, Thomas Blicher, Erik Bongcam-Rudloff, Michelle D Brazas, Cath

Brooksbank, Aidan Budd, Javier De Las Rivas, Jacqueline Dreyer, Pedro L Fernandes, and Celia et al. Van Gelder. Best practices in bioinformatics training for life scientists. Briefings in bioinformatics, page bbt043, 2013.

[35] Timothy H Vines, Arianne YK Albert, Rose L Andrew, Florence Débarre, Dan G Bock, Michelle T Franklin, Kimberly J Gilbert, Jean-Sébastien Moore, Sébastien Renaut, and Diana J Rennison. The availability of research data declines rapidly with article age. Current biology, 24(1):94–97, 2014.

[36] Greg Wilson, DA Aruliah, C Titus Brown, Neil P Chue Hong, Matt Davis, Richard T Guy, Steven HD Haddock, Kathryn D Huff, Ian M Mitchell, and Mark D et al. Plumbley. Best practices for scientific computing. PLoS Biol, 12(1): e1001745, 2014.

[37] Qing Yan. Translational Bioinformatics and Systems Biology Ap-proaches for Personalized Medicine. In Methods in molecular bi-ology (Clifton, N.J.), volume 662, pages 167–178, 2010. doi: 10. 1007/978-1-60761-800-3_8. URL http://www.ncbi.nlm.nih.gov/pubmed/ 20824471http://link.springer.com/10.1007/978-1-60761-800-3{_}8.

(26)

2 Creating transparent and reproducible

pipelines: Best practices for tools, data

and workflow management systems

Alexandros Kanterakis1, George Potamias2, Morris A. Swertz1

1Genomics Coordination Center, Department of Genetics, University Medical Center

Groningen, University of Groningen, Groningen, The Netherlands.

2Institute of Computer Science, Foundation for Research and Technology Hellas

(FORTH), Heraklion, Greece.

Submitted as a chapter to Human Genome Informatics: Translating Genes Into Health Editors: Darrol Baker, Christophe Lambert and George Patrinos Publication August 2018 Publisher: Elsevier https://www.elsevier.com/books/human-genome-informatics/baker/978-0-12-809414-3

(27)

Abstract

Today, the practice of properly sharing the source code, analysis pipelines and protocols of published studies has become commonplace in bioinformatics. Additionally, there is a plethora of technically mature Workflow Management Systems (WMS) that offer simple and user-friendly environments where users can submit tools and build transparent, shareable and reproducible pipelines. Arguably, the adoption of open science policies and the availability of efficient WMSs constitutes major progress towards battling the replication crisis, advancing research dissemination and creating new collaborations. Yet, today we still see that it is very difficult to include a large range of tools in a scientific pipeline, whereas on the other side, certain technical and design choices of modern WMSs discourage users from doing just this. Here we present three sets of easily applicable “best practices” targeting (i) bioinformatics tool developers, (ii) data curators and (iii) WMS engineers, respectively. These practices aim to make it easier to add tools to a pipeline, to make it easier to directly process data, and to make WMSs widely hospitable for any external tool or pipeline. We also show how following these guidelines can directly benefit the research community.

2.1 Introduction

Today, publishing the source code, data and other implementation details of a research project serves two basic purposes. The first is to allow the community to scrutinize, validate and confirm the correctness and soundness of a research methodology. The second is to allow the community to properly utilize the methodology in novel data, or to adjust it to test new research hypotheses. This is a natural process that pertains to practically every new invention or tool and can be reduced down to the over-simplistic sequence: Create a tool, test the tool, use the tool. Yet it is surprising that in bioinformatics this natural sequence was not standard practice until the 1990s when specific groups and initiatives like BOSC [63] started advocating its use. Fortunately, today we can be assured that this process has become common practice, although there are still grounds for improvement. Many researchers state that ideally, the ‘materials and methods’ part of a scientific report should be an object that encapsulates the complete methodology, is directly available and executable, and should accompany (rather than supplement) any biologic investigation [69]. This highlights the need for another set of tools that automate the sharing, replication and validation of results, and the conclusions from published studies. Additionally, there is a need to include external tools and data easily [58] and to be able to generate more complex or integrated

(28)

research pipelines. This new family of tools is referred to as Workflow Management Systems (WMS) or Data Workflow Systems [79].

The tight integration of open source policies and WMSs is a highly anticipated milestone that promises to advance many aspects of bioinformatics. In section 2.6

we present some of these aspects, with fighting the replication crisis and advancing healthcare through clinical genetics being the most prominent. One critical question is where do we stand now on our way to making this milestone a reality. Estimating the number of tools that are part of reproducible analysis pipelines is not a trivial task. Galaxy [55] has a publicly available repository of available tools called ‘toolshed’1which lists 3,356 different tools. myExperiment [56] is a social web-site where researchers can share Research Objects such as scientific workflows; it contains approximately 2,700 workflows. For comparison bio.tools [74], a community-based effort to list and document bioinformatics resources, lists 6,842 items2. Bioinformatics.ca [16] curates a list of 1,548 tools and 641 databases, OMICtools [68] contains approximately 5,000 tools and databases, and the Nucleic Acids Research journal curates a list of 1,700 databases [53]. Finally, it is estimated there are approximately 100 different WMSs 3 with very diverse design principles.

Despite the plethora of available repositories, tools, databases and WMSs, the task of combining multiple tools in a research pipeline is still considered a cumbersome procedure that requires above-average IT skills [86]. This task becomes even more difficult when the aim is to combine multiple existing pipelines, use more than one WMS, or to submit the analysis to a highly customized, High Performance Computing (HPC) environment. Since the progress and innovation in bioinformatics lies to a great extent in the correct combination of existing solutions, we should expect significant progress on the automation of pipeline building in the future[111]. In the meantime, today’s developers of bioinformatics tools and services can follow certain software development guidelines that will help future researchers tremendously in building complex pipelines with these tools. Additionally, data providers and curators can follow clear directions in order to improve the availability, re-usability and semantic description of these resources. Finally, WMS engineers should follow certain guidelines that can augment the inclusiveness, expressiveness and user-friendliness of these environments. Here we present a set of easy-to-follow guidelines for each of these groups.

1Galaxy Tool Shed: https://toolshed.g2.bx.psu.edu/ 2https://bio.tools/stats

(29)

2.2 Existing workflow environment

Thorough reviews on existing workflow environments in bioinformatics can be found in many relevant studies [86], [79], [130]. Nevertheless, it is worthwhile to take a brief look at some of the most well-known environments and frequently used techniques.

The most prominent workflow environment and perhaps the only success story in the area - with more than a decade of continuous development and use - is Galaxy [55]. Galaxy has managed to build a lively community that includes biologists and IT-experts. It also acts as an analysis frontend in many research institutes. Other features that have contributed significantly to Galaxy’s success are: (1) capability to build independent repositories of workflows and tools and allow the inclusion of these from one installation to another, (2) having a very basic and simple wrapping mechanism for ambiguous tools, (3) native support for a large set of High Performance Environments (HPC) like (TORQUE, cloud, grid) [82], and (4) offering a simple, interactive web-based environment for graph-like workflow creation. Despite Galaxy’s success, it still lacks many of the qualitative criteria that we will present later. Perhaps most important is the final criterion that discusses enabling users to collaborate in analyses, share, discuss and rate results, exchange ideas, and co-author scientific reports.

The second most well-known environment is perhaps TAVERNA [150]. TAVERNA is built around the integration of many web services and has limited adoption in the area of genetics. The reason for this is that it is a more complex framework, it is not a web environment and forces users to adhere to specific standards (i.e. BioMoby, WSDL). We believe that the reasons why TAVERNA is lagging GALAXY, despite having a professional development team and extensive funding, should be studied more deeply in order to generate valuable lessons for future reference.

In this thesis we use a new workflow environment, MOLGENIS-compute [21], [19], [20]. This environment offers tool integration and workflow description that is even simpler than GALAXY and it is built around MOLGENIS, another successful tool used in the field [131]. Moreover it comes with “batteries included” tools and workflows for genotype imputation and RNA-sequencing analysis.

Other environments with promising features include Omics Pipe [50] and EDGE [89] for next generation sequencing data, and Chipster [77] for microarray data. All these environments claim to be decentralized and community-based although their wide adoption by the community still needs to be proven.

Besides integrated environments, it is worth noting some workflow solutions at the level of programming languages for example, Python packages like Luigi and

(30)

bcbio-nextgen [61], and Java packages like bpipe [118]. Other solutions are Snakemake4, Nextflow5 and BigDataScript [29], which are novel programming languages dedicated solely to scientific pipelines6. Describing a workflow in these packages gives some researchers the ability to execute their analyses easily in a plethora of environments, but unfortunately the packages are targeted at skilled IT users.

Finally, existing workflows and languages for workflow description and exchange are YAWL [140], Common Workflow Language7 and Workflow Description Language8 . The support of one or more workflow languages by an environment is essential and comprises one of the most important qualitative criteria.

2.3 What software should be part of a scientific workflow?

Even in the early phases of development of a bioinformatics tool, one should take into consideration that it will become part of a yet-to-be-designed workflow at some point. Since the needs of that workflow are unknown at the moment, all the design principles used for the tool should focus on making it easy to configure, adapt and extend. This is on top of other quality criteria that bioinformatics methods should possess [149], [87]. After combining tens of different tools together for the scope of this thesis, we can present a checklist of qualitative criteria for modern bioinformatics tools to cover coding, quality checks, project organization and documentation.

• On coding:

– Is the code readable or else written in an extrovert way, assumming that

there is a community ready to read, correct and extend it?

– Can people, unfamiliar with the project’s codebase get a good grasp of the

architecture and module organization of the project?

– Is the code written according to the idiom used by the majority of

program-mers of the chosen language (i.e. camel case for Java or underscores for Python?)

– Is the code style (i.e. formatting, indentation, comment style) consistent

throughout the project?

4Similar to ‘make’ tool, with a targeted repository of workflows https://bitbucket.org/snakemake/ snakemake/wiki/Home

5http://www.nextflow.io/

6For more: https://www.biostars.org/p/91301/

7https://github.com/common-workflow-language/common-workflow-language 8https://github.com/broadinstitute/wdl

(31)

• On quality checks:

– Do you provide test cases and unit tests that cover the entirety of the code

base?

– When you correct bugs, do you make the previously failing input a new test

case?

– Do you use assertions?

– Does the test data come from “a real world problem”?

– Does your tool generate logs that could easily trace errors on: input

param-eters, input data, user’s actions or implementation? • On project organization

– Is there a build tool that automates compilation, setup and installation? – Does the build tool check for necessary dependent libraries and software? – Have you tested the build tool in some basic commonly used environments? – Is the code hosted in an open repository? (i.e. github, bitbucket, gitlab) – Do you make incremental changes, with sufficient commit messages? – Do you “re-invent the wheel”? Or else, is any part of the project already

implemented in a mature and easily embedded way that you don’t use? • On Documentation

– Do you describe the tool sufficiently well?

– Is some part of the text more targeted to novice users? – Do you provide tutorials or step-by-step guides?

– Do you document interfaces, basic logic and project structure? – Do you document installation, setup and configuration?

– Do you document memory, CPU, disk space or bandwidth requirements? – Do you provide execution time for average use?

– Is the documentation self-generated? (i.e. Sphinx)

– Do you provide means of providing feedback or contacting the main

(32)

• Having a user interface is always a nice choice, but does it always support command line execution? Command line tools are far more easily adapted to pipelines.

• Recommendations for command line tools in bioinformatics include [125]:always have help screens, use stdout, stdin and stderr when applicable, check for as many errors as possible and raise exceptions when they occur, validate parameters, and do not hard code paths.

• Create a script (e.g. with BASH) or a program that takes care of the installation of the tool in a number of known operating systems.

• Adopt open and widely used standards and formats for input and output data. • If the tool will most likely be at the end of an analysis pipeline and will create a large list of findings (e.g. thousands of variants) that require human inspection, consider generating an online database view. Excellent choices for this are MOL-GENIS [131] and BioMart. These systems are easily installed and populated with data whilst they allow other researchers to explore the data with intuitive user interfaces.

• Finally choose light web interfaces rather than local GUIs. Web interfaces allow easy automatic interaction in contrast to local GUIs. Each tool can include a modular web interface that in turn can be a component of a larger web-site. Web frameworks like Django offer very simple solutions for this. An example of an integrated environment in Django is given by Cobb [30]. Tools that include web interfaces are Mutalyzer [148] and MutationInfo9.

Before checking these lists a novice bioinformatician might wonder what minimum IT knowledge is required in the area of bioinformatics. Apart from basic abstract notions in computer programming, new comers should get experience in BASH scripting, the Linux operating system and modern interpreted languages like Python [110]. They should also get accustomed to working with basic software development techniques [100] and tools that promote openness and reproducibility [73], [90] like GIT [114].

These guidelines are not only targeted to novice but also to experienced users. The bio-star project was formed to counteract the complexity, bad quality, poor documentation and organization of bioinformatics software, which was discouraging potential new contributors. The purpose of this project was to create “low barrier entry” environments

(33)

[14] that enforce stability and scale-up community creation. Moreover, even if some of these guidelines are violated, publishing the code is still a very good idea [8]. It is important to note here that at one point the community was calling for early open publication of code, regardless of the quality, while at the same time part of the community was being judgmental and rejecting this step [35]. So having guidelines for qualitative software shouldn’t mean that the community does not support users who do not follow them mainly due to inexperience.

Another question is whether there should be guidelines for good practices on scientific software usage (apart from creating new software). Nevertheless, since software is actually an experimental tool in a scientific experiment, the abuse of the tool might actually be a good idea! For this reason the only guideline that is crucial when using scientific software regards tracking results and data provenance. This guideline urges researchers to be extremely cautious and responsible when monitoring, tracking and managing scientific data and logs that are generated from scientific software. In particular, software logs are no difference to wet-lab notebook logs and should always be kept (and archived) for further validation and confirmation10. Here we argue that this responsibility is so essential to the scientific ethic that it should not be delegated lightheartedly to automatic workflow systems without careful human inspection.

2.4 Preparing data for automatic workflow analysis

Scientific analysis workflows are autonomous research objects in the same fashion as independent tools and data. This means that workflows need to be decoupled from specific data and should be able to analyze data from sources beyond the reach of the bioinformatician author. Being able to locate open, self-described and directly accessible data is one of the most sensitive areas of life sciences research. There are two reasons for this. The first covers law, ethical and privacy concerns regarding the individuals participating in a study. The second is the reluctance of researchers to release data that only they have the benefit of accessing, thus giving them an advantage in scientific exploitations. This line of thought has placed scientific data management as a secondary concern, often given low priority in the scope of large projects and consortia. It is interesting in this regard to take a look at the conservative views of the medical community on open data policies [91]. These views consider the latest trends for openness as a side-effect of the technology from which they should be protected rather than as a revolutionary chance for progress. Instead of choosing sides in this controversy, we argue that technology itself can be used to resolve the issue. This is

(34)

feasible by making data repositories capable of providing research data whilst protecting private sensitive information and ensuring proper citation.

Today it is evident that the only way to derive clinically applicable results from the analysis of genetic data is through the collective analysis of diverse datasets that are open not only for analysis, but also for quality control and validation scrutiny [96]. For this reason we believe that significant progress has to be made in order to establish the right legal frameworks that will empower political and decision-making boards to create mechanisms that enforce and control the adaptation of open data policies. Even then, the most difficult change remains the paradigm shift that is needed: from the deep-rooted philosophy of researchers who treat data as personal or institutional property towards more open policies [5]. Nevertheless, we are optimistic - some changes have already started to take place due to the open data movement. Here, we present a checklist for open data management guidelines within a research project [134].

• Make long-term data storage plans early on. Most research data will remain valuable even decades after publication.

• Release the consent forms, or other legal documents and policies under which the data were collected and released.

• Make sure that the data are discoverable. Direct links should not be more distant than 2 or 3 clicks away from relevant search in a search engine.

• Consider submitting the data to an open repository, e.g. Gene Expression Omnibus or the European Genome-phenome Archive (EGA).

• Provide meta-data. Try to adopt open and widely used ontologies and formats, and make the datasets self-explanatory. Make sure all data are described uniformly according to the meta-data and make the meta-data specifications available. • Provide the “full story” of the data. Experimental protocols, equipment,

pre-processing steps, data quality checks and so on.

• Provide links to software that has been used for pre-processing and links to tools that are directly applicable for downstream analysis.

• Make the data citable and suggest a proper citation for researchers use. Also show the discoveries already made using the data and suggest further experiments. • Provide a manager/supervisor’s email and contact information for potential

(35)

2.5 Quality criteria for modern workflow environments

Although the notion of a scientific workflow is a simple concept with a historic presence in the bioinformatics community there are many factors that affect their overall quality that are still being overlooked today. In this section we present some of these factors. 2.5.1 Being able to embed and to be embedded

Any workflow should have the ability to embed other workflows as components regardless of their in-between unfamiliarity. Modern workflow environments tend to suffer from the “lock-in” syndrome. Namely, they demand their users invest considerable time and resources to wrap an external component with the required meta-data, libraries and configuration files into a workflow. Workflows should be agnostic regarding the possible components supported and should provide efficient mechanisms to wrap and include them with minimum effort.

Similarly, a workflow environment should not assume that it will be the primary tool with which the researcher make the complete analysis. This behavior is selfish and reveals a desire to dominate a market rather than to contribute to a community. That being said, workflow systems should offer researchers the ability to export their complete analysis in formats that are easily digested by other systems. Examples include simple BASH scripts with meta-data described in well-known and easily parsed formats like XML, JSON or YAML. Another option is to directly export meta-types, scripts and analysis code in serialized objects like PICKLE and CAMEL11 that can be easily loaded as a whole from other tools.

2.5.2 Support ontologies

Workflow systems tend to focus more on the analysis part of the tool and neglect the semantic part. The semantic enrichment of a workflow can be achieved by adhering and conforming to the correct ontologies. Analysis pipelines that focus on high throughput genomics, like Next Generation Sequencing (NGS), or proteomics have indeed a limited need for semantic integration mainly because a large part of this research landscape is uncharted so far. Nevertheless, when a pipeline approaches findings closer to the clinical level, the semantic enrichment is necessary. At this level, the plethora of findings, the variety of research pipelines, and sometimes the discrepancies between conclusions, can create a bewildering terrain. Therefore, a semantic integration through ontologies can provide common ground for direct comparison of findings and methods. Thus

(36)

ontologies, for example for biobanks [109], gene function [34], genetic variation [22] and phenotypes [117], can be extremely helpful. Of course, ontologies are not the panacea to all these problems since they have their own issues to be considered [94].

2.5.3 Support virtualization

The computing requirements of a bioinformatics workflow are often unknown to the workflow author or user but include the processing power, memory consumption and time required. Additionally the underlying computing environment has its own require-ments (e.g. the operating system, installed libraries and preinstalled packages). Since a workflow environment has its own requirements and dependencies, it is cumbersome for even skilled and well IT-trained bioinformaticians to set it up and configure. Conse-quently, a considerable amount of valuable research time is spent in configuring and setting up a pipeline or a workflow environment. Additionally, lack of documentation, IT inexperience and time pressure lead to misconfigured environments that in turn leads to waste of resources and may even produce erroneous results12. A solution for this can be virtualization. Virtualization is the “bundling” of all required software, operating system, libraries and packages into a unique object (usually a file) that is called an “image” or “container”. This image can be executed in almost all known operating systems with the help of specific software, making this technique a very nice approach for the “be embeddable” feature discussed above. Any workflow environment that can be virtualized is automatically easily embeddable in any other system by just including the created image.

Some nice examples include the I2B2 (Informatics for Integrating Biology and the Bedside) consortium13 which offers the complete framework in a VMare image [139]. Another is the transMART software that is offered in a VWware or VirtualBox container [4]. Docker is also an open-source project that offers virtualization, as well as an open repository where users can browse, download and execute a vast collection of community generated containers. Docker borrows concepts from the GIT version control system. Namely, users can download a container, apply changes and “commit” the changes to a new container. The potential value of Docker in science has already been discussed [12], [41];for example BioShaDock is a Docker repository of tools that can be seamlessly executed in Galaxy [104]. Other bioinformatics initiatives that are based on Docker

12Paper retracted due to software incompatibility: http://www.nature.com/news/ error-found-in-study-of-first-ancient-african-genome-1.19258

(37)

are the CAMI (Critical Assessment of Metagenomic Interpretation14), nucleotid.es15 and bioboxes.org. All these initiatives offer a testbed for comparison and evaluation of existing techniques mainly for NGS tasks.

2.5.4 Offer easy access to commonly used datasets

Over the last years, an increasing number of large research projects have generated large volumes of qualitative data that are part of almost all bioinformatics analysis pipelines. So far although locating and collecting the data is straightforward, their volume and their dispersed descriptions makes their inclusion a difficult task. Modern workflow environments should offer automatic methods to collect and use them and examples of relevant datasets are ExAC [32] with 60.000 exomes, 1000 Genomes Project [1], GoNL [135], ENCODE project [33], METABRIC dataset [37] and data released from large consortia like GIANT16. Another impediment is that different consortia release data with different consent forms, for example in the European Genome-phenome Archive (EGA) [84] each provider has a different access policy. Filling and submitting these forms is another task that can be automated (with information provided by users). Another essential feature should be the automatic generation of artificial data whenever this is requested for testing and simulation purposes [105].

2.5.5 Support and standardize data visualization

A feature that is partially supported by existing workflow environments is the inherent support for data visualization. Over the course of genetics research, certain visualization methods have become standard and are easily interpreted and widely understood by the community. Workflow environments should not only support the creation and inclusion of these visualization techniques, but also suggest them whenever a pipeline is being written.

For example in Genome-Wide Analysis Studies (GWAS), Manhattan plots for detect-ing significance, Q-Q plots for detectdetect-ing inflation and Principal Component Analysis plots for detecting population stratification or the inclusion of visualizations from tools that have become standard in certain analysis pipelines, like LocusView and Haploview plots for GWAS. The environment should also support visualization for research data like haplotypes [75], [9], reads from sequencing [71], NGS data [126], [136] and biological networks [103].

14http://cami-challenge.org/

15Very interesting presentation on virtualization: http://nucleotid.es/blog/why-use-containers/ 16https://www.broadinstitute.org/collaboration/giant/index.php/Main_Page

(38)

As an implementation note, in the last few years we have seen a huge improvement in the ability of Internet browsers to visualize large and complex data. This was possible mainly due to the wide adoption and standardization of JavaScript. Today JavaScript can be used as a standalone genome browser (for example JBrowse [128] although UCSC also allows this functionality [115]) and genome viewer (e.g. pileup.js [142]). JavaScript has allowed the inclusion of interactive plots like those from RShiny17 and the creation of aesthetically and informative novel forms of visualization with libraries like D318. Today there are hundreds of very useful, context -pecific minor online tools in the area of bioinformatics. The superior advantage of JavaScript is that it is already installed (and used daily) in almost all Internet browsers by default. We therefore strongly advocate workflow environments that support JavaScript data visualization. 2.5.6 Enable “batteries included” Workflow Environments

A workflow environment, no matter how advanced or rich-in-features, is of no use to the bioinformatics community if it does not offer a basic collection of widely adopted and ready-to-use tools and workflows. The inclusion of these, even in early releases, will offer a valuable testing framework and will quickly attract users to become a contributing community. These tools and workflows can be split into the following categories:

Tools:

• Tools that are essential and can be used for a variety of purposes such as plink [113] and GATK [39].

• Tools for basic text file manipulation (e.g. column extraction), format validation, conversion and quality control.

• Tools for basic analysis, e.g. for variant detection, sequence alignment, variant calling, Principal Component Analysis, eQTL analysis, phasing, imputation, association analysis and meta-analysis.

• Wrappers for online tools. Interestingly, a large part of modern analysis demands the use of tools that require interaction through a web browser. These tools either do not offer a local offline version or operate on large datasets that it is inconvenient to store locally. A modern workflow should include these tools via special methods that emulate and automate browser interaction. Examples of 17http://shiny.rstudio.com/

Referenties

GERELATEERDE DOCUMENTEN

Along with the source code, each article has sections that provide documentation, user parameters, under development code, unit tests and edit permissions of the method (See

To evaluate the ability of MutationInfo to locate the chromosomal positions of dbSNP and HGVS variants with sequence accession numbers, we analyzed the highly curated content

Finally, improved imputation accuracy was also measured for population-specific reference panels for the Ashkenazi Jewish [40], Sardinian (Italy) and Minnesotan (USA)

Without being exhaustive, we can place these tools in categories like simple scripts in modern interpreted languages, data visualization, tools for data annotation, validation

Bovendien beschrijf ik de vier belangrijkste praktische overwegingen die moeten worden aangepakt om een bio-informatica compo- nent (d.w.z. tools, data) zo bruikbaar mogelijk te

Στο κεφάλαιο 2, παρουσιάζω επίσης τα αναμενόμενα οφέλη από την υιοθέτηση αυτών των κατευθυντήριων γραμμών, τα σημαντικότερα από τα οποία είναι η αυξημένη

I am thankful to the following people from this group: Freerk van Dijk, from whom I learned a lot during his amazing work in creating pipelines that assembled all the terabytes of

• Laurent C Francioli, Androniki Menelaou, Sara L Pulit, Freerk van Dijk, Pier Francesco Palamara, Clara C Elbers, Pieter B T Neerincx, Kai Ye, Victor Guryev, Wigard P