SemanticSCo: a platform to support the semantic composition of services for gene expression analysis

(1)

SemanticSCo: A platform to support the semantic composition of

services for gene expression analysis

Gabriela D.A. Guardia

a

_{, Luís Ferreira Pires}

b

_{, Eduardo G. da Silva}

b

_{, Cléver R.G. de Farias}

a,⇑ a

Department of Computer Science and Mathematics – Faculty of Philosophy, Sciences and Letters at Ribeirão Preto (FFCLRP) – University of São Paulo (USP), Ribeirão Preto, Brazil b

Faculty of Electrical Engineering, Mathematics and Computer Science – University of Twente, Enschede, Netherlands

a r t i c l e i n f o

Article history:

Received 12 July 2016 Revised 27 November 2016 Accepted 31 December 2016 Available online 3 January 2017 Keywords:

Gene expression analysis Semantic web services Service composition Composition architecture Composition platform

a b s t r a c t

Gene expression studies often require the combined use of a number of analysis tools. However, manual integration of analysis tools can be cumbersome and error prone. To support a higher level of automation in the integration process, efforts have been made in the biomedical domain towards the development of semantic web services and supporting composition environments. Yet, most environments consider only the execution of simple service behaviours and requires users to focus on technical details of the compo-sition process.

We propose a novel approach to the semantic composition of gene expression analysis services that addresses the shortcomings of the existing solutions. Our approach includes an architecture designed to support the service composition process for gene expression analysis, and a flexible strategy for the (semi) automatic composition of semantic web services. Finally, we implement a supporting platform called SemanticSCo to realize the proposed composition approach and demonstrate its functionality by successfully reproducing a microarray study documented in the literature.

The SemanticSCo platform provides support for the composition of RESTful web services semantically annotated using SAWSDL. Our platform also supports the definition of constraints/conditions regarding the order in which service operations should be invoked, thus enabling the definition of complex service behaviours. Our proposed solution for semantic web service composition takes into account the require-ments of different stakeholders and addresses all phases of the service composition process. It also pro-vides support for the definition of analysis workflows at a high-level of abstraction, thus enabling users to focus on biological research issues rather than on the technical details of the composition process. The SemanticSCo source code is available athttps://github.com/usplssb/SemanticSCo.

1. Introduction

Gene expression analysis has become a widely used approach to the assignment of probable functions to the gene information stored in the genome of a number of biological species[1]. In order to accomplish this goal, gene expression analyses are performed in a high throughput manner, measuring the expression of thousands of genes simultaneously. When gene expression data are available, a biologist usually defines an analysis workflow containing a series of activities[2]. These activities can include data normalization, identification of differentially expressed genes, cluster analysis and functional analysis, among others.

The execution of an analysis workflow frequently requires the combined use of a number of software tools[3]. The integration of different tools to perform the activities defined in an analysis

workflow requires biologists to manually transfer data between tools and convert data formats, due to structural differences on these data [4]. However, the manual integration of software resources has become cumbersome due to the increasing number and variety of tools and data formats that need to be handled by the biologists [5,6]. Therefore, automated support for data/tool integration has become highly desirable[7].

In order to facilitate the (semi) automatic integration of tools, a number of efforts have been made in the biomedical domain to provide analysis tools as semantic web services[8–12]. Web ser-vices provide standardized programming interfaces, which facili-tate tool interoperability in this domain. In addition, the semantic enrichment of service interfaces with ontological con-cepts that describe their associated functionality and the meaning of services inputs/outputs enables the creation of mechanisms to support (semi) automatic composition of these resources in order to accomplish an intended gene expression analysis task. More-over, the assignment of semantics to data shared among different

http://dx.doi.org/10.1016/j.jbi.2016.12.014

E-mail address:farias@ffclrp.usp.br(C.R.G. de Farias).

Contents lists available atScienceDirect

Journal of Biomedical Informatics

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / y j b i n

(2)

services in a given composition guarantees data exchange in a semantically consistent and meaningful manner.

Following these ideas, a number of approaches have been pro-posed to support semantic service composition in the biomedical domain [13–19]. Some support environments follow a fully-automated service composition approach, e.g., Bio-jETI[14] and jORCA[17]. These environments assume that users are able to pro-vide a complete and accurate set of requirements beforehand. In contrast, other environments follow a semi-automatic composition approach, e.g., Sesame[15], Galaxy [16]and Taverna[18]. These environments assist users in the creation of an analysis workflow by iteratively suggesting the most appropriate services at each step of the workflow design[20,21]. These suggestions are based on the semantic matching between input and output data types of adjacent services defined in the composition chain. Regardless of the compo-sition strategy adopted, once a compocompo-sition is completely specified, the supporting environments provide support for workflow execu-tion, by automatically invoking each individual service defined in the workflow and managing the data transfer between services.

In most support environments, users are supposed to define concrete (low level) workflows that consist of a number of inter-connected web services, rather than specifying their requirements in a more abstract manner, e.g., in terms of desired functionality and data types that can be provided. Consequently, the definition of analysis workflows using such environments requires users to focus on technical issues in addition to the biological questions of interest. Moreover, most existing approaches to semantic service composition in the biomedical domain consider that individual services defined in a composition present a simple execution beha-viour. However, some services may prescribe a local behaviour, i.e., constraints or conditions regarding the order in which service operations should be invoked.

In this paper, we propose a novel approach to address these shortcomings. This approach consists of an architecture and a sup-porting infrastructure to enable more abstract support to the semantic composition of services in the gene expression analysis domain. We claim that fully automatic composition is inadequate for the functional genomics domain since knowledge is iteratively obtained during the analysis process. Thus, we consider the use of a semi-automatic composition approach since it enables the grad-ual specification of the biologist requirements as the study advances, while assisting him/her during the service discovery and composition tasks. We propose a layered architecture to sup-port the semi-automatic semantic composition of gene expression analysis services. Then, we present the platform called Seman-ticSCo that we have built to implement and validate the proposed composition approach. Finally, we demonstrate how SemanticSCo was successfully used to reproduce part of a case study already documented in the literature involving the analysis of one-color microarray data. The platform source code is available under Apache License 2.0 athttps://github.com/usplssb/SemanticSCo.

The remaining of the paper is structured as follows: Section2

introduces a number of relevant technologies and contributions that were considered during the development of our platform; Sec-tion 3 describes our service composition approach; Section 4

describes the main aspects of the SemanticSCo platform; Section5

presents an example of the SemanticSCo platform usage; Section6

outlines a number of related contributions and highlights the most relevant aspects of our work; and finally, Section7presents some conclusions and outlines some opportunities for future work. 2. Methods

In this work, we first define our approach to the semantic com-position of gene expression analysis services, and then we discuss the SemanticSCo platform that we have built to implement and

validate the proposed approach. In order to build the SemanticSCo platform, we have adapted an existing composition framework named A-DynamiCoS [22]. The SemanticSCo platform considers that each individual service has a (complex) local behaviour defined by the order in which service operations are expected to be executed. Finally, we incorporate a set of semantic web services available at the GEAS Repository[23]into our composition plat-form in order to test and demonstrate its usage.

2.1. Framework Adaptable DynamiCoS (A-DynamiCoS)

Adaptable DynamiCoS (A-DynamiCoS) [22] is a framework developed to support user-centric service composition. This frame-work provides flexible support to the different activities of the ser-vice composition process, taking into account the specific needs of users during the composition process. A-DynamiCoS is structured into three main components, namely User Support, Coordinator and Composition Framework.

The User Support component is responsible for collecting user intentions and translating them into computational commands that indicate which activities of the composition process should be performed. The Coordinator component receives commands from the User Support and maps them onto necessary invocations to subcomponents of the Composition Framework, which provide the required behaviour. The Composition Framework aims at auto-matically supporting the whole service composition process. In this component, each activity of the composition process is sup-ported by an independent subcomponent.

A Java prototype implementation of A-DynamiCoS is provided at http://www.dynamicos.sourceforge.net. In this prototype, the Coordinator and Composition Framework components were devel-oped to support the composition of SOAP services semantically annotated using the SPATEL language[24]. These two components are implemented as SOAP web services[25].

2.2. Service local behaviour specification

Previously, we have defined a methodology for the creation of RESTful semantic web services for gene expression analysis[23]. According to our methodology, a service can be created in three steps: (1) a RESTful service is implemented as a wrapper for an existing analysis tool; (2) a WSDL service description is automati-cally generated from the service implementation; and (3) the ser-vice description is semantically annotated using SAWSDL. The last step can include the definition/extension of a service/domain ontology to be used in the annotation process. Service operations are semantically annotated with ontological terms that describe their functionality, inputs and outputs.

A web service created according to our methodology can expose a set of operations that independently provide some specific func-tionality. However, a web service may also prescribe the order in which their operations should be invoked. In addition, more com-plex service behaviours may need to be defined, e.g., conditions to be fulfilled before invoking a given service operation. In this paper, the term ‘‘local behaviour” denotes these behaviours and the con-straints they impose.

A service description defines the operations provided by a ser-vice and the details to communicate with the serser-vice through the individual execution of its operations, as in the case of WSDL

[26]. However, a service description usually does not define the order in which service operations should be invoked. In order to enable the automated execution of (complex) services, the explicit definition of their local behaviour is necessary, in addition to their interface descriptions. For this purpose, a web service may have its expected local behaviour specified as an executable process, defin-ing a flow of service tasks (operations) that should be invoked to

(3)

properly execute the service. In this context, we extend our previ-ous methodology and model each created service as an executable process using the Business Process Model and Notation (BPMN)

[27].

To model the local behaviour of a single service in BPMN, each service operation is defined as a BPMN service task that points to the WSDL definition of the operation. The connections among ser-vice tasks define the order in which they should be executed. Con-ditions under which these tasks are executed can also be defined using BPMN gateways and events, while the WSDL inputs and out-puts of each service operation can be modelled as BPMN process inputs and outputs, respectively. Finally, service parameters can be modelled as BPMN data objects. Once a BPMN specification is available, the corresponding service can be automatically executed by any general-purpose BPMN execution engine.

2.3. Semantic web services for gene expression analysis

We have developed a representative set of semantic web ser-vices capable of performing a number of analysis activities on gene expression data obtained from DNA microarrays and RNA-Seq experiments[23]. Our services have been developed according to the Representational State Transfer (REST)[28]architectural style. Once implemented, the services were described using the Web Ser-vices Description Language version 2 (WSDL 2.0)[26]. The WSDL description of a service specifies its set of operations, input and output message formats, as well as technical details to access the service. These WSDL descriptions were semantically annotated with concepts from a service ontology developed in the Web Ontology Language (OWL) [29]. The semantic annotations were performed using Semantic Annotations for WSDL (SAWSDL)[30].

The developed services are publicly available at the Gene Expression Analysis Services (GEAS) Repository (http://dcm.ffclrp. usp.br/lssb/geas). The GEAS Repository provides access to these services and detailed documentation about them. The available services support, among others, the preprocessing of microarray data, differential expression analysis of microarray and RNA-Seq data, hierarchical and K-means clustering of microarray data, visu-alization of hierarchically clustered microarray data, enrichment analysis of gene expression data and visualization of gene expres-sion data rendered into KEGG pathways[31]. All services available

at the GEAS Repository were deployed on the SemanticSCo platform.

3. Service composition approach

A gene expression study is usually performed in five general steps[32,33], viz., design experiment, perform experiment, analyse data, interpret results and validate results. Our composition approach focus on the data analysis step, in which a biologist or bioinformatician is responsible for defining a workflow of analysis activities to be performed on data and selecting relevant tools to perform these activities. Due to the exploratory nature of a gene expression study, the analysis workflow may be iteratively refined in order to include (or modify) activities or tools according to the obtained analysis results and corresponding interpretation.

3.1. Composition architecture

Once semantic web services are created, the ultimate goal is to further integrate these services in order to create an analysis work-flow (service composition). During the definition of this workwork-flow, a biologist specifies an ordered set of analysis activities to be per-formed on gene expression data. Ideally, the biologist defines the set of goals and data types to be processed as an abstract (semantic) workflow rather than directly dealing with services to specify a concrete workflow. Once an abstract workflow is defined, a con-crete service composition can then be subsequently derived from it. In order to cope with these requirements, we propose a four-layer architecture to enable semantic service composition for gene expression analysis. Layer 1 contains a set of semantic web services for gene expression analysis. Layer 2 contains a set of executable processes that specify the local behaviour of each semantic web service contained in Layer 1. Layer 2 also incorporates mechanisms for the publication of these executable processes. Layer 3 performs the semantic composition of processes defined in Layer 2. Thus, Layer 3 incorporates mechanisms for process (semantic) discovery and selection as well as for the creation of the semantic composi-tion itself. Layer 4 comprises the specificacomposi-tion of an abstract anal-ysis workflow from which a concrete service composition can be created in Layer 3. Hence, Layer 4 incorporates mechanisms to assist users during the specification task (requirements). Fig. 1

S4 S2 P2 P4 P1 P3 P5 input output P1 output P3 input P3 output P5 input A B C start end input type Layer 4: Abstract Workflow Layer 1: Semantic Web Services Layer 2: Executable Processes Layer 3: Semantic Composition S5 Activity Process Service Data Legend S3 S1 P3 P5 P1

(4)

presents an overview of our architecture, indicating its layers and their interactions.

3.2. Composition strategy

A semantic service composition is created according to the fol-lowing steps:

1. The biologist selects some (new) input dataset, e.g., Affyme-trix normalized data.

2. The supporting system attempts to find data types that can be used to specify the semantics of the selected dataset according to a service ontology. A list of data types (ontolog-ical concepts) is presented to the biologist, e.g., normalized microarray data, log-transformed normalized microarray data, and one-color normalized microarray data.

3. The biologist selects one of the suggested data types to spec-ify the semantics of the input dataset, e.g., normalized microarray data.

4. The supporting system attempts to find analysis activities that can be performed on the available data according to a service ontology. The available data are provided by the biol-ogist and produced by each service of the composition chain. A list of analysis activities (ontological concepts) is pre-sented to the biologist, e.g., differential expression analysis, microarray data clustering, Student’s t test analysis. If the biologist intends to refine the sug-gested list of analysis activities, step 1 can be performed again to modify the previously specified data semantics using, for example, more specific ontological terms (e.g. one-color normalized microarray data).

5. The biologist selects one of the suggested analysis activities, e.g., two-sample Student’s t test analysis.

6. The supporting system attempts to find a (composite) ser-vice that accepts the available data as input and is capable of performing the selected analysis activity. A list of avail-able services is presented to the biologist. Thereupon, the biologist can repeat step 1 to modify the previously specified data semantics or step 4 to modify the selected analysis activity.

7. The biologist selects one of the suggested services, e.g., MicroOneDifferentialAnalysis.

8. The system adds the selected service to the composition chain. The biologist can repeat step 1 to include an addi-tional dataset or step 4 to include or modify an analysis activity.

9. The biologist executes one or more services of the composi-tion chain.

10. The supporting system saves the results produced by the service(s) execution. At this point, the biologist can further extend the composition by repeating steps 1 or 4. Otherwise, the analysis ends.

Fig. 2presents an overview of the proposed semi-automatic composition strategy.

4. The semantic services composition platform

SemanticSCo has been implemented in Java. In order to build SemanticSCo, we adapted the A-DynamiCoS framework[22] con-sidering both the characteristics of the gene expression domain and the technologies employed in the development of the semantic web services available at the GEAS Repository, viz., REST, WSDL 2.0, SAWSDL and BPMN.

4.1. Platform overview

SemanticSCo is structured into six basic components: compo-nent Service Creation and Publication provides (semi) automatic support to the creation and publication of semantic web services; components Composite Service Enactment, Coordinator, Service Com-position and ComCom-position and Execution Context provide support for the creation and execution of semantic service compositions; finally, component Service Registry stores relevant information about all services incorporated into the platform.Fig. 3presents SemanticSCo’s architecture.

Component Service Creation and Publication supports bioinfor-maticians in the creation of semantic web services and subse-quently publication of relevant service information into component Service Registry. All service information stored in Ser-vice Registry can be later accessed by component SerSer-vice Composi-tion during service discovery.

Component Composite Service Enactment provides a graphical user interface to support biologists or bioinformaticians in the cre-ation and execution of service compositions. During the definition

3. Biologist selects data type 5. Biologist selects analysis activity 6. System searches for services 4. System searches for analysis activities 7. Biologist selects service relevant services? yes no change composition? yes 9. Biologist executes composition no 8. System adds service to composition relevant activities? no yes change data or activity? data activity no finalize analysis? yes 2. System searches for data types 10. System saves results 1. Biologist selects input data end start

(5)

of an (abstract) analysis workflow, component Composite Service Enactment collects user (biologist) intentions and translates them into computational (primitive) commands that indicate to compo-nent Coordinator which activities of the composition process should be performed.

Each primitive command issued by component Composite Ser-vice Enactment triggers in component Coordinator a number of calls (supporting strategy) to the basic composition services provided by component Service Composition, as well to the context storage ser-vices provided by component Composition and Execution Context. Thus, component Coordinator mediates the interactions between component Composite Service Enactment and components Service Composition and Composition and Execution Context. Component Coordinator allows the different activities of the composition pro-cess, e.g., service selection and execution, to be performed on demand according to the users needs, thus complying with our flexible composition strategy. In this context, Coordinator enables, for example, a user to interleave the inclusion of a new analysis activity (service) in the composition workflow with the execution of analysis activities whose inputs are already available.

Component Service Composition automatically supports all ser-vice composition activities. Each activity is supported by an inde-pendent set of subcomponents, e.g., Service Discoverer and Registry Inquiry for service discovery. Automation of the discovery and composition processes is achieved by using semantic annota-tions to describe both services and user requests.

Finally, component Composition and Execution Context provides mechanisms to store all context information handled during the creation/execution of a service composition. Stored information includes not only services that were discovered, selected and executed, but also uprovided datasets and ser-vice execution’s results. Since this component separately stores for each user the information associated with a single composi-tion, it provides the necessary support for handling multiple user sessions (compositions).

The available A-DynamiCoS prototype implementation provides a generic and domain-independent implementation of components Coordinator, Service Composition and Composition and Execution Context, allowing them to be reused in any application domain. However, this implementation supports only the composi-tion of SOAP-based services semantically annotated using SPATEL

[24]. Further, these services must have a single operation (simple service). In order to incorporate the set of services available at GEAS Repository, we have adapted these components to support the composition of RESTful services semantically annotated using SAWSDL. New features introduced in SemanticSCo include support for the definition of (abstract) analysis workflows, management of multi-operation (complex) services and multiple service instances in a composition, as well as service discovery based on both the desired service functionality and the semantic matching between service inputs and outputs. Finally, we have developed the Compos-ite Service Enactment component (named User Support in A-DynamiCoS) from scratch taking into account the specific require-ments and characteristics of the stakeholders to be supported in the gene expression domain.

4.2. Service creation

The creation of RESTful semantic web services according to our extended methodology is supported by two components: Wsdl2 Creator and Wsdl2Bpmn Mapper. Component Wsdl2 Creator pro-vides semi-automatic support for the creation, edition and valida-tion of WSDL service descripvalida-tions. This component was adapted from a previously defined tool named Wsdl2 Generator[23]. Com-ponent Wsdl2Bpmn Mapper enables the automatic mapping of a WSDL 2.0 service description to a BPMN process definition, thus facilitating the definition of the local service behaviour. Once this mapping is automatically performed using the Wsdl2Bpmn Mapper component, any general-purpose BPMN editor can be used to fur-ther refine the created process.Fig. 4presents the proposed map-ping from WSDL to BPMN elements.

Each WSDL operation is mapped to a BPMN service task, while each XML Schema element referenced by a WSDL operation input is mapped to a BPMN Input or Data Object. If the operation input is intended to be provided by the user before the process execution starts, its corresponding XML Schema element should be mapped to a BPMN Input. Otherwise, if the operation input can be provided during the process execution, its corresponding XML Schema ele-ment can be mapped to a BPMN Data Object. Similarly, each XML Schema element referenced by a WSDL operation output is mapped to a BPMN Output or Data Object. If the operation output should be provided to the user at the end of the process execution

SemanticSCo Platform

Service Registry Primitive Commands BPMN Publisher Publication BPMN Interpreter Creation Wsdl2 Creator Wsdl2Bpmn Mapper

Service Creation and Publication

User Interface CP1 CP2 CP3 CP4 Command Flow implements Local Data Storage

Composite Service Enactment

Command Processor Supporting Strategies Strategy {CP1} Strategy {CP2} ... Coordinator SOAP messages Service Composition Discovery/Selection Registry Inquiry Service Discoverer Service Composer Composition CLM Manager Request Semantic Provider Semantic Reasoner Composition and Execution Context bioinformatician biologist (or bioinformatician)

(6)

(e.g., analysis result), its corresponding XML Schema element is mapped to a BPMN Output. Otherwise, if the operation output is intended to be consumed only by other service operation(s) defined within the BPMN process and is not intended to be returned to the user (e.g., internal identifier), its corresponding XML Schema element can be mapped to a BPMN Data Object. Once the WSDL operations and XML elements of a service description are mapped to a BPMN process, the associated SAWSDL annota-tions become indirectly accessible via BPMN.

4.3. Service publication

Component Service Registry has been implemented as a UDDI-based service registry using the jUDDI Java API (http://juddi. apache.org/), which allows the publication and discovery of ser-vices according to the UDDI specification[34]. The UDDI specifica-tion provides support for the publicaspecifica-tion of informaspecifica-tion extracted from WSDL 1.1 service descriptions in a UDDI registry[35]. How-ever, no support is provided for the publication of information extracted from either WSDL 2.0 descriptions, including SAWSDL semantic annotations, or BPMN processes, so we defined mappings from WSDL 2.0 and BPMN elements to UDDI structures. Fig. 5

shows these mappings.

We model the local behaviour of each single service as a separate BPMN process. Each process consists of an ordered set of service tasks, where each service task refers to a WSDL operation defined within a WSDL interface. In order to map a service to the UDDI-based registry, the WSDL interface is first mapped to a UDDI Inter-face Technical Model (tModel). A tModel is a generic UDDI structure (a namespace with metadata) used to represent a unique (techni-cal) concept or construct. The WSDL binding that points to the WSDL interface is then mapped to a UDDI Binding tModel, while the BPMN process itself is mapped to a UDDI Process tModel. Both the UDDI Binding and Process tModel point to the UDDI Interface tModel. Next, the BPMN process is mapped to a UDDI Business Ser-vice. The WSDL Endpoint that refers to the WSDL Binding is mapped to a UDDI Binding Template contained in the Business Service. The UDDI Binding Template points to the UDDI Binding, Process and Interface tModel. Finally, SAWSDL semantic annotations referenced by BPMN service task, input and output elements are mapped to UDDI Category Bags defined within the Business Service.

In order to support the service publication task using our map-ping, we first extended the jUDDI implementation of the Service

Registry component with a set of tModels that store information about BPMN and WSDL components. We also developed two sep-arate components, viz., BPMN Interpreter and BPMN Publisher. Com-ponent BPMN Interpreter is responsible for reading a BPMN document and extracting the necessary information for publica-tion. This component also reads and extracts WSDL-related infor-mation (e.g., SAWSDL annotations) referenced in the BPMN document. Once relevant service information is extracted, BPMN Publisher automatically publishes this information into component Service Registry according to the proposed mapping.

4.4. Service composition

Components Semantic Provider, Service Discoverer and Service Composer were developed to support service request, service dis-covery/selection and service composition, respectively. Semantic Provider was developed from scratch, while the other components extended existing A-DynamiCoS’ components.

Component Semantic Provider supplies ontological concepts for the specification of input data semantics. These concepts are used by a (biologist) during the creation of an abstract workflow. This component also extracts semantic information (user inputs and/ or service outputs) from a user-defined abstract workflow and invokes component Semantic Reasoner to discover ontological con-cepts in a service ontology. These concon-cepts represents analysis activities (functionality) that can be performed on the available data.

Component Service Discoverer supports service discovery in two steps. First, Service Discoverer extracts semantic information (func-tionality and inputs/outputs) from a user-defined abstract work-flow and invokes component Semantic Reasoner to reason on a service ontology and find out similar ontological concepts. Second, Service Discoverer invokes component Registry Inquiry to query component Service Registry and discover services that match the request consisting of the concepts retrieved by Semantic Reasoner. Service Discoverer also ranks the discovered services according to the semantic similarity between the desired analysis activity and the service functionality and between services’ inputs and out-puts. The semantic similarity between ontological concepts is cal-culated based on six degrees of matching adapted from[36], viz., exact, direct/indirect subsume, direct/indirect plug-in and disjoint. The ranked list of discovered services is presented to the user, who should select a suitable service according to personal preferences. Finally, component Service Composer supports the service composition process in two steps. First, Service Composer invokes

Business Service Service

Interface Process

Endpoint _TemplateBinding Category Bag Service Task Semantic Annotations UDDI 3.0.2 WSDL 2.0 BPMN 2.0 Operation IO Specification Input Output

Data Object Binding

Process tModel Interface tModel Binding tModel

Fig. 5. Mapping from WSDL and BPMN to UDDI-based service registry. Dashed lines represent references between elements, while solid lines represent WSDL and BPMN elements that are mapped onto our UDDI registry.

Definitions Process IO Specification Input Service Task Description Interface Operation Input Output Types XML Schema XML Element

BPMN 2.0

WSDL 2.0

Output Service Binding Data Object Event Gateway Sequence Flow

Fig. 4. Mapping from WSDL 2.0 to BPMN 2.0. Dashed lines represent references between elements, while solid lines represent the mapping of WSDL elements onto BPMN elements.

(7)

component CLM Manager to create a Causal Link Matrix (CLM) con-taining semantic similarity values defined between selected ser-vices’ inputs and outputs. CLM values are computed with the aid of component Semantic Reasoner. Second, Service Composer creates a composition chain. Two services can be interconnected if output data provided by one service are semantically compatible, as defined by the CLM values, with data consumed by the other ser-vice. A service composition is represented as a graph, where each node represents a separate service and each connecting edge rep-resents the interconnection between a service output and another service input. Service Composer component can create a composi-tion chain by applying either a forward or backward approach. 4.5. Composite service enactment and execution

Component Composite Service Enactment provides a graphical user interface that allows biologists to define an abstract workflow, associate concrete services to (abstract) analysis activities and exe-cute each analysis activity separately. The graphical specification of an analysis workflow follows a BPMN-like notation, supported by the JGraphX Java API (https://github.com/jgraph/jgraphx).

Component Composite Service Enactment collects user inten-tions, translating them into primitive commands that triggers the execution of different supporting strategies in the Coordinator. Each primitive command also specifies the SOAP request/response messages to be exchanged between Composite Service Enactment and Coordinator components. We defined an extensible set of nine primitive commands, as follows:

1. DISCOVER_INPUT_ SEMANTICS, which is used to request the set of available ontological concepts for the semantic specification of a user-provided dataset.

2. DISCOVER_FUNCTION_ SEMANTICS, which is used to request the set of ontological concepts representing analysis activities (functionality) that can be performed on a given dataset. 3. DISCOVER_SERVICES, which is used to request the set of services

that provide a given functionality and which are capable of con-suming a given dataset.

4. INCLUDE_SERVICES, which is used to include a given set of ser-vices in the composition chain.

5. RESOLVE_SERVICES, which is used to request the set of services that are capable of producing data required as input for a given set of services.

6. VALIDATE_INPUTS, which is used to validate a given set of vice inputs, i.e., the association of uprovided datasets to ser-vice inputs.

7. COMPOSE_SERVICES, which is used to request the forward or backward composition of two services previously included in the composition chain using the INCLUDE_SERVICES command. 8. GET_EXECUTABLE_SERVICES, which is used to request the set of services that are ready for execution, i.e., services whose required inputs are available.

9. ADD_TO_CONTEXT, which is used to store a given dataset pro-duced by the execution of a service.

Composite Service Enactment implements a Command Flow to support the composition strategy depicted inFig. 2. The Command Flow consists of a workflow of primitive commands that should be issued by the user interface to perform each system activity defined in our composition strategy. Thus, Command Flow trans-parently defines the behaviour of SemanticSCo’s user interface.

Fig. 6 presents the Command Flow codified by SemanticSCo’s user interface. Each system activity of the composition strategy is mapped to a set of primitive commands. For example, when the biologist selects some input data, the system searches for data

activity available? _yes no DISCOVER_ INPUT_ SEMANTICS GET_ EXECUTABLE_ SERVICES ADD_TO_ CONTEXT VALIDATE_ INPUTS INCLUDE_ SERVICES DISCOVER_ SERVICES INCLUDE_ SERVICES DISCOVER_ FUNCTION_ SEMANTICS COMPOSE_ SERVICES RESOLVE_ SERVICES service available? yes no change/include data or activity? data activity change composition? yes no executable service? no data available? yes yes user-provided data? yes no finalize analysis? end yes 2. System searches for data types

4. System searches for analysis activities

6. System searches for services

8. System adds service to composition

no

10. System saves results

no start

(8)

types (activity 2). In order to perform this activity, SemanticSCo’s user interface issues the DISCOVER_INPUT_SEMANTICS primitive command.

Once a service composition is (partially) specified, the user can invoke the execution of one or more services. The execution of these services is delegated to an external execution engine. Since each service is specified as a BPMN executable process, any general-purpose BPMN execution engine can be used to support its execution. In the context of this work, we use the Activiti BPMN process engine for Java (http://activiti.org). The service execution task is delegated to Activiti and SemanticSCo only manages the information required/produced by each service during its execu-tion. Moreover, if the execution of some service fails (e.g., due to a network error), the platform identifies and reports the failure to the user, who can execute the service again later.

Component Composition and Execution Context persistently stores user-provided datasets as well as any result produced during each service execution. Further, component Composite Service Enact-ment temporarily stores these datasets to optimize data retrieval.

5. SemanticSCo usage

In order to demonstrate the support provided by SemanticSCo, we first incorporated all services available at the GEAS Repository by publishing relevant service information into the platform ser-vice registry. In addition, we developed a set of general-purpose adaptation services to handle semantic and syntactic mismatches between services outputs and inputs. These services support differ-ent tasks including data filtering, mapping between differdiffer-ent gene identifier types, data partitioning and concatenation according to user-defined experimental conditions, among others. All adapta-tion services were initially developed as semantic software con-nectors using the methodology proposed by Miyazaki et al.[37], wrapped as RESTful semantic services using our extended method-ology and then incorporated into SemanticSCo. Any new developed adaptation service can be easily incorporated into the platform fol-lowing the same approach.

Next, we created three analysis scenarios to reproduce gene expression studies documented in the literature. In the first scenar-io, we reproduce part of a microarray study that investigates the transcription profiles of multipotent mesenchymal stromal cells isolated from multiple sclerosis patients at pre- and post-autologous hematopoietic stem cell transplantation[38]. In the second scenario, we reproduce part of a RNA-Seq study that inves-tigates the role of both chromatin remodeller BRG1 and transcrip-tion factor MITF in the proliferatranscrip-tion and morphology of human melanoma cells [39]. Each scenario comprises the composition and execution of different sets of services. A detailed description of each scenario is provided in[40].

This section details the creation and execution of the third anal-ysis scenario. This scenario reproduces part of a microarray study that investigates the mechanisms associated to social eavesdrop-ping in the brain transcriptome of zebrafish (Danio rerio) organisms

[41]. In this study, Affymetrix Zebrafish Gene 1.1 ST Array data, available in Gene Expression Omnibus (GEO) under accession number GSE69719, were obtained from four behavioural groups: (i) bystanders to interacting conspecifics (BIC); (ii) bystanders attentive to non-interacting conspecifics (BANIC); (iii) bystanders inattentive to non-interacting conspecifics (BINIC); and (iv) iso-lated fish (ISOL).

5.1. Composition scenario for the analysis of Affymetrix data

The creation of a generic composition scenario in SemanticSCo can be performed as follows: (1) the (biologist) user includes a

dataset to be analysed and specifies its semantics (ontological con-cept); (2) the user iteratively selects an analysis activity (ontolog-ical concept) to be performed on available data; (3) the user selects one web service from a suggested list of available services to per-form each analysis activity. Once a service is associated to an anal-ysis activity and the service inputs are available, its execution can be triggered at any moment by the user. After service execution, the resulting datasets can be locally saved by the user.

In the scenario created for the analysis of Affymetrix microarray data, we first identified a set of differentially expressed genes on the normalized microarray dataset using the MicroOneDifferentialAnal-ysis service. In this analMicroOneDifferentialAnal-ysis, we applied the Student’s t-test to com-pare the ISOL group against the remaining groups one at a time. Next, we hierarchically clustered the pooled set of differentially expressed genes identified for the compared groups using the MicroHCluster service. The clustering was performed using the average-linkage method with Manhattan distance. The resulting clustered data were then graphically visualized with the support of the MicroHClusterViewer service. Finally, we used the GeneSetEnrichmentAnalysis service to perform a gene set enrich-ment analysis on the microarray normalized data. In this analysis, we applied the Generally Applicable Gene-set Enrichment (GAGE)

[42] method to compare the ISOL group against the remaining groups one at a time. We have considered gene sets from KEGG

[31], Wikipathways[43]and Gene Ontology (GO)[44]. Adaptation services were used to facilitate the syntactical match between ser-vices outputs and inputs.Fig. 7highlights the main steps taken to create this analysis scenario in SemanticSCo as follows.

Step 1: the biologist selects the input normalized microarray dataset comprising all experimental groups, i.e., BIC, BANIC, BINIC and ISOL. The biologist then selects the ontological con-cept one-color normalized microarray data to specify the semantics of the dataset. After that, the biologist selects three instances of the adaptation service AS1. Each instance is responsible for partitioning the normalized dataset into two independent datasets associated with different experimental groups: ISOL+BIC, ISOL+BANIC and ISOL+BINIC. The biologist then associates the normalized dataset with the corresponding inputs of the three instances of AS1.

Step 2: the biologist selects three instances of the MicroOneDif-ferentialAnalysis service to perform the activity differential analysis of one-color microarray dataon each dataset produced by the corresponding instances of AS1. The biologist associates the two outputs produced by each instance of AS1 with the corresponding inputs of each instance of MicroOneDif-ferentialAnalysis. In this scenario, three instances of AS1 and MicroOneDifferentialAnalysis have been included considering that the resulting datasets should be concomitantly provided to the next analysis activity.

Step 3: the biologist selects the adaptation service AS4 to filter the initial normalized dataset according to the lists of gene iden-tifiers produced by the three instances of the MicroOneDifferen-tialAnalysis service. The biologist then associates the normalized dataset and the three outputs of the MicroOneDifferentialAnalysis instances with the corresponding inputs of AS4. The AS4 service is also responsible for mapping the gene identifiers stored in the normalized dataset according to a user-provided mapping. Thus, the biologist selects a mapping dataset and the ontological con-cept gene identifiers mapping to specify the semantics of the selected dataset. The dataset is then associated with the cor-responding input of AS4.

Step 4: the biologist selects the MicroHCluster service to perform the activity hierarchical clustering of microarray data and associates the output of AS4 with the corresponding input of MicroHCluster.

(9)

Step 5: the biologist selects the MicroHClusterViewer service to perform the activity dendrogram generation and associates the three outputs of MicroHCluster with the corresponding inputs of MicroHClusterViewer.

Step 6: the biologist selects the adaptation service AS5 to filter the initial normalized dataset according to the experimental groups of interest and map the gene identifiers stored in the dataset according to a user-provided mapping. The biologist then associates the previously selected normalized and mapping datasets with the corresponding inputs of AS5. Next, the biolo-gist selects the GeneSetEnrichmentAnalysis service to perform the activity gene set enrichment analysis of gene expression dataand associates the output of AS5 with the corresponding input of GeneSetEnrichmentAnalysis. The GeneSetEnrichmentAnalysis service also requires as input a user-provided gene set. Thus, the biologist selects a gene set of interest and the ontological concept gene set to specify the semantics of the selected dataset. Finally, the gene set is associ-ated with the corresponding input of GeneSetEnrichmentAnalysis.

Fig. 8illustrates the final structure of our analysis scenario in SemanticSCo.

After defining this analysis scenario, the biologist sequentially executes all instances of AS1. The resulting datasets are automati-cally provided as input for the corresponding instances of MicroOneDifferentialAnalysis, which can then be executed. Once the results of all instances of MicroOneDifferentialAnalysis are avail-able, the AS4, MicroHCluster and MicroHClusterViewer services can then be sequentially executed. Concomitantly to this analysis, ser-vices AS5 and GeneSetEnrichmentAnalysis can be sequentially exe-cuted. The datasets produced by the execution of each service instance defined in this composition scenario can be locally saved at the biologist’s discretion. In addition, the biologist can modify the defined scenario at any time accordingly.

5.2. Comparative analysis of Affymetrix data

The differential expression analysis performed using the MicroOneDifferentialAnalysis service (p< 0:05; fold-change P 1:1) revealed a set of 8 up-regulated genes in the BIC group, including

genes btg2; dnajb5; egr4; fos; msh4; npas4a; npas4b and nr4a1. This analysis also revealed a set of 12 differentially expressed in the BANIC group. In this set, genes egr4; fos; npas4a; nr4a1 and znf 507 were up-regulated, while genes C25HXorf 38; dap1b; ftr50 and soga3b were down-regulated. Finally, the differential expression analysis revealed a set of 6 differentially expressed in the BINIC group. In this set, genes osbpl1a; pcdh2ab7; pcdh2g5 and pcdhga10 were up-regulated, while genes pcdh2ab6 and ugt5c2 were down-regulated in the BINIC group. All genes reported in the orig-inal study[41]were identified using the MicroOneDifferentialAnal-ysis service. Moreover, our service identified additional genes comprising 1 down-regulated gene (pcdh2ab6) in the BANIC group as well as 1 up-regulated (pcdh2g5) and 2 down-regulated genes (pcdh2ab6; ugt5c2) in the BINIC group. Similar results were also obtained regarding commonly expressed genes in BIC, BANIC and BINIC groups: 4 differentially expressed genes were shared by BIC and BANIC groups, including genes egr4; fos; npas4a and nr4a1, while 3 genes were shared by BANIC and BINIC groups, including genes pcdh2ab6; pcdh2ab7 and pcdhga10.

In the hierarchical clustering performed using the MicroHCluster service, the BIC, BANIC and BINIC samples were separately clus-tered from the ISOL samples, except for a single BIC sample that was closer to ISOL than to the other groups. Similarly to the origi-nal study, the gene expression profile of BIC was closer to BANIC than to the remaining groups. Similar results were also obtained regarding the hierarchical clustering of genes, which revealed a set of 8 genes with similar expression profile across all samples: btg2; dnajb5; egr4; fos; msh4; npas4a; npas4b e nr4a1.Fig. 9presents the graphical view of clustered data created using the MicroHClus-terViewer service.

The gene set enrichment analysis performed using the GeneSetEnrichmentAnalysis service (p:adj < 0:1) revealed genes enriched in most of the biological pathways and processes reported in the original study. Pathways enriched in the BIC group included cholesterol/steroid biosynthesis, exercise-induced circadian regulation, FGF signaling pathway and phototransduction. In this group, genes were also enriched in the following Gene Ontology (GO) biological processes: growth, lipid metabolic process and tran-scription. Pathways enriched in the BANIC group included FGF sig-naling pathway and oxidative phosphorylation. In this group, genes

Step 1

one-color normalized microarray data gene set enrichment analysis of gene expression data differential analysis of one-color microarray data (2) AS4 dendrogram generation AS5 gene set gene identifiers mapping hierarchical clustering of microarray data AS1 (1) AS1 (2) AS1 (3)

Step 2

Step 3

Step 4

Step 5

Step 6

differential analysis of one-color microarray data (1) differential analysis of one-color microarray data (3)

(10)

were also enriched in the GO biological process visual perception. Finally, pathways enriched in the BINIC group included cholesterol biosynthesis, oxidative phosphorylation and ribosome. In this group, genes were also enriched in the GO biological processes positive regulation of transcription and RNA processing.

6. Discussion

In this work we proposed a novel approach for the semi-automatic semantic composition of gene expression analysis ser-vices. This approach rests on a four-layered architecture designed to support the proposed composition process. Each layer provides a set of elements whose functionality can be accessed by the layer immediately above and access the set of functionality provided by the elements defined in the layer immediately below. This modular structure provides a flexible solution that facilitates its modifica-tion and/or extension. Moreover, this structure promotes the use of components predefined in each layer. Finally, we developed a platform named SemanticSCo to implement the proposed architec-ture. This section compares the main features of SemanticSCo with similar composition environments that have been proposed in the biomedical domain. A comprehensive review on domain-independent semantic composition approaches can be found in

[45].

Similarly to SemanticSCo, most of the software environments proposed in the biomedical domain follow a semi-automatic approach to service composition, by iteratively suggesting the most suitable services to be included by the user in a composition. However, most existing approaches focus only on some activities of the composition process, such as service discovery or selection. For example, Withers et al.[18]present two extensions to support semantic service discovery in the Taverna environment: the Bio-Moby and SADI plug-ins. These extensions assist users during the design of a concrete (low-level) workflow by iteratively suggesting services capable of consuming output data provided by each pre-ceding service included in the workflow. These suggestions are only based on the semantic matching (compatibility) between ser-vices’ inputs and outputs, not explicitly considering the functional-ity provided by each component service. Therefore, the users should be able to select appropriate services based on their knowl-edge about the services available in the domain and their underly-ing functionality, ultimately posing a challenge for non-experienced users. Service creation/publication, requisition and selection are not discussed in the proposed extensions. In contrast, Fig. 8. Affymetrix data analysis scenario created using SemanticSCo.

Fig. 9. Hierarchical clustering of differentially expressed genes (lines) in samples of each experimental group (columns). In the heatmap, green and red indicates high and low expression, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

(11)

SemanticSCo provides a complete solution for the semantic com-position of services.

A number of platforms assume that biologists must provide a complete service composition specification before it can be deployed and executed, such as Bio-jETI [14] and Sesame [15]. However, the requirements of a biologist may change during the execution of a study according to, for example, the interpretation of some biological results. In this sense, SemanticSCo provides a more flexible support since the specification and execution activi-ties can be interleaved in order to enable biologists to apply their newly obtained biological knowledge during the composition process.

Other platforms assume that the biologists driving the compo-sition process have technical knowledge about the services avail-able in the domain and their underlying functionality. For example, Dhamanaskar et al. [16] propose an extension to the Galaxy platform to support the semantic discovery and composi-tion of web services. This extension consists of a service suggescomposi-tion engine that assists users during the creation of a low-level work-flow by suggesting a ranked list of services to be included in each step of the workflow design. These suggestions are computed by a path-based algorithm that considers both the desired service func-tionality and the semantic matching between the outputs of all preceding services in the composition chain and the input of the suggested services. Services are ranked according to the semantic similarity between the desired and the service functionality, and also between services’ inputs and outputs.

The proposed Galaxy extension is similar to SemanticSCo since it supports both forward and backward service composition thus providing flexible support to users during the creation of their analysis workflows. However, the definition of an analysis work-flow using the Galaxy extension relies on the knowledge that biol-ogists have about the analysis tools available in the domain and their underlying functionality. In contrast, our platform was designed to support a higher level of abstraction, allowing its users to specify their requirements in a more abstract manner, i.e., in terms of desired functionality and data types that can be provided. Once an abstract workflow is (partially) specified, SemanticSCo provides mechanisms for the discovery and composition of ser-vices capable of fulfilling the requirements specified in the abstract workflow. This level of abstraction enables biologists to focus on biological research issues rather than on technical details of the available services.

Another example of semantic-based composition approach that provides both forward and backward composition is proposed by Ba et al.[13]. This approach aims at interactively assisting users during service composition design, by suggesting services whose inputs/outputs are semantically compatible with services already included in the composition chain and automatically interconnect-ing the selected services. However, in contrast to the discovery mechanism provided by SemanticSCo, the functionality provided by a service is not explicitly considered during discovery.

The Sesame platform[15]provides a high level of abstraction during workflow design similarly to SemanticSCo. During the

workflow specification, Sesame assists users by suggesting analysis activities that can be included at each step. These suggestions are based on the ontological relations defined between analysis activ-ities and their inputs/outputs in a service ontology. Once an abstract workflow is completely specified by the user, Sesame sug-gests a list of suitable services for each analysis activity defined in the workflow. However, these suggestions are only based on the semantic matching between analysis activities and annotations that describe service functionality. As a consequence, users must solve syntactic and semantic mismatches between input and out-put data types.

In contrast to our semi-automatic service composition approach, some platforms have been developed to support fully-automated service composition, such as jORCA[17] and Bio-jETI

[14]. In these platforms, the user specifies starting and ending data types, and the supporting system attempts to automatically gener-ate a linear sequence of services capable of deriving the specified output from the input. Services are considered compatible and can be interconnected if the output of a prior service semantically matches the input of the next service in the composition chain. Consequently, these platforms assume that users are able to pro-vide a complete and accurate set of requirements beforehand, which is not always true for gene expression analysis.

Table 1summarizes the main features provided by SemanticSCo and presents a comparison with other existing composition plat-forms. These features include support for the execution of RESTful (complex) services, automation level of service discovery and com-position, technical abstraction level of composition process, sup-port for all composition process activities, as well as supsup-port for composition specification and execution interleaving.

Although SemanticSCo currently uses only the services avail-able at the GEAS Repository, the platform provides mechanisms to support both the development of new semantic web services and the publication (deployment) of these services into the plat-form. We provide a detailed methodology to support the system-atic development of RESTful semantic web services for gene expression analysis[23]. Further, SemanticSCo supports the cre-ation of WSDL service descriptions, the mapping of annotated ser-vice descriptions onto BPMN processes and the registration of service descriptions into the platform. Finally, SemanticSCo sup-ports different service management activities, including the dis-covery, composition and execution of deployed services. The platform manages both RESTful and SOAP services, as long as they have been annotated using SAWSDL and specified using BPMN.

Usually, a biologist is interested not only in the final results obtained from a gene expression study, but also in intermediate results produced during the analysis process. Therefore, Seman-ticSCo provides access to all results produced during the composi-tion process, thus facilitating their step-by-step interpretacomposi-tion. Another aspect of a gene expression study is reproducibility, which requires additional information regarding the context in which the study has been performed. These information can include, for example, the analysis activities that were performed on data and the parameters applied to each analysis. Although support for

Table 1

Main features provided by SemanticSCo and other composition platforms.

Platform REST support Complex services support Automation level Composition abstraction level

Support for all composition activities

Specification/execution interleaving

SemanticSCo Yes Yes Guided High Yes Yes

Taverna extension Yes No Guided Low No Yes

Galaxy extension Yes No Guided Low No Yes

Ba et al. Not addressed No Guided Low No Not addressed

Sesame Not addressed No Guided High No No

jORCA Yes No Automated Low No Yes

(12)

reproducibility is still limited in our composition platform, the developed infrastructure already provides mechanisms for context information gathering and storage. Planned new features includes facilities to aid users to manage data provenance.

To the best of our knowledge, no approach has been defined for the semantic composition of services in the functional genomics domain. We believe our composition approach, implemented in SemanticSCo, represents an adequate solution for gene expression analysis and its target users. Although our approach has been designed to support the requirements of the functional genomics domain, we believe it can be beneficial to other biological domains or even other science areas with similar requirements.

7. Conclusion

In this paper, we introduced a novel approach to the semantic composition of gene expression analysis services. In order to design our approach, we initially identified the main requirements for service composition in the functional genomics domain and characterised the stakeholders involved in the composition pro-cess. We then defined a layered architecture that, differently from existing approaches, provides a higher level of abstraction enabling biologists to focus on the biological study rather than on technical details of the composition process. Additionally, the proposed architecture provides support for the definition and execution of (complex) service behaviours.

Finally, we developed a composition platform named Seman-ticSCo to implement the proposed architecture. SemanSeman-ticSCo sup-ports all activities of the composition process through separate components, viz., Semantic Provider for service request, Service Dis-coverer for service discover and selection, and Service Composer for service composition. In addition, we developed the Wsdl2Bpmn Mapper component to support the creation of (complex) semantic web services, a UDDI-based service registry and the BPMN Inter-preter and BPMN Publisher components to support the automatic publication of the created services into SemanticSCo’s service registry.

Future research includes the incorporation of mechanisms to manage the information necessary to improve the reproducibility of gene expression analysis studies. SemanticSCo’s user interface can also be improved to facilitate the definition of analysis work-flows by users with different computational skills. Furthermore, new services can be developed and made available in SemanticSCo to support additional analysis activities. In this sense, an increase in the number of available services may require the development of additional mechanisms to manage services’ non-functional properties.

Conflict of interest

The authors declare that they have no conflict of interests.

Acknowledgement

This work was supported by the Brazilian Ministry of Education (CAPES).

References

[1] I.S. Segundo-Val, C.S. Sanz-Lozano, Introduction to the Gene Expression Analysis, Springer, New York, 2016, http://dx.doi.org/10.1007/978-1-4939-3652-6_3. pp. 29–43.

[2] S. Wandelt, A. Rheinländer, M. Bux, L. Thalheim, B. Haldemann, U. Leser, Data management challenges in next generation sequencing, Datenbank-Spektrum 12 (3) (2012) 161–171,http://dx.doi.org/10.1007/s13222-012-0098-2.

[3] H.C. Lee, K. Lai, M.T. Lorenc, M. Imelfort, C. Duran, D. Edwards, Bioinformatics tools and databases for analysis of next-generation sequence data, Brief. Funct. Genom. 11 (1) (2012) 12–24,http://dx.doi.org/10.1093/bfgp/elr037. [4] V. Lapatas, M. Stefanidakis, R.C. Jimenez, A. Via, M.V. Schneider, Data

integration in biological research: an overview, J. Biol. Res.-Thessaloniki 22 (1) (2015) 1–16,http://dx.doi.org/10.1186/s40709-015-0032-5.

[5] S. Ghosh, Y. Matsuoka, Y. Asai, K.-Y. Hsin, H. Kitano, Software for systems biology: from tools to integrated platforms, Nat. Rev. Genet. 12 (12) (2011) 821–832,http://dx.doi.org/10.1038/nrg3096.

[6] S. Pettifer, D. Thorne, P. McDermott, J. Marsh, A. Villéger, D.B. Kell, et al., Visualising biological data: a semantic approach to tool and database integration, BMC Bioinform. 10 (Suppl. 6) (2009) 1–12, http://dx.doi.org/ 10.1186/1471-2105-10-S6-S19.

[7] P. Romano, Automation of in-silico data analysis processes through workflow management systems, Brief. Bioinform. 9 (1) (2008) 57–68,http://dx.doi.org/ 10.1093/bib/bbm056.

[8] C.E. Cook, M.T. Bergman, R.D. Finn, G. Cochrane, E. Birney, R. Apweiler, The European Bioinformatics Institute in 2016: data growth and integration, Nucl. Acids Res. 44 (D1) (2016) D20–D26,http://dx.doi.org/10.1093/nar/gkv1352. [9] M.D. Wilkinson, B. Vandervalk, L. McCarthy, The Semantic Automated

Discovery and Integration (SADI) web service design-pattern, API and reference implementation, J. Biomed. Semant. 2 (1) (2011) 1–23,http://dx. doi.org/10.1186/2041-1480-2-8.

[10] S. Pettifer, J. Ison, M. Kalas, D. Thorne, P. McDermott, I. Jonassen, et al., The EMBRACE web service collection, Nucl. Acids Res. 38 (Suppl. 2) (2010) W683– 688,http://dx.doi.org/10.1093/nar/gkq297.

[11] J. Bhagat, F. Tanoh, E. Nzuobontane, T. Laurent, J. Orlowski, M. Roos, et al., BioCatalogue: a universal catalogue of web services for the life sciences, Nucl. Acids Res. 38 (Suppl. 2) (2010) W689–694,http://dx.doi.org/10.1093/nar/ gkq394.

[12] D.D.G. Gessler, G.S. Schiltz, G.D. May, S. Avraham, C.D. Town, D. Grant, et al., SSWAP: a simple semantic web architecture and protocol for semantic web services, BMC Bioinform. 10 (1) (2009) 1–21, http://dx.doi.org/10.1186/1471-2105-10-309.

[13] M. Ba, S. Ferré, M. Ducassé, Safe suggestions based on type convertibility to guide workflow composition, in: F. Esposito, O. Pivert, M. Hacid, W.Z. Rás, S. Ferilli (Eds.), Foundations of Intelligent Systems. 22nd International Symposium ISMIS; 2015 Oct 21–23; Lyon (France), Springer International Publishing, Cham, 2015, pp. 230–236, http://dx.doi.org/10.1007/978-3-319-25252-0_25.

[14] A.-L. Lamprecht, The Bio-jETI framework, in: User-Level Workflow Design: A Bioinformatics Perspective, Lecture Notes in Computer Science, vol. 8311, Springer, Berlin, 2013, pp. 31–61, http://dx.doi.org/10.1007/978-3-642-45389-2_2.

[15] L. Zhang, Y. Wang, P. Xuan, A. Duvall, J. Lowe, Y. Wang, et al., Sesame: a new bioinformatics semantic workflow design system, in: IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2013 Dec 18-21; Shanghai (China), IEEE, 2013, pp. 504–508, http://dx.doi.org/10.1109/ BIBM.2013.6732546.

[16] A. Dhamanaskar, M.E. Cotterell, J. Zheng, J.C. Kissinger, C.J. Stoeckert Jr., J.A. Miller, Suggestions for Galaxy workflow design using semantically annotated services, in: Proceedings of the 7th International Conference on Formal Ontology in Information Systems (FOIS); 2012 Jul 24-27; Graz (Austria), 2012, pp. 29–42.

[17] J. Karlsson, O. Trelles, jORCA and Magallanes sailing together towards integration of web services, in: A.T. Freitas, A. Navarro (Eds.), Bioinformatics for Personalized Medicine. 10th Spanish Symposium JBI; 2010 Oct 27-29; Torremolinos (Spain), Lecture Notes in Computer Science, vol. 6620, Springer, Berlin, 2012, pp. 94–101,http://dx.doi.org/10.1007/978-3-642-28062-7_11. [18] D. Withers, E. Kawas, L. McCarthy, B. Vandervalk, M. Wilkinson,

Semantically-guided workflow construction in Taverna: the SADI and BioMoby plug-ins, in: T. Margaria, B. Steffen (Eds.), Leveraging Applications of Formal Methods, Verification, and Validation. 4th International Symposium on Leveraging Applications (ISoLA); 2010 Oct 18–21; Heraklion (Greece), Springer, Berlin, 2010, pp. 301–312,http://dx.doi.org/10.1007/978-3-642-16558-0_26. [19] M. DiBernardo, R. Pottinger, M. Wilkinson, Semi-automatic web service

composition for the life sciences using the BioMoby semantic web framework, J. Biomed. Inform. 41 (5) (2008) 837–847, http://dx.doi.org/ 10.1016/j.jbi.2008.02.005.

[20] J. Leipzig, A review of bioinformatic pipeline frameworks, Brief. Bioinform. (2016),http://dx.doi.org/10.1093/bib/bbw020.

[21] O. Spjuth, E. Bongcam-Rudloff, G.C. Hernández, L. Forer, M. Giovacchini, R.V. Guimera, et al., Experiences with workflows for automating data-intensive bioinformatics, Biol. Direct 10 (1) (2015) 1–12, http://dx.doi.org/10.1186/ s13062-015-0071-8.

[22] E.G. da Silva, L. Ferreira Pires, M. van Sinderen, A-DynamiCoS: a flexible framework for user-centric service composition, in: Proceedings of the IEEE 16th International Enterprise Distributed Object Computing Conference (EDOC); 2012 Sep 10–14; Beijing (China), IEEE, 2012, pp. 81–92,http://dx. doi.org/10.1109/EDOC.2012.19.

[23] G.D.A. Guardia, L. Ferreira Pires, R.Z.N. Vêncio, K.C.R. Malmegrim, C.R.G. de Farias, A methodology for the development of RESTful semantic web services for gene expression analysis, PLoS ONE 10 (7) (2015) 1–28,http://dx.doi.org/ 10.1371/journal.pone.0134011.

[24] J.P. Almeida, A. Baravaglio, M. Belaunde, P. Falcarin, E. Kovacs, Service creation in the SPICE service platform, in: Proceedings of the 17th Wireless World

(13)

Research Forum Meeting (WWRF17); 2006 Nov, Wireless World Research Forum, Heidelberg, 2006, pp. 1–7.

[25] World Wide Web Consortium, SOAP Version 1.2 Part 1: Messaging Framework, second ed., 2007.

[26] World Wide Web Consortium, Web Services Description Language (WSDL) Version 2.0 Part1: Core Language, 2007.

[27] Object Management Group, Business Process Model and Notation (BPMN) Version 2.0.2, 2013.

[28]R.T. Fielding, Architectural Styles and the Design of Network-based Software Architectures (Ph.D. Thesis), University of California, Irvine, 2000.

[29] World Wide Web Consortium, OWL 2 Web Ontology Language Structural Specification and Functional-Style Syntax, second ed., 2012.

[30] World Wide Web Consortium, Semantic Annotations for WSDL and XML Schema, 2007.

[31] M. Kanehisa, S. Goto, KEGG: kyoto encyclopedia of genes and genomes, Nucl. Acids Res. 28 (1) (2000) 27–30,http://dx.doi.org/10.1093/nar/28.1.27. [32] J.B.W. Wolf, Principles of transcriptome analysis and gene expression

quantification: an RNA-seq tutorial, Mol. Ecol. Resour. 13 (4) (2013) 559– 572,http://dx.doi.org/10.1111/1755-0998.12109.

[33] D.K. Slonim, I. Yanai, Getting started in gene expression microarray analysis, PLoS Comput. Biol. 5 (10) (2009) 1–4, http://dx.doi.org/10.1371/journal. pcbi.1000543.

[34] OASIS, UDDI Version 3.0.2, 2004.

[35] OASIS, Using WSDL in a UDDI Registry, Version 2.0.2 - Technical Note, 2004. [36] M. Paolucci, T. Kawamura, T.R. Payne, K. Sycara, Semantic matching of web

services capabilities, in: I. Horrocks, J. Hendler (Eds.), The Semantic Web -ISWC 2002. First International Semantic Web Conference; 2002 Jun 9–12; Sardinia (Italy), Lecture Notes in Computer Science, vol. 2342, Springer, Berlin, 2002, pp. 333–347,http://dx.doi.org/10.1007/3-540-48005-6_26.

[37] F.A. Miyazaki, G.D.A. Guardia, R.Z.N. Vêncio, C.R.G. de Farias, Semantic integration of gene expression analysis tools and data sources using

software connectors, BMC Genom. 14 (Suppl. 6) (2013) S2,http://dx.doi.org/ 10.1186/1471-2164-14-S6-S2.

[38] G.L.V. de Oliveira, K.W.A. de Lima, A.M. Colombini, D.G. Pinheiro, R.A. Panepucci, P.V.B. Palma, et al., Bone marrow mesenchymal stromal cells isolated from multiple sclerosis patients have distinct gene expression profile and decreased suppressive function compared with healthy counterparts, Cell Transplant. 24 (2) (2015) 151–165, http://dx.doi.org/10.3727/ 096368913X675142.

[39] P. Laurette, T. Strub, D. Koludrovic, C. Keime, S.L. Gras, H. Seberg, et al., Transcription factor MITF and remodeller BRG1 define chromatin organisation at regulatory elements in melanoma cells, eLife 4 (2015) e06857,http://dx.doi. org/10.7554/eLife.06857.

[40] G.D.A. Guardia, Suporte ao desenvolvimento e à composição de serviços web semânticos para a análise de expressão gênica (Ph.D. Thesis), University of São Paulo, 2016. <http://www.teses.usp.br>.

[41] J.S. Lopes, R.A. de Abreu, R.F. Oliveira, Brain transcriptomic response to social eavesdropping in zebrafish (Danio rerio), PLoS ONE 10 (12) (2015) 1–21,http:// dx.doi.org/10.1371/journal.pone.0145801.

[42] W. Luo, M.S. Friedman, K. Shedden, K.D. Hankenson, P.J. Woolf, GAGE: generally applicable gene set enrichment for pathway analysis, BMC Bioinform. 10 (1) (2009) 1–17,http://dx.doi.org/10.1186/1471-2105-10-161. [43] T. Kelder, A.R. Pico, K. Hanspers, M.P. van Iersel, C. Evelo, B.R. Conklin, Mining

biological pathways using WikiPathways web services, PLoS ONE 4 (7) (2009) 1–4,http://dx.doi.org/10.1371/journal.pone.0006447.

[44] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, et al., Gene Ontology: tool for the unification of biology, Nat. Genet. 25 (1) (2000) 25–29,

http://dx.doi.org/10.1038/75556.

[45]E.M. Goncalves da Silva, User-centric Service Composition - Towards Personalised Service Composition and Delivery (Ph.D. Thesis), University of Twente, 2011.