DAPNA: an architectural framework for data processing networks

(1)

Data Processing Networks

Hasan S¨ozer1, Sander Nouta2, Andreas Wombacher2, and Paolo Perona3

1 _¨

Ozye˘gin University, ˙Istanbul, Turkey

2

University of Twente, Enschede, The Netherlands

3

Institute of Environmental Engineering, EPFL, Lausanne, CH

Abstract. A data processing network is as a set of (software) com-ponents connected through communication channels to apply a series of operations on data. Realization and maintenance of large-scale data processing networks necessitate an architectural approach that supports analysis, verification, implementation and reuse. However, existing tools and architectural styles fall short to support all these features. In this pa-per, we introduce an architectural style and framework for documenting and realizing data processing networks. Our framework employs reusable and composable data filters. These filters are annotated with their de-ployment information. The overall architecture is specified with an XML-based architecture description language. The specification is processed by a toolset for analysis and code generation. The framework has been utilized for defining and realizing an environmental monitoring applica-tion.

Keywords: software architecture, framework, style, software reuse, data processing networks

1 Introduction

Data collection, processing and interpretation are essential tasks for almost all scientists. Automating these tasks is important (and sometimes necessary) to save time and prevent errors. Unfortunately, not all scientists have the neces-sary computer science background to realize this automation. They have to use standard tools, create scripts for different data processing steps and manually integrate them. This approach is not always adequate. In some research domains, there are too many data processing tasks. The integration of these tasks is subject to complex inter-task dependencies and timing constraints. Moreover, the pro-cessing tasks can be distributed among multiple computers at remote locations. The CCES project RECORD [18] is an example for this case, where scientists need to monitor and analyze data regarding a river and its surroundings to cali-brate their environmental models. Such cases demand the development of a data processing network (DPN), which we define as a set of (software) components connected through communication channels to apply a series of operations on data.

(2)

2 Hasan S¨ozer, Sander Nouta, Andreas Wombacher, and Paolo Perona

The realization of a DPN should be documented with an architecture de-scription to facilitate communication, analysis and verification, and to support implementation and reuse. Otherwise, realizing and maintaining a large-scale DPN design can be cumbersome. DPNs can be documented with a set of ex-isting architectural styles [11] such as the Pipe-and-Filter style [6] or its vari-ants [16]. These styles can be sufficient for documentation and communication purposes. Analysis and verification can also be supported if the architecture is formally specified [2]. However, the lack of an architectural framework hinders the possibility for automated analysis, reuse and code generation.

In this paper, we introduce an architectural framework called DAPNA for documenting, analyzing and realizing DPNs. The framework employs compos-able data filters. These filters can be reused both within the same project and across different projects. The DPN architecture is described using a style that we have defined as a specialization of the Pipe-and-Filter style. In addition, this style also incorporates deployment information in terms of annotations on data filters that can be distributed on the Internet or on intranets. The overall DPN architecture is specified with an XML-based architecture description language (ADL). The specification is processed by a toolset for analysis and code genera-tion. The framework has been utilized for defining and realizing an environmental monitoring application in the context of the CCES Project RECORD [18] case study.

The remainder of this paper is organized as follows. In Section 2, we describe DAPNA and its utilization for analysis and realization of a DPN. In Section 3, we discuss the actually applied case study for environmental monitoring. In Section 4, related previous studies are summarized. Finally, in Section 5 we discuss some future work issues and provide the conclusions.

2 The DAPNA framework

The overall process for realizing a DPN with DAPNA is depicted in Figure 1. The designer specifies the DPN using an XML-based ADL, for which we have devised an XML schema [13]. The DAPNA framework [13] processes such a description for analysis and code generation. First it analyzes the provided architecture description for checking its conformance to the architecture style and violation of any constraints. It also checks if the specified DPN is connected without missing any links. If there is an error detected, the analysis results are provided to the designer. If not, DAPNA generates a set of deployment packages. These packages should be copied and executed on the corresponding deployment nodes as specified in the architecture description.

We have performed a domain analysis [3] to define the basic concepts and their relations pertaining to a DPN. By investigating the literature [12] and ex-ample applications [18], we have defined a domain model [13]. Based on this domain model and the existing Pipe-and-Filter style [6], we have defined our ADL that defines the set of elements and relations taking part in a DPN archi-tecture description. In addition to the basic concepts such as pipes and filters,

(3)

Fig. 1. The overall process for DPN realization.

we have introduced the concept of binding to support the (hierarchical) com-position of filters. Furthermore, we have defined particular types of data sinks and data sources that are common in a DPN. DPN elements, relations and their properties can be found in [13].

DAPNA comprises a reusable library of communication primitives, data source and sink definitions, and data filters. This library can be further extended with user-defined, composable data filters. In addition to the filter elements, DAPNA provides several reusable abstractions for data sources, serilalization options, random data generators (for training purposes) and conditional data flows. An overview of the framework elements can be found in [13]. The framework hides many implementation details from the user. For instance, the utilization of com-munication mechanisms (i.e., TCP/IP sockets) is taken care of by the framework according to the topology specified in the documented DPN. Furthermore, the input ports, transformations and output ports are all equipped with standard recovery mechanisms [13]. For example, if a node within the DPN is temporarily unavailable, the other nodes retransmit messages within a predefined interval. When the number of (unsuccessful) attempts exceed a threashold, the node fail-ure is reported to the user.

In the following, we first explain the specification of a DPN with DAPNA. Then, in the next section, we discuss the application of DAPNA on a case study. The following listing shows the overall structure of a DPN description specified with our XML-based ADL [13].

(4)

4 Hasan S¨ozer, Sander Nouta, Andreas Wombacher, and Paolo Perona 1 < project > 2 < filters > 3 < f i l t e r n a m e =" S a m p l e F i l t e r " 4 r u n a t = " 1 2 7 . 0 . 0 . 1 " b u f f e r ="10" > 5 < input > ... </ input > 6 < t r a n s f o r m a t i o n > ... </ t r a n s f o r m a t i o n > 7 < output > ... </ output > 8 </ filter > 9 ... 10 </ filters > 11 </ project >

Listing 1.1. The basic structure of a DPN description.

The description consists of a root element called <project>, which contains a <filters> element. The <filters> element consists of a set of <filter> el-ements. The <filter> element has three attributes: name, runat and buffer (optional). The first attribute depicts the filter name and must be unique. The second attribute shows the IP address of the machine where the filter will be deployed on. The last attribute indicates the maximum number of data pack-ages inside the internal buffer. A <filter> element has three sub elements: an <input> element, a <transformation> element and an <output> element.

The <transformation> element defines how the incoming data packets will be processed. Hereby, one of the predefined transformations (e.g., merge) can be used. Optionally, an external application can be utilized to process data packets. The following listing shows an example transformation definition that makes use of MATLAB [4].

1 < t r a n s f o r m a t i o n t y p e =" e x t e r n a l " > 2 < command > m a t l a b . exe </ command > 3 < p a r a m e t e r s > 4 < p a r a m e t e r > - n o d e s k t o p </ p a r a m e t e r > 5 < p a r a m e t e r > - wait </ p a r a m e t e r > 6 < p a r a m e t e r > - r </ p a r a m e t e r > 7 < p a r a m e t e r > l o a d ( ’% f ’) ; c o r r e l a t e P i x e l s () ; s a v e - v6 ’% f ’; e x i t ; </ p a r a m e t e r > 8 </ p a r a m e t e r s > 9 ... 10 </ t r a n s f o r m a t i o n >

Listing 1.2. A transformation definition that makes use of an external application.

Note that the type attribute is defined as external. The use of the external application is specified including a set of parameters within the parameters element.

(5)

The other two sub elements of a <filter> element are the <input> and <output> elements. These elements comprise (possibly) multiple <port> ele-ments. An example <port> element is listed in the following.

1 < p o r t n a m e =" o u t p u t " t y p e =" g a t e w a y " > 2 < c o n d i t i o n s >

3 < c o n d i t i o n n a m e =" w e e k d a y s " > 4 < t i m e t y p e =" l e s s e q u a l " >

5 < format > M </ format > < value >5 </ value > 6 </ time > 7 </ c o n d i t i o n > 8 < c o n d i t i o n n a m e =" w e e k e n d " > 9 < not > < r e f e r e n c e n a m e =" w e e k d a y s "/ > </ not > 10 </ c o n d i t i o n > 11 </ c o n d i t i o n s > 12 < pipes > 13 < p i p e d e s t i n a t i o n =" N o d e 1 . i n p u t " c o n d i t i o n =" w e e k e n d " t y p e =" tcp " > 14 < p o r t n u m b e r >1444 </ p o r t n u m b e r > 15 </ pipe > 16 </ pipes > 17 </ port >

Listing 1.3. An OutputPort of type gateway.

Each <port> element has two attributes: name and type. The name must be unique. The output can be stored in a file, presented on the screen, or it can be sent to another filter for further processing. In the example listing above, the filter type is defined as gateway, which means that the output should be passed to connected and active pipes that are defined in the <pipes> element as part of the <port> element. The <pipes> element consists of a set of <pipe> elements, each of which has three attributes: destination, condition (optional) and type. Pipes can connect filters residing at both local and remote hosts. The condition attribute contains the name of a condition that determines whether the pipe accepts data packages.

3 Case Study

The CCES project RECORD [18] is investigating the effects of restoring a river on flora, aquatic fauna, river morphology and ground water. One aspect of the project is the development of a model for river morphology changes. To evaluate the proposed models, the researchers installed two towers with cameras taking pictures on a regular basis at least once per day over three years. During floods, the rate of taking pictures has to be increased significantly. The pictures docu-ment the gravel bars in a restored river segdocu-ment. From these pictures, the shape of the gravel bar must be inferred using image processing techniques. These shapes can then be compared with the shapes derived from the morphology models. The huge amount of pictures requires automation of the image processing. Since image processing is based on classification of image properties, the classification

(6)

has to be trained especially after bigger changes in the morphology, thus, this again is a repetitive process. A detailed description of the hydrological aspects concerning image processing can be found in [15].

Fig. 2. A data processing view for the CCES project RECORD [18] case study.

Figure 2 shows an example DPN architecture with a graphical notation [13]. This is a simplified example used within the CCES project RECORD [18] case study. Here, the DPN is based on a training part (upper part of Figure 2) and an analysis part (lower part of Figure 2). The Training step consists of two filters: the Reference Area filter determines the areas within the picture which are most discriminative for the classification of water pixels for differentiating water and non water pixels. In the Pixel Correlation filter, relationships between pixels are inferred, i.e., if a pixel is classified as water, then all its correlated pixels are water. The Training step uses as input an information consisting out of a collection of Training Data provided by a filter and a set of configuration Settings provided again by a filter. The two inputs are assembled in a Merge filter, where each training image is associated with the configuration settings, comparable to a database join operation.

The Analyze step uses as input the merged configuration Setting information, the output of the Training and the pictures Acquired from the towers. First, the Colour-based Pixel Classification filter uses the RGB value distributions derived in the training step to classify pixels as water or non-water. This result is then further processed by applying the derived pixel correlations from the training

(7)

step. The outcome of the Analyze step is sent to the Store filter, which stores the incoming information in the file system without further processing. In this example, the data processing is not distributed. The filters related to training and analysis are deployed on the same node, labelled as Server ; however, the acquisition of data is performed at a remote site (i.e., Project Site).

Documenting a DPN in conformance to an architecture style makes the data processing steps, their relationships and the overall topology explicit. As such, it facilitates the communication of architectural design decisions. Furthermore, it supports domain-specific analysis, code generation and reuse.

We have evaluated the usability of the DAPNA framework in the context of the CCES project RECORD [18]. At the moment of writing this paper, a func-tional DPN has not been deployed yet in the field. However, we have designed a DPN and utilized the DAPNA framework to realize the corresponding system on a local testbed with real data flow. The system has worked on over a thou-sand images of the use case within a day. We have observed that the high level definition of the pipes makes it easy to chain the various transformations. As a drawback we noticed, the users were not completely comfortable with editing XML documents. Hence, we plan to realize a graphical editor in the Eclipse GMF framework. Another challenge is the definition and integration of locally installed tools and their integration with the DPN. It turned out that executing external scripts on different platforms and operating systems required different parameters in the DPN specification.

4 Related Work

A comparison of architectural styles for network-based software architectures can be found in [9]. In this work, after a classification and comparison of the existing styles, the Representational State Transfer (REST) style for distributed hypermedia systems is introduced. In comparison with ours, the REST style is more focused on the network properties instead of the data processing aspects. Depending on the interest of the stakeholders, both styles can be utilized as part of an architecture description.

Many algorithms have been developed for stream processing [12] and in par-ticular for image processing [14, 17]. Our approach provides the means to docu-ment the utilization, composition and configuration of these solutions as part of the software architecture design.

Workflows can be created with a Workflow Management Systems like Kepler [10] or Taverna [19] using an intuitive drag-and-drop interface. A workflow is defined as the automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules [7]. A DPN is in fact a workflow, but a workflow is not always a DPN. DAPNA is specialized for defining and realizing DPNs. In addition, DAPNA supports distributed deployment and processing by definition. Workflows may implicitly support some decentralization aspects, but there is usually a central system needed to execute the workflow.

(8)

5 Conclusions and Future Work

We have introduced DAPNA, an architectural framework for documenting, ana-lyzing and realizing data processing networks (DPNs). DAPNA abstracts away many implementation details, while enabling designers to develop, configure and deploy DPNs. As such, our approach makes the development of DPNs less effort-consuming and less error-prone. In the future, we will investigate possibilities for performing dynamic analysis and runtime integration of DAPNA with analysis tools such as MATLAB [4].

Acknowledgments

This work has been carried out as part of the CCES project RECORD [18].

References

1. van der Aalst, W.M.P., Weijters, A.J.M.M.: Process mining: a research agenda. Computers in Industry 53(3), 231 – 244 (2004)

2. Abowd, G., Allen, R., Garlan, D.: Formalizing style to understand descriptions of software architecture. ACM Transactions on Software Engineering and Methodol-ogy 4(4), 319–364 (1995)

3. Arrango, G.: Domain analysis methods. In: Schafer, Prieto-Diaz, R., Matsumoto, M. (eds.) Software Reusability, pp. 17–49. Ellis Horwood (1994)

4. Bishop, R.H.: Modern Control Systems Analysis and Design Using MATLAB and SIMULINK. Addison Wesley (1996)

5. Bodenstaff, L., Wombacher, A., Wieringa, R., Jaeger, M.C., Reichert, M.: Mon-itoring service compositions in mode4sla - design of validation. In: Cordeiro, J., Filipe, J. (eds.) ICEIS (4). pp. 114–121 (2009)

6. Clements, P., Bachmann, F., Bass, L., Garlan, D., Ivers, J., Little, R., Nord, R., Stafford, J.: Documenting Software Architectures: Views and Beyond. Addison-Wesley (2002)

7. Coalition, W.M.: The workflow reference model. Document Number TC00-1003, Issue 1.1 (January 1995)

8. Dashofy, E., van der Hoek, A., Taylor, R.: A highly-extensible, xml-based architec-ture description language. In: Proceedings of the Working IEEE/IFIP Conference on Software Architectures. Amsterdam, The Netherlands (2001)

9. Fielding, R.: Architectural styles and the design of network-based software archi-tecture. Ph.D. dissertation, Department of Information and Computer Science, University of California, Irvine (2000)

10. Kepler: The kepler project. http://kepler-project.org (August 2011)

11. Monroe, R., Kompanek, A., Melton, R., Garlan, D.: Architectural styles, design patterns, and objects. IEEE Software 14(1), 43–52 (1997)

12. Muthukrishnan, S.: Data streams: Algorithms and applications. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms. (January 2003) 13. Nouta, C.: Data processing networks made easy. M.Sc. thesis, Department of

Com-puter Science, University of Twente, Enschede, The Netherlands (2011)

14. Pal, N., Pal, S.: A review on image segmentation techniques. Pattern Recognition 26, 1277–1294 (1993)

(9)

15. Pasquale, P., Perona, P., Schneider, P., Shrestha, J., Wombacher, A., Burlando, P.: Modern comprehensive approach to monitor the morphodynamic evolution of restored river corridors. Hydrology and earth system sciences discussions 7, 8873 – 8912 (2010)

16. Shaw, M., , Clements, P.: A field guide to boxology: Preliminary classification of architectural styles for software systems. In: Proceedings of the 21st International Computer Software and Applications Conference. pp. 6–13. Washington, DC, USA (1997)

17. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)

18. Swiss Experiment: Record:home - swissexperiment (2011), http://www. swiss-experiment.ch/index.php/Record:Home

19. Taverna: Taverna - open source and domain independent workflow management system. http://www.taverna.org.uk (August 2011)