• No results found

A benchmark for water column data compression

N/A
N/A
Protected

Academic year: 2021

Share "A benchmark for water column data compression"

Copied!
66
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

A benchmark for water column

data compression

ing. Bouke Grevelt

August 27, 2017, 65 pages

Supervisor: dr. ir. A.L. Varbanescu

Host organisation: Qps B.V., http://www.qps.nl, Zeist, +31 (0)30 69 41 200

Contact: Jonathan Beaudoin, PhD, Fredericton, +1 506 454 4487, beaudoin@qps.nl

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

Contents

Abstract 3 Preface 4 1 Introduction 5 1.1 Problem definition . . . 6 1.2 Approach . . . 6 1.3 Document structure . . . 7 2 Related Work 8 2.1 Water column data compression. . . 8

2.2 Benchmarking & test benches . . . 9

2.3 Corpus creation. . . 11

3 Water Column Corpus 14 3.1 Existing corpora . . . 14 3.1.1 Calgary corpus . . . 14 3.1.2 Canterbury corpus . . . 14 3.1.3 Ekushe-Khul . . . 15 3.1.4 Silesia corpus . . . 15 3.1.5 Prague Corpus . . . 16

3.2 Corpus creation method . . . 16

3.3 Results. . . 17

4 Metrics 21 4.1 Generic compression metrics. . . 21

4.2 Real-time compression . . . 21 4.3 Processing . . . 22 4.4 Cost reduction . . . 24 4.5 Overview . . . 26 5 Test bench 27 5.1 Requirements . . . 27 5.2 Structural design . . . 27

5.2.1 Test execution logic . . . 28

5.2.2 Input files . . . 28

5.3 Compression algorithm. . . 29

5.4 Compressed file / decompressed file. . . 30

5.5 Results. . . 30 5.5.1 Configuration file . . . 30 5.6 Behavioral design . . . 31 5.6.1 Test execution . . . 31 5.6.2 Metrics computation . . . 31 6 Benchmark 34

(3)

6.1 Publication . . . 34

6.2 Rules. . . 35

7 Using the test bench 37 7.1 Installing the test bench . . . 37

7.2 Adding a new algorithm to the test bench . . . 37

7.2.1 The algorithm to add . . . 37

7.2.2 Building a shared library . . . 38

7.2.3 Exposing the algorithm in python . . . 39

7.2.4 implementing the interface . . . 39

7.2.5 Add the algorithm to the test bench configuration . . . 40

7.3 Configuring the test bench. . . 40

7.4 Running the test bench . . . 45

8 Results and Analysis 46 8.1 Algorithms . . . 46

8.2 Generic compression metrics. . . 47

8.2.1 Deviation in (de)compression time . . . 47

8.2.2 LZMA compression time. . . 47

8.2.3 Jpg2k compression ratio . . . 48

8.2.4 Random access decompression compared to full file decompression . . . 48

8.2.5 Jpg2k performance . . . 48

8.3 The processing metric . . . 48

8.4 The real-time metric . . . 50

8.5 The cost reduction metric . . . 51

9 Conclusion & Future work 53 9.1 Threats to validity . . . 53

9.1.1 File properties considered in corpus selection . . . 53

9.1.2 Restriction of candidate files for corpus selection . . . 54

9.1.3 Hardware differences between acquisition and processing. . . 54

9.2 Future work . . . 54

9.2.1 Water Column Corpus inclusion criteria . . . 54

9.2.2 Improve performance. . . 54 9.2.3 Data generation . . . 54 9.2.4 Lossy compression . . . 54 9.2.5 Hardware differences . . . 55 9.3 Community involvement . . . 55 Bibliography 56 A Water column data 58 B Generic Water column Format 60 B.1 Purpose . . . 60

B.2 Structure . . . 60

B.3 Conversion . . . 61

(4)

Abstract

Multibeam echosounders are devices that use acoustic pulses to measure the topology of the sea floor. Modern multibeam echosounders make the raw data that is used to determine the sea floor topology available to the user. This data is referred to as water column data. The scientific community has identified many applications for water column data to improve and augment current hydrographic applications. Due to its large size compared to other hydrographic data, surveyors often choose not to store water column data. Compression of water column data has been identified as a possible solution to this problem and multiple methods for compression have been proposed. As there currently is no clear definition on how to measure the performance of water column compression algorithms, it is unclear what the current state of the art is.

In this work, we show that benchmarking can be used to evaluate the performance of water column compression algorithms. We present the Water Column Corpus: a methodologically selected, repre-sentative set of water column data files in the public domain to be used as the input data for the benchmark and a set of metrics used to measure compression algorithm performance. A test bench was developed and published in the public domain to compute the presented metrics for the files in the Water Column Corpus. Four generic algorithms and one water column data specific compression algorithm are included in the test bench. The results clearly distinguish the different algorithms based on their performance.

(5)

Preface

This thesis describes my research project at Quality Positioning Services (QPS). QPS has been my employer since before I started the Software Engineering program at the university of Amsterdam (UvA).

Acknowledgements

First and foremost I need to thank Ana Lucia Varbanescu for her inexhaustible guidance and support throughout this whole endeavor. Thanks to Jonathan Beaudoin for always finding the time for a chat or a review and of course for coming up with the topic. Thanks to Duko van der Schaaf for supporting me when I wanted to work less to go back to school. Thanks to Duncan Mallace & Tom Weber for helping me find specific water column data. Thanks to QPS for supporting me throughout the program and with the thesis especially. But most of all thanks to Charlotte for putting up with me for two years of stress, complaints and preoccupation.

(6)

Chapter 1

Introduction

QPS B.V., hereafter referred to as ’the host organization’, is a company that builds and sells hydro-graphic software. One of the main purposes of this software is to gather and display information on the depth and form of the seafloor. This is referred to as bathymetric data. A number of different types of devices are capable of gathering bathymetric data. The type of device most commonly used for this purpose is the multibeam echo sounder (MBES). This device emits sound waves to the seafloor and determines the location of the seafloor based on the received echo. The technique is similar to ultrasounds used in the medical field. Many modern multibeam echosounders make the raw data that was used to compute the bathymetry available to the user. This data is referred to as water column data1.

Figure 1.1: Water column data in grey with seafloor detections in green and red

The scientific community has identified opportunities to use water column data to for a number of purposes, including

• Improved noise recognition. [Cla06] • Improved bottom detection. [Cla06]

• Improved ability to estimate minimum clearance over wrecks. [Cla06], [CRBW14] • Study of marine life. [CRBW14]

• Study of sub-aquatic gas vents [CRBW14] • Study of suspended sediment [CRBW14]

These novel applications provide opportunities for the host organization to improve and expand their software suite.

A problem with water column data is its volume. With data rates of several gigabytes per hour [Bea10], storage requirements drastically increase when water column data is to be stored in addition

(7)

to the data more commonly stored during hydrographic acquisition. Beaudoin states that ”A ten-fold increase in data storage requirements is not uncommon” [Bea10]. Moszynski et al. state that ”the size of water column data can easily exceed 95% of all data collected by a multibeam system” [MCKL13]. The additional costs incurred by the high volume of water column data are prohibitive to the collection of this data [MCKL13][Bea10][APR+16]. The host organization believes that the volume reduction of water column data will make the collection of the data less prohibitive and thus allows end users to benefit from future water column data related innovations.

The problem of water column data size has been noted in the scientific community. Applying compression to the data is brought forward as a possible solution to this problem. Beulens et al. [BWSP06] note that the data size is one the of the challenges in water column data processing and state that ”making use of a data compression scheme is the preferred approach” to solving this issue. [BWSP06]. Moszynski et al. [MCKL13], Beaudoin [Bea10] and Amblas et al. [APR+16] have

proposed algorithms for water column compression.

1.1

Problem definition

The field of water column compression currently consists of three publications by Beaudoin [Bea10], Moszinsky et al. [MCKL13], and Amblas et al. [APR+16] The authors of all three publications

present results that show that their proposed algorithm outperforms some general purpose compres-sion algorithm. As all of the publications use different sets of input files and none include a direct comparison against any of the other water column specific compression algorithms, it is unclear what the current state of the art in water column compression is.

Furthermore, because none of the publications discuss their motivation for the selection of the input data used in the experiment, the scope of the results is unclear.

Therefore, the research question addressed in this thesis is ’How to evaluate the performance of lossless water column data compression algorithms?’

1.2

Approach

According to Sim et al. [SEH03] the lack of emphasis on validation and comparison with other solu-tions observed in the water column compression field is typical for immature scientific communities. The authors state that benchmarking can have a positive effect on the maturity of a scientific com-munity as the creation of a benchmark ”requires a comcom-munity to examine their understanding of the field, come to an agreement on what are the key problems, and encapsulate this knowledge in an evaluation” [SEH03, p. 1].

In this work, we present our vision for a benchmark for water column data compression based on the three components of a benchmark according to Sim et al. [SEH03]: motivation for the comparison, task sample, and performance measures. The motivation for comparison is the desire to know the current state of the art in water column compression (as discussed in section1.1). The task sample consists of a set of input files that are selected using an empirical method based on the method for corpus creation presented by Arnold & Bell [AB97]. Performance measures are selected based on a literature review of related work, further augmented with new measures which we believe are relevant to the domain.

The positive effect of benchmarking discussed by Sim et al. [SEH03] depends on a collaborative effort to create a benchmark within that community. We therefore invite members of the community to respond, contribute and collaborate on this work in order to get to a community driven benchmark for water column compression.

To facilitate contributions from the scientific community, we provide not only the definition of the benchmark but also a test bench for the evaluation of the benchmark. This test bench facilitates computation of the performance measures over the task sample.

This work focuses solely on the lossless compression of water column data. Although the domain may benefit from lossy compression, the lack of widely used water column processing applications makes it hard to determine what types and which amount of loss is acceptable. Possible future work on the evaluation of lossy compression algorithms is described in section9.2.4

(8)

1.3

Document structure

In chapter2 we review related work and its impact on our approach.

In chapter 3 we present the Water Column Corpus, a set of water column files to be used as the input for compression algorithms, selected to be representative of the real-world data water column compression algorithms may encounter. The Water Column Corpus is what Sim et al. [SEH03] refer to as the ’task sample’ of the benchmark.

Chapter4presents the metrics that are to be computed for each algorithm in the benchmark. The metrics are what Sim et al. [SEH03] refer to as the ’performance measures’.

Chapter5presents the design of a test bench that computes the metrics specified in chapter4over a collection of input files.

Next to the three components of a benchmark presented by Sim et al. [SEH03] (motivation for the comparison, task sample and performance measures) we present two more components that we believe to be important for a benchmark in chapter 6: a set of rules that algorithms that are to be included in the benchmark should adhere to, and the method of benchmark result publication.

As part of this work, the design described in chapter5 has been implemented. Chapter7describes how this test bench can be used. This includes running the test bench and adding a new algorithm to the set of algorithms used by the test bench.

In Chapter8the results of running the test bench are presented and analyzed. Chapter9 contains conclusion, threats to validity, and future work.

(9)

Chapter 2

Related Work

In this chapter, we present relevant existing work, roughly divided into sections based on three different research areas.

2.1

Water column data compression

In [MCKL13], Moszynski et al. describe a method to compress MBES data (which includes water column data) based on Huffman coding. They propose two adaptations to standard Huffman coding to improve performance:

• Create a Huffman tree once for each message type instead of for every message. The authors assume that probabilities will be the same among different messages of the same type.

• Encode water column amplitude data in its true resolution instead of the file format’s resolution which is often too high.

The average compression ratio is approximately 1:3 and outperforms the general purpose compres-sion methods ZIP, 7-ZIP and RAR in both comprescompres-sion rate and comprescompres-sion time. There are some exceptions, specifically when the settings of the multibeam echosounder are changed during the sur-vey. In that case, the Huffman tree is no longer optimal (since it was created for the first packet) and the compression ratio drops.

The authors use two files for validation. Both originate from the same multibeam echosounder (a Reson 7125) and contain ”relatively flat and homogenous bathymetry” [MCKL13, p. 81]. The results obtained this way may not be representative for other systems and other types of survey (e.g., fish schools, wrecks, or gas plumes).

The authors note that their proposed solution (including file format) is specifically suited for large files as ”Reading the structure of the compressed file and retrieving the information such as the total number of datagrams, original size of particular datagrams and their types is also possible without the need of decoding the whole compressed dataset.” [MCKL13, p. 81]. However, decompression times, either for a single record or the complete file, are not part of the results.

Summarizing: The results presented in these works are incomparable and may not be representative for the complete domain. Our benchmark should provide representative and comparable results for all included algorithms.

In [Bea10], the author uses the (JasPer[AK00] implementation of) the JPEG2000 algorithm to compress the amplitude data in water column data. Lossless encoding leads to a compression rate of approximately 1:1.5. Compression rates up to 1:10 are attainable when lossy compression is used while still ”yielding very little loss of resolution or introduction of artifacts” [Bea10, p. 14].

The author uses data from a single type of multibeam echosounder (A Kongsberg EM3002) that was collected over a wreck. The results obtained this way may not be representative for other types of

(10)

systems or surveys (e.g., fish schools, homogeneous bathymetry, or gas plumes). The source of the data is of specific interest as the proposed algorithm only compresses amplitude samples. For Kongsberg data, this constitutes the majority of the water column data. For other data formats however, it does not. The Reson s7k water column encoding for instance, may contain phase data which typically has approximately the same size as the amplitude data. The performance of the proposed algorithm on data in the Kongsberg encoding may therefore not be representative of the algorithm’s performance on other file formats.

The author does not include information on compression time or decompression time for the algo-rithm.

In [APR+16], The authors apply FAPEC, a compression algorithm initially developed for space data

communications, to water column data. FAPEC uses entropy encoding and ”includes mechanisms for the efficient compression of sequences with repeated values” [APR+16, p. 47]. The proposed algorithm uses a pre-processing pipeline tailored to MBES water column data.

The test results show a compression rate of approximately 1:1.7. The algorithm was considerably faster than any of the other general purpose compression techniques that were evaluated (Gzip, Bzip2, Rar & 7Zip) with FAPEC being more than twice as fast as the runner up.

The authors use two files for validation. Both originate from the same type of multibeam echosounder (A Kongsberg EM710) and correspond to ”a relatively smooth an homogeneous bathymetry” [APR+16,

p. 47], but one of the files includes a wreck. The results may not be representative for other types of systems or other types of surveys (e.g. fish schools or gas plumes).

2.2

Benchmarking & test benches

in [DHL15], the authors present a benchmark framework for data compression. The intended algo-rithms are light weight memory compression algoalgo-rithms for databases.

A benchmark is defined by the user as a combination of data generation parameters and the al-gorithms that should be included in the benchmark. The authors place specific emphasis on the performance of the framework. They aim to get maximum performance by reducing redundant ac-tions in the framework’s execution.

The framework uses standardized interfaces (not otherwise described in the work) that algorithms should conform to for easy inclusion in the framework.

As we want our test bench to have high performance, we should prevent redundant actions. Whether or not that requires an approach similar to the ’sophisticated approach’ described by the authors will have to become clear during the research.

The benchmark specification file offers an easy interface for the user to select which algorithms to include in a run. The use of standardized interfaces for the algorithms makes it easy to include new algorithms into the framework. Both features would be valuable in our test bench as well.

In [Swa08], the author proposes a single measure for the efficiency of data compression and storage. This measure is based on compression ratio, compression time, and decompression time. As many algorithms are asymmetric in compression and decompression performance, such a measure requires information on the expected ratio of compression and decompression actions for a single file.

The single measure for the efficiency of data compression presented is a cost measure. The author defines measures for the profits of storing data and the costs for storing data. The overall efficiency measure is the ratio of these two measures.

We want the benchmark for water column compression to include a single metric representative of the algorithm’s performance. A cost measure would appear to be a logical choice for such a measure. For water column compression the ratio of compression and decompression is interesting as well. Both because the data may be decompressed multiple times (if processed on multiple systems), but also because (part of) the compressed data may not be decompressed at all (if the data was stored only in

(11)

case it was needed). Water column compression adds some complexity when compared to data storage in the sense that the applicant of the compression algorithm is likely to want to compress the data as it is generated. This means that not only the relation between compression time and decompression time is important, but also the relation between compression time and generation time.

In [GVI+14], the authors are faced with a similar problem as the one addressed in this work: a

relatively immature field (graph processing) which does not have a clear method for performance analysis.

As a step towards the goal of creating a benchmark suite, seven challenges for creating such a suite are presented:

• Evaluation process

How to design the benchmark process in such a way that a ’fair’ comparison of platforms can be attained. This relates to rules and/or definition for data format, workflows, multi-tenancy and tuning.

• Selection and design of performance metrics

Which metrics are of interest and how can these be normalized to directly compare runs on different hardware.

• Dataset selection

How to select a dataset that is representative, but also able to stress bottlenecks of the platforms. • Algorithm coverage

How to select a representative, reduced, set of graph-processing algorithms. • Scalability

How to make the benchmark deal with platforms ranging from super-computer to small-business scale.

• Portability

How to balance the required features of the benchmark suite against the amount of work it takes to make a platform ”benchmark ready”.

• Reporting of results

Ideally the benchmark produces a single metric that represents the performance of the platform. The authors believe that such a metric will be hard/impossible to find as no platform can offer the best performance over the whole dataset, even when only a single metric is evaluated. As the fields of water column compression and graph processing differ significantly, not all challenges apply to a benchmark for water column compression. We specifically believe that algorithm coverage and scalability are not relevant in our domain. The former because the water column compression domain consists of a single type of algorithm (the compression algorithm), and the latter because we believe that there will not be much diversity in the types of platforms that will run water column compression algorithms.

Portability can be an important factor in the adoption of the benchmark: if the work required for the inclusion of an algorithm into the benchmark is too large, it is likely that the benchmark will not be adopted by algorithm implementers. The authors suggest that benchmarking graph-processing algorithms requires re-implementation of the algorithm in the domain of the benchmark. For the water column data compression benchmark, we want to look into methods that do not require such re-implementations.

In [CHI+15], a benchmark is introduced for graph-processing platforms (based on the vision detailed in ”Benchmarking Graph-Processing platforms: A Vision” [GVI+14]).

(12)

The benchmark uses a choke-point based design; the problems that challenge the current technology are collected and described early in the design process [Bon14]. These choke-points are identified by experts in the community. The benchmark uses both generated and real-world input data.

We believe that the field of water column compression is not mature enough for the choke-points in water column compression to be identified. Therefore a choke-point based benchmark, although an interesting concept for future work, is not feasible at this time.

The combination of real-world data and generated data could be beneficial for the water column data compression benchmark. Using real-world data would increase the credibility of the benchmark. Data generation could be used to show how certain factors of the input data affect the different compression algorithms.

This work unfortunately contains no reference to the ’normalized metrics’ referred to in the pre-ceding publication [GVI+14].

In [Pow01], the author describes the creation of a test bench for compression algorithms compressing a number of widely known corpora, among which the Calgary and Canterbury corpora.

The test bench is employed by the author to maintain a benchmark for generic compression algo-rithms. The test bench is periodically run and the results are published on a website. The downside of this method is that only algorithms that have been built for UNIX can be included in the test bench. Also, the algorithms to be included in the benchmark need to be made available to the maintainer of the benchmark.

The author states that it is impossible to make sure that different compression tests are run under the same conditions due to differences in processor speed and system resources. The author claims to counter this by reporting all speed measurements relative to the speed measurement on the same file by the UNIX compress utility.

It is unclear if the normalization employed by the author is meant to normalize for hardware differ-ences (e.g. different platforms running the test bench) or differdiffer-ences in the state of the system between different test runs. However, we believe that reporting performance relative to the performance of another algorithm is not a good approach for either, because it assumes that both algorithms scale with system change in the same way. This is not necessarily the case. Differences in parallelization, for instance, are likely to result in different scaling behavior as the CPU load changes between runs.

Summarizing: The primary difference between our benchmark and those presented in this section is the difference in domain. Water column data compression performance evaluation requires metrics specific to the domain, such as the ability of an algorithm to compress the data at the rate at which it is generated. Similar to the test benches described in this section, our test bench should be designed with a focus performance, portability and ease-of-use.

2.3

Corpus creation

In [AB97], the authors look into the reliability of empirical evaluations for lossless compression (of text). They indicate that the ’Calgary corpus’ which was commonly used for that purpose at the time, had become outdated and voice a concern that new compression algorithms may be tailored to that corpus.

The goal of a corpus of files is to facilitate a one-to-one comparison of compression methods by creating a small, publicly available, sample of ’likely’ files for the compression purpose. A corpus should conform to the following criteria:

• Be representative of the files that are likely to be used for the compression. • Be widely available and in the public domain.

(13)

• Perceived to be valid and useful (otherwise it will not be used). This can be attained by including widely used files and publishing the inclusion procedure for the corpus.

• Actually be valid and useful. The performance of compression algorithms on files in the corpus should reflect the typical performance of the algorithms.

The authors created a new corpus for file compression using the following steps:

1. Select a large group of files that are relevant for inclusion in the corpus (all public domain). 2. Divide the files into groups according to their type (e.g. HTML, C code, English text). 3. Select a small number of representative files per group.

(a) Create a scatter plot of the file size before and after compression of the files using a number of compression algorithms.

(b) Fit a straight line to the points using ordinary regression techniques. (c) For each group, pick a file that is closest to the fitted line.

The authors note that due to the large deviation between files, the absolute compression ratio of the corpus may not be representative, but the relative compression ratio is representative.

In [IR08], Islam and Rajon propose a corpus for evaluating compression algorithms for Bengali text. The authors claim that the necessity of such a corpus arises because results on corpora in English have little significance for text in Bengali. The corpus was created in a way similar to that used by Arnold and Bell [AB97]. Like them, the authors start out with a candidate data set that is categorized. Islam and Rajon pre-process their candidate files to remove all non-Bengali text and convert the files to Unicode encoding before compression. They use type-to-token ratio to select the files for inclusion in the corpus rather than compression ratio. They select two files for each category: the one with the lowest TTR and the one with the highest TTR.

Although the authors show that there is a relation between TTR and compression ratio, they don’t explain why they choose TTR over compression ratio. Most interesting is their choice to select files with the lowest and highest TTR instead of the approach used by Arnold and Bell who use linear regression to find the most representative file in a category. Selecting the files with the lowest and highest TTR would appear to select the most unrepresentative files instead.

In his master’s thesis, ˇRezn´ıˇcek evaluates existing corpora for the evaluation of compression methods and constructs a new one [Jak10]. The author believes a new corpus to be necessary to overcome a number of problems in the Calgary, Canterbury, and large Canterbury corpora namely, the lack of large files (in the Calgary and Canterbury corpora), over-representation of English text, lack of files that are a concatenation of large projects and the lack of medical images and databases.

The methodology for creating this corpus is very similar to the methodology proposed by Arnold and Bell, but includes a method to update the corpus.

The method to update the corpus is of specific interest as the rest of the methodology is practically the same as that used by Arnold & Bell [AB97]. It is very likely that a corpus of water column data will become outdated at some point, because the corpus should be representative of real-world data, and the real-world data continually changes because of advances in technology. We do see a potential problem in the method presented by ˇRezn´ıˇcek: Any modification to the corpus should be versioned, but as there is no central agent responsible for versioning, nor a centralized location that contains all versions of the corpus, the updater of the corpus has no means to determine what the last version of the corpus is. This induces the risks of multiple versions of the corpus having the same version number.

(14)

Summarizing: The methodology used for the creation of corpora is either not described or based on that presented by Arnold & Bell [AB97] with slight modifications. New corpora are proposed not because authors disagree with the methodology used for previous corpora, but because the previous corpora have become outdated.

(15)

Chapter 3

Water Column Corpus

In order for a test bench to evaluate the performance of compression algorithms, it needs to contain data to be compressed. In order for the test bench to have credibility the data should be representative of the data the algorithm encounters during normal operation. At the same time, the volume of data used by the test bench should be as low as possible to guarantee swift execution.

The test bench can generate its own input data, use a set of input files or use a combination of both. The advantage of data generation is that it provides the opportunity to specifically change certain properties of the input data in isolation. This allows the user of the test bench to review the effects of these individual properties of the data on the compression algorithm performance.

Our test bench, in its first version, includes a corpus of real world data to be used as input for the algorithms: the Water Column Corpus. A data generator for water column data does not currently exist. Based on our analyses of water column data, a generator will be developed and included in a future release of the test bench.

In this chapter, we review the methods used for the creation of other corpora, present the method that we have used for the Water Column Corpus, and finally present the water column corpus itself.

3.1

Existing corpora

Although we are the first to define a corpus for water column compression, a number of corpora have been defined for general purpose compression. In this section, we will look at existing corpora and the method used to create them.

3.1.1

Calgary corpus

Probably the most widely known corpus for lossless data compression is the Calgary corpus created by Bell et al. for evaluating a number of lossless compression algorithms [BCW90]. The corpus consists of 14 files of 9 distinct types. Bell et al. provide no information on the method used for selection of the files other than that the file types are ’common on computer systems’ [BCW90, p. 583]. The Calgary corpus was made publicly available and has been used for lossless compression evaluation frequently in the 1990’s. [Jak10, p. 44]

3.1.2

Canterbury corpus

The Canterbury corpus was published in 1997 by Arnold & Bell [AB97] as a reaction to issues in the Calgary corpus. The authors argue that the corpus has become outdated, was not constructed methodically, and indicate a concern that new compression algorithms may be tailored to the content of the corpus.

The authors argue that a proper corpus should consist of a small sample of likely files. However, determining the likeliness of data is precisely the problem in compression algorithm development. Because of this, files for empirical validation are often selected haphazardly from files that are avail-able. The authors claim that this way of working reduces repeatability and validity. They present

(16)

a methodology for creating a corpus of test data for compression methods based on the following criteria:

• The corpus should be representative of the files that are likely to be encountered in real-world applications.

• The corpus should be widely available and in the public domain. • The corpus should not be larger than necessary.

• The corpus should be perceived to be valid and useful. This can be attained by including widely used files and publishing the inclusion procedure for the corpus.

• The corpus should actually be valid and useful. The performance of compression algorithms on files in the corpus should reflect the typical performance of the algorithms.

Based on these criteria a method for creating a corpus is presented that consists of the following steps:

1. Select a large number of candidate files from the public domain. 2. Divide the files into groups according to their type.

3. Compress all files.

4. Use linear regression to determine the correlation between compressed and uncompressed file size.

5. Choose the file that is closest to the regression line in each category to be included in the corpus.

3.1.3

Ekushe-Khul

Islam & Rajon propose a corpus for evaluating compression algorithms for Bengali text [IR08]. The authors claim that the necessity of such a corpus arises because results on corpora in English have little significance for text in Bengali. The method used for corpus creation is similar to that used for the Calgary corpus. Islam & Rajon make the following changes to the methodology presented by Arnold & Bell.

• Prior to compression all files are stripped of non-Bengali text and converted to Unicode. • File selection is based on type-to-token ratio rather than compression ratio.

• For each group, the files with the highest and lowest type-to-token ratio are selected whereas Arnold & Bell select the file with the compression rate closest to the regression line.

3.1.4

Silesia corpus

As part of his dissertation, Sebastian Deorowicz presents the Silesia corpus [Deo03] for lossless com-pression. The author believes a new corpus to be necessary to overcome a number of problems in the Calgary, Canterbury, and large Canterbury corpora. Namely the lack of large files (in the Calgary and Canterbury corpora), over-representation of English text, lack of files that are a concatenation of large projects, and the lack of medical images and databases. Furthermore the author states that the Canterbury corpus is specifically faulty because it contains a file for which the ”specific structure causes different algorithms to achieve strange results (...) the differences in the compression ratio on this file are large enough to dominate the overall average corpus ratio”[Deo03, p. 92].

The Silesia corpus is meant to be used in combination with the Calgary corpus. Therefore, the Silesia corpus contains files of the types missing that corpus. The author provides no insight into the method used for selecting files to be included in the corpus.

(17)

3.1.5

Prague Corpus

In his master’s thesis, Jakub ˇRezn´ıˇcek presents the Prague Corpus [Jak10], the author includes a verbose methodology (similar to the one used for the Canterbury corpus) and instructions on how to update the corpus. The method for corpus creation contains the following steps:

• Define file types (e.g. image, database, binary) that are frequently used in practice. These will be the groups used for the corpus.

• Collect candidate files that are real (not randomly generated) and can be placed in the public domain. Use as many sources as possible to prevent similarity.

• Decompress internally compressed files (such as PDFs). • Divide the candidate files into groups.

• Subgroup the candidate files based on file extension.

• Compress all files using several (at least three) compression programs using different algorithms. • For each subgroup that contains at least 15 files and optionally for each complete group:

– Determine the correlation between uncompressed and compressed file sizes using linear regression for each compression method.

– Select the file closest to the regression line for each compression method.

– From the selected files, select the one with the lowest compression ratio for inclusion in the corpus.

If one wants to update the Prague corpus, the same method should be followed. The contributor may then choose to either make changes to, or completely replace the corpus. The old corpus should always be kept available. The new corpus should be documented in a report (an optional template is provided) and the new corpus should be published on the Internet.

3.2

Corpus creation method

In this section, we present a method for the creation of a corpus of water column data. The method is based on that proposed by Arnold & Bell for the Calgary corpus [AB97]. Although a number of new corpora have been suggested to replace the Calgary corpus, most of them still use the method presented by Arnold & Bell (with slight adaptations) for constructing the new corpus. Those that do not, do not report any method at all for selecting files.

The only remark we have found that could be considered criticism of the method presented by Arnold & Bell is that of Deorowicz on the inclusion of the XML file that has such different compression characteristics across different compression algorithms that it dominates the overall corpus ratio. [Deo03, p. 92]. Deorowicz, however, provides no evidence for this statement and we have not been able to find similar statements in literature or a web search.

Like the authors of other corpora, we have adapted the method proposed by Arnold & Bell for our purposes. The following adaptations have been made to the original method:

• As there is little water column data in the public domain, it is not feasible to collect a sufficiently large sample of files that is in the public domain. Instead, we will select candidate files regardless of whether they are in the public domain. Once files have been selected for inclusion in the corpus, we will request permission to put the file in the public domain from the owner. If the owner does not grant permission, the file will be removed from the candidate files and a new file will be selected.

• Although Arnold & Bell describe in detail, the selection process of representative files for the categories, they provide little insight into the categorization of candidate files. We believe that the chosen categories serve two purposes: stating which types of files are important in the field

(18)

and reducing the search scope for candidate files. As we start with a limited set of data, there is no need to limit our search scope. The purpose of categorization in our case is thus only to state which types of water column data are important in the domain.

We take a number of categories from the categories used by the ”Shallow survey conference”: an important conference in the domain at which multibeam echosounder manufacturers are asked to gather data over various types of objects. From this data the categories survey, water seep, wreck, and target have been taken. We have added two categories that we believe to be important in the domain: fish schools and gas vents. Figure3.1provides example data for each of the included categories.

• Files that contain water column data often also contain other types of maritime data. Also, one format supports data from multiple devices in one file whereas the other formats do not. In order to get comparable files, the files need to be processed to remove all non-water column data and split files that contain data from different devices.

Because the current the state of the art in water column compression cannot be identified (which is part of the problem we are addressing with the test bench proposed in this thesis), we cannot use the state of the art water column compression algorithm to determine the compression ratio for the files. Instead we will use the Lempel-Ziv-Markov chain algorithm as this algorithm (used by and often referred to as ’7-Zip’) is used by all authors of water column compression algorithms to compare their algorithm’s performance against and either has the best compression ratio of all the algorithms used to compare against [MCKL13], [APR+16], or is the only algorithm the author compares against

[Bea10].

After these adaptions, our process for creating a corpus of water column data comprises of the following steps:

1. Gather a set of files that contain water column data (either proprietary or in the public domain). 2. Process the files to remove all non-water column data and split files when needed.

3. Divide the files into groups according to their category. 4. Compress all files using the LZMA algorithm.

5. Use linear regression to determine the correlation between compressed and uncompressed file size for each group.

6. Choose the file that is closest to the regression line for each group.

7. Ask the owner of the file for permission to put the file in the public domain.

8. If permission is granted, the file is selected for inclusion in the corpus. If not, the file is removed from the set of candidate files and the process is repeated from step 4.

We strongly agree with ˇRezn´ıˇcek [Jak10] that it is very likely for any corpus to become outdated and as such it requires a method for updating and/or replacing the current version. We propose a high-level approach so as not to restrict the solution space for any future iteration:

The Water Column Corpus should always have a maintainer. The maintainer is responsible for the publication of the Water Column Corpus. The publication includes versioning of the corpus. Publications should include the methodology used for corpus selection (or a reference to a previous version of the corpus if the same methodology is used) and the corpus itself. All versions of the Water Column Corpus should be published at a single location.

3.3

Results

Files containing water column data were gathered from the host organization’s intranet, a number of industry specialists, and the scientific community. After processing, this yielded a set of 1066 candidate files with a total size of just over 600GB.

(19)

(a) Survey (b) Gas vent

(c) Fish (d) Target

(e) Water seep (f) Wreck

Figure 3.1: Water column data categories included in the corpus

The data was grouped based on meta-data when available. If this was not possible, we used visual inspection to determine the category of the data. As this is a slow process, the time line of the project did not allow us to classify all of the candidate files. For maximum efficiency, we focused on the categorization of the larger sets of files. 768 out of 1066 files were categorized. An overview of the categories and the number of candidate files assigned to each can be found in table3.1.

Category N water column contains Fish 8 A school of fish

Vent 15 A stream of bubbles

Survey 415 Little else than the return from the sea floor Target 142 A small (<2m) object

Wreck 86 A sunken object (e.g., a wreck of a ship) Water seep 92 A seep of fresh water into salt water

Total 768

Table 3.1: Categorized candidate files

All candidate files were compressed using LZMA. We made use of the open source LZMA SDK1,

and its python binding pylzma2. All files were compressed using the default settings of the algorithm.

For each group, a scatter plot was created of the original file size versus the compressed file size. Using linear regression a line was fitted to the data to determine the representative compression rate for the group. The results can be found in figure 3.2, which shows distinct groups with a linear relation for all categories except the wreck category which appears to contain two groups. Closer inspection showed that a set of files had been included that did include wreck data, but an area that

1http://www.7-zip.org/sdk.html 2https://pypi.python.org/pypi/pylzma

(20)

Figure 3.2: Original vs compressed size of 768 categorized candidate files

was relatively large compared to the wreck had been surveyed. As a result, only a small part of the water column data actually contained a wreck. Therefore, the data was not representative of the category and the files were removed from the candidate files.

Table3.2shows the results of the categorization effort. Correlation and deviation numbers are very similar to those presented by Arnold and Bell. Table 3.3 shows the files that were selected to be included in the corpus. We obtained permission to put all these files in the public domain, so there was no need to reiterate the process for any of the categories.

compression rate

standard best match

Bin N average deviation correlation ratio

Fish 8 0.7223 0.01 0.9991 0.7195 Survey 415 0.5479 0.11 0.9812 0.5477 Target 142 0.6546 0.02 0.9966 0.6541 Vent 15 0.4774 0.10 0.9947 0.4833 Water seep 92 0.2621 0.05 0.993 0.2622 Wreck 58 0.5041 0.04 0.9949 0.5053

Table 3.2: Categorized compression properties

category best match size (MB)

Fish 20060922 232920 7125 (400kHz) filtered.s7k 4293.3 Survey 0003 20140910 094146 True North.wcd 1241.4 Target 20140724 105715 filtered.s7k 1023.2 Vent 0006 20090402 210804 ShipName.wcd 363.2 Water seep 0046 20110309 215521 IKATERE.wcd 551.7 Wreck 0012 - Bathy plus WCD 0 from qinsy db.wcd 40.4

Table 3.3: Selected files for the corpus

(21)

on both the complete set of candidate files and the corpus. The results are shown in table3.4. There is some deviation in the average compression ratios between the candidate files and the corpus. This is similar to the results obtained by Arnold and Bell [AB97, p. 207] and likely due to the relatively high deviation in compression ratios for files in a group (as shown in table3.2). This means that our corpus, like the Canterbury corpus, is unreliable for absolute performance estimates.

Fortunately, like the Canterbury corpus, the relative compression ratios are very similar: ZIP has a compression ratio that is 118.8& of the compression ratio of LZMA for the compete set of candidate files, and 118.6% for the corpus. Similarly, the compression ratio of BZ2 is 102.5% of that of LZMA for the complete set of candidate files and 102.4% for the corpus. Therefore the corpus is reliable for relative compression performance estimation.

Average compression ratio Algorithm Candidate files Water Column Corpus

LZMA 0.5113 0.5287

ZIP 0.6078 0.6268

BZ2 0.5241 0.5413

Table 3.4: Average compression ratios of candidate files and corpus ‘

(22)

Chapter 4

Metrics

In this chapter, we present the metrics that will be reported by the test bench. The test bench will include both generic compression metrics, which are presented in the first section, and water column data specific metrics, which are presented in the second section.

Specifics on the implementation of metric computation in the test bench can be found in chapter5.

4.1

Generic compression metrics

We studied publications on water column compression, generic compression evaluation and recent publications on lossless compression algorithms to see which metrics are reported. The reported metrics were: compression ratio by all, compression time and/or decompression time by most. All metrics will be included in the test bench:

CR = Sc So

where

CR is the compression ratio.

Sc is the size of the compressed data.

So is the size of the original data

Tc The time required to compress the data.

Td The time required to decompress the compressed data.

4.2

Real-time compression

An important feature of a compression algorithm for water column data is its ability to compress data at the rate it is generated by the multibeam echosounder during data acquisition. Inability to compress the data ’on the fly’ means that the data will have to be stored uncompressed first and later be compressed. This separate compression step takes up time and thus induces cost.

The ability of an algorithm to perform real-time compression depends not only on the performance of the algorithm, but also on the environment in which it is executed. As compression will be one of many tasks executed by hydrographic acquisition software in parallel, the algorithm will have access to limited resources.

The test bench includes the ’Real-time compression’ metric; the ratio of the average water column record generation interval to the average water column compression time:

(23)

RT C = Trga

Trca

where

Trga is the average water column record generation interval.

Trca is the average water column record compression time.

The average water column generation interval can be computed from the generation time included in the water column record:

Trga=

Twcl− Twcf

N − 1 where

Trga is the average water column generation interval.

Twcf is the generation time of the first water column record.

Twcl is the generation time of the last water column record.

N is the number of water column records.

The average water column record compression time is dependent on the performance of the com-pression algorithm in the real-time environment:

Trca =

Tc(e)

N where

Trca is the average water column record compression time.

N is the number of water column records in the file.

Tc(e) is the compression time of the file in computation environment e.

The computation environment is defined by two user parameters: the memory and the CPU capacity available to the algorithm during compression. More details in how the test bench emulates the real-time environment can be found in section5.6.2.

RTC has a range of (0, ∞), where:

• 1 indicates that on average the compression of water column records requires the same time as it takes for the hardware to generate them.

• < 1 indicates that the compression of water column records on average takes more time than it takes for the hardware to generate them. Thus real-time compression is not possible.

• > 1 indicates that on average water column records can be compressed in less time than it takes for the hardware to generate them. Thus real-time compression is possible.

4.3

Processing

After data has been stored during acquisition, it is transferred to another department where it is processed. Using water column compression adds overhead to the processing phase as data needs to be decompressed prior to its processing. Although there is a theoretical advantage of decreased

(24)

disk space requirements due to compression, we believe that these cost savings are negligible when compared to the overhead in processing time.

The processing metric is included to indicate the overhead induced in processing due to the ap-plication of water column data compression. It is defined as the product of the decompression time required for processing and the number of times the file will be decompressed in the processing phase:

P = Tdp∗ Ndp

where

Tdp is the decompression time for processing.

Ndp is the number of decompressions.

The number of decompressions Ndp on a single workstation can be reduced by storing the

decom-pressed data in between different processing session. As it is common for hydrographic processing software to offer up disk space in favor of processing speed, we assume that processing software will follow this same trend for water column decompression and thus the number of processing sessions on a single machine has no influence on the number of decompressions. We have observed that support for concurrent data access is uncommon in hydrographic processing software, while it is common for multiple people to work on the same data. This means that the data is decompressed on each work station. Ndp is therefore normally equal to the number of work stations working on processing the

same data.

The decompression time for processing, Tdp depends on the performance of the compression

algo-rithm and on the number of records that are required to be decompressed for processing. The latter depends strongly on the reason for collecting water column data. If water column data is recorded with the intention of improving bathymetric data, it is likely to be stored for the entire survey, but only used in locations where bathymetric data appears to be faulty. In other words, only a small part of the stored water column data will need to be processed (and thus decompressed). Conversely, if water column data is gathered with the intention of investigating a specific feature (e.g. a wreck or marine life) it is likely that water column data will only be stored at the location of that feature. In this situation, it is likely that all the stored water column records will be processed (and thus decompressed).

We assume that the processing software selects the best decompression method for the data: full file decompression if all records are required during processing, or individual record decompression if only a limited subset of records are required for processing. The decompression time is therefore defined as:

Tdp= min(Td, Tdpartial)

where

Tdp is the time required for decompression in processing.

Td is the time required for full file decompression.

Tdpartial is the time required for partial decompression of the required water column records

The partial decompression time depends on the ratio of water column records the user expects to decompress during processing:

(25)

Tdpartial = N ∗ Rp∗ Tdra

where

Tdpartial Partial decompression time

N is the number of water column records.

Rpis the ratio of water column records that the user expects to process to the number of water

column records in the file. (a user parameter).

Tdra is the average random access decompression time per record.

The average random decompression time is calculated by the test bench by randomly selecting a number of records for decompression and then averaging the decompression time.

4.4

Cost reduction

As discussed in section1.1, the cost incurred by handling and storing water column data is prohibitive to its collection. This cost includes one-time costs such as the cost of transfer of the data from the survey vessel to the processing station on shore over expensive satellite links and acquisition of storage space, as well as continuous costs such as maintenance on the storage space for the data. We refer to the sum of these costs as the total cost of data ownership.

Employing water column compression will decrease the volume of data and thus the total cost of ownership of the data, but it also incurs cost because of time lost on compression and decompression. The cost metric shows the reduction in cost realized by the application of a specific compression algorithm:

C = Rco− Cc− Cd

where

Rco the reduction in cost of ownership.

Cc is the cost incurred by compression.

Cd is the cost incurred by decompression.

The reduction of the cost of ownership depends on the data reduction realized by the compression algorithm and the cost of ownership of data. The latter is a user parameter:

Rco= Rd∗ Co

where

Rco the reduction in cost of ownership.

Rd is data reduction resulting from the application of the compression algorithm.

Co is the cost of ownership of data.

(26)

Rd= So∗ (1 − CR)

where

Rd is data reduction resulting from the application of the compression algorithm.

So is the size of the original, uncompressed file.

CR is the compression ratio for the file.

The cost incurred by compression depends on the ability of the algorithm to perform real-time compression. If the data can be compressed in real-time, no additional cost is induced by compression. If the algorithm cannot process the data in real-time, we assume that the data is compressed after acquisition. Compression is likely to occur on the ship (to reduce the amount of data that needs to be sent to shore) and thus takes up time that could otherwise have been spent on acquisition. We refer to this time as ship time. The equation for the cost of compression is thus:

Cc= Tc∗ Cst if RT C < 1 otherwise Cc= 0

where

Cc is compression cost.

Tc is the compression time for the complete file.

RT C is the real-time compression metric. Cst is the cost of ship time (a user parameter).

The cost of decompression for processing is the product of the time spent on decompressing the data we need for processing (e.g. the processing metric presented in section4.3) and the cost of processing time:

Cd= P ∗ Cpt

where

P is the processing metric.

(27)

4.5

Overview

In this chapter, we proposed the following metrics to be included in the test bench:

Metric Symbol Range Better when

Compression ratio CR (0, ∞) Lower Compression time Tc (0, ∞) Lower

Decompression time Td (0, ∞) Lower

Real-time compression RT C (0, ∞) Higher Processing overhead P (0, ∞) Lower

Cost reduction C (−∞, ∞) Higher

Table 4.1: Metrics

In order to compute these metrics, the user must supply the following parameters to the test bench: • Memory available to the algorithm during acquisition.

• Processor capacity (as a percentage) available to the algorithm during acquisition. • The number of times a file is decompressed for processing.

• Processing ratio.

• The cost of processing time. • The cost of ship time.

• The cost of ownership of data.

Additional information on how to determine the right values for these parameters can be found in table7.1in section 7.3.

(28)

Chapter 5

Test bench

In this chapter, we present a test bench for water column compression algorithms based on the input files and metrics presented in chapters 3 and4, respectively. The purpose of the test bench is both to facilitate the computation of metrics for the benchmark and to enforce that all benchmark results have been computed in the same manner.

5.1

Requirements

The requirements for the test bench are based on the requirements for benchmarks presented by Sim et al. [SEH03]. As this work is meant as a first step towards a benchmark for water column compression and we expect it to change based on input from the (scientific) community, we have added changeability as a requirement.

• Accessibility

The test bench (including the set of input files) should be in the public domain. • Affordability

The work to make an algorithm available to the test bench should not take more than a day. A complete test run should not take more than a day.

• Relevance

The results of the test bench should be representative of real-world applications. • Solvability

The test bench should not use (artificial) input data that is incompressible. • Portability

The test bench should be able to handle algorithms written in many programming languages. • Scalability

The test bench should support algorithms at different levels of maturity. • Changeability

The test bench should support changing the set of input files and the set of metrics.

5.2

Structural design

This section describes the structural design of the test bench based on the requirements presented in the previous section. For portability, the test bench will be implemented in python, which means the test bench will run on any of the platforms for which a python interpreter is available without requiring compilation.

Figure 5.1 shows the components of the test bench and their relation. Each component will be discussed in detail in the following sections.

(29)

Figure 5.1: structural test bench design

5.2.1

Test execution logic

The code that runs when the test bench is executed. This logic governs the program flow of the test bench. Detailed information on the program flow can be found in section5.6.1. More details on the way metrics are computed can be found in section5.6.2.

5.2.2

Input files

The Water Column Corpus presented in chapter3 is the default set of input files. However, the test bench can be configured to use any set of input files by editing the configuration file.

Input file format

Water column data files come in a lot of different encodings as it is common for MBES manufacturers to use their own, proprietary encoding. If the test bench would support files in their native format, this would put the responsibility of decoding the data on the compression algorithm. We see two problems with this approach. First, it makes the test bench hard to extend with new file formats as that would require an update of all the compression algorithms. Second, there is a risk of algorithms supporting a single encoding1. The inability of compression algorithms to compress the same files

would make the test bench useless for comparing algorithm performance.

In order to overcome these problems, we present a generalized file format in which the test bench’ default input files will be encoded. The goals for this format are:

• Be generic, so that the data from any vendor encoding can be converted to this format. • Be as close as possible to the original file in its content.

• Contain all the information essential to water column data. • Contain all the data needed by the test bench.

1This may already be the case for current publications on water column compression: Moszynski et al. use an

application that reads files in the ’s7k’ format [MCKL13, p. 81]. Amblas et al. [APR+16] & Beaudoin [Bea10] do not

specifically mention which formats are supported, but both only include data from Kongsberg systems in their input data sets.

(30)

Based on these goals, we defined the generic water-column format (GWF). Details of the file format can be found in appendixB.

In order for the data to be as generic as possible, we include as few fields as possible while still making sure that all the data needed for compression and the test bench are included. We include the fields: identifier, generation time and number of beams for each water column record. Amplitude samples and phase samples for each beam.

In order to retain as much similarity with the original file as possible, we do not perform conversion on the sample data (which is the majority, >90% of the data). Instead, the file will contain a field indicating the format in which the samples are stored. Furthermore, all data that does not fit in one of the defined fields is stored as ’generic data’. Both the water column record header and each individual beam contain a field in which generic data can be stored. Because of these precautions, conversion from a vendor specific format to the generic water-column format can be achieved largely by reordering of the data in the original file. More details and examples of such a conversion can be found in appendixB.

We do not expect that there is a correlation between the compressibility of a file and its structure. Therefore the performance of a water column data compression algorithm should not be different when processing an input file in the GWF file format compared to processing the same data in its original format. However, we have not conducted extensive research to provide proof to support this claim, and users of the test bench may be worried about the results of the test bench on files in the GWF file format not being representative of the algorithms’ real-life performance. Therefore, the test bench supports input files both in the GWF file format and all the original file formats of the Water Column Corpus (.all, .wcd and .s7k).

5.3

Compression algorithm

<<interface>> Compression algorithm

Init(parameters : string) : int

Compress(inputPath : string, outputPath : string) : int Decompress(inputPath : string, outputPath : string) : int Decompress(inputPath : string, outputPath : string, recordID) : int

Figure 5.2: algorithm interface The compression algorithm is the algorithm for

which we want to evaluate the performance. In order for the algorithm to be used by the test bench, it should be made available as a python module that supports the interface shown in fig-ure5.2. Compression algorithms are often written in C ([APR+16], [Bea10]) or C++ ([MCKL13])

for speed. Algorithms written in these program-ming languages can be easily made available as a

python module by compiling them as shared libraries and making the libraries available in python using the ctypes module2. There are many other methods to make code written in other languages

available to the test bench3.

The init operation is used to set algorithm parameters that are used in (de)compression. The parameters are specific to the algorithm that is used. For example, generic compression algorithms often include a parameter to favor either speed or compression rate. The test bench passes parameters in JSON4 notation as they have been configured in the configuration file of the test bench. As the parameters are specific to the algorithm, all parameters and their possible values should be supplied by the algorithm manufacturer.

The compress and decompress operations both take the path of an input file (the original file for the compress function, the compressed file for the decompress function) and the path to an output file. The algorithm is responsible for creating the output file at the specified location and writing the (un)compressed data to that file. The interface specifies an operation to decompress the entire file and an operation to decompress a single record from the input file.

All operations should return an integer which is negative in case of failure (algorithm specific return values may be used) and zero in case of success.

2https://docs.python.org/3/library/ctypes.html

3https://wiki.python.org/moin/IntegratingPythonWithOtherLanguages 4http://www.json.org

(31)

5.4

Compressed file / decompressed file

The compressed and decompressed files are transient files that are used to store the output of the compress and decompress functions, respectively. The algorithm under test is responsible for the creation of the files and the test bench is responsible for the removal of the files when they are no longer necessary. The files are used by the test bench for the calculation of the compression ratio metric.

5.5

Results

The results of the test bench are stored in an sqlite database. The results consist of all the metrics computed during the test run. The database has a simple format consisting of columns for input file name, name of the algorithm and one for each computed metric. Refer to figure7.2for an example.

5.5.1

Configuration file

The configuration file contains the configuration for the test bench. This includes: • The paths to all the input files.

• The path to the python module(s) that contains the compression algorithm(s) • The compression algorithm parameters.

• The metric parameters.

The configuration file uses the JSON format for readability. An example of a configuration file can be found in appendixC. Section7.3provides more details on the adaptation of the configuration file.

(32)

5.6

Behavioral design

In this chapter we present the behavioral design of the test bench. Section5.6.1describes the program flow of a test run. Section5.6.2describes how the metrics proposed in chapter4are computed.

5.6.1

Test execution

A run of the test bench computes the metrics proposed in chapter 4 for one or more compression algorithms using one or more input files. Both the algorithms and the input files to use are specified by the user in the configuration file. Figure 5.3 shows the execution flow of a test run of the test bench as a flow chart. The gray boxes in the flow chart’s processes indicate (parts of) metrics that are calculated in that step.

The test bench iterates over all the input files. For each file, the test bench starts with getting meta data from the file. This meta data is necessary for the calculation of metrics later on (file size, the number of records in the file & the timespan of the file) and determining which records will be used for random order decompression.

Once the meta data for the file is read, the following steps are executed for each algorithm: 1. The algorithm is instructed to compress and subsequently decompress the input file, yielding

measurements for the compression ratio, compression time and decompression time metrics. 2. The algorithm is instructed to decompress a single record from the compressed file for each

record that was selected for random access decompression.

3. The test bench emulates the real time scenario in which limited resources are available to the algorithm (this process is described in detail in section 5.6.2) and subsequently instructs the algorithm to compress the input file again to compute the average record compression time in a low resource environment. After the compression step completes, all resources allocated for the real time scenario are released.

When these actions have been completed for all input files and all algorithms, the metrics that require user parameters are computed from the previously calculated values and the user defined parameters. Calculating these metrics outside of the file and algorithm loops allows for fast re-computation of these metrics from previous results when only the parameters have been changed.

5.6.2

Metrics computation

Chapter4presents the metrics that are computed by the test bench, but does not address the question of how these metrics should be computed. The computation of metrics CR, P and C is obvious from their definition. Computation for metrics Tc, Td and RT C, however is less obvious. In this section,

we will discuss the computation of these metrics. Time measurement

Except for compression rate, the computation of all metrics depends on time measurement. It is important to discern which time we want to measure for each metric. Compression (Tc) and

decom-pression time (Td) are metrics that indicate the performance of the algorithm. Therefore we want to

exclude anything that is not related to algorithm performance. Examples of such activities are the operating system executing other tasks, and the time spent waiting on disk access to complete.

For the real time compression (RT C) metric on the other hand, we explicitly want to include the influence of other processes that claim CPU time in parallel to the execution of the compression algorithm.

The test bench is written in the python programming language which offers a number of different methods to get timing information. We have tested these methods to find the right method to use for both scenarios. The most important observation that we made is that not every method behaves according to the documentation. Specifically, we found a number of methods for which the results, contrary to documentation, differ significantly when executed on different operating systems.

Referenties

GERELATEERDE DOCUMENTEN

- Voor waardevolle archeologische vindplaatsen die bedreigd worden door de geplande ruimtelijke ontwikkeling: hoe kan deze bedreiging weggenomen of verminderd

Heselhaus, Deutsche Literatur; derselbe, Deutsche Lyrik; Conrady, Moderne Lyrik; Burger, Struktureinheit 13) vgl.. bevor der Autor dessen Text kennt. Aus der

The emerging role of the structural heart disease interventionist in dealing with a large spectrum of cardiac conditions, from infancy to the very elderly, is highlighted in this

A hybrid FET device is presented of which the gate insulator is replaced by a Si-C linked, organic monolayer and the gate contact is made via an organic, conducting polymer

In Tabel 26 zijn de stikstof- en fosfaatoverschotten van het Lagekostenbedrijf (gemiddelde 1999 tot en met 2003) en het gemiddelde van een groep deelnemers aan Praktijkcijfers 2

Figure 5, Calculation step 2 (a) current design method with triangular load distribution for the situation with or without subsoil support(b) new design method with uniform

In the reaction states SA_A1 (Fig. 33) the only major AOC of the LUMO is located on the Mo-atom, thus only allowing for primary overlap of the HOMO of the alkene. This may result

Het systeem omvat tien units van ieder drieduizend kippen, die gerangschikt zijn rond een centrale ruimte voor overzicht en techniek.. De units zijn compact