Triple-C: resource-usage prediction for semi-automatic parallelization of groups of dynamic image-processing tasks

(1)

Triple-C: resource-usage prediction for semi-automatic

parallelization of groups of dynamic image-processing tasks

Citation for published version (APA):

Albers, A. H. R., With, de, P. H. N., & Suijs, E. (2009). Triple-C: resource-usage prediction for semi-automatic parallelization of groups of dynamic image-processing tasks. In IEEE Internationatiol Symposium on Parallel and Distributed Processing 2009, IPDPS 2009, 23-29 May 2009, Rome, Italy (pp. 1-8). Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/IPDPS.2009.5160942

DOI:

10.1109/IPDPS.2009.5160942

Document status and date: Published: 01/01/2009

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at: openaccess@tue.nl

(2)

Triple-C: Resource-Usage Prediction for Semi-Automatic Parallelization of

Groups of Dynamic Image-Processing Tasks

Rob Albers

∗†

_{, Eric Suijs}

†

_{and Peter H.N. de With}

∗‡

∗

_{Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, The Netherlands,}

†

_{Philips Healthcare, X-Ray, PO Box 10.000, 5680 DA Best, The Netherlands,}

‡

_{CycloMedia Technology, PO Box 68, 4180 BB Waardenburg, The Netherlands.}

E-mail: r.albers@philips.com

Abstract

With the emergence of dynamic video processing, such as in image analysis, runtime estimation of re-source usage would be highly attractive for automatic parallelization and QoS control with shared resources. A possible solution is to characterize the application execution using model descriptions of the resource usage. In this paper, we introduce Triple-C, a pre-diction model for Computation, Cache-memory and Communication-bandwidth usage with scenario-based Markov chains. As a typical application, we explore a medical imaging function to enhance objects of interest in X-ray angiography sequences. Experimental results show that our method can be successfully applied to describe the resource usage for dynamic image-processing tasks, even if the flow graph dynamically switches between groups of tasks. An average pre-diction accuracy of 97% is reached with sporadic excursions of the prediction error up to 20-30%. As a case study, we exploit the prediction results for semi-automatic parallelization. Results show that with Triple-C prediction, dynamic processing tasks can be executed in real-time with a constant low latency.

1. Introduction

Advanced video processing is becoming increas-ingly dynamic, which poses new requirements on the system design. With dynamic video-processing appli-cations, such as in image analysis, the computational complexity has become data dependent and memory usage is more irregular. Detailed know-how of specific application aspects, such as data-driven complexity and the corresponding memory requirements is relevant for

optimal mapping of tasks on a computing platform and optimizing the performance during runtime.

In order to guide the mapping and obtain efficiency in implementation, performance prediction may be ap-plied in the form of modeling. This paper concentrates on achieving sufficient accuracy in the modeling for applications featuring dynamic execution of tasks.

The secondary objective is to use the modeling for runtime estimation of the resource usage with the aim to execute more functions on the same platform. The model descriptions are used as a prediction for resource planning, parallelization and possibly the correspond-ing Quality-of-Service (QoS) control [1][2].

The above problem statement has been addressed in several fields of applications, such as high-performance computing and multimedia. Within performance com-puting, several techniques are reported in the litera-ture for performance prediction of parallel applica-tions [3][4][5]. Most of them are based on block-parallel or dataflow-like approaches. The drawback of these approaches is that the analysis is restricted to applications which have static processing requirements at known rates and data sizes, while the required computations have a deterministic nature.

In the multimedia domain, application analysis has revealed that algorithms have variable rates and sizes, which involves application tuning based on this vari-able nature. In such a case, the application is described by its key features such as a simple average or discrete distributions acquired from profile analysis [6][7]. In our case, the application field is professional medical imaging and processing. Since this involves a rather broad scope, we study a mixture of the previous features and new aspects. A difference is that our application involves image analysis, which has a more dynamic nature and will be discussed below.

(3)

The paper is organized as follows. In Section 2, we start with a survey of the system and architectural re-quirements in our application field. Section 3 presents the medical-imaging application under study. Perfor-mance modeling of the video-processing functions, memory communication and granularity optimization is discussed in Section 4 and 5. For runtime estimation of the resource usage, in Section 6, we introduce task-level scalability and runtime quality control of the application. Section 7 presents experimental results for mapping the image-analysis application on a multipro-cessor platform with a constant low latency. Section 8 concludes the paper.

2. System and Architectural Requirements

The work presented in this paper aims at realizing a design flow involving the mapping and runtime management of advanced video applications onto a multiprocessor platform. We focus on the upper design layers, dealing with the applications and their control for an efficient execution. Let us start with a survey of the system and architectural requirements which are important for our application field in professional medical imaging.

• Low latency. We explore a medical imaging

func-tion to enhance objects of interest under X-ray fluoroscopy imaging, during a live interventional angiography procedure [8][9][10]. Because physi-cians must see their actions directly on the screen (eye-hand coordination), a constant low latency is a key requirement for the real-time imaging application.

• Off-the-shelf multiprocessor. In many medical

imaging procedures, a multitude of imaging func-tions is carried out in parallel. Because dedi-cated hardware lacks flexibility, using off-the-shelf multiprocessors as a platform for software image-processing seems to be a valid choice. The selected platform contains multiple processing cores in order to provide the required processing power, while maintaining some degree of runtime programmability and flexibility. To increase the cost-efficiency, specialistic co-processors (such as graphics cards) can be added in the future.

• Variable image analysis. A trend in medical

imaging is the introduction of image analysis and feature extraction techniques to increase the performance of other tasks in the real-time video pipeline (like image enhancement). From a archi-tectural perspective, the computational complexity has become data dependent and memory usage is more irregular. As an answer to the variable

Figure 1. Two coronary angiograms of a stented bifurcation (a,b), and the enhanced views of the motion-compensated stent (c,d).

processing rates, performance prediction may be applied in the form of modeling to guide the mapping and to obtain an efficient implemen-tation. Modeling of application resource usage is complicated because depending on the image content and intermediate analysis result, the anal-ysis algorithm may switch to a different group of processing tasks.

In this paper, we introduce Triple-C, a predic-tion model for Computapredic-tion, Cache-memory and Communication-bandwidth usage for groups of dy-namic image-processing tasks with scenario-based Markov chains. It is important to note that our ap-proach is essentially different from considering only the case state of the system. Unlike the worst-case approach, our approach is dynamic, i.e. it makes use of runtime characteristics of the input data and the environment of the application, with the aim to execute more functions on the same platform.

3. Medical image-analysis application

Let us now discuss the key application used to evaluate our architectural study. Coronary angioplasty is a catheter-based procedure performed by an inter-ventional cardiologist in order to open up a blocked coronary artery and restore blood flow to the heart muscle. Angioplasty is used as an alternative treatment to coronary artery bypass surgery in more than half of the cases. Following balloon angioplasty, a wire mesh tube (stent) can be placed to keep the artery open. The correct deployment of a stent in the coronary arteries is important for ensuring the efficiency of drug-eluting

(4)

150 30 60 RDG ROI MKX EXT ROI CPLS

SELECT REG ENH ZOOM

S W IT C H REG. SUCCESSFUL Y N 75 60 INPUT IMAGE OUTPUT IMAGE 120 S W IT C H ROI ESTIMATED Y N 150 RDG FULL _MKX EXT FULL 75 S W IT C H RDG DETECTION 15 15 S W IT C H Y N Y N

Figure 2. Flow graph for motion-compensated feature enhancement and required bandwidth between tasks (MByte/s).

stents. Image analysis and motion-compensation tech-niques can improve the visualization and measurement of objects of interest in X-ray angiography, thereby making it easier to achieve optimum and complete stent deployment, potentially eliminating the need for addi-tional procedures, such as intravascular ultrasound. In this paper, we explore a medical-imaging application to enhance moving features under X-ray fluoroscopy imaging, during a live interventional angioplasty pro-cedure1_.

Motion-compensated feature enhancement consists of several steps, as depicted in Figure 2. The presented flow graph is based on a cascading of four stages which are individually described in [11][12][13][14]. After stent placement, the candidate balloon markers are detected in the image using an automatic marker-extraction algorithm. Ridge detection (RDG) and fil-tering is applied to the input images such that all other structures, except candidate balloon markers, are removed. If no other dominant structures are detected, the function is skipped. Subsequently, marker extrac-tion (MKX EXT) selects punctual dark zones contrast-ing on a brighter background as candidate markers.

Based on a-priori known distances between the balloon markers, couples selection (CPLS SEL) selects the best marker couple from the set of candidate couples. Subsequently, temporal registration (REG) to align respective markers in selected image frames, is based on a motion criterion, where a temporal dif-ference is performed between two succeeding images of the sequence. A Region Of Interest is estimated in the original image (ROI EST), where the markers have previously been detected. The guide wire can be detected by a ridge filter in guide-wire extraction (GW EXT). If the markers of a possible couple are situated on a track corresponding to a ridge joining them (the guide wire), this is the indication that the results obtained by automatic marker extraction are found stable. Enhancement (ENH) of the stent is performed by temporal integration of the registered

1. Commercially available as StentBoost or IC Stent.

image frames according to the balloon markers. The output is presented by zooming (ZOOM) in the ROI containing the stent. As evident from the images in Figure 1, the enhanced images enable an improved control of the good expansion of the stents and of their positioning relative to the vessel wall without the use of other devices, such as intravascular ultrasound.

The described application is dynamic in three major aspects: (1) At the start, an ROI of variable data-dependent size is chosen for further analysis, and (2) at every stage, switch functions select a specific flow graph, depending on the previous stage(s). More-over (3), some of the internal flow graphs require intrinsically a variable processing time. Performance prediction is challenging as the input data may cause variability within task execution times. Moreover, the dynamic decision making is based on outcomes of the image analysis process which heavily depend on the input video data. Tasks in the image analysis cannot be easily switched off, since that would lead to an incomplete or unacceptable result. In the next section, we introduce the prediction model for the computation time including the estimation technique coping with the dynamic behavior.

4. Prediction Model of Computation Time

As an answer to variable processing rates of the ap-plication, performance prediction may be applied in the form of modeling to guide the mapping and to obtain an efficient implementation. Modeling of application resource usage is complicated because depending on the image content and intermediate analysis result, the analysis algorithm may switch to a different group of processing tasks.

We have considered several options for modeling of the computation time. As a first solution, we inves-tigated literature on video traffic modeling [15]. Most of the papers deal with Markov-chain approaches since the estimation of the model parameters is straightfor-ward and there is a large number of analysis techniques available.

(5)

A Markov chain is a stochastic process with the Markov property. The Markov property means that, given the present state, future states are independent of the past states. In other words, the description of the present state fully captures all the information that could influence the future evolution of the process. Future states will be reached through a probabilistic process instead of a deterministic one.

To deal with applications for which the computation time depends on long-term statistics of the video frames, higher-order probabilistic processes can be used, but the state space will grow exponentially. Also, a problem is to obtain statistically significant estimates for the transition probabilities, because with an increasing order, the number of samples for each estimate is very small, even for long data sets. An alternative view on the modeling of the system be-havior is to consider the timing statistics of the video frames in two categories, as a result from mapping the algorithms on a platform. Hence, we then investigate short-term and structural fluctuations in processing time on the platform. Short-term fluctuations can be caused by cache misses or the overhead imposed by task switching and control. Structural fluctuations are caused by the dependency of the processing time of the tasks on the video content itself over a longer time period. This points to the direction of splitting the computational statistics in categories.

As a consequence of the previous discussions, we have adopted a concept where the long-term statistics are decoupled from the short-term stochastic behavior by employing different models for those statistics.

• Short-term data correlations. We try to describe the

prediction model for the short-term data-dependent tasks as a probabilistic process such as a finite state Markov chain. A first-order Markov chain is by defi-nition memoryless, where in the model it is implicitly assumed that the processing times of successive frames are independent. Based on computation of the auto-correlation function, we have concluded that couples selection (CPLS SEL) and guide-wire extraction (GW EXT) tasks can both be modeled with Markov chains. A disadvantage of Markov-chain modeling is the re-quired exponentially decaying autocorrelation function of the input data. Markov-chain prediction falls short if processing times between video frames are correlated over a longer time period. Next, we will describe the modeling of long-term structural data dependencies.

• Long-term data correlations. We consider the

pre-diction model to consist of long-term low-frequency fluctuations, around which short-term high-frequency fluctuations can take place. Discrimination between the

0 5 10 35 40 45 50 55 Frame number C o m p u ta ti o n t im e [m s] 0 250 500 750 1000 1250 1500 1750 -10 -5

Ridge detection HPF (Ridge detection) LPF (Ridge detection)

Figure 3. Computation time of the RDG FULL task.

low and high-frequency part can be made by various types of filters, such as Finite Impulse Response (FIR) or Infinite Impulse Response (IIR) filters. We apply the Exponentially Weighted Moving Average (EWMA) filter. As this IIR filter weights recent inputs more heavily than long-term previous ones, it adapts more quickly to the input signal compared to FIR filters. The EWMA filter is defined by:

y(tk) = (1 − α) × y(tk−1) + α × x(tk) (1)

Given the separation of correlation behavior, the short-term fluctuations are modeled with Markov chains. We have validated the applicability of the Markov-chain modeling by analyzing the autocorrelation function. In Fig. 3, the computation-time statistics for the ridge-detection (RDG) task are shown. To model the com-putation time for the current video frame, the output of the EWMA filter is used for long-term behavior prediction. On top of that, a Markov chain predicts the short-term fluctuations in computation time. The state-space description can be generated by analyzing the computation time over a long time period. The number of states M is Cmax/σC, where Cmax denotes the

largest measured value and σCthe standard deviation.

We have experimentally evolved to a model with approximately 2M states to obtain sufficient accuracy. The quantization intervals are adaptively chosen such that each interval contains on the average the same amount of samples. The entries of the transition prob-ability matrix {Pij} are estimated by

Pij = nij/( M

X

k=1

nik), (2)

where nij denotes the number of transitions from

interval i to interval j. Processing-time statistics for different Region-Of-Interest (ROI) sizes show that the RDG task has a linear dependency on the size of the ROI (See Fig. 6). To analyze load fluctuations, caused by dependencies on the video content itself, we have

(6)

subtracted a linear growth function from the obtained statistics. This function is specified by

ytk= 0.067 × tk+ 20.6. (3)

For the remaining data-dependent fluctuations after subtraction, we analyzed the autocorrelation function. As the function has a exponential decay, it can again be described with a Markov chain. As the fluctuations are in the same order as the high-frequency behavior described in the previous section, we have included these statistics to the Markov state-generation process, to generate a single Markov chain for the ridge-detection task.

5. Cache-memory and

Communication-bandwidth Analysis

In this section, we focus on the detailed analysis of the cache-memory and communication-bandwidth allocation for inter-task and intra-task communication. Optimal resource utilization following only the pre-diction model for computation time, usually conflicts with the usage of communication resources. A reliable reasoning about tasks having variable processing rates requires analytical methods, covering on one hand the actual required computation time, and on the other hand, the communication and storage requirements. We adopt this approach in Triple-C, in which, apart from prediction models for Computation time (Sec-tion 4), we also introduce analysis for Cache-memory and Communication-bandwidth usage. Let us start by describing the memory requirements for each task in the graph of Fig. 2, followed by the calculation of the communication bandwidth under various conditions.

5.1. Task Memory Requirements

The required amount of memory for each task can be derived by extracting the input/output requirements and intermediate storage requirement from a reference software implementation. In Table 1, the results are shown for the RDG, MKX, ENH and ZOOM task. For concentrating on significant bandwidth requirements, only operations on arrays are taken into account. The tasks that operate on a subset or feature data are negli-gible in terms of memory consumption. Note that some of the functions have different memory requirements, depending on the state of a switch in the flow graph. For example, if the RDG task is switched off, the succeeding MKX function has a much smaller input buffer requirement. The amount of communication bandwidth required for correct operation of the image-analysis application is derived by a mapping of the memory requirements onto the platform architecture.

Task RDG select Input (KB) Intermediate (KB) Output (KB)

RDG FULL 2,048 7,168 5,120 RDG ROI 2,048 5,120 5,120 MKX FULL - 512 512 2,560 MKX ROI - 512 512 2,560 MKX FULL x 4,608 512 2,560 MKX ROI x 4,608 512 2,560 ENH 2,048 8,192 1,024 ZOOM 1,024 4,096 4,096

Table 1. Memory requirements for each task of Fig. 2.

5.2. Communication-bandwidth Requirements

To calculate the required amount of communication bandwidth of the application under study, there are a number of constraints that influence the actually mea-sured communication bandwidth load on the platform architecture.

• Hardware platform: the choice for a particular

hardware platform sets an upper limit on the available resources.

• Application scenarios: by switching between

dif-ferent groups of processing tasks (scenarios), the memory and communication-bandwidth require-ments imposed on the system can vary.

• Intra-task memory: if a task internally requires

more memory that can be stored locally in the cache memory of the processor, additional com-munication bandwidth will be initiated to swap data in and out the external memory.

• Inter-task memory: the partitioning of the

applica-tion on the platform has a direct relaapplica-tionship with the required amount of communication bandwidth between tasks.

• Hardware platform. For the experiments, we use

a general-purpose multiprocessor platform containing two quad-core processors. In Figure 4, the system is shown. In total, the system consists of 8 processors of 2.33 GCycles/s, 8 level-1 caches of 32 KB and 4 level-2 caches of 4 MB. The system is equipped with 4 GB of external memory. For more details about the instantiated architecture, we refer to literature [16].

• Application scenarios. The application under study

processes images of 1024×1024 pixels (2 Bytes/pixel) at 30 Hz. Due to the switch statements in the flow graph of Figure 2, there are multiple application sce-narios possible. For the worst-case scenario in terms of bandwidth requirements, the tasks operate on a full-frame granularity, the ridge-detection task is activated and the registration phase completes successfully. On the opposite, for the best-case scenario in terms of bandwidth requirements, the tasks operate on a small

(7)

CPU 0 CPU 1 CPU 2 CPU 3 L1 L1 L1 L1 L3 / SNOOP FILTER I/O HUB L2 L2 MEMORY HUB D R A M 0 D R A M 3 D R A M 2 D R A M 1 CPU 4 CPU 5 CPU 6 CPU 7 L1 L1 L1 L1 L2 L2 DMA G F X N E T W O R K N E T W O R K CPU CACHE I/O MEMORY B U S (a) (b) B U S 8 x 2,327 MCycles/s 0.94 – 3.83 GB/s 72 GB/s 48 GB/s 29 GB/s

Figure 4. (a) Generic architecture model, (b) instanti-ated architecture with parameters.

region-of-interest granularity, which will save a signifi-cant amount of required bandwidth. Furthermore, if the ridge-detection task is not required and the registration is not successful, the enhancement and zoom tasks will be skipped, which will save another significant amount of required bandwidth. Note that in this ’best-case’ scenario in terms of bandwidth, the algorithm will not output a satisfying result. In total, there are eight different scenarios possible given the three switch statements in the flow graph.

• Intra-task memory. For several tasks, the required

amount of intra-task memory will exceed the available local cache memory of the hardware platform. For the application under study, the RDG FULL, ENH and ZOOM tasks have an intra-task memory requirement that is higher than the level-2 cache capacity of the hardware platform. Therefore, additional communica-tion bandwidth will be initiated to swap data in and out the external memory. The amount of bandwidth of each task can be modeled by analysis of the access pattern of the internal buffers. As the above tasks operate on a whole image granularity, scanning of the pixels linearly in the (x, y) direction will enable that all data items will be accessed. The modeling of the cache-memory occupation and corresponding eviction of internal buffers can be described with a space-time buffer occupation model. In Figure 5, for each subtask of the RDG function, the eviction of cache lines in case of overload is shown, as it requires extra bandwidth between cache memory and external memory storage.

• Inter-task memory. In Figure 2, the required

inter-task bandwidth is shown on the arrows between inter-tasks (1024×1024 pixels, 30 Hz, 2 Bytes/pixel). The map-ping of the application on the platform has a direct re-lationship with the actually measured communication-bandwidth between tasks.

5120 RDG FULL 2048 C B A 2048 5120 C P U EXTERNAL MEMORY STORAGE CACHE MEMORY STORAGE BUS BUS C P U C P U C P U C P U (1) (5) (4) (3) (2)

Figure 5. Required intra-task bandwidth for RDG FULL task due to the limited cache-memory storage.

6. Semi-Automatic Parallelization

In this section, we want to exploit the modeling re-sults from the previous sections for runtime estimation of the resource usage, as our aim is to execute more functions on the same platform. The model descrip-tions are used as a prediction for resource planning, parallelization and quality control.

For the application under study, data-dependent switch statements (Fig. 2) can cause the total process-ing time to change rather abruptly. As a result, the latency at the output may vary over time. However, during a live interventional X-ray procedure, large latency differences between succeeding frames are not allowed for clinical reasons (eye-hand coordination of the physician).

A straightforward solution is to employ a task parti-tioning on the platform, based on worst-case resource usage. With a delay function at the end of the pipeline, the output latency can be kept constant. As this solution will obviously work as expected, the main drawback is that for most of the time, the reserved resource budget is set too conservative. Furthermore, the output latency is higher than actually required. Moreover, it is impossible to exploit the difference between average-case and worst-average-case requirements without affecting the reliability (guaranteed performance) of the application. Another approach is to use the prediction results from Triple-C for semi-automatic parallelization of the application. It is important to note that this approach is essentially different from considering the worst-case state of the system. Unlike the worst-case approach, our approach is dynamic, i.e., it makes use of runtime characteristics of the input data and the environment of the application. Using Triple-C, we are able to

(8)

0 50K 100K 150K 200K 250K ROI size (pixels)

300K ytk = 0.067tk + 20.6 E ff ec ti ve la te n cy ( m s) 15 17.5 20 22.5 25 2-st ri p e p ar . 10 12.5 30 35 40 45 50 se ri al . 20 25

Figure 6. Processing-time statistics for different Region-Of-Interest (ROI) sizes.

accurately predict how much resources are required. This information can be used at runtime for on-the-fly repartitioning of groups of tasks. Furthermore, we have the possibility to execute more functions on the same platform. The approach consist of several steps.

• Initialization: By processing the first frame of the

sequence, we initialize the partitioning of the flow-graph based on the image characteristics. The output latency is set to an initial value (close to average case), which will be our latency budget during runtime.

• Runtime adaptation: Based on the outcome from

the resource predictions for subsequent frames, the resource manager can decide to repartition the flow-graph to handle an increase or decrease of resource consumption, to keep the output latency stable at the initialized (average-case) value.

• Profiling: The application can be profiled to gather

statistical information of the differences between the actually consumed resources and the predicted values. The information can be used for on-line model train-ing, or to give insight information about the prediction quality of the model descriptions.

For the application under study, the data of the RDG FULL and RDG ROI tasks can be easily partitioned, as the tasks have a streaming nature. For the CPLS SEL and GW EXT tasks, functional partitioning is more appropriate, as the tasks operate on extracted features, rather than image data. For a comparison between data-parallel partitioning and function-data-parallel partitioning, we refer to [17].

7. Experimental Results

In this section, we evaluate the analysis and predic-tion models on a set of test sequences. Furthermore, the statistics are used to guide the semi-automatic parallelization of the application, following the concept as explained in the previous section. Let us start by evaluating the accuracy of Triple-C.

(a) (b) s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s0 0.51 0.19 0.16 0.03 0.03 0.02 0.01 0.01 0.01 0.01 s1 0.21 0.21 0.17 0.15 0.07 0.08 0.02 0.06 0.00 0.03 s2 0.04 0.08 0.33 0.13 0.16 0.10 0.04 0.06 0.02 0.04 s3 0.04 0.05 0.19 0.16 0.20 0.14 0.06 0.08 0.04 0.04 s4 0.03 0.04 0.16 0.15 0.23 0.12 0.09 0.12 0.03 0.02 s5 0.03 0.05 0.11 0.15 0.18 0.18 0.15 0.09 0.05 0.03 s6 0.02 0.02 0.09 0.08 0.14 0.20 0.14 0.19 0.07 0.04 s7 0.02 0.04 0.08 0.07 0.10 0.09 0.16 0.25 0.12 0.07 s8 0.03 0.01 0.13 0.03 0.09 0.08 0.14 0.25 0.07 0.17 s9 0.01 0.01 0.03 0.01 0.03 0.03 0.06 0.10 0.12 0.60

Task Prediction Model [ms] RDG FULL <Eq. 1> + Markov RDG RDG ROI <Eq. 3> + Markov RDG MKX EXT 2.5 CPLS SEL <Eq. 1> + Markov CPLS

REG 2

ROI EST 1 GW EXT <Eq. 1> + Markov GW

ENH 24

ZOOM 12.5

Table 2. (a) RDG transition matrix and (b) model summary.

The cache-memory and communication-bandwidth analysis from Section 5 are used for the prediction of memory and bandwidth usage for the set of test sequences. At a scenario level, the memory resource usage is more or less constant. The size of the ROI only slightly impacts the memory usage, therefore the differences between a large ROI and a small ROI are negligible in terms of bandwidth. For the test sequences, an average prediction accuracy be-tween the analysis and measured cache-memory and communication-bandwidth usage of 90% is obtained.

Computation time statistics are obtained by profiling the executed application on a chip-multiprocessor plat-form. For the prediction of computation time for data-dependent tasks, we have applied probabilistic models based on known methods for modeling video traffic performance in networks. For tasks with long-term dependencies on input data frames, we have applied filtering to make the signal suitable for probabilistic modeling. Data-dependent switch statements in the task graph are modeled with state tables. Changes in the processing granularity (ROI processing), are modeled with a linear growth function. In Fig. 6, the effective latencies are shown for two data-partitioning strategies: serial and 2-stripe parallel.

For training the prediction models, we have used a data set of 37 video sequences of in total 1,921 video frames. In the training set, different scenarios exist to create the dynamics in algorithmic adaptation and switching. In Table 2(a), the Markov transition matrix is shown for the ridge-detection task. Similar matrices are generated for the couples selection and guide-wire extraction tasks. A summary of the prediction models can be found in Table 2(b). For the test sequences, an average prediction accuracy of 97% is reached with sporadic excursions of the prediction error up to 20-30%.

In Fig. 7(red curve, top), the computation time statistics are shown for a straightforward mapping. The latency at the output can vary between 60 and 120 ms. By feeding the computation time prediction from Triple-C to a runtime manager for semi-automatic

(9)

70 80 90 100 110 120 30 40 50 60 0 50 100 150 200

Measured time (straightforward mapping) Measured time (semi-auto parallel) Prediction model 0 E ff ec ti ve la te n cy ( m s) Frame number

Figure 7. Prediction model vs. actual computation time.

parallelization, we are already able to reduce the la-tency and jitter significantly. As our aim is to execute more functions on the same platform, also the memory and bandwidth predictions for different parallelization scenarios have to be taken into account in the future by the runtime manager.

Results show that our method is able to reduce the jitter on the latency significantly (Fig. 7, yellow curve, bottom). The throughput is much more stable compared to a straightforward mapping. However, there are still some small peaks noticeable, but the difference between worst-case execution and average-case execution is reduced to 20% in the semi-automatic parallel case, compared to a straightforward mapping leading to a latency variability of almost 85%.

8. Conclusions

In this paper, we have introduced a prediction model, called Triple-C, for Computation, Cache-memory and Communication-bandwidth usage for groups of dy-namic image-processing tasks, employing scenario-based Markov chains. The models can be used to exe-cute more functions on the same platform. It has been validated that the scenario-based Markov modeling is suited to describe the runtime resource usage, even if the flow graph dynamically switches between groups of tasks. As a consequence, it is possible to realize a parallelization of data distribution and computations, such that the latency is kept nearly constant. This feature enables the execution of more functions on the same platform.

Experimental results for a medical imaging func-tion have shown that scenario-based modeling can be successfully applied to describe the resource usage function. Results show an average prediction accuracy of 97% with sporadic excursions of the prediction error up to 20-30%. Our semi-automatic parallelization strategy is able to lower the jitter on the latency with

almost 70% and guarantees a constant throughput. The techniques described in this paper can potentially be used for alternative applications using image analysis, such as in surveillance systems.

References

[1] C.C. W¨ust et al., “Qos control strategies for high-quality video processing,” Real-Time Syst., vol. 30, no. 1-2, pp. 7–29, 2005.

[2] B. Li and K. Nahrstedt, “Qualprobes: middleware qos profiling services for configuring adaptive appli-cations,” in Middleware ’00: IFIP/ACM Int. Conf. on

Distr. systems platforms, USA, 2000, pp. 256–272.

[3] D. Black-Schaffer, Block Parallel Programming for

Real-time Applications on Multi-core Processors, Ph.D.

thesis, Stanford University, U.S.A., June 2008. [4] L. Thiele, E. Wandeler, and S. Chakraborty,

“Perfor-mance analysis of multiprocessor dsps,” Signal Proc.,

IEEE, vol. 22, no. 3, 2005.

[5] H. Gautama and A.J.C. van Gemund, “Performance prediction of data-dependent task parallel programs,” in Euro-Par, 2001, vol. 2150 of LNCS, pp. 106–116. [6] P. Poplavko, T. Basten, and J. van Meerbergen,

“Execution-time prediction for dynamic streaming ap-plications with task-level parallelism,” in DSD, 2007. [7] M. Pastrnak and P.H.N. de With, “Data storage

ex-ploration and bandwidth analysis for distributed mpeg-4 decoding,” ISCE, IEEE Int. Symp. on Cons. Elec., 2004.

[8] “Image guidance method coronary stent deployment,” Patent application WO/2002/002173.

[9] “X-ray identification of interventional tools,” Patent application US/2008/137923.

[10] V. Bismuth and R. Vaillant, “Elastic registration for stent enhancement in x-ray image sequences,” Image

Processing: ICIP, IEEE Int. Conf. on, Oct. 2008.

[11] “Viewing system for control of ptca angiograms,” Patent application WO/2005/104951.

[12] “Medical viewing system and method for spatially en-hancing structures in noisy images,” Patent application US/2005/058363.

[13] “System and method for enhancing an object of in-terest in noisy medical images,” Patent application WO/2004/066842 .

[14] “Medical viewing system and method for enhanc-ing structures in noisy images,” Patent application WO/2003/045263 .

[15] V.S. Frost and B. Melamed, “Traffic modeling for telecommunications networks,” Comm. Mag., IEEE, vol. 32, no. 3, pp. 70–81, Mar 1994 .

[16] S. Radhakrishnan, S. Chinthamani, and Kai Cheng, “The blackford northbridge chipset for the intel 5000,”

Micro, IEEE, vol. 27, no. 2, pp. 22–33, 2007.

[17] E.B. van der Tol, E.G. Jaspers, and R.H. Gelderblom, “Mapping of h.264 decoding on a multiprocessor ar-chitecture,” 2003, vol. 5022, pp. 707–718, SPIE.