Performance optimizations - Eindhoven University of Technology MASTER 3D face reconstruction us

This chapter presents various performance optimizations made to the 3D face scanner application, ranging from high-level optimizations, such as modification of the algo-rithms, to low-level optimizations, such as the implementation of time-consuming parts in assembly language.

In order to verify that the achieved optimizations were valid in general, and not for specific cases, 10 scans of different persons were used for profiling the performance of the application. Every profile consisted of running the application 10 times for each scan and then averaging the results, in order to reduce the influence that external factors might have in the measured times. Figure 5.1 presents an example of the graphs that will be used throughout this and the following chapters to represent the changes in performance.

Here, each bar is divided into different colors that represent the distribution of the total execution time among the various stages of the application described in Chapter 3 and summarized in Figure 3.1.

The translation from MATLAB to C code corresponds to the first optimization per-formed. The top two bars in Figure 5.1 show that the C implementation resulted in a speedup of approximately 15 times over the MATLAB implementation running on a desktop computer. On the other hand, the bottom two bars reflect the difference in execution time after running the C implementation in two different platforms. The much more limited resources available in the BeagleBoard-xM have a clear impact on the execution time. The C code was compiled with GCC’s O2 optimization level.

The bottom bar in Figure 5.1 represents the starting point for a set of optimization procedures that will be described in the following sections. The order in which these are presented corresponds to the same order in which they were applied to the application.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 C code running on BB

C code running on PC MATLAB code running on PC

time (sec)

Read binary file Preprocessing Normalization Global motion compensation Decoding Tessellation Calibration Vertex filtering Hole filling Other

Figure 5.1: Execution times of (Top) the MATLAB implementation on a desktop com-puter, (Middle) the C implementation on a desktop comcom-puter, (Bottom) the C

imple-mentation on the BeagleBoard-xM.

5.1 Double to single-precision floating-point numbers

The same representation format of floating-point numbers for the MATLAB and C implementations were necessary to compare both results in each step of the translation process. The original C implementation was implemented using double-precision format because this is the format used in the MATLAB code. Taking into account that the additional precision offered by double-precision format over single-precision was not essential and that the ARM Cortex-A8 processor features a 32 bit architecture, the conversion from double to single-precision format was made. Figure 5.2 shows that with this modification the total execution time decreased from 14.53 to 12.52 sec.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Double-precision

Single-precision

time (sec)

Read binary file Preprocessing Normalization Global motion compensation Decoding Tessellation Calibration Vertex filtering Hole filling Other

Figure 5.2: Difference in execution time when double-precision format is changed to single-precision.

5.2 Tuned compiler flags

While the previous versions of the C code were compiled with O2 performance level, the goal of this step was to determine a combination of compiler options that would

translate into faster running code. A full list of the options supported by GCC can be found in [41]. Figure 5.3 shows that the execution time decreased by approximately 3 seconds (24% of the total time, 12.5 sec) after tuning the compiler flags. The list of compiler flags that produced best performance at this stage of the optimization process were:

-funroll-loops -Ofast -fsingle-precision-constant -ftree-loop-distribution -mcpu=cortex-a8 -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp

0 1 2 3 4 5 6 7 8 9 10 11 12 13

O2 optimization level

Tuned flags

time (sec)

Read binary file Preprocessing Normalization Global motion compensation Decoding Tessellation Calibration Vertex filtering Hole filling Other

Figure 5.3: Execution time before and after tuning GCC’s compiler options.

5.3 Modified memory layout

A different memory layout for processing the camera frames was implemented to further exploit the concept of spatial locality of the program. As noted in Section 3.3, many of the operations in the normalization stage involve pixels from pairs of consecutive frames, i.e., first and second, third and fourth, fifth and sixth, and so on. Data of the camera frames were placed in memory in a manner such that corresponding pixels between frame pairs laid next to each other in memory. The procedure is shown in Figure 5.4.

However, this modification yielded no improvement on the execution time of the appli-cation, as can be seen from Figure 5.5.

5.4 Reimplementation of C’s standard power function

The generation of Texture 1 frame in the normalization stage starts by averaging the last two camera frames followed by a gamma correction procedure. The process of gamma correction in this application consists of elevating each pixel to the 0.85 power. After profiling the application, it was found that the power function from the standard math C library was taking most of the time inside this process. Taking into account that the

Figure 5.4: Modification of the memory layout of the camera frames. The blue, red, green, and purple circles represent pixels of the first, second, third and fourth frames,

respectively.

0 1 2 3 4 5 6 7 8 9 10

Normal memory layout

Modified memory layout

time (sec)

Read binary file Preprocessing Normalization Global motion compensation Decoding Tessellation Calibration Vertex filtering Hole filling Other

Figure 5.5: The execution time of the program did not change with a different memory layout for the camera frames.

high accuracy offered by such function was not required and that the overhead involved in validating the input could be removed, a different implementation of such function was adopted.

A novel approach was proposed by Ian Stephenson in [42] explained as follows. The power function is usually implemented using logarithms as:

pow(a, b) = x^log^x^(a)∗b

where x can be any convenient value. By choosing x = 2, the process of calculating the power function reduces to finding fast pow2() and log2() functions. Such functions can be approximated with a few instructions. For example, the implementation of log2(a) can be approximated based on the IEEE floating point representation of a:

exponent mantissa

a = M ∗ 2^E

where M is the mantissa and E is the exponent. Taking log of both sides gives:

log₂(a) = log₂(M ) + E

and since M is normalized, log₂(M ) is always small, therefore:

log₂(a) ≈ E

This new implementation of the power function provides the improvement of the execu-tion time shown in Figure 5.6.

0 1 2 3 4 5 6 7 8 9 10

Standard C power function

Power function reimplemented

time (sec)

Read binary file Preprocessing Normalization Global motion compensation Decoding Tessellation Calibration Vertex filtering Hole filling Other

Figure 5.6: Difference in execution time before and after reimplementing C’s standard power function.

5.5 Reduced memory accesses

The original order of execution was modified to reduce the amount of memory access and to increase the temporal locality of the program. Temporal locality is a principle stating that referenced memory locations will tend to be referenced again soon. Moreover, the reordering allowed to replace floating-point calculations with integer calculations in the modulation stage, which are known to typically execute faster in ARM processors.

Figure 5.7 shows the order in which the algorithms are executed before and after this optimization. By moving the calculation of the modular frame to the preprocessing stage, the values of the camera frames do not have to be re-read. Moreover, the processes of discarding, cropping, and scaling frames are now being performed in an alternating fashion together with the calculation of the modular frame. This loop merging improves the locality of data and reduces loop overhead. Figure 5.8 shows the change in execution time of the application for this optimization step.

Preprocessing Parse XML

file

Discard frames

Crop

frames Scale

Normalization

Texture 1 Modulation Texture 2 Normalize Execution flow

Rest of program

(a) Original order of execution.

Preprocessing Parse XML

file

Discard frames

Crop

frames Scale

Normalization

Texture 1 Texture 2 Normalize Execution flow

Rest of program Modulation

(b) Modified order of execution.

Figure 5.7: Order of execution before and after the optimization.

0 1 2 3 4 5 6 7 8 9

After reordering Before reordering

time (sec)

Read binary file Preprocessing Normalization Global motion compensation Decoding Tessellation Calibration Vertex filtering Hole filling Other

Figure 5.8: Difference in execution time before and after reordering the preprocessing stage.

5.6 GMC in y dimension only

A description of the global motion compensation (GMC) method used in the applica-tion was presented in Chapter 3. Figure 3.8 shows the different stages of this process.

However, this figure does not reflect the manner in which the GMC was initially imple-mented in the MATLAB code. In fact, this figure describes the GMC implementation after being modified with the optimization described in this section. A more detailed picture of the original GMC implementation is given in Figure 5.9. Previous research found that optimal results were achieved when GMC is applied in the y direction only.

The manner in which this was implemented was by estimating GMC for both directions but only performing the shift in the y direction. The optimization consisted in removing all unnecessary calculations related to the estimation of GMC in the x direction. This optimization provides the improvement of the execution time shown in Figure 5.10.

Global motion compensation

Normalized frame sequence

For every pair of consecutive frames

Frame A

Figure 5.9: Flow diagram for the GMC process as implemented in the MATLAB code.

0 1 2 3 4 5 6 7 8 9

Figure 5.10: Difference in execution time before and after modifying the GMC stage.

5.7 Error in Delaunay triangulation

OpenCV was used to compute the Delaunay triangulation. A series of examples available in [43] were used as references for our implementation. Despite the fact that OpenCV constructs the triangulation while abstracting the complete algorithm from the pro-grammer, a not so straightforward approach is required to extract the triangles from a so called subdivision. OpenCV offers a series of functions that can be used to nav-igate through the edges that form the triangulation. It is therefore, the responsibility of the programmer to extract each of the triangles while stepping through these edges.

Moreover, care must be taken to avoid repeated triangles in the final set. An error was detected at this point of the optimization process in the mechanism that was being used to avoid repeated triangles. Figure 5.11 shows the increase in execution time after this bug was resolved.

0 1 2 3 4 5 6 7 8 9

Before fixing bug

After fixing bug

time (sec)

Read binary file Preprocessing Normalization Global motion compensation Decoding Tessellation Calibration Vertex filtering Hole filling Other

Figure 5.11: Execution time of the application increased after fixing an error in the tessellation stage.

5.8 Modified line shifting in GMC stage

A series of optimizations performed to the original line shifting mechanism in the GMC stage are explained in this section. The MATLAB implementation uses the circular shift function to perform the alignment of the frames (last step in Figure 3.8). Given that there is no justification for applying a circular shift, a regular shift was implemented instead, in which the last line of a frame is discarded rather than copied to the opposite border. Initially, this was implemented using a for loop. Later, this was optimized even further by replacing such for loop with the more optimized memcpy function available in the standard C library. This in turn led to a faster execution time.

A further optimization was obtained in the GMC stage, which yielded better memory usage and faster execution time. The original shifting approach used two equally sized portions of memory in order to avoid overwriting the frame that was being shifted. The

need for a second portion of memory was removed by adding some extra logic to the shifting process. A conditional statement was included in order to determine if the shift has to be performed in the positive or negative direction. In case the shift is negative, i.e.

upwards, the shifting operation traverses the image from top to bottom while copying each line a certain number of rows above it. In case the shift is positive, i.e. downwards, the shifting operation traverses the image from bottom to top while copying each line a certain number of rows below it. The result of this set of optimizations is presented in Figure 5.12.

0 1 2 3 4 5 6 7 8 9

Before changes to GMC

After changes to GMC

time (sec)

Read binary file Preprocessing Normalization Global motion compensation Decoding Tessellation Calibration Vertex filtering Hole filling Other

Figure 5.12: Execution times of the application before and after optimizing the line shifting mechanism in the GMC stage.

5.9 New tessellation algorithm

A good motivation for using the Delaunay triangulation in a two-dimensional space is presented by Rippa [44], who proves that such triangulation minimizes the roughness of the resulting model. Nevertheless, an important characteristic of the decoding process used in our application allows the adoption of a different triangulation mechanism that improved the execution time significantly, while sacrificing smoothness in a very small amount. This characteristic refers to the fact that the resulting set of vertices from the decoding stage are sorted in an increasing manner. This in turn removes the need to search for the nearest vertices and, therefore, allows the triangulation to be greatly simplified. More specifically, the vertices are ordered in increasing order from left to right and bottom to top in the plane. Moreover, they are equally spaced along the y dimension, which simplifies even further the algorithm needed to connect such vertices into triangles.

The developed algorithm traverses the set of vertices row by row from bottom to top, creating triangles between every pair of consecutive rows. Moreover, each pair of con-secutive rows is traversed from left to right, while connecting the vertices into triangles.

The algorithm is presented in Algorithm 1. Note that for each pair of rows, this algo-rithm describes the connection of vertices until the moment in which the last vertex of either row is reached. The unconnected vertices that remain in the other longer row are connected with the last vertex of the shorter row in a later step (not included in Algorithm 1).

Algorithm 1 New tessellation algorithm

1: for all pair of rows do

2: find the left-most vertices in both rows and store them in vertex row A and vertex row B 3: while last vertex in either row has not been reached do

4: if vertex row A is more to the left than vertex row B then

5: connect vertex row A with the next vertex on the same row and with vertex row B 6: change vertex row A to the next vertex on the same row

7: else

8: connect vertex row B with the next vertex on the same row and with vertex row A 9: change vertex row B to the next vertex on the same row

10: end if

11: end while 12: end for

Figure 5.13 shows the result of applying the two described triangulation methods to the same set of vertices. The execution time of the application was reduced by approximately 1.4 seconds with this optimization, as shown in Figure 5.14. Furthermore, the new triangulation algorithm resulted in a speedup of approximately 125 times over OpenCV’s Delaunay triangulation implementation.

Figure 5.13: The Delaunay triangulation was replaced with a different algorithm that takes advantage of the fact that vertices are sorted.

5.10 Modified decoding stage

A major improvement was achieved in the execution time of the application after op-timizing several time-consuming parts of the decoding stage. As a first step, two fre-quently called functions of the standard math C library, namely ceil() and floor(),

0 1 2 3 4 5 6 7 8 9 Delaunay triangulation

New triangulation algorithm

time (sec)

Read binary file Preprocessing Normalization Global motion compensation Decoding Tessellation Calibration Vertex filtering Hole filling Other

Figure 5.14: Execution times of the application before and after replacing the Delaunay triangulation with the new approach.

were replaced with faster implementations that used pre-processor directives to avoid the function call overhead. Moreover, the time spent in validating the input was also avoided since it was not required. However, the property that allowed the new implementations of the ceil() and floor() functions to increase the performance to a greater extent was the fact that such functions only operate on index values. Given that index values only assume non-negative numbers, the implementation of each of these functions was further simplified.

A second optimization applied to the decoding stage was to replace dynamically allocated memory on the heap with statically allocated memory on the stack, while controlling that the amount of memory to be stored would not cause a stack overflow. Stack allocation is usually faster since it is memory that is faster addressable.

The last optimization consisted on the detection and removal of several tasks that were not contributing to the final result. The reason why such tasks were present in the application is due to the fact that several alternatives were implemented for achieving a common goal during the algorithmic design stage. However, after assessing and choosing the best option, the other ones were forgotten to be entirely removed.

The overall result of the optimizations described in this section is shown in Figure 5.15.

An important reduction of approximately 1 second was achieved. As a rough estimate, half of this speedup can be attributed to the removal of the nonfunctional code.

5.11 Avoiding redundant calculations of column-sum vec-tors in the GMC stage

This section describes the last optimization performed to the GMC stage. The algorithm presented in Figure 3.8 has the following shortcoming: for every pair of consecutive

0 1 2 3 4 5 6 7 Original decoding stage

Modified decoding stage

time (sec)

Read binary file Preprocessing Normalization Global motion compensation Decoding Tessellation Calibration Vertex filtering Hole filling Other

Figure 5.15: Execution time of the application before and after optimizing the decoding stage.

frames, the sum of pixels in each column is calculated for both frames. This means that the column-sum vector is calculated twice for each image except for the first and last frame (n = 1 and n = N ). By reusing the column-sum vector calculated in the previous iteration, such recalculation can be avoided. An updated version of the GMC stage that incorporates this idea is shown in Figure 5.16. The speedup achieved for the GMC stage after performing this optimization was approximately 1.8 times. Figure 5.17 shows the execution times of the application before and after removing the redundant calculations.

5.12 NEON assembly optimization 1

The ARM NEON general-purpose SIMD engine featured in the Cortex-A series proces-sors was exploited for the last series of optimizations performed to the 3D face scanner application. The first step was to detect the stages of the application that exhibit rich amount of exploitable data operations where the NEON technology could be applied.

The vast majority of the operations performed in the preprocessing, normalization and global motion compensation stages are data independent, and therefore, suitable for being computed in parallel on the ARM NEON architecture extension.

There are four major approaches to integrate NEON technology into an existent appli-cation: (i) by using a vectorizing compiler that automatically translates C/C++ code into NEON instructions (ii) by using existent C/C++ libraries based on NEON technol-ogy (iii) by using the NEON C/C++ intrinsics which provide low-level access to NEON instructions but with the compiler doing some of the work associated with writing as-sembly instructions, and (iv) by directly writing NEON asas-sembly instructions linked to the C/C++ project in the compilation process. A detailed explanation of each of these approaches can be found in [45]. Based on the results achieved in [46], directly writing NEON assembly instructions outperforms the other alternatives and, therefore, it was this approach that was adopted.

Global motion compensation

First pair of consecutive frames

Normalized frame sequence

For every remaining pair of consecutive frames (from n=3 to n=N) Column vector

Figure 5.16: Flow diagram for the optimized GMC process that avoids the recalculation of the image’s columns sum.

Figure 5.17: Execution times of the application before and after avoiding redundant

In document Eindhoven University of Technology MASTER 3D face reconstruction using structured light on a hand-held device Roa Villescas, M. (pagina 56-68)