Real-time visualization of in- stent restenosis models

(1)

Bachelor Informatica

Real-time visualization of

in-stent restenosis models

Mathijs Molenaar

June 8, 2016

Supervisor(s): Alfons Hoekstra, Robert Belleman

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

In-Stent Restenosis (ISR) is the narrowing of a blood vessel after a metal stent has been placed during a previous stenosis surgery. The exact causes of this phenomenon are not yet fully understood. In search of the causes, computer experiments are performed using advanced fluid simulations. While investigating, the model parameters are tweaked and more simulations are performed in order to get a better understanding of the results. Existing visualization solutions do not provide the performance nor ease of use that is required.

Therefore, ISR Viewer was developed at SURFsara. It provides high run-time perfor-mance at the cost of pre-processing. Additionally it only works on surfaces and attributes that are defined upon them. Volumes rendering is not supported because of its high com-putational cost. This thesis describes how the original program is improved in terms of performance, graphics features and usability. Options are investigated and the choices made are presented and discussed. Start-up times and frame rates are improved while predictions are made about scalability. The new graphics features focus on adding shadows. This helps the user perceive depth better which makes a big difference when looking perpendicular to a clip plane. To improve usability multi-touch support is added and the feature set is ex-tended. This includes customizable color mapping as well as hiding surfaces based on their attributes.

In the end, these enhancements get the program to a state where it is more than just a tech demo.

(4)

(5)

Introduction

This thesis describes the improvements made to a visual interactive exploration environment designed to aid in-stent restenosis research. Stenosis is the abnormal narrowing of a blood vessel. Modern day medical procedures place a metal tube in the vessel to prevent recurrence. But this can happen though, and when it does it is called in-stent restenosis. Both computational science engineers and medical experts are looking for the causes of this phenomenon as they are not fully understood yet. Part of the research is performing computer flow simulation on virtual 3D models. Heavy computing power is required to visualize this data, delivering only a still image. It is clear that researchers require a faster, more intuitive, way to explore the simulation results while they are tweaking the model.

This is what the ISR (In-Stent Restenosis) Viewer is designed for. It provides an integrated visualization environment for large data sets. Originally developed at SURFsara, it provides real-time interaction at the cost of some graphical fidelity. The main difference from the existing visualization techniques is that it only supports surface meshes. The change from translucent volumes to surface meshes allows for the speed up that is required for real-time rendering. The primary counterpart of ISR Viewer is ParaView[15] which is build upon VTK[16] (Visualization ToolKit). Paraview is an advanced visualization program that supports both filtering and ren-dering of data. The extended feature set comes at the price of performance, which is where ISR Viewer differentiates.

The focus of this work is to extend ISR Viewer to aid the before mentioned work flow. This is categorized in three main subjects:

1. Performance While the overall performance of the program is satisfying, there are two points to be improved. First of all, the start up time is higher then desirable. Second of all, the frame rate deteriorated under unspecified conditions. Both of these problems are pin-pointed and fixed, to provide a fluid user experience.

2. Graphics features ISR Viewer supports only basic rendering techniques. Both perfor-mance and usability take precedence over graphical fidelity. But it turned out to be possible to add shadows within the performance constraints. These help to point out discrepan-cies in distance. The improved depth perception helps the user to better understand the geometric structure of an object.

3. Usability Extending the visualization options is the primary focus of this thesis. With the availability of a touch screen it is a logical choice to add multi-touch support. This adds an unique selling point to the application. The color mapping functionality is also greatly extended. The most notable addition is the ability to hide geometry based on user defined thresholds.

(8)

(9)

CHAPTER 2

Background Information

2.1 ISRViewer

In this chapter the ISR Viewer application in its original form (figure 2.1) is be discussed. As the input data plays an important role in the program it is explained first. Afterwards the global structure of the original program is discussed. The insights given here are essential towards understanding some of the improvements made in further chapters.

Figure 2.1: Screenshot of the original version of ISR Viewer.

2.1.1 Input data

The results of fluid simulations cannot be directly loaded into ISR Viewer. Instead some pre-processing has to be done like generating low polygon representations and converting the meshes to the custom BIN format. A scene file basically describes a model as well as optionally providing material information. A scene may contain multiple variants of the same model. A good example is a full blood vessel and a clipped (cut in half) variant that shows the insides. In addition a model is usually made up of multiple elements. An element can be the metal stent or a layer of the blood vessels wall. Finally a scene consists of multiple timesteps to show the evolution of the

(10)

in-stent restenosis over time. A mesh should be provided for every element in every variant at each time step. The available variants, elements and time steps are to be provided by the scene file as well as the path to the corresponding mesh files. Additionally a scene file can indicate which surface attributes are available. For each mesh file one or more BIN files containing an attribute may be provided. These files are assumed to follow a specific naming convention, so they do not have to be supplied by the scene file. A script is provided to batch convert mesh files to the BIN format and to generate a scene file.

The BIN file format was specifically developed for ISR Viewer. It is used to store either a mesh or a scalar attribute defined at every vertex of a mesh. Meshes must consist of indices, vertex positions and vertex normals. A BIN file stores this information as continuous binary arrays. This allows the data to be directly used by OpenGL without any processing.

2.1.2 Technology

ISR Viewer was first developed by Paul Melis at SURFsara in 2013. It is build upon the GUI framework Qt4[4] and is written in Python 2[10]. Although both these technologies are cross platform, only Windows is officially supported. It is widely known that Python by itself is not a very high performance programming language by design. Libraries written in C, like NumPy[8], help bridge the performance gap with native languages. Furthermore, ISR Viewer was designed to not do any heavy lifting. With only one developer, the high productivity that Python delivers is considered most important. Rendering of the 3D meshes is performed with OpenGL 3.3. This API is well supported by all 3 major platforms (Windows, Linux and OS X) and all graphics vendors (AMD, Nvidia and Intel).

2.1.3 Program Structure

The original program consisted of a simple structure with a GUI class containing most of the logic. When a different variant, attribute or time step is selected, a new mesh or attribute data has to be loaded. This is a two part process as there are 3 different types of storage. The first step is to load the mesh data from the file system into CPU memory. Because of the BIN file format this operation is trivial. Graphics cards contain their own memory (Video Random Access Memory) because access to the CPU memory is too slow. So the second step is to copy data from CPU into GPU memory, after which it can be used for rendering. GPU memory usually does not have the capacity to store all time steps of the scene. Furthermore, it cannot be upgraded without the purchase of a new graphics card. This is why the original program only stored one time step in GPU memory. The CPU memory is used as a cache between GPU memory and disk. In its original form, this cache was never cleared. So it required the system to have enough CPU memory to fit the entire data set. To hide the latency that is introduced by loading files from disk, a low polygon version of the mesh is drawn as a placeholder. These can be generated using the BIN conversion script and are all loaded during start-up. To prevent file IO from stalling the GUI, a separate IO thread is used. File operations are executed in a Last In First Out (LIFO) order as the latest request is deemed most important.

(11)

CHAPTER 3

Performance

This chapter will cover the performance problems that plagued the original implementation. The first section describes the tools that were used to find the source of the problems. The two main issues were long start-up times and low frame rates. Their source, possible fixes and the final results are described in the respective sections. At the end of this chapter, predictions are made about the performance characteristics of future larger models. It is shown that ISR Viewer will keep performing well in the foreseeable future.

3.1 Profiling Tools

The default tool for diagnosing performance problems is a profiler. Python comes bundled with 3 of them[11]: profile, hotshot and cProfile. Of those tree, only cProfile will be discussed as profile is too slow and hotshot is deprecated. In addition, the following third party profilers were considered: line profiler [14], pprofile[21] and Visual Studio[18] with Python Tools[17].

Profilers can be grouped into two categories: deterministic profilers and statistical profilers. As Python is an interpreted language it allows profilers to hook into every line of the program. Deterministic profilers use this feature to provide accurate profiling results. This comes at the cost of a high performance overhead, which is especially apparent in line-by-line profilers. Statistical profilers try to reduce this overhead by sampling the call stack at a fixed interval. This gives them a predictable performance impact. The downside is that their results may not accurately reflect the programs execution if the interval time is too high.

Of the profilers considered cProfile and line profiler are deterministic, pprofile supports both and the type of the Visual Studio profiler is unknown. Although cProfile is a deterministic profiler, its performance overhead is low because it only gets invoked on function calls. The biggest downside of cProfile is that it only records the number of calls and the cumulative execution time of functions. line profiler promises to do the same thing but at line-granularity. Unfortunately, it could not be properly tested as it does not support one of the libraries that ISR Viewer uses. pprofile also provides line-granularity in both deterministic and statistical mode. This causes for very high performance overhead in deterministic mode, raising the frame time from 4 milliseconds to 12 milliseconds. The statistical mode however has no noticeable performance impact. Like cProfile, pprofile only records the number of calls and the cumulative time. Finally, the Visual Studio profiler provides function level granularity at low overhead. It is the only profiler that provides extended statistics like the maximum execution time of a function. That is why Visual Studio was used as the main profiling tool for this thesis.

During normal operation the program is GPU limited on the test system (Appendix A). This means that the program is running faster then it is being rendered. When the applications wants to present a rendered frame it will have to wait until the GPU is finished. Improving the CPU performance will thus not result in any higher frame rates as the GPU is forming a bottleneck. Profiling tools for graphics cards are a bit rare. Microsoft supplies multiple GPU diagnostic tools with Visual Studio but these only work with DirectX and do not support OpenGL. The only

(12)

tools that support OpenGL are developed by GPU manufacturers. In case of the test system (Appendix A) this is AMD with its GPU PerfStudio[1]. This software supports both profiling and basic debugging functionality of OpenGL shaders (GPU code). It can also show how long it took to draw each individual mesh. This can be of big help when analyzing performance problems of GPU heavy rendering techniques.

3.2 Start-up time

The start up time of the original version of ISR Viewer is not satisfactory. But reproducing the problem is a bit difficult as it only happened after a reboot. This issue was caused by the Windows operating system caching all recently used files in memory. For testing purposes, RAMMap[23] was used to clear this cache. The Visual Studio profiler that was mentioned earlier, immediately shows the root of the problem (figure 3.1).

Figure 3.1: Profiling of start-up time.

The problem is caused by blocking file operations. Investigation in the respective methods shows that all the proxy (low polygon) meshes are loaded synchronously at start-up. An even bigger delay is caused by the get information() function, which reads the limits of the 3D meshes at all time steps. These limits include the size of the axis aligned bounding box (AABB) and the number of vertices and indices. Although this information is stored in the header of each BIN file, it takes a lot of time process them all.

The suggested fix is to move collection of the limits to a pre-processing step. This information can be stored in the python file that describes a scene. A consideration is that this may break the program if the user updates the 3D mesh files without updating the scene file. If the user is planning on doing this a lot, he may opt to not provide the limits in the scene file. ISR Viewer will then work like it did before and synchronously read all the files. The proxy file loading has been changed to an on-demand nature. The fix to the proxy mesh problem goes hand-in-hand with the changed made in the next section. As such they will be described there in more detail.

3.3 Improving Frame Time

3.3.1 Problem

The second major performance problems is sudden frame drops. This can also be described as spikes in the frame time. The first step of solving any bug is finding how it is reproducible. Frame rate drops are only visible if something is happening on the screen. So for debugging purposes the camera is altered to automatically rotate around the models. After playing around it is clear that the frame rate always drops when changing time step. The next step is to fire up the profiler and see what is causing these stutters.

Because the problems do not occur every frame, average execution time is not important. Instead maximum execution time is the main focus. This also shows why Visual Studio is a good choice, as it is the only profiler that shows minima and maxima. The profiling results (figure 3.2) show that BufferCollection.update() has a fairly high maximum execution time despite the low average. The only thing this function does is make multiple calls to glBufferSubData. This explains why the maximum execution time of glBufferSubData is only half that of its parent.

(13)

Figure 3.2: Profiling of frame rate stutters when time step changes.

glBufferSubData is an OpenGL function that copies data from the CPU to the GPU. In this case that includes vertex, index and optionally attribute data.

3.3.2 Solutions

A possible explanation of why it takes so long is given on the OpenGL wiki[24]: “There are several OpenGL functions that can pull data directly from client-side memory, or push data directly into client-side memory. (...) When any of these functions have returned, they must have finished with the client memory.”. This suggest that the glBufferSubData call blocks until all the data is copied to the GPU. In the sample scene time steps contain up to 145MB of data that need to be uploaded. PCI-express 3.0 has a bandwidth of 15.75GB/s. This means that in theory, the transfer operation should take 9.2 milliseconds. OpenGL Insights[5] has shown however, that the practical bandwidth of PCI-express 2.0 is substantially below the theoretical bandwidth. This is considered true for PCI-express 3.0 as well. In the same chapter, OpenGL Insights also claims that applications do not actually have to wait for the transfer operation. The driver will make a copy of the data and wait for the GPU to transfer it over. This way the client application only has to wait for a memcpy. Such a copy should not take up to 50 milliseconds for 145MB of data. This is proven by an experiment with glMapBuffer. This function allows the driver to allocate memory ahead of time to which a pointer is given to the application. This way the data copy can be performed, and thus measured, on the application side. A call to glUnmapBuffer then reverts the ownership of the memory back to the driver after which it will be transferred to the graphics card. Timing results show that the call to glUnmapBuffer takes way longer than the actual copy operation. This would suggest that the driver is performing a synchronous transfer operation. This also happens on a new buffer that has just been allocated and that will not be used for rendering in the next frames.

A possible work around would be to perform the upload tasks on a secondary thread. Multi-threading support in OpenGL is very restricted as it is modelled after a state machine. Rendering from multiple threads is impossible although using a separate upload thread is supported. In this work-flow, both threads have their own state machines that share buffer resources. This requires the programmer to insert the correct synchronization primitives to ensure the buffer is always in the correct state. The solution was tested in a stand-alone program that shows the data with a rotating camera and changes time step every 400 frames. The results are given as the percentile of the frame times. This shows the percentage of frames below a certain frame time. The results show that multithreading slightly reduces performance as compared to the original implementation.

An alternative solution is to break up the big block of data into smaller parts. These can then be uploaded during multiple frames. A visual overview is given in figure 3.3. Although the average frame rate remains unchanged, the maximum frame time is reduced significantly. This work-around has been implemented in ISR Viewer as the results were very positive (figure 3.4).

(14)

Figure 3.3: Visual representation of the difference in GPU uploading.

Figure 3.4: Frame time analysis of the original implementation and the two workarounds. The graph shows the percentage of frames below a certain frame time. This is also known as a percentile graph.

3.3.3 Implementation

The actual implementation in ISR Viewer was not as straight forward as in the test program. Spreading upload operations over multiple frames introduces some lag that needs to be covered up. Just like with file I/O, a proxy (low polygon version) of the model is shown until the transfer is completed. These proxy models are loaded in exactly the same way as is described below. The assumption is made that they load so fast that the user will never have to wait for them.

Using a fixed amount of buffers is does not make optimal use of the available GPU memory. A GPU buffer pool is introduced to fix this. Whenever data needs to be uploaded, a new buffer is allocated if there is room in GPU memory. Otherwise, the least recently used (LRU) buffer from the pool is invalidated and reused. Because upload operations will span more then 1 frame, an upload task stack is necessary. It is important to note that this introduces a race condition. If a lot of new upload tasks are requested, the buffer that gets reused may still be on the upload stack from a previous request. The solution is to cancel the original upload operation as it is outdated. This same LRU eviction rule is also applied to the CPU cache. This brings with it an extra dependency. When a CPU item gets evicted, there may be a GPU upload task that depends on that data. The current fix is to cancel that upload task so that the CPU item can be safely reused.

The transfer performance may change depending on the hardware used. Users may also have different opinions on what the target frame rate should be. As a solution, the amount of data copied every frame is included in the application settings. A frame rate target option is very hard to implement as it requires an indication of how long an upload will operation will take.

To reduce the time proxies are shown, a prediction is required of what the user is going to do next. It is likely that an user will step to the next or previous time steps. This assumption has

(15)

been included in ISR Viewer. As long as there is room, 2 neighbouring time steps from each side are loaded into GPU memory. This is combined with the recently used cache described above to make optimal use of memory.

3.4 Performance prediction

The models being visualized are getting bigger over time. It is important to show how ISR Viewer will scale with increasing model sizes. Load times can be split into two parts: from disk to CPU memory and from CPU to GPU memory. Moreover, cache sizes are also very important. If everything fits into the cache it will only have to be loaded once. By default, the two neighbouring time steps are loaded into GPU memory so they can be displayed instantly. This may hide some of the latency involved in getting 3D meshes from disk into GPU memory. Of coarse, the GPU memory should always have the capacity to contain at least one time step. If there is not enough room for five time steps, the number of preloaded time steps is decreased. This may seriously impact the user experience when the user scrolls through the time line.

The first part of the loading process, from disk to CPU memory, is primarily determined by the storage system. Every time step consist of at least one 3D mesh and optional attribute files. The mesh files are much bigger then the attribute files as they contain the indices (32bit integer) as well as 6 floats per vertex (position and normal) as opposed to just 1 float per vertex. The sample scene has mesh files ranging from 18MB to 65Mb and attribute files starting from 2MB. These attribute files are so small that access times may come into play. The rated access time of a mechanical hard drive is about 10 milliseconds. That is a considerable amount compared to the 13.33 milliseconds it takes to read a 2MB attribute file at 150MB/s. In theory, storing the model on a solid state drive should give a big boost to file I/O performance. Not only do solid state driver offer much higher sequential read speed, especially on PCI-express drives. But their low access times should also significantly reduce the read times of smaller files.

To test the hypothesis a small test program was build that loads a subset of the sample data set (1.05GB). As solid state drives are optimized for parallel workloads, multi threaded accesses are also evaluated. Between each test run the file system cache was cleared to ensure validity. The results show (figure 3.5) that a SSD is able to deliver far better performance than a HDD. It also shows that unlike the expectation, the SSD does not achieve great scaling with a high number of threads. It may be interesting for future work to find an explanation for this unexpected behavior.

Figure 3.5: File load performance using multiple threads.

The time it takes to complete a GPU transfer depends on the frame rate and the amount of blocks that a upload tasks get split up in. With increasing model sizes it is recommended to

(16)

adjust the GPU transfer block size accordingly. This can be a process of trial and error but a more elegant solution is considered future work. The actual transfer speed is dependent on the hardware and graphics driver. PCI-express development is not going very fast at the moment. On the other hand, more powerful GPUs are released every year. This will result in faster rendering which leaves more time for data transfers.

(17)

CHAPTER 4

Graphics Features

4.1 Introduction to Computer Graphics

Both shadow mapping and screen space ambient occlusion require some understanding of com-puter graphics. This section will give a quick introduction and describe on a very high level how OpenGL operates.

Computer graphics may be generated using three different rendering techniques: ray tracing, path tracing and rasterization. Path tracing tries to mimic the real world phenomenon of light rays. It traces light rays coming from the camera through the scene. When a ray hits a surface it splits and bounces as defined by the surface material. Of coarse, it is impossible to indefinitely trace the rays, so a limit is set to the maximum number of bounces. When using a high amount of bounces, path tracing is able to generate high quality imagery that is indistinguishable from real world photography. The issue with path tracing is its extremely high computational complexity. Ray tracing is somewhat similar to path tracing in that it traces light rays. But at the first intersection rays are shot towards all lights to determine if the point is lit by them. This produces accurate shadows and direct lighting. It does not account for indirect lighting, also known as global illumination. Although it is much faster then path tracing, ray tracing is still not fast enough to visualize complex models at interactive frame rates. That is why rasterization has become the standard in real-time graphics. Rasterization is the task of converting basic two dimensional primitives into raster images. In computer graphics it means converting triangles into screen pixels. This technique creates comparable images as ray tracing when shadows are incorporated. The advantage of rasterization is that it offers much higher performance. Nowadays, some computer games are actually using a combination of both techniques to provide high quality shadows (Frustum-Traced Raster Shadows [26]) or to mimic global illumination (Interactive Indirect Illumination Using Voxel Cone Tracing [6]).

The conversion of 3D to 2D is performed by some clever linear algebra. Three matrices are used to transfer primitives from 4 different coordinate systems, also called spaces 4.1. Model space is the coordinate system in which geometry is defined. Geometry is placed in world space using a model matrix. These often contain rotation, translation and scaling to put a mesh in the correct position in the world. The view matrix translates the world such that the camera becomes the origin. It also rotates it such that the look-at vector of the camera points in the negative z direction. Finally, the projection matrix projects the view space into screen space. There are two different projection matrices although almost all applications use perspective projection. Perspective projection assumes that viewing rays converge to a single point, the eye of the camera. The other type of projection is orthogonal. Orthogonal projection uses parallel viewing rays. This is seldom used for virtual cameras. But it can come in handy for other graphics techniques like shadow mapping.

The OpenGL rendering pipeline consists of a number of different blocks. Some of which are programmable and some are still fixed function. For the sake of simplicity, only part of the graph-ics pipeline will be covered. A complete overview can be found at http://openglinsights.com/ pipeline.html. For now, it is only important to understand that two blocks of the pipeline: the

(18)

Figure 4.1: The 4 different coordinate systems used for rasterized rendering (courtesy of ma-trix44.net [22]).

vertex shader and the fragment shader. Shaders are programmable blocks that can be written in GLSL. GLSL is the programming language of OpenGL. The syntax is based on C but it has extended support for linear algebra. In the vertex shader vertex data can be modified. Using the matrices described above, the 3D world positions are transformed into 2D camera coordinates. The original program also used the vertex shader to determine the output color on a per vertex basis. This process is also known as Gouraud shading and relies on the graphics hardware to interpolate the colors between vertices. Gouraud shading has long been replaced by Phong shad-ing. Phong shading performs lighting calculations in the fragment shader. This programmable block is executed on a per pixel basis. Doing the light calculations here results in considerably better image quality at the cost of a tiny bit of performance.

4.2 Rendering Techniques

4.2.1 Single Pass

Although shader programming has become pretty flexible, it does not support variable size arrays of uniform variables. This can be troublesome in situations with a varying number of light sources. The simplest solution is to define an array of a size that will never be exceeded. The disadvantage of this technique is that it is not very efficient when the scene contains a high number of overdraw. When a pixel in the frame buffer gets overwritten, the result of any previous light operation is discarded. That means that the previous light operation on that pixel was a waste processing time. This is why advanced rendering engines use depth sorting on a per mesh basis to prevent overdraw. ISR viewer is not very suitable for depth sorting because a model usually contains only a handful of big meshes.

4.2.2 Forward Rendering

A more flexible solution is forward rendering. Forward rendering spreads the lighting calculations over multiple render passes. For every light source the geometry is rasterized again to calculate the lights influence on the final pixel colors. A fixed function block of the graphics pipeline performs the merging of multiple passes. A possible performance improvement to single pass

(19)

Figure 4.2: OpenGL pipeline (courtesy of opengl.org wiki [25]).

rendering is that consecutive render passes will not have overdraw as the depth buffer is already filled. The caveat is that rasterizing the geometry multiple times can be very expensive. Forward rendering was implemented as a test but it it took a big hit on the frame rates. The frame times scaled almost linearly with the number of lights. The cost of rasterization far out weighted the cost of lighting operations. This is why forward rendering was eventually removed from the application.

4.2.3 Deferred Rendering

Deferred rendering trades the overhead of rasterizing the scene multiple times for higher memory usage. Just like forward rendering, it may use multiple render passes for lighting calculations. But instead of rendering the geometry multiple times, the normals and material information are stored in a special frame buffer, called the G-buffer. Deferred rendering can be very heavy on the memory controller. So it is important to keep the size of the G-buffer to a minimum. This can be done by storing floating point numbers in a low precision representation. Lighting is done in one or multiple consecutive render passes that afflict all pixels on the screen.

Essentially, deferred rendering combines the advantages of forward rendering and single pass rendering. Furthermore, some post processing techniques are only possible in forward- and deferred renderers. Screen space ambient occlusion4.4 requires normal information after the scene is already completely rendered.

4.2.4 Results

ISR Viewer originally used single pass rendering. As mentioned earlier, an experimental version was build using forward rendering. The performance impact was considered unacceptable with 250% higher frame times (for 3 lights). This can be attributed to the high amount of vertex data that has to be rasterized, as well as the low cost of the Blinn-Phong lighting model [2]. Because of Screen Space Ambient Occlusion4.4, there is a need for a more advanced rendering approach. That is why ISR Viewer was upgraded to deferred rendering. As the lighting setup is fixed, a single pass is used to calculate the influence of all light sources. In the end the performance impact of deferred rendering is minimal, adding less then 1 millisecond to each frame.

(20)

4.3 Shadow Mapping

4.3.1 Algorithm

To give more insight into the geometry of the model, shadow mapping was added. This was planned to be used to improve depth perception, combined with ambient occlusion. In the end it required too many shadow casting lights for a good, scene independent, lighting setup. This significantly hampered performance to the point that shadow casting lights were disabled. Shadow mapping is now instead used to project shadows to the inside of a cube around the model. This cube shows the orthogonal projection of the model onto the side planes.

Figure 4.3: Shadow mapping illustrated (courtesy of opengl-tutorial.org [20]).

The shadow mapping algorithm builds on the idea that light is only cast on geometry that is visible from the light source. Everything that is behind what the light can “see” will be in shadow. The scene is rendered from the light’s view into a z-buffer, which from now on will be called shadow map. This is a buffer that only contains depth information, writing color output is disabled to improve performance.

Figure 4.4: Shadow acne (courtesy of opengl-tutorial.org [20]).

(21)

into coordinates on the light’s shadow map. This uses the same matrices as were used when rendering that shadowmap in the first place. The projected x and y coordinates are used to sample the shadow map. Comparing the depth in the shadow map with the projected depth results to whether the point lies in shadow or not. After this stripes may appear in non shadowed areas. This is caused by self shadowing or “shadow acne” (figure 4.4). This problem is caused by the limited resolution of the shadow map. This is easily fixed by subtracting a small depth bias from the projection coordinates.

4.3.2 Implementation

Shadow mapping is pretty simple to implement. An off-screen depth-only buffer is created into which the shadow map is be rendered. This is then used as a texture during the lighting pass. The main difficulty with shadow mapping is the selection of a good viewport. It is important that the whole visible scene fits within the lights view. But making it too big will result in worse aliasing artifacts. The decision to only project shadows on the axis aligned shadow cube alleviated this problem. The viewport of the shadowmaps can now be based on the limits gathered at the start of the application.

4.3.3 Performance

Shadow mapping has a big performance impact as the whole scene has to be re-rendered for every light. The dimensions of the shadow map, which can be set in the settings file, also has a big influence on performance. ISR Viewer uses the Blinn-Phong shading model which is relatively light weight. As such, it is expected that rendering a shadow map is about as expensive as rendering with lighting enabled. For the shadow cube the scene has to be rerendered from 3 different viewpoints. This results in a big hit on performance, so it is disabled by default.

4.4 Screen Space Ambient Occlusion

4.4.1 Introduction

Shadow mapping’s heavy performance impact did not allow shadows for all lights. This was accentuated by the fact that having shadows requires more lights to ensure that every part of the scene is well lit. Local shadows are still desirable as in-stent restenosis models contain a lot of small detail. Computer graphics engineers call the phenomenon of local shadows “ambient occlusion”. Phong shading uses a fixed amount of ambient light as a very rough approximation of global illumination. Global illumination is a collection of algorithms that take into account indirect lighting. The video game industry uses invisible light sources to approximate this effect. This is an impossible task for ISR Viewer, as this requires manual work for every possible scene. Ambient occlusion tries to incorporate shadows caused by close surroundings into the ambient light. Most of the ambient occlusion techniques only require screen space depth and normal information to function. This means that unlike shadow mapping, ambient occlusion has a much lower performance impact

(22)

4.4.2 Algorithm

The ambient occlusion algorithm of choice is Screen Space Ambient Occlusion because it is relatively easy to implement. SSAO was developed by a single engineer at Crytek for the use in the computer game Crysis[19]. In the years after it has become the standard for ambient occlusion in the video game industry. Unfortunately, the paper by Crytek does not contain sufficient information described this technique. So the SSAO implementation in ISR Viewer is based on various sources from the internet [3][7][13].

Ambient occlusion can be described as to what extent a point on a surface is occluded by its surrounding geometry. To determine the occlusion factor, a hemisphere perpendicular to the surface is created at every pixel on the screen. This hemisphere is filled with randomly positioned points. The amount a screen pixel is occluded is equal to the amount of samples in the hemisphere that are occluded. Closer points are considered to have the most important information. To account for this, the distance of the points are distributed in such a way that most of them lie close to the center of the hemisphere. Note that the kernel of random points does not change per pixel. Neither does it change by time. This ensures that the output image is both spatially and temporally stable.

Figure 4.6: Normal-orientated hemisphere (courtesy of learnopengl.com [7]).

It is clear that a higher number of samples will result in a more accurate result. The com-putational order of SSAO is O(w ∗ h ∗ s), where w and h are the frame buffer dimensions (in pixels) and s equals the number of samples. So increasing the sample count can become expen-sive. There is a little trick however, that increases the effective sample count. The sample kernel should be rotated around the normal. If this is done by a repetitive noise texture it introduces regularity into the result. This regularity should occur at a high frequency such that the noise can be removed with an image blur. To actually determine whether a sample is occluded, it is projected to screen space coordinates. The x and y coordinates are used to sample the depth buffer. If the depth buffer contains a higher (further) value then the sample z coordinate, then it is considered occluded. Further work suggest tracing the complete path from the origin pixel to the sample. This enables sample points that are visible, but have an obstructed path to the pixel, to count as occluded. Such algorithms are often classified as horizon based.

4.4.3 Implementation

At the start of the application the sample kernel (hemisphere) and noise texture are generated. The samples are generated in spherical coordinates and convert then to the Cartesian coordinate system. The angles should never be flat on the surface to prevent z-fighting. Z-fighting occurs when a sample vector is perpendicular to the normal of a flat surface. Theoretically, such a sample would lie exactly on the surface. But because computers only have a limited precision, the sample may be considered occluded. This is essentially the same problem as shadow acne in shadow mapping. The distance is scaled between 0.1 and 1.0 to prevent sample points from getting to close to the pixel.

f o r ( i in 0 . . . k e r n e l S i z e ) {

t h e t a = a n g l e ( −85 , +85) p h i = a n g l e ( 0 , 3 6 0 )

(23)

// Assumes r a d i u s o f 1 . 0 k e r n e l [ i ] = s p h e r i c a l T o C a r t e s i a n ( phi , t h e t a ) s c a l e = i / k e r n e l S i z e // l i n e a r i n t e r p o l a t e between 0 . 1 and 1 . 0 a t p o s i t i o n s c a l e ˆ2 s c a l e = l e r p ( 0 . 1 , 1 . 0 , s c a l e ∗ s c a l e ) k e r n e l [ i ] ∗= s c a l e ; }

The random noise texture contains normalized vectors with a random rotation around the z axe. The z coordinate is always 0 so that it will never run parallel to the normal vector in view space. In the SSAO fragment shader a tangent space matrix (TBN) is created from the view space normal and the random noise vector using the Gram-Schmidt process. The Gram-Schmidt process extracts an orthonormal basis from the normal vector and noise vector. The resulting vectors are used to form a matrix that transfers coordinates from view space to tangent space.

v e c 3 t a n g e n t = n o r m a l i z e(r v e c - n o r m a l * dot(rvec, n o r m a l) ) ;

v e c 3 b i t a n g e n t = c r o s s(normal, t a n g e n t) ;

m a t 3 tbn = m a t 3(tangent, b i t a n g e n t, n o r m a l) ;

Using the TBN matrix the sample kernel is orientated perpendicular to the surface normal in view space. These coordinates are then transformed to screen space using the same projection matrix as was used in the geometry pass. The g-buffer is sampled at the clip space position to get the view space depth. Besides the usual depth comparison a range check is also executed. This check ensures that if a sample is occluded by something much closer to the camera, it wont count towards the ambient occlusion. This ensures that the ambient occlusion is only caused by local geometry. An example of this is shown in figure 4.7.

f l o a t o c c l u s i o n = 0 . 0 ; for (int i = 0; i < K E R N E L _ S I Z E; i++) { // G e t s a m p l e p o s i t i o n . v e c 3 s a m p l e = tbn * u _ s a m p l e K e r n e l[i]; s a m p l e = o r i g i n + s a m p l e * u _ r a d i u s; // P r o j e c t s a m p l e p o s i t i o n . v e c 4 o f f s e t = v e c 4(sample, 1 . 0 ) ; // F r o m v i e w to clip - s p a c e . o f f s e t = u _ p r o j e c t i o n M a t r i x * o f f s e t; // P e r s p e c t i v e d i v i d e . o f f s e t.xyz /= o f f s e t.w; // T r a n s f o r m to r a n g e 0 . 0 - 1 . 0 . o f f s e t.xyz = o f f s e t.xyz * 0.5 + 0 . 5 ; // G e t s a m p l e d e p t h . f l o a t s a m p l e D e p t h = -t e x t u r e 2 D(u _ p o s i t i o n D e p t h S a m p l e r, o f f s e t.xy) .w; // If s c e n e f r a g m e n t is b e f o r e ( s m a l l e r in z ) s a m p l e point , // i n c r e a s e o c c l u s i o n . f l o a t r a n g e C h e c k = s m o o t h s t e p(0.0 , 1.0 , u _ r a d i u s / abs(o r i g i n.z - s a m p l e D e p t h) ) ; o c c l u s i o n += (s a m p l e D e p t h >= s a m p l e.z ? 1.0 : 0 . 0 ) * r a n g e C h e c k; } o c c l u s i o n = 1.0 - (o c c l u s i o n / f l o a t(K E R N E L _ S I Z E) ) ;

Finally, the resulting image should be blurred to remove the noise generated by the random rotation of the sample kernels. A Gaussian blur is used because its kernel is separable. This means that the 2D Gaussian kernel can be written as the multiplication of the one dimensional Gaussian kernel orientated horizontally and vertically respectively. The convolution of the SSAO image with both one dimensional Gaussian kernels will result in the same output as a convolution

(24)

(a) Without range check (b) With range check

Figure 4.7: The effect of a range check in SSAO

with the two dimensional kernel.

Gs(x) = 1 s√2πexp (− x2 2s2) (4.1) Gs(x, y) = Gs(x)Gs(y) (4.2) Gs(x, y) = 1 2πs2exp (− x2_{+ y}2 2s2 ) (4.3)

4.4.4 Results

The performance impact of SSAO is fairly low. Adding about 2 milliseconds (figure 4.8) to every frame at a 1080p on the test system (Appendix A). Note that the execution time of SSAO is only dependent on resolution. Larger models will not have any impact on the SSAO performance. Furthermore, it is very predicable and does not change much over time.

(25)

CHAPTER 5

Usability

5.1 Color Mapping

5.1.1 Features

The in-stent restenosis viewer was designed for medical and computational science researchers to quickly analyse simulation results. It is important that the program is able to visualize any model attribute efficiently. Another requirement is keeping interactive frame rates. This means that advanced image based flow visualizations cannot be supported. Currently, the only type of supported flow visualization is glyphs that are generated by an external program like ParaView. ISR Viewer’s visualization capabilities are build around attributes that are defined at the surface of the geometry. These are then depicted using a color mapping. Often, only a small range of an attribute’s spectrum is to the interest of the user. That is why attributes are mapped onto a range from 0 to 1 before a color is assigned to them. This mapping is performed by a color scale function. Both linear and logarithmic scales are provided by default. Mapping this value to a color is done using a color mapping function. In addition to the original rainbow colormap, ISR Viewer now also comes with a cool to warm colormap. Users may require other colormaps and scales to better represent their problems. Therefore, both color scales and colormaps are now extendable. Color mapping can be enabled or disabled for every element. This is used to make one color mapped element stand out from its environment

Figure 5.1: Color mapping and attribute cutoffs.

(26)

models, they can now select which range of the attribute is interesting to them. Surfaces with attributes outside that range will not be drawn (figure 5.1). This leaves the user with only those parts of the model that are interesting to him/her.

5.1.2 Implementation

The color mapping, color scale and attribute cutoffs have to be combined into one user interface element. Qt4 does not provided an appropriate widget so a custom one was to be build. A double ended slider is used to select the cutoff range. The colormap is shown as the background of the slider while tick marks represent the scale function. Two drop-down lists provide access to all the available colormaps and scales. Extensibility is provided in the form of two JSON files. These contain the names and corresponding files of the color maps and scales. Implementations should be provided in both Python for the GUI and GLSL for the rendering pipeline. GLSL is the programming language used to write OpenGL shaders. Colormap files should provide a function that maps values in the range from 0 to 1 onto a color. In Python a color is represented by a QColor (part of PyQt4[12]) instance and in GLSL by a vec3. The reason for the lack of alpha is that translucency is not supported. Rendering translucent surfaces requires depth sorting which is not trivial. Second of all, it does not deliver the desired effect of translucent volumes for which volume rendering is required. The color scale Python files should convert a value between 0 and 1 to an attribute value. The GLSL implementation should do the opposite and convert an attribute into a value between 0 and 1.

Most readers probably have no experience with GLSL. This is not a problem as it is a very simple language based on the familiar C/C++ syntax. To help them get familiar with GLSL two code samples are provided. Listing 5.1 shows the source code of the linear color scale.

f l o a t s c a l e V a l u e T o P o s(f l o a t value, f l o a t a t t r _ m i n, f l o a t a t t r _ m a x) {

r e t u r n (v a l u e - a t t r _ m i n) / (a t t r _ m a x - a t t r _ m i n) ; }

Listing 5.1: ”Linear color scale in GLSL.”

The second example (listing 5.2) contains the source code of the cool to warm colormap. This colormap describes a gradient from blue to white to red. Notice how in GLSL colors are described using floating point values. This allows OpenGL to fully support high precision monitors.

v e c 3 p o s T o C o l o r(f l o a t v a l u e) { c o n s t v e c 3 s t a r t = v e c 3(59 , 76 , 1 9 2 ) ; c o n s t v e c 3 mid = v e c 3(220 , 221 , 2 2 1 ) ; c o n s t v e c 3 end = v e c 3(180 , 4 , 38) ; if (v a l u e < 0 . 5 ) {

r e t u r n mix(start, mid, v a l u e*2) / 2 5 5 . 0 ; } e l s e {

r e t u r n mix(mid, end, (value- 0 . 5 ) *2) / 2 5 5 . 0 ; }

}

Listing 5.2: ”Cool to warm colormap in GLSL.”

5.2 Multi-touch Interaction

5.2.1 Introduction

ISR Viewer was designed to be a simple and fast exploration tool of in-stent restenosis models. User interaction plays a vital role in helping the user understand the environment. Multi-touch support was added to give users the feeling of actually touching the model. Multi-touch users are first class citizens, all interactions that are possible with a mouse and keyboard must also be available to touch users. This is important because touch users should not feel like they are being

(27)

held back. Most of the actual interface uses buttons and drop downs, these work out of the box with touch. Mouse gestures to control the camera had to be replaced. The original translation feature translated with a fixed speed relative to the mouse movement. This gave an unnatural feeling to the movement so it had to be replaced. The new translation gesture is inspired by DabR[9]. It uses a one finger press and hold gesture to grab an object. The camera should then translate in such a way that the object follows the users movement. This provides the user with the feeling that he or she is interacting with the 3D world itself. A problem with this technique is that it is possible to grab empty space. This would not be an issue in a closed room such as in previous work. But in ISR Viewer the model may not always fill the screen and the user is able to select the background. Currently, those interactions are ignored. This is not the best solution and some research could be done in providing a more gracious fallback. When a user translates the camera the look-at point translates with it. At the end of the movement the look-at point is set to be the point in 3D space at the center of the screen. When this points to the (infinite) background, the original look-at point is translated instead. This new way of translation has also been adopted to mouse users. Rotation is the same way as it used to be. Moving the mouse with the left button down will rotate the camera around the look-at point. This was adopted to touch by a one finger movement.

Zooming has also been updated. The original version of ISR Viewer supported two types of zooming. Scrolling would lead to a change in field of view (FOV). While moving the mouse with the right button pressed would move the camera closer or further from the look-at point. The field of view zoom created weird distortions when the FOV became too low (figure 5.2). It has now been removed in favor of moving the camera. This type of zooming has been changed a bit. Instead of zooming towards the look-at point, the camera now translates towards the user’s cursor. This maintains the cursor position in 3D space, just like with translation. At the end of the zoom operation, the look-at point is set to the center of the screen. If that point lies on the background then the original look-at point is translated instead. Touch users can zoom in the same way using a pinch gesture.

Figure 5.2: Field of view zoom distortion.

5.2.2 Implementation

Qt4 contains support for receiving touch events as well as gestures. Recognizers for pinch, swipe and (two finger) pan gestures are provided out-of-the-box. Combining the touch events with an OpenGL widget proved to be harder than it should have been. Qt is also capable of forwarding gesture events generated by the Windows operating system. These events suppress the Qt gestures recognizers. Unfortunately, the Python bindings provided by PyQt4 do not contain the C bindings to read the native gesture events. It has taken numerous hours to find the correct combination and order of function calls to receive the gesture events. The alternative would have been to recompile Qt4 and disable native gesture events. This would have led to an unwanted dependency mess..

Multiple of the previously described techniques require the world space position of what the user is pointing to. These could either be calculated using CPU raytracing or by reading out the depth buffer. CPU ray tracing is a lot of work as the in-stent restenosis models contain very

(28)

large amounts of triangles. It would also require a spatial data structure to be build such as an octree, which costs a lot of memory and processing time. So instead, the depth buffer is sampled and the world position is determined using the inverse view-projection matrix. Reading from the depth buffer takes about 2 milliseconds, during which CPU operations are blocked. Considering a frame has a budget of 33.3 milliseconds at 30 frames per second, this delay is acceptable.

(29)

CHAPTER 6

Future Work

6.1 Culling and Sorting

Although frame rates are good, there is certainly room for further optimizations. The most obvious one is culling geometry. Culling is the act of determining which meshes are visible and which are not. The performance improvement can be quite significant if a lot of meshes are culled. The most basic culling technique is back-face culling, which prevents backwards facing triangles from being drawn. OpenGL has build in support for back-face culling based on whether triangles are defined clockwise or counter clockwise. Because not all models may support it, back-face culling is currently disabled. Other culling techniques work on meshes instead of triangles. In-stent restenosis models consist of only a handful of very big meshes For efficient culling these meshes have to be split up into multiple smaller parts. The way they are split heavily influences the efficiency of the culling. As ISR Viewer is light weight, it is advised that this is done in a pre-prossessing step as spatial partitioning algorithms are computationally intensive.

6.2 Next Generation Graphics APIs

In 2013, AMD released there proprietary Mantle graphics API. This API was designed to give game developers low level access to graphics hardware to help them achieve console level efficiency on PC. Battlefield 4 received a Mantle patch 4 months after release and was the first game to support it. As the game supports both Microsoft’s DirectX and Mantle, it made for a fair comparison between the two APIs. In all benchmarks Mantle came out ahead by up to 20%. The release of mantle put public pressure on Microsoft to make DirectX more low level. After two years Microsoft presented their answer: DirectX 12. Simultaneously, the Khronos Group worked on their new low level graphics API called Vulkan (formerly glNext). After numerous delays, Vulkan was finally released in Februari 2016. The most promoted feature of both Vulkan and DirectX 12 is the lower driver overhead. This is not very applicable to ISR Viewer as it does not make many API calls However, these APIs also provide lower level access to data transfers between the CPU and GPU. Rendering, compute and data transfer commands are now separated in different queues. It is now also up to the user to make sure that the data is only used when it has finished transferring. This design promotes parallelism and may provide a better solution to the problem described in section 3.3.1.

(30)

(31)

CHAPTER 7

Conclusion

The ISR (In-Stent Restenosis) Viewer provides a visual interactive exploration environment of in-stent restenosis models. Compared to existing software it offers higher performance and a user interface specifically geared towards ISR models. This thesis covers the improvements made to the program in terms of performance, graphics features and usability. The start-up time is decreased by moving work over to the pre-processing phase. Frame rates are improved by spreading CPU to GPU memory transfers over multiple frames. Some predictions are made about scalability when larger models are loaded. The additional graphics features mainly focus on improving the depth perception. Local shadows are added using SSAO; a technique that has a low performance impact and scales only by screen resolution and not geometric complexity. Direct shadows using shadow mapping were found to be too computationally expensive. So they are only used to draw a cube surrounding the model to show the orthogonal projection. In terms of usability two big changes are made. The most important is the expansion of the color mapping feature. There is now support for multiple colormaps and scales. Additional ones can be added using the extension support. It is now also possible to hide surfaces whose attributes are not within a certain range. This greatly helps users find what they are looking for. Finally, basic multi-touch support was added and camera interactions were improved.

With the advancements made to ISR Viewer it is now ready to be actually used, instead of being a tech demo. Compared to its main competitor ParaView, it trades features for per-formance. By gearing towards one specific use case, lots of unnecessary functionality can be dropped. ISR Viewer provides only the tools the user really needs. This not only makes the program run faster but also provides a more streamlined user experience.

(32)

(33)

(34)

(35)

APPENDIX A

Experiment Hardware

CPU Intel i7 4790k (4.4Ghz boost)

RAM 4 * 4GB DDR3 1600Mhz (dual channel)

GPU AMD R9 290X 8GB (1030Mhz core / 5500Mhz mem) Hard Drive Seagate Barracuda 3TB (ST3000DM001)

Solid State Samsung 830 256GB Operating System Windows 10 Pro

(36)

(37)

Bibliography

[1] AMD. GPU PerfStudio 3.5. http://developer.amd.com/tools-and-sdks/ graphics-development/gpu-perfstudio/.

[2] Blinn, J. F. Models of light reflection for computer synthesized pictures. In ACM SIG-GRAPH Computer Graphics (1977), vol. 11, ACM, pp. 192–198.

[3] Chapman, J. SSAO Tutorial. http://john-chapman-graphics.blogspot.nl/2013/01/ ssao-tutorial.html.

[4] Company, T. Q. Qt4. http://www.qt.io/.

[5] Cozzi, P., and Riccio, C. OpenGL Insights. CRC press, 2012.

[6] Crassin, C., Neyret, F., Sainz, M., Green, S., and Eisemann, E. Interactive indirect illumination using voxel-based cone tracing: an insight. In ACM SIGGRAPH 2011 Talks (2011), ACM, p. 20.

[7] de Vries, J. SSAO Tutorial. http://www.learnopengl.com/#!Advanced-Lighting/ SSAO.

[8] Developers, N. NumPy. http://www.numpy.org/.

[9] Edelmann, J., Schilling, A., and Fleck, S. The DabR-a multitouch system for intu-itive 3D scene navigation. In 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video, 2009 (2009), IEEE, pp. 1–4.

[10] Foundation, P. S. Python. https://www.python.org/.

[11] Foundation, P. S. Python Profilers. https://docs.python.org/2/library/profile. html.

[12] Hess, D. K., and Summerfield, M. PyQt Whitepaper. Tech. rep., Riverbank Computing Limited, 2013.

[13] iceFall Games. Know your SSAO artifacts. https://mtnphil.wordpress.com/2013/ 06/26/know-your-ssao-artifacts/.

[14] Kern, R. Line Profiler 1.0. https://github.com/rkern/line_profiler. [15] Kitware. ParaView. http://www.paraview.org/.

[16] Kitware. Visualization ToolKit. http://www.vtk.org/.

[17] Microsoft. Python Tools for Visual Studio 2.2. https://microsoft.github.io/PTVS/. [18] Microsoft. Visual Studio 2015. https://www.visualstudio.com/.

[19] Mittring, M. Finding next gen: Cryengine 2. In ACM SIGGRAPH 2007 courses (2007), ACM, pp. 97–121.

(38)

[20] opengl tutorial. Shadow mapping tutorial. http://www.opengl-tutorial.org/ intermediate-tutorials/tutorial-16-shadow-mapping/.

[21] Pelletier, V. pprofile 1.8.3. https://github.com/vpelletier/pprofile.

[22] Rath, E. Coordinate Systems in OpenGL. http://www.matrix44.net/cms/notes/ opengl-3d-graphics/coordinate-systems-in-opengl.

[23] Russinovich, M. RAMMap v1.5. https://technet.microsoft.com/en-us/ sysinternals/rammap.aspx.

[24] The Khronos Group, I. OpenGL Wiki Buffer Object Streaming. https://www.opengl. org/wiki/Buffer_Object_Streaming.

[25] The Khronos Group, I. OpenGL Wiki Rendering Pipeline Overview. https://www. opengl.org/wiki/Rendering_Pipeline_Overview.

[26] Wyman, C., Hoetzlein, R., and Lefohn, A. Frustum-traced raster shadows: Revisiting irregular z-buffers. In Proceedings of the 19th Symposium on Interactive 3D Graphics and Games (2015), ACM, pp. 15–23.

Real-time visualization of in- stent restenosis models

Bachelor Informatica