Real‐time detection of text in natural images on mobile devices

(1)

Real‐time detection of text in

natural images on mobile devices

Emiel Bon

Universiteit van Amsterdam Software Engineering 22 February 2016 Supervisors: Tijs van der Storm (Universiteit van Amsterdam) storm@cwi.nl Pjotter Tommassen (Itude Mobile) p.tommassen@itude.com Robin Puthli (Itude Mobile) r.puthli@itude.com

(2)

Abstract

The goal of this thesis is to quantify the performance benefits and accuracy tradeoffs that occur when performing the Stroke Width Transform (SWT) algorithm on a mobile GPU, to see if it is possible to perform text detection in pictures of natural scenes in real time on a mobile device. We have focussed on finding an architecture, programming languages and a set of libraries that are portable to the majority of current generation mobile devices. To this end we have developed a close approximation of the original SWT algorithm for execution on the mobile CPU (swtcpu ), and a custom variant for execution on the mobile GPU ( swtgpu ). We have evaluated the accuracy of swtcpu and swtgpu using the ICDAR 2003 image database together with the precision and recall metrics. We have measured the performance of both variants by measuring the processing time for each substep in the algorithm, and the total processing time. We have found that swtcpu performs an analysis of a single image in 17.7 seconds on average, and has a precision and recall of 0.42 and 0.57 respectively. swtgpu does the same in 3.6 seconds, with a precision and recall of 0.35 and 0.53 respectively. Furthermore we have found that swtgpu can in theory only run on a small subset of all mobile devices due to specific nonstandard functional requirements on the mobile GPU. In practice, we have only been able to run swtgpu on 1 device model due to implementation differences in the OpenGL ES 2.0 library across different device models. In contrast, swtcpu runs on all devices. Finally we estimate that the development effort for swtgpu has been between 5 and 10 times larger than for swtcpu. We conclude that the gained performance increase is insufficient to outweigh the accuracy loss, increased development time, and limited portability of the solution.

(3)

Glossary

GPU Graphics Processing Unit

Vertex (pl. vertices) A vertex is a data structure containing position data, and possibly other types of data as well (e.g. color data, normal data, texture coordinates, identifiers, a timespan etc.).

Shader A program executed on the GPU written in a specialized programming language.

Vertex shader A shader that does operations on a single vertex. E.g. in 3D games, this shader is used to transform 3dimensional models, described in 3D world coordinates, to 2D screen coordinates.

Fragment A potential pixel. A fragment contains all data to test whether it will eventually be drawn on screen. It contains color data, but also possibly alpha, depth etc.

Fragment shader A shader that does operations on a single fragment. Sometimes called

pixel shader.

Screen space The 2D coordinate system of the screen or output image, as opposed to the 3D “world” coordinate system where 3D models are defined. Texture A 2D buffer for pixel data existing in GPU memory. A texture can have

up to 4 color channels: red, green, blue and alpha (i.e. translucency). Pixel data can be represented as byte values in the range 0...255 for each color channel. OpenGL ES 2.0 extensions can also provide additional functionality where a texture’s color channels can be represented as 16 or 32 bit floating point numbers.

GPGPU General Purpose Graphics Processing Unit, the concept of using the GPU as a highly parallel general purpose processor.

Kernel A square matrix used for image convolution . It contains weights that 1 are applied to a pixel p and its neighbors to mix the colors and get a new color for p. It can be used to achieve blurring, gradient calculation, sharpening, dilation etc.

API Application Programming Interface

Mobile device Tablet or smartphone

(4)

Android A UNIX based operating system created by Google, designed for and primarily used on mobile devices

iOS A UNIX based operating system created by Apple, designed for mobile devices created by Apple.

Open Computing Language

(OpenCL)

A framework for GPGPU. OpenCL makes it possible to write programs in a Clike language that can be executed on both (multicore) CPU and GPU.

Compute Unified Device Architecture (CUDA)

A proprietary framework for GPGPU by graphics card manufacturer NVIDIA. It gives the programmer access to a virtual instruction set that allows code to be executed on the GPU. By design, it is only available for a subset of NVIDIA graphics cards, which are present in a very select number of Android tablets.

Open Graphics Library (OpenGL)

A language and platform independent API for rendering 2D and 3D graphics. It is typically used to interact with a graphics processing unit (GPU), to achieve hardwareaccelerated rendering . The functionality 2 of OpenGL is based on trends and advancements in graphics hardware development. Furthermore, device manufacturers can design and submit their own extensions to the OpenGL API. These extensions are maintained in a database by the Khronos Group. Because of this, OpenGL is comprised of a core specification and optional extensions.

OpenGL for

Embedded Systems (OpenGL ES or GLES)

A subset of the OpenGL API for rendering 2D and 3D computer graphics such as those used by video games, typically hardwareaccelerated using a GPU. It is designed for resource constrained systems like smartphones, computer tablets, video game consoles, embedded devices and PDAs . 3

Grayscale image Monochromatic image where each pixel has only an intensity value in the range [0...255] or [0...1], instead of the full RGB(A) spectrum

Aspect ratio The proportional relationship between an image’s width and height2 DSL Domain Specific Language

Bounding Box The axisaligned minimal rectangular area that encloses a shape or set of points

(5)

1. Introduction

In 2010 a novel, region based algorithm called Stroke Width Transform ( SWT )[1] has been developed at Microsoft Research for the accurate detection of text in natural scenes. Examples of text occurring in natural scenes are numeric plates on cars, writing on statue sockets, or text on billboards. Analyzing previously created images with this algorithm is possible due to some sophisticated image transformations, that can be done within reasonable time on a PC with a modern Central Processing Unit (CPU).

There are however many applications where an immediate image analysis is convenient. At present, mobile devices (like smartphones) are ubiquitous in everyday life. The majority of mobile devices have a camera, so everyone owning one can take pictures of text in natural scenes. However, they often have processors that are less powerful than those of a modern PC, so it is likely the SWT algorithm cannot be efficiently performed on the CPU of a mobile device.

In this thesis, we will develop a version of the SWT algorithm that can be performed on the mobile GPU. We will quantify the performance benefits and accuracy tradeoffs when performing the SWT algorithm on the mobile GPU. We will develop an architecture and a set of nonfunctional requirements that allow the GPU version of the algorithm to be portable to other platforms. We will describe what limitations are encountered, how they are compensated and what concessions have been made. Furthermore, we will evaluate the chosen approach itself in terms of development effort and appropriateness.

(6)

2. Motivation

The host organization (Itude Mobile) has a client that manages the transport logistics of Unit Load Devices ( ULDs). They manage thousands of these objects around the world. These objects are identified by alphanumeric codes, printed on a sticker which is applied to the hull of the object. These stickers are 9 to 12 characters in length, have different colors and fonts, and are often partially damaged due to rough transports or adhoc repairs. The host organization created an app for smartphones and tablets that can photograph these stickers, detect and parse the text on the stickers and immediately look up the history of the ULDs (where they are now, where they have been, their current status etc.). The app is available for the iOS and Android operating systems.

The app works well, but the image analysis is not fast enough for realtime application, i.e. analyze a stream of images until a piece of text is found that matches the format of the code on the sticker. Right now, the user first has to take a picture of the code, and after some time (ranging from 20 seconds on an iPhone 4 to about one second on an iPhone 5S) the app shows the code as digital text and looks up relevant data for the corresponding object. When the analysis fails on the picture, a new picture has to be taken and the process is repeated.

Also, according to the specification of the SWT algorithm, the analysis has to be performed twice, once for dark text with a light background, and once for light text with a dark background. At present, the latter is not performed because this makes the running time of the algorithm twice as long; which in practice is unacceptable for many devices. The omission of this step entails that many of the analyses inherently fail. Furthermore, the input picture is heavily downscaled to allow for fast detection at the cost of accuracy, which entails an additional number of failures.

Too many experiences of failure would make users want to cut their loss beforehand, i.e. not use the image analysis at all and type in the code manually.

The desired user experience would be to just briefly hold the camera in front of the sticker, until the text is successfully retrieved, thus analyzing a stream of images in realtime. A more efficient implementation is desired, so the analysis is performed quickly and more accurately.

(7)

3. Related work

Detecting Text in Natural Scenes with Stroke Width Transform [1], the paper that proposed the novel Stroke Width Transform algorithm, published in 2010. The algorithm’s accuracy is virtually unparalleled for horizontal texts, and its performance on modern PCs is very good. Our research will attempt to recreate and optimize this algorithm for realtime application on mobile devices. A more detailed description of the SWT algorithm will follow in the next section.

Detecting Texts of Arbitrary Orientations in Natural Images [4], a paper that proposes some extensions for the original SWT algorithm, published in 2012. The extensions allow for greater accuracy for detection of text with arbitrary orientations. The extensions are quite complex and many factors rely on trained classifiers, making the approach unsuitable for a GPU architecture.

TranslatAR: A Mobile Augmented Reality Translator [5], an experiment funded by Nokia to allow for realtime text detection and translation in natural scenes on the Nokia N900 smartphone. Published in 2011, the text detection uses a method different from SWT. It requires input from the user, can only detect one line of text in an image and relies on some very heavy assumptions. Also it does not have realtime performance, as well as very poor accuracy compared to SWT.

Parallel Text Detection [10], a project that attempts to perform realtime text detection in natural scenes using Stroke Width Transform for use in robotics. The goal was to have realtime performance (analyze 24 image frames per second) for text detection in images with 1080p resolution. They use NVIDIA’s CUDA [12] framework to leverage the GPU. They do this on a PC, which compared to mobile devices has virtually no hardware limitations. Furthermore, the CUDA framework is unique to NVIDIA GPU architectures, which are uncommon in mobile devices, as are other GPGPU approaches like OpenCL [13]. They increase the accuracy of SWT further by adding a dilation step, however this has a dramatic adverse effect on the performance. They still show a good performance increase with respect to a CPU implementation (from 2 frames per second to 6 frames per second for 1080p input), but performance is still not realtime.

Text Detection on Nokia N900 Using Stroke Width Transform [6], a report made for a practical assignment that attempts to implement the Stroke Width Transform algorithm on a mobile device (Nokia N900). Although not entirely correctly implemented, the report documents some implementation difficulties that may arise, and shows very good results for text that is relatively easy to detect, even without a perfect implementation. The implementation is not optimized for realtime performance.

None of the above works are searching for a solution that uses SWT to achieve realtime performance on a mobile device, nor propose an architecture that is suitable for crossplatform implementation.

(8)

4. Introduction to Stroke Width Transform

What follows is a short summary of the Stroke Width Transform ( SWT ) algorithm proposed in [1].

The key observation of the algorithm is that text is always written with some sort of pen, creating strokes of approximately constant widths. The algorithm consists of 4 respective phases:

1. Input image preprocessing 2. Stroke Width Transform operation 3. Finding letter candidates

4. Chaining letters into words

What follows is a brief description of these steps to give the reader insight into the complexity and implementation implications. A more detailed (although high level) description and rationale can be found in [1].

4.1. Input image pre‐processing

The Stroke Width Transform operation requires 2 pieces of information for each pixel in the input image:

1. Whether the pixel is an edge pixel (i.e. does it lie on the stroke boundary of a potential letter)

2. The gradient direction for that pixel, which is assumed to be perpendicular to the stroke boundary

To achieve the former, the paper suggests using the edge detection algorithm proposed by Canny [8], often dubbed the “optimal edge detector”. This algorithm in turn also requires the gradient for each pixel, but to eliminate noise in the image these gradients have to be calculated from a blurred, grayscale version of the input image.

To summarize, this step consists of 4 subroutines: 1. Convert the input image to grayscale

2. Blur the grayscale image (Gaussian blur is the standard due to desirable properties like intensity preservation)

3. Calculate the gradient for each pixel from the blurred image 4. Create an edge map using the Canny edge detector

Each of these subroutines produces an output image of the same dimensions as the input image, and they are all retained for later steps in the algorithm (c.f. sections 8.3.1 through 8.3.4 for example results).

(9)

4.2. The Stroke Width Transform operation

The SWT algorithm’s key contribution is an image operator that for each pixel in the input image determines the width of the stroke containing that pixel. The inputs to the SWT operation are the previously calculated edge map and gradient map. What follows is a condensed description of the SWT operation as described in [1].

The SWT output image has the same dimensions as the input images, and has all its values initialized to ∞. For each edge pixel p, a ray is cast in the direction of its gradient d_p. We follow this ray from p until another edge pixel q is found. If the gradient in q (d_q) is not roughly opposite to d_p, or if no opposite edge pixel is found, the ray is discarded. We call the distance between p

and q the stroke width . For each pixel that intersects the ray, we set its value to min (current value, stroke width).

Figure 4.2. Stroke width calculation, image taken from [1]

However, in certain cases the calculated values do not accurately reflect the actual stroke width (cf. [1]). To mitigate this, we calculate the median for all the pixels in each nondiscarded ray and we follow it again, this time setting the value for each pixel to min (current value , median stroke width). See section 8.3.5. The Stroke Width Transform operation for example results.

4.3. Finding letter candidates

The input to this phase is an image where the pixels have as values the width of the stroke that contains that pixel. This phase groups together values with similar stroke widths into letter candidates. Values with similar stroke widths (ratio <= 3.0) are grouped together using the

(10)

comprising the component are set to that label value. The resulting groups are called letter candidates (cf. section 8.3.6. Connected Components for example results).

Next, each letter candidate is tested against some heuristics to prune candidates that are not likely to be letters: the aspect ratio has to be between 0.1 and 10, its bounding box may not contain more than 2 bounding boxes of other letter candidates, its height has to be between 10 and 300 pixels and the standard deviation of the stroke width values in the letter candidate may 4 not be more than half the component’s average stroke width.

The candidates that survived are considered to be letters. For a more detailed description of the rationale behind the “magic numbers” in these heuristics, please refer to [1]. Within the scope of this thesis, we will assume they make sense, unless they turn out to produce very poor results in the optimized version of the algorithm.

4.4. Chaining letters into words

The letters are then grouped into words. Every possible combination of 2 letters is considered. They are paired together if they express the following properties: 1. They have roughly the same median stroke width (ratio < 2.0) 2. The ratio between the heights is < 2.0 3. The distance between them is not more than 3 times the width of the wider one 4. The average color has to be approximately the same

Next, every pair is considered a chain. Chains are merged iff they have one letter in common, and their general direction is about the same. This process is repeated until no more chains can be merged based on these criteria. The result is a series of distinct chains of letters, which are considered to be text lines.

Finally the text lines are split into words by estimating the widths of spaces and intraword distances with a histogram.

4.5. Output

The output of the algorithm is a set of bounding boxes, indicating areas in the input image that are likely to contain text. The area in the bounding box can then be fed to a generic OCR algorithm for extraction of the digital text.

(11)

5. Introduction to the Graphics Processing Unit

A Graphics Processing Unit ( GPU ) is a specialized stream processor, designed to accelerate and facilitate the rendering of 3D shapes to 2D graphical images. The processing happens in several distinct stages, resembling a pipeline: each stage takes as input the result from the previous stage . This pipeline is often referred to as the rendering pipeline . The output of the pipeline is a rendered image. GPUs are present in any modern desktop computer and laptop, but also in mobile and embedded devices.

5.1. Shaders

Most modern GPUs are flexible. They contain some parts of the older fixed function pipeline , but many parts are programmable through shaders. Shaders are small programs written in a DSL[14][15], intended for execution on the GPU instead of the CPU. Shaders are executed in specific parts of the rendering pipeline.

5.2. Rendering pipeline

The actual rendering pipeline consists of more steps than are described below, but for this thesis the following simplified view is sufficient.

Input

The rendering pipeline takes as input a set of vertices . A vertex is a data structure which can contain any kind of data. Often used data is position data (x,y,z,w), color data (r,g,b,a), surface normal data, texture coordinates (u,v), but also identifiers, a timespan, etc.

Uniforms

Other types of input are possible in the form of uniforms . Uniforms are what connects the “outside world” to the shader code. They are variables (ints, floats, arrays, matrices etc.) that can be set from the application code, and remain constant throughout a single pass through the rendering pipeline. They can be accessed from any programmable part in the rendering pipeline (cf. Programmable Vertex Processor and Programmable Fragment Processor ). A special type of uniform is the sampler, that allows access to texture data stored in the GPUs memory space.

Programmable Vertex Processor

Each input vertex first enters the programmable vertex processor, where it is processed by a

vertex shader , a program that operates on a single vertex. In 3D games, it is often used to transform a 3dimensional model described in 3D world coordinates to 2D screen coordinates. The vertex shader outputs a position in screen space and possibly other data as well to the

rasterizer. Note that the vertex shader is executed for each vertex individually ; no context data with respect to the other vertices is available.

(12)

Rasterizer

Depending on what kind of primitives (e.g. points, lines, triangles, quads) the graphics card is setup to draw, the rasterizer assembles one or more of these vertices into a primitive, e.g. it takes 3 respective vertices to define a triangle. The output data from the vertex shader is then

interpolated between the 3 vertices to get the values for each fragment inside the defined triangle. A fragment can be thought of as a “potential” pixel. It contains the same data as the vertices that define the primitive it is contained in, only interpolated to represent the values in between the vertices.

Programmable Fragment Processor

Each of these fragments enter the fragment shader , a program that operates on a single fragment. The fragment shader has to output a color value which is then written to the frame buffer, the output image. This output value can be calculated from the data in the fragment, or retrieved directly from a texture image , or a combination of the two. What is important to note is that the location to which the output value is written is determined by the rasterizer and cannot be altered by the fragment shader. Finally the output color is blended (according to some blending function, sometimes programmable) with the color that was already in the frame buffer.

Final thoughts

On first glance, it seems as if this way of processing data is only applicable for graphics rendering. This is because the GPU has been designed mainly for this purpose. Implementing operations related to graphics rendering becomes much easier, because the programmer no longer has to think about rasterizing primitives and interpolation of values.

5.3. Architecture

Client‐server

When using OpenGL, the CPU and GPU interact via a clientserver architecture. The GPU generally has its own memory space , so any input data has to be uploaded to the GPU, and 5 output data has to be downloaded. Calls and queries can be made to the GPU from the CPU, which executes these commands asynchronously, but in order.

(13)

Single Instruction, Multiple Data

The power of the GPU lies within the Single Instruction Multiple Data ( SIMD ) architecture. It is optimized to do the same operations on many data elements (vertices and fragments) simultaneously. Furthermore, on a smaller scale it is also optimized for doing one operation on up to 4 data elements simultaneously ( vectorization). For instance, in graphics programming it is common to take the cross product of two 4component vectors. On the GPU this is often a single instruction due to the vectorization. This functionality is also present in most modern CPUs, albeit somewhat difficult to use, less performant and often not portable to other models of CPUs. Furthermore, the GPU also has builtin, hardware accelerated functions for common graphics operations, such as value interpolation, linear sampling, taking the dot product of two vectors, and matrix multiplication, which can automatically make use of the vectorization functionality.

Limitations

Unfortunately the architecture of the GPU also has some significant limitations.

Limited debugging

GPUs do not support interrupts or exceptions [11], nor do they support basic debugging functionality like breakpoints or printing to standardout. This makes debugging significantly harder, especially for shader code.

Limited dynamic branching

The SIMD architecture could entail a performance penalty for dynamic branching (e.g. if/else statements). For example, the GPU processes each vertex in its own thread. GPUs combine threads into groups and send them down the pipeline together to save on instruction fetch/decode power. If threads encounter a branch, they may diverge, e.g. 1 thread in a 4thread group may take the branch while the other 3 may not take it. Now the group is split into two groups of sizes 1 and 3. These newly formed groups will run inefficiently. The 1thread group will run at 25% efficiency and the 3thread group will run at 75% efficiency. You can imagine that if a GPU continues to encounter nested branches, its efficiency becomes very low (paraphrased from [11]).

No inter‐thread communication

There is no interthread communication [11], and therefore no shared or global data. Vertices do not know of each other's existence: they have no way of querying the values of other vertices, nor do they know what primitive shape they are part of. Fragments also do not know of each other's existence.

(14)

Only pixels as output

Output is limited to the output of the graphics pipeline, which means the GPU will always output data in a texture image. Textures are comprised of pixels, which in turn are comprised of up to 4 numeric elements indicating RGB light intensity and an alpha (i.e. translucency) value. These elements are usually bytes (8 bit integers in the range [0..255]) or floating point numbers (16 or 32 bit real numbers). For our solution, all data has to be encoded into the pixel’s color channels.

(15)

6. Problem analysis

The host organization has spent a significant amount of resources in increasing the efficiency of the image analysis on the mobile CPU, and have come to the conclusion that its efficiency cannot be further improved. However the image analysis is still not fast enough for realtime application.

6.1. Proposed solution

Adapt some or all steps of the SWT algorithm so that it can be performed on the mobile GPU to increase its performance using the GPU’s hardware accelerated image processing functionality and inherent parallelism.

6.2. Rationale

Where the CPU is good at doing one (or a few) tasks very fast, the GPU is good at doing a lot of simple tasks slowly, but in parallel and in large quantities. The GPU is optimized for tasks related to 2D and 3D graphics rendering. Since the SWT algorithm directly transforms and analyzes pixel data (as opposed to training a classifier), a promising way to increase the algorithm’s efficiency is to adapt the majority of the algorithm’s processing steps for execution on the GPU.

6.3. Difficulties

It is desirable that the solution presented in this thesis can be used for a significant majority of the existing mobile devices, especially for the host organization which provides its current solution for both the Android and iOS platform. Therefor we have to target different platforms simultaneously, which have varying native programming languages and APIs.

Communication with the GPU is costly. Suppose we have three necessarily consecutive subroutines in the algorithm: A, B and C. A and C can be adapted to work efficiently on the GPU, but B cannot. This would result in a situation where first A has to be performed on the GPU, the results have to be downloaded for B to be performed on the CPU, and the results from B have to be uploaded again to the GPU as input for C. This downloading and uploading are among the slowest operations when communicating with the GPU. This solution can only be efficient if this can be limited to a minimum.

We also have to cope with the enormous device diversity . The specifications of the CPU, GPU and RAM memory in these devices vary greatly, but we still want to draw conclusions about the relative performance. In particular, the specifications among different GPU models vary greatly, with some supporting only a subset of the needed functionality.

(16)

There are only a very limited number of available GPGPU APIs for mobile platforms. The popular GPGPU frameworks are not available for mobile devices, and other frameworks have undesirable properties (like limited portability).

There are various parts of the SWT algorithm that cannot be mapped both efficiently and 100% accurately. A balance will have to be found between accuracy and efficiency. The tradeoffs will be discussed and quantified.

[1] describes the SWT algorithm on a high level. The values of most parameters are mentioned, but not all of them (e.g. the lower and upper threshold for the edge detection algorithm). Also not all algorithms that were used are discussed, e.g. the algorithm proposed to do the chaining of letter candidates is a very naive one (O(n 2_{) complexity), used only to convey the desired} behavior. Finally, no open source implementation is made available. The experiment performed in [1] is therefore difficult to repeat, and it is possible we cannot achieve the same results.

(17)

7. Research method

7.1. Experiment

We want to assess the feasibility of a realtime implementation of the SWT algorithm on current generation mobile devices. We will develop two distinct versions of the SWT algorithm:

● swtcpu will mimic the SWT algorithm proposed in [1] to the best of our abilities. It will run fully on the CPU of the mobile device, making use of the image processing functions present in the highly optimized and portable OpenCV library. This will serve as a reference for the desired accuracy of the SWT algorithm and as a baseline for the algorithm’s performance.

● swtgpu will run on the GPU of the mobile device for the most part. Every required image processing function will be implemented in shader code for execution on the GPU. Parts that cannot be efficiently performed on the GPU will be performed on the CPU. For both algorithms, the accuracy and performance will be measured and compared for a small set of devices.

7.2. Input

Both algorithms will run on the same input: images taken from the ICDAR 2003 database. This is the same dataset used in [1], which allows us to compare our results with [1]. The ICDAR 2003 database contains images of natural scenes containing text, which are annotated with the location and size of bounding boxes that enclose the text.

7.3. Accuracy measurement

The accuracy will be measured by using the same metrics as in [1]: precision and recall.

The precision metric in [1] represents the amount of correctly retrieved areas of text as a fraction of all retrieved areas of text, i.e. _{true positives + false positives}true positives .

The recall metric in [1] represents the amount of correctly retrieved areas of text as a fraction of all areas of text, i.e. _{true positives + false negatives}true positives .

Finally, these 2 metrics are combined into the fmeasure . The fmeasure is defined as the harmonic mean between precision and recall: 2 ∙ _{precision + recall}precision ∙ recall. This metric serves as an overall comparative accuracy measure of text detection algorithms.

The precision and recall are calculated for each individual image analysis, and averaged for the final result. The fmeasure is then calculated for this end result.

(18)

To compare the retrieved regions of text in our algorithm with the ground truth regions provided in the ICDAR database, we use the same approach as in [1]. For each ground truth region, we check which of the retrieved regions has the most overlap with the ground truth region. Instead of recording a match as “successfully retrieved”, i.e. “1”, we record its percentage of overlap with the ground truth region, so we get a gliding scale of success instead of a binary classification. This unfortunately gives some extra pessimistic results in our case (cf. section 10. Discussion ), but does allow us to compare our algorithm’s accuracy with [1].

7.4. Performance measurement

For each individual image analysis, we measure the processing time of each filter (cf. section

8.2. Architecture ), for both swtcpu and swtgpu. The processing time is defined as the time in seconds between the start and completion of the filter. We also measure the total time in seconds the algorithm takes to do an individual analysis.

Finally, these measurements are averaged for all performed analyses, resulting in an average processing time for each filter, and an average total running time of the swtcpu and swtgpu algorithms.

7.5. Platform, API and programming language selection

With selecting the APIs and programming languages to be used, the following nonfunctional requirements were leading. The implementation must:

1. Be relevant for the vast majority of smartphones and tablets.

2. Be crossplatform so it can be easily adapted for use on any mobile device. 3. Have optimal performance for the GPU implementation.

7.5.1. Programming language

The programming language chosen for the implementation of swtcpu and swtgpu is C++. The rationale for this language is threefold. First and foremost, C++ is a crossplatform language, that has a compiler for both iOS and Android. Second, OpenGL and OpenCV (cf. section 7.5.2 Libraries) are crossplatform C/C++ libraries, so using C++ has the benefit that no wrapper around these libraries has to be used. Lastly, optimal performance is very important for this implementation and C++ has many ways to approach optimal execution of the code. That said however, a downside to C++ is its complexity and strictness with regards to type safety, thus generally speaking using C++ implies a longer development time and harder to maintain code than using higher level languages like Java (cf. section 12. Conclusion).

(19)

7.5.2. Libraries

There are 2 important libraries used in the implementation of swtcpu and swtgpu.

OpenCV

For the implementation of swtcpu the Open Computer Vision library OpenCV [16] is used. This library provides many common image processing functions and data structures and focusses on accuracy, high performance and portability. At the time of writing, this library leverages only the CPU when used on mobile devices.

OpenGL ES 2.0

For the implementation of swtgpu the crossplatform graphics library for embedded systems

OpenGL ES 2.0 is used. OpenGL (ES) is a not actually a library but a specification, which every GPU manufacturer implements for their GPU model. OpenGL ES version 2.0 is the most widely supported API to make use of the GPU in mobile devices. However, for this solution the6 core specification is too limited (cf. section 9.3. Limitations ). The largest downside to using OpenGL ES 2.0 is that the algorithm and its input and output have to be tailored to the OpenGL ES 2.0 rendering pipeline, which can also impede optimal performance.

Rejected alternatives

There are several General Purpose GPU (GPGPU) frameworks available that give access to the GPU as a general purpose processor. Well known examples are the Open Computing Language ( OpenCL )[13] and NVIDIA’s Compute Unified Device Architecture ( CUDA)[12]. However, we cannot use these frameworks in this thesis, for the simple reason that at the time of writing, OpenCL is not supported by the two most popular mobile operating systems: Google’s Android and Apple’s iOS . The CUDA framework is for NVIDIA GPUs only, which are 7 found in a very select number of Android tablets, and only the very latest iterations of the GPU model support CUDA.

Specific mobile GPGPU frameworks are Google’s RenderScript [17] and Apple’s Metal [18].

RenderScript specifies one language for parallel processing, that can be executed on both mobile GPU and multicore CPU. The downside is that there is little guarantee that the code will actually be executed on the GPU, especially for older devices. Metal is an attempt to allow low overhead access to the GPU for graphics rendering and parallel computation. The major disadvantage for both frameworks is that they are not crossplatform, i.e. they only work on their own respective platform, which violates one of our nonfunctional requirements.

6_{Other APIs / libraries exist as well (e.g. Microsoft’s DirectX or OpenGL ES 3.1 with compute shader} support), but are less common for mobile devices.

7_{Although OpenCL is supported by most GPUs, even the ones in Android and iOS devices, it is not exposed} by the OS.

(20)

7.5.3. Platform

For feasibility reasons, this thesis will implement swtgpu only for the family of devices created by Apple. The reason is that support for specific OpenGL extensions is incremental in Apple’s mobile devices, i.e. once a device was released with support for a specific OpenGL extension, all newer devices also support this functionality, with increased performance. This makes it easier to draw conclusions about (future) performance of the algorithm.

However, we realize that Apple has a relatively small market share when looking at the total market of mobile devices. Therefore we have chosen a programming language and a set of libraries / APIs that are also available on other devices (like Google’s Android), to make the solution and its results more general.

(21)

8. Implementation

8.1. Programming languages and source lines of code

For feasibility reasons, the proposed algorithms are only implemented for the family of mobile devices running the iOS operating system. For the proposed solution to be relevant to other platforms as well, we use platform independent programming languages and APIs. The proposed algorithm was written in C++ and GLSL, with a wrapper of ObjectiveC and ObjectiveC++ to use the code on iOS.

The implementation consists of 7 parts, comprising roughly 5,300 source lines of code (SLOC): 1. An objectoriented wrapper around OpenGL, to improve readability and maintainability

(~1200 lines of code) 2. 59 vertex and fragment shaders (~700 lines of code, cf. Appendix A E) 3. Filter implementations for swtcpu (~800 lines of code) 4. Filter implementations for swtgpu (~1200 lines of code) 5. Code common to swtcpu and swtgpu to chain letters into words (~650 lines of code) 6. Application code (~400 lines of code) 7. Platform specific code to execute the algorithm as an iOS app (~400 lines of code) Figure 8.1. SWT application component stack

The implementation can be used on any platform that has a C++ compiler and supports OpenGL ES 2.0 (this also includes desktop PCs). The only exception is the small amount of platform specific code.

(22)

8.2. Architecture

We chose to implement the image processing as a pipeline of filters . Each filter represents an image processing step, possibly consisting of other subfilters. It has as input one or more textures and zero or more parameters. As output it has zero or one texture. The output texture can be used as input for the next filter in the pipeline.

Figure 8.2.1 Filter pipeline architecture

Each filter in swtgpu has an initialization phase where it sets appropriate states on the GPU. Next it instructs the GPU to execute a sequence of image operations. Finally it waits for the GPU to finish with the result. For swtcpu the initialization phase is not used, it simply performs the sequence of image operations.

Figure 8.2.2 Sequence of operations in a filter

The filters monitor their own performance by starting a timer just before the initialization phase and stopping the timer directly after the result is ready. The processing time of the filter is then expressed as the time in seconds between the start and end of the timer.

(23)

Figure 8.2.3 describes the architecture of the whole swtgpu pipeline from input image to a set of text regions containing individual letters to be used for the chaining phase (cf. section 4).

(24)

(25)

8.3. Filter implementations

This section will give a detailed description of how the image processing steps (i.e. filters ) in the algorithm were implemented for swtgpu.

8.3.1. Grayscale filter

Input 1. The original image 2. Four vertices defining a quad, which spans the entire image Output A grayscale version of the input image Shaders

Vertex shader: Normal.vs (Appendix A) Fragment shader: Grayscale.fs (Appendix B)

Figure 8.3.1 Grayscale CPU (l) and GPU (r)

Explanation

A single pass over the image is performed, calculating a single intensity value for each pixel using the weighted sum of each pixel’s RGB values with (0.2126, 0.7152, 0.0722) respectively.

8.3.2. Gaussian blur

Input

1. The grayscale image

(26)

Output

A blurred version of the input image

Figure 8.3.2 Gaussian Blur CPU (l) and GPU (r).8

Shaders

Vertex shader: GaussianBlur.vs (Appendix B) Fragment shader: GaussianBlur.fs (Appendix B)

Explanation

The gray image needs to be blurred to lessen the response of the subsequent edge detection filter, i.e. remove noise that could cause falsepositives. A 5x5 Gaussian blur kernel is first calculated. This can be done using tools available online or using a precalculated Gaussian kernel. We make use of the Gaussian kernels separability property and we use an optimization 9 technique described in [21] to leverage the GPU’s hardware accelerated linear interpolation to get the same result as a 5x5 kernel using only a 3x3 kernel.

8.3.3. Sobel gradients calculation

Input

1. A grayscale image (in our case the blurred grayscale image) 2. Four vertices defining a quad, which spans the entire image

Output

8_{The CPU variant requires less blurring because OpenCV’s Canny operation blurs its input itself.}

9_{An NxN kernel that has the separability property can be rewritten as the product of an Nx1 and a 1xN kernel.}

Instead of having to perform 5 x 5 = 25 operations per pixel (for N = 5) in a single pass, we can perform two consecutive passes of only 5 operations per pixel each, resulting in a total of only 2 x 5 = 10 operations per pixel, plus

(27)

An image containing 2D gradients (x,y), mapped to the red and green channels in the image, and the length and gradient direction in polar coordinates, mapped to the blue and alpha channels.

Figure 8.3.3 Gradients CPU (l) and GPU (r) 10

Shaders

Horizontal pass

Vertex shader: Sobel1.vs (Appendix B) Fragment shader: Sobel1.fs (Appendix B) Vertical pass

Vertex shader: Sobel2.vs (Appendix B) Fragment shader: Sobel2.fs (Appendix B)

Explanation

In this step we will calculate a gradient map. We will use a 3x3 Sobel kernel , which has to be applied 2 times, once for the horizontal direction and once for the vertical direction. This kernel also has the separability property 9_{, so the whole operator can be applied in 4 passes. In practice} doing it in 4 passes turned out to be as slow as doing it in one big pass, probably because the kernel is so small that the overhead of doing so many passes becomes significant. We have combined the horizontal and vertical parts of both directions into one, resulting in two passes that have 3 texture fetches each, the same as the Gaussian blur described above. In addition to the 2dimensional gradient, we also output the gradient’s tangent and length, which will be used by the edge detection operator.

8.3.4. Canny edge detection

Input

1. A gradient map

(28)

2. An upper intensity threshold argument for the Canny operator 3. Four vertices defining a quad, which spans the entire image

Output

A binary image, where a black pixel represents a pixel not located on an edge, and a white pixel represents a pixel located on an edge.

Figure 8.3.4 Canny edge detector on CPU (l) and GPU (r)11

Shaders

Vertex shader: Normal.vs (Appendix A) Fragment shader: Canny.fs (Appendix B)

Explanation

We are now going to calculate an edge map, as required by the SWT algorithm [1]. The edge detection algorithm described by Canny cannot be adapted to work efficiently on the GPU due to the sequential nature of the hysteresis step [8]. We have chosen to use an approximation where this step is simply omitted. A solution described in [6] used an approximation of the hysteresis, but we have found that this approximation affected performance significantly, while its effect could also be achieved by simply using a lower “upper threshold”. The “lower threshold” used to classify weak pixels therefore has no more meaning, and is also omitted in our solution. The resulting algorithm produces edge maps that look similar to Canny’s (edges of 1 pixel wide, which is what we need for the SWT algorithm), but with more falsepositives due to the lower “upper threshold” and many falsenegatives due to the omission of the hysteresis step.

To estimate the threshold value, we use an approach suggested in [22]. We first calculate a histogram of the blurred gray image, placing each pixel’s intensity value in a bin in the [0, 255]

(29)

range. We then download this histogram from the GPU. Although downloading from the GPU is generally slow, this is acceptable here due to the small amount of data that is transferred (exactly 256 16bit float values). We then find the median of this histogram H_m on the CPU, and set the upper threshold of the Canny filter to a ratio of this median. The values suggested in [22] do not produce good results in swtgpu’s Canny filter due to the fact that the lower threshold and hysteresis step is omitted in this implementation, so we have estimated a different value by a simple iterative process using several input images and see what works well. The upper threshold value we have found to work well is 0 ∙ H.4 m.

8.3.5. The Stroke Width Transform operation

The Stroke Width Transform operation on the GPU consists of 4+1 consecutive steps. 1. Ray casting

2. Ray value writing 3. Ray averaging

4. Average ray value writing

Steps 1 through 4 are performed 2 times, once in the gradient direction and once in the opposite gradient direction. Because the gradients are calculated from intensity differences, 2 executions are necessary to account for dark text on a light background, and light text on a dark background. An additional step is done to create vertices for the line primitives needed for the second and fourth step (c.f. figure 8.2.3: Prepare ray lines ). It has to be done only once, but it requires us to download the edge map from the GPU. Since the edge map are basically boolean values, we reduce the amount of data transferred by only downloading the red color channel, and using the smallest data type (bytes).

8.3.5.1. Ray casting

Input

1. The edge map 2. The gradient map

3. A boolean parameter indicating whether we are looking for dark text on a light background, or light text on a dark background. This boolean is used as an indication to look for an opposite edge in the direction of the gradient, or in the opposite direction 4. Four vertices defining a quad, which spans the entire image

Output

An image where each pixel contains the location of the opposite edge pixel, or (0,0) if no opposite edge pixel was found.

Shaders

(30)

Fragment shader: CastRays.fs (Appendix C)

Explanation

For each pixel, we record its position in a variable ( pos0 ) and look up the gradient dpat this

pixel’s position. We want to find the position of the opposite edge pixel ( pos1 ), i.e. the first edge pixel we encounter when we walk from pos0 in the direction of the gradient.

We start walking from pos0into the direction of the gradient. On the CPU we keep walking until an opposite edge pixel is found, but on the GPU we walk for a constant number of steps to avoid dynamic branching . We keep walking for a maximum of 50 pixels (the original paper 12 uses 300), while taking steps of 0.2 pixels. This results in 50 / 0.2 = 250 steps for each edge pixel. As long as we have not yet found the opposite edge pixel, we keep updating pos1during each step. After each step, we sample the pixel at pos1and check if it is an edge pixel. If so, we mark that the edge pixel has been found to stop updating pos1in subsequent steps, but we (unfortunately) have to keep walking until we have set 250 steps.

Finally, the pixel at pos1 is evaluated according to the same criteria as in [1]: ● Is there an edge pixel at pos1?

● Is pos1 still within the image’s boundaries?

● Is the gradient at pos1 roughly opposite to the gradient at pos0?

If so, we write the position of the opposite edge pixel ( pos1 ) to the pixel located at pos0in the output image. The result of this step is a map where each pixel contains the location of the opposite edge pixel, or (0,0) if no valid opposite edge pixel was found.

Notes

This step is inherently inefficient on the GPU due to large numbers of sequential operations and texture fetches, however the optimizations we propose generate a surprisingly good performance (cf. 9.1. Performance).

8.3.5.2 Stroke width value writing

Input

1. An array of lines, one for each edge pixel, each defined as a pair of vertices p and q, with their positions set to the x,ycoordinates of the edge pixel and the zcoordinate set to either 0 or 1 indicating whether the vertex is p (the start vertex of the line) or q (the end vertex of the line).

2. An image where each pixel contains the position of the opposite edge pixel, or nothing. Obtained in the previous step.

(31)

Output An image where each pixel contains the minimum of the widths of the strokes that contains that pixel. Pseudocode (vertex shader) pos0 = vertexPosition pos1 = lookup(oppositePositions, pos0) dist = distance(pos0, pos1) strokeWidth = if dist = 0.0 then ∞ else dist outputPosition = if vertexPosition.z = 0 then pos0 else pos1 Shaders

Vertex shader: WriteRays.vs (Appendix C) Fragment shader: WriteRays.fs (Appendix C)

Explanation

First, we set up the GPU to draw lines. For each line (i.e. 2 consecutive vertices) that is processed, both its start vertex p and end vertex q enter the vertex shader independently, but in order. Both their positions are initially set to the coordinates of the same edge pixel. We first set pos0to the current vertex position, and we look up pos1by reading the value in the texture containing the opposite edge positions for each pixel (the result of the previous step). We then calculate the distance between pos0and pos1, which is the stroke width . Finally, only when the vertex is q (i.e. the zcoordinate is 1), we actually move it to the opposite edge position, which allows the rasterizer to generate a line from p (on the edge pixel) to q (on the opposite edge pixel).

All pixels on the generated line pass through the fragment shader, and we pass it the stroke width from the vertex shader. This fragment shader simply writes this value to every pixel on the line. 8.3.5.3. Ray averaging Input ● An image where each pixel contains the position of the opposite edge pixel, or nothing. Obtained in an earlier step. ● Output from the ray value writing step. Output An image where each edge pixel contains the minimum of all stroke width averages of lines that contain that pixel.

(32)

Shaders

Vertex shader: Normal.vs (Appendix A)

Fragment shader: AverageRays.fs (Appendix C)

Explanation

The original algorithm casts the rays again and determines the median of the set of pixels in the output of the ray writing step that intersect the ray. Next, min(current value, median) is written to every pixel on the ray. This is done to eliminate some incorrect cases (cf. section 4.2).

We are taking a similar approach here, however the “rays” (lines actually) were cast by the line drawing algorithm in the rendering pipeline, by drawing a line primitive between each edge pixel and the opposite edge pixel. We have to retrieve all pixels that were written using this algorithm, but unfortunately the line drawing algorithm is not specified explicitly by the OpenGL standard. Instead, a set of rules are specified to which the algorithm must conform. Bresenham’s line algorithm [23], one of the most used and efficient line drawing algorithms, conforms to these rules. Furthermore, calculating the median of a set of values requires us to first sort a list of values. Due to hardware limitations (e.g. we cannot allocate a dynamically sized array), this is not possible on the GPU.

We propose the following implementation. For each edge pixel, we start walking in the direction of the opposite edge pixel using Bresenham’s line algorithm. We add the values of all pixels identified by the algorithm together, and finally divide it by the number of pixels, thus calculating the average instead of the median.

Notes

This part of the algorithm deviates from [1] because the average instead of the median is used. This will increases the chance of having falsenegatives in later steps, because erroneous values are averaged into the final value, instead of eliminated.

8.3.5.4. Average stroke width writing

Input

1. An array of lines, one for each edge pixel, each defined as a pair of vertices p and q, with their positions set to the x,ycoordinates of the edge pixel and the zcoordinate set to either 0 or 1 indicating whether the vertex is p (the start vertex of the line) or q (the end vertex of the line).

2. An image where each pixel contains the position of the opposite edge pixel, or (0,0). Obtained in the previous step.

(33)

An image where each pixel contains the stroke width of the stroke that contains that pixel.

Figure 8.3.5.4a Stroke Width Transform with the gradient CPU (l) and GPU (r)

Figure 8.3.5.4b Stroke Width Transform against the gradient CPU (l) and GPU (r)

Shaders

Vertex shader: WriteAverageRays.vs (Appendix C) Fragment shader: WriteAverageRays.fs (Appendix C)

Explanation

This step is exactly the same as 8.3.5.2. Stroke width value writing , with the exception that the stroke width value that is written is not calculated from the distance between the edge pixel and the opposite edge pixel. Instead, it is looked up in the output of the previous step, which contains the minimum average stroke width value.

(34)

8.3.6. Connected Components

Input

1. One image where each pixel contains the width of the stroke that contains that pixel, in the direction of the gradient (i.e. “with” the gradient), obtained in the previous step. 2. One image where each pixel contains the width of the stroke that contains that pixel, in

the opposite direction of the gradient (i.e. “against” the gradient), obtained in the previous step.

Output

Two images where each pixel is set to the identifier of the group it belongs to. As a consequence of the Connected Components algorithm proposed in [5], in swtgpu this identifier contains some extra information, namely the location (x,y) of the topleft pixel in the group. The output from this filter is a set of pixel groups that are considered letter candidates.

Figure 8.3.6a Connected Components “with the gradient” CPU (l) and GPU (r)

Figure 8.3.6b Connected Components “against the gradient” CPU (l) and GPU (r)

(35)

All vertex and fragment shaders in appendix D.

Explanation

Pixels with a similar stroke width value (ratio <= 3.0) are grouped together using the Connected Components algorithm [20]. A desktop GPU version of this algorithm was proposed in [5], which as far as we know is a unique approach. We have adapted it to the limited functionality of the mobile GPU, while at the same time making use of some features unique to the mobile GPU. We also made functional changes to perform the grouping on stroke width value instead of a binary classification used in most implementations.

This has probably been the most difficult and time consuming filter to implement due to the complexity of the algorithm and limitations of the mobile GPU with respect to the desktop GPU. Practically all implementations of the Connected Components algorithm are sequential, running on the CPU, and adapting this algorithm to the GPU has been the subject of a thesis in itself. Thus, explaining the steps and rationale to implement the algorithm on the GPU is beyond the scope of this thesis and we refer the reader to [5]. The shaders we have implemented to run the algorithm on the mobile GPU can be found in appendix D.