• No results found

Real‐time detection of text in natural images on mobile devices

N/A
N/A
Protected

Academic year: 2021

Share "Real‐time detection of text in natural images on mobile devices"

Copied!
72
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Real‐time detection of text in

natural images on mobile devices

Emiel Bon

Universiteit van Amsterdam  Software Engineering    22 February 2016    Supervisors:  Tijs van der Storm   (Universiteit van Amsterdam)  storm@cwi.nl    Pjotter Tommassen   (Itude Mobile)  p.tommassen@itude.com    Robin Puthli  (Itude Mobile)  r.puthli@itude.com     

(2)

Abstract

The goal of this thesis is to quantify the performance benefits and accuracy trade­offs that occur        when performing the Stroke Width Transform (SWT) algorithm on a mobile GPU, to see if it is        possible to perform text detection in pictures of natural scenes in real time on a mobile device.        We have focussed on finding an architecture, programming languages and a set of libraries that        are portable to the majority of current generation mobile devices. To this end we have        developed a close approximation of the original SWT algorithm for execution on the mobile CPU        (swt­cpu ), and a custom variant for execution on the mobile GPU (       swt­gpu ). We have evaluated        the accuracy of swt­cpu and swt­gpu using the ICDAR 2003 image database together with the        precision and recall metrics. We have measured the performance of both variants by measuring        the processing time for each sub­step in the algorithm, and the total processing time. We have        found that swt­cpu performs an analysis of a single image in 17.7 seconds on average, and has        a precision and recall of 0.42 and 0.57 respectively. swt­gpu does the same in 3.6 seconds, with        a precision and recall of 0.35 and 0.53 respectively. Furthermore we have found that swt­gpu        can in theory only run on a small subset of all mobile devices due to specific non­standard        functional requirements on the mobile GPU. In practice, we have only been able to run swt­gpu        on 1 device model due to implementation differences in the OpenGL ES 2.0 library across        different device models. In contrast, swt­cpu runs on all devices. Finally we estimate that the        development effort for swt­gpu has been between 5 and 10 times larger than for swt­cpu. We        conclude that the gained performance increase is insufficient to outweigh the accuracy loss,        increased development time, and limited portability of the solution. 

(3)

Glossary

 

GPU  Graphics Processing Unit 

Vertex (pl. vertices)   A vertex is a data structure containing position data, and possibly other        types of data as well (e.g. color data, normal data, texture coordinates,        identifiers, a timespan etc.). 

Shader  A program executed on the GPU written in a specialized programming        language. 

Vertex shader  A shader that does operations on a single vertex. E.g. in 3D games,        this shader is used to transform 3­dimensional models, described in 3D        world coordinates, to 2D screen coordinates. 

Fragment  A potential pixel. A fragment contains all data to test whether it will        eventually be drawn on screen. It contains color data, but also possibly        alpha, depth etc. 

Fragment shader  A shader that does operations on a single fragment. Sometimes called       

pixel shader

Screen space  The 2D coordinate system of the screen or output image, as opposed        to the 3D “world” coordinate system where 3D models are defined.  Texture  A 2D buffer for pixel data existing in GPU memory. A texture can have       

up to 4 color channels: red, green, blue and alpha (i.e. translucency).        Pixel data can be represented as byte values in the range 0...255 for        each color channel. OpenGL ES 2.0 extensions can also provide        additional functionality where a texture’s color channels can be        represented as 16 or 32 bit floating point numbers. 

GPGPU  General Purpose Graphics Processing Unit, the concept of using the        GPU as a highly parallel general purpose processor. 

Kernel  A square matrix used for image convolution . It contains weights that      1          are applied to a pixel       p and its neighbors to mix the colors and get a        new color for         p. It can be used to achieve blurring, gradient calculation,        sharpening, dilation etc. 

API  Application Programming Interface 

Mobile device  Tablet or smartphone 

(4)

Android  A UNIX based operating system created by Google, designed for and        primarily used on mobile devices  

iOS  A UNIX based operating system created by Apple, designed for mobile        devices created by Apple. 

Open Computing  Language 

(OpenCL) 

A framework for GPGPU. OpenCL makes it possible to write programs        in a C­like language that can be executed on both (multi­core) CPU        and GPU. 

Compute Unified  Device Architecture  (CUDA) 

A proprietary framework for GPGPU by graphics card manufacturer        NVIDIA. It gives the programmer access to a virtual instruction set that        allows code to be executed on the GPU. By design, it is only available        for a subset of NVIDIA graphics cards, which are present in a very        select number of Android tablets. 

Open Graphics  Library (OpenGL) 

A language and platform independent API for rendering 2D and 3D        graphics. It is typically used to interact with a graphics processing unit        (GPU), to achieve hardware­accelerated rendering . The functionality        2      of OpenGL is based on trends and advancements in graphics        hardware development. Furthermore, device manufacturers can design        and submit their own extensions to the OpenGL API. These extensions        are maintained in a database by the Khronos Group. Because of this,        OpenGL is comprised of a core specification and optional extensions

OpenGL for 

Embedded Systems  (OpenGL ES or  GLES) 

A subset of the OpenGL API for rendering 2D and 3D computer        graphics  such  as  those  used  by  video  games,  typically  hardware­accelerated using a GPU. It is designed for resource        constrained systems like smartphones, computer tablets, video game        consoles, embedded devices and PDAs . 3

Grayscale image  Monochromatic image where each pixel has only an intensity value in        the range [0...255] or [0...1], instead of the full RGB(A) spectrum 

Aspect ratio The proportional relationship between an image’s width and height2  DSL  Domain Specific Language 

Bounding Box  The axis­aligned minimal rectangular area that encloses a shape or set        of points 

(5)

1. Introduction

In 2010 a novel, region based algorithm called       Stroke Width Transform (       SWT )[1] has been      developed at Microsoft Research for the accurate detection of text in natural scenes. Examples        of text occurring in natural scenes are numeric plates on cars, writing on statue sockets, or text        on billboards. Analyzing previously created images with this algorithm is possible due to some        sophisticated image transformations, that can be done within reasonable time on a PC with a        modern Central Processing Unit (CPU). 

 

There are however many applications where an immediate image analysis is convenient. At        present, mobile devices (like smartphones) are ubiquitous in everyday life. The majority of        mobile devices have a camera, so everyone owning one can take pictures of text in natural        scenes. However, they often have processors that are less powerful than those of a modern PC,        so it is likely the SWT algorithm cannot be efficiently performed on the CPU of a mobile device.   

In this thesis, we will develop a version of the SWT algorithm that can be performed on the        mobile GPU. We will quantify the performance benefits and accuracy trade­offs when        performing the SWT algorithm on the mobile GPU. We will develop an architecture and a set of        non­functional requirements that allow the GPU version of the algorithm to be portable to other        platforms. We will describe what limitations are encountered, how they are compensated and        what concessions have been made. Furthermore, we will evaluate the chosen approach itself in        terms of development effort and appropriateness. 

(6)

2. Motivation

The host organization (Itude Mobile) has a client that manages the transport logistics of       Unit  Load Devices (    ULDs). They manage thousands of these objects around the world. These        objects are identified by alphanumeric codes, printed on a sticker which is applied to the hull of        the object. These stickers are 9 to 12 characters in length, have different colors and fonts, and        are often partially damaged due to rough transports or ad­hoc repairs. The host organization        created an app for smartphones and tablets that can photograph these stickers, detect and        parse the text on the stickers and immediately look up the history of the ULDs (where they are        now, where they have been, their current status etc.). The app is available for the iOS and        Android operating systems. 

 

The app works well, but the image analysis is not fast enough for       real­time application, i.e.       analyze a   stream of images until a piece of text is found that matches the format of the code on        the sticker. Right now, the user first has to take a picture of the code, and after some time        (ranging from 20 seconds on an iPhone 4 to about one second on an iPhone 5S) the app shows        the code as digital text and looks up relevant data for the corresponding object. When the        analysis fails on the picture, a new picture has to be taken and the process is repeated.  

 

Also, according to the specification of the SWT algorithm, the analysis has to be performed        twice, once for dark text with a light background, and once for light text with a dark background.        At present, the latter is not performed because this makes the running time of the algorithm        twice as long; which in practice is unacceptable for many devices. The omission of this step        entails that many of the analyses inherently fail. Furthermore, the input picture is heavily        downscaled to allow for fast detection at the cost of accuracy, which entails an additional        number of failures. 

 

Too many experiences of failure would make users want to cut their loss beforehand, i.e. not        use the image analysis at all and type in the code manually.  

 

The desired user experience would be to just briefly hold the camera in front of the sticker, until        the text is successfully retrieved, thus analyzing a stream of images in real­time. A more efficient        implementation is desired, so the analysis is performed quickly and more accurately.

(7)

3. Related work

Detecting Text in Natural Scenes with Stroke Width Transform                 [1], the paper that proposed the       novel Stroke Width Transform algorithm, published in 2010. The algorithm’s accuracy is virtually        unparalleled for horizontal texts, and its performance on modern PCs is very good. Our        research will attempt to recreate and optimize this algorithm for real­time application on mobile        devices. A more detailed description of the SWT algorithm will follow in the next section. 

 

Detecting Texts of Arbitrary Orientations in Natural Images               [4], a paper that proposes some        extensions for the original SWT algorithm, published in 2012. The extensions allow for greater        accuracy for detection of text with arbitrary orientations. The extensions are quite complex and        many factors rely on trained classifiers, making the approach unsuitable for a GPU architecture.    

TranslatAR: A Mobile Augmented Reality Translator           [5], an experiment funded by Nokia to allow        for real­time text detection and translation in natural scenes on the Nokia N900 smartphone.        Published in 2011, the text detection uses a method different from SWT. It requires input from        the user, can only detect one line of text in an image and relies on some very heavy        assumptions. Also it does not have real­time performance, as well as very poor accuracy        compared to SWT. 

 

Parallel Text Detection     [10], a project that attempts to perform real­time text detection in natural        scenes using Stroke Width Transform for use in robotics. The goal was to have real­time        performance (analyze 24 image frames per second) for text detection in images with 1080p        resolution. They use NVIDIA’s CUDA [12] framework to leverage the GPU. They do this on a        PC, which compared to mobile devices has virtually no hardware limitations. Furthermore, the        CUDA framework is unique to NVIDIA GPU architectures, which are uncommon in mobile        devices, as are other GPGPU approaches like OpenCL [13]. They increase the accuracy of        SWT further by adding a dilation step, however this has a dramatic adverse effect on the        performance. They still show a good performance increase with respect to a CPU        implementation (from 2 frames per second to 6 frames per second for 1080p input), but        performance is still not real­time.  

 

Text Detection on Nokia N900 Using Stroke Width Transform                 [6], a report made for a practical       assignment that attempts to implement the Stroke Width Transform algorithm on a mobile        device (Nokia N900). Although not entirely correctly implemented, the report documents some        implementation difficulties that may arise, and shows very good results for text that is relatively        easy to detect, even without a perfect implementation. The implementation is not optimized for        real­time performance. 

 

None of the above works are searching for a solution that uses SWT to achieve real­time        performance on a mobile device, nor propose an architecture that is suitable for cross­platform        implementation. 

(8)

4. Introduction to Stroke Width Transform

What follows is a short summary of the       Stroke Width Transform (      SWT ) algorithm proposed in        [1]. 

The key observation of the algorithm is that text is always written with some sort of pen, creating        strokes of approximately constant widths. The algorithm consists of 4 respective phases:

1. Input image pre­processing  2. Stroke Width Transform operation  3. Finding letter candidates 

4. Chaining letters into words 

What follows is a brief description of these steps to give the reader insight into the complexity        and implementation implications. A more detailed (although high level) description and rationale        can be found in [1]. 

4.1. Input image pre‐processing

The Stroke Width Transform operation requires 2 pieces of information for each pixel in the        input image: 

1. Whether the pixel is an edge pixel (i.e. does it lie on the stroke boundary of a potential        letter) 

2. The gradient direction for that pixel, which is assumed to be perpendicular to the stroke        boundary 

To achieve the former, the paper suggests using the edge detection algorithm proposed by        Canny [8], often dubbed the “optimal edge detector”. This algorithm in turn also requires the        gradient for each pixel, but to eliminate noise in the image these gradients have to be calculated        from a blurred, grayscale version of the input image. 

To summarize, this step consists of 4 subroutines:  1. Convert the input image to grayscale 

2. Blur the grayscale image (Gaussian blur is the standard due to desirable properties like        intensity preservation) 

3. Calculate the gradient for each pixel from the blurred image  4. Create an edge map using the Canny edge detector 

Each of these subroutines produces an output image of the same dimensions as the input        image, and they are all retained for later steps in the algorithm (c.f. sections 8.3.1 through 8.3.4        for example results). 

(9)

4.2. The Stroke Width Transform operation

The SWT algorithm’s key contribution is an image operator that for each pixel in the input image        determines the width of the stroke containing that pixel. The inputs to the SWT operation are the        previously calculated edge map and gradient map. What follows is a condensed description of        the SWT operation as described in [1].  

 

The SWT output image has the same dimensions as the input images, and has all its values        initialized to ∞. For each edge pixel       p, a ray is cast in the direction of its gradient      dp. We follow      this ray from       p until another edge pixel       q is found. If the gradient in      q (dq) is not roughly opposite          to dp, or if no opposite edge pixel is found, the ray is discarded. We call the distance between       p

and   q the  stroke width   . For each pixel that intersects the ray, we set its value to       min (current   valuestroke width). 

  Figure 4.2. Stroke width calculation, image taken from [1]  

However, in certain cases the calculated values do not accurately reflect the actual stroke width        (cf. [1]). To mitigate this, we calculate the median for all the pixels in each non­discarded ray        and we follow it again, this time setting the value for each pixel to       min (current value  , median   stroke width). See section 8.3.5. The Stroke Width Transform operation for example results. 

4.3. Finding letter candidates

The input to this phase is an image where the pixels have as values the width of the stroke that        contains that pixel. This phase groups together values with similar stroke widths into letter        candidates. Values with similar stroke widths (ratio <= 3.0) are grouped together using the       

(10)

comprising the component are set to that label value. The resulting groups are called       letter   candidates (cf. section 8.3.6. Connected Components for example results).  

Next, each letter candidate is tested against some heuristics to prune candidates that are not        likely to be letters: the         aspect ratio has to be between 0.1 and 10, its        bounding box may not         contain more than 2 bounding boxes of other letter candidates, its height has to be between 10        and 300 pixels and the         standard deviation of the stroke width values in the letter candidate may  4        not be more than half the component’s average stroke width. 

The candidates that survived are considered to be letters. For a more detailed description of the        rationale behind the “magic numbers” in these heuristics, please refer to [1]. Within the scope of        this thesis, we will assume they make sense, unless they turn out to produce very poor results in        the optimized version of the algorithm. 

4.4. Chaining letters into words

The letters are then grouped into words. Every possible combination of 2 letters is considered.        They are paired together if they express the following properties:   1. They have roughly the same median stroke width (ratio < 2.0)  2. The ratio between the heights is < 2.0  3. The distance between them is not more than 3 times the width of the wider one  4. The average color has to be approximately the same   

Next, every pair is considered a       chain. Chains are merged iff they have one letter in common,        and their general direction is about the same. This process is repeated until no more chains can        be merged based on these criteria. The result is a series of distinct chains of letters, which are        considered to be text lines

 

Finally the text lines are split into words by estimating the widths of spaces and intra­word        distances with a histogram. 

4.5. Output

The output of the algorithm is a set of bounding boxes, indicating areas in the input image that        are likely to contain text. The area in the bounding box can then be fed to a generic OCR        algorithm for extraction of the digital text.

(11)

5. Introduction to the Graphics Processing Unit

Graphics Processing Unit (       GPU ) is a specialized stream processor, designed to accelerate        and facilitate the rendering of 3D shapes to 2D graphical images. The processing happens in        several distinct stages, resembling a pipeline:       each stage takes as input the result from the        previous stage  . This pipeline is often referred to as the       rendering pipeline   . The output of the          pipeline is a rendered image.         GPUs are present in any modern desktop computer and laptop,        but also in mobile and embedded devices. 

5.1. Shaders

Most modern GPUs are flexible. They contain some parts of the older       fixed function pipeline     , but    many parts are programmable through         shaders. Shaders are small programs written in a        DSL[14][15], intended for execution on the GPU instead of the CPU. Shaders are executed in        specific parts of the rendering pipeline. 

5.2. Rendering pipeline

The actual rendering pipeline consists of more steps than are described below, but for this        thesis the following simplified view is sufficient. 

Input

The rendering pipeline takes as input a set of       vertices . A vertex is a data structure which can        contain any kind of data. Often used data is position data (x,y,z,w), color data (r,g,b,a), surface        normal data, texture coordinates (u,v), but also identifiers, a timespan, etc. 

Uniforms

Other types of input are possible in the form of       uniforms . Uniforms are what connects the        “outside world” to the shader code. They are variables (ints, floats, arrays, matrices etc.) that        can be set from the application code, and remain constant throughout a single pass through the        rendering pipeline. They can be accessed from any programmable part in the rendering pipeline        (cf. Programmable Vertex Processor and        Programmable Fragment Processor     ). A special type of          uniform is the sampler, that allows access to texture data stored in the GPUs memory space. 

Programmable Vertex Processor

Each input vertex first enters the programmable vertex processor, where it is processed by a       

vertex shader  , a program that operates on a single vertex. In 3D games, it is often used to        transform a 3­dimensional model described in 3D world coordinates to 2D screen coordinates.        The vertex shader outputs a position in       screen space ­ and possibly other data as well ­ to the       

rasterizer. Note that the vertex shader is executed for each vertex       individually ; no context data        with respect to the other vertices is available. 

(12)

Rasterizer

Depending on what kind of primitives (e.g. points, lines, triangles, quads) the graphics card is        setup to draw, the rasterizer assembles one or more of these vertices into a primitive, e.g. it        takes 3 respective vertices to define a triangle. The output data from the vertex shader is then       

interpolated between the 3 vertices to get the values for each       fragment inside the defined        triangle. A fragment can be thought of as a “potential” pixel. It contains the same data as the        vertices that define the primitive it is contained in, only interpolated to represent the values in        between the vertices. 

Programmable Fragment Processor

Each of these fragments enter the       fragment shader   , a program that operates on a single        fragment. The fragment shader has to output a color value which is then written to the       frame  buffer, the output image. This output value can be calculated from the data in the fragment, or        retrieved directly from a       texture image  , or a combination of the two. What is important to note is        that the   location to which the output value is written is determined by the rasterizer and cannot        be altered by the fragment shader. Finally the output color is blended (according to some        blending function, sometimes programmable) with the color that was already in the frame buffer. 

Final thoughts

On first glance, it seems as if this way of processing data is only applicable for graphics        rendering. This is because the GPU has been designed mainly for this purpose. Implementing        operations related to graphics rendering becomes much easier, because the programmer no        longer has to think about rasterizing primitives and interpolation of values. 

5.3. Architecture

Client‐server

When using OpenGL, the CPU and GPU interact via a client­server architecture. The GPU        generally has its own memory space , so any input data has to be uploaded to the GPU, and      5        output data has to be downloaded. Calls and queries can be made to the GPU from the CPU,        which executes these commands asynchronously, but in order. 

(13)

Single Instruction, Multiple Data

The power of the GPU lies within the       Single Instruction Multiple Data (         SIMD ) architecture. It is        optimized to do the same operations on many data elements (vertices and fragments)        simultaneously. Furthermore, on a smaller scale it is also optimized for doing one operation on        up to 4 data elements simultaneously (       vectorization). For instance, in graphics programming it is        common to take the cross product of two 4­component vectors. On the GPU this is often a        single instruction due to the vectorization. This functionality is also present in most modern        CPUs, albeit somewhat difficult to use, less performant and often not portable to other models of        CPUs. Furthermore, the GPU also has built­in,       hardware accelerated functions for common          graphics operations, such as value interpolation, linear sampling, taking the dot product of two        vectors, and matrix multiplication, which can automatically make use of the vectorization        functionality. 

Limitations

Unfortunately the architecture of the GPU also has some significant limitations. 

Limited debugging

GPUs do not support interrupts or exceptions [11], nor do they support basic debugging        functionality like breakpoints or printing to standard­out. This makes debugging significantly        harder, especially for shader code. 

Limited dynamic branching

The SIMD architecture could entail a performance penalty for dynamic branching (e.g. if/else        statements). For example, the GPU processes each vertex in its own thread.       GPUs combine    threads into groups and send them down the pipeline together to save on instruction        fetch/decode power. If threads encounter a branch, they may diverge, e.g. 1 thread in a        4­thread group may take the branch while the other 3 may not take it. Now the group is split into        two groups of sizes 1 and 3. These newly formed groups will run inefficiently. The 1­thread        group will run at 25% efficiency and the 3­thread group will run at 75% efficiency. You can        imagine that if a GPU continues to encounter nested branches, its efficiency becomes very low        (paraphrased from [11]). 

No inter‐thread communication

There is no inter­thread communication [11], and therefore no shared or global data. Vertices do        not know of each other's existence: they have no way of querying the values of other vertices,        nor do they know what primitive shape they are part of. Fragments also do not know of each        other's existence. 

(14)

Only pixels as output

Output is limited to the output of the graphics pipeline, which means the GPU will always output        data in a texture image. Textures are comprised of pixels, which in turn are comprised of up to 4        numeric elements indicating RGB light intensity and an alpha (i.e. translucency) value. These        elements are usually bytes (8 bit integers in the range [0..255]) or floating point numbers (16 or        32 bit real numbers). For our solution, all data has to be encoded into the pixel’s color channels.

(15)

6. Problem analysis

The host organization has spent a significant amount of resources in increasing the efficiency of        the image analysis on the mobile CPU, and have come to the conclusion that its efficiency        cannot be further improved. However the image analysis is still not fast enough for real­time        application. 

6.1. Proposed solution

Adapt some or all steps of the SWT algorithm so that it can be performed on the mobile GPU to        increase its performance using the GPU’s hardware accelerated image processing functionality        and inherent parallelism. 

6.2. Rationale

Where the CPU is good at doing one (or a few) tasks very fast, the GPU is good at doing a lot of        simple tasks slowly, but in parallel and in large quantities. The GPU is optimized for tasks        related to 2D and 3D graphics rendering. Since the SWT algorithm directly transforms and        analyzes pixel data (as opposed to training a classifier), a promising way to increase the        algorithm’s efficiency is to adapt the majority of the algorithm’s processing steps for execution        on the GPU. 

6.3. Difficulties

It is desirable that the solution presented in this thesis can be used for a significant majority of        the existing mobile devices, especially for the host organization which provides its current        solution for both the Android and iOS platform. Therefor we have to target different platforms        simultaneously, which have varying native programming languages and APIs. 

Communication with the GPU is costly. Suppose we have three necessarily consecutive        subroutines in the algorithm: A, B and C. A and C can be adapted to work efficiently on the        GPU, but B cannot. This would result in a situation where first A has to be performed on the        GPU, the results have to be downloaded for B to be performed on the CPU, and the results from        B have to be uploaded again to the GPU as input for C. This downloading and uploading are        among the slowest operations when communicating with the GPU. This solution can only be        efficient if this can be limited to a minimum. 

We also have to cope with the enormous       device diversity   . The specifications of the CPU, GPU        and RAM memory in these devices vary greatly, but we still want to draw conclusions about the        relative performance. In particular, the specifications among different GPU models vary greatly,        with some supporting only a subset of the needed functionality. 

(16)

There are only a very limited number of available GPGPU APIs for mobile platforms. The        popular GPGPU frameworks are not available for mobile devices, and other frameworks have        undesirable properties (like limited portability). 

 

There are various parts of the SWT algorithm that cannot be mapped both efficiently and 100%        accurately. A balance will have to be found between accuracy and efficiency. The tradeoffs will        be discussed and quantified. 

 

[1] describes the SWT algorithm on a high level. The values of most parameters are mentioned,        but not all of them (e.g. the lower and upper threshold for the edge detection algorithm). Also        not all algorithms that were used are discussed, e.g. the algorithm proposed to do the chaining        of letter candidates is a very naive one (O(n                2​) complexity), used only to convey the desired                behavior. Finally, no open source implementation is made available. The experiment performed        in [1] is therefore difficult to repeat, and it is possible we cannot achieve the same results. 

(17)

7. Research method

7.1. Experiment

We want to assess the feasibility of a real­time implementation of the SWT algorithm on current        generation mobile devices. We will develop two distinct versions of the SWT algorithm:  

● swt­cpu will mimic the SWT algorithm proposed in [1] to the best of our abilities. It will        run fully on the CPU of the mobile device, making use of the image processing functions        present in the highly optimized and portable OpenCV library. This will serve as a        reference for the desired accuracy of the SWT algorithm and as a baseline for the        algorithm’s performance.  

● swt­gpu will run on the GPU of the mobile device for the most part. Every required        image processing function will be implemented in       shader code for execution on the GPU.        Parts that cannot be efficiently performed on the GPU will be performed on the CPU.  For both algorithms, the accuracy and performance will be measured and compared for a small        set of devices.  

7.2. Input

Both algorithms will run on the same input: images taken from the ICDAR 2003 database. This        is the same dataset used in [1], which allows us to compare our results with [1]. The ICDAR        2003 database contains images of natural scenes containing text, which are annotated with the        location and size of bounding boxes that enclose the text. 

7.3. Accuracy measurement

The accuracy will be measured by using the same metrics as in [1]: precision and recall.  

The precision metric in [1] represents the amount of correctly retrieved areas of text as a fraction        of all retrieved areas of text, i.e. true positives + false positivestrue positives

The recall metric in [1] represents the amount of correctly retrieved areas of text as a fraction of        all areas of text, i.e. true positives + false negativestrue positives

Finally, these 2 metrics are combined into the       f­measure . The f­measure is defined as the        harmonic mean between precision and recall:       2 ∙  precision + recallprecision ∙ recall. This metric serves as an overall        comparative accuracy measure of text detection algorithms. 

The precision and recall are calculated for each individual image analysis, and averaged for the        final result. The f­measure is then calculated for this end result. 

(18)

To compare the retrieved regions of text in our algorithm with the ground truth regions provided        in the ICDAR database, we use the same approach as in [1]. For each ground truth region, we        check which of the retrieved regions has the most overlap with the ground truth region. Instead        of recording a match as “successfully retrieved”, i.e. “1”, we record its percentage of overlap with        the ground truth region, so we get a gliding scale of success instead of a binary classification.        This unfortunately gives some extra pessimistic results in our case (cf. section       10. Discussion   ),  but does allow us to compare our algorithm’s accuracy with [1]. 

7.4. Performance measurement

For each individual image analysis, we measure the       processing time of each        filter (cf. section      

8.2. Architecture  ), for both swt­cpu and swt­gpu. The processing time is defined as the time in        seconds between the start and completion of the filter. We also measure the total time in        seconds the algorithm takes to do an individual analysis. 

Finally, these measurements are averaged for all performed analyses, resulting in an average        processing time for each filter, and an average total running time of the swt­cpu and swt­gpu        algorithms. 

7.5. Platform, API and programming language selection

With selecting the APIs and programming languages to be used, the following non­functional        requirements were leading. The implementation must: 

1. Be relevant for the vast majority of smartphones and tablets. 

2. Be cross­platform so it can be easily adapted for use on any mobile device.  3. Have optimal performance for the GPU implementation. 

7.5.1. Programming language

The programming language chosen for the implementation of       swt­cpu  and  swt­gpu is C++. The        rationale for this language is threefold. First and foremost, C++ is a cross­platform language,        that has a compiler for both iOS and Android. Second, OpenGL and OpenCV (cf. section       7.5.2   Libraries) are cross­platform C/C++ libraries, so using C++ has the benefit that no wrapper        around these libraries has to be used. Lastly, optimal performance is very important for this        implementation and C++ has many ways to approach optimal execution of the code. That said        however, a downside to C++ is its complexity and strictness with regards to type safety, thus        generally speaking using C++ implies a longer development time and harder to maintain code        than using higher level languages like Java (cf. section 12. Conclusion).

(19)

7.5.2. Libraries

There are 2 important libraries used in the implementation of swt­cpu and swt­gpu.  

OpenCV 

For the implementation of       swt­cpu the    Open Computer Vision library        OpenCV [16] is used. This       library provides many common image processing functions and data structures and focusses on        accuracy, high performance and portability. At the time of writing, this library leverages only the        CPU when used on mobile devices. 

OpenGL ES 2.0 

For the implementation of       swt­gpu the cross­platform graphics library for embedded systems       

OpenGL ES 2.0 is used. OpenGL (ES) is a not actually a library but a specification, which every          GPU manufacturer implements for their GPU model. OpenGL ES version 2.0 is the most widely        supported API to make use of the GPU in mobile devices. However, for this solution the6       core   specification is too limited (cf. section       9.3. Limitations   ). The largest downside to using OpenGL        ES 2.0 is that the algorithm and its input and output have to be tailored to the OpenGL ES 2.0        rendering pipeline, which can also impede optimal performance. 

Rejected alternatives 

There are several     General Purpose GPU (GPGPU) frameworks available that give access to the          GPU as a general purpose processor. Well known examples are the       Open Computing    Language (  OpenCL )[13] and NVIDIA’s      Compute Unified Device Architecture (        CUDA)[12].   However, we cannot use these frameworks in this thesis, for the simple reason that at the time        of writing, OpenCL is not supported by the two most popular mobile operating systems:        Google’s Android and Apple’s iOS . The CUDA framework is for NVIDIA GPUs only, which are        7        found in a very select number of Android tablets, and only the very latest iterations of the GPU        model support CUDA. 

 

Specific mobile GPGPU frameworks are Google’s       RenderScript [17] and Apple’s        Metal [18].    

RenderScript specifies one language for parallel processing, that can be executed on both        mobile GPU and multi­core CPU. The downside is that there is little guarantee that the code will        actually be executed on the GPU, especially for older devices.       Metal is an attempt to allow low       overhead access to the GPU for graphics rendering and parallel computation. The major        disadvantage for both frameworks is that they are not cross­platform, i.e. they only work on their        own respective platform, which violates one of our non­functional requirements. 

6 Other APIs / libraries exist as well (e.g. Microsoft’s DirectX or OpenGL ES 3.1 with compute shader  support), but are less common for mobile devices. 

7 Although OpenCL is supported by most GPUs, even the ones in Android and iOS devices, it is not exposed  by the OS. 

(20)

7.5.3. Platform

For feasibility reasons, this thesis will implement       swt­gpu only for the family of devices created       by Apple. The reason is that support for specific OpenGL extensions is incremental in Apple’s        mobile devices, i.e. once a device was released with support for a specific OpenGL extension,        all newer devices also support this functionality, with increased performance. This makes it        easier to draw conclusions about (future) performance of the algorithm. 

 

However, we realize that Apple has a relatively small market share when looking at the total        market of mobile devices. Therefore we have chosen a programming language and a set of        libraries / APIs that are also available on other devices (like Google’s Android), to make the        solution and its results more general.

(21)

8. Implementation

8.1. Programming languages and source lines of code

For feasibility reasons, the proposed algorithms are only implemented for the family of mobile        devices running the     iOS operating system. For the proposed solution to be relevant to other        platforms as well, we use platform independent programming languages and APIs. The        proposed algorithm was written in C++ and GLSL, with a wrapper of Objective­C and        Objective­C++ to use the code on iOS. 

 

The implementation consists of 7 parts, comprising roughly 5,300 source lines of code (SLOC):  1. An object­oriented wrapper around OpenGL, to improve readability and maintainability       

(~1200 lines of code)  2. 59 vertex and fragment shaders (~700 lines of code, cf. Appendix A ­ E)  3. Filter implementations for swt­cpu (~800 lines of code)  4. Filter implementations for swt­gpu (~1200 lines of code)  5. Code common to swt­cpu and swt­gpu to chain letters into words (~650 lines of code)  6. Application code (~400 lines of code)  7. Platform specific code to execute the algorithm as an iOS app (~400 lines of code)      Figure 8.1. SWT application component stack   

The implementation can be used on any platform that has a C++ compiler and supports        OpenGL ES 2.0 (this also includes desktop PCs). The only exception is the small amount of        platform specific code. 

(22)

8.2. Architecture

 

We chose to implement the image processing as a pipeline of       filters . Each filter represents an          image processing step, possibly consisting of other sub­filters. It has as input one or more        textures and zero or more         parameters. As output it has zero or one texture. The output texture        can be used as input for the next filter in the pipeline. 

 

  Figure 8.2.1 Filter pipeline architecture 

 

Each filter in swt­gpu has an initialization phase where it sets appropriate states on the GPU.        Next it instructs the GPU to execute a sequence of image operations. Finally it waits for the        GPU to finish with the result. For swt­cpu the initialization phase is not used, it simply performs        the sequence of image operations. 

 

  Figure 8.2.2 Sequence of operations in a filter 

 

The filters monitor their own performance by starting a timer just before the initialization phase        and stopping the timer directly after the result is ready. The       processing time of the filter is then         expressed as the time in seconds between the start and end of the timer

(23)

Figure 8.2.3 describes the architecture of the whole       swt­gpu pipeline from input image to a set        of text regions containing individual letters to be used for the chaining phase (cf. section 4).       

(24)
(25)

8.3. Filter implementations

This section will give a detailed description of how the image processing steps (i.e.       filters ) in the      algorithm were implemented for swt­gpu. 

8.3.1. Grayscale filter

Input  1. The original image  2. Four vertices defining a quad, which spans the entire image  Output  A grayscale version of the input image  Shaders 

Vertex shader: Normal.vs (Appendix A)  Fragment shader: Grayscale.fs (Appendix B) 

 

Figure 8.3.1 Grayscale CPU (l) and GPU (r) 

Explanation 

A single pass over the image is performed, calculating a single intensity value for each pixel        using the weighted sum of each pixel’s RGB values with (0.2126, 0.7152, 0.0722) respectively. 

8.3.2. Gaussian blur

Input 

1. The grayscale image 

(26)

Output 

A blurred version of the input image 

 

Figure 8.3.2 Gaussian Blur CPU (l) and GPU (r).8  

Shaders 

Vertex shader: GaussianBlur.vs (Appendix B)  Fragment shader: GaussianBlur.fs (Appendix B) 

Explanation 

The gray image needs to be blurred to lessen the response of the subsequent edge detection        filter, i.e. remove noise that could cause false­positives. A 5x5 Gaussian blur       kernel is first      calculated. This can be done using tools available online or using a pre­calculated Gaussian        kernel. We make use of the Gaussian kernels       separability property and we use an optimization 9        technique described in [21] to leverage the GPU’s hardware accelerated       linear interpolation to      get the same result as a 5x5 kernel using only a 3x3 kernel.

8.3.3. Sobel gradients calculation

Input 

1. A grayscale image (in our case the blurred grayscale image)  2. Four vertices defining a quad, which spans the entire image 

Output 

8 The CPU variant requires less blurring because OpenCV’s Canny operation blurs its input itself. 

9 An NxN kernel that has the separability property can be rewritten as the product of an Nx1 and a 1xN kernel.                                         

Instead of having to perform 5 x 5 = 25 operations per pixel (for N = 5) in a single pass, we can perform two        consecutive passes of only 5 operations per pixel each, resulting in a total of only 2 x 5 = 10 operations per pixel, plus       

(27)

An image containing 2D gradients (x,y), mapped to the red and green channels in the image,        and the length and gradient direction in polar coordinates, mapped to the blue and alpha        channels. 

 

Figure 8.3.3 Gradients CPU (l) and GPU (r) 10 

Shaders 

Horizontal pass 

Vertex shader: Sobel1.vs (Appendix B)  Fragment shader: Sobel1.fs (Appendix B)  Vertical pass 

Vertex shader: Sobel2.vs (Appendix B)  Fragment shader: Sobel2.fs (Appendix B) 

Explanation 

In this step we will calculate a gradient map. We will use a 3x3       Sobel  kernel , which has to be          applied 2 times, once for the horizontal direction and once for the vertical direction. This kernel        also has the separability property        9, so the whole operator can be applied in 4 passes. In practice                          doing it in 4 passes turned out to be as slow as doing it in one big pass, probably because the        kernel is so small that the overhead of doing so many passes becomes significant. We have        combined the horizontal and vertical parts of both directions into one, resulting in two passes        that have 3 texture fetches each, the same as the Gaussian blur described above. In addition to        the 2­dimensional gradient, we also output the gradient’s tangent and length, which will be used        by the edge detection operator. 

8.3.4. Canny edge detection

Input 

1. A gradient map 

(28)

2. An upper intensity threshold argument for the Canny operator  3. Four vertices defining a quad, which spans the entire image 

Output 

A binary image, where a black pixel represents a pixel not located on an edge, and a white pixel        represents a pixel located on an edge. 

 

Figure 8.3.4 Canny edge detector on CPU (l) and GPU (r)11 

Shaders 

Vertex shader: Normal.vs (Appendix A)  Fragment shader: Canny.fs (Appendix B) 

Explanation 

We are now going to calculate an edge map, as required by the SWT algorithm [1]. The edge        detection algorithm described by Canny cannot be adapted to work efficiently on the GPU due        to the sequential nature of the       hysteresis step [8]. We have chosen to use an approximation        where this step is simply omitted. A solution described in [6] used an approximation of the        hysteresis, but we have found that this approximation affected performance significantly, while        its effect could also be achieved by simply using a lower “upper threshold”. The “lower        threshold” used to classify weak pixels therefore has no more meaning, and is also omitted in        our solution. The resulting algorithm produces edge maps that look similar to Canny’s (edges of        1 pixel wide, which is what we need for the SWT algorithm), but with more false­positives due to        the lower “upper threshold” and many false­negatives due to the omission of the hysteresis        step. 

 

To estimate the threshold value, we use an approach suggested in [22]. We first calculate a        histogram of the blurred gray image, placing each pixel’s intensity value in a bin in the [0, 255]       

(29)

range. We then download this histogram from the GPU. Although downloading from the GPU is        generally slow, this is acceptable here due to the small amount of data that is transferred        (exactly 256 16­bit float values). We then find the median of this histogram       Hm   on the CPU, and      set the upper threshold of the Canny filter to a ratio of this median. The values suggested in [22]        do not produce good results in swt­gpu’s Canny filter due to the fact that the lower threshold and        hysteresis step is omitted in this implementation, so we have estimated a different value by a        simple iterative process using several input images and see what works well. The upper        threshold value we have found to work well is 0 ∙ H.4 m

8.3.5. The Stroke Width Transform operation

The Stroke Width Transform operation on the GPU consists of 4+1 consecutive steps. 1. Ray casting 

2. Ray value writing  3. Ray averaging 

4. Average ray value writing 

Steps 1 through 4 are performed 2 times, once in the gradient direction and once in the opposite        gradient direction. Because the gradients are calculated from intensity differences, 2 executions        are necessary to account for dark text on a light background, and light text on a dark        background. An additional step is done to create vertices for the line primitives needed for the        second and fourth step (c.f. figure 8.2.3:       Prepare ray lines     ). It has to be done only once, but it        requires us to download the edge map from the GPU. Since the edge map are basically boolean        values, we reduce the amount of data transferred by only downloading the red color channel,        and using the smallest data type (bytes). 

8.3.5.1. Ray casting

Input 

1. The edge map  2. The gradient map 

3. A boolean parameter indicating whether we are looking for dark text on a light        background, or light text on a dark background. This boolean is used as an indication to        look for an opposite edge in the direction of the gradient, or in the opposite direction  4. Four vertices defining a quad, which spans the entire image 

Output 

An image where each pixel contains the location of the opposite edge pixel, or (0,0) if no        opposite edge pixel was found. 

Shaders 

(30)

Fragment shader: CastRays.fs (Appendix C) 

Explanation 

For each pixel, we record its position in a variable (      pos0 ) and look up the gradient​       ​dp​at this 

pixel’s position. We want to find the position of the opposite edge pixel (      pos1 ), i.e. the first edge          pixel we encounter when we walk from pos0​ in the direction of the gradient. 

We start walking from       pos0into the direction of the gradient. On the CPU we keep walking until        an opposite edge pixel is found, but on the GPU we walk for a constant number of steps to        avoid dynamic branching . We keep walking for a maximum of 50 pixels (the original paper    12        uses 300), while taking steps of 0.2 pixels. This results in 50 / 0.2 = 250 steps for each edge        pixel. As long as we have not yet found the opposite edge pixel, we keep updating       pos1during     each step. After each step, we sample the pixel at       pos1and check if it is an edge pixel. If so, we       mark that the edge pixel has been found to stop updating       pos1in subsequent steps, but we       (unfortunately) have to keep walking until we have set 250 steps. 

Finally, the pixel at pos1​ is evaluated according to the same criteria as in [1]:   ● Is there an edge pixel at pos1​? 

● Is pos1 still within the image’s boundaries?  

● Is the gradient at pos1​ roughly opposite to the gradient at pos0​? 

If so, we write the position of the opposite edge pixel (      pos1 ) to the pixel located at​       pos0in the       output image. The result of this step is a map where each pixel contains the location of the        opposite edge pixel, or (0,0) if no valid opposite edge pixel was found. 

Notes 

This step is inherently inefficient on the GPU due to large numbers of sequential operations and        texture fetches, however the optimizations we propose generate a surprisingly good        performance (cf. 9.1. Performance). 

8.3.5.2 Stroke width value writing

Input 

1. An array of lines, one for each edge pixel, each defined as a pair of vertices       p and    q, with their positions set to the x,y­coordinates of the edge pixel and the z­coordinate set        to either 0 or 1 indicating whether the vertex is       p (the start vertex of the line) or      q (the   end vertex of the line). 

2. An image where each pixel contains the position of the opposite edge pixel, or nothing.        Obtained in the previous step. 

(31)

Output  An image where each pixel contains the minimum of the widths of the strokes that contains that  pixel.  Pseudocode (vertex shader)  pos0 = vertexPosition  pos1 = lookup(oppositePositions, pos0)  dist = distance(pos0, pos1)  strokeWidth = if dist = 0.0 then ∞ else dist  outputPosition = if vertexPosition.z = 0 then pos0 else pos1  Shaders 

Vertex shader: WriteRays.vs (Appendix C)  Fragment shader: WriteRays.fs (Appendix C) 

Explanation 

First, we set up the GPU to draw lines. For each line (i.e. 2 consecutive vertices) that is        processed, both its start vertex       p and end vertex        q enter the vertex shader independently, but in       order. Both their positions are initially set to the coordinates of the same edge pixel. We first set        pos0to the current vertex position, and we look up       pos1by reading the value in the texture​       containing the opposite edge positions for each pixel (the result of the previous step). We then        calculate the distance between       pos0and   ​pos1​, which is the       stroke width   . Finally, only when the          vertex is     q (i.e. the z­coordinate is 1), we actually move it to the opposite edge position, which        allows the rasterizer to generate a line from       p (on the edge pixel) to      q (on the opposite edge         pixel). 

All pixels on the generated line pass through the fragment shader, and we pass it the       stroke   width from the vertex shader. This fragment shader simply writes this value to every pixel on the        line.  8.3.5.3. Ray averaging Input  ● An image where each pixel contains the position of the opposite edge pixel, or nothing.  Obtained in an earlier step.  ● Output from the ray value writing step.  Output  An image where each edge pixel contains the minimum of all stroke width averages of lines that  contain that pixel. 

(32)

Shaders 

Vertex shader: Normal.vs (Appendix A) 

Fragment shader: AverageRays.fs (Appendix C) 

Explanation 

The original algorithm casts the rays again and determines the median of the set of pixels in the        output of the ray writing step that intersect the ray. Next, min(current value, median) is written to        every pixel on the ray. This is done to eliminate some incorrect cases (cf. section 4.2). 

 

We are taking a similar approach here, however the “rays” (lines actually) were cast by the line        drawing algorithm in the rendering pipeline, by drawing a line primitive between each edge pixel        and the opposite edge pixel. We have to retrieve all pixels that were written using this algorithm,        but unfortunately the line drawing algorithm is not specified explicitly by the OpenGL standard.        Instead, a set of rules are specified to which the algorithm must conform. Bresenham’s line        algorithm [23], one of the most used and efficient line drawing algorithms, conforms to these        rules. Furthermore, calculating the median of a set of values requires us to first sort a list of        values. Due to hardware limitations (e.g. we cannot allocate a dynamically sized array), this is        not possible on the GPU. 

 

We propose the following implementation. For each edge pixel, we start walking in the direction        of the opposite edge pixel using Bresenham’s line algorithm. We add the values of all pixels        identified by the algorithm together, and finally divide it by the number of pixels, thus calculating        the average instead of the median. 

Notes 

This part of the algorithm deviates from [1] because the average instead of the median is used.        This will increases the chance of having false­negatives in later steps, because erroneous        values are averaged into the final value, instead of eliminated. 

8.3.5.4. Average stroke width writing

Input 

1. An array of lines, one for each edge pixel, each defined as a pair of vertices       p and    q, with their positions set to the x,y­coordinates of the edge pixel and the z­coordinate set        to either 0 or 1 indicating whether the vertex is       p (the start vertex of the line) or      q (the   end vertex of the line). 

2. An image where each pixel contains the position of the opposite edge pixel, or (0,0).        Obtained in the previous step. 

(33)

An image where each pixel contains the stroke width of the stroke that contains that pixel. 

Figure 8.3.5.4a Stroke Width Transform with the gradient CPU (l) and GPU (r)

Figure 8.3.5.4b Stroke Width Transform against the gradient CPU (l) and GPU (r) 

Shaders 

Vertex shader: WriteAverageRays.vs (Appendix C)  Fragment shader: WriteAverageRays.fs (Appendix C) 

Explanation 

This step is exactly the same as       8.3.5.2. Stroke width value writing         , with the exception that the        stroke width value that is written is not calculated from the distance between the edge pixel and        the opposite edge pixel. Instead, it is looked up in the output of the previous step, which        contains the minimum average stroke width value. 

(34)

8.3.6. Connected Components

Input 

1. One image where each pixel contains the width of the stroke that contains that pixel, in        the direction of the gradient (i.e. “with” the gradient), obtained in the previous step.  2. One image where each pixel contains the width of the stroke that contains that pixel, in       

the opposite direction of the gradient (i.e. “against” the gradient), obtained in the        previous step. 

Output 

Two images where each pixel is set to the identifier of the group it belongs to. As a        consequence of the Connected Components algorithm proposed in [5], in       swt­gpu this identifier      contains some extra information, namely the location (x,y) of the top­left pixel in the group. The        output from this filter is a set of pixel groups that are considered letter candidates

 

Figure 8.3.6a Connected Components “with the gradient” CPU (l) and GPU (r) 

 

Figure 8.3.6b Connected Components “against the gradient” CPU (l) and GPU (r) 

(35)

All vertex and fragment shaders in appendix D. 

Explanation 

Pixels with a similar stroke width value (ratio <= 3.0) are grouped together using the Connected        Components algorithm [20]. A desktop GPU version of this algorithm was proposed in [5], which        as far as we know is a unique approach. We have adapted it to the limited functionality of the        mobile GPU, while at the same time making use of some features unique to the mobile GPU.        We also made functional changes to perform the grouping on stroke width value instead of a        binary classification used in most implementations. 

This has probably been the most difficult and time consuming filter to implement due to the        complexity of the algorithm and limitations of the mobile GPU with respect to the desktop GPU.        Practically all implementations of the Connected Components algorithm are sequential, running        on the CPU, and adapting this algorithm to the GPU has been the subject of a thesis in itself.        Thus, explaining the steps and rationale to implement the algorithm on the GPU is beyond the        scope of this thesis and we refer the reader to [5]. The shaders we have implemented to run the        algorithm on the mobile GPU can be found in appendix D. 

Referenties

GERELATEERDE DOCUMENTEN

Figures 12 and 13 for the cosine vortex and Figures 14 and 15 for Burgers’ solution, respectively, show the influence of q on the optimal global smoothing length scale and

Het aanwenden van geweld door politieambtenaren, ook wel politiegeweld, kan een schending opleveren van de rechten in artikel 2 en 3 EVRM. Ter bescherming van deze rechten heeft

Using the simulation software, it has been shown that it is possible to use digital techniques for the detection of wireless identification signals and that

The Rijksmuseum has collaborated with different parties in digitizing its collection, among those are other institutions such as the Cultural Heritage Agency of the Netherlands, the

After optimization of the neutron flux to the unknown-object cavity, data samples of 300 mil- lion initial neutrons were generated to study the details of gamma-ray emission by

a) Selection of working topics (software projects). b) Training with agile methodologies (Scrum). c) Training using project management tools (Trello) (Fig.2). d) Training

The aim of this study was to expand the literature on webrooming behaviour and to get a better understanding on how the different shopping motivations (convenience

In principle, the drawing set consists of a rubber- covered drawing board provided with tactile centimetre divisions along two guide slots, and with two magnetic clamps