Medical Document Image Filtering Using Convolutional and Recurrent Neural Networks

(1)

Medical Document Image Filtering

Using Convolutional and Recurrent

Neural Networks

Lars Lokhoff, 10606165

lars.lokhoff@student.uva.nl August 29, 2018, 33 pages Supervisors: Jurgen Deege Ana-Maria Oprescu Host organisation: ChipSoft

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

Abstract

Reducing the tasks of hospital staff that are not directly patient related is important for hospitals as insurance companies demand that hospitals treat more patients and treat them faster. Managing paperwork and converting the paper documents into digital form using OCR can be a tedious and time-consuming task for hospital staff, and it is not directly patient related. Documents are run through the OCR pipeline in batches, even though the content of 25% to 50% of these documents per batch are judged useless by a specialist after or before the OCR process. By filtering the documents automatically, the specialists do not have to worry about judging the documents themselves. With our research, we filter images of medical documents based on their layout by extracting the features from the image using a CNN and feeding these features, concatenated with a vectorized sequence of tokens, to an RNN. The output of this encoder/decoder model outputs the layout of a document, described using DSL tokens. The CNN and RNN are chosen by feeding annotated images of medical documents into 2 frameworks that compare their performance. After implementing the encoder/decoder model into the OCR pipeline we measure the time it takes to run the extended OCR pipeline according to the amount of images that are filtered out. Our encoder/decoder model achieves an accuracy of 95.86% on predicting the tokens and layout of medical document images. In order for the OCR pipeline with filtering implemented to take at least the same amount of time to apply OCR to a batch of images as naively applying OCR to a full batch, 31% of images in a batch need to be filtered out.

(5)

Chapter 1

Introduction

Hospitals use paperwork in many different ways and departments. Each consult, type of research and department have their own document, each with a unique layout. Think about letters from specialist to specialist, letters from specialist to a patient, lists of activities a nurse should take part in on a certain day and different kinds of research result documents. Apart from being stored within the hospital in their paper form, documents are often scanned and stored in image form connected to a patient’s digital medical file. This happens for example when a patient is sent from one hospital to another because of a procedure the other hospital is specialized in. Although the digital connection between hospitals is improving, it is nowhere safe enough to send documents. Therefore the patient is handed their documents in paper form, which they have to hand over at the ’new’ hospital. The documents are transferred this way because of privacy reasons. The patient is made responsible for their own data and its safekeeping during the transfer.

After the documents are handed in at the hospital, they are either stored in their paper form, or they are scanned to be stored in the hospital its system. The documents are scanned in batches of multiple documents. The specialist then reviews all these documents, looking for the information they need to do their work. This means a specialist also has to review the documents they do not need. Filtering documents on their usefulness could be a major time-saver for a specialist, giving them more time to work with their patients.

Within hospitals, the application of machine learning has potential. Machine learning can be used within different domains, from image classification and object recognition to medical diagnosis. Even within the field of image classification and object recognition, many different applications can be thought of. Think for example about detecting fractures in x-ray images, recognizing skin diseases or suspicious moles and even detect open and closed doors in ERs. Most of these applications share a common goal for the hospital: saving time. This might sound simple, but jobs like recognizing fractures and recognizing diseases can be time-consuming jobs for specialists. For a hospital, these specialist do not come cheap. They get paid a high hourly rate, and from a hospital its perspective, the specialists are used optimally when they perform their special acquired skills on patients. When the time spent on the tedious tasks like fracture recognition can be reduced or the task can be com-pletely removed from a specialists routine, a hospital can use a specialist more effectively.

The insurance companies that pay hospitals constantly push for cost reduction. As every Dutch cit-izen is required to have health insurance, insurance companies look for ways to utilize hospital staff time as useful as possible. When you have an appointment in a hospital, this is most of the times paid through this insurance. In other words, the insurance company pays for many hospital visits. This is why insurance companies want patients to be treated faster and cheaper. When patients are treated faster, the throughput of patients within the same time frame is more, effectively reducing cost. The insurance companies also push the hospitals to find cheaper ways to treat their patients. When a specialist is able to spend less time on tasks like information retrieval from documents, they are able to spend more time on patients, which helps reach the goal of the insurance companies. Given the advances in the machine learning field, a machine learning solution for filtering scanned

(6)

doc-uments on their usefulness is interesting to look into. Combining this filtering with Optical Character Recognition (OCR) within the same application give specialist to directly receive useful information without looking at the images of the scanned documents or the documents themselves. One of the main concerns when it comes to using machine learning algorithms is their accuracy. Especially when the health of people is involved, mistakes come costly. Filtering documents, however, is less sensitive to mistakes. When a specialist misses information because a document is wrongly filtered, for exam-ple, he will not continue his actions until he has found the required information.

The reduced sensitivity to mistakes makes the use of machine learning for the purpose of present-ing certain information to a specialist an interestpresent-ing startpresent-ing point for the integration of machine learning within health organizations. Together with ChipSoft1_{, we will look into the possibility of}

filtering documents based on their layout and content using Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). In this research, we aim at speeding up the OCR pipeline within hospitals with this filtering. We aim at speeding up in 2 parts of the process. Firstly, the OCR process itself. Applying OCR unnecessarily to images that for example only contain images, takes time, processing power and energy. Second, we aim at speeding up the information gathering process of a specialist while judging the output of the OCR process.

Within this research, we will focus on the creating a pipeline that filters documents based on their layout using CNNs and RNNs by generating a DSL. This DSL displays the layout of a document and based on the DSL we can decide to apply OCR to an image or not. We will not focus on the OCR part of the pipeline, as this is being implemented as a different project within ChipSoft.

1.1 Research Questions

Creating a pipeline means this research is comprised of multiple steps. We can divide the pipeline into 3 parts: Convolutional Neural Network (CNN) feature extraction, Domain Specific Language (DSL) generation using the extracted features and a Recurrent Neural Network (RNN), and finally applying OCR on the images of medical documents. The first part requires us to find a CNN that extracts features of the images of medical documents. To find the right CNN, we will use the framework from [22] with our own annotated dataset of images of medical documents. This gives us the CNN that performs best in classifying the images of medical documents based on metrics described in [22] . The research question for this part of our research is as follows:

RQ 1: Which CNN from the framework of Rico Lamein performs best on classifying images of medical documents?

After finding the CNN, a combination of CNN and RNN is found that generates a DSL describ-ing the layout and content of a document based on extracted features. Finddescrib-ing this combination is done is a similar fashion as done in [22]. We create a framework that compares the performance of 3 RNNs that are fed with the features extracted by our CNN of choice. The RNNs are a standard LSTM network, a LSTM network with an additional bias to the forget gate and a GRU network. The research question that belongs to this part is:

RQ 2: What combination of CNN and RNN performs best on DSL generation of images of med-ical documents?

The final part consists of putting the whole pipeline together. The language model that generates the DSL combined with OCR. The full pipeline will be run on our dataset and the time it takes to apply OCR to the filtered documents will be compared to the time it takes for a full run of OCR on all documents in the dataset. The point where the pipeline with filtering implemented takes as much time as the naive OCR pipeline will be referred to as the sweetspot.

(7)

RQ 3: How many images need to be filtered out before the filtering process decreases the time of running the OCR pipeline?

1.2 ChipSoft

ChipSoft aims at developing healthcare-IT that supports healthcare providers. The software is devel-oped to assist healthcare providers with giving the right care at the right time. The main product from ChipSoft, HiX, is a fully integrated software solution that provides a one-time registration at the source and multiple-time use. HiX helps to share information between patients and healthcare providers in a secure digital environment.

The MultiMedia department within ChipSoft is the department handling the documents regarding patients. All scan, import, print and other document related actions are started through the Multi-Media module. Not only document related actions, also photo and video related actions are regulated within this module. Within the module they are constantly working on improving the handling of documents within a hospital, making the document and media handling as natural as possible for specialists and patients.

With the current development in machine learning, the MultiMedia department has shown inter-est in looking at possible applications of machine learning that deal with document handling. As mentioned earlier, the developments in the image recognition field have shown impressive results. This led to the idea of automatically classifying scanned documents based on their layout. Taking this a step further led to the combination of using feature extraction with a natural language model. This is used to apply OCR to an image of a document based on their layout and content.

1.3 Contributions

With this research, we contribute the following:

• A document image filtering pipeline that filters images of documents based on their layout with an accuracy of 95.86%.

• A framework that allows a user to find the combination of CNN and RNN that best suits their image layout description problem.

1.4 Thesis Outline

In chapter 2, the theoretical knowledge used in our research is described. This chapter contains information about the different types of neural networks, metrics and domain-specific languages used in our research. Chapter 3 states the related work to our research in terms of different classification algorithms, image-based code generation and XAML. In chapter 4, we discuss the approach that will help us answer our research questions. We discuss our plan of attack and the resource we use. Chapter 5 describes our experiments. The experiments are used to directly answer our research questions. In chapter 6 the results of our experiments are stated and discussed. Finally, in chapter 7 we draw our conclusions, answer our research questions and discuss the final results.

(8)

Chapter 2

Background

In this chapter we discuss the theoretical background and information used to conduct our research. This chapter starts with an explanation of the basis of convolutional neural networks (CNN): artificial neural networks (ANN). After explaining the basis of CNNs, we discuss how CNNs are able to classify images. Next, we describe another ANN type: Recurrent Neural Networks (RNN). We then describe XAML, the DSL we have based our DSL on. Finally, we discuss a method called one-hot-encoding.

2.1 Artificial Neural Networks

In 1943, neurophysiologist Warren McCulloch and mathematician Walter Pitts wrote a paper on how neurons might work [26]. They attempted to describe how neurons in the brain might work by model-ing a simple neural network usmodel-ing electrical circuits. In 1949, Donald Hebb wrote ”The Organization of Behavior” [13], a work which pointed out the fact that neural pathways are strengthened each time they are used. This concept is fundamentally essential to the ways in which humans learn. As com-puters became more advanced, the first hypothetical neural framework was simulated in the 1950’s. This simulation was attempted by Nathanial Rochester from the IBM research laboratories. In 1959, Bernard Widrow and Marcian Hoff of Stanford developed models called ADALINE and MADALINE. MADALINE was the first neural network applied to a real-world problem, using an adaptive filter that eliminates echoes on phone lines.

However, these models all lacked learning. This changed when Rosenblatt created the perceptron in 1957. A perceptron is an algorithm for supervised learning of classifiers. The perceptron looked promising at first. However, it was quickly proven that perceptrons could not be trained to recog-nize many classes of patterns. This caused the field of neural network research to stagnate for many years. After these years, it was recognized that a feedforward neural network with two or more layers (multilayer perceptron) had a greater processing power than perceptrons with one layer (single layer perceptron). The neural network research stagnated again some years later, after two key issues with the computational machines that processed neural networks were discovered. Basic perceptrons were incapable of processing the exclusive-or circuit, and computers didn’t have enough processing power to effectively handle the work required by large neural networks.

A key trigger for renewed interest in the neural networks field was the backpropagation algorithm developed by Paul J. Werbos in 1975. This algorithm solved the exclusive-or problem and accelerated the training of multi-layer networks. Backpropagation distributed the error term back up through the layers of a network by modifying the weights in each layer. Because the algorithm was computation-ally very expensive, the neural networks lost popularity over computationcomputation-ally less expensive classifiers like support vector machines.

(9)

implementations. The ANNs have since then been used to solve multiple problems, ranging from image recognition to predictive systems.

As stated earlier, ANNs are either single layered (perceptron) or multi-layered. Pereceptrons are based on biological neurons in our brains. Biological neurons have binary inputs and outputs (0 or 1). A neuron accepts inputs from other neurons through dendrites, which connect neurons together. Dendrites have a gap between them called the synapse that assigns a weight to a particular input. All of these inputs are considered by a neuron according to this weight by the cell body, or soma. The neurons exhibit an all-or-nothing behavior. If the combination of inputs exceeds a certain thresh-old, then an output signal is produced. This output signal is transmitted through its dendrites and synapses to other connected neurons. We use this biological explanation because the mathematical model is very similar. In this mathematical model, we have n inputs, with n weights W1...Wn. A

weighted sum of these inputs (z ) is calculated and fed to an activation function σ. The activation function can be seen as a threshold function, as it defines the output based on a certain threshold.

2.2 Convolutional Neural Networks

Convolutional neural networks (CNNs) are similar to multi-layered ANNs. They also consist of an input layer, output layer and multiple hidden layers. A CNN also contains nodes representing neurons with learnable weights. Each node receives some input which can be followed by non-linearity, which will be explained later on. Finally, the network uses the same single differentiable score function as ANNs, classifying an image based on the raw input pixels it is fed.

Convolutional neural networks are a category of neural networks that have been proven very effective in areas such as image recognition and classification. One of the first CNNs which helped propel the field of deep learning was the LeNet architecture. The LeNet architecture, designed by Yann LeCun [23], was mainly used for character recognition. This network created a foundation for every CNN to come. The basis of every CNN consist of convolution, non-linearity, pooling or subsampling and classification.

2.2.1 Convolution

CNNs derive their name from the convolution operator. The main purpose of convolution is to extract features from the input image. A convolution is a mathematical operation on two functions (f and g) to produce a third function, that is typically viewed as a modified version of one of the original functions, giving the integral of the pointwise multiplication of the two functions as a function of the amount of one of the original functions is translated.

In image processing a convolution is a general purpose filter effect for images. A convolution is done by multiplying a pixel’s and its neighboring pixels color value by a matrix. This matrix is called the kernel. A kernel can be of different sizes, resulting in different results when applied to the same image. To produce the end result, the kernel is slid over the image. To preserve the convolution’s commutative property, the kernel has to be flipped before sliding it over the image. To deal with the problem that arises when sliding the kernel over corner and edge pixels where part of kernel falls off the image, the part of the kernel that falls out can either be wrapped around the image or the fallen off values become 0.

2.2.2 Convolutional Layer

At the core of a CNN lies the convolutional layer. A convolutional layer’s parameters consist of a set of learnable filters. Every filter is small spatially, but extends through the full depth of the input volume. During a forward pass between layers, we slide the filter across the width and the height of the input volume and compute dot products between the entries of the filter and the input at any position. As the filter is moved across the width and height of the input, a 2-dimensional activation map that gives responses of that filter at every spatial position is produced. The network will learn filters that activate when they see some type of visual feature such as an edge of some orientation

(10)

or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network. We then have an entire set of filters in each convolutional layer, each producing a separate 2-dimensional activation map. These activation maps are stacked along the depth dimension and produce the output volume.

When dealing with high-dimensional inputs such as images, it is not practical to connect neurons to all neurons in the previous volume. Instead, each neuron is connected to only a local region of the input volume. The spatial extent of this connectivity is a hyper-parameter called the receptive field of the neuron. The extent of the connectivity along the depth axis is always equal to the depth of the input volume. The connections are local in space (along with the width and height), but always full along the entire depth of the input volume.

To control the number of parameters in a convolutional layer, parameter sharing is applied. The number of parameters can drastically be reduced by making one assumption: if one feature is useful to compute then at some position (x,y), then it should also be useful to compute at a different position (x2, y2). By denoting a single 2-dimensional slice of depth as a depth slice, neurons are constrained in each depth slice to use the same weights and bias.

2.2.3 Non-Linearity

The second core concept of CNNs is non-linearity (ReLu). ReLu is optionally applied after a convo-lution. ReLu stands for Rectified Linear Unit and is a non-linear operation. ReLu is an element-wise operation (it is applied on every pixel) and replaces all negative pixel values in the feature map by 0.

2.2.4 Pooling Layer

The third core concept of CNNs is the pooling layer. A pooling layer is usually a layer in-between successive convolutional layers. Its function is to progressively reduce the spatial size of the repre-sentation to reduce the number of parameters and computation in the network, hence controlling overfitting. Overfitting means that the CNN performs well on the training set of the data, but poorly on new data. The pooling layer operates independently on every depth slice of the input and resizes it spatially using the MAX operation.

2.2.5 Classification

The final step in a CNN is the classification of the given input into one of the possible classes based on the training dataset. The convolutional and pooling layers act as feature extractors from the input image, while the final layers in the network are fully connected layers that act as the classifier. The fully connected layer is a traditional multi-layer perceptron that uses a softmax activation function in the output layer. The sum of the output probabilities from the fully connected layer is 1. This is ensured by using the softmax as the activation function in the output layer of the fully connected layer. The softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between 0 and 1.

2.3 Recurrent Neural Network

A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence. The term recurrent neural network is used indiscriminately to refer to two broad classes of networks with a similar general structure, where on is finite impulse and the other is infinite impulse. Both of these classes exhibit dynamic behavior [27]. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that cannot be unrolled. Both classes of RNNs can have an additional stored state, and the storage can be under direct control by the neural network. The storage can also be replaced by another network or graph. Such controlled states are referred to as gated state or gated memory, and are part of Long Short-Term Memory (LSTM) neural networks.

(11)

2.3.1 Long Short-Term Memory

LSTM neural networks were invented by Hochreiter and Schmidhuber in 1997 [14]. LSTM neural networks are RNNs that use LSTM units as a building unit in their layers. As can be seen in figure

2.1, a common LSTM unit is composed of a cell (ct), an input gate (it), an output gate (ot and a

forget gate (t. The cell is responsible for remembering values over arbitrary time intervals. Each of

the three gates can be seen as an artificial neuron used in a multi-layer or feedforward neural network. They compute an activation of a weighted sum using an activation function. They can be seen as regulators of the flow of values that go through the connections of the LSTM. All gates are connected.

Figure 2.1: LSTM Example1

2.4 Domain-Specific Language

A domain-specific language (DSL) is a computer language specialized to a particular application domain. There is a wide variety of DSLs, ranging from widely used languages for common domains, such as BibTex, CSS and SQL, down to languages used by only one or a few pieces of software. DSLs can be divided by the kind of language, and they can include domain-specific markup languages, modeling languages or programming languages. The idea of a DSL is that it is based on only the relevant concepts and features of its target domain. A DSL is a means of describing and generating members of a program family within a given domain, without the need for knowledge about general programming. By providing notations tailored to the application domain, a DSL offers substantial gains in productivity and even enables end-user programming [20].

2.4.1 DSL Design

According to [33], designing a DSL consists of the following development phases: decision, analy-sis, design, implementation and deployment. However, DSL development is not a simple sequential process. The decision process may be influenced by preliminary analysis which, in turn, may have to supply answers to unforeseen questions arising during design, and design is often influenced by implementation considerations.

As a decision pattern, we will be using the ”transform visual to textual notation” pattern [33]. In the analysis phase, a domain model is created consisting of a domain definition, a domain terminology, descriptions of domain concepts and feature models describing the commonalities and variabilities of domain concepts and their interdependencies. A feature model will be composed according to the FODA domain analysis methodology [18]. A FODA feature model consists of definitions of the se-mantics of features, feature composition rules describing which combinations of features are valid or

(12)

invalid and reasons for choosing a certain feature. According to [33], the easiest way to design a DSL is to base it on an existing language. Possible benefits are easier implementation and familiarity for users. Since we will be using the DSL to describe the layout of medical documents, we will be using the ”piggyback” approach in this phase. This means we will be basing our DSL on other DSLXs that are designed to work with layout. Finally, as we are not deploying our DSL, we will be using the interpretation strategy in the implementation phase of the DSL.

2.5 CNNw Architecture Comparison

To select the CNN architecture that performs best in classifying the medical documents, we have to be able to compare them in some way. Because we are building on the work of [22], we will use the metrics he proposed as seen in table 2.5.2. These metrics are chosen to systematically compare the CNN architectures, making it easy for the user to choose the classifier that fits her image recognition problem best. The metrics are derived from [30]. This paper proposes performance measures for multiple machine learning classification tasks, one of them being multi-class classification tasks.

2.5.1 Architectures

The CNN architectures that are used in the framework are VGG16 and VGG19 [29], Xception [9], Resnet50 [12] and InceptionV3 [31].

VGG16 and VGG19

The VGG designers [29] had the intention to keep their CNN architecture simple. They only used 3x3 convolutional layers and a 2x2 max-pooling layer. The extracted features are fed to a softmax classifier. The difference between VGG16 and VGG19 lies within the number of weight layers used. One downside of the VGG architectures is their training time. Because of their depth, training these architectures more time compared to other less deep architectures.

InceptionV3

The Inception model uses Inception modules. In traditional CNN architecture, one has to make a choice whether to add a convolutional or pooling layer. Unlike traditional CNN architectures, the InceptionV3 architecture allows the parallel execution of convolutional and max-pooling layers. However, this increases the number of outputs the architecture has. The solution the authors provide is to add 1x1 convolutional layers to reduce the dimensions of the data.

Xception

The Xception architecture was introduced by [9]. This model is an extension of the Inception architec-ture. This architecture replaces the modules used in Inception with depthwise separable convolutions. According to the paper, this results in an overall improvement of the architecture. They state that Xception outperforms the InceptionV3 architecture on a large image classification dataset. Because the number of parameters in the two architectures are roughly the same, the performance increase lies mainly within the more efficient use of model parameters instead of an increased capacity. Resnet50

The Resnet50 architecture was introduced by [12]. This architecture relies on the use of micro-architecture modules, which together form the Resnet50 micro-architecture. With more traditional CNN architectures, there is an underlying mapping with a nonlinear function F (x). Instead of using F (x), the nonlinear function G(x) is used, which can be defined as F (x) − x. By arithmetically adding x to F (x) after the second weight layer and passing this through the Rectified Linear Unit function (ReLU), we enable ourselves to carry information from previous layers to the next. Even though the larger depth of the Resnet50 architectures compared to other architectures, the use of the ReLU

(13)

functions increases training speed. The model disk size is also substantially smaller due to the usage of global average pooling instead of using fully-connected layers.

2.5.2 Metrics

In table 2.5.2, there are a couple of variables that need explaining. tpi, f pi, tniandf ni respectively

represent true positives, false positives, true negatives and false negatives, with i representing a certain class. True positives represent the number of samples that are correctly recognized by the classifier. True negatives are samples that do not belong to class i and are also recognized as not class i. False positives are samples that are classified as class i but do not belong to class i and finally, false nega-tives are samples that belong to class i but are not recognized as class i.

Metric Formula Average Accuracy N P i=1 tpi+ tni tpi+ tni+ f pi+ f ni N Error Rate N P i=1 f pi+ f ni tpi+ tni+ f pi+ f ni N Precisionµ N P i=1 tpi N P i=1 tpi+ f pi Recallµ N P i=1 tpi N P i=1 tpi+ f ni

Table 2.1: Metrics for comparing CNNs, [22]

2.6 XAML

Extensible Application Markup Language (XAML) is a declarative XML-based markup language developed by Microsoft [3]. The acronym XAML originally stood for Extensible Avalon Markup Language, Avalon being the code-name for Windows Presentation Foundation (WPF). XAML is used extensively in .NET Framework 3.0 and 4.0 technologies, particularly WPF, Silverlight, Windows Workflow Foundation and Windows Runtime XAML Framework. In WPF, XAML forms a user interface markup language to define UI elements, data binding, events and other features. XAML elements map directly to Common Language Runtime object instances, while XAML attributes map to Common Language Runtime properties and events on those objects. Anything that is created or implemented in XAML can be expressed using a more traditional .NET language. However, a key aspect of the technology is the reduced complexity needed for tools to process XAML, because it is based on XML.

(14)

2.7 One-Hot Encoding

To transform a sequence of words into a vector of numbers, we use a technique called one hot encoding. The set of possible words or DSL tokens are what we call categorical data. Some algorithms can work with categorical data directly. However, combining a feature vector of numbers with a vector of strings will not work. Firstly, all possible tokens are assigned an integer. This integer is then converted into a binary variable where the binary bit string has as many bits as possible tokens. Each token is represented with a unique bit string where one of the bits is set to one, and the rest to zero.

(15)

Chapter 3

Related Work

This chapter discusses the works related to our research. We describe different types of classification algorithms and approaches and why they are not sufficient for our goal. We end the chapter with 2 approaches that are similar to our solution on speeding up the OCR pipeline.

3.1 Document Image Classification

Image-based document classifiers classify documents from a single application domain [4,5,6,28] or multiple application domains [32,21,15,10]. For example: a classifier can identify bank documents, business letters or tax forms, or the classifier classifies documents coming from all 3 of these domains. In our research we expect the classifier to classify medical documents. We expect these documents to come from different departments from across the entire hospital or health-care organization, and treat these departments as different domains.

The application domain or document space can be partitioned in 3 different ways: The possible doc-ument classes make up the entire docdoc-ument space, which means a docdoc-ument can always be classified as one of the possible classes. The document space can be larger than the union of the possible doc-ument classes, which means some docdoc-uments do not belong to any of the possible classes and should be rejected. Finally, there can be overlapping in the document classes, which means a document can belong to multiple classes. In our research, we focus on the first type of document space: each document is a member of one of the possible classes, and a document can always be classified as one of these classes (no documents are rejected).

(16)

Approach Average Accuracy Precision Recall Classification algorithm Text recognition [16] NR 73 53 Vector space Text recognition [7]

97.1 81.95 (average) 26.5 (average) LEOPARD algorithm Layout analysis

[15]

70.84 NR NR Hidden Markov model

Layout analysis [11] 96.8 NR NR Decision tree Structure-based features [28] 88.51 NR NR Self-organizing map Functional land-marks [32]

93.3 NR NR Two-layer document clas-sifier

Attributed rela-tional graphs [5]

97 NR NR First order random graphs Neural network

[25]

NR 0.475 (average) 0.486 (average) Neural networks (2 meth-ods)

Convolution neural network [19]

65.35 NR NR Neural network (logistic regression with softmax layer)

Table 3.1: Document image classification approaches and their average accuracy, precision, recall and their classification algorithm if stated.

Document classification algorithms have been studied for quite some time. In our research we are trying to find the CNN that best matches the requirements within ChipSoft. Most papers found focus on one attribute of the classifiers only, their accuracy. In [22], this focus is shifted to be broader. Classifiers are compared on multiple metrics, as previously mentioned in chapterBackground. Classifying images of documents can be done in multiple ways: based on their content (text-based) [16,7], layout-based using ’traditional’ classifiers like naive-bayes and nearest-neighbor [32,11,5,28,

15], and the more state-of-the-art neural networks and CNNs [25,19]. We have chosen to use CNNs for two reasons. Firstly, CNNs are currently outperforming any other type of image classification al-gorithms. Secondly, classifying images using OCR (Optical Character Recognition) is from a business standpoint not viable, as it is too time consuming and not reliable enough.

In table 3.1, we have summed up some of the approaches we have found that classify document images with their accuracy, recall and if the approach is CNN based.

3.1.1 Optical Character Recognition

The approaches listed first are approaches that use text recognition to classify documents (OCR) [16,7]. These approaches work by attempting to identify the text written in a document and matching this text to certain key-words to classify a document. The accuracy achieved by [7] is high: 97.1%, but such accuracy is only feasible when all images are clear. In a hospital, especially when dealing with old images, the quality isn’t always suitable for OCR, and many documents contain mostly images, which means the OCR cannot be used to classify these documents.

3.1.2 Interval Encoding

The approaches listed second are document classification algorithms that classify documents using interval encoding [15]. With interval encoding the elements of spatial layout are captured. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of an image. The features are then used in a hidden Markov model based page layout classification system that is trainable and extendable. As can be seen in table3.1, an average accuracy of 70.84% is achieved using this method. For our research we, aim to achieve an accuracy of at least 90%

(17)

while using CNNs. This approach, however, requires the document to be pre-processed in such a way that the document is segmented into rectangular blocks of text. As we want to reduce the tasks of specialists and personnel in the hospital and not increase them, this is an unwanted characteristic of the approach.

3.1.3 Decision Tree

The approach that uses decision trees [11] applies a first-order learning system for the induction of rules used for the layout-based classification and understanding of documents. The authors have divided the problem of document classification into sub-problems. The classification of blocks within a document, the classification of the document itself and the understanding of the layout of a document by labeling layout components. Based on the achieved accuracy of 96.8%, this approach would meet our requirements. However, this approach also uses OCR, which as mentioned earlier is undesirable.

3.1.4 Structure-based Features

The approach described in [28] uses structure-based features such as percentages of text or non-text content regions. These features are used by 2 classifiers: a decision tree classifier and a self-organizing map. Images are divided into n rows and m columns evenly sized and spread across the document. The average accuracy achieved was 88.51%, which comes close to our goal of 90%. However, this approach did not use CNNs. The second approach as described [32] uses the number of columns in the image and functional landmarks to classify a document. Each document is sorted according to the number of columns it contains. This puts it in a certain class. The presence or absence of landmarks in these columns is then used to decide if a document belongs to that class. This is done for all classes.

3.1.5 Relational Graphs

The fourth approach uses attributed relational graphs (ARGs) to represent the layout structure of documents [5]. These representations are used to classify the documents using first order random graphs (FORGs). The average accuracy of this approach is 97%, which is very high. However, this approach still does not use CNNs, which we need for our research.

3.1.6 Neural Network

The fifth approach listed in table 3.1 compares the results of using a neural network that classifies images of documents to methods such as Rocchio, Nearest Neighbor, Naive-Bayes and One-Class SVM algorithms [25]. The paper does not report the average accuracy that is achieved by the neural network, only the precision and recall. Although this approach does not use a CNN, it a step closer to what we are looking for in terms of the classification method used compared to the other document classification methods found.

3.1.7 Convolutional Neural Network

The final method shown in table 3.1is a method that uses a CNN to classify the document images [19]. The document classes are defined by structural similarity. The CNN is equipped with rectified linear units and trained with dropout. They’ve used 2 datasets: the Tobacco litigation dataset and the NIST tax-form dataset. The average accuracy is only reported on the Tobacco dataset, which is 65.35%. So this method uses a CNN, but their accuracy is below our goal.

3.2 Image-Based Code Generation

Image-based code generation is a technique where programming code is generated based on the con-tents of an image. For example when an image of a GUI is given, the code in HTML to produce that GUI is generated. Below we discuss the work of [8] and [24] and their approach on image-based code generation.

(18)

3.2.1 Pix2code

The automatic generation of programs using machine learning techniques is a relatively new field of research. The authors from [8], propose an approach to generate code based on a graphical user interface (GUI) screenshot or image. The authors divide the problem of generating code based on a GUI image into three sub-problems. First, a computer vision understanding of the given GUI is needed. What objects are present, which pose do they have and what are their identities? Second, a language modeling problem of understanding text and generating syntactically and semantically correct samples are needed. Finally, the solutions to the first 2 sub-problems are used to generate computer code of the objects present in the GUI image.

The authors have proposed to solve the first problem, getting a computer vision understanding of the given GUI image, by using a CNN to perform unsupervised feature learning by mapping an input image to a learned fixed-length vector. Input images are resized to 256 by 256 pixels, no further preprocessing is performed.

To solve the second sub-problem, the authors propose a DSL to describe GUIs. Because only the layout, graphical components and their relationships are of interest, the actual textual value of the labels is ignored. The simplicity of the DSL enables for token-level language modeling with a discrete input by using one-hot encoded vectors. This is done using a LSTM.

The final sub-problem is solved by the authors by combining the vectorial representations generated by the CNN and LSTM. This feature vector is fed into a second LSTM-based model decoding the representations learned by both the vision model and the language model. The decoder learns to model the relationship between objects present in the input GUI image and the associated tokens present in the DSL code.

We will be applying a similar approach like the one used in this research, but instead of generat-ing code, we will use the output DSL to asses a document on the possible information it contains. We will also not use a LSTM in the feature extraction process, like the approach mentioned below.

3.2.2 Webpages from Screenshots

The work of [24] builds on the work of [8] by alternating the end-to-end model used in [8] by replacing the non pre-trained CNN by a pre-trained CNN.

The work of [24] uses a dataset of websites mapped into a DSL consisting of 18 vocabulary tokens. A pre-trained ResNet-152 model is used to extract features from an image, which is passed into a decoder model. The decoder takes in the expected target for the screenshot and the features from the encoder. It stores the target in a word embedding. The source and target sequences, along with our feature vector is then used to train a LSTM layer.

This work uses a similar approach to ours by using a pre-trained CNN. However, we will be using a different CNN architecture, and also different types of images.

3.3 XAML based DSL

As described in section XAML in the Background chapter, XAML is a markup language used for applications such as WPF. Because within ChipSoft developers work with XAML on a daily basis. But not only withing ChipSoft is this a frequently used language. XAML is used for the development of a wide variety of Microsoft applications. For these reasons we will design our DSL with the XAML syntax in mind. This gives the users within ChipSoft a familiar looking piece of code to look at when the output of the end-to-end system is being inspected.

(19)

Chapter 4

OCR Pipeline

This chapter discusses the approach on answering our research questions and building the new OCR pipeline. First, we discuss the data used in our experiments, as this data is gathered by hand. We then discuss how we find the CNN and RNN we will be using in our OCR pipeline. We then discuss the DSL that is used to describe the layout of documents. Finally, we describe the OCR pipeline itself.

4.1 Data Validation

The images used in our research were annotated and categorized by hand. The images are all images of documents that are used in different hospitals that are working with products from ChipSoft. We categorized the images into a total of 38 different classes based on their layout. The documents range from hospital release documents to reports of patient appointments. An example document is given in figure4.2. The contents of these documents are highly confidential and contain patient informa-tion that cannot be distributed outside of ChipSoft. We have created an example document without any patient sensitive data, which can be seen in figure4.1. The paneled version as the network will describe the document can be seen in figure4.2.

To be certain that the categorization is correct, and the CNNs are not fed any incorrect data, we used OCR to validate the category an image is placed in. The OCR is implemented using the python library called pytesseract [1]. This library is a python wrapper for Google’s Tesseract-OCR [2]. We set up the OCR to go through the images in each category, and look for certain key-words that belong to documents that belong to that category. Images that, according to the OCR, do not contain any of these key-words will be marked and validated by hand. We do this extra re-validation by hand because some images are of bad quality and the OCR is unable to correctly validate them.

(20)

Figure 4.1: An example document with all patient sensitive data blurred out.

(21)

4.2 Image Classification

The classification process is executed with the help of the framework created by [22]. The framework ranks 5 CNNs on the metrics shown in 2.5.2. The CNNs, as described in chapter Background, are: VGG16 and VGG19 [29], Xception [9], Resnet50 [12] and InceptionV3 [31]. The framework is able to scrape the Internet using a Google scraper for images based on the search term the user enters. Because medical documents are not or hardly found on the Internet and the few that are found are not being used in hospitals working with ChipSoft, we provide our own validated dataset of images of medical documents. This dataset is fed into the framework, which trains each CNN separately on classifying an image into one of the 38 possible categories. From each class 80% of the images are used for training, while 20% of the images are used for evaluation. After training a CNN, the framework evaluates it with the evaluation dataset. The framework gives these results in text a file. The text file contains the accuracy of a CNN per class, an average accuracy, the rank 1 accuracy, the rank 5 accuracy, the number of false positives and the number of false negatives. For a given image, the CNN outputs a probability that image is of a certain class for all possible 38 classes, where the class with the highest probability is chosen as its prediction. The rank 5 accuracy is calculated by counting the amount of times a CNN had the correct class in its top 5 of highest probable classes and dividing this by the total number of samples. Rank 1 accuracy is calculated by counting the number of times a CNN predicted the right class of a given input, divided by the total amount of samples.

As stated earlier, the metrics shown in 2.5.2 will be used to rank the CNNs. Average accuracy, error rate, precision and recall are calculated using the true positives, true negatives, false positives and false negatives. In consultation with ChipSoft, the most important metric is average accuracy, if these are to be the same, we look at error rate, then precision, finally at recall. The accuracy is most important because this is the metric hospitals will be most interested in.

4.3 DSL Generation

To generate the DSL corresponding to the layout of an image of a medical document, a combination of a CNN and a RNN is used. The output will be a sequence of tokens, where each token represents a section in a document. The tokens are generated in a left to right manner, where a document is divided into rows. Each row contains multiple sections. When only one section is present in a row, the row tokens are left out. All tokens are listed in figure4.3. We have based the notation of the DSL tokens on XAML. Within ChipSoft, XAML as used in front-end development. Basing the notation of the tokens on XAML gives the employees something familiar to work with, even though the token names themselves are not used within XAML itself.

The CNN extracts features from an input image. The output of the CNN is a feature vector. This vector, together with a vector that represents the predicted words so far, functions as input for the RNN.

The RNN will be trained to generate the DSL code, a sequence of tokens, based on these input vectors. The feature vector contains information about the image that corresponds to the layout of that image. The vector of predicted words consists of a sequence of words that are vectorized so it can be combined with the feature vector. The output, a sequence of tokens from the DSL then gives us the information on the contents of the image in a more understandable way than a feature vector. The RNN can be seen as a decoder of the feature vector, translating the features together with the vectorized sequence of tokens into a sequence of tokens, which is readable by humans. For example, the expected sequence of tokens generated for the document in4.2is: <row start>, <logo>, <stamp>, <row end>, <text block>, <text block>, <text block>. As can be seen here, the first start and beginning of the first row containing multiple sections are marked by the <row start> and <row end> tokens. This is not needed for the sections containing text, as each text block fills up an entire row by itself.

To give a demonstration of how the pipeline operates, we take figure 4.1 as an example image. Firstly, the CNN will extract the features from that image. The output of the CNN is a feature

(22)

<start> This token marks the start of a document. The token is needed by the RNN so it knows it is the beginning of a document. The RNN will base its first prediction based on the feature vector and the start token.

<end> This token marks the end of a document. The RNN will keep predicting tokens until either this token is predicted, or the maximum amount of tokens has been reached.

<row-start> The <row-start> token marks the beginning of a row containing multiple sections. <row-end> This token marks the end of a row.

<block> The <block> token marks a section containing empty or blank space within a document. <text-block> This token marks the presence of text blocks or pieces of text. This is the block we

are looking for in terms of applying OCR.

<graph> This token marks the presence of a graph. Not usable for OCR.

<stamp> The stamp token marks a stamp containing hospital and department specific information. Also useful for OCR purposes.

<image> The image token marks the presence of an image. This token is not used for OCR purposes. <table> This token marks a table. Tables contain text, so OCR can be applied to documents

containing these tokens.

<pad> This special token is mainly used in training. Because the RNN expects fixed-length vectors. Every training sequence has to be padded until they are as long as the longest sequence in the training data.

Figure 4.3: DSL Tokens and their meaning.

vector. This feature vector is concatenated with the <start> token in vectorized form. The RNN takes this vector as input and based on that vector the RNN predicts the a next token, which is in this case the <row start> token. This token is then added to the vectorized sequence of tokens, now consisting of the <start> and the <row start> token. The new vectorized sequence of tokens is again concatenated with the feature vector and fed to the RNN. This cycle continues until the <end> token is generated and the sequence consists of <start>, <row start>, <logo>, <stamp>, <row end>, <text block>, <text block>,<text block> and <end>. The RNN is now finished predicting the lay-out of the example document and based on the presence of the <text block> token this image is not filtered out.

Encoder/Decoder Structure

The CNN and RNN combination is often referred to as an encoder/decoder structure or architecture. The structure consist of a CNN that extracts the features from an image, and a LSTM that generates a token. However, the features are extracted only once, and not every training step. The two vectors are concatenated and fed into a second LSTM. The encoder thus consists of an input that accepts a feature vector and the first LSTM, where the feature vector remains the same for an image, and the LSTM output vector changes after every generated token. The resulting vector is fed into the decoder, which is the second LSTM. The decoder is the part of the architecture that outputs the sequence of tokens.

To predict the sequence of tokens that belong to an image of a document, the image features and the start token are fed into the network. The network predicts the first token. The new sequence, which is the start token plus the predicted token, is then fed into the network. This repeats itself until either the end token is predicted, or the maximum of tokens is generated.

(23)

4.4 OCR Sweetspot

The current OCR pipeline consists of running a full batch of images through OCR and letting the specialist decide which images are part of that batch or let the specialist decide which output is useful after applying OCR to all images. The goals is to eliminate the part where a specialist has to decide either which documents enter the OCR pipeline or which OCR output is useful. The approach on the elimination of specialist input is to expand the OCR pipeline. A combination of a CNN and a RNN will be used to filter out any documents that do not require OCR. The combination, also known as an encoder-decoder model outputs DSL tokens that describe the layout of a document. Based on the wishes of a specialist, the tokens can be used to decide which documents are run through OCR. With the result of the image classification experiment, where we compare the performance of 5 CNNs on earlier mentioned metrics, we will find out which of the CNNs performs best on classifying the images of medical documents. This CNN will also be used in the generation of the DSL code based on the images of medical images. The CNN is combined with one of the RNNs, based on the experiment from section 4.3. The combined neural networks generate a DSL, that will be used to decide if a document contains text that makes it suitable for OCR. By suitable we mean that the document contains information we would like to extract using OCR. By filtering out documents like this the OCR process takes less times, because we skip non-suitable documents. The non-suitable documents will not have OCR applied to them and will also not be reviewed by a specialist.

To find out how many images need to be filtered to reduce the time it takes to apply OCR to a batch, we use 3 equations. The first equation, which can be seen in equation4.1, calculates the time it takes to OCR a batch of N images. The second equation, equation 4.2, is used to calculate the amount of time needed to run a batch of N images through the filtering pipeline. The final equation, which calculates the time it takes to run images through the filtering pipeline and apply OCR to any filtered images. Equation 4.3uses the time (TOCRfiltered) it takes to filter a batch of N images and

adds this to the time it takes to apply OCR to images that are not filtered out. The purpose of4.3is to find the point where the filtering and OCR process on a batch of images takes less time than just applying OCR.

TOCR total= N ∗ TOCR image (4.1)

Tf ilter total= N ∗ Tf ilter image (4.2)

(24)

Chapter 5

Experiments

In the previous chapter, we discussed the approach on answering our research questions. This chapter describes all the experiments that will be conducted to answer the research questions and build our OCR pipeline.

5.1 CNN Comparison

The first research question of our research is: Which CNN from the framework of [22] performs best on classifying images of medical documents?. Therefore the first experiment that is conducted in this research, is the comparison of the performance of the CNNs in the [22] framework. Because classi-fiers and CNNs essentially perform the same task, the methods used for classifier comparison can be applied to CNN comparison. Within the field of machine learning, most literature on the comparison of CNNs focus on only 1 metric, which is their accuracy. According to [22] however, other attributes of classifiers are important as well.

Based on the metrics described in table 2.5.2, we choose the CNN that performs best on classify-ing images of medical documents. The 5 CNNs that are compared can be found in section2.5. These architectures are chosen based on their performance in To conduct this experiment a total of 4910 images are divided into 38 different classes. The size of each class varies from 50 images to 1200 images. The amount of images in each class is not normalized, because we want the sizes to match with the number of times each class is used in a real-life environment.

5.2 RNN Comparison

The second research question is: What combination of CNN and RNN performs best on DSL gener-ation of images of medical documents?. To find a RNN that best matches the chosen CNN in the experiment before, a second experiment will be conducted similarly to that of the CNN comparison experiment. This time we’ve chosen RNN architectures that are performing well in predicting XML tag sequences and we create a framework that rates these RNNs based on their accuracy. The 3 RNNs are each fed the same input, which are the feature vectors generated by the CNN that is chosen based on the previous experiment together with the DSL sequence belonging to each image. The images used are identical to the ones used in the first experiment. A total of 4910 images, divided over the 38 different classes are used. Each image goes paired with a layout description in the form of the tokens. This experiment will rank the RNNs based on their accuracy, because this is the metric most important for ChipSoft. The accuracy is measured in terms of token prediction, not on the prediction of a total sequence. This means that each time the RNN predicts a token based on its input, the metrics are accuracy is updated.

(25)

5.3 OCR Sweetspot

The final research question is: How many images need to be filtered out before the filtering process decreases the time of running the OCR pipeline?. In order to find the percentage of images that need to be filtered before applying OCR and the pipeline with filtering takes at least as much time to process a batch of images as naively applying OCR to the full batch, we will measure the average time to OCR the image of a document. We OCR images of each of the 38 classes, to make sure we take into consideration all possible documents and the time it takes to apply OCR to them. We will then use the same documents, and feed them to the filtering pipeline. The average time it takes to filter all images is measured. Using the average time it takes to OCR an image, and the time it takes to run an image through the pipeline, we calculate the time it takes to filter images, and then OCR to the filtered batch.

(26)

Chapter 6

Results

This chapter presents the results from the experiments described in chapter 5. We present the results of finding the CNN that performs best on classifying images of medical documents and the results of running the RNN comparison framework with images of medical documents. Finally, we discuss the outcome of filling in the equations described in the Approach chapter.

6.1 CNN Comparison

The results of running the framework with the annotated data can be seen in table 6.1. This table shows the average accuracy, error rate, precision and recall for each of the 5 CNNs that are used in the framework of [22]. The general results of feeding our image collection to each of the CNNs are on par with state-of-the-art results. Because most documents have a straightforward layout, CNNs per-form well on classifying the images of documents. From the output of the framework, we can see that most mistakes are made in classifying a document as a class with similar (but slightly different) layout. The architecture performing best is VGG16. This architecture ranks highest or on par on all 4 metrics compared to the other architectures, even though the difference between VGG16 and VGG19 is very small, and is only in accuracy and recall.

Architecture Average accuracy Error rate Precision Recall InceptionV3 0.9971 0.0029 0.9458 0.9458 Resnet50 0.9968 0.0031 0.9373 0.9453 VGG19 0.9982 0.0018 0.9673 0.9673 VGG16 0.9985 0.0015 0.9673 0.9673 Xception 0.9972 0.0028 0.9487 0.9487

Table 6.1: Framework experiment results

6.2 RNN Comparison

Table6.2shows the results of training and evaluating three RNNs. Each RNN is trained with features extracted using the VGG16 architecture and a sequence of tokens assigned by hand. The combination of VGG16 and LSTM with a bias to the forget gate gave us the highest accuracy (95.86%). With an accuracy of 95.69%, the combination of VGG16 and GRU architectures is not far behind. With a little more than 1% less accuracy, the VGG16 and standard LSTM architecture are third best.

(27)

Because the idea of generating a sequence of DSL tokens came from [8], we compare the performance of their architecture to ours. The result of feeding our image collection to their framework can be seen in figure6.2. This shows us that our CNN and RNN combination has an increase in accuracy of 5%.

Architecture Average accuracy (%) Training time (seconds)

LSTM 94.36 4344

LSTM with bias to forget gate 95.86 4469

GRU 95.69 3442

pix2code 90.42 51404

Table 6.2: Table showing the results of running the RNN comparison framework. The accuracy and the training time of each combination of VGG16 and RNN is shown.

6.3 OCR Pipeline Sweetspot

We measured the time it takes to OCR 4910 images, spread across 38 different document classes. The time it takes to apply OCR to an image is measured. The total amount of time needed is then divided by the total amount of images, giving us the average time needed to apply OCR to an image. The average time needed to apply OCR to an image is 4.9 seconds.

After measuring the average time it takes to apply OCR to an image, we measured the average time it takes to run an image through our filtering pipeline containing the CNN and RNN. The av-erage time it takes to run an image through the filtering pipeline is 1.5 seconds.

When we fill in equations4.1,4.2and4.3, using the measured times, we are able to plot a graph that shows us the point where running the pipeline takes less time than simply applying OCR to a full batch of images. To test this we have used a batch of 100 images. On the y-axis we display the time and on the x-axis we display the amount of images that are filtered out. As more images are filtered out, the less time it takes to complete the OCR pipeline.

Figure 6.1: Naive OCR time versus OCR filtered pipeline time according to the number of images filtered out. The blue line represents the time it takes to filter and apply OCR according to the number of filtered images before OCR.

(28)

Chapter 7

Discussion

The previous chapter presented the results of the experiments we performed. We will further discuss the results in this chapter while reflecting back on our research questions, stated in section1.1. We begin by discussing the outcome of the first experiment, which is related to RQ1. The results of the second experiment are then discussed. This experiment is related to RQ2. Finally, we discuss the last experiment, that will be used to answer RQ3.

7.1 Choosing a CNN

The first experiment we performed was the experiment that helped us decide which CNN we are using in our encoder/decoder model. We fed our dataset of annotated images of medical documents to a framework created by [22]. The framework has as output the required variables to calculate the metrics from2.5.2. The framework contains 5 CNNs, each of them performing according to the state of the art. Their accuracy was tested on the MNIST dataset1 _{and the CIFAR-10 dataset}2_.

The performance of the 5 CNNs can be seen in table 6.1. Each CNN performed according the state of the art as described in [22]. The CNN that performed the best is the VGG16 architecture, closely followed by the VGG19 architecture. The performance of the CNNs is higher than expected, as not all of the 38 classes contain the same amount of images to train and the quality of the images ranges from poor with many pixels and background noise to high where an image is clear. Because of the difference in the number of images per class and the quality of some of the images we expected the accuracy of the CNNs to be lower than the state of the art. We believe the accuracy is still on par with the state of the art because of the simplicity of the images. They are all black and white images containing rectangular sections and text, with some classes containing a logo or an image of a body part or organ.

7.2 Choosing a RNN

After finding the CNN that performed best in classifying the images of medical documents in our dataset, we performed a second experiment. Now that we have found our encoder in the encoder/de-coder model, we need to find the deencoder/de-coder. This is done by comparing the performance of 3 RNNs in decoding the feature vector they fed by the CNN chosen based on our previous experiment. We constructed a framework that is similar to the one used in the first experiment. This framework is used to find the combination that performs best on generating a DSL that describes the layout of a document. The encoder/decoder model will have this goal. The framework thus contains the CNN chosen in the previous chapter, which extracts features from images. These features are combined with a vectorized sequence of tokens describing the layout of a document. Based on these vectors the RNNs

1_{http://yann.lecun.com/exdb/mnist/} 2_{https://www.cs.toronto.edu/ kriz/cifar.html}

(29)

are trained to output the correct token sequence. All 3 RNNs are trained, and evaluated. Based on the evaluations, which evaluates the RNN based on the accuracy of their token prediction, the RNNs are ranked. The results of the experiment are shown in table6.2. We chose the RNNs based on their performance on XML token prediction as reported in [17]. One of the RNN architectures reported in that paper, called MUT, is not implemented in the keras framework used to implement the RNN and CNN architectures. The authors of the architecture did not respond to our request of receiving the code of this architecture. For this reason we chose not to implement the MUT architecture in our framework. We have chosen to create this framework to be able to use it in a later moment, when for instance the layout of documents changes drastically, and we want to check if the chosen combination still has the highest accuracy.

As seen in table 6.2, the encoder/decoder model containing the VGG16 and LSTM with a bias to the forget gate has the highest accuracy when it comes to token prediction based on images of medical documents. The other combinations are not far behind, where the LSTM has a 1.5% lower accuracy. This is as expected, as the accuracies reported in [17] are also similar. We have chosen to test the difference in results because even though the differences are small, we want to be certain the combination we choose has the highest accuracy and the type of test in [17] is only based on the prediction of words, not on the combination of images and tokens.

Apart from the increase in accuracy, our architecture also has a greatly reduced training time. Train-ing the pix2code architecture for 5 epochs usTrain-ing the non memory intensive settTrain-ing, which means that instead of loading all data into memory in one go the data is loaded into memory in batches, takes 15 hours. Training our architecture takes 2 hours. We believe this massive difference in training time comes from the difference in the RNN architecture. The pix2code RNN architecture has more layers, which means there are more parameters that are trained. The pix2code architecture has 109.818.157 trainable parameters. Our architecture has 5.994.229 trainable parameters. The pix2code architecture has as input an image converted to an array. The feature extraction is part of their encoder-decoder architecture. This part is also trained during the training process and is accountable for 104.098.080 trainable parameters. Our architecture uses a pre-trained CNN, which eliminates the training of this part of the architecture.

Another difference between our architecture compared to the pix2code architecture is the layer af-ter the LSTM layer in the encoder. Pix2code uses no layer. We however, use a TimeDistributed3 layer. The TimeDistributed layer applies a Dense layer at every time step (unrolling of LSTM). The TimeDistributed layer is used to keep the values between each time step separate and is used in many-to-many problems, like ours.

7.3 Running Time of the OCR Pipeline

The goal of the final experiment was to find the number of images that need to be filtered out in order for the OCR pipeline with filtering to take as much time as the naive OCR approach. We need to find out what impact the change in the OCR pipeline is in terms of time it takes to run the pipeline. As the more time it takes to run, the more power is consumed. Running the OCR pipeline is a heavy burden on a PC, which makes it almost unusable for other tasks while it is running. In the first part of this experiment we measured the time it takes to run an image through the old OCR pipeline. We then measured the time it takes to run an image through the encoder/decoder model. This gave us the ability to calculate the number of documents that need to be filtered for the new OCR pipeline with filtering to take as much time as running the naive OCR pipeline. As can be seen in figure6.1, the time it takes to run a batch of images of size 100 through OCR with filtering takes almost 2 minutes more than naively applying OCR to the full batch. This is, when all images are judged as useful by the filtering process. As there are more images filtered out, naturally, fewer images are run through OCR and the process takes less time. The more images are filtered out, the closer we come to the point that filtering and applying OCR takes as much time as applying OCR to a full batch.

(30)

This point is at 30.5 images, which means 31 images (you cannot filter half an image). That would mean when 31% or more of the images are filtered out, the OCR and filtering process takes less time than applying OCR to the full batch. By looking at the consistency of batches that are uploaded by specialists, we expect between 25% and 50% of the documents to be filtered out per batch. This means the filtering will reduce the time of the process in most cases.

Medical Document Image Filtering Using Convolutional and Recurrent Neural Networks