• No results found

Text Recognition in Printed Historical Documents

N/A
N/A
Protected

Academic year: 2021

Share "Text Recognition in Printed Historical Documents"

Copied!
55
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Text Recognition in Printed Historical Documents

Master thesis by

Twan van Laarhoven

1

Date: September 11, 2010 Advisors: Wim H. Hesselink

Arnold Meijster

1twanvl@gmail.com

(2)

Contents

1 Introduction 4

1.1 Introduction . . . 4

1.2 The OCR pipeline . . . 4

1.3 The Emmius OCR System . . . 5

1.3.1 A User’s Perspective . . . 6

1.3.2 Page Annotation Files . . . 6

1.3.3 Webserver . . . 7

2 Preprocessing 8 2.1 Conversion to Grayscale . . . 8

2.2 Methods for Contrast Normalization . . . 8

2.3 Local Contrast Stretching . . . 9

2.3.1 Algorithm . . . 10

3 Finding Rectangles 13 3.1 Finding Filled Rectangles . . . 13

3.1.1 Maximal Filled Rectangles . . . 13

3.1.2 Blockers . . . 14

3.2 Finding Border Rectangles . . . 15

3.2.1 Scanning Algorithm . . . 16

3.2.2 Divide and Conquer Algorithm . . . 17

4 Lines and Columns 20 4.1 Splitting into Columns . . . 20

4.1.1 Paths . . . 21

4.1.2 Cutting the Image . . . 22

4.2 Splitting into Lines . . . 22

4.2.1 Straightening . . . 22

4.2.2 Approximation Algorithm . . . 24

4.2.3 Finding Lines . . . 26

4.2.4 Finding Peak Intervals . . . 26

4.2.5 Mapping Pixels to Lines . . . 29

5 Character Recognition 31 5.1 Character Segmentation . . . 31

5.1.1 Characters Spanning Multiple Components . . . 32

5.1.2 Labeled Transition Systems . . . 32

5.1.3 With a Dictionary . . . 34

5.1.4 Characters of Different Lengths . . . 35

5.2 Character Classifier . . . 36

5.2.1 Features for Character Recognition . . . 36

5.2.2 Nearest Neighbor Classifier . . . 37

(3)

Contents Contents

5.2.3 k-means Classifier . . . 37

5.2.4 Taking Distance into Account . . . 39

5.2.5 Other Classifiers . . . 41

5.2.6 Difficult Cases . . . 42

5.3 Words and Spaces . . . 42

5.3.1 Features . . . 42

5.3.2 Problem Cases . . . 43

5.3.3 Classifier . . . 43

5.4 Evaluation . . . 43

6 Conclusions 46 6.1 Summary . . . 46

6.2 Future Work . . . 46

Bibliography 48 A Algorithms 49 A.1 Finding Filled Rectangles . . . 49

A.2 Finding Border Rectangles . . . 51

A.3 Seams for Separating Margins . . . 53

A.4 Training Prototype+Distance Model . . . 54

(4)

Chapter 1

Introduction

1.1 Introduction

Libraries and archives house many old books that are of importance to historians and other scholars. However, to search and read these texts the historian needs to visit a library and carefully handle delicate old books. If these books could be made available online in a searchable way that would be very useful to these scholars and other interested persons.

In this thesis we focus on one book in particular, The “Rerum Frisicarum Historia” by Ubbo Emmius, published in 1616, hereafter referred to as the Rerum. This is a book about the history of Friesland. Because the author, Ubbo Emmius, was the first Rector Magnificus of the University of Groningen, this book is of special interesting to scholars from that university.

The current situation is that books like these are digitized by hand, a labor intensive and expensive process. A computer program would be much more efficient for this task.

Traditional OCR programs are intended for modern text, and hence they have considerable diffi- culties with historical works. We have therefore developed our own OCR system that is specifically tuned to this task. In contrast with other OCR systems, our goal is not to achieve perfect auto- matic recognition, but merely to assist a human with digitizing a text.

The focus of this thesis will be on the algorithms that we have developed for various steps of the character recognition process. Besides these algorithms we have also developed an OCR system that is useful to historians. This system is detailed in the next section.

1.2 The OCR pipeline

Our OCR program is structured as a pipeline. We start with a photograph or scan of a book page, and output a plain text representation of that page. The pipeline has the following steps:

0. A photograph or scan is made of a book page.

1. The input image is converted to grayscale and processed to improve contrast. (Chapter 2) 2. The area containing the body text is selected, the rest of the image is discarded. (Chapters 3

and 4)

3. The body text is split into lines. (Chapter 4)

4. Each line is split into ‘components’, which form characters. (Chapter 5)

(5)

1.3. The Emmius OCR System Chapter 1. Introduction

5. The text on each line is recognized. (Chapter 5)

Figure 1.1 shows how a typical page is processed by this pipeline. The rest of this report will focus on each of these steps in turn.

(a) Input image (b) After preprocessing (c) Area containing text

(d) Split into lines (e) Split into components

DEQVE VRBE GRONINGA. 71

buunt, ac de omnibus rationes reddunt publicè, ad quas cuilibet, cui ani- mus, licetaccedere. Acſunthi diverſi ab iis, de quibus inexponendisrebus civilibus dixi. Eliguntur à presbyterio, ut iſti à ſenatu. Nec eſt hoc munus perpetuum, ſed triennio aut quadriennio exſpirat ſubſtitutis novis: fereque ex his, qui benè hoc munere functi ſunt, presbyteri cooptantur. Eſt & invi - ſendis,ſolandis, erudiendis, exhortandis ægrotis, ac in mortis periculo con- ſtitutis præfectus, quemviſitatoremnuncupant, ex iis plerunque delectus, qui inmunere docendæ Eccleſiæ iiæ urbe vel inagro verſatiſunt. ls presbyte - rii pars eſt. De ſcholâ literariâ hic dicere prolixiùs nihil attinet, quam libe- ralibus ſtipendiis conſtitutis inuſum docentium ſenatus hic einſtituit aut in- ſtauravit poſt urbemfoederi Belgico rurſumconjunctam. Namordo ejus typis publicatus extat. Hocſolum hic repeto, docere inea præceptores ViI, qui ſtipendia à Rep. capiuntaureos Carolinos nonminus bismille ſexcen- tosoctuaginta,& domiciliahabent liberaab omni onere, iiiauq autemà diſcipulisnon capiunt. Sedes ſcholæincoenobio Franciſcano, ut dixi antè, ſitu inurbemedio. Porrò Eccleſiæ miniſtri diebus dominicis ter docent in temploutroque, atque inſuper, reliqua hebdomade bis in MartiiIiano, ſemel in Marianoaut Amnico: baptiſmum iiſdem in templis finitis concio- nibus præſente eccleſiâ adminiſtrant: Coenamdomini ſummâ cumreveren- tiâ quolibet trimeſtri diſtribuunt communicantibus, idqueviciſſim in tem- plis jam dictis, nec quenquam admenſam hanc ſacramadmittunt, niſi antè demyſterio benè edoctum, & capitum doctrinæ Chriſtianæ præcipuorum mediocritergnarum, & in vitâ probum piumque:cæteros ab eâ arcent. Re- iiqua quoque miniſterii ſui ritè expediunt. In primis quolibet menſe ſemeI coetum vel conventum agunt in urbe cum miniſtris eccleſiarum vicana- rum, quæ in præfecturâſunt ſuburbanâ, diſciplinæ& effiortationismutuæ causâ. Quotannis autemſemel delecti eorum, reliquique presbyterii fyno- do interſunt ?iniſtrorum univerſæ huj us provinciæ int.er Amaſum & Lavi- cam, in quamdelecti ſingularumclaſſium coëunt, nuncin urbe,ſapius in agro, lociscelebrioribus. Cæteroqui, ad conventus menſtruos & ordina- rios quod attinet, nihil habent urbani cumminiſtris agrariis commune.

DE ACADEMIA GRONINGÆ NVPER INSTITVTA.

CORONIS mihi eſto hujus libelli recogniti Academia in hac urbe âb Ordinibus provinciæ ſtudio procerum eruditorum anno ſupe- riore M.DC.XIV inſtituta, & ſolenni ritu die XXIII Auguſti in- troducta: qui dies propterea ceu natalis Academiæ quotannis in- augurandonovo Rectori dicatus eſt. Collata in hunc finem in- gens pecuniæ ſumma, ſumenda quotannis è cenſibus monaſticis. Vocati viri eruditione præſtantes, & celebres docendæ theologiæ, juriſprudentiæ, medicinæ, hiſtoriæ, philoſophiæ, logicæ, ethicæ, phyſi.cæ, mathematicæ, Græcæ linguæ, numero ſex, vocandiq ue plures. Locus huic Muoed.oe datus juxta fanum Franciſcanum in latere ejus boreo, regione imprimis commo- dâ, ab omni vulgi ſtrepitu ſemotâ. Ædificia præclarè ſtructa & adornata, area amoena auditorii.s cohærens, porticibus latis & magnificis tribus à la- teribus cincta: ipſumfanum Franciſcanum luculentu?. Academiæ uſum magnis ſumptibus reparatum. ln bibliothecam Academiâdignam cenſus annuus legatur, ejuſq ue initia jam coepta, locus ei in fano eodem deſtina- tus, pluteis adornandus. Armarium anatonimicum inſtituendum, ibidem, bus meiiſa

(f) Recognized text

Figure 1.1: Steps in the OCR pipeline.

1.3 The Emmius OCR System

The main goal of this project is to digitize books, so besides good algorithms the end result should be a usable OCR program.

In this section we describe our OCR program, which we have dubbed “Emmius OCR”. The Emmius OCR system consists of four parts:

• A library of image processing and classification algorithms. It includes the algorithms de- scribed in this report, among others.

• A suite of command line programs for invoking the OCR pipeline.

• A graphical user interface (GUI) for annotating pages.

• A webserver that manages annotated pages and source images.

One of the design goals has been to make these components usable independently.

(6)

1.3. The Emmius OCR System Chapter 1. Introduction

1.3.1 A User’s Perspective

A typical workflow is that a user gets an image, either from the webserver or from her own scan.

Then she opens this image in the graphical user interface. The GUI uses the OCR library with a pre-trained classifier to recognize the text on the page, which will in practice not be perfect.

The user then corrects the errors made by the OCR system, and marks all characters that are recognized correctly. These annotations can then be stored in a “page annotation file”

One of our design philosophies is that the user is always right. This means that she should be able to correct the system at all points. For example, if the system marks the wrong region of the image as ‘body text’ the user can select a different region. In the same way the user also has the final say of character segmentation and the recognized text.

1.3.2 Page Annotation Files

Emmius OCR uses page annotation files to store both the annotations entered by a user as well as those inferred by the system. For each character we store a boolean flag indicating whether the user has accepted that character. The GUI provides a button with which the user can accept the labels of characters. Once all characters and spaces in a text are accepted we consider the entire page to be accepted.

The page annotation file needs to include an identifier of the page, but the page is given as just an image file. We therefore use a SHA1 hash of the source image as the image identity. The GUI automates the link between annotation and image file, making the process invisible to the user: When an image file is opened the program searches for a corresponding annotation file and opens it automatically; conversely when an annotation file is opened the program searches for the corresponding image file.

Another important point is how the annotations are stored. Consider what happens when one of the early steps in the OCR pipeline is changed. Then the exact location of the lines and characters that are recognized can also change. This can happen while the system is still being developed, but also as the result of parameters that could be changed by the user. We don’t want any annotations made by the user to be lost in these cases. When the user is about to make such a dangerous parameter change, we first store the page annotation file, and reload it after the change.

It is therefore important that the page annotation file is robust to changes in the exact locations of characters. For each line or character that is recognized we therefore store the coordinates of its bounding box in the source image. When a page annotation file is opened we need to match the characters in the file with those recognized from the image. We define the match between two character bounding boxes A and B to be the fraction which they overlap,

match(A, B) = |A ∩ B|

|A ∪ B| = |A ∩ B|

|A| + |B| − |A ∩ B|.

Finding these best matches using a linear search is quadratic in the total number of characters/- components, but it is fast enough in practice. It would be more efficient to use a space partition data structure to store the character bounding boxes, but for simplicity we have not done so.

When the segmentation of components has changed significantly, there may not be a one-to-one mapping between the two sets of characters. For example, where we recognize two characters in the image there may be only one in the annotation file, or vice versa. In these cases we do our best to preserve the annotations from the page annotation file, and emit a warning message if that fails.

(7)

1.3. The Emmius OCR System Chapter 1. Introduction

1.3.3 Webserver

To make the recognized and annotated text available to interested users, we have also developed an accompanying website. The webserver maintains a set of pages, and for each page it stores its image and the latest annotation file (if any).

Using the command line version of the recognition program, the webserver extracts the text from these annotation files. This text is shown on the website, and can also be searched. Whenever a new annotation file is uploaded the page’s text is updated.

In a nightly batch job the webserver updates the recognized text of all pages in two steps:

1. First a new recognizer is trained using all annotated pages.

2. Then all pages are passed through the OCR system again, to improve the recognized text.

If there are annotations for a page, then only the unannotated characters are changed.

The GUI is linked to the webserver through the open and save commands. When an image or annotation file is opened, the GUI requests the latest annotation for that page from the webserver, and when an annotation file is saved it is also uploaded to the webserver. The link to the webserver is not mandatory, however, which makes it possible to work offline.

(8)

Chapter 2

Preprocessing

The contrast and color in photographs of book pages can vary, see figure 2.1(a). The first step of our processing pipeline is to normalize the images so that the background is always ‘white’ (value 0) and the ink is always ‘black’ (value 255)1. Then later steps in the processing will not have to worry about the contrast. In this chapter we describe how we do this normalization.

2.1 Conversion to Grayscale

The source photographs are color images, so they should first be converted to grayscale images.

We use the standard formula for luminance Y = 0.2126R + 0.7152G + 0.0722B. Because the image is a photograph of black and white text, the exact factors do not matter. Note that it might be possible to improve upon this formula. For example colored spots are less likely to be text than are black spots, so they could be assigned a higher grayscale value. We have not investigated such improvements.

2.2 Methods for Contrast Normalization

The simplest method of contrast normalization given a grayscale image would be to use a threshold.

Pixel values larger than or equal to the threshold value are considered background, while values lower are foreground. Using a fixed threshold is not an option, because the contrast varies across images. Otsu’s method [Otsu, 1979] is a popular way of automatically picking a threshold. This method chooses the threshold value that maximizes the separation between black and white. An important disadvantage of Otsu’s method is that it picks a single threshold for the entire image.

If different parts of the image are lighted differently then this will not give good results, see figure 2.1(b)

More advanced thresholding methods exist, such as Sauvola’s method [Sauvola & Pietikinen,2000], which picks a threshold based on the mean and variance in a small window. The threshold value is µ + kσ, where µ is the mean value of the window, σ is the standard deviation in the window, and k is a constant. Unfortunately this method produces bad results for historical documents, because the paper is very rough and uneven. In the literature on character recognition different values of the parameter k are recommended. Sauvola recommends k = 0.5 [Sauvola & Pietikinen, 2000], while a later review paper recommends k = 0.34 [Badekas & Papamarkos,2005]. In short

1 Using the value 0 for white is opposite to the normal convention, but this choice simplifies the rest of the pipeline by allowing us to assume that all pixels outside the image have value 0 and are part of the background.

(9)

2.3. Local Contrast Stretching Chapter 2. Preprocessing

k should be high for rough paper and low for smooth paper, but no single value of k suffices in all cases, see figure 2.1(c). We believe the main reason for the sensitivity to the parameter value is that the mean and standard deviation are not robust to outliers, which are caused by irregularities in the paper.

Instead of using a threshold directly on the input image we will first improve the contrast, so that the background always becomes white and the foreground always becomes black. Then, any thresholding method can be applied to give a binary image. We choose to not always use a threshold method, because gray pixels can give useful information for character recognition. Some of the later steps in the processing do require a binary (i.e. two color) image, and in that case we use a fixed threshold value of 100. After contrast stretching the exact value of this threshold is no longer important because ink is always black and the background is always white.

(a) Input images (b) Otsu’s method (c) Sauvola’s method (d) Our contrast stretching method

Figure 2.1: An example of contrast stretching and thresholding. Otsu’s method sometimes picks a threshold that is too low, resulting in too many ink pixels. Sauvola’s method sometimes picks a threshold that is too high, resulting in too many background pixels.

2.3 Local Contrast Stretching

Our approach to contrast stretching is based on estimating the local foreground and background color in small parts of the image, called windows. In each of these window we determine the minimum and maximum pixel values. If the window contains only foreground or background pixels, then the difference between the minimum and maximum will be small. On the other hand, if the window contains both foreground and background pixels, then the difference will be larger.

We keep only these windows with a large difference. Assuming the paper is white (high) and ink is black (low), the maximum pixel value in a window is an estimate of the background color and the minimum is an estimate of the foreground color. Then we propagate the estimated colors from the retained windows to neighboring discarded windows. The new ‘contrast stretched’ value s(x) of a pixel x is then

s(x) = clip im(x) − bg(x) f g(x) − bg(x)

 ,

where f g and bg are the local fore- and background color estimates, and the function clip rounds its argument to the nearest value in the target range, in our implementation integers in the range 0..255. Figure 2.1(d) shows what a typical page looks like after contrast stretching.

(10)

2.3. Local Contrast Stretching Chapter 2. Preprocessing

(a) original image (b) another image (c) contrast stretched, f g = min, bg = max

(d) contrast stretched, f g = median, bg = max

(e) contrast stretched, f g = min, bg = median

Figure 2.2: An example of contrast stretching.

2.3.1 Algorithm

The previous paragraph gives the basic idea, but several improvements and clarifications must be made. First of all the image needs to be divided into windows in some way. We choose to use rectangular windows of 8 by 8 pixels, with adjacent windows overlapping by 4 pixels so that each pixel belongs to 4 windows. The size of the windows is not very important for the results, but using more and larger windows is slower.

Secondly, instead of using the maximum value in a window as an estimate of the background color we use the median. Since most of the pixels in the image are background pixels, the median will most often be the background color. Compared to the maximum value the median is less sensitive to outliers. In this way we eliminate more of the background noise, as illustrated in figure 2.2.

Incidentally, there is another way in which we can use the window medians. Previously we had assumed that the foreground color corresponds to low pixel values and the background to high values. But now that we use the median as an estimate of the background color this assumption is no longer necessary. Since there are many more background pixels than foreground pixels, the median will be closer to the background color than to the foreground color in most windows. So if on average the median is closer to the minimum then we know that the maximum value can be used as an estimate of the foreground color, and vice-versa.

Next, to determine which windows to keep and which to remove we use a threshold t: we keep a window w if |f g(w) − bg(w)| > t. The threshold is

t = mean(|f g − bg|) + τ · stddev(|f g − bg|),

where we take the mean and standard deviation over all windows. The higher the value of τ the fewer windows will be kept. We use τ = 1.

Another improvement is to blur the estimates of the foreground and background colors, to give a smoother result. In the extreme case we could use an infinite blur radius, so all the estimates are the same for the entire image. That might be appropriate if the contrast does not vary over the image. We make the blur radius σ a parameter of the algorithm. A value of the order of 0.01 times the width of the image gives good results for the Rerum images.

Blurring the foreground and background estimates also gives a way of determining values for the removed windows. We assign a weight to the estimate in each window, removed windows get weight 0. Then we blur the weighted values, which ensures that all windows get a value. Instead of making a binary decision whether to keep each window, the algorithm becomes slightly more

(11)

2.3. Local Contrast Stretching Chapter 2. Preprocessing

Listing 1 Contrast stretching Require: gray value image im

// Calculate window minimum, median and maximum for each window w do

lo[w] ← min {im[x] | x ∈ w}

med[w] ← median{im[x] | x ∈ w}

hi[w] ← max {im[x] | x ∈ w}

end for

// Is the foreground black (lo) or white (hi)?

if mean(med − lo) > mean(hi − med) then

f g ← lo, bg ← med // this is assignment of entire arrays else

f g ← hi, bg ← med end if

// Calculate window weights

t ← mean(|f g − bg|) + τ · stddev(|f g − bg|) for each window w do

if |f g[w] − bg[w]| > t then

weight[w] ← |f g[w] − bg[w]| − w else

weight[w] ← 0 end if

end for

// Calculate final foreground and background estimates f g ← WeightedBlur(weight, f g, σ)

bg ← WeightedBlur(weight, bg, σ) // Stretch image values

for each pixel x do

w ← window containing x

s[x] ← clip (im[x] − bg[w])/(f g[w] − bg[w]) end for

robust if we vary the weights. Thus we assign to each window w the weight |f g(w) − bg(w)| − t if

|f g(w) − bg(w)| > t and 0 otherwise.

To perform the blurring, let value, weight : X → R be an image of values and an image of weights respectively, then define

WeightedBlur(weight, value, σ) = GaussianBlur(weight · value, σ)/GaussianBlur(weight, σ) where multiplication and division are done pointwise. Hence

WeightedBlur(weight, value, σ)(x) = P

yweight(y) value(y) ekx−yk2/2σ2 P

yweight(y) ekx−yk2/2σ2 .

This weighted blur works like a regular Gaussian blur, by convolving the image with a Gaussian kernel. But instead of using only the Gaussian kernel, the weights are also taken into account. Note that if there are any non-zero weights, then after a weighted blur all pixels will get a meaningful value.

By far the most expensive step in the algorithm is computing the median value of each window.

Because the input values are bytes we could use a 256 bin histogram. To speed up the search through this histogram we use a multilevel histogram [Alparone et al., 1994]. This means that we keep two histograms: a coarse histogram with 16 bins, as well as a fine scale histogram with

(12)

2.3. Local Contrast Stretching Chapter 2. Preprocessing

256 bins. To insert or remove a value, just one bin has to be updated in each histogram. To look up the median value (or another percentile), first the coarse histogram is searched, and then only 16 bins in the fine histogram have to be searched, namely that correspond to the coarse bin containing the median.

The algorithm is given in pseudo code in listing 1. There are four parameters in the algorithm:

• The size of windows (implicit in listing 1).

• The overlap between windows, or equivalently the number of windows.

• The threshold τ for keeping windows.

• The radius σ of the blur of estimated fore- and background color.

None of the parameter values is critical.

(13)

Chapter 3

Finding Rectangles

In this chapter we develop algorithms for finding axis aligned rectangles in binary images. In particular, we are interested in a rectangle that is the largest in some sense. These largest rectangles will help in separating the area of the page that contains text from the rest of the image. This selection is described in detail in the next chapter, but first we develop the underlying algorithms.

In this chapter we work with binary images, which in our pipeline are created by applying a threshold after contrast stretching (see chapter 2). An N by M binary image A can be represented as a set of points, A ⊆ {(x, y) | 0 ≤ x < N, 0 ≤ y < M }.

3.1 Finding Filled Rectangles

First we will first focus on filled rectangles. A filled rectangle is a set of points R(x1, x2, y1, y2) = {(x, y) | x1≤ x ≤ x2, y1≤ y ≤ y2}.

On rectangles we can define an increasing function of width (x2− x1) and height (y2− y1), for example the area Φ(R(x1, x2, y1, y2)) = (x2−x1)(y2−y1). We are then looking for a filled rectangle contained in the image that maximizes Φ. This could be the rectangle with the largest area, but we could also use any other increasing function of the width and height of rectangles. For example in [Breuel, 2002] tall skinny rectangles of the background color are used to identify columns in scanned text.

3.1.1 Maximal Filled Rectangles

We say that a filled rectangle B ⊆ A is maximal if it is not itself contained in a larger filled rectangle in A. Since Φ is an increasing function of width and height, and a filled rectangle can only be contained in one of a larger size, any rectangle that maximizes Φ must be maximal. Therefore, one way of finding the largest filled rectangle is by enumerating all maximal filled rectangles contained in A.

Any non-maximal filled rectangle can be made into a maximal one by extending it as far as possible in all directions. In general there is no unique way to do this extending. Therefore we use the following recipe to map each filled rectangle to a unique maximal one:

1. First extend the rectangle upward as far as possible while remaining in A.

2. Then extend the rectangle to the left and right as far as possible.

(14)

3.1. Finding Filled Rectangles Chapter 3. Finding Rectangles

(a) (b) (c)

Master Project ‘EmmiusOCR’ – filled-rectangle Twan van Laarhoven (twanvl@gmail.com)

March 15, 2010 1Finding filled rectangles 1.1Introduction

TODO Motivatie

An N by M binary image A can be represented as a set of points, A ⊆ [1; N ] × [1; M ].

A filled rectangle is a set R(l, r, t, b) = {(x, y) | l ≤ x ≤ r, t ≤ y ≤ b}.TODO hier R(x, x+, y, y+) gebruiken?

We are now looking for the largest filled rectangle contained in the image. This could be a rectangle with the largest area, but we could also use any other increasing function of the width and height of rectangles. For example in [?] tall skinny rectangles of the background color are used to identify columns in scanned text.

1.2Maximal filled rectangles 000

00 ^

000<== | ==>0

0 |

p

Figure 1: Each maximal filled rectangle is completely specified by a single point.

We say that a filled rectangle B ⊆ A is maximal if it is not itself contained in a larger filled rectangle in A. Now one way of finding the largest filled rectangle is by enumerating all maximal filled rectangles contained in A.

Any non-maximal filled rectangle can be made into a maximal one by extending it as far as possible in all directions. In general there is no unique way to do this extending. Therefore we will use the following recipe to map each filled rectangle to a unique maximal one:

1. First extend the rectangle upward as far as possible while remaining in A.

2. Then extend the rectangle to the left and right as far as possible.

3. Then extend the rectangle downward as far as possible.

This is illustrated in figure 1.

Note also that each point in A is itself a small filled rectangle.TODO For p ∈ A, {p} is a rect- angle Theorem 1 (maximal filled rectangles). All maximal filled rectangles in an image A can be constructed by the extending a single point (x, y) ∈ A using the above recipe.

1

(d)

Figure 3.1: Constructing a maximal filled rectangle from a point. In this example A consists of white pixels. (a) Starting from a single point. (b) The rectangle is extended upward as far as possible. (c) The rectangle is extended to the left and right. (d) The result is a maximal filled rectangle.

3. Then extend the rectangle downward as far as possible.

This is illustrated in figure 3.1. Note also that each point in A is itself a small filled rectangle.

Theorem 1 (maximal filled rectangles). All maximal filled rectangles in an image A can be constructed by extending a single point {(x, y)} ⊆ A using the above recipe.

Proof. For each maximal filled rectangle R(x1, x2, y1, y2) ⊆ A there is an x with x1≤ x ≤ x2such that (x, y1− 1) /∈ A, because otherwise the rectangle would not be maximal.

Now start with the rectangle {(x, y2)} = R(x, x, y2, y2) ⊆ A. In step 1 it is extended upwards to R(x, x, y1, y2). Then in step 2 it is extended to R(x1, x2, y1, y2), and the last step leaves the rectangle unchanged.

The above recipe can be used directly as an algorithm, by extending all points in the image. As theorem 1 shows we will find all maximal filled rectangles in this way. However, the same rectangle might be found multiple times. For example the rectangle in figure 3.1(d) will also be associated with the point left of the one marked in figure 3.1(a). Since we are only interested in the largest rectangle such duplicates are not a problem.

Note that for the purpose of finding the largest filled rectangle we can skip the last step. Then we will also find rectangles that can still be extended downward, and hence are not maximal. But the largest rectangle will still be found.

A naive implementation of step 1 requires O(M ) time, while step 2 requires O(N M ) time. Because these steps are repeated for all points in the image, the total runtime is O((N M )2).

3.1.2 Blockers

For a more efficient algorithm we need to be able to determine how far a rectangle can be extended in constant time. Define the left blocker, lb(x, y) of a point (x, y) as the x-coordinate of the first point left of (x, y) that is not in A. A filled rectangle in A containing (x, y) must have left

(15)

3.2. Finding Border Rectangles Chapter 3. Finding Rectangles

coordinate greater than lb(x, y). The right, top and bottom blockers are defined similarly, lb(x, y) = max{ x0| x0 ≤ x, (x0, y) /∈ A} rb(x, y) = min{ x0| x ≤ x0, (x0, y) /∈ A}

tb(x, y) = max{ y0 | y0≤ y, (x, y0) /∈ A} bb(x, y) = min{ y0 | y ≤ y0, (x, y0) /∈ A}.

Each of these blockers can be calculated in a single pass over the image.

By using blockers it is possible to implement step 1 in constant time, by extending the rectangle up to tb(x, y) + 1. In step 2 the rectangle needs to be extended to the left (and right). We can do that by looking for the rightmost left blocker (and the leftmost right blocker) for the rows between tb(x, y) and y,

lbext(x, y) = max{ lb(x, y0) | y0 ∈ N, tb(x, y) < y0≤ y }, and rbext(x, y) = min{ rb(x, y0) | y0∈ N, tb(x, y) < y0 ≤ y }.

The algorithm now amounts to reporting the rectangle R(lbext(x, y)+1, rbext(x, y)−1, tb(x, y)+1, y) for each point (x, y) ∈ A. If lbextand rbext are calculated using a scan over all y0 values then the total runtime is O(N M2).

It is also possible to calculate lbext and rbext incrementally. First note that if (x, y) /∈ A then tb(x, y) = y and hence lbext(x, y) = max ∅ = −∞ and rbext(x, y) = ∞. Otherwise tb(x, y) = tb(x, y − 1), in which case lbext(x, y − 1) can be used to calculate lbext(x, y). So

lbext(x, y) =

(max(lbext(x, y − 1), lb(x, y)) if (x, y) ∈ A

−∞ otherwise .

And similarly for rbext.

The incremental calculation of lbextand rbextneeds only a constant amount of time per pixel, so the total runtime of the algorithm is O(N M ). The entire algorithm is given in detail in appendix A.1.

The output of the algorithm is a set that includes all maximal rectangles in an image A, as well as some non-maximal ones. To find the largest rectangle with respect to some function Φ, we simply compare the value of Φ for all the rectangles in this set. Note that each rectangle that is inspected in this search was firts inserted into the set, so the time complexity for finding the largest rectangle is no worse than that of constructing the set of all maximal rectangles.

3.2 Finding Border Rectangles

Instead of looking for filled rectangles we now consider border rectangles, rectangles where all pixels on the border are set. Formally a border rectangle is a set

B(x1, x2, y1, y2) ={(x, y1) | x1≤ x ≤ x2} ∪ {(x, y2) | x1≤ x ≤ x2}

∪{(x1, y) | y1≤ y ≤ y2} ∪ {(x2, y) | y1≤ y ≤ y2}.

A possible application of border rectangles is finding the area of the page that contains text. The raw photographs we work with are larger than a single page, so they can also contain parts of the cover and the table on which the book lays. Some of these areas are dark, and we must be careful not to confuse them with text. One way to do that is to remove these areas before further processing the page. We have found that the largest border rectangle of the background color in an image corresponds to the page area, and hence finding this largest rectangle gives a good way of removing the irrelevant parts of the image. An example of this idea is shown in figure 3.2.

We will abuse terminology somewhat by saying that a border rectangle a contains a border rect- angle b if b is contained in a’s border or in its interior, so B(x1, x2, y1, y2) contains B(x10, x02, y10, y20) iff R(x1, x2, y1, y2) ⊆ R(x10, x02, y01, y20).

(16)

3.2. Finding Border Rectangles Chapter 3. Finding Rectangles

(a) (b)

Figure 3.2: A possible application of border rectangles is finding the page area of a book. (a) A binarized photograph of a book page. (b) The largest border rectangle of background pixels

As in the previous section the goal is to find rectangles contained in an image A that are maximal with regard to an increasing function of the rectangle’s width and height. Also as in the previous section, we do this by enumerating all maximal border rectangles. However, the algorithm for finding filled rectangles can not be adapted for this case, because border rectangles in an image can not always be made by extending smaller rectangles.

3.2.1 Scanning Algorithm

Using the blockers defined in the previous section it is easy to test whether a given border rectangle is contained in A. A naive generate-and-test algorithm for finding the largest border rectangle would therefore require O(N2M2) time.

We now first present an algorithm that takes O(N M2) time. The core of this algorithm is finding all maximal border rectangles with given top and bottom coordinates y1 and y2. We can do this using a horizontal scan over the image, which takes O(n) time. This scan is repeated for all O(M2) possible (y1, y2) pairs.

At each position x in the horizontal scan there are three possible situations:

• The rectangle’s left or right edge can be at x. This is the case if {(x, y)|y1≤ y ≤ y2} ⊆ A, or using blockers, if t(x, y2) < y1.

• The rectangle can pass through x if (x, y1) ∈ A and (x, y2) ∈ A.

• No rectangle with top y1 and bottom y2can pass through x.

All we need to do now is keep track of the leftmost edge of possible rectangles, and store rectangles when reaching a possible right edge. The full algorithm is given in listing 2. As in the previous section, the output is stored in a set V , which contains all maximal border rectangles, and possibly some non-maximal ones. Finding the largest rectangle can be done with a search afterwards. ←

opmerking over

complexiteit herhalen?

An important optimization is possible when we are only interested in the largest border rectangle.

In that case we should start by examining large rectangles by making the outer loop run over the rectangle’s height h = y2− y1. When a rectangle of size N × h is smaller than the largest rectangle found so far, we can exit the outer loop.

(17)

3.2. Finding Border Rectangles Chapter 3. Finding Rectangles

Listing 2 Find maximal border rectangles in image A Ensure: V ⊆ {r | r is a border rectangle in A}

Ensure: V ⊇ {r | r is a maximal border rectangle in A}

for y1← 0 to M − 1 do for y2← y1 to M − 1 do

x1← ∞

for x ← 0 to N − 1 do

if (x, y1) /∈ A or (x, y2) /∈ A then

x1← ∞ // no rectangle here

else if tb(x, y2) < y1then x1← min(x1, x)

V ← V ∪ {B(x1, x, y1, y2)}

end if end for end for end for

3.2.2 Divide and Conquer Algorithm

A more efficient algorithm is possible that uses a divide and conquer approach. The idea is to only find the border rectangles that intersect a vertical line through the middle of the image, x = xmid. Then the algorithm is applied recursively to the left and right halves of the image to also find rectangles not intersecting that line. To ensure that both N and M decrease in the recursion we should transpose the image halves.

Of course when searching for the largest rectangle the search can be stopped if the sub images can not possibly contain a larger rectangle. It turns out that in practice a recursive search is often not necessary at all.

In high level pseudo code the algorithm is:

if image is empty then stop xmid← bN/2c

Find border rectangles B(x1, x2, y1, y2) where x1≤ xmid and x2≥ xmid

Recursively find rectangles in {(y, x) | (x, y) ∈ A ∧ x < xmid} Recursively find rectangles in {(y, x) | (x, y) ∈ A ∧ x > xmid}.

To find border rectangles intersecting x = xmid, we must for each pair of y-coordinates y1≤ y2<

M :

1. Find the smallest x1such that x1can be the left side of a rectangle with these y-coordinates that intersects xmid. That is, x1> lb(xmid, y1), x1> lb(xmid, y2) and bb(x1, y1) > y2. 2. Find the largest x2such that x2can be the right side of a rectangle with these y-coordinates

that intersects xmid. That is, x2< rb(xmid, y1), x2< rb(xmid, y2) and bb(x2, y1) > y2. In contrast with the algorithm for finding filled rectangles, it is in general not possible to efficiently find the left and right sides given the best sides for adjacent y-coordinates. Instead, we maintain two data structures with all possible left and right sides,

lsides(y1, y2, x) = min{x0| x0≥ x, bb(x0, y1) > y2} rsides(y1, y2, x) = max{x0 | x0≤ x, bb(x0, y1) > y2}.

lsides(y1, y2, x) gives the best possible left side to the right of x, while rsides gives the best right

(18)

3.2. Finding Border Rectangles Chapter 3. Finding Rectangles

(a) (b) (c) (d) (e) (f)

Figure 3.3: Finding border rectangles. In this example A consists of white pixels. (a) The largest border rectangle, column xmid = 3 is indicated. (b) Possible left edges in the left halve of the image. At the top lsides(2, 2, −) is indicated as a pointer structure. At the bottom the left halve of the resulting rectangle. (c) lsides(2, 3, −), column 3 is no longer a valid side, so lsides(2, 3, 3) = 4. There is no rectangle possible here. (d) lsides(2, 4, −), column 0 is no longer a valid side. (e) lsides(2, 6, −). (e) lsides(2, 7, −), column 1 is no longer a valid side, so lsides(2, 7, 0) = lsides(2, 7, 1) = 2.

side. So now

x1= lsides(y1, y2, max(lb(xmid, y1), lb(xmid, y2)) + 1), and x2= rsides(y1, y2, min(rb(xmid, y1), rb(xmid, y2)) − 1).

For each y1,y2 we can report a border rectangle if x1≤ x2.

From lsides(y1, y2, x) we can efficiently determine lsides(y1, y2+ 1, x). Note that by definition bb(lsides(y1, y2, x), y1) > y2, that leaves two cases when moving to y2+ 1,

lsides(y1, y2+ 1, x) =

(lsides(y1, y2, x) if bb(lsides(y1, y2, x), y1) > y2+ 1 lsides(y1, y2+ 1, x + 1) if bb(lsides(y1, y2, x), y1) = y2+ 1.

rsides can be updated analogously. For a fixed value of y1 the algorithm will loop over the values of y2, figure 3.3 shows an example.

The update of lsides described above still affects O(N ) values in each step, one for each x- coordinate. We can improve this by computing lsides lazily, and only updating values that change.

When lsides(y1, y2, x) = x we store the value x, otherwise lsides(y1, y2, x) = lsides(y1, y2, x0) for x0 > x, and we store x0. This is the same data structure as is used for storing disjoint sets. In effect lsides partitions the set of x-coordinates into disjoint set, where lsides(y1, y2, x) = x0 for different values of x0. Using path compression and union by rank lookups and updates of lsides can be done in O(α(N )) time where α is the inverse Ackermann function [Tarjan,1975].

To efficiently determine which elements of lsides and rsides need to be updated we need a mapping from y2to the set of x-coordinates with bb(x, y1) = y2. One way to implement this mapping is to maintain a list of x-coordinates sorted by bb(x, y1). Such a sorted list can be constructed efficiently using bucket sort. The relevant x-coordinates will be at the front of this list, when increasing y2, values are removed from the front of the list while bb(xf ront, y1) ≤ y2.

The body of the algorithm will loop over all M values of y1. For each of these there will be O(N ) updates to lsides and rsides, amortized over the O(M ) possible values of y2. The runtime of the ‘find rectangles intersecting xmid’ step is therefore O(M2+ M N α(N )) ≤ O(M N α(N )) In the recursive step M N is cut in half and the image is transposed, so the recursion depth is O(log(M N )) while the cost at each level is O(M N max(α(M N ), α(M N ))) ≤ O(M N α(M + N )).

Therefore the total runtime of the algorithm is O(M N log(M N )α(M + N )).

(19)

3.2. Finding Border Rectangles Chapter 3. Finding Rectangles

The algorithm is given in full detail in appendix A.2.

(20)

Chapter 4

Lines and Columns

4.1 Splitting into Columns

Figure 4.1: Several pages with text in the margins.

The Rerum has comments in the margins, see figure 4.1. These margins are not part of the body text, and we have therefore chosen to separate them from the body text. The margins also use a different, smaller, font; which means that the line positions and line height in the body text do not correspond to that of the margins. So, if the margins were not separated, breaking the text into lines would become more difficult.

The space between the margin and the body text can be smaller than the space between words. So distance alone is not enough to separate the margins from the body. In other OCR systems, large spaces are used for finding margins and columns [Breuel, 2002]. In particular, the assumption is made that columns can be separated by a (possibly rotated) rectangle of white space. As figure 4.2 illustrates, this method can not be used here.

As input we are given a preprocessed N by M grayscale image f . Such an image can be represented as a function f : N × N → R. We assume that f (x, y) = 0 unless 0 ≤ x < N and 0 ≤ y < M .

(21)

4.1. Splitting into Columns Chapter 4. Lines and Columns

Figure 4.2: The input image, downsampled vertically. An angle of 45 from vertical in this image corresponds to an angle of 5.7in the original image. This is sufficient to separate the margin text (on the right) from the body text. Note that the body text and margin can not be separated by a rectangle.

4.1.1 Paths

The approach we take is to find paths from top to bottom that runs through the background pixels. One of these paths should separate the margin from the body. The idea of these paths is inspired by seam carving [Avidan & Shamir,2007].

To get paths that separate the body text from margins we want paths that

1. Run from top to bottom without backtracking, i.e. paths that are ascending in y-coordinate.

2. Pass through background pixels as much as possible, instead of paths running through the body text.

3. Are as vertical as possible, instead of more complicated paths that pass around the text altogether.

A path p is a function that assigns an x-coordinate to each y-coordinate. We look for paths that minimize the cost

cost(p) =

M −1

X

y=0

f (p(y), y) +

M −1

X

y=1

D(|p(y) − p(y − 1)|),

where D is a monotonic function. This cost has two terms. The first term penalizes paths that pass through ink pixels. The second term of the cost penalizes paths that don’t run vertical, weighted by an increasing function D.

Consider an image with a large amount of background above and below the text. Without the second penalty term all paths could first move to a specific x-coordinate where the cost is low, so all paths would be essentially the same.

For the cost of diagonal paths we use the function D(0) = 0, D(1) = d and D(x) = ∞ for larger values of x. The parameter d controls the relative weight of the through-ink and the non-vertical costs. This penalty function disallows paths with |p(y) − p(y − 1)| > 1, that is, paths with an angle more than 45 from vertical. But an angle of 45 is still too much for our purposes, since such diagonal paths can cut corners from a block of text.To enforce a smaller angle we use a simple trick: we downsample the image vertically by a factor of 10. A path at an angle of 45 in the downsampled image corresponds to a path at an angle of 5.7in the original image, as illustrated in figure 4.2

Finding paths that minimize the cost function can be done with a dynamic programming algorithm.

For each pixel in the image we find the cost of the cheapest path from the top of the image that ends at that pixel. To find a path passing through (x, y) we only need to examining paths coming from the three possible previous positions, at (x − 1, y − 1), (x, y − 1) and (x + 1, y − 1). After the costs are calculated we can read the paths by tracing back from bottom to top. The full algorithm is included in appendix A.3. The runtime of the algorithm is linear in the number of pixels.

(22)

4.2. Splitting into Lines Chapter 4. Lines and Columns

4.1.2 Cutting the Image

The seam carving algorithm gives N possible paths for an image of width N , one for each pixel in the bottom row. Some of these paths will pass through the text, and many paths will pass through the same pixels. We are only interested in paths that pass mostly through the background, that is paths that are actually cheap. We consider a path to be cheap if its cost is at most 5% of the cost of the most expensive path, and if it is cheaper than its neighboring paths.

These paths will divide the image into a number of sections. A section is the part of the image between two paths. We count the pixels in the left hand path as part of the section, so all pixels will belong to exactly one section. The final step is to select the sections that make up the body text from the other sections.

Our assumption is that margins contain less text than the body. For each section we count the fraction of foreground pixels. The section with the largest foreground fraction is part of the body, as well as adjacent sections with a fraction close to it. To be precise, we keep all sections with a foreground fraction that is at least 0.5 times the largest foreground fraction.

We discard the margins entirely, and continue processing only the body text.

This approach of removing margins works for almost all pages of the Rerum. There are two special cases where errors occur.

• On pages without body text, such as title pages.

• On pages with multiple columns, such as the index at the end of the book. Depending on the exact contents, either some of the columns are discarded, or multiple columns are joined together.

We have not investigated how these problems can be avoided, because the pages containing normal body text are more important. In the future this problem needs to be solved, though.

4.2 Splitting into Lines

After the margins have been removed, the image can be split into lines.

The Rerum pages are especially challenging for line finding algorithms, because the distance between text lines is much smaller than in modern texts. Sometimes characters from adjacent lines touch.

Our approach to line finding is based on a simple idea: calculate the average pixel value on each row of the image. Local maxima in this projection profile correspond to text lines, while local minima correspond to the space between rows.

However, when trying to apply this idea directly to the Rerum pages, one encounters a problem:

the lines in the image are not straight, so values from different lines blend together, see figure 4.3.

This happens because the page is curved and because it might be rotated slightly in the photograph.

To avoid these problems we need to make the text straight before trying to find text lines, see figure 4.4.

As in the previous sections we work with an N by M grayscale image f , represented as a function f : N × N → R. We assume that f (x, y) = 0 unless 0 ≤ x < N and 0 ≤ y < M .

4.2.1 Straightening

Imagine the image as a piece of elastic material. We then try to push and pull it in such a way that all text lines become strictly horizontal. This of course raises the question of what a text line

(23)

4.2. Splitting into Lines Chapter 4. Lines and Columns

(a) A page image, with the projection profile overlayed. (b) The projection profile.

Figure 4.3: The projection profile can not be used for finding lines directly, because different lines blend together.

(a) A page image, with the projection profile overlayed. (b) The projection profile.

Figure 4.4: After making the text straight, peaks in the projection profile correspond to lines.

(24)

4.2. Splitting into Lines Chapter 4. Lines and Columns

is, since we perform straightening before finding lines.

We use the same idea as for finding text lines, namely that the average pixel value inside text lines is higher than the average value outside text lines. Making the text straight is then a matter of moving pixels around to maximize the similarity among pixels in the same row of the image.

Clearly, moving parts of the image left or right has no effect on the horizontalness of lines, so we only need to consider movement up and down.

We can formalize this pushing and pulling as a mapping L from pairs of coordiates to a single (vertical) coordinate. We call this coordinate L(x, y) the logical row.

Since we want to distort the image as little as possible, the straightening algorithm has to balance several goals:

1. The mapping L should be smooth horizontally.

2. The mapping L should be smooth vertically.

3. The logical lines should not be compressed or stretched too much vertically.

4. The pixel values inside each logical row should be as similar as possible, matching the projection of that line closely.

Now the projection profile we use for finding lines is that of the straightened image:

P (l) = X

L(x,y)=l

f (x, y)

If we knew this projection profile of the straightened image, we could fit the image to it by shifting columns of pixels up or down to maximize the similarity. So now we have a chicken-and-egg problem: we need the projection profile to find a good mapping, but we need the mapping to calculate the projection profile.

4.2.2 Approximation Algorithm

We use a simple approximate solution, namely a single left-to-right pass. The projection profile of the straightened image up to column x is used to find the optimal mapping for column x + 1.

For each y coordinate the goal is to find the logical row L(x + 1, y) = l so that f (x + 1, y) is most similar to P (l). Since we want the mapping to be smooth, we know that L(x + 1, y) will not differ much from L(x, y). Therefore we try all values of l from L(x, y) − ∆max to L(x, y) + ∆max for some small constant ∆max.

For measuring the similarity we calculate the product P (l) · f (x + 1, y). The reader may wonder why we maximize this product instead of minimizing the squared error (P (l) − f (x + 1, y))2. The reason is that there are background pixels both between text lines and inside text lines. By using the product these background pixels are given a lower weight.

If we were to just pick the best value of l for each row independently, then there would be a problem in regions without ink pixels, since there we have no information on what value of l to use. Here we use the smoothness goal. If, for example, the mapping distorts the image upward in some area of the image, then we assume that the distortion is also upward at points with little or no information. This smoothness can be realized by simply blurring the similarity values vertically for each different value of ∆l(x, y) = L(x + 1, y) − L(x, y). The result is that we always look at similarity values in a small neighborhood, instead of single pixels. Then we pick l = L(x, y) + ∆l(x, y) where ∆l(x, y) gives the highest blurred similarty score.

This blurring of similarity values is not enough to ensure that L is smooth. There can be discon- tinuities in the index of maximal similarity, as illustrated in figure 4.5. Therefore, to ensure that the mapping is actually smooth, we also blur L itself vertically for each column x.

(25)

4.2. Splitting into Lines Chapter 4. Lines and Columns

To ensure horizontal smoothness we average the new logical row coordinate at each position with the one from the previous column. This results in an exponentially decaying weight of previous columns.

-5 0 5

y Dl

(a) Similarity scores s

-5 0 5

y Dl

(b) ∆lindex where similarity is maximal.

Figure 4.5: Smoothly varying similarity scores does not guarantee that the argmaxls(∆l) is smooth.

One final detail is that using this algorithm directly on an image is problematic, because there is white space not only between lines, but also between characters. So lines with text contain both fore- and background pixels. The algorithm would try to shift the background pixels between (and even inside) characters out of the text line. A simple solution is to resample the image to a smaller width and to blur it horizontally, so that characters are blended together with the white space between characters. An added bonus is that by using a fixed width the algorithm becomes scale invariant.

Listing 3 Line Straightening

Require: an N0 by M grayscale image f

f0← downsample image f horizontally to a width of N pixels for l ← 0 to M − 1 do

P [l] ← 0 end for

for x ← 0 to N − 1 do

// How well do the logical lines match the image with different shifts?

for ∆l← −∆max to ∆max do for y ← 0 to M − 1 do

l ← y + L[x − 1, y] + ∆l

match[∆l][y] ← f0(x, y) · P [l]/x end for

match[∆l] ← GaussianBlur(match[∆l], σ1) end for

// Pick the best logical line for each y for y ← 0 to M − 1 do

l← argmaxlmatch[∆l][y] // find ∆l with best match Lx[y] ← y + L[x − 1, y] + ∆l

end for

Lx← GaussianBlur(Lx, σ2)

// Store logical row in L, and update P for y ← 0 to M − 1 do

L[x, y] ← λ · Lx[y] + (1 − α) · L[x − 1][y]

P [L[x, y]] ← P [L[x, y]] + f0(x, y) end for

end for

The full algorithm is given in listing 3. It takes as input an N0by M grayscale image, and produces a mapping L as well as the projection profile P . The algorithm has several parameters that affect the balance between the goals:

(26)

4.2. Splitting into Lines Chapter 4. Lines and Columns

N The width of the image, we found that N = 100 pixels gives sufficient room.

max Do not consider l[x, y] differing more than this value from l[x − 1, y], turns vertical smoothness into a hard constraint. We used ∆max= 5 for images with a height of 2000 pixels.

σ1 Blur of match values. We used 0.03 times the height of the image.

σ2 Blur of logical row indices. We used 0.03 times the height of the image.

λ Decay rate for horizontal smoothness, higher values give a smoother mapping. We used λ = 0.9.

None of these parameter values is critical.

4.2.3 Finding Lines

As stated in the introduction of this chapter, we will use the maxima in the projection profile of the straightened text to find lines. The font of the Rerum has large serifs, decorations at the ends of the strokes that make up letters. Because of these serifs, the projection profile has not one but two local maxima for each text line, as can be seen in figure 4.4(b). Therefore, we can not simply take all local maxima as text lines.

Our approach is to instead look for intervals around the local maxima that are still ‘high enough’.

The part of the projection profile between the local maximum at the baseline of a text line and the x-line of that same text line contains the projection of the body of that text line. If we pick the correct threshold then these two maxima will belong to a single interval. The next section will describe a fast algorithm for finding these intervals.

4.2.4 Finding Peak Intervals

Let g : Z → R+ be a function from the integers to the non-negative reals, in our case it will be the projection profile g = P . Assume that g(i) is zero for i < 0 or i ≥ n. Let 0 < α ≤ 1 be a constant.

A peak interval in g is an interval [a; b] such that

• g(i) ≥ α · max g[a; b] for all i ∈ [a; b], and

• g(a − 1) < α · max g[a; b] and g(b + 1) < α · max g[a; b]

where max g[a; b] = max{g(i) | a ≤ i ≤ b} is the maximum value of g on the interval [a; b].

In words: in a peak interval all values are at least α times the maximum in that interval, and the values directly adjacent to the interval are less than that. As theorem 2 shows, peak intervals as defined this way can not overlap. That makes sense when we are looking for lines in an image, since two different lines do not occupy the same pixels.

Theorem 2. Different peak intervals can not overlap.

Proof. (by contradiction) Assume that [a; b] and [c; d] are two overlapping peak intervals.

First consider the case that one peak interval contains the other, say a ≤ c ≤ d ≤ b. Then obviously max g[c; d] ≤ max g[a; b]. By the definition of peak intervals, both f (c − 1) < α · max g[c; d] ≤ α · max g[a; b] and f (d + 1) < α · max g[c; d] ≤ α · max g[a; b]. This means that c − 1 nor d − 1 can be in the peak interval [a; b], so it must be the case that a = c and b = d, and the two intervals are the same.

If neither of the intervals contain the other, then they must have a different starting point, say a < c. This implies that b < d. If also b < c then the intervals do not overlap, so we can assume that a < c ≤ b < d. Then c − 1 ∈ [a; b], so g(b + 1) < α · max g[a; b] ≤ g(c − 1) < α · max g[c; d].

But also b + 1 ∈ [c; d], and because [c; d] is a peak interval, g(b + 1) ≥ α · max[c; d]. Which is a contradiction, so two different peak intervals can not overlap.

Referenties

GERELATEERDE DOCUMENTEN

So reflective abstraction, which has its foundations in the sensory- motor activity that the human subject shares with other animals, in its developed form is a

To subtract this component (and make more sensitive searches for additional.. The histogram on the right shows the binned distribution of integrated optical depth. Cas A

To what degree and in which form and content is conducting EIS a valid and reliable method to obtain information about why employees leave an organization.. The first objective is

Other states have started to explore the possibility offered by the Schengen Borders Code to conduct highly discretionary random police checks –for reasons of both immigration

Survival values of 0.1, 0.01, and 0.001 were chosen as end points of the survival curve, and the contributions of rectangularization and life span extension to the increase in

Op de kaart zien we dat de omgeving zeer ge­ scbikt is voor fietstocbten langs meren, door bossen en licbt beuvelacbtig boe­ renland.. Het bedrijfsterrein grenst aan

Ultraviolet (UV) spectroscopy, probing diffuse 10 4 − 10 6 K gas at high spectral resolution, is uniquely poised to (1) witness environmental galaxy quenching processes in action,

Sa position laisse supposer qu'elle aurait pu servir de fermeture imposante (2,20 x 2,20 x 0,80 m) faisant Ie penda,nt à la masse de la façade. Toutefois, les fouilles de 1888 à