Fuzzy Spatial. Re Ition.s for Document Layout: Analysis.

(1)

WORDT T UITGELEEND

Fuzzy Spatial. Re Ition.s for Document Layout: Analysis.

E. Kuiper R. Wieringa

r1I

_______________

Con

%IaItk

(2)

Fuzzy Spatial Relations fOr Document Layout: Analysis

Authors:

E. Kuiper R. Wieringa

Date: 24-12-1999

Supervisors:

Dr. ir. J.A.G. Nijhuis

Drs. J.H. Stevens

(3)

Abstract

Document Understanding (DU) is the process of converting a document from its paper form to an electronic, editable form. One step of Document Understanding is the labeling of blocks on the page, this process is called Document Layout Analysis. Documern pages are segmented, which results in blocks (rectangles) that contain parts of the document like headings, paragraphs and figures. The blocks are labeled with logical labels such as paragraph, heading, pagenumber etc. The research in this report focuses on this labeling.

Much research is done in the area of Document Layout Analysis, most researchers use rule bases or document templates to label the blocks on a document.

In this report the possibility of using fuzzy spatial relations with document layout analysis is examined. Technical papers (IEEE, Elsevier) are used as an input to the system and only the geometrical properties of the blocks are used and not the contents. Two experiments for recognizing the layout structure were performed: the first experiment uses block information (size, position) only, a second experiment is used to examine how the results of the first method can be improved using fuzzy spatial relations in combination with the iterative method. In the iterative method the rule base is processed more than once for a single docum.nt page.

It can be concluded that it is possible to create a good layout analysis system that uses fuzzy logic and spatial relations, the recognition rate is up to 85% for 103 documents pages. Equations, paragraphs and headings present problems, in future research these problems might be solved by using extra information like the pixel density of the blocks

and the font type setting (bold, italic etc).

(4)

Samenvatting

Document Understanding (DU) is het proces van het converteren van een document in papiervorm naar een elektronische, bewerkbare vorm. Een stap binnen Document Under- standing is het labelen van blokken op een pagina, dit proces wordt Document Layout Analysis genoemd. Pagina's van documenten worden gesegmenteerd, wat resulteert in rechthoekige blokken die ieder een deel van bet document bevatten, zoals paragrafen, paragraajkoppen en bladzijdenummers. Het onderzoek in dit versiag richt zich op dit labelen. Er is veel onderzoek gedaan op bet gebied van Document Layout Analysis, de meeste onderzoekers gebruiken rule bases and document templates om de blokken in een document te labelen.

In dit versiag wordt de mogelijkheid van het gebruik van fuzzy spatiële relaties met document layout analysis onderzocht. Technische documenten (IEEE, Elsevier) worden als in- voer gebruikt, tevens worden alleen de geometrische eigenschappen van de blokken gebruikt en niet de inhoud. Twee experimenten voor bet herkennen van de layout zijn uitgevoerd: het eerste experiment gebruikt alleen blok informatie (de grootte en positie van het blok), bet tweede experiment wordt gebruikt om aan te tonen dat de resultaten van het eerste experiment kunnen worden verbeterd door fuzzy spatiele relaties in _combi- natie met de iteratieve methode te gebruiken. In de iteratieve methode wordt de rule base voor iedere pagina meerdere keren doorlopen.

Er kan geconcludeerd worden dat het mogelijk is om een goede layout analysis systeem te bouwen dat fuzzy logic en spatiële relaties gebruikt, de berkenning van de layout is bijna 85% voor 103 pagina's. Formules, paragrafen en paragraafkoppen zijn moeilijk te herkennen, in verder onderzoek zouden deze problemen opgelost kunnen worden door gebruik te maken van extra informatie zoals de dichtheid van de blokken en de font type

setting (vet, scheef etc.).

(5)

Fuzzy Wuzzy was a bear, Fuzzy Wuzzy had no hair; then Fuzzy Wuzzy wasn't frzzy, was he?

Traditional British poem Unattributed, nineteenth century

(6)

STheuseoffuzzylogicandspatialrelations

²⁶

5.1Fuzzy logic ²⁶

5.2 Spatial relations ²⁷

overviev ²⁹

(7)

6.2 Inputs of the system .

²⁹

6.3 Outputs of the system 31

6.4 Setting up the rule base 32

7 Experiments 34

7.1 Introduction 34

7.2 Confusion matrices 34

7.3 Using block information only 35

7.3.1 The inputs 35

7.3.2 The outputs 37

7.3.3 Setting up the rules 37

7.4 Spatial relations and iterations 40

7.4.1 Experiment using business cards 40

7.5 Spatial relations and iterations 42

7.5.1 The inputs 42

7.5.2 The outputs 44

7.5.3 Iterations 44

7.5.4 Setting up the rules 45

7.6 Final results and conlusion 48

8Conclusionandfurtherresearch

⁵⁰

References 51

AppendixA 53

AppendixB 58

AppendixC 61

AppendixD 68

AppendixE 71

AppendixF 72

AppendixG 76

AppendixH 85

(8)

Chapter 1 Introduction

In this first chapter an introduction is given to document layout analysis. This chapter gives some reasons why layout analysis can be used and explains the research area in which document layout analysis is found. Furthermore the hypothesis is given, which will be the main research topic for this entire report. In the following chapters the hypothesis will be proved correctly or it will be rejected.

1.1

Why layout analysis?

Nowadays we see that companies have a lot of old paper based documents that need to be converted to an elecronical form e.g. for backup purposes or to decrease the pile of paper in the company. This conversion to an electronical form is done by scanning the document and saving the document image into a database with scanned documents. These document images take up a lot of storage space and it is not possible to edit the contents of these documents. In some cases we are not (only) interested in the scanning of the document, but sometimes we want to derive the contents and logical structure of a document. With many documents available we prefer that the job of reconstructing the document contents and layout structure will be done automatically instead of typing all the data by hand.

There can be several reasons why to derive the contents and the logical layout of documents, the first one can be that if all the scanned documents are present in a large database then we know nothing about the contents of these documents except for the scanned image. Sometimes it is useful to know what's in a document e.g. when we search for the authors or the title of a document, in this case it is sufficient enough to recognize the part of the document that contains the title and the authors and do Optical Character Recogni- tion (OCR) on these parts of the document. If this would be possible we can build a table of contents for the whole database in which we can find the authors and/or title of a document image. In the same way it is useful to create information about the abstracts of documents so we can look for keywords in the documents without having to do OCR on the whole document and search for these keywords.

Another reason can be that a company changes it's document layout format and want's to

(9)

useful for some writers of books, if the writer wants to republish an old book at a different publisher then the layout of the book has to be converted into the layout of the pub-

lisher's format.

1.2 Research area

This questions mentioned in the previous sections have motivated many researchers in finding a solution to the automatic conversion of a document on paper to an electronic, editable form, the research area involved is called Document Understanding research.

With the techniques found in the Document Understanding research it is also possible to set up the documents into a new layout format. Currently standard document layouts are developed, for example SGML and XML [26]. With these standards it is easy to use the documents on any type of computer and in any type of word processing application, the conversion of documents into these standards can be done using Document Understand- ing.

The area of Document Understanding can be devided into three sub area's, whereas Doc- ument Analysis, Document Recognition and Document Understanding. The latter can be confusing because the same term is used for the whole research area. In the literature about Document Understanding these terms are used through another. For a more detailed explanation on Document Understanding we refer to chapter 2.

1.3 Problem description

In the previous section we saw that sometimes it is very useful to get the layout structure from documents.

In figure 1.1 on the next page the layout structure of an example document is shown. The question is how to retrieve this logical structure from the document. So the problem description for the retrieval of the layout structure of the document is:

How do we get the logical layout of documents?

In this report this will be the main question and an answer to this question will become clear in the following chapters.

1.4 Hypothesis

The main topic that will be researched in this report is if fuzzy logic and spatial relations (chapter 3 and 4) can help us to find the layout of a document. The only info that is used is the information of the rectangular blocks in the document, the reason for this can be found in chapter 5. Because humans use fuzzy terms and spatial relations in their reasoning it is expected that this is possible. With the geometrical information it is possible to calculate spatial relations between those rectanglular blocks. This brings us to the hypothesis that we will try to prove in the rest of this report.

(10)

We can build a fizzy system that determines the logical layout of a document page using only the geometricaiproperties of rectangular blocks and spatial relations between those blocks.

. 11TLE

AUTHORS

ABSTRACT PARAGRAPH

ABSTRACT TEXT

PICTURE

CHAPTER PARAGRAPH

PARAGRAPH

Figure 1.1:

The layout structure of a document

1.5 Limiting conditions

As can be concluded from the hypothesis, the fuzzy system does not use any low level information like OCR—text or other contents, this is the first limiting condition. The reason why OCR and other contents is not used is that humans also don't need this information to say useful things about the layout.

We are only interested in labeling the blocks on a document page, so we are not interested in building a system that produces electronic, editable output. The main reason for this is that we need other techniques like OCR to produce such an output.

It is difficult to build a system for many different types of documents because of the diversity in the documents. Therefore the research in this report focuses only technical papers like IEEE, Elsevier and others. The reason for this is that the range of diversity be-

I I

(11)

like Document Grammars, (chapter 2) although this is true, we expect that there _are enough correspondences between technical papers to create a recognition system for them. Many researchers have already used technical papers as an input for their document understanding systems, some results may be used to compare our results.

(12)

Chapter 2 Document Understanding

In this chapter Document Understanding and research that has been done in this area will be explained. At the end of this chapter some problems with document understanding will be shown.

The process of converting a document on paper to an electronic, editable form is called Document Understanding (DU) [11]. The Document Understanding task is divided into three conceptual levels: document (image) analysis, document (image) recognition and document understanding. The term document understanding is confusing in this part because the term is used for the whole research area as well for the final process in which the document structure is reproduced in the same way as humans would do. An architec- tural overview of Document Understanding can be found in figure 2.1. In the next subsections we will briefly explain the three conceptual levels.

2.1 Image acquisition and preprocessing

Afterthe document is scanned (e.g. at 300 dpi) several preprocessing steps can be done to enhance the scanned image, in this section Noise Removal and Skew correction are discussed.

Noise removal is the process of removing the so called noise from an image, when documents are copied they will often show black 'dots' on them due to, for example a dirty copier. In many cases morphology (erosion/dilation) is used to clear noise from the scanned document image.

Skew correction is the technique to correct the skew of images, sometimes images are not scanned straight and the whole scanned image is rotated a little. In later processing this will likely cause troubles in OCR but also in the detection of image regions like paragraphs. Hough transform is the most common used method in the correction of skew, in [2] an example is given of the use of the Hough method, an explanation of the Hough

(13)

Description of document

Area Labeling

I

Area Segmentation

I

Binarization

I I

OCR

Document (image) recognition

Image Acquisition and preprocessing

T

Document

Figure 2.1:

Architecture of document understanding

2.2 Document analysis

As can be seen from figure 2.1, the process of document analysis consists of three main tasks: binarization, area segmentation and area labelling. In the following sections these three sub areas are discussed. The result of this part of the document understanding process is the geometric structure of the document. The geometric structure describes the building components in the document and their position on the page.

Natural Language Processing

Document Understanding

Document (image) analysis

Photograph/Graphics Analysis/understanding

(14)

2.2.1

Binarization

For OCR and also for segmentation methods it is a better to have a black—and—white image instead of a grayscale or color image, this because it is then easier to find edges within the image. Binarization is used to convert the grayscale and color images to black—

and—white images. Often a threshold is used to convert the pixels of the image to black or white pixels, an example of thresholding is shown in [2], it shows a threshold of 128 used

on an 8—bit gray—scale image.

2.2.2 Area segmentation

Area segmentation is the process of dividing the image into rectangular blocks, such a rectangular block contains a single logical component, i.e. a title, paragraph, picture or caption. Segmentation attempts to isolate these logical components in the document

image but does not identify the type of logical component (this will be discussed later).

Many ways to segmentate the image were found and some of these methods are mentioned in [2]. Mainly there are two approaches to look at segmentation: top—down and bottom up. In the top—down approach we begin to define large image blocks and divide these blocks into smaller blocks. In the bottom—up approach the opposite way is used;

first define small image blocks and after that create larger blocks by merging these small blocks. In some cases a combination of the top—down and bottom—up approach is used, a description of this method can be found in [6]. When looking at these two approaches into more detail then we can see that the most segmentation methods use the top—down approach. The most commonly used segmentation methods are described in the following paragraphs.

Top—down segmentation

Run —L engthSmoothing Algorithm (RLSA)

This RLSA segmentation method tries to create continuous streams of pixels by merging any two black pixels that are less than a threshold value apart from eachoth- er. This method can first be applied row—by—rowand then column—by—columnresult- ing in two distinct bitmaps, after that these two bitmaps are combined to a single bit- map using a pixelwise 'OR'—operation [6]. The resulting output image has a smear whereever printed material (text or images) was found on the original image. The dif- ficulty with this algorithm is the selection of the thresholds for the horizontal and vertical smear [8], these two thresholds are usually set through experiments.

Recursive X—Ycuts (RKYC)

Thismethod assumes that a document can be presented in the form of nested rectangular bocks. First the whole image is taken into account; the algorithm then tries to split the image into two blocks. This is done by obtaining the projection profile of the block, this can be done as follows [6]: after an image and a rectangular block for this image is obtained histograms are calculated for the horizontal and vertical direc- tion. This results into two arrays containing the number of black pixels in the horizon-

(15)

the influence of noise can be reduced, for example two black pixels can be noise so a threshold of 2 would result in 0 black pixels in the histogram). After we have found the resulting projection profiles the algorithm searches for a vertical and horizontal cut. The projection profiles can be seen as waveforms where a deep valley with _a width greater than a certain threshold can be seen as a place to cut the block intotwo blocks [20]. The splitting of image blocks is continued until a certain level is reached (characters or words, for example). It higbiy depends on the chosen threshold how good the segmentation will be, we can see in [6] that a wrong threshold can lead to an erroneous segmentation. The main advantage of the RXYC method is its resist- ance to noise, this in respect to the RLSA method which is very sensitive to noise.

Hough transform

This method exploits the fact that documents contain significant linearity and is able to find straight lines in images and text. The linearity in text is found by viewing text as thick texture lines and is therefore ideal for Table—forms and also for detecting tables in documents. For a more detailed explanation of the Hough Transform we refer to [21].

Bottom—up segmentation

Connected Components (CC)

In this algorithm the pixels of an image are combined into small rectangular blocks (components). After that the algorithm repeatedly tries to combine these smaller components into larger components. When translating this perspective into the analy- is of documents we first have individual characters and figures at the lower level of analysis. The characters are then merged into a larger component, a word. After that a word can be merged with other words into a text line etc. In [6] an implementation of Connected Components is presented.

As mentioned earlier, combinations of top—down and bottom—up approaches are also used, we often see that RLSA is used for smearing the document image and CC is used to extract blocks from the smeared document image [2, 6 and 8]. In [11] another _method called docstrum is presented, it uses a K—nearest—neighbor clustering of components.

In table 2.1 we present a few technical papers and the segmentation method(s) used in these papers. Not all the methods mentioned in these papers use the directly scanned images as an input. Papers [2, 4, 8, 9, 11 and 12] use the scanned images as an input, in [4]

skew correction is done using a Hough transform and in [2] no skew correction is used.

For the papers [8, 9, 11 and 12] no information on skew correction is available, however all papers assume a straight scanned image_{as input.}

The adaptive split—and—meige algorithm mentioned in [4] works globally as follows. The image is first split up in 4 homogenous regions then each region is split up in homogenous regions. If two neighbor regions are homogenous and there union is also homogenous then both regions are merged. The algorithm is adaptive in the sense that the parti-

(16)

tioning (splitting) lines are chosen using an algorithm that adapts the lines to finds the best region.

Table 2.1:

Overview of papers and the used segmentation method(s)

With area labeling or block classification in [10] the labeling of blocks on their type of data is meant, that is: text, picture, line drawing, table etc. Area labeling is necessary to determine the type of processing required for each block, so are text blocks usually passed on for OCR processing. In this case area labeling is presented as a part of the _geo- metric structure of a document but one could say that there is no difference between _area labeling and the kind of labeling we do in Document Understanding (see figure 2.1 in the previous chapter). This is partially true, area labeling can be seen as a low level of abstraction on the document contents. Denoting a block to contain for example a text is a lower level of abstraction then denoting that the block contains a title. Labeling areas with information like text, picture etc. depends heavily on the low level information that is used of the document, usually features of blocks are used to label the contents type of the block. Some examples of features are [6]:

• Block eccentricity: —width/ height of the block

• Black pixel density: —percentageof black pixels

• Cut count: — number of white lines through a block

The features mentioned above are all low—level features, using these features we can derive information about the contents of the block, i.e. text blocks have many (horizontal) white cuts and a low black pixel density. When using a high level of abstraction we can often do without any low—level of information, an example of this are Document Gram- mars in the process of document understanding (this will be discussed later). This is a reason why area labeling and the labeling in document understanding are often separated.

All the steps in document analysis have been introduced, the result of this analysis is the geometric layout of the document. This layout is often stored in a data structure that is easy to access, the most commonly used method for storing information about the layout of documents are trees. We have already mentioned the fact that the process of document

tirni r. ainci al.

Jiming Liu Ct al. [4]

Top—down

Michael Sharpe et al. [6]

Top—down

RLJk ano connectea component tor Adaptive split—and—meige

RLSA, RXYCand connected component for block extraction Top—down/

Bottom—up

Philippe Chauvet et al. [8] Top—down RLSA and connected component for block extraction Toyohide Watanabe et al.

[9]

Top—down edge extraction filter (this document only takes table—form documents as input)

Debashish Niyogi et al. [11] Bottom—up Docstrum Debashish Niyogi et al. [12] Bottom—up Docstrum

2.2.3 Area labeling

(17)

this should be a form that is not tied to a specific output format (Latex or Word). There are currently two international standards for the representation and interchange of documents: Office Document Architecture (ODA) (most commonly used [10]) and Standard Generalized Markup Language (SGML). During the process of Document Understanding research it is better to keep these standards in mind, intermediate results should conform to one of the standards or should use a representative form whichs can be easily converted into one of the two standards because the standards are not platform or application dependant.

A description of this document understanding process using a representative form can be an X—Y tree which contains information about the document. After the documunt analysis process is completed thes X—Y tree is build, this tree is passed to the document understanding process for processing additional (layout) information about the document. Fi- nally the resulting X—Y tree can be converted to either ODA or SGML or some other kind of format (e.g. Latex). In Figure 2.2 a segmented document and its X—Y tree is shown. Detailed information on X—Y trees can be found in [6].

TITLE

AUTHORS

ABSTRACT PARAGRAPH

ABSTRACT TEXT

PICTURE

CHAPTER PARAGRAPH

PARAGRAPH

Figure 2.2:

A

segmented

document and its X—Ytree representation

I I

(18)

2.3 Document recognition

In the document recognition process the contents of the blocks that were found in the document analysis process are recognized. For example Optical Character Recognition (OCR) is part of this process and tries to recognize the textual contents of the ^blocks.

Photograph and graphics analysis can also be used in this process and tries to reproduce the contents of photographs and graphics. In graphics analysis the graphics of the ^documents are segmented, after segmentation it tries to identify the parts of the graphics. A part of graphics analysis is vectorization; in vectorization line drawings are converted to vectors so it is possible to modify the contents of the drawing. In photograph and graphics understanding the contents of the graphics or photographs are recognized; ^licence plates recognition is an example of photograph analysis and photograph understanding.

The analysis and understanding part for OCR, photographs and graphics are placed Out- side the hierarchical chart in figure 2.1, the reason for this is that there is no previously determined moment on which we should do an analysis of photographs, graphics or OCR. If OCR is taken as an example then OCR could be done after the process of document analysis and document understanding, but it could be done also directly after the document analysis.

2.4 Document understanding

Atthis stage the information on the geometric structure of the document is known, now it is time to derive the logical structure of the document. This is the area on which this research is focused. This section will not discuss the graphics and photograph understanding area but with Natural Language Processing the grouping of rectangular blocks based on the human interpretation of the contents and the spatial relations of these blocks is meant. This process can also be called logical grouping and in the rest of this paper this definition shall be used. There are several methods to do logical grouping, this section discusses two methods.

Document grammars

With the help of document grammars the layout of documents can be recognized.

Document grammars are descriptions of documents and can be described by context—

free grammars in the same manner as with programming languages. In figure 2.3 an example of such a context—free grammar for bibliographical references is given [14], the text on the left of each line is the non—terminal symbol, the text after the '::=' ^text is the production rule for the specific terminal. The terminal symbols are single characters.

(19)

Ackley, D. H. (1985). A connectionist algorithm for genetic search. Proceedingsof an International Conference on Genetic Algorithms and Their Applications, 12 1—135.

refhead '.' refbody

authors date

author I authortlist

(author ','} author ^',' '&' author

aname ',' f^{name '.'} (f^name

'('

^year ^(',' ^month) ^')'

book I article I phdthesis I inproceedings I inbook

title booktitle ',' pages ^'.' number ^'—' number

number (digit)

'A' 'B'

...

Figure

2.3: Example of a context—free document grammar

The grammar in figure 2.3 can be split up in production and recognition rules [6], all the production rules are combined to form the document grammar, which defines the layout of a (part of a) document, in this case a reference. The recognition rules are used in the terminal productions of the grammar, they are used to decide if a given element is the type of logical component that is expected by the grammar. In figure 2.4 we have split up the grammar as discussed above.

reference ^{zs refhead} ^. ref^body number :: (digit}

refhead :: authors ^date ^fname ^{:: 'A'} ...

authors ^s: ^author authortlist aname, month,

authorlist :: (author ','} author ^',' '&' author ^{title and}

author :z— aflame ',' fnazne '.' (fname _'.'} booktitle ^tz (character) date z:— '(' year (',' month) ')'

ref body :: book article phdthesis I

inproceedings I inbook

inproceedings ::—

title

booktitle ',' pages '.'

pages :: number '—' number

year number

Figure 2.4: Theproductions and recognition rules of the grammar in Figure 2.3

TheX—Y treethatwe assume as the input can now simply be parsed. In this example

itis necessary that OCR has been done, because character information is used in the terminal productions. The rules could have been set up differently however and in such a way that no OCR information is needed. For this case let's take a simple document that starts with a title and then the name of the author (we have one for simplic- ity), after the author we expect at least one paragraph, the grammar would look like this:

document zz— title author paragraph (paragraph) title ::— rules ^{for titles}

author :— rules for authors paragraph zz— rules for paragraphs

Title,author and paragraph are terminal symbols here. The rules for title are not defined like: 'a sequence of characters' but another definition is used. For example: 'a

reference

ref head ::

authors authorlist author date refbody inproceedings pages

year ::

number fname

aflame, month,

title and booktitle ::= (character}

(20)

title is one line and centered on the page'. In this way no OCR information is needed, the same rule can be used for the author. We could ask ourselves the question: how would we recognize the difference between a title and an author? This is not necessary in this case because of the use of a document grammar. The grammar says that the document starts with a title so if we fmd one line of text that is centered on the page then this should be the title, this would mean that if someone forgot to use a title the author would be seen as the title.

The main disadvantage on document grammars is that a priori information is needed on the document model that will be examined. Document models for which no document grammars exist will simply not be recognized, therefore document grammars are not useful if we want to build a generic document understanding system. An example of a document grammar based understanding system can be found in [6].

Rule bases

In the document grammars method we have already seen an example of a rule base.

Document grammars have rules to recognize the terminal symbols in the document, it is said that the disadvantage is that a priori knowledge is needed of the documents that are used in the document understanding system. This method may help to find a generic approach for document understanding.

Rule bases consist of rules used to recognize logical components in documents, the rules can be set up for each different type of logical component. The information on which the rules are based depend on the previous document analysis stage, i.e, if no information on the number of text lines in a block was extracted during the document analysis, then this information can not be used in the stage of logical grouping [11].

An example of a rule which needs extra information from the document analysis stage is:

IF a block X is of type small—text

AND IF it is below another block Y which is a photograph AND IF the widths of the two blocks are equal

THEN block X is a photo—caption

Here, itis assumed that during area labeling the data type of each block is determined.

The problem with rule—based logical grouping is the setting up of the rules, it is hard to capture all the knowledge we have on documents; for example a title is always at the top of the document; therefore it is proposed that users of the document understanding system can modify the rule bases [1].

In [11] the rules are split up in three different types of rules: knowledge rules, control rules and strategy rules. The knowledge rules are the rules that define general characteristics of documents. The control rules and strategy rules regulate the analysis of the document. In [3] the knowledge base is trained with information extracted from the blocks, the following table gives an overview of the papers and the used method

(21)

Table 2.2:

Overilew of papers and the used logical grouping method(s)

!Tflm7fl

k1F!

Nenad Marovac [1] Rule—base, user can correct wrongly labeled blocks William Lovegrove et al.

[3]

Rule—base, knowledge is trained

Jiming Liu et a!. [4] Rule—base

Michael Sharpe et al. [6] Document grammar Kazem Taghva et al. [7] Rule—base

Debashish Niyogi Ct a!. [11] Rule—base, 3 levels Debashish Niyogi et al. [12] Rule—base, 3 levels Frédéric Bapst et al. [14] Rule—base

After the logical grouping stage is complete the X—Y tree is updated and contains all the information we need to create an editable document. All the steps in a document understanding system have now been discussed, the result is an output in an editable form. It is shown that there are various steps to be taken to get a good document understanding system, during the entire process a lot of things can go wrong e.g. a bad segmentation could lead to bad logical grouping because information that is needed to do a good logical grouping is not available. Is bad logical grouping really such a bad

thing in all cases? If the system fails to identify a section heading then this would not be a problem in a representation form as HTML (as long as we reproduce the same

typographic settings)^.Butif we use the heading as a base to build the contents of the document then this would be a problem, so it depends on the application of the document understanding how big the failure tolerance is.

The logical structure is a grouping of components bases on human interpretation of the contents of the components and the spatial relations between the components. Figure 2.2

shows the relations between the document understanding architecture and the geometric and logical structure. In the geometric structure we see spatial relations between the components in the document, in chapter 4 we will look deeper into these spatial relations and their use in document understanding.

In figure 2.5 the steps in the document process are shown, each image shows the results of the specified document understanding levels.

2.5 Problems during document understanding

It is said that during the process of converting a paper document into an electronic and editable form a lot of things can go wrong. In this section we will discuss some problems on document understanding that many researchess have found during their research on document understanding.

During the last couple of years, imaging techniques have evolved, things like binarization and skew correction are often incorporated in imaging software that is packed with scan- ners. Capturing document images is a problem when the documents are of a very bad

(22)

quality, also color documents are very difficult to scan in the sense that it is difficult to separate text from colored backgrounds. Up to now all the currently available document understanding systems use black—and—white images.

TT

⁴

Logical structure

Figure 2.5:

Relation between layout structures and document understanding architecture

We have seen that the trouble with area segmentation is finding the right thresholds for the segmentation algorithms. There is no mathematical method for setting up the thresholds, this means that the thresholds are set up on an experimental basis. Segmentation might fail on documents due to wrong threshold settings. Another problem is that all the segmentation methods try to setup rectangular blocks, text that is wrapped like a circle around an image will not be segmented correctly, however, most documents use straight

Geometric

structure

I I

Document analysis

Document understanding

- ⁰ ^rTE

^PACENR^os_i ^1—HFADER

TEXT

TFFT .P

T

I AUTHOR(S)

PARA- GRAPH

I =

•1I•

PARA- GRAPH

rr

IrEKr

PARA- GRAPH

IPARAI

FOOTER

(23)

In the subject of area labeling there are no particular problems, however it can imagined that in some cases area labeling might fail, when looking at figure 2.3 it can be seen that the figure is actually build up of text. What will the area labeling algorithm see in this case, a text or a figure? In most cases it will see text, which is no problem, because the entire picture is text, but what if we would have put one rectangle in the figure? Normal- ly, area labeling sticks to primitive labels like text, figure and table, but what if_more complex labels are defmed like equation, line drawings, small—text, big—text etc, the problem of area labeling would become more and more complex. An easy way to deal with areas that cannot be labeled is to label them as pictures, in this way the area will be labeled as a picture in the resulting document description.

In logical grouping the usual problem is to define a set of rules that covers all possible labels and that can detect all labels. With document grammars the problem is to define good rules for the terminal symbols. With rule bases the problem is to define rules that covers all labels and does not contain overlapping rules, the rules should be kept simple because complex rules are difficult to understand and to implement. It might be a good idea to let the computer derive complex rules from an input example i.e. if an author's name is always preceded by a title, then the computer will derive this logic from this example. Ofcourse it is very difficult to cover all the labels if we have so many of them;

therefore it can be useful to let the user add new rules or to split up the problem in little subproblems (document types) to build a rule base for many types of documents we want to understand.

As can be read from the text above, many problems can arise during the process of document understanding. All the research found is concentrated on a particular type of document to avoid the problems mentioned in the previous paragraph, in this way problems with thresholds in segmentation can also be avoided. Most researches simply choose _a threshold that works fine for the type of document that is used, in many cases the thresholds do not work for other types of documents.

In the next chapter we will look into spatial relations, spatial relations are very important in this research, the reason why will be explained inthe following chapter.

(24)

Chapter 3 Spatial relations

This chapter presents related work on spatial relations. We start with explaining spatial reasoning, then we give a short overview of the modeling of spatial relations and finally we present three approaches for spatial reasoning which can be used in document understanding.

3.1 Spatial reasoning

Spatial reasoning is in general about reasoning about entities occupying space. These entities can be physical entities (e.g. table, chair, etc.) or abstract entities (e.g. enemy terri- tory). Spatial prepositions characterize an object's position according to one or several other objects in space, usually one reference object is used but sometimes more than one reference object is used (e.g. between). In spatial expressions two kinds of relationships can be distinguished; static relationships (e.g. at, between) and dynamic relationships (e.g. towards, to the right of). Static spatial relationships are used to represent the original image completely without loss of information, in dynamic scenes we try to recognize and track object movements in several images.

The following kinds of spatial prepositions are distinguished [16]:

• Topological prepositions; these spatial relations can be characterized without re- currence to an orientation of space (e.g. inside, outside, at, on, by, near or far).

Sometimes they describe the source (e.g. from, out of) or the goal (e.g. to).

• Projective prepositions; these spatial relations do need an orientation of space.

They need a certain frame of reference that assigns the vertical and horizontal axes (e.g. left, right, above, below).

Conditions for object configurations such as distance and relative position are used to describe the spatial relations that represent the relationships between spatial entities. We refer to these entities via the corresponding objects.

Before anything can be said about a scene a two—dimensional representation of the scene

(25)

description of a scene, it is often desired to transform this into a higher level of specifica- tion of what it is a picture of. To accomplish this we first need to consider the different objects in the picture, these objects can be retrieved using a technique called segmentation. The objects as a result from segmentation can be represented using lines, edges and regions. Between the objects in the picture will exist several relations with various kinds of properties (e.g. shape, size or texture), two types of relations can be distinguished:

• Those involving comparison of the properties of objects (e.g. larger, darker or smoother)

• Those involving the relative position (e.g. above, near and to the left of)

For spatial relationships according to the relative position a list of names can be made that is exhaustive in the sense that any other kind of relation can be described using combinations of these [17].

1. Leftof

2. Right of

3. Beside (alongside, next to)

4. Above (over, higher than, on top of)

5. Below (under, underneath, lower than)

6. Behind (in back of) 7. In front of

8. Near (close to, next to)

9. Far 10. Touching 11. Between 12. Inside (within) 13. Outside

The relations mentioned above should be read as, e.g. Left of (a,b) = ^"ato the left of b".

There are binary but also n—ary (n>2) relations (e.g. Between (a,b,c) = ^"a is between b and c".

3.2 Modeling of spatial relations and objects

Objects in the scene must have some kind of representation, each object can have several representations. It would be inefficient to code all these representations into the names of the objects, this would lead to very much overhead in the data structure and a very large knowledge base. It is a better idea to be given names for the objects, a model is then only applicable to the relations that the model was told about. In this case only needs to be

(26)

given the object names according to their model. In this case the objects can be described using its properties and attributes in both geometrical (e.g. area, height, extrinsic diameter, intrinsic diameter, roundness and elongatedness) and non—geometrical properties (e.g.

brightness, color and texture) [18].

Representations for spatial relations

Because 2D images are considered, the treatment of spatial relations can be build up from points to regions. Spatial relations between points can be considered using fuzzy sets whose membership values often are given by angular information. The spatial relations between points are determined by the angle 0 made by the line pass-

ing trough the two points and the x—axis.

Figure 3.1:

Relation between two points

Thisinformation can be used to define the relations left, right, above, below according to this angle.

Table 3.1:

Membership values distinguished from the angle

Morerelations are possible but this is just an example.

Ambiguity of spatial relations

After a data structure is present for the objects in the scene it is necessary to decide what relations apply to those objects. It might be useful to create characteristic functions for relations pairs e.g. ABOVE/BELOW, LEFT/RIGHT, IN FRONT OF/BE- HIND. In further study it has been prove that sometimes some approaches turn out to be incorrect.

(27)

For example:

A

C-

D

Figure 3.2:

Is A to the left of B? Is D to the right of C?

Sometimeswe want to have different fuzzy models for left and right [17].

Another problem is the representative points for the characterization of the regions.

In figure 3.3 we can see three different kinds of shapes, but with the same height, width and center of gravity. A reference point can not be chosen easily in this case [23].

Figure 3.3:

Different shapes with same width, height and center of gravity

3.3 Used approaches in document understanding

Works using document understanding in the past used three approaches [5]: Qualitative approach, Image resolution and texture feature approach, Formalism for document layout description and modeling approach.

Qualitative approach

This approach has been a research area within the artificial intelligence. This approach can be used to reason qualitatively about physical systems [13]. The qualitative approach uses a fixed format that is often based on the numerical description of the block in terms of two corners for each block. How the document page is organized in terms of rectangular blocks can be described using a Form Defmition Language (FDL). This FDL provides functions for programming with the following steps to determine the value of variables:

a) Extract rectangles b) Sort the rectangles

(a) (b) (c)

(28)

c) Group the rectangles

d) Calculate area and join small rectangles e) Resolve the variables

Qualitative descriptions provide the ability to reason with incomplete and weak information [13]. Using weak descriptions of variables and relationships this approach can derive many significant deductions. When additional information is available it can reduce the amount of pre—analysis. Another reason to use this approach is that humans often reason qualitative about their environment. Without the knowledge of the laws of notion humans can qualitatively deduce that when it strikes a ball, the ball will roll forward eventually come to a halt. Because it is hard to work with qualitative information often the quantitative approach is used. The quantitative approach tries to catch the abstract spatial information, which are not normally given in numerical expressions in terms of quantities, which makes the information less abstract. There- fore the quantitative approach is often used with the qualitative approach.

Image resolution and texture feature approach

This approach can be used to classify the segmented blocks into categories (e.g. text with large letters, text with small letters and line drawings). Image texture modeling has been used with great efficiency to recognize objects with specific layouts (e.g.

newspapers). A disadvantage to this approach is that it is quite difficult to abstract these properties so they can be used for generic objects. To use this method a great deal of human knowledge of documents must be encoded. The information modeled from human knowledge is more universal; we know that the title of a page usually appears somewhere near the top of the page.

Formalism for document layout description and modeling approach

In this case the model is realized using a tree structure that describes the document layout. The tree consists of different layout abstraction levels. This is done in the fol-

lowing way:

• The document is scanned

• Segmentation of the binary image into components (e.g. characters, text—lines, text—blocks and graphics).

• The document image is segmented into horizontal and vertical axis.

• The identified layout components are mapped into a hierarchical data structure.

• Using stepwise refinement the syntactical structure of the document is established.

Using this method on a wide variety of document classes would force us to use very large knowledge bases. Search time and size of the database increases significantly.

(29)

Chapter 4 Fuzzy Logic

In this chapter we will give a short introduction to Fuzzy Logic.

4.1 What is Fuzzy Logic?

Fuzzy theory is a mathematical concept that was first introduced by Prof. L.A. Zadeh in 1965. This theory is based on the fuzziness in our reasoning and tries to catch this within the framework of sets theory. The term fuzzy takes in an aspect of uncertainty: fuzziness is the ambiguity that can be found in the meaning of a word or definition of a concept. In human speech we express ourselves in terms like "an old women", "a tall man" or "a high number", with fuzziness people can express the uncertainty that is part of the meaningof these terms. So, by using fuzzy logic we would be able to reason in logic the same way as humans do.

4.1.1

Fuzzy

sets

The difference between standard or crisp sets and fuzzy sets is that the functions in fuzzy sets can have any real value between 0 and 1, whereas in standard (crisp) sets clearly defined boundaries are used. An element x in a standard set can be either in the set or outside the set, a standard set uses the characteristic function and is defined as follows:

11,ifx E E

XE(x)

= 10,ifx

E

The characteristic function for crisp sets is expressed by XE : X — {0, 1}

Fuzzy set theory defines the degree to which an element x is included in the subset, this in contradiction to standard sets where an element is either in or outside the set. This degree will be 0 if the element is not a part of the subset and it will be 1 if the element is totally a part of the subset. The function that gives the degree to which an element is included in a fuzzy set is called the membership function _{i (0 <—} _{<= 1)} and is expressed by:

(30)

Y:X— [0,1]

For example the grade of membership of an element x in an area A is 0.4 and is expressed by A(X) = 0.4, so the element applies to area A for a value of 0.4.

Membership functions are extensions to characteristic functions and fuzzy sets are extensions of standard sets, because we are talking about extensions some characteristics of fuzzy sets may be lost (or differ) in comparison with standard sets.

Mapping a parameter in a membership function is usually done using a graphical representation, figure 4.1 shows an example for a parameter called temperature.

fwarm\

5 10 IS 25 Icmper.twt

Figure

4.1: Fuzzy membership function for warm

Membershipfunctions can be defined in different shapes, the most common used^shapes can be found in figure 4.2. The shape of the membership function depends on the area in which the set is used, the crisp shape for example can be used to convert fuzzy sets into standard (or crisp) sets.

Figure

4.2: Trapezoid, triangle, Gauss, singleton and crisp membership func-

mt

tions

Fuzzy variables are called linguistic variabes, categories describing the same context are combined in one linguistic variable. An example of such a linguistic variable is temperature where we can distinguish between cold, wann and hot temperatures, these temperatures are called linguistic terms. Linguistic terms represent the possible values of a linguistic variable. In figure 4.3 the membership functions for the linguistic terms of the linguistic variable temperature are shown.

(31)

0.0

cold warm

Figure 4.3:

The fuzzy linguistic terms for the linguistic variable temperature

4.1.2 Operations on fuzzy sets

Because fuzzy sets use membership functions and standard sets use characteristic functions the operations on fuzzy sets are defined differently, but just as with standard sets it is possible to calculate the intersection, union or complement between fuzzy sets.

Intersection

The intersection between two fuzzy setsA and B,A fl B can be defined by the following membership function:

YA(x) = YA(x) A YB(x)

To calculate this intersection we must take the minimum (A—operator) of the two sets.

Union

The union of fuzzy setsA andB,A U B is the fuzzy set defined by the following membership function:

YA(x) =

YA(x) V Y(x)

This union can be calculated by taking the maximum (v—operator)of the two sets A and B.

Complement

The complement of a fuzzy set A is the fuzzyset defined by the following membership function:

{x) =

¹ UA(X)

In figure 4.4 a graphical presentation of the membership functions for intersection and union is shown.

, Fuzzy ANDfunction MIN(A,B)

1.0

23 kmperatun

4 1.0

Fuzzy ORfunction MAX(A, B)

0.0

Figure 4.4:

Fuzzyintersection (AND) and union (OR)

(32)

There are many other operations on fuzzy sets, the interested reader should refer to [24]

for more information.

4.1.3 Fuzzy relations

Just as fuzzy sets can be explained as an extension of standard sets as was shown in the previous section, fuzzy relations can be explained as an extension of standard relations and some characteristics of fuzzy sets may be lost (or differ) in comparison with standard sets.

The fuzzy relation R between set X and set Y is a fuzzy set of the direct product as shown here:

X x Y = {(x,y)Ix E X,y E Y}

The fuzzy relation R is characterized by the membership function 14R:XX Y—'[O,l]

Using fuzzy relations we can reason about certain systems (e.g. fuzzy control, fuzzy diag- nosis and fuzzy expert systems), take a look at the next example

a) IfxisA theny isB

b) x is A'

Conclusion: y is B'

The following is an example of fuzzy reasoning from the previous lines.

a) If a tomato is red then the tomato is ripe.

b) This tomato is very red.

Conclusion: This tomato is very ripe.

Fuzzy reasoning takes input parameters and relates these to a set of output parameters, these output parameters can be used as input parameters for other fuzzy relations. The degree of which the output parameter will appear depends on which operations are used between the input parameters.

When an AND—operation is used between the input parameters in the IF—expression the degree of which the rule counts is calculated by taking the minimum of all inputs, for

example in the rule "If A and B then C", C will count for MIN(A,B). The same counts for OR—operations but in this case the MAX—operation is used to calculate the degree in which the output parameter appears.

There are many other methods to calculate the degree to which the output parameter appears an example is the weighted sum over all inputs, for more information see [24].

(33)

Chapter 5 The use of fuzzy logic and spatial relations

This section will state some arguments to use frzzy logic and spatial relations.

5.1 Fuzzy logic

When looking at a document's page at such a distance that the text is not readable any- more, then many things can be said about the layout of the document. Although the contents of the document can not be seen, we can often reason about the layout of the document by simply looking at the size of the blocks, their position on the page and their relation to other blocks. Take a look at figure 5.1 for example.

I TITLE I

AUTHORS

1

ABSTRACT

= HEADING

I ABSTRACT TEXT

Figure 5.1:

The rectangles around the text blocks

Only the rectangles around the text blocks are visible in this figure, for readers who often read technical papers it is obvious that this is part of a title page in a technical paper.

(34)

The first block on the top of the page is most doubtly a title, this because the block is on top of the page and centered, the block's width is large and the height is high.

The block below would then be an authors text because authors information is in many cases below the title. Furthermore the block's height is not very small so more than one

author can fit in this rectangle.

After the block with the authors text we see a very small block whith the size of a small heading, this is probably an abstract heading because it is the first heading below the authors text.

Below the abstract heading is ofcourse an abstract text, apparently this is the case because the block below the abstract heading has the size of a paragraph.

In this example the only information that is available of the document are the rectangles around the text blocks. Still it is possible to say many things about the layout of the document. Apparently it is not always needed to use extra information (e.g. OCR) to recognize the layout structure of documents.

How can the information we know about the layout of documents be translated to an automated system? The approach that is used in this report is fuzzy logic with spatial relations. Readers who are not familiar with spatial relations and/or fuzzy logic we suggest to read chapter 3 and/or chapter 4.

The reason to use fuzzy logic is the way humans reason about documents. In the example could be seen that the arguments that were used to recognize a block are in fuzzy terms:

"small width", "above the page", "very high" etc. Another advantage of fuzzy logic is that the blocks are not restricted to a bounded area, e.g. if a block is a little bit outside the area of "on top of the page" then the blocks value for "on top of the page" would be: "a little bit on top of the page" instead of "not on top of the page".

5.2 Spatial relations

When humans look at documents they can overview the document and all it's blocks all at one stage. They use spatial relations with all the blocks in the document page to recognize the layout of the document. As could be seen in the example this is also what we did:

The authors block was labeld as an authors block because a title block was above this block, above is a spatial relation. Spatial relations have no sharp boundaries for the appli- cability of the spatial expressions [16], therefore it is hard to quantify spatial relations with the use of standard mathematics. A good method of quantifying human judgements into a mathematical model is using fuzzy logic. According to [18] properties.and spatial relations based on fuzzy set theory, coupled with fuzzy segmentation will yield realistic results. Using Fuzzy Logic with the qualitative approach of spatial relations will retain both topology and geometrical attributes of the block regions [17]. Therefore spatial relations and fuzzy logic are combined.

As was mentioned earlier humans can overview the document at one stage, this cannot be accomplished with an automated system is to look at all the blocks at the same time to get a global idea about the document. To do this would make the system to complex because the number of relations between the blocks will grow entirely for every block in the doc-

(35)

4RANENDoNKJ

company Logo

Ham. —Ir. T.J.A. Mensinki

manager

software departmentl

Function

Mdru.

• _\

IKRM4ENOONK Factcy automation ISr Row4W

Isat

^10, 4004 JT 11.4 IP.O. Box 6147,4000 IIC T1, Th. Nethedaude Phon. 31-Q344-3944 Fax 3l-344-6233ag

Figure 5.2:

Example of business card recognition

This idea of fuzzy logic and spatial relations can also be applied to the recognition of the layout structure of business cards (see figure 5.2). This has been done as a practical as- signment for students of the Rijksuniversiteit Groningen. First the business cards _are scanned and segmented using rectangular regions, for these regions spatial relations can be derived like a function is always below a name. The fuzzy system we have build is based on this idea, in the next chapter we describe our fuzzy system.

(36)

Chapter 6 System overview

In this chapter an overview of the system will be given, each part of the system will be discussed briefly.

6.1 Overview of the system

In figure 6.1 a schematic overview of the system is given, at the left side are the inputs and on the right side the outputs, in between is the fuzzy rule base which contains a number of IF—THEN rules.

Position ^I

Size

Rule Base

Label Distance ^I

Neighbour

_____________

Figure

6.1: Overview of the system

In the following subsections each of the three parts of the system will be discussed briefly, a detailed discussion can be found in the next chapter.

6.2 Inputs of the system

First documents are scanned at a resolution of 300 dots per inch and one page at a time, there is no mechanism for skew correction so the documents must be scanned straight although a little skew is no problem because segmentation is done by hand. Further it is assumed that the scanned document page is the entire page, that is: the entire page is

(37)

ing the margins are part of the scanned image. The segmentation of the pages is done by hand with an application called Metrics, the Run Length Smoothing Algorithm [6] is used to support the segmentation. The main problem with segmenting by hand is the choice of whether to take two pieces of text together or not. As was mentioned in section 2.5 the problem with segmentation algorithms is the choice of the thresholds and this is also the problem with hand—segmentation. Finding the right thresholds is a entirely different problem and not a part of the research we did, therefore we kept the following rule at hand: If we are not certain about how to segment a piece of text on the document, we assume (for our ease) the best segmentation, so we make it ourselves easy to assume an ideal segmentation.

The input document pages went through the process of area labeling and we assume that for each block the type of data of its contents is known, we have the following datatypes:

• Text

• Table

• Figure

• Line

The reason for the choice of these four datatypes is that much research has been done in the area of area labeling and the above mentioned datatypes can be found using the area labeling techniques.

An example of a page after it has been segmented and went through the process of area labeling is shown in figure 6.2.

'-Tm___________

'Ii

D[

ⁱ

III

TT TD

— = T

ITExrI .i.

Figure

6.2: A page aftersegmentation andarea labeling

Each block on the page is inputted into the system one block at a time and in a top to bottom order. For each block a number of properties are known which are the actual inputs to the fuzzy rule base, there are four main properties:

Fuzzy Spatial. Re Ition.s for Document Layout: Analysis.