Supporting OCR Impact Estimation on the Retrieval of the First Mention of A Concept in Historic Newspaper Archives

(1)

Supporting OCR Impact Estimation on the Retrieval of the

First Mention of A Concept in Historic Newspaper Archives

!

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

Xiaoli Li

10874968

M

ASTER

I

NFORMATION

S

TUDIES

H

UMAN-

C

ENTERED

M

ULTIMEDIA

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

October 6, 2015

1st_Supervisor ₂nd_Supervisor

MSc. Myriam Traub Dr. Lynda Hardman

(2)

Supporting OCR Impact Estimation on the Retrieval of the

First Mention of A Concept in Historic Newspaper Archives

Xiaoli Li

Human-centered Multimedia Information Study

University of Amsterdam, Netherlands

xiaoli.li@student.uva.nl

ABSTRACT

Digitized historical newspaper archives play an increasingly important role in humanities studies. However, Optical Char-acter Recognition (OCR) errors significantly influence the dependability of search results. Being aware of such issues is essential when conducting specific humanities research tasks. In this paper, we describe the early phase of design and eval-uation of a user interface that supports estimation of the impact of OCR errors on keyword retrieval, finding the first mention of a concept. In order to gain insights on users’ workflow and probable acceptance of the interface design, we conducted simulated tasks and semi-structured interview with three potential users. Focusing on users’ problem solv-ing process, we set requirements and created the prototype interface. Our proposed design was evaluated with five hu-manities researchers in an informal usability test. Our find-ings indicate that information for users to estimate the im-pact of OCR errors is needed. Providing OCR misspellings of the original query is useful for clearly understanding OCR issues, as well as suggesting OCR misspellings for future tasks. The suggested way of showing OCR misspellings was confirmed by the users and seen as easy to understand. The chronological visualization can be difficult to understand but can help users estimate the impact of OCR errors. It was considered as background information; interesting to know but not directly necessary.

1. INTRODUCTION

Digitization of archives opens a wealth of information for humanities researchers. In recent years, long-term projects developing searchable databases for nationwide historical newspapers have been set up in many countries. Chroni-cling America1_{, for instance, has digitized more than one} million pages and they intend to digitize all surviving his-torical newspapers in the United States within a few years. Similar projects and achievements can be found all over the 1_{Chronicling. http://chroniclingamerica.loc.gov}

world: Delpher2_{in the Netherlands, Trove}3_{in Australia and} PaperPast4 in New Zealand. In general, digital newspa-per archives have tremendously improved access to historical sources which are no longer bound to a physical location. A major issue in historical digital archives is the quality of the searchable text obtained. Though state-of-the-art OCR has achieved satisfactory results for modern prints, many studies show its limitation when processing historical mate-rials [2, 8] – it is technically impossible to recognize all char-acters with 100% accuracy [6]. Additionally, OCR quality is influenced significantly by inconsistencies in typography and the original source’s physical condition [4, 11], therefore from a long-term perspective, older documents are more likely to produce much lower data quality.

This does not mean digital archives cannot be useful to hu-manities researchers who see the OCR problem as “a serious obstacle” [13]. If good quality means “enough for the or-ganization to [...] make reasonable decisions” [7], we may assume that sufficient support provided by the system can help users’ decision making, even if the data quality remains imprecise. Nevertheless, on today’s historical digitized plat-forms, we found no further support such as the measurement of OCR accuracy or impact that might be beneficial for end-users when conducting professional research tasks.

Theoretically, potentially feasible solutions to measure the OCR data quality exist [4]. In practice, however, we lack available data to fulfill such quality measurement. The digi-tization of an entire large archive usually takes years, during which period the OCR software used is updated or details are unclear. Take Delpher for example, the OCR process had been outsourced and “not all details of this process known to the library” [13]. Therefore, the current Delpher system is not capable to provid an estimation of data accuracy. Another way of supporting users’ decision making is to esti-mate how OCR errors influence the search results. Instead of data quality, impact of OCR errors is more realistically assessed in the current stage technologies, and is likely found useful from the user side. Nevertheless, there is no on-line archive or research tool that provides such feature. There-fore, we explore how estimation of OCR impact might sup-port experts in their work. Particularly, we concentrate on 2_{Delpher. http://www.delpher.nl}

3_{Trove. http://trove.nla.gov.au}

(3)

a typical humanities research task [13]: finding first mention of a concept.

Our research is based on the digital newspaper archive of the National Library of the Netherlands. The research question we pose is:

RQ: Can we support humanities researchers in estimating the impact of OCR errors on the retrieval of the first mention of a concept?

This project was organized in two phases. In the first phase, we conducted a preliminary study (Section 3) to understand how humanities researchers estimate the reliability of the search results using Delpher and the current methods ex-perts dealing with OCR errors. Insights gained from users allowed us to identify requirements (Section 4) and make de-sign choices (Section 5). In the second phase, in order to test our design, we conducted an evaluation study (Section 6) us-ing a prototype interface with five humanities researchers. Based on the findings from the evaluation study, we drew our conclusions.

2. RELATED WORK

To our knowledge, no current OCR platform provides con-venient tools for estimating OCR errors. In this section, we discuss the work done by archives in order to improve aware-ness of OCR issues and tools related to our project, in the domain of digital humanities.

Many on-line archives catch users’ attention to the OCR problem by placing a sentence or an icon on their websites. For instance, Delpher warns users on an article level, and further information can be accessed with a link (Figure 1). Trove not only shows the warn, but also allows users to con-tribute the text correction in a crowd-sourcing style. How-ever, when querying a keyword, none of the known archives warns that the result list may be incomplete due to OCR issues, not to mention providing a guide for users to retrieve related but unrecognized data.

Figure 1: Notes that call user awareness of uncertain data quality on single page (Delpher)

Digital libraries not only an important role in providing ac-cess to resources, but are becoming “tools to service and enable scholars’ research” [12]. Increasing number of appli-cations that enhance the working environment for specific

research tasks are being integrated with comprehensive dig-ital libraries. Google Ngram (Figure 2) is a popular. It displays a graph showing how queried phrases occurred in a corpus of books. By looking at the graph, users can explore frequency of certain phrases and thus learn about the de-velopment of words and expressions throughout the years. Tools can be found in digital newspaper archives as well. Figure 3 shows an overview of newspapers in Delpher. By filtering the newspaper titles, number of particular newspa-pers can be found easily.

Figure 2: Google Ngram Viewer

Figure 3: Krantenoverzicht: An overview of news-papers in the Delpher database

3. PRELIMINARY STUDY

While a general understanding of search tasks in digital li-braries can be found in the literature 1,3,14], there is little information on the workflow of the specific task “finding the first mention”, or users’ opinions on the current research tool. Such information can be collected only with humani-ties researchers’ help because of the professional character of the task. In order to fill the information gap and investigate user requirements regarding the estimation of OCR impact, we conducted a preliminary study.

In order to open the conversation quickly and directly, we gave tasks to participants and observe how they deal with OCR errors. Besides, to gain insights from users’ criticism, we also showed initial design (paper prototypes) that illus-trates potential solutions, based on the current understand-ing of the problem-solvunderstand-ing process. Although such mock-ups show only simulated information that reduces the realism, they are “nevertheless very useful in the early stages of de-sign” [15].

(4)

3.1 Setup

We setup experiments with three humanities professionals at their individual workplaces. One was a Dutch History PhD student (P1, age = 29), one a senior reporter (P2, age = 57) and one an interdisciplinary Digital humanities PhD student (P3, age = 26). All had knowledge of the Dutch language (two of them were native speakers) and had no difficulties using the computer, enabling them to carry out the test. Our participants were recruited by email invitations. To observe their awareness of OCR issues, there was no clear introduction about the purpose of our research prior to the meeting.

During the meeting we gave the participant three tasks. All tasks were printed on papers to ensure careful reading and participants were given enough time to complete the task (the participants stop themselves). We used tasks not to evaluate users’ ability dealing with OCR errors, but to cre-ate a chance for us to observe users’ behaviors and prefer-ence. A user usually could not say specifically what part of the system should be improved [3], but by observing the fulfillment of a search task, it is possible to summarize the requirements. At the same time, we could open a conver-sation around the topic. The tasks and their specific pur-poses were listed in Table 1. Participants were encouraged to “think aloud” [14] and their spoken words of the whole process were recorded.

Figure 4: OCR background information given to test participants.

3.2 Key Findings

3.2.1 Information Seeking Process

In order to design useful interfaces, we need to understand the user’s problem solving process (information seeking pro-cess [5]). The information seeking propro-cess is an interactive cycle consisting of (1) recognition of the problem, (2) activ-ities of query specification and (3) examination of retrieval results, until a satisfactory result is found [9, 10]. We ob-served that a user’s workflow for finding first mention in an OCRed archive can be summarized as follows:

Recognition of the Problem. Task 2 let us observe how researchers conduct their search after they have understood the possible influence by OCR errors. We learned that all

participants tried to search with more keywords relative to the given word “Amsterdam”. They expanded their keyword lists in two ways: (1)Participants wrote down or considered in their mind, several possible misspellings of “Amsterdam”, based on their background knowledge and experience of how a letter might be misspelled. In addition, one participant tried the old version of “Amsterdam”, seeing it as a chance to find earlier mentions in the archive. (2) Wild-card characters (“*”) were used to expand the search, and di↵erent letters in “Amsterdam” were replaced.

Activities of Query and Examination of Retrieval Results This step was combined closely with searching and comparing. Participants searched several words (including the original word “Amsterdam”) one by one using Delpher query box, sorted search results of each query, and looked at the earliest date appearing in the result lists, comparing it with the previous results. Then they wrote down the new one if it was an earlier mention than the ones they had already written down.

Confirmation of the Result Within 10 minutes, all par-ticipants gave one date and indicated they had finished the task. Two participants (P1, P3) who used wildcard charac-ters were confident about their results – when asked to what extent they believed the result was true, they gave a four out of five. P2 did not use wild-card characters and was aware that he had tried only a small set of possible misspelling forms. “You can never be sure”, he indicated, of the uncer-tain accuracy of the result and explained if they need precise results, it would be necessary to go to the archive, browsing physical newspapers near the date he had just found.

3.2.2 Query Expanding as an Approach

The earliest mention of “Amsterdam” found in the Delpher database appeared in the year 1618. None of our partici-pants “correctly” found it, even with the help of the wild-card.

Obviously, participants used a personal “misspelling list” to deal with OCR errors. The accuracy of the result relied on their former experience for choosing the misspelling letter and the times of misspelled forms they queried. That indi-cated that giving unlimited time, the user probably would make a “correct” query and find the earliest mention in the result list.

None of our participants could guarantee that their findings were correct —— but this does not mean they did not trust their results. With repeated searching and checking they might get enough confidence to make a decision, especially when they know the search covered many possibilities.

4. USER REQUIREMENTS

Findings from the preliminary study indicated that an in-terface facilitating query expansion is needed for the task “finding the first mention”. First, it is a method intuitively done by users themselves. Second, by looking at the mis-spellings, users understand the OCR issues better, and they might be able to estimate the influence of the errors. Surrounding the idea of query expansion, we list the user requirements, which would guide the user interface design.

(5)

TASK PURPOSE OF THE TASK TASK 1 Find the first mention of “Amsterdam” in a Dutch

newspaper, using Delpher, and write down the find-ing.

Help users get used to the platform. This task could be skipped if the participant is an experienced Delpher user and aware of the impact of OCR errors. TASK 2 (1) Read a brief introduction of OCR technology

and an informal confusion table of misspelling forms caused by OCR errors (see Figure 4).

(2) Find the first mention of “Amsterdam” in Delpher and write down the finding again.

(1) Make sure that the user understands potential OCR errors giving incorrect spellings.

(2) Observe users’ workflow dealing with such issues in their own way.

TASK 3 Showing the paper prototype, we gave a short oral description of the design. The participants were asked to give feedback about our design (what do you think about functions provided by the interface, how much do you trust the result searched by this interface, can you estimate how data quality is influenced by OCR errors?)

(1) Capture users’ judgements and personal consider-ations dealing with the OCR issues.

(2) Explore users opinions or expectation about the solution.

Table 1: Tasks in Preliminary study

Features in this section reflect our approach dealing with OCR impact estimation on the retrieval of the first mention of a concept. The regular functionalities (e.g. searching, sorting, filtering with newspaper title) of a search interface are left out since they are not specific querying the first mention.

4.1 Misspelling Generation

When searching for the first mention of a concept, a user needs as many hits as possible to make sure no earlier ones were missed. We observed that all users try to search for var-ious OCR forms of the original query to extend the search manually. Such methods have some obvious disadvantages. First, it is unlikely one can generate a misspelling list cover-ing most of the frequent OCR confusions based on personal experience or observation. Second, though wild-card can be helpful, the number of search results might be unnecessarily huge because of the false positives.

To improve the methods dealing with misspelling genera-tion, users need a list of misspellings based on known OCR confusions (An example OCR confusion table can be found in Figure 4). In addition, the interface should show hits found by all misspellings so that the user dose not to query manually.

4.2 Visualizing Occurrence of Misspellings

Delpher gives notification about influence of OCR and pro-vides general background information about its technical limitation. However, there is a lack of detailed support, such as confusion tables or how frequently some misspellings occurred. Considering that such information might help users to make decisions for their research tasks, we assumed that a visualization for estimating OCR influence is needed. Looking at the visualized information, users can analyze the existing information, and further, estimate the OCR im-pact. However, this requirement was built on assumption and needed to be tested in the Evaluation Study.

About the choice of percentage, rather than absolute counts, we did not find a clear preference at the experts’ in the design process, but we had two reasons making that temporary decision. First, by looking at the ratio of the original query, the user might directly perceive the OCR impact, which was considered as the main purpose of the visualization. Showing the absolute counts of di↵erent misspellings may have the same e↵ect but the counts might vary significantly and not easy for users to look at. Second, we learned from Google Ngram Viewer, which uses percentages illustrating the occurrence frequency of words in historical books, and it might be a good example. We were open to test it in the coming evaluation study and learn which way is better.

4.3 Quickly Browsing

“Researchers engage in browsing behaviour more than search-ing” [1], we also found the requirement of tools that support quickly browsing the original newspaper of the result item. We observed that users always go to the original newspaper to make sure that the document is correctly related to their query. In Delpher, they have to click and open every item on the result list, find the keyword in the original scanned news-paper, and if the item is a false positive, they would have to go back to the result page and continue the checking. For instance, search results of the query “Amsterdam” might not really show a newspaper including the word “Amsterdam” -on the original material it might say “Amstel” — this can only be checked when users see the original newspaper. Such process is necessary, but should be simplified.

5. USER INTERFACE

We provided a possible solution that fulfills the user require-ments. In this section, we describe the design and implemen-tation of the prototype interface.

5.1 Design Description & Rationale

Our suggested interface consists of three main areas (see Figure 5): [A] the visualization of occurrence of OCR

(6)

mis-Figure 5: The prototype user interface

spellings, [B] an interactive misspelling list, and [C] the re-sult list.

5.1.1 Misspelling Generation

To expand the original query, we provide a interactive mis-spelling list (Figure 5-[B]). Words in this list have three types and are distinguished by color; the original query (green), the misspellings that get hits (light blue), and the mis-spellings that get no hits (dark blue). Initially, only the original query, which in our evaluation was “Amsterdam”, is checked and [A] shows only the green and [C] gives only results searched by “Amsterdam”.

To assure flexibility, the original query and misspellings that get hits can be added or removed from selection. In front of each word, there is a check box that the user can enable and disable; on the right of the word, there is a bar that indi-cates how many hits the particular word found. Those mis-spellings that get no hits are listed for system transparency, though their check boxes would be always disabled. If the user enabled the original query, its check box and bar would turn green, which is corresponding to its percentage on the visualization ([A]); for misspellings that got hits, their bars and check boxes would turn blue when being selected. At the same time, [A] and [C] would change

correspond-ingly. By interacting with the misspelling list, the user can gain information about how misspellings are influencing the search results.

By clicking the “info button” on the top, users can read back-ground information and how this misspelling list is generated in a pop-up box.

Figure 6: An item on result list

5.1.2 Visualizing Occurrence of Misspellings

In order to illustrate the OCR impact, we chose to visualize the occurring frequency of the original query and its mis-spellings through the years. By comparing their percentages in all hits, the user can directly recognize how misspellings influence the search results.

Shown in Figure 5-[A], the green bars show the percentage of the mentions of the original query, which, in our evaluation study was “amsterdam”; the light blue presents the selected word “amfterdam” and “amflerdam” in the misspelling list

(7)

([B]); the dark blue stands for all other results searched by the remaining misspellings that got hits in this case. For example, in 1810s, the original query “amsterdam” helped us get roughly 75% of the total search results, while 20% were found by “amfterdam” and about 5% of the results were by other misspellings. The values shown on the visualization is controlled by the misspelling list.

Regarding the choice of percentage rather than absolute counts, we did not find a clear preference by the experts in the design process, but we had two reasons for mak-ing that temporary decision. First, by lookmak-ing at the ratio of the original query, the user might directly perceive the OCR impact, which was considered as the main purpose of the visualization. Showing the absolute counts of di↵erent misspellings may have the same e↵ect, but the counts vary significantly and would not be easy for users to look at. Sec-ond, we learned that Google Ngram Viewer uses percentages illustrating the occurrence frequency of words in historical books, and it is a clear way when comparing the frequency of di↵erent words. This choice is to be tested in the coming evaluation study.

5.1.3 Quickly Browsing

In the result area (Figure 5-[C]), search results are sorted by date initially for users’ convenience to find the first mention. On each result item (Figure 6), there is information about the newspaper article, such as date of publishing, title of the article and context of the text. Particularly, to support browsing, the interface shows an image that highlights the word on its original newspaper. Therefore, the user would not need to open any link examining whether the document is really relevant to the original query.

5.2 Implementation

5.2.1 Infrastructure

The implementation of the prototype interface5uses HTML, CSS and Java script. The animation and visualization re-lied on the framework D3 and the Java script library jQuery. The prototype was tested on the Chrome browser (version 45.0). To simplify the development, we did not make the interface workable for queries other than the keyword we used in our evaluation study. The information shown on the interface was provided by a JSON file which was custom generated and only contained data that was related to the evaluation tasks. The highlighting of the keywords on im-ages was implemented based on position information in the OCR data (xml files).

5.2.2 Data

The data used in the interface was a small subset of the Delpher data. It contained 265 newspaper pages (265 OCR xml documents, meta-data of the newspapers in separate csv files and their original scanned images), which are sampled from a wide range of time in the Delpher database. 134 of these had corresponding ground truth and, based on this, a confusion table was made in an earlier project [13]. Our data processing is illustrated in Figure 7. Having re-ceived a query, which was “Amsterdam” in our case, the 5_{The prototype interface is available at:} http://xiaoli-li.com/evaluation/result.php?amsterdam

system generates a misspelling list based on the OCR con-fusion table. Then the original query and its misspellings are searched in the 265 newspaper pages. All hits were aligned with their relevant meta-data and the outcome were a JSON file that contains information we need for the eval-uation tasks, which can be found in the following section. The data preparation was done using PHP and MySQL.

Figure 7: Data Processing

6. EVALUATION STUDY

Our goals were to evaluate the requirements defined in Sec-tion 4, and explore whether the user interface meets the requirements and can be considered useful. The evaluation was a qualitative exploratory study.

Since finding the first mention is a specific research task con-ducted by professionals, we required humanities experts to participant in our evaluation. We browsed websites of di↵er-ent institutions, reading the introduction of many human-ities researchers, making sure they know Dutch and their studies are about the Netherlands after the seventeenth cen-tury (so that they might use historical newspapers). In the invitation emails, we also asked if the expert uses digitized archives in their work. And to ensure that they are aware of

(8)

OCR issues before the meeting, we provided detailed infor-mation about the project background, and gave OCR con-fusions as example.

6.1 Setup

We visited five PhD individually (2-4 years PhD students, the average age was 28) at the University of Amsterdam and Leiden University for a task-based semi-structured in-terview. They were experienced as professionals and use Delpher for their own research. In the evaluation, each of them was asked to carry our five tasks and answer questions related to the tasks. The evaluation had two phases: Introduction We first read the general information about the evaluation and what we expected, and introduced partic-ipants to the think-aloud method. After participant signed a consent form and filled in personal information, we started to record our conversation and things happening on the com-puter screen using a research software Silverback6_.

Experiment Session As a warm-up session, participants had to complete a rehearsal task using Delpher. During this task, we asked questions such as “what is this element”, “what do you expect to see when clicking”, to help

partici-pants practice to think-aloud.

Following the rehearsal task, four tasks were given one by one to the user. 5-10 minutes were allocated to each task. In addition, we prepared semi-structured questions to explore their opinions about both the interface and their experience working on digitized archives, specifically Delpher. All tasks in the evaluation can be found in Table 2, as well as the specific purposes.

In addition to the tasks, we prepared open questions that were not closely relevant to the interface, but more about users’ general preference when using digitized archives, such as the demand of OCR errors estimation, the way they pub-lish a search result, and etc..

6.2 Results

6.2.1 Misspelling List

During rehearsal tasks, participants showed us the need for a proper misspelling list. Most participants would query misspellings of the original query in real-life tasks. Only one, who focused on qualitative research, could directly ac-cept the result provided by the search engine. The meth-ods of choosing OCR misspellings were diverse; some par-ticipants queried only the misspellings given in the intro-duction of the evaluation study; some participants tried to seek support from “Uitgebreid zoeken” (Advanced search) or “Krantenoverzicht” (a visualization tool on Delpher inter-face), though there was not really a solution; some partici-pants tried to use search methods that work on other plat-forms but not on Delpher. Particularly, one used wild-card querying “Amster*”, and got the earliest item among all par-ticipants’ findings, but there was no solid clue for why the participant chose “Amster*”, not else. All participants indi-cated that the choice of misspellings should depend on the background information — be it personal historical back-ground or the information about the database provided by 6_{http://silverbackapp.com}

archives. Therefore it is important for archives to provide enough information for users’ decision making.

When looking at the mock-up interface, most participants could understand the misspelling list easily, though at first sight, two participants misunderstood it as a set of historical spellings of “Amsterdam”.

The misspelling list was seen useful in two ways. First, it drew attention to OCR issues. Participants welcomed what could be encountered and how they could influence search results. Secondly, the list could be used as a detailed ref-erence or background information, suggesting OCR forms for future research tasks. On an individual level, two par-ticipants indicated that they would like to check changes in search results and on the visualization by interacting with the list. Comparing with the wild-card method, one partici-pant found the functionality convenient, though it was clear that the misspelling list was not complete.

6.2.2 Visualizing Occurrence of Misspellings

We observed that it took minutes for users to figure out the meaning of the visualization that illustrated OCR im-pact. One participant indicated that a user needs to see the misspelling list first to gain enough information for un-derstanding the visualization. After an oral explanation, all users could understand clearly.

In task 3, all participants could analyze the OCR influence through many years and find which misspellings could pos-sibly get more hits. During the task, users tried enabling and disabling di↵erent misspellings and looked at changes on the visualization, finally they made decisions by com-paring counts of occurrence of di↵erent misspellings. Ques-tions in this task were particularly designed for OCR impact estimation, although according to participants there would unlikely be such questions in their own workflow.

As to the usage of the visualization tool, participants ex-pressed that it was not directly needed for their research, but it was good to know as background information. “This chronology is quite important for historians to see”, one par-ticipant told us, and indicated that it helped users to be further aware of the development of OCR issues in time. Especially, we observed that by looking at the visualization, one participant easily found gaps in our data set (which is true, since our data only include 265 newspaper pages). When being asked if it would be better to show the abso-lute number of occurrences rather than the percentage on the visualization, three participants strongly preferred the former. Also, one suggested we provide the information in both ways.

6.2.3 Quickly Browsing Search Results

Participants gave few opinions about highlighting images on the result list, only two, during their thinking-aloud, men-tioned that they were nice to see. Through observation, it was a smooth process for all participants to perceive the function and start using them. For example in task 4, after the first item on the list was found, users looked at the im-age, checking if the search result really was “Amsterdam” — they did not go to the original newspaper pages on Delpher.

(9)

TASK PURPOSE OF THE TASK REHEARSAL

TASK

Find the first mention of ”Amsterdam”, using Delpher, and explain how the result can be trusted (select from four levels: do not trust, have little trust, tend to trust, completely trust).

(1) The participant can practice thinking aloud. (2) We observe the way users deal with OCR errors.

task 1 Query “Amsterdam” in the prototype and try to ex-plore the interface without clicking.

(1) We find out if the user interface is easily under-standable to the user.

(2) If needed, we explain the interface so that the user is ready for the following tasks.

TASK 2 Find two of the most frequently occurring words in the misspelling list. Relevant questions will be asked. (Does the misspelling list create extra confusion for your searching? Would you use the misspelling list? etc.)

By observing and asking questions, we would have an idea if our approach, the query expansion, is seen as useful dealing with OCR errors or for other research purposes.

TASK 3 Find answers for three questions: (1) When is the first occurrence of “Amsterdam” that has been misspelled with “Amfterdam”; (2&3) If you want to find infor-mation about Amsterdam between the year 1800 to 1809 and the year 1780 to 1789, what misspelling(s) would you like to use?

(1) We observe how the user finds specific information that requires them to interact with the interface. (2) Further, question 2 and question 3 would need the user to analyze the visualization and well-perceive the OCR impact to the search results.

TASK 4 Using the prototype interface, find the first mention of “Amsterdam” in the 265 newspaper pages and explain how the result can be trusted.

We explore if the user can gain more (or less) trust in their findings, after interacting with the prototype and what has influenced their trust?

Table 2: Tasks in evaluation study

6.2.4 Other Findings

Generally, there were two attitudes among users when deal-ing with OCR issues. Two participants preferred to browse the documents one by one from the first scanned newspaper in the database, concerned that the system would not find the item they need. When publishing the paper, they would not write about the technical limitations. Only the archive name would be mentioned as it sets the boundary of the re-search. Others tended to trust the results and publish with a footnote about limitations of the data source and Delpher interface. Particularly, they would like to list the main mis-spellings, emphasizing the uncertainty of the search results. All participants confirmed the importance of the system’s transparency. Detailed background information and limita-tions of the system were considered helpful. One participant even emphasized it is necessary to have a more obvious warn-ing on the Delpher interface so that researchers can become more aware of the limitation of the data or the system. In both the rehearsal task and task 4, users were asked to give their trust level for the two results. Four of the five participants had more confidence in their findings in task 4. According to their explanation, their trust was gained mainly by getting more system-side information given by the mock-up interface, though they knew that there were still limitations (e.g. the misspelling list was incomplete, the demo system was not fully implemented). One partic-ipant told us that the misspelling list that showed many possibilities had helped a lot for increasing trust.

7. DISCUSSION

Though the main parts of the interface were positively ceived by participants, there deficiencies in our design re-main.

7.1 Misspelling list

Though the misspelling list was said useful and easy to un-derstand by all participants, disadvantages were also pointed out. One participant suggested that all words in the mis-spelling list should be enabled at the initial status, so that it would be unlikely to miss important search results. About misspellings that got no hit, three participants questioned the necessity of their existence in the list and preferred that such extra information be hidden, although they saw the transparency of the system as important.

As said, there was a common (two of five) misunderstanding that our list consisted of di↵erent ways of writing. This might be because OCR misspellings are less used than other related spellings (e.g. historical spellings, singular or plural form) in conducting humanities research tasks, though all participants have been aware of the issues.

Apart from OCR misspellings, support for di↵erent ways of writing a word was suggested as important and needed by participants. It would be interesting to integrate both functionalities on our interface, since they are similar issues from the users’ perspective.

(10)

7.2 Visualizing Occurrence of Misspellings

After an explanation of the visualization, all participants indicated that it was good to see as background information. This can confirm our assumption that such visualization is needed, though it may have low priority.

There can be several reasons why participants would need extra explanation or tutorials to understand the visualiza-tion. First, the wording for the legends on the visualization might be lack of perspicuity. Further, the choice of show-ing percentage, not absolute counts on y-axis, might create confusion as well. Second, such information was not directly useful for the search task (finding the first mention), there-fore some users might have had no motivation for exploring it. As users suggested, it might be better locating the visu-alization on the bottom of the interface, or even hide them. During task 3, three users tried to click on the bars in the visualization, where there was no function designed. Though no one clearly explained what they needed to see, except the absolute counts of the occurrence, we surmised that they expected to check what newspaper items were illustrated by a specific bar in the visualization, digging a deeper level of the interface. We could not conclude if such interaction is necessary for OCR impact estimation, however, it might be a direction for further study.

7.3 Quickly Browsing

In our research scope, we could not conclude that the image highlighting the query word should replace the thumbnail of the whole newspaper, which is the typical displaying style for digital archives. However, it would be interesting to further research if the former is a better choice in many aspects (e.g. is it more e↵ective for perceiving OCR errors?).

8. CONCLUSION & FUTURE WORK

We conducted a user-centered design study into an interface that supports estimation of the impact of OCR errors on the retrieval of the first mention of a concept. In a preliminary study, we set experiments, investigated activities performing the specific research tasks, and identified user requirements. In an evaluation study, we tested the requirements and the usefulness of our interface design qualitatively with five hu-manities researchers.

Our findings indicate that information for users to estimate the impact of OCR errors is needed. Providing OCR mis-spellings of the original query is confirmed useful for clearly understanding OCR issues, as well as suggesting OCR mis-spellings for future tasks. The suggested way of showing OCR misspellings was confirmed by the users and seen as easy to understand. The chronological visualization can be difficult to understand but can help users estimate the im-pact of errors. It was considered as background information; interesting to know but not directly necessary.

For future work, there are several interesting directions re-lated to our study. First, participants showed the demand for supporting di↵erent ways of spelling when querying in digital archives. It might be interesting to combine such needs with dealing with OCR misspellings, since these are similar issues for users. Second, It would be interesting to explore various ways of visualizing the occurrence frequency,

so that users can understand the information easier. Addi-tionally, we can further evaluate the usefulness and e↵ec-tiveness of the highlighting function quantitatively. Last, but not least, this study could be reproduced with more participants to confirm the UI features in a larger popula-tion.

9. ACKNOWLEDGEMENTS

We thank Myriam Traub and Lynda Hardman for their su-pervision, Emmanuelle Beauxis-Aussalet and Jacco van Os-senbruggen for their advice, and all members in the Informa-tion Access group for their help. We thank Frank Nack for reviewing this paper and supporting throughout the whole process. Finally, we are grateful to all the humanities re-searchers participating in this project and sharing their ex-perience and insights.

10. REFERENCES

[1] Robert B Allen and Robert Sieczkiewicz. How historians use historical newspapers. Proceedings of the American Society for Information Science and Technology, 47(1):1–4, 2010.

[2] Kata G´abor and Benoˆıt Sagot. Automated error detection in digitized cultural heritage documents. In EACL 2014 Workshop on Language Technology for Cultural Heritage, 2014.

[3] Marti Hearst. Search user interfaces. Cambridge University Press, 2009.

[4] Rose Holley. How good can it get? analysing and improving ocr accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine, 15(3/4), 2009.

[5] Gary Marchionini. Information-seeking strategies of novices using a full-text electronic encyclopedia. Journal of the American Society for Information Science, 40(1):54–66, 1989.

[6] Gudila Paul Moshi, Lazaro SP Busagala, Wataru Ohyama, Tetsushi Wakabayashi, and Fumitaka Kimura. An impact of linguistic features on

automated classification of ocr texts. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pages 287–292. ACM, 2010. [7] Ken Orr. Data quality and systems theory.

Communications of the ACM, 41(2):66–71, 1998. [8] Ulrich Re✏e and Christoph Ringlstetter.

Unsupervised profiling of ocred historical documents. Pattern Recognition, 46(5):1346–1357, 2013.

[9] Gerard Salton. Automatic text processing: The transformation, analysis, and retrieval of. Reading: Addison-Wesley, 1989.

[10] Ben Shneiderman, Don Byrd, and W Bruce Croft. Clarifying search: A user-interface framework for text searches. D-lib magazine, 3(1):18–20, 1997.

[11] Thomas Smits. Problems and possibilities of digital newspaper and periodical archives. Tijdschrift voor Tijdschriftstudies, (36):139–146, 2014.

[12] E Toms and N Flora. From physical to digital humanities library–designing the humanities scholar’s workbench. Mind Technologies: Humanities

Computing and the Canadian Academic Community, pages 91–115, 2006.

(11)

[13] MyriamC. Traub, Jacco van Ossenbruggen, and Lynda Hardman. Impact analysis of ocr quality on research tasks in digital archives. In Sarantos Kapidakis, Cezary Mazurek, and Marcin Werla, editors, Research and Advanced Technology for Digital Libraries, volume 9316 of Lecture Notes in Computer Science, pages 252–263. Springer International Publishing, 2015. [14] Luuk Van Waes. Thinking aloud as a method for

testing the usability of websites: the influence of task

variation on the evaluation of hypertext. Professional Communication, IEEE Transactions on,

43(3):279–291, 2000.

[15] Robert A Virzi, Je↵rey L Sokolov, and Demetrios Karis. Usability problem identification using both low-and high-fidelity prototypes. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 236–243. ACM, 1996.