Ubuzz - Making history part of everyday life.

(1)

Ubuzz - Making history part of everyday life.

Martijn C. Loos 10205802

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dhr. dr. F.M. Nack Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 27th, 2014

(2)

Abstract

The Amsterdam Museum wants to involve the residents of Amsterdam more with their city’s history. They have a large inventory-database with a variety of historical records of Amsterdam. This database has to be used to retrieve records to return information about the history of Amsterdam to the resident. This information should enlighten the opposite point of view the resident has of a topic. To accomplish this, a system is implemented in Java that uses multiple search techniques where each subsequent search technique weakens the semantic link from the input with the output. An existent application is utilized to gather trending topics from Twitter in a small area in Amsterdam, with the sentiment engendered with by these topics. These topics are used to search for records in the inventory-database that reveal relevant information on the opposite view of a topic, where opposite stands for records that reveal a negative point of view if the input topic engendered a positive point of view. Tags are gathered that are linked to each record, and by classifying them as a tag that belongs to a positive or negative sentiment, the tags are used to guide the search to retrieve records with the correct sentiment.

A small scale test is performed with the use of only a few input topics which shows that the concept of the system works. This system lays a groundwork of a bigger system, leaving room for many improvements.

(3)

1 Introduction

This research project is being conducted in cooperation with the Amsterdam Museum. The Amsterdam Museum (AM) wants to involve the residents of Amsterdam more with their city’s history and therefore they are seeking an application that can interact with users by providing relevant information on topics the user is discussing at a particular moment in time. An existing manner of providing information by the AM is the use of the DNA exhibition in the Amsterdam Museum. This exhbition is divided into four core-keywords: Ondernemerschap, Vrijdenken, Burgerschap and Creativiteit. Through the DNA exhibition, the history of Amsterdam is displayed, where all the exhibited items are categorized in at least one of the four themes. This division quickly provides information to the visitors of the museum about the history of Amsterdam, because they will understand the context of an exhibited item faster. The information provided by the DNA exhibition has to be extended to an application, but to achieve this, the AM first has to discover what residents topics are discussing, retrieve appropriate information belonging to this topic and return this to the user. The AM already has an application that can discover what is the current concern to people with the use of social media, specifically Twitter. This application can pick up the trending topics from a certain small area somewhere in Amsterdam, together with the sentiment engendered by these topics. The AM also has a large inventory-database with a great variety of records of the history of Amsterdam and they want to use this database to retrieve records that highlight an opposing point of view on the trending topic that the Twitter application discovered. However, as the name implies, the inventory-database was originally designed to oversee what possessions the AM has in their storage, but they would now like to use this database to retrieve the information that belongs to a topic. To realize this, some sort of layer has to be implemented that can establish what the opposing point of view is and also be able to gather the records that are relevant for this opposing point of view. The idea of the AM is that they want to continually stimulate the discussion, however, this has to happen within the boundaries of the information that the inventory-database can deliver, therefore, the story elements can only use limited information. The layer should guide the retrieval of this information. After relevant records are found, they are returned as output and will then be passed on to a story engine, which can create a coherent story that connects the input topic with the information that the output records deliver. This story highlights opposing viewpoints of a topic resulting in a story that makes the user aware of different opinions on a topic. For example, when the input is a topic with a positive sentiment linked to it, the output should be a record that has some connection with the topic and a negative sentiment.

The focus of this research project is the layer between the Twitter query and the AM Database. This has to be a program that uses the topic and sentiment as input to guide a search into the AM Database, in order to find the appropriate records to pass on to the story engine. This system only delivers the records as output.

The main objective of this research project is to discover whether this kind of layer, with the use of tags, will result in the retrieval of relevant records. Therefore, a prototype of the layer will be programmed in Java, which is a small version of the layer, that will only work for certain topics and tags. However, if this prototype works in a small environment, it will also work in a large environment. After that, the prototype can be extended in such a way that it will work for all topics and all tags and return the appropriate records.

The research question belonging to this project is the following: How can a twitter topic be mapped to an inventory database to establish an argument structure?

The twitter topic is the input from the application, the inventory database is the AM Database and the argument structure consists of the records that are returned, but are from the opposite sentiment than the input sentiment, to highlight the opposing point of view on a topic.

After experimentation it is evident that the prototype is successful and is indeed a useful basis to expand as a layer for the AM Database.

(5)

2 Literature Review

This project utilizes the AM Database, which can be defined as a collection of cultural heritage. Earlier research by [Isaac et al., 2008] conducted experiments to connect the search queries to databases where cultural heritage resources are stored, so that one search would give equal results in more than one database. Isaac et al. state that ”a faceted browser that gives a unified access to two collections of illuminated manuscripts via any of its respective vocabularies.” is a good example for a demonstrator, but also that more experimentation is needed in practical applications. The aligning of multiple databases has no use for this research, because only the AM Database is used.

This research uses as input a topic gathered from Twitter, that is provided by an existing application, to discover relevant records in the AM Database and to guide this search for a record, an extra layer is built before accessing the database. Researchers from the VU did something similar, however, where in this research one topic at once is used, the research of the VU used a complete database. They transformed the AM Database into a triple store database, altering it from the relational database that it was, resulting in easier access to the database. In [Tordai et al., 2009] a theoretical research is conducted and with the results of that research, they created their own application for database alignment, called AMALGAME [Van Ossenbruggen et al., 2011].

A part of the guidance of the search of relevant records consists of collecting more information on the topic, resulting in a broader concept, increasing the chance of finding a relevant record. Section 3.1:The thematic atom of the PhD thesis of Charlie Hargood [Hargood, 2011] is relevant to this problem. It contains information about story content and structure. In this section it is described that with the smallest part of the content, described by Hargood as a natom, the features, motif and theme that belong to it can also be discovered.

For this project, primarily Wikipedia is used to collect more information on a topic. Two researches, performed by [Coursey and Mihalcea, 2009] and [Coursey et al., 2009], identify topics of documents with the utilization of Wikipedia. However, my research uses only one topic at a time, instead of complete documents. Coursey and Mihalce, and Coursey et al. try to identify documents that provide information, whereas my research tris to verify information.

The output of this research project must be a record represents a counter-argument. [Bocconi et al., 2005] conducted a research concerning automatically generating an argument structure from videos. The videos are annotated and with the use of the Toulmin model [Toulmin et al., 1984] the right video fragments have to be found to create a correct piece of video that displays a point of view on a discussion. Whereas Bocconi et al. use video fragments, this research uses only text, gathered from records. To simplify it for this research, the counter-argument delivered as output is just a record that corresponds to an opposite sentiment, instead of a record discovered by using the Toulmin model, as is conducted by Bocconi et al.

3 The System

Previous attempts have been made to create a layer which also incorporates the sentiment belonging to topics to find appropriate records, but they were not very successful. This is why this project utilizes something different, called tags. Tags are predetermined keywords that will be used to retrieve records that are of the opposite sentiment than the input sentiment and are still relevant to the initial topic. The use of tags will be described in detail.

Before the system can be explained, extra context has to be mentioned: the manner of division of the DNA exhibition should also be apparent in the system by the use of the four core-keywords:Ondernemerschap, Vrijdenken, Burgerschap and Creativiteit. These keywords have to be incorporated in the information that will be returned to the user, however, these keywords are not incorporated in the database. In Sec-tion 3.2 and onwards is explained how these keywords are incorporated, without being explicitly saved in the database.

The second context explanation regards the information of the database and a few definitions that belong to it. The database consists of stored records, were each one contains information about an object that can be or is exhibited in the Amsterdam Museum. A record contains multiple fields, where each field contains different information about the record. Table 1 shows a table with the name of the fields and a short description of its semantics. Section 4.1 explains which fields are used and why. Appendix A contains an example of a record.

(6)

Field name Field description

priref The priref is the record number of the object and the unique identifier in the database.

object number Every object has a unique object number. This number has no association with the priref number.

title The title of the object. An object can have multiple titles. creator The creators of the object. Names of persons are stored as

”Last-name, Firstname”.

production.date.start The start date of the manufacturing of the object.

production.date.end The end date of the manufacturing of the object. When this field is equal to the production.start.date, it implicates a specific date. object name The keyword with which the object is described.

object category The sub-collection of which the object is an element of. description A description of the object.

content.motif.general Contains keywords with which the image is described. content.subject Contains keywords of things visible on the image. content.person.name Contains the name of the person that is portrayed. association.subject Contains keywords that describe the context of the object. association.person.name Contains names of persons that are associated with the object. material Contains keywords with which the material is described.

technique Contains keywords with which the technique of manufacturing is described.

dimension Contains data about the dimensions of the object. related object.reference Contains a link to a related object inside the data set. acquisition.date The date of acquisition of the object.

AHMteksten Information in text about the object. This can be texts in Dutch or English.

documentation Contains data about the publication of the object. The title of the publication is a link to the librabydatabase of the AM. reproduction A link to the image of the object.

Table 1: Table of the field names with their respective field descriptions. This is a translation of the information that can be found at the website of the AM: http://www.amsterdammuseum.nl/open-data

3.1 Overview of the system

Figure 1 shows the pipeline of the system and along with this overview gives a global image on the system.

As mentioned before, the system will be a layer that guides the search for appropriate records into the database. This will be conducted in several steps. This overview will show the general approach and the following sections will go into detail for every part of the system:

The input of the system is the output of the application that gathers topics from twitter. When the topic is entered as input value, the first step that is performed is categorizing it into one of three categories: Event, Human or Object. The topic also has one of the themes from the DNA Exhibition linked to it along with a sentiment. These values are required for the following step, which is retrieving the required tags. Every theme has its own tags for each category, for a negative or a positive sentiment. Code 1 contains pseudo-code that gives an implication of how the tags are stored.

(7)

(8)

Listing 1: Example of how the tags are stored in the tag database. Ondernemerschap Event P o s i t i v e T a g s Tag Tag N e g a t i v e T a g s Tag Tag Human P o s i t v e T a g s Tag N e g a t i v e T a g s Tag O b j e c t P o s i t i v e T a g s Tag N e g a t i v e T a g s Tag

This storage of tags exists for every theme.

Furthermore, the concept of the system is that information is discovered that highlights the opposite point of view on a topic, thus when a topic arrives with a positive sentiment, the tags for the negative sentiment of that theme from the discovered category are gathered and used.

The system performs three kinds of searches: in the first search, a request is constructed to the database to retrieve records that have the name of the topic in one the fields of the record, in combination with on of the tags. When this does not result in enough records, a second search will be performed. This search leaves the topic out of the search, while only searching for records with one of the tags that is linked to the category of the topic. When this still does not result in enough records, a third search will be performed. The first and second search did not result, or not result in enough, records with the tags in the associated category, therefore the third search will search for records with the tags from the other categories that are still under the same theme. By searching for records with this approach, the strength of the semantic link between the original topic and the retrieved records is being weakened with every search. The first search has a strong semantic link, because records are sought with the initial topic, under the initial category. The second search weakens the semantic link, but still holds it by searching on tags in the same category, although the topic is left out. The third search weakens the link even further, by searching on tags in other categories, however, the semantic link never fades, because the tags that are used are always from the initial theme, which is the one thing that never changes throughout the differente searches.

Eventually, the gathered records are stored and used as output, from where the records can be sent to a story engine, which can create a coherent story of the records.

The next sections will describe each part of the system in detail, starting with the Twitter applica-tion that delivers the input.

3.2 Twitter application

The system utilizes a Twitter application that already exists and was created in a project for the course ”Tweedejaarsproject”, by Miriam Huijser, Eva van Weel, Sander Beerenpoot and Martijn Loos. In Fig-ure 1, this is the first square with the logo of Twitter. This is an application that works on a smartphone, from where tweets can be sent. On the background, a server runs that gathers all the tweets sent by the application and processes them, therefore being able to discover the trending topics for certain areas in Amsterdam. The area for which it can discover the trending topics is an area with a 1 km radius from the point the user sends their tweet. This can be located because the application sends the coordinates from which the tweet was sent along with the tweet.

(9)

exhi-bition, ensuring that topics can be stored under one of the four themes. This will always be possible, because the application forces the user to choose a theme before sending a tweet.

The topics also have a sentiment engendered by them, which is expressed in a number from -1 to 1, where -1 is extremely negative, 0 is neutral and 1 is extremely positive.

The topic, the theme under which the topic belongs and the sentiment that is engendered by the topic are stored in an XML file. This data can be used as input for the system of the current research project and guarantees that a theme and sentiment are always available for the topic.

3.3 Categorization

The first step of the system is to categorize the topic that is used as input into on of three categories: Event, Human or Object. By categorizing in such a way, the right tags can be retrieved in the next stage.

The categorization passes through two phases: The first phase consists of a analysis to discover whether the topic is of the Category Event. If that is not the case, the topic is analyzed to discover whether the topic is Human and if that is also not the case, the topic will be automatically categorized as an Object. The first step of the categorization utilizes Wikipedia: when a topic is delivered as input, a connection will be initialized to Wikipedia, through the API of Mediawiki (http://www.mediawiki.org/wiki/ MediaWiki). Through Mediawiki, the content of a Wikipedia page can be retrieved in the form of an XML file, if the Wikipedia page exists, by using URL requests. Only the introduction of the Wikipedia page is retrieved, which generally contains enough information to analyze.

To discover in which category a topic belongs, the introduction is searched for the words ”date” or ”data”, and ”geboren”. When one of the first two keywords is discovered, the topic can be categorized as event, because it is something that has a date associated with it. While humans also have a date associated to them, this is almost always the date of birth, which is discovered by searching for the word ”geboren”. Therefore, when ”date” or ”data” is detected on the Wikipedia page, the topic is categorized as Event, whereas if ”geboren” is discovered, the topic is categorized as Human.

If none of the two category-specific words are apparent, although this is not pictured in Figure 1, a second check to discover if the topic is Human will be performed. This check utilizes the AM Database. Through the AMDLib Api, a connection can be established between the system and the AM Database, and records are retrieved where the topic occurs in one of the following fields: content.person.name, asso-ciation.person.name and creator. If there are one or more hits for each of the fields content.person.name and association.person.name or if there are one or more hits for content.person.name and creator, then the topic is categorized as Human. One hit is equals to one record returned. See Section 4.1 why these fields are chosen.

If these two methods do not result in a categorization the topic will be categorized as Object.

3.4 Keywords-without-records storage

A storage XML file is created that keeps track of all the keywords that do not return any records. Before any search is conducted, this XML file is accessed to discover if the topic is already used as input once. If the topic is indeed in this storage, there is no need to proceed trough the entire system, because it is already known no records will be returned. The topics that are stored in this XML file are the topics that are being discussed by the users, but which do not result in feedback for the users, because no records could be retrieved. This is convenient for the AM, because when they notice this, they can change certain records in the AM Database in such a matter that these topics will delivers results, thereby also improving the AM Database.

3.5

3.6 Three types of searches.

The system progresses through three types of searches, where every search weakens the semantic link between the topic and the output records. All the searched that are performed retrieve records from the AM Database.

The First search is the search that retrieves records that are semantically the closest to the input query, because it uses the topic and the tag as input. The following fields are all fields shown in Table 1: For all categories, the fields title and AMHteksten are used to search for records. For Events, the field association.subject is also used, for Humans, the fields association.person.name, content.person.name and creator are added and for Object the field description is added. The tag is inserted in the associ-ation.subject field., subsequently the First search searches on the topic and the tag and if this results in five or more records, it immediately proceeds to the output. However, if less than five records are returned, the Second search takes place.

The Second search retrieves records that have a weaker semantic link to the topic and does not use the topic to search on at all, only the tags for the corresponding theme and category. For example, if a topic has the theme Ondernemerschap and is classified as Event, the Event tags will be obtained and records will be retrieved that contain these tags. This way, records that are similar to the input topic will be returned, because they are in the same theme and category. There is however a difference in the manner Event records are gathered in contrast to the other two categories: the Event records include the year. Since an Event always has a time when it took place, for example the Olympic games in Amsterdam in 1928, this can be used to find records that are from approximately the same time. To discover what the date is the event took place, one of two methods is used:

The first method connects to Wikipedia, by using the same strategy as with the categorization part, although this time, the content of the information box (see Figure 2) is gathered. It is possible that the content of this information box contains the start date of the event and if this is the case, this date will be retrieved and set as the start date of the event. Not every Wikipedia page is complete however, therefore many Wikipedia pages of an event topic do not have the date in the content. Some event topics do not have a corresponding Wikipedia page at all hence when that happens, a second method is used, utilizing the AM Database, which searches for all the years that are mentioned in the AHMteksten field throughout all the records that are retrieved with the tag. The year that occurs the most times is set as the start date of the event.

When the start date of the event is known, records are being searched that took place in a time frame of five years prior and five years after the start date. If this does not result in a total of five or more records, including the records already retrieved in the First search, another five years is added, until twenty years are added to both sides. If after that still not enough records are obtained, the Third search will take place. The topics that fall under Human and Object categories do not use this search that includes the date.

The Third search weakens the semantic link between the records and the topics more, while still holding it. At this point searches have already been conducted on the topic and tags from the same category and only on the tags from the category and which did not result in enough records, therefore it has no use searching in the same category again, hence the other categories of the same theme will be used. When a topic is categorized as Event, first the Human tags will be gathered and a search takes place like the Second search with these tags. If that does not result in enough records, then the Object tags are used. The order of the tags that are utilized in the Third search for Human topics is Event first, then Object, and the order for Object topics is Event first, then Human. That indicates that if for example, the topic is categorized as Human and has the theme Vrijdenken, but no records resulted from the First and Second search, the tags for Event under Vrijdenken will be gathered. Thereafter the records which have these tags will be retrieved and if this does not result in more than five records, the tags for Object will be gathered from Vrijdenken. With these tags a last record retrieval will be performed.

In the Third search, if a topic is an Event, the search also performs different, in the same way as in the Second search. That indicates that if a topic is an Event, but nothing is returned in the First or Second search, the tags for Human will be gathered. In the Second search, the date of the event was found. The

(11)

Third search utilizes this date by conducting the same loop as in the Second search with the tags of the Human category; records with the tags of the Human category will be gathered, but with a time frame of five years prior and five years after the date of the topic, even though the tags do not belong to the Event category. This also loops until twenty years is added to both side.

If, after all the searches, still no more than five records are retrieved, it will return the records that were found as output to the XML file, or return nothing if not even one record was obtained.

Figure 2: The red framed box is the infobox on a Wikipedia page.

3.7 Output

The output is constructed in such a way that when it is send to the story engine, it knows what the initial parameters were and what kind of records were found in the end. Code 2 is an example with the topic Heineken, showing the first five records that are stored.

(12)

Listing 2: Example of how the output is displayed with the topic Heineken as input. <R e c o r d L i s t> <I n i t i a l P a r a m e t e r s> <Theme>Ondernemerschap</Theme> <Topic>H e i n e ke n</ Topic> <C a t e g o r y>Human</ C a t e g o r y> <S e n t i m e n t>P o s i t i v e</ S e n t i m e n t> </ I n i t i a l P a r a m e t e r s> <Record i d=” 25798 ”> <Thema>Ondernemerschap</Thema> <S e n t i m e n t>P o s i t i v e</ S e n t i m e n t> <C a t e g o r y>Human</ C a t e g o r y> <SearchOn>H e i n e ke n</ SearchOn> <Media> <MediaType>Text</ MediaType> <MediaType>Image</ MediaType> </ Media> <Year>1975</ Year> </ Record> <Record i d=” 2281 ”> <Thema>Ondernemerschap</Thema> <S e n t i m e n t>P o s i t i v e</ S e n t i m e n t> <C a t e g o r y>Human</ C a t e g o r y> <SearchOn>j u b i l e u m</ SearchOn> <Media /> <Year>1853</ Year> </ Record> <Record i d=” 4875 ”> <Thema>Ondernemerschap</Thema> <S e n t i m e n t>P o s i t i v e</ S e n t i m e n t> <C a t e g o r y>Human</ C a t e g o r y> <SearchOn>j u b i l e u m</ SearchOn> <Media> <MediaType>Text</ MediaType> <MediaType>Image</ MediaType> </ Media> <Year>1875</ Year> </ Record> <Record i d=” 6124 ”> <Thema>Ondernemerschap</Thema> <S e n t i m e n t>P o s i t i v e</ S e n t i m e n t> <C a t e g o r y>Human</ C a t e g o r y> <SearchOn>j u b i l e u m</ SearchOn> <Media> <MediaType>Text</ MediaType> <MediaType>Image</ MediaType> </ Media> <Year>1878</ Year> </ Record> <Record i d=” 6128 ”> <Thema>Ondernemerschap</Thema> <S e n t i m e n t>P o s i t i v e</ S e n t i m e n t> <C a t e g o r y>Human</ C a t e g o r y> <SearchOn>j u b i l e u m</ SearchOn> <Media>

(13)

<MediaType>Text</ MediaType> <MediaType>Image</ MediaType> </ Media>

The information under initialParameters is the information that is used as the input. The most important information from the initalParamaters are the topic that is searched on and the category, because they can change throughout the searches. This gives insight into how strong the semantic link is between the retrieved record and the input topic, for the story engine. If for example the input category was Event, but a record was returned with category Human, the story engine understands that it had to retrieve topics by using the Third search, therefore the semantic link is quite weak. It can then design a story that explains that the semantic link is weak, while still connecting the topic to the record. The rest of the data are all the records that are retrieved. The record id is the record number that is stored along with the data on how the record was found is stored, such as which theme, sentiment, category and topic was searched on. The mediatype tag is to inform the story engine what kind of information the record contains. A record can contain a text, when the field AMHteksten is used, and it can also contain a link to an image, when the field reproduction is used, which the story engine can also use. Notice that in this example the first record has Heineken as its SearchOn tag and the second record has jubileum as its SearchOn tag. That is the difference between the First search and the Second search: the First search returned one record when searched on the topic and the tag. Subsequently, the Second search was invoked, and the topic was dropped; only records with the positive tags from Human under Ondernemerschap were searched. The first tag in the tag database with these parameters is ”jubileum” (see appendix B), therefore the remainder of the records was retrieved with this word. With this tag, already more than five records were returned, consequently no other tags were used to search for records. The Third search was not invoked here, but if that would have happened, the category would change from Human to Event or Object, depending on which category would result in records.

It is possible that no records are returned. In that case, an error message will be stored, stating that no records could be retrieved. For example, this happens when searching on negative records for Baeu, which has the theme Creativiteit, because it has no tags stored, as is shown in Code 3.

(14)

Listing 3: Example of how the output is displayed with the topic Heineken as input. <r e c o r d> <I n i t i a l P a r a m e t e r s> <Theme> C r e a t i v i t e i t</Theme> <Topic>Baeu</ Topic> <C a t e g o r y>O b j e c t</ C a t e g o r y> <S e n t i m e n t>P o s i t i v e</ S e n t i m e n t> </ I n i t i a l P a r a m e t e r s> <r e c o r d>

<E r r o r>No r e c o r d s found .</ E r r o r> </ r e c o r d>

</ r e c o r d>

When this occurs, the topic is also put into the storage of keywords that return no records.

4 Experimentation

To be capable to test the system, the AM delivered keywords that would return many records, also stating under which theme the keywords belong. Table 4 is the list that the AM delivered:

Ondernemerschap Burgerschap Vrijdenken Creativiteit

ambachten en beroepen regenten* coffeeshop Felix Meritis

bedrijven, fabrieken sociale zorg Hartjesdag Blaeu*

bierbrouwerij dak- en armenzorg Provo atelier

taxi-oorlog charitas witkar Bredero*

Hoop, Adriaan van der Wibaut, Floor wederdopersoproer schuttersstukken

Heineken Willet, Abraham aansprekersoproer regentenstukken

Noord/Zuidlijn schutterij Potten- en flikkerdiscotheek de Trut

Paleis van Volksvlijt Lieverdje

VOC Gogh, Theo van

Table 2: The list of keywords delivered and classified by the Amsterdam Museum

The stars indicate that the topic has a wild card: all records are gathered that have the topic in a field with anything behind it. For example when the field title contains ”Blaeu art”, this record would be retrieved, because it contains ”Blaeu”, with something after it, which fills in the wild card.

With this list two experiments were conducted: the discovery of what the best fields are to search on per category and the discovery of the tags.

4.1 Category search fields

To discover what fields were the best fields to search on for each category, not all the words from the initial list were used, but a couple were chosen as test words. Because the words were not categorized yet, this was something I had to do myself. The resulting list is a list divided by categories with the keywords that were most easily categorized, and two keywords that were added by myself, namely ”Olympische Spelen” and ”Tweede Wereldoorlog”:

(15)

Event Human Object

Olympische Spelen Hoop, Adriaan van der Ambachten en beroepen

Tweede Wereldoorlog Heineken bierbrouwerij

Wederdopersoproer Wibaut, Floor Paleis van Volksvlijt

Aansprekersoproer Willet, Abraham VOC

Gogh, Theo van coffeeshop

Blaeu witkar

Bredero

Table 3: The table of keywords categorized by hand into the three categories

What had to be discovered was which fields return the most records, in other words: result in the most hits, and to conduct that, the fields that can hold the most useful information were chosen: title, association.subject, association.person.name, content.subject, content.person.name, creator, description and AHMteksten. The topics from Table 4.1 were used as input to each of the fields and the number of records that are returned have been stored. This produced the following result:

Olympische Spelen Tweede Wereldoorlog Wederdopersoproer Aansprekersoproer

title 50 2 0 7 association.subject 85 82 16 7 association.person.name 3 0 0 0 content.subject 1 0 0 1 content.person.name 0 0 0 0 creator 0 0 0 0 description 17 1 0 0 AHMteksten 9 33 0 0

Table 4: Number of records per field returned for keywords in the Event category.

Hoop, Adriaan van der Heineken Wibaut, Floor Willet, Abraham Gogh, Theo van Blaeu Bredero title 5 25 1 32 39 15 17 association.subject 0 0 0 0 0 1 0 association.person.name 10 16 0 229 42 4 5 content.subject 0 0 0 0 0 0 0 content.person.name 2 2 0 19 1 10 4 creator 0 19 0 0 0 25 0 description 5 7 0 14 0 0 5 AHMteksten 12 6 2 59 23 15 17

(16)

ambachten en

beroepen

bierbrouwerij Paleis van Volksvlijt

VOC coffeeshop witkar

title 13 3 41 18 7 17 association.subject 723 10 0 1 0 7 association.person.name 0 0 0 90 0 0 content.subject 0 2 0 2 0 2 content.person.name 0 0 0 0 0 0 creator 0 0 0 0 0 0 description 0 2 4 13 2 0 AHMteksten 1 1 15 143 3 1

Table 6: Number of records per field returned for keywords in the Event category.

The first thing noticeable is that every category has hits on the fields title and AHMteksten, therefore, these fields are used for all three categories.

From all the tables can be noticed that Table 4.1 has hits on the field content.person.name, whereas the other two tables do not have this. Consequently this indicates that this is an appropriate field to use when the category is Human. Two other fields that result in more hits for Table 4.1 in con-trast to the other two tables are creator and association.person.name, however, not every topic that is Human creates hits for these fields, therefore they are no reliable fields when used alone. Hence, in the categorization part that checks if a topic is Human, the topic is filled into the three fields con-tent.person.name, association.person.name and creator. If content.person.name and creator both result in one hit or more, or if content.person.name and association.person.name result in one hit or more, the topic is categorized as Human. With this method, topics that are not human, while having hits on association.person.name or creator alone, do not get categorized as Human. For example ”VOC” has 90 hits on association.person.name, while it has none on content.person.name, therefore it will not be categorized as Human. Other problems this method prevents is when a topic is a person, while not being a creator, as is the case with ”Theo van Gogh”. He has no records where he is listed as creator, however he is definitely a human. Due to the hits on association.person.name and content.person.name, ”Theo van Gogh” can still be categorized as Human.

The three fields that classify a Human topic can also be used to retrieve as many records as possible for topics that are already categorized as Human, therefore these fields are also used in the three searches, on top of the fields title and AHMteksten fields. For Event topics, the field that results in more hits in contrast to the other categories, is the field association.subject, therefore this field is used in the three searches for topics categorized as Event and description when a topic is of the Object category.

4.2 Creating tag database

Although the tag database currently consists of tags, an early idea was to work with templates. This method differs on that the templates would have been created by me, while the tags are gathered from the test topics of the AM Database. In early stages, it would perform almost like the tags: a list would be created for every category under a theme for both sentiments. For example, the only template that was created in the early stages was for Ondernemerschap, under Event: the positive template was ”winst” and the negative template was ”verlies”. These words would indicate that with large events, cities or entrepreneurs could make a large profit, which is a positive thing, or lose money of it, which is a negative thing. However, it appeared that none of the records had such words hence a search with this template always returned no record at all, and it was also very hard to think of new templates for all the other themes and categories. The contact at the AM indicated that all the records also had tags, which could be used as a replacement of the templates. The idea was followed through in the end.

The tags are the words that are stored in the field association.subject and for every test keyword, the tags were retrieved. However, many tags do not have a sentimental value and could not be classified as positive or negative, only as neutral. Consequently, an XML file was created serving as database, that can store for every theme under every category the positive, negative and neutral tags. After all the tags were gathered, they were classified as positive, negative or neutral by hand and put into the database,

(17)

shown in Appendix B. Most of the tags were neutral, but a few tags could be classified as negative or positive. Notice that some tags are ”pos1”, or ”neg2”; they are added to fill up the tag database and because no tags exist that are called this in the AM Database, these tags will not interfere in the progress of the system.

The one substantial problem with this method, is that the tags are very unique. A search was performed where records for the the test key words were retrieved. The tags from these records were gathered and counted how many times the tags would occur in the records. Two of the keywords with their tags are shown:

Olympische Spelen 90 Tag: Sapporo: 3

Tag: Olympische Spelen: 86 Tag: drugs: 1

Tag: sport: 1

Tag: sport en spel: 5 Tag: Indonesi¨e: 2 Tag: Amsterdam: 6

Tag: Anton-Geesink (Olympisch) stadion: 1 Tag: Amsterdam Olympische Spelen, 1928: 2 Tag: reclame: 1

Tag: Amsterdam Olympisch Stadion Olympische Spelen, 1928: 1 Tag: Beijing: 8

Tag: burgemeester van Amsterdam: 1 Tag: burgemeester: 1

Tag: Sydney: 1

Tag: Amsterdam Zuid: 2

Tweede Wereldoorlog 115 Tag: jubileum: 1

Tag: Dam: 12

Tag: hongerwinter: 2

Tag: politieke en sociale beroeringen: 2 Tag: Sint Nicolaasfeest: 1

Tag: Olympische Spelen: 1 Tag: verzetstrijder: 2 Tag: toneel: 1 Tag: empire: 2 Tag: Biedermeier: 2 Tag: wethouders: 1 Tag: Rusland (?): 1 Tag: Amsterdam: 1 Tag: mode: 2 Tag: bezetting 1940-1945: 3 Tag: ballet, dans: 1

Tag: gezelschapsspel: 1 Tag: Amsterdam Noord: 1 Tag: Leidsestraat: 2 Tag: boksport: 1

Tag: Tweede Wereldoorlog: 69 Tag: verzet: 2

Tag: aanslag: 1 Tag: speelgoed: 2 Tag: bevrijding: 1

Tag: Felix Meritis inwijding: 1 Tag: luchtbescherming: 1

(18)

Tag: Suriname: 1

Tag: Keizersgracht 324: 1

The number next to the topic is the total records that were returned on that topic. For example, the first tag for Olympische Spelen is ”Sapporo” and it occurred in three tags of the total of ninety, ”Olympische Spelen” (as tag) occurred in 86 records, ”drugs” in only one record, etc. As is apparent here, many tags only occur in one record. Although these topics are both test keywords for the Event category, the tags are very different for both keywords. This phenomenon occurs for all the keywords; there are no tags, in this small test environment, that are unique for a category, hence when the Second search is used, the semantic link will still be very weak. The other problem is that very few tags are not neutral. For example, the only positive tags that are used for Olympische Spelen are ”spel” and ”sport en spel”, and the only negative tag is ”drugs”, while the rest is all classified as neutral. Consequently, when Olympische Spelen is used as input topic, not many records will be returned in this prototype version, because there are still very few tag words that are usable. However, when a record is returned from the First search, it will have a great relevance to the input topic.

Conclusively, when this prototype system is used, the first search returns almost always very few records, because of the few tags that are usable, however, when it returns a record, it is mostly highly relevant to the input topic. Ideally, all the tags that occur over all the topics should be mapped to a theme, category and sentiment, subsequently, more records can be found that are closer related to the topic, also finding tags that are unique for a category. This prototype system could be the starting point of mapping all the tags. Each time a topic is being processed and the tags that come up are not in the database yet, they can be classified and added, effectively enhancing the system into a learning system. It will expand its own database while the system is running, creating better results for each new topic.

5 Results

The result of this research project is a prototype system that returns records in many cases, however the semantic link is still weak. All of the test topics have been run through the system, to discover what kind of output they deliver. Although the topics are sorted by category, for convenience the theme will be noted in brackets next to the topic. Under every topic it is noted by what search they were returned and how many records were returned. Because some tags cause the number of records that are returned to be well over five records, this is indicated as ”rest”. Where it is stated that all the records, or the rest, is ”by 1 tag” in a search, this indicates that the threshold of five records is passed by records that were retrieved by 1 tag only and no other tags were picked to used more records.

Event topics:

Olympische Spelen (Ondernemerschap): Positive: 5 records by First search.

Negative: 1 record by First search and 1 by Second search.

Tweede Wereldoorlog (Vrijdenken) Positive: All records by First search. Negative: All records by First search.

Wederdopersoproer and Aansprekersoproer were both incorrectly classified as object, therefore the re-sults of these topics will not be reviewed.

Human topics:

Adriaan van der Hoop (Ondernemerschap)

Positive: 1 record by First Search, rest by 1 tag in Second search. Negative: All records by 1 tag in Second Search.

(19)

Heineken (Ondernemerschap):

Positive: 1 record by First search, rest by 1 tag in Second search. Negative: All records by 1 tag in Second search.

Floor Wibaut (Burgerschap):

Positive: 1 record by First search, rest by 1 tag Second search. Negative: No records returned.

Abraham Willet (Burgerschap):

Positive: 2 records by First Search, rest by 1 tag Second search. Negative: No records returned.

Theo van Gogh (Vrijdenken):

Positive: All records by 1 tag in Second search. Negative: All records by First Search.

Blaeu (Creativiteit):

Positive: No records returned.

Negative: All records by 1 tag in Second search.

Bredero (Creativiteit): Positive: No records returned.

Negative: All records by 1 tag in Second search.

Object topics:

Ambachten en beroepen (Ondernemerschap): Positive: All records by 1 tag in Second search.

Negative: 3 records by 1 tag in Second search and rest by 1 tag in Third search.

Bierbrouwerij (Ondernemerschap):

Positive: All records by 1 tag in Second search.

Paleis van Volksvlijt (Ondernemerschap):

Positive: 4 records by First search, rest by 1 tag in Second search.

Negative: 2 records by First search, 1 record by 1 tag in Second search and rest 1 tag in Third search.

VOC (Ondernemerschap):

Positive: 2 records by First search, rest by 1 tag in Second search.

Coffeeshop(Vrijdenken):

Positive: All records by 1 tag in Second search. Negative: All records by 1 tag in Second search.

Witkar(Vrijdenken)

Positive: All records by 1 tag in Second search. Negative: All records by 1 tag in Second search.

With First search the amount of tags that were used to retrieve the records are not indicated, be-cause they are mostly retrieved by several tags, whereas with the Second or Third search, they are mostly retrieved with one tag only.

(20)

6 Evaluation

Several patterns can be found in the results. First, there are only three keywords that can return records only on the First search, which results in most relevance to the topic. These keywords are Tweede Wereldoorlog, the positive records for Olympische Spelen and the negative records for Theo van Gogh. This indicates that they had tags that either had many combinations of tags and topic or that one combination resulted in many records. Either way, these three keywords result in the records with the strongest semantic link. Furthermore, the negative records of Olympische Spelen and the positive records of Adriaan van der Hoop, Heineken, Floor Wibaut and Abraham Willet retrieve some records from the First search, while filling up the records to the threshold by using the Second search. When retrieving records on tags only, it is possible, and happens many times, that there are enough records retrieved on one tag to pass the threshold. This happens with all the aforementioned topics. For the negative records of Adriaan van der Hoop, Heineken, Blaeu, Bredero, the positive records of Theo van Gogh, Ambachten en beroepen and bierbrouwerij and the topics Coffeeshop and Witkar, the First search does not deliver results, thus it immediately goes to the Seconds search, which fills up the threshold with records for only one tag.

The negative records for Ambachten en beroepen, bierbrouwerij, Paleis van Volksvlijt and VOC also use the Third search, on top of the First and Second search, which indicates that the records from the tags of the Second search were not enough to pass the threshold, consequently using the tags from another category, which did have a tag that could gather enough records to pass the threshold. The topic Paleis van Volksvlijt uses even all searches in one run, effectively proving that the system does work correctly, by progressing to all the searches in one run.

The negative records for Floor Wibaut and Abraham Willet and the positive records for Blaeu and Bredero are non-existent. This means there were no tags, in all searches, that could provide records. This is because not every theme and category have tags at all and if exactly these categories and themes are used, naturally no records will be found.

Topics that are from the same theme and category have results that look alike. This is a logical occur-rence, because after the First search the topic is not used anymore, which is the only unique search term, and only the tags are of relevance. The tags are the same for every topic that is in the same theme and category. Therefore, the results of topics such as Adriaan van der Hoop and Heineken, Floor Wibaut and Abraham Willet, Blaeu and Bredero, and Coffeeshop and Witkar, are very much alike or even equal. Although these results show that the system works properly, it does often deliver records that do not have semantically strong links with the topic, due to the Second or Third search being used many more times than the First search. In turn, this is because there are so few tags in the database that can be classified as positive or negative. When the database will be expanded with more tags, more, stronger relevant records will be retrieved by topics. Also if in this system a topic is inserted that is not one of the test topics, it will almost instantly use the Second or Third search, for the recurring reason that there are just not enough tags in the database yet, so there is only a very small chance that records of the unknown topic contains one of the tags that are already in the database.

The conclusion can be made that this prototype works, albeit limited, and can be expanded on all sides to create a system that will deliver records that have a semantically strong relationship to the topic. The next section will describe how the Amsterdam Museum evaluated the system.

6.1 Evaluation by the Amsterdam Museum

The AM also evaluated the output of the system and they focused more on the quality of the output than the quantity. The following ten topics were used as input and for every input the first five records were used for evaluation. The sentiment that is written with the topics is the sentiment on which the records were retrieved. The AM was given the first five results along with the topics and had to state if the records that were returned for the topic were indeed relevant records or not, and if not, why they were not the expected records.

Event topics: Olympische Spelen (positive records), Wederdopersoproer (positive records).

Human topics: Adriaan van der Hoop (positive records), Heineken (positive records), Abraham Willet (Positive records), Theo van Gogh (negative records).

(21)

records), Coffeeshop (negative records).

The evaluation of the AM on the topics will be explained for each topic:

Olympische Spelen (positive records):

All five records that were returned for this topic were found to be relevant topics by the AM, probably because the semantic link between the topic and the returned records is strong, considering that all the records are retrieved by the First search. Moreover, the tags that are used in combination with the topic while searching for positive records are ”Sport” and ”Sport en spel”, indicating that the topic and the tags are already in the same context.

Wederdopersoproep (positive records):

All the records of this topic were irrelevant records, but that was to be expected, since this topic will be categorized incorrectly when running the system, therefore, this topic is disregarded.

Adriaan van der Hoop (positive records):

In this case, all five records that were returned were no records that were expected by the AM and they even state that all the records are not related to the topic. Only the first record that was returned was retrieved by the First search, consequently this one should have the strongest semantic link. The content of the record is an image of ice skating on the Zaan. The system returned this record, because the topic is human and occurred in the field association.person.name and three of the tags that are stored to use for positive records on a human in Ondernemerschap, ”IJsvermaak”, ”Kinderspel”, and ”Sport and spel”, occurred in combination with the topic. This indicates that there are just too few tags to always retrieve records that have a semantically strong link, even though this record was returned by the First search. The other four records were returned by the Second search, with the tag ”jubileum”, therefore immedi-ately creating a very weak semantic link with the topic.

Heineken (positive records):

The topic Heineken has almost the same result as the previous topic, by having one record retrieved by the First search and the rest by one tag by the Second search, however, this time, the first record is stated as a relevant record by the AM. It retrieves a record that is about a Heineken drinking glass, because there is a match between the topic occurring in the field association.person.name and the tag ”jubileum”. There are no further matches to be made in the First search, therefore it proceeds to the Second search, yielding the same results as the previous topic, because Heineken is also a human under the theme Ondernemerschap. Consequently, records with the tag ”jubileum” are gathered again, and the records that are retrieved have again no semantic connection with the topic. That is the reason the AM states that these topics are not topics one would expect with this input.

Abraham Willet (positive records):

This topic as input yields as result two topics retrieved by the First search and three topics retrieved by the Second search. A match between the topic occurring in the field association.person.name with the tag ”verjaardag” delivers the first record and a match between the topic occurring in the field AMHtek-sten and the tag ”bloemen” yields the second record. Both these records have a strong semantic link to the topic, which the AM agrees to, because they state the first two records are indeed relevant records. However, when proceeding to the records retrieved by the Second search, again the records are not ex-pected with the topic by the AM. The records are retrieved, because the tag ”verjaardag” is used and without the topic as extra search guide, this results in records with a very weak semantic link to the topic.

Theo van Gogh (negative records):

This topic as input retrieves five records that are memorabilia of Theo van Gogh by the First search and the AM states that all the records retrieved for this topic are relevant records. The First search matches the topic in the field association.person.name with the tag ”moordaanslag”. This gives enough records (40), that the threshold is passed easily and the first five records can be filled with records retrieved by the First search. This also results in records that have a strong semantic link with the topic.

(22)

bierbrouwerij (negative records):

In this case, the Second search delivers all the five records and has a surprising outcome. All the records with the tag ”brand” are returned and this results in two records about the Paleis of Volksvlijt, a record about a tableau, a sign that prohibits smoking cannabis and a tableau for the abstainers bond. The first three records are unusual according to the AM, however, they state that the last two records are funny and indeed have a correct negative nature. This is surprising, because most of the time, the first few records are the best fitting records with the topic and the last records are not. However, I think that if negative records would be retrieved for the topic Paleis of Volksvlijt, the AM would think vice verse. That is not the case though, because the next topic retrieves positive records for the topic Paleis of Volksvlijt.

Paleis van Volksvlijt (positive records):

This topics results in three records from the First search and two from the Second search, however, due to a miscommunication on my part, the last record from the First search was not sent to the AM, therefore they only evaluated three of the four First search records. One of the results is a record about a portrait in honor of the 25-year reigning jubilee of king Willem III and queen Sophie. The system delivers this records, because it finds a match between the topic and the tag ”jubileum”. None of the fields that lead to this record has an occurrence of the topic however, which indicates that there has to be some synonym for the topic in the AM Database, that delivers this record, even though the topic does not occur in any field, because no such thing is implemented in the system.

Another record from the First search is a record about an award ceremony that took place in the Paleis of Volksvlijt. This record is retrieved because the combination of the topic that occurs in the field asso-ciation.subject and the tag ”prijsuitreiking” both match this record. The AM does not find this and the previous record relevant records as a result for this topic. Although the first one is clear, because it has a very weak semantic link with the topic due to that the topic does not even occur in one of the fields, the second one is not as straightforward. A prediction would be that it is because something took place in the Paleis of Volksvlijt, but therefore does not have information about it, consequently not being a correct output.

A third record delivered by the First search is, however, considered a relevant record for this topic and is a record about the revue ”Amsterdam je bent goud waard” in the Paleis of Volksvlijt. This records is delivered because a match is found of the topic occurring in the field assocation.subject and the tag ”revue, cabaret”.

Next to the previous records is also one record from the Second search. This is a record that is retrieved by the tag ”onderwijs”, therefore it has a very weak semantic link to the topic and the AM also states that this record is not a relevant output record.

VOC (positive records):

This topic retrieves two records from the First search and the rest from the Second search. The two records from the First search are about koppertjesmaandag and therefore has a small connection to the topic. It reaches these records because the topic occurs in the field AHMteksten and has the tag ”eerste steenlegging”. This results in two records that have a stronger than weak semantic link to the topic and although the AM finds these records passable, it just passes the threshold.

The other records are retrieved by the Second search and are, just as with the previous topic, records with the tag ”onderwijs”, resulting in a very weak semantic link to the topic and the AM states that these records are no relevant output records.

Coffeeshop (negative records):

All the records that are returned for this topic are relevant records according to the AM. One record is found by the First search and the rest by the Second search. The first record is found because the topic occurs in the field title in combination with the tag ”drugs”. Thereafter, the records that are retrieved are the ones that only have ”drugs” as their tag, resulting in records that have a strong semantic link to the topic.

A pattern can be discovered in these evaluations. Often the records that are retrieved with the First search are records that are relevant as an output and, because they result from the First search, also have a strong semantic link. However, when the records for the Second search start appearing, the link

(23)

almost immediately disappears and the AM finds the resulting records not acceptable anymore. This is due to the fact that the tag database is still just very small and on the one hand very unique (barely finds any records in First search) and on the other hand too broad (finds too many records in Second or Third search). The tag database should definitely be expanded, possibly with a structure that classifies different tags in different subcategories to better navigate through the tags and therefore discover better suiting tags, instead of what it is currently, where occasionally a tag from the Second search matches the topic resulting in relevant topics. That chance is still too small and if this already happens on the Second search, the Third search will perform even worse.

Although after the evaluation of the AM it seems that the Second and Third search actually do not perform strongly, but only deliver records that are semantically too weak, and thus irrelevant to the topic, the result has to be observed as part of a bigger concept. This concept is the idea that the Twitter application, this system and the story engine all create one system, with these different parts contained in it. The output of this system may seem weak, but the story engine can make the results semantically valuable, because the output contains the initial parameters and the output that was retrieved with these parameters, consequently a coherent story can be created by the story engine which connects the input queries with the output records in such a way that it is semantically linked and that the user, who receives the output, gets a good idea in what way the input and output are connected. This may still be a semantic weak link, but a connection and coherent story can be made nonetheless. If one looks at the system in such a way, the Second and Third search actually perform well, because the output always stays inside the same theme. The AM was asked to evaluate the records that were returned for an input and so that is exactly what they delivered, but make it seem like the Second and Third search do not really deliver, however, when looking at the larger concept, of a system where this system will be a part in, the Second and Third search do deliver.

7 Future work

Although this system lays a basis as to how the whole progress could work, many improvements can be made. One of the largest was already mentioned earlier: adding every tag that is not yet in the database to expand and eventually complete the tag database. This could be done while new topics come in: when a topic comes in and linked to that topic are unknown tags, they can be classified and stored under the right theme, category and sentiment. This will improve the results of the system, effectively making it a learning system.

There are also smaller things that could be improved:

In the part of the system where the Twitter application is explained, it is stated that the output of that application is an XML file, which contains the input for this system. However, to create a system that is tested in a simple matter, this was not implemented yet. If it was implemented, every time a topic had to be used as input, a trending topic had to be created in order for the Twitter application to detect it and process it as output. Therefore, for this system, one has to fill in the theme, sentiment and topic by hand and future work should attach the Twitter application to create one fluent system, subsequently in a later stage the story engine can also be added, to obtain a complete system from user input to output back to the user.

In the categorization topics, a connection with Wikipedia is created, however, not every topic has a Wikipedia page or the Wikipedia pages that come up are the refer pages, for example when there are multiple persons with the same name. If this happens, no connection with Wikipedia can be established. This is not a problem for the Human topics, because it has a second process involving the AM Database to categorize them. However, this is a problem for topics that are events, because they do not have a second process to categorize, therefore when no Wikipedia connection can be established topics that are Events cannot be categorized, will also not be categorized as Humans and therefore always end up as Objects. This is an incorrect classification, as happens with the topics Wederdopersoproer and Aansprekersoproer, hence an improvement should be made that will adjust this.

The AM Database stores names of humans as ”Lastname, Firstname”, however, this is not the manner people tweet about persons. Because persons can only be found by inserting names as they are stored in the database, for every topic that contains two words, the order of words is changed. For example, when the topic Theo van Gogh is inserted, the order of words will be changed to Gogh, Theo van. This

(24)

takes place before the categorization, therefore it is not known yet that Theo van Gogh is a human. If for example Olympische Spelen is inserted as a topic, the order of the words also changes, to Spelen, Olympische, where the first example will return records that lead to the categorization of the topic to Human, and the second example will not retrieve any records. Although changing the order of words is a solution, it is not an elegant solution. An improvement would be if a connection could be made to the Union List of Artist Names (ULAN) of the Getty Research Institute (http://www.getty.edu/ research/tools/vocabularies/ulan/). This is a database with the names of thousands of artists and their information, resembling what kind of artist the person is, date of birth and death, relationships with other people, etc. When a topic is inserted into the system, it could connect with ULAN to discover if the topic might be a name in the ULAN. If this is the case, it can categorize the topic as Human and immediately use the information from ULAN for further processing. This is a more elegant solution than the current implementation.

The Second search uses the date of an Event topic to find records that are in a certain time frame, which results in records that are semantically closer to the original topic, because they happened closer in time. It would be an improvement if this would also be possible for the other two categories. The problem in these cases is that it is hard to obtain a date corresponding to a human or an object and this was a problem that is not solved for this research project. Even if it was solved, this leaves a couple of questions: What date should be used for a human? Its birth date, its date of death, an average of those, or something else? And what date should be used for an object: the date of creation or something similar, or the date when the object became recognized or used, etc? If an improvement is made to the date of a Human or Object category, all these question should be researched.

An initial idea was to create an argument structure with the input topic, to be able to return an argument structure that was opposite of the input. When a topic is used as input and has a positive sentiment linked to it, this would be turned into an argument structure. The argument structure that has to be returned as output was supposed to be a structure that arguments for the negative point of view. However, due to time constraints, this idea was not implemented and it was simplified to returning the records that belong to a topic and an opposite sentiment. Adding this argument structure would be an improvement, because it results in a more specific search. The research from [Bocconi et al., 2005] can be used as a basis to improve this feature.

The tags that are in the database are now treated like a list, where the first tag is retrieved first in a search. The first tag from a list always gathers the first records and if the number of records already pass the threshold, no other tag will ever be used. This can be improved by rotating the tag that is at the top of the list or by ensuring that only one record per tag is gathered, and if it later turns out that the threshold cannot be passed by doing this, another round along all the tags can be made, until the threshold is passed. An extra improvement on top of this could be that the tags that have the most social relevance are on top of the list, in order to retrieve records that are also socially relevant.

This system utilizes Wikipedia through the Mediawiki API, however, a better online encyclopedia would have been DBpedia. This is a semantic representation of the information that is available on Wikipedia and also has pages in Dutch, which is an important feature, because this application will be used by Dutch users. The option of using DBpedia came too late in the progress, therefore the connection with Wikipedia is retained. The structure of Wikipedia and DBpedia is, however the same, therefore, changing the first for the latter should be an achievable feature.

8 Conclusion

The research question How can a twitter topic be mapped to an inventory-database to establish an ar-gument structure? has been answered by this research project. By creating a system that lays a layer between the input topic and the inventory database, relevant records can be retrieved from the database and by using tags as representation for the sentiment, records can be found that highlight an opposite point of view on the input topics, representing the argument structure. In conclusion, this prototype system works on a small scale, laying the groundwork to improve it on multiple points, as to create a full scale system that has improved results.

(25)

9 Discussion

Unfortunately the tag database is a rather small database, filled up with few tags that were actually practical, due to the fact that the test topics, where the tags are extracted from, were to be delivered by the Amsterdam Museum. However, this took longer than initially planned and I received it in the last week of coding therefore, changes had to be made quick and there was no more time to further extend the tag database. Were this to come earlier, this improvement could have been made earlier.

References

[Bocconi et al., 2005] Bocconi, S., Nack, F., and Hardman, L. (2005). Supporting the generation of argu-ment structure within video sequences. In Proceedings of the sixteenth ACM Conference on Hypertext and Hypermedia, pages 75–84. ACM.

[Coursey and Mihalcea, 2009] Coursey, K. and Mihalcea, R. (2009). Topic identification using wikipedia graph centrality. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 117–120. Association for Computational Linguistics.

[Coursey et al., 2009] Coursey, K., Mihalcea, R., and Moen, W. (2009). Using encyclopedic knowledge for automatic topic identification. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL ’09, pages 210–218, Stroudsburg, PA, USA. Association for Computational Linguistics.

[Hargood, 2011] Hargood, C. (2011). Semiotic term expansion as the basis for thematic models in narrative systems.

[Isaac et al., 2008] Isaac, A., Schlobach, S., Matthezing, H., and Zinn, C. (2008). Integrated access to cultural heritage resources through representation and alignment of controlled vocabularies. Library Review, 57(3):187–199.

[Tordai et al., 2009] Tordai, A., van Ossenbruggen, J., and Schreiber, G. (2009). Combining vocabulary alignment techniques. In Proceedings of the fifth international conference on Knowledge capture, pages 25–32. ACM.

[Toulmin et al., 1984] Toulmin, S., Rieke, R., and Janik., A. (1984). Introduction to Reasoning. MacMil-lan Publishing Company, 2 edition.

[Van Ossenbruggen et al., 2011] Van Ossenbruggen, J., Hildebrand, M., and De Boer, V. (2011). In-teractive vocabulary alignment. In Research and Advanced Technology for Digital Libraries, pages 296–307. Springer.

Ubuzz - Making history part of everyday life.

Ubuzz - Making history part of everyday life.

Contents

1

Introduction

2

Literature Review

3

The System

3.1

Overview of the system

3.2

Twitter application

3.3

Categorization

3.4

Keywords-without-records storage

3.5

Tags

3.6

Three types of searches.

3.7

Output

4

Experimentation

4.1

Category search fields

4.2

Creating tag database

5

Results

6

Evaluation

6.1

Evaluation by the Amsterdam Museum

7

Future work

8

Conclusion

9

Discussion

References