Enriching news for supporting users’ information needs using schema-driven classification of entities and relations

(1)

Thesis Master

Information Studies -‐ Human Centered Multimedia

University of Amsterdam -‐ Faculty of Science

Title

Enriching news for supporting users’ information needs

using schema-‐driven classification of entities and relations

Author: Viola Pinzi

Student number: 10652434

Supervisor: Lynda Hardman – Centrum Wiskunde & Informatica (CWI)

Second examiner: Frank Nack – University of Amsterdam

(2)

Enriching news for supporting usersʼ information needs

using schema-driven classification

of entities and relations

Viola Pinzi

MSc Information Studies – University of Amsterdam

viola.pinzi@student.uva.nl

ABSTRACT

The LinkedTV project News scenario aims at improving the experience of watching news on TV. It envisages that potential users of the system watch news broadcasts, express a need for additional information and that the system provides resources from the web that are potentially relevant to them.

Our goal was to investigate user information needs for a given news topic, based on news video fragments. Furthermore, we aimed at representing the news video fragment and related information needs in a form compatible with the system knowledge representation model.

Our contribution consisted of a method to formally represent fragments and requirements using a controlled vocabulary, which was applied to information needs collected through a user study. The analysis resulted in lists of concepts and schemas of the content structure. This contribution supports semantic linking between news, related information needs and additional resources to be retrieved from the web to satisfy those needs.

Categories and Subject Descriptors

H.3.5 [Information Storage and Retrieval]: Online Information Services – data sharing, web-based services.

General Terms

Human Factors, Languages.

Keywords

News enrichment, user information needs, news classification, semantic web, semantic annotation.

1. INTRODUCTION

When watching news broadcasts, viewers may have information needs triggered by the news item they are watching. The LinkedTV project 1_{aims at connecting television} mainstream content and web resources in a single experience (networked media), through an automatic system that generates links between the two. One of the scenarios used to illustrate the technology developed is the Interactive/enriched News scenario. The News scenario aims at improving the experience of watching news on TV. It envisages that a potential user of the system watches a news broadcast (consisting of news fragments) and expresses a need for additional information related to it [15]. The system aims at providing users with resources from the web that are potentially relevant to them, given a specific news item. User information needs are defined in terms of the value users attribute to different contents the application might propose to them, so in terms of user relevance and not as system relevance (algorithmic) or topicality (relevance to a topic) [2].

We identified two issues: first, the need to investigate which

1_{www.linkedtv.eu}

resources are relevant for users, when watching news; second, the need for user requirements to be made available for the system, once learned, in a form that the system can understand. While the user would express needs using natural language, in the form of free text, the system uses a machine-readable model to represent the content of web resources. This means that user requirements cannot be directly used by an automated system to generate queries and to retrieve web resources that satisfy those needs. They have to be coded and represented in a way that is compatible with the system knowledge representation model. We tackled both problems: on the one hand, we aimed at identifying which kind of information news followers would like to find in additional resources from the web, related to the content of specific news video fragments of TV broadcasts (user information needs and relevance of additional information); on the other, our purpose was to specify a method to represent user needs in a form that is usable for the system and to apply this method to the identified needs (user requirements representation).

Section 2 provides a brief overview of previous work on information needs and models for semantic annotation, in the news field. Section 3 presents the research methods. Section 4 describes the user study we carried out to collect user information needs. Section 5 presents the analysis of news fragments and information needs and final results.

2. RELATED WORK

Several studies have been conducted on user information needs related to seed news content, and numerous models have been developed for news representation, classification and retrieval.

2.1 News enrichment and user information

needs

To evaluate methods for investigating user requirements, we considered studies that match our context: news enrichment as final purpose; news video fragments as seed content; understanding related information needs as aim. Palumbo et al. [14] and Perez et al. [15] conducted research within analogous scenarios. The method used by Perez et al. in the first phase of their research was selected as reference model for our investigation of user requirements. They asked participants of a user study to express their needs, after watching news video fragments, in a free and open way, through unstructured text. This method was expected to emulate the real-life situation envisaged by the LinkedTV News scenario. The concept of annotation schema for video fragments, proposed by Palumbo et al., constituted a starting point to specify our analysis method of news fragments and needs, since it supports the representation of news contents in a structured way. However, these studies did not tackle the problem of how to make user requirements compatible with the knowledge representation model of an automated system, which was one of our aims.

(3)

2.2 Semantic Web technologies and news

classification

To propose a solution for the user requirements representation and how they could be used to retrieve relevant additional resources, we considered approaches that apply Semantic Web technologies, on one hand, to link resources with related contents across different systems, and, on the other, to describe and classify news articles.

Demartini et al. [6] discuss methods to establish these semantic links among resources through the identification of entities, in the resource text, and the extraction of relations among entities to be used for their enrichment. Many systems tackle this task mapping entities and relations to concepts of specific knowledge bases (for example, DBpedia datasets2), or, more ambitiously, to the whole Linked Open Data graph3_{. This} approach has been increasingly applied also in the field of media production and news. The British Broadcasting Corporation 4_{(BBC) created a system to manage their} knowledge base and to link resources across all their platforms and to external data through semantic annotation [11]. Their system uses DBpedia identifiers as controlled vocabulary, mapped to their internal concepts categorization system (CIS). They effectively introduced information needs in the mix, with the idea to support users in their navigation across different information flows. However, their approach takes into consideration user needs mostly in terms of usability of the platforms. In our study, semantic annotation was selected as reference method for representing not only resources, but also information needs, to identify links among concepts that are relevant for users.

The system described by Vargas-Vera et al. [16] recognizes events within the stories of news articles, using the KMi ontology to automatically instantiate pre-defined event templates. In the KMi ontology, forty types of events are defined as classes and each is represented through a template, which indicates properties and classes of objects. Although their analysis of news content structure and the resulting templates constitute a reference model for our method, they tackled the news classification problem from the point of view of event types and not in terms of categories of contents. Furthermore, in their model, each template has a central concept, the Event, which is subject of all the properties. In our research, on one hand, we look at news classification in terms of news topics, as macro-categories, and, on the other, we represent news stories as sets of linked concepts, not as event types corresponding to predefined templates. We believe that this approach may result more sustainable and efficient for large-scale applications, since topic categorization is a higher level classification method in comparison with events topology. Furthermore, through sets of linked concepts, it is possible to represent any news content structure, seen and unseen.

3. RESEARCH METHODS

To investigate user requirements for news enrichment, our approach was to design a user study, to prompt a group of potential users of the LinkedTV system to freely express their information needs, in an unstructured way, after watching news video fragments, and to introduce them to the idea of finding additional resources, online, to match and satisfy those needs. As for the representation of requirements and news fragments in a form compatible with the system knowledge model, we opted

2_{http://wiki.dbpedia.org/Datasets} 3_{http://linkeddata.org}

4_{www.bbc.co.uk/rd}

for applying Semantic Web technologies, using semantic annotation and schemas, to support the linking between fragments, needs and web resources, across different systems.

3.1 User study

3.1.1 Exploratory search task

We designed a user study that emulated the expected scenario of a real-life situation: a person watches a news video fragment and then searches online resources to satisfy the resulting needs for additional information. The study protocol was based on an information seeking task and, in particular, on the concept of exploratory search behaviour. The characteristics of this type of search task match the main aspects of our scenario: an open-ended and dynamic multi-stage process that entails a certain level of uncertainty and that is associated with not well defined information needs [17]. Since the process was designed to elicit a real-life situation [4], it was formulated as a simulated work task [3], where the participant is provided with a general description of the situation (context, problem and objective) and with minimum constraints, to support free expression of needs and open search process.

3.1.2 News topic and news video fragments

selection

A list of parameters was specified and applied to guide the selection of news video fragments proposed to the participants of the study.

To make the analysis of results feasible and to support the identification of content structure, the news video fragments watched by participants needed to have some level of consistency. We identified topic as central dimension to investigate news classification, since it is one of the main parameters of news categorization systems and determines the way news programmes are organised. Furthermore, different news topics are represented by different domain knowledge and domain knowledge influences the process of information representation, which is another core matter of this research. Topic is the consistency aspect that led the selection of news fragment for the study. For feasibility constraints, we also decided to investigate only one topic, chosen considering the related study by Palumbo et al. [14], which focused on environmental news. Environmental news can be considered mostly as ‘slow developing news’, while TV news programmes consist mainly of ‘fast breaking news’ [12]. Therefore, we opted for a topic connected with environmental news, but whose nature is mainly ‘fast breaking’: Disasters and accidents. Furthermore, since in stories belonging to this topic both human and natural factors are involved, we expected the final findings to be more generalizable, in respect to the scope of the problem. DBpedia’s subject categories were used as reference model, to identify the topic and its label and to further specify five sub-topics: Industrial accidents and incidents; Transport accidents

and incidents; Fires and explosions; Engineering failures; Natural disasters.

We also defined a boundary with respect to the events’ causes. While natural disaster causes are not intentional, man-made disasters and accidents include events whose causes may be both intentional (as War) and unintentional (such as Structural

failures). For consistency, we avoided events whose main cause

was intentional or possibly cross-category. For example, within the sub-topic Explosions, we considered news about Gas

explosions and we ruled out the ones about Bombings.

Finally, there was the need to ensure clarity and intelligibility of news video fragments (content-wise and at technical level) and alignment in terms of format and length.

(4)

3.2 Analysis and representation of news

fragments and information needs

We specified a method for news and information needs representation, along three dimensions: to make the needs available for an automated system; to relate them with the content of news fragments; to identify the relations between concepts that are relevant for users.

Within the Semantic Web framework, an automated system can represent information in terms of entities, entity types and relations and assign to them semantic descriptions (concepts). Instances, classes and properties are the three types of concepts providing these descriptions [1] and the whole process is defined as semantic annotation [10]. Concepts are also assigned unique identifiers, which allows to link different instantiations to the same description. This linking is a fundamental aspect of the Semantic Web [9] and the main reason we applied these technologies for our analysis. It provides the possibility for each piece of information to be semantically represented in a standard way, which supports linking and retrieval of resources that contain same entities and/or same relations between entities. The opportunity of linking resources, not only through single entities but also through the relations among them, is one of the central aspects of our approach, since it was expected to result in a richer interpretation of user needs, which might enhance the level of relevance of potential related resources retrieved from the web.

Our analysis method consists of four processes: extraction of entities; identification of relations between entities; mapping of entities and relations to concepts of a controlled vocabulary (resources and properties); identification of entity types (classes); representation of the content structure using the mapped concepts (schema). We refer to the model proposed by Grishman [8], for the information extraction steps, and to Exner et al. [7], for the mapping to the concepts of a controlled vocabulary. The method was expected to reveal which relations between entities (or groups of relations) are relevant for users, given specific news fragments and related information needs, and to provide a representation of these relations compatible with the knowledge representation model the system uses.

3.2.1 Information extraction and concept mapping

The first step of the analysis consists in identifying entities and relations, within each news fragment and set of information needs, and verifying co-references, to group recurring ones under the same label. Furthermore, when entities are qualified by adverbs, adjectives, relative clauses, implicit or explicit copulative expressions, relations with other entities etc., inference is applied to select the entity to retain and its label. To map entities and relations to concepts, DBpedia was chosen as reference knowledge base. DBpedia consists of large and structured datasets [13], which supports, on one hand, a good level of mapping of factual information and, on the other, the possibility to link entities and relations to the concepts of its Ontology5_{. Since DBpedia is already widely used as hub to link} other knowledge bases (such as the LOD cloud6 and BBC’s platforms), an annotation schema mapped to its ontology was expected to facilitate retrieval and classification of linked resources from the web.

Entities are mapped to Resources of DBpedia Named Graph and entity types and relations to Classes and Properties of the Ontology. Resources are explored by related keywords, through

5_{http://wiki.dbpedia.org/Ontology} 6_{http://wiki.dbpedia.org/Interlinking}

DBpedia Lookup Service7_{, and coupled with entities based on} their Descriptions. Properties and Classes are identified exploring manually DBpedia schema8_{, analysing Range and} Domain, for Properties, and Description and position in the hierarchy, for Classes. Co-references are analysed across news fragments and information needs to map to the same concept.

3.2.2 Content structure and representation

To look at the structure of the content, in order to identify the links between concepts that are relevant for users, the whole corpus of factual information of news fragments and information needs is represented through expressions subject-predicate-object, at instance level. From this low-level representation, a higher level schema is constructed, where each slot is defined by a Property and specifies the Classes of resources for both Range and Domain. The aim is to support retrieval of resources that contain both same entities and relations and related entities belonging to the same classes.

4. USER STUDY

Before the main experiment, a pilot study was carried out with two participants, to test the protocol, efficiency of tasks and clarity of instructions.

4.1 Protocol description

Participants were fluent in English and were interested in following the news. It was expected that they could relate with the scenario and that they possessed minimum competences to perform the tasks and to understand the news video fragments. The sessions were conducted in a semi-controlled environment (meeting-style setting) and participants used a laptop provided by the researcher to perform the tasks, through an online template.

The protocol is summarized in the following.

• Introduction – The researcher provided a brief explanation and prompted the participant to read the text explaining the Situation.

• Step 1. Selecting and watching seed news - The participant was asked to select and watch one news video from a list of five, while also thinking about the additional information they would like to have, related to the content of the news. • Step 2. Expressing information needs on news topic – The

participant was asked to express information needs in writing, as a list, ordered from high to low interest. • Step 3. Online search – The participant was prompted to

search online for the items of information in the list and to choose one resource for each item. They were asked to use Firefox with the HCI Browser9_{plug-in [5], which allowed} to structure the search process by item and to record transaction logs.

• Step 4. Information needs’ motivations – Finally, the participant was asked to provide motivations for the items in the list.

To support open and free expression of needs, the constraints were minimal: no time limit; no fixed number of information needs to express or to search for; no limitations about the online search process (type of sources, formats etc.). Furthermore, even if not explicitly stated in the instructions, the participants were allowed to stop at any time, to skip information items, during the online search, and to end a search task without selecting any resource.

7_{http://lookup.dbpedia.org/api/search.asmx} 8_{http://mappings.dbpedia.org/index.php} 9_{http://ils.unc.edu/hcibrowser}

(5)

4.2 Participants and data collected

The user study was conducted in June 2014, with 18 participants, 11 females and 7 males, recruited mainly at an university. The mean age was 29, with a standard deviation of 9; 14 participants were in the age range 22-29 and 4 over 30. As for the education level, 1 participant indicated Secondary school, 3 Bachelor degree, 13 Master’s degree and 1 Other. 8 participants were students, 4 indicated a profession and 6 did not provide specific indications.

News fragments 1, 2, 3 and 5 were watched by 4 participants and 4 by 2.

The list of collected information needs constitutes a first level answer to the problem. Participants expressed between 2 and 5 items each, for a total of 58 items. By news fragment, 12 items were collected for video 1, 16 for 2, 11 for 3, 7 for 4, 12 for 5.

5. ANALYSIS AND RESULTS

5.1 Identification of concepts

We analysed each news fragment. Three types of data were used: speech transcripts; Title and Description of the video provided within the webpage hosting it; screen captions (headlines overlapping the video frames). Speech transcripts were extracted manually or, when available, using the speech recognition tool in-built in the online player. This raw text was segmented in sentences and sentences expressing related information were grouped in blocks. Blocks containing factual information about the news story and identifiable entities were selected for further analysis (opinions, direct speech, imperative sentences and metadata of the news fragment were ruled out). Subsequently, we analysed the related information needs. The interpretation of collected needs (Step 2 of the user study) was also expanded and validated using data extracted from Step 3 (search queries, visited pages, selected resources and text answers) and Step 4 (motivations for needs).

Entities and relations were identified and mapped to DBpedia concepts, analysing co-references.

When it was not possible to map to DBpedia, an alternative representation was found. Unmapped entities were represented in four ways: their original form in the text; lexical entities from WordNet10; a Class assigned by us; using a Property and an expression subject/predicate/object. If an entity was mapped to a resource but its Class was not defined, a Class was assigned. If no matching Class was found, an upper level Class was assigned and the resource or lexical entity was retained in the representation to specify it. Finally, when a relation was not defined in DBpedia ontology, a lexical entity from WordNet was used. When a relation was defined but Range and/or Domain of the mapped Property did not match our context, the relation was represented either with a lexical entity or indicating the mismatch with DBpedia definition.

Tables 1 and 2 provide statistics for the concepts identified, by type, respectively within News fragments and information needs, indicating how many were mapped to Dbpedia and how many were represented with an alternative form. Table 3 shows the whole results. Each concept is counted only once, across both news fragments and information needs. Thus, the figures in Table 2 refer to new concepts that were not mentioned in news fragments.

10_{http://wordnetweb.princeton.edu/perl/webwn}

Table 1. Concepts identified within news fragments DBpedia Alternative Total

Entity – Resource 37 25 62 Type – Class 19 19 38 Relation -‐ Property 15 14 29

Table 2. Concepts identified within information needs

DBpedia Alternative Total

Entity – Resource 7 4 11

Type – Class 3 4 7

Relation -‐ Property 6 3 9

Table 3. Concepts identified across news and needs Dbpedia Alternative Total

Entity – Resource 44 29 73 Type – Class 22 23 45 Relation -‐ Property 21 17 38

Regarding entities and types, the mapping was very efficient for places, and, within DBpedia Named Graph, a specific Class was assigned for geographical resources (Country, Region,

Settlement). As for entities of other types, even when they were

mapped to DBpedia resources, often no Class or only upper level Class was assigned (Thing or Agent). As for relations, static or slow changing attributes of persons, organisations and places were easily mapped (Name, Description, Map etc.), while the majority of unmapped relations concerned types of actions. In some cases, we represented actions with the Property

Activity, which is meant to describe a general activity and not to

express the dynamicity or the unicity of a specific action in time (for example, in the case Organisation/Activity/Official action).

5.2 Schemas of news fragments and

information needs

The content structure of news fragments and information needs was represented through expressions subject-predicate-object, at instance level (Table 4 and 5). In some cases, the subject or the object was represented indicating an upper level Class and another expression within brackets (Table 4, line 5). This annotation means that the Event causing the damage is the same

Event described by Expression 4.

From this low-level representation, the higher level schemas were constructed. The information needs schemas constitute a second level of answer to our problem: they represent the additional information relevant for users, in a form compatible with the system knowledge representation, and they show the relevant relations between classes of entities. Table 6 and 7 show respectively the results for News fragment 1 and for its information needs. The graph in Figure 1 shows how the concepts of News fragment 1 (Table 6) are connected to each other, across the slots of the schema. The other schemas are available in the Appendix.

(6)

Table 4. Examples of expressions from News fragment 1

ID Text from the video Entity or type Relation Entity or type

4 The Elk river, a water source for hundreds of thousands of people in West Virginia, and now contaminated with chemical used to clean coal.

http://dbpedia.org/resource/ Elk_River_(West_Virginia)

WordNet: verb

contaminatedBy OntologyClass: ChemicalSubstance 5 Health officials say some 4 to 6 people were admitted to

the hospital with problems related to the contamination WordNet: noun health; noun damage OntologyProperty:CausedBy OntologyClass: Event (Expression n. 4)

Table 5. Examples of expressions from information needs related to News fragment 1

ID Text from participants’ results Entity or type Relation Entity or type

28 Text from Participant 11. Item 1: location information (where is west virginia, which river)

submittedAnswerText: found this page, with a map and a link to the elk river

http://dbpedia.org/resource/

West_Virginia OntologyProperty:Map Thing

Table 6. Final schema for News fragment 1 – Sub-topic: Industrial accidents and incidents Slot

ID Domain class (alt:resource/lexical entity/slot) Property (alt:lexical entity) Range class (alt:resource/lexical entity/slot)

1 NaturalPlace/BodyOfWater WordNet: verb provide Thing (resource Water_resource)

2 NaturalPlace/BodyOfWater Location PopulatedPlace

3 Person Residence PopulatedPlace

4 NaturalPlace/BodyOfWater WordNet: verb contaminatedBy ChemicalSubstance 5 Thing (WordNet: noun health; noun damage) CausedBy Event (Slot n.4) 6 Thing (WordNet:adjective economic; noun impact) CausedBy Event (Slot n.4)

7 ChemicalSubstance OwningOrganisation Organisation/Company

8 Organisation/Company President Person

9 Person (Slot n.3) WordNet: verb harmedBy Event (Slot n.4)

10 Person (Slot n. 9) Number xsd:integer

11 ChemicalSubstance Description langString

12 Organisation/GovernmentAgency Location PopulatedPlace

13 Organisation/GovernmentAgency Activity Activity (WordNet: adjective official; noun action)

14 Public_company President Person

15 Public_company Location PopulatedPlace

Table 7. Final schema for information needs related to News fragment 1 - Participants: P1, P4, P8, P11 Slot

16 Organisation/Company Product -‐ Service Thing

17 Organisation/Company OrganisationMember Person

18 ChemicalSubstance Description langString

19 Organisation/Company Description langString

20 Event (resource Chemical_accident) PreviousEvent Event (Slot n.4)

21 ChemicalSubstance Name langString

22 Person (Slot n.3) WordNet: verb harmedBy Event (Slot n.4)

24 Politician/Senator Location PopulatedPlace

25 Politician/Senator Activity Activity (WordNet: adjective official; noun action) 26 Organisation/Company WordNet: verb involvedIn LegalCase

27 Organisation/Company WordNet: verb involvedIn Event (Slot n. 20)

(7)

Figure 1. Semantic graph representing News fragment 1

5.3 Final results

For each news fragment, we analysed the relations between its content and the related information needs and we observed some trends within each needs set. The slot numbers refer to the schemas of related information needs, in Table 7 for News fragment 1 and in the Appendix for the other fragments.

News fragment 1

For 8 slots out of 13, the needs concern more detailed or background information about resources already mentioned in the news (Company, ChemicalSubstance, PopulatedPlace), represented by the relations Product, Service, Member,

Description, Name, involvedIn and Map (slots 16, 17, 18, 19,

21, 26, 27, 28). Relevant focus is on the Company (6 slots out of 13). A new aspect is the interest for related Chemical

accidents in the same area (relation previousEvent). News fragment 2

Relevant focus is on the people involved in the event, with 8 slots out of 19 (slots 16, 17, 18, 19, 22, 23, 28, 29). Regarding people, two new concepts were identified in the needs set:

survivorOf and Opinion (slots 16, 17, 19). Other new aspects

were the interest for related events and their number (slots 15, 21, 30, 31), and for maps of the location and of the accident (slot 13, 20).

News fragment 3

Overall the additional information required is very similar to that provided in the fragment. A new aspect identified in the needs is the interest for events of the type Oil disaster (slot 23, 24), which is linked to the activity of the company mentioned in the news (State oil company), but not to the type of event (which was an Explosion in a Building). We also observed interest in the people involved in the event (slots 20, 21, 27, 28, 29).

News fragments 4

New aspects identified in the needs are the interest for updates on the event (slots 29, 30) and for general information on that type of events, in this case Earthquake (slot 31). Interest on the human factor is also relevant (slots 25, 27, 28).

News fragment 5

Overall the needs regard more detailed information about concepts already mentioned in the fragment. We can observe relevant interest for the industry the news deals with,

Construction and Real estate (slots 21, 22, 23, 24), for the Company owning the building site where the event takes place

(slots 25, 26, 30, 31) and for people affected by the event (slots 27, 28, 34, 35).

Finally, we observed some trends in relevant concepts across all information needs.

1. In all sets, we observed interest on people involved in the events and their conditions. These needs are represented by the relations involvedIn, harmedBy, survivorOf, witnessOf,

passengerOf, killedBy, requirement, status, Name, opinion.

2. Detailed and visual geographical information (relation

Map) is required for all fragments except 4, where a map

was already shown in the video.

3. An interest for previous and/or related events was identified in 4 out of 5 needs sets: for NF1 slot 20; for NF2 slots 15, 21, 30, 31; for NF3 slots 23, 24; for NF5 slot 30.

6. DISCUSSION

6.1 User study

The type of task and the low level of constraints of the protocol resulted to be an effective strategy. The participants were asked to perform an exploratory search task, from need identification (Step 2) to online search for resources matching their

(8)

information needs (Step 3), and to provide motivations (Step 4). This rich and diverse information allowed us to expand the interpretation process and to better specify the needs for 14 participants out of 18 (Participants: 1, 2, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 18). Furthermore, for 8 participants out of these 14, the additional data revealed novel information that was not explicit or clear in the items provided in the initial list (Participants 1, 5, 7, 8, 9, 10, 11, 15). However, the dynamicity of the whole process may have been reduced due to its structure organized by activity (watching video, writing down the needs etc.), since, in real-life situations, a user may choose to perform it by information need item instead. Since information needs evolve over time and during the exploration process, the tasks structure may affect their identification. During the sessions, it was observed that some participants mildly modified the information needs’ list, while performing the search; some realized that two or more of their identified information needs could be grouped together; some commented that the same web resource was answering more than one need and that answers could be grouped together.

Another issue regards how news video fragments were proposed to participants. In order to support their interest in the content of the news, participants were given the possibility to choose the video fragment to watch. However, unexpectedly many participants seemed to prefer 2 of the news fragments over the other 3. After session 11, 2 of the videos had reached the count of 4 views, while the other 3 had respectively 1, 1 and none. This observation made us re-evaluate the strategy, and, in order to be able to collect information needs about all news fragments, we decided to modify the protocol in progress, removing a fragment once it reached 4 views.

6.2 Analysis method and final results

We were able to identify not only single entities and relations and list of concepts, in both news fragments and related information needs, but also schemas of their content structure, to represent links between concepts. However, part of the information was not mapped to DBpedia Named Graph or Ontology and, in the final schemas, it was represented with alternative forms. This factor makes some of the slots of the schemas not directly usable for an automated system. Furthermore, since our investigation was limited to one news topic and five news fragments, the schemas do not cover the whole requirements of a real-world application such as LinkedTV, which deals with numerous news topics.

Finally, the method to analyse news fragments and information needs was reviewed by three researchers, but, due to limited resources, the full analysis and interpretation of results was performed by one. The potential bias resulting from the lack of cross-validation has to be considered for future work based on this study results. Similarly, due to time constraints, we could not perform an evaluation of the final schemas with other news fragments and users.

7. CONCLUSIONS

Our work contributed to the news enrichment scenario, within LinkedTV project.

We proposed a method to formally represent both the content of news video fragments from TV broadcasts and user requirements for additional information related to that content. This representation is based on mapping the information extracted from news and user needs to the same controlled vocabulary, which supports semantic linking between news, needs and additional resources to be retrieved from the web to satisfy those needs. We collected a list of user information needs related to news fragments of the topic Disasters and

accidents. Applying the method to those news fragments and to

the list of needs, we obtained a list of concepts, partially mapped to DBpedia knowledge base and partially represented through WordNet lexical entities. We identified additional links between the concepts, looking at the content structure and constructing schemas at instance and class level. Finally, we identified some trends within information needs, by news fragment and across the whole set of needs.

These contributions are valuable to enhance LinkedTV content management flow at three levels. Within the system backend, the schemas could be used to generate training datasets for algorithms or direct queries, to retrieve resources from the web, or to filter from a repository of annotated resources, to populate automatically the user interface. Moreover, such structured data can be mapped to other existing ontologies (beyond DBpedia), both at upper level (general ontologies) and at domain knowledge level (related to the news topic), to support linking with other knowledge bases. Finally, such schemas could serve as guidelines for human editors, who use the LinkedTV editing tool to manually publish content in the interface.

7.1 Future work

As for the results, since we were not able to represent part of the content through the concepts of DBpedia Ontology, a further mapping to other ontologies is required, in order for the schemas to be fully usable within a system. Validation of the results by other researchers is also needed, as well as an evaluation phase to test the capacity of the final schemas to predict information needs, given other news fragments belonging to the same topic.

Furthermore, a wider user study is needed, to perform similar analysis for other news topics. For such a study, a tool could be used to support efficient manual annotation of news fragments and information needs, allowing a group of researchers to perform the analysis cooperatively and to easily review the identified concepts and schemas. In a pilot phase, it would be also interesting to test how the structure and order of the tasks in the protocol may influence the final information needs. In particular, it would be useful to verify whether the richness and specificity of needs are affected by performing the exploratory search structured by information item instead that by activity. Finally, the strategy to propose participants a list of video fragments to choose from has to be reviewed, in order to balance between supporting participants’ interest in the news content and the need to have a minimum number of participants selecting and watching each fragment.

ACKNOWLEDGEMENTS

This work was partially supported by LinkedTV project, funded by the European Commission through the 7th Framework Programme (FP7-287911). We would like to thank: Lynda Hardman, NL; Lilia Perez Romero, NL; Michiel Hildebrand, NL; Frank Nack, NL; all the participants of the user study.

REFERENCES

[1] Antoniou, G., Groth, P., van Harmelen, F., and Hoekstra, R. 2012, Semantic Web Primer. Third Edition. MIT press [2] Borlund, P. 2003. The concept of relevance in IR. J. Am.

Soc. Inf. Sci. Technol. 54, 10 (August 2003), 913-925.

DOI=http://dx.doi.org/10.1002/asi.10286

[3] Borlund, P. 2003. The IIR evaluation model: A framework for evaluation of interactive information retrieval systems. In Information Research, 8, 3. http://informationr.net/ir/8-3/paper152.html

[4] Byström, K. and Hansen, P. 2005. Conceptual framework for tasks in information studies: Book Reviews. J. Am. Soc.

Inf. Sci. Technol. 56, 10 (August 2005), 1050-1061.

(9)

[5] Capra, R. 2009. The HCI Browser Tool for Studying Web Search Behavior. In Proceedings of the Third Workshop on

Human-Computer Interaction and Information Retrieval

(Washington DC, USA, October 23, 2009). HCIR 09. http://cuaslis.org/hcir2009/HCIR2009.pdf

[6] Demartini, G., Difallah, E. D., and Cudré-Mauroux P. 2012. ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In

Proceedings of the 21st international conference on World Wide Web. WWW '12. ACM, New York, NY, USA,

469-478. DOI=http://doi.acm.org/10.1145/2187836.2187900 [7] Exner, P., and Nugues, P. 2012. Entity extraction: From

unstructured text to DBpedia RDF triples. The Web of

Linked Entities Workshop (Boston, USA, November 11,

2012). WoLE 2012. http://lup.lub.lu.se/record/3191701 [8] Grishman, R. 1997. Information Extraction: Techniques

and Challenges. In Proceedings International Summer

School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology. SCIE

'97. Maria Teresa Pazienza (Ed.). Springer-Verlag, London, UK, 10-27. http://dl.acm.org/citation.cfm?id=669801 [9] Heath, T., and Bizer, C. 2011. Linked Data: Evolving the

Web into a Global Data Space. 1st edition. Synthesis

Lectures on the Semantic Web: Theory and Technology, 1,

1. Morgan & Claypool, 1-136. http://linkeddatabook.com/book

[10] Kiryakov, A., Popov, B., Terziev, I., Manov, D., and Damyan Ognyanoff. 2004. Semantic annotation, indexing, and retrieval. Web Semant. 2, 1 (December 1, 2004), 49-79. DOI=http://dx.doi.org/10.1016/j.websem.2004.07.005 [11] Kobilarov, G., Scott, T., Raimond, Y., Oliver, S.,

Sizemore, C., Smethurst, M., Bizer, C., and Lee, R. 2009. Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Connections. In Proceedings of the 6th European Semantic Web

Conference on The Semantic Web: Research and Applications. ESWC 2009. Springer-Verlag, Berlin,

Heidelberg, 723-737. DOI=http://dx.doi.org/10.1007/978-3-642-02121-3_53

[12] Media Insight Project. 2014. The Personal News Cycle. Report. American Press Institute and the Associated Press-NORC Center for Public Affairs Research.

http://www.apnorc.org/projects/Pages/the-personal-news-cycle.aspx

[13] Mendes, P., Jacob, M., and Bizer, C. 2012. DBpedia: a Multilingual Cross-domain Knowledge Base. In

Proceedings of the Eight International Conference on Language Resources and Evaluation (Istambul, Turkey,

May 21-27, 2012). LREC '12. http://www.lrec-conf.org/proceedings/lrec2012/pdf/570_Paper.pdf [14] Palumbo, A. C., and Hardman, L. 2013. User information

needs for environmental opinion-forming and decision-making in link-enriched video. In Proceedings of the 11th

European conference on Interactive TV and video.

EuroITV '13. ACM, New York, NY, USA, 85-88. http://doi.acm.org/10.1145/2465958.2465973

[15] Perez Romero, L., Hardman, L., and Hildebrand, M.. 2013. Deliverable 3.5 - Requirements Document LinkedTV User Interfaces (Version 2). Technical report. LinkedTV project. http://www.linkedtv.eu/demos-materials/linkedtv-deliverables

[16] Vargas-Vera, M., and Celjuska, D. 2004. Event Recognition on News Stories and Semi-Automatic Population of an Ontology. In Proceedings of the 2004

IEEE/WIC/ACM International Conference on Web Intelligence (September 20-24, 2004). WI '04. IEEE

Computer Society, Washington, DC, USA, 615-618. DOI= http://dx.doi.org/10.1109/WI.2004.10148

[17] Wildemuth, B. M., and Freund, L. 2012. Assigning search tasks designed to elicit exploratory search behaviors. In Proceedings of the Symposium on Human-Computer

Interaction and Information Retrieval. HCIR '12. ACM,

New York, NY, USA, Article 4.

(10)

APPENDIX

Schemas of news fragments and related information needs

News fragment 2

Sub-topic: Transport accidents and incidents Slot

1 Person KilledBy Event (Slot n. 3)

2 Person (Slot n.1) Number xsd:integer

3 Train; Event (WordNet: noun accident) Location Country/PopulatedPlace 4 Train; Event (WordNet: noun accident) Location Settlement/PopulatedPlace 5 Person WordNet: noun witnessOf Event (Slot n. 3)

6 Person WordNet: verb harmedBy Event (Slot n. 3)

7 Event (Slot n. 3) CausedBy Thing or Event (WordNet:noun cause) 8 Person WordNet: noun passengerOf Train

9 Person (resource Rescuer) Activity Activity (resource Rescue)

10 OfficeHolder Activity Activity (WordNet: adjective official; noun action)

11 GovernmentAgency Location Country/PopulatedPlace

12 Organization (Slot n.11) Activity Activity (WordNet: adjective official; noun action)

Related information needs - Participants: P2, P6, P9, P10 Slot

13 Settlement/PopulatedPlace Map Thing

14 Train Type Thing

15 Train; Event (WordNet: noun accident) Number xsd:integer 16 Person WordNet: noun survivorOf Event (Slot n. 3) 17 Person (Slot n. 16) WordNet: noun opinion Thing 18 Person WordNet: noun witnessOf Event (Slot n. 3) 19 Person (Slot n. 18) WordNet: noun opinion Thing

20 Event (Slot n.4) Map Thing

21 Train; Event (WordNet: noun accident) PreviousEvent Event (Slot n. 3)

22 Person WordNet: noun driver Train

24 Train OwiningOrganisation Organisation

25 Organisation (Slot n. 24) Description langString

26 Event (Slot n. 3) CausedBy Thing or Event (WordNet:noun cause)

27 Organization (Slot n.11) Activity Activity (WordNet: adjective official; noun action) 28 Person WordNet: noun passengerOf Train

29 Person (Slot n. 28) WordNet: noun opinion Thing 30 Train; Event (WordNet: noun accident) PreviousEvent Event (Slot n. 4) 31 Event (Slot n. 30) Number xsd:integer

News fragment 3

Sub-topic: Fires and explosions Slot

1 Event (WordNet: explosion) Location Settlement/PopulatedPlace

4 Event (Slot n. 1) Location Building

5 Building OwiningOrganisation Organisation/Company

6 Person (resource Rescuer) Activity Activity (resource Rescue) 7 Person WordNet: noun survivorOf Event (Slot n. 1)

8 GovernmentAgency Activity Activity (WordNet: adjective official; noun action) 9 President Activity Activity (WordNet: adjective official; noun action) 10 President WordNet: noun opinion Thing

11 Person WordNet: verb harmedBy Event (Slot n. 1) 12 Person (Slot n. 11) Number xsd:integer

13 Event (Slot n. 1) CausedBy Thing or Event (WordNet:noun cause) 14 Event (Slot n. 1) CausedBy Gas_leak (resource)

(11)

15 Person InvolvedIn Event (Slot n. 1) 16 Person (Slot n. 15) Number xsd:integer

Related information needs – Participants: P7, P14, P15, P16 Slot

17 Event (Slot n. 1) CausedBy Thing or Event (WordNet:noun cause)

18 Building OwiningOrganisation Organisation/Company

19 Event (Slot n. 1) CausedBy Thing (resource Gas_leak)

22 Person (resource Rescuer) Activity Activity (resource Rescue) 23 Event (WordNet: noun oil WordNet: noun disaster) Location Place

24 Event (Slot n. 23) Number xsd:integer

25 Building Map Thing

26 Event (WordNet: explosion) WordNet: noun radiance float

27 Person WordNet: noun survivorOf Event (Slot n. 1)

29 Person WordNet: harmedBy Event (Slot n. 1) 30 Event (Slot n. 29) Number xsd:integer

News fragment 4

Sub-topic: Natural disasters Slot

1 NaturalEvent (resource Earthquake) Location Settlement/PopulatedPlace

2 Settlement/PopulatedPlace neighbourRegion Country/PopulatedPlace

3 Settlement/PopulatedPlace WordNet: noun distance; adjective regional noun capital String

5 Settlement/PopulatedPlace Type Desert

6 Event (Slot n. 1) WordNet: noun epicenter Place 7 Settlement/PopulatePlace WordNet: noun distance; noun epicenter Lenght

8 GovernmentAgency Activity Activity (WordNet: adjective official;

noun action) 9 Event (Slot n. 1) WordNet: noun magnitude xsd:string

14 Building (WordNet: adjective residential) WordNet: verb destroyedBy Event (Slot n. 1)

15 Building (Slot n. 14) Number xsd:integer

16 MilitaryUnit Activity Activity (resource Rescue)

17 GeopoliticalOrganisation Activity Activity (resource Rescue)

18 GovernmentAgency Location PopulatedPlace (resource Pakistan)

19 Organisation (Slot n. 18) Activity Activity (WordNet: adjective official; noun action)

20 Person WordNet: noun survivorOf Event (Slot n. 1)

21 Person (Slot n. 20) requirement Food

22 Person (Slot n. 20) requirement Place (WordNet: noun Shelter) 23 Person (Slot n. 20) requirement Thing (resource Water_resource) 24 Person (Slot n. 20) requirement Activity (resource Health_care)

Related information needs – Participants: P17, P18 Slot

25 Person (Slot n. 20) requirement Place (WordNet: noun Shelter) 26 Event (Slot n. 1) WordNet: noun magnitude xsd:string

27 Person (Slot n. 20) requirement Thing (resource Water_resource)

28 Organisation Activity Activity (resource Humanitarian_aid)

29 Event (Slot n. 1) Description Thing

30 Thing (Slot n. 29) updated xsd:date

(12)

News fragment 5

Sub-topic: Engineering failures Slot

1 Event (resource Structural failure) Location Place (resource Building_site) 2 Event (resource Structural failure) Location Country/PopulatedPlace 3 Event (resource Structural failure) Location Settlement/PopulatedPlace 4 Settlement/PopulatedPlace neighbourRegion Settlement/PopulatedPlace 5 Place (resource Building_site) buildingType ShoppingMall/Building

6 Person killedBy Event (Slot n. 1)

7 Person (Slot n. 6) number xsd:integer

8 Person (resource Construction_worker) WordNet: verb trappedIn Place (resource Building_site)

10 Person (resource Construction_worker) WordNet: verb harmedBy Event (Slot n. 1)

12 Person (resource Rescuer) Activity Activity (resource Rescue) 13 Place (resource Building_site) OwningOrganisation Company

14 Company Activity Activity (WordNet: mismanagement)

15 Event (Slot n. 1) CausedBy Activity (Slot n. 14) 16 GeopoliticalOrganisation Location Settlement/PopulatedPlace

17 Organisation (Slot n. 16) Activity Activity (WordNet: adjective official; noun action) 18 Activity (Slot n. 17) PreviousEvent Event (Slot n. 1)

19 Activity (resource Construction) Location Country/PopulatedPlace 20 Activity (Slot n. 19) SystemOfLaw SystemOfLaw

Related information needs – Participants: P3, P5, P12, P13 Slot

21 Activity (resource Real_estate_development) Location Country/PopulatedPlace 22 Activity (Slot n. 21) Description langString

23 Person (resource Construction_worker) WordNet: noun status String 24 Activity (Slot n. 19) SystemOfLaw SystemOfLaw

25 Company Activity Activity (WordNet: mismanagement)

26 Company WordNet: verb involvedIn LegalCase

27 Person killedBy Event (Slot n. 1)

30 Company WordNet: verb involvedIn Event (Slot n. 1)

31 Company (Slot n. 30) Name xsd:langString

32 ShoppingMall/Building WordNet: verb involvedIn Place (resource Building_site) 33 ShoppingMall/Building (Slot n. 32) Name xsd:langString

35 Person (Slot n. 34) Name xsd:langString

36 Event (Slot n. 1) Date xsd:date

37 Event (Slot n. 1) Description Thing

Enriching news for supporting users’ information needs using schema-­driven classification of entities and relations