• No results found

Enriching news for supporting users’ information needs using schema-­driven classification of entities and relations

N/A
N/A
Protected

Academic year: 2021

Share "Enriching news for supporting users’ information needs using schema-­driven classification of entities and relations"

Copied!
12
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Thesis  Master  

Information  Studies  -­‐  Human  Centered  Multimedia  

University  of  Amsterdam  -­‐  Faculty  of  Science  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Title  

Enriching  news  for  supporting  users’  information  needs  

 using  schema-­‐driven  classification  of  entities  and  relations  

 

 

 

 

 

 

 

 

 

 

 

Author:  Viola  Pinzi  

Student  number:  10652434  

 

Supervisor:  Lynda  Hardman  –  Centrum  Wiskunde  &  Informatica  (CWI)  

Second  examiner:  Frank  Nack  –  University  of  Amsterdam  

 

(2)

Enriching news for supporting usersʼ information needs

using schema-driven classification

of entities and relations

Viola Pinzi

MSc Information Studies – University of Amsterdam

viola.pinzi@student.uva.nl

ABSTRACT

The LinkedTV project News scenario aims at improving the experience of watching news on TV. It envisages that potential users of the system watch news broadcasts, express a need for additional information and that the system provides resources from the web that are potentially relevant to them.

Our goal was to investigate user information needs for a given news topic, based on news video fragments. Furthermore, we aimed at representing the news video fragment and related information needs in a form compatible with the system knowledge representation model.

Our contribution consisted of a method to formally represent fragments and requirements using a controlled vocabulary, which was applied to information needs collected through a user study. The analysis resulted in lists of concepts and schemas of the content structure. This contribution supports semantic linking between news, related information needs and additional resources to be retrieved from the web to satisfy those needs.

Categories and Subject Descriptors

H.3.5 [Information Storage and Retrieval]: Online Information Services – data sharing, web-based services.

General Terms

Human Factors, Languages.

Keywords

News enrichment, user information needs, news classification, semantic web, semantic annotation.

1. INTRODUCTION

When watching news broadcasts, viewers may have information needs triggered by the news item they are watching. The LinkedTV project 1 aims at connecting television mainstream content and web resources in a single experience (networked media), through an automatic system that generates links between the two. One of the scenarios used to illustrate the technology developed is the Interactive/enriched News scenario. The News scenario aims at improving the experience of watching news on TV. It envisages that a potential user of the system watches a news broadcast (consisting of news fragments) and expresses a need for additional information related to it [15]. The system aims at providing users with resources from the web that are potentially relevant to them, given a specific news item. User information needs are defined in terms of the value users attribute to different contents the application might propose to them, so in terms of user relevance and not as system relevance (algorithmic) or topicality (relevance to a topic) [2].

We identified two issues: first, the need to investigate which

1 www.linkedtv.eu

resources are relevant for users, when watching news; second, the need for user requirements to be made available for the system, once learned, in a form that the system can understand. While the user would express needs using natural language, in the form of free text, the system uses a machine-readable model to represent the content of web resources. This means that user requirements cannot be directly used by an automated system to generate queries and to retrieve web resources that satisfy those needs. They have to be coded and represented in a way that is compatible with the system knowledge representation model. We tackled both problems: on the one hand, we aimed at identifying which kind of information news followers would like to find in additional resources from the web, related to the content of specific news video fragments of TV broadcasts (user information needs and relevance of additional information); on the other, our purpose was to specify a method to represent user needs in a form that is usable for the system and to apply this method to the identified needs (user requirements representation).

Section 2 provides a brief overview of previous work on information needs and models for semantic annotation, in the news field. Section 3 presents the research methods. Section 4 describes the user study we carried out to collect user information needs. Section 5 presents the analysis of news fragments and information needs and final results.

2. RELATED WORK

Several studies have been conducted on user information needs related to seed news content, and numerous models have been developed for news representation, classification and retrieval.

2.1 News enrichment and user information

needs

To evaluate methods for investigating user requirements, we considered studies that match our context: news enrichment as final purpose; news video fragments as seed content; understanding related information needs as aim. Palumbo et al. [14] and Perez et al. [15] conducted research within analogous scenarios. The method used by Perez et al. in the first phase of their research was selected as reference model for our investigation of user requirements. They asked participants of a user study to express their needs, after watching news video fragments, in a free and open way, through unstructured text. This method was expected to emulate the real-life situation envisaged by the LinkedTV News scenario. The concept of annotation schema for video fragments, proposed by Palumbo et al., constituted a starting point to specify our analysis method of news fragments and needs, since it supports the representation of news contents in a structured way. However, these studies did not tackle the problem of how to make user requirements compatible with the knowledge representation model of an automated system, which was one of our aims.

(3)

2.2 Semantic Web technologies and news

classification

To propose a solution for the user requirements representation and how they could be used to retrieve relevant additional resources, we considered approaches that apply Semantic Web technologies, on one hand, to link resources with related contents across different systems, and, on the other, to describe and classify news articles.

Demartini et al. [6] discuss methods to establish these semantic links among resources through the identification of entities, in the resource text, and the extraction of relations among entities to be used for their enrichment. Many systems tackle this task mapping entities and relations to concepts of specific knowledge bases (for example, DBpedia datasets2), or, more ambitiously, to the whole Linked Open Data graph3. This approach has been increasingly applied also in the field of media production and news. The British Broadcasting Corporation 4 (BBC) created a system to manage their knowledge base and to link resources across all their platforms and to external data through semantic annotation [11]. Their system uses DBpedia identifiers as controlled vocabulary, mapped to their internal concepts categorization system (CIS). They effectively introduced information needs in the mix, with the idea to support users in their navigation across different information flows. However, their approach takes into consideration user needs mostly in terms of usability of the platforms. In our study, semantic annotation was selected as reference method for representing not only resources, but also information needs, to identify links among concepts that are relevant for users.

The system described by Vargas-Vera et al. [16] recognizes events within the stories of news articles, using the KMi ontology to automatically instantiate pre-defined event templates. In the KMi ontology, forty types of events are defined as classes and each is represented through a template, which indicates properties and classes of objects. Although their analysis of news content structure and the resulting templates constitute a reference model for our method, they tackled the news classification problem from the point of view of event types and not in terms of categories of contents. Furthermore, in their model, each template has a central concept, the Event, which is subject of all the properties. In our research, on one hand, we look at news classification in terms of news topics, as macro-categories, and, on the other, we represent news stories as sets of linked concepts, not as event types corresponding to predefined templates. We believe that this approach may result more sustainable and efficient for large-scale applications, since topic categorization is a higher level classification method in comparison with events topology. Furthermore, through sets of linked concepts, it is possible to represent any news content structure, seen and unseen.

3. RESEARCH METHODS

To investigate user requirements for news enrichment, our approach was to design a user study, to prompt a group of potential users of the LinkedTV system to freely express their information needs, in an unstructured way, after watching news video fragments, and to introduce them to the idea of finding additional resources, online, to match and satisfy those needs. As for the representation of requirements and news fragments in a form compatible with the system knowledge model, we opted

2 http://wiki.dbpedia.org/Datasets 3 http://linkeddata.org

4 www.bbc.co.uk/rd

for applying Semantic Web technologies, using semantic annotation and schemas, to support the linking between fragments, needs and web resources, across different systems.

3.1 User study

3.1.1 Exploratory search task

We designed a user study that emulated the expected scenario of a real-life situation: a person watches a news video fragment and then searches online resources to satisfy the resulting needs for additional information. The study protocol was based on an information seeking task and, in particular, on the concept of exploratory search behaviour. The characteristics of this type of search task match the main aspects of our scenario: an open-ended and dynamic multi-stage process that entails a certain level of uncertainty and that is associated with not well defined information needs [17]. Since the process was designed to elicit a real-life situation [4], it was formulated as a simulated work task [3], where the participant is provided with a general description of the situation (context, problem and objective) and with minimum constraints, to support free expression of needs and open search process.

3.1.2 News topic and news video fragments

selection

A list of parameters was specified and applied to guide the selection of news video fragments proposed to the participants of the study.

To make the analysis of results feasible and to support the identification of content structure, the news video fragments watched by participants needed to have some level of consistency. We identified topic as central dimension to investigate news classification, since it is one of the main parameters of news categorization systems and determines the way news programmes are organised. Furthermore, different news topics are represented by different domain knowledge and domain knowledge influences the process of information representation, which is another core matter of this research. Topic is the consistency aspect that led the selection of news fragment for the study. For feasibility constraints, we also decided to investigate only one topic, chosen considering the related study by Palumbo et al. [14], which focused on environmental news. Environmental news can be considered mostly as ‘slow developing news’, while TV news programmes consist mainly of ‘fast breaking news’ [12]. Therefore, we opted for a topic connected with environmental news, but whose nature is mainly ‘fast breaking’: Disasters and accidents. Furthermore, since in stories belonging to this topic both human and natural factors are involved, we expected the final findings to be more generalizable, in respect to the scope of the problem. DBpedia’s subject categories were used as reference model, to identify the topic and its label and to further specify five sub-topics: Industrial accidents and incidents; Transport accidents

and incidents; Fires and explosions; Engineering failures; Natural disasters.

We also defined a boundary with respect to the events’ causes. While natural disaster causes are not intentional, man-made disasters and accidents include events whose causes may be both intentional (as War) and unintentional (such as Structural

failures). For consistency, we avoided events whose main cause

was intentional or possibly cross-category. For example, within the sub-topic Explosions, we considered news about Gas

explosions and we ruled out the ones about Bombings.

Finally, there was the need to ensure clarity and intelligibility of news video fragments (content-wise and at technical level) and alignment in terms of format and length.

(4)

3.2 Analysis and representation of news

fragments and information needs

We specified a method for news and information needs representation, along three dimensions: to make the needs available for an automated system; to relate them with the content of news fragments; to identify the relations between concepts that are relevant for users.

Within the Semantic Web framework, an automated system can represent information in terms of entities, entity types and relations and assign to them semantic descriptions (concepts). Instances, classes and properties are the three types of concepts providing these descriptions [1] and the whole process is defined as semantic annotation [10]. Concepts are also assigned unique identifiers, which allows to link different instantiations to the same description. This linking is a fundamental aspect of the Semantic Web [9] and the main reason we applied these technologies for our analysis. It provides the possibility for each piece of information to be semantically represented in a standard way, which supports linking and retrieval of resources that contain same entities and/or same relations between entities. The opportunity of linking resources, not only through single entities but also through the relations among them, is one of the central aspects of our approach, since it was expected to result in a richer interpretation of user needs, which might enhance the level of relevance of potential related resources retrieved from the web.

Our analysis method consists of four processes: extraction of entities; identification of relations between entities; mapping of entities and relations to concepts of a controlled vocabulary (resources and properties); identification of entity types (classes); representation of the content structure using the mapped concepts (schema). We refer to the model proposed by Grishman [8], for the information extraction steps, and to Exner et al. [7], for the mapping to the concepts of a controlled vocabulary. The method was expected to reveal which relations between entities (or groups of relations) are relevant for users, given specific news fragments and related information needs, and to provide a representation of these relations compatible with the knowledge representation model the system uses.

3.2.1 Information extraction and concept mapping

The first step of the analysis consists in identifying entities and relations, within each news fragment and set of information needs, and verifying co-references, to group recurring ones under the same label. Furthermore, when entities are qualified by adverbs, adjectives, relative clauses, implicit or explicit copulative expressions, relations with other entities etc., inference is applied to select the entity to retain and its label. To map entities and relations to concepts, DBpedia was chosen as reference knowledge base. DBpedia consists of large and structured datasets [13], which supports, on one hand, a good level of mapping of factual information and, on the other, the possibility to link entities and relations to the concepts of its Ontology5. Since DBpedia is already widely used as hub to link other knowledge bases (such as the LOD cloud6 and BBC’s platforms), an annotation schema mapped to its ontology was expected to facilitate retrieval and classification of linked resources from the web.

Entities are mapped to Resources of DBpedia Named Graph and entity types and relations to Classes and Properties of the Ontology. Resources are explored by related keywords, through

5 http://wiki.dbpedia.org/Ontology 6 http://wiki.dbpedia.org/Interlinking

DBpedia Lookup Service7, and coupled with entities based on their Descriptions. Properties and Classes are identified exploring manually DBpedia schema8, analysing Range and Domain, for Properties, and Description and position in the hierarchy, for Classes. Co-references are analysed across news fragments and information needs to map to the same concept.

3.2.2 Content structure and representation

To look at the structure of the content, in order to identify the links between concepts that are relevant for users, the whole corpus of factual information of news fragments and information needs is represented through expressions subject-predicate-object, at instance level. From this low-level representation, a higher level schema is constructed, where each slot is defined by a Property and specifies the Classes of resources for both Range and Domain. The aim is to support retrieval of resources that contain both same entities and relations and related entities belonging to the same classes.

4. USER STUDY

Before the main experiment, a pilot study was carried out with two participants, to test the protocol, efficiency of tasks and clarity of instructions.

4.1 Protocol description

Participants were fluent in English and were interested in following the news. It was expected that they could relate with the scenario and that they possessed minimum competences to perform the tasks and to understand the news video fragments. The sessions were conducted in a semi-controlled environment (meeting-style setting) and participants used a laptop provided by the researcher to perform the tasks, through an online template.

The protocol is summarized in the following.

• Introduction – The researcher provided a brief explanation and prompted the participant to read the text explaining the Situation.

• Step 1. Selecting and watching seed news - The participant was asked to select and watch one news video from a list of five, while also thinking about the additional information they would like to have, related to the content of the news. • Step 2. Expressing information needs on news topic – The

participant was asked to express information needs in writing, as a list, ordered from high to low interest. • Step 3. Online search – The participant was prompted to

search online for the items of information in the list and to choose one resource for each item. They were asked to use Firefox with the HCI Browser9 plug-in [5], which allowed to structure the search process by item and to record transaction logs.

• Step 4. Information needs’ motivations – Finally, the participant was asked to provide motivations for the items in the list.

To support open and free expression of needs, the constraints were minimal: no time limit; no fixed number of information needs to express or to search for; no limitations about the online search process (type of sources, formats etc.). Furthermore, even if not explicitly stated in the instructions, the participants were allowed to stop at any time, to skip information items, during the online search, and to end a search task without selecting any resource.

7 http://lookup.dbpedia.org/api/search.asmx 8 http://mappings.dbpedia.org/index.php 9 http://ils.unc.edu/hcibrowser

(5)

4.2 Participants and data collected

The user study was conducted in June 2014, with 18 participants, 11 females and 7 males, recruited mainly at an university. The mean age was 29, with a standard deviation of 9; 14 participants were in the age range 22-29 and 4 over 30. As for the education level, 1 participant indicated Secondary school, 3 Bachelor degree, 13 Master’s degree and 1 Other. 8 participants were students, 4 indicated a profession and 6 did not provide specific indications.

News fragments 1, 2, 3 and 5 were watched by 4 participants and 4 by 2.

The list of collected information needs constitutes a first level answer to the problem. Participants expressed between 2 and 5 items each, for a total of 58 items. By news fragment, 12 items were collected for video 1, 16 for 2, 11 for 3, 7 for 4, 12 for 5.

5. ANALYSIS AND RESULTS

5.1 Identification of concepts

We analysed each news fragment. Three types of data were used: speech transcripts; Title and Description of the video provided within the webpage hosting it; screen captions (headlines overlapping the video frames). Speech transcripts were extracted manually or, when available, using the speech recognition tool in-built in the online player. This raw text was segmented in sentences and sentences expressing related information were grouped in blocks. Blocks containing factual information about the news story and identifiable entities were selected for further analysis (opinions, direct speech, imperative sentences and metadata of the news fragment were ruled out). Subsequently, we analysed the related information needs. The interpretation of collected needs (Step 2 of the user study) was also expanded and validated using data extracted from Step 3 (search queries, visited pages, selected resources and text answers) and Step 4 (motivations for needs).

Entities and relations were identified and mapped to DBpedia concepts, analysing co-references.

When it was not possible to map to DBpedia, an alternative representation was found. Unmapped entities were represented in four ways: their original form in the text; lexical entities from WordNet10; a Class assigned by us; using a Property and an expression subject/predicate/object. If an entity was mapped to a resource but its Class was not defined, a Class was assigned. If no matching Class was found, an upper level Class was assigned and the resource or lexical entity was retained in the representation to specify it. Finally, when a relation was not defined in DBpedia ontology, a lexical entity from WordNet was used. When a relation was defined but Range and/or Domain of the mapped Property did not match our context, the relation was represented either with a lexical entity or indicating the mismatch with DBpedia definition.

Tables 1 and 2 provide statistics for the concepts identified, by type, respectively within News fragments and information needs, indicating how many were mapped to Dbpedia and how many were represented with an alternative form. Table 3 shows the whole results. Each concept is counted only once, across both news fragments and information needs. Thus, the figures in Table 2 refer to new concepts that were not mentioned in news fragments.

10 http://wordnetweb.princeton.edu/perl/webwn

Table 1. Concepts identified within news fragments   DBpedia   Alternative   Total  

Entity  –  Resource   37   25   62   Type  –  Class   19   19   38   Relation  -­‐  Property   15   14   29  

Table 2. Concepts identified within information needs

  DBpedia   Alternative   Total  

Entity  –  Resource   7   4   11  

Type  –  Class   3   4   7  

Relation  -­‐  Property   6   3   9  

Table 3. Concepts identified across news and needs   Dbpedia   Alternative   Total  

Entity  –  Resource   44   29   73   Type  –  Class   22   23   45   Relation  -­‐  Property   21   17   38  

Regarding entities and types, the mapping was very efficient for places, and, within DBpedia Named Graph, a specific Class was assigned for geographical resources (Country, Region,

Settlement). As for entities of other types, even when they were

mapped to DBpedia resources, often no Class or only upper level Class was assigned (Thing or Agent). As for relations, static or slow changing attributes of persons, organisations and places were easily mapped (Name, Description, Map etc.), while the majority of unmapped relations concerned types of actions. In some cases, we represented actions with the Property

Activity, which is meant to describe a general activity and not to

express the dynamicity or the unicity of a specific action in time (for example, in the case Organisation/Activity/Official action).

5.2 Schemas of news fragments and

information needs

The content structure of news fragments and information needs was represented through expressions subject-predicate-object, at instance level (Table 4 and 5). In some cases, the subject or the object was represented indicating an upper level Class and another expression within brackets (Table 4, line 5). This annotation means that the Event causing the damage is the same

Event described by Expression 4.

From this low-level representation, the higher level schemas were constructed. The information needs schemas constitute a second level of answer to our problem: they represent the additional information relevant for users, in a form compatible with the system knowledge representation, and they show the relevant relations between classes of entities. Table 6 and 7 show respectively the results for News fragment 1 and for its information needs. The graph in Figure 1 shows how the concepts of News fragment 1 (Table 6) are connected to each other, across the slots of the schema. The other schemas are available in the Appendix.

(6)

Table 4. Examples of expressions from News fragment 1

ID   Text  from  the  video   Entity  or  type   Relation   Entity  or  type  

4   The  Elk  river,  a  water  source  for  hundreds  of  thousands  of   people  in  West  Virginia,  and  now  contaminated  with   chemical  used  to  clean  coal.  

http://dbpedia.org/resource/ Elk_River_(West_Virginia)    

WordNet:  verb  

contaminatedBy   OntologyClass:  ChemicalSubstance     5   Health  officials  say  some  4  to  6  people  were  admitted  to  

the  hospital  with  problems  related  to  the  contamination   WordNet:  noun  health;  noun  damage   OntologyProperty:CausedBy     OntologyClass:  Event  (Expression  n.  4)  

Table 5. Examples of expressions from information needs related to News fragment 1

ID   Text  from  participants’  results   Entity  or  type   Relation   Entity  or  type  

28   Text  from  Participant  11.  Item  1:  location  information   (where  is  west  virginia,  which  river)  

submittedAnswerText:  found  this  page,  with  a  map  and  a   link  to  the  elk  river  

http://dbpedia.org/resource/

West_Virginia   OntologyProperty:Map   Thing  

Table 6. Final schema for News fragment 1 – Sub-topic: Industrial accidents and incidents Slot  

ID   Domain  class    (alt:resource/lexical  entity/slot)   Property    (alt:lexical  entity)   Range  class    (alt:resource/lexical  entity/slot)  

1   NaturalPlace/BodyOfWater   WordNet:  verb  provide   Thing  (resource  Water_resource)  

2   NaturalPlace/BodyOfWater   Location   PopulatedPlace  

3   Person   Residence   PopulatedPlace  

4   NaturalPlace/BodyOfWater   WordNet:  verb  contaminatedBy   ChemicalSubstance   5   Thing  (WordNet:  noun  health;  noun  damage)     CausedBy       Event  (Slot  n.4)   6   Thing  (WordNet:adjective  economic;  noun  impact)   CausedBy       Event  (Slot  n.4)  

7   ChemicalSubstance   OwningOrganisation   Organisation/Company  

8   Organisation/Company   President   Person  

9   Person  (Slot  n.3)   WordNet:  verb  harmedBy   Event  (Slot  n.4)  

10   Person  (Slot  n.  9)   Number   xsd:integer  

11   ChemicalSubstance   Description   langString  

12   Organisation/GovernmentAgency   Location   PopulatedPlace  

13   Organisation/GovernmentAgency   Activity     Activity  (WordNet:  adjective  official;  noun  action)  

14   Public_company   President   Person  

15   Public_company   Location   PopulatedPlace  

Table 7. Final schema for information needs related to News fragment 1 - Participants: P1, P4, P8, P11 Slot  

ID   Domain  class    (alt:resource/lexical  entity/slot)   Property    (alt:lexical  entity)   Range  class    (alt:resource/lexical  entity/slot)  

16   Organisation/Company   Product  -­‐  Service   Thing  

17   Organisation/Company   OrganisationMember   Person  

18   ChemicalSubstance   Description   langString  

19   Organisation/Company   Description   langString  

20   Event  (resource  Chemical_accident)   PreviousEvent   Event  (Slot  n.4)  

21   ChemicalSubstance   Name   langString  

22   Person  (Slot  n.3)   WordNet:  verb  harmedBy   Event  (Slot  n.4)  

23   Person  (Slot  n.  9)   Number   xsd:integer  

24   Politician/Senator   Location   PopulatedPlace  

25   Politician/Senator   Activity   Activity  (WordNet:  adjective  official;  noun  action)   26   Organisation/Company   WordNet:  verb  involvedIn   LegalCase  

27   Organisation/Company   WordNet:  verb  involvedIn   Event  (Slot  n.  20)  

(7)

Figure 1. Semantic graph representing News fragment 1

5.3 Final results

For each news fragment, we analysed the relations between its content and the related information needs and we observed some trends within each needs set. The slot numbers refer to the schemas of related information needs, in Table 7 for News fragment 1 and in the Appendix for the other fragments.

News fragment 1

For 8 slots out of 13, the needs concern more detailed or background information about resources already mentioned in the news (Company, ChemicalSubstance, PopulatedPlace), represented by the relations Product, Service, Member,

Description, Name, involvedIn and Map (slots 16, 17, 18, 19,

21, 26, 27, 28). Relevant focus is on the Company (6 slots out of 13). A new aspect is the interest for related Chemical

accidents in the same area (relation previousEvent). News fragment 2

Relevant focus is on the people involved in the event, with 8 slots out of 19 (slots 16, 17, 18, 19, 22, 23, 28, 29). Regarding people, two new concepts were identified in the needs set:

survivorOf and Opinion (slots 16, 17, 19). Other new aspects

were the interest for related events and their number (slots 15, 21, 30, 31), and for maps of the location and of the accident (slot 13, 20).

News fragment 3

Overall the additional information required is very similar to that provided in the fragment. A new aspect identified in the needs is the interest for events of the type Oil disaster (slot 23, 24), which is linked to the activity of the company mentioned in the news (State oil company), but not to the type of event (which was an Explosion in a Building). We also observed interest in the people involved in the event (slots 20, 21, 27, 28, 29).

News fragments 4

New aspects identified in the needs are the interest for updates on the event (slots 29, 30) and for general information on that type of events, in this case Earthquake (slot 31). Interest on the human factor is also relevant (slots 25, 27, 28).

News fragment 5

Overall the needs regard more detailed information about concepts already mentioned in the fragment. We can observe relevant interest for the industry the news deals with,

Construction and Real estate (slots 21, 22, 23, 24), for the Company owning the building site where the event takes place

(slots 25, 26, 30, 31) and for people affected by the event (slots 27, 28, 34, 35).

Finally, we observed some trends in relevant concepts across all information needs.

1. In all sets, we observed interest on people involved in the events and their conditions. These needs are represented by the relations involvedIn, harmedBy, survivorOf, witnessOf,

passengerOf, killedBy, requirement, status, Name, opinion.

2. Detailed and visual geographical information (relation

Map) is required for all fragments except 4, where a map

was already shown in the video.

3. An interest for previous and/or related events was identified in 4 out of 5 needs sets: for NF1 slot 20; for NF2 slots 15, 21, 30, 31; for NF3 slots 23, 24; for NF5 slot 30.

6. DISCUSSION

6.1 User study

The type of task and the low level of constraints of the protocol resulted to be an effective strategy. The participants were asked to perform an exploratory search task, from need identification (Step 2) to online search for resources matching their

(8)

information needs (Step 3), and to provide motivations (Step 4). This rich and diverse information allowed us to expand the interpretation process and to better specify the needs for 14 participants out of 18 (Participants: 1, 2, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 17, 18). Furthermore, for 8 participants out of these 14, the additional data revealed novel information that was not explicit or clear in the items provided in the initial list (Participants 1, 5, 7, 8, 9, 10, 11, 15). However, the dynamicity of the whole process may have been reduced due to its structure organized by activity (watching video, writing down the needs etc.), since, in real-life situations, a user may choose to perform it by information need item instead. Since information needs evolve over time and during the exploration process, the tasks structure may affect their identification. During the sessions, it was observed that some participants mildly modified the information needs’ list, while performing the search; some realized that two or more of their identified information needs could be grouped together; some commented that the same web resource was answering more than one need and that answers could be grouped together.

Another issue regards how news video fragments were proposed to participants. In order to support their interest in the content of the news, participants were given the possibility to choose the video fragment to watch. However, unexpectedly many participants seemed to prefer 2 of the news fragments over the other 3. After session 11, 2 of the videos had reached the count of 4 views, while the other 3 had respectively 1, 1 and none. This observation made us re-evaluate the strategy, and, in order to be able to collect information needs about all news fragments, we decided to modify the protocol in progress, removing a fragment once it reached 4 views.

6.2 Analysis method and final results

We were able to identify not only single entities and relations and list of concepts, in both news fragments and related information needs, but also schemas of their content structure, to represent links between concepts. However, part of the information was not mapped to DBpedia Named Graph or Ontology and, in the final schemas, it was represented with alternative forms. This factor makes some of the slots of the schemas not directly usable for an automated system. Furthermore, since our investigation was limited to one news topic and five news fragments, the schemas do not cover the whole requirements of a real-world application such as LinkedTV, which deals with numerous news topics.

Finally, the method to analyse news fragments and information needs was reviewed by three researchers, but, due to limited resources, the full analysis and interpretation of results was performed by one. The potential bias resulting from the lack of cross-validation has to be considered for future work based on this study results. Similarly, due to time constraints, we could not perform an evaluation of the final schemas with other news fragments and users.

7. CONCLUSIONS

Our work contributed to the news enrichment scenario, within LinkedTV project.

We proposed a method to formally represent both the content of news video fragments from TV broadcasts and user requirements for additional information related to that content. This representation is based on mapping the information extracted from news and user needs to the same controlled vocabulary, which supports semantic linking between news, needs and additional resources to be retrieved from the web to satisfy those needs. We collected a list of user information needs related to news fragments of the topic Disasters and

accidents. Applying the method to those news fragments and to

the list of needs, we obtained a list of concepts, partially mapped to DBpedia knowledge base and partially represented through WordNet lexical entities. We identified additional links between the concepts, looking at the content structure and constructing schemas at instance and class level. Finally, we identified some trends within information needs, by news fragment and across the whole set of needs.

These contributions are valuable to enhance LinkedTV content management flow at three levels. Within the system backend, the schemas could be used to generate training datasets for algorithms or direct queries, to retrieve resources from the web, or to filter from a repository of annotated resources, to populate automatically the user interface. Moreover, such structured data can be mapped to other existing ontologies (beyond DBpedia), both at upper level (general ontologies) and at domain knowledge level (related to the news topic), to support linking with other knowledge bases. Finally, such schemas could serve as guidelines for human editors, who use the LinkedTV editing tool to manually publish content in the interface.

7.1 Future work

As for the results, since we were not able to represent part of the content through the concepts of DBpedia Ontology, a further mapping to other ontologies is required, in order for the schemas to be fully usable within a system. Validation of the results by other researchers is also needed, as well as an evaluation phase to test the capacity of the final schemas to predict information needs, given other news fragments belonging to the same topic.

Furthermore, a wider user study is needed, to perform similar analysis for other news topics. For such a study, a tool could be used to support efficient manual annotation of news fragments and information needs, allowing a group of researchers to perform the analysis cooperatively and to easily review the identified concepts and schemas. In a pilot phase, it would be also interesting to test how the structure and order of the tasks in the protocol may influence the final information needs. In particular, it would be useful to verify whether the richness and specificity of needs are affected by performing the exploratory search structured by information item instead that by activity. Finally, the strategy to propose participants a list of video fragments to choose from has to be reviewed, in order to balance between supporting participants’ interest in the news content and the need to have a minimum number of participants selecting and watching each fragment.

ACKNOWLEDGEMENTS

This work was partially supported by LinkedTV project, funded by the European Commission through the 7th Framework Programme (FP7-287911). We would like to thank: Lynda Hardman, NL; Lilia Perez Romero, NL; Michiel Hildebrand, NL; Frank Nack, NL; all the participants of the user study.

REFERENCES

[1] Antoniou, G., Groth, P., van Harmelen, F., and Hoekstra, R. 2012, Semantic Web Primer. Third Edition. MIT press [2] Borlund, P. 2003. The concept of relevance in IR. J. Am.

Soc. Inf. Sci. Technol. 54, 10 (August 2003), 913-925.

DOI=http://dx.doi.org/10.1002/asi.10286

[3] Borlund, P. 2003. The IIR evaluation model: A framework for evaluation of interactive information retrieval systems. In Information Research, 8, 3. http://informationr.net/ir/8-3/paper152.html

[4] Byström, K. and Hansen, P. 2005. Conceptual framework for tasks in information studies: Book Reviews. J. Am. Soc.

Inf. Sci. Technol. 56, 10 (August 2005), 1050-1061.

(9)

[5] Capra, R. 2009. The HCI Browser Tool for Studying Web Search Behavior. In Proceedings of the Third Workshop on

Human-Computer Interaction and Information Retrieval

(Washington DC, USA, October 23, 2009). HCIR 09. http://cuaslis.org/hcir2009/HCIR2009.pdf

[6] Demartini, G., Difallah, E. D., and Cudré-Mauroux P. 2012. ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In

Proceedings of the 21st international conference on World Wide Web. WWW '12. ACM, New York, NY, USA,

469-478. DOI=http://doi.acm.org/10.1145/2187836.2187900 [7] Exner, P., and Nugues, P. 2012. Entity extraction: From

unstructured text to DBpedia RDF triples. The Web of

Linked Entities Workshop (Boston, USA, November 11,

2012). WoLE 2012. http://lup.lub.lu.se/record/3191701 [8] Grishman, R. 1997. Information Extraction: Techniques

and Challenges. In Proceedings International Summer

School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology. SCIE

'97. Maria Teresa Pazienza (Ed.). Springer-Verlag, London, UK, 10-27. http://dl.acm.org/citation.cfm?id=669801 [9] Heath, T., and Bizer, C. 2011. Linked Data: Evolving the

Web into a Global Data Space. 1st edition. Synthesis

Lectures on the Semantic Web: Theory and Technology, 1,

1. Morgan & Claypool, 1-136. http://linkeddatabook.com/book

[10] Kiryakov, A., Popov, B., Terziev, I., Manov, D., and Damyan Ognyanoff. 2004. Semantic annotation, indexing, and retrieval. Web Semant. 2, 1 (December 1, 2004), 49-79. DOI=http://dx.doi.org/10.1016/j.websem.2004.07.005 [11] Kobilarov, G., Scott, T., Raimond, Y., Oliver, S.,

Sizemore, C., Smethurst, M., Bizer, C., and Lee, R. 2009. Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Connections. In Proceedings of the 6th European Semantic Web

Conference on The Semantic Web: Research and Applications. ESWC 2009. Springer-Verlag, Berlin,

Heidelberg, 723-737. DOI=http://dx.doi.org/10.1007/978-3-642-02121-3_53

[12] Media Insight Project. 2014. The Personal News Cycle. Report. American Press Institute and the Associated Press-NORC Center for Public Affairs Research.

http://www.apnorc.org/projects/Pages/the-personal-news-cycle.aspx

[13] Mendes, P., Jacob, M., and Bizer, C. 2012. DBpedia: a Multilingual Cross-domain Knowledge Base. In

Proceedings of the Eight International Conference on Language Resources and Evaluation (Istambul, Turkey,

May 21-27, 2012). LREC '12. http://www.lrec-conf.org/proceedings/lrec2012/pdf/570_Paper.pdf [14] Palumbo, A. C., and Hardman, L. 2013. User information

needs for environmental opinion-forming and decision-making in link-enriched video. In Proceedings of the 11th

European conference on Interactive TV and video.

EuroITV '13. ACM, New York, NY, USA, 85-88. http://doi.acm.org/10.1145/2465958.2465973

[15] Perez Romero, L., Hardman, L., and Hildebrand, M.. 2013. Deliverable 3.5 - Requirements Document LinkedTV User Interfaces (Version 2). Technical report. LinkedTV project. http://www.linkedtv.eu/demos-materials/linkedtv-deliverables

[16] Vargas-Vera, M., and Celjuska, D. 2004. Event Recognition on News Stories and Semi-Automatic Population of an Ontology. In Proceedings of the 2004

IEEE/WIC/ACM International Conference on Web Intelligence (September 20-24, 2004). WI '04. IEEE

Computer Society, Washington, DC, USA, 615-618. DOI= http://dx.doi.org/10.1109/WI.2004.10148

[17] Wildemuth, B. M., and Freund, L. 2012. Assigning search tasks designed to elicit exploratory search behaviors. In Proceedings of the Symposium on Human-Computer

Interaction and Information Retrieval. HCIR '12. ACM,

New York, NY, USA, Article 4.

(10)

APPENDIX

Schemas of news fragments and related information needs

News fragment 2

Sub-topic: Transport accidents and incidents Slot  

ID   Domain  class    (alt:resource/lexical  entity/slot)   Property    (alt:lexical  entity)   Range  class    (alt:resource/lexical  entity/slot)  

1   Person   KilledBy   Event  (Slot  n.  3)  

2   Person  (Slot  n.1)   Number   xsd:integer  

3   Train;  Event  (WordNet:  noun  accident)   Location   Country/PopulatedPlace   4   Train;  Event  (WordNet:  noun  accident)   Location   Settlement/PopulatedPlace   5   Person   WordNet:  noun  witnessOf   Event  (Slot  n.  3)  

6   Person   WordNet:  verb  harmedBy   Event  (Slot  n.  3)  

7   Event  (Slot  n.  3)   CausedBy   Thing  or  Event  (WordNet:noun    cause)   8   Person   WordNet:  noun  passengerOf   Train  

9   Person  (resource  Rescuer)   Activity   Activity  (resource  Rescue)  

10   OfficeHolder   Activity   Activity  (WordNet:  adjective  official;  noun  action)  

11   GovernmentAgency   Location   Country/PopulatedPlace  

12   Organization  (Slot  n.11)   Activity   Activity  (WordNet:  adjective  official;  noun  action)  

Related information needs - Participants: P2, P6, P9, P10 Slot  

ID   Domain  class    (alt:resource/lexical  entity/slot)   Property    (alt:lexical  entity)   Range  class    (alt:resource/lexical  entity/slot)  

13   Settlement/PopulatedPlace   Map   Thing  

14   Train   Type   Thing  

15   Train;  Event  (WordNet:  noun  accident)   Number   xsd:integer   16   Person   WordNet:  noun  survivorOf   Event  (Slot  n.  3)   17   Person  (Slot  n.  16)   WordNet:  noun  opinion   Thing   18   Person   WordNet:  noun  witnessOf   Event  (Slot  n.  3)   19   Person  (Slot  n.  18)   WordNet:  noun  opinion   Thing  

20   Event  (Slot  n.4)   Map   Thing  

21   Train;  Event  (WordNet:  noun  accident)   PreviousEvent   Event  (Slot  n.  3)  

22   Person   WordNet:  noun  driver   Train  

23   Person  (Slot  n.  8)   Number   xsd:integer  

24   Train   OwiningOrganisation   Organisation  

25   Organisation  (Slot  n.  24)   Description   langString  

26   Event  (Slot  n.  3)   CausedBy   Thing  or  Event  (WordNet:noun    cause)  

27   Organization  (Slot  n.11)   Activity   Activity  (WordNet:  adjective  official;  noun  action)   28   Person   WordNet:  noun  passengerOf   Train  

29   Person  (Slot  n.  28)   WordNet:  noun  opinion   Thing   30   Train;  Event  (WordNet:  noun  accident)   PreviousEvent   Event  (Slot  n.  4)   31   Event  (Slot  n.  30)   Number   xsd:integer  

News fragment 3

Sub-topic: Fires and explosions Slot  

ID   Domain  class    (alt:resource/lexical  entity/slot)   Property    (alt:lexical  entity)   Range  class    (alt:resource/lexical  entity/slot)  

1   Event  (WordNet:  explosion)   Location   Settlement/PopulatedPlace  

2   Person   KilledBy   Event  (Slot  n.  1)  

3   Person  (Slot  n.2)   Number   xsd:integer  

4   Event  (Slot  n.  1)   Location   Building  

5   Building   OwiningOrganisation   Organisation/Company  

6   Person  (resource  Rescuer)   Activity   Activity  (resource  Rescue)   7   Person   WordNet:  noun  survivorOf   Event  (Slot  n.  1)  

8   GovernmentAgency   Activity   Activity  (WordNet:  adjective  official;  noun  action)   9   President     Activity   Activity  (WordNet:  adjective  official;  noun  action)   10   President   WordNet:  noun  opinion   Thing  

11   Person   WordNet:  verb  harmedBy   Event  (Slot  n.  1)   12   Person  (Slot  n.  11)   Number   xsd:integer  

13   Event  (Slot  n.  1)   CausedBy   Thing  or  Event  (WordNet:noun    cause)   14   Event  (Slot  n.  1)   CausedBy   Gas_leak  (resource)  

(11)

15   Person   InvolvedIn   Event  (Slot  n.  1)   16   Person  (Slot  n.  15)   Number   xsd:integer  

Related information needs – Participants: P7, P14, P15, P16 Slot  

ID   Domain  class    (alt:resource/lexical  entity/slot)   Property    (alt:lexical  entity)   Range  class    (alt:resource/lexical  entity/slot)  

17   Event  (Slot  n.  1)   CausedBy   Thing  or  Event  (WordNet:noun    cause)  

18   Building   OwiningOrganisation   Organisation/Company  

19   Event  (Slot  n.  1)   CausedBy   Thing  (resource  Gas_leak)  

20   Person   KilledBy   Event  (Slot  n.  1)  

21   Person  (Slot  n.  20)   Number   xsd:integer  

22   Person  (resource  Rescuer)   Activity   Activity  (resource  Rescue)   23   Event  (WordNet:  noun  oil  WordNet:  noun  disaster)   Location   Place  

24   Event  (Slot  n.  23)   Number   xsd:integer  

25   Building   Map   Thing  

26   Event  (WordNet:  explosion)   WordNet:  noun  radiance   float  

27   Person   WordNet:  noun  survivorOf   Event  (Slot  n.  1)  

28   Person  (Slot  n.27)   Number   xsd:integer  

29   Person   WordNet:  harmedBy   Event  (Slot  n.  1)   30   Event  (Slot  n.  29)   Number   xsd:integer  

News fragment 4

Sub-topic: Natural disasters Slot  

ID   Domain  class    (alt:resource/lexical  entity/slot)   Property    (alt:lexical  entity)   Range  class    (alt:resource/lexical  entity/slot)  

1   NaturalEvent  (resource  Earthquake)   Location   Settlement/PopulatedPlace  

2   Settlement/PopulatedPlace   neighbourRegion   Country/PopulatedPlace  

3   Settlement/PopulatedPlace   WordNet:  noun  distance;  adjective  regional  noun  capital     String  

4   Settlement/PopulatedPlace   Map   Thing  

5   Settlement/PopulatedPlace   Type   Desert  

6   Event  (Slot  n.  1)   WordNet:  noun  epicenter   Place   7   Settlement/PopulatePlace   WordNet:  noun  distance;  noun  epicenter   Lenght  

8   GovernmentAgency   Activity   Activity  (WordNet:  adjective  official;  

noun  action)   9   Event  (Slot  n.  1)   WordNet:  noun  magnitude   xsd:string  

10   Person   KilledBy   Event  (Slot  n.  1)  

11   Person  (Slot  n.  10)   Number   xsd:integer  

12   Person   WordNet:  verb  harmedBy   Event  (Slot  n.  1)  

13   Person  (Slot  n.  12)   Number   xsd:integer  

14   Building  (WordNet:  adjective  residential)   WordNet:  verb  destroyedBy   Event  (Slot  n.  1)  

15   Building  (Slot  n.  14)   Number   xsd:integer  

16   MilitaryUnit   Activity   Activity  (resource  Rescue)  

17   GeopoliticalOrganisation   Activity   Activity  (resource  Rescue)  

18   GovernmentAgency   Location   PopulatedPlace  (resource  Pakistan)  

19   Organisation  (Slot  n.  18)   Activity   Activity  (WordNet:  adjective  official;   noun  action)  

20   Person   WordNet:  noun  survivorOf   Event  (Slot  n.  1)  

21   Person  (Slot  n.  20)   requirement   Food  

22   Person  (Slot  n.  20)   requirement   Place  (WordNet:  noun  Shelter)   23   Person  (Slot  n.  20)   requirement   Thing  (resource  Water_resource)   24   Person  (Slot  n.  20)   requirement   Activity  (resource  Health_care)  

Related information needs – Participants: P17, P18 Slot  

ID   Domain  class    (alt:resource/lexical  entity/slot)   Property    (alt:lexical  entity)   Range  class    (alt:resource/lexical  entity/slot)  

25   Person  (Slot  n.  20)   requirement   Place  (WordNet:  noun  Shelter)   26   Event  (Slot  n.  1)   WordNet:  noun  magnitude   xsd:string  

27   Person  (Slot  n.  20)   requirement   Thing  (resource  Water_resource)  

28   Organisation   Activity   Activity  (resource  Humanitarian_aid)  

29   Event  (Slot  n.  1)   Description   Thing  

30   Thing  (Slot  n.  29)   updated   xsd:date  

(12)

News fragment 5

Sub-topic: Engineering failures Slot  

ID   Domain  class    (alt:resource/lexical  entity/slot)   Property    (alt:lexical  entity)   Range  class    (alt:resource/lexical  entity/slot)  

1   Event  (resource  Structural  failure)   Location   Place  (resource  Building_site)   2   Event  (resource  Structural  failure)   Location   Country/PopulatedPlace   3   Event  (resource  Structural  failure)   Location   Settlement/PopulatedPlace   4   Settlement/PopulatedPlace   neighbourRegion   Settlement/PopulatedPlace   5   Place  (resource  Building_site)   buildingType   ShoppingMall/Building  

6   Person   killedBy   Event  (Slot  n.  1)  

7   Person  (Slot  n.  6)   number   xsd:integer  

8   Person  (resource  Construction_worker)   WordNet:  verb  trappedIn   Place  (resource  Building_site)  

9   Person  (Slot  n.  8)   number   xsd:integer  

10   Person  (resource  Construction_worker)   WordNet:  verb  harmedBy   Event  (Slot  n.  1)  

11   Person  (Slot  n.  10)   number   xsd:integer  

12   Person  (resource  Rescuer)   Activity   Activity  (resource  Rescue)   13   Place  (resource  Building_site)   OwningOrganisation   Company  

14   Company   Activity   Activity  (WordNet:  mismanagement)  

15   Event  (Slot  n.  1)   CausedBy   Activity  (Slot  n.  14)   16   GeopoliticalOrganisation   Location   Settlement/PopulatedPlace  

17   Organisation  (Slot  n.  16)   Activity   Activity  (WordNet:  adjective  official;  noun  action)   18   Activity  (Slot  n.  17)   PreviousEvent   Event  (Slot  n.  1)  

19   Activity  (resource  Construction)   Location   Country/PopulatedPlace   20   Activity  (Slot  n.  19)   SystemOfLaw   SystemOfLaw  

Related information needs – Participants: P3, P5, P12, P13 Slot  

ID   Domain  class    (alt:resource/lexical  entity/slot)   Property    (alt:lexical  entity)   Range  class    (alt:resource/lexical  entity/slot)  

21   Activity  (resource  Real_estate_development)   Location   Country/PopulatedPlace   22   Activity  (Slot  n.  21)   Description   langString  

23   Person  (resource  Construction_worker)   WordNet:  noun  status   String   24   Activity  (Slot  n.  19)   SystemOfLaw   SystemOfLaw  

25   Company   Activity   Activity  (WordNet:  mismanagement)  

26   Company   WordNet:  verb  involvedIn   LegalCase  

27   Person   killedBy   Event  (Slot  n.  1)  

28   Person  (Slot  n.  27)   number   xsd:integer  

29   Settlement/PopulatedPlace   Map   Thing  

30   Company   WordNet:  verb  involvedIn   Event  (Slot  n.  1)  

31   Company  (Slot  n.  30)   Name   xsd:langString  

32   ShoppingMall/Building   WordNet:  verb  involvedIn   Place  (resource  Building_site)   33   ShoppingMall/Building  (Slot  n.  32)   Name   xsd:langString  

34   Person   WordNet:  verb  harmedBy   Event  (Slot  n.  1)  

35   Person  (Slot  n.  34)   Name   xsd:langString  

36   Event  (Slot  n.  1)   Date   xsd:date  

37   Event  (Slot  n.  1)   Description   Thing  

Referenties

GERELATEERDE DOCUMENTEN

By including personality traits measured in 2005 on the right-hand side of equations to account for life satisfaction in earlier as well as later years, we are in effect as- suming

As a consequence of the redundancy of information on the web, we assume that a instance - pattern - instance phrase will most often express the corresponding relation in the

Solid-liquid extraction by addition of Chelex-100 for a minimum of three times gives a clear, colorless solution, which was evaporated by a stream of air, chromatographed (silica

Bilayers were formed by either brushing or dipping: after lipid in decane had been introduced by brushing, a lipid/ decane film formed on the surface of the electrolyte, and

electrolyte 2.7/4.2uM lipid contact injection brush transfer electrolyte 0.88/1.35mM lipid Adamantyl guest br oken bila baseline drift, large leakage.. electrolyte lipid 1%

Bilayers were formed by either brushing or dipping: after lipid in decane had been introduced by brushing, a lipid/ decane film formed on the surface of the electrolyte, and

Bilayers were formed by either brushing or dipping: after lipid in decane had been introduced by brushing, a lipid/ decane film formed on the surface of the electrolyte, and

Notwithstanding the relative indifference toward it, intel- lectual history and what I will suggest is its necessary complement, compara- tive intellectual history, constitute an