Temporal Summarization of News Streams

(1)

Master Thesis

Temporal Summarization of News

Streams

Author: Georgeta - Cristina Gârbacea Supervisor: Dr. Evangelos Kanoulas

A thesis submitted in fulfillment of the requirements for the degree of Master’s of Science in Artificial Intelligence

(2)

(3)

Acknowledgements

I would like to thank my supervisor Dr. Evangelos Kanoulas for his guid-ance and support throughout this thesis. I would not have been able to carry out this work without the constant help, useful advice and uplifting encour-agement I received on his behalf over the past year.

I would also like to thank Prof. Dr. Maarten de Rijke for giving me the chance to join the Information and Language Processing (ILPS) group within the University of Amsterdam, and the Amsterdam Data Science Re-search Center as a Master student. It was a very inspiring environment to be in, where I had the opportunity to work on many interesting research problems that challenged my thinking.

I sincerely thank all present and former ILPS members and co-authors, especially to Manos Tsagkias, Daan Odijk, David Graus, Isaac Sijaranamual, Zhaochun Ren, Ridho Reinada, and Nikos Voskarides. The interesting discus-sions we had have always been very helpful and a true source of inspiration. Furthermore, I would also like to thank Prof. Dr. Maarten de Rijke, Prof. Dr. Arjen de Vries, and Dr. Piet Rodenburg for agreeing to be part of my thesis defence committee.

Finally, I would like to express my gratitude to my parents, brother, and friends for their endless support.

(4)

Abstract

Monitoring and analyzing the rich and continuously updated content in an online environment can yield valuable information that allows users and organizations gain useful knowledge about ongoing events and consequently, take immediate action. This calls for effective ways to accurately monitor, analyze and summarize the emergent information present in an online en-vironment. News events such as protests, accidents, or natural disasters represent a unique source of fast updated information. In such scenarios users who are directly affected by the event have an urgent need for up-to-date information that enables them to initiate and carry immediate action. Temporal summarization algorithms filter large volumes of streaming doc-uments and emit sentences that constitute salient event updates. Systems developed typically combine in an ad-hoc fashion traditional retrieval and document summarization algorithms to filter sentences inside documents. Retrieval and summarization algorithms however have been developed to op-erate on static document collections. Therefore, a deep understanding of the limitations of these approaches when applied to a temporal summarization task is necessary. In this work we present a systematic analysis of tempo-ral summarization methods, and demonstrate the limitations and potentials of previous approaches by examining the retrievability and the centrality of event updates, as well as the existence of intrinsic inherent characteristics in update versus non-update sentences. We also probe the utility of traditional information retrieval methods, event centrality and event modelling tech-niques at identifying salient sentence updates. Last, we employ supervised machine learning methods for event summarization. Our results show that retrieval algorithms have a theoretical upper bound that does not allow for the identification of all relevant event updates, however in this work we out-line promising directions towards improving the performance of an efficient temporal summarization system.

(5)

Introduction

With the exponential growth of the World Wide Web, finding relevant in-formation buried in an avalanche of data has become increasingly difficult. Automatic tools that offer people timely access to up-to-date news and are able to digest information from various sources are in high demand nowa-days, since these can alleviate the information overload problem when the volume and diversity of data is overwhelming. As the amount of informa-tion increases, the interest in systems that can produce a concise and flu-ent overview of the important contflu-ent grows larger. This calls for effective summarization techniques that can help people grasp the essential pieces of information conveyed. Recent years have witnessed the development of ap-plications capable to summarize news streams, scientific articles, voicemail, meeting recordings, broadcast news and videos [75]. Even though these sys-tems are far from perfect, they have successfully shown their utility in helping users cope with vast amounts of data in a timely manner.

During unexpected crisis events such as natural disasters, mass protests or human catastrophes, people have an urgent need for information, especially if they are directly involved in the event. Crisis situations typically involve many open questions and uncertainty, and it is often necessary to make quick decisions having only limited knowledge about a developing event. Generally, in the beginning of an event there is little relevant content available and infor-mation is very scarce. As the event develops, on-going inforinfor-mation becomes available through news agencies, social media, television, newspapers, and radio stations. Social media in particular has become the main channel for people to post situation-sensitive information, allowing affected populations and those outside the impact zone to stay informed on “what is happening right now”, and learn “first hand” news in almost real-time. This enables members of the public, humanitarian organisations and formal agencies to take immediate action in time-dependent and safety-critical situations. The role the public plays in disaster response efforts is essential when it comes to gathering and spreading critical information, fast communication, and the organization of relief efforts [44].

Time critical news events happen unexpectedly and information about the topic, while often voluminous, is evolving rapidly. On-time and up-to-date information about the event is essential throughout the entire disaster lifecycle (preparation, impact, response and recovery) to people directly in-volved in the event, first responders, formal response agencies, and local, national and international crisis management organizations. However, col-lecting authoritative news is especially challenging during major events which involve extensive damage – the diversity of news sources disseminating the event cause many rumours and controversial information to propagate, and

(8)

the high volume of data can overwhelm those trying to monitor the situ-ation [38]. In addition, the quality of information is frequently degraded by the inclusion of unimportant, duplicate or inaccurate content, and this makes finding the right information in timely fashion a challenging process. Emergency situations create a scenario where it is important to be able to present users with only novel and relevant facts that characterize an event as it develops. Because information is needed rapidly, people cannot wait for comprehensive reports to materialize – they have limited time to follow on the event and therefore they should receive updates that include the most pertinent information only. This calls for methods that can successfully mon-itor, filter, analyze and organize the dynamic and overwhelming amount of data produced during the time duration of a disaster or crisis event.

Automatic summarization techniques have the potential to assist during catastrophic events, by delivering relevant and salient information at regular time intervals and in cases when human volunteers are unable to do this [49]. According to Radev et al [87], a summary is defined as “a text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that”; in the given context the word “text” refers to any speech, video, multimedia document, or hypertext content. Therefore, the main goal of a summarizer system is to present the user with a summary of the main ideas inside the input in a concise and coherent way, by reducing the size of the input, but with preserving the initial degree of informativeness. Since the information content in a document appears in bursts, the main challenge of an efficient summarization system is distinguishing between the more and the less informative segments of text.

Document summarization is aimed at automatically creating a represen-tative summary or synopsis of the entire document, by finding the most in-formative sentences. In general, there are two main approaches to document summarization - extraction and abstraction. Extractive summaries (extracts) tend to be more practical and work by selecting a subset of the most sig-nificant concepts inside the original documents (existing words, phrases, or sentences) to form a summary. As opposed to that, abstractive summaries (abstracts) first need to “understand” the input, i.e. build an internal seman-tic representation of the documents, and then paraphrase the salient concepts by the use of natural language generation techniques. The end result is an output summary close to what a human might generate. However, due to current limitations in natural language processing technology and complexity constraints, research into abstractive methods for summarization is restricted to specific domains only, such as image collection and video summation. In terms of the input, summarization systems can produce a summary of either one single document, in which case they do single document summarization, or of multiple source documents, in which case they do multi-document sum-marization. Motivated by the ever-growing size of the web and an increased information access, multi-document summarization is extremely useful in providing a brief digest of many documents on the same topic or discussing the same event. In addition, according to the topic of focus, summarization can be further distinguished into generic summarization and query-focused summarization. Generic summarization typically addresses a broad audi-ence: anyone may end up reading the summary, thererefore no assumptions are made about the genre or domain of input documents, nor the goal for

(9)

generating the summary [75]. Generic summarization primarily answers the question “What is this document/collection of documents about?”. In con-trast, query-focused summarization attempts to summarize the information content given a query issued by the user, and a document/set of relevant documents returned by a search engine. The summarizer takes the query into account, and finds information within the relevant document(s) that re-lates to the user query. Hence, query-focused summarization aims to give an answer to the question “What does this document(s) say about the query?”. One fallback of the above presented approaches is that they take a retro-spective perretro-spective when issuing updates, and assume that all the relevant information about an event has already been collected. This makes them unsuitable to scenarios of long-running events with a dynamic flow of infor-mation, which typically require concise on-topic updates to be issued in a timely manner. This observation introduces a new dimension to summariza-tion – time, so that the informasummariza-tion conveyed is also time sensitive. Update summarization takes this time-dimension into account, and produces an in-cremental summary which contains the most salient and evolving information from a collection of input documents, leaving from the assumption that the user has prior knowledge about the event and has read previous documents on the topic. The summary is expected to convey the most important de-velopments of an event beyond what the user has already seen, i.e. only new information not covered inside an initial update summary (which is the output of an Initial Summarization step).

In this thesis we focus on efficiently monitoring the information associated with an event over time, following the specifications of the TREC Tempo-ral Summarization1 _{track [}₁₃_{]. The goal of the task is to encourage the} development of systems which can detect useful, novel, and timely sentence-length updates about a developing crisis event when new information rapidly emerges in the presence of a dynamic corpus. Given as input a named crisis event with high impact, and a large volume stream of documents concerning that event, systems are required to emit a series of sentence updates that describe the evolution of the current event over time. An optimal summary covers all of the essential information about the event with no redundancy, and each new piece of information is added to the summary as soon as it becomes available. Writing a concise and fluent summary requires the ca-pability to recognize, modify and merge information expressed in different sentences inside the input. Given this pre-defined task, our goal is to inves-tigate methods that can help in efficiently summarizing long-running events by identifying the important content at the sentence level. In more specific terms, we delve into the problem of extractive query focused multi-document update summarization, and analyze the potential, and limitations of existent information retrieval and machine learning methods at identifying salient sen-tence updates deemed to be included in temporal summaries of news event streams. We also include a study through which we aim to obtain a deeper understanding of how and why some of the aforementioned approaches fail, and what is required for a successful temporal summarization system. We believe that such an analysis is necessary and can shed light in developing more effective algorithms in the future.

1

(10)

1.1 Research Questions

The main research question of this thesis is whether we can effectively iden-tify and broadcast short, timely, relevant and novel sentence-length updates about an ongoing crisis event in an online, sequential setting. In order to ad-dress this research question, we aim at answering the following sub-questions. First, we examine which are the limitations of retrieval algorithms towards temporal summarization of news events (RQ1). To this end, we focus on the overlap between the language of an event query and the language of an event update in terms of the shared vocabulary, and based on that we perform an upper bound analysis study of the potentials and limitations of information retrieval algorithms. Furthermore, we investigate the existence of intrinsic inherent characteristics in update versus non-update sentences (RQ2). We are mainly interested in finding which sentences are important to include in a summary, and whether we can identify discriminative terms that charac-terize these sentence updates. We are also interested in assessing whether an event update is central inside the documents that contain it (RQ3). In addition, we examine whether extracting entities and relations can help in summarization (RQ4). Finally, we investigate the performance of retrieval techniques individually, and combined with other features, when we employ supervised machine learning methods for summarization in a learning-to-rank framework (RQ5).

1.2 Contributions

The main contributions of this work are:

• a systematic analysis of the limitations and potentials of retrieval, sum-marization, and event update modeling algorithms;

• insights into the performance of main methods for event update iden-tification on the TREC Temporal Summarization dataset;

• an analysis of the efficiency of supervised machine learning techniques for the task at hand;

• an examination of what makes for a good sentence update, with sug-gestions for future work and further improvements.

The remainder of this thesis is organized as follows. In Chapter 2 we discuss related work. The TREC Temporal Summarization task and datasets are presented in Chapter3. In Chapter4we describe the experimental setup and main methods used. In Chapter5 we include an upper bound analysis on the limitations of the employed methods for sentence update extraction, including the utility of entities in text summarization. We employ machine learning methods for ranking sentence updates in Chapter6. We analyze our results, conclude, and discuss future research directions in Chapter7.

(11)

Chapter 2

Related Work

In this chapter we present work in temporal summarization that is directly related to ours. More specifically, we first focus on describing the design of best systems participating in previous editions of the TREC Temporal Summarization track, and afterwards we present related work in the area of temporal summarization of online news.

2.1 TREC Temporal Summarization

The Temporal Summarization (TS) task is one of the tracks in the Text RE-trieval Conference (TREC), an on-going series of workshops that encourage research in information retrieval and related applications1_,2_,3_{. As we} de-scribe in Chapter3, TREC TS provides high volume streams of news articles and blog posts crawled from the Web, for a set of pre-defined crisis events. Each article or blog post is associated with an event, and each event is rep-resented by a topic description (a textual query briefly describing the event), and the start and end time of the period during which the event took place. Participant systems are required to return a set of sentences extracted from the corpus of relevant documents for each event; these sentences will form the summary of the respective event over time. An optimal summary covers all the essential information about the event with no redundancy, and each new piece of information is added to the summary as soon as it becomes available. In addition, it is often the case that an event can be decomposed into a variety of fine-grained atomic sub-events, called nuggets. The ideal temporal summarization system would update users as soon as each of these sub-events occur. Therefore, an optimal summary should cover all the infor-mation nuggets associated with an event in the minimum number of sentences describing the event.

To evaluate the quality of a summary, TREC TS introduced custom evaluation metrics developed by the track organizers, that look specifically at relevance, coverage, novelty and latency of the updates. These metrics include time-sensitive versions of precision and recall, ensuring that systems are penalized for cases when delivering information about an event happens long after the event occurred. Expected Gain denotes how much information is provided by the updates, while Comprehensiveness measures how well the updates cover the event. Latency is used to measure how fast updates are issued by the system. Expected Latency Gain is similar to traditional preci-sion, with the observation that runs are penalized for delaying the emission of

1 http://trec.nist.gov/pubs/trec22/trec2013.html 2 http://trec.nist.gov/pubs/trec23/trec2014.html 3 http://trec.nist.gov/pubs/trec24/trec2015.html

(12)

relevant updates. Latency Comprehensiveness is analogous to traditional re-call, and measures the coverage of relevant nuggets in a run. Runs are scored under a combined measure of the last two metrics, i.e. the Harmonic Mean of the normalized Expected Latency Gain and Latency Comprehensiveness. Most TREC TS participants addressed the problem in multiple stages, first selecting relevant documents likely to concern the event, and afterwards focusing on extracting relevant and novel sentences based on some filters. All systems use a data preprocessing step, which consists in decrypting, decom-pressing and deserialization of the data, and indexing it using open source tools to facilitate dealing with the large volume of documents and news arti-cles. Most participants use query expansion techniques to improve retrieval performance, and to overcome the word mismatch problem between the lan-guage of the query and the vocabulary used in relevant updates [112,18,57,

62,50,114]. In what follows we describe systems of particular interest from past editions of TREC TS 2013, 2014 and 2015.

2.1.1 TREC Temporal Summarization 2013

The best performing system in TREC TS 2013 in terms of the Expected Gain metric is the information extraction system developed by team PRIS [112]. They use hierarchical Latent Dirichlet Allocation on the set of relevant event documents to infer potential event topics. The generated topic descriptions undergo a manual filter to select keywords describing the topic, and these keywords are later used in scoring sentences based on the similarity between the keyword vector of a topic and the vector of a given sentence. Sentences that are most similar to the topic are selected as output. Their system includes a post-processing step to ensure that the most up to date information is selected every hour, and that duplicate sentences are removed from the output. However, their approach is semi-supervised, and involves human labour, which is expensive and time-consuming.

In the same TREC edition, University of Waterloo [18] achieved the high-est score in terms of the Latency Comprehensiveness metric. They first rank documents using the query likelihood model in increasing order of their time stamps, and extract relevant sentences from these documents using standard retrieval techniques. They achieve high recall on the task, however their method performs poorly in terms of precision.

The highest harmonic mean in the competition was obtained by ICTNET [57]. They first determine whether a document is relevant to the topic by checking whether the title of a news article covers all words inside the event query. Afterwards they learn a set of crisis-specific trigger words from the training data – for example, words such as “kill”, “die” and “injure”. This set of keywords is augmented with WordNet synonyms, and sentences are scored based on how many of these keywords are found inside a sentence. They check for novelty, and discard the current sentence if its similarity degree with past emitted sentences exceeds a certain threshold. However, their approach can fail when the type of the event is not known in advance, or when the present information is not specific to the given event type.

Other teams participate in the competition to investigate real-time ex-tractive models for the summarization of crisis events from news streams. University of Glasgow [62] proposed a core summarization framework able to process hourly document batches at document, sentence, and temporal

(13)

level. At the document level, they select the top 10 documents per hour as a function of their relatedness to the query; at the sentence level they identify sentences from these documents that are most likely to be useful for inclusion in the summary; lastly, at the temporal level they compare each candidate sentence with the current representation of the event, and emit only novel sentences that have not been selected from prior batches. To this end, they use the Maximal Marginal Relevance (MMR) algorithm designed to balance relevance and novelty [21]. They find that during periods of low prominence, off-topic and non-relevant content is likely to be included in the summary, which leads to a topic-drift within the final summary. Therefore, the volume of information to be summarized should be adjusted during time periods when no important sub-events occur. They conclude that adapting the sentence selection strategy over time is critical for an efficient update summarization system. Finally, the HLTCOE team [110] identifies three key challenges any temporal summarization system must address: i) topicality: the appropriate selection of on-topic sentences, ii) novelty: the inclusion in the summary of sentences that contain novel content, and iii) importance: the selection of informative sentences that a human being would consider adding to the summary. They create bag-of-words representations for each topic, including unigrams, named entities and predicates, leaving from the event query description and the titles of documents; each time a new sentence is selected for inclusion into the summary, these representations are updated accordingly. They conclude that dynamically updating a topic’s representa-tion as sentences are selected for inclusion in the summary is a challenging process, and that more sophisticated models for the extraction of relevant sentences that capture simultaneously a document’s relevance to the topic and a sentence’s relevance inside a document are needed.

TREC TS 2014 released a pre-filtered version of the TREC KBA dataset containing documents more likely to include relevant sentences, this time for a new set of crisis events. Participants could choose between using the entire corpus of documents just like in the previous year, or running experiments on the pre-filtered subset. The best performing run in the competition in terms of the H-metric belongs to Columbia University [50,51]. Their system com-bines supervised machine learning approaches with the Affinity Propagation clustering technique for sentence salience prediction. They use a diverse set of features for selecting an exemplar set of sentences from the relevant event documents. These features include surface features, query features, language model scores, geographic and temporal relevance features. They use these features in training a machine learning classifier for predicting the centrality of a sentence within an event. Sentences which pass this filter constitute the input to the next clustering step, which ultimately ensures that the selected sentences are the most central and relevant. Finally, they check for novelty by imposing a minimum salience threshold on each emitted sentence update. Team BJUT [114] scored best in terms of Expected Latency Gain results. Their temporal summarization system includes a corpus pre-processing mod-ule, an information retrieval modmod-ule, and an information processing module. After performing basic pre-processing steps, they index the data and cluster sentences using the k-means clustering algorithm. They choose the centers

(14)

of the clusters and the top-k sentences from each cluster for inclusion into the summary, after ranking them by time and cosine similarity to the query. While attaining good precision, their system is doing less well in terms of recall.

The team from University of Glasgow [63] achieved the best recall, de-vising a real-time filtering framework for document processing and sentence selection. To this end, they leverage TREC TS historical data from previous year (2013) to train a machine learning classifier for predicting whether a document is relevant to a given event or not. They expand event representa-tions based on Freebase and DBPedia, and process documents in real-time as they arrive. Each document is assessed for the extraction of salient sentences from it. To pass the selection filter, a sentence should be of medium length, well-written, and contain one or more named entities. Supervised machine learning methods are employed at finding well-written sentences. Finally, there is a novelty based filtering module that ensures only sentences with low cosine similarity with sentences already selected are emitted as updates. One of the important observations of their work is that there is a high degree of vocabulary mismatch between the description of an event (i.e. the event query), and the associated information nuggets for that event. This makes the task of identifying relevant updates particularly challenging. To tackle this problem, they leverage the positional relationship of sentences inside a document to emit updates that do not exhibit any semantic overlap with the event query, but are found in close proximity with sentences estimated as relevant. They find that using a larger browsing window increases the com-prehensiveness of the summary, but harms the expected latency gain scores. Their main conclusion is that using sentence proximity within documents can address the semantic gap between the event query and the relevant updates that share no common terms with the query.

Other teams applied various techniques. The BUPT PRIS team [85] achieved the best latency score. Their main focus is on keywords mining though the use of query expansion techniques. To this end, they expand query terms with words with similar meaning based on Wordnet, Word2Vec, and a neural network model. Sentences are scored in batches at regular time intervals by the number of keywords contained inside the sentence. IRIT [1] proposes a generic event model, leaving from the assumption that event updates contain specific crisis words independent of the event type (for example, general keywords such as “storm”, “hurricane” and “bombing”). They build a generic event model by estimating term frequencies of words inside the gold standard corpus, and use this model in scoring incoming sentences for inclusion into the summary. Because of the topic drift between the event query and the relevant event updates, ICTNET [24] uses the topic of a document to determine whether the respective document is related or not to the query. They infer latent semantic topics from documents using Latent Dirichlet Allocation, and use the generated list of keywords and their weights in sentence scoring. Additionally, they mine a list of discriminative words with the χ2 _{method, and use these words as features in training a}

Support Vector Machines classifier for the selection of relevant and novel sentence updates.

(15)

TREC 2015 builds upon the full/partial-filtering and summarization tasks introduced in previous years (2014 and 2015). Besides that, this is the first year when the summarization-only task is introduced, providing participants with lower volume streams of on-topic documents for a given set of events. As we are mainly interested in this task, in what follows we describe the architecture of participant systems in the summarization-only track.

The best scoring system in the competition was developed by University of Waterloo [90]. They observe that news articles frequently have an inverted pyramid structure: the title describes the newest and most relevant informa-tion, the first paragraph explains the new information on the event, while the remaining paragraphs only mention supportive information. Moreover, it is often the case that the first relevant documents about the event reveal important information such as date, time, location or initial estimates of damages, while the corpus later expands to provide more precise information in the form of continuous updates. Based on this line of reasoning, they develop a system for processing documents in 5 minute batches by retrieving the highest scoring sentences for each time interval using BM25. To avoid pushing redundant updates, they compute the cosine similarity between the document titles of the previously pushed updates with the document title of the proposed update. To ensure that good quality updates are added to the summary, they develop a custom metric for sentence selection by looking at documents for which the number of paragraphs is strictly higher than the number of images contained. BJUT [111] use two different clustering algo-rithms for summarization: non-negative matrix factorization with similarity preserving feature regularization, and affinity propagation. Compared to other clustering algorithms, affinity propagation presents the advantage that it is a fast and efficient clustering algorithm for large datasets which does not require the preliminary specification of the number of clusters. They find out that these methods perform similarly, however they seem to lack in stability. Leaving from the assumption that “events are about entities”, University of Glasgow [65] aims at creating summaries of events using features derived from the entities involved in the development of an event. In particular, they investigate the role of features such as entity importance and entity-entity interaction in capturing salient entities, and how these entities connect with each other. The importance of an entity is estimated as the frequency of the respective entity throughout the entire duration of the event, and the entity-entity interaction is estimated via entity co-occurence. They use these features for scoring sentence updates for inclusion into the event summary, using two distinct corpus processing methods: summarizing the content of an event document by document (real-time scenario), and in hourly batches (near real-time task). To produce temporal summaries of events, they first score sentences by their cosine similarity to the query, after which the set of candidate sentences goes as input into a re-ranking step which makes use of the entity focused features already mentioned. Top-k sentences which pass this filter are included in the final summary. They observe that processing the corpus hour-by-hour is more effective under the expected gain metric, how-ever processing the corpus document-by-document performs better in terms of comprehensiveness. In both cases the feature encoding the entity-entity interaction is more effective than the feature encoding the entity importance.

(16)

The UDEL FANG team [59] performs query expansion based on event-type information, and information about event-related entities in the query by incorporating external knowledge from Wikipedia. Sentences are scored us-ing the query likelihood method with Dirichlet smoothus-ing. ISCASIR [108] relies on distributed word representations in computing distances between the query and the relevant sentence updates. ILPS.UvA [34] employs a set of retrieval based and event modeling methods in building the summary of a crisis event. For more details regarding their methods and results, see AppendixB.

2.2 News tracking and summarization

Single and multi-document summarization have been long studied by the natural language processing and information retrieval communities [75, 44,

7,9,4,26,74]. Multi-document summarization is more complex, and issues such as compression, speed, redundancy and passage selection are critical in the formation of useful summaries. Ideally, multi-document summaries con-tain the key relevant pieces of information shared across documents without any redundancy, plus some other information unique to individual documents directly relevant to the user’s query. According to Goldstein et al [36], multi-document summarization is particularly useful in cases when there is a large collection of dissimilar documents available and the user wants to asses the information landscape contained in the collection, or when as the result of the user issuing a query, a collection of topically related documents is returned. Events play a central role in many online news summarization systems. In general, events are real-world occurrences that unfold over space and time [6]. An event can be defined in multiple ways, either as “the occurrence of something significant associated with a specific time and location” [20], or as “an occurrence causing changes in the volume of text data that discusses the associated topic at a specific time; this occurrence is characterized by topic and time, and is often associated with entities such as people and location” [30]. Topic Detection and Tracking (TDT) [6] has focused on monitoring broadcast news stories and issuing alerts about seminal events and related sub-events in a stream of broadcast news stories at document level. To retrieve text at different granularities, passage retrieval methods have been widely employed; see TREC HARD track [3] and INEX adhoc [47] initiatives for an overview. Passages are typically treated as documents, and existing language modeling techniques that take into account contextual information [22,72,33], the document structure[12] or the hyperlinks contained inside the document [77] are adapted for retrieval [52]. Nevertheless, passage retrieval techniques assume a static test collection and are not directly applicable to a streaming corpus scenario. Clustering [106, 105], topic modelling [8, 93,

92], and graph-based approaches [31,67] have been proposed to quantify the salience of a sentence within a document.

McCreadie et al [64] introduce the task of incremental update summa-rization, aimed at producing a series of sentence updates about an event over time to issue to an end-user tracking the event. Their approach builds upon traditional update summarization systems that produce fixed length update summaries at regular time intervals. During times when no new pieces of

(17)

information emerge, these update summaries often contain irrelevant or re-dundant information that needs to be filtered. They model the sentence selection from each update summary as a rank cutoff problem, and predict based on the current update summary and previous sentences issued as up-dates, how many sentences from the current update summary to select for inclusion in the final summary. To this end, they use supervised machine learning approaches, and device a set of features that capture the prevalence of the event, the novelty of the content, and the overall sentence quality across the sentences inside the input update summary. They use these fea-tures in predicting the optimal rank-cutoff θ where to stop reading the current update summary, given the previously seen sentence updates. Kedzie et al [49] combine sentence salience prediction with clustering to produce relevant summaries that track events across time. They process relevant documents in hourly batches, by first predicting the salience of each sentence in the cur-rent batch, and afterwards selecting the most salient sentences and clustering them. For the task of sentence salience prediction they use a diverse set of features, including language model scores, geographic relevance and tempo-ral relevance, in combination with a Gaussian process regression model. In order to select sentences that are both salient and representative for the cur-rent batch, they combine the output of the salience prediction model with the affinity propagation clustering algorithm. Lastly, they select the most salient and representative sentences for the current hour after performing a sequential redundancy check, in decreasing order of sentence salience scores. However, their approach presents a number of limitations: first, because they train the regression model offline, it is difficult to include features that capture information about the incoming document stream or the current summary; second, the clustering algorithm suffers from an inevitable time lag as it needs to collect one hour’s worth of new documents necessary to be analyzed before issuing updates; and third, clustering severely limits the scalability of their system. To address these shortcomings, the same authors propose in [48] a locally optimal learning to search algorithm that uses rein-forcement learning and learning-based search to sample training data from the vast space of all update summary configurations with all sentence up-dates, and train a binary classifier which can predict whether or not to include a candidate sentence in the summary of an event. Each environmental state in the reinforcement learning problem corresponds to having seen the first incoming sentences in the stream and a sequence of actions, and is encoded using both static and dynamic features. They learn a policy that maps states to actions – either select or skip the current sentence.

Vuurens et al [105] propose a three-step approach for online news tracking and summarization consisting in the following steps: routing, identification of salient sentences, and summarization. They first build a graph in which each streaming news article is represented as a node of the graph, and assign directed edges to the top three nearest neighbours of the article based on the similarity of titles and proximity of publication times. To this end, they introduce a 3-NN streaming variant of the k-nearest neighbour clustering algorithm, with the purpose to detect newly formed clusters of articles. From these clusters they later extract salient sentences, after applying the 3-NN heuristic one more time. Lastly, they generate a concise summary by selecting the most relevant sentences which include the most recent developments of the given topic.

(18)

Althoff et al [11] generate a timeline of events and relations for entities in a knowledge-base, accounting for quality criteria such as relevance, temporal diversity and content diversity. The relevance of an entity is modeled as a linear combination of the relevance of related entities and the importance of related dates. For a given entity of interest, they first generate a set of possible candidate events by searching the knowledge base. Simple events are encoded as paths of length one through the knowledge graph, while com-pound events are nodes connected through a path in the graph. To ensure an event can be understood by an end-user in natural language, they manually generate description templates for the most frequent paths. In order to select the most diverse and relevant set of events, they rely on submodular opti-mization and devise a greedy selection algorithm with optimal guarantees of diversity in the temporal spacing of events. Sipos et al [102] re-interpret summarization as a coverage problem over words anchored in time. The components of the summary they generate for a collection of news articles can be either authors, keywords, or documents extracted from the collection; therefore, they aim to extend document summarization beyond just extrac-tive sentence retrieval. Given a corpus of related articles, they first identify the most influential documents on the content of the corpus, then at each point in time they identify the most influential documents for that respective time period. The next step is identifying the most influential authors for the given period, and lastly the most influential key phrases in the respective time interval. An optimal summary should not only optimize the coverage of the information content, but should also reflect which documents and au-thors have had the highest influence in the development of the corpus. The key assumption of their approach is that coverage of words can be used as a proxy for the coverage of the information content. They model each task as a maximum coverage problem, and perform optimization via a greedy selection algorithm.

Ren at al [92] focus on contrastive themes summarization, quantifying the opposing viewpoints found in a set of opinionated documents. The main challenges of the task are represented by the unknown number of topics, and the unknown relationships among topics. To address these, they combine a nested Chinese restaurant process with a hierarchical non-parametric topic model. They extract a set of diverse and salient themes from documents, and based on the probabilistic distributions of the themes, they generate contrastive summaries incorporating divergence, diversity and relevance. 2.2.1 Event Detection

Many systems for online news and social media processing during periods of crisis focus in their initial stages on the detection of events. Event detection from online news streams represents a vibrant research area, incorporating diverse techniques from different fields such as machine learning, natural language processing, data mining, information extraction and retrieval, and text mining. Some disasters can be predicted in advance (or forewarned) up to a certain level of accuracy, based on meteorological, geographic, de-mographic, or other types of data [44]. In such cases alarming signals are drawn before the actual event takes place, and immediate action is taken to minimize the impact of the event on the directly affected population. There are some other cases however, when even though an event cannot be fully

(19)

anticipated, it can still be forecasted by social media analysis, like for ex-ample strikes or mass protests [88]. Nevertheless, a large category of events, such as earthquakes or natural disasters, are unexpected events and cannot be predicted before they happen. Automatic detection methods are useful in finding critical information about disasters as quickly as it becomes available. This information is vital for populations in the affected areas, in addition to rescuers and people able to help, who present an urgent need for up-to-date information. Therefore, techniques for the automatic detection of both pre-dicted and unexpected crisis events are in high demand nowadays to assist in making time-critical and potentially life-saving decisions.

Event detection has been long addressed in the Topic Detection and Tracking (TDT) program [5], a research initiative aimed to encourage the development of tools for news monitoring from traditional media sources, and to keep users updated about the latest news and developments of an event. TDT was made up of three tasks – story segmentation, topic detec-tion and topic tracking – with the purpose of segmenting the news text into cohesive stories, detection of unforeseen events and tracking the development of a previously reported event. Three important sub-tasks make up the event detection phase: i) data preprocessing, ii) data representation, and iii) data organization. The data preprocessing step involves stop word removal, text stemming and tokenization. Data representation for event detection can be done either though term vectors (bag of words), or named entity vectors. Term vectors contain non-zero entries for the terms which appear in docu-ments, typically weighted using the classical tf.idf approach [99]. However, the term vector model is likely to suffer from the curse of dimensionality when the text is long. Moreover, the temporal, syntactic and semantic features of the text are lost. The named entity vector is an alternative representation, which tries to answer the 4W questions: who, what, when, and where [70]. In the vector space, similarity between events is measured using the Euclidean distance, the Pearson’s correlation coefficient, the cosine similarity, or the more recently introduced Hellinger distance and clustering index.

According to Atefeh et al [16], methods found in literature for event detection can be classified according to the event type (into specified or unspecified), the detection task (into retrospective or new event detection), and the detection method (into supervised or unsupervised). We present in turn each category below.

Specified vs. unspecified event detection

Specified Event Detection. Events can be either fully or partially speci-fied along with related content and metadata information such as location, time or people involved in the event. Supervised event detection techniques usually exploit this metadata information inside a wide range of machine learning, data mining and text analysis techniques. Sakaki et al [98] detect hazards and crises such as earthquakes, typhoons and large traffic jams using spatial and temporal information. They formulate the event detection prob-lem as a classification probprob-lem, extract a set of statistical (the number of words), contextual (words surrounding user queries), and lexical (keywords in a tweet) features necessary in training a Support Vector Machines classifier for event detection. They perform this analysis on Twitter data, and man-ually label the training set with instances of events and non-events. Their

(20)

experiments show that statistical features carry the most weight, and that combining them with the rest of the features can only yield small improve-ments in the classification performance. Controversial events that give rise to public debate are detected using supervised gradient boosted decision trees [83]. A diverse set of linguistic, structural, bursting, sentiment, and contro-versy features are used in ranking controversial-event snapshots. In addition, the importance and the number of entities are found useful in determining the relative importance of a snapshot. Other features used for event de-tection and entity extraction include the relative positional information and tf.idf frequencies of terms, part-of-speech tags and regular expressions. Met-zler et al [66] retrieve a ranked list of historical event summaries in response to a user query based on term frequencies within the retrieved timespan. They perform query expansion with the top highest weighted terms during a specific time interval, and rank summaries using a query likelihood scoring function with Dirichlet smoothing .

Unspecified Event Detection. Because no information about the event is known a priori, unknown events are usually detected by exploiting tempo-ral patterns from the incoming document streams. News events of genetempo-ral interest exhibit a sudden and sharp increase in the usage of specific keywords. However, techniques for unspecified event detection need to discriminate be-tween trivial incidents and events of major importance using scalable and effi-cient algorithms. In [100], after employing a Naive Bayes classifier to separate news from irrelevant information, the authors employ an online clustering al-gorithm based on tf.idf term vectors and cosine similarity to form clusters of news articles. The authors of [82] highlight the importance of named enti-ties in improving the overall system accuracy. They identify named entienti-ties using the Stanford Named Entity Recognizer4 _{trained on conventional news} corpora. Topical words, defined as words which are more popular than oth-ers with respect to an event are extracted from news articles on the basis of their frequency and entropy, and divided into event clusters in a co-occurence graph; this helps in tracking changes among events at different times [58]. Finally, continuous wavelet transformations localized in time and frequency domain [25], and therefore able to track the development of a bursty event, have been combined with Latent Dirichlet Allocation [19] topic model infer-ence in the task of unspecified event detection.

In the rest of this thesis we are focusing on developing summarization techniques for specified events, as defined by TREC Temporal Summarization campaign. For more information about the test events, please see Chapter3

and AppendixA.

Retrospective vs. new event detection

Retrospective Event Detection. Retrospective event detection aims to identify events from accumulated historical records, mentioned inside docu-ments that have arrived in the past. Common methods involve the creation of clusters of documents using similar words, referring to similar groups of people, or occurring close to each other in time or space [44]. Zhao et al [113] combine clustering techniques and graph analysis to detect events. They con-struct graphs where each node represents an actor involved in the event and

4

(21)

edges represent the flow of information between actors, by exploiting the so-cial, textual and temporal characteristics of documents. Sayyadi et al [101] assume that there is a topical relationship between keywords co-occuring be-tween documents. To this end, they build a graph over the documents based on word co-occurence. Temporal and geographical distributions of space and time tags found in documents have been used as features in the clustering process, and in determining the type of an event. In [94], the authors ex-tract named entities, calendar dates, part-of-speech tags and certain types of words and phrases that characterize an event. Afterwards, they proceed to classifying the event type retrospectively using a latent variable model.

New Event Detection. New event detection techniques continuously monitor the incoming data stream for signals that may indicate emerging events in real time, such as breaking news or trending events. It is mostly regarded as a query-free retrieval task – since the event information is not known in advance, it cannot be formulated as an explicit query. Clustering techniques, although they present an inherent time-lag, have been widely employed for the task [106]. In general, documents are processed sequentially, merging an event with an existent cluster if the similarity measure is below a specific threshold, or creating a new one if the similarity measure exceeds that threshold. In order to ensure that a text document contains information that has not been reported before, metrics such as the Hellinger similarity, the Kullback-Leibler divergence and the cosine similarity are used to compare the current content with previously reported facts. Online new event detection seeks to achieve low latency, in the sense that a new event has to be reported as soon as soon as a document corresponding to the new event has been received by the event detection system. Most of the methods for new event detection rely on bursts of specific keywords that present a sharp increase in frequency as an event emerges. Busty patterns are captured when the observed frequencies of these keywords are much higher than they used to be in the fixed time interval windows from the past [84]. Other methods employed for the task rely on Locality Sensitive Hashing [81] for processing documents in a bounded time and space frame. Some detection methods use wavelet-based signals to find bursts in individual words, and compute their cross correlation to differentiate trivial events from major ones [109].

In this current work our goal is to process documents in temporal order as they arrive, and extract novel updates from news streams in a timely manner. Although we mainly focus on new event detection, in Chapter 4

we take a retrospective look at the limitations of event update identification algorithms. This is meant to offer us a deeper understanding of what makes a summarization system successful.

Supervised vs. unsupervised event detection

Unsupervised Event Detection. Most techniques for unsupervised event detection rely on clustering algorithms. Due to the fact that data is dynam-ically evolving and that new events arise over time, these algorithms should not require any prior knowledge of the number of clusters (techniques such as k-means, k-median, k-medoid or other approaches based on expectation maximization [2] are therefore inappropriate). Furthermore, these cluster-ing methods need to be efficient and highly scalable. Incremental clustercluster-ing approaches [23] have been used for grouping continuously generated text.

(22)

When the level of similarity between incoming text and existing clusters ex-ceeds a specific threshold, the new text is considered similar and is merged to the closest cluster; otherwise it is considered a new event for which a new cluster will be created. Features used in unsupervised event detection include tf.idf weighted term vectors computed over some time period, word frequency, word entropy and proper names. Cosine similarity is the most used metric to compute distances between term vectors and the centres of clusters. Graph-based clustering algorithms, such as hierarchical divisive clustering approaches [58], have been proposed to divide topical words into event clusters. Graph partitioning techniques [109] have been used to form events by splitting a graph into subgraphs, where each subgraph corresponds to an event. However, hierarchical clustering algorithms are known to not scale well to the large size of the data, as they require a full similarity matrix. Supervised Event Detection. Manually labelling data for supervised event detection is a rather labor-intensive and time-consuming task, more feasible in the case of specified events than unspecified events. Classification algorithms that have been employed for the task are Naive Bayes, Support Vector Machines, gradient boosted decision tress, typically trained on a small set of human labelled data. Classifiers use a vast set of linguistic, structural, burst, and sentiment features. Often times relative positional information, POS tagging and named entity extraction features are also incorporated. However, the main drawback of supervised event detection approaches is that they assume a static environment: there is usually one classifier trained offline, on a small batch of manually labelled data, and used for detecting events either directly, or as a pre-processing step before performing clustering. When data is coming in streams in a continuously evolving environment, the classifier is prone to err as soon as a topic drift occurs. Incremental learning [46] and ensemble methods [53] have been used to account for unseen events in the training data and adapt to changes that may occur over time. In addition, semi-supervised learning approaches which rely on a small amount of labelled data combined with a large amount of unlabelled data have been proposed to train the event detection classifiers [116].

Hybrid Event Detection. Often times a combination of unsupervised and supervised event detection approaches is employed [100]. Supervised classification techniques can be used as a first step in identifying relevant and important documents before clustering them. A trained classifier is ex-pected to improve the efficiency of a summarization system by discriminating between events and non-events, and therefore reducing the amount of noisy data which goes as input into the clustering step. This approach is however sensitive to classification accuracy, and the cascading pipeline can lead to the drop of many important events before the clustering stage takes place. Conversely, there are other approaches which do the clustering first, and then classify whether each cluster contains relevant information about an event. As the clusters evolve over time, the features for the old clusters need to be periodically updated, and features for the newly formed clusters inferred.

In our approach to the task we employ supervised machine learning meth-ods. In Chapter 7 we present a machine learned approach able to predict whether a sentence should be included in the event summary or skipped, based on a set of retrieval, centrality and event modelling features.

(23)

2.2.2 Event Tracking.

Event tracking studies how events evolve and unfold over space and time [5]. An event typically consists of multiple stages: warning, impact, emergency and recovery. The situation is monitored during the warning phase. The dis-aster is on-going during the impact phase, while the immediate post-impact period during which rescue and other emergency activities take place is the emergency phase. Recovery is the returning back to normal period. Iyengar et al [45] extract a set of discriminative words as features and use them in training a Support Vector Machines classifier, which combined with a Hid-den Markov model, is able to predict the phase of an event for an incoming message.

Often times events are made up of small-scale sub-events that take place as a crisis situation unfolds. Sub-event detection using Conditional Random Fields has been employed in [54]. Hua et al [43] group sets of words related to the event into clusters and use supervised learning to classify incoming messages. In addition, they perform geographical location estimation on the classified output. However, challenges in sub-event detection arise from the inadequate reporting of spatial and temporal information, or in cases when mundane events introduce a lot of noise in distinguishing important happenings from trivial ones.

2.2.3 Event Summarization.

Event summarization is particularly important in helping users deal with the information overload problem in times of crisis. A text based represen-tation of an evolving event can serve in providing a brief digest of the core topics discussed in the set of input documents [75]. Most text summariza-tion systems operate in an incremental and temporal fashion: given a set of documents the user has read, and a new set of documents to be sum-marized, the objective is to present the user with a summary of only the data the user has not already read [62,14]. Purely content based methods, such as the incremental or hierarchical clustering of text have been widely used in addressing the task. Alternatively, regression based combinations of features have been employed in summarizing events [39]. However, the relative importance of different sets of features inspired primarily from batch summarization is not yet well understood. In addition, novelty detection is an integral part of many summarization algorithms that require aggressive inter-sentence similarity computation, a procedure which scales poorly for large datasets.

In the rest of this thesis we focus on tracking and summarization tech-niques for news events. We include an analysis of temporal summarization methods, and demonstrate the limitations and potentials of previous ap-proaches by examining the retrievability and the centrality of event updates, the existence of intrinsic inherent characteristics in update versus non-update sentences.

(24)

(25)

Chapter 3

Task, Datasets and Evaluation

Metrics

In this chapter we present the temporal summarization task as defined by the TREC Temporal Summarization track [13,15,14], which aims to provide an evaluation testbed for developing methods for the temporal summarization of large volumes of news streams. More precisely, we describe the goal of the task with the main challenges involved, the datasets provided, and the evaluation framework. In addition, we describe and motivate our method of choice when it comes to evaluating summaries of news articles.

3.1 TREC Temporal Summarization Task

TREC Temporal Summarization (TS) task facilitates research in monitoring and summarization of information associated with an event over time. It encourages the development of systems able to emit sentence updates over time, given as input a named crisis event identified through a query, a time period during which the event occurred, and a high volume stream of news articles concerning the event.

Figure 3.1: Example of a TREC topic description for the topic “2012 Buenos Aires Rail Disaster”.

According to Aslam et al [13], an event refers to a temporally acute topic which can be represented by the following set of fields: i) title - a short retrospective description of the event, specified as string, ii) description - a retrospective free text event description, given as a url, iii) start - the time when the system should start summarization, specified as UNIX timestamp, iv) end - the time when the system should end summarization, also speci-fied as UNIX timestamp, v) query - a keyword representation of the event, mentioned as string, and finally vi) type - the event type. In Figure 5.1 we

(26)

include a typical TREC TS event description; please check AppendixA for an overview of all topics.

For a specified event, systems are required to emit a series of sentence up-dates that cover all the essential information regarding the event throughout its entire duration. TREC TS focuses on large events with a wide impact, such as natural catastrophes, disasters, protests or accidents. In Table3.1we present an overview of the distribution of TREC TS event types for the past three years of the competition. Participants are provided with very high-volume streams of news articles and blog posts crawled from the Web, but only a small portion of the stream is relevant to the event. In addition to fil-tering out the irrelevant content, the stream of documents from the tracking period for each event must be processed in temporal order. Therefore, the set of sentences emitted should form a chronological summary of the respective event over time.

Table 3.1: Event type statistics for the testing events inside the TREC TS 2013, 2014 and 2015 datasets.

Event Type TREC TS 2013 TREC TS 2014 TREC TS 2015 accident 2 1 7 bombing 1 1 8 conflict – – 1 earthquake 1 – 2 hostage – 1 – impact event – 1 – protest – 6 2 riot – 1 – storm 4 3 1 shooting 2 1 – Total 10 15 21

An optimal summary covers all the essential information about the event with no redundancy, and each new piece of information is added to the summary as soon as it becomes available. The automatic summarization of long-running events from news streams poses a number of challenges that emerge from the dynamic nature of the corpus. First, a long-running event can contain hundreds or thousands of unique nuggets of information to be summarized, spread out across the entire lifetime of the event. Second, the information reported about the event can rapidly become outdated, and it is often highly redundant. Therefore, there is the need for online algorithms to extract significant and novel event updates in a timely manner. Third, a temporal summarization system should be able to alter the volume of content issued as updates over time with respect to the prevalence and novelty of discussions about the event. Lastly, in contrast to classic summarization and timeline generation tasks, this analysis is expected to be carried online, as documents are flowing in the system in a streaming scenario.

Participant systems are required to emit a series of sentences extracted from the corpus of relevant documents for each event. When a sentence is emitted, the time of the underlying document must be recorded. If the emit/ignore sentence decisions are made immediately on a per-sentence basis, the timestamp will correspond to the crawl time of the document. Otherwise, if the emission of a sentence is delayed so as to collect more information about

(27)

whether or not to issue an update, the timestamp recorded will reflect the additional latency. Incorporating information which is not part of the KBA corpus is allowed, as long as this external information has been created prior to the event start time, or the source of this information is perfectly time-aligned with the KBA corpus of documents (such that no information from the future is considered). Therefore, any external document created during or after the event end time is considered information from the future and is implicitly excluded from being used for training the summarization system.

3.2 Datasets

TREC Temporal Summarization track uses documents from the TREC KBA 2014 Stream Corpus1_{. The corpus (4.5 TB) contains timestamped documents} from a variety of news and social media sources, and spans the time period October 2011 - April 2013. Each document inside the KBA Streaming corpus has assigned a unique identifier and a timestamp representing the time when the respective document was crawled. A document typically includes the original HTML source of the document, and a set of zero or more sentences extracted by the organizers; each sentence inside the document is identified by the positional index of its occurrence inside the document. However, the extraction of sentences from documents was done using rudimentary heuris-tics, occasionally producing sentences of several hundreds of words in cases when no punctuation marks were present (for example, such sentences can include entire paragraphs, tables or navigational labels). A sentence update inside a document is specified by the distinct combination made up of the document identifier which contains the sentence, and an integer value des-ignating the index of the sentence inside the document (i.e. the sentence identifier).

Organizers made available different versions of the TREC KBA Stream Corpus to participants, depending on their interests in the competition. In the first edition of the TREC Temporal Summarization task in 2013, they were provided with the entire version of the corpus containing high-volume streams of news articles and blog posts collected from various online sources. Since only a very small portion of the stream was relevant to the set of test events, the corpus had to be aggressively filtered for removing irrelevant con-tent. Sentence extracts would then be selected from the relevant documents identified, and returned to the user as updates describing an event over time. One year later in 2014, organizers made available a pre-filtered version of the TREC KBA Stream Corpus corpus (559 GB), containing documents from the time range of the event topics which are more likely to focus on doc-uments that include relevant sentences. The irrelevant content would still have to be filtered, and the relevant documents processed in temporal order to select sentence updates that summarize the event. In addition to these two datasets, in 2015 the organizers released a much lower volume stream of on-topic documents for a given set of events. Accounting for the datasets from the past two years, in 2015 participants can run experiments on three distinct test sets. Each of these three datasets corresponds to a specific sub-task: Full Filtering and Summarization (based on the entire TREC KBA Stream Cor-pus), Partial Filtering and Summarization (based on the pre-filtered version

1

(28)

of the TREC KBA Stream Corpus), and Summarization Only (based on the filtered version of the TREC KBA Stream Corpus).

Given the corpus of documents, participant systems are required to pro-duce temporal summaries for distinct sets of crisis events every year. In Table

3.2and Table 3.3we present the test topics for TREC TS 2013, altogether with statistics regarding the number of updates that have been annotated by human assessors and the number of essential pieces of information describing each event (also called nuggets). Since not all documents and updates which are mentioned in the gold standard set for the TREC TS 2013 collection exist in the database of indexed documents, we also include statistics about the number of indexed documents, nuggets and updates that we can effectively retrieve from the database. Following the same pattern, in Table3.4and Ta-ble3.5we present information regarding the number of pooled, relevant and indexed updates, documents and nuggets for the TREC TS 2014 collection. Moreover, in Table3.6we include statistics regarding the number of nuggets, pooled and relevant updates for the events inside the TREC TS 2015 collec-tion. Since the gold standard set for TREC TS 2015 has not been released as of the moment of writing of this thesis, in this last table we resume to including information as reported by the TREC TS organizers [14]. For this reason, in the rest of the thesis we also exclude running experiments on the TREC TS 2015 dataset, considering that we would not be able to perform any evaluation on this data in the absence of the annotated gold standard sentence updates for inclusion into the summary.

Table 3.2: TREC Temporal Summarization 2013 topics with gold nuggets, pooled and relevant updates for event

-nugget matching.

Event Event Title Nuggets Pooled Relevant Id Updates Updates 1 2012 Buenos Aires Rail Disaster 56 N/A 426 2 2012 Pakistan garment factory fires 89 N/A 372 3 2012 Aurora shooting 139 N/A 210 4 Wisconsion Sikh temple shooting 97 N/A 406 5 Hurricane Isaac 2012 108 N/A 81 6 Hurricane Sandy 418 N/A 485 7 June 2012 North American derecho 91 N/A 0 8 Typhoon Bopha 88 N/A 172 9 2012 Guatemala earthquake 45 N/A 166 10 2012 Tel Aviv bus bombing 37 N/A 281

3.3 Evaluation Metrics

TREC Temporal Summarization runs are evaluated by taking into consid-eration their relevance, coverage, novelty, and latency of the updates. To account for each of these aspects, the organizers of the track have defined custom evaluation metrics. The evaluation process is centered around the concept of information nuggets, defined as fine-grained atomic pieces of tent that capture all the essential information a good summary should con-tain. As events are composed of a variety of sub-events, a nugget typically corresponds to a sub-event that an ideal system would broadcast to the end

Temporal Summarization of News Streams

Master Thesis