• No results found

NACK: A Mashup that Facilitates Objective Analysis of News Events

N/A
N/A
Protected

Academic year: 2021

Share "NACK: A Mashup that Facilitates Objective Analysis of News Events"

Copied!
15
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

NACK

: A Mashup that Facilitates

Objective Analysis

of News Events

Merijn van Wouden 6306632 / 10008519

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. F.M. Nack Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 27th, 2014 1

(2)

Abstract

We propose a system that aids the reader in obtaining an objective viewpoint towards news, effacing the influence of bias in journalism. The system clusters the latest articles from various news sources on subject and presents them grouped in an online interface. Alongside each arti-cle are icons that indicate the sentiment in language usage and comprehension level needed to understand the article; the user can apply selective filters to these.

The results from a usability survey were promising; a considerable part of the human test panel could benefit from the system’s rendition of news. Another considerable group of people was sceptic and had well-founded criticism against the sentiment classification specifically.

Contents

1 Introduction 3 1.1 Background . . . 3 1.1.1 Sentiment in Journalism . . . 3 1.1.2 Structure . . . 4 1.2 Related work . . . 4 1.2.1 Sentiment Analysis . . . 4 1.2.2 Text Clustering . . . 4 1.3 My Approach . . . 5 2 System Architecture 6 2.1 Crawler . . . 6 2.2 Clustering . . . 7 2.2.1 Vectorization . . . 7 2.2.2 New Algorithm . . . 8 2.3 Sentiment Analysis . . . 9 2.4 Complexity Analysis . . . 9 2.5 Interface . . . 10 2.6 Failed Attempt . . . 11 3 Evaluation 12 3.1 Usability Survey . . . 12

3.1.1 Test Panel’s Distribution . . . 12

3.1.2 Ratings . . . 12

3.2 Accuracy . . . 13

3.2.1 Sentiment . . . 13

3.2.2 Clustering . . . 13

(3)

1

Introduction

1.1 Background

News journals are usually not entirely objective or leave out certain information that another news-paper does cover in a report about the same subject. Newsnews-paper journalists and reporters tell stories according to their own biased view on the world, or even add some sauce to the story to make it more interesting. While this may perhaps be just what most readers are looking for, an involved news reader on the other hand forms an objective opinion based on several sources. Some of those readers are subscribers to several newspapers and arrange them side by side on the table to compare them (the supervisor of this project being one of this kind). Naturally, this task can be made easier by providing the reader with a mashup of the articles from all desired news sources, grouped in clusters based on the subject they cover. Additional improvements would be to provide metadata about the article, such as the political orientation of the author, sentiment of the article and the comprehension level required for its use of language.

These involved news readers can now add a new medium to their toolbox, as the system described above has been implemented and made accessible online. After creating this news comparison tool (from now on referred to as NACK: News Analysis and Clustering Kit), a survey has been con-ducted among a selection of people, to determine to what extent NACK can be a valuable addition to the reader’s daily routine. While the part classifying the article’s complexity level has been implemented by my fellow student Dani¨elle Tump for her own bachelor thesis [1], I have taken the other aspects of NACK’s architecture for my part, which shall be described in this paper.

1.1.1 Sentiment in Journalism

This paper mainly focuses on whether knowledge of the article’s sentiment (positive/neutral/negative) can be an aid to the reader or not, or even distract him from having an objective viewpoint. It is important to note here that the proportion in which we process either good or bad news already interferes with our forming of an objective opinion in the first place. The negativity bias in human psychology causes us to have a better memory for bad news than for goods news [2]; and according to Psychology Today, studies have shown that “Bad news far outweighs good news by as much as seventeen negative news reports for every one good news report” [3]. Taking this into consideration, it is no surprise that a query for ‘positive news’ on Google shows numerous platforms trying to break from this norm, and focusing on publishing mainly positive news. Furthermore, it has been observed that with the increased informational utility of news due to the emergence of new media channels, selective exposure to both negative and positive news has also increased [4]. These findings show that there is a need among news readers to be able to discriminate between positive and negative articles, and thus make a selection based on their personal preference. This discrimination however suffers from the fact that everyone has an own opinion; an article can mean good news to one person but bad news to another. This is why its sentiment in word usage will be used as the indicative feature in NACK. The supposition is that this classification will, most times, agree with the reader’s experience; articles that contain many words such as massacre or crisis will likely mean negative news to the reader, likewise will a majority of positive words indicate a positive experience.

(4)

1.1.2 Structure

In this paper, the several aspects of NACK will be discussed: crawling online for news articles, clustering, sentiment analysis and the interface. Then, Nack will be evaluated with an accuracy measurement and a survey that has been conducted among a group of people. Finally, possible other applications of NACK, and future work that can be done are discussed.

We encourage the reader to examine NACK at http://niwz.org.

1.2 Related work

The literature relevant to this project is divided into sentiment analysis and text clustering. 1.2.1 Sentiment Analysis

Godbole et al. [5] presented a system that assigns scores to English news articles indicating whether the article is positive or negative. They created their own dictionaries of words together with their indicative sentiment, every news category having its own dictionary. The system they created processed large amounts of documents, which made them have to make design choices to boost the algorithm’s speed which however disadvantaged the accuracy.

Baccianella et al. [6] introduced SentiWordNet, a lexical resource specifically for sentiment analysis of English language. It consists of lexical structures that are automatically extracted from sentimentally annotated sources.

Socher et al. [7] created a sentiment database with English phrases from parse trees labeled with their sentiment (using human annotation via Amazon’s Mechanical Turk). It pushed the state of the art in single sentence positive/negative classification from previously achieved accuracy of 80% up to approximately 85%, and can accurately capture the effects of negation.

1.2.2 Text Clustering

Making matches with the articles from different news sources requires text clustering techniques. This can be done in several ways; the traditional method is to use a bag of words and subject it to statistical analysis. Hotho et al. [8] proposed a method to combine statistical text clustering with an ontological database, which resulted in an improved accuracy for categorization of news articles in specific.

Liu et al. [9] show that feature selection improves text clustering and discuss which features to extract.

(5)

1.3 My Approach

NACK contains a combination of the findings in the above researches. For sentiment analysis, SentiWordNet is used and a Naive Bayes Classifier. The sentiment classifier in the NLP package from Stanford [7] has been proven to be the best performing one as of today. It would require an interface to a different programming language and therefore require more time to implement. It is not done during this project, and can be done as a later improvement of NACK.

Clustering is done by a newly invented clustering algorithm. Inspired by Hotho et al.’s findings, the input for this algorithm is combined with an ontological database (DBpedia). The extra features that Liu et al. propose are also considered.

Other work has been done on discriminating between political preference: whether an article is right-wing or left-wing. This did not have satisfactory results, the reason for this will be explained at the end of the System Architecture section.

Research Question

The main question to be answered within this paper is how knowledge of an article’s sentiment can aid the reader to form a more objective viewpoint on the news. In order to evaluate this question with a survey among a group of people, NACK should satisfy a few requirements so that the test panel produces reliable feedback. These are listed below.

System Requirements

1. NACK must have a user friendly interface. In order to always show the latest news, it must update its articles on a regular interval.

2. The user must experience an ease in comparing articles from different newspapers, more than doing so the ‘traditional way’.

3. The accuracy of both the clustering on subject, and sentiment analysis must be 80% at minimum. The point where accuracy reaches 80% in classification algorithms is generally easy to reach; pushing the accuracy beyond this point usually requires significantly more effort.

(6)

2

System Architecture

NACK is implemented in a PHP environment on a Linux server. The complexity analysis was written in Python, so a communication interface was built between these languages to connect the parts. NACK is available on http://niwz.org (N!WZ is an instance of NACK, but the system can be duplicated for any platform, hence the name discrepancy).

Every hour, NACK updates its database by collecting the latest online new articles, followed by all the text processing operations required for producing the final display to be shown to the visitor. The sequence of operations performed during this periodical update will now be described successively.

2.1 Crawler

NACK’s crawler (crawling online to collect articles) is fed by a set of RSS feeds of the world-news sections of English-written newspapers. Feeds can be added to the set through the administrative interface of NACK. Upon every periodical update, the crawler iterates through the feeds’ XML pages, seeing whether new entries had been added to each feed. Those are added to the database; the title, url, post date and short summary can usually be extracted from the XML. This is however not enough. NACK requires the full article body for the text processing parts. Also for the user, having direct access to the full article is convenient. Thus, the crawler downloads the article’s full HTML page from the newspaper’s website, and then parses its structure. Every website has a different HTML-template. The crawler deals with this problem as follows:

(a) If a newspaper is one of the newspapers for which a custom function has been written to extract the article’s body text, this function is applied.

(b) Else, a general function will be applied, which searches for the longest connected piece of text present in the HTML document. As even the body text usually is chopped up in separate HTML structures (for instance, one will find a structure as shown in figure 1), all text inside the descendants of a single parent will be regarded as a single connected piece of text, only if all these descendants have the same parent-child structure from their mutual ancestor.

<article>

<p>This is</p> <p>the article</p> <p>structure.</p> </article>

Figure 1: Example of an article’s body HTML

This way, adding an extra RSS feed to NACK will most times work without having to write custom code. In addition to the article’s body text, the <meta name="keywords">-tag’s content is saved into the database, as these are (usually) manually inserted descriptive keywords most relevant to the subject, and therefore form the most useful input for the clustering algorithm.

(7)

Figure 2: Vectorizing articles before being clustered

2.2 Clustering

The next step is to cluster the articles in the database. The latest 250 entries are considered, older ones are disregarded.

2.2.1 Vectorization

Every clustering algorithm requires the articles to be represented as vectors, making it possible to mutually compare each of them. The schema in figure 2 shows the steps in this process:

1. The first step crumbles the articles apart into bags of words, and reduces every word to its stem using the Porter stemmer [10], which is a mild stemmer. Most other stemming algorithms such as the Lancaster stemmer are too agressive for this task (they cause different words to be stemmed in such a way that they are unified to the same stem). These word sets are saved as arrays where every once occuring word initially has a count of 1, every twice occuring word has a count of 2, and so on.

2. Then, each article’s title and keywords (from the meta-tag in the HTML head) are added to the corresponding arrays, again increasing the counts for each word, but with a greater increment than previously (as these words are more descriptive for the article)

3. Next, a vector containing all words from every article is built. Every word is subjected to the following steps:

(a) The first step is in accordance to Hotho et al’s findings on combining ontological information with clustering [8]. NACK examines whether the word is an encyclopedia concept by querying DBpedia (a Wikipedia-derivative stuctured database [11]) using the SPARQL query language. If it is present in DBpedia, the weight of this word is increased, relative to the extent that its encyclopedia entry is more descriptive for an article than other words. This extent (the amount of weight-increase) is dependent on the DBpedia-category whereto it belongs. For instance, the ontology categories Person and Surname increase by a factor of 10, while category PopulatedPlace even increases by a factor of 15. This resulted in a higher accuracy. Changing the weights per category influenced the accuracy, and these weights are manually optimized. Naturally, once a word has been queried for using SPARQL, the result is cached in NACK’s database to speed up future execution time.

(8)

(b) If a word was not found in DBpedia, but is present in our own dictionary of English words, the weight is divided by three (which is an optimized parameter value), indicating these as being less important.

(c) The (few) remaining words remain untouched. They may be incorrectly spelled or gibberish, but they can also be an important descriptive concept that happened to not be present in DBpedia.

Figure 3: dendogram from hierarchical clustering Figure 4: problem with the data set 2.2.2 New Algorithm

In most applications, hierarchical clustering gives the best results for grouping text [12]. A hi-erarchical clustering algorithm builds a dendogram in which the clusters are split at a certain distance-tresholds. See figure 3; depending on what threshold would be the best, one makes a cut at a certain point to extract the breakdown of clusters, which if it were the thick line in this example would be {a,b},{c},{d},{e,f,g},{h}.

There is a problem with applying this method to our data, due to the fact that some of our clusters are sparse and cover a greater area in our data space, whereas other clusters have a high density and are closer together. Take a look at figure 4: in (a), the threshold has a high value, so the maximum distance between the cluster centroids is too high, which allows for the 2s and 3s to be incorrectly grouped together. In (b), this maximum distance is too little resulting in the 1s being divided into separate groups, which is also wrong.

This is why a new algorithm is proposed for NACK’s purpose. Hierarchical clustering generally has a complexity of O(n3), the proposed new clustering algorithm has a complexity of O(n2) and

can be summarized by the pseudocode in figure 5. The premise that underpins this algorithm is that a news platform does not publish the exact same news event twice. This means that a cluster will not contain more news reports than the amount of news sources that are in use. This amount will be referred to as X. Clustering starts with a mild threshold, allowing for large clusters to emerge. Clusters that contain more articles than X (in the code: FOR ALL CLUSTERS LARGER THAN X ENTRIES) are ungrouped and recursively reconsidered by the algorithm, but with a lower threshold

(9)

(hence, smaller clusters are formed). These articles will be grouped with other clusters if possible, or else form their own cluster. This recursion is finished once all clusters have less or equal entries as X.

This algorithm has shown to give better results with our data set.

FUNCTION CLUSTER(vectors, threshold, clusters={empty set}) FOREACH(vectors as vector)

IF(FIND(cluster in clusters where distance to vector <= threshold)) THEN add vector to cluster

ELSE create a new cluster from vector END

FOR ALL CLUSTERS LARGER THAN X ENTRIES:

BREAK APART INTO VECTORS AND ADD TO newvectors threshold -= Y

clusters = CLUSTER(newvectors, threshold, clusters) RETURN clusters

END

Figure 5: new clustering algorithm; X, Y and initial threshold are variable parameters

2.3 Sentiment Analysis

The sentiment analysis is done with a Naive Bayes classifier. A Naive Bayes algorithm makes the assumption that each word is statistically independent of each other word. While this is of course not the case for natural language (grass is more often preceded by green than by red ), this naive assumption has shown to works well for language classification tasks.

The classifier uses SentiWordNet [6], a database of words categorized by their indicative sen-timent, and takes as an input the article as a bag of words. Simple negation structures are also considered: a negating structure negates the sentiment-value of a word. The three-dimensional output (positivity, neutrality and negativity) are then transformed to a one-dimensional value (-1 to 1) to be stored into the database for later display to the user; negativity and positivity push the value to either limits while neutrality pushes it to 0.

2.4 Complexity Analysis

Complexity analysis is done by D. Tump, and can be found in the paper she wrote about NACK [1].

(10)

2.5 Interface

While the above steps are all periodically executed, the interface (website frontend) is constantly available online and can also be seen in figure 6 and 7. Time has been spent on making the interface user-friendly, as this is one of our system requirements.

Basic functionalities that usually come with RSS readers are implemented: marking articles as read or starred, associate certain tags with certain sources (here done through the administrative interface). Currently, the tags connected to sources are newspaper dependent and either right-wing (e.g. Washington Post), left-wing (e.g. The Guardian), or neutral (e.g. Reuters). All tags have a color assigned to them, and the portion in which each tag is present in a cluster determines the color of the little square in its top-right corner (the colors are mixed). A user can see the sentiment and complexity rates assigned to each article by their icons (which are emoticons and eyeglasses respectively). It is possible to filter on them to receive a selective set of results. The cluster’s visual size in the interface is relative to the amount of news entries it contains.

For every cluster, NACK finds the country where its news articles are about, so that it can add an image to the cluster (the database has an image stored for every country). Finding the country is done by searching for the most prevalent country name in the texts. A custom stemming algorithm has been made to unify words as China with Chinese (these two resulting in chin@, in order to not collide with occurrences of chin), and map irregular nationalities to country names (So that e.g. Dutch, Holland and Netherlands are mapped to the same stem). The country with the most occurrences will decide the image. I chose to save a single image only per country, so that a user that repeatedly visits NACK will associate the images with the countries they belong to. This helps the user to quickly recognize where the news event has taken place.

There are two reasons why NACK does not grab an image from the news articles; firstly because there is a high chance of receiving a wrong, irrelevant image (such as an advert), secondly because it would be an image from one single news source, but part of our goal is to prevent one news source to play a dominant role in the presentation towards the user.

(11)

Figure 7: NACK’s web interface, display of a cluster’s contents

2.6 Failed Attempt

It is worth to notice the attempts that have been made during implementing and testing news methods that did not result as expected or desired.

Distinguishing Between Political Preferences

In the current state of NACK’s implementation, labels indicating either right-wing, left-wing and political neutrality are dependent solely on the general political orientation which the news source it belongs to is know to have. But an attempt has been made to create a dictionary of words indicating political preference, by which the articles would have been classified in a same manner as has been done with sentiment analysis; using a Naive Bayes classifier. From either group of, on one hand the newspapers that are generally known as left-wing oriented, and on the other hand the newspapers known as being right-wing, 10.000 articles were downloaded from the LexisNexis portal. The assumption was that the language used in a newspaper being known to have a certain political preference, consists of a vocabulary that is specific and therefore can be reduced to this political view in question. However, once the vocabularies were extracted and given to the classifier, its decisions were near to, if not completely, statistically random. The test set consisted of 30 manually annotated news articles about the crisis in Ukraine during spring 2014. Adding other features to the classification input, such as the length of sentences and the length of words, did not improve the results. The conclusion we can draw from this is that political preference is a complex property of text and can not be simply deducted from its vocabulary. Future attempts can be made using bigrams or parsing the sentence’s structure as a tree, but it seems that political preference is an aspect contained by semantics on the level of a full article, and the effort required to build an automated classifier for it will therefore not be little.

(12)

3

Evaluation

3.1 Usability Survey

The first and second system requirements are about NACK’s usability and user-friendliness. These qualities were determined by conducting a survey among a group of human testers. We asked the test panel questions about the layout and about their opinion on NACK’s way of presenting news. 3.1.1 Test Panel’s Distribution

The test panel consisted of 50 people whom the author and Dani¨elle Tump personally know, half of which are women and the other half men. The average age was 24 years old, with a standard deviation of 9 years. Half of the group specifically focused on the sentiment indicators (they did not see the complexity indicators); the other half of the group likewise focused on the complexity indicators.

Figure 8 shows the frequency by which the participants inform themselves through news channels. The amount of news channels they consult is shown in figure 9. As one can expect, there was correlation between the two. Participants who consult five or more news sources do that on a daily basis, and this part of the panel (20% of all participants) is the most suitable target audience of NACK. On a side note, the data showed that women read the news less frequently (once or a few times in a week) than men (mostly daily), although it may be questioned whether there were enough measurements to draw this conclusion.

Figure 8: How often the participants read news Figure 9: Number of consulted sources 3.1.2 Ratings

The sentiment indicators appeared correct most times to 40% of the participants, sometimes cor-rect to 32%, usually incorcor-rect to 16% and never corcor-rect to 12%. The indicators are thus more often indicated as correct than incorrect. However, when asked whether these are an improvement compared to the version without them, the majority answers with a no (which was 40%, maybe said 32%, yes said 28%). These results are not correlated, so the accuracy of the sentiment indicators did not influence the participants’ opinions on its usefulness.

Negative criticism towards sentiment icons usually concerned the fact that it pushes the reader to read the news in a certain way, influences the reader already before making a judgement by himself, makes the reader look at the world through a filtered glass and therefore even prevents the

(13)

reader from forming an objective opinion; thus the sentiment indicators are counter-effective for these participants.

Positive criticism was usually about the participant in general being annoyed by the prevalent focus on negativity in the news, and using NACK they can filter this negativity away. The positive criticism consisted of weaker and shorter argumentations than the negative.

The overall layout and interface of the website was also judged by the tester panel. 24% of them found the interface intuitive and had no problems using it, 42% thought the layout was passable yet could use some improvement, 34% found the interface too chaotic. Some remarks were made on the choice of color and other personal preferences, these will be considered in the next version of NACK which is currently being revised by a visual web designer.

3.2 Accuracy

3.2.1 Sentiment

Although the participants of the survey were told to primarily focus their attention on usability and user-friendliness, the scores by which they rated the correctness of NACK’s classifications suggest that the system requirement concerning sentiment accuracy was not achieved; the result of only 40% of the participants calling the indicators most times correct does not even come close to the endeavoured minimum of 80%. However, the hypothesis is that a user’s memory keeps better track of NACK’s incorrect classifications than the correct ones. For example, just a single article about a bloody warfare being falsely classified as positive will likely be seen as a serious mistake of the system, damaging the user’s opinion towards NACK to such an extent that it will not be simply compensated by numerous other occasions of correct classifications. If this assumption holds true, the actual accuracy can be far higher than the results from the survey.

In order to make a more reliable measure of the accuracy, the author classified 100 articles without seeing the classification made by NACK beforehand. These manually acquired results were then compared to NACK’s automated classification of the same set of articles. The manual classification used the same scale of 5 possible values NACK uses, 1 being very negative and 5 being very positive. The accuracy could then be calculated from the distance between the two classifications. This accuracy was 76%. It is indeed a better score than what appeared to be score from the test panel’s opinions, yet worse than planned in the system requirement. Most mistakes of NACK were made with articles that were manually classified as neutral; NACK classified these often as either of the extremes. Thus, making NACK classify neutral more often would benefit its accuracy.

3.2.2 Clustering

The measure of the clustering algorithm’s accuracy is measured as the ratio between completely correct clusters (homogenous: every article is about the same subject) and wrong clusters (which have 2 or more different subjects). This was counted for 100 clusters. 54 were correct, 46 were mixed up. So the clustering accuracy is 54%, far below the requirement. If however a less strict benchmark is used where a cluster that has both correctly and incorrectly grouped articles also adds to the accuracy, this value would be higher: up to 70 to 75 %.

(14)

4

Conclusions and Discussion

A large group of people, 40%, does not want to know an article’s sentiment in advance of reading it. Most of these people are part of our original target audience: involved news readers. They do not want to be influenced by a computerized classification that suggests whether a news event is either good or bad, prior to them reading it by themselves. This means that the sentiment icons, contrary to their original purpose, counteracts the forming of an objective opinion.

However, 30% of the people are enthusiastic about the sentiment icons for a different reason than what they were originally meant for; mostly, these users want to filter out bad news reports which they think are too prevalent in the media. This is conform the suggestions made in section 1.1.1 in the introduction.

Improvements

It would be worthwhile to improve the accuracy of both sentiment analysis and clustering, this would probably convince more people of NACK’s potential. An upgrade of the interface would also help (the most frequently heard complaint about the interface was that it is too chaotic).

The clustering algorithm can be improved by changing the dimensions to be comparable accord-ing to the Wordnet distance (the minimum amount of steps needed through the Wordnet ontology from the first concept to reach the second concept). This would make all similar concepts be grouped together instead of the current situation where no semantical linkages have been used. The same could be done using DBpedia instead of Wordnet.

To push the accuracy of the sentiment classifier, best would be to consider the sentiment classifier in the NLP package from Stanford [7]. This classifier uses sentences’ parse trees, which could also help to again consider classifying on political preference.

The user evaluation of NACK did not contain the complete system; the test panel did not get access to both the sentiment icons and the complexity icons. This combination could however enhance the user’s experience, so in a future evaluation this should be considered as well.

Other applications

NACK can be without much alterations be applied to the domain of written literature, this was one of the suggestions that came out of the user evaluation. Books have longer text than news articles and the complexity analysis will therefore be more reliable, its indicators are useful for the reader to filter between difficult and easy to read books. The sentiment analysis however would blend all moments with positive and negative sentiment together, which can result in an unreliable indicator. This problem can be solved by showing how the sentiment changes during the course of the story. The clustering should be considered as well, it might already successfully group the books on genre.

(15)

References

[1] Dani¨elle Tump. News is my story. Bachelor thesis at University of Amsterdam, 2014.

[2] Roy F Baumeister, Ellen Bratslavsky, Catrin Finkenauer, and Kathleen D Vohs. Bad is stronger than good. Review of general psychology, 5(4):323, 2001.

[3] Ray B. Williams. Why we love bad news, December 2010.

[4] Silvia Knobloch-Westerwick, Francesca Dilltnan Carpentier, and Andree Blumhoff. Selective exposure effects for positive and negative news: Testing the robustness of the informational utility model. Journalism & Mass Communication Quarterly, 82(1):181–195, 2005.

[5] Namrata Godbole, Manjunath Srinivasaiah, and Steven Skiena. Large-scale sentiment analysis for news and blogs. In Proceedings of the International Conference on Weblogs and Social Media (ICWSM), 2007.

[6] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. 2010.

[7] Richard Socher, Alex Perelygin, and Jean Y. Wu. In Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, Stanford University, Stanford, CA, USA. [8] Andreas Hotho, Steffen Staab, and Gerd Stumme. Ontologies improve text document

clus-tering. In Proc. of the ICDM 03, The 2003 IEEE International Conference on Data Mining, pages 541–544, 2003.

[9] Tao Liu, Shengping Liu, Zheng Chen, and Wei-Ying Ma. An evaluation on feature selection for text clustering. In Tom Fawcett and Nina Mishra, editors, ICML, pages 488–495. AAAI Press, 2003.

[10] M. F. Porter. Readings in information retrieval. chapter An Algorithm for Suffix Stripping, pages 313–316. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.

[11] Jens Lehmann, Robert Isele, and Max Jakob. DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 2014.

[12] Michael Steinbach George Karypis and Vipin Kumar. A comparison of document clustering techniques. In TextMining Workshop at KDD2000 (May 2000), 2000.

IMPLEMENTATION

The entire project code, including Dani¨elle Tump’s part, is available for download on http://niwz.org/nack.zip.

Referenties

GERELATEERDE DOCUMENTEN

The junkshop was chosen as the first research object for multiple reasons: the junkshops would provide information about the informal waste sector in Bacolod, such as the

The handle http://hdl.handle.net/1887/19952 holds various files of this Leiden University dissertation.!. Het omslag is niet voorzien

Belgian customers consider Agfa to provide product-related services and besides these product-related services a range of additional service-products where the customer can choose

Als we er klakkeloos van uitgaan dat gezondheid voor iedereen het belangrijkste is, dan gaan we voorbij aan een andere belangrijke waarde in onze samenleving, namelijk die van

certain behaviors and access to valued resources (Anderson, &amp; Brown, 2010), it is hypothesized that the greater the status inequality is, and thus the

They experiment with three supervised techniques: multinomial Naive Bayes, maximum entropy and support vector machines (SVM). The best result, accuracy of 84%, was obtained

No useful power calculation for the current study can be calculated, since the effect sizes of early environmental experiences, the early development of self-regulation and a

PAUL WILLIAMS Music by ANNA LAURA PAGE... CREATION WILL BE AT