Mining contrastive opinions on political texts using cross-perspective topic model

(1)

Mining contrastive opinions on

political texts using

cross-perspective topic model

Kees C.N. Halvemaan 10188010

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. Maarten J. Marx Informatics Institute (ILPS)

Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

1 Abstract

Fang et al. (2012) proposed a novel opinion mining research problem called con-trastive opinion modeling (COM) using cross-perspective topic model (CPT). The goal of COM is to extract topics from documents written from various perspec-tives and present the opinions for each perspective on the topics. Fang et al. (2012) used statement records of United States senators as data and compared the opin-ions of Democrats and Republicans. This thesis has replicated the research done by Fang et al. (2012) using proceedings from the Dutch house of representatives as data, comparing the opinion of the government and the opposition. The model can be used for many interesting applications, including opinion summarisation and forecasting, government intelligence and cross-cultural studies. An extensive set of experiments have been conducted to evaluate the proposed model on a dataset in the political domain: the proceedings of the Dutch house of representatives of 2011-2012. The results of the experiments have been qualitatively analysed, and have shown the effectiveness of the model when applied to Dutch data.

(3)

2 Introduction

Fang et al. (2012) proposed a novel opinion mining research problem called con-trastive opinion modeling (COM) using cross-perspective topic model (CPT). The model is based on the idea that different perspectives use the same topic words (nouns), but are expressed using different opinion words (verbs, adjectives and ad-verbs). Given any query topic and a set of texts written from different perspectives, the goal of COM is to present the opinions for each perspective on the topic and further quantify the difference. The result will be a list of nouns (the topic) and for each perspective a list of verbs, adjectives and adverbs (the opinion).

The goal of this thesis will be to replicate the research done by Fang et al. (2012) using data from debates in the Dutch house of representatives. Their main research question was:

Can the CPT model effectively discover the shared topics across multiple perspectives and accurately capture the opinions expressed by different perspectives on the topics?

This thesis will be an investigation whether this also holds on Dutch parliamentary data. The findings will be useful for studying the diversity of opinions in the po-litical landscape of the Netherlands. This thesis will focus on the proceedings of a single parliamentary year. Future work could apply the same technique on multiple years allowing for a view of the development of opinions over time.

The proceedings of the parliamentary year 2011-2012 have been used, these have been made available via an open data initiative of the Dutch government. The text of the proceedings has been annotated, part-of-speech tagged and lemmatised, allowing for easy parsing.

A topic model, which are algorithms for discovering the main themes that per-vade a large and otherwise unstructured collection of documents, is used to extract the topics from the data. Latent Dirichlet Allocation (LDA), is a statistical topic model of document collections. The intuition behind LDA is that a document can exhibit multiple topics. The nouns are extracted from the data and used to create an LDA model, which is then used to assign topics to the documents.

Using maximum likelihood estimation (MLE) a conditional probablity distri-bution for each topic is made over the opinion words. This is called a language model. From this model, a parsimonious language model (PLM) is then created to find the opinions of the opposition and the government on specific topics. A PLM implements the idea that one should model what language use distinguishes a rele-vant opinion word from other opinion words, instead of blindly modeling language use.

(5)

The topics are evaluated by checking the top 10 words of each topic distribution on cohesion. The quality of the opinion words are evaluated by comparing the PLM with the results from a non-parsimonious language model. The cross-perspective topic model is evaluated by examining the opinion words used by the two perspec-tives, on specific topics where the opinions can be assumed in advance with little domain-specific knowledge. For example debates about the possible bail-out of financially weak states in the Euro-zone, the government was pro bail-out, while the opposition was against it.

Outline of the thesis.

Section 3 reviews related work focused on opinion mining, while also explaining the workings of topic models and parsimonious language models. The layout of the data is explained in section 4 as well as the method for extracting topic words and opinion words. The application of contrastive opinion modeling on the data can be found in section 4.2. The model is evaluated in section 5, with a separate subsection for the topics and the opinions (section 5.1 and section 5.2 respectively). A conclusion is given in section 6, as well as a short discussion.

3 Related work

The approach taken by Fang et al. (2012) for their opinion mining problem uses latent Dirichlet analysis (LDA) Blei et al. (2003) to create topic models which are used to find the main themes in the data. A literature review of opinion mining is given in section 3.1 and the workings of LDA are explained in section 3.2. The approach taken by this thesis also uses a parsimonious language model to filter out non-informative and non-opiniated words, the theory behind this can be found in section 3.3.

3.1 Opinion mining

Several opinion mining studies have been conducted on comparing texts or opin-ions. Zhai et al. (2004) created a probabilistic model for comparing text collections for a problem called comparative text mining. The model was later extended by Paul & Girju (2009) to detect cultural differences from people’s experiences in var-ious countries. The above works do not quantify the differences they find, which makes the differences not comparable or measurable when making multiple re-quests. On top of that, they also cannot deal with ad hoc queries, tasks that COM can perform (Fang et al. 2012).

(6)

The research field of opinion mining and sentiment analysis has been exten-sively studied in the past decade. Pang & Lee (2008) gave a general overview of the developments, problems and applications of sentiment analysis. Most of the early work was focused on the polarity of opinions on either a word (Hatzi-vassiloglou & McKeown 1997), sentence (Kim & Hovy 2004) or document level (Pang et al. 2002) with no consideration of the dependence of opinions on topics.

Polarity and topicality have first been combined by Hurst & Nigam (2003) to form the notion of opinion retrieval, i.e. to find opinionated documents about a given topic. Eguchi & Lavrenko (2006) introduced an early ranking formula: the cross-entropy of topics and sentiments under a generative model. The Text RE-treival Conference (TREC) introduced a Blog Track with a major task of opinion retrieval (Ounis et al. 2006). The goal was to make a retrieval system for opinions in order to locate blog documents expressing opinions. The opinion retrieval ef-fort was approached as a two-stage task: first taking the documents relevant to a certain topic and then ranking the documents by their opinion scores. Zhang et al. (2007) describe a popular and simple method to identify opinion-related content in documents by matching the documents with a sentiment-word dictionary and calculating the term frequency. Zhang & Ye (2008) worked on modeling the topic and sentiment of documents in a unified way. The opinion retrieval task is in this case essentially a document retrieval process, opinions are not directly returned in response to a search request.

Another body of related research is based on feature-based opinion mining, which is a supervised machine learning approach that identifies opinions based on a set of features or attributes of a product instead of an overall evaluation. Early work in this sector is the association rule mining based method (Hu & Liu 2004). The problem with this approach is that it relies heavily on training sets, which have to be created by manual annotation, a resource-consuming task.

3.2 Latent Dirichlet analysis

Fang et al. (2012) used a cross-perspective topic model to map the opinions of different perspectives on a certain topic. Topic models are algorithms for discover-ing the main themes that pervade a large and otherwise unstructured collection of documents. Latent Dirichlet Allocation (LDA) Blei et al. (2003) is one of the ear-liest topic models that have been created. The intuition behind it is that documents exhibit multiple topics.

Figure 1 gives an intuitive explanation of the workings of LDA. The image shows a speech made by Ms. Faber-van de Klashorst, a member of parliament for the Partij voor de Vrijheid (PVV party). Different words in the article have been

(7)

Figure 1: The intuitions behind latent Dirichlet allocation. It is assumed that a number of topics exist for the whole collection (the column on the far left). The document generation is assumed to be as following: first, a distribution over the topics is chosen (the histogram on the right). Then, for each word (the coloured circles) a word is chosen from the coresponding topic. The topics and topic assign-ment in this image are purely illustrative. This figure, adapted to the dataset used in this thesis, is based on figure 1 in Blei (2012).

(8)

manually highlighted. Words about slaughterhouses, such as slachthuis1, bloed and mesare highlighted in blue. Words about animal welfare have been highlighted in yellow: dier, welzijn and verdoving. Words highlighted in pink are about the PVV party and those highlighted in green are about the Islam. If the author had taken the time to highlight every word, it would become clear that the article mixes the topics of animal welfare, slaughterhouses and the PVV party in different proportions, and is not about the (green) topic Islam. The proportions of the latent topics ”behind” this article is given by the histogram on the right. Knowing that those topics are being used by Ms. Faber-van de Klashorst would allow us to find her, and her party’s, opinion on those topics.

3.3 Parsimonious Language Models

Language models estimated using maximum likelihood estimation suffer from two problems: a lot of probability mass goes to non-informative ”stop words”, and ha-paxes enter the models while they should be treated as outliers. Hiemstra et al. (2004) proposed a method, called parsimonisation, to remedy these two problems. The method is language independent and works by comparing the probabilitiy of a term in a specific document with its probability in the complete corpus. If these are almost equal then the background-corpus explains the observed probability in the document, e.g. the frequent use of the word voorzitter is explained by the fact that the corpus contains parliamentary speeches only. One can thus reduce the im-portance of the term in the document by lowering its probability, and redistributing its probability mass over the other terms in the document.

4 Method

The approach taken by Fang et al. (2012) was followed as closely as possible. The implementation of the COM in this thesis was written in Python, using the pack-age Gensim, which contains a lot of topic modeling algorithms. A complete list of the software that was used can be found in appendix B. The data had already been preprocessed, which allowed for easy parsing, further information on the data can be found in section 4.1.1. A number of filters have been applied for the extraction of the topic and opinion words, the process is described in section 4.1.2 and sec-tion 4.1.3 respectively. Secsec-tion 4.2 explains the workings of COM as applied in this thesis.

1

(9)

4.1 Experimental setup

4.1.1 Data collections

The proceedings of the Dutch house of representatives have been archived by the government since 1859 and made available online for the general public as part of an open data initiative. The supervisor of the author parsed the files from the parliamentary year 2011 to 2012 and applied lemmatisation and part-of-speech tags. Please see appendix C for a general layout of the files. A scene starts when a member of parliament approaches the lectern and starts speaking, concequently also starting the first speech of that particular scene. Other members can then interupt the speaker and start their own speech within the scene. Once the debate ends, the scene is closed and the next person on the list of speakers will open a new scene. Each speech node has information about its speaker such as: name, party, gender and in what role the person is speaking. The implementation uses only the data found in the speeches, as it contains the most relevant and opinion-rich data. The dataset exists out of 2130 files, of which only 880 contain speech nodes; table 1 gives detailed statistics of the collections.

Table 1: Statistics of the testbed.

Description Value

Number of documents 880

Topics

Number of unique topic words 3988

Number of topic words 265356

Opinions Government Opposition Total

Number of unique opinion words 5397 6389 65182

Number of opinion words 332768 389251 722019

4.1.2 Topic-word extraction

The lemmatised nouns are extracted from the documents, which are used in the creation of the topic model. A number of filters are applied to filter out nouns which are either overused or underused. First, only words with more than three characters are taken in consideration for processing. Second, words that are found in more than 10% of the files are filtered out, as well as those that are found in less than 10 documents.

2_{Note that this is not the sum of the number of unique opinion words of the government and those}

(10)

4.1.3 Opinion-word extraction

The same procedure which was applied on the topic words, is used for the opinion words. Both the lemmatised adjectives and adverbs are extracted from the doc-uments as opinion words. First, only words with more than three characters are considered for processing. Second, a more lenient filter is applied than for the topic words in order to get as much different opinion words as possible, while also filtering overused and underused words. Words found in more than 95% of the files, and those found in less than two files, are filtered out.

The opinion-word extraction process is handled differently by Fang et al. (2012). They also use verbs for their opinion words, besides adjectives and adverbs. An attempt was made to use all three for the proceedings data used in this thesis, but it lead to a significant increase in non-informative and non-opinionated words. The verbs ’drowned out’ the more opiniated adverbs and adjectives, thus reducing the overal quality of the opinions. See table 2 for the difference between the opinion words with and without verbs.

Table 2: The top 10 government words of the opinion-topic distribution for topic #694in descending order (See table 5 for the topic words corresponding with this topic.). The left column has been generated with verbs, adjectives and adverbs, while the right column only had adjectives and adverbs. Note that these opinion words have not been parsimonised yet, the effects of which are explained in sec-tion 4.2.1.

Verbs, adjectives and adverbs Adjectives and adverbs

worden daarna vragen Nederlands willen klein zullen nauwkeurig niet echter gaan helemaal goed simpel echter eigen staan kort eens gewoon

4_{Note that the topics are refered to as numbers, as the LDA model returns a set of nouns as the}

(11)

4.2 Contrastive Opinion Mining

The implementation of contrasting opinion mining first needs an explanation of parsimonious language models. An overview of the theory was given in section 3.3, however, there are two types of PLMs being generated which are explained in section 4.2.1. Then, the steps are given on how the COM task was achieved in section 4.2.2.

4.2.1 Parsimonisation

A parsimonised language model needs a background-corpus to which documents are compared for similarity. The background-corpus will determine what word-usage is considered to be characteristic for that corpus. If a document is then compared with it, and a certain word is overused in the document, that word will be seen as typical for the document. Two types of PLMs were created to test two different theories.

First, a parsimonious language model was made with the whole corpus as background-corpus, which can be seen in table 3. The top 10 words are shown, with their respective log probabilities. This PLM can be used to check what words are the typically used by both perspectives. The opposition has a high probability for using the word bezuiniging. This can be explained by many debates having taken place concerning budget cuts. The opposition was not being supportive, and they tended to argue against the measures that were made by the government.

Second, for each topic a PLM has been created with as background-corpus all documents that exhibit that topic. Table 4 shows the top 10 words per perspective of the parsimonious language model with all documents that exhibit topic #69 as background-corpus. Since the background-corpus only contains documents about the same topic, the characteristic words per perspective of the topic can be found. The opposition uses more words connected the religious aspect of animal welfare: joods, islamitisch, while the government is more likely to talk about the financial aspects: fiscaal, btw-tarief.

4.2.2 Steps

There are a number of steps needed in order to perform the contrastive opinion min-ing task. The program starts with the creation of dictionaries, which are hashtables of the unique words with their frequency. The usage of hashtables was necessary to decrease the run-time of the program and allow the topic model algorithms to work with the data. Dictionaries are created for the topic words, and for the opinion

6

(12)

Table 3: The top 10 words per perspective with their log probabilities6, as cal-culated in the parsimonious language model made with the whole corpus as background-corpus.

Government Opposition

Word Log probability Word Log probability

ontraden -2.23 ondersteunen -3.68 specifiek -4.19 bezuinigen -4.39 beschikbaar -4.21 democratisch -4.74 overigens -4.48 benieuwd -4.83 buitengewoon -4.63 ziek -5.01 uiteindelijk -4.77 stem -5.12 wisselen -4.93 eindelijk -5.18 individueel -4.95 dreigen -5.30 ingaan -4.96 snappen -5.40 vooruitlopen -5.00 korten -5.43

Table 4: The top 10 words per perspective with their log probabilities, as calculated in the parsimonious language model with as background-corpus all documents ex-hibiting topic #69.

Word Log probability Word Log probability

fiscaal -3.02 joods -3.81 symbolisch -3.60 maatschappelijk -4.00 buitengewoon -3.60 islamitsch -4.06 ethisch -3.78 sterk -4.21 blij -3.78 grootschalig -4.31 gewoonlijk -4.00 juist -4.42 btw-tarief -4.00 onnodig -4.57 waaronder -4.28 kritisch -4.57 inbegrepen -4.28 hoeverre -4.57 generiek -4.27 eveneens -4.75

words used by each perspective. The filters mentioned in section 4.1.2 and 4.1.3 are applied to the dictionaries. Three corpera are created using the dictionaries: a topic corpus, and for each perspective an opinion corpus. A corpus is a vector of IDs which refer to the words in their respective dictionary. An LDA model is made using the topic dictionary and corpus. Fang et al. (2012) generated 200

(13)

top-ics, however, their dataset was considerably larger. A larger dataset would require more topics to cover all the possible themes in the texts. The number of topics that the model should generate was set at 100 since it gave the best results after several trials done with different numbers of topics. The LDA model is then used to assign a topic distribution to each document, e.g. a document could be 50% about topic #69, 25% about topic #77 and 25% about topic #3. This distribution was used to weigh the opinion words per topic for each document in the language model. In the example, an opinion word would be counted as 25 times that word for topic 3, same for topic 69 and 50 times for topic 77. A maximum likelihood estimation is made for each of topics over the opinion words. Next, the PLMs are created. First, one is created with the whole opinion corpus as background-corpus. Second, for each topic, a PLM is created with as background-corpus all documents that exhibit that topic. The final step is the generation of the topics with the opin-ions of each perspective. The output is generated by iterating over the topics and printing the top 10 words from the topic-word distribution. Then, for each perspec-tive the top 10 words are printed from the opinion-topic distribution for both of the parsimonised language models.

5 Evaluation

The evaluation has been split up into two parts: the evaluation of the topics in sec-tion 5.1 and the evaluasec-tion of the opinions in secsec-tion 5.2. The latter treats both the quality of the opinions, and the performance of the cross-perspective topic model.

5.1 Topics

5.1.1 Topic quality

The topics were manually evaluated on topical cohesion of the top 10 words for each topic-word distribution. Of the 100 topics which were generated by the LDA model, 80% were cohesive. A topic is considered to be cohesive when no more than two unrelated words appear in the top 10 words. The first column of table 5 shows an example for what is considered a cohesive topic and the second column shows an example of a non-cohesive topic. The cohesive topic solely has nouns about animal welfare and the slaughtering of animals, while the non-cohesive topic has words mixed in from different topics such as child support and privacy.

(14)

Table 5: An example of a cohesive topic (topic #69) and a non-cohesive topic (topic #77). The top 10 words for each of the topic-word distributions of the topics are shown on descending order.

TOPIC #69 TOPIC #77 dier kinderbijslag convenant beloning slacht camera dierenwelzijn topinkomen godsdienst export indienster bonus godsdienstvrijheid ziekenhuis ritueel voedselbank seconde persoonsgegevens grondrecht woningcorparatie 5.1.2 Topic distribution

The topic distribution can give valuable insights into the dataset. Figure 3 shows the number of documents and the amount of topics that they exhibit. A lot of doc-uments seem to have one clear winner, and there are few docdoc-uments that exhibit more than two topics. About 85% of the documents had either one or two topics assigned to them. There seems to be a tipping point when the threshold reaches 20%, after which documents start losing all their topics. Figure 2 shows the distri-bution of the topic with the highest probability assigned to a document by the LDA model. More than half, both the ’50-75’ and ’75-100’ slices, of the documents have one clear winner topic.

Having one topic per document gives a clear indication that the LDA model is confident in its topic assignments. The effect of the threshold in figure 3 shows that it could be useful to filter out topics with a low probability. The opinion words are gather per document, no distinction is made between an opinion word used with one topic or another, thus when all opinion words will be matched to a single topic, it should give better results for the opinion words.

5.2 Opinions

5.2.1 Quality of opinion words

The quality of the opinion words has been examined based on the number of non-informative and non-opininiated opinion words, so called ’stop words’, found in

(15)

Figure 2: The collection of documents is divided into slices, with each slice rep-resenting the probability of the highest topic for that document. For example, if a document has three topics: topic #1 at 55%, topic #2 at 20% and topic #3 at 25%, then it will get assigned to the ’50-75’ slice.

the top 10 of the topic-opinion distribution. As explained in section 4.1.3, remov-ing the verbs from the opinion words created a boost in quality. The parsimoni-sised language model also makes a significant difference, as can be seen in tabel 6. Whereas the opinion word list of the government first existed solely out of stop words such as niet, ander, heel, it now has more opiniated words such as beheers-baarand complementair. There is a smaller difference in the opinion words used by the opposition, as they started out with less stop words. The overal number of stop words has decreased, however, there are still some left. Fang et al. (2012) had about ten times more data then was used in this thesis (compare table 1 in (Fang et al. 2012) with table 1). More data would mean more training of the model, which should improve the quality of the opinion words, as less hapax-like words will enter the system.

(16)

Figure 3: The number of documents with the amount of topics they exhibit. A threshold was established to remove topics, whose distribution percentage was less than the threshold. Each bar represents a 2% increase in the threshold, starting at 0%. The darkest blue bar indicates documents without topics, the lighter blue with one topic, etc.

5.2.2 Cross-perspectiveness

Table 7 contains a sample of topics and the corresponding opinions of the govern-ment and the opposition. The top 10 words from the topic-word log distribution p(w|z) are shown, as well as the top 10 words from the topic-opinion log distri-bution p(o|z) for each perspective. The pairing of p(w|z) and p(o|z) can indicate the opinion o of each perspective on topic z represented by the word w. The dif-ferences of opinion between the government and the opposition are reflected in their choice of opinion words. For example in topic #10, the opposition seems be emphasise on the fiscal and bureaucratic aspects of health care, while the govern-ment more on the manageability aspect. Topic #19 shows some real contrasting

(17)

Table 6: The effects of parsimonisation. The top part shows the results of topic #10 without parsimonisation, and the bottom withou, and the bottom withoutt. The parsimonisation used here uses all opinion words in combination with topic #10 as background-corpus. The top 10 words per topic-word log distribution p(w|z) are shown, as well as the top 10 words from the topic-opinion log distribution p(o|z) for each perspective.

Without parsimonisation Government Opposition TOPIC #10

Word Log Prob. Opinion Log Prob. Opinion Log Prob. zorgverzekeraar -1.87 niet -8.66 recent -10.71

zorgkosten -2.51 ander -8.69 allerlaatst -11.41

zorgpremie -2.94 heel -8.70 waartoe -11.65

huisarts -2.97 goed -8.70 nederlands -12.54

betaling -3.06 eens -8.75 geregeld -12.79

overschrijding -3.49 zelf -8.75 extramuraal -12.79 gedragseffect -3.58 alleen -8.76 gegrond -13.13

ziekenhuis -3.61 even -8.76 big -13.13

huisartsenzorg -3.66 belangrijk -8.80 acceptabel -13.13 sturingsinstrument -3.77 echter -8.83 eenzelfde -13.64 With parsimonisation

zorgverzekeraar -1.87 beheersbaar -4.07 fiscaal -4.06 zorgkosten -2.51 complementair -4.07 bureaucratisch -4.38

zorgpremie -2.94 ineens -4.07 ernstig -4.51

huisarts -2.97 erom -4.29 ruim -4.51

betaling -3.06 omlaag -4.29 dichtbij -4.85

overschrijding -3.49 alweer -4.29 plezierig -4.85 gedragseffect -3.58 eenduidig -4.29 eenvoudig -4.85 ziekenhuis -3.61 definitief -4.35 telkens -4.85 huisartsenzorg -3.66 uiterst -4.58 afschrikwekkend -4.85 sturingsinstrument -3.77 teniet -4.58 kritisch -4.91

opinions regarding the European monetairy union7_{; The government focuses on}

the strictness and practicability, while the focus of the opposition is on reciprocity and democracy.

The government is prone to using less opiniated words than the opposition. This can be seen in topic #70, where words such as bovendien and vreemd used by the government express a more timid opinion than economisch and modern. The government seems to use such stop words more often than the opposition. Issues are clouded by members of the government in order to down-play them, words used in this process are usually stop words such as bovendien, vandaar, etc. Applying extra filters for the stop words could solve the problem, but it will also introduce another; a certain word might be a stop word for topic a, but not for topic b. For

7

In the year 2011, a lot of debates took place about financially weaker states in the European union, and the effect of their bankruptcy on the banking system.

(18)

example, the word vreemd is a stopword for topic #70 which is about retirement funds, while it might be a regular opinion word for a topic about immigration (e.g. een persoon uit den vreemde).

Figure 4: The top word cloud is for the topic9, and the bottom word cloud for the opinion words used by the government (blue) and the opposition (red).

(19)

Table 7: A sample of topics and their corresponding opinions from the government and the opposition. The top 10 words per topic-word log distribution p(w|z) are shown, as well as the top 10 words from the topic-opinion log distribution p(o|z) for each perspective.

TOPIC #70

Word Log Prob. Opinion Log Prob. Opinion Log Prob. pensioenfonds -2.05 bovendien -2.82 sterk -4.29 dekkingsgraad -3.10 wenselijk -3.15 politiek -4.37 pensioenakkoord -3.28 specifiek -3.15 institutioneel -4.37

herindeling -3.30 helder -3.66 modern -4.46

rendement -3.69 vandaar -3.66 economisch -4.57 pensioenwet -3.76 totaal -3.66 bestuurlijk -4.71 pensioenleeftijd -4.03 bereid -3.66 collectief -4.71 bestuurskracht -4.16 vreemd -3.66 concreet -4.71 toetsingskader -4.26 destijds -3.78 adequaat -4.71 gemeenteraad -4.27 consequent -4.07 blij -4.71

TOPIC #10

Word Log Prob. Opinion Log Prob. Opinion Log Prob. zorgverzekeraar -1.87 beheersbaar -4.07 fiscaal -4.06

zorgkosten -2.51 complementair -4.07 bureaucratisch -4.38

zorgpremie -2.94 ineens -4.07 ernstig -4.51

huisarts -2.97 erom -4.29 ruim -4.51

betaling -3.06 omlaag -4.29 dichtbij -4.85

overschrijding -3.49 alweer -4.29 plezierig -4.85 gedragseffect -3.58 eenduidig -4.29 eenvoudig -4.85 ziekenhuis -3.61 definitief -4.35 telkens -4.85 huisartsenzorg -3.66 uiterst -4.58 afschrikwekkend -4.85 sturingsinstrument -3.77 teniet -4.58 kritisch -4.91

TOPIC #19

Word Log Prob. Opinion Log Prob. Opinion Log Prob. tegemoetkoming -3.34 wettelijk -3.69 illegaal -4.28

bankenunie -3.45 nauw -4.14 vanmiddag -4.38

munt -3.86 hypothecair -4.19 demissionair -4.41 verdragswijziging -3.93 budgettair -4.41 serieus -4.42 inkomensgrens -4.13 maatschappelijk -4.55 eenzijdig -4.51 groeipact -4.31 intergouvernementeel -4.55 contant -4.60 overdracht -4.45 uitvoerbaar -4.70 thuis -4.73 eurozone -4.48 bankroet -4.70 bilateraal -4.73 vergezicht -4.53 aanzienlijk -4.70 ondemocratisch -4.73 eurobond -4.54 strikt -4.79 fundamenteel -4.73

(20)

6 Conclusion & Discussion

A succesful reproduction of the research done by Fang et al. (2012) was achieved using Dutch parliamentary data. The topics which were found by the LDA model were of high quality, 80% were cohesive. About 85% of documents had either one or two documents assigned to them, this indicats that the LDA model had high confidence while assigning those topics. A clear contrast could be seen between the opinions of the government and opposition on various topics. However, the opinion words still contain some stop words, there is still room for improvement on this part. It may be a domain-specific issue, since politicians often cloud the issue by using a lot of words for something that could be said in less. Overal the cross-perspective topic model gave sufficient results.

There are a number of differences in the implementation of CPT in this thesis and in Fang et al. (2012). They only used opinion words found in sentences that exhibited opinion clues, which are constructions in a sentence that indicate it con-tains an opinion. Opinion clues are used in (Furuse et al. 2007) to extract opiniated sentences from blogs and Chen et al. (2010) extracted statements which best ex-press the opinionists’ standpoints on specific topics. Applying this technique for the implementation in this thesis might improve its opinion word quality.

The usage of verbs did not seem to work for the domain used in this thesis. Perhaps the Dutch language has more verbal stop words than the English. It could also be caused by sparsity of data as was explained in section 5.2.1.

For the evaluation of the topics Fang et al. (2012) had used a word-intrusion method (Chang et al. 2009). This is a way to measure the cohesional quality of topics by replacing one word in the top 10 of the topic-word distribution with a word from a different topic. A number of test-subjects is then asked to spot the word which has been replaced. When a topic has a high cohesion by default, it will be easy to spot the odd word. When applying this technique to all topics found by the LDA model, it should be possible to get a score based on the input of multiple people, instead of the author alone.

Fang et al. (2012) had als done a quantative evaluation of the opinions, by using a perplexity score. Perplexity is a quantative measure for comparing language models and is often used to compare the predictive performance of topic models (Griffiths & Steyvers 2004). The perplexity value indicates the model’s ability to generalise to unseen data. A lower perplexity scores indicates better generalisation performance. In the CPT model, it reflects the ability of the model to predict opinion words for unseen documents.

There are many applications for the CPT model described in this thesis. For example, instead of modeling data from only one year, it could be done for multiple years to track changes in opinions over time. The proceedings of the debates in the

(21)

house of representatives of each day are made available in the evening, this data can then be visualised as two word clouds: one for the topic and one for the opinions per perspective, see figure 4 for an example. This can then be used as an aid for summarising the debate.

7 Bibliography

References

Blei, D. M. (2012), ‘Probabilistic topic models’, Communications of the ACM 55(4), 77–84.

Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003), ‘Latent dirichlet allocation’, the Journal of machine Learning research3, 993–1022.

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L. & Blei, D. M. (2009), Reading tea leaves: How humans interpret topic models, in ‘Advances in neural informa-tion processing systems’, pp. 288–296.

Chen, B., Zhu, L., Kifer, D. & Lee, D. (2010), What is an opinion about? exploring political standpoints using opinion scoring model., in ‘AAAI’, Citeseer.

Eguchi, K. & Lavrenko, V. (2006), Sentiment retrieval using generative models, in‘Proceedings of the 2006 Conference on Empirical Methods in Natural Lan-guage Processing’, Association for Computational Linguistics, pp. 345–354. Fang, Y., Si, L., Somasundaram, N. & Yu, Z. (2012), Mining contrastive opinions

on political texts using cross-perspective topic model, in ‘Proceedings of the fifth ACM international conference on Web search and data mining’, ACM, pp. 63– 72.

Furuse, O., Hiroshima, N., Yamada, S. & Kataoka, R. (2007), Opinion sentence search engine on open-domain blog., in ‘IJCAI’, pp. 2760–2765.

Griffiths, T. L. & Steyvers, M. (2004), ‘Finding scientific topics’, Proceedings of the National Academy of Sciences101(suppl 1), 5228–5235.

Hatzivassiloglou, V. & McKeown, K. R. (1997), Predicting the semantic orienta-tion of adjectives, in ‘Proceedings of the 35th annual meeting of the associaorienta-tion for computational linguistics and eighth conference of the european chapter of the association for computational linguistics’, Association for Computational Linguistics, pp. 174–181.

(22)

Hiemstra, D., Robertson, S. & Zaragoza, H. (2004), Parsimonious language models for information retrieval, in ‘Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval’, ACM, pp. 178–185.

Hu, M. & Liu, B. (2004), Mining opinion features in customer reviews, in ‘AAAI’, Vol. 4, pp. 755–760.

Hurst, M. F. & Nigam, K. (2003), Retrieving topical sentiments from online docu-ment collections, in ‘Electronic Imaging 2004’, International Society for Optics and Photonics, pp. 27–34.

Kim, S.-M. & Hovy, E. (2004), Determining the sentiment of opinions, in ‘Pro-ceedings of the 20th international conference on Computational Linguistics’, Association for Computational Linguistics, p. 1367.

Ounis, I., Rijke, M., Macdonald, C., Mishne, G. & Soboroff, I. (2006), Overview of the trec-2006 blog track, Technical report, DTIC Document.

Pang, B. & Lee, L. (2008), ‘Opinion mining and sentiment analysis’, Foundations and trends in information retrieval2(1-2), 1–135.

Pang, B., Lee, L. & Vaithyanathan, S. (2002), Thumbs up?: sentiment classification using machine learning techniques, in ‘Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10’, Association for Computational Linguistics, pp. 79–86.

Paul, M. & Girju, R. (2009), Cross-cultural analysis of blogs and forums with mixed-collection topic models, in ‘Proceedings of the 2009 Conference on Em-pirical Methods in Natural Language Processing: Volume 3-Volume 3’, Associ-ation for ComputAssoci-ational Linguistics, pp. 1408–1417.

Zhai, C., Velivelli, A. & Yu, B. (2004), A cross-collection mixture model for com-parative text mining, in ‘Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining’, ACM, pp. 743–748. Zhang, M. & Ye, X. (2008), A generation model to unify topic relevance and

lexicon-based sentiment for opinion retrieval, in ‘Proceedings of the 31st annual international ACM SIGIR conference on Research and development in informa-tion retrieval’, ACM, pp. 411–418.

Zhang, W., Yu, C. & Meng, W. (2007), Opinion retrieval from blogs, in ‘Pro-ceedings of the sixteenth ACM conference on Conference on information and knowledge management’, ACM, pp. 831–840.

(23)

Appendices

A Translations

Table 8: A list of Dutch words and their English translations. Words that mean the same in Dutch as in English have been left out. The words of table 7 and 6 have not been translated, please consult a dictionary when needed.

Dutch English

convenant convenant

slacht slaughter

dierenwelzijn animal welfare

godsdienst religion

indienster petitioner (feminine) godsdienstvrijheid freedom of religion

ritueel ritual

seconde second

grondrecht fundamental right

kinderbijslag child support

beloning reward

topinkomen high income

ziekenhuis hospital

voedselbank food bank

persoonsgegevens personal data woningcorparatie housing cooperative

worden to be vragen to ask willen to want zullen shall niet not gaan to go goed good echter however staan to stand eens agree daarna afterwards Nederlands Dutch klein small nauwkeurig accurate Dutch English slachthuis slaughterhouse bloed blood mes knife dier animal welzijn welbeing verdoving anesthesia voorzitter chairman ontraden to dissuade specifiek specifically beschikbaar available overigens moreover buitengewoon extraordinary uiteindelijk eventually wisselen to exchange individueel individual ingaan go into vooruitlopen to anticipate ondersteunen to support bezuinigen to save benieuwd curious ziek ill stem vote eindelijk finally dreigen threaten snappen to understand korten to rebate simpel simple eigen own kort short gewoon as usual

(24)

B Software

The implementation was written in Python 2.7 and ran on Ubuntu 14.04.2 LTS (32-bit). The program took about one hour to create the dictionaries and corpera, while running on the authors laptop with an Intel R _CoreTM _{i5-2410M CPU @ 2.30GHz}

and 4 GB of RAM. The dictionaries and corpera were saved, so they did not need to be regenerated unless significant changes were made to the program. While loading the files from disk, it took only 5 minutes to run. Table 9 shows a list of software used in the implementation.

Table 9: A list of software used in the implementation.

Name Description Url

Python 2.7 A programming language which has a lot of free open-source packages that can be used for scientific research.

https://www.python. org

Gensim A Python package containing

many topic model algorithms.

https://radimrehurek. com/gensim

Weighwords An implementation of parsi-monious language models in Python.

https://github.com/ larsmans/weighwords

Git A distributed version control

system, used in combination with the online repository Bit-bucket.

https://git-scm.com/, https://bitbucket.org

(25)

C The proceedings files

Figure 5: This diagram shows the general layout of the proceedings files. An asterix indicates that the node could have multiple siblings of the same type. The children of the ’docinfo’ and ’meta’ nodes have been left out, as they have not been used in this project.

root docinfo ... meta ... proceedings topic stage-direction* text scene* stage-direction* text speech* text

Mining contrastive opinions on political texts using cross-perspective topic model