Topical diversity of debates

(1)

Topical diversity of debates

Ferron Saan 10386831

Bachelor thesis Informatiekunde Credits: 12 EC University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. Maarten J. Marx Informatics Institute (ILPS)

Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

Abstract

In this paper the relation between interestingness and topical diveristy of Canadian parliamentary proceedings is investigated. For interesting-ness the method described in [13] is employed and for topical diversity the method described in [2]. The main findings are that in general there is a relatively low correlation between interestingness and topical diversity and this can be explained by the fact that the main sources of interestingness do not influence topical diversity.

Furthermore is the role of group diversity investigated in respect to topical diversity of a debate, since it is likely that when a debate is topical diverse, this is caused by people who represents different interests such as the parties represented in the House. For measuring the group diversity the Jensen–Shannon divergence is used. The findings are that there is a weak correlation between group diversity and topical diversity.

(3)

1 Introduction

John Naisbitt stated in 1984 that “we are drowning in information but starved for knowledge” [16]. Thirty years later this is especially true. We stored more than 295 exabytes, which is 295 billion gigabytes, since 1986 [11]. It is clear that there is currently an abundance of information and because the storage capacity will continue to grow, people tend to keep storing information [12]. The problems that occur however, is that there is not enough manpower to process all of this information. One way to process large collections of text is topic modelling using latent dirichlet allocation (LDA), introduced by [6]. By using LDA one can measure the diversity of a collection, and based on these measures the collection will be easier to explore. In [7] it is shown that text interestingness is correlated with the topical diversity texts. In this study, this relation between interestingness and topical diversity is investigated with respect to parliamentary debates. The main research question is:

”Do topical diversity and importance of debates correlate?”.

In addition to this question this study also investigates the role of different groups in topical diversity, since it is likely that when a debate is topical diverse, this is caused by people who represents different interests, such as the parties represented in the House. This question is expressed as follows:

”Do different groups contribute to topical diversity of a debate?”. To answer the main research question, interestingness and topical di-versity of the debates are independently measured in order to calculate their correlation. This is done on the parliamentary debates of Canada. For mea-suring the interestingness the method described in [13] is employed and for the topical diversity the method from [2]. To answer the question regarding group diversity, each debate will be divided into two parts representing the coalition and the opposition. Then correlation between the topic distribution of these parts and the entire debate can be calculated to tell how much group diversity contribute to topical diversity.

2 Related Work

In this section Latent Dirichlet Allocation, topical diversity and debate inter-estingness are explained.

2.1 Latent Dirichlet Allocation

The goal of topic modeling is to discover the topic structure of a collection of documents automatically. This topic structure are the topics, the topic distri-bution over the documents in the collection and the distridistri-bution of the words over each topic [5]. Latent Dirichlet Allocation (LDA) in one example of topic

(5)

modeling, introduced by [6] and used in a variety of studies such as finding scientific topics [10], tag recommendation [15] and spam filtering [3].

LDA is a generative probabilistic model of a corpus. The basic idea is that documents are represented as a probability distribution over latent topics, where each topic is a distribution over words [6]. LDA assumes a document is generated as follows. First a distribution over the topics is created. Then, for each word, a topic assignment is chosen and last a word from the corresponding topic is chosen. This way, LDA reflects the intuition that documents have multiple topics by representing a document as a probability distribution over latent topics.

2.2 Topical diversity

Diversity is of pervasive interest [18], for example in the field of recommendation systems where diversity can be used to reflect the user’s interests [21]. [2] studied diversity of documents based on their content. Other studies related to text-based diversity are in the fields such as information retrieval [9] [20]. In this study diversity is used to determine the topical diversity of a debate, and whether or not this diversity correlates with the definition of interestingness given by [13]. When a debate is topical diverse it is more likely to be important, or interesting. This intuition is shown by [7]. In this study the problem of automatic prediction of text interestingness is discussed and an approach for quantifying it with the aid of topic diversity is presented.

2.3 Debate interestingness

[13] described interestingness of a debate as the probability that the public finds a debate of great significance. In this study this definition of interestingness will be used to see if there is a correlation between the topical diversity of a debate and the interestingness of a debate. [13] modelled interestingness as a three-dimensional construct. The first dimension is the intensity of the debate, the second dimension is constituted by the quantity and quality of key players in a debate and the third dimension is formed by the debate length. Based on these three dimensions [13] were able to rank documents on interestingness. They show that debate intensity and key players are the best indicators for debate interestingness.

3 Methods

In this section we describe the dataset, the pre-processing steps and the methods used for measuring the topical diversity and interestingness of debates as well as the measures for calculation group diversity. The source code is available online: https://www.dropbox.com/s/6m0txqjg4e2dz2k/Thesis2.ipynb?dl=0.

(6)

3.1 Dataset

The dataset consists of 9.174 xml files, which are Canadian debates from 1994 to 2014. This dataset is publicly available at https://openparliament.ca/ or http://search.politicalmashup.nl. Each file has three tags under root: docinfo, meta and proceedings. Docinfo is provides some information which is the same for each file. In meta, the date, title and description of a debate are stored and also the official link to the debate. In proceedings, the actual debate is stored. Each debate consists of one or more topics and each topic is divided into one or more scenes. Each scenes consists of multiple speech tags, which are the actual content of the debate. Every speech tag has attributes for the person who said it, their party and several others like district, gender and time. In listing 1 a simplified example of the file is shown.

Listing 1: Simplified file example

3.2 Creating LDA model

The first step towards the LDA model was to filter out the speech of each debate. This is done by selecting the content of each speech tag in the xml file. Each filtered debate was saved as a new file. The second step was the pre-processing of these new files. This pre-processings consists of multiple steps. The first step was to lowercase each word, remove punctuation and tokenize the debate. The second step was removing the stopwords and apply part-of-speech tagging to keep only the nouns. The POS tagger that was used is from the NLTK package, which is available at http://www.nltk.org/. We used this POS tagger to filter everything but nouns because nouns typically bear argument functions [19]. It is thus a good way to capture thematic information, and it ignores opinions which are not interesting for this study. After selecting only the nouns, they were lemmatized and the files with less then 100 words remaining are removed. This resulted in removing 121 files meaning 9053 files remaining.

(7)

After these steps of pre-processing the id2word mapping for the LDA model is created with the aid of the gensim package, which is available at https://radimrehurek.com/gensim/. The top 100 most occurring words in the corpus are removed from this mapping, as well as the top 100 document frequency words and the words which occur in less then 5 documents. The result is an id2word mapping with 38002 unique words.

With this id2word mapping ready, the LDA models are created for 100, 75, 50 and 25 topics. After some experiments the extra parameters of gensim are the iterations set to 1000 and the number of passes set to 10 since this produced the best LDA models.

3.3 Topic Parsimonization

In order to improve the LDA models topic parsimonization was applied to the probability vectors of each debate. The algorithm for the topic parsimonization is the one used by [14] but used on the topic probabilities of documents instead of the term distributions of documents. Topic parsimonization consists of two steps and a fixed number of iterations. In the e-step (1) the maximum likeli-hood estimates are used to estimate each topic probability. This step benefits topics that occur relatively more frequent in the document than in the whole collection. In the m-step (2) these estimations are normalized. At the end of each iteration the topics with a probability lower than a threshold will be re-moved. Topic parsimonization concentrates thus on the probabilities of a fewer topics in comparison to the previous probability vectors because topics that are assigned to all vectors (P (t|C)) are penalized in the e-step and will drop below the threshold and are thus removed during the parsimonization process.

divided by the sum of tftD for all topics assigned to a document and P (t|C) is

the sum of tftD for every time topic t is assigned to a document divided by the

number of documents. The penalty α was set to 0.01, the threshold to 0.00005 and the number of iterations to 100. The next steps are measuring the diversity of each debate based on these four LDA models and the importance of each debate.

3.4 Measuring topical diversity

For measuring the topical diversity of debates the method proposed in [2] is used for estimating the diversity of documents. This method estimates the diversity

(8)

using Rao’s coefficient [17] for a document D, div(D) = T X i=1 T X i=1 pD_i pD_j δ(i, j) (3)

where T is the set of topics, pD

i and pDj are the probability of assigning topics i

and j to document D, and δ(i, j) is the distance (dissimilarity) of topics i and j. For this distance metric angular simularity is used because this can be used as a proper distance metric and has been popularly used in information retrieval for semantic analysis of text documents [4] [1]. δ(i, j) is calculated as follows:

δ(i, j) = cos

−1_{(i, j)}

π (4)

where cos−1_{(i, j) is the inverse of cosine similarity of topics i and j. To calculate}

the similarity of topics we identify a topic i with the vector consisting of all pD i

for all documents D in the collection.

3.5 Measuring interestingness

In order to calculate the interestingness of the debates seven features are used as described in [13]. These seven features are calculated the same way as in [13]. The seven features are the number of speakers, the percentage of members present, the presence of the prime minister and deputy prime minister, the number of floor leaders speaking, the word count and the closing time. These seven features were the best indicators for the importance of a debate according to [13]. In order to calculate the importance of a debate D:

I(D) =

7

X

i=1

wi∗ fi (5)

where fiis a feature and wiis the multivariate linear regression value of fiin the

trained model reported by [13] for assigning interestingness values to debates and the sum is taken over the mentioned seven features.

3.6 Groups diversity

After calculating the pearson correlation between topical diversity and interest-ingness we focus on diversity between groups. It is likely that when a debate is topical diverse, this is caused by people who represents different interests, such as the parties represented in the House. It goes without saying that the coali-tion and the opposicoali-tion are represented at each debate and will try to pursue their parties interest. It is thus likely that topical diversity of a debate is caused as a result of this endeavor. We focus on the coalition and the opposition. In order to calculate this group diversity each debate is divided into a coalition and an opposition part based on the speaker. Then, based on the LDA models a

(9)

probability distribution of topics is created for both parts. Based on these prob-ability vectors the Jensen–Shannon divergence (JSD) [8] is calculated for three situations namely coalition versus general, opposition versus general and coali-tion versus opposicoali-tion. The general probability vector is the topic distribucoali-tion assigned to the whole debate used in section 3.4.

J SD(P |Q) = 0.5KLD(P |M ) + 0.5KLD(Q|M ) (6)

where M = 0.5(P + Q) and KLD(P |M ), KLD(Q|M ) are the Kullback–Leibler divergences between P and M and Q and M . P and Q are the probability vectors of the coalition part of debate, opposition part of a debate or a whole debate.

4 Results

First, there must be mentioned that the step of topic parsimonization (3.3) turned out to be a wrong decision with respect to calculation topical diversity. This step was conducted to reduce the inevitable noise but there are some types of debates (such as a debate where the government answers petitions) where the topic probability vector consists of a lot of small probabilities. These probabilities were filtered during the step of topic parsimonization thus this step was reverted and the results are without the topic parsimonization.

Table 1: Showcase of topical diversity on two files ca.proc.d.19991214 ca.proc.d.20130423 Subject

Routine proceedings, Standing Committee statements by members, on Agriculture oral question and Agri-Food and government orders

Number of words 314586 57662

Number of lemmatized nouns 82882 15207

Number of topic tags 6 1

Interestingness 0.0413890564494 0.00762147784718 Diversity 0.157578161636 0.0719757081831 Probability vector (topic, probability) [0, 0.2676450934523554] [2, 0.06824049564695601] [1, 0.01652837338582317] [5, 0.7855067369108776] [3, 0.0253779593802695], [6, 0.024907819359605156] [18, 0.0119699958917873] [7, 0.011282848073440736] [26, 0.0124263585879280] [9, 0.03306117790730514] [30, 0.0970119116911362] [41, 0.035327454902111] [31, 0.0331682572503551] [38, 0.0212052595246803] [39, 0.4094926719339155] [40, 0.0323368601995707] [46, 0.0256382379173712]

In table 1 two different debates are shown, one with a low diversity and one with a high diversity to illustrate the LDA model with 50 topics. It is clear that all values are consistent, not only is the subject very diverse for

(10)

the left debate but the number of words and topic tags is also high which are both good indicators for topical diversity. This is reflected in the probability vector where there are 11 topic all with smaller probabilities. This debate has topic 39 with the highest probability, the top 10 words of this topic are ”nation, treaty, province, constitution, columbia, referendum, indian, land, court, new-foundland”, thus government related and this is in line with a part of the debate (government orders and routine proceedings). The debate on the right side of the table has topic 5 with a high probability, and the top 10 words for that topic are ”farm, agriculture, production, market, food, management, farmer, crop, ontario, agri” which is also in line with the subject of the debate.

Table 2 shows the information of top 5 most diverse debates in the Canadian parliaments. The most diverse debate in the Canadian parliament is a debate where witnesses talk with regard to the consideration of the pre-budget consultation report of the House. A great number of witnesses talked about a wide variety of topics, like chemical, export, radiology, film producers and a lot more. A great number of questions is asked to these speakers as well, resulting in a diverse debate. The other four debates shown in table 2 are all meetings of a standing committee. In these kinds of debates a number of speakers get the chance to discuss certain subjects regarding the standing committee. In these debates one can also ask questions and thus resulting in diverse debates as well. These debates score low on interestingness because the prime minister, deputy prime minster and the floor leaders are never present at these types of debates. These debates take much less time than other type of debates and thus is the closing time always zero. This results in debates with a low interestingness.

Table 2: Top five diverse debates

Topic Diversity Interestingness

Pre-budget consultation report 0.224 0.019 Foreign Affairs and International Development 0.219 0.007 Industry, Science and Technology 0.218 0.013 Industry, Science and Technology 0.218 0.015 Industry, Science and Technology 0.217 0.012

Table 3: Top five interestingness debates

Topic Interestingness Diversity

Government’s response to 14 petitions 0.523 0.183 Sustainable Development Technology 0.509 0.168

Gas emissions 0.507 0.179

Special Import Measures 0.505 0.200

Skills development 0.487 0.197

Table 3 shows the top 5 most interesting debates. These top 5 most interesting debates have still a high diversity with an average diversity of 0.185. The main difference with the top 5 diverse debates is however the fact that the prime minister is present at all the top 5 most interesting debates, as well as the deputy prime minister and all the floor leaders. This in contrast to the top

(11)

5 diverse debates where the prime minister, deputy prime minister and floor leaders were not present. Also, in the most interesting debates the word count is almost three times as many as in the most diverse debates. This is consistent with the findings in [13], they showed that debate intensity and key players are the best indicators for debate interestingness. Furthermore, these debates all focus on different bills and are therefore a completely different type of debate then the standing committees of the top 5 most diverse debates.

Table 4 shows the Pearson correlation values of the diversity and in-terestingness, both in total and per feature. Figure 1 shows the scatter plot of interestingness against diversity. These values are lower than expected. This can be explained by the fact that interestingness and diversity represent differ-ent characteristics of debates. Diversity represdiffer-ents to what extdiffer-ent a debate is focused on one topic while interestingness represents to what extent a debate is interesting for the public. The latter primarily in relation to the presence of the prime minister, deputy prime minister and floor leaders which are the main sources of interestingness. A debate which takes a long time is more likely to be focused on just a single or a few topics. And thus does our intuition that when a debate is topical diverse it is more likely to be interesting not hold since the main sources of interestingness are defined by characteristics which do not influence diversity.

Table 4: The pearson correlation of debates’ diversity and interestingness fea-tures with 100, 75, 50 and 25 LDA topics respectively

Feature Correlation

All features -0.021 0.058 0.101 0.150 Number of speakers 0.016 0.063 0.086 0.119 Prime minister -0.001 0.052 0.077 0.102 Deputy prime minister 0.0002 0.028 0.044 0.093 Floor leaders -0.006 0.067 0.101 0.114 Word count -0.103 -0.022 0.076 0.186 Closing time -0.024 -0.044 -0.068 -0.106

(12)

Figure 1: Scatter plot of interestingness (y-axis) against diversity (x-axis). Each point in the plot corresponds to a debate with 50 LDA topics.

Figure 2 show the Jensen–Shannon divergence between groups during various parliaments. In all but two parliaments the JSD score of the coalition versus opposition is the highest, followed by the coalition versus general. This first position of the coalition versus opposition makes sense since, needless to say, these two groups will differ the most in opinion. The second position of the coalition versus general is a bit surprising, because it indicates that the opposition differ less from the entire debate while one would expect this to be the coalition especially since the coalition has the majority of the House. However, this might be explained by the fact that the opposition has an average of 2315 nouns per debate while the coalition has an average of 1541 nouns per debate.

The two parliaments different from the others are the 37th and the 38th parliaments. Not only have these two periods reverted results but the results are also much higher than the others. After investigation no obvious reason was found for this difference but these two periods stood apart in some way. The 37th period stands apart because it is the only period where two parties merged, the Canadian Allience and the Progressive Conservative merged to create the Conservative Party of Canada. The 38th period stands apart because there were only 150 debates during this period since it dissolved after a vote of no confidence just a year after elections. But no obvious reasons were found that these particularities are the cause of the differences.

(13)

Figure 2: Bar charts for the average Jensen–Shannon divergence between groups during various parliament periods with 50 LDA topics.

Table 5: The correlation of topical diversity and group diversity with 100, 75, 50 and 25 LDA topics respectively.

Groups Correlation

Coalition and General 0.169 0.147 0.202 0.178 Opposition and General 0.150 0.124 0.128 0.149 Coalition and Opposition 0.176 0.138 0.238 0.179

Table 5 shows the Pearson correlation values of the group diversity against topical diversity of the debates. It shows that the diversity between groups has weak positive correlation with topical diversity of a debate. This is consistent with expectations because when there is a difference between the topics of for example the coalition and the general debate it is likely that the topical diversity is greater. Also, the fact that the correlation between opposi-tion and general is lower makes perfect sense since we have seen in figure 2 that the divergence between these two is also lower. In conclusion, group diversity does contribute to topical diversity even though the correlation is weak and lower than one would expect.

(14)

5 Conclusion

This study has investigated the correlation between topical diversity and in-terestingness on Canadian proceedings with the research question ”do topical diversity and importance of debates correlate”. Interestingness was measured using seven features described in [13] and topical diversity was measured based on the method described in [2] on a probability vector created by LDA model trained on the parliamentary proceedings. Pearson correlation was calculated between topical diversity and interestingness, and results show low correlation. The low correlation value can be explained by the fact that the main sources of interestingness do not influence topical diversity.

Furthermore, the role of group diversity is investigated with respect to topical diversity of a debate with the research question ”do different groups con-tribute to topical diversity of a debate”. For this measure the Jensen-Shannon divergence was used between the probability vectors of different groups. The groups in particular are the coalition and the opposition. Pearson correlation between the Jensen-Shannon divergence and topical diversity was calculated and results show weak correlation. Thus group diversity does contribute to topical diversity even though the correlation is weak and lower than one would expect. The results indicate a low correlation between topical diversity and interestingness, however, a couple of factors may have influenced these findings. First of all, different type of debates may have influenced the findings, standing committees are very diverse but not interesting because there are no key player present at these kind of debates. Other type of debates found are bills, which are both diverse and interesting, messages from the Senat and voting procedures which are not diverse and not interesting. The type of debate are likely to influence the correlation scores. Despite this knowledge, there was no easy way to retrieve the type of debate. This aspect should be taken into account in further research. Secondly, for measuring the interestingness the weights reported by [13] are used but it is possible that these values do not apply for Canadian debates since [13] used Dutch debates. So these values may be not applicable. Finally, we could also have considered a topic tag as a debate instead of the entire file. Each topic tag contain speech about one topic, for example, in a debate where member can ask question or give statements each question or statement and the responses are in one topic tag. The entire file is thus diverse while each topic tag on its own is not.

Bibliography

[1] Tan Apaydin and Hakan Ferhatosmanoglu. Access structures for angular similarity queries. Knowledge and Data Engineering, IEEE Transactions on, 18(11):1512–1525, 2006.

[2] Kevin Bache, David Newman, and Padhraic Smyth. Text-based measures of document diversity. In Proceedings of the 19th ACM SIGKDD international

(15)

conference on Knowledge discovery and data mining, pages 23–31. ACM, 2013.

[3] István B´ıró, Dávid Siklósi, Jácint Szabó, and András A Benczúr. Linked latent dirichlet allocation in web spam filtering. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, pages 37–40. ACM, 2009.

[4] Paul V Biron and Donald H Kraft. New methods for relevance feedback: improving information retrieval performance. In Proceedings of the 1995 ACM symposium on Applied computing, pages 482–487. ACM, 1995. [5] David M Blei. Probabilistic topic models. Communications of the ACM,

55(4):77–84, 2012.

[6] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allo-cation. the Journal of machine Learning research, 3:993–1022, 2003. [7] Michal Derzinski and Khashayar Rohanimanesh. An information theoretic

approach to quantifying text interestingness. 2015.

[8] Bent Fuglede and Flemming Topsoe. Jensen-shannon divergence and hilbert space embedding. In IEEE International Symposium on Informa-tion Theory, pages 31–31, 2004.

[9] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Discovering diverse and salient threads in document collections. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 710–720. Association for Computational Linguistics, 2012.

[10] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceed-ings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004. [11] Martin Hilbert and Priscila L´opez. The world’s technological capacity to

store, communicate, and compute information. science, 332(6025):60–65, 2011.

[12] Peter Hinssen. The new normal. 2010.

[13] Alexander Hogenboom, Maarten Jongmans, and Flavius Frasincar. Struc-turing political documents for importance ranking. In Natural Language Processing and Information Systems, pages 345–350. Springer, 2012. [14] Rianne Kaptein, Rongmei Li, Djoerd Hiemstra, and Jaap Kamps. Using

parsimonious language models on web data. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 763–764. ACM, 2008.

[15] Ralf Krestel and Peter Fankhauser. Personalized topic-based tag recom-mendation. Neurocomputing, 76(1):61–70, 2012.

(16)

[16] John Naisbitt and J Cracknell. Megatrends: Ten new directions transform-ing our lives. Number 158. Warner Books New York, 1984.

[17] C.Radhakrishna Rao. Diversity and dissimilarity coefficients: A unified approach. Theoretical Population Biology, 21(1):24 – 43, 1982.

[18] Andy Stirling. A general framework for analysing diversity in science, tech-nology and society. Journal of the Royal Society Interface, 4(15):707–719, 2007.

[19] Robert Lawrence Trask. Language and linguistics: the key concepts. Taylor & Francis, 2007.

[20] Michael J Welch, Junghoo Cho, and Christopher Olston. Search result diversity for informational queries. In Proceedings of the 20th international conference on World wide web, pages 237–246. ACM, 2011.

[21] Cai-Nicolas Ziegler, Sean M McNee, Joseph A Konstan, and Georg Lausen. Improving recommendation lists through topic diversification. In Proceed-ings of the 14th international conference on World Wide Web, pages 22–32. ACM, 2005.

Topical diversity of debates

Topical diversity of debates

Contents

1

Introduction

2

Related Work

2.1

Latent Dirichlet Allocation

2.2

Topical diversity

2.3

Debate interestingness

3

Methods

3.1

Dataset

3.2

Creating LDA model

3.3

Topic Parsimonization

3.4

Measuring topical diversity

3.5

Measuring interestingness

3.6

Groups diversity

4

Results

5

Conclusion

Bibliography