The Validity of Latent Dirichlet Allocation as a Scientific Instrument
Master Thesis Digital Humanities
13th July 2017
Student: T.R. Anthonio – s271106
Primary supervisor: Prof. Dr. Jan-Willem Romeijn
Secondary supervisor: Dr. Emar Maier
Abstract
Scientific models function as crucial instruments to study theories, phenomena or both.
Models are able to function as an instrument because they represent the studied subject and additional elements, such as mathematical formulations, laws and descriptions. As models represent rather than being an exact copy of the target system, they contain ide- alizations or assumptions. Therefore, it is important to investigate whether the generated results of a model would also occur in the real world, where these false assumptions do not hold. In parallel, scholars are often uncertain how to build a certain model. Most mod- els can not be constructed without making decisions about the input data of the model, which is the data set and the parameter settings. Frequently, there is no theory concerned with the model building process of a specific model. Because of these two elements, it is important to investigate the validity of any scientific model.
In this study, we investigated the validity of the most popular topic model, which is Latent
Dirichlet Allocation (LDA). LDA is an algorithm that infers the hidden topic structure
of a collection of documents by using conditional probability. In order to investigate the
validity of LDA, we conducted two experiments. In our first experiment, we created dif-
ferent LDA models with distinct parameter settings to investigate whether they yielded
conflicting images of the same corpus. In our second experiment, we created an accurate
distribution of topics that reflected both the correct proportion and the content of the
five topics of our corpus according to our knowledge about the data. We compared this
distribution with the output of an LDA model that inferred five topics. In particular,
we determined the Hellinger distance between these two distributions to assess whether a
normal run of the LDA model of five topics corresponded to the accurate distribution of
topics. In addition, we investigated whether adoptions in the parameter settings would
bring us closer to the accurate representation of the topics. Our findings suggest that
the results of LDA should be assessed critically as they are variable under adaptations
in subjective input. We argue that these variations could be related to the idealizations
that the model makes. In addition, our results showed that the model does not manage
to provide the topics that pervade a corpus as we see them. Moreover, without any adap-
tations in the sampling procedure of the model, it provides different topic representations
of the same input data. We conclude that these issue scare problematic for the model to
function as a valid instrument to formulate and correct theories.
Contents
1 Introduction 3
1.1 Models . . . . 3
1.2 Uncertainty . . . . 4
1.3 Purpose . . . . 5
2 Latent Dirichlet Allocation 6 2.1 Characteristics . . . . 6
2.1.1 Model assumptions . . . . 6
2.1.2 Generative model . . . . 7
2.1.3 Inverted generative model . . . . 10
2.1.4 Gibbs sampling . . . . 12
2.2 LDA in scientific inquiry . . . . 14
2.2.1 Machine learning . . . . 14
2.2.2 Humanities . . . . 16
2.2.3 Evaluation methods . . . . 17
2.3 Research question . . . . 19
3 Models 21 3.1 Representation and function . . . . 21
3.2 Computational Models . . . . 23
3.2.1 Simulation models . . . . 23
3.2.2 Machine learning models . . . . 25
3.3 Idealization . . . . 28
3.4 False models . . . . 29
3.5 Robustness analysis . . . . 32
4 Method 37 4.1 Overall approach . . . . 37
4.2 Parameter selection and notation . . . . 40
4.3 Data set . . . . 41
4.4 Experiment 1 . . . . 42
4.4.1 Set-up . . . . 42
4.4.2 Operationalization . . . . 45
4.4.3 Additional Data set . . . . 45
4.4.4 Scripts and sampling . . . . 46
4.5 Experiment 2 . . . . 48
4.5.1 Set-up . . . . 48
4.5.2 Script . . . . 50
5 Results 52 5.1 Experiment 1: results on BBC Data set . . . . 52
5.1.1 Topic-terms . . . . 54
5.1.2 Semantic coherence . . . . 55
5.1.3 Overall distribution . . . . 56
5.2 Experiment 1: results on Wikipedia data set . . . . 57
5.2.1 Topic-terms . . . . 59
5.2.2 Semantic coherence . . . . 59
5.3 Overall distribution . . . . 60
5.4 Experiment 2 . . . . 61
6 Discussion 63 6.1 Experiment 1 . . . . 63
6.1.1 Parameter robustness . . . . 63
6.1.2 Topic representation . . . . 65
6.2 Experiment 2 . . . . 66
6.2.1 Topic accuracy . . . . 66
6.2.2 Hellinger distance . . . . 67
7 Conclusion 70 7.1 Summary . . . . 70
7.2 Further research . . . . 71
7.3 Final statement . . . . 73
Appendices 80 A Script 1 . . . . 81
B Script 2 . . . . 83
C Script 3 . . . . 85
D Script 4 . . . . 86
Chapter 1 Introduction
This chapter introduces the most important concepts of this thesis. In particular, it will provide a brief explanation of scientific models, model uncertainty and topic modeling.
Furthermore, this chapter outlines the purpose of this work. The relevance of this study will be addressed as well.
1.1 Models
Models are important tools in our daily and professional life. For instance, when travelers travel, they use a map to navigate. When students study, they use graphs, tables and illustrations to memorize concepts. When a biology professor gives a lecture, she uses a model to illustrate and explain various parts of the human body. Finally, car engineers use a concept car to showcase a future end-product.
In science, models function as instruments to learn something about the world. Fre- quently, these models are generated by means of computational tools. The most important reason to exploit computational models is that they can be utilized to examine something that can not be observed by humans. In this light, a computational model can be seen as an instrument that provides access to the object of study. One approach to use a compu- tational model in this way is by creating a simulation of the phenomenon or object that is being studied. For example, archaeologists create simulations of human behavior in a landscape to examine the past (Brouwer Burg et al., 2016; Guimil-Farina and Parcero- Oubina, 2015; Liefferinge et al., 2014; Verbrugghe et al., 2017). Similarly, ecologists use computational simulations to study rare weather conditions (Cheng et al., 2013; Grimm and Berger, 2016; MacLeod et al., 2005; Ruczynski et al., 2010).
In addition to simulations, statistical models of textual data have also become crucial
to extend the human vision. Originally, these algorithms were developed to navigate
through the World Wide Web. Since the massive increase in available digital information,
there has been a huge demand for algorithms that categorize, search and manage (big) data. Now, these algorithms are used as research methods to learn something about digitized texts (data) and the phenomena that they represent. As humans are not able to observe large amounts of data in a workable period of time, these tools function as instruments to read, analyze and understand texts. This approach is currently used in the humanities, where these texts are used to learn something about their field of discipline, such as literature, language, journalism and history.
1.2 Uncertainty
One type of statistical model that is used in such a way is a topic model. A topic model represents the thematic structure of a large collection of documents (Blei and Jordan, 2003; Blei, 2012). Currently, the most popular topic model algorithm is Latent Dirichlet Allocation (Blei and Jordan, 2003). In fact, the term topic modelling has become nearly synonymous with Latent Dirichlet Allocation (LDA). In LDA, a topic is defined as a list of words that likely occur in a collection of digitize documents. In order to retrieve the topics of a document, LDA attempts to find a set of topics that likely generated the collection of documents. In this way, LDA functions as an instrument that builds a representation of the thematic structure of a large collection of documents.
Although Latent Dirichlet Allocation has promising applications for different disci- plines, it is important to consider the extent to which this model accurately represents the target system. All models contain an aspect of uncertainty because they are repre- sentations of something else, rather than the real thing. Some models may only represent a part of the target system. For instance, a map may show the roads of a city, but not its vegetation. This might not be problematic if one is not interested in vegetation, but it might be problematic if it is the subject of study. Furthermore, a model of a human body may not correspond to the actual size of the human body. Often, such models are down-scaled to facilitate observation. Moreover, in case of weather simulations, some as- sumptions of the model may be simplified or exaggerated. Consequently, the model might not be a valid tool to investigate the weather because the assumptions of the model do not occur in reality.
It is of great significance to identify the components of a model that do not correspond
to the real world because it aids in determining the ability of a model to function as a
valid instrument of science. If the assumptions of the model do not correspond to the real
world, than it might be the case that the result of the models will also not occur in the
real world. More importantly, without awareness of the validity of the model, an idealized
component of the model may lead to a (partially) false image of the studied object. As
a consequence, scholars obtain incorrect knowledge about the subject and lack awareness
that their results should be reviewed critically. Idealized components in LDA may lead to incorrect knowledge about the corpus. For example, the model could show the topics film, music, theatre, whereas the real topics of the corpus are film, sport, opera.
In addition to incorrect representations, a model can also function as an invalid tool by providing inconsistent images of the same phenomenon or object. One component that impacts this are the best parameter settings of the model. It can be the case that a value of 20 a certain parameter yields a very different image than when it is set to 21.
As an effect, it can be rather challenging to determine the parameter settings of a model.
LDA is also a model that requires the modeler to determine the values of the parameters.
Yet, no general instructions are available about parameter selection. Therefore, both idealization and parameter selection may impact the ability of LDA to function as a scientific instrument.
1.3 Purpose
The purpose of this work is to study the validity of LDA as a scientific tool to build a representation of the topics of a corpus. Our work distinguishes itself from other work because we will study model uncertainty in a machine learning algorithm with critical questions about the reliability in the back of our minds. Similar has been done to study uncertainty in computational simulations. In accordance with Pietsch (2013), we believe that simulation models are different than statistical models that manage big data. The exact differences and what they entail will become more apparent in the remaining part of this thesis.
In order to investigate the validity of LDA, we studied model uncertainty in LDA.
This implied that we both considered the robustness of the model and how accurately the
model represents the topics of a corpus. In general, the findings of this work contribute
to the theoretical development of epistemological literature on scientific models. More
importantly, this work can be seen as the first attempt to study model uncertainty in a
computational statistical model. In addition, this work contributes to the field of digital
humanities, because it raises awareness to the validity of computational models. Further-
more, it describes a novel approach to verify a statistical model of texts. The results of
this work are also relevant for software engineers, who use topic modeling in their applica-
tions, such as search engines and recommender systems. If our results indicate that LDA
fails to provide the correct topics of the corpus, then this will also affect the performance
of such applications.
Chapter 2
Latent Dirichlet Allocation
This chapter is divided into two parts. In the first part, we will describe the assumptions that the LDA model makes when it generates a representation of the topics of a collection of documents. In addition, we will outline the generative process, and how this ought to be represented in a mathematical model and a graphical model. In the second part, we will describe how LDA functions as an instrument in science by outlining different studies that used LDA as a measurement of scientific inquiry. This section will also outline the metrics that are currently available to measure the validity of LDA. Finally, we will present the research questions that we aim to answer in this thesis.
2.1 Characteristics
2.1.1 Model assumptions
We start this chapter by describing the underlying assumptions behind Latent Dirichlet Allocation. The first assumption is that documents contain multiple topics. This makes sense because most documents are composed of several topics, rather than just one. For example, consider the previous chapter of this thesis. One topic that we discussed is models. However, we also talked about digital information. In addition, we devoted the final section to the purpose of this current research. All these themes are different topics and they are all part of Chapter 1. Now, suppose that we would give participants a new version of Chapter 1, after shuffling the words. We also provide them a set of highlighters with the colors green, blue and yellow. Then, we instruct participants to highlight each word that is related to the theme research with the green highlighter.
Similarly, participants ought to use the yellow highlighter to mark words that refer to
models. Finally, any word that is related to digital information ought to be highlighted
with a blue color. We ask participants to find 20 words for each topic and to ignore stop
words and structural words in this process, such as the, and, furthermore, a, would and so on. We also ask them to write down how often each topic occurs by summing up the the frequency that each word in a topic occurs in a document.
LDA attempts to conduct this process automatically for multiple documents. In this process, LDA defines a topic as distribution of words that likely occur together. So, one assumption that LDA makes is that a topic consist of a distribution of words. It is called a distribution because each topic is presented by its words and the probability that each word occurs in a given topic. For example, the topic research may consist of the words thesis, study, purpose, findings, work, model. The word thesis may contain a probability of 0.08, the word purpose may contain a probability of 0.01 of occurring in this topic, and so on.
The second assumption is related to how the LDA model observes our collection of documents. Recall that we gave our participants documents where all the words were shuffled. The reason is that the LDA model views every document as a bag-of-words.
In other words, word order is completely discarded in any LDA model. Hence, another assumption that LDA makes is that all documents are written without any specific word order.
The final assumption of the LDA model is that all the documents in the collection share the same set of topics. However, the proportion that each topic occurs per document differs. Suppose that the model decides that the three previously mentioned topics are the topics of this thesis and that each chapter is seen as one document. It makes sense that Chapter 3, which is entirely devoted to research about models, has a higher proportion of the theme models than the theme digital information. However, the theme digital information will still be addressed in this chapter. Similarly, we might still use terms that refer to the topic research to describe relevant work.
In sum, there are four crucial assumptions in LDA. First of all, LDA assumes that topics are composed of multiple words. Secondly, it supposes that each document con- tains multiple topics. Thirdly, all documents are assumed to be bag-of-words. The final assumption implies that each document in the contains the same topics. In addition, it is important to be aware of the different layers in which LDA views a collection of documents. The highest layer is the layer of the corpus. A corpus is divided into docu- ments. Documents can be viewed as the second layer in LDA. Each document consists of a distribution of topics. Finally, each topic is composed of a distribution of words.
2.1.2 Generative model
Suppose that we have a set of documents and an overview of three topics. The first
topic consists of the words biology, DNA, genetics. The second topic contains the words
research, data, validity. Finally, the third topic is composed of the words country, war, leader. We also know the probability of each word occuring in a topic. For instance, we know that the word country has a probability of 0.1 to occur in Topic 3. Now, imagine that we read one of the documents and that the first word that we see is Darwin. We may want to know to which topic this word belongs. In other words, we are asking the question: "What is the probability that the word Darwin occurs in Topic 1?" The same question can be posed for Topics 2 and 3. We can express this question with the following equation:
P (W
D|β
1:K, θ
1:D, Z
1:D) (2.1) In this notation, W
Dis the word in a document, which is in our case Darwin. β denotes the topics, which are multiplied by the per-document topic proportions θ
1:Dand by the topic- word assignment Z
1:D. This notation represents the conditional probability of a word W in a document D, given that we know the topic distribution which is β multiplied by θ
1:Dand Z
1:D.
Technically, LDA assumes that all the topics are generated before the documents.
Therefore, in LDA, each document is created by the following procedure (Blei and Jordan, 2003):
1. Randomly choose a distribution over topics
2. From this distribution, randomly choose a topic (for each word) 3. From this topic, randomly choose a word from the topic
4. Repeat this process until the amount of possible words in documents is met.
In the first step, all the topics are drawn from the same distribution. This reflects the assumption that all documents exhibit the same topics. A possible outcome of the first step could be that Topic 1 has a probability of 0.4 percent to occur in our collection of document. For Topic 2, this value could be 0.1 and for Topic 3 0.3. In the second step, we randomly choose a topic, which is equal to rolling with a three-sided dice. A possible outcome of this dice might be Topic 3. Then, the model looks at the distribution of words that belong to this topic. These words were country, war, leader. In Step 3, we randomly choose a word from this list of words. Consequently, the output of this step can be the word war. Finally, this process is repeated until all the documents are constructed.
This generative model of LDA can be described by a graphical model. In this notation,
a plate or rectangle is drawn to represent a group variables that repeat together rather than
drawing each variable respectively. In addition, the number of repetitions is indicated in
the plate. Another important characteristic of the graphical model is that all the observed
variables are shaded, and that each variable is treated as a random variable.
Figure 2.1: The graphical model of LDA
The graphical model of LDA is shown in Figure 2.1. There are two variables that are placed outside the plates, which are α and η. These parameters are derived from the Dirichlet distribution. A Dirichlet distribution is a distribution over distributions. The meaning of a distribution over distributions can be explained by the following example.
Suppose that we have a collection of boxes with sweets. There are three boxes with lollies, one box with chocolates and five boxes with muffins. Now, suppose that we randomly pick five sweets of this collection, and that the result of this sampling is: two muffins, one chocolate bar and two lollies. The result of this is that we have two ’distributions’:
a distribution of boxes and a distribution of the sweets that we collected. In LDA, there
are of such Dirichlet distributions. One that models the proportion of words for each
topic and one that models the proportion of topics of each document. The former is set
by η, and the latter by α. Both η and α represent prior beliefs that the scholars has
about his or her data set. For instance, a high value of α reflects the assumption that
one topic may contain a mixture of two topics, because it may contain words about sports
and entertainment. In contrast, a low value of α reflects the assumption that one topic is
primarily about sport or primarily about entertainment. η reflects the uniformity of the
topic-terms. A high value of η reflects the belief that the topic-terms per topic contain
little specific words. A low value of η represents the opposite. Not that these values
should be set by the user, in addition to the number of topics, the number of topic-terms
per topic and the Gibbs Sampling settings Binkley et al. (2014), which will be described
later.
Now, consider Figure 2.1. We observe that the topic distribution θ
dis drawn from α.
The topic assignments z
d,ndetermines the observed words for document D, which is w
d,n. The other variables that determined the observed words is β
k, which is in turn dependent on η. η is the number of topics, which is determined by the user. One can denote this process equivalently into the following joint distribution of latent and observed variables:
Y
ki=1
p(β
k|η) Y
dd=1
p(θ
d|α)
N
Y
n=1
p(z
d,n|θ
d)p(w
d,n|β
1:k, z
d,n)
(2.2)
The first component of this joint distribution
k
Q
i=1