The Validity of Latent Dirichlet Allocation as a Scientific Instrument

(1)

The Validity of Latent Dirichlet Allocation as a Scientific Instrument

Master Thesis Digital Humanities

13th July 2017

Student: T.R. Anthonio – s271106

Primary supervisor: Prof. Dr. Jan-Willem Romeijn

Secondary supervisor: Dr. Emar Maier

(2)

Abstract

Scientific models function as crucial instruments to study theories, phenomena or both.

Models are able to function as an instrument because they represent the studied subject and additional elements, such as mathematical formulations, laws and descriptions. As models represent rather than being an exact copy of the target system, they contain ide- alizations or assumptions. Therefore, it is important to investigate whether the generated results of a model would also occur in the real world, where these false assumptions do not hold. In parallel, scholars are often uncertain how to build a certain model. Most mod- els can not be constructed without making decisions about the input data of the model, which is the data set and the parameter settings. Frequently, there is no theory concerned with the model building process of a specific model. Because of these two elements, it is important to investigate the validity of any scientific model.

In this study, we investigated the validity of the most popular topic model, which is Latent

Dirichlet Allocation (LDA). LDA is an algorithm that infers the hidden topic structure

of a collection of documents by using conditional probability. In order to investigate the

validity of LDA, we conducted two experiments. In our first experiment, we created dif-

ferent LDA models with distinct parameter settings to investigate whether they yielded

conflicting images of the same corpus. In our second experiment, we created an accurate

distribution of topics that reflected both the correct proportion and the content of the

five topics of our corpus according to our knowledge about the data. We compared this

distribution with the output of an LDA model that inferred five topics. In particular,

we determined the Hellinger distance between these two distributions to assess whether a

normal run of the LDA model of five topics corresponded to the accurate distribution of

topics. In addition, we investigated whether adoptions in the parameter settings would

bring us closer to the accurate representation of the topics. Our findings suggest that

the results of LDA should be assessed critically as they are variable under adaptations

in subjective input. We argue that these variations could be related to the idealizations

that the model makes. In addition, our results showed that the model does not manage

to provide the topics that pervade a corpus as we see them. Moreover, without any adap-

tations in the sampling procedure of the model, it provides different topic representations

of the same input data. We conclude that these issue scare problematic for the model to

function as a valid instrument to formulate and correct theories.

(3)

1 Introduction 3

1.1 Models . . . . 3

1.2 Uncertainty . . . . 4

1.3 Purpose . . . . 5

2 Latent Dirichlet Allocation 6 2.1 Characteristics . . . . 6

2.1.1 Model assumptions . . . . 6

2.1.2 Generative model . . . . 7

2.1.3 Inverted generative model . . . . 10

2.1.4 Gibbs sampling . . . . 12

2.2 LDA in scientific inquiry . . . . 14

2.2.1 Machine learning . . . . 14

2.2.2 Humanities . . . . 16

2.2.3 Evaluation methods . . . . 17

2.3 Research question . . . . 19

3 Models 21 3.1 Representation and function . . . . 21

3.2 Computational Models . . . . 23

3.2.1 Simulation models . . . . 23

3.2.2 Machine learning models . . . . 25

3.3 Idealization . . . . 28

3.4 False models . . . . 29

3.5 Robustness analysis . . . . 32

4 Method 37 4.1 Overall approach . . . . 37

4.2 Parameter selection and notation . . . . 40

4.3 Data set . . . . 41

(4)

4.4 Experiment 1 . . . . 42

4.4.1 Set-up . . . . 42

4.4.2 Operationalization . . . . 45

4.4.3 Additional Data set . . . . 45

4.4.4 Scripts and sampling . . . . 46

4.5 Experiment 2 . . . . 48

4.5.1 Set-up . . . . 48

4.5.2 Script . . . . 50

5 Results 52 5.1 Experiment 1: results on BBC Data set . . . . 52

5.1.1 Topic-terms . . . . 54

5.1.2 Semantic coherence . . . . 55

5.1.3 Overall distribution . . . . 56

5.2 Experiment 1: results on Wikipedia data set . . . . 57

5.2.1 Topic-terms . . . . 59

5.2.2 Semantic coherence . . . . 59

5.3 Overall distribution . . . . 60

5.4 Experiment 2 . . . . 61

6 Discussion 63 6.1 Experiment 1 . . . . 63

6.1.1 Parameter robustness . . . . 63

6.1.2 Topic representation . . . . 65

6.2 Experiment 2 . . . . 66

6.2.1 Topic accuracy . . . . 66

6.2.2 Hellinger distance . . . . 67

7 Conclusion 70 7.1 Summary . . . . 70

7.2 Further research . . . . 71

7.3 Final statement . . . . 73

Appendices 80 A Script 1 . . . . 81

B Script 2 . . . . 83

C Script 3 . . . . 85

D Script 4 . . . . 86

(5)

Chapter 1 Introduction

This chapter introduces the most important concepts of this thesis. In particular, it will provide a brief explanation of scientific models, model uncertainty and topic modeling.

Furthermore, this chapter outlines the purpose of this work. The relevance of this study will be addressed as well.

1.1 Models

Models are important tools in our daily and professional life. For instance, when travelers travel, they use a map to navigate. When students study, they use graphs, tables and illustrations to memorize concepts. When a biology professor gives a lecture, she uses a model to illustrate and explain various parts of the human body. Finally, car engineers use a concept car to showcase a future end-product.

In science, models function as instruments to learn something about the world. Fre- quently, these models are generated by means of computational tools. The most important reason to exploit computational models is that they can be utilized to examine something that can not be observed by humans. In this light, a computational model can be seen as an instrument that provides access to the object of study. One approach to use a compu- tational model in this way is by creating a simulation of the phenomenon or object that is being studied. For example, archaeologists create simulations of human behavior in a landscape to examine the past (Brouwer Burg et al., 2016; Guimil-Farina and Parcero- Oubina, 2015; Liefferinge et al., 2014; Verbrugghe et al., 2017). Similarly, ecologists use computational simulations to study rare weather conditions (Cheng et al., 2013; Grimm and Berger, 2016; MacLeod et al., 2005; Ruczynski et al., 2010).

In addition to simulations, statistical models of textual data have also become crucial

to extend the human vision. Originally, these algorithms were developed to navigate

through the World Wide Web. Since the massive increase in available digital information,

(6)

there has been a huge demand for algorithms that categorize, search and manage (big) data. Now, these algorithms are used as research methods to learn something about digitized texts (data) and the phenomena that they represent. As humans are not able to observe large amounts of data in a workable period of time, these tools function as instruments to read, analyze and understand texts. This approach is currently used in the humanities, where these texts are used to learn something about their field of discipline, such as literature, language, journalism and history.

1.2 Uncertainty

One type of statistical model that is used in such a way is a topic model. A topic model represents the thematic structure of a large collection of documents (Blei and Jordan, 2003; Blei, 2012). Currently, the most popular topic model algorithm is Latent Dirichlet Allocation (Blei and Jordan, 2003). In fact, the term topic modelling has become nearly synonymous with Latent Dirichlet Allocation (LDA). In LDA, a topic is defined as a list of words that likely occur in a collection of digitize documents. In order to retrieve the topics of a document, LDA attempts to find a set of topics that likely generated the collection of documents. In this way, LDA functions as an instrument that builds a representation of the thematic structure of a large collection of documents.

Although Latent Dirichlet Allocation has promising applications for different disci- plines, it is important to consider the extent to which this model accurately represents the target system. All models contain an aspect of uncertainty because they are repre- sentations of something else, rather than the real thing. Some models may only represent a part of the target system. For instance, a map may show the roads of a city, but not its vegetation. This might not be problematic if one is not interested in vegetation, but it might be problematic if it is the subject of study. Furthermore, a model of a human body may not correspond to the actual size of the human body. Often, such models are down-scaled to facilitate observation. Moreover, in case of weather simulations, some as- sumptions of the model may be simplified or exaggerated. Consequently, the model might not be a valid tool to investigate the weather because the assumptions of the model do not occur in reality.

It is of great significance to identify the components of a model that do not correspond

to the real world because it aids in determining the ability of a model to function as a

valid instrument of science. If the assumptions of the model do not correspond to the real

world, than it might be the case that the result of the models will also not occur in the

real world. More importantly, without awareness of the validity of the model, an idealized

component of the model may lead to a (partially) false image of the studied object. As

a consequence, scholars obtain incorrect knowledge about the subject and lack awareness

(7)

that their results should be reviewed critically. Idealized components in LDA may lead to incorrect knowledge about the corpus. For example, the model could show the topics film, music, theatre, whereas the real topics of the corpus are film, sport, opera.

In addition to incorrect representations, a model can also function as an invalid tool by providing inconsistent images of the same phenomenon or object. One component that impacts this are the best parameter settings of the model. It can be the case that a value of 20 a certain parameter yields a very different image than when it is set to 21.

As an effect, it can be rather challenging to determine the parameter settings of a model.

LDA is also a model that requires the modeler to determine the values of the parameters.

Yet, no general instructions are available about parameter selection. Therefore, both idealization and parameter selection may impact the ability of LDA to function as a scientific instrument.

1.3 Purpose

The purpose of this work is to study the validity of LDA as a scientific tool to build a representation of the topics of a corpus. Our work distinguishes itself from other work because we will study model uncertainty in a machine learning algorithm with critical questions about the reliability in the back of our minds. Similar has been done to study uncertainty in computational simulations. In accordance with Pietsch (2013), we believe that simulation models are different than statistical models that manage big data. The exact differences and what they entail will become more apparent in the remaining part of this thesis.

In order to investigate the validity of LDA, we studied model uncertainty in LDA.

This implied that we both considered the robustness of the model and how accurately the

model represents the topics of a corpus. In general, the findings of this work contribute

to the theoretical development of epistemological literature on scientific models. More

importantly, this work can be seen as the first attempt to study model uncertainty in a

computational statistical model. In addition, this work contributes to the field of digital

humanities, because it raises awareness to the validity of computational models. Further-

more, it describes a novel approach to verify a statistical model of texts. The results of

this work are also relevant for software engineers, who use topic modeling in their applica-

tions, such as search engines and recommender systems. If our results indicate that LDA

fails to provide the correct topics of the corpus, then this will also affect the performance

of such applications.

(8)

Chapter 2 Latent Dirichlet Allocation

This chapter is divided into two parts. In the first part, we will describe the assumptions that the LDA model makes when it generates a representation of the topics of a collection of documents. In addition, we will outline the generative process, and how this ought to be represented in a mathematical model and a graphical model. In the second part, we will describe how LDA functions as an instrument in science by outlining different studies that used LDA as a measurement of scientific inquiry. This section will also outline the metrics that are currently available to measure the validity of LDA. Finally, we will present the research questions that we aim to answer in this thesis.

2.1 Characteristics

2.1.1 Model assumptions

We start this chapter by describing the underlying assumptions behind Latent Dirichlet Allocation. The first assumption is that documents contain multiple topics. This makes sense because most documents are composed of several topics, rather than just one. For example, consider the previous chapter of this thesis. One topic that we discussed is models. However, we also talked about digital information. In addition, we devoted the final section to the purpose of this current research. All these themes are different topics and they are all part of Chapter 1. Now, suppose that we would give participants a new version of Chapter 1, after shuffling the words. We also provide them a set of highlighters with the colors green, blue and yellow. Then, we instruct participants to highlight each word that is related to the theme research with the green highlighter.

Similarly, participants ought to use the yellow highlighter to mark words that refer to

models. Finally, any word that is related to digital information ought to be highlighted

with a blue color. We ask participants to find 20 words for each topic and to ignore stop

(9)

words and structural words in this process, such as the, and, furthermore, a, would and so on. We also ask them to write down how often each topic occurs by summing up the the frequency that each word in a topic occurs in a document.

LDA attempts to conduct this process automatically for multiple documents. In this process, LDA defines a topic as distribution of words that likely occur together. So, one assumption that LDA makes is that a topic consist of a distribution of words. It is called a distribution because each topic is presented by its words and the probability that each word occurs in a given topic. For example, the topic research may consist of the words thesis, study, purpose, findings, work, model. The word thesis may contain a probability of 0.08, the word purpose may contain a probability of 0.01 of occurring in this topic, and so on.

The second assumption is related to how the LDA model observes our collection of documents. Recall that we gave our participants documents where all the words were shuffled. The reason is that the LDA model views every document as a bag-of-words.

In other words, word order is completely discarded in any LDA model. Hence, another assumption that LDA makes is that all documents are written without any specific word order.

The final assumption of the LDA model is that all the documents in the collection share the same set of topics. However, the proportion that each topic occurs per document differs. Suppose that the model decides that the three previously mentioned topics are the topics of this thesis and that each chapter is seen as one document. It makes sense that Chapter 3, which is entirely devoted to research about models, has a higher proportion of the theme models than the theme digital information. However, the theme digital information will still be addressed in this chapter. Similarly, we might still use terms that refer to the topic research to describe relevant work.

In sum, there are four crucial assumptions in LDA. First of all, LDA assumes that topics are composed of multiple words. Secondly, it supposes that each document con- tains multiple topics. Thirdly, all documents are assumed to be bag-of-words. The final assumption implies that each document in the contains the same topics. In addition, it is important to be aware of the different layers in which LDA views a collection of documents. The highest layer is the layer of the corpus. A corpus is divided into docu- ments. Documents can be viewed as the second layer in LDA. Each document consists of a distribution of topics. Finally, each topic is composed of a distribution of words.

2.1.2 Generative model

Suppose that we have a set of documents and an overview of three topics. The first

topic consists of the words biology, DNA, genetics. The second topic contains the words

(10)

research, data, validity. Finally, the third topic is composed of the words country, war, leader. We also know the probability of each word occuring in a topic. For instance, we know that the word country has a probability of 0.1 to occur in Topic 3. Now, imagine that we read one of the documents and that the first word that we see is Darwin. We may want to know to which topic this word belongs. In other words, we are asking the question: "What is the probability that the word Darwin occurs in Topic 1?" The same question can be posed for Topics 2 and 3. We can express this question with the following equation:

P (W

D

|β

1:K

, θ

1:D

, Z

1:D

) (2.1) In this notation, W

_D

is the word in a document, which is in our case Darwin. β denotes the topics, which are multiplied by the per-document topic proportions θ

_1:D

and by the topic- word assignment Z

_1:D

. This notation represents the conditional probability of a word W in a document D, given that we know the topic distribution which is β multiplied by θ

_1:D

and Z

_1:D

.

Technically, LDA assumes that all the topics are generated before the documents.

Therefore, in LDA, each document is created by the following procedure (Blei and Jordan, 2003):

1. Randomly choose a distribution over topics

2. From this distribution, randomly choose a topic (for each word) 3. From this topic, randomly choose a word from the topic

4. Repeat this process until the amount of possible words in documents is met.

In the first step, all the topics are drawn from the same distribution. This reflects the assumption that all documents exhibit the same topics. A possible outcome of the first step could be that Topic 1 has a probability of 0.4 percent to occur in our collection of document. For Topic 2, this value could be 0.1 and for Topic 3 0.3. In the second step, we randomly choose a topic, which is equal to rolling with a three-sided dice. A possible outcome of this dice might be Topic 3. Then, the model looks at the distribution of words that belong to this topic. These words were country, war, leader. In Step 3, we randomly choose a word from this list of words. Consequently, the output of this step can be the word war. Finally, this process is repeated until all the documents are constructed.

This generative model of LDA can be described by a graphical model. In this notation,

a plate or rectangle is drawn to represent a group variables that repeat together rather than

drawing each variable respectively. In addition, the number of repetitions is indicated in

the plate. Another important characteristic of the graphical model is that all the observed

variables are shaded, and that each variable is treated as a random variable.

(11)

Figure 2.1: The graphical model of LDA

The graphical model of LDA is shown in Figure 2.1. There are two variables that are placed outside the plates, which are α and η. These parameters are derived from the Dirichlet distribution. A Dirichlet distribution is a distribution over distributions. The meaning of a distribution over distributions can be explained by the following example.

Suppose that we have a collection of boxes with sweets. There are three boxes with lollies, one box with chocolates and five boxes with muffins. Now, suppose that we randomly pick five sweets of this collection, and that the result of this sampling is: two muffins, one chocolate bar and two lollies. The result of this is that we have two ’distributions’:

a distribution of boxes and a distribution of the sweets that we collected. In LDA, there

are of such Dirichlet distributions. One that models the proportion of words for each

topic and one that models the proportion of topics of each document. The former is set

by η, and the latter by α. Both η and α represent prior beliefs that the scholars has

about his or her data set. For instance, a high value of α reflects the assumption that

one topic may contain a mixture of two topics, because it may contain words about sports

and entertainment. In contrast, a low value of α reflects the assumption that one topic is

primarily about sport or primarily about entertainment. η reflects the uniformity of the

topic-terms. A high value of η reflects the belief that the topic-terms per topic contain

little specific words. A low value of η represents the opposite. Not that these values

should be set by the user, in addition to the number of topics, the number of topic-terms

per topic and the Gibbs Sampling settings Binkley et al. (2014), which will be described

later.

(12)

Now, consider Figure 2.1. We observe that the topic distribution θ

d

is drawn from α.

The topic assignments z

d,n

determines the observed words for document D, which is w

d,n

. The other variables that determined the observed words is β

k

, which is in turn dependent on η. η is the number of topics, which is determined by the user. One can denote this process equivalently into the following joint distribution of latent and observed variables:

Y

^k

i=1

p(β

_k

|η) Y

^d

d=1

p(θ

_d

|α)

N

Y

n=1

p(z

_d,n

|θ

_d

)p(w

_d,n

|β

_1:k

, z

_d,n

)

(2.2)

The first component of this joint distribution

k

Q

i=1

p(β

_k

|η) corresponds to the right plate on in Figure 2.1 and it represents the first step of the generative process. k denotes the topic number, which could be in our case either Topic 1, Topic 2, or Topic 3. In this phase, the topics β

_k

are drawn from the Dirichlet prior η. This reflects that each topic is drawn from the same distribution over topics. The probability of β

_k

given η is computed by using the Dirichlet distribution. This calculation can be represented in another equation, which we will not outline here. Notice that the parentheses denote that this part is independent from the others which is also depicted in Figure 2.1.

The second part starts by representing all the d documents in the collection. The first variable that is generated for a document d is the topic proportions. This is denoted by:

p(θ

_d

|α). Similar as in case of the topic proportions, the probability of a topic proportion given α is determined by the Dirichlet distribution. Again, this corresponds to the rep- resentation in the graphical model, where θ

_d

is dependent on α. Recall that the topic proportions are determined per document.

Next, for each document we have n words. This is reflected in the notation because it is represented inside the parentheses of the documents. In the graphical model in Figure 2.1, this is represented by the N -plate that is placed inside the D-plate. At this stage, the model draws the per-word topic assignment Z

_d,n

from the observed words W

_d,n

. This is similar to the stage were we rolled the dice with 3-sides. Then, we choose a word from this selected topic, which is drawn from Z

_d,n

and β

_k

(see Figure 2.1). In the joint distribution, this is represented by p(w

_d,n

|β

_1:k

, Z

_d,n

).

2.1.3 Inverted generative model

So far, we assumed that we already had an overview of the topics of a document. However, in reality, the model does not contain such a list. In fact, the model only observes the words of a document, represented as a bag-of-words. Recall that we used the following notation to denote the generative process of LDA:

P (W

_1:D

|β

_1:K

, θ

_1:D

, Z

_1:D

) (2.3)

(13)

Again, this notation assumes that we already know the topics of a corpus. In particular, it represents the probability of a word W occurring in a document D given the topic distribution of that document, which is β

1:K

multiplied by θ

1:d

and Z

1:d

.

As LDA only observes the words in a document, the process of LDA is often called ’in- verted generative’, because it inverts the generative process that we described previously.

This can be denoted as follows:

P (β

_1:K

, θ

_1:D

, Z

_1:D

|W

_1:D

) (2.4) Hence, the computational problem of LDA is to answer the question: what is the topic structure of this document, given the words of this document? It answers this question for all the documents in a corpus. In order to answer this question, the model uses Bayes’

theorem.

Bayes’ theorem is used to describe the probability that an event occurs under the condition of another event being true. In order to explain Bayes’ theorem, we will use A and B to refer to these two events (see 2.6). Bayes’ theorem can be used to compute the probability that the situation A is true under the condition of B being true. This can be denoted by P (A|B). This conditional probability be used to calculate two things.

Firstly, it can assess the plausibility of a statement A given B. In other words, conditional probability can be used to calculate the likelihood given empirical knowledge (Bolstand, 2007). For instance, we may use this idea to calculate the likelihood of someone being a drug dealer given that he uses drugs. Another example to illustrate this concept is when one only knows that there were presidential elections and that there were two candidates left. In addition, suppose that we lack any knowledge on who has won the elections. The only knowledge that we have about the current situation is that this new president has introduced a new taxes policy. Based on our knowledge, we are able to calculate the likelihood that one president has won the elections given this new policy. To calculate the conditional probability, one can use Bayes’ theorem, which is stated as the following notation:

p(A|B) = P (B|A) ∗ P (A)

P (B) (2.5)

In the notation of Bayes’ theorem, P (A) and P (B) denote the probabilities of observing A and B occurring independently. Recall that P (A|B) denotes the conditional probability of observing event A given that B occurs. Finally, P (B|A) represents the probability of observing B given that A is true. In the formula, the probability of A given B is calculated by the probability of A times the probability of B given A over the probability of B.

In case of LDA, A represents the hidden topics and B the observed words (see 2.1). In

the notation of Bayes’ theorem, P (A) and P (B) denote the probabilities of observing A

and B occurring independently. In addition, P (A|B) denotes the conditional probability

(14)

of observing event A given that B occurs. Finally, P (B|A) represents the probability of observing B given that A is true. In the formula, the probability of A given B is calculated by the probability of A times the probability of B given A over the probability of B.

Now, we can use this rule to determine the likelihood that a topic distribution occurs given the words of the document. Using our simplified equation in 2.1, we are able to compute the hidden topic structure by inserting the different components into the equation. Consequently, the equation to compute the hidden topic structure can be denoted as follows:

p(β

_1:k

, θ

_1:D

, z

_1:D

|w

_1:D

) = p(w

_1:D

|β

_1:k

, θ

_1:D

, z

_1:D

) ∗ p(β

_1:k

, θ

_1:D

, z

_1:D

)

p(w

_1:D

) (2.6)

Hence, solving the equation above would answer the question: ’What is the likely hidden structure that generated this document?’. However, the resulting distribution of this equation is intractable to compute. In order to tackle this issue, inference techniques are necessary. These techniques are variations inference or Gibbs Sampling (Darling, 2011).

2.1.4 Gibbs sampling

Although the first implementation of LDA used Variational Inference (Blei and Jordan, 2003), we will use Gibbs sampling, because most LDA algorithms that are used in research use this sampling method. Therefore, we will describe in this section how Gibbs sampling works.

Gibbs sampling is a member of the Markov Chain algorithms. These algorithms con- struct a so called Markov Chain. After multiple iterations of a stepping through a chain, random sampling from the distribution should be closed to the desired distribution. In other words, multiple iterations are necessary to simulate variables from the true distri- bution. It is not known beforehand how many iterations are necessary.

In LDA, Gibbs sampling is used to produce random variables from the ’latent’ dis- tribution of topics. In particular, the process of Gibbs sampling can be described in the following steps:

1. Randomly assign all the topic proportions for some document based on the document topic proportions and the topic vocabulary proportions

2. Reassign the topic proportions per document based on the assignments of all the topic proportions in the current document

3. Repeat Step 1 and Step 2 for all documents in our data set

4. Randomly assign topic vocabulary distributions based on assignments of the topic

proportions in the entire corpus.

(15)

5. Repeat until the specified number of iterations is met

The result of the first step is a vector that includes all the words of the document and their likelihood of ocurring in a given topic. For instance, the likelihood of the word dna occuring in Topic 2 is 0.001 and the likelihood of the word research in Topic 2 is 0.005.

Then, the second step reassigns the topic proportions based on the result of the first step.

Hence, dna and research would be the only words of Topic 2, than the topic proportion of Topic 2 would be 0.006. Next, Step 3 repeats these steps for all the documents in a collection of documents. Then, it randomly assigns all the topic vocabulary distributions in the entire corpus.

Because of the random assignments in this process, LDA returns different results after each run. Therefore, it is necessary to study multiple runs of the model. We can illustrate this idea by the following example. Suppose that we want to determine the average grade that students scored for an exam. Arguably, the most reliable way to determine the average grade is by dividing the sum of the grades by the number of students. However, in case of Gibbs sampling, a random student would be selected to determine the grade.

As an effect, we may believe that the average grade is lower or higher than in reality.

This happens when one inspects a single output of the topic model, rather than multiple.

However, when LDA is applied in software engineering, this issue is overseen (Binkley et al., 2014). As will be argued in the next chapter, it seems that is not tackled in other research.

The settings of Gibbs sampling can be set by the user. These settings consist of the following parameters:

1. n, which represents the number of random variates 2. b, which is the number of passes

3. si, which denotes the sampling interval.

The first Gibbs setting n denotes the number of passes. One can compare this with to

rolling a 6-sided-dice. To determine the probability that the value of the dice is 6, one

turn is not enough. Similarly, the higher the amount of passes, the more accurate the

output of the model. The second Gibbs setting b specifies the amount of iterations that

are discarded in order to limit the distribution of φ and θ. This is the amount of times

that the process described previously is conducted. It is important to keep in mind that

a large number of iterations does not automatically lead to a better result. The reason

is that Gibbs sampling is not an optimization algorithm. It might be the case that after

1000 iterations, the results will only get worse. The final Gibbs setting, si represents

the amount of iterations that the Gibbs sampler should conduct before it continues to

(16)

seek the next observation. A high setting of this value is recommended because it yields practically independent observations (Binkley et al., 2014). Notice the settings n and si depend strongly on the available computational power.

2.2 LDA in scientific inquiry

In Section 1, we described several components of LDA to understand how the model builds a representation of a corpus. In the current section, we will describe how LDA is used in scientific inquiry, in order to gain an understanding of how LDA functions as a scientific instrument. In particular, this section outlines studies in two different disciplines:

machine learning and the humanities. In each field, LDA is used for a different purpose.

In machine learning, LDA is primarily utilized to extract topic features from a collection of data, in order to build a certain system. Yet, in the humanities, LDA is used as a tool to extract the thematic structure of a collection of texts in order to retrieve knowledge about a phenomena in literature, history, literature or culture. The final part of this section describes current methods to evaluate the results of the LDA model for each discipline.

2.2.1 Machine learning

Machine learning is a field within computer science that is concerned with the ability of the computer to learn something from textual data in order to conduct a task automatically by a model. An example of such a task is sentiment analysis. A computer can learn in two ways. The first way is supervised learning. In supervised learning, the computer learns things by observing examples of assumable correct annotations of a data set, which are often provided by humans. An example of supervised learning is when a computer learns to infer the sentiments of tweets based on observing a data set of tweets that are labeled correctly. In order to learn how to label a tweet as positive, negative or neutral, the user provides the computer a set of features that can be used to determine the sentiment.

An example of a feature is the length of a tweet, or the amount of swear words. Using these features and the correct examples, the computer is able to infer rules, such as: ’Most tweets that contain swear words are labeled as negative, so tweets that contain swear words should be labeled as negative’. As a result, when the computer observes a swear word in a tweet, it will be labeled as negative. Another way how a computer can learn things is through unsupervised learning. When a computer is ought to learn through unsupervised learning, it has to perform a certain task without observing correctly annotated data.

Instead, it learns solely by specifying a set of rules. Latent Dirichlet Allocation is an

example of unsupervised learning, because it does not observe pre-labeled data (Blei,

(17)

2012). Instead, it observes the words to extract the topics of a collection of documents.

In machine learning, topic models have been used to extract the topic features of a text in order to build a system for a specific task. For instance, topic models have been used to design systems that automatically infer the genre of a collection of texts. For instance, Cho and Sirmorya (2016) developed a system to automatically infer the genre of a movie script. They implemented multiple systems to examine which features yielded the best results. In the end, the system that used LDA had the highest accuracy. Therefore, Cho and Sirmorya (2016) proved that LDA can play an important role to determine the genre of a movie script.

In addition to movie genres, LDA has also been utilized to classify genres in literature.

This is demonstrated in the work of Hettinger et al. (2016). Similar as to the work of Cho and Sirmorya (2016), they used LDA to extract topic features to determine the genre.

However, Hettinger et al. (2016) inferred the topics on a collection of 628 digitized German novels. Remarkable is that they argue that topic models represent the topoi of literature, rather than themes. Topoi is the way the story is told by the author through stylistic elements. According to Hettinger et al. (2016), Topic models often retrieve a set of stylistic elements, rather than themes. Furthermore, Hettinger et al. (2016) acknowledge that topic models generally serve to extract the semantic features of a text. The findings of their study proved that LDA is a useful algorithm to extract these features to automatically determine the genre of a literary piece.

The automatic inference of topic features have also been beneficial in conducting sen- timent analysis. An example of a study where LDA was used to extract topic features to determine the sentiment of documents is described in the work of Brody and Elhadad (2010). In particular, they used LDA to classify the sentiment of reviews. Brody and El- hadad (2010) argue that LDA is particularly useful to identify the entities that are being reviewed. Rana et al. (2016) use the example of the sentence "the phone is great but the battery life is too short" to illustrate this. According to Brody and Elhadad (2010), when this sentence is used as the input for LDA, it returns the words phone and battery life.

Because of this feature, they applied LDA on all sentences of a review. However, Brody

and Elhadad (2010) acknowledges that this approach is not suitable to detect entities

when they are implicitly stated. An example of such a review is the sentence: "It was

a horrible experience". In this case, the reviewer refers to the entity being reviewed as

it. The results of the study by Brody and Elhadad (2010) prove that this is a successful

way to determine the entities of a review. This has also been proven in the work of Rana

et al. (2016), who also used topic models to classify the sentiments of reviews.

(18)

2.2.2 Humanities

In the humanities, LDA is currently used to automatically extract the topics of a collection of texts. Rather than extracting these topics to build another system, LDA is used to learn something about the world. One approach to do this is by using a collection of digitized books as input for LDA, and to use the topic structure that is given by LDA to form a statement about literature or history. Yang et al. (2011) used this approach.

In this study, topic models were used to explore the themes of a collection of American newspapers published between 1829 and 2008. They used LDA to extract the topics of newspapers. Their results showed that LDA can be useful to achieve knowledge about history. In their study, Yang et al. (2011) observed that the inferred topics represented discussions that were held in the public sphere. For instance, one of the topics contained the words black, price, good, yard, which reflected discussions of the market and sales of goods, such as cotton. This knowledge corresponded to known facts about that period.

Moreover, the results in the study of Yang et al. (2011) showed that topic models are also beneficial to discover new facts about history. The most crucial historical finding from their work was that newspapers paid more attention to the Battle of Jacinto than was previously thought by historians.

Another study that used topic models to investigate historical trends was conducted by Hall et al. (2008). In particular, they investigated historical trends in the field of com- putational linguistics. Hall et al. (2008) trained a LDA topic model on digitized papers of a popular journal in computational linguistics that departed from 1978 to 2006. Their study showed that topic models are useful to determine trends in science. For instance, their results indicated that the interests in computational semantics has decreased. How- ever, Hall et al. (2008) do note that not all topics inferred by topic models may be useful.

In their results, there were some topics that consisted of unrelated words. In these cases, words were either too general to infer any meaningful conclusions or they failed to make sense together because they were semantically incoherent.

In addition to historical trends, topic modeling has also been used as an instrument

to retrieve knowledge about culture. One example of such a study is the work of Schoch

(2017). The researcher used topic models in his work to discover the distinctive dominant

topics of different dramatic subgenres. In addition, the scholar investigated to what extent

topic models yield results that correspond to conventional genre distinctions. Schoch

(2017) used a data set of 890 plays published between 1610 and 1810, which reflects the

period of the Classical age and the Enlightenment. The researcher divided the data set into

different genres and inferred topics for these genres. Afterwards, Schoch (2017) retrieved

the topic scores of each topic by calculating the sum of the probability of each topic-term

in a topic. For each genre, the researcher calculated the mean in order to investigate how

(19)

important each topic was in each genre. In general, the findings of his study indicate that topic models identify interesting characteristics of genres in theatre. In some cases, the findings provided new insights into the history of French drama. An example of such a finding was that there may have been three distinct subgenres of tragedies in this period.

Other results corresponded to conventional distinctions of genre in culture studies.

Another study where topic modeling was used to infer knowledge about culture is the study of Rhody (2012). The researcher used LDA to map themes in a collection of poems. Rhody (2012) was the first scholar who proved that topic models yield promising results in figurative language. Before the work of Rhody (2012), scholars were skeptical to in using topic models on figurative language because it was expected that they would provide topics that lacked the same thematic clarity as in other works of topic models.

Moreover, it was expected that it would be rather difficult to associate the topics with a specific theme, in a sense that a topic consisting of the words dna, evolution, Darwin can be associated with genetics. However, Rhody (2012) proved that topic models of figurative language yield both comprehensible and reproducible results. For instance, one topic of the results contained the words night, light, moon, stars, sky, eyes, bright. This set of words is commonly used to describe someones appearance.

Surprisingly, none of the studies described above mention the issues that are caused by Gibbs sampling. In Section 2, we have seen that Gibbs sampling leads to inconsistent results. None of the studies mention this issue, whereas the tools that are used in topic modeling research, such as MALLET (McCallum, 2002), do use Gibbs sampling. One possibility is that scholars are unaware of this issue, mainly because of a lack of knowl- edge about computer science. Another possibility is that they choose the output that was closest to their expectations. However, as Gibbs sampling may lead to different repre- sentations of the corpus, we believe that this issue should be more prominent in research about topic models.

2.2.3 Evaluation methods

The most common way to assess the quality of any probabilistic model is by computing the perplexity of a test set of data. One metric to assess perplexity is the log-likelihood.

In order to compute perplexity, the data set is split into a test-set and a training data set.

Once the data is divided, the model will be trained on the training data set. Afterwards,

the model will be trained on the test-set. Next, one computes the log-likelihood of the

test set to determine how well the model performs on a subset of unseen documents. This

approach can also be used to determine the parameters of a model. A way to do this is

by creating multiple models with different parameter settings. The model with the lowest

perplexity is seen as the ’best’ model (Blei and Jordan, 2003; Steyvers and Griffiths, 2006;

(20)

Wallach et al., 2009; Wiedemann, 2016).

One limit of perplexity is that it only assesses how well the model performs on a given test set. Hence, it only assesses the extent to which the previously learned topic structure can be applied to unseen documents. Therefore, this metric does not address the exploratory purposes of topic modeling (Chang et al., 2009), such as used in the humanities. Moreover, they showed that models that were built by using perplexity does correspond to the human evaluation of topic models. This has also been confirmed in the work of Wiedemann (2016). In general, Topics that were generated by likelihood optimized models seem to contain terms with little semantic coherence (Chang et al., 2009; Wiedemann, 2016).

Because of this lack, Chang et al. (2009) conducted a study to explore novel methods to evaluate topic models. In particular, their study aimed to develop methods that as- sessed two components of the latent space of topic models. The first component was the composition of topics. To evaluate this component, Chang et al. (2009) developed a task where participants had to select the intruder of a given set of topic-terms. Chang et al.

(2009) presented the participants a set of topic-terms that contained one term that did not belong in this set. It was a topic-term that was part of another topic. The task of the participant was to determine the intruder. Chang et al. (2009) expected that participants would not be able to determine the intruder if the words were not related. In that case, it would imply that the topic is not coherent. The second component that Chang et al.

(2009) studied was the association between a document and a topic. Chang et al. (2009) presented the participants a snippet of a document and a selection of topics. The task of the participant was to identify a topic that was not associated with the document by the topic model. These tasks proved to be useful to determine the quality of a topic model.

In addition, the results of Chang et al. (2009) showed that it was difficult for participants to determine the semantic coherence of topics that contained optimized likelihood.

Another qualitative method to evaluate the output of topic models is described in the work of Evans (2014). According to Evans (2014), there are three elements that should be studied to evaluate any topic model. The first element is the topic-terms of the topics with the highest probability. If it is possible to assign a label to these words, than we may conclude that the model that generated these topics is a ’good’ model. The second element that should be studied is the semantic coherence of the topics. This corresponds to the work of Chang et al. (2009). The final element that should be taken into consideration is the topic distribution, which could be verified by existing knowledge about the corpus or about the phenomenon that the data represents. The latter has been used in the work of Yang et al. (2011). In this study, the results were namely verified by a historian.

In our research, we are interested in the validity of LDA as a scientific instrument.

Perplexity only measures how well the model performs on unseen data. In other words,

(21)

it assesses how to what extent the model is well trained on a specific data set, and how well it will perform on unseen data based on the previously learned topic structure. How- ever, we are not interested in assessing how well the model is able to learn something.

The other methods can provide insights in how accurately the model is able to represent the topic structure of a corpus. However, these methods do not directly address model uncertainty. We need to use a method that enables us to create different models with different assumptions. We can use methods such as word-intrusion as a method to com- pare the outputs. However, it lacks a strategy to create different models with different assumptions. Therefore, it is necessary to use a different method.

2.3 Research question

In the first part of this chapter, we described the different components of LDA. We have seen that in any application of LDA, there are certain parameters that should be set.

These parameters are the number of topics, the number of topic-terms, α and η (see Chapter 2). The parameters that set the Gibbs sampling procedure should be specified as well. The work of Binkley et al. (2014) has already investigated the impact of the settings of both the hyper parameters and the parameters that set the Gibbs Sampling procedure.

However, no literature seems to be concerned with the number of topics. Although it makes sense that a different value of topics yields a different set of topics, there are still questions that should be investigated further. For instance, it remains questionable to what extent an increase in the number of parameters yield a completely different image of the corpus. Suppose that we infer five topics and that they represent the following themes:

sport, politics, movies, history, film. Is it possible that when we infer ten topics on the same corpus, there will be no topic that refers to sport, politics or movies? What if, these topics may be completely different? In addition, to what extent does the model provide consistent results when it does not use Gibbs Sampling? What happens if we specify the starting value of the model? After multiple runs, does the model provide consistent results? In parallel, how does this effect the different images yielded by different input settings?

It is important to answer these questions because they will influence our knowledge

about the corpus. In the worst-case scenario, our knowledge of the corpus is completely

false because of uncertain components in LDA. In the best-case scenario, LDA is able

to provide us an accurate image of the corpus that is robust and consistent. Ideally, it

should provide one correct image of the corpus that reflects the themes of a corpus. In

order to investigate the position of LDA in these scenarios, our research question is as

follows: ’To what extent is LDA a valid model to assess the topics of in a corpus?’. This

question can be divided into the following questions:

(22)

1. To what extent does the LDA model provide conflicting images of the same corpus when varying the model assumptions?

2. To what extent does the LDA model manages to build an accurate image of the topic distribution of a corpus?

In the next chapter, we will outline the theoretical framework in the context of these

research questions.

(23)

Chapter 3 Models

The purpose of this chapter is to provide an overview of current studies that address how models may contain idealizations and how this may lead to false or inconsistent results.

These factors influence the ability of a model to function as a scientific instrument, because they cause the model to contain assumptions that do not correspond to the real world.

In the previous chapter, we saw that LDA contains such unrealistic assumptions. It is important to gain a broad understanding of these assumptions and to categorize them into the current theoretical framework to identify a suitable method to assess the effect of these assumptions.

3.1 Representation and function

In the epistemology literature, models are defined as representations of the entity be- ing studied in scientific inquiry (Colyvan, 2013; Giere, 1999; Frigg and Hartmann, 2006;

Hartmann, 2007; McCarty, 2005; Nagel et al., 1962; Weisberg, 2006a; Wimsatt, 1985).

In particular, models are characterized as representations of theory and/or phenomena (Frigg and Hartmann, 2006). This definition evokes the question how models can function as an instrument and representation at the same time.

The most important study that provides a systematic account on how models function

as instruments in science is the work conducted by Morgan and Morrison (1999). Before

their study, models were primarily characterized by their representation function, rather

than their ability to measure. Also, little attention was paid to how models operate in

the world and in theories. These two elements are the main focus of the work of Morgan

and Morrison (1999). In their study, they argue that models function as instruments in

science. The reason that models are able to function in such a way is caused by how

models relate to theories and the world. In order to support this claim, they answer four

questions about scientific models.

(24)

The first question that they pose is: ’What gives models their autonomy?’. According to the researchers the answer lies in how models are constructed. Models are neither entirely derived from data nor theory. In particular, they state that models are constructed by choosing and integrating a set of items that are derived from theory, data and other elements. One reason why models contain these various components is because there are often no instructions available on how models should be built, neither in philosophy of science literature nor in specific literature on theories or data that the model draws from. In order to prove that models contain different elements, Morgan and Morrison (1999) describe the model of the business cycle. They explain that this model was drawn from some bit of theory, empirical evidence, mathematical formalism and a metaphor that determined how the model was built. Morgan and Morrison (1999) make their claim stronger by explaining that even when models seem to be completely derived from theory or data, it contains a mixture of elements. One example of a model that seems to be derived entirely from theory is the Pendulum model, which is used to measure gravity force. They argue that in order to measure gravity, a systematic account on how the pendulum works is necessary. However, in order to measure, a real pendulum is needed.

Similarly, even when a model seems to be completely derived from data, such as the barometer model, the data is converted into a specific structure. This is were the other components occur.

The second question that they address is: ’What does it mean for a model to function autonomously?’. Morgan and Morrison (1999) describe three ways how models can be used as an autonomous instrument. What all these different instrumental functions have in common is that they operate independently on theory and/or data. The first instrumental function is that models can be utilized to learn something about theories. For instance, a model can be used to explore the characteristics of a theory. Furthermore, it can be used to examine processes were theories fail to provide an extensive account. The second way models can function as instruments is as direct instruments. An example of such a model that Morgan and Morrison (1999) describe is the Leontief input-output model, which is used to measure the coefficients of conversion from inputs to outputs in the economy.

Finally, a model can design and produce technologies for experimental manipulation. The scholars argue that models are able to function in this way because they provide the kind of information that is necessary to intervene in the experimental setting.

The third question that they formulate is: "How can an instrument both represent and

measure?". Morgan and Morrison (1999) use the example of the thermometer to answer

this question. A thermometer is able to represent weather conditions. In addition, it

also measures the temperature. Morgan and Morrison (1999) provide other examples of

instruments and models that work in a similar way. All these examples have in common

that they do not represent by mirroring. Instead, representation could be seen as a kind

(25)

of rendering. This implies that the model is a partial representation that translates or abstracts the object of study in another form. In this way, the model functions as a mediator between theories and the world.

The final question that Morgan and Morrison (1999) pose is: ’How can we learn from models?’. They stress that scholars are able to learn from a model by studying it. However, they believe that scientists are better able to learn from models in the construction phase.

Morgan and Morrison (1999) provide examples of models were learning took place when scholars questioned how to integrate certain components in a model. Another phase were scholars learn something about the model is when they are used as a tool of investigation.

In this phase, scholars make use of the autonomy of the model to learn something about theories or the world. If models would not be autonomous from the world or theory, then we would not be able to use them as tools to learn things about both domains.

In our study, we view models through the same perspective as Morgan and Morrison (1999). Therefore, we view LDA as a model that is able to represent and measure at the same time. We showed in Chapter 2 that the LDA model can operate as an instrument.

In the same chapter, we also showed that the LDA model builds a representation of a corpus. Besides, the LDA model contains a mixture of elements, such as statistical rules, the generative process and idealizations. Therefore, we can argue that the LDA model operates independently in theories and in the world. However, this autonomous role that the model fulfills indicates the possibility of discrepancies between the model and the target system. As the model is not entirely drawn from the target system, whereas they are used to make claims about this target system, it is important to assess the validity of a scientific model, and in particular that of LDA.

We have now described how models can function as critical instruments of science, and how we may learn from them. However, we are interested in a specific type of models, namely computational models. These models have different properties than the models described in the work of Morgan and Morrison (1999). The next section outlines work that is concerned with computational models, and their ability to function as a scientific instrument.

3.2 Computational Models

3.2.1 Simulation models

Most studies that view computational models through an epistemological perspective are

focused on computational simulations rather than statistical models of texts (Morgan

and Morrison, 1999; Frigg and Hartmann, 2006; Frigg and Reiss, 2009; Oberkampf et al.,

2002; Roy and Oberkampf, 2011; Grüne-Yanoff and Weirich, 2010; Weisberg, 2006b). In

(26)

general, simulations are viewed as scientific tools to simulate reality (Frigg and Hartmann, 2006; Morgan and Morrison, 1999; Oberkampf et al., 2002; Roy and Oberkampf, 2011).

The most extensive study that views simulations through a philosophical perspective is the study conducted by Frigg and Reiss (2009). The researchers describe two different ways on how simulation is described in current literature. The first way is how simulation is used in the narrow sense. This implies that simulation refers to the use of a computer to solve an equation that cannot be solved analytically, or to explore mathematical properties were analytical methods fail. The second way is how simulation is used in the broad sense.

Here, simulation refers to the entire process of construction that uses and justifies a model that contains analytically intractable mathematics. They view this as the definition of computational models. The models are used to study one kind of system in order to learn about a different system (Frigg and Reiss, 2009). This can be seen as the creation of parallel worlds. There are two issues tied to this setting. Firstly, models are approximate systems, rather than the target system themselves. Secondly, many models are artificially created systems that do not appear in nature. Because of these two issues, they argue that it is important to question whether the result of the simulation is valid. In order to do this, there are three issues that should be taken into consideration: approximation, truncation and interpretation. Approximation is concerned with how close the actual solutions are to numerical conclusions. Truncation erros are related to the fact that the computer stores numbers only to a fixed number of digits. Finally, interpretation is concerned with how well scholars are able to interpret the results and how this can be generalized. Frigg and Reiss (2009) conclude that issues of idealization and non-realistic conditions appear in all kinds of models, and that we may use the epistemological framework of general models to understand simulation models.

Another study that addresses the validity of computational simulations is the work

of Roy and Oberkampf (2011). They address the validity of computational simulations

by describing different sources of uncertainty in computational simulations. Roy and

Oberkampf (2011) start with outlining two types of uncertainty that may occur in com-

putational simulations. The first type is aletory, which is "the inherent variation in a

quantity that, given sufficient samples of the stochastic process, can be characterized

via a probability density distribution" (p. 2132). The second type of uncertainty that

Roy and Oberkampf (2011) mention is epistemic, which is uncertainty caused by the lack

of knowledge by the modelers, analysts or experimentalists about the validation of the

model. They continue by outlining three sources of uncertainty in any computational

simulation. The first source are the model inputs. These include both the parameters

used in the model and the data from the studied phenomena. Examples of model input

data are initial conditions, experimental measurements, theory, export opinion, support-

ing simulations and any parameter settings being used. The second source that Roy and

(27)

Oberkampf (2011) distinguish is numerical approximation. Numerical approximation is applied in equation-based models that rarely admit exact solutions for practical problems.

The characterization of numerical approximation errors is often used to verify the model.

Examples of numerical approximations are iterative convergence error, round-off errors and errors caused by computer programming mistakes. The final type of uncertainty is caused by the model form. Roy and Oberkampf (2011) argue that the form of the model is the end-product of assumptions, conceptualizations, abstractions and mathematical formulations.

All sources of uncertainty that Roy and Oberkampf (2011) describe can also occur in LDA. In fact, we have already indicated two of these sources. First of all, we have indicated in Chapter 2 that the modeler is required to make decisions about the parameters of LDA.

Secondly, we also argued that the idealizations of the model may be problematic in LDA, which is the third type of uncertainty that Roy and Oberkampf (2011) mention. However, the second type of uncertainty may also occur in LDA, as the LDA model is a statistical model. However, we will not consider this type of uncertainty in our research because the analysis is not suitable to shed more light on this type of uncertainty.

3.2.2 Machine learning models

The only study that studies machine learning models through an epistemological approach

is the study conducted by Pietsch (2013), who outlines the role of big data models in scien-

tific inquiry by providing an account on how they differ from conventional and simulation

models. The essay of Pietsch (2013) starts with a description of what big data science

entails. The researcher argues that there are two important characteristics of big data sci-

ence. The first characteristic is that it requires data representing relevant configurations

of the examined phenomenon. Ideally, this data contains all necessary information to let

the model make certain predictions. The second characteristic is the automation of the

scientific process which involves all the steps from data collection to processing to model

construction. Afterwards, Pietsch (2013) describes some characteristics of computer sim-

ulations. He argues that computer simulations contain assumptions that originate outside

the computational model, namely in the real world. Computational models are mainly

of deductive nature because they serve as a tool to study these model assumptions. Fur-

thermore, the researcher describes three phases of computational modeling in scientific

inquiry. Firstly, a model is derived from a general theory. Notice that this assumption

contradicts what has been argued in the study of Morgan and Morrison (1999), who claim

that models are neither derived from entirely from theory nor data. Next, specific values

are assigned to this model. Thirdly, the model and the specific values are translated into

an algorithm that builds a simulation.

(28)

Based on this description, Pietsch (2013) provides an enumeration of six ways how machine learning models differ from simulation models. The first difference is that the starting point is different. According to Pietsch (2013), computer simulations analyze a mathematical model, whereas big data science analyses a collection of data. The second difference is the nature of inference. Computer simulations derive deductive consequences, whereas machine learning models use induction. The third difference is that the entire scientific process in big data science is automated. In computational simulations, only the simulation process is done automatically. The fourth difference is the loss in explanatory power, which is prevalent in big data models. According to Pietsch (2013), the assump- tions of simulation models can form an explanatory framework for the model predictions.

However, this is not the case in big data science models, because of the automation of the modeling process. The fifth difference is that computational models can be used to conduct ’experiments with models’. This is done in a sense were different models are created by using different parameter settings. Yet, this setting does not appear in ma- chine learning models because they are already given in the data. The final characteristic that Pietsch (2013) mentions is of technical nature. Computational simulations rely on computation power of a single computer, whereas in big data models can rely on multiple computers.

In addition to how machine learning models differ from simulation models, Pietsch (2013) also discusses how machine learning models differ from conventional modelling procedures. The most important difference is that machine learning models contain much less explanatory power. The researcher argues that machine learning models only tell us that certain correlations occur, rather than why they occur. This is an important issue that also impacts the ability of a model to function as a valid instrument. Another crucial distinction is that there are few restrictions for the input data of big data models.

Often, there is a lack of specific rules that guide the modeler to determine which data should be discarded. A third difference is that these models do not necessarily focus on several aspects without leaving others. Finally, idealizations may only play a small role in machine learning models because they are often introduced to link different levels of generality.

It is important to be aware that the study of Pietsch (2013) is primarily focused on

machine learning models that are used for prediction. This is made clear throughout the

paper where he discusses examples that serve for prediction in supervised learning algo-

rithms, such as regression and classification algorithms. As previously indicated, LDA

falls into the category of unsupervised learning models. In addition, Pietsch (2013) views

big data models as tools for prediction, rather than tools to learn something about a

phenomenon. One characteristic that does not appear in our model is the exclusion of

parameter initiation. Furthermore, machine learning models are also representations of

The Validity of Latent Dirichlet Allocation as a Scientific Instrument

The Validity of Latent Dirichlet Allocation as a Scientific Instrument

Master Thesis Digital Humanities

13th July 2017

Student: T.R. Anthonio – s271106

Primary supervisor: Prof. Dr. Jan-Willem Romeijn

Secondary supervisor: Dr. Emar Maier

Abstract

Scientific models function as crucial instruments to study theories, phenomena or both.

In this study, we investigated the validity of the most popular topic model, which is Latent

Dirichlet Allocation (LDA). LDA is an algorithm that infers the hidden topic structure

of a collection of documents by using conditional probability. In order to investigate the

validity of LDA, we conducted two experiments. In our first experiment, we created dif-

ferent LDA models with distinct parameter settings to investigate whether they yielded

conflicting images of the same corpus. In our second experiment, we created an accurate

distribution of topics that reflected both the correct proportion and the content of the

five topics of our corpus according to our knowledge about the data. We compared this

distribution with the output of an LDA model that inferred five topics. In particular,

we determined the Hellinger distance between these two distributions to assess whether a

normal run of the LDA model of five topics corresponded to the accurate distribution of

topics. In addition, we investigated whether adoptions in the parameter settings would

bring us closer to the accurate representation of the topics. Our findings suggest that

the results of LDA should be assessed critically as they are variable under adaptations

in subjective input. We argue that these variations could be related to the idealizations

that the model makes. In addition, our results showed that the model does not manage

to provide the topics that pervade a corpus as we see them. Moreover, without any adap-

tations in the sampling procedure of the model, it provides different topic representations

of the same input data. We conclude that these issue scare problematic for the model to

function as a valid instrument to formulate and correct theories.

Contents

1 Introduction 3

1.1 Models . . . . 3

1.2 Uncertainty . . . . 4

1.3 Purpose . . . . 5

2 Latent Dirichlet Allocation 6 2.1 Characteristics . . . . 6

2.1.1 Model assumptions . . . . 6

2.1.2 Generative model . . . . 7

2.1.3 Inverted generative model . . . . 10

2.1.4 Gibbs sampling . . . . 12

2.2 LDA in scientific inquiry . . . . 14

2.2.1 Machine learning . . . . 14

2.2.2 Humanities . . . . 16

2.2.3 Evaluation methods . . . . 17

2.3 Research question . . . . 19

3 Models 21 3.1 Representation and function . . . . 21

3.2 Computational Models . . . . 23

3.2.1 Simulation models . . . . 23

3.2.2 Machine learning models . . . . 25

3.3 Idealization . . . . 28

3.4 False models . . . . 29

3.5 Robustness analysis . . . . 32

4 Method 37 4.1 Overall approach . . . . 37

4.2 Parameter selection and notation . . . . 40

4.3 Data set . . . . 41

4.4 Experiment 1 . . . . 42

4.4.1 Set-up . . . . 42

4.4.2 Operationalization . . . . 45

4.4.3 Additional Data set . . . . 45

4.4.4 Scripts and sampling . . . . 46

4.5 Experiment 2 . . . . 48

4.5.1 Set-up . . . . 48

4.5.2 Script . . . . 50

5 Results 52 5.1 Experiment 1: results on BBC Data set . . . . 52

5.1.1 Topic-terms . . . . 54

5.1.2 Semantic coherence . . . . 55

5.1.3 Overall distribution . . . . 56

5.2 Experiment 1: results on Wikipedia data set . . . . 57

5.2.1 Topic-terms . . . . 59

5.2.2 Semantic coherence . . . . 59

5.3 Overall distribution . . . . 60

5.4 Experiment 2 . . . . 61

6 Discussion 63 6.1 Experiment 1 . . . . 63

6.1.1 Parameter robustness . . . . 63

6.1.2 Topic representation . . . . 65

6.2 Experiment 2 . . . . 66

6.2.1 Topic accuracy . . . . 66

6.2.2 Hellinger distance . . . . 67

7 Conclusion 70 7.1 Summary . . . . 70

7.2 Further research . . . . 71

7.3 Final statement . . . . 73