Topic Modeling in Arabic

(1)

Topic Modeling in Arabic

Root Distribution in Arabic topic modeling

Peter M. Sprenger

A thesis presented for the degree of

Master of Arts

Faculty of Arts

University of Groningen

Netherlands

February 19, 2018

(2)

Acknowledgments

First and foremost, I want to thank my thesis supervisors Johan Bos, Barbara Plank and Maria Moritz for their continued and helpful guidance throughout the whole process of writing this thesis. Learning how to code during my year in Groningen was an immeasurable gift and something I learnt to really enjoy.

A big “thank you” also goes to Marco Büchler, Ute Pietruschka and everyone at the DH-team at the University of Leipzig thanks to whom my interest in the

Digital Humanities grew in the first place. The interaction with all of you made

me go to Groningen and study Digital Humanities. Thank you!

Finally, without the support and love of my family, friends and partner, all of this would have never happened. Thank you for always making me feel good!

(5)

List of Figures

2.1 The illustration from (Blei 2012, p. 78) visualizes how we can imagine the output of topic modeling. Blei highlights words in an article in Science that LDA would associate to certain topics. 10

2.2 Boyd-Graber’s version of the graphical plate diagram for the la-tent Dirichlet allocation (Boyd-Graber 2010, p. 8). Each node shows a variable of the LDA process, the labels are according to the algorithm on page 11. The hidden variables, like topic proportions, topic assignments, and topics themselves, are not shaded. The one shaded node are the observed words in the doc-uments. The plates signify repetitions, the letter in the corners the times of repetitions. The bigger plate stands for the collec-tion of documents in our data set. The smaller plate shows a document described by a collection of words and weighted topic

assignments to these words. . . 12

6.1 Arabic Bible. . . 35

6.2 UN Corpora. . . 36

6.3 TED talk subtitles: k=25, α=1,3,5 . . . 37

8.1 Arabic Bible. . . 51

8.2 UN Corpora. . . 52

(6)

List of Tables

2.1 The sentence “Do you eat it?” written in Arabic and separated into its particles (Kelaiaia and Merouani 2016). . . 7

2.2 Bag of words representation of the first lines of Dr. Seuss’s “One fish, two fish, red fish, blue fish” (Boyd-Graber 2010). . . 10

6.1 The results from the topic modeling runs where k was set to the number of root types. In the table, the number of empty topics are shown as well as the percentage of this number related to the overall tokens in the data set. . . 33

(7)

Abstract

Topic modeling is a widely used technique in the Digital Humanities. More often than not the results are somewhat ambiguous and leave room for interpretation. This thesis aims at testing latent Dirichlet allocation (LDA) when applied to Arabic language corpora. It will be demonstrated what LDA is doing with regard to the Arabic stem system.

As meaning in Arabic language is closely connected to the root of a word and not only words coming from the same word family (as in writer, writing) but also words of much looser semantic relationship (as in plane, bird) share the same root, we are interested in whether and how this affects clustering in a topic modeling task using LDA.

The goal of this thesis is to find out whether LDA will give high probability to words with the same root in a certain topic. If we find that roots and topics correlate this is an important outcome, because it opens possibilities in counter-enriching research areas of topic modeling and Semitic languages that make use of the root system.

(8)

Chapter 1

Introduction

Finding patterns in large data sets is a common task in information retrieval. With techniques such as the widely used topic modeling algorithm Latent

Dirich-let allocation (LDA) comes the promise to receive an output clustered by topic

of the contained words.

According to Brett, topic modeling “clusters (...) words and groups them together by a process of similarity” (Brett2013). We will deal with the notion of the topic closely in this thesis as this is what the output of topic modeling is first and foremost about. Within the scope of this thesis we argue that the meaning contained in the Arabic base form of a word—namely the root—can be regarded as a topic itself. The Arabic stem system is also a case of a highly developed system of patterns and thus we want to find out whether we can find this system reflected in the output of LDA. Specifically we want to find out how the distribution of roots is over the top words in each topic in the topic modeling output.

It makes sense to further test LDA with Arabic data sets because of several reasons. Firstly, research in this area so far was relatively sparse. Secondly, our approach makes sense because of the morphological structure of the Arabic language. And thirdly, Alan Liu’s notion of critically scrutinizing new tools on our table of methodology leads us also to pose new questions and to test the boundaries of what is possible with new tools at hand (Liu2012).

Topic modeling has become a widely used technique in the Digital Human-ities and beyond. If applied correctly, so the assumption, topic modeling can split a given corpora up into many meaningful clusters—something we would call topics. It is “one of the most touted methods of finding patterns in large corpora” (Binder2016). In some cases this approach has undoubtedly produced informative output. More often than not, the visually impressive output de-ceives over what can actually be deducted from it. As Brett specifies “topic modeling is not an exact science by any means” (Brett2013).

What is the added value though of a topic modeling process? Why should the humanities work with this technical tool? Thanks to this technology it is possible to get an overview over a lot of data. Nelson makes it clear when he

(9)

defines the data as “so large that [it] cannot be read by any individual human being” anymore (Nelson2011). According to Binder, topic modeling “relies on a procedure that has little resemblance to anything a human being could do” (Binder 2016, p. 201). Hence, we should engage with technology and at the very least see whether we can make good use of it for our purposes as Digital Humanists as well.

The thesis aims at adding to the research of LDA being applied to Arabic language data sets. We want to demonstrate how LDA behaves when we con-front it with a semantical clustering task using the Arabic language. The work will help to contribute empirical evidence on how a Semitic language behaves when a topic modeling approach is applied to it. This thesis shows in general characteristics of Arabic when statistically tested and processed, because there is only little work done in Arabic topic clustering.

As previously mentioned, we want to find commonalities between the way LDA defines a topic and how a topic is predefined by an Arabic root. Does the LDA output group words with the same roots together? To what degree do we find the same root in the output? On the following pages we strive to answer the question how the output of LDA relates to the Arabic morphological stem system. Are there any relations to be found at all? Specifically we want to test how Arabic roots are being distributed in the output of LDA.

For this reason, we will test three data sets; the first one a rather traditional text: an Arabic translation of the Bible. Secondly, we will work with the Arabic parts of the UN Corpora and thirdly with the subtitles from TED talks. Since— in the frame of this thesis—we argue that roots can represent a topic, we will explore what happens when we set the variable k of the LDA algorithm to the number of root types found in each of our data sets. If our assumption was true, we would find words with the same root within the top words of the topic modeling output. Anticipating the likeliness of a failure of this concept, we will also analyze the output of the topic models when setting k to lower numbers.

How to evaluate topic models is still an unanswered question. For different purposes, there are better and worse evaluation techniques. To answer our research question, we will count the appearances of roots within the topics. The distribution of these values we will then plot in a box plot. Using this visualization tool, we can see in how far roots appear only once or indeed more often in the output of the topic models.

When using tools like topic models, we have to explore the boundaries of these tools and what we can or cannot do with them. In our thesis, we will use LDA to answer a research question that has not been asked before. With our work, we strive to add knowledge to the existing research about topic modeling. The remainder of this thesis is organized as follows. On the following pages we will present relevant information regarding the previously mentioned subjects and questions. We will begin by giving background information in chapter 2

on p. 4. Here, we will give an introduction to the Arabic language first in section2.1 (p. 4) and go into details about the concept of the root system in section 2.1.2(p. 6). Next, we will introduce the reader to the basics of topic modeling in section2.2(p.8) and discuss the latent Dirichlet allocation model

(10)

and the Dirichlet distribution in section 2.2.1 (p. 9). Furthermore, we will present possible methods to evaluate LDA in section 2.2.2 (p. 13). Finally in this chapter, we will talk briefly about research that has been done using LDA in section 2.3 (p. 14) and specifically about research on LDA approaches for Arabic corporas in section2.4(p.15).

In the third chapter3 (p.26) we will give a short overview over the general setup of our experiments. We will briefly introduce the data sets in section3.1

(p. 18), the research question in section 3.2 (p. 18) and finally the technical approach in section3.3(p. 19).

In chapter 4 (p. 20), we will then present the data used in our thesis and how to handle it in order to use it later on in the topic modeling task. Firstly, we will introduce all three data sets in section 4.1. Secondly, in section 4.2

(p. 22) we will present our method to split the original data into smaller files that can be used for topic modeling. Further preprocessing is needed which we will explain in section4.3 (p.25). Potential problems in this process so far are being discussed in section4.4(p. 25).

We use chapter 5 (p. 26) to explain the general setup of our experiments. In section5.1(p.26), we show how to import data into the topic modeling tool MALLET. Section5.2 (p.27) presents our initial try of creating topic models by setting the number of topics to the number of root types per data set. We will see that we need to approach the problem of topic modeling in a different manner. That’s why we added section 5.3 (p. 29) with more topic modeling runs and their configurations.

We will present our results in chapter 6 (page 32). Here, we will firstly present in section6.1(p.32) the result of the topic modeling runs when setting

k to number of root types. Then, we will present the results of the other runs

with more moderate numbers for k in section6.2(p.32).

In chapter 7 (p. 39), we will discuss the results and compare them to one another.

Last but not least, chapter8 Conclusion and future work(p.41) summarizes all the work and gives an overview over what could be done next.

(11)

Chapter 2

Background Information

In this chapter we will present relevant information on what the basic knowledge needed is for this thesis and beyond.

In the first section2.1we will introduce the Arabic language and its grammar to the reader. Specifically, we will first give some information about the Arabic language in general. In the next section 2.1.2 on page 6 we will explain the stem system of Semitic languages—and Arabic in specific—and how meaning is contained in the root of a word. This is important as we will analyze whether we can find a relation between the roots and the output of the Latent Dirichlet allocation (LDA) topic modeling algorithm.

In the following section2.2on page8 we will give a basic introduction into topic modeling. Where does it come from and how was it used when it was originally introduced? On page9we will explain the reasoning behind the most common topic modeling algorithm LDA. We will show the ideas of several figures in the Digital Humanities community how LDA works and what other research has been done with it.

After this, beginning on page13, we will show some ways to evaluate LDA. We will explain that there is no standardized and useful evaluation measure that works for all scenarios. For our approach in this thesis, we will come up with our own testing measure (see section3.3 Technical approachon page19). We will close this chapter with an overview over the work that has been done with LDA in general and in the Arabic language specifically (page15).

2.1 Introduction to the Arabic language

We will give some general information about the Arabic language first, dis-cussing the Semitic background of the language and then in the second section starting on page6we will explain the concept of the root in Semitic languages.

(12)

2.1.1 General information

Arabic belongs to the family of Semitic languages which comprises aside from Arabic also Hebrew, Amharic and Aramaic to name only a few still spoken today (Ryding2005, p. 1). The region in which nowadays Arabic is spoken can be localized as follows: It stretches from Morocco in Northern Africa over the Levant, through the Arabian peninsula and Iraq, ending in some enclave spots in Iran and Pakistan (Watson 2012). Naturally, we can not speak of only one Arabic language but we have to differentiate between its many branches and dialects. Semantics and grammar are also very different from what somebody coming from an Indo-European background might know. In this section, we will discuss some characteristics of the Arabic language to understand some background of Arabic in general and the root system in Semitic languages in particular.

As any other natural language, Arabic has evolved during the past hundreds of years. “Modern Standard Arabic” (MSA) is generally taught to students all over the world today, and is to be differentiated from “Classical Arabic”, the language that dates back to the emergence of the Quran and even to years be-fore that (Ryding2012). MSA—in Arabic it is called _{ﻰﺤﺼﻔﻟا ﺔﻴﺑﺮﻌﻟا ﺔﻐﻠﻟا (in short:} fuṣḥá) which translates to “the purest Arabic language”—is used today mainly in the media—television or newspapers—and in conversations between highly educated people. A big problem for people learning the Arabic language is that the language differs significantly between what is spoken in different Arab countries and what a student learns as MSA. Furthermore, a person speaking a dialect from Syria might not understand a person from Morocco because of the vast differences in their dialects. Most likely though, both of them can under-stand the Egyptian dialect thanks to the extensive dissemination of Egyptian movies and TV series throughout the Arab world. When it comes to MSA, language use is limited to only a few settings. Ryding writes that Modern Stan-dard Arabic “not only serves as the vehicle for current forms of literature, but also as a resource language for communication between literate Arabs from ge-ographically distant parts of the Arab world. A sound knowledge of MSA is a mark of prestige, education and social standing” (p. 845). As Ryding also remarks it is not clearly distinguishable where dialects end and fuṣḥá begins (p. 846-847). That being said, now we have a slight idea of the difficulty of the Arabic language.

In the next paragraphs, we will introduce some of the peculiarities of the Arabic language and its script. Specifically we will take note of the stem system. Let’s first have a look at the letters of the Arabic language in the table in the appendix on page48. This will help us to recognize words later on in this thesis. As we can see in the table, letters can look differently according to the position at which they stand in a word.

A very easy example is the letter bāʾ—_{ب when written independently whithout} any other letters accompanying it—, the Arabic form of our b: If it stands at the beginning of a word it looks like this: _{ـﺑ, in the middle of a word it looks} as follows: _{ـﺒـ and finally we can find the letter bāʾ at the end of a word and it}

(13)

would look like this: _{ﺐـ. This is an example of a very regular letter. It is very} straightforward to recognize the bāʾ no matter the position at which it stands in a word.

A more difficult letter is the ʿayn—_{ع. If the letter stands at the first position} of a word it would look like this: _{ـﻋ; in the middle of a word like this: ـﻌـ; and} in the end like this: _{ﻊـ. All of them actually represent the letter ʿayn—or ع in} Arabic. For someone new to Arabic, this could make it more difficult identifying a letter in a word. Further down the line, when we will concern ourselves with

stemming, letters will inevitably change their appearance when words are being

cut down to the stem. Therefore, we are advised to have a look at the table on page48every once in a while when we cannot recognize a letter anymore.

Most letters are connected to the preceding ones. There are a few exceptions though: ʾalif, dāl, ḏāl, rāʾ, zāy and wāw (Ryding2005). If one of these letters appears in a word, they create a little break in the script as can be seen in these example words: _{ْﺪ ِﺣاَو, ًاﺮ ْﻜُﺷ, ْكوُﺮْﺒَﻣ.}

Another problem of the Arabic script is that short vowels are usually missing in the written script. They are represented by a system of diacritics. Usually, we leave out diacritics when writing a text. Reading out a text thus becomes a problem for someone without profound knowledge of the language. We add an easy example to illustrate this. The sentence_{ﺔﺒﺘﻜﻤﻟا ﻰﻟإ ﺖﺒﻫذ—meaning: “I went to} the library”—could be converted into these letters “dhhbt ʾilā āl-mktb”. Either we know enough Arabic and the vowels that we need to insert and pronounce between the root letters or we are lost. With full vocalization the sentence would look like this: _{ِﺔَﺒَﺘ ْﻜَﻤْﻟا ﻰَﻟِإ ُﺖْﺒَﻫَذ and this would be transcribed and spoken in} a proper way like this: “dhahabtu ʾilā āl-maktaba”1_.

So, if we already know the words and the grammatical function of each word, the pronunciation should be no problem. But as Ryding puts it: “Few are those who can readily and accurately inflect all lexical items in any text without substantial preparation.” (Ryding 2012, p. 847). Thus, pronouncing letters and non-written vowels alike is usually reserved for true masters of the Arabic language. Nevertheless, we can still identify the root of each verb or noun without having this knowledge beforehand.

2.1.2 The concept of the root

Arabic morphology describes the rules following which, all words are structured. The smallest entity of a word that can carry meaning is called a morpheme. Derivational or lexical morphology in particular deals with variations of words. In English, we can derive the words “faithful” and “unfaithfulness” from the word “faith” for example. Inflectional morphology on the other hand describes

1. We are using the Library of Congress transliteration scheme described here: https: //www.loc.gov/catdir/cpso/romanization/arabic.pdf. It needs to be noted that the actual, really correct and proper pronunciation would be “dhahabtu ʾilā āl-maktabati”. In spoken language, one wouldn’t usually pronounce the ending ti.

This python script was used to convert ArabicBetacode into the transliteration: https:// github.com/maximromanov/ArabicBetacode.

(14)

how a word can be altered to reflect different things like number, tense, mood, gender, case and so forth. “However, the boundaries between derivation and inflection are not as clear-cut in Arabic as they are in English because Arabic morphology works on different principles, and because Arabic morphological theory views elements of word structure and sentence structure from a differ-ent perspective”, remarks Ryding (Ryding2005, p. 45). Understanding Arabic morphology is thus important to understand the concept of the root in Semitic languages, like Arabic or Hebrew.

Arabic morphology generally requires a word to have three consonant roots encircled by a specific pattern of (mostly) vowels. Of course we can find an exception to this rule that allows a group of words to have four letters in the most basic form. This most basic form is called the “root” of a word. These three (or four) root letters “form words, or word stems” (p. 45). The symbolism of the root—the stems and the word tree—in this sense is a very illustrative image as it shows very well how the language works: Based on the word in the root form, a lot of other branches stretch out from the original stem. Each of the connected branches still carries the meaning of the original root within itself but might look quite different from the original root.

Let’s take a common example: k-t-b_{ﺐﺘﻛ (meaning to write). This is the root.} Now, we can find all kinds of branches coming from this root. This would in the case of k-t-b be words like َﺐَﺘَﻛ2

kataba, meaning: he wrote or ُﺐُﺘ ْﻜَﻳ yaktubu: he writes. But also words like_{بﺎﺘ ِﻛ book or ﺐِﺗﺎَﻛ writer. This is the most generic} example one can find for the Arabic language but in essence: all verbs tend to work like this3_{. To sum this up: all above mentioned words that can be tracked}

back to the root k-t-b all describe the same thing: the concept of writing. One could also say that they in some sense describe a topic.

Furthermore Arabic is an “agglutinative language” in which proclitics (prepo-sitions, conjunctions, prefixes and suffixes) and enclitics (personal pronouns) are attached to the base form of the word. Thus a whole sentence in a western lan-guage can be expressed by one single word in Arabic. _{ﺎَﻬَﻧﻮُﻠَﻛَﺄَﺗَأ means “Do you eat} it?”. In Table2.1we can see how the question can be separated into its particles (Kelaiaia and Merouani2016):

Table 2.1: The sentence “Do you eat it?” written in Arabic and separated into its particles (Kelaiaia and Merouani2016).

Enclitic Suffix Stem Prefix Proclitic ﺎَﻫ َنو ُﻞَﻛَأ َت َأ

In this table we can clearly see that the stem is only three letters long but the whole word or sentence in this case is much longer. A stemmer needs to be

2. In Arabic most vowels aren’t written out. Instead a system of diacritic marks are placed under or above the consonants to show which vowel should be spoken after that very consonant.

3. Of course as every language Arabic has major complications such as irregularly behaving, weak verbs built-in too.

(15)

able to properly identify the root. The stemmer that we will use, is described in section5.2on page27.

Let us consider the term topic from another angle though. More realistic is the scenario of more than one root representing a topic. Let’s take another example of a possible topic: horse riding. Words like horseman, rider or jockey would come to mind; or certainly words like riding, horse and also saddle or even foal. Now, in Arabic these words would be translated as follows:

horseman; rider _ﻞﻴﺧ horse _ﻞﻴﺧ rider; jockey _ﺐﻛر he/it gets on; boards; rides; climbs _ﺐﻛر steed; horse _دﻮﺟ horse _ﻦﺼﺣ horse _سﺮﻓ saddle _جﺮﺳ saddle _ﺪﺷ he/it rides; mounts _ﻮﻄﻣ

colt; foal _ﻮﻠﻓ foal; colt _ﻮﻠﻓ

So, in this easy example we can already determine nine different roots for the same topic. The idea of having only one root represent one topic seems far fetched from this perspective. Another example though shows that two connected words can indeed have the same root. Plane and bird share the same root: _ﺮﻴﻃ.

In any way, it very much depends on how we define the term topic in this context. We might find that topic can be a very flexible term itself. This leads to the next section in which we introduce topic modeling and more specifically the latent Dirichlet allocation and how topics are determined using this method.

2.2 Introduction to Topic modeling

Topic modeling can help us to get an overview of large collections of data. In times of ever growing amounts of data becoming available for researchers and enterprises, a lot of knowledge and resources are being put into information retrieval where topic modeling is commonly being applied. From the beginning of their development on, topic model algorithms were not only used in natural language processing tasks but also in the fields of psychology and bioinformatics (Boyd-Graber2010, p. 9).

In the 1990s, topic modeling techniques for Digital Humanities purposes emerged. Before this, vaguely similar programs were developed by the Ameri-can military namely DARPA’s “Topic Detection and Tracking initiative”. This initiative was built to survey a constant flow of information in newsfeeds and to detect when important events took place. For the Digital Humanities, the

(16)

programs were rather needed to work with static data sets (Binder2016, p. 203-204).

There are several algorithms that fall under the umbrella of topic modeling. There is Probabilistic Latent Semantic Indexing (pLSI), which stems originally from latent semantic analysis (LSA), Latent Dirichlet allocation (LDA) and

correlated topic modeling (CTM) (Chang et al.2009, p. 2 and 5). Since LDA is most commonly—and as well in the experiments of this thesis—used, we will focus on this topic modeling algorithm in particular and explain how it fundamentally works in the next section.

2.2.1 Latent Dirichlet Allocation

The latent Dirichlet allocation (LDA) was most prominently introduced by David M. Blei, Andrew Y. Ng and Michael I. Jordan (Blei, Ng, and Jordan

2003). We can call LDA a “probabilistic, generative model” (Boyd-Graber2010) as it informs us using probabilities how a text has been generated.

Let us begin with a definition of topics and topic models that was given by David Blei himself. He explains how we can understand what a topic is in the context of topic modeling and specifically LDA. As we can read in the following lines, the term can first and foremost be understood in a rather technical sense.

“What exactly is a topic? Formally, a topic is a probability distri-bution over terms. In each topic, different sets of terms have high probability, and we typically visualize the topics by listing those sets [...]. As I have mentioned, topic models find the sets of terms that tend to occur together in the texts. They look like ‘topics’ because terms that frequently occur together tend to be about the same subject.” (Blei2012).

As Blei says, a topic is defined as a probability distribution over terms or as

co-occuring words. We will get a little bit deeper into the mathematical side of

this on the following pages. For now, we take this definition for granted. There is one assumption that we are dealing with when using LDA. The bag

of words model is very common in topic modeling applications. Here we assume

that the order of the words in a document does not matter. Only frequencies count and syntax et cetera are dropped (Binder2016, p. 204). As an example, let us take a look on the opening lines of Dr. Seuss’s book “One fish, two fish, red fish, blue fish” and their representation in the bag of words model in Table

2.2.

The bag of words model means that the computing machine takes all the words in a document and ignores any existing structure immanent in the text. For in-formation retrieval purposes this approach holds enough valuable inin-formation. From the bag of words representation we can learn that Dr. Seuss’s book is mostly about fish. We have to be aware that more semantic information con-tained in the text is lost in this approach. There are other related methods such as the N-gram model where we can compute the counts of N-grams of a

(17)

Table 2.2: Bag of words representation of the first lines of Dr. Seuss’s “One fish, two fish, red fish, blue fish” (Boyd-Graber2010).

Original document Bag of Words model One fish, two fish fish: 8 red: 1 red fish, blue fish blue: 2 black: 1 black fish, blue fish one: 1 old: 1 old fish, new fish two: 1 new: 1

text. This specific approach could for example find occurrences of “New York” instead of counting “New” and “York” separately. Still though, “LDA does a surprisingly good job of sorting words based on co-occurrence” as Rhody notes (Rhody2012). When we are only looking for the topic of a text in a corpus, the bag of words model works just fine.

Jockers—the researcher of literature he is—explains in rich language the way he visualizes how topic modeling works: Melville and Austen walking side by side and taking spoonfuls of words from the “LDA Buffet”. Afterwards they sit down at a table and put their words together choosing each word carefully from certain topics. In the end, their books Moby Dick and Persuasion are finished (Jockers 2011). This picture of topic modeling seems to be a good one: single words are being taken from collections of words and then distributed into new collection of words that form new documents.

Figure 2.1is David Blei’s visualization how the output of LDA would look like based on a given text (Blei2012):

Figure 2.1: The illustration from (Blei2012, p. 78) visualizes how we can imagine the output of topic modeling. Blei highlights words in an article in Science that LDA would associate to certain topics.

(18)

Figure2.1shows an article from the magazine Science. Blei highlighted words that make up topics as LDA could find them. “Gene”, “DNA” and “genetic” could form such a topic; as well as “brain”, “neuron”, “nerve” or “data”, “num-ber” and “computer”. These kind of collections of words is essentially how the output of a topic modeling tool looks like. The categories of words are not being named by LDA though. The term we would use for this distribution of words— to come up with a name for the topic—is up to humans to decide on. Whether or not such a topic has to make sense and seem logical to a human mind is the question that needs to be solved before a topic model can be evaluated.

The promise of topic modeling is that it “discovers the topics that permeate the corpus and assigns documents to these topics” (Boyd-Graber 2010, p. 2). The output of a topic modeling algorithm is, firstly, the assignment of topics to documents that have been fed to the LDA tool; and secondly the assign-ment of words to these topics (p. 2). As these assignassign-ments are being output as probabilities that describe with what kind of words the corpus ‘was created’, we can call LDA a ‘probabilistic, generative model’. This sounds rather abstract but Boyd-Graber also remarks that the fact that the output of topic modeling algorithms “correspond to human notions of topics is a product of how language is used” (p. 3-4).

Binder takes the view that “what is important is not that words be used literally, but that the vocabularies of texts correlate with their topics in a uni-form fashion” (Binder 2016, p. 206). He concludes that topic modeling does well with technical and informational literature because language has been very much standardized in these genres (p. 207).

So, the underlying assumption here is that if done correctly, the output of an LDA algorithm should make sense to a human being. As we will see in section

2.2.2, evaluation of topic models is not completely straightforward.

Jordan Boyd-Graber describes the algorithmic process of LDA as follows (Boyd-Graber2010, p. 7):

1. For each document d∈ {1, . . . M}:

(a) Choose the document’s topic weights θd∼ Dir(α)

(b) For each word w∈ {1, . . . Nd}:

i. Choose topic assignment zd,n ∼ Mult(θd)

ii. Choose word wd,n∼ Mult(βzd,n)

There are two parameters, α and β, that can influence the topic model. These two parameters are to be set beforehand. α comes from a Dirichlet distribution and needs to be very small (e.g. 0.0001) so that LDA assigns a small number of topics for each document (p. 5-6). For good interpretability, we would prefer only one topic to have high probability per document.

In a first step, LDA chooses for each document d in our data set of size M the corresponding topic weights θd. The topic weights θ are the probabilities

with which the different topics appear in the document. The length of the θd

-vectors is also determined by K—a parameter that specifies how many topics LDA should find in the documents (p. 7). In a second step, LDA chooses a topic

(19)

from the distribution of our first step. Then, LDA determines the words that go with the topic.

As output of LDA we receive probabilities of words and these probabilities describe how our text could have been created. As Blei points out it is still the task of the scholars to interpret and understand the outcome of the topic model. The probabilities alone do not speak for themselves—we have to carefully investigate what they mean and how they support or negate our hypotheses (Blei

2012).

Figure 2.2 is the graphical plate diagram for LDA. The plates represent repetition and the letter in the corner of each plate shows us the number of times a certain process is being repeated. The one shaded circle is the piece of information we can observe—these are the words in our documents. All others are statistical probabilities or the parameters that influence the model.

wd,n zd,n θd α βk M Nd

Figure 2.2: Boyd-Graber’s version of the graphical plate diagram for the latent Dirichlet allocation (Boyd-Graber2010, p. 8). Each node shows a variable of the LDA process, the labels are according to the algorithm on page11. The hidden variables, like topic proportions, topic assignments, and topics themselves, are not shaded. The one shaded node are the observed words in the documents. The plates signify repetitions, the letter in the corners the times of repetitions. The bigger plate stands for the collection of documents in our data set. The smaller plate shows a document described by a collection of words and weighted topic assignments to these words.

The mathematical side of LDA is rather technical. Since we will not use the probabilities in the output of the LDA implementation, it is not necessary to repeat in detail the mathematical background of LDA—especially since Blei and others have comprehensively explained it before (Blei, Ng, and Jordan2003; Blei2012; Boyd-Graber2010). Still, a few things should be noted. David Blei explains the rationale behind LDA:

“The name ‘latent Dirichlet allocation’ comes from the specification of this generative process. In particular, the document weights come from a Dirichlet distribution—a distribution that produces other distributions—and those weights are responsible for allocating the words of the document to the topics of the collection. The document

(20)

weights are hidden variables, also known as latent variables.” (Blei

2012).

The Dirichlet distribution is, as Blei explained, a distribution that generates more distributions. The equation for it looks like this (Blei, Ng, and Jordan

2003): p(θ| α) = Γ(∑k_i=1αi ) ∏k i=1Γ(αi) θ₁α1−1· · · θαk−1 k .

The probability of a data set to emerge is defined as follows (Blei, Ng, and Jordan2003): p(D| α, β) = M ∏ d=1 ∫ p(θd| α) (_N d ∏ n=1 ∑ zdn p(zdn| θd)p(wdn| zdn, β) ) dθd.

We will now move to the chapter and discuss in short how our research is set up, what our data is and what kind of implementations of LDA exist.

2.2.2 Methods to evaluate LDA

We can find different methods to evaluate the output of an LDA algorithm in literature. However, how to set up the evaluation depends very much on what the goal of the research is. One of the more standard ways to evaluate a machine learning task is the prediction over held-out documents (Chang et al.2009; Wal-lach et al.2009). This way, we can find out whether our topic model works well with a previously unseen data set. Wallach, Murray, Salakhutdinov and Mimno present in their article an extensive overview of several “methods for estimat-ing the probability of held-out documents given a trained model” (Wallach et al.2009). The content of their paper is highly technical; their conclusion though is that a lot of today’s used methods “are generally inaccurate”. They suggest that using “Chib-style estimation” or the “‘left-to-right’ algorithm” should accu-rately estimate the probability of held-out documents and would thus be proper methods to evaluate topic models (Wallach et al.2009).

Another criteria can be the interpretability of the output of our LDA im-plementation. We might have people looking at our topic models and in this case, they should make sense. Different research goals lead to different methods of topic model evaluation. Especially in the sphere of the Digital Humanities though, evaluation needs to be taken seriously in order for the Digital Human-ities to be appreciated as an accurate science. We will introduce a few of the existing approaches in the following paragraphs.

Posner presents possible methods for researchers in the Digital Humanities. She introduces the Topic Modeling Tool (TMT)—a tool with a GUI that people

(21)

can use without ever going to the command line(Newman and Balagopalan

2011):

“[The developers] have done us all a great service. But they may also have created a monster. The barrier for running the TMT is so low that it’s entirely possible to run a topic modeling test and produce results without having much idea what you’re doing or what the results mean.” (Posner2012).

Her explanations include some instructions to evaluate the topic models using visualizations—all within common programs such as Excel. She admits that “by now it should be abundantly clear that no part of this process is ‘scientific’; it’s just one way of getting your head around a large body of text.” (Posner2012).

More sophisticated approaches can be found as well. Chang et al. then find another method to measure the ‘internal representation’ of topic models using human annotators. Since topic models are being utilized by humans to find and understand topics, they thought it would be best to let people evaluate the output of topic models. Two approaches were followed to find out whether this works or not: “Word intrusion” and “topic intrusion” (Chang et al.2009, p. 3). Word intrusion measures whether human annotators4 _{find one ‘intruding’}

word in a word list: The top five words from a topic are presented to the annotator and an additional sixth word is added. This additional word has a low probability in the same topic but a high probability in another topic. The assumption is that a topic is easily identifiable and ‘makes sense’ when the ‘intruder’ is easy to recognize (p. 3-4).

Topic intrusion does almost the same: the annotators are shown the title and the first few lines of the text that the topic model is describing. Then, four sets, each consisting of five words are shown that supposedly come out of a topic model describing this text while one of the four sets actually comes from somewhere else and thus does not fit in. Again, the assumption is that it should be easy for the annotators to find out which one the ‘intruding’ topic is if the topic models properly reflect the context of the text and one clearly doesn’t match (Chang et al.2009). Chang et al. interestingly found out that:

“The highest probability did not have the best interpretability; in fact, the trend was the opposite. This suggests that as topics become more fine-grained in models with larger number of topics, they are less useful for humans. [...] implying that the models are often trading improved likelihood for lower interpretability.” (p. 6-7).

2.3 Research done with LDA

Let us have a look at what other researchers have achieved with topic modeling and LDA specifically. Especially in the field of Digital humanities we can find

(22)

interesting use cases of topic modeling.

The Digital Scholarly Lab at the University of Richmond has done research on the case of the newspaper Dispatch (Nelson 2017). Nelson analyzes his project in the New York Times: topic modeling made it possible to see general trends in the newspaper’s articles from between 1860 and 1865 without having to read thousands of articles by himself. The newspaper deals a lot with the American Civil War. Nelson identifies several topics that were dominant in the paper: “anti-northern diatribes” and “patriotism and poetry”. In times where the people were influenced to hate Northerners by dehumanizing them in their articles, it could be found that they also tried to invoke patriotism in the form of poetry (Nelson 2011). The website of the Digital Scholarly Lab gives an illustrative opportunity to explore the topics found in the newspaper.

Rhody uses topic modeling to analyze figurative language. She describes her approach of using LDA as a “reductive methodology of sorting poetic language into relatively stable categories”. Rhody uses topic modeling in order to have an “ability to address a larger scale of texts, revealing patterns and relationships that might otherwise have remained hidden”. It becomes evident that topic modeling is largely about the ability to deal with large corpora and thus distant

reading. “Rather than selecting from just a few poems, LDA allowed me to cast

my net as wide as 4,500 poems” (Rhody2012).

Mimno and McCallum devised a topic model that finds the most influential authors in a digital library. Their topic model takes information about cita-tions, authorship data and the abstracts of the papers into account (Mimno and McCallum2007).

Bergamaschi and Po compared two topic modeling algorithms to build a recommendation system for movies that is based on the plots of the movies (Bergamaschi and Po2015).

Block and Newman built topic models of a little more than 500.000 abstracts of America: History and Life and Historical Abstracts to find trends and “rela-tionships between women’s history and history as a whole” (Block and Newman

2011).

2.4 LDA on Arabic texts

There is not much research to be found that uses latent Dirichlet allocation or does topic modeling in general with Arabic texts. In the following paragraphs we will take a look at the few works that are available on this matter.

Siddiqui, Faraz and Sattar used topic modeling to find the hidden thematic structures in the Quran. They considered each surah as a separate document and run the LDA with different numbers of topics. Interestingly, when setting

k=2, LDA achieved to separate the surahs into two sets that are being regarded

as having been written at different times in the prophet’s life (Siddiqui, Faraz, and Sattar2013).

Alhawarat analyzes also the Quran—though only one chapter. He runs the topic modeling on the regular word level as well as on the root form of the

(23)

words. For some reason, no stemmer was used in this study and all roots were discovered manually. Alhawarat’s findings are that only few of the topics are coherent. Although he initially states that his research is not a religious one, his conclusion, stating that the findings are incoherent because the Quran is not human-made, tells another story (Alhawarat2015).

In his thesis, Panju focuses on the whole Quran. He topic models a Buckwal-ter transliBuckwal-teration of the Quran and then maps the relations between connected chapters of the book (Panju 2014). The visualizations contained in his thesis look pretty but their meaning and added value cannot easily be identified.

Bagdouri researched the impact of the Iraq War on the Iraqi blogosphere using topic modeling. He has 11.669 Arabic and 31.246 English blog posts from Iraqi bloggers at hand. His thesis finds that higher numbers in war-related posts correlate with higher body counts in the war (Bagdouri2011).

Brahmi, Ech-Cherif and Benyettou assess the need to follow up on the re-search in Arabic information retrieval tasks. They compare available stemmers with a newly developed one and apply topic modeling to a data set of several tens of thousand Arabic newspaper articles. (Brahmi, Ech-Cherif, and Benyet-tou2012)

Kelaiaia and Merouani compare LDA and K-means and their ability to iden-tify topics in Arabic texts. Evaluating their results with the F-measure and En-tropy, they find that LDA overall achieves better results than K-means (Kelaiaia and Merouani2016).

(24)

Chapter 3

Setup

This thesis is based on the Latent Dirichlet Allocation (LDA), a common—if not the most common—topic modeling algorithm. We gave an introduction to LDA and necessary information on the Arabic language in the previous chapter of this thesis. Thus, we now have an appropriate knowledge of the important principles in LDA and the Arabic language in order to look at the setup of this thesis. In this chapter, we will look into what specifically needs to be done on the practical, computational side to perform this thesis.

Brett clarifies that we need several things when working with LDA (Brett

2013): Firstly, a corpus, as large as possible. How to divide this corpus is up to us. We could split a corpus of a newspaper in its articles but also lines in poems could work if that is what we are interested in. How to set up the LDA implementation always depends on what we are after in our research. Secondly, tokenization and removal of function words, punctuation or generally everything what we would not believe to carry meaning in the sense of a topic. These pre-processing steps will be discussed in the next chapter4 Dataon page20. Brett also believes that we need to know beforehand what we will generally find in the corpus in order to understand when something is going wrong in the output, to sort of recognize an “outlier” in the output (Brett2013). And thirdly, of course we need a tool that can do the topic modeling for us. There are many ready applications out there and we can choose from them as we will describe further down in this chapter.

We will now firstly introduce the three data sets used in our thesis. The UN Corpora contains resolutions of the United Nations General Assembly. The Arabic bible is a translation dating back to 1865. And lastly, we will use a collection of TED Talk subtitles for this thesis.

Secondly, we will present our research question and explain why we think it is important to pose and answer them. We will also discuss how our research question will be tested and answered with our chosen approach.

Finally, we will introduce the technical approach of our thesis. Which piece of LDA implementation will we use for the topic modeling tasks in our thesis? As there are a few of these tools available, it is worthwhile to see which one fits

(25)

our task the best. Finally, we will also present our method of testing our topic models given our research question.

3.1 Overview over the data sets

We will use three data sets in our thesis. In order to experiment with different kinds of genres of text we will choose three data sets that are very different from each other.

The first data set is a rather traditional text. We are using an Arabic translation of the bible. The Arabic version of the bible should be of proper size and literary quality to conduct experiments on. Mayer and Cysouw describe the data set in their paper (Mayer and Cysouw2014).

The second data set is the UN Corpora which contains resolutions passed by the United Nations. The 2100 resolutions of the United Nations General Assembly in the data set are available in six languages, among them Arabic. Rafalovitch and Dale describe the data set to have an average of 3 million tokens per language (Rafalovitch and Dale2009).

The third data set that we are using contains subtitles from TED talks that were translated into Arabic. According to Tiedemann, the data set has about 2.4 million tokens (Tiedemann2012).

For each data set, there is also an English version at hand. The English translations would come in handy if we would want to compare the topic models in different languages.

3.2 Research Questions

Within the scope of this thesis we argue that the meaning contained in each Arabic root can be regarded as a topic itself. As we described in the previous chapter, all Arabic words that have been derived from the same Arabic root talk about the same concept. Thus we define our research question: In what sense can we find commonalities between the way LDA defines a topic and how a topic is predefined by an Arabic root? Does the LDA output group words with the same roots together? To what degree do we find the same root in the output?

Rather than accepting that the meaning of a word is based on its context— the co-occuring words—we are interested in determining whether the output of LDA reflects the Arabic system of morphology as well. In our approach we want to find out how roots are being distributed over the output of LDA.

Our research question can be answered fairly easy and without using the probabilities that we could receive as well via the output of a topic modeling tool. In order to answer our question, we concentrate on the top words in the output of LDA. As we learned in the previous chapter, LDA will give a high probability on terms that co-occur. For our approach we decide that the ten most high-ranked words in a topic are reflective enough of a topic. Ten words per

(26)

topic gives us enough data to test our research question. Of course, completely different results might be achieved if we used 20 or even more words instead of ten.

3.3 Technical approach

For the LDA topic modeling algorithm many different implementations can be found. Once one of these tools is chosen, handling of the algorithm itself should be fairly straightforward. Ted Underwood and Graham et al. have posted introductions that are fairly accessible even to a not so tech literate audience (Underwood2012; Shawn Graham and Milligan2012). In the following paragraphs, we will introduce a few of the available LDA tools.

Python itself has a package named lda for topic modeling.1 _This

implemen-tation is very basic.

A popular tool is Gensim2_{(Řehůřek and Sojka}₂₀₁₀_{). With its slogan “topic}

modeling for humans” and its relatively easy documentation, it is the go-to option for many humanities scholars.

Another tool is Paper Machines.3 _{Adam Crymble discusses the tool in}

Jour-nal of Digital Humanities (Crymble2012). Paper Machines can topic model and visualize a Zotero library. It can generate all sorts of visualization, among others word clouds, phrase nets or mapped geo-references.

For R there is also a package that deals with topic modeling (Hornik and Grün2011).

We will be using the LDA tool MALLET (McCallum2002).4 _{The MALLET}

LDA topic modeling tool can be used fairly easily given some prior knowledge of the command line. The Programming Historian has an easy introduction how MALLET can be set up and used (Shawn Graham and Milligan2012). Dariah gives basic instructions for MALLET as well (Dariah-DE, n.a.[a]), as well as ideas how to visualize topic models (Dariah-DE,n.a.[b]).

To load the text into MALLET’s own data structure we have to preprocess it. We will explain how to do this in the next chapter onData.

We will test the output of MALLET in the following way: How often gets a specific root repeated in the output? For each topic, if a root repeats itself, we will count the occurrence of this. If a root appears only one time, we would plot a measure point at y=1. If a root would repeat itself three times, we would plot a measure point at y=3. We will show the results in a box plot to show the distribution of the measure points. The implementation of this test can be found online at https://github.com/pmsprenger/ma/blob/master/src/test2.py. A more in depth discussion of this code can be found in section5.3on page29.

1. A short documentation can be found herehttps://pypi.python.org/pypi/lda, the code is being developed by Allan Riddell here: https://github.com/ariddell/lda/.

2. It can be downloaded here: https://radimrehurek.com/gensim/.

3. We can find it here: http://papermachines.org/. Note that Paper Machines is not being maintained anymore.

(27)

Chapter 4

Data

In this chapter we will introduce the data of our thesis and how we should preprocess it in order for the topic modeling application to run well and produce proper results.

Firstly, section 4.1 The data sets will have information on the three data sets used in our thesis.

Secondly, we will learn about the preprocessing steps that need to be done before the data can be used in the LDA tool MALLET. As in any data driven project, preprocessing takes up a lot of time and is the basis for good results. We will explain in the sections4.2 Segmentation of the filesand4.3 Further prepro-cessingthe necessary preprocessing tasks that we coded. Finally,4.4 Threats to validitywill list some of the problems that occurred during the preprocessing.

4.1 The data sets

We chose three data sets as previously mentioned in this thesis. The data sets differ very much regarding their size, language use and structure. It makes sense to test our ideas on parallel corpora since translations make it possible to compare the output of the LDA algorithm in different languages—for now, we won’t do this but for future research, this is a good opportunity. We will have a look at each one of them in the following sections4.1.1 Arabic Bible,4.1.2 UN Corporaand4.1.3 TED talk subtitles.

4.1.1 Arabic Bible

The first data set that we used is a version of the Bible, translated to Arabic. The Bible yields the advantage that we can be sure to find a digital version of it, including transliterations or vocalized versions in case we would need them. Mayer and Cysouw created a huge parallel Bible corpus1_{, spanning over 900}

1. The data set we used is apparently not available anymore: http://paralleltext.info/ data/full/arb-x-bible/. However, there are more web sites online where you can find the

(28)

translations. According to them, the Bible is one of the most translated books in history (Mayer and Cysouw2014). The data set as a whole is very interesting since the translations differ very much among themselves:

“The number of verses per translation varies widely. The average number of verses per translation is 10,707 (with a standard devia-tion of 7,727 verses). The largest number of verses (36,986) is in the text of the English King James Version, which includes many apocrypha books. The smallest number of verses can be found in the text for the Papua New Guinea language Wedau [wed], which lists only 677 verses. The average number of words per translation is 408,973 (standard deviation: 367,572). The average vocabulary size (number of types) is 21,176 (standard deviation: 15,134).” (Mayer and Cysouw2014).

The data set is one large text file with numbers at the beginning of each line to indicate the books, chapters and verses. The files we will run our topic modeling on have the following characteristics. After splitting of the files along the boundaries of the chapters of the Bible—explained further below in section

4.2 Segmentation of the files—we have 1179 files. These files contain our Arabic version of the Bible. The corpus size is 556217 while the number of unique root types—discussed in section4.3 Further preprocessing—are 8387.

The token/root type ratio is thus: 556217

8387 = 66.32.

For future work, we should keep in mind that the different versions of the Bible do not always have the same length. While working on this thesis, we discovered that the King James version of the Bible had about 20.000 lines more than the Arabic or the German ones2_{. If we want to proceed with topic}

modeling of parallel corpora in different languages, we should make sure that the same amount of data or chapters of the Bible in this case will be given to

MALLET.

4.1.2 UN Corpora

Our second data set is the UN Corpora that contains resolutions from 2000 to 2007 of the United Nations General Assembly in the six official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. Rafalovitch and Dale describe the data set in their paper (Rafalovitch and Dale2009). As the resolutions have legal weight, their translations can be considered to be care-fully checked. With this corpora we could tackle future topic modeling tasks to

Arabic Bible, for example here: http://christos-c.com/bible/ or https://github.com/ christos-c/bible-corpus/blob/master/bibles/Arabic.xml.

2. The data set by Mayer and Cysouw is not available anymore but was found here:

http://paralleltext.info/data/full/deu-x-bible-elberfelder1905/ and here: http:// paralleltext.info/data/full/eng-x-bible-kingjames/.

(29)

compare topic models in various languages. The UN Corpora is freely down-loadable3_{. It comes as a TMX file which is an XML standard that stands for}

Translation Memory eXchange4_{. Thus, it is a markup for keeping translations}

of a text together in one file.

According to Rafalovitch and Dale, the data set has an average of 3 million tokens per language (Rafalovitch and Dale2009). And in fact, our split files have 3.299.099 Arabic tokens that we will do our topic modeling on. Unique root types after removing function words and stemming are 6210.

Thus, we can calculate the following token/root type ratio: 3299099

6210 = 531.26.

Rafalovitch and Dale’s description of the structure of the data set will help us splitting the large file into the 2100 resolutions. In section4.2 Segmentation of the fileswe will explain how this works.

4.1.3 TED talk subtitles

Our third and last data set is a corpora containing subtitles from TED talks. They are available via OPUS, a platform of the Department of Linguistics and Philology at Uppsala University in Sweden5_.

Subtitles from TED talks are an interesting choice for this thesis as for the diverse topics available in this kind of corpora. In general, a TED talk focusses closely on one narrowly defined topic. Many TED talks should discuss therefore a broad range of topics. We expect interesting findings thanks to this factor.

According to Tiedemann the data set has about 2.4 million tokens (Tiede-mann2012). Our bash query showed that there are 2.517.161 tokens and 13390 unique root types after removing stop words and stemming.

Following is our calculation of the token/root type ratio: 2517161

13390 = 187.99.

The data comes just like the UN Corpora as a TMX file. We will explain how we split this data set up into 1113 documents in the following section

4.2 Segmentation of the files.

4.2 Segmentation of the files

In order for our LDA tool MALLET to work properly, the data has to be pre-processed so that it matches MALLET’s requirements. Each text that should be treated by MALLET as one unique document should be in one file. Tech-nically, a document could be only a paragraph or a verse from the Bible but

3. This is the website: http://www.uncorpora.org/.

4. Have a look here: http://www.ttt.org/oscar/index.html.

(30)

also a whole book. It does make sense though for MALLET to have at least around 1000 different documents to produce the topic models. That’s why it made sense to split the three data sets in the following manner.

Bible

Since the version of the Bible that we have contains 1179 chapters in the Arabic version, we will split the original file into 1179 files that will each contain a chapter. Note that the English version—the King James Bible—that we had access to had 1357 chapters. If we were to compare topic models of the two languages, we would need to cut the English version after chapter 1179 to ensure that both data sets would represent the same content.

The script for splitting the Bible can be downloaded from here: https: //github.com/pmsprenger/ma/blob/master/src/split_bible.py. We will go through the important steps of the script.

After opening and reading in the file, we find that each line with a verse of the Bible starts with a number with 8 digits at the beginning of the line. We can also observe that the chapters of the Bible are changing when there is a change in the fifth digit. With the following RegEx, we match this fifth digit:

1 re.sub(r'^[0-9]{4}([0-9]).*', r'\1', line).

We then design a for loop that will write the contents of the lines to a file every

time there has been a change in the fifth digit. We use a counter to name the

files according to the sequence of the chapters in the Bible. Finally, if there is content left in our variable en, we also write it to a file (lines 68 to 73 in the script).

We have two functions that can be triggered by setting sys.argv[1] in the command line to ar for Arabic or en for English.

UN Corpora

In general, the script for the UN Corpora works the same, as the one for the

Bible6_{. A specific line in the file will trigger the script to save the previous lines}

to a new file. The UN Corpora looks a little bit different from the Bible data set. Here, other RegEx’s are needed. To begin with, the following RegEx will match all lines that contain Arabic language.

1 re.compile("<tuv xml:lang=\"ar\"><seg>")

1 re.compile(r'<tuv xml:lang="en"><seg>RESOLUTIONS? \d+/\d+\s?\S?.')

This RegEx will trigger the lines previously saved in variable en or ar to be saved to a new file—we use this line for both English and Arabic.

6. The script can be found here: https://github.com/pmsprenger/ma/blob/master/src/ split_un.py.

(31)

To get the right file name for the English translations, we need to capture two different lines as there are different numbering systems to resolutions:

1 re.compile(r"<tuv xml:lang=\"en\"><seg>\d+/\d+\.?\s?(A|B)?")

or

1 re.compile(r"^\d+/\d+ (A|B)\.")

For Arabic we need to capture three specific lines:

1 re.compile(r"<tuv xml:lang=\"ar\"><seg>\d+/\d+\.?\s?"),

1 re.compile(r"<tuv xml\:lang=\"ar\"><seg>\d+/\d+ \u0623\u0644\u0641")

and

1 re.compile(r"<tuv xml\:lang=\"ar\"><seg>\d+/\d+ \u0628\u0627\u0621").

We are using the unicode version of the Arabic letters to avoid the hassle that is unavoidable when mixing left-to-right and left-to-right text in one line.

A few specific lines are avoided to be taken into account. These are the lines that specify which countries have abstained, were against or in favor of a resolution for example7_.

TED talk subtitles

Lastly, we need to split the TED talk subtitle data set. The script can be downloaded here: https://github.com/pmsprenger/ma.

The following RegEx will trigger the lines to be saved that were stored previously in the variable en or ar:

1 rx_http_en = re.compile('<tuv xml:lang="en"><seg>http://www.ted.com/ talks/');

for Arabic it’s the same but with the the tag xml:lang="ar". The file name can be caught with

1 re.compile('<tuv xml:lang="en"><seg>\d+</seg></tuv>');

for Arabic it’s again the same but with the ar tag.

7. We do not want country names to interfere with our topic models, so we discard them. Of course, given another research question and context, it could also be very interesting to focus on which countries would generally vote on certain topics.

(32)

4.3 Further preprocessing

To make the data fit for MALLET, we have to further preprocess the texts. Firstly, we need to tokenize the data sets. Secondly, we need to get rid of all words that do not carry any topical meaning.

Our script that tokenizes the words and removes those words that don’t carry meaning is available here: https://github.com/pmsprenger/ma/blob/ master/src/tokenizer.py.

Tokenizing can easily be achieved through nltk:

1 from nltk import wordpunct_tokenize as tokenizer,

lines 49 for English and 109 for Arabic.

Removing function words needs a little bit more coding. These function words are implemented in MALLET itself but only for some languages: English, German, Finnish, French and Japanese. Since, we are working with Arabic language texts, we need to find another way to separate meaningful words and function words. There is a comprehensive list of function words to be found here:

https://github.com/mohataher/arabic-stop-words. This list contains 750 Arabic words that do not carry meaning. We further found that we needed to lose these punctuation marks: _{., ؟, ، and :.}

After tokenizing and getting rid of function words, we can now move to the

topic modeling of our texts—described in the next chapter5 Experiments.

4.4 Threats to validity

What might constitute a problem is the fact that we can find English text also in the Arabic files. A concise example is the word DNA that we can find in the output of the TED talk subtitles topic modeling. As words like DNA are scientific expressions that are also being used in the Arab world, we will have English words in the output. If one of these English words is among the top ten words, we will get skewed results in regards to counting roots per topic.

(33)

Chapter 5

Experiments

In this chapter, we will get to know the experiments that we carried out on our data using MALLET’s topic modeling (McCallum2002). We will learn how to setup MALLET (5.1 Setup), thereafter we will explain our initial idea how to set k in: 5.2 Choosing k using a stemmer. We will shortly discuss results of this part in section6.1on page32. In the following section5.3 Topic modeling runs and testing, we will present with which parameters we did our topic modeling to get the results that we will then present next in chapter6 Results. We will also have a look at our method to assess the output of the topic models and the script we wrote for this purpose.

5.1 Setup

Before running the actual topic modeling implemented in MALLET, we have to import the data into MALLET’s own data format. For this, MALLET has an import function. The instructions can be found here: http://mallet.cs. umass.edu/import.php. We imported our files in the following way.

1 bin/mallet import-dir input ../out/tokenized/ted/ar/ keep-sequence

--output data/ted-ar.mallet

The --input option defines which folder should be used as input. All files in this folder will be imported and used in the topic model. In other words, our data set, split up into the files should be available in the defined folder. The --keep-sequence option is recommended both by the Programming Historian and MALLET itself (Shawn Graham and Milligan2012; McCallum 2002). Fi-nally, the option --output defines which folder the .mallet-file will be saved to.

Theoretically, MALLET could also detect stop words and filter them out. For this, we would need to use --remove-stopwords, --stoplist-file FILE and possibly also --extra-stopwords FILE if there would be more than one file from which we would get our stop words. We chose to do this by ourselves

Topic Modeling in Arabic