• No results found

Mining social media for psychiatric effects of 5-alpha reductase inhibitors

N/A
N/A
Protected

Academic year: 2021

Share "Mining social media for psychiatric effects of 5-alpha reductase inhibitors"

Copied!
61
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Mining social media for psychiatric

effects of 5-alpha reductase inhibitors

Liset van Wijk

(2)

Student

Liset M. van Wijk

Student number: 12003875 l.m.vanwijk@amsterdamumc.nl

Tutor

Dr. Julio C. Facelli Professor and Vice Chair

Department of Biomedical Informatics University of Utah

julio.facelli@utah.edu

Mentor

Dr. Martijn C. Schut

Department of Medical Informatics Amsterdam UMC, location AMC m.c.schut@amsterdamumc.nl

Location

Department of Biomedical Informatics University of Utah

421 Wakara Way

Salt Lake City, Utah, 84108 United States

Period

(3)

Abstract

INTRODUCTION

Nowadays, the development of new drugs has a low rate of success, while development costs are rising. Drug repurposing (defined as finding new purposes for existing drugs) could be a fruitful alternative. Previous experimental drug repurposing research showed promising effects of 5-alpha-reductase (5-AR) inhibitors for psychiatric conditions. 5-AR inhibitors are already used as a treatment for hair loss and benign prostatic hyperplasia. Social media is suggested as a source for drug repurposing research as well, which can strengthen the theory that 5-AR inhibitors are effective drugs for psychiatric conditions. The goal of this study to determine if social media can be used to find effects of 5-AR inhibitors for people with psychiatric conditions.

METHODS

First, we selected a suitable social media venue based on content and size. As 5-AR in-hibitors are already used as a treatment for hair loss, we focused on social media venues related to hair loss. Then, the posts of the venue were gathered and pre-processed, and concepts were extracted with the medical named entity recognition tool MetaMap. Af-terwards, we selected a subset of concepts related to psychiatric conditions based on the relevance of the concept. Finally, we went manually through relevant posts to identify people that take a 5-AR inhibitor in combination with having a psychiatric condition.

RESULTS

The selected social media venue was the forum https://www.hairlosstalk.com/. This forum contains over a million posts, written by about 28,000 different members. From this venue, we extracted over 15 million concept instances with MetaMap. Because the majority of the concept instances were irrelevant, we selected 46 concepts that were related to a mental disorder or a psychoactive substance as indicators for a psychiatric condition. For six of these concepts, we manually identified people with a psychiatric condition who also take a 5-AR inhibitor. For each of the concepts, we identified at least one person that met both requirements.However, the number of people identified was too low, and the data were too sparse to find effects of 5-AR inhibitors.

CONCLUSION

Our research showed that it is possible to identify people that take a 5-AR inhibitor and also have a psychiatric condition. However, we could not find effects of 5-AR in-hibitors, due to the sparsity of relevant data. We still see possibilities in the use of social media to find drug repurposing candidates. We recommend that further research takes account of the following suggestions: Use a disease-related social media venue instead of a drug-related venue and identify people who take a drug and/or have a disease with an automatic approach.

Key words: drug repurposing, 5-alpha reductase inhibitor, social media mining, natu-ral language processing

(4)

INTRODUCTIE

Tegenwoordig heeft de ontwikkeling van nieuwe medicijnen een lage succesratio, terwijl ontwikkelingskosten stijgen. ‘Drug repurposing’ (gedefinieerd als het vinden van nieuwe toepassingen voor bestaande medicijnen) kan een waardevol alternatief zijn. Experi-mentele onderzoeken naar drug repurposing van 5-alpha reductase (5-AR) inhibitoren hebben veelbelovende resultaten laten zien voor gebruik in psychiatrische condities. 5-AR-inhibitoren worden al gebruikt als medicijn tegen haaruitval en Benigne Prostaat Hyperplasie. Daarnaast wordt sociale media voorgesteld als een bron voor onderzoek naar drug repurposing, wat de theorie kan versterken dat 5-AR-inhibitoren effectieve medicijnen zijn voor psychiatrische condities. Het doel van dit onderzoek is bepalen of sociale media gebruikt kan worden om effecten te vinden van 5-AR-inhibitoren voor mensen met psychiatrische condities.

METHODEN

We beginnen met het selecteren van een sociaal media platform, op basis van de inhoud en grootte. Omdat 5-AR-inhibitoren al voorgeschreven worden als een medicijn tegen haaruitval, focussen we ons op sociale media gerelateerd aan haaruitval. Vervolgens verzamelen we berichten van het platform, en die worden bewerkt om de informatie er eenvoudiger uit te halen. Hierna worden concepten uit het forum gehaald met het medis-che named entity recognition instrument MetaMap. Dan selecteren we een groep con-cepten, gerelateerd aan psychiatrische condities. Tenslotte analyseren we de berichten handmatig om personen te identificeren die zowel een 5-AR-inhibitor innemen als een psychiatrische conditie hebben.

RESULTATEN

Het geselecteerde sociale media platform is https://www.hairlosstalk.com/. Het platform bevat meer dan een miljoen berichten, geschreven door ongeveer 28.000 ver-schillende forumleden. Met MetaMap extraheren we meer dan 15 miljoen conceptin-stanties vanaf het platform. Omdat het merendeel van de conceptinconceptin-stanties irrelevant is, selecteren we 46 concepten die gerelateerd zijn aan een mentale stoornis of een psy-choactief middel. Voor zes van deze concepten identificeren we handmatig personen die zowel de psychiatrische conditie hebben als een 5-AR-inhibitor innemen. Voor elk con-cept is minstens een persoon gevonden dat aan beide eisen voldoet. Echter is het aantal personen dat we ge¨ıdentificeerd hebben te laag en de data was te schaars om effecten van 5-AR-inhibitoren te vinden.

CONCLUSIE

Ons onderzoek heeft aangetoond dat het mogelijk is om personen te vinden die een 5-AR-inhibitor innemen en een psychiatrische conditie hebben. Echter zijn geen ef-fecten van 5-AR-inhibitoren gevonden, omdat de relevante data schaars is. We zien nog steeds mogelijkheden om sociale media te gebruiken om drug repurposing kandidaten te vinden. We raden toekomstig onderzoek de volgende suggesties aan: Gebruik een ziekte-gerelateerd sociaal media platform in plaats van een medicijn-gerelateerd plat-form, en identificeer mensen die medicijnen gebruik en/of een psychiatrische conditie hebben met een geautomatiseerde aanpak.

(5)

Acknowledgements

This thesis is the final work towards my Master’s degree in Medical Informatics. It has been almost a year since I went to Salt Lake City and started my research at the University of Utah. I could not have been more grateful for this experience while also having the opportunity to see more of Utah and the rest of the United States.

Funding for this study was provided by the University of Utah Health Program of Personalized Health, the University of Utah Center for Clinical and Translational Science under NCATS Grant U01TR002538. Furthermore, The Center for High-Performance Computing at the University of Utah provided computer resources for High-Performance Computing, which was partially funded by the N.I.H. Shared Instrumentation Grant 1S10OD021644-01A1.

I would like to thank my mentor Julio Facelli for giving me the opportunity to go to Salt Lake City and write my thesis at the University of Utah. I would also thank you for your supervision and for providing me with housing. Furthermore, I would like to thank my tutor Martijn Schut for his guidance and advise, especially during the final months of writing this thesis.

Furthermore, I would like to thank Kathy Scott and the rest of the roommates for mak-ing me feel so welcome in Salt Lake. I had an amazmak-ing experience that I will never forget. Finally, I would like to thanks my family and friends for your support over the year(s).

Liset van Wijk

Utrecht, November 2019

(6)

5-AR 5-alpha-reductase

ADR Adverse Drug Reaction

BPH Benign Prostatic Hyperplasia

CUI Concept Unique Identifier

DHT Dihydrotestosterone

EHR Electronic Health Record

MMI MetaMap Indexing

NER Named Entity Recognition

NLP Natural Language Processing

OCD Obsessive-Compulsive Disorder

R&D Drug Research and Development

SVM Support Vector Machine

UMLS Unified Medical Language System

(7)

Contents

1 Introduction 1 2 Background 3 2.1 Drug repurposing . . . 3 2.2 5-alpha-reductase inhibitors . . . 4 2.3 Text Mining . . . 5

2.4 Social media in healthcare research . . . 5

2.5 Strengths and weaknesses of social media research . . . 6

2.6 Implications for this research . . . 7

3 Selection of a social media venue 8 3.1 Introduction . . . 8

3.2 Methods . . . 9

3.3 Results . . . 10

3.4 Discussion . . . 12

4 Extracting psychiatric condition concepts 15 4.1 Introduction . . . 15

4.2 Methods . . . 16

4.3 Results . . . 18

4.4 Discussion . . . 23

5 Identifying people who take a 5-AR inhibitor and have a psychiatric condition 26 5.1 Introduction . . . 26

5.2 Methods . . . 27

5.3 Results . . . 28

5.4 Discussion . . . 33

6 General discussion & Conclusion 36 6.1 Main findings . . . 36

6.2 Implications & Recommendations . . . 36

6.3 Conclusion . . . 37

References 39

Appendices 46

A Overview of social media sites 47

B Additional MetaMap information 49

C Trigger information 51

(8)

Introduction

Drug research and development (R&D) is an expensive, time-consuming, high-risk pro-cess, and its efficacy is low [1–3]. According to DiMasi et al. [3], the cost per approved new drug was nearly 2,6 billion dollars (US) in 2013, while the process of drug devel-opment takes between 10 and 17 years [4]. At the same time, the US Food and Drug Administration approves roughly one in ten new drugs that enter a clinical trial [5]. Moreover, about 25 drug candidates from first discovery are needed to launch one new drug [1]. Together, this shows the need for new approaches for drug discovery.

Drug repurposing is an effective strategy as it finds new indications for existing drugs. Safety profiles of the drugs are already known, which reduces clinical trial expenses, saves time, and makes the development low-risk [6]. The drug repurposing strategy as described in [7] has three steps. The critical step is finding the right candidate drug for an indication to generate a repurposing hypothesis. An experimental, a computational or a mixed approach can be used to generate a hypothesis. An experimental approach gives direct evidence of links between drugs and diseases, while a computational drug repurposing approach is mostly data-driven and involves a methodical analysis. A mixed approach is favourable to create a strong hypothesis for a candidate drug. Over the years, several drugs have found a new purpose based on drug repurposing research [8– 10].

Recently, experimental drug repurposing approaches showed promising results for the 5-α-reductase (5-AR) inhibitors finasteride and dutasteride. The drugs are prescribed for male hair loss and benign prostatic hyperplasia (BPH) [11–13], but Paba et al. [14] identified therapeutic effects of 5-AR inhibitors for neuropsychiatric disorders. Further-more, several studies showed the effects of finasteride in animal models for schizophrenia, Tourette’s syndrome and Levodopa-induced dyskinesia [15–18]. Similarly, dutasteride had effects in protecting mice from Parkinson-like disease [19]. Combining this estab-lished experimental research with computational research may generate a strong hy-pothesis for repurposing 5-AR inhibitors for psychiatric conditions.

A study by Hurle et al. [20] suggested social media as a source for computational drug repurposing research. Several studies on finding adverse drug reactions (ADRs) utilizing social media also reported beneficial effects of drugs [21–24], which could lead to repurposing the drug. Finding ADRs in social media had already shown its success [25], and the research process is similar to finding beneficial effects. So far, limited research has been performed on using social media for drug repurposing [26, 27], but both studies showed the potential. As social media demonstrates potential in finding adverse and beneficial effects, the goal here is to assess further if 5-AR inhibitors have effects that are beneficial for psychiatric conditions. We use social media as a data source for computational drug repurposing.

To succeed in this, a high volume of relevant data must be available. Finasteride is a highly used drug, and in 2016 it was prescribed over 10 million times for both male bald-ness and BPH in the US [12, 13]. The number of prescriptions for dutasteride in 2016 in the US was over 1,6 million [28]. Furthermore, both drugs are mentioned heavily on hair

(9)

CHAPTER 1. INTRODUCTION 2

loss related forums [29, 30]. Altogether, the 5-AR inhibitors show promising experimen-tal repurposing results, have a high prescription rate and a relevant number of references on social media. These aspects make them good candidate drugs for this drug repur-posing study. To generate a hypothesis for repurrepur-posing 5-AR inhibitors, we formulate the following research question: “Using social media, can we find signals that indicate a change in psychiatric condition in patients receiving 5-AR inhibitor treatment?” We approach the research question stated above through the following three aims:

1. Identify social media venues with substantial 5-AR inhibitor content and extract posts from participants mentioning 5-AR inhibitors in at least one posting. 2. Use a concept extraction tool to extract concepts of the postings, and so obtain

posts with mentions of psychiatric conditions.

3. Identify people that have been using a 5-AR inhibitor in combination with having a psychiatric condition.

Our approach is derived from the five steps described in the systematic review conducted by Tricco et al. [25] on using social media for finding ADRs. First, we acquire data from a social media venue and then pre-process the textual data to improve the quality. In the third step, we extract the information by selecting concepts related to drugs and effects. The fourth step consists of data analysis looking for patterns and relations between concepts. Lastly, we evaluate the performance of the approach. In Figure 1.1, we show each step of the approach of Tricco et al., in correspondence with the chapter the step is executed in our research.

The remaining part of the thesis is organised as follows. The next chapter provides background information, and the three chapters after that, Chapters 3-5 each present one of the aims in relation with one or two steps from the research by Tricco et al. [25]. In Chapter 3, a social media venue with 5-AR inhibitor content is selected from which we extract psychiatric condition concepts in Chapter 4. People having a psychiatric con-dition and take a 5-AR inhibitor are identified in Chapter 5. The final chapter consists of a general discussion and conclusion, where we answer the research question.

Figure 1.1: Steps described in Tricco et al. [25] in correspondence with Chapters in this research.

(10)

Background

In this chapter, we provide background information that is relevant throughout the different aims of this research. Accordingly, we cover the need for drug repurposing and provide information about it. Furthermore, we explain how 5-α-reductase inhibitors work and what relation they have to psychiatric diseases. We also elaborate on text mining, clarify the five steps described by Tricco et al. to find ADRs in social media [25] and discuss the strengths and weaknesses of social media health research. We describe background information related to each specific aim in the respective chapter.

2.1

Drug repurposing

Currently, several studies point out concerns in the development of new drugs [2, 3, 31]. The duration of the process is long, costs and risks are high, it is inefficient, and only a small portion of candidate drugs are successfully brought to the market. Therefore, Paul et al. [1] argue that improving the productivity of R&D is very important. Drug repurposing is a useful approach to tackle current drug development issues. It is an approach to identify new therapeutic indications for approved or investigational drugs that differ from the original indication [7, 26]. Drug repurposing has advantages in comparison with traditional R&D. These are a reduced risk of failure, a shorter development duration, and costs savings [7]. Usually, the drug repurposing strategy consists of three steps before a drug is moved further into the developmental pipeline. These steps are [7]:

1. Identify a candidate drug for a given indication (hypothesis generation). 2. Assess the effect of the candidate drug in pre-clinical models.

3. Evaluate the efficacy of the candidate drug in phase II clinical trials.

To identify new candidates for drug repurposing, two systematic approaches for hy-pothesis generation are beneficial. These approaches are computational approaches and experimental approaches. Combining both approaches into a mixed approach strength-ens the hypothesis [7]. Computational approaches are conducted with a broad range of data, of which social media is one.

Only two studies which use social media as a data source for drug repurposing are known to us [26, 27]. In a feasibility study by Rastegar-Mojarad et al. [27], online patient medication reviews were used to find repurposing possibilities. Patterns were identified that discover beneficial effects of drugs, and with a rule-based system, they found five drug repurposing candidates. In research by Nugent et al. [26], Tweets were used to find drugs and side-effects for drug repurposing. The study identified new indications for drugs, which suggests usability in drug repurposing. Both studies provide evidence that social media is a valuable source for drug repurposing.

(11)

CHAPTER 2. BACKGROUND 4

2.2

5-alpha-reductase inhibitors

We earlier introduced 5-AR inhibitors as a potential target for drug repurposing. In this section, we define current indications of these drugs, possible new indications, and why they are good candidates for drug repurposing.

The enzyme 5-alpha reductase converts testosterone into dihydrotestosterone (DHT), which is the more potent androgen [17, 32]. 5-alpha reductase has three isoforms1, of

which two are known to express during different developmental stages and at differ-ent tissues [33, 34]. It is therefore likely that the isoforms have separate therapeutic targets.

5-AR inhibitors are drugs that inhibit isoforms of the enzyme 5- alpha-reductase. In this way, they stop the transformation of testosterone to DHT, which has applications in male hair loss and BPH. The drugs finasteride and dutasteride are 5-AR inhibitors, which both work differently on the isoforms.

Finasteride is a 5-AR inhibitor and clinically approved as a treatment for BPH and male hair loss [12]. Currently it is on the market under the brand names Proscar (Merck)R

and Propecia (Merck). Proscar comes as a 5 mg dose and is mainly used for BPH,R

while Propecia is distributed as 1 mg tablets and prescribed as a treatment for male hair loss [33].

Dutasteride is also a 5-AR inhibitor, but in contradiction to finasteride only approved for BPH in the United States [11]. Still, dutasteride is prescribed off-label for male hair loss. Dutasteride is prescribed daily in a 0.5 mg dose, under the brand name Avodart

R

(GlaxoSmithKline). In recent research, dutasteride shows to be more effective than finasteride for male hair loss, while having similar side effects [35].

2.2.1

Experimental outcomes

The 5-AR inhibitor finasteride showed promising results in experimental studies for treatment of several neuropsychiatric disorders. Early research revealed antipsychotic-like effects in rats [17]. Soggiu et al. [12] hinted that finasteride has behavioral effects, and is of therapeutic potential for neuropsychiatric disorders. Additionally, Paba et al. [14] suggested that finasteride could have therapeutic effects in psychotic disorders, impulse control disorders, and Tourette’s Syndrome. Effects of finasteride for Tourette’s Syndrome are also pointed out in [18, 36]. The research by Frau et al. [36] gave further evidence that finasteride has antipsychotic-like abilities. The study implied that the 5-AR inhibitor could be useful for treating schizophrenia and neuropsychiatric disorders, including obsessive-compulsive disorder. Recently, Diviccaro et al. [37] showed that treating rats with finasteride has long term effects on depressive-like behaviour. Less is known about the potential effects of dutasteride. So far is reported that the drug decreased development and expression of dyskinesia, which is often seen in Parkinson patients treated by Levodopa [38]. Finasteride also showed these results in earlier re-search [15], but dutasteride seems more effective at a lower dose and without impacting the effectiveness of Levodopa. Dutasteride was also suggested as a promising drug for protection of neurons in Parkinson disease itself [19].

The outcomes of these studies imply drug repurposing possibilities of 5-AR inhibitors in a broad segment of neurological and psychiatric disorders. The findings from these studies show effects in animal models, so it has not been demonstrated that similar effects are seen in humans.

(12)

2.3

Text Mining

We define text mining as the techniques used to extract computable information, which can be used to generate knowledge from large amounts of unstructured textual data [23]. As most of the data online are unstructured and contain valuable information, there is a need to develop methods to retrieve and analyze this data. With Natural Language Processing (NLP), large volumes of data can be processed automatically [39].

NLP consists of different techniques, and each of them has a different goal [40]. One of the techniques is lemmatization, which is grouping the different forms of a word so they can be analyzed as the same item [40]. Stemming is a method to get the root of words. Both stemming and lemmatization are useful to detect the similarity of words [41]. Tokenization is the splitting of texts into meaningful parts, such as words, phrases, sentences, or paragraphs [42]. Texts are split into tokens by considering its delimiters. Stop word removal is an NLP technique that removes high-frequency words [41]. Part of Speech Tagging is assigning each token to the assumed part of speech [43]. Word sense disambiguation works to determine the meaning of a word in the context of its sentence, executed in a computational way [23, 44].

Named Entity Recognition (NER) is an NLP technique to discover words or group of words in the data as entities [45]. In medical text processing, NER is mostly used to extract medical concepts and terms. MetaMap is a tool that provides medical NER amongst some other NLP techniques [46]. It is a widely used tool and has a broad range of applications, including concept extraction of social media sources [27, 39, 45]. We will further explain the features of MetaMap in Chapter 4.

2.4

Social media in healthcare research

Social media has several applications in healthcare research. For example, to manage and surveillance outbreaks of diseases [47], to understand the patient perspective of a treatment [48], and to recruit participants for medical studies [49].

We earlier mentioned two studies [26, 27] as the only known research about drug re-purposing using social media. A more frequent other application of social media in healthcare research is finding adverse effects of drugs. Several studies were conducted to find ADRs from social media content [25, 50, 51]. The approach for finding ADRs has a similar goal as finding effects for a drug repurposing hypothesis, with the difference that we are not just looking for adverse effects, but for effects in general.

Most of the approaches to find ADRs consists of similar steps. Tricco et al. [25] described a pipeline of five steps regularly executed in finding ADRs using social media:

1. Data acquisition 2. Text pre-processing

3. Information extraction: identification of named entities and normalization 4. Data analysis: relation extraction

(13)

CHAPTER 2. BACKGROUND 6

2.5

Strengths and weaknesses of social media research

In comparison to more traditional medical research on health records, social media re-search has advantages and disadvantages. Sampathkumar et al. [52] points out that online forums have lowered a barrier for patients to report their experiences, and thus could be a valuable resource for ADRs. However, there might also be under and overre-porting of side-effects on web-forums [53]. Moreover, ‘mild’ and symptom-related ADRs are over-represented in social media, while laboratory tests and ‘serious’ ADRs are men-tioned less in comparison with other data sources [54]. On the other hand, spontaneous adverse event reporting systems also suffer from reporting biases [55].

Another interesting point is that the research of Topaz et al. [56] found a match between the most frequently ADRs mentioned on social media and in electronic health records (EHRs). They also argue that less frequently reported reactions were more frequently reported in social media. In [57] is argued that social media is a complementary source to traditional reporting systems and thus is valuable. Research in finding ADRs in EHRs shows similarities to using social media, but it has complexity issues related to privacy concerns [58]. Another problem mentioned by the same research is the lack of a comprehensive EHR for patients and patients having multiple EHRs. Social media has similar data available at one location and without privacy concerns.

Liu et al. [55] observed that it is challenging to discern ADRs from indications in free-text clinical notes. Likely, this is also the case with social media. Additionally, people could give general statements about effects, but not talk about personal experience [39]. Another study documents that ADR mentions are not exclusively related to actual patient experience, but also to research, news items or word of mouth [59]. These other mentions cause noise in the data and can affect the quality of the findings.

The difference in vocabulary between biomedical literature on one hand and user-generated health content on the other is one of the challenges in our research. The language of social media frequently contains misspellings, abbreviations, irregular gram-mar, symbols, ambiguity, and noise. [21, 39, 53, 60–63]. Together with already informal language, this makes it difficult for ontologies to normalize it to formal terms [64]. We propose several solutions to the disadvantages. For example, CSpell is a spell checker developed to deal with consumer health language and can improve the quality of texts [65]. It can manage non-word errors, real-word errors, word boundary infractions, punc-tuation errors and combinations of them. Also, Sarker & Gonzalez-Hernandez [66] developed a misspelling generator for health-related text sources to find misspelled key-words. Word-sense-disambiguation is a useful NLP technique to deal with ambiguous words [53].

To deal with noise in Tweets, Bian et al. [61] proposed a Support Vector Machine (SVM) to filter the noise out. Also, to include only texts with personal experience about ADRs, an SVM was used to classify texts. In [48] long posts are excluded arguing that they only contain news articles or research papers instead of focusing on the user’s experience. In their study, Hoang et al. [67] took the credibility and authenticity of users into account.

Taken all strengths and weaknesses of social media into account, we see social media as an valuable data source for our study. Furthermore, we considered the solutions for improving the quality of the results.

(14)

2.6

Implications for this research

Based on the provided background information, we have some implications we take into account during this study.

Because research so far is only executed in rodents, the additional value of our study is to explore if the effects of 5-AR inhibitors are also observable in humans. The 5-AR inhibitors finasteride and dutasteride work slightly different and have different effects. Taking this into account, we still want to bundle both drugs as a group of 5-AR inhibitors to see what effects we find. Furthermore, previous research has mainly reported effects in psychiatric disorders, but some neurological effects were also presented. Because drug repurposing research is best established in psychiatric effects, this research also focuses on finding a relation between 5-AR inhibitors and psychiatric effects to strengthen the hypothesis.

Because the approach of finding ADRs and general effects are similar, the steps described by Tricco et al. [25] are also incorporated in our research. We elaborate on the steps in the chapter where their implementation is reported. We present an overview of the steps and in which chapter they are performed in Figure 1.1.

As Luque et al. discussed, MetaMap is a widely used NER tool [45], including appli-cations for social media [27, 39]. Therefore, we decided to use MetaMap as a tool to extract concepts for our research. To deal with the textual disadvantages, we want to use the spellchecker CSpell [65] and incorporate word-sense-disambiguation. Another disadvantage of social media data is the ambiguity concerning statements about drugs and effects [39, 55, 59]. We want to take account of this ambiguity by analyzing if people take a drug and experience an effect themselves.

(15)

Chapter 3

Selection of a social media venue

3.1

Introduction

In this chapter we describe our approach to complete the first aim, which is: Identify social media venues with substantial 5-AR inhibitor content and extract posts from par-ticipants mentioning 5-AR inhibitors in at least one posting. We started this aim by selecting a social media venue that fits best with our goal. The aim was further accom-plished by the first two steps shown in Figure 1.1, which are data acquisition and text pre-processing.

3.1.1

Data acquisition

The first step described by Tricco et al. [25] was performed to gather the data. To execute data acquisition, a social media venue is needed. Various social media venue are used in health-related studies, from general social media sites like Twitter [61, 62] to health-related discussion boards like DailyStrength [21, 24, 39, 60], and MedHelp [68, 69], and even disease-specific forums [59, 70]. Web crawlers are often used in this step to gather the data from the venue. A web crawler identifies the threads in a forum and then parses through all pages of each thread where they extract the posts [52].

Volumes of acquired data vary a lot. Studies that use Twitter as their data source often mention large volumes of data. Bian et al. mention they gathered 2 billion Tweets [61], while Jimeno-Yepes et al. [71] obtained 43 million Tweets. In both studies, most of the gathered content has no relation to the goal, and their final datasets were much smaller. When posts of health-related discussion boards are extracted, datasets usually contain only thousands of posts [21, 39].

3.1.2

Text pre-processing

After the data is acquired, text pre-processing is important as it can significantly improve correctness of classification in later steps [42]. With pre-processing, the extracted data is standardized and cleaned with different NLP techniques [45]. Stemming, lemmatization, tokenization, stop word removal, and part-of-speech tagging are all mentioned as pre-processing techniques [21, 24, 39, 53, 60, 69].

Some studies mention specific pre-processing steps for social media. Jimeno-Yepes et al. [71] mentions they filtered out duplicate tweets, and removed non-English tweets, which was also done by [61]. De-identification is the removal of identifiable information, and it protects the identity of the posting’s author [72, 73]. In a study by Mao et al. [70], a de-identification system removes personal identifiers including e-mail addresses, URLs and user names. Then it replaces all names with ’tagged’ identifiers to take track of posts.

(16)

3.2

Methods

We received an Institutional Review Board exemption (IRB 00118104) from the Uni-versity of Utah under category 4 defined by the United States Federal Regulations 45 CFR 46.101(b). Because of the size of the data, we used the computing capabilities of the Center for High Performance Computing at the University of Utah to acquire, store, process, and analyze the data. Throughout the process, we used Python [74].

3.2.1

Gathering of social media venues

Social media venues of interest were identified in two ways. At first, we query the general health forums of the review by Sarker et al. [51] for 5-AR inhibitor content. Secondly, the web was explored to find health forums with content related to hair loss or 5-AR inhibitor use. To identify content, both the brand names and the chemical names were used, just as ‘hair loss’ and ‘alopecia’ related terms. Both direct and indirect searches were executed. General terms such as ‘health discussion board’ or ‘health forum’ were used to identify relevant websites. On the site itself, we sought further on terms as ‘finasteride’, ‘hair loss’ or ‘alopecia’, to identify if a forum has relevant content. To select social media venues that are suitable for further research, we set selection criteria. At first, the majority of the posts must be in English. Furthermore, people have to talk about personal experience using a 5-AR inhibitor. Also, details are needed about the author of a post and at what time it is posted. Furthermore, the content has to be relevant, and recent use of the venue is preferred. Lastly, a social media venue with more content and more members is favored over a smaller one, as the chance to identify content related to psychiatric conditions is also larger.

3.2.2

Data extraction and pre-processing

After a social media venue was selected, the textual data was extracted and pre-processed. Therefore, the complete venue is downloaded to obtain links to all web pages, which is followed by extracting all textual content with the Python package Beautiful Soup [75]. An overview of the extracted information is presented in Table 3.1. With Beautiful Soup, quoted texts, pictures, and likes or dislikes were removed from the posts. Using regular expressions, we identified and removed URLs and email addresses from the posts. To solve encoding issues in the posts, the Python package ftfy [76] was used to adjust non-ASCII characters to their ASCII equivalents. The still present non-ASCII characters were manually removed as MetaMap cannot process them.

All duplicate posts were removed, followed by removing posts after February 14th, 2019,

which was the date of downloading the venue. Also, all posts from Guest members were removed, and all member names were replaced with a unique identifier, both to protect the privacy of the members and make further processing easier. Finally, posts without content, such as posts with missing values or only containing white space, were removed. Table 3.2 presents the pre-processing actions performed and how this influenced the number of posts and members. To deal with the misspellings in the posts, CSpell was used [65]. This spellchecker was developed for answering health-related questions and thus related to the content of this research, which makes it more likely to provide good results than general spell checkers.

Using regular expressions, posts mentioning ‘finasteride’, ‘propecia’, ‘proscar’, ‘dutas-teride’, or ‘avodart’ were identified and occurrences of the drug names were counted. To establish whether CSpell improved the spelling of the drugs, the names of the drugs were counted both before and after CSpell was applied. Each member that mentioned

(17)

CHAPTER 3. SELECTION OF A SOCIAL MEDIA VENUE 10

at least one of the five drugs was identified as a potential user of the drug. Further analysis compares the group of members who mention the drug against the group of members who never mentions the drug.

Table 3.1: Extracted information for each post.

Extracted Information Type

Post String

PostID String

Member name String

Timestamp DateTime

3.3

Results

3.3.1

Selecting a social media venue

Initially, 22 websites were identified through the search and from the paper by Sarker et al. [51] (see Figure 3.1). Five of them were removed due to inaccessibility or language constraints. When the websites were further inspected, six websites were discarded due to missing info about the writer of a message. Missing information makes it impossible to connect a writer to multiple messages, and thus posts from the same writer cannot be linked to each other.

A last eligibility check on the websites was performed based on content, which excluded a total of eight websites. Two websites had irrelevant content, Twitter and Propecia-Help [77]. Twitter content contains many retweets, marketing of 5-AR inhibitors, news articles, and Tweets not in English. Although PropeciaHelp is dedicated to people who have taken Propecia, it is only about ADRs experienced and therefore too narrow to be of relevance. One website was excluded because it only has seven reviews about 5-AR inhibitors. Five websites have a section about hair loss, but the 5-AR related content in there is hard to find, which is why we excluded them. A complete overview of the social media venues explored here is presented in Table A.1 of the Appendix.

This leaves us with two eligible venues, HairLossTalk.com[29] and baldtruthtalk [30]. Both are forums about hair loss. They have sections related to hair loss treatment, hair transplantation, and people can share their own stories. HairLossTalk.com has over a million posts, and 66,976 members, while baldtruthtalk has about 200,000 posts and 44,796 members. Due to the larger number of posts and members of HairLossTalk.com we decided to use this forum for the remaining part of this study. However, baldtruthtalk should be considered if the research is replicated with another source.

3.3.2

Text pre-processing performance

After selecting the venue, pre-processing steps were performed. Table 3.2 presents how many posts were removed and how this changed the number of members. A first remark is that we previously mentioned that the forum has almost 67,000 members, but only 28,902 have posted something and can be considered active members. As can be seen from Table 3.2 is that the removal of posts after February 14th had the most influence

on the number of members. Meanwhile, the removal of the Guest members had the most effect on post deletion. It is also remarkable that more than 9,000 posts did not have content and therefore were removed, although this might be the result of the text pre-processing. In the end, 99.3% of the members and 96.1% of the posts were included in the final dataset.

(18)

Figure 3.1: Flow diagram depicting the selection process for relevant data sources.

Table 3.2: Text pre-processing steps and difference in number of posts and members.

Number of posts Posts removed Number of members Members removed

Extracted posts 1,067,884 n.a. 28,902 n.a.

Removed duplicate posts 1,067,223 661 28,902 0

Removed posts after 14 February 2019 1,064,552 2,671 28,723 179

Removed ‘guest’ posts 1,035,089 29,463 28,722 1

Removed posts without content 1,025,945 9,144 28,696 26

(19)

CHAPTER 3. SELECTION OF A SOCIAL MEDIA VENUE 12

In Table 3.3, we present the number of mentioned drugs found before and after applying CSpell on the data. For all drug names, new instances are found, and thus the spelling checker improved misspelled words to the correct drug names. The most substantial percentual increase was for Dutasteride, while Proscar has the smallest increase. The overall increase in 5-AR inhibitor posts is 1.00%.

Table 3.3: Number of drugs mentioned in posts before and after applying CSpell.

Drug name Before posts After posts Increase Finasteride 121,914 123,184 1.04% Propecia 54,243 54,845 1.11% Proscar 14,516 14,579 0.43% Dutasteride 27,422 28,050 2.29% Avodart 5,270 5,380 2.09% Total 183,421* 185,255* 1.00%

*Totals are not addable as multiple 5-AR inhibitors can be

mentioned in the same post.

3.3.3

5-alpha-reductase inhibitor content

After the text-preprocessing steps, properties of 5-AR inhibitor content on the forum were examined, to assess if the content is of value.

First, we analyzed how many members mention a 5-AR inhibitor at least once. Out of all 29,696 members, 15,869 mention a 5-AR inhibitor at least once, which is 55.3%. A Mann-Whitney U test was performed to compare the number of posts for the group of members that mention a 5-AR inhibitor with the group who does not mention the drug. The median [quartiles] of the number of posts per for the mentioning group and the non-mentioning group are respectively 8 [3, 28] and 2 [1, 4]. This resulted in a significantly higher number of posts for the group that mentions a 5-AR inhibitor (U=43,357,129; p=0.000).

A total of 185,255 posts (see Table 3.3) contain a mention of a 5-AR inhibitor after applying CSpell. Finasteride is the most represented 5-AR inhibitor name, with oc-currences in 2/3 of the 5-AR inhibitor including posts, while Propecia is represented in almost 30% of the post with 5-AR inhibitor content. Different 5-AR names can be mentioned in a post, which is why the total posts that contain a 5-AR inhibitor is lower than counting all similar names separately.

To assess if members are active for sufficient time, we checked the time between the first and last post. The active period is the time between these posts. On average, a member that mentions a 5-AR inhibitor is active for almost a year (M=364.44 days, SD=680.74 days).

3.4

Discussion

Our aim was to ‘Identify social media venues with substantial 5-AR inhibitor content and extract posts from participants mentioning 5-AR inhibitors in at least one posting’. We achieved this by finding a web forum containing over a million postings, and extracting the posts. Posts were pre-processed and put into the correct format. Moreover, we have shown a significant number of posts (over 180,000) with 5-AR inhibitors mentions and over half of the members that mentions a 5-AR inhibitor at least once.

The website Hairlosstalk.com was chosen as a data source. We identified 20 other social media venues, but 19 of them were excluded as they were not accessible anymore (4),

(20)

not in English (1), missed important information (6), had irrelevant content (2), or too little (1) or hard to find content (5). This left two similar sources eligible, as both were web forums dedicated to talk about hair loss. From these two Hairlosstalk.com was chosen because it had more content and more members.

One of the sources that was excluded because of irrelevant content is Twitter. This surprised us, as several other studies used Twitter as a source to extract ADRs [24, 39, 61], and obtaining significant results. Moreover, the website several studies mentioned DailyStrength [21, 24, 39, 60], which made us hopeful about this website too. However, the website was not working and therefore excluded.

About 600 posts were removed because they are duplicates and 9,000 posts were removed because they lacked content. It is unclear why there were duplicated posts. Possible causes are that posts are submitted multiple times shortly after each other, or HTML pages that are downloaded multiple times. There are several explanations for the posts lacking content. For one, members can edit their posts and thereby remove all content. Another explanation is that there are posts that only consisted of quotes and/or pictures, which are automatically removed with BeautifulSoup. Also, the removal of URLs and e-mail addresses could have emptied posts. Taken together, nothing affects the quality of the data as the posts would not have added value for this research.

The 5-AR content finding suggests that people who mention a 5-AR inhibitor signifi-cantly post more. This was considered a good sign, as it is more likely that an indication of a psychiatric condition could be found. Furthermore, people who mention a 5-AR inhibitor are averagely active for around one year.

3.4.1

Strengths & Limitations

Several earlier studies on finding ADRs reported that misspellings influenced the nor-malization of terms [27, 39, 53, 61]. In this study, CSpell has shown its usefulness for detecting and correcting misspelled instances of drugs. The increase in posts with 5-AR inhibitors was 1.00%, which is small but still helpful. Although we did not take a further look into the CSpell corrections, and it is possible that CSpell also revised correct words incorrectly, we see its use as a strength in this research.

Another strength is the volume of our data. When compared to related work, the size of our data is on the larger side, which raises the chance to find members of interest. The only studies that have a larger volume of data are studies where Twitter is used [61, 71]. The main difference is that in the Twitter studies, the density of relevant content is low. Meanwhile, in studies with smaller data sources, all posts are of interest, thus leading to a high density of relevant content. As less than a fifth of our posts contain 5-AR inhibitors, the density of relevant content is on the low side. This is also a limitation as it is more time-consuming to distinguish relevant content from all the data.

We pointed out that over half (15,869) of the members (28,696) of the forum mentions a 5-AR inhibitor, which is a promising sign for the rest of our research. However, from this proportion, we cannot know how many people really take a 5-AR inhibitor. If the percentage of members that take a 5-AR inhibitor is low, this is a limitation of the study. Mainly because our final goal is to identify members that both take a 5-AR inhibitor and have a psychiatric condition. With fewer members that take a 5-AR inhibitor, there will even be fewer members who meet both conditions.

(21)

CHAPTER 3. SELECTION OF A SOCIAL MEDIA VENUE 14

3.4.2

Future work

One of the social media sources considered here is the website AskAPatient [78], which was excluded in our research because of missing user information. Nevertheless, the website seems very interesting for future research. The reason for this is the detailed information about why people take a drug, how long they have been taking a drug, and what side effects they experience. Future research could be conducted with data from this website to find what effects of 5-AR inhibitors are represented there.

There is also some future work that can be conducted to overcome the limitations men-tioned before. At first, we menmen-tioned that the density of relevant posts is low and that it is time-consuming to distinguish relevant data from all posts. The second limitation is that we cannot infer which members take a 5-AR inhibitor and who mention it for other reasons. Sarker & Gonzalez [39] provided a solution which we could apply to solve both limitations. In their study, they first manually annotated a small part of the posts for the presence or absence of ADR’s. Then, the study applied a classification machine learning approach to distinguish ADR and non-ADR posts, which was successful. We could apply a similar approach in our research, which could start with annotating posts to differentiate posts were people mention taking a drug from other posts. This could be followed by executing a classification task to divide posts where people mention that they take a 5-AR inhibitor. Then can be inferred which members are relevant for the remaining part of our research. This would thus lead to a higher density of relevant content, and we know that members take a 5-AR inhibitor.

(22)

Extracting psychiatric condition

concepts

4.1

Introduction

This chapter describes the methods and accomplishments related to the second aim of this thesis: Use a concept extraction tool to extract concepts of the postings, and so obtain posts with mentions of psychiatric conditions. In Chapter 3, social media posts from the website hairlosstalk.com[29] were gathered and pre-processed. In this chapter, we continue with the third step described by Tricco et al. [25], which is information extraction. This step is executed with the medical concept extraction tool MetaMap. Because it is unknown if every extracted concept is correct and relevant, we also evaluate the extracted concepts. The aim is finalized by the selection of a subset of concepts that are related to psychiatric conditions.

As this aim consist of the use of an extraction tool and the evaluation and selection of extracted concepts, this part of our research is more of a technical execution rather than an innovative study.

4.1.1

Information extraction

Information extraction comprises identification and normalization of concepts [25]. It is an essential step for many other natural language processing tasks, for example, classi-fication and text mining [79]. Concept identiclassi-fication covers finding concepts in the text, while concept normalization is the mapping the identified concept to a vocabulary [63, 64]. Over 60 biomedical vocabularies are incorporated into the Unified Medical Lan-guage System (UMLS), and therefore this system is often used for medical information extraction purposes [80].

Most information extraction approaches are dictionary/lexicon-based. Other informa-tion extracinforma-tion approaches are based on machine learning or a combinainforma-tion between a dictionary and a machine learning approach. An example is a study by Leaman et al. [21], which created a lexicon from four medical vocabularies to find ADRs in user comments. Furthermore, the research of Bian et al. [61] uses MetaMap to generate a list of UMLS concept codes that are mapped from their Twitter messages. MetaMap is the most popular medical knowledge-based extraction system, as it maps concepts to the UMLS Metathesaurus [79].

4.1.2

MetaMap

MetaMap is a widely used NER tool in biomedical applications [45]. It integrates text mining tasks and uses a wide range of NLP techniques. Concepts in texts are extracted with both lexical and syntactic analysis, in a few sequential steps. MetaMap first deter-mines sentence boundaries, parses the text into tokens, applies part-of-speech tagging

(23)

CHAPTER 4. EXTRACTING PSYCHIATRIC CONDITION CONCEPTS 16

and identifies acronyms and abbreviations. Then a lexical lookup is executed, and syn-tactic analysis is applied to identify phrases. This is followed by processing each phrase with variation generation, candidate identification, and mapping construction to pro-duce and evaluate the best match for the phrase. An optional last step comprises of word sense disambiguation where the best match is chosen based on semantic consistency between the mapping and its surrounding text [81, 82].

MetaMap was originally developed for mapping user queries, abstracts and titles from MEDLINE citations to the UMLS Metathesaurus [46]. Tari et al. [83] used this approach to find drug names for discovering drug-drug interactions in abstracts. MetaMap is not limited to this, and other uses are to process radiology and pathology reports to help to detect diseases in admission data [84], and to identify indications and adverse drug events from patient medication reviews of an online forum [27].

Depending on the desired task, MetaMap can be set to different configurations for the best extraction of concepts needed for each application. Configuration options are about the data source, behavioral functioning, and output format [85]. Because all concepts in MetaMap are derived from the biomedical vocabularies in the UMLS, the user can use one or more vocabularies as a data source. Also, restrictions can be set to specific semantic types.

Behavior configuration options include NegEx, using word-sense disambiguation and setting thresholds for matching. The output can be formatted in diverse formats, in-cluding XML output and fielded MetaMap Indexing (MMI) output. The MMI output includes a unique identifier, the UMLS concept name, the semantic type, trigger infor-mation of the mapping, the textual location of the identified concept and the Concept Unique Identifier (CUI). CUIs were introduced to have one coherent concept with a unique identifier for terminology that has different names among UMLS vocabularies but has essentially the same meaning [39].

4.2

Methods

In this section, we describe which UMLS sources we use, how MetaMap is configured, and how concepts were selected. This is all related to the information extraction process as described by [25].

4.2.1

UMLS Source

We decided to use only SNOMED-CT (2018AB US version)1 as a data source, which is

the largest terminology existing in the UMLS [86]. SNOMED-CT is a semi-hierarchical resource, going from general terms to specific concepts. Because concepts have relations to multiple other concepts, they can occur multiple times in the hierarchy, and they can also have multiple parents. We choose SNOMED-CT because of its size, the hierarchical structure, and the inclusion of categories related to this research. Concepts from hierar-chical SNOMED-CT structures were extracted with the Python module PyMedTermino [87].

4.2.2

MetaMap configuration

Several MetaMap configurations were explored, and for this study was chosen MMI format for output with the postID (see Table 3.1 as unique identifier and SNOMED-CT as the UMLS vocabulary included. Furthermore, word sense disambiguation was

(24)

incorporated, a pruning threshold was set to 30 (default is no maximum), and we limited the length of composite phrases to 0 (default is 4). More details about each setting are provided in Table B.1 in the Appendix.

After MetaMap processed all posts, the extracted mappings are either of the MMI type or the ‘Acronyms and Abbreviations’ (AA) type. From all extracted mappings, the AA type mappings were removed. This was necessary as these mappings do not contain a CUI and cannot be used for further analysis. Furthermore, not all output that is generated as MMI output is of relevance for this research. So only the postID, the UMLS concept name, the CUI, the semantic type and the trigger information were preserved.

In Table 4.1, the extracted CUIs based on the following input are shown: “post-1384024 - will not make a difference if you are on finasteride. Oh, and not being able to take DECA with finasteride is a myth.” The post is identified by the postID. If a word occurs multiple times in the same sentence, it leads to one mapping triggered by multiple words, (see Trigger Information of the Finasteride concept). The finasteride concept also shows that it belongs to multiple semantic types. A negated mapping is shown at the UMLS concept ‘Able (finding)’, where the Trigger Information ends with a one instead of a zero.

Table 4.1: An example of output from extracted CUIs.

postID UMLS concept

name

CUI Semantic Type Trigger Information

post-1384024 Finasteride C0060389 [horm,orch,phsu] [“Finasteride”-tx-2- “finasteride”-noun- 0,“Finasteride”-tx-1-“finasteride”-noun-0] post-1384024 Togo C0040363 [geoa] [“Togo”-tx-2-“to”-adv-0] post-1384024 Able (finding) C1299581 [fndg] [“Able”-tx-2-“able”-adj-1] post-1384024 Take C1515187 [hlca] [“Take”-tx-2-“take”-verb-0] post-1384024 Differential quality C0443199 [qlco]

[“Differential”-tx-1-“difference”-noun-0]

4.2.3

Concept selection strategy

To find if there is a relation between 5-AR inhibitors and the effects mentioned in Chapter 2, related concepts are needed. Several studies point out that 5-AR inhibitors have neuropsychiatric, antipsychotic, and anti-depressive-like abilities [12, 17, 36, 37]. Although both psychiatric and neurologic effects are mentioned in recent studies, this research just focuses on psychiatric effects.

There is no known existence of an ontology or list of CUIs related to our goal. Even the seemingly related UMLS semantic type ‘Mental or Behavioral Dysfunction’ lacks relevant concepts [86]. Therefore, a selection of relevant CUIs had to be determined by hand for this purpose. Different groups of concepts are useful to indicate a psychiatric condition. We defined three groups that could be used as indicators for a psychiatric condition, which are disorders, findings, and medication. We searched the SNOMED-CT hierarchy for one umbrella-concept for each of the indicators. We choose the highest relevant level in the hierarchy as this would minimize the exclusion of relevant con-cepts. Of the umbrella-concepts identified, three lists were constructed containing the umbrella-concept and the top 50 most occurring descendants. In Table 4.2, the three concept names and CUIs are shown. Because of the length of the term ‘Mental state, be-havior and/or psychosocial function finding’, throughout the rest of this chapter ‘Mental finding’ will be used as a name to refer to this umbrella-concept.

(25)

CHAPTER 4. EXTRACTING PSYCHIATRIC CONDITION CONCEPTS 18

Table 4.2: The three selected umbrella-concepts.

SNOMED-CT concept CUI

Mental disorder C0004936

Psychoactive substance C0682880

Mental state, behavior and/or psychosocial function finding C1272788

The descendants of the umbrella-concepts were evaluated on correctness of the trigger, relevance of semantic type, relevance towards psychiatric conditions, personal experi-ences, and number of occurrences. Based on the criteria, several CUIs were excluded. CUIs that made the final selection are eligible for further examination in Chapter 5.

4.3

Results

In total, MetaMap extracted 15,118,335 mapping instances. Of all instances, 5,702 were of the type AA and thus subsequently removed. MetaMap extracted 23,521 unique CUIs, of 222 different semantic type combinations (as some CUIs are related to multiple semantic types). A total of 987,711 posts contained at least one CUI, which is 96.3% of the posts.

4.3.1

Most occurring CUIs

In Figure 4.1 the top 25 extracted CUIs and their corresponding concept names are presented. These 25 concepts are almost a quarter of the total extracted instances (24.6%). The legend of Figure 4.1 provides abbreviations of semantic types, of which the full names can be found in Table B.2 of the Appendix. Most semantic types are only represented once, with exception of Geographic Area [geoa], Quantitative Concept [qnco], and Temporal Concept [tmco]. The figure also shows that some concepts be-long to multiple semantic types. Finasteride relates to three different semantic types, and Tryptophanase, Minodoxil and Blood group antibody I all occur in two semantic types.

Some of the CUIs in the top 25 are related to hair loss and treatment options (Hair, Alopecia, Finasteride and Minoxidil), others are often occurring words, but some of the CUIs seem odd at first glance. For example, the countries Togo, Somalia, and Guyana would not be expected to be presented often in this forum. In Table C.1 of the appendix, the two most common triggers for each concept are shown. As can be seen from the table, Togo is mostly mapped from ‘to’, Somalia is coming from ‘so’, and Guyana is triggered by ‘guy’ or ‘guys’, which all are words often represented in posts. Furthermore, the most occurring UMLS concept ‘Iodides’ and the ‘Blood group antibody I’ concept are both triggered by the word ‘I’ which is naturally often occurring in forums. ‘Tryptophanase’ is also triggered by the word ‘to’. Different CUIs for the same word are the result of word sense disambiguation.

These results indicate that the most occurring CUIs are not related to psychiatric con-ditions, which amplifies the need for a more specific set of concepts. It also highlights that trigger information need to be checked before concepts are used. This was done for the three concepts of Table 4.2. For them, the top 50 most common descendants were ranked in bar charts where semantic types are also visualised. Furthermore, we assessed the two most common triggers of each concept. The triggers information is shown in Appendix Tables C.2-C.4. Based on the bar chart and the trigger information, concepts are excluded, or included in a final list of concepts. The reason for exclusion is

(26)

also shown in the respective appendix table. In the next three sections about the three umbrella-concepts, we also elaborate on this.

Figure 4.1: Top 25 extracted UMLS concepts*.

* Several concepts are mapped incorrectly by MetaMap. Iodides and Blood Group antibody I are mapped for the word ‘I’, while Togo and Tryptophanase are triggered by the word ‘to’. Somalia and Guyana are respectively triggered by the words ‘so’ and ‘guy’.

4.3.2

Mental disorder

In Figure 4.2 the descendant concepts of ‘Mental Disorder’ are presented. From the figure, it is apparent that ‘Mental or Behavioral Dysfunction’ [mobd] is by far the most represented semantic type for this umbrella-concept and ‘Mild Mental Retardation’ is the most represent descendant concept with 1,359 instances.

Based on the criteria mentioned before and the trigger information from Appendix Table C.2, several concepts were excluded. A total of nine concepts were excluded because the trigger seems incorrect. Most often, these are common words mapped to a more specific UMLS concept. One incorrect concept was mapped from an unrelated abbreviation.

Furthermore, seven concepts were excluded based on the trigger information, as it is unlikely that a concept described a personal experience. For example, the ‘Mild Mental Retardation’ concept is triggered most often by the words moron and morons. It is unlikely that a person would call himself a moron, so the personal experience is certainly lacking, and therefore these concepts can be excluded.

Additionally, five concepts were eliminated because they lack relevance towards a psy-chiatric condition. An example of these concepts is ‘Personality Change’. Finally, the concept Trichotillomania was removed from the list because it is expected that the trig-gers are related to the forum content (hair loss) and not to a psychiatric condition. In total, 22 concepts were excluded, which leaves 28 concepts available for further exami-nation. An overview of excluded concepts is shown in Table 4.3.

(27)

CHAPTER 4. EXTRACTING PSYCHIATRIC CONDITION CONCEPTS 20

Figure 4.2: Top 50 most occurring descendant concepts of Mental Disorder.

Table 4.3: Excluded Mental Disorder descendant concepts.

Reason for exclusion Concept

Incorrect trigger Confusion; Binge eating disorder; Stereotypic Movement Disorder; Phys-ical and emotional exhaustion state; Phobic anxiety disorder; Pica Dis-ease; Phobia, Social; Coffin-Siris syndrome; Hallucinogen Persisting Per-ception Disorder

Unlikely personal experience Mild Mental Retardation; Pedophilia; Moderate mental retardation (I.Q. 35-49); Alcoholic Intoxication, Chronic; Profound Mental Retardation; Abuse of steroids; Alcohol abuse

Lacking psychiatric relevance Fetishism (Psychiatric); Identity Problem; Flashing; Personality change; Dyslexia

Hair-related Trichotillomania

4.3.3

Psychoactive substance

As can be seen in Figure 4.3, Cocaine and Heroin are the most commonly occurring descendants from ‘Psychoactive substance’, with respectively 3,258 and 1,882 instances. Together with four other concepts, they are related to the ‘Hazardous or Poisonous Substance’ [hops] semantic type, which represents recreational drugs. Furthermore, the ‘Food’ [food] semantic type contains only alcoholic beverages as concepts. Due to the lacking relevance to a psychiatric condition by the two semantic types, 22 concepts were excluded from further examination. Moreover, nine other concepts were excluded because they are related to cannabis/alcohol, or they are not known as a treatment for a psychiatric condition.

In appendix Table C.3, the trigger information is shown. Overall, the triggers seem accurate, with only the concept allobarbital being doubtful. The concept is triggered

(28)

by the word ‘dial’. As this word is common language, the correctness of the mapping is probably incorrect, and therefore this concept was excluded. Table 4.4 provides an overview of all excluded concepts.

In total, 18 of the descendants from the ‘Psychoactive substance’ concept are included for further examination. Interestingly, all concepts have at least the ‘Pharmacologic Substance’ [phsu] semantic type.

Figure 4.3: Top 50 most occurring descendant concepts of Psychoactive Substance.

Table 4.4: Excluded Psychoactive substance descendant concepts.

Reason for exclusion Concept Has [food] or [hops] semantic

type.

Cocaine; Heroin; Beer; distilled alcoholic beverage; Cider; Metham-phetamine; Rum; Vodka; Wine; Cocktail; Red wine; Nicotine; Whisky; Gin; Home brewed beer; Ecstacy - drug; Stout; Alcoholic Beverages; Hal-lucinogens; Tequila; Bourbon; Lager

Not known as psychiatric condition treatment.

Marijuana leaf; Tetrahydrocannabinol; Morphine; tenocyclidine; Cannabis substance; Cannabidiol; Codeine; Methadone; Ingestible alco-hol

Incorrect trigger allobarbital

4.3.4

Mental finding

Figure 4.4 shows instances from the umbrella-concept ‘Mental Finding’. In contrary to the two classes discussed before, the number of instances for each of the descendants is higher. We also see that various semantic types are represented within the top 50 con-cepts, with ‘Finding’ [fndg] and ‘Mental Process’ [menp] standing out. Without looking at the trigger information, these descendant concepts are common words and often re-lated to emotions, which are no indicators of having a psychiatric condition. The only

(29)

CHAPTER 4. EXTRACTING PSYCHIATRIC CONDITION CONCEPTS 22

interesting information they provide is about the general well-being of a member, which is not of relevance for this research. Because of the common words, and larger frequency of the concepts, we decided to exclude this umbrella-concept and all descendants. The complete list of the 46 included concepts from the other two umbrella-concepts is shown in Table 4.5.

Figure 4.4: Top 50 most occurring descendant concepts of Mental finding.

Table 4.5: Final list of concepts related to a psychiatric condition.

Mental disorder Psychoactive substance

Paranoia Antidepressive Agents

Obsessive-Compulsive Disorder Selective Serotonin Reuptake Inhibitors

Anxiety Disorders Lithium

Mental disorders Diazepam

Autistic Disorder Lithium Chloride Antisocial Personality Disorder Anti-Anxiety Agents

Mixed anxiety and depressive disorder Methylphenidate Hydrochloride

Severe depression Sedatives

Attention deficit hyperactivity disorder Antipsychotic Agents Body Dysmorphic Disorders Benzodiazepine Alzheimer’s Disease Citalopram Presenile dementia Sertraline

Schizophrenia Modafinil

Major Depressive Disorder Perphenazine Neurotic Disorders Sodium Valproate Bipolar Disorder Amitriptyline Eating Disorders

Post-Traumatic Stress Disorder Hypochondriasis Nonorganic psychosis Phobia, Social Personality Disorders Mild depression Drug Dependence Panic Disorder Dementia

(30)

4.3.5

5-AR inhibitor CUIs

The number of CUIs found for finasteride (123,075, see Figure 4.1) is very close to the number of instances found in Chapter 3, 123,184 after applying Cspell. This is interesting as both approaches of gathering the concepts are seemingly different. The number of CUIs were also checked and compared for the other 5-AR inhibitor concepts. Dutasteride has exactly 28,000 instances that were extracted from the posts. Comparing this to the Cspell count of 28,050, almost all dutasteride instances were found, which confirms that with regular expressions and MetaMap, a similar number of concepts is extracted.

Although the concepts Propecia, Proscar, and Avodart do have a corresponding CUI they do not appear in the SNOMED-CT ontology, but only in other UMLS vocabular-ies. Therefore it was not expected that they would show up as CUIs. However, none of the brand names were mapped towards finasteride or dutasteride concepts either, which leaves the 5-AR inhibitor brand names completely neglected by MetaMap. This is not consistent with other results, as it can be seen in appendix Table C.3 that ‘Di-azepam’, ‘Methylphenidate Hydrochloride’, and ‘Perphenazine’ were triggered by their brand names, respectively Valium, Ritalin, and Trifalon.

4.4

Discussion

In this aim, the goal was to: ‘Use a concept extraction tool to extract concepts of the postings, and so obtain posts with mentions of psychiatric conditions.’ We extracted UMLS concepts from the posts using the concept extraction tool MetaMap. An analysis was performed on the top 25 extracted concepts and the extracted concepts related to 5-AR inhibitors. Additionally, three concepts related to psychiatric conditions and their descendants were evaluated to find suitable concepts for further analysis in Chapter 5.

Among the top 25 frequent concepts, we found six surprising concepts because com-mon words wrongly triggered them. Besides these, all other concepts were correctly derived from common words or words related to hair loss. These top 25 frequent words account for almost a quarter of all extracted instances, and none of them were related to psychiatric conditions. This gave the insight that only a small part of all extracted instances would be of relevance, and further search for appropriate concepts had to be more specific.

The chosen umbrella-concepts ‘Mental Disorder’ and ‘Psychoactive Substance’ have re-spectively 28 and 18 correct descendants with a connection to psychiatric conditions. Because research mentioned in Chapter 2 shows that 5-AR inhibitors have neuropsy-chiatric effects, the included concepts from ‘Mental Disorder’ are of interest for further research. The concepts ‘Obsessive-Compulsive Disorder’ and ‘Schizophrenia’ are empha-sized in these studies and thus in the spotlight for further research. With the umbrella-concept ‘Psychoactive substance’ several prescription drugs for psychiatric conditions were extracted. Although there is no prior knowledge about the relation between 5-AR inhibitors and psychiatric medication, these drugs can signal psychiatric conditions and thus are relevant to pursue research on as well.

In contrary to that, the concept ‘Mental finding’ is not valuable to identify concepts related to psychiatric conditions. The descendants are less specific and not directly related to a psychiatric condition. Concepts have value as indicators of well-being, but using these concepts, it cannot be inferred that someone has a psychiatric condition.

Referenties

GERELATEERDE DOCUMENTEN

Binding of 14-3-3 proteins to the ser1444 resulted in a decrease of LRRK2 kinase activity, hinting that the binding of 14-3-3 proteins will result in

In dit onderzoek stond de ontwikkeling en validering van de Forensische Klachtenlijst (FKL) centraal: een zelfrapportagelijst om (veranderingen in) de meest voorkomende (psychische)

1 Word-for-word translations dominated the world of Bible translations for centuries, since the 1970s – and until the first few years of this century – target-oriented

On the other hand, because of the observation of the galaxy cluster around PKS 2155  304, the conservatively value of 1 G for its magnetic field and the estimator with

Ook uit Winterdal (2016) bleek dat de recallscores van de hertest erg laag waren als er een productieve hertest werd gebruikt. Wellicht zijn ze hoger wanneer er receptief

Let us follow his line of thought to explore if it can provide an answer to this thesis’ research question ‘what kind of needs does the television program Say Yes to the

Objective: Considering the importance of the social aspects of alcohol consumption and social media use, this study investigated the social content of alcohol posts (ie, the

Abstract This study contributes to our understanding of work engagement within teams by using aggregated data at the work-unit level in order to test the