• No results found

Automatic detection of humorous content in short texts

N/A
N/A
Protected

Academic year: 2021

Share "Automatic detection of humorous content in short texts"

Copied!
20
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Bachelor Project Artificial Intelligence (18 Credits) Amsterdam June 24, 2016 Written by

Jochem Barelds

Student Number: 5795567 Supervised by

Dr. Tejaswini Deoskar

ILLC, Office Address: Room F2.44, SP 107 email: t.deoskar@uva.nl

University of Amsterdam Faculty of Science

(2)
(3)

1 Introduction 1

2 Literature overview 3

3 Theoretical foundations 5

3.1 Basic definitions . . . 5

3.2 Features . . . 6

3.2.1 POS tag counts . . . 6

3.2.2 Word polarity . . . 6 3.3 Ambiguity . . . 7 3.4 Abstraction . . . 7 4 Implementation 9 4.1 Core . . . 9 4.2 Data extraction . . . 9 5 Evaluation 11 5.1 Performce of individual features . . . 11

5.2 Average feature values per category . . . 11

5.3 Overall performance . . . 12

(4)
(5)

Introduction

The automatic classification of content in texts is an important research do-main in the field of artificial intelligence. Recently, the automatic detection of humor has drawn attention within this domain, but it still poses signifi-cant challenges, since it requires a deep semantic understanding of the text, which is difficult to automate.

Previous research has focused primarily on the detection of puns [Kao et al., 2015] and humor in one-liners [Mihalcea et al., 2010]. Closely related is the research done on the detection of memorable content in quotes [Danescu-Niculescu-Mizil et al., 2012].

The problem is that most previous research is constrained to a very pre-dictable form of humor, which relies more on the detection of specific syn-tactic patterns than a deep semantic understanding of the humor. Another problem is that puns are not always considered humorous. Not every joke can be reduced to a play on words. Furthermore, as the puns are contrasted with common language models, it begs the question whether the results are not primarily due to the memorability associated with humor, since memo-rable quotes often involve a play on words in a fashion similar to humor. This thesis focuses on contrasting humor with memorability. What distin-guishes one from the other? What kind of features are needed to make this distinction? Aside from the purely syntactic features, semantic features such as sentiment, word similarity and ambiguity are implemented and evaluated. It can be shown that these features contribute to the automatic detection of humorous content in short texts, in particular distinguishing them from

(6)

memorable but not humorous content.

To avoid any confusion, throughout this thesis, memorability and humor will be treated as mutually exclusive. If a quote is “memorable” in the com-mon usage of the word, but also “humorous”, then it is treated as humorous only—not memorable. This distinction should be kept in mind.

(7)

Literature overview

In [Mihalcea et al., 2010] several features are proposed to detect humor in one-liners. A one-liner is defined as a single humorous sentence with a set-up and a punch line. The set-set-up, as the name suggests, sets set-up the joke by priming the dominant frame of reference in the mind of the receiver. The punch line then adds extra semantic content to the joke which forces the receiver of the joke to suddenly change the initial frame of reference to an unexpected and opposing frame of reference, which is said to evoke laugh-ter. Several features are explored, most of them being about the semantic relatedness between the set-up and the punch line, which, according to in-congruity theory, should be minimized. In the experiments 150 one-liners are included in the dataset. Every one-liner has the same set-up, but, in addition, each of them has three non-humorous endings besides the actual punch line. The classifier is trained on this dataset and then evaluated with a similar test set. The baseline is 25% (the accuracy of a random guess), while the implemented classifier has an accuracy of 84%. This shows that it is able to correctly predict the punch lines 84% of the time. In other words, it detects the humorous content and correctly distinguishes it from the non-humorous endings.

Since one-liners capture the essence of the incongruity of humor in a very elegant manner, incongruity detection can be used for the detection of humor in short texts as well. It provides valuable insights into what distin-guishes humor from non-humor.

In [de Oliveira and Rodrigo, ] Yelp user reviews have been used to (1) collect humorous and non-humorous content and (2) automatically anno-tate the content as humorous or non-humorous. This saves the researchers

(8)

considerable time, since annotating training data is a very time consuming process. Deep learning has been used to classify reviews as humorous or non-humorous. This method of collecting data has also been used in this thesis, with the difference that quote sites with quotes from comedians and philosophers—which have already been categorized by the webmaster—are used instead of Yelp user reviews.

In [Radev et al., 2015] the automatic detection of humor in user-submitted captions of New Yorker cartoons is investigated. This paper focuses on cer-tain features such as negative sentiment (or negative word polarity), human-centeredness and lexical centrality that strongly correlate with the funniest captions. Several datasets were created, such as the funniest captions, the most incoherent captions, and the most human-centeredness captions. These were then compared to each other. Sentiment too will be the focus of at-tention in this thesis.

In [Mihalcea and Strapparava, 2006] very large datasets from different sources were used to classify humorous texts. The same techniques could be em-ployed to compare different sources and to examine how they differ from each other.

(9)

Theoretical foundations

3.1

Basic definitions

Before we move on to the classifier and the features, some basic definitions will be laid out below.

Let W be the set of all possible words, which includes non-existing words. Let T the set of all valid POS tags, as laid out by the Penn Treebank Project [Marcus et al., 1993]. Let S the set of all synsets covered by Word-Net [Miller, 1995]. Let S : W × T → 2S be the function that maps a tagged word to a subset of S. If a word is not found in WordNet, then a stemmer is used. If the stemmed word is still not found, then S(w, t) = ∅.

Let Z be the set of annotated sentences, where each ~z ∈ Z represents an ordered vector of tagged words.

Let Q be the set of all quotes in the corpus. Each quote ~q ∈ Q repre-sents an ordered vector of sentences.

Let C = {Ch, Cm, C0} be the set of categories, where Ch is the humor

category, Cm the memorable category, and C0 the baseline category, i.e.,

the “common language” category. Then Q∗ is the annotated quotes set, where each quote is assigned to a single category.

(10)

3.2

Features

3.2.1 POS tag counts

The POS tag count of a quote is the number of times that a given tag t ∈ T occurs in a quote ~q ∈ Q.

3.2.2 Word polarity

The polarity of a synset is given by SentiWordNet, which associates synsets with polarities for a subset of S [Baccianella et al., 2010]. This is a multi-dimensional property, where a synset s can be both positive (e⊕) and

neg-ative (e ), but also objective (e ):

e⊕(s) + e (s) ≤ 1 (3.1)

e (s) = 1 − (e⊕(s) + e (s)) (3.2)

The polarity of a tagged word (w, t) is then computed as the average of all possible synsets:1 e⊕(w, t) = 1 |S(w, t)| X s∈S(w,t) e⊕(s) (3.3) e (w, t) = 1 |S(w, t)| X s∈S(w,t) e (s) (3.4) e (w, t) = 1 − (e⊕(w, t) + e (w, t)) (3.5)

The average polarity of a quote can then be computed by taking the average of every word in the quote that has a polarity associated with it (ignoring the rest). However, this is not a very good approach, since the polarities in a quote tend to neutralize each other. Furthermore, a lot of information is thrown away, which could be useful.

A better approach would be to measure the change in polarity over the course of a quote. Every polarity associated with a tagged word (w, t) at

1

Context is not taken into account. However, if context were taken into account, the subset of possible synsets for a tagged word could be further reduced.

(11)

The offset of the line would represent the starting polarity, while the slope would represent the gradual change over time. Again, however, polarities tend to cancel each other out, so this is still not a good approach.

One could further investigage polynomials and sinuoids to capture the change, but a far easier method that throws away less information than the aver-age would, would be to use a thresholding function to map the polarities to tokens of a very small set {⊕, , }. Then sentiment ngrams can be made from a string of these tokens, and these ngrams can then be counted. For example, we could count the change from positive to negative h⊕, i, or the change from negative to positive h , ⊕i.

3.3

Ambiguity

The total ambiguity of a quote can be measured by counting the possible synsets for each word, and subtracting 1 for each word in the quote, so that a completely unambiguous quote has an ambiguity measure of 0.

3.4

Abstraction

Abstractions lend themselves very well for a play on words. Since abstrac-tions solely exist as cognitive tools, it is not surprising that they are used in humorous and memorable content. Memorable quotes often employ words such as “good” and “bad”, which are both abstract, and mold them into an idea that, for example, inspires or motivates people.

Then how do we measure abstractions? WordNet has a valuable synset for abstractions. Using WordNet notation, this is referred to as abstraction.n.06, which denotes that, given the abstraction as an index word, this synset can be uniquely identified as the sixth synset of the list of synsets with that word as a noun. The symbol α will be used for this synset.

A measure of abstraction can then be computed by calculating the distance δ between a synset s and α as its hypernym. Every level in the inheritance

(12)

abstraction synset as its hypernym will be treated as if the distance were ∞. The abstraction of a synset can then be defined as:2

A(s) = 1

δ(s, α) +  (3.6)

If multiple synsets exist for a word, the average of those synsets is taken as the abstraction for that word. Then the abstraction of an entire quote is equal to the sum of the abstraction measure of each word.

(13)

Implementation

4.1

Core

The code was written in Java 8.1 The WEKA library was used for training and evaluation [Hall et al., 2009]. The Stanford POS Tagger was used for tagging words in quotes [Manning et al., 2014].

4.2

Data extraction

Data extraction was done using Jodd’s Jerry and HTTP libraries.2. The funny jokes section was used of Reader’s Digest to fetch the humorous con-tent, and the motivational and inspirational sections were used to fetch the memorable content. In addition, quotes from BrainyQuote were used. In to-tal, 1,588 memorable and 1,185 quotes were fetched. The Brown corpus was used to fetch the content for “common language”, a total of 1,000 quotes.

1

http://github.com/jchmb/bachelor-thesis.git

(14)
(15)

Evaluation

5.1

Performce of individual features

The gain of each feature is laid out below. Features where the gain was 0 or less were left out to improve readability.

Feature Accuracy Gain

indefinite 59.63 2.34

past tense 57.68 0.39

noun count 58.30 1.01

abstraction score 58.19 0.90

The generality of indefinite determiners and past tense vs. present tense seems to contribute to the distinction significantly, as well as the noun count and the abstraction score.

5.2

Average feature values per category

The average feature value per category is laid out below. Average feature values of zero for each category were left out to improve readability.

(16)

Feature Humorous Memorable Common language detection NN l l 0.01 0.00 0.00 sentiment objectivity 1.00 1.00 1.00 neg to pos 0.00 0.00 0.00 count VB 1.26 1.34 1.35 detection NN l f 0.11 0.16 0.08 detection NN f f 0.15 0.20 0.08 detection JJ f f 0.02 0.03 0.01 detection JJ l f 0.01 0.02 0.00 abstraction score 501.55 465.07 324.90 count NN 3.19 3.71 2.07 count JJ 2.14 1.86 0.85 third person 0.02 0.02 0.03 detection VB f l 0.00 0.00 0.00 detection VB l f 0.04 0.06 0.11 length 128.25 130.22 86.32 detection VB f f 0.05 0.07 0.10

It can be shown that humorous quotes tend to be significantly more ab-stract than either memorable or common language quotes. It can also be seen that memorable quotes tend to have more nouns than humorous quotes, while humorous quotes tend to have more nouns.

However, sentiment does not seem to contribute as much as hypothesized.

5.3

Overall performance

Humorous instances 1185

Memorable instances 1588

Total number of instances 2773

Training set size 90%

Evaluation set size 10%

Cross-validation folds 10

Accuracy 63.71

Baseline 57.27

The baseline is nothing more than the number of memorable quotes (1,588) divided by the total number of quotes (2,773). Note that the common lan-guage category was not used in this evaluation. They were used mostly for feature design.

(17)

Conclusion & future work

While the distinction between humorous and memorable quotes is subtle— since both tend to employ a similar language features such as abstractions, ambiguity, and incongruity—it is nonetheless demonstrated that it is possi-ble to make this distinction on the basis of syntactic and semantic features by creating a classifier that performs better than chance would allow. However, since this research has been largely an exploratory effort, much work still needs to be done. Particularly, as most features are on a word-based level, features word-based on higher-level syntactics and semantics are likely to increase the accuracy of the classifier, e.g.,

• Incongruity between (sub)sentences within a quote.

• Counting the possible parse trees of the sentences of a quote. Each parse tree represents a different interpretation. Since humor—and memorable quotes—often employ ambiguity, it is a promising feature to explore in the future.

• Taking context into account, although it still remains unclear how this should be approached.

(18)
(19)

[Baccianella et al., 2010] Baccianella, S., Esuli, A., and Sebastiani, F. (2010). Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In LREC, volume 10, pages 2200–2204. [Danescu-Niculescu-Mizil et al., 2012] Danescu-Niculescu-Mizil, C., Cheng,

J., Kleinberg, J., and Lee, L. (2012). You had me at hello: How phrasing affects memorability. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 892–901. Association for Computational Linguistics.

[de Oliveira and Rodrigo, ] de Oliveira, L. and Rodrigo, A. L. Humor de-tection in yelp reviews.

[Finlayson, 2014] Finlayson, M. A. (2014). Java libraries for accessing the princeton wordnet: Comparison and evaluation. In Proceedings of the 7th Global Wordnet Conference, Tartu, Estonia.

[Hall et al., 2009] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reute-mann, P., and Witten, I. H. (2009). The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18.

[Kao et al., 2015] Kao, J. T., Levy, R., and Goodman, N. D. (2015). A computational model of linguistic humor in puns. Cognitive Science. [Manning et al., 2014] Manning, C. D., Surdeanu, M., Bauer, J., Finkel,

J. R., Bethard, S., and McClosky, D. (2014). The stanford corenlp natural language processing toolkit. In ACL (System Demonstrations), pages 55– 60.

[Marcus et al., 1993] Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). Building a large annotated corpus of english: The penn tree-bank. Computational linguistics, 19(2):313–330.

(20)

[Mihalcea and Strapparava, 2006] Mihalcea, R. and Strapparava, C. (2006). Learning to laugh (automatically): Computational models for humor recognition. Computational Intelligence, 22(2):126–142.

[Mihalcea et al., 2010] Mihalcea, R., Strapparava, C., and Pulman, S. (2010). Computational models for incongruity detection in humour. In Computational linguistics and intelligent text processing, pages 364–374. Springer.

[Miller, 1995] Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.

[Radev et al., 2015] Radev, D., Stent, A., Tetreault, J., Pappu, A., Ili-akopoulou, A., Chanfreau, A., de Juan, P., Vallmitjana, J., Jaimes, A., Jha, R., et al. (2015). Humor in collective discourse: Unsupervised funni-ness detection in the new yorker cartoon caption contest. arXiv preprint arXiv:1506.08126.

Referenties

GERELATEERDE DOCUMENTEN

The results of the matching experiment are given in Figs.. Left panel: Detection thresholds are given by solid symbols, matching data by open symbols. The diamond

Women working in the opencast mining environment do not face the same challenges relating to health and safety as underground workers as is evident from the empirical

Het reisgedrag van de studenten wordt beïnvloedt door veranderingen binnen verschillende disciplines; ten eerste vanuit politieke een politieke discipline, waar politieke

These databases are the unpublished dietary data of three studies, namely: the prospective urban and rural epidemiological (PURE) study designed to track the changing lifestyles,

can be tuned by the designer via device scaling; 2) a uniform electric field-profile directly translates into uniform AM-EL over the device area (A); 3) the EL-intensity (optical

De republikeinse benadering van burgerschap ziet burgerschap als een rol die elk individu goed moet vervullen, de staat als een voorwaarde voor het hebben van een vita activa en

This study attempts to give insight in cyberstalking behavior among 927 high-school students by examining three possible explaining variables: internet addiction, gender

To meet the clinical demands of increased depth sensitivity, we propose a set-up in which the excitation and detection coils are mechanically separated, as shown in Figure (a).. As