Capabilities of Diagnostic Classifier and LSTM in detecting Sarcasm

(1)

Capabilities of Diagnostic Classifier and

LSTM in detecting Sarcasm

Roel Kuiper 11337575 Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. T.Lentz

Institute for Language and Logic Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

Abstract

This thesis looks at the capabilities of a conventional LSTM in building an internal representation of sarcasm, to eventually be able to predict sarcasm. A data set consisting of audio files of Donald Trump, Alec Baldwin and Alec Baldwin sarcastically imitating Trump was used to train LSTM models to finish pitch contours. On which a Diagnostic Classifier was trained to find if an internal representation of the speakers was present. The conclusion of this thesis is that sarcasm, using this approach, is hard to detect in the internal representation of LSTM models. However, the Diagnostic Classifier is a promising new way to open the black box of ’machine learning’ and warrants use in further research in corresponding fields

(3)

1 Introduction

According to Merriam-Webster, the meaning of the word ”Sarcasm” can be traced back to the Greek verb sarkazein, which initially meant ”to tear flesh like a dog.” Sarkazein eventually developed to mean ”to sneer.” Sarkazein led to the Greek noun sarkasmos, (”a sneering or hurtful remark”), iterations of which passed through French and Late Latin before arriving in English as ”sarcasm” in the mid-16th century. Even today sarcasm is often described as sharp, cutting, or wounding, reminiscent of the original meaning of the Greek verb (Merriam-Webster, 2019). If it were possible to predict sarcasm, subtitles for the hearing-impaired could be provided of a sarcasm flag when needed. Additionally, With the rise of deep fakes, predicting sarcasm/imitation might even be used to automatically flag (deep) fake news. The aim of this thesis is to find out if it is possible to algorithmically predict sarcasm based on audio only. All results of this research will be made public.

1.1 Cues for sarcasm

Sarcasm is detected by humans due to several key cues. Firstly Kreuz suggests that there are lexical cues for sarcasm, and that sarcasm can be constructed via a formula (Kreuz and Caucci, 2007). Secondly the visual cue for sarcasm is the phenomenon called ‘blank face’ (Attardo et al., 2003), often used in spontaneous sarcasm. This cue is important for detecting sarcasm since body language plays a major role in communicating. Additionally the third and most important cue for sarcasm is context/co-text. As Gibbs concluded in 1986, for sarcasm to be comprehended, the meaning of the sarcastic statement does not need to be processed first. The processing of sarcasm is mostly influenced by prior knowledge of the subject and the societal norm (Gibbs, 1986). For the dataset used in this paper, this means that humans almost instantaneously know that statements are sarcastic if they know that the speaker is imitating somebody else. This knowledge is not so easily encoded in an algorithm and outside the scope of this research, so other cues for sarcasm need to be used. Finally auditive cues for sarcasm include changes in resonance, reductions of speech rate and reduction in the pitch range (Cheang and Pell, 2008). Moreover Rockwell suggests that a lower pitch level, slower tempo and a greater intensity while speaking are the discriminatory factors for sarcasm (Rockwell, 2000). Additionally Attardo even concludes that pitch alone is the contrastive marker for sarcasm and that an ‘ironical intonation’ does not exist. Therefore while the sarcasm research field has not reached a consensus on all the auditive cues for sarcasm, there seems to be a more commonly recurring cue: pitch. However, even pitch is not considered to be an auditive cue for sarcasm by all. Bryant concludes that an ironic tone of voice does not exist. Listeners interpret verbal irony by combining a variety of cues, including information outside of the linguistic context (Bryant and Tree, 2005).

1.2 Predicting sarcasm

For the predicting of sarcasm based on auditive cues it is important to know what the most important cues for sarcasm are. Pitch carries cues auditive cues for sarcasm and is interesting to use as predictive value, even though the precise encoding of sarcasm through pitch is not yet clear.

1.2.1 Pitch

Speech sound waves, like all waves, can be described in terms of frequency, amplitude, and other char-acteristics that are used for pure sine waves. In speech waves, these are not quite as simple to measure as for sine waves. Let’s consider frequency: even though speech waves are not exactly a sine wave, the wave is nonetheless periodic i.e. it has a frequency. The frequency of this wave comes from the speed of vibration of the vocal cords. When the vocal cords are open, air is pushing up through the lungs, creating a region of high pressure. When the folds are closed, there is no pressure from the lungs. So, when the vocal folds are vibrating, regular peaks in amplitude are expected, where each major peak corresponds to an opening of the vocal folds. The frequency of the vocal fold vibration, or the frequency of the complex wave, is called the fundamental frequency of the waveform, often abbreviated to F0. F0 can be plotted over time in a pitch track, also called pitch contour. (Jurafsky and Martin, 2008a)The pitch of a sound is the mental sensation, or perceptual correlate, of fundamental frequency; in general, if a sound has a higher fundamental frequency we perceive it as having a higher pitch. ’In general’ is used

(5)

because the relationship is not linear, since human hearing has different acuities for different frequencies. Roughly speaking, human pitch perception is most accurate between 100 Hz and 1000 Hz and in this range pitch correlates linearly with the frequency. Human hearing represents frequencies above 1000 Hz less accurately, and above this range, pitch correlates logarithmically with frequency. Logarithmic representation means that the differences between high frequencies are compressed and hence not as ac-curately perceived. (Jurafsky and Martin, 2008b) Additionally, pitch values can differ from F0 values in devoiced phones and voiceless phonemes. Devoiced phones and voiceless lack a F0 value, however when someone is talking people do not perceive a lack of pitch. This means that these phones and phonemes are audible for humans but pitch extraction toolkits, like Praat, do not extract pitch values correctly. Instead, Praat extracts zero values which essentially mean: ”Not available”. (p.c. Tom Lentz)

1.2.2 Finishing pitch contour with LSTM

There are multiple algorithms that are proficient in predicting based on sequences, with LSTM as a commonly used algorithm. A Long-Short Term Memory (LSTM) model is a form of a Recurrent Neural Network (RNN) (Hochreiter and Schmidhuber, 1997). So called memory blocks in the recurrent hidden layer of the LSTM are the main difference with RNN’s. These memory blocks contain memory cells which store the temporal state of the network. Additionally, the memory cell contains gates to control the flow of information. A memory cell has an input and output gate; the output gate sends the cell activations into the rest of the network, while the input gate handles the input activations into the cell. On top of the input and output gate a memory cell also accommodates a forget gate, to prevent processing unsegmented continuous input streams. LSTM network maps an input sequence x = (x1 , ...,

xN) to an output sequence h = (h1, ..., hN), where N is the length of the sequence, iteratively using the

following equations, for t in range 1 to N :

(1) ft= σ(Wf[ht−1, xt] + bf) (2) it= σ(Wi[ht−1, xt] + bi) (3) ˜Ct= tanh(WC[ht− 1, xt] + bC) (4) Ct= ft∗ Ct− 1 + it∗ ˜Ct (5) ot= σ(Wo[ht− 1, xt] + bo) (6) ht= ot∗ tanh(Ct)

Table 1: LSTM equations (Olah, 2015) Figure 1: Visual representation of LSTM cell (Olah, 2015)

Throughout the equations in table 1 versions of W and b are used. The W terms are weight matrices corresponding to the layer, and the b terms are bias vectors corresponding to the layer. In equation 1 ft represents the forget layer, looking at the previous output, ht−1, and new input, xt, it decides what

part of the information is kept using a sigmoid function, σ. The next part, equation 2 and 3, is to decide what part of the new information is used in the cell state. First the input gate layer, equation 2, decides, which values to update. The tanh layer, of equation 3, creates new candidate values, ’ ˜Ct’, that could be

added to the cell state. Using the results of equation 1, 2 and 3, we are now updating the new cell state, ’Ct’ in equation 4. Forgetting parts from the previous cell state, ’ft∗ Ct−1’, that are no longer deemed

necessary and updating with the new candidate values, ’it∗ ˜Ct’. The final two equations are to decide

the output. First, a sigmoid layer, equation 5, is used to decide which parts of the cell state is outputted. Multiplied by a tanh layer, in equation 6, to push all values between -1 and 1. A visual representation of a LSTM cell can be seen in figure 1. 1

For this thesis there are 3 relevant parameters of a LSTM model: units, look back and look forward. Firstly, units describes the dimensionality of the output space i.e. the dimension of the internal state of the LSTM layer (Wang, 2020). Secondly, look back stands for the length of the sequence that is used by the LSTM model to make a prediction on. Finally, look forward is the length of the sequence that is to be predicted by the LSTM model.

(6)

1.2.3 Diagnostic Classifier

In this thesis, a relatively new approach called diagnostic classification (Hupkes, Veldhoen, and Zuidema, 2018) is used. This approach is based on the idea that if a model is computing or keeping track of certain information it should be possible to extract this information from its internal state space. In Hupkes’ vision, whether a network is representing a certain variable or feature is tested by training an additional classifier – a diagnostic classifier – to predict the sequence of values this variable takes at each step of the computation from the sequence of hidden states a trained network goes through while processing the input sentence. If the sequence of values can be predicted with a high accuracy by the diagnostic classifier, this indicates that the hypothesised information is indeed computed by the network. Conversely, a low accuracy suggests that this information is not represented in the hidden state. Hupkes uses the Diagnostic Classifier (DC) to address most of the shortcomings of visualisation-based methods for opening the ’black box’ that machine learning can be. As well as to quantitatively test hypotheses about neural networks that range from very simple to fully fledged strategy descriptions. or instance, it could be used to test for the existence of feature detectors, but it can also be extended to test whether a network is computing the type of information needed for an algorithmically defined symbolic strategy. A slight alteration to Hupkes approach is used for this thesis. Hupkes’ focus lies on confirming if de-tailed hypotheses on internal representations are followed, but for satiric pitch tracks dede-tailed hypotheses do not exist yet. Therefore, the DC will be adjusted to be able to differentiate the speakers based on the hidden nodes of a LSTM model. If this results in a high accuracy, this might indicate that there is an internal representation for sarcasm and that sarcasm can be detected using pitch tracks.

2 Method

The available dataset, obtained from Saskia Leymann, for this thesis is divided in three parts: audio of Donald Trump, audio of Alec Baldwin, audio of Alec Baldwin satirically imitating Donald Trump. The Trump audio is from some of his weekly addresses, this sums up to a total of 2:01:56 hours. The audio from Alec Baldwin is from him narrating his autobiography called: ’Nevertheless: A Memoir’. This audio sums up to a total of 1:48:01 hours. The imitation audio is from Baldwin narrating his own book, in which he ridicules Trump, called: ’You can’t spell America without me: The really tremendous inside story of my fantastic first year as president Donald J. Trump (A so-called parody)’. This audio sums to a total of 1:58:22 hours. In the rest of the thesis these different parts of the dataset will respectively be called: Baldwin, Trump and Sarcasm.

2.1 Datapreprocessing

Praat is used to extract the pitch tracks of the audio files. The optimal pitch range setting for the extraction of pitch is from 50 Hz till 400 Hz (p.c. Saskia Leymann), using a time frame of 0.015s per pitch value. All the pitch values are combined to vectors and separated per sentence based on the average sentence length of the corresponding speaker.

Trump Sarcasm Baldwin

7.5 6.8 4.8

Table 2: Average sentence lenght in seconds

The average sentence length, as can be seen in Table 2, was estimated by taking the average sentence length of the first 30 sentences of each speaker. This is results in an estimated 1350 sentence for Baldwin, 1044 sentences for Sarcasm and 975 sentences for Trump. Since people are unable to hear a lack of pitch on top of Praat making mistakes in the pitch extraction process, it might be necessary to smooth pitch values of 0 for optimal performance. Therefore, to examine if smoothing is necessary, in Experiment 1 all models are trained on smoothed and unsmoothed data. This smoothing is done by simply removing all frames where the F0 is 0. However, this might result in a loss of information due to possible loss of pauses in words and other irregularities.

(7)

2.2 Experiment 0: Optimal Diagnostic Classifier

This experiment was conducted during experiment 1. Experiment 0,1 and 2 are codependent, it is not possible to find the optimal solution for the one without knowing the answer to the others. However the goals of the experiments differ: pitch prediction for the LSTM in Experiment 1 and classification, by the Diagnostic Classifier, from the LSTM to two factors of interest: identity and style, in Experiment 0 and 2. For Experiment 0, in order to save time, the optimal parameters for the Diagnostic Classifier are found by using the easiest, i.e. smallest, set of LSTM parameters. The parameters are: 50 units, a look back of 10 and a look forward of 1. These parameters are used to train a LSTM on a toy problem, to test if the approach of this thesis will work. The toy problem consists of generated data of two different sine functions, as depicted in Figure 2.

Units Layers

50 1

100 2

200 3

Table 3: All possible parameters of the

DC Figure 2: Example of data used for the toy problem

The performance of this LSTM model will be tested by having a Diagnostic Classifier train on the internal states of the LSTM. Why this is done will be explained in Experiment 1. The performance of the DC will be tested, for several parameters as can be seen in Table 3. The best performing DC will not necessarily be used to conduct Experiment 1. To save time, speed will also be taken into account.

2.3 Experiment 1: Optimal LSTM Model

This experiment consists of two parts. Again, this experiment is limited by the time constraints of the thesis and limited computing power. This results in having to do a ’preselection’ of the LSTM models. So first, LSTM models are trained on a quarter of the available data for varying parameters. These parameters, as seen in Table 4, are selected in order to do an estimation of the optimal parameters

Units Look back Look forward

50 10 1

100 100 10

200 200 50

Table 4: All parameters that are used for finding the optimal parameters of the LSTM model

The LSTM models are trained to predict the next n, look forward, values based on the last m, look back, values. The predictions are demonstrated in Figure 3, Appendix A is included to showcase an example of a prediction for all the models. In total 54 models will be trained, 27 smoothed and 27 normal models.

(8)

Figure 3: The graph in the middle is from a model that is trained on the smoothed data, the other graphs are from models that are trained on the normal data

However, it is hard to define a metric that can be used to rank these predictions. Predictions vary from the actual pitch, since the model has no knowledge of the next word. Therefore, it might occur that good models, i.e. models that have a good representation of the pitch pattern of the different speakers, score very low when using a metric. On top of that, a good model is hard to define mathematically. Ideally human listeners would define if predicted pitch track sounds appropriate. Since that is beyond the scope of this thesis, a Diagnostic Classifier is necessary to rank the models. The rationale here is: if a LSTM model predicts pitch the way humans would, it might have an internal representation for satire/sincerity, speaker-specific properties, caused by personal style, and physical properties such as the length and shape of the speech channel and vocal cords.

So by using the best parameters for the Diagnostic Classifier, the best parameters for the LSTM can be decided. If the DC is able to differentiate between the different speakers, a representation of the pitch contour is stored in the LSTM. The optimal parameters for the DC, from experiment 0, are used to give each model an accuracy and F1 score. For the second part of this experiment the best LSTM model is selected and the corresponding parameters are used to train a model on all of the data. Again, this model is then tested on the DC and will result in an accuracy and F1 score.

2.4 Experiment 2: Recognizing Sarcasm/Imitation

Using the the optimal LSTM from Experiment 1, it is time to start looking at sarcasm again. The DC is adjusted to no longer train on all speakers, but only on two of the three speakers. Firstly, the DC is trained only on Baldwin and Sarcasm data, even though the LSTM models are trained on all speakers. The hypothesis is that this discrepancy will not cause noise, since the LSTM was shown more pitch contours and therefore a higher chance at making a better internal representation. Secondly, the DC is trained only on Trump and Sarcasm data. Examining the ability to differentiate a speaker and Sarcasm might give more insight in the predictability of sarcasm using an LSTM. This will be done for both the optimal smoothed and not smoothed model. If Trump and Sarcasm are easier to differentiate than Baldwin and Sarcasm, then it is likely that the LSTM simply recognizes Sarcasm as Baldwin. However if this is not the case then Baldwin’s imitation of Trump might be so successful that the LSTM recognizes Sarcasm as Trump.

2.5 Experiment 3: Recognizing Sarcasm on specified data

In Experiment 2 the DC was adjusted to compare Sarcasm to one speaker at a time. This was done on a model trained on all the speakers. This experiment was conducted to examine if the approach of Experiment 1 has caused unnecessary noise. Instead of using the models trained on all three speaker new LSTM models will be trained only on the corresponding speakers. So a smoothed and not smoothed model will be trained only on Trump and Sarcasm data, as well as a smoothed and not smoothed model that will be trained on Baldwin and Sarcasm data.

3 Results

The following sections contain the results of all the conducted experiments. All models are tested based on an average accuracy and average F1-measure. These values are calculated after testing if the Diagnostic

(9)

Classifier was able to differentiate three speakers. This is done three times after which the average is taken and noted in the upcoming tables. In total there are 54 models, 27 on the normal data and 27 on smoothed data. Experiments 0 and 1 were done in cooperation with Kleber (Kleber, 2020) and Bakker (Bakker, 2020), the results of the optimal not smoothed model are courtesy of Bakker (Bakker, 2020) and are included for ease of comparison. Abbreviations used in the upcoming tables are:

• U, stands for the amount of units that was used in the LSTM-layer of the model. • LF, look forward, stands for the amount of frames that was predicted.

• LB, look back, stands for the amount of frames that was used to make a prediction.

3.1 Experiment 0

Units 50 100 200

1 layer 69.6% 74.8% 77.9% 2 layers 81.2% 82.2% 84.1%

Table 5: Accuracy for DC parameters

3.2 Experiment 1

Only the results of the model with highest accuracy trained on the smoothed data set and the model with the highest accuracy trained on the not smoothed data set are displayed here. The results of the other models can be found in Appendix B.

Precision Recall F1 Accuracy

Trump 0.36 0.34 0.35 34.60%

Sarcasm 0.44 0.24 0.30 24.06%

Baldwin 0.53 0.66 0.58 66.01%

Average 0.46 0.47 0.45 41.53%

Table 6: LB:10, LF:10, U:100, smoothed

Trump 0.54 0.45 0.49 45.0%

Sarcasm 0.45 0.2 0.27 19.64%

Baldwin 0.56 0.78 0.65 78.02%

Average 0.53 0.54 0.51 47.53%

Table 7: LB:200, LF:10, U:200, not smoothed

3.2.1 Best models on complete data set

Trump 0.35 0.52 0.42 51.59%

Sarcasm 0.4 0.38 0.38 37.46%

Baldwin 0.56 0.44 0.49 43.99%

Average 0.47 0.44 0.44 44.38%

Table 8: LB:10, LF:10, U:100, smoothed

Trump 0.67 0.58 0.62 58.4%

Sarcasm 0.5 0.31 0.38 30.3%

Baldwin 0.6 0.77 0.68 76.77%

Average 0.59 0.6 0.58 55.12%

Table 9: LB:200, LF:10, U:200, not smoothed (Bakker, 2020)

(10)

3.3 Experiment 2

Sarcasm 0.51 0.46 0.44 46.36%

Baldwin 0.52 0.56 0.48 55.69%

Average 0.52 0.51 0.46 51.04%

Table 10: Sarcasm vs Baldwin, not smoothed

Sarcasm 0.58 0.44 0.50 43.68%

Trump 0.55 0.69 0.61 68.60%

Average 0.57 0.56 0.55 56.14%

Table 11: Trump vs Sarcasm, not smoothed

Sarcasm 0.76 0.24 0.35 24.5%

Baldwin 0.38 0.85 0.52 85.17%

Average 0.63 0.46 0.41 54.83%

Table 12: Sarcasm vs Baldwin, smoothed

Sarcasm 0.71 0.46 0.55 45.62%

Trump 0.58 0.81 0.68 80.48%

Average 0.65 0.63 0.61 63.06%

Table 13: Trump vs Sarcasm, smoothed

3.4 Experiment 3

Sarcasm 0.50 0.30 0.37 29.7%

Baldwin 0.51 0.71 0.59 70.56%

Average 0.50 0.51 0.48 50.1%

Table 14: Sarcasm vs Baldwin, on not smoothed model trained purely on Baldwin and Sarcasm

Sarcasm 0.55 0.34 0.42 34.10%

Trump 0.52 0.72 0.60 71.65%

Average 0.53 0.53 0.51 52.88%

Table 15: Trump vs Sarcasm, on not smoothed model trained purely on Trump and Sarcasm

Sarcasm 0.72 0.45 0.55 45.67%

Baldwin 0.58 0.81 0.67 80.5%

Average 0.65 0.63 0.61 63.03%

Table 16: Sarcasm vs Baldwin, on smoothed model trained purely on Baldwin and Sarcasm

Sarcasm 0.72 0.47 0.57 47.1%

Trump 0.59 0.81 0.69 81.25%

Average 0.66 0.64 0.63 64.17%

Table 17: Trump vs Sarcasm, on smoothed model trained purely on Trump and Sarcasm

4 Discussions

The first experiment was conducted to determine the optimal parameters for the Diagnostic Classifier. As can be seen in Table 5 a higher accuracy is achieved when having multiple layers and when more units are used. Due to the time and resource constraints it was decided that two layers of 200 units would be sufficient. However it would be interesting to further examine the optimal combination of units and layers, based on the difference in speed.

In the second experiment the best LSTM parameters were determined, while simultaneously testing if smoothing has a positive impact on the predictability of sarcasm. The best performing models are displayed in Table 6 and 7, to boost the performance these parameters were used to train a model on the entire data set, displayed in Table 8 and 9. An important note is that both models beat chance, but not by a large margin. In addition, it appears that while models trained on not smoothed data

(11)

outperform the smoothed models on average, smoothed models predict sarcasm with a higher accuracy. Before conclusions concerning the predictability of sarcasm using a LSTM model can be drawn, it is interesting to take examine the results of Experiment 2 and 3.

As can be seen in Appendix B for normal, not smoothed, models, having a higher look back results in a higher average accuracy and average F1-score, while smoothed models do not share that correspondence. Overall smoothed models perform worse on average f1-score and average accuracy than normal models. A possible explanation might be that the information captured in the length of the silences between and in words can be distinguishing. On top of that, with the current implementation of smoothing, silences are removed from the sentences. This results in shorter sentences, since some of the frames are removed. So look back looks further back for smoothed sentences than it does for normal sentences, possibly taking in too much information and clouding the ability to make an accurate prediction. This is also part of the reason why there is no mention of a smoothed model with a look back of 200 in this thesis. As can be seen in the final three Tables of appendix B, Table 60 till 59, the sentences of the smoothed Baldwin data set are too small to handle larger combinations of look back and look forward.

The third experiment was conducted to test if sarcasm can be detected by using less speakers to classify on. As can be seen in Tables 10 till 13, the smoothed models are slightly better at differentiating Sarcasm from another speaker. However all models do not beat chance by a significant margin. The best result is achieved when Sarcasm and Trump are compared on the smoothed data. This might be due to the LSTM classifying Sarcasm as Baldwin and therefore is only able to differentiate between the two different speakers, Baldwin and Trump. Kleber’s (Kleber, 2020) work supports this suggestion, since the LSTM is relatively capable in seeing the difference between Trump and Baldwin.

In the last experiment, LSTM’s were trained and tested on a maximum of two speakers each, all LSTM’s containing Sarcasm, to determine if having the third speaker as extra data is justifiable. On average the models performs as well as the models from the previous experiment, the only model with a significant change is the smoothed model for Baldwin and Sarcasm. However the changes are not sufficient to warrant further research in only including the relevant speakers.

Looking back at the results of Experiment 1 and 2, smoothed models are on average better in predicting the Sarcasm data set. Were not smoothed models do not beat chance, smoothed models beat chance by a slight margin. A possible explanation can be that Praat makes errors regularly, which results in zeros. With smoothing these errors have been removed from the data set. However it seems unlikely that is the main reason that sarcasm is hard to predict for not smoothed models, since Praat is a commonly used tool to extract pitch (contours). Additionally, one would assume that their is relevant information in the silences between words, which could also be a speaker-specific indicator.

5 Conclusions

This thesis set out to research the algorithmic predictability of sarcasm using audio, using a conventional LSTM to predict the next values of a pitch contour. Within the constraints of this research the optimal parameters for performing this task have been found, by using a Diagnostic Classifier to determine if the LSTM models have an internal representation for sarcasm. However, overall sarcasm is hard to detect with this approach. A conventional LSTM model capably sees the difference between two distinguished speakers, however struggles with respect to detecting the Sarcasm data set. A possible explanation might be that the imitation of Trump by Alec Baldwin is too close in pitch to Baldwin’s normal voice as well as to Trump’s voice. Additionally, the pitch contour of Sarcasm might not have been distinguishable enough for the LSTM to accurately build an internal representation in the LSTM. Implying that predicting sarcasm using pitch only might not be possible. On the other hand it could be the case conventional LSTM’s and the approach of this thesis are not suitable to preform the task of predicting sarcasm based on audio. On the contrary, the Diagnostic Classifier is a promising new way to open the black box of ’machine learning’ and warrants use in further research in corresponding fields. A possible improvement in the approach used in this thesis is to use Deep LSTM’s instead of a conventional LSTM, as (Sak, Senior, and Beaufays, 2014) has shown that Deep LSTM outperform conventional LSTM’s. Additionally, a feature containing the words/phonemes could be added to improve the ability of the LSTM to predict the next word, as has been shown to be a suitable addition in this context (Zen et al., 2016).

(12)

References

Attardo, Salvatore et al. (Jan. 2003). “Multimodal Markers of Irony and Sarcasm”. In: Humor-international Journal of Humor Research - HUMOR 16, pp. 243–260. doi: 10.1515/humr.2003.012.

Bakker, Rico (Jan. 2020). “An analysis of the importance of pitch in the detection of satire by use of LSTM and a Diagnostic Classifier”. In:

Bryant, Gregory A. and Jean E. Fox Tree (2005). “Is there an Ironic Tone of Voice?” In: Language and Speech 48.3. PMID: 16416937, pp. 257–277. doi: 10.1177/00238309050480030101.

Cheang, Henry S. and Marc D. Pell (May 2008). “The Sound of Sarcasm”. In: Speech Commun. 50.5, pp. 366–381. issn: 0167-6393. doi: 10.1016/j.specom.2007.11.003. url: http://dx.doi.org/ 10.1016/j.specom.2007.11.003.

Gibbs, R (1986). “On the psycholinguistics of sarcasm”. In: Journal of Experimental Psychology: General 115, pp. 3–15.

Hochreiter, Sepp and J¨urgen Schmidhuber (Nov. 1997). “Long Short-Term Memory”. In: Neural Comput. 9.8, pp. 1735–1780. issn: 0899-7667. doi: 10.1162/neco.1997.9.8.1735. url: https://doi.org/ 10.1162/neco.1997.9.8.1735.

Hupkes, Dieuwke, Sara Veldhoen, and Willem Zuidema (Jan. 2018). “Visualisation and ‘Diagnostic Clas-sifiers’ Reveal How Recurrent and Recursive Neural Networks Process Hierarchical Structure”. In: J. Artif. Int. Res. 61.1, pp. 907–926. issn: 1076-9757.

Jurafsky, Daniel and James H. Martin (2008a). Speech and Language Processing. 2nd ed. Prentice Hall, p. 342. isbn: 9780131873216,0131873210.

— (2008b). Speech and Language Processing, pp. 258–260.

Kleber, Fedor (Jan. 2020). “Using LSTM and a Diagnostic Classifier to predict pitch and internally represent individual voices”. In:

Kreuz, Roger J. and Gina M. Caucci (2007). “Lexical Influences on the Perception of Sarcasm”. In: FigLanguages ’07, pp. 1–4. url: http://dl.acm.org/citation.cfm?id=1611528.1611529.

Merriam-Webster (2019). Sarcasm — Definition of Sarcasm. url: https://www.merriam- webster. com/dictionary/sarcasm (visited on 12/05/2019).

Olah, Christopher (2015). Understanding LSTM Networks. url: http://colah.github.io/posts/ 2015-08-Understanding-LSTMs/ (visited on 01/25/2020).

Rockwell, Patricia (Sept. 2000). “Lower, Slower, Louder: Vocal Cues of Sarcasm”. In: Journal of Psy-cholinguistic Research 29.5, pp. 483–495. issn: 1573-6555. doi: 10.1023/A:1005120109296. url: https://doi.org/10.1023/A:1005120109296.

Sak, H., Andrew Senior, and F. Beaufays (Jan. 2014). “Long short-term memory recurrent neural network architectures for large scale acoustic modeling”. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 338–342.

Wang, Ji Yang (2020). What is ”units” in LSTM layer of keras. url: https://zhuanlan.zhihu.com/ p/58854907 (visited on 01/25/2020).

Zen, Heiga et al. (2016). “Fast, Compact, and High Quality LSTM-RNN Based Statistical Paramet-ric Speech Synthesizers for Mobile Devices”. In: Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, September 8-12, 2016, pp. 2273–2277. doi: 10 . 21437 / Interspeech . 2016 - 522. url: https : / / doi . org / 10 . 21437 / Interspeech.2016-522.

(13)

6 Appendix A

(14)

(15)

(16)

7 Appendix B

7.1 No smoothing

Trump 0.37 0.02 0.04 2.13%

Sarcasm 0.61 0.21 0.31 21.07%

Baldwin 0.53 0.95 0.68 95.5%

Average 0.51 0.53 0.42 39.6%

Table 18: LB:10, LF:1, U:50

Trump 0.46 0.17 0.24 16.7%

Sarcasm 0.57 0.34 0.42 33.56%

Baldwin 0.56 0.85 0.68 85.71%

Average 0.54 0.55 0.5 45.33%

Table 19: LB:10, LF:1, U:100

Trump 0.53 0.07 0.13 7.4%

Sarcasm 0.6 0.32 0.41 31.53%

Baldwin 0.55 0.91 0.68 91.38%

Average 0.56 0.55 0.47 43.4%

Table 20: LB:10, LF:1, U:200

Trump 0.49 0.1 0.16 10.17%

Sarcasm 0.46 0.08 0.13 7.8%

Baldwin 0.5 0.93 0.65 92.88%

Average 0.49 0.5 0.39 36.93%

(17)

Precision Recall F1 Accuracy Trump 0.42 0.15 0.23 15.5% Sarcasm 0.49 0.1 0.17 10.6% Baldwin 0.5 0.88 0.64 88.06% Average 0.48 0.5 0.41 38.04% Table 22: LB:10, LF:10, U:100

Trump 0.41 0.19 0.26 18.87%

Sarcasm 0.51 0.11 0.17 10.6%

Baldwin 0.5 0.85 0.63 85.31%

Average 0.48 0.49 0.42 38.26%

Table 23: LB:10, LF:10, U:200

Trump 0.31 0.06 0.09 5.47%

Sarcasm 0.25 0.03 0.05 2.7%

Baldwin 0.46 0.93 0.62 93.21%

Average 0.36 0.45 0.32 33.77%

Table 24: LB:10, LF:50, U:50

Trump 0.33 0.07 0.11 6.73%

Sarcasm 0.28 0.07 0.11 7.27%

Baldwin 0.47 0.89 0.61 88.85%

Average 0.38 0.45 0.34 34.3%

Table 25: LB:10, LF:50, U:100

Trump 0.39 0.19 0.25 19.1%

Sarcasm 0.3 0.06 0.09 5.87%

Baldwin 0.48 0.85 0.61 84.92%

Average 0.41 0.46 0.37 36.65%

Table 26: LB:10, LF:50, U:200

Trump 0.47 0.29 0.36 28.6%

Sarcasm 0.55 0.23 0.32 23.0%

Baldwin 0.56 0.83 0.67 83.38%

Average 0.53 0.54 0.5 45.0%

Table 27: LB:100, LF:1, U:50

Trump 0.43 0.21 0.28 20.8%

Sarcasm 0.57 0.22 0.31 21.8%

Baldwin 0.54 0.86 0.66 85.83%

Average 0.52 0.53 0.48 42.82%

Table 28: LB:100, LF:1, U:100

Trump 0.48 0.34 0.39 34.0%

Sarcasm 0.54 0.24 0.33 24.4%

Baldwin 0.57 0.81 0.66 81.02%

Average 0.54 0.55 0.52 46.47%

Table 29: LB:100, LF:1, U:200

Trump 0.49 0.32 0.38 31.57%

Sarcasm 0.5 0.2 0.29 20.17%

Baldwin 0.55 0.82 0.66 81.71%

Average 0.52 0.53 0.49 44.5%

Table 30: LB:100, LF:10, U:50

Trump 0.47 0.33 0.39 33.2%

Sarcasm 0.51 0.27 0.35 27.03%

Baldwin 0.56 0.79 0.66 78.81%

Average 0.53 0.54 0.51 46.36%

(18)

Trump 0.44 0.3 0.36 30.13%

Sarcasm 0.39 0.18 0.24 17.93%

Baldwin 0.51 0.77 0.62 77.21%

Average 0.46 0.48 0.44 41.77%

Table 33: LB:100, LF:50, U:50

Trump 0.43 0.29 0.35 29.23%

Sarcasm 0.4 0.22 0.29 22.4%

Baldwin 0.52 0.75 0.61 74.56%

Average 0.46 0.48 0.45 42.06%

Table 34: LB:100, LF:50, U:100

Trump 0.44 0.34 0.38 34.36%

Sarcasm 0.36 0.22 0.27 22.03%

Baldwin 0.51 0.7 0.59 70.0%

Average 0.45 0.47 0.45 42.18%

Table 35: LB:100, LF:50, U:200

Trump 0.52 0.35 0.41 34.97%

Sarcasm 0.44 0.1 0.17 10.16%

Baldwin 0.54 0.85 0.67 85.6%

Average 0.51 0.53 0.47 43.56%

Table 36: LB:200, LF:1, U:50

Trump 0.51 0.32 0.39 31.57%

Sarcasm 0.47 0.14 0.22 14.5%

Baldwin 0.55 0.85 0.67 85.17%

Average 0.52 0.53 0.48 43.73%

Table 37: LB:200, LF:1, U:100

Trump 0.55 0.4 0.46 40.27%

Sarcasm 0.44 0.13 0.2 13.13%

Baldwin 0.56 0.84 0.67 83.92%

Average 0.52 0.55 0.5 45.76%

Table 38: LB:200, LF:1, U:200

Trump 0.51 0.37 0.43 37.43%

Sarcasm 0.47 0.17 0.25 17.27%

Baldwin 0.55 0.81 0.66 81.44%

Average 0.52 0.54 0.5 45.36%

Table 39: LB:200, LF:10, U:50

Trump 0.52 0.36 0.42 35.8%

Sarcasm 0.48 0.19 0.27 19.0%

Baldwin 0.55 0.82 0.66 81.5%

Average 0.52 0.54 0.5 45.44%

Table 40: LB:200, LF:10, U:100

Trump 0.54 0.45 0.49 45.0%

Sarcasm 0.45 0.2 0.27 19.64%

Baldwin 0.56 0.78 0.65 78.02%

Average 0.53 0.54 0.51 47.53%

(19)

Trump 0.46 0.31 0.37 31.27%

Sarcasm 0.38 0.23 0.29 23.1%

Baldwin 0.52 0.74 0.61 73.73%

Average 0.46 0.48 0.46 42.7%

Table 43: LB:200, LF:50, U:100

Trump 0.48 0.37 0.42 37.07% Sarcasm 0.41 0.22 0.28 21.63% Baldwin 0.52 0.73 0.61 73.35% Average 0.48 0.5 0.47 44.03% Table 44: LB:200, LF:50, U:200

7.2 Smoothed

Trump 0.24 0.05 0.08 4.83%

Sarcasm 0.28 0.50 0.36 50.47%

Baldwin 0.53 0.53 0.53 52.71%

Average 0.39 0.40 0.37 36.00%

Table 45: LB:10, LF:1, U:50

Trump 0.25 0.04 0.07 4.47%

Sarcasm 0.29 0.42 0.34 41.97%

Baldwin 0.53 0.62 0.57 61.66%

Average 0.39 0.42 0.38 36.03%

Table 46: LB:10, LF:1, U:100

Trump 0.26 0.19 0.22 19.40%

Sarcasm 0.33 0.42 0.37 42.00%

Baldwin 0.52 0.52 0.52 51.90%

Average 0.41 0.41 0.41 37.73%

Table 47: LB:10, LF:1, U:200

Trump 0.37 0.22 0.27 21.90%

Sarcasm 0.46 0.20 0.27 19.90%

Baldwin 0.52 0.78 0.62 71.88%

Average 0.47 0.49 0.44 39.91%

Table 48: LB:10, LF:10, U:50

Trump 0.36 0.34 0.35 34.60%

Sarcasm 0.44 0.24 0.30 24.06%

Baldwin 0.53 0.66 0.58 66.01%

Average 0.46 0.47 0.45 41.53%

Table 49: LB:10, LF:10, U:100

Trump 0.39 0.30 0.34 29.64%

Sarcasm 0.40 0.24 0.30 23.46%

Baldwin 0.53 0.71 0.61 71.12%

Average 0.46 0.48 0.46 41.40%

(20)

Trump 0.30 0.12 0.16 11.57%

Sarcasm 0.07 0.00 0.00 0.33%

Baldwin 0.44 0.89 0.59 88.79%

Average 0.30 0.42 0.30 33.56%

Table 52: LB:10, LF:50, U:100

Trump 0.28 0.33 0.30 33.20%

Sarcasm 0.17 0.01 0.02 1.00%

Baldwin 0.44 0.66 0.53 65.97%

Average 0.32 0.38 0.32 33.36%

Table 53: LB:10, LF:50, U:200

Trump 0.28 0.86 0.42 85.90%

Sarcasm 0.39 0.18 0.23 18.07%

Baldwin 0.43 0.09 0.15 9.50%

Average 0.38 0.31 0.24 37.79%

Table 54: LB:100, LF:1, U:50

Trump 0.29 0.87 0.43 86.96%

Sarcasm 0.32 0.17 0.21 17.27%

Baldwin 0.48 0.09 0.15 9.07%

Average 0.39 0.30 0.23 37.77%

Table 55: LB:100, LF:1, U:100

Trump 0.28 0.86 0.42 86.15%

Sarcasm 0.28 0.18 0.19 15.43%

Baldwin 0.39 0.09 0.13 7.53%

Average 0.33 0.29 0.22 36.36%

Table 56: LB:100, LF:1, U:200

Trump 0.28 0.84 0.42 84.00%

Sarcasm 0.33 0.27 0.29 26.64%

Baldwin 0.35 0.03 0.06 2.93%

Average 0.33 0.29 0.21 37.91%

Table 57: LB:100, LF:10, U:50

Trump 0.28 0.88 0.43 87.79%

Sarcasm 0.40 0.24 0.30 24.46%

Baldwin 0.55 0.06 0.11 6.37%

Average 0.44 0.32 0.24 39.56%

Table 58: LB:100, LF:10, U:100

Trump 0.29 0.87 0.44 86.35%

Sarcasm 0.37 0.28 0.32 28.07%

Baldwin 0.57 0.06 0.11 6.23%

Average 0.45 0.32 0.25 40.21%

Table 59: LB:100, LF:10, U:200

Trump 0.28 0.77 0.42 77.19%

Sarcasm 0.32 0.29 0.30 28.73%

Baldwin 0.00 0.00 0.00 0.00%

Average 0.17 0.29 0.20 35.3%

(21)

Trump 0.29 0.79 0.42 79.02%

Sarcasm 0.31 0.26 0.29 26.33%

Baldwin 0.00 0.00 0.00 0.00%

Average 0.17 0.30 0.20 34.12%

Capabilities of Diagnostic Classifier and LSTM in detecting Sarcasm